diff mbox series

[net,4/4] docs: net: dsa: replace TODO section with info about history and devel ideas

Message ID 20231208193518.2018114-5-vladimir.oltean@nxp.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series Add some history to the DSA documentation | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success SINGLE THREAD; Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success No Fixes tags, but series doesn't touch code
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers warning 2 maintainers not CCed: florian.fainelli@broadcom.com linux-doc@vger.kernel.org
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 190 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Vladimir Oltean Dec. 8, 2023, 7:35 p.m. UTC
It was a bit unclear to me what the TODO is about and what is even
actionable about it. I had a discussion with Florian about it at NetConf
2023, and it seems that it's about the amount of boilerplate code that
exists in switchdev drivers, and how that could be maybe made common
with DSA, through something like another library.

I think we are seeing a lot of people who are looking at DSA now,
and there is a lot of misunderstanding about why things are the way
they are, and which are the changes that would benefit the subsystem,
compared to the changes that go against DSA's past development trend.

I think what is missing is the context, which is admittedly a bit
hard to grasp considering there are 15 years of development.
Based on the git log and on the discussions where I participated,
I tried to cobble up a section about DSA's history. Here and there,
I've mentioned the limitations that I am aware of, and some possible
ways forward.

I'm also personally surprised by the amount of change in DSA over the
years, and I hope that putting things into perspective will also
encourage others to not think that it's set in stone the way it is now.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 Documentation/networking/dsa/dsa.rst | 186 +++++++++++++++++++++++++--
 1 file changed, 176 insertions(+), 10 deletions(-)

Comments

Linus Walleij Dec. 8, 2023, 10:40 p.m. UTC | #1
On Fri, Dec 8, 2023 at 8:36 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:

> It was a bit unclear to me what the TODO is about and what is even
> actionable about it. I had a discussion with Florian about it at NetConf
> 2023, and it seems that it's about the amount of boilerplate code that
> exists in switchdev drivers, and how that could be maybe made common
> with DSA, through something like another library.
>
> I think we are seeing a lot of people who are looking at DSA now,
> and there is a lot of misunderstanding about why things are the way
> they are, and which are the changes that would benefit the subsystem,
> compared to the changes that go against DSA's past development trend.
>
> I think what is missing is the context, which is admittedly a bit
> hard to grasp considering there are 15 years of development.
> Based on the git log and on the discussions where I participated,
> I tried to cobble up a section about DSA's history. Here and there,
> I've mentioned the limitations that I am aware of, and some possible
> ways forward.
>
> I'm also personally surprised by the amount of change in DSA over the
> years, and I hope that putting things into perspective will also
> encourage others to not think that it's set in stone the way it is now.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Adding documentation is always good, and the kernel definitely
looks better after this patch than before so:
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

For ease of reading I would personally restructure it a bit, by
putting in three sections:

- History
  Pure development history (for the old code maybe add examples
  of which switches brought about the changes, in the same way
  that the Ocelot driver is mentioned?)

- Policy
  Do this, don't do that. The text has a few paragraphs that read
  like so.

- Future directions
  What we want to do next, or can be expected for the future,
  again the text has a few paragraphs that read like that.

Yours,
Linus Walleij
Florian Fainelli Dec. 8, 2023, 11:03 p.m. UTC | #2
On 12/8/23 11:35, Vladimir Oltean wrote:
> It was a bit unclear to me what the TODO is about and what is even
> actionable about it. I had a discussion with Florian about it at NetConf
> 2023, and it seems that it's about the amount of boilerplate code that
> exists in switchdev drivers, and how that could be maybe made common
> with DSA, through something like another library.
> 
> I think we are seeing a lot of people who are looking at DSA now,
> and there is a lot of misunderstanding about why things are the way
> they are, and which are the changes that would benefit the subsystem,
> compared to the changes that go against DSA's past development trend.
> 
> I think what is missing is the context, which is admittedly a bit
> hard to grasp considering there are 15 years of development.
> Based on the git log and on the discussions where I participated,
> I tried to cobble up a section about DSA's history. Here and there,
> I've mentioned the limitations that I am aware of, and some possible
> ways forward.
> 
> I'm also personally surprised by the amount of change in DSA over the
> years, and I hope that putting things into perspective will also
> encourage others to not think that it's set in stone the way it is now.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

This is really great, and thanks for having put that together, it 
represents an useful timeline of changes introduces.

> ---

[snip]

> +Probing through ``platform_data`` remains limited in functionality. The
> +``ds->dev->of_node`` and ``dp->dn`` pointers are NULL, and the OF API calls
> +made by drivers for discovering more complex setups fall back to the implicit
> +handling. There is no way to describe multi-chip trees, or switches with
> +multiple CPU ports. It is always assumed that shared ports are configured by
> +the driver to the maximum supported link speed (they do not use phylink).
> +User ports cannot connect to arbitrary PHYs, but are limited to
> +``ds->user_mii_bus``.

Maybe a mention here that this implies built-in/internal PHY devices 
only, just as a way to re-iterate the limitation and to echo to the 
previous patch?

> +
> +Many switch drivers introduced since after DSA's second OF binding were not
> +designed to support probing through ``platform_data``. Most notably,
> +``device_get_match_data()`` and ``of_device_get_match_data()`` return NULL with
> +``platform_data``, so generally, drivers which do not have alternative
> +mechanisms for this do not support ``platform_data``.
> +
> +Extending the ``platform_data`` support implies adding more separate code.
> +An alternative worth exploring is growing DSA towards the ``fwnode`` API.
> +However, not the entire OF binding should be generalized to ``fwnode``.
> +The current bindings must be examined with a critical eye, and the properties
> +which are no longer considered good practice (like ``label``, because ``udev``
> +offers this functionality) should first be deprecated in OF, and not migrated
> +to ``fwnode``.
> +
> +With ``fwnode`` support in the DSA framework, the ``fwnode_create_software_node()``
> +API could be used as an alternative to ``platform_data``, to allow describing
> +and probing switches on non-OF.

Might suggest to move the 3 paragraphs towards the end because otherwise 
it might be a distraction for the reader who might think: ah that's it? 
no more technical details!? Looks like Linus made the same suggestion in 
his review.

> +
> +DSA is used to control very complex switching chips. Some devices have a
> +microprocessor, and in some cases, this microprocessor can run a variant of the
> +Linux kernel. Sometimes, the switch packet I/O procedure of the internal
> +microprocessor is different from the packet I/O procedure for an external host.
> +The internal processor may have access to switch queues, while the external
> +processor may require DSA tags. Other times, the microprocessor may also be
> +connected to the switch in a DSA fashion (using an internal MAC to MAC
> +connection).
> +
> +Since DSA is only concerned with switches where the packet I/O is handled
> +by an intermediate conduit driver, this leads to the situation where it is
> +recommended to have two drivers for the same switch hardware. 

the same switch hardware, one for each of the use cases described above.

When the queues
> +are accessed directly, a separate non-DSA driver should be used, with its own
> +skeleton which is integrated with ``switchdev`` on its own.
> +
> +In 2019, a DSA driver was added for the ``ocelot`` switch, which is a thin
> +front-end over a hardware library that is also common with a ``switchdev``
> +driver. While this design is encouraged for other similar cases, code
> +duplication among multiple front-ends is a concern, so it may be desirable to
> +extract some of DSA's core functionality into a reusable library for Ethernet
> +switches. This could offer a driver-facing API similar to ``dsa_switch_ops``,
> +but the aspects relating to cross-chip management, to DSA tags and to the
> +conduit interface would remain DSA-specific.

Yes! That was exactly the idea indeed.

> +
> +Traditionally, DSA switch drivers for discrete chips own the entire
> +``spi_device``, ``i2c_client``, ``mdio_device`` etc. When the chip is complex
> +and has multiple embedded peripherals (IRQ controller, GPIO controller, MDIO
> +controller, LED controller, sensors), the handling of these peripherals is
> +currently monolithic within the same device driver that also calls
> +``dsa_register_switch()``.
> +
> +But an internal microprocessor may have a very different view of the switch
> +address space, and may have discrete Linux drivers for each peripheral.
> +In 2023, the ``ocelot_ext`` driver was added, which deviated from the
> +traditional DSA driver architecture. Rather than probing on the entire
> +``spi_device``, it created a multi-function device (MFD) top-level driver for
> +it (associated with the SoC at large), and the switching IP is only one of the
> +children of the MFD (it is a platform device with regmaps provided by its
> +parent). The drivers for each peripheral in this switch SoC are the same when
> +controlled over SPI and when controlled by the internal processor.
> +
> +Authors of new switch drivers that use DSA are encouraged to have a wider view
> +of the peripherals on the chip that they are controlling, and to use the MFD
> +model to separate the components whenever possible. The general direction for
> +the DSA core is to shrink in size and to focus only on the Ethernet switching
> +IP and its ports. ``CONFIG_NET_DSA_HWMON`` was removed in 2017. Adding new
> +methods to ``struct dsa_switch_ops`` which are outside of DSA's core focus on
> +Ethernet is strongly discouraged.

Agreed and good idea to put that on (virtual) paper.

> +
> +DSA's support for multi-chip trees also has limitations. After converting from
> +the first to the second OF binding, the switch tree stopped being a platform
> +device, and its probing became implicit, and distributed among its constituent
> +switch devices. There is currently a synchronization point in
> +``dsa_tree_setup_routing_table()``, through which the tree setup is performed
> +only once, when there is more than one switch in the tree. The first N-1
> +switches will end their probing early, and the last switch will configure the
> +entire tree, and thus all the other switches, in its ``dsa_register_switch()``
> +calling context.
> +
> +Furthermore, the synchronization point works because each switch is able to
> +determine, in a distributed manner, that the routing table is not complete, aka
> +that there is at least one switch which has not probed. This is only possible
> +because the ``link`` properties in the device tree describe the connections to
> +all other cascade ports in the tree, not just to the directly connected cascade
> +port. If only the latter were described, it could happen that a switch waits
> +for its direct neighbors to probe before setting up the tree, but not
> +necessarily for all switches in the tree (therefore, it sets up the tree too
> +early).
> +
> +With more than 3 switches in a tree, it becomes a difficult task to write
> +correct device trees which are not missing any link to the other cascade ports
> +in the tree. The routing table, based on which ``dsa_routing_port()`` works, is
> +directly taken from the device tree, although it could be computed through BFS
> +instead. This means that the device tree writer needs to specify more than just
> +the hardware description (represented by the direct cascade port connections).
> +
> +Simplifying the device tree bindings to require a single ``link`` phandle
> +cannot be done without rethinking the distributed probing scheme. One idea is
> +to reinstate the switch tree as a platform device, but this time created
> +dynamically by ``dsa_register_switch()`` if the switch's tree ID is not already
> +present in the system. The switch tree driver walks the device tree hop by hop,
> +following the ``link`` references, to discover all the other switches, and to
> +construct the full routing table. It then uses the component API to register
> +itself as an aggregate driver, with each of the discovered switches as a
> +component. When ``dsa_register_switch()`` completes for all component switches,
> +the tree probing continues and calls ``dsa_tree_setup()``.

An interesting paragraph, but I am not sure this is such a big pain 
point that we should be spending much description of it, especially 
since it sounds like this is solved, but could be improved, but in the 
grand scheme of things, should we really be spending any time on it?

Same-vendor cascade configurations are Marvell specific, and different 
vendor cascades require distinct switch trees, therefore do not really 
fall into the cross-chip design anymore. In a nutshell, cross-chip is 
very very niche and limited.

> +
> +The cross-chip management layer (``net/dsa/switch.c``) can also be improved.
> +Currently ``struct dsa_switch_tree`` holds a list of ports rather than a list
> +of switches, and thus, calling one function for each switch in a tree is hard.
> +DSA currently uses one notifier chain per tree as a workaround for that, with
> +each switch registered as a listener (``dsa_switch_event()``).
> +
> +It is considered bad practice to use notifiers when the emitter and the
> +listener are known to each other, instead of a plain function call. Also, error
> +handling with notifiers is not robust. When one switch fails mid-operation,
> +there is no rollback to the previous state for switches which already completed
> +the operation successfully.
> +
> +To untangle this situation and improve the reliability of the cross-chip
> +management layer, it is necessary to split the DSA operations into ones which
> +can fail, and ones which cannot fail. For example, ``port_fdb_add()`` can fail,
> +whereas ``port_fdb_del()`` cannot. Then, cross-chip operations can take a
> +fallible function to make forward progress, and an infallible function for
> +rollback. However, it is unclear what to do in the case of ``change_mtu()``.
> +It is hard to classify this operation as either fallible or infallible. It is
> +also unclear how to deal with I/O access errors on the switch's management bus.

How about something like this:

I/O access errors occurring during the switch configuration should 
always be logged for debugging but are very unlikely to be recoverable 
and therefore require an investigation into the failure mechanism and 
root cause or possible work around.
Vladimir Oltean Dec. 9, 2023, 1:58 a.m. UTC | #3
On Fri, Dec 08, 2023 at 03:03:35PM -0800, Florian Fainelli wrote:
> > +Probing through ``platform_data`` remains limited in functionality. The
> > +``ds->dev->of_node`` and ``dp->dn`` pointers are NULL, and the OF API calls
> > +made by drivers for discovering more complex setups fall back to the implicit
> > +handling. There is no way to describe multi-chip trees, or switches with
> > +multiple CPU ports. It is always assumed that shared ports are configured by
> > +the driver to the maximum supported link speed (they do not use phylink).
> > +User ports cannot connect to arbitrary PHYs, but are limited to
> > +``ds->user_mii_bus``.
> 
> Maybe a mention here that this implies built-in/internal PHY devices only,
> just as a way to re-iterate the limitation and to echo to the previous
> patch?

I am not fully convinced that saying "user_mii_bus can only access internal PHYs"
only would be correct. This paragraph also exists in the user MDIO bus section:

For Ethernet switches which have both external and internal MDIO buses, the
user MII bus can be utilized to mux/demux MDIO reads and writes towards either
internal or external MDIO devices this switch might be connected to: internal
PHYs, external PHYs, or even external switches.

> > +
> > +Many switch drivers introduced since after DSA's second OF binding were not
> > +designed to support probing through ``platform_data``. Most notably,
> > +``device_get_match_data()`` and ``of_device_get_match_data()`` return NULL with
> > +``platform_data``, so generally, drivers which do not have alternative
> > +mechanisms for this do not support ``platform_data``.
> > +
> > +Extending the ``platform_data`` support implies adding more separate code.
> > +An alternative worth exploring is growing DSA towards the ``fwnode`` API.
> > +However, not the entire OF binding should be generalized to ``fwnode``.
> > +The current bindings must be examined with a critical eye, and the properties
> > +which are no longer considered good practice (like ``label``, because ``udev``
> > +offers this functionality) should first be deprecated in OF, and not migrated
> > +to ``fwnode``.
> > +
> > +With ``fwnode`` support in the DSA framework, the ``fwnode_create_software_node()``
> > +API could be used as an alternative to ``platform_data``, to allow describing
> > +and probing switches on non-OF.
> 
> Might suggest to move the 3 paragraphs towards the end because otherwise it
> might be a distraction for the reader who might think: ah that's it? no more
> technical details!? Looks like Linus made the same suggestion in his review.

I think it needs even more rethinking than that. I now remembered that
we also have a "Design limitations" section where the future work can go.

It's hard to navigate through what is now a 1400 line document and not
get lost.

I'm hoping I could move the documentation of variables and methods that
now sits in "Driver development" into kdoc comments inline with the code,
to reduce the clutter a bit.

But I don't know how to tackle this. Should documentation changes go to
"net" or to "net-next"? I targeted this for "net" as a documentation-only
change set. But if I start adding kdocs, it won't be so clear-cut anymore...

> > +Simplifying the device tree bindings to require a single ``link`` phandle
> > +cannot be done without rethinking the distributed probing scheme. One idea is
> > +to reinstate the switch tree as a platform device, but this time created
> > +dynamically by ``dsa_register_switch()`` if the switch's tree ID is not already
> > +present in the system. The switch tree driver walks the device tree hop by hop,
> > +following the ``link`` references, to discover all the other switches, and to
> > +construct the full routing table. It then uses the component API to register
> > +itself as an aggregate driver, with each of the discovered switches as a
> > +component. When ``dsa_register_switch()`` completes for all component switches,
> > +the tree probing continues and calls ``dsa_tree_setup()``.
> 
> An interesting paragraph, but I am not sure this is such a big pain point
> that we should be spending much description of it, especially since it
> sounds like this is solved, but could be improved, but in the grand scheme
> of things, should we really be spending any time on it?
> 
> Same-vendor cascade configurations are Marvell specific, and different
> vendor cascades require distinct switch trees, therefore do not really fall
> into the cross-chip design anymore. In a nutshell, cross-chip is very very
> niche and limited.

Well, I've been contacted by somebody to help with a custom board with 3
daisy chained SJA1105 switches. He is doing the testing for me, and I'm
waiting for the results to come back. I'm currently waiting for an uprev
to an NXP BSP on top of 5.15 to be finalized, so that patches developed
over net-next are at least barely testable...

If you remember, the SJA1105 has these one-shot management routes which
must be installed over SPI, and they decide where the next transmitted
link-local packet goes.

Well, the driver only supports single-chip trees, as you say, because it
only programs the management route in the targeted switch. With daisy
chained switches, one needs to figure out the actual packet route from
the CPU to the leaf user port, and install a management route for every
switch along that route. Otherwise, intermediary switches won't know
what to do with the packets, and drop them.

The specific request was: "help, PTP doesn't work!"

I did solve the problem, and the documentation paragraphs above are
basically my development notes while examining the existing support
and the way in which it isn't giving me the tools I need.

I do need to send a dt-bindings patch on this topic as well. The fact
that we put all cascade links in the device tree means we don't know
which one represents the direct connection to the neighbor cascade port,
and which one is an indirect connection. We need to bake in an
assumption, like for example "the first element of the 'link' phandle
array is the direct connection". I hope retroactively doing that won't
bother the device tree maintainers too much. If it does, the problem is
intractable.

But I agree, requirements for cross-chip support are rare, and with
SJA1105 I don't even own a board where I can directly test it. I
specifically bought the Turris MOX for that.

> > +To untangle this situation and improve the reliability of the cross-chip
> > +management layer, it is necessary to split the DSA operations into ones which
> > +can fail, and ones which cannot fail. For example, ``port_fdb_add()`` can fail,
> > +whereas ``port_fdb_del()`` cannot. Then, cross-chip operations can take a
> > +fallible function to make forward progress, and an infallible function for
> > +rollback. However, it is unclear what to do in the case of ``change_mtu()``.
> > +It is hard to classify this operation as either fallible or infallible. It is
> > +also unclear how to deal with I/O access errors on the switch's management bus.
> 
> How about something like this:
> 
> I/O access errors occurring during the switch configuration should always be
> logged for debugging but are very unlikely to be recoverable and therefore
> require an investigation into the failure mechanism and root cause or
> possible work around.

Yeah, I suppose.

What do you think, has something like phy_error() been a useful mechanism
for anything? Or just a pain in the rear? Would it be useful to shut
everything down on a bus I/O error, phylib style?
Florian Fainelli Dec. 11, 2023, 5:29 p.m. UTC | #4
On 12/8/23 17:58, Vladimir Oltean wrote:
> On Fri, Dec 08, 2023 at 03:03:35PM -0800, Florian Fainelli wrote:
>>> +Probing through ``platform_data`` remains limited in functionality. The
>>> +``ds->dev->of_node`` and ``dp->dn`` pointers are NULL, and the OF API calls
>>> +made by drivers for discovering more complex setups fall back to the implicit
>>> +handling. There is no way to describe multi-chip trees, or switches with
>>> +multiple CPU ports. It is always assumed that shared ports are configured by
>>> +the driver to the maximum supported link speed (they do not use phylink).
>>> +User ports cannot connect to arbitrary PHYs, but are limited to
>>> +``ds->user_mii_bus``.
>>
>> Maybe a mention here that this implies built-in/internal PHY devices only,
>> just as a way to re-iterate the limitation and to echo to the previous
>> patch?
> 
> I am not fully convinced that saying "user_mii_bus can only access internal PHYs"
> only would be correct. This paragraph also exists in the user MDIO bus section:
> 
> For Ethernet switches which have both external and internal MDIO buses, the
> user MII bus can be utilized to mux/demux MDIO reads and writes towards either
> internal or external MDIO devices this switch might be connected to: internal
> PHYs, external PHYs, or even external switches.

OK, so what you have put is good enough, no need to add more that would 
only be a repeat of the user_mii_bus description (with the risk of the 
two being out of sync eventually).

> 
>>> +
>>> +Many switch drivers introduced since after DSA's second OF binding were not
>>> +designed to support probing through ``platform_data``. Most notably,
>>> +``device_get_match_data()`` and ``of_device_get_match_data()`` return NULL with
>>> +``platform_data``, so generally, drivers which do not have alternative
>>> +mechanisms for this do not support ``platform_data``.
>>> +
>>> +Extending the ``platform_data`` support implies adding more separate code.
>>> +An alternative worth exploring is growing DSA towards the ``fwnode`` API.
>>> +However, not the entire OF binding should be generalized to ``fwnode``.
>>> +The current bindings must be examined with a critical eye, and the properties
>>> +which are no longer considered good practice (like ``label``, because ``udev``
>>> +offers this functionality) should first be deprecated in OF, and not migrated
>>> +to ``fwnode``.
>>> +
>>> +With ``fwnode`` support in the DSA framework, the ``fwnode_create_software_node()``
>>> +API could be used as an alternative to ``platform_data``, to allow describing
>>> +and probing switches on non-OF.
>>
>> Might suggest to move the 3 paragraphs towards the end because otherwise it
>> might be a distraction for the reader who might think: ah that's it? no more
>> technical details!? Looks like Linus made the same suggestion in his review.
> 
> I think it needs even more rethinking than that. I now remembered that
> we also have a "Design limitations" section where the future work can go.
> 
> It's hard to navigate through what is now a 1400 line document and not
> get lost.

Agreed.

> 
> I'm hoping I could move the documentation of variables and methods that
> now sits in "Driver development" into kdoc comments inline with the code,
> to reduce the clutter a bit.
> 
> But I don't know how to tackle this. Should documentation changes go to
> "net" or to "net-next"? I targeted this for "net" as a documentation-only
> change set. But if I start adding kdocs, it won't be so clear-cut anymore...

Personal preference is that documentation changes should always target 
'net' (unless they document a 'net-next' change obviously) because 
accurate documentation is most important than anything.

> 
>>> +Simplifying the device tree bindings to require a single ``link`` phandle
>>> +cannot be done without rethinking the distributed probing scheme. One idea is
>>> +to reinstate the switch tree as a platform device, but this time created
>>> +dynamically by ``dsa_register_switch()`` if the switch's tree ID is not already
>>> +present in the system. The switch tree driver walks the device tree hop by hop,
>>> +following the ``link`` references, to discover all the other switches, and to
>>> +construct the full routing table. It then uses the component API to register
>>> +itself as an aggregate driver, with each of the discovered switches as a
>>> +component. When ``dsa_register_switch()`` completes for all component switches,
>>> +the tree probing continues and calls ``dsa_tree_setup()``.
>>
>> An interesting paragraph, but I am not sure this is such a big pain point
>> that we should be spending much description of it, especially since it
>> sounds like this is solved, but could be improved, but in the grand scheme
>> of things, should we really be spending any time on it?
>>
>> Same-vendor cascade configurations are Marvell specific, and different
>> vendor cascades require distinct switch trees, therefore do not really fall
>> into the cross-chip design anymore. In a nutshell, cross-chip is very very
>> niche and limited.
> 
> Well, I've been contacted by somebody to help with a custom board with 3
> daisy chained SJA1105 switches. He is doing the testing for me, and I'm
> waiting for the results to come back. I'm currently waiting for an uprev
> to an NXP BSP on top of 5.15 to be finalized, so that patches developed
> over net-next are at least barely testable...
> 
> If you remember, the SJA1105 has these one-shot management routes which
> must be installed over SPI, and they decide where the next transmitted
> link-local packet goes.

Yes, it's coming back now.

> 
> Well, the driver only supports single-chip trees, as you say, because it
> only programs the management route in the targeted switch. With daisy
> chained switches, one needs to figure out the actual packet route from
> the CPU to the leaf user port, and install a management route for every
> switch along that route. Otherwise, intermediary switches won't know
> what to do with the packets, and drop them.
> 
> The specific request was: "help, PTP doesn't work!"
> 
> I did solve the problem, and the documentation paragraphs above are
> basically my development notes while examining the existing support
> and the way in which it isn't giving me the tools I need.
> 
> I do need to send a dt-bindings patch on this topic as well. The fact
> that we put all cascade links in the device tree means we don't know
> which one represents the direct connection to the neighbor cascade port,
> and which one is an indirect connection. We need to bake in an
> assumption, like for example "the first element of the 'link' phandle
> array is the direct connection". I hope retroactively doing that won't
> bother the device tree maintainers too much. If it does, the problem is
> intractable.

Yes I believe this is exactly how we had intended for the "link" 
property to be used. We should have indeed encoded that more precisely 
into the property.

> 
> But I agree, requirements for cross-chip support are rare, and with
> SJA1105 I don't even own a board where I can directly test it. I
> specifically bought the Turris MOX for that.
> 
>>> +To untangle this situation and improve the reliability of the cross-chip
>>> +management layer, it is necessary to split the DSA operations into ones which
>>> +can fail, and ones which cannot fail. For example, ``port_fdb_add()`` can fail,
>>> +whereas ``port_fdb_del()`` cannot. Then, cross-chip operations can take a
>>> +fallible function to make forward progress, and an infallible function for
>>> +rollback. However, it is unclear what to do in the case of ``change_mtu()``.
>>> +It is hard to classify this operation as either fallible or infallible. It is
>>> +also unclear how to deal with I/O access errors on the switch's management bus.
>>
>> How about something like this:
>>
>> I/O access errors occurring during the switch configuration should always be
>> logged for debugging but are very unlikely to be recoverable and therefore
>> require an investigation into the failure mechanism and root cause or
>> possible work around.
> 
> Yeah, I suppose.
> 
> What do you think, has something like phy_error() been a useful mechanism
> for anything? Or just a pain in the rear? Would it be useful to shut
> everything down on a bus I/O error, phylib style?

phy_error() has been somewhat helpful in knowing whether we have a hard 
to debug, possibly silent MDIO error lurking around. Because the PHY 
library polls or reacts to interrupts updating the link status has a 
very high probability of hitting a MDIO transaction error. This is 
similar to what we see with some of our fielded reports where most 
text/data corruptions occur in scheduler code paths... because the 
scheduler is very often executed. So you know you have an error, but you 
don't know yet why.

For a switch there can be a lot of different transactions, but being 
able to know where it fails precisely could be a lead as to where the 
failure is.
diff mbox series

Patch

diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst
index 2cd91358421e..305bb471fc02 100644
--- a/Documentation/networking/dsa/dsa.rst
+++ b/Documentation/networking/dsa/dsa.rst
@@ -1200,14 +1200,180 @@  methods must be implemented:
 - ``port_hsr_leave``: function invoked when a given switch port leaves a
   DANP/DANH and returns to normal operation as a standalone port.
 
-TODO
-====
-
-Making SWITCHDEV and DSA converge towards an unified codebase
--------------------------------------------------------------
+History and development directions
+==================================
 
-SWITCHDEV properly takes care of abstracting the networking stack with offload
-capable hardware, but does not enforce a strict switch device driver model. On
-the other DSA enforces a fairly strict device driver model, and deals with most
-of the switch specific. At some point we should envision a merger between these
-two subsystems and get the best of both worlds.
+This section gives some background context to the developers who are eager to
+make changes to the DSA core (``net/dsa/`` excepting ``tag_*.c`` files).
+
+Initially (2008), the DSA core was a platform driver for a platform device
+representing the switch tree. Support for hardware switch chips lived in
+separate modules, which were required to call the ``register_switch_driver()``
+method. The tree platform device was instantiated through ``platform_data``.
+
+Later (2013), the DSA core gained support for OF bindings. The initial bindings
+(now no longer supported) expected a compatible string of "marvell,dsa" or
+"brcm,bcm7445-switch-v4.0" for the tree, and, like the implementation of the
+DSA core, also expected that all switches in the tree are MDIO devices on an
+``mii_bus``. The initial bindings and core driver implementation first assumed
+that all switches in the tree are connected to the same MDIO bus, then in 2015,
+they were augmented such that each switch may be on its own MDIO bus.
+
+The early DSA core was more concerned with using switches as port multiplexers
+and with managing auxiliary functionality like temperature sensors
+(``CONFIG_NET_DSA_HWMON``) rather than with production-ready features.
+Bridging and bonding were handled in software. Support for hardware-accelerated
+bridging, by means of integrating with the ``switchdev`` framework, was added
+in 2015.
+
+In mid 2016, the second (and current) device tree binding for DSA was
+introduced. Here, individual switches are represented as devices in the Linux
+device model sense, on arbitrary buses, not just MDIO. The limitation of being
+able to describe a single CPU port was lifted (the driver support for multiple
+CPU ports came much later, in 2022). During this time, DSA stopped playing the
+role of a device model for switches, and ``register_switch_driver()`` was
+replaced with ``dsa_register_switch()``, the modern mechanism through which
+arbitrary devices may use DSA as a framework exclusively for integrating
+Ethernet switch IP blocks with the network stack.
+
+With the conversion to the second device tree binding, DSA's support for
+``platform_data`` (used in non-OF scenarios) was also changed to align. With
+the DSA tree no longer being a platform device, the ``platform_data`` structure
+moved to individual switch devices.
+
+Support for the initial device tree binding was subsequently removed in 2019.
+
+Probing through ``platform_data`` remains limited in functionality. The
+``ds->dev->of_node`` and ``dp->dn`` pointers are NULL, and the OF API calls
+made by drivers for discovering more complex setups fall back to the implicit
+handling. There is no way to describe multi-chip trees, or switches with
+multiple CPU ports. It is always assumed that shared ports are configured by
+the driver to the maximum supported link speed (they do not use phylink).
+User ports cannot connect to arbitrary PHYs, but are limited to
+``ds->user_mii_bus``.
+
+Many switch drivers introduced since after DSA's second OF binding were not
+designed to support probing through ``platform_data``. Most notably,
+``device_get_match_data()`` and ``of_device_get_match_data()`` return NULL with
+``platform_data``, so generally, drivers which do not have alternative
+mechanisms for this do not support ``platform_data``.
+
+Extending the ``platform_data`` support implies adding more separate code.
+An alternative worth exploring is growing DSA towards the ``fwnode`` API.
+However, not the entire OF binding should be generalized to ``fwnode``.
+The current bindings must be examined with a critical eye, and the properties
+which are no longer considered good practice (like ``label``, because ``udev``
+offers this functionality) should first be deprecated in OF, and not migrated
+to ``fwnode``.
+
+With ``fwnode`` support in the DSA framework, the ``fwnode_create_software_node()``
+API could be used as an alternative to ``platform_data``, to allow describing
+and probing switches on non-OF.
+
+DSA is used to control very complex switching chips. Some devices have a
+microprocessor, and in some cases, this microprocessor can run a variant of the
+Linux kernel. Sometimes, the switch packet I/O procedure of the internal
+microprocessor is different from the packet I/O procedure for an external host.
+The internal processor may have access to switch queues, while the external
+processor may require DSA tags. Other times, the microprocessor may also be
+connected to the switch in a DSA fashion (using an internal MAC to MAC
+connection).
+
+Since DSA is only concerned with switches where the packet I/O is handled
+by an intermediate conduit driver, this leads to the situation where it is
+recommended to have two drivers for the same switch hardware. When the queues
+are accessed directly, a separate non-DSA driver should be used, with its own
+skeleton which is integrated with ``switchdev`` on its own.
+
+In 2019, a DSA driver was added for the ``ocelot`` switch, which is a thin
+front-end over a hardware library that is also common with a ``switchdev``
+driver. While this design is encouraged for other similar cases, code
+duplication among multiple front-ends is a concern, so it may be desirable to
+extract some of DSA's core functionality into a reusable library for Ethernet
+switches. This could offer a driver-facing API similar to ``dsa_switch_ops``,
+but the aspects relating to cross-chip management, to DSA tags and to the
+conduit interface would remain DSA-specific.
+
+Traditionally, DSA switch drivers for discrete chips own the entire
+``spi_device``, ``i2c_client``, ``mdio_device`` etc. When the chip is complex
+and has multiple embedded peripherals (IRQ controller, GPIO controller, MDIO
+controller, LED controller, sensors), the handling of these peripherals is
+currently monolithic within the same device driver that also calls
+``dsa_register_switch()``.
+
+But an internal microprocessor may have a very different view of the switch
+address space, and may have discrete Linux drivers for each peripheral.
+In 2023, the ``ocelot_ext`` driver was added, which deviated from the
+traditional DSA driver architecture. Rather than probing on the entire
+``spi_device``, it created a multi-function device (MFD) top-level driver for
+it (associated with the SoC at large), and the switching IP is only one of the
+children of the MFD (it is a platform device with regmaps provided by its
+parent). The drivers for each peripheral in this switch SoC are the same when
+controlled over SPI and when controlled by the internal processor.
+
+Authors of new switch drivers that use DSA are encouraged to have a wider view
+of the peripherals on the chip that they are controlling, and to use the MFD
+model to separate the components whenever possible. The general direction for
+the DSA core is to shrink in size and to focus only on the Ethernet switching
+IP and its ports. ``CONFIG_NET_DSA_HWMON`` was removed in 2017. Adding new
+methods to ``struct dsa_switch_ops`` which are outside of DSA's core focus on
+Ethernet is strongly discouraged.
+
+DSA's support for multi-chip trees also has limitations. After converting from
+the first to the second OF binding, the switch tree stopped being a platform
+device, and its probing became implicit, and distributed among its constituent
+switch devices. There is currently a synchronization point in
+``dsa_tree_setup_routing_table()``, through which the tree setup is performed
+only once, when there is more than one switch in the tree. The first N-1
+switches will end their probing early, and the last switch will configure the
+entire tree, and thus all the other switches, in its ``dsa_register_switch()``
+calling context.
+
+Furthermore, the synchronization point works because each switch is able to
+determine, in a distributed manner, that the routing table is not complete, aka
+that there is at least one switch which has not probed. This is only possible
+because the ``link`` properties in the device tree describe the connections to
+all other cascade ports in the tree, not just to the directly connected cascade
+port. If only the latter were described, it could happen that a switch waits
+for its direct neighbors to probe before setting up the tree, but not
+necessarily for all switches in the tree (therefore, it sets up the tree too
+early).
+
+With more than 3 switches in a tree, it becomes a difficult task to write
+correct device trees which are not missing any link to the other cascade ports
+in the tree. The routing table, based on which ``dsa_routing_port()`` works, is
+directly taken from the device tree, although it could be computed through BFS
+instead. This means that the device tree writer needs to specify more than just
+the hardware description (represented by the direct cascade port connections).
+
+Simplifying the device tree bindings to require a single ``link`` phandle
+cannot be done without rethinking the distributed probing scheme. One idea is
+to reinstate the switch tree as a platform device, but this time created
+dynamically by ``dsa_register_switch()`` if the switch's tree ID is not already
+present in the system. The switch tree driver walks the device tree hop by hop,
+following the ``link`` references, to discover all the other switches, and to
+construct the full routing table. It then uses the component API to register
+itself as an aggregate driver, with each of the discovered switches as a
+component. When ``dsa_register_switch()`` completes for all component switches,
+the tree probing continues and calls ``dsa_tree_setup()``.
+
+The cross-chip management layer (``net/dsa/switch.c``) can also be improved.
+Currently ``struct dsa_switch_tree`` holds a list of ports rather than a list
+of switches, and thus, calling one function for each switch in a tree is hard.
+DSA currently uses one notifier chain per tree as a workaround for that, with
+each switch registered as a listener (``dsa_switch_event()``).
+
+It is considered bad practice to use notifiers when the emitter and the
+listener are known to each other, instead of a plain function call. Also, error
+handling with notifiers is not robust. When one switch fails mid-operation,
+there is no rollback to the previous state for switches which already completed
+the operation successfully.
+
+To untangle this situation and improve the reliability of the cross-chip
+management layer, it is necessary to split the DSA operations into ones which
+can fail, and ones which cannot fail. For example, ``port_fdb_add()`` can fail,
+whereas ``port_fdb_del()`` cannot. Then, cross-chip operations can take a
+fallible function to make forward progress, and an infallible function for
+rollback. However, it is unclear what to do in the case of ``change_mtu()``.
+It is hard to classify this operation as either fallible or infallible. It is
+also unclear how to deal with I/O access errors on the switch's management bus.