diff mbox

[RFC] Documentation: devicetree: add description for generic bus properties

Message ID 20131127172806.GC2291@e103592.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Martin Nov. 27, 2013, 5:28 p.m. UTC
Hi all,

SoC architectures are getting increasingly complex in ways that are not
transparent to software.

A particular emerging issue is that of multi-master SoCs, which may have
different address views, IOMMUs, and coherency behaviour from one master
to the next.

DT can't describe multi-master systems today except for PCI DMA and
similar.  This comes with constraints and assumptions that won't work
for emerging SoC bus architectures.  On-SoC, a device's interface to the
system can't be described in terms of a single interface to a single
"bus".

Different masters may have different views of the system too.  Software
needs to understand the true topology in order to do address mapping,
coherency management etc., in any generic way.

One piece of the puzzle is to define how to describe these topologies in
DT.

The other is how to get the right abstractions in the kernel to drive
these systems in a generic way.

The following proposal (originally from Will) begins to address the DT
part.

Comments encouraged -- I anticipate it may take some discussion to
reach a consensus here.

Cheers
---Dave


From will.deacon@arm.com Wed Nov 20 12:06:22 2013
Date: Wed, 20 Nov 2013 12:06:13 +0000
Subject: [PATCH RFC v2] Documentation: devicetree: add description for generic bus properties

This patch documents properties that can be used as part of bus and
device bindings in order to describe their linkages within the system
topology.

Use of these properties allows topological parsing to occur in generic
library code, making it easier for bus drivers to parse information
regarding their upstream masters and potentially allows us to treat
the slave and master interfaces separately for a given device.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---

A number of discussion points remain to be resolved:

  - Use of the ranges property and describing slave vs master bus
    address ranges. In the latter case, we actually want to describe our
    address space with respect to the bus on which the bus masters,
    rather than the parent. This could potentially be achieved by adding
    properties such as dma-parent and dma-ranges (already used by PPC?)

  - Describing masters that master through multiple different buses

  - How on Earth this fits in with the Linux device model (it doesn't)

  - Interaction with IOMMU bindings (currently under discussion)

Cheers,

Will

 .../devicetree/bindings/arm/coherent-bus.txt       | 110 +++++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/coherent-bus.txt

Comments

Greg Kroah-Hartman Nov. 27, 2013, 11:06 p.m. UTC | #1
On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
> Hi all,
> 
> SoC architectures are getting increasingly complex in ways that are not
> transparent to software.
> 
> A particular emerging issue is that of multi-master SoCs, which may have
> different address views, IOMMUs, and coherency behaviour from one master
> to the next.
> 
> DT can't describe multi-master systems today except for PCI DMA and
> similar.  This comes with constraints and assumptions that won't work
> for emerging SoC bus architectures.  On-SoC, a device's interface to the
> system can't be described in terms of a single interface to a single
> "bus".
> 
> Different masters may have different views of the system too.  Software
> needs to understand the true topology in order to do address mapping,
> coherency management etc., in any generic way.
> 
> One piece of the puzzle is to define how to describe these topologies in
> DT.
> 
> The other is how to get the right abstractions in the kernel to drive
> these systems in a generic way.
> 
> The following proposal (originally from Will) begins to address the DT
> part.
> 
> Comments encouraged -- I anticipate it may take some discussion to
> reach a consensus here.
> 
> Cheers
> ---Dave
> 
> 
> >From will.deacon@arm.com Wed Nov 20 12:06:22 2013
> Date: Wed, 20 Nov 2013 12:06:13 +0000
> Subject: [PATCH RFC v2] Documentation: devicetree: add description for generic bus properties
> 
> This patch documents properties that can be used as part of bus and
> device bindings in order to describe their linkages within the system
> topology.
> 
> Use of these properties allows topological parsing to occur in generic
> library code, making it easier for bus drivers to parse information
> regarding their upstream masters and potentially allows us to treat
> the slave and master interfaces separately for a given device.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
> 
> A number of discussion points remain to be resolved:
> 
>   - Use of the ranges property and describing slave vs master bus
>     address ranges. In the latter case, we actually want to describe our
>     address space with respect to the bus on which the bus masters,
>     rather than the parent. This could potentially be achieved by adding
>     properties such as dma-parent and dma-ranges (already used by PPC?)
> 
>   - Describing masters that master through multiple different buses
> 
>   - How on Earth this fits in with the Linux device model (it doesn't)

How does this _not_ fit into the Linux device model?  What am I missing
here that precludes the use of the "driver/device/bus" model we have
today?

>   - Interaction with IOMMU bindings (currently under discussion)
> 
> Cheers,
> 
> Will
> 
>  .../devicetree/bindings/arm/coherent-bus.txt       | 110 +++++++++++++++++++++

Why "arm"?

What makes it ARM specific?

thanks,

greg k-h
Will Deacon Nov. 28, 2013, 10:28 a.m. UTC | #2
Hi Greg,

On Wed, Nov 27, 2013 at 11:06:50PM +0000, Greg KH wrote:
> On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
> > >From will.deacon@arm.com Wed Nov 20 12:06:22 2013
> > A number of discussion points remain to be resolved:
> > 
> >   - Use of the ranges property and describing slave vs master bus
> >     address ranges. In the latter case, we actually want to describe our
> >     address space with respect to the bus on which the bus masters,
> >     rather than the parent. This could potentially be achieved by adding
> >     properties such as dma-parent and dma-ranges (already used by PPC?)
> > 
> >   - Describing masters that master through multiple different buses
> > 
> >   - How on Earth this fits in with the Linux device model (it doesn't)
> 
> How does this _not_ fit into the Linux device model?  What am I missing
> here that precludes the use of the "driver/device/bus" model we have
> today?

The main problem is that we have devices which slave on one bus and master
on another. That then complicates probing, power-management, IOMMU
configuration, address mapping (e.g. I walk the slave buses to figure out
where the slave registers live, but then I need a way to work out where
exactly I master on a different bus) and dynamic coherency, amongst other
things.

If we try to use the current infrastructure then we end up with one bus per
device, which usually ends up being a fake bus representing both the slave
and master buses (which is how the platform bus gets abused) and then device
drivers having their own idea of the system topology where it's required.
This is fairly horrible and doesn't work for anything other than the trivial
case, where one or both of the buses are `dumb' and don't require any work
from Linux.

> >  .../devicetree/bindings/arm/coherent-bus.txt       | 110 +++++++++++++++++++++
> 
> Why "arm"?
> 
> What makes it ARM specific?

This is just an RFC, so I'd be happy to put the binding somewhere more
broad. I'm not sure how much of an issue this is outside of the SoC space,
though.

Will
Dave Martin Nov. 28, 2013, 5:33 p.m. UTC | #3
[Resending -- apologies for any duplicates received.

Real reply below.

My lame excuse:

        It turns out that Mutt's decode-copy command (Esc-C) will lose
        most headers unless you invoke it from the message viewer *and*
        you have full header display enabled at the time.  Otherwise, or
        if invoked from the index, many headers may disappear including
        headers not filtered by normal header weeding, and everything
        needed for threading.  copy-message (C) always saves full headers
        though.  Maybe it's a bug.  Or not.  Go figure.
       
        This + my habit of saving messages I want to reply to in a
        separate mbox + mindlessly using decode-save instead of
        save-message even when I'm not going to run git am on the result
        = facepalm.]


On Thu, Nov 28, 2013 at 10:28:45AM +0000, Will Deacon wrote:
> Hi Greg,
> 
> On Wed, Nov 27, 2013 at 11:06:50PM +0000, Greg KH wrote:
> > On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
> > > >From will.deacon@arm.com Wed Nov 20 12:06:22 2013
> > > A number of discussion points remain to be resolved:
> > > 
> > >   - Use of the ranges property and describing slave vs master bus
> > >     address ranges. In the latter case, we actually want to describe our
> > >     address space with respect to the bus on which the bus masters,
> > >     rather than the parent. This could potentially be achieved by adding
> > >     properties such as dma-parent and dma-ranges (already used by PPC?)
> > > 
> > >   - Describing masters that master through multiple different buses
> > > 
> > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > 
> > How does this _not_ fit into the Linux device model?  What am I missing
> > here that precludes the use of the "driver/device/bus" model we have
> > today?

The physical-sockets history of buses like PCI tends to force a simple
tree-like topology as a natural consequence.  You also end up with
closely linked topologies for power, clocks, interrupts etc., because
those all pass through the same sockets, so it's difficult to have a
different structure.

On SoC, those constraints have never existed and are not followed.  A
device's interface to the system is almost always split into multiple
connections, not covered by a single design or standard.  The problem
now is that increasing complexity means that the sometimes bizarre
topology features of SoCs are becoming less and less transparent for
software.

The device model currently seems to assume that certain things (power,
DMA and MMIO accessibility) follow the tree (which may not work for many
SoCs), and some other things (clocks, regulators, interrupts etc.) are
not incorporated at all -- making them independent, but it may make some
abstractions impossible today.

How much this matters for actual systems is hard to foresee yet.  Since
not _all_ possible insanities find their way into silicon.  The
onus should certainly be on us (i.e., the ARM/SoC community) to
demonstrate if the device model needs to change, and to find practical
ways to change it that minimise the resulting churn.

> The main problem is that we have devices which slave on one bus and master
> on another. That then complicates probing, power-management, IOMMU
> configuration, address mapping (e.g. I walk the slave buses to figure out
> where the slave registers live, but then I need a way to work out where
> exactly I master on a different bus) and dynamic coherency, amongst other
> things.
> 
> If we try to use the current infrastructure then we end up with one bus per
> device, which usually ends up being a fake bus representing both the slave
> and master buses (which is how the platform bus gets abused) and then device
> drivers having their own idea of the system topology where it's required.
> 
> This is fairly horrible and doesn't work for anything other than the trivial
> case, where one or both of the buses are `dumb' and don't require any work
> from Linux.

If we can come up with some generic bus type that is just a container for
a load of hooks that know how to deal with various aspects of each device's
interface to the system, on a per-device basis, than may be a start.

The platform bus kinda serves that role, but the trouble with that is that
it doesn't encourage any abstraction at all.  In the face of increasing
complexity, abstraction is desperately needed.


> > >  .../devicetree/bindings/arm/coherent-bus.txt       | 110 +++++++++++++++++++++
> > 
> > Why "arm"?
> > 
> > What makes it ARM specific?
> 
> This is just an RFC, so I'd be happy to put the binding somewhere more
> broad. I'm not sure how much of an issue this is outside of the SoC space,
> though.

I think that the ARM community are the ones who care the most today,
so are likely to make the most noise about it.

The binding is entirely generic in concept, so we should certainly
push for it to be non-ARM-specific.  Non-ARM SoCs will likely need
to solve this problem too at some point.

Cheers
---Dave
Greg Kroah-Hartman Nov. 28, 2013, 7:10 p.m. UTC | #4
On Thu, Nov 28, 2013 at 10:28:45AM +0000, Will Deacon wrote:
> Hi Greg,
> 
> On Wed, Nov 27, 2013 at 11:06:50PM +0000, Greg KH wrote:
> > On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
> > > >From will.deacon@arm.com Wed Nov 20 12:06:22 2013
> > > A number of discussion points remain to be resolved:
> > > 
> > >   - Use of the ranges property and describing slave vs master bus
> > >     address ranges. In the latter case, we actually want to describe our
> > >     address space with respect to the bus on which the bus masters,
> > >     rather than the parent. This could potentially be achieved by adding
> > >     properties such as dma-parent and dma-ranges (already used by PPC?)
> > > 
> > >   - Describing masters that master through multiple different buses
> > > 
> > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > 
> > How does this _not_ fit into the Linux device model?  What am I missing
> > here that precludes the use of the "driver/device/bus" model we have
> > today?
> 
> The main problem is that we have devices which slave on one bus and master
> on another. That then complicates probing, power-management, IOMMU
> configuration, address mapping (e.g. I walk the slave buses to figure out
> where the slave registers live, but then I need a way to work out where
> exactly I master on a different bus) and dynamic coherency, amongst other
> things.
> 
> If we try to use the current infrastructure then we end up with one bus per
> device, which usually ends up being a fake bus representing both the slave
> and master buses (which is how the platform bus gets abused) and then device
> drivers having their own idea of the system topology where it's required.
> This is fairly horrible and doesn't work for anything other than the trivial
> case, where one or both of the buses are `dumb' and don't require any work
> from Linux.

Then just put everything on a single "bus", there's nothing in the
driver core that requires a bus to work in a specific way.

> > >  .../devicetree/bindings/arm/coherent-bus.txt       | 110 +++++++++++++++++++++
> > 
> > Why "arm"?
> > 
> > What makes it ARM specific?
> 
> This is just an RFC, so I'd be happy to put the binding somewhere more
> broad. I'm not sure how much of an issue this is outside of the SoC space,
> though.

There aren't "SoC"s on other architectures?  :)

thanks,

greg k-h
Greg Kroah-Hartman Nov. 28, 2013, 7:13 p.m. UTC | #5
On Thu, Nov 28, 2013 at 05:33:39PM +0000, Dave Martin wrote:
> On Thu, Nov 28, 2013 at 10:28:45AM +0000, Will Deacon wrote:
> > Hi Greg,
> > 
> > On Wed, Nov 27, 2013 at 11:06:50PM +0000, Greg KH wrote:
> > > On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
> > > > >From will.deacon@arm.com Wed Nov 20 12:06:22 2013
> > > > A number of discussion points remain to be resolved:
> > > > 
> > > >   - Use of the ranges property and describing slave vs master bus
> > > >     address ranges. In the latter case, we actually want to describe our
> > > >     address space with respect to the bus on which the bus masters,
> > > >     rather than the parent. This could potentially be achieved by adding
> > > >     properties such as dma-parent and dma-ranges (already used by PPC?)
> > > > 
> > > >   - Describing masters that master through multiple different buses
> > > > 
> > > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > > 
> > > How does this _not_ fit into the Linux device model?  What am I missing
> > > here that precludes the use of the "driver/device/bus" model we have
> > > today?
> 
> The physical-sockets history of buses like PCI tends to force a simple
> tree-like topology as a natural consequence.  You also end up with
> closely linked topologies for power, clocks, interrupts etc., because
> those all pass through the same sockets, so it's difficult to have a
> different structure.

There's nothing in the driver core that enforces such a topology.

> On SoC, those constraints have never existed and are not followed.  A
> device's interface to the system is almost always split into multiple
> connections, not covered by a single design or standard.  The problem
> now is that increasing complexity means that the sometimes bizarre
> topology features of SoCs are becoming less and less transparent for
> software.
> 
> The device model currently seems to assume that certain things (power,
> DMA and MMIO accessibility) follow the tree (which may not work for many
> SoCs), and some other things (clocks, regulators, interrupts etc.) are
> not incorporated at all -- making them independent, but it may make some
> abstractions impossible today.
> 
> How much this matters for actual systems is hard to foresee yet.  Since
> not _all_ possible insanities find their way into silicon.  The
> onus should certainly be on us (i.e., the ARM/SoC community) to
> demonstrate if the device model needs to change, and to find practical
> ways to change it that minimise the resulting churn.

Yes it is, you all are the ones tasked with implementing the crazy crap
the hardware people have created, best of luck with that :)

> > The main problem is that we have devices which slave on one bus and master
> > on another. That then complicates probing, power-management, IOMMU
> > configuration, address mapping (e.g. I walk the slave buses to figure out
> > where the slave registers live, but then I need a way to work out where
> > exactly I master on a different bus) and dynamic coherency, amongst other
> > things.
> > 
> > If we try to use the current infrastructure then we end up with one bus per
> > device, which usually ends up being a fake bus representing both the slave
> > and master buses (which is how the platform bus gets abused) and then device
> > drivers having their own idea of the system topology where it's required.
> > 
> > This is fairly horrible and doesn't work for anything other than the trivial
> > case, where one or both of the buses are `dumb' and don't require any work
> > from Linux.
> 
> If we can come up with some generic bus type that is just a container for
> a load of hooks that know how to deal with various aspects of each device's
> interface to the system, on a per-device basis, than may be a start.
> 
> The platform bus kinda serves that role, but the trouble with that is that
> it doesn't encourage any abstraction at all.  In the face of increasing
> complexity, abstraction is desperately needed.

Then create a different abstraction, the normal solution to any problem
in programming :)

thanks,

greg k-h
Dave Martin Nov. 28, 2013, 7:39 p.m. UTC | #6
On Thu, Nov 28, 2013 at 11:13:31AM -0800, Greg KH wrote:
> On Thu, Nov 28, 2013 at 05:33:39PM +0000, Dave Martin wrote:
> > On Thu, Nov 28, 2013 at 10:28:45AM +0000, Will Deacon wrote:
> > > Hi Greg,
> > > 
> > > On Wed, Nov 27, 2013 at 11:06:50PM +0000, Greg KH wrote:
> > > > On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
> > > > > >From will.deacon@arm.com Wed Nov 20 12:06:22 2013
> > > > > A number of discussion points remain to be resolved:
> > > > > 
> > > > >   - Use of the ranges property and describing slave vs master bus
> > > > >     address ranges. In the latter case, we actually want to describe our
> > > > >     address space with respect to the bus on which the bus masters,
> > > > >     rather than the parent. This could potentially be achieved by adding
> > > > >     properties such as dma-parent and dma-ranges (already used by PPC?)
> > > > > 
> > > > >   - Describing masters that master through multiple different buses
> > > > > 
> > > > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > > > 
> > > > How does this _not_ fit into the Linux device model?  What am I missing
> > > > here that precludes the use of the "driver/device/bus" model we have
> > > > today?
> > 
> > The physical-sockets history of buses like PCI tends to force a simple
> > tree-like topology as a natural consequence.  You also end up with
> > closely linked topologies for power, clocks, interrupts etc., because
> > those all pass through the same sockets, so it's difficult to have a
> > different structure.
> 
> There's nothing in the driver core that enforces such a topology.

Maybe not ... I have to wrap my head around that stuff a bit more.

> > On SoC, those constraints have never existed and are not followed.  A
> > device's interface to the system is almost always split into multiple
> > connections, not covered by a single design or standard.  The problem
> > now is that increasing complexity means that the sometimes bizarre
> > topology features of SoCs are becoming less and less transparent for
> > software.
> > 
> > The device model currently seems to assume that certain things (power,
> > DMA and MMIO accessibility) follow the tree (which may not work for many
> > SoCs), and some other things (clocks, regulators, interrupts etc.) are
> > not incorporated at all -- making them independent, but it may make some
> > abstractions impossible today.
> > 
> > How much this matters for actual systems is hard to foresee yet.  Since
> > not _all_ possible insanities find their way into silicon.  The
> > onus should certainly be on us (i.e., the ARM/SoC community) to
> > demonstrate if the device model needs to change, and to find practical
> > ways to change it that minimise the resulting churn.
> 
> Yes it is, you all are the ones tasked with implementing the crazy crap
> the hardware people have created, best of luck with that :)

Agreed.  The first assumption should be that we can fit in with the
existing device model -- we should only reconsider if we find that
to be impossible.

> > > The main problem is that we have devices which slave on one bus and master
> > > on another. That then complicates probing, power-management, IOMMU
> > > configuration, address mapping (e.g. I walk the slave buses to figure out
> > > where the slave registers live, but then I need a way to work out where
> > > exactly I master on a different bus) and dynamic coherency, amongst other
> > > things.
> > > 
> > > If we try to use the current infrastructure then we end up with one bus per
> > > device, which usually ends up being a fake bus representing both the slave
> > > and master buses (which is how the platform bus gets abused) and then device
> > > drivers having their own idea of the system topology where it's required.
> > > 
> > > This is fairly horrible and doesn't work for anything other than the trivial
> > > case, where one or both of the buses are `dumb' and don't require any work
> > > from Linux.
> > 
> > If we can come up with some generic bus type that is just a container for
> > a load of hooks that know how to deal with various aspects of each device's
> > interface to the system, on a per-device basis, than may be a start.
> > 
> > The platform bus kinda serves that role, but the trouble with that is that
> > it doesn't encourage any abstraction at all.  In the face of increasing
> > complexity, abstraction is desperately needed.
> 
> Then create a different abstraction, the normal solution to any problem
> in programming :)

That's certainly the first step.  It might end up looking a lot like a
kludge layer which duplicates core functionality -- if so, we should
then consider whether there is a better way, but we shouldn't judge it
prematurely.


It would be great to get some comments [hint to everyone] on the proposed
DT binding so that we can start to explore this properly.

Cheers
---Dave
Thierry Reding Nov. 28, 2013, 8:33 p.m. UTC | #7
On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
[...]
> From will.deacon@arm.com Wed Nov 20 12:06:22 2013
[...]
> A number of discussion points remain to be resolved:
> 
>   - Use of the ranges property and describing slave vs master bus
>     address ranges. In the latter case, we actually want to describe our
>     address space with respect to the bus on which the bus masters,
>     rather than the parent. This could potentially be achieved by adding
>     properties such as dma-parent and dma-ranges (already used by PPC?)
> 
>   - Describing masters that master through multiple different buses
> 
>   - How on Earth this fits in with the Linux device model (it doesn't)
> 
>   - Interaction with IOMMU bindings (currently under discussion)

This is all very vague. Perhaps everyone else knows what this is all
about, in which case it'd be great if somebody could clue me in.

In particular I'm not sure what exact problem this solves. Perhaps a
somewhat more concrete example would help. Or perhaps pointers to
documentation that can help filling in the gaps.

>  .../devicetree/bindings/arm/coherent-bus.txt       | 110 +++++++++++++++++++++
>  1 file changed, 110 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/arm/coherent-bus.txt
> 
> diff --git a/Documentation/devicetree/bindings/arm/coherent-bus.txt b/Documentation/devicetree/bindings/arm/coherent-bus.txt
> new file mode 100644
> index 000000000000..e3fbc2e491c7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/arm/coherent-bus.txt
> @@ -0,0 +1,110 @@
> +* Generic binding to describe a coherent bus
> +
> +In some systems, devices (peripherals and/or CPUs) do not share
> +coherent views of memory, while on other systems sets of devices may
> +share a coherent view of memory depending on the static bus topology
> +and/or dynamic configuration of both the bus and device. Establishing
> +such dynamic configurations requires appropriate topological information
> +to be communicated to the operating system.
> +
> +This binding document attempts to define a set of generic properties
> +which can be used to encode topological information in bus and device
> +nodes.
> +
> +
> +* Terminology
> +
> +  - Port                : An interface over which memory transactions
> +                          can propagate. A port may act as a master,
> +                          slave or both (see below).
> +
> +  - Master port         : A port capable of issuing memory transactions
> +                          to a slave. For example, a port connecting a
> +                          DMA controller to main memory.
> +
> +  - Slave port          : A port capable of responding to memory
> +                          transactions received by a master. For
> +                          example, a port connecting the control
> +                          registers of an MMIO device to a peripheral
> +                          bus.

"Port" sounds awfully generic. Other bindings (such as those for V4L2,
aka media) use ports for something completely different. Perhaps we can
come up with a more specific term that matches the use-case better?

What exactly does this map to in hardware?

Thierry
Jason Gunthorpe Nov. 28, 2013, 9:10 p.m. UTC | #8
On Thu, Nov 28, 2013 at 09:33:23PM +0100, Thierry Reding wrote:

> >   - Describing masters that master through multiple different buses
> > 
> >   - How on Earth this fits in with the Linux device model (it doesn't)
> > 
> >   - Interaction with IOMMU bindings (currently under discussion)
> 
> This is all very vague. Perhaps everyone else knows what this is all
> about, in which case it'd be great if somebody could clue me in.

It looks like an approach to describe an AXI physical bus topology in
DT..

AFAIK the issue is that the AXI toolkit arm provides encourages a
single IP block to have several AXI ports - control, DMA, high speed
MMIO, for instance. Each of those ports is hooked up into an AXI bus
DAG that has little to do with the CPU address map.

Contrasted with something like PCI, where each IP has exactly one bus
port into the system, so the MMIO register access address range
directly implies the bus master DMA path.

To my mind, a sensble modeling would be to have the DT tree represent
the AXI DAG flattened into a tree rooted at the CPU vertex. Separately
in the DT would be the full AXI DAG represented with phandle
connections.

Nodes in the DT would use phandles to indicate their connections into
the AXI DAG.

Hugely roughly:
soc 
{
   ranges = <Some quasi-real ranges indicating IP to CPU mapping>;
   ip_block 
   { 
      reg = <...>
      axi-ports = <mmio = &axi_low_speed_port0, dma = &axi_dma_port1, .. >;
   }
}

axi
{
   /* Describe a DAG of AXI connections here. */
   cpu { downstream = &ax_switch,}
   axi_switch {downstream = &memory,&low_speed}
   memory {}
   dma {downstream = &memory}
   low_speed {}
}

I was just reading the Zynq manual which gives a pretty good
description of what one vendor did using the ARM AXI toolkits..

http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf
Figure 5-1 pg 122

You can see it is a complex DAG of AXI busses. For instance if you
want to master from a 'High Performance Port M0' to 'On Chip RAM' you
follow the path AXI_HP[MO] -> Switch1[M2] -> OCM.

But you can't master from 'High Performance Port M0' to internal
slaves, as there is no routing path.

Each switch block is an opportunity for the designer to provide
address remapping/IO MMU hardware that needs configuring :)

Which is why I think encoding the AXI DAG directly in DT is probably
the most future proof way to model this stuff - it sticks close to the
tools ARM provides to the SOC designers, so it is very likely to be
able to model arbitary SOC designs.

Regards,
Jason
Greg Kroah-Hartman Nov. 28, 2013, 9:25 p.m. UTC | #9
On Thu, Nov 28, 2013 at 07:39:17PM +0000, Dave Martin wrote:
> On Thu, Nov 28, 2013 at 11:13:31AM -0800, Greg KH wrote:
> > On Thu, Nov 28, 2013 at 05:33:39PM +0000, Dave Martin wrote:
> > > On Thu, Nov 28, 2013 at 10:28:45AM +0000, Will Deacon wrote:
> > > > Hi Greg,
> > > > 
> > > > On Wed, Nov 27, 2013 at 11:06:50PM +0000, Greg KH wrote:
> > > > > On Wed, Nov 27, 2013 at 05:28:06PM +0000, Dave Martin wrote:
> > > > > > >From will.deacon@arm.com Wed Nov 20 12:06:22 2013
> > > > > > A number of discussion points remain to be resolved:
> > > > > > 
> > > > > >   - Use of the ranges property and describing slave vs master bus
> > > > > >     address ranges. In the latter case, we actually want to describe our
> > > > > >     address space with respect to the bus on which the bus masters,
> > > > > >     rather than the parent. This could potentially be achieved by adding
> > > > > >     properties such as dma-parent and dma-ranges (already used by PPC?)
> > > > > > 
> > > > > >   - Describing masters that master through multiple different buses
> > > > > > 
> > > > > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > > > > 
> > > > > How does this _not_ fit into the Linux device model?  What am I missing
> > > > > here that precludes the use of the "driver/device/bus" model we have
> > > > > today?
> > > 
> > > The physical-sockets history of buses like PCI tends to force a simple
> > > tree-like topology as a natural consequence.  You also end up with
> > > closely linked topologies for power, clocks, interrupts etc., because
> > > those all pass through the same sockets, so it's difficult to have a
> > > different structure.
> > 
> > There's nothing in the driver core that enforces such a topology.
> 
> Maybe not ... I have to wrap my head around that stuff a bit more.
> 
> > > On SoC, those constraints have never existed and are not followed.  A
> > > device's interface to the system is almost always split into multiple
> > > connections, not covered by a single design or standard.  The problem
> > > now is that increasing complexity means that the sometimes bizarre
> > > topology features of SoCs are becoming less and less transparent for
> > > software.
> > > 
> > > The device model currently seems to assume that certain things (power,
> > > DMA and MMIO accessibility) follow the tree (which may not work for many
> > > SoCs), and some other things (clocks, regulators, interrupts etc.) are
> > > not incorporated at all -- making them independent, but it may make some
> > > abstractions impossible today.
> > > 
> > > How much this matters for actual systems is hard to foresee yet.  Since
> > > not _all_ possible insanities find their way into silicon.  The
> > > onus should certainly be on us (i.e., the ARM/SoC community) to
> > > demonstrate if the device model needs to change, and to find practical
> > > ways to change it that minimise the resulting churn.
> > 
> > Yes it is, you all are the ones tasked with implementing the crazy crap
> > the hardware people have created, best of luck with that :)
> 
> Agreed.  The first assumption should be that we can fit in with the
> existing device model -- we should only reconsider if we find that
> to be impossible.

Let me know if you think it is somehow impossible, but you all should
really push back on the insane hardware designers that are forcing you
all to do this work.  I find it "interesting" how this all becomes your
workload for their crazy ideas.

Best of luck,

greg k-h
Thierry Reding Nov. 28, 2013, 10:22 p.m. UTC | #10
On Thu, Nov 28, 2013 at 02:10:09PM -0700, Jason Gunthorpe wrote:
> On Thu, Nov 28, 2013 at 09:33:23PM +0100, Thierry Reding wrote:
> 
> > >   - Describing masters that master through multiple different buses
> > > 
> > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > > 
> > >   - Interaction with IOMMU bindings (currently under discussion)
> > 
> > This is all very vague. Perhaps everyone else knows what this is all
> > about, in which case it'd be great if somebody could clue me in.
> 
> It looks like an approach to describe an AXI physical bus topology in
> DT..

Thanks for explaining this. It makes a whole lot more sense now.

> AFAIK the issue is that the AXI toolkit arm provides encourages a
> single IP block to have several AXI ports - control, DMA, high speed
> MMIO, for instance. Each of those ports is hooked up into an AXI bus
> DAG that has little to do with the CPU address map.
> 
> Contrasted with something like PCI, where each IP has exactly one bus
> port into the system, so the MMIO register access address range
> directly implies the bus master DMA path.
> 
> To my mind, a sensble modeling would be to have the DT tree represent
> the AXI DAG flattened into a tree rooted at the CPU vertex. Separately
> in the DT would be the full AXI DAG represented with phandle
> connections.
> 
> Nodes in the DT would use phandles to indicate their connections into
> the AXI DAG.
> 
> Hugely roughly:
> soc 
> {
>    ranges = <Some quasi-real ranges indicating IP to CPU mapping>;
>    ip_block 
>    { 
>       reg = <...>
>       axi-ports = <mmio = &axi_low_speed_port0, dma = &axi_dma_port1, .. >;
>    }
> }
> 
> axi
> {
>    /* Describe a DAG of AXI connections here. */
>    cpu { downstream = &ax_switch,}
>    axi_switch {downstream = &memory,&low_speed}
>    memory {}
>    dma {downstream = &memory}
>    low_speed {}
> }

Correct me if I'm wrong, but the switch would be what the specification
refers to as "interconnect", while a port would correspond to what is
called an "interface" in the specification?

> I was just reading the Zynq manual which gives a pretty good
> description of what one vendor did using the ARM AXI toolkits..
> 
> http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf
> Figure 5-1 pg 122
> 
> You can see it is a complex DAG of AXI busses. For instance if you
> want to master from a 'High Performance Port M0' to 'On Chip RAM' you
> follow the path AXI_HP[MO] -> Switch1[M2] -> OCM.
> 
> But you can't master from 'High Performance Port M0' to internal
> slaves, as there is no routing path.
> 
> Each switch block is an opportunity for the designer to provide
> address remapping/IO MMU hardware that needs configuring :)
> 
> Which is why I think encoding the AXI DAG directly in DT is probably
> the most future proof way to model this stuff - it sticks close to the
> tools ARM provides to the SOC designers, so it is very likely to be
> able to model arbitary SOC designs.

I'm not sure I agree with you fully here. At least I think that if what
we want to describe is an AXI bus topology, then we should be describing
it in terms of the AXI specification.

On the other hand I fear that this will lead to very many nodes and
properties that we need to add, with potentially no immediate gain. So I
think we should be cautious about what we do add, and restrict ourselves
to what we really need.

I mean, even though device tree is supposed to describe hardware, there
needs to be a limit to the amount of detail we put into it. After all it
isn't a hardware description language, but rather a language to describe
the hardware in a way that makes sense for operating system software to
use it.

Perhaps this is just another way of saying what Greg has already said.
If we continue down this road, we'll eventually end up having to
describe all sorts of nitty gritty details. And we'll need even more
code to deal with those descriptions and the hardware they represent. At
some point we need to start pushing some of the complexity back into
hardware so that we can keep a sane code-base.

Thierry
Jason Gunthorpe Nov. 28, 2013, 11:31 p.m. UTC | #11
On Thu, Nov 28, 2013 at 11:22:33PM +0100, Thierry Reding wrote:
> On Thu, Nov 28, 2013 at 02:10:09PM -0700, Jason Gunthorpe wrote:
> > On Thu, Nov 28, 2013 at 09:33:23PM +0100, Thierry Reding wrote:
> > 
> > > >   - Describing masters that master through multiple different buses
> > > > 
> > > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > > > 
> > > >   - Interaction with IOMMU bindings (currently under discussion)
> > > 
> > > This is all very vague. Perhaps everyone else knows what this is all
> > > about, in which case it'd be great if somebody could clue me in.
> > 
> > It looks like an approach to describe an AXI physical bus topology in
> > DT..
> 
> Thanks for explaining this. It makes a whole lot more sense now.

Hopefully the ARM guys concur, this was just my impression from
reviewing their patches and having recently done some design work with
AXI..

> > axi
> > {
> >    /* Describe a DAG of AXI connections here. */
> >    cpu { downstream = &ax_switch,}
> >    axi_switch {downstream = &memory,&low_speed}
> >    memory {}
> >    dma {downstream = &memory}
> >    low_speed {}
> > }
> 
> Correct me if I'm wrong, but the switch would be what the specification
> refers to as "interconnect", while a port would correspond to what is
> called an "interface" in the specification?

That seems correct, but for this purpose we are not interested in
boring dumb interconnect but fancy interconnect with address remapping
capabilities, or cache coherency (eg the SCU/L2 cache is modeled as
switch/interconnect in a AXI DAG).

I called it a switch because the job of the interconnect block is to
take an AXI input packet on a slave interface and route it to the
proper master interface with internal arbitration between slave
interfaces. In my world that is a called a switch ;)

AXI is basically an on-chip point-to-point switched fabric like PCI-E,
and the stuff that travels on AXI looks fairly similar to PCI-E TLPs..

If you refer to the PDF I linked I broadly modeled the above DT
fragment on that diagram, each axi sub node (vertex) represents an
'interconnect' and 'downstream' is a master->slave interface pair (edge).

Fundamentally AXI is inherently a DAG, but unlike what we are used to
in other platforms you don't have to go through a fused
CPU/cache/memory controller unit to access memory, so there are
software visible asymmetries depending on how the DMA flows through
the AXI DAG.

> > Which is why I think encoding the AXI DAG directly in DT is probably
> > the most future proof way to model this stuff - it sticks close to the
> > tools ARM provides to the SOC designers, so it is very likely to be
> > able to model arbitary SOC designs.
> 
> I'm not sure I agree with you fully here. At least I think that if what
> we want to describe is an AXI bus topology, then we should be describing
> it in terms of the AXI specification.

Right, that was what I was trying to describe :) 

The DAG would be vertexes that are 'interconnect' and directed edges
that are 'master -> slave interface' pairs.

This would be an addendum/side-table dataset to the standard 'soc' CPU
address map tree, that would only be needed to program address
mapping/iommu hardware.

And it isn't really AXI specific, x86 style platforms can have a DAG
too, it is just much simpler, as there is only 1 vertex - the IOMMU.

> I mean, even though device tree is supposed to describe hardware, there
> needs to be a limit to the amount of detail we put into it. After all it
> isn't a hardware description language, but rather a language to describe
> the hardware in a way that makes sense for operating system software to
> use it.

Right - which is why I said the usual 'soc' node should remain as-is
typical today - a tree formed by viewing the AXI DAG from the CPU
vertex. That 100% matches the OS perspective of the system for CPU
originated MMIO.

The AXI DAG side-table would be used to resolve weirdness with 'bus
master' DMA programming. The OS can detect all the required
configuration and properties by tracing a path through the DAG from
the source of the DMA to the target - that tells you what IOMMUs are
involved, if the path is cache coherent, etc.

> Perhaps this is just another way of saying what Greg has already said.
> If we continue down this road, we'll eventually end up having to
> describe all sorts of nitty gritty details. And we'll need even more

Greg's point makes sense, but the HW guys are not designing things
this way for kicks - there are real physics based reasons for some of
these choices...

eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
expensive compared to a purpose built muxed bus tree. Doing coherency
look ups on DMA traffic costs energy, etc.

> code to deal with those descriptions and the hardware they represent. At
> some point we need to start pushing some of the complexity back into
> hardware so that we can keep a sane code-base.

Some of this is a consequence of the push to have the firmware
minimal. As soon as you say the kernel has to configure the address
map you've created a big complexity for it..

Jason
Greg Kroah-Hartman Nov. 29, 2013, 2:35 a.m. UTC | #12
On Thu, Nov 28, 2013 at 04:31:47PM -0700, Jason Gunthorpe wrote:
> > Perhaps this is just another way of saying what Greg has already said.
> > If we continue down this road, we'll eventually end up having to
> > describe all sorts of nitty gritty details. And we'll need even more
> 
> Greg's point makes sense, but the HW guys are not designing things
> this way for kicks - there are real physics based reasons for some of
> these choices...
> 
> eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
> expensive compared to a purpose built muxed bus tree. Doing coherency
> look ups on DMA traffic costs energy, etc.

Really?  How much power exactly does it take / save?  Yes, hardware
people think "software is free", but when you can't actually control the
hardware in the software properly, well, you end up with something like
itanium...

> > code to deal with those descriptions and the hardware they represent. At
> > some point we need to start pushing some of the complexity back into
> > hardware so that we can keep a sane code-base.
> 
> Some of this is a consequence of the push to have the firmware
> minimal. As soon as you say the kernel has to configure the address
> map you've created a big complexity for it..

Why the push to make firmware "minimal"?  What is that "saving"?  You
just push the complexity from one place to the other, just because ARM
doesn't seem to have good firmware engineers, doesn't mean they should
punish their kernel developers :)

greg k-h
Thierry Reding Nov. 29, 2013, 9:37 a.m. UTC | #13
On Thu, Nov 28, 2013 at 06:35:54PM -0800, Greg KH wrote:
> On Thu, Nov 28, 2013 at 04:31:47PM -0700, Jason Gunthorpe wrote:
> > > Perhaps this is just another way of saying what Greg has already said.
> > > If we continue down this road, we'll eventually end up having to
> > > describe all sorts of nitty gritty details. And we'll need even more
> > 
> > Greg's point makes sense, but the HW guys are not designing things
> > this way for kicks - there are real physics based reasons for some of
> > these choices...
> > 
> > eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
> > expensive compared to a purpose built muxed bus tree. Doing coherency
> > look ups on DMA traffic costs energy, etc.
> 
> Really?  How much power exactly does it take / save?  Yes, hardware
> people think "software is free", but when you can't actually control the
> hardware in the software properly, well, you end up with something like
> itanium...
> 
> > > code to deal with those descriptions and the hardware they represent. At
> > > some point we need to start pushing some of the complexity back into
> > > hardware so that we can keep a sane code-base.
> > 
> > Some of this is a consequence of the push to have the firmware
> > minimal. As soon as you say the kernel has to configure the address
> > map you've created a big complexity for it..
> 
> Why the push to make firmware "minimal"?  What is that "saving"?  You
> just push the complexity from one place to the other, just because ARM
> doesn't seem to have good firmware engineers, doesn't mean they should
> punish their kernel developers :)

In my experience the biggest problem here is that people working on
upstream kernels and therefore confronted with these issues are seldom
able to track the latest developments of new chips.

When the time comes to upstream support, most of the functionality has
been implemented downstream already, so it actually works and there's no
apparent reason why things should change.

Now I know that that's not an ideal situation and upstreaming should
start a whole lot earlier, but even if that were the case, once the
silicon tapes out there's not a whole lot you can do about it anymore.
Starting with upstreaming even before that would have to be a solution,
but I don't think that's realistic at the current pace of development.

There's a large gap between how fast new SoCs are supposed to tape out
and the rate at which new code can be merged upstream. Perhaps some of
that could be mitigated by putting more of the complexity into firmware
and that's already happening to some degree for ARMv8. But I suspect
there's a limit to what you can hide away in firmware while at the same
time giving the kernel enough information to do the right thing.

I am completely convinced that our goal should be to do upstreaming
early and ideally there shouldn't be any downstream development in the
first place. The reason why we're not there yet is because it isn't
practical to do so currently, so I'm very interested in suggestions or
finding ways to improve the situation.

Thierry
Russell King - ARM Linux Nov. 29, 2013, 9:57 a.m. UTC | #14
On Fri, Nov 29, 2013 at 10:37:14AM +0100, Thierry Reding wrote:
> There's a large gap between how fast new SoCs are supposed to tape out
> and the rate at which new code can be merged upstream. Perhaps some of
> that could be mitigated by putting more of the complexity into firmware
> and that's already happening to some degree for ARMv8. But I suspect
> there's a limit to what you can hide away in firmware while at the same
> time giving the kernel enough information to do the right thing.

One of the bigger issues which stands in the way of companies caring
about mainstream support is closed source IPs like VPUs and GPUs.

If you have one of those on your chip, even if the kernel side code
is already under the GPL, normally that code is not "mainline worthy".
Also, as the userspace code may not be open source, some people object
to having the open source part in the kernel.

So for customers to be able to get the performance out of the chip,
they have to stick with having non-mainline kernel.

At that point, why bother spending too much time getting mainline
support for the device.  It's never going to be fully functional in
mainline.  It doesn't make sense for these SoC companies.
Thierry Reding Nov. 29, 2013, 9:57 a.m. UTC | #15
On Thu, Nov 28, 2013 at 04:31:47PM -0700, Jason Gunthorpe wrote:
> On Thu, Nov 28, 2013 at 11:22:33PM +0100, Thierry Reding wrote:
> > On Thu, Nov 28, 2013 at 02:10:09PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Nov 28, 2013 at 09:33:23PM +0100, Thierry Reding wrote:
> > > 
> > > > >   - Describing masters that master through multiple different buses
> > > > > 
> > > > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > > > > 
> > > > >   - Interaction with IOMMU bindings (currently under discussion)
> > > > 
> > > > This is all very vague. Perhaps everyone else knows what this is all
> > > > about, in which case it'd be great if somebody could clue me in.
> > > 
> > > It looks like an approach to describe an AXI physical bus topology in
> > > DT..
> > 
> > Thanks for explaining this. It makes a whole lot more sense now.
> 
> Hopefully the ARM guys concur, this was just my impression from
> reviewing their patches and having recently done some design work with
> AXI..
> 
> > > axi
> > > {
> > >    /* Describe a DAG of AXI connections here. */
> > >    cpu { downstream = &ax_switch,}
> > >    axi_switch {downstream = &memory,&low_speed}
> > >    memory {}
> > >    dma {downstream = &memory}
> > >    low_speed {}
> > > }
> > 
> > Correct me if I'm wrong, but the switch would be what the specification
> > refers to as "interconnect", while a port would correspond to what is
> > called an "interface" in the specification?
> 
> That seems correct, but for this purpose we are not interested in
> boring dumb interconnect but fancy interconnect with address remapping
> capabilities, or cache coherency (eg the SCU/L2 cache is modeled as
> switch/interconnect in a AXI DAG).
> 
> I called it a switch because the job of the interconnect block is to
> take an AXI input packet on a slave interface and route it to the
> proper master interface with internal arbitration between slave
> interfaces. In my world that is a called a switch ;)
> 
> AXI is basically an on-chip point-to-point switched fabric like PCI-E,
> and the stuff that travels on AXI looks fairly similar to PCI-E TLPs..
> 
> If you refer to the PDF I linked I broadly modeled the above DT
> fragment on that diagram, each axi sub node (vertex) represents an
> 'interconnect' and 'downstream' is a master->slave interface pair (edge).
> 
> Fundamentally AXI is inherently a DAG, but unlike what we are used to
> in other platforms you don't have to go through a fused
> CPU/cache/memory controller unit to access memory, so there are
> software visible asymmetries depending on how the DMA flows through
> the AXI DAG.
> 
> > > Which is why I think encoding the AXI DAG directly in DT is probably
> > > the most future proof way to model this stuff - it sticks close to the
> > > tools ARM provides to the SOC designers, so it is very likely to be
> > > able to model arbitary SOC designs.
> > 
> > I'm not sure I agree with you fully here. At least I think that if what
> > we want to describe is an AXI bus topology, then we should be describing
> > it in terms of the AXI specification.
> 
> Right, that was what I was trying to describe :) 
> 
> The DAG would be vertexes that are 'interconnect' and directed edges
> that are 'master -> slave interface' pairs.
> 
> This would be an addendum/side-table dataset to the standard 'soc' CPU
> address map tree, that would only be needed to program address
> mapping/iommu hardware.
> 
> And it isn't really AXI specific, x86 style platforms can have a DAG
> too, it is just much simpler, as there is only 1 vertex - the IOMMU.
> 
> > I mean, even though device tree is supposed to describe hardware, there
> > needs to be a limit to the amount of detail we put into it. After all it
> > isn't a hardware description language, but rather a language to describe
> > the hardware in a way that makes sense for operating system software to
> > use it.
> 
> Right - which is why I said the usual 'soc' node should remain as-is
> typical today - a tree formed by viewing the AXI DAG from the CPU
> vertex. That 100% matches the OS perspective of the system for CPU
> originated MMIO.
> 
> The AXI DAG side-table would be used to resolve weirdness with 'bus
> master' DMA programming. The OS can detect all the required
> configuration and properties by tracing a path through the DAG from
> the source of the DMA to the target - that tells you what IOMMUs are
> involved, if the path is cache coherent, etc.

That all sounds like an awful amount of data to wade through. Do we
really need all of it to do what we want? Perhaps it can be simplified
a bit. For instance it seems like the majority of hardware where this is
actually required will have to go through one IOMMU (or a cascade of
IOMMUs) and the path isn't cache coherent.

IOMMUs typically require additional parameters to properly map devices
to virtual address spaces, so we'll need to hook them up with masters in
DT anyway. If we further assume that all masters use non-cache-coherent
paths, then the problem becomes much simpler.

Of course that would only work for a specific case and not solve the
more general case. But perhaps it'll be good enough to cover the
majority of uses.

> > Perhaps this is just another way of saying what Greg has already said.
> > If we continue down this road, we'll eventually end up having to
> > describe all sorts of nitty gritty details. And we'll need even more
> 
> Greg's point makes sense, but the HW guys are not designing things
> this way for kicks - there are real physics based reasons for some of
> these choices...
> 
> eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
> expensive compared to a purpose built muxed bus tree. Doing coherency
> look ups on DMA traffic costs energy, etc.

I understand that these may all contribute to saving power. However what
good is a system that's very power-efficient if it's so complex that the
software can no longer control it?

Thierry
Thierry Reding Nov. 29, 2013, 10:43 a.m. UTC | #16
On Fri, Nov 29, 2013 at 09:57:03AM +0000, Russell King - ARM Linux wrote:
> On Fri, Nov 29, 2013 at 10:37:14AM +0100, Thierry Reding wrote:
> > There's a large gap between how fast new SoCs are supposed to tape out
> > and the rate at which new code can be merged upstream. Perhaps some of
> > that could be mitigated by putting more of the complexity into firmware
> > and that's already happening to some degree for ARMv8. But I suspect
> > there's a limit to what you can hide away in firmware while at the same
> > time giving the kernel enough information to do the right thing.
> 
> One of the bigger issues which stands in the way of companies caring
> about mainstream support is closed source IPs like VPUs and GPUs.
> 
> If you have one of those on your chip, even if the kernel side code
> is already under the GPL, normally that code is not "mainline worthy".
> Also, as the userspace code may not be open source, some people object
> to having the open source part in the kernel.
> 
> So for customers to be able to get the performance out of the chip,
> they have to stick with having non-mainline kernel.
> 
> At that point, why bother spending too much time getting mainline
> support for the device.  It's never going to be fully functional in
> mainline.  It doesn't make sense for these SoC companies.

Well, there are advantages to having large parts, even if not all, of an
SoC supported in the mainline kernel. The closer you are to mainline,
the easier it becomes for customers and users to use a mainline kernel.

The better an SoC is supported upstream the fewer vendor-specific
patches are required to get feature parity, which in turn makes it
easier for customers to forward-port (and back-port for that matter)
those patches to whatever kernel they want. It also allows vendors to
concentrate on the more contentious patches and spend time on making
them worthy of mainline.

Customers aren't only end-users but also embedded partners that want to
use the SoC in their products. It's no secret that many vendor trees lag
behind upstream and that causes all kinds of pain such as having to port
drivers for new hardware that's not supported in whatever vendor tree
you happen to get. Tracking upstream is invaluable because it makes it
almost trivial to support new hardware and it helps in turn with getting
your own changes merged.

So upstreaming SoC support isn't only for the benefit of SoC vendors. It
also is very convenient for all users of the SoC. It's true that ideally
an SoC would be fully functional in mainline. That's not the case today,
but that doesn't mean it will always be true.

Thierry
Will Deacon Nov. 29, 2013, 11:44 a.m. UTC | #17
On Thu, Nov 28, 2013 at 09:25:28PM +0000, Greg KH wrote:
> On Thu, Nov 28, 2013 at 07:39:17PM +0000, Dave Martin wrote:
> > On Thu, Nov 28, 2013 at 11:13:31AM -0800, Greg KH wrote:
> > > Yes it is, you all are the ones tasked with implementing the crazy crap
> > > the hardware people have created, best of luck with that :)
> > 
> > Agreed.  The first assumption should be that we can fit in with the
> > existing device model -- we should only reconsider if we find that
> > to be impossible.
> 
> Let me know if you think it is somehow impossible, but you all should
> really push back on the insane hardware designers that are forcing you
> all to do this work.  I find it "interesting" how this all becomes your
> workload for their crazy ideas.

Oh, I don't think we're claiming anything is impossible here :) It's more
that we will probably want to make some changes to the device model to allow,
for example, a device to be associated with multiple buses of potentially
different types.

Step one is to get the DT binding sorted, then we can try and get Linux to
make use of it. This goes hand-in-hand with the IOMMU discussion going on
here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2013-November/210401.html

which is one of the issues that is hitting us right now.

Cheers,

Will
Dave Martin Nov. 29, 2013, 11:58 a.m. UTC | #18
On Thu, Nov 28, 2013 at 04:31:47PM -0700, Jason Gunthorpe wrote:
> On Thu, Nov 28, 2013 at 11:22:33PM +0100, Thierry Reding wrote:
> > On Thu, Nov 28, 2013 at 02:10:09PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Nov 28, 2013 at 09:33:23PM +0100, Thierry Reding wrote:
> > > 
> > > > >   - Describing masters that master through multiple different buses
> > > > > 
> > > > >   - How on Earth this fits in with the Linux device model (it doesn't)
> > > > > 
> > > > >   - Interaction with IOMMU bindings (currently under discussion)
> > > > 
> > > > This is all very vague. Perhaps everyone else knows what this is all
> > > > about, in which case it'd be great if somebody could clue me in.
> > > 
> > > It looks like an approach to describe an AXI physical bus topology in
> > > DT..
> > 
> > Thanks for explaining this. It makes a whole lot more sense now.
> 
> Hopefully the ARM guys concur, this was just my impression from
> reviewing their patches and having recently done some design work with
> AXI..

Yes and no.  We are trying to describe a real topology here, but only
because there are salient features that the kernel genuinely does
need to know about if we want to be able to abstract this kind of thing.

It's not just about AXI.

Things like CCI-400 ("cache coherent interconnect") and its successors 
have real run-time control requirements on each connection to the bus.
(The "port" terminology is used in the CCI documentation, but in any
case the concept of a link between a device and a bus should be a pretty
generic concept, not tied to a specific name or a specific interconnect.)

If I need to turn on the bus interface for device X, in need to know
how to tell the bus which interface to poke -- hence the need for
a "port ID".  Of course, we're free to choose other names.

The master-slave link concept is not supposed to be a new concept at
all: DT already has this concept.  All we are aiming to add here is the
ability to describe cross-links that ePAPR cannot describe directly.

> 
> > > axi
> > > {
> > >    /* Describe a DAG of AXI connections here. */
> > >    cpu { downstream = &ax_switch,}
> > >    axi_switch {downstream = &memory,&low_speed}
> > >    memory {}
> > >    dma {downstream = &memory}
> > >    low_speed {}
> > > }
> > 
> > Correct me if I'm wrong, but the switch would be what the specification
> > refers to as "interconnect", while a port would correspond to what is
> > called an "interface" in the specification?
> 
> That seems correct, but for this purpose we are not interested in
> boring dumb interconnect but fancy interconnect with address remapping
> capabilities, or cache coherency (eg the SCU/L2 cache is modeled as
> switch/interconnect in a AXI DAG).

Bear in mind that "fancy interconnect with address remapping
capabilities" probably means at least two independent components on an
ARM SoC.

To avoid excessive code fragmentation we'd want a driver for each, not a
driver per every possible pairing.  The pairing could be different on
every port even in a single SoC, though I hope we will never see that.

> I called it a switch because the job of the interconnect block is to
> take an AXI input packet on a slave interface and route it to the
> proper master interface with internal arbitration between slave
> interfaces. In my world that is a called a switch ;)

In axi { axi_switch {} }, are you describing two levels of bus, or
one?  I'm guessing one, but then the nested node looks a bit weird.

> AXI is basically an on-chip point-to-point switched fabric like PCI-E,
> and the stuff that travels on AXI looks fairly similar to PCI-E TLPs..
> 
> If you refer to the PDF I linked I broadly modeled the above DT
> fragment on that diagram, each axi sub node (vertex) represents an
> 'interconnect' and 'downstream' is a master->slave interface pair (edge).
> 
> Fundamentally AXI is inherently a DAG, but unlike what we are used to
> in other platforms you don't have to go through a fused
> CPU/cache/memory controller unit to access memory, so there are
> software visible asymmetries depending on how the DMA flows through
> the AXI DAG.

Just to call this out, the linkage is *not* guaranteed to be acyclic.

If you connect pass-through devices (i.e., buses) round in a cycle,
you may get transactions going round and round forever, so we should
never see that in a system.

However, there's nothing to stop a DMA controller's master side being
looped back so that it can access its own slave interface.  This is the
normal situation for coherent DMA, since the whole point there is
that the DMA controller should shares its system view closely with
the CPUs, including some levels of cache.

(This does mean that the DMA may be able to program itself -- but I don't
claim that this is useful.  Rather, it's a side-effect of providing a
coherent system view.)

> > > Which is why I think encoding the AXI DAG directly in DT is probably
> > > the most future proof way to model this stuff - it sticks close to the
> > > tools ARM provides to the SOC designers, so it is very likely to be
> > > able to model arbitary SOC designs.
> > 
> > I'm not sure I agree with you fully here. At least I think that if what
> > we want to describe is an AXI bus topology, then we should be describing
> > it in terms of the AXI specification.
> 
> Right, that was what I was trying to describe :) 
> 
> The DAG would be vertexes that are 'interconnect' and directed edges
> that are 'master -> slave interface' pairs.
> 
> This would be an addendum/side-table dataset to the standard 'soc' CPU
> address map tree, that would only be needed to program address
> mapping/iommu hardware.
> 
> And it isn't really AXI specific, x86 style platforms can have a DAG
> too, it is just much simpler, as there is only 1 vertex - the IOMMU.

Agree -- this concept of a master/slave link is a really generic concept,

The complete set of properties associated with each link will be
specific to each different interconnect, and possibly from port to port:
_that_ stuff would be described by separate, non-generic properties
defined per interconnect type.

But to do DMA mapping, you should only need to know what master/slave
links exist, and any associated mappings.

> 
> > I mean, even though device tree is supposed to describe hardware, there
> > needs to be a limit to the amount of detail we put into it. After all it
> > isn't a hardware description language, but rather a language to describe
> > the hardware in a way that makes sense for operating system software to
> > use it.
> 
> Right - which is why I said the usual 'soc' node should remain as-is
> typical today - a tree formed by viewing the AXI DAG from the CPU
> vertex. That 100% matches the OS perspective of the system for CPU
> originated MMIO.

Do you mean the top-level bus node in the DT and its contents, or
something else?

If so, agreed ...

> The AXI DAG side-table would be used to resolve weirdness with 'bus
> master' DMA programming. The OS can detect all the required
> configuration and properties by tracing a path through the DAG from
> the source of the DMA to the target - that tells you what IOMMUs are
> involved, if the path is cache coherent, etc.

... that could work, although putting the links in the natural places
in the DT directly feels cleaner that stashing a crib table elsewhere
in the DT.  That's partly cosmetic, I think both could work?

> 
> > Perhaps this is just another way of saying what Greg has already said.
> > If we continue down this road, we'll eventually end up having to
> > describe all sorts of nitty gritty details. And we'll need even more
> 
> Greg's point makes sense, but the HW guys are not designing things
> this way for kicks - there are real physics based reasons for some of
> these choices...
> 
> eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
> expensive compared to a purpose built muxed bus tree. Doing coherency
> look ups on DMA traffic costs energy, etc.
> 
> > code to deal with those descriptions and the hardware they represent. At
> > some point we need to start pushing some of the complexity back into
> > hardware so that we can keep a sane code-base.
> 
> Some of this is a consequence of the push to have the firmware
> minimal. As soon as you say the kernel has to configure the address
> map you've created a big complexity for it..
> 
> Jason
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Dave Martin Nov. 29, 2013, 1:13 p.m. UTC | #19
On Fri, Nov 29, 2013 at 09:57:03AM +0000, Russell King - ARM Linux wrote:
> On Fri, Nov 29, 2013 at 10:37:14AM +0100, Thierry Reding wrote:
> > There's a large gap between how fast new SoCs are supposed to tape out
> > and the rate at which new code can be merged upstream. Perhaps some of
> > that could be mitigated by putting more of the complexity into firmware
> > and that's already happening to some degree for ARMv8. But I suspect
> > there's a limit to what you can hide away in firmware while at the same
> > time giving the kernel enough information to do the right thing.
> 
> One of the bigger issues which stands in the way of companies caring
> about mainstream support is closed source IPs like VPUs and GPUs.
> 
> If you have one of those on your chip, even if the kernel side code
> is already under the GPL, normally that code is not "mainline worthy".
> Also, as the userspace code may not be open source, some people object
> to having the open source part in the kernel.
> 
> So for customers to be able to get the performance out of the chip,
> they have to stick with having non-mainline kernel.
> 
> At that point, why bother spending too much time getting mainline
> support for the device.  It's never going to be fully functional in
> mainline.  It doesn't make sense for these SoC companies.

Putting effort into upstream support for something that is only relevant
to GPUs (or VPUs) does look less valuable for us right now, unless it
encourages people to start posting more GPU/VPU code upstream, and we
know there are other blockers there.

DMA and IOMMU we definitely care about, though.

Cheers
---Dave
Russell King - ARM Linux Nov. 29, 2013, 1:29 p.m. UTC | #20
On Fri, Nov 29, 2013 at 01:13:59PM +0000, Dave Martin wrote:
> On Fri, Nov 29, 2013 at 09:57:03AM +0000, Russell King - ARM Linux wrote:
> > On Fri, Nov 29, 2013 at 10:37:14AM +0100, Thierry Reding wrote:
> > > There's a large gap between how fast new SoCs are supposed to tape out
> > > and the rate at which new code can be merged upstream. Perhaps some of
> > > that could be mitigated by putting more of the complexity into firmware
> > > and that's already happening to some degree for ARMv8. But I suspect
> > > there's a limit to what you can hide away in firmware while at the same
> > > time giving the kernel enough information to do the right thing.
> > 
> > One of the bigger issues which stands in the way of companies caring
> > about mainstream support is closed source IPs like VPUs and GPUs.
> > 
> > If you have one of those on your chip, even if the kernel side code
> > is already under the GPL, normally that code is not "mainline worthy".
> > Also, as the userspace code may not be open source, some people object
> > to having the open source part in the kernel.
> > 
> > So for customers to be able to get the performance out of the chip,
> > they have to stick with having non-mainline kernel.
> > 
> > At that point, why bother spending too much time getting mainline
> > support for the device.  It's never going to be fully functional in
> > mainline.  It doesn't make sense for these SoC companies.
> 
> Putting effort into upstream support for something that is only relevant
> to GPUs (or VPUs) does look less valuable for us right now, unless it
> encourages people to start posting more GPU/VPU code upstream, and we
> know there are other blockers there.

I think you miss my point.  Manufacturers want their chips to be useful
to people, and they want all the features on their chip to be usable.
They don't want something which sort-of works but has chunks of support
for various IP that they spent time integrating not supported.

So they have two options: either they develop a kernel out of mainline
which supports everything, which isn't subject to the whims of mainline
kernel developers breaking it all the time because of lack of testing,
or they decide that they're not going to support everything and work on
mainline only.

The problem with the latter is they're explicitly saying to some
customers that they're on their own as far as that's concerned, and
they're not prepared to do that: customers are the people who pay the
bills, remember, and you don't turn them away without good reason.

For example, I doubt that SolidRun would've picked Freescale's IMX6
for their next board unless there was support for the GPU and VPU in
some kernel somewhere.  Remember, not everyone is interested in producing
yet another toy board which can only really be used as a NAS.  Some want
accelerated graphics and hardware assisted video decode today too, and
want to sell products based on those features.
Dave Martin Nov. 29, 2013, 2:13 p.m. UTC | #21
On Fri, Nov 29, 2013 at 09:57:12AM +0000, Thierry Reding wrote:
> On Thu, Nov 28, 2013 at 04:31:47PM -0700, Jason Gunthorpe wrote:

[...]

> > The AXI DAG side-table would be used to resolve weirdness with 'bus
> > master' DMA programming. The OS can detect all the required
> > configuration and properties by tracing a path through the DAG from
> > the source of the DMA to the target - that tells you what IOMMUs are
> > involved, if the path is cache coherent, etc.
> 
> That all sounds like an awful amount of data to wade through. Do we
> really need all of it to do what we want? Perhaps it can be simplified
> a bit. For instance it seems like the majority of hardware where this is
> actually required will have to go through one IOMMU (or a cascade of
> IOMMUs) and the path isn't cache coherent.

The DT should describe the hardware, but only those aspects that a sane
OS should need to care about.  Some judgment is needed.

Figuring out exactly which info we ought to care about is part of the
purpose of this discussion.  There are certainly lots of hardware
integration and configuration parameters that we don't need to know.

I think that figuring out the path capabilities ought to be a one-off
step, done when the DMA client device is probed.  We need to retain
enough information to do the mapping each time a buffer needs to be
set up, but we shouldn't have to re-scan the DT each time.


In more general cases, there are still some things that really can't
be pushed into firmware for which Linux needs a fair amount of
topology information.

Our current example is things like MSIs from PCIe devices in systems
with newer GICs and SMMU.  Particularly for guests under KVM, the ways
all these link together is needed for configuring and routing MSIs to
guest CPUs.


[...]

> > eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
> > expensive compared to a purpose built muxed bus tree. Doing coherency
> > look ups on DMA traffic costs energy, etc.
> 
> I understand that these may all contribute to saving power. However what
> good is a system that's very power-efficient if it's so complex that the
> software can no longer control it?

Not a lot of good.  However, that's the extreme case.  We have to deal
with some pain for sure -- and some parts of that pain may turn out to be
too ridiculous (either too useless, or too unworkable) for it to be worth
supporting them in Linux.

This thread focuses on one of the less ridculous things: the aim is not
to describe the hardware bus architecture in full detail, just the
aspects the OS needs to know about for important, abstractable things
like DMA and IOMMU topology.

If we model the description around the actual topology, there seems
less chance of needing to bodge the bindings in the future when some
previously non-relevant aspect of the topology becomes important.

Cheers
---Dave
Greg Kroah-Hartman Nov. 29, 2013, 5:37 p.m. UTC | #22
On Fri, Nov 29, 2013 at 11:44:53AM +0000, Will Deacon wrote:
> On Thu, Nov 28, 2013 at 09:25:28PM +0000, Greg KH wrote:
> > On Thu, Nov 28, 2013 at 07:39:17PM +0000, Dave Martin wrote:
> > > On Thu, Nov 28, 2013 at 11:13:31AM -0800, Greg KH wrote:
> > > > Yes it is, you all are the ones tasked with implementing the crazy crap
> > > > the hardware people have created, best of luck with that :)
> > > 
> > > Agreed.  The first assumption should be that we can fit in with the
> > > existing device model -- we should only reconsider if we find that
> > > to be impossible.
> > 
> > Let me know if you think it is somehow impossible, but you all should
> > really push back on the insane hardware designers that are forcing you
> > all to do this work.  I find it "interesting" how this all becomes your
> > workload for their crazy ideas.
> 
> Oh, I don't think we're claiming anything is impossible here :) It's more
> that we will probably want to make some changes to the device model to allow,
> for example, a device to be associated with multiple buses of potentially
> different types.

Why would you want that?  What good would that help with?

> Step one is to get the DT binding sorted, then we can try and get Linux to
> make use of it. This goes hand-in-hand with the IOMMU discussion going on
> here:
> 
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2013-November/210401.html
> 
> which is one of the issues that is hitting us right now.

Interesting how people seem to not know how to cc: the needed
maintainers when they touch core code :(
Greg Kroah-Hartman Nov. 29, 2013, 5:42 p.m. UTC | #23
On Fri, Nov 29, 2013 at 10:37:14AM +0100, Thierry Reding wrote:
> > > Some of this is a consequence of the push to have the firmware
> > > minimal. As soon as you say the kernel has to configure the address
> > > map you've created a big complexity for it..
> > 
> > Why the push to make firmware "minimal"?  What is that "saving"?  You
> > just push the complexity from one place to the other, just because ARM
> > doesn't seem to have good firmware engineers, doesn't mean they should
> > punish their kernel developers :)
> 
> In my experience the biggest problem here is that people working on
> upstream kernels and therefore confronted with these issues are seldom
> able to track the latest developments of new chips.
> 
> When the time comes to upstream support, most of the functionality has
> been implemented downstream already, so it actually works and there's no
> apparent reason why things should change.

That's a failure of the companies involved.

> Now I know that that's not an ideal situation and upstreaming should
> start a whole lot earlier, but even if that were the case, once the
> silicon tapes out there's not a whole lot you can do about it anymore.
> Starting with upstreaming even before that would have to be a solution,
> but I don't think that's realistic at the current pace of development.

For other companies it is realistic.  I have a whole presentation on
this, and why it even makes good business sense to do it properly (hint,
saves you time and money, who doesn't like that?)

> There's a large gap between how fast new SoCs are supposed to tape out
> and the rate at which new code can be merged upstream. Perhaps some of
> that could be mitigated by putting more of the complexity into firmware
> and that's already happening to some degree for ARMv8. But I suspect
> there's a limit to what you can hide away in firmware while at the same
> time giving the kernel enough information to do the right thing.
> 
> I am completely convinced that our goal should be to do upstreaming
> early and ideally there shouldn't be any downstream development in the
> first place. The reason why we're not there yet is because it isn't
> practical to do so currently, so I'm very interested in suggestions or
> finding ways to improve the situation.

"Practical"?  Heh, other companies know how to do this properly, and
because of that, they will succeed, sorry.

It can be done, the fact that ARM and it's licensees don't want to do
it, doesn't mean it isn't "practical" at all, it's just a failure on
their part to do things in the "correct" way, wasting time and money in
the process.

Oh well, I guess you all have tons of time and money, best of luck with
that :)

greg k-h
Greg Kroah-Hartman Nov. 29, 2013, 5:43 p.m. UTC | #24
On Fri, Nov 29, 2013 at 01:13:59PM +0000, Dave Martin wrote:
> On Fri, Nov 29, 2013 at 09:57:03AM +0000, Russell King - ARM Linux wrote:
> > On Fri, Nov 29, 2013 at 10:37:14AM +0100, Thierry Reding wrote:
> > > There's a large gap between how fast new SoCs are supposed to tape out
> > > and the rate at which new code can be merged upstream. Perhaps some of
> > > that could be mitigated by putting more of the complexity into firmware
> > > and that's already happening to some degree for ARMv8. But I suspect
> > > there's a limit to what you can hide away in firmware while at the same
> > > time giving the kernel enough information to do the right thing.
> > 
> > One of the bigger issues which stands in the way of companies caring
> > about mainstream support is closed source IPs like VPUs and GPUs.
> > 
> > If you have one of those on your chip, even if the kernel side code
> > is already under the GPL, normally that code is not "mainline worthy".
> > Also, as the userspace code may not be open source, some people object
> > to having the open source part in the kernel.
> > 
> > So for customers to be able to get the performance out of the chip,
> > they have to stick with having non-mainline kernel.
> > 
> > At that point, why bother spending too much time getting mainline
> > support for the device.  It's never going to be fully functional in
> > mainline.  It doesn't make sense for these SoC companies.
> 
> Putting effort into upstream support for something that is only relevant
> to GPUs (or VPUs) does look less valuable for us right now, unless it
> encourages people to start posting more GPU/VPU code upstream, and we
> know there are other blockers there.

What are these "other blockers"?
Will Deacon Nov. 29, 2013, 6:01 p.m. UTC | #25
On Fri, Nov 29, 2013 at 05:37:01PM +0000, Greg KH wrote:
> On Fri, Nov 29, 2013 at 11:44:53AM +0000, Will Deacon wrote:
> > On Thu, Nov 28, 2013 at 09:25:28PM +0000, Greg KH wrote:
> > > On Thu, Nov 28, 2013 at 07:39:17PM +0000, Dave Martin wrote:
> > > > On Thu, Nov 28, 2013 at 11:13:31AM -0800, Greg KH wrote:
> > > > > Yes it is, you all are the ones tasked with implementing the crazy crap
> > > > > the hardware people have created, best of luck with that :)
> > > > 
> > > > Agreed.  The first assumption should be that we can fit in with the
> > > > existing device model -- we should only reconsider if we find that
> > > > to be impossible.
> > > 
> > > Let me know if you think it is somehow impossible, but you all should
> > > really push back on the insane hardware designers that are forcing you
> > > all to do this work.  I find it "interesting" how this all becomes your
> > > workload for their crazy ideas.
> > 
> > Oh, I don't think we're claiming anything is impossible here :) It's more
> > that we will probably want to make some changes to the device model to allow,
> > for example, a device to be associated with multiple buses of potentially
> > different types.
> 
> Why would you want that?  What good would that help with?

It would help with devices which have their slave interface on one bus, but
master to another.

We need a way to configure the master side of things (IOMMU, coherency, MSI
routing, etc) on one bus and configure the slave side (device probing, power
management, clocks, etc) on another.

> > Step one is to get the DT binding sorted, then we can try and get Linux to
> > make use of it. This goes hand-in-hand with the IOMMU discussion going on
> > here:
> > 
> >   http://lists.infradead.org/pipermail/linux-arm-kernel/2013-November/210401.html
> > 
> > which is one of the issues that is hitting us right now.
> 
> Interesting how people seem to not know how to cc: the needed
> maintainers when they touch core code :(

To be fair, I don't think that code was intended to be merged, and ended up
sparking a discussion about what we need in the DT to represent these
topologies. DT people were on CC iirc.

Will
Greg Kroah-Hartman Nov. 29, 2013, 6:11 p.m. UTC | #26
On Fri, Nov 29, 2013 at 06:01:10PM +0000, Will Deacon wrote:
> On Fri, Nov 29, 2013 at 05:37:01PM +0000, Greg KH wrote:
> > On Fri, Nov 29, 2013 at 11:44:53AM +0000, Will Deacon wrote:
> > > On Thu, Nov 28, 2013 at 09:25:28PM +0000, Greg KH wrote:
> > > > On Thu, Nov 28, 2013 at 07:39:17PM +0000, Dave Martin wrote:
> > > > > On Thu, Nov 28, 2013 at 11:13:31AM -0800, Greg KH wrote:
> > > > > > Yes it is, you all are the ones tasked with implementing the crazy crap
> > > > > > the hardware people have created, best of luck with that :)
> > > > > 
> > > > > Agreed.  The first assumption should be that we can fit in with the
> > > > > existing device model -- we should only reconsider if we find that
> > > > > to be impossible.
> > > > 
> > > > Let me know if you think it is somehow impossible, but you all should
> > > > really push back on the insane hardware designers that are forcing you
> > > > all to do this work.  I find it "interesting" how this all becomes your
> > > > workload for their crazy ideas.
> > > 
> > > Oh, I don't think we're claiming anything is impossible here :) It's more
> > > that we will probably want to make some changes to the device model to allow,
> > > for example, a device to be associated with multiple buses of potentially
> > > different types.
> > 
> > Why would you want that?  What good would that help with?
> 
> It would help with devices which have their slave interface on one bus, but
> master to another.
> 
> We need a way to configure the master side of things (IOMMU, coherency, MSI
> routing, etc) on one bus and configure the slave side (device probing, power
> management, clocks, etc) on another.

Make this two "devices" and have each "device" have a pointer or a way
to "find" the other one.
Will Deacon Nov. 29, 2013, 6:15 p.m. UTC | #27
On Fri, Nov 29, 2013 at 06:11:23PM +0000, Greg KH wrote:
> On Fri, Nov 29, 2013 at 06:01:10PM +0000, Will Deacon wrote:
> > On Fri, Nov 29, 2013 at 05:37:01PM +0000, Greg KH wrote:
> > > On Fri, Nov 29, 2013 at 11:44:53AM +0000, Will Deacon wrote:
> > > > On Thu, Nov 28, 2013 at 09:25:28PM +0000, Greg KH wrote:
> > > > > On Thu, Nov 28, 2013 at 07:39:17PM +0000, Dave Martin wrote:
> > > > > > On Thu, Nov 28, 2013 at 11:13:31AM -0800, Greg KH wrote:
> > > > > > > Yes it is, you all are the ones tasked with implementing the crazy crap
> > > > > > > the hardware people have created, best of luck with that :)
> > > > > > 
> > > > > > Agreed.  The first assumption should be that we can fit in with the
> > > > > > existing device model -- we should only reconsider if we find that
> > > > > > to be impossible.
> > > > > 
> > > > > Let me know if you think it is somehow impossible, but you all should
> > > > > really push back on the insane hardware designers that are forcing you
> > > > > all to do this work.  I find it "interesting" how this all becomes your
> > > > > workload for their crazy ideas.
> > > > 
> > > > Oh, I don't think we're claiming anything is impossible here :) It's more
> > > > that we will probably want to make some changes to the device model to allow,
> > > > for example, a device to be associated with multiple buses of potentially
> > > > different types.
> > > 
> > > Why would you want that?  What good would that help with?
> > 
> > It would help with devices which have their slave interface on one bus, but
> > master to another.
> > 
> > We need a way to configure the master side of things (IOMMU, coherency, MSI
> > routing, etc) on one bus and configure the slave side (device probing, power
> > management, clocks, etc) on another.
> 
> Make this two "devices" and have each "device" have a pointer or a way
> to "find" the other one.

That's certainly one possibility, and one that I'd also toyed around with.
The risk is that we're just spreading the problem around (e.g. into the
dmaengine API), but it's definitely a starting point.

As I said, we need to sort out the DT bindings first then we can see exactly
what we need to fit into Linux.

Will
Jason Gunthorpe Nov. 29, 2013, 6:43 p.m. UTC | #28
On Fri, Nov 29, 2013 at 11:58:15AM +0000, Dave Martin wrote:
> > Hopefully the ARM guys concur, this was just my impression from
> > reviewing their patches and having recently done some design work with
> > AXI..
> 
> Yes and no.  We are trying to describe a real topology here, but only
> because there are salient features that the kernel genuinely does
> need to know about if we want to be able to abstract this kind of thing.
> 
> It's not just about AXI.

Right, I brought up AXI because it is public, well documented and easy
to talk about - every bus/interconnect (PCI, PCI-E, RapidIO,
HyperTransport, etc) I've ever seen works in essentially the same way
- links and 'switches'.

> The master-slave link concept is not supposed to be a new concept at
> all: DT already has this concept.  All we are aiming to add here is
> the ability to describe cross-links that ePAPR cannot describe
> directly.

The main issue seems to be how to do merge the DT standard CPU-centric
tree with a bus graph that isn't CPU-centric - eg like in that Zynq
diagram I mentioned.

All the existing DT cases I'm aware of are able to capture the DMA bus
topology within the CPU tree - because they are the same :)

> In axi { axi_switch {} }, are you describing two levels of bus, or
> one?  I'm guessing one, but then the nested node looks a bit weird.

So, my attempt was to sketch a vertex list and adjacency matrix in DT.

'axi' is the container for the graph, 'axi_switch' is a vertex and
then 'downstream' encodes the adjacency list.

We can't use the natural DT tree hierarchy here because there is no
natural graph root - referring to the Zynq diagram there is no vertex
you can start at and then reach every other vertex - so a tree can't
work, and there is no such thing as a 'bus level'

> However, there's nothing to stop a DMA controller's master side being
> looped back so that it can access its own slave interface.  This is the
> normal situation for coherent DMA, since the whole point there is
> that the DMA controller should shares its system view closely with
> the CPUs, including some levels of cache.

The DAG would only have vertexes for switches and distinct vertexes
for 'end-ports'. So if an IP block has a master interface and a slave
interface then it would have two DAG end-port vertexes and the DAG can
remain acyclic.

The only way to create cycles is to connect switches in loops, and you
can always model a group of looped switches as a single switch vertex
to remove cycles.

If cycles really are required then it just makes the kernel's job
harder, it doesn't break the DT representation ..

> > Right - which is why I said the usual 'soc' node should remain as-is
> > typical today - a tree formed by viewing the AXI DAG from the CPU
> > vertex. That 100% matches the OS perspective of the system for CPU
> > originated MMIO.
> 
> Do you mean the top-level bus node in the DT and its contents, or
> something else?
> 
> If so, agreed ...

Right, the DT, and the 'reg' properties should present a tree that is
the MMIO path for the CPU. That tree should be a subset of the full
bus graph.

If the bus is 'sane' then that tree matches the DMA graph as well,
which is where most implementations are today.

> ... that could work, although putting the links in the natural places
> in the DT directly feels cleaner that stashing a crib table elsewhere
> in the DT.  That's partly cosmetic, I think both could work?

I choose to talk about this as a side table for a few reasons (touched
on above) but perhaps the most important is where do you put switches
that the CPU's MMIO path doesn't flow through? What is the natural
place in the DT tree?

Again refering to the Zynq diagram, you could have a SOC node like
this:

soc
{
  // Start at Cortex A9
  scu {
     OCM {}
     l2cache {
             memory {}
	     slave {
	     	   on_chip0 {reg = {}}
	     	   on_chip1 {reg = {}}
	     	   on_chip2 {reg = {}}
		   ...
	     }
     }
  }
}

Where do I put a node for the 'memory interconnect' switch? How do I
model DMA connected to the 'Cache Coherent AXI port'?

MMIO config registers for these blocks are going to fit into the MMIO
tree someplace, but that doesn't really tell you anything about how
they fit into the dma graph.

Easy to do with a side table:
axi {
    cpu {downstream = scu}
    scu {downstream = OCM,l2cache}
    l2cache {downstream = memory,slave interconnect}
    slave interconnect {} // No more switches

    hp axi {downstream = memory interconnect}
    memory interconnect {downstream = memory,OCM}
    
    coherent acp {downstream = scu}

    gp axi  {downstream = master interconnect}
    master interconnect  {downstream = central interconnect}
    central interconnect  {downstream = ocm,slave interconnect,memory}

    dma engine  {downstream = central interconnect}
}

Which captures the switch vertex list and adjacency list.

Then you have to connect the device nodes into the AXI graph:

Perhaps:

on_chip0
{ 
   reg = {}
   axi_mmio_slave_port = <&slave interconnect, M0> // Vertex and edge
   axi_bus_master_port = <&hp axi, S1>
}

Or maybe back reference from the graph table is better:

axi {
   hp axi {
       downstream = memory interconnect
       controller = &...;
       S1 {
          bus_master = &on_chip0;   // Vertex and edge
       	  axi,arbitration-priority = 10;
       }
   }
   slave interconnect {
       M0 {
          mmio_slave = &on_chip0;
       }
   }
}

It sort of feels natural that you could describe the interconnect in
under its own tidy node and the main tree remains left alone...

This might be easier to parse as well, since you know everything under
'axi' is related to interconnect and not jumbled with other stuff.

Cheers,
Jason
Thierry Reding Nov. 29, 2013, 7:45 p.m. UTC | #29
On Fri, Nov 29, 2013 at 09:42:23AM -0800, Greg KH wrote:
> On Fri, Nov 29, 2013 at 10:37:14AM +0100, Thierry Reding wrote:
> > > > Some of this is a consequence of the push to have the firmware
> > > > minimal. As soon as you say the kernel has to configure the address
> > > > map you've created a big complexity for it..
> > > 
> > > Why the push to make firmware "minimal"?  What is that "saving"?  You
> > > just push the complexity from one place to the other, just because ARM
> > > doesn't seem to have good firmware engineers, doesn't mean they should
> > > punish their kernel developers :)
> > 
> > In my experience the biggest problem here is that people working on
> > upstream kernels and therefore confronted with these issues are seldom
> > able to track the latest developments of new chips.
> > 
> > When the time comes to upstream support, most of the functionality has
> > been implemented downstream already, so it actually works and there's no
> > apparent reason why things should change.
> 
> That's a failure of the companies involved.

Yes, I know.

> > Now I know that that's not an ideal situation and upstreaming should
> > start a whole lot earlier, but even if that were the case, once the
> > silicon tapes out there's not a whole lot you can do about it anymore.
> > Starting with upstreaming even before that would have to be a solution,
> > but I don't think that's realistic at the current pace of development.
> 
> For other companies it is realistic.  I have a whole presentation on
> this, and why it even makes good business sense to do it properly (hint,
> saves you time and money, who doesn't like that?)

I've seen a recording of that presentation. Twice. =)

> > There's a large gap between how fast new SoCs are supposed to tape out
> > and the rate at which new code can be merged upstream. Perhaps some of
> > that could be mitigated by putting more of the complexity into firmware
> > and that's already happening to some degree for ARMv8. But I suspect
> > there's a limit to what you can hide away in firmware while at the same
> > time giving the kernel enough information to do the right thing.
> > 
> > I am completely convinced that our goal should be to do upstreaming
> > early and ideally there shouldn't be any downstream development in the
> > first place. The reason why we're not there yet is because it isn't
> > practical to do so currently, so I'm very interested in suggestions or
> > finding ways to improve the situation.
> 
> "Practical"?  Heh, other companies know how to do this properly, and
> because of that, they will succeed, sorry.
> 
> It can be done, the fact that ARM and it's licensees don't want to do
> it, doesn't mean it isn't "practical" at all, it's just a failure on
> their part to do things in the "correct" way, wasting time and money in
> the process.

Well, I can't really argue with that, so I'll stop with the whining and
go back to work.

Thierry
Dave Martin Dec. 2, 2013, 8:25 p.m. UTC | #30
On Fri, Nov 29, 2013 at 11:43:41AM -0700, Jason Gunthorpe wrote:
> On Fri, Nov 29, 2013 at 11:58:15AM +0000, Dave Martin wrote:
> > > Hopefully the ARM guys concur, this was just my impression from
> > > reviewing their patches and having recently done some design work with
> > > AXI..
> > 
> > Yes and no.  We are trying to describe a real topology here, but only
> > because there are salient features that the kernel genuinely does
> > need to know about if we want to be able to abstract this kind of thing.
> > 
> > It's not just about AXI.
> 
> Right, I brought up AXI because it is public, well documented and easy
> to talk about - every bus/interconnect (PCI, PCI-E, RapidIO,
> HyperTransport, etc) I've ever seen works in essentially the same way
> - links and 'switches'.
> 
> > The master-slave link concept is not supposed to be a new concept at
> > all: DT already has this concept.  All we are aiming to add here is
> > the ability to describe cross-links that ePAPR cannot describe
> > directly.
> 
> The main issue seems to be how to do merge the DT standard CPU-centric
> tree with a bus graph that isn't CPU-centric - eg like in that Zynq
> diagram I mentioned.
> 
> All the existing DT cases I'm aware of are able to capture the DMA bus
> topology within the CPU tree - because they are the same :)
> 
> > In axi { axi_switch {} }, are you describing two levels of bus, or
> > one?  I'm guessing one, but then the nested node looks a bit weird.
> 
> So, my attempt was to sketch a vertex list and adjacency matrix in DT.
> 
> 'axi' is the container for the graph, 'axi_switch' is a vertex and
> then 'downstream' encodes the adjacency list.
> 
> We can't use the natural DT tree hierarchy here because there is no
> natural graph root - referring to the Zynq diagram there is no vertex
> you can start at and then reach every other vertex - so a tree can't
> work, and there is no such thing as a 'bus level'
> 
> > However, there's nothing to stop a DMA controller's master side being
> > looped back so that it can access its own slave interface.  This is the
> > normal situation for coherent DMA, since the whole point there is
> > that the DMA controller should shares its system view closely with
> > the CPUs, including some levels of cache.
> 
> The DAG would only have vertexes for switches and distinct vertexes
> for 'end-ports'. So if an IP block has a master interface and a slave
> interface then it would have two DAG end-port vertexes and the DAG can
> remain acyclic.
> 
> The only way to create cycles is to connect switches in loops, and you
> can always model a group of looped switches as a single switch vertex
> to remove cycles.
> 
> If cycles really are required then it just makes the kernel's job
> harder, it doesn't break the DT representation ..
> 
> > > Right - which is why I said the usual 'soc' node should remain as-is
> > > typical today - a tree formed by viewing the AXI DAG from the CPU
> > > vertex. That 100% matches the OS perspective of the system for CPU
> > > originated MMIO.
> > 
> > Do you mean the top-level bus node in the DT and its contents, or
> > something else?
> > 
> > If so, agreed ...
> 
> Right, the DT, and the 'reg' properties should present a tree that is
> the MMIO path for the CPU. That tree should be a subset of the full
> bus graph.
> 
> If the bus is 'sane' then that tree matches the DMA graph as well,
> which is where most implementations are today.
> 
> > ... that could work, although putting the links in the natural places
> > in the DT directly feels cleaner that stashing a crib table elsewhere
> > in the DT.  That's partly cosmetic, I think both could work?
> 
> I choose to talk about this as a side table for a few reasons (touched
> on above) but perhaps the most important is where do you put switches
> that the CPU's MMIO path doesn't flow through? What is the natural
> place in the DT tree?
> 
> Again refering to the Zynq diagram, you could have a SOC node like
> this:
> 
> soc
> {
>   // Start at Cortex A9
>   scu {
>      OCM {}
>      l2cache {
>              memory {}
> 	     slave {
> 	     	   on_chip0 {reg = {}}
> 	     	   on_chip1 {reg = {}}
> 	     	   on_chip2 {reg = {}}
> 		   ...
> 	     }
>      }
>   }
> }
> 
> Where do I put a node for the 'memory interconnect' switch? How do I
> model DMA connected to the 'Cache Coherent AXI port'?
> 
> MMIO config registers for these blocks are going to fit into the MMIO
> tree someplace, but that doesn't really tell you anything about how
> they fit into the dma graph.
> 
> Easy to do with a side table:
> axi {
>     cpu {downstream = scu}
>     scu {downstream = OCM,l2cache}
>     l2cache {downstream = memory,slave interconnect}
>     slave interconnect {} // No more switches
> 
>     hp axi {downstream = memory interconnect}
>     memory interconnect {downstream = memory,OCM}
>     
>     coherent acp {downstream = scu}
> 
>     gp axi  {downstream = master interconnect}
>     master interconnect  {downstream = central interconnect}
>     central interconnect  {downstream = ocm,slave interconnect,memory}
> 
>     dma engine  {downstream = central interconnect}
> }
> 
> Which captures the switch vertex list and adjacency list.
> 
> Then you have to connect the device nodes into the AXI graph:
> 
> Perhaps:
> 
> on_chip0
> { 
>    reg = {}
>    axi_mmio_slave_port = <&slave interconnect, M0> // Vertex and edge
>    axi_bus_master_port = <&hp axi, S1>
> }
> 
> Or maybe back reference from the graph table is better:
> 
> axi {
>    hp axi {
>        downstream = memory interconnect
>        controller = &...;
>        S1 {
>           bus_master = &on_chip0;   // Vertex and edge
>        	  axi,arbitration-priority = 10;
>        }
>    }
>    slave interconnect {
>        M0 {
>           mmio_slave = &on_chip0;
>        }
>    }
> }
> 
> It sort of feels natural that you could describe the interconnect in
> under its own tidy node and the main tree remains left alone...
> 
> This might be easier to parse as well, since you know everything under
> 'axi' is related to interconnect and not jumbled with other stuff.

That is true, but I do have a concern that bolting more and more
info onto the side of DT may leave us with a mess, while the "main"
tree becomes increasingly fictional.

You make a lot of good points -- apologies for not responding in detail
to all of them yet, but I tie myself in knots trying to say too many
different things at the same time.


For comparison, here's what I think is the main alternative approach.
My discussion touches obliquely on some of the issues you raise...


My basic idea is that DT allows us to express master/slave relationships
like this already:

	master_device {
		slave_device {
			reg = < ... >;
			ranges = < ... >;
			dma-ranges = < ... >;
		};
	};

(dma-ranges is provided by ePAPR to describe the reverse mappings for
slave_device to master back on master_device).

In a multi-master system this isn't enough, because a node might have
to have multiple parents in order to express all the master/slave
relationships.

In that case, we can choose one of the parents as the canonical one
(e.g., the immediate master on the path from the coherent CPUs), or if
there is no obvious canonical parent the child node can be a
freestanding node in the tree (i.e., with no reg or ranges properties,
either in the / { } junkyard, or in some location that makes topological
sense for the device in question).  The DT herarchy retains real
meaning since the direction of master/slave relationships is fixed,
but the DT becomes a tree of connected trees, rather than a single tree.

So, some master/slave relationships will not be child/parent any more,
and we need another way to express the linkage.


My idea now, building on Will's suggestion is to keep the existing
abstractions unchanged, but create an alternate "detached"
representation of each relationship, to use in multi-master
situations.


In the following, the #address-cells and #size-cells properties of a
and b (where applicable) control the parsing of REG or RANGE in
precisely the same way for both forms.

SLAVE-PORT is required for slave buses with multiple masters, but it
is only needed in systems where the ports need to be identified
distinctly.  DT has no traditional way to describe this, so we'd need
to add another property ("parent-master" in my example) if we want to
describe that in tree form.  For now, I assume that a device (i.e., a
slave that is not a bus) does not need to need to distinguish between
different slave ports -- if it did need to do so, additional
properties could be added to describe that, but we have no example of
this today(?)

Drivers for buses that use SLAVE-PORT could still work with a DT that
does not provide the SLAVE-PORT information, but might have restricted
functionality in that case (no per-port control, profiling, etc.)


Note also that:

	{ #slave-cells = <0>; } is equivalent to { // no #slave-cells }
	{ parent-master = <>; } is equivalent to { // no parent-master }

...which gives the correct interpretation for traditional DTs.


We then have the following transformations:


Slave device mapping, tree form:
	a {
		b {
			reg = < REG >;
		};
	};

Slave device mapping, detached form:
	a {
		slave-reg = < &b REG >;
	};

	b {
	};


Slave passthrough bus mapping, tree form:
	a {
		b {
			#slave-cells = < SLAVE-PORT-CELLS >;
			parent-master = < SLAVE-PORT >;

			ranges;
		};
	};

Slave passthrough bus mapping, detached form:
	a {
		slave = < &b SLAVE-PORT >;
	};

	b {
		#slave-cells = < SLAVE-PORT-CELLS >;
	};


Remapped slave bus mapping, tree form:
	a {
		b {
			#slave-cells = < SLAVE-PORT-CELLS >;
			parent-master = < SLAVE-PORT >;

			ranges = < RANGE >;
		};
	};

Remapped slave bus mapping, detached form:
	a {
		slave = < &b SLAVE-PORT RANGE >;
	};

	b {
		#slave-cells = < SLAVE-PORT-CELLS >;
	};


There are no dma-ranges properties here, because in this context "DMA"
is just a master/slave relationship where the master isn't a CPU.
The added properties actually allow us to describe that just fine, but
in a more explicit and general way.


Some new code will be required to parse this, but it is just a new
more flexible _mechanism_ for expressing an old relationship, extended
with a straightforward slave-port identifier which is a simple array of
cells, we should have a good chance of burying the change under
abstracted interfaces.

Existing DTs should not need to change, because we have a well-defined
mapping between the two representations in the non-multi-master case.

I may try to come up with a partial description of the Zync SoC, but
I was getting myself confused when I tried it earlier ;)

Cheers
---Dave
Jason Gunthorpe Dec. 3, 2013, 12:07 a.m. UTC | #31
On Mon, Dec 02, 2013 at 08:25:43PM +0000, Dave Martin wrote:
> > This might be easier to parse as well, since you know everything under
> > 'axi' is related to interconnect and not jumbled with other stuff.
> 
> That is true, but I do have a concern that bolting more and more
> info onto the side of DT may leave us with a mess, while the "main"
> tree becomes increasingly fictional.
> 
> You make a lot of good points -- apologies for not responding in detail
> to all of them yet, but I tie myself in knots trying to say too many
> different things at the same time.

I think the main point is to observe that we are encoding a directed
graph onto DT, so long as the original graph can be extracted the
DT encoding can be whatever people like :)

> In a multi-master system this isn't enough, because a node might have
> to have multiple parents in order to express all the master/slave
> relationships.

Right, DT is a tree, not a graph - and this is already a minor problem
we've seen modeling some IP blocks on the Marvell chips. They also
have multiple ports into the various system busses.

> In that case, we can choose one of the parents as the canonical one
> (e.g., the immediate master on the path from the coherent CPUs), or if
> there is no obvious canonical parent the child node can be a
> freestanding node in the tree (i.e., with no reg or ranges properties,
> either in the / { } junkyard, or in some location that makes topological
> sense for the device in question).  The DT herarchy retains real
> meaning since the direction of master/slave relationships is fixed,
> but the DT becomes a tree of connected trees, rather than a single
> tree.

I'm not sure this will really be a problem in practice:

Consider:
 - All IP blocks we care about are going to have a CPU MMIO port for
   control.
 - The 'soc' tree is the MMIO hierarchy from the CPU perspective
 - IP blocks should try to DT model as a single node when possible

In that case, the location of a DT node for a multiport IP is now well
defined: It is the path from the CPU to the MMIO port, expressed in
DT.

Further, every 'switch' is going to have MMIO to control the switch,
so the switch node DT locations are also well defined.

Basically, I think the main 'soc' tree's layout is mostly unambiguous
and covers all the relevant blocks.

You woun't get a forest of DT trees because every block must be MMIO
reachable.

It is also the same core DT tree with my suggestion or yours.

Your edge encoding also makes sense, but I think this is where I would
disagree the most:

> Slave device mapping, tree form:
> 	a {
> 		b {
> 			reg = < REG >;
> 		};
> 	};
> 
> Slave device mapping, detached form:
> 	a {
> 		slave-reg = < &b REG >;
> 	};
> 
> 	b {
> 	};

This now requires the OS to parse this dataset just to access standard
MMIO, and you have to change the standard existing code that parses
ranges and reg to support this extended format.

Both of those reasons seem like major downsides to me. If the OS
doesn't support advanced features (IOMMU, power management, etc) it
should not require DT parsing beyond the standard items. This may
become relevant when re-using a kernel DT in uboot for instance.

On the other hand, this is a great way to actually express the correct
address mapping path for every reg window - but isn't that a separate
issue from the IOMMU/DMA problem? You still need to describe the DMA
bus mastering ports on IP directly.

The side-table concept would keep the parsing completely contained
within the IOMMU/etc drivers, and not have it leak out into existing
core DT code, but it doesn't completely tidy multiple slave ports.

Also, I was thinking after I sent the last email that this is a good
time to be thinking about a future need for describing NUMA affinites
in DT. That is basically the same directed graph we are talking about
here. Trying some modeling samples with that in mind would be a good
idea..

You should also think about places to encode parameters like
master/slave QOS and other edge-specific tunables..

> I may try to come up with a partial description of the Zync SoC, but
> I was getting myself confused when I tried it earlier ;)

The Zynq is interesting because all the information is public - and it
is a good example of the various AXI building blocks. Imagine some
IOMMUs in there and you have a complete scenario to talk about..

It even has a coherent AXI port available for IP to hook up to. :)

Regards,
Jason
Dave Martin Dec. 3, 2013, 11:45 a.m. UTC | #32
On Mon, Dec 02, 2013 at 05:07:40PM -0700, Jason Gunthorpe wrote:
> On Mon, Dec 02, 2013 at 08:25:43PM +0000, Dave Martin wrote:
> > > This might be easier to parse as well, since you know everything under
> > > 'axi' is related to interconnect and not jumbled with other stuff.
> > 
> > That is true, but I do have a concern that bolting more and more
> > info onto the side of DT may leave us with a mess, while the "main"
> > tree becomes increasingly fictional.
> > 
> > You make a lot of good points -- apologies for not responding in detail
> > to all of them yet, but I tie myself in knots trying to say too many
> > different things at the same time.
> 
> I think the main point is to observe that we are encoding a directed
> graph onto DT, so long as the original graph can be extracted the
> DT encoding can be whatever people like :)

Sure, we're just juggling different descriptions for the same thing here.
The fact that our different representations do seem to agree on that
is reassuring...

> > In a multi-master system this isn't enough, because a node might have
> > to have multiple parents in order to express all the master/slave
> > relationships.
> 
> Right, DT is a tree, not a graph - and this is already a minor problem
> we've seen modeling some IP blocks on the Marvell chips. They also
> have multiple ports into the various system busses.
> 
> > In that case, we can choose one of the parents as the canonical one
> > (e.g., the immediate master on the path from the coherent CPUs), or if
> > there is no obvious canonical parent the child node can be a
> > freestanding node in the tree (i.e., with no reg or ranges properties,
> > either in the / { } junkyard, or in some location that makes topological
> > sense for the device in question).  The DT herarchy retains real
> > meaning since the direction of master/slave relationships is fixed,
> > but the DT becomes a tree of connected trees, rather than a single
> > tree.
> 
> I'm not sure this will really be a problem in practice:
> 
> Consider:
>  - All IP blocks we care about are going to have a CPU MMIO port for
>    control.
>  - The 'soc' tree is the MMIO hierarchy from the CPU perspective
>  - IP blocks should try to DT model as a single node when possible
> 
> In that case, the location of a DT node for a multiport IP is now well
> defined: It is the path from the CPU to the MMIO port, expressed in
> DT.
> 
> Further, every 'switch' is going to have MMIO to control the switch,
> so the switch node DT locations are also well defined.
> 
> Basically, I think the main 'soc' tree's layout is mostly unambiguous
> and covers all the relevant blocks.
> 
> You woun't get a forest of DT trees because every block must be MMIO
> reachable.
> 
> It is also the same core DT tree with my suggestion or yours.

Absolutely: I didn't argue this very well.

The CPU's-eye view of the system determines a natural hierarchy for
everything or almost.

It's possible that there is some bus or switch that only non-CPUs
can see.  But if the CPU has no control interface for it, that
suggests that bus is transparent enough that it needs no control --
and might not need to be represented in the DT at all.

As a rule, we should never put anything in the DT that does not
need to be described.

But if we end up with deviations from this rule, floating nodes
give us an escape route.

Suppose you have a cluster of DSPs used to implement a GPU.  They
might have their own front-side bus which they control themselves.
In this situation, it might be more natural to represent that whole
side cluster as a separate floating subtree within /.

But that's all very hypothetical.  In most cases, you just call
that monstrosity "gpu" and make it look like a device -- even in
the hardware.

> 
> Your edge encoding also makes sense, but I think this is where I would
> disagree the most:
> 
> > Slave device mapping, tree form:
> > 	a {
> > 		b {
> > 			reg = < REG >;
> > 		};
> > 	};
> > 
> > Slave device mapping, detached form:
> > 	a {
> > 		slave-reg = < &b REG >;
> > 	};
> > 
> > 	b {
> > 	};
> 
> This now requires the OS to parse this dataset just to access standard
> MMIO, and you have to change the standard existing code that parses
> ranges and reg to support this extended format.
> 
> Both of those reasons seem like major downsides to me. If the OS
> doesn't support advanced features (IOMMU, power management, etc) it
> should not require DT parsing beyond the standard items. This may
> become relevant when re-using a kernel DT in uboot for instance.

You're right that this is a change.  However, I think that no existing
DT needs to change, and few DTs will use it -- similar to the argument
about why DT will normally look like a single tree.

In real systems, I think multi-master slaves which are accessed
directly and not via some multi-master shared bus are not that common.

A partial dodge would be to introduce a dummy bus node:

	a {
		b {
			reg = < REG >;
		};
	};

becomes

	a {
		slave-ranges = < &b_bus RANGE >;
	};

	b_bus {
		// #slave-cells = <0> is the default
		compatible = "simple-bus";

		b {
			reg = < REG' >;
		};
	};

Where RANGE maps REG into a's address space, and REG' is REG rebased
to address 0.

Now, we can refer indirectly to b as many times as we like, without
using the slave-reg thing.

This still needs special parsing though -- but again, only for cases
that we already can't describe with DT.


The common situation will be that all shared slaves are really under
some shared bus, and that bus really has some natural location in
the DT.  So for cases simple enough not to require these extensions,
I think there would still be no change.

> On the other hand, this is a great way to actually express the correct
> address mapping path for every reg window - but isn't that a separate
> issue from the IOMMU/DMA problem? You still need to describe the DMA
> bus mastering ports on IP directly.

Those problems aren't identical, but they seem closely related.

My thought was that this gives us most of the language required to
describe the mastering links for bus-mastering devices.

> 
> The side-table concept would keep the parsing completely contained
> within the IOMMU/etc drivers, and not have it leak out into existing
> core DT code, but it doesn't completely tidy multiple slave ports.
> 
> Also, I was thinking after I sent the last email that this is a good
> time to be thinking about a future need for describing NUMA affinites
> in DT. That is basically the same directed graph we are talking about
> here. Trying some modeling samples with that in mind would be a good
> idea..
> 
> You should also think about places to encode parameters like
> master/slave QOS and other edge-specific tunables..

The idea that some things are properties of and edge or link, not a
node or device, overlaps with my thinking about IOMMU.  This may
apply to any device that behaves like some adaptor or passthrough.

One option is to create subnodes for these links.  I did not elaborate
this previously, but I think allowing "slave" to be a node for the
more complex cases where we need to add more info might help here:

	dma {
		slave {
			compatible = "slave-link", "simple-bus";
			ranges = < ... >;
			iommu-foo = < ... >;

			slave = < &shared_bus SLAVE-PORT >;
		}
	};

The best way to describe multiple master ports on the DMA controller
would need some thought.


One option would to extend the address space on the slave node with
additional cell(s) to carry port identifiers.  ePAPR already does
things in this sort of way in the interrupt-map and interrupt-map-mask
properties, to describe PCI interrupt routing.

> 
> > I may try to come up with a partial description of the Zync SoC, but
> > I was getting myself confused when I tried it earlier ;)
> 
> The Zynq is interesting because all the information is public - and it
> is a good example of the various AXI building blocks. Imagine some
> IOMMUs in there and you have a complete scenario to talk about..

Indeed.  I was impressed to see a non-trivial block diagram that wasn't
pasted straight out of some marketing powerpoint :)

It's a good example for discussion here, particularly if we add some
IOMMUs to the mix.

> It even has a coherent AXI port available for IP to hook up to. :)

You mean the ACP port connecting the PL Fabric back to the CPU cluster?
I'm guessing the PL Fabric is the interface to the FPGA logic.


Now I need to go back to your proposal and the IOMMU thread and try to
understand better how the approaches map onto each other.

Cheers
---Dave
Mark Brown Dec. 4, 2013, 6:43 p.m. UTC | #33
On Thu, Nov 28, 2013 at 06:35:54PM -0800, Greg KH wrote:
> On Thu, Nov 28, 2013 at 04:31:47PM -0700, Jason Gunthorpe wrote:

> > Greg's point makes sense, but the HW guys are not designing things
> > this way for kicks - there are real physics based reasons for some of
> > these choices...

> > eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
> > expensive compared to a purpose built muxed bus tree. Doing coherency
> > look ups on DMA traffic costs energy, etc.

> Really?  How much power exactly does it take / save?  Yes, hardware
> people think "software is free", but when you can't actually control the
> hardware in the software properly, well, you end up with something like
> itanium...

If you look at the hardware design decisions this stuff tends to be
totally sensible; there's a bunch of factors at play (complexity, area
and isolation tend to be other ones).  There's a lot of the stuff that
we're complaining about where they can reasonably question why this is
so complex for us.  That doesn't mean that everything that it's possible
to do is sensible but there's definitely limitations on the kernel side
here.

> > > code to deal with those descriptions and the hardware they represent. At
> > > some point we need to start pushing some of the complexity back into
> > > hardware so that we can keep a sane code-base.

> > Some of this is a consequence of the push to have the firmware
> > minimal. As soon as you say the kernel has to configure the address
> > map you've created a big complexity for it..

> Why the push to make firmware "minimal"?  What is that "saving"?  You
> just push the complexity from one place to the other, just because ARM
> doesn't seem to have good firmware engineers, doesn't mean they should
> punish their kernel developers :)

These firmwares have tended to be ROMed or otherwise require expensive
validation to change for sometimes sensible reasons, keeping the amount
of code that's painful to change low will tend to make people happier if
a change is needed.  Most people like the risk mitigation.
Greg Kroah-Hartman Dec. 4, 2013, 7:03 p.m. UTC | #34
On Wed, Dec 04, 2013 at 06:43:45PM +0000, Mark Brown wrote:
> On Thu, Nov 28, 2013 at 06:35:54PM -0800, Greg KH wrote:
> > On Thu, Nov 28, 2013 at 04:31:47PM -0700, Jason Gunthorpe wrote:
> 
> > > Greg's point makes sense, but the HW guys are not designing things
> > > this way for kicks - there are real physics based reasons for some of
> > > these choices...
> 
> > > eg An all-to-all bus cross bar (eg like Intel's ring bus) is engery
> > > expensive compared to a purpose built muxed bus tree. Doing coherency
> > > look ups on DMA traffic costs energy, etc.
> 
> > Really?  How much power exactly does it take / save?  Yes, hardware
> > people think "software is free", but when you can't actually control the
> > hardware in the software properly, well, you end up with something like
> > itanium...
> 
> If you look at the hardware design decisions this stuff tends to be
> totally sensible; there's a bunch of factors at play (complexity, area
> and isolation tend to be other ones).  There's a lot of the stuff that
> we're complaining about where they can reasonably question why this is
> so complex for us.  That doesn't mean that everything that it's possible
> to do is sensible but there's definitely limitations on the kernel side
> here.

The main reason it's so "complex" is the drive for people to have a "one
kernel image for multiple systems" and hence the need to have DT handle
all of this.  I don't think that requirement has been pushed back on the
hardware engineers yet, and they are still thinking that a custom image
per chip is ok.

If the hardware designers don't have that goal, this is just going to
get harder and harder over time, as your systems get more and more
complex.

> > > > code to deal with those descriptions and the hardware they represent. At
> > > > some point we need to start pushing some of the complexity back into
> > > > hardware so that we can keep a sane code-base.
> 
> > > Some of this is a consequence of the push to have the firmware
> > > minimal. As soon as you say the kernel has to configure the address
> > > map you've created a big complexity for it..
> 
> > Why the push to make firmware "minimal"?  What is that "saving"?  You
> > just push the complexity from one place to the other, just because ARM
> > doesn't seem to have good firmware engineers, doesn't mean they should
> > punish their kernel developers :)
> 
> These firmwares have tended to be ROMed or otherwise require expensive
> validation to change for sometimes sensible reasons, keeping the amount
> of code that's painful to change low will tend to make people happier if
> a change is needed.  Most people like the risk mitigation.

I love it how it's so easy to make the kernel be the part of the whole
system stack that is simpler to change than any other :)

I'm all for making Linux be the "firmware" and deal with these very
low-level issues directly, but again, this drives in the face of your
self-stated goal of "one image per architecture" for the kernel.  You
kind of can't have it both ways it seems, so someone needs to make up
their mind as to what it's going to be...

Best of luck with this.

greg k-h
Mark Brown Dec. 4, 2013, 8:27 p.m. UTC | #35
On Wed, Dec 04, 2013 at 11:03:12AM -0800, Greg KH wrote:
> On Wed, Dec 04, 2013 at 06:43:45PM +0000, Mark Brown wrote:

> > If you look at the hardware design decisions this stuff tends to be
> > totally sensible; there's a bunch of factors at play (complexity, area
> > and isolation tend to be other ones).  There's a lot of the stuff that
> > we're complaining about where they can reasonably question why this is
> > so complex for us.  That doesn't mean that everything that it's possible
> > to do is sensible but there's definitely limitations on the kernel side
> > here.

> The main reason it's so "complex" is the drive for people to have a "one
> kernel image for multiple systems" and hence the need to have DT handle
> all of this.  I don't think that requirement has been pushed back on the
> hardware engineers yet, and they are still thinking that a custom image
> per chip is ok.

No, the single system image stuff is orthogonal here and is basically
irrelevant for many use cases - it's essential for servers and so on but
most consumer electronics guys pretty much don't care and this stuff is
as much a problem for them as anyone else.  We've always struggled with
these things even when building for specific hardware, DT is just
another way to write the data structure here.  When people are talking
about figuring out the DT first here what they're taking about is as
much working out what we need to abstract first as anything else.

This isn't a million miles away from the stuff we've dealt with using
probe deferral in terms of fitting into the device model, at least at a
high level, though it is harder to sidestep the issues here.

> > These firmwares have tended to be ROMed or otherwise require expensive
> > validation to change for sometimes sensible reasons, keeping the amount
> > of code that's painful to change low will tend to make people happier if
> > a change is needed.  Most people like the risk mitigation.

> I love it how it's so easy to make the kernel be the part of the whole
> system stack that is simpler to change than any other :)

Dammn open source license :)

> I'm all for making Linux be the "firmware" and deal with these very
> low-level issues directly, but again, this drives in the face of your
> self-stated goal of "one image per architecture" for the kernel.  You
> kind of can't have it both ways it seems, so someone needs to make up
> their mind as to what it's going to be...

I think you're confusing me with someone else...  in any case, I don't
see why there should be any conflict here.
diff mbox

Patch

diff --git a/Documentation/devicetree/bindings/arm/coherent-bus.txt b/Documentation/devicetree/bindings/arm/coherent-bus.txt
new file mode 100644
index 000000000000..e3fbc2e491c7
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/coherent-bus.txt
@@ -0,0 +1,110 @@ 
+* Generic binding to describe a coherent bus
+
+In some systems, devices (peripherals and/or CPUs) do not share
+coherent views of memory, while on other systems sets of devices may
+share a coherent view of memory depending on the static bus topology
+and/or dynamic configuration of both the bus and device. Establishing
+such dynamic configurations requires appropriate topological information
+to be communicated to the operating system.
+
+This binding document attempts to define a set of generic properties
+which can be used to encode topological information in bus and device
+nodes.
+
+
+* Terminology
+
+  - Port                : An interface over which memory transactions
+                          can propagate. A port may act as a master,
+                          slave or both (see below).
+
+  - Master port         : A port capable of issuing memory transactions
+                          to a slave. For example, a port connecting a
+                          DMA controller to main memory.
+
+  - Slave port          : A port capable of responding to memory
+                          transactions received by a master. For
+                          example, a port connecting the control
+                          registers of an MMIO device to a peripheral
+                          bus.
+
+  **Note** The ports on a bus to which masters are connected are
+           referred to as slave ports on that bus.
+
+
+* Properties
+
+  - #slave-port-cells   : A property of the bus, describing the number
+                          of cells required for an upstream master
+                          device to encode a single slave port on the
+                          bus. The actual encoding is defined by the
+                          bus binding.
+
+  - slave-ports         : A property of a device mastering through a
+                          downstream bus, describing the set of slave
+                          ports on the bus to which the device is
+                          connected. The property takes the form of a
+                          list of pairs, where each pair contains a
+                          phandle to the bus node as its first element
+                          and #slave-port-cells cells (for the bus
+                          referred to in the first element) as the
+                          second element.
+
+
+* Example
+
+        my-coherent-bus {
+                compatible = "acme,coherent-bus-9000";
+                #address-cells = <1>;
+                #size-cells = <1>;
+                reg = <0xba5e0000 0x10000>;
+
+                [...]        /* More bus-specific properties */
+
+                /*
+                 * Slave ports on this bus can be identified with a
+                 * single cell.
+                 */
+                #slave-port-cells = <1>;
+
+                /* 1:1 address space mapping with our parent bus. */
+                ranges;
+
+                /*
+                 * These devices all have at least their *slave* interfaces
+                 * on the coherent bus.
+                 */
+                dma0@0xfff00000 {
+                        compatible = "acme,coherent-dma-9000";
+                        reg = <0xfff00000 0x10000>;
+
+                        [...]        /* More dma-specific properties */
+
+                        /*
+                         * The DMA controller can master through two
+                         * ports on the coherent bus, using port
+                         * identifiers '0' and '1'.
+                         */
+                        slave-ports = <&my-coherent-bus 0>,
+                                      <&my-coherent-bus 1>;
+                };
+
+                [...]        /* More devices */
+        };
+
+        /*
+         * A device that can master through the coherent bus, but has its
+         * slave interface elsewhere.
+         */
+        dma1@0xfff80000 {
+                compatible = "acme,coherent-dma-9000";
+                reg = <0xfff80000 0x10000>;
+
+                [...]        /* More dma-specific properties */
+
+                /*
+                 * The DMA controller can master through a single port
+                 * on the coherent bus above, using port identifier '8'.
+                 */
+                slave-ports = <&my-coherent-bus 8>;
+        };