[0/3] Apple M1 DART IOMMU driver

Message ID	20210320151903.60759-1-sven@svenpeter.dev (mailing list archive)
Headers	show Return-Path: <SRS0=X12G=IS=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 425B061964 Date: Sat, 20 Mar 2021 15:19:33 +0000 To: iommu@lists.linux-foundation.org From: Sven Peter <sven@svenpeter.dev> Cc: Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, Rob Herring <robh+dt@kernel.org>, Arnd Bergmann <arnd@kernel.org>, Hector Martin <marcan@marcan.st>, Mark Kettenis <mark.kettenis@xs4all.nl>, Marc Zyngier <maz@kernel.org>, Mohamed Mediouni <mohamed.mediouni@caramail.com>, Stan Skowronek <stan@corellium.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, devicetree@vger.kernel.org Subject: [PATCH 0/3] Apple M1 DART IOMMU driver Message-ID: <20210320151903.60759-1-sven@svenpeter.dev> MIME-Version: 1.0 Precedence: list Reply-To: Sven Peter <sven@svenpeter.dev> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	Apple M1 DART IOMMU driver \| expand [0/3] Apple M1 DART IOMMU driver [1/3] iommu: io-pgtable: add DART pagetable format [2/3] dt-bindings: iommu: add DART iommu bindings [3/3] iommu: dart: Add DART iommu driver

Sven Peter March 20, 2021, 3:19 p.m. UTC

Hi,

After Hector's initial work [1] to bring up Linux on Apple's M1 it's time to
bring up more devices. Most peripherals connected to the SoC are behind a iommu
which Apple calls "Device Address Resolution Table", or DART for short [2].
Unfortunately, it only shares the name with PowerPC's DART.
Configuring this iommu is mandatory if these peripherals require DMA access.

This patchset implements initial support for this iommu. The hardware itself
uses a pagetable format that's very similar to the one already implement in
io-pgtable.c. There are some minor modifications, namely some details of the
PTE format and that there are always three pagetable levels, which I've
implement as a new format variant.

I have mainly tested this with the USB controller in device mode which is
compatible with Linux's dwc3 driver. Some custom PHY initialization (which is
not yet ready or fully understood) is required though to bring up the ports,
see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the same
setup you will probably need that branch for now and add the nodes from
the DT binding specification example to your device tree.

Even though each DART instances could support up to 16 devices usually only
a single device is actually connected. Different devices generally just use
an entirely separate DART instance with a seperate MMIO range, IRQ, etc.

I have just noticed today though that at least the USB DWC3 controller in host
mode uses *two* darts at the same time. I'm not sure yet which parts seem to
require which DART instance.

This means that we might need to support devices attached to two iommus
simultaneously and just create the same iova mappings. Currently this only
seems to be required for USB according to Apple's Device Tree.

I see two options for this and would like to get feedback before
I implement either one:

1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
to identify the DART and the second one to identify the master.
The DART DT node would then also take two register ranges that would
correspond to the two DARTs. Both instances use the same IRQ and the
same clocks according to Apple's device tree and my experiments.
This would keep a single device node and the DART driver would then
simply map iovas in both DARTs if required.

2) Keep #iommu-cells as-is but support
iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
instead.
This would then require two devices nodes for the two DART instances and
some housekeeping in the DART driver to support mapping iovas in both
DARTs.
I believe omap-iommu.c supports this setup but I will have to read
more code to understand the details there and figure out how to implement
this in a sane way.

I currently prefer the first option but I don't understand enough details of
the iommu system to actually make an informed decision.
I'm obviously also open to more options :-)

Best regards,

Sven

[1] https://lore.kernel.org/linux-arch/20210304213902.83903-1-marcan@marcan.st/
[2] https://developer.apple.com/library/archive/documentation/DeviceDrivers/Conceptual/IOKitFundamentals/DataMgmt/DataMgmt.html
[3] https://github.com/svenpeter42/m1n1/commit/1e2661abf5ea2c820297b3ff591235c408d19a34
[4] https://github.com/svenpeter42/m1n1/tree/usb-uartproxy-console-wip

Mark Kettenis March 21, 2021, 4 p.m. UTC | #1

> Date: Sat, 20 Mar 2021 15:19:33 +0000
> From: Sven Peter <sven@svenpeter.dev>
> 
> Hi,
> 
> After Hector's initial work [1] to bring up Linux on Apple's M1 it's time to
> bring up more devices. Most peripherals connected to the SoC are behind a iommu
> which Apple calls "Device Address Resolution Table", or DART for short [2].
> Unfortunately, it only shares the name with PowerPC's DART.
> Configuring this iommu is mandatory if these peripherals require DMA access.
> 
> This patchset implements initial support for this iommu. The hardware itself
> uses a pagetable format that's very similar to the one already implement in
> io-pgtable.c. There are some minor modifications, namely some details of the
> PTE format and that there are always three pagetable levels, which I've
> implement as a new format variant.
> 
> I have mainly tested this with the USB controller in device mode which is
> compatible with Linux's dwc3 driver. Some custom PHY initialization (which is
> not yet ready or fully understood) is required though to bring up the ports,
> see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the same
> setup you will probably need that branch for now and add the nodes from
> the DT binding specification example to your device tree.
> 
> Even though each DART instances could support up to 16 devices usually only
> a single device is actually connected. Different devices generally just use
> an entirely separate DART instance with a seperate MMIO range, IRQ, etc.
> 
> I have just noticed today though that at least the USB DWC3 controller in host
> mode uses *two* darts at the same time. I'm not sure yet which parts seem to
> require which DART instance.
> 
> This means that we might need to support devices attached to two iommus
> simultaneously and just create the same iova mappings. Currently this only
> seems to be required for USB according to Apple's Device Tree.
> 
> I see two options for this and would like to get feedback before
> I implement either one:
> 
>     1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
>        to identify the DART and the second one to identify the master.
>        The DART DT node would then also take two register ranges that would
>        correspond to the two DARTs. Both instances use the same IRQ and the
>        same clocks according to Apple's device tree and my experiments.
>        This would keep a single device node and the DART driver would then
>        simply map iovas in both DARTs if required.
> 
>     2) Keep #iommu-cells as-is but support
>             iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
>        instead.
>        This would then require two devices nodes for the two DART instances and
>        some housekeeping in the DART driver to support mapping iovas in both
>        DARTs.
>        I believe omap-iommu.c supports this setup but I will have to read
>        more code to understand the details there and figure out how to implement
>        this in a sane way.
> 
> I currently prefer the first option but I don't understand enough details of
> the iommu system to actually make an informed decision.
> I'm obviously also open to more options :-)

Hi Sven,

I don't think the first option is going to work for PCIe.  PCIe
devices will have to use "iommu-map" properties to map PCI devices to
the right iommu, and the currently implementation seems to assume that
#iommu-cells = <1>.  The devictree binding[1] doesn't explicitly state
that it relies on #iommu-cells = <1>, but it isn't clear how the
rid-base to iommu-base mapping mechanism would work when that isn't
the case.

Now the PCIe DARTs are simpler and seem to have only one "instance"
per DART.  So if we keep #iommu-cells = <1> for those, you'd still be
fine using the first approach.

As I mentioned before, not all DARTs support the full 32-bit aperture.
In particular the PCIe DARTs support a smaller address-space.  It is
not clear whether this is a restriction of the PCIe host controller or
the DART, but the Apple Device Tree has "vm-base" and "vm-size"
properties that encode the base address and size of the aperture.
These single-cell properties which is probably why for the USB DARTs
only "vm-base" is given; since "vm-base" is 0, a 32-bit number
wouldn't be able to encode the full aperture size.  We could make them
64-bit numbers in the Linux device tree though and always be explicit
about the size.  Older Sun SPARC machines used a single "virtual-dma"
property to encode the aperture.  We could do someting similar.  You
would use this property to initialize domain->geometry.aperture_start
and domain->geometry.aperture_end in diff 3/3 of this series.

I think it would make sense to include this in this series, as this
would make adding support for PCIe very easy, and PCIe gives you
aupport for network (both wired and wireless) and the type-A USB ports
on the mini.

Cheers,

Mark

[1] Documentation/devicetree/bindings/pci/pci-iommu.txt

Sven Peter March 21, 2021, 5:28 p.m. UTC | #2

Hi Mark,

Sorry for the spam if you get this message twice. This is pretty embarrassing
but I've just switched mail providers after ProtonMail messed up yesterday and
it looks like my new one defaulted to sending HTML messages even though I only
typed plaintext. This shouldn't have happened in the first place but it
certainly shouldn't happen again now :-(

> On 21. Mar 2021, at 17:00, Mark Kettenis <mark.kettenis@xs4all.nl> wrote:
> 
> I don't think the first option is going to work for PCIe.  PCIe
> devices will have to use "iommu-map" properties to map PCI devices to
> the right iommu, and the currently implementation seems to assume that
> #iommu-cells = <1>.  The devictree binding[1] doesn't explicitly state
> that it relies on #iommu-cells = <1>, but it isn't clear how the
> rid-base to iommu-base mapping mechanism would work when that isn't
> the case.
> 
> Now the PCIe DARTs are simpler and seem to have only one "instance"
> per DART.  So if we keep #iommu-cells = <1> for those, you'd still be
> fine using the first approach.

Good point, I guess that only leaves option two for now then.
Having some DARTs use cells = <1> and others <2> sounds confusing to me.

> 
> As I mentioned before, not all DARTs support the full 32-bit aperture.
> In particular the PCIe DARTs support a smaller address-space.  It is
> not clear whether this is a restriction of the PCIe host controller or
> the DART, but the Apple Device Tree has "vm-base" and "vm-size"
> properties that encode the base address and size of the aperture.
> These single-cell properties which is probably why for the USB DARTs
> only "vm-base" is given; since "vm-base" is 0, a 32-bit number
> wouldn't be able to encode the full aperture size.  We could make them
> 64-bit numbers in the Linux device tree though and always be explicit
> about the size.  Older Sun SPARC machines used a single "virtual-dma"
> property to encode the aperture.  We could do someting similar.  You
> would use this property to initialize domain->geometry.aperture_start
> and domain->geometry.aperture_end in diff 3/3 of this series.
> 
> I think it would make sense to include this in this series, as this
> would make adding support for PCIe very easy, and PCIe gives you
> aupport for network (both wired and wireless) and the type-A USB ports
> on the mini.

Agreed, I'd ideally like to converge on a device tree binding
that won't have to change in the near future.

I've tried to use an address space larger than 32bit and that seems to
work for parts of the dwc3 controller but breaks for the xhci parts because
the upper lines don't seem to be connected there (e.g. if xhci tries to
use <0x1 0xffff0000> I get a fault for <0 0xffff0000>).

Looking at other iommu drivers I have found the following two similar
bindings:

qcom uses a ranges property with a 64bit address and 32 bit size [1]

  apps_iommu: iommu@1e20000 {
      ...
      ranges = <0 0x1e20000 0x40000>;
      ...
  };

and tegra seems to support a dma-window property with 32bit address
and size [2]

  smmu {
          [...]
          dma-window = <0 0x40000000>;    /* IOVA start & length */
          [...]
  };

I believe there already is of_get_dma_window to handle parsing this
in the common iommu code [3] but I can't find any place using it.
It's a little bit more complex that we need since it allows to specify the
number of cells for both the address and the size but it should allow us to
express all possible configurations:

  usb_dart {
      [ ... ]
      #dma-address-cells = <1>;
      #dma-size-cells = <2>;
      dma-window = <0 0x1 0x0>;
      [ ... ]
  };
  pcie_dart {
      [ ... ]
      #dma-address-cells = <1>;
      #dma-size-cells = <1>;
      dma-window = <0x100000 0x3fe00000>;
      [ ... ]
  };

where #dma-address-cells and #dma-size-cells default to
#address-cells and #size-cells respectively if I understand
the code correctly. That way we could also just always use
a 64bit address and size in the DT, e.g.

  pcie_dart {
      [ ... ]
      dma-window = <0 0x100000 0 0x3fe00000>;
      [ ... ]
  };

Best,

Sven

[1] Documentation/devicetree/bindings/iommu/qcom,iommu.txt
[2] Documentation/devicetree/bindings/iommu/nvidia,tegra30-smmu.txt
[3] drivers/iommu/of_iommu.c

Mark Kettenis March 21, 2021, 6:35 p.m. UTC | #3

> Date: Sun, 21 Mar 2021 18:22:15 +0100
> From: "Sven Peter" <sven@svenpeter.dev>
> 
> Hi Mark,
> 
>  On 21. Mar 2021, at 17:00, Mark Kettenis <mark.kettenis@xs4all.nl>
>  wrote:
> 
>  I don't think the first option is going to work for PCIe.  PCIe
>  devices will have to use "iommu-map" properties to map PCI devices to
>  the right iommu, and the currently implementation seems to assume that
>  #iommu-cells = <1>.  The devictree binding[1] doesn't explicitly state
>  that it relies on #iommu-cells = <1>, but it isn't clear how the
>  rid-base to iommu-base mapping mechanism would work when that isn't
>  the case.
> 
>  Now the PCIe DARTs are simpler and seem to have only one "instance"
>  per DART.  So if we keep #iommu-cells = <1> for those, you'd still be
>  fine using the first approach.
> 
> Good point, I guess that only leaves option two for now then.
> Having some DARTs use cells = <1> and others <2> sounds confusing to me.

Guess we do need to understand a little bit better how the USB DART
actually works.  My hypothesis (based on our discussion on #asahi) is
that the XHCI host controller and the peripheral controller of the
DWC3 block use different DMA "streams" that are handled by the
different sub-DARTs.

The Corellium folks use a DART + sub-DART model in their driver and a
single node in the device tree that represents both.  That might sense
since the error registers and interrupts are shared.  Maybe it would
make sense to select the appropriate sub-DART based on the DMA stream
ID?

>  As I mentioned before, not all DARTs support the full 32-bit aperture.
>  In particular the PCIe DARTs support a smaller address-space.  It is
>  not clear whether this is a restriction of the PCIe host controller or
>  the DART, but the Apple Device Tree has "vm-base" and "vm-size"
>  properties that encode the base address and size of the aperture.
>  These single-cell properties which is probably why for the USB DARTs
>  only "vm-base" is given; since "vm-base" is 0, a 32-bit number
>  wouldn't be able to encode the full aperture size.  We could make them
>  64-bit numbers in the Linux device tree though and always be explicit
>  about the size.  Older Sun SPARC machines used a single "virtual-dma"
>  property to encode the aperture.  We could do someting similar.  You
>  would use this property to initialize domain->geometry.aperture_start
>  and domain->geometry.aperture_end in diff 3/3 of this series.
> 
>  I think it would make sense to include this in this series, as this
>  would make adding support for PCIe very easy, and PCIe gives you
>  aupport for network (both wired and wireless) and the type-A USB ports
>  on the mini.
> 
> Agreed, I'd ideally like to converge on a device tree binding
> that won't have to change in the near future.
> 
> I've tried to use an address space larger than 32bit and that seems to
> work for parts of the dwc3 controller but breaks for the xhci parts because
> the upper lines don't seem to be connected there (e.g. if xhci tries to
> use <0x1 0xffff0000> I get a fault for <0 0xffff0000>).
> 
> Looking at other iommu drivers I have found the following two similar
> bindings:

Ah, missed those.

> qcom uses a ranges property with a 64bit address and 32 bit size [1]
> 
>   apps_iommu: iommu@1e20000 {
>       ...
>       ranges = <0 0x1e20000 0x40000>;
>       ...
>   };

Doesn't sound like a very good idea to me to repurpose "ranges" for
this...

> 
> and tegra seems to support a dma-window property with 32bit address
> and size [2]
> 
>   smmu {
>           [...]
>           dma-window = <0 0x40000000>;    /* IOVA start & length */
>           [...]
>   };
> 
> I believe there already is of_get_dma_window to handle parsing this
> in the common iommu code [3] but I can't find any place using it.
> It's a little bit more complex that we need since it allows to specify the
> number of cells for both the address and the size but it should allow us to
> express all possible configurations:
> 
>   usb_dart {
>       [ ... ]
>       #dma-address-cells = <1>;
>       #dma-size-cells = <2>;
>       dma-window = <0 0x1 0x0>;
>       [ ... ]
>   };
>   pcie_dart {
>       [ ... ]
>       #dma-address-cells = <1>;
>       #dma-size-cells = <1>;
>       dma-window = <0x100000 0x3fe00000>;
>       [ ... ]
>   };
> 
> where #dma-address-cells and #dma-size-cells default to
> #address-cells and #size-cells respectively if I understand
> the code correctly. That way we could also just always use
> a 64bit address and size in the DT, e.g.
> 
>   pcie_dart {
>       [ ... ]
>       dma-window = <0 0x100000 0 0x3fe00000>;
>       [ ... ]
>   };

That sounds like a serious contender to me!  Hopefully one of the
Linux kernel developers can give this some sort of blessing.

I think it would make sense for us to just rely on the #address-cells
and #size-cells defaults for the M1 device tree.

Sven Peter March 22, 2021, 10:17 p.m. UTC | #4

Hi Mark,

On Sun, Mar 21, 2021, at 19:35, Mark Kettenis wrote:
>
> Guess we do need to understand a little bit better how the USB DART
> actually works.  My hypothesis (based on our discussion on #asahi) is
> that the XHCI host controller and the peripheral controller of the
> DWC3 block use different DMA "streams" that are handled by the
> different sub-DARTs.

I've done some more experiments and the situation is unfortunately more
complicated: Most DMA transfers are translated with the first DART.
But sometimes (and I have not been able to figure out the exact conditions)
transfers instead have to go through the second DART. 
This happens e.g. with one of my USB keyboards after a stop EP command
is issued: Suddenly the xhci_ep_ctx struct must be translated through the
second DART.

What this likely means is that we'll need to point both DARTs
to the same pagetables and just issue the TLB maintenance operations
as a group.

> 
> The Corellium folks use a DART + sub-DART model in their driver and a
> single node in the device tree that represents both.  That might sense
> since the error registers and interrupts are shared.  Maybe it would
> make sense to select the appropriate sub-DART based on the DMA stream
> ID?

dwc3 specifically seems to require stream id #1 from the DART
at <0x5 0x02f00000> and stream id #0 from the DART at <0x5 0x02f80000>.
Both of these only share a IRQ line but are otherwise completely independent.
Each has their own error registers, etc. and we need some way to
specify these two DARTs + the appropriate stream ID.

Essentially we have three options to represent this now:

1) Add both DARTs as separate regs, use #iommu-cells = <2> and have the
   first cell select the DART and the second one the stream ID.
   We could allow #iommu-cells = <1> in case only one reg is specified
   for the PCIe DART:

   usb_dart1@502f00000 {
     compatible = "apple,t8103-dart";
     reg = <0x5 0x02f00000 0x0 0x4000>, <0x5 0x02f80000 0x0 0x4000>;
     #iommu-cells = <2>;
     ...
   };

   usb1 {
     iommus = <&usb_dart1 0 1>, <&usb_dart1 1 0>;
     ...
   };

   I prefer this option because we fully describe the DART in a single
   device node here. It also feels natural to group them like this because
   they need to share some properties (like dma-window and the interrupt)
   anyway. 

2) Create two DART nodes which share the same IRQ line and attach them
   both to the master node:

   usb_dart1a@502f00000 {
     compatible = "apple,t8103-dart";
     reg = <0x5 0x02f00000 0x0 0x4000>;
     #iommu-cells = <1>;
     ...
   };
   usb_dart1b@502f80000 {
     compatible = "apple,t8103-dart";
     reg = <0x5 0x02f80000 0x0 0x4000>;
     #iommu-cells = <1>;
     ...
   };

   usb1 {
     iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
     ...
   };

   I dislike this one because attaching those two DARTs to a single device
   seems rather unusual. We'd also have to duplicate the dma-window setting,
   make sure it's the same for both DARTs and there are probably even more
   complications I can't think of right now. It seems like this would also
   make the device tree very verbose and the implementation itself more
   complicated.

3) Introduce another property and let the DART driver take care of
   mirroring the pagetables. I believe this would be similar to
   the sid-remap property:

   usb_dart1@502f00000 {
     compatible = "apple,t8103-dart";
     reg = <0x5 0x02f00000 0x0 0x4000>, <0x5 0x02f80000 0x0 0x4000>;
     #iommu-cells = <1>;
     sid-remap = <0 1>;
   };
   usb1 {
     iommus = <&usb_dart1 0>;
   };

   I slightly dislike this one because we now specify which stream id 
   to use in two places: Once in the device node and another time in the
   new property in the DART node. I also don't think the binding is much
   simpler than the first one.

> > where #dma-address-cells and #dma-size-cells default to
> > #address-cells and #size-cells respectively if I understand
> > the code correctly. That way we could also just always use
> > a 64bit address and size in the DT, e.g.
> > 
> >   pcie_dart {
> >       [ ... ]
> >       dma-window = <0 0x100000 0 0x3fe00000>;
> >       [ ... ]
> >   };
> 
> That sounds like a serious contender to me!  Hopefully one of the
> Linux kernel developers can give this some sort of blessing.
> 
> I think it would make sense for us to just rely on the #address-cells
> and #size-cells defaults for the M1 device tree.
>

Agreed.

Best,

Sven

Mark Kettenis March 23, 2021, 8 p.m. UTC | #5

> Date: Mon, 22 Mar 2021 23:17:31 +0100
> From: "Sven Peter" <sven@svenpeter.dev>
> 
> Hi Mark,
> 
> On Sun, Mar 21, 2021, at 19:35, Mark Kettenis wrote:
> >
> > Guess we do need to understand a little bit better how the USB DART
> > actually works.  My hypothesis (based on our discussion on #asahi) is
> > that the XHCI host controller and the peripheral controller of the
> > DWC3 block use different DMA "streams" that are handled by the
> > different sub-DARTs.
> 
> I've done some more experiments and the situation is unfortunately more
> complicated: Most DMA transfers are translated with the first DART.
> But sometimes (and I have not been able to figure out the exact conditions)
> transfers instead have to go through the second DART. 
> This happens e.g. with one of my USB keyboards after a stop EP command
> is issued: Suddenly the xhci_ep_ctx struct must be translated through the
> second DART.
> 
> What this likely means is that we'll need to point both DARTs
> to the same pagetables and just issue the TLB maintenance operations
> as a group.
> 
> > 
> > The Corellium folks use a DART + sub-DART model in their driver and a
> > single node in the device tree that represents both.  That might sense
> > since the error registers and interrupts are shared.  Maybe it would
> > make sense to select the appropriate sub-DART based on the DMA stream
> > ID?
> 
> dwc3 specifically seems to require stream id #1 from the DART
> at <0x5 0x02f00000> and stream id #0 from the DART at <0x5 0x02f80000>.
> Both of these only share a IRQ line but are otherwise completely independent.
> Each has their own error registers, etc. and we need some way to
> specify these two DARTs + the appropriate stream ID.
> 
> Essentially we have three options to represent this now:
> 
> 1) Add both DARTs as separate regs, use #iommu-cells = <2> and have the
>    first cell select the DART and the second one the stream ID.
>    We could allow #iommu-cells = <1> in case only one reg is specified
>    for the PCIe DART:
> 
>    usb_dart1@502f00000 {
>      compatible = "apple,t8103-dart";
>      reg = <0x5 0x02f00000 0x0 0x4000>, <0x5 0x02f80000 0x0 0x4000>;
>      #iommu-cells = <2>;
>      ...
>    };
> 
>    usb1 {
>      iommus = <&usb_dart1 0 1>, <&usb_dart1 1 0>;
>      ...
>    };
> 
>    I prefer this option because we fully describe the DART in a single
>    device node here. It also feels natural to group them like this because
>    they need to share some properties (like dma-window and the interrupt)
>    anyway. 
> 
> 2) Create two DART nodes which share the same IRQ line and attach them
>    both to the master node:
> 
>    usb_dart1a@502f00000 {
>      compatible = "apple,t8103-dart";
>      reg = <0x5 0x02f00000 0x0 0x4000>;
>      #iommu-cells = <1>;
>      ...
>    };
>    usb_dart1b@502f80000 {
>      compatible = "apple,t8103-dart";
>      reg = <0x5 0x02f80000 0x0 0x4000>;
>      #iommu-cells = <1>;
>      ...
>    };
> 
>    usb1 {
>      iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
>      ...
>    };
> 
>    I dislike this one because attaching those two DARTs to a single device
>    seems rather unusual. We'd also have to duplicate the dma-window setting,
>    make sure it's the same for both DARTs and there are probably even more
>    complications I can't think of right now. It seems like this would also
>    make the device tree very verbose and the implementation itself more
>    complicated.
> 
> 3) Introduce another property and let the DART driver take care of
>    mirroring the pagetables. I believe this would be similar to
>    the sid-remap property:
> 
>    usb_dart1@502f00000 {
>      compatible = "apple,t8103-dart";
>      reg = <0x5 0x02f00000 0x0 0x4000>, <0x5 0x02f80000 0x0 0x4000>;
>      #iommu-cells = <1>;
>      sid-remap = <0 1>;
>    };
>    usb1 {
>      iommus = <&usb_dart1 0>;
>    };
> 
>    I slightly dislike this one because we now specify which stream id 
>    to use in two places: Once in the device node and another time in the
>    new property in the DART node. I also don't think the binding is much
>    simpler than the first one.

Hi Sven,

The problem with both #1 and #2 is that you end up with two references
to (effectively) different iommu's in the dwc3 device node.  I don't
see how that is compatible with the idea of using a single translation
table for both sub-DARTs.

If you no longer think that is desirable, you'll still have the
problem that you'll need to modify the dwc3 driver code such that it
uses the right IOMMU to do its DMA address translation.  Given what
you write above that sounds really ugly and confusing.  I would
certainly want to avoid doing that in OpenBSD.

So I think #3 is the only realistic option here.  In my opinion it is
perfectly fine for the devicetree to present a workable model of the
hardware instead of a 100% accurate model.

> > > where #dma-address-cells and #dma-size-cells default to
> > > #address-cells and #size-cells respectively if I understand
> > > the code correctly. That way we could also just always use
> > > a 64bit address and size in the DT, e.g.
> > > 
> > >   pcie_dart {
> > >       [ ... ]
> > >       dma-window = <0 0x100000 0 0x3fe00000>;
> > >       [ ... ]
> > >   };
> > 
> > That sounds like a serious contender to me!  Hopefully one of the
> > Linux kernel developers can give this some sort of blessing.
> > 
> > I think it would make sense for us to just rely on the #address-cells
> > and #size-cells defaults for the M1 device tree.
> >
> 
> Agreed.
> 
> 
> Best,
> 
> Sven
>

Rob Herring (Arm) March 23, 2021, 8:53 p.m. UTC | #6

On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:
> > Date: Sat, 20 Mar 2021 15:19:33 +0000
> > From: Sven Peter <sven@svenpeter.dev>
> > 
> > Hi,
> > 
> > After Hector's initial work [1] to bring up Linux on Apple's M1 it's time to
> > bring up more devices. Most peripherals connected to the SoC are behind a iommu
> > which Apple calls "Device Address Resolution Table", or DART for short [2].
> > Unfortunately, it only shares the name with PowerPC's DART.
> > Configuring this iommu is mandatory if these peripherals require DMA access.
> > 
> > This patchset implements initial support for this iommu. The hardware itself
> > uses a pagetable format that's very similar to the one already implement in
> > io-pgtable.c. There are some minor modifications, namely some details of the
> > PTE format and that there are always three pagetable levels, which I've
> > implement as a new format variant.
> > 
> > I have mainly tested this with the USB controller in device mode which is
> > compatible with Linux's dwc3 driver. Some custom PHY initialization (which is
> > not yet ready or fully understood) is required though to bring up the ports,
> > see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the same
> > setup you will probably need that branch for now and add the nodes from
> > the DT binding specification example to your device tree.
> > 
> > Even though each DART instances could support up to 16 devices usually only
> > a single device is actually connected. Different devices generally just use
> > an entirely separate DART instance with a seperate MMIO range, IRQ, etc.
> > 
> > I have just noticed today though that at least the USB DWC3 controller in host
> > mode uses *two* darts at the same time. I'm not sure yet which parts seem to
> > require which DART instance.
> > 
> > This means that we might need to support devices attached to two iommus
> > simultaneously and just create the same iova mappings. Currently this only
> > seems to be required for USB according to Apple's Device Tree.
> > 
> > I see two options for this and would like to get feedback before
> > I implement either one:
> > 
> >     1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
> >        to identify the DART and the second one to identify the master.
> >        The DART DT node would then also take two register ranges that would
> >        correspond to the two DARTs. Both instances use the same IRQ and the
> >        same clocks according to Apple's device tree and my experiments.
> >        This would keep a single device node and the DART driver would then
> >        simply map iovas in both DARTs if required.
> > 
> >     2) Keep #iommu-cells as-is but support
> >             iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
> >        instead.
> >        This would then require two devices nodes for the two DART instances and
> >        some housekeeping in the DART driver to support mapping iovas in both
> >        DARTs.
> >        I believe omap-iommu.c supports this setup but I will have to read
> >        more code to understand the details there and figure out how to implement
> >        this in a sane way.
> > 
> > I currently prefer the first option but I don't understand enough details of
> > the iommu system to actually make an informed decision.

Please don't mix what does the h/w look like and what's easy to 
implement in Linux's IOMMU subsytem. It's pretty clear (at least 
from the description here) that option 2 reflects the h/w. 

> > I'm obviously also open to more options :-)
> 
> Hi Sven,
> 
> I don't think the first option is going to work for PCIe.  PCIe
> devices will have to use "iommu-map" properties to map PCI devices to
> the right iommu, and the currently implementation seems to assume that
> #iommu-cells = <1>.  The devictree binding[1] doesn't explicitly state
> that it relies on #iommu-cells = <1>, but it isn't clear how the
> rid-base to iommu-base mapping mechanism would work when that isn't
> the case.
> 
> Now the PCIe DARTs are simpler and seem to have only one "instance"
> per DART.  So if we keep #iommu-cells = <1> for those, you'd still be
> fine using the first approach.
> 
> As I mentioned before, not all DARTs support the full 32-bit aperture.
> In particular the PCIe DARTs support a smaller address-space.  It is
> not clear whether this is a restriction of the PCIe host controller or
> the DART, but the Apple Device Tree has "vm-base" and "vm-size"
> properties that encode the base address and size of the aperture.
> These single-cell properties which is probably why for the USB DARTs
> only "vm-base" is given; since "vm-base" is 0, a 32-bit number
> wouldn't be able to encode the full aperture size.  We could make them
> 64-bit numbers in the Linux device tree though and always be explicit
> about the size.  Older Sun SPARC machines used a single "virtual-dma"
> property to encode the aperture.  We could do someting similar.  You
> would use this property to initialize domain->geometry.aperture_start
> and domain->geometry.aperture_end in diff 3/3 of this series.

'dma-ranges' is what should be used here.

Rob

Sven Peter March 23, 2021, 9:03 p.m. UTC | #7

Hi Mark,

On Tue, Mar 23, 2021, at 21:00, Mark Kettenis wrote:
> The problem with both #1 and #2 is that you end up with two references
> to (effectively) different iommu's in the dwc3 device node.  I don't
> see how that is compatible with the idea of using a single translation
> table for both sub-DARTs.

I don't have a strong opinion about this fwiw.
Option #1 and #2 seem to have precedence in the already existing
iommu bindings though [1,2,3].
I just want to make sure we've thought through all options and understand
their advantages/disadvantages before we take a decision.

I've been reading some more Linux iommu code and here's how I understand
it now / how I could implement at least #1 with a single shared pagetable.
This might also be possible for option #2, but I'll need to think that through
in more detail.

An iommu domain is a collection of devices that share a virtual address space.
During domain allocation I can just allocate a single DART pagetable and
not have anything point to it for now.

A stream identifies the smallest unit the iommu hardware can differentiate.
For the DART we have 16 of these with a single TCR + the four TTBR register
for each stream.

A device is assigned to individual streams using the "iommus" property. When
a device is attached to a domain we now simply setup the TTBR registers to
point to the iommu domain pagetable. It doesn't matter here if it's a single
stream or multiple ones or even multiple devices sharing a single stream as
long as they're attached to the same domain.

All operations (map, unmap, etc.) now simply first modify the domain
pagetable and then issue the TLB maintenance operations for attached streams.

> 
> If you no longer think that is desirable, you'll still have the
> problem that you'll need to modify the dwc3 driver code such that it
> uses the right IOMMU to do its DMA address translation.  Given what
> you write above that sounds really ugly and confusing.  I would
> certainly want to avoid doing that in OpenBSD.

Yeah, I absolutely don't want to hack on the dwc3 code at all.
That will end up being *very* ugly.

Best,

Sven

[1] Documentation/devicetree/bindings/iommu/iommu.txt "Multiple-master IOMMU"
[2] Documentation/devicetree/bindings/qcom,iommu.txt last example
[3] Documentation/devicetree/bindings/arm,smmu.yaml first example

Mark Kettenis March 23, 2021, 10:33 p.m. UTC | #8

> Date: Tue, 23 Mar 2021 14:53:46 -0600
> From: Rob Herring <robh@kernel.org>
> 
> On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:
> > > Date: Sat, 20 Mar 2021 15:19:33 +0000
> > > From: Sven Peter <sven@svenpeter.dev>
> > > 
> > > Hi,
> > > 
> > > After Hector's initial work [1] to bring up Linux on Apple's M1 it's time to
> > > bring up more devices. Most peripherals connected to the SoC are behind a iommu
> > > which Apple calls "Device Address Resolution Table", or DART for short [2].
> > > Unfortunately, it only shares the name with PowerPC's DART.
> > > Configuring this iommu is mandatory if these peripherals require DMA access.
> > > 
> > > This patchset implements initial support for this iommu. The hardware itself
> > > uses a pagetable format that's very similar to the one already implement in
> > > io-pgtable.c. There are some minor modifications, namely some details of the
> > > PTE format and that there are always three pagetable levels, which I've
> > > implement as a new format variant.
> > > 
> > > I have mainly tested this with the USB controller in device mode which is
> > > compatible with Linux's dwc3 driver. Some custom PHY initialization (which is
> > > not yet ready or fully understood) is required though to bring up the ports,
> > > see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the same
> > > setup you will probably need that branch for now and add the nodes from
> > > the DT binding specification example to your device tree.
> > > 
> > > Even though each DART instances could support up to 16 devices usually only
> > > a single device is actually connected. Different devices generally just use
> > > an entirely separate DART instance with a seperate MMIO range, IRQ, etc.
> > > 
> > > I have just noticed today though that at least the USB DWC3 controller in host
> > > mode uses *two* darts at the same time. I'm not sure yet which parts seem to
> > > require which DART instance.
> > > 
> > > This means that we might need to support devices attached to two iommus
> > > simultaneously and just create the same iova mappings. Currently this only
> > > seems to be required for USB according to Apple's Device Tree.
> > > 
> > > I see two options for this and would like to get feedback before
> > > I implement either one:
> > > 
> > >     1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
> > >        to identify the DART and the second one to identify the master.
> > >        The DART DT node would then also take two register ranges that would
> > >        correspond to the two DARTs. Both instances use the same IRQ and the
> > >        same clocks according to Apple's device tree and my experiments.
> > >        This would keep a single device node and the DART driver would then
> > >        simply map iovas in both DARTs if required.
> > > 
> > >     2) Keep #iommu-cells as-is but support
> > >             iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
> > >        instead.
> > >        This would then require two devices nodes for the two DART instances and
> > >        some housekeeping in the DART driver to support mapping iovas in both
> > >        DARTs.
> > >        I believe omap-iommu.c supports this setup but I will have to read
> > >        more code to understand the details there and figure out how to implement
> > >        this in a sane way.
> > > 
> > > I currently prefer the first option but I don't understand enough details of
> > > the iommu system to actually make an informed decision.
> 
> Please don't mix what does the h/w look like and what's easy to 
> implement in Linux's IOMMU subsytem. It's pretty clear (at least 
> from the description here) that option 2 reflects the h/w.

Apple does represent these as a single node in their device tree.  The
two instances share an interrupt and share power/clock gating.  So
they seem to be part of a single hardware block.

> > > I'm obviously also open to more options :-)
> > 
> > Hi Sven,
> > 
> > I don't think the first option is going to work for PCIe.  PCIe
> > devices will have to use "iommu-map" properties to map PCI devices to
> > the right iommu, and the currently implementation seems to assume that
> > #iommu-cells = <1>.  The devictree binding[1] doesn't explicitly state
> > that it relies on #iommu-cells = <1>, but it isn't clear how the
> > rid-base to iommu-base mapping mechanism would work when that isn't
> > the case.
> > 
> > Now the PCIe DARTs are simpler and seem to have only one "instance"
> > per DART.  So if we keep #iommu-cells = <1> for those, you'd still be
> > fine using the first approach.
> > 
> > As I mentioned before, not all DARTs support the full 32-bit aperture.
> > In particular the PCIe DARTs support a smaller address-space.  It is
> > not clear whether this is a restriction of the PCIe host controller or
> > the DART, but the Apple Device Tree has "vm-base" and "vm-size"
> > properties that encode the base address and size of the aperture.
> > These single-cell properties which is probably why for the USB DARTs
> > only "vm-base" is given; since "vm-base" is 0, a 32-bit number
> > wouldn't be able to encode the full aperture size.  We could make them
> > 64-bit numbers in the Linux device tree though and always be explicit
> > about the size.  Older Sun SPARC machines used a single "virtual-dma"
> > property to encode the aperture.  We could do someting similar.  You
> > would use this property to initialize domain->geometry.aperture_start
> > and domain->geometry.aperture_end in diff 3/3 of this series.
> 
> 'dma-ranges' is what should be used here.
> 
> Rob
>

Robin Murphy March 24, 2021, 3:29 p.m. UTC | #9

On 2021-03-20 15:19, Sven Peter wrote:
> Hi,
> 
> After Hector's initial work [1] to bring up Linux on Apple's M1 it's time to
> bring up more devices. Most peripherals connected to the SoC are behind a iommu
> which Apple calls "Device Address Resolution Table", or DART for short [2].
> Unfortunately, it only shares the name with PowerPC's DART.
> Configuring this iommu is mandatory if these peripherals require DMA access.
> 
> This patchset implements initial support for this iommu. The hardware itself
> uses a pagetable format that's very similar to the one already implement in
> io-pgtable.c. There are some minor modifications, namely some details of the
> PTE format and that there are always three pagetable levels, which I've
> implement as a new format variant.
> 
> I have mainly tested this with the USB controller in device mode which is
> compatible with Linux's dwc3 driver. Some custom PHY initialization (which is
> not yet ready or fully understood) is required though to bring up the ports,
> see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the same
> setup you will probably need that branch for now and add the nodes from
> the DT binding specification example to your device tree.
> 
> Even though each DART instances could support up to 16 devices usually only
> a single device is actually connected. Different devices generally just use
> an entirely separate DART instance with a seperate MMIO range, IRQ, etc.
> 
> I have just noticed today though that at least the USB DWC3 controller in host
> mode uses *two* darts at the same time. I'm not sure yet which parts seem to
> require which DART instance.
> 
> This means that we might need to support devices attached to two iommus
> simultaneously and just create the same iova mappings. Currently this only
> seems to be required for USB according to Apple's Device Tree.
> 
> I see two options for this and would like to get feedback before
> I implement either one:
> 
>      1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
>         to identify the DART and the second one to identify the master.
>         The DART DT node would then also take two register ranges that would
>         correspond to the two DARTs. Both instances use the same IRQ and the
>         same clocks according to Apple's device tree and my experiments.
>         This would keep a single device node and the DART driver would then
>         simply map iovas in both DARTs if required.

This is broadly similar to the approach used by rockchip-iommu and the 
special arm-smmu-nvidia implementation, where there are multiple 
instances which require programming identically, that are abstracted 
behind a single "device". Your case is a little different since you're 
not programming both *entirely* identically, although maybe that's a 
possibility if each respective ID isn't used by anything else on the 
"other" DART?

Overall I tend to view this approach as a bit of a hack because it's not 
really describing the hardware truthfully - just because two distinct 
functional blocks have their IRQ lines wired together doesn't suddenly 
make them a single monolithic block with multiple interfaces - and tends 
to be done for the sake of making the driver implementation easier in 
terms of the Linux IOMMU API (which, note, hasn't evolved all that far 
from its PCI-centric origins and isn't exactly great for arbitrary SoC 
topologies).

>      2) Keep #iommu-cells as-is but support
>              iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
>         instead.
>         This would then require two devices nodes for the two DART instances and
>         some housekeeping in the DART driver to support mapping iovas in both
>         DARTs.
>         I believe omap-iommu.c supports this setup but I will have to read
>         more code to understand the details there and figure out how to implement
>         this in a sane way.

This approach is arguably the most honest, and more robust in terms of 
making fewer assumptions, and is used by at least exynos-iommu and 
omap-iommu. In Linux it currently takes a little bit more housekeeping 
to keep track of linked instances within the driver since the IOMMU API 
holds the notion that any given client device is associated with "an 
IOMMU", but that's always free to change at any time, unlike the design 
of a DT binding.

There's also the funky "root" and "leaf" devices thing that ipmmu-vmsa 
does, although I believe that represents genuine hardware differences 
where the leaves are more like extra TLBs rather than fully-functional 
IOMMU blocks in their own right, so that may not be a relevant model here.

Robin.

> I currently prefer the first option but I don't understand enough details of
> the iommu system to actually make an informed decision.
> I'm obviously also open to more options :-)
> 
> 
> Best regards,
> 
> 
> Sven
> 
> [1] https://lore.kernel.org/linux-arch/20210304213902.83903-1-marcan@marcan.st/
> [2] https://developer.apple.com/library/archive/documentation/DeviceDrivers/Conceptual/IOKitFundamentals/DataMgmt/DataMgmt.html
> [3] https://github.com/svenpeter42/m1n1/commit/1e2661abf5ea2c820297b3ff591235c408d19a34
> [4] https://github.com/svenpeter42/m1n1/tree/usb-uartproxy-console-wip
> 
> 
>

Sven Peter March 25, 2021, 7:53 a.m. UTC | #10

On Tue, Mar 23, 2021, at 21:53, Rob Herring wrote:
> On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:
> > > Date: Sat, 20 Mar 2021 15:19:33 +0000
> > > From: Sven Peter <sven@svenpeter.dev>
> > > I have just noticed today though that at least the USB DWC3 controller in host
> > > mode uses *two* darts at the same time. I'm not sure yet which parts seem to
> > > require which DART instance.
> > > 
> > > This means that we might need to support devices attached to two iommus
> > > simultaneously and just create the same iova mappings. Currently this only
> > > seems to be required for USB according to Apple's Device Tree.
> > > 
> > > I see two options for this and would like to get feedback before
> > > I implement either one:
> > > 
> > >     1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
> > >        to identify the DART and the second one to identify the master.
> > >        The DART DT node would then also take two register ranges that would
> > >        correspond to the two DARTs. Both instances use the same IRQ and the
> > >        same clocks according to Apple's device tree and my experiments.
> > >        This would keep a single device node and the DART driver would then
> > >        simply map iovas in both DARTs if required.
> > > 
> > >     2) Keep #iommu-cells as-is but support
> > >             iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
> > >        instead.
> > >        This would then require two devices nodes for the two DART instances and
> > >        some housekeeping in the DART driver to support mapping iovas in both
> > >        DARTs.
> > >        I believe omap-iommu.c supports this setup but I will have to read
> > >        more code to understand the details there and figure out how to implement
> > >        this in a sane way.
> > > 
> > > I currently prefer the first option but I don't understand enough details of
> > > the iommu system to actually make an informed decision.
> 
> Please don't mix what does the h/w look like and what's easy to 
> implement in Linux's IOMMU subsytem. It's pretty clear (at least 
> from the description here) that option 2 reflects the h/w. 
> 

Good point, I'll keep that in mind and give option 2 a try.

> > 
> > As I mentioned before, not all DARTs support the full 32-bit aperture.
> > In particular the PCIe DARTs support a smaller address-space.  It is
> > not clear whether this is a restriction of the PCIe host controller or
> > the DART, but the Apple Device Tree has "vm-base" and "vm-size"
> > properties that encode the base address and size of the aperture.
> > These single-cell properties which is probably why for the USB DARTs
> > only "vm-base" is given; since "vm-base" is 0, a 32-bit number
> > wouldn't be able to encode the full aperture size.  We could make them
> > 64-bit numbers in the Linux device tree though and always be explicit
> > about the size.  Older Sun SPARC machines used a single "virtual-dma"
> > property to encode the aperture.  We could do someting similar.  You
> > would use this property to initialize domain->geometry.aperture_start
> > and domain->geometry.aperture_end in diff 3/3 of this series.
> 
> 'dma-ranges' is what should be used here.
> 

The iommu binding documentation [1] mentions that

    The device tree node of the IOMMU device's parent bus must contain a valid
    "dma-ranges" property that describes how the physical address space of the
    IOMMU maps to memory. An empty "dma-ranges" property means that there is a 
    1:1 mapping from IOMMU to memory.

which, if I understand this correctly, means that the 'dma-ranges' for the
parent bus of the iommu should be empty since the DART hardware can see the
full physical address space with a 1:1 mapping.


The documentation also mentions that

     When an "iommus" property is specified in a device tree node, the IOMMU
     will be used for address translation. If a "dma-ranges" property exists
     in the device's parent node it will be ignored.

which means that specifying a 'dma-ranges' in the parent bus of any devices
that use the iommu will just be ignored.

As a concrete example, the PCIe DART IOMMU only allows translations from iovas
within 0x00100000...0x3ff00000 to the entire physical address space (though
realistically it will only map to 16GB RAM starting at 0x800000000 on the M1).

I'm probably just confused or maybe the documentation is outdated but I don't
see how I could specify "this device can only use DMA addresses from
0x00100000...0x3ff00000 but can map these via the iommu to any physical
address" using 'dma-ranges'.

Could you maybe point me to the right direction or give me a small example?
That would help a lot!



Thanks,

Sven


[1] Documentation/devicetree/bindings/iommu/iommu.txt

Sven Peter March 25, 2021, 7:58 a.m. UTC | #11

Hi Robin,


On Wed, Mar 24, 2021, at 16:29, Robin Murphy wrote:
> On 2021-03-20 15:19, Sven Peter wrote:
> > 
> > I have just noticed today though that at least the USB DWC3 controller in host
> > mode uses *two* darts at the same time. I'm not sure yet which parts seem to
> > require which DART instance.
> > 
> > This means that we might need to support devices attached to two iommus
> > simultaneously and just create the same iova mappings. Currently this only
> > seems to be required for USB according to Apple's Device Tree.
> > 
> > I see two options for this and would like to get feedback before
> > I implement either one:
> > 
> >      1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
> >         to identify the DART and the second one to identify the master.
> >         The DART DT node would then also take two register ranges that would
> >         correspond to the two DARTs. Both instances use the same IRQ and the
> >         same clocks according to Apple's device tree and my experiments.
> >         This would keep a single device node and the DART driver would then
> >         simply map iovas in both DARTs if required.
> 
> This is broadly similar to the approach used by rockchip-iommu and the 
> special arm-smmu-nvidia implementation, where there are multiple 
> instances which require programming identically, that are abstracted 
> behind a single "device". Your case is a little different since you're 
> not programming both *entirely* identically, although maybe that's a 
> possibility if each respective ID isn't used by anything else on the 
> "other" DART?

That would be possible. The only difference is that I need to
program ID 0 of the first DART and ID 1 of the second one. Both
of these IDs are only connected to the same USB controller.


> 
> Overall I tend to view this approach as a bit of a hack because it's not 
> really describing the hardware truthfully - just because two distinct 
> functional blocks have their IRQ lines wired together doesn't suddenly 
> make them a single monolithic block with multiple interfaces - and tends 
> to be done for the sake of making the driver implementation easier in 
> terms of the Linux IOMMU API (which, note, hasn't evolved all that far 
> from its PCI-centric origins and isn't exactly great for arbitrary SoC 
> topologies).

Yes, the easier driver implementation was my reason to favour this option.

> 
> >      2) Keep #iommu-cells as-is but support
> >              iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
> >         instead.
> >         This would then require two devices nodes for the two DART instances and
> >         some housekeeping in the DART driver to support mapping iovas in both
> >         DARTs.
> >         I believe omap-iommu.c supports this setup but I will have to read
> >         more code to understand the details there and figure out how to implement
> >         this in a sane way.
> 
> This approach is arguably the most honest, and more robust in terms of 
> making fewer assumptions, and is used by at least exynos-iommu and 
> omap-iommu. In Linux it currently takes a little bit more housekeeping 
> to keep track of linked instances within the driver since the IOMMU API 
> holds the notion that any given client device is associated with "an 
> IOMMU", but that's always free to change at any time, unlike the design 
> of a DT binding.

Sounds good. I'll read those drivers and give it a try for v2.


Thanks,


Sven

Robin Murphy March 25, 2021, 11:50 a.m. UTC | #12

On 2021-03-25 07:53, Sven Peter wrote:
> 
> 
> On Tue, Mar 23, 2021, at 21:53, Rob Herring wrote:
>> On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:
>>>> Date: Sat, 20 Mar 2021 15:19:33 +0000
>>>> From: Sven Peter <sven@svenpeter.dev>
>>>> I have just noticed today though that at least the USB DWC3 controller in host
>>>> mode uses *two* darts at the same time. I'm not sure yet which parts seem to
>>>> require which DART instance.
>>>>
>>>> This means that we might need to support devices attached to two iommus
>>>> simultaneously and just create the same iova mappings. Currently this only
>>>> seems to be required for USB according to Apple's Device Tree.
>>>>
>>>> I see two options for this and would like to get feedback before
>>>> I implement either one:
>>>>
>>>>      1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
>>>>         to identify the DART and the second one to identify the master.
>>>>         The DART DT node would then also take two register ranges that would
>>>>         correspond to the two DARTs. Both instances use the same IRQ and the
>>>>         same clocks according to Apple's device tree and my experiments.
>>>>         This would keep a single device node and the DART driver would then
>>>>         simply map iovas in both DARTs if required.
>>>>
>>>>      2) Keep #iommu-cells as-is but support
>>>>              iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
>>>>         instead.
>>>>         This would then require two devices nodes for the two DART instances and
>>>>         some housekeeping in the DART driver to support mapping iovas in both
>>>>         DARTs.
>>>>         I believe omap-iommu.c supports this setup but I will have to read
>>>>         more code to understand the details there and figure out how to implement
>>>>         this in a sane way.
>>>>
>>>> I currently prefer the first option but I don't understand enough details of
>>>> the iommu system to actually make an informed decision.
>>
>> Please don't mix what does the h/w look like and what's easy to
>> implement in Linux's IOMMU subsytem. It's pretty clear (at least
>> from the description here) that option 2 reflects the h/w.
>>
> 
> Good point, I'll keep that in mind and give option 2 a try.
> 
>>>
>>> As I mentioned before, not all DARTs support the full 32-bit aperture.
>>> In particular the PCIe DARTs support a smaller address-space.  It is
>>> not clear whether this is a restriction of the PCIe host controller or
>>> the DART, but the Apple Device Tree has "vm-base" and "vm-size"
>>> properties that encode the base address and size of the aperture.
>>> These single-cell properties which is probably why for the USB DARTs
>>> only "vm-base" is given; since "vm-base" is 0, a 32-bit number
>>> wouldn't be able to encode the full aperture size.  We could make them
>>> 64-bit numbers in the Linux device tree though and always be explicit
>>> about the size.  Older Sun SPARC machines used a single "virtual-dma"
>>> property to encode the aperture.  We could do someting similar.  You
>>> would use this property to initialize domain->geometry.aperture_start
>>> and domain->geometry.aperture_end in diff 3/3 of this series.
>>
>> 'dma-ranges' is what should be used here.
>>
> 
> The iommu binding documentation [1] mentions that
> 
>      The device tree node of the IOMMU device's parent bus must contain a valid
>      "dma-ranges" property that describes how the physical address space of the
>      IOMMU maps to memory. An empty "dma-ranges" property means that there is a
>      1:1 mapping from IOMMU to memory.
> 
> which, if I understand this correctly, means that the 'dma-ranges' for the
> parent bus of the iommu should be empty since the DART hardware can see the
> full physical address space with a 1:1 mapping.
> 
> 
> The documentation also mentions that
> 
>       When an "iommus" property is specified in a device tree node, the IOMMU
>       will be used for address translation. If a "dma-ranges" property exists
>       in the device's parent node it will be ignored.
> 
> which means that specifying a 'dma-ranges' in the parent bus of any devices
> that use the iommu will just be ignored.

I think that's just wrong and wants updating (or at least clarifying). 
The high-level view now is that we use "dma-ranges" to describe 
limitations imposed by a bridge or interconnect segment, and that can 
certainly happen upstream of an IOMMU. As it happens, I've just recently 
sent a patch for precisely that case[1].

I guess what it might have been trying to say is that "dma-ranges" 
*does* become irrelevant in terms of constraining what physical memory 
is usable for DMA, but that shouldn't imply that its meaning doesn't 
just shift to a different purpose.

> As a concrete example, the PCIe DART IOMMU only allows translations from iovas
> within 0x00100000...0x3ff00000 to the entire physical address space (though
> realistically it will only map to 16GB RAM starting at 0x800000000 on the M1).
> 
> I'm probably just confused or maybe the documentation is outdated but I don't
> see how I could specify "this device can only use DMA addresses from
> 0x00100000...0x3ff00000 but can map these via the iommu to any physical
> address" using 'dma-ranges'.
> 
> Could you maybe point me to the right direction or give me a small example?
> That would help a lot!

PCI is easy, since it's already standard practice to use "dma-ranges" to 
describe host bridge inbound windows. Even if the restriction is really 
out in the host-side interconnect rather than in the bridge itself, to 
all intents and purposes it's indistinguishable so can still be 
described the same way.

The case of a standalone device having fewer address bits wired up than 
both its output and the corresponding IOMMU input might expect is a 
little more awkward, since that often *does* require adding an extra 
level of "bus" to explicitly represent that interconnect link in the DT 
model, e.g. [2].

Robin.

[1] 
https://lore.kernel.org/linux-arm-kernel/720d0a9a42e33148fcac45cd39a727093a32bf32.1614965598.git.robin.murphy@arm.com/
[2] 
https://lore.kernel.org/linux-arm-kernel/20180926132247.10971-23-laurentiu.tudor@nxp.com/

Sven Peter March 25, 2021, 8:49 p.m. UTC | #13

On Thu, Mar 25, 2021, at 12:50, Robin Murphy wrote:
> On 2021-03-25 07:53, Sven Peter wrote:
> > 
> > 
> > On Tue, Mar 23, 2021, at 21:53, Rob Herring wrote:
> >> On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:
> >>>
> >>> As I mentioned before, not all DARTs support the full 32-bit aperture.
> >>> In particular the PCIe DARTs support a smaller address-space.  It is
> >>> not clear whether this is a restriction of the PCIe host controller or
> >>> the DART, but the Apple Device Tree has "vm-base" and "vm-size"
> >>> properties that encode the base address and size of the aperture.
> >>> These single-cell properties which is probably why for the USB DARTs
> >>> only "vm-base" is given; since "vm-base" is 0, a 32-bit number
> >>> wouldn't be able to encode the full aperture size.  We could make them
> >>> 64-bit numbers in the Linux device tree though and always be explicit
> >>> about the size.  Older Sun SPARC machines used a single "virtual-dma"
> >>> property to encode the aperture.  We could do someting similar.  You
> >>> would use this property to initialize domain->geometry.aperture_start
> >>> and domain->geometry.aperture_end in diff 3/3 of this series.
> >>
> >> 'dma-ranges' is what should be used here.
> >>
> > 
> > The iommu binding documentation [1] mentions that
> > 
> >      The device tree node of the IOMMU device's parent bus must contain a valid
> >      "dma-ranges" property that describes how the physical address space of the
> >      IOMMU maps to memory. An empty "dma-ranges" property means that there is a
> >      1:1 mapping from IOMMU to memory.
> > 
> > which, if I understand this correctly, means that the 'dma-ranges' for the
> > parent bus of the iommu should be empty since the DART hardware can see the
> > full physical address space with a 1:1 mapping.
> > 
> > 
> > The documentation also mentions that
> > 
> >       When an "iommus" property is specified in a device tree node, the IOMMU
> >       will be used for address translation. If a "dma-ranges" property exists
> >       in the device's parent node it will be ignored.
> > 
> > which means that specifying a 'dma-ranges' in the parent bus of any devices
> > that use the iommu will just be ignored.
> 
> I think that's just wrong and wants updating (or at least clarifying). 
> The high-level view now is that we use "dma-ranges" to describe 
> limitations imposed by a bridge or interconnect segment, and that can 
> certainly happen upstream of an IOMMU. As it happens, I've just recently 
> sent a patch for precisely that case[1].
> 
> I guess what it might have been trying to say is that "dma-ranges" 
> *does* become irrelevant in terms of constraining what physical memory 
> is usable for DMA, but that shouldn't imply that its meaning doesn't 
> just shift to a different purpose.
> 

Okay, now it makes sense then!

> > As a concrete example, the PCIe DART IOMMU only allows translations from iovas
> > within 0x00100000...0x3ff00000 to the entire physical address space (though
> > realistically it will only map to 16GB RAM starting at 0x800000000 on the M1).
> > 
> > I'm probably just confused or maybe the documentation is outdated but I don't
> > see how I could specify "this device can only use DMA addresses from
> > 0x00100000...0x3ff00000 but can map these via the iommu to any physical
> > address" using 'dma-ranges'.
> > 
> > Could you maybe point me to the right direction or give me a small example?
> > That would help a lot!
> 
> PCI is easy, since it's already standard practice to use "dma-ranges" to 
> describe host bridge inbound windows. Even if the restriction is really 
> out in the host-side interconnect rather than in the bridge itself, to 
> all intents and purposes it's indistinguishable so can still be 
> described the same way.
> 
> The case of a standalone device having fewer address bits wired up than 
> both its output and the corresponding IOMMU input might expect is a 
> little more awkward, since that often *does* require adding an extra 
> level of "bus" to explicitly represent that interconnect link in the DT 
> model, e.g. [2].
> 

Nice, thanks! That's exactly what I was looking for :)


Best,

Sven

Arnd Bergmann March 25, 2021, 9:41 p.m. UTC | #14

On Thu, Mar 25, 2021 at 8:53 AM Sven Peter <sven@svenpeter.dev> wrote:
> On Tue, Mar 23, 2021, at 21:53, Rob Herring wrote:
>
> I'm probably just confused or maybe the documentation is outdated but I don't
> see how I could specify "this device can only use DMA addresses from
> 0x00100000...0x3ff00000 but can map these via the iommu to any physical
> address" using 'dma-ranges'.

It sounds like this is a holdover from the original powerpc iommu, which also
had a limited set of virtual addresses in the iommu.

I would think it's sufficient to describe it in the iommu itself,
since the limitation
is more "addresses coming into the iommu must be this range" than "this device
must use that address range for talking to the iommu".

If the addresses are allocated by the iommu driver, and each iommu only has
one DMA master attached to it, having a simple range property in the iommu
node should do the trick here. If there might be multiple devices on the same
iommu but with different address ranges (which I don't think is the case), then
it could be part of the reference to the iommu.

         Arnd

Mark Kettenis March 26, 2021, 3:59 p.m. UTC | #15

> From: Arnd Bergmann <arnd@kernel.org>
> Date: Thu, 25 Mar 2021 22:41:09 +0100
> 
> On Thu, Mar 25, 2021 at 8:53 AM Sven Peter <sven@svenpeter.dev> wrote:
> > On Tue, Mar 23, 2021, at 21:53, Rob Herring wrote:
> >
> > I'm probably just confused or maybe the documentation is outdated but I don't
> > see how I could specify "this device can only use DMA addresses from
> > 0x00100000...0x3ff00000 but can map these via the iommu to any physical
> > address" using 'dma-ranges'.
> 
> It sounds like this is a holdover from the original powerpc iommu,
> which also had a limited set of virtual addresses in the iommu.
> 
> I would think it's sufficient to describe it in the iommu itself,
> since the limitation is more "addresses coming into the iommu must
> be this range" than "this device must use that address range for
> talking to the iommu".
> 
> If the addresses are allocated by the iommu driver, and each iommu
> only has one DMA master attached to it, having a simple range
> property in the iommu node should do the trick here. If there might
> be multiple devices on the same iommu but with different address
> ranges (which I don't think is the case), then it could be part of
> the reference to the iommu.

The ADT has properties on the iommu node that describe the adresses it
accepts for translation ("vm-base" and "vm-size").  So I think we can
safely assume that the same limits apply to all DMA masters that are
attached to it.  We don't know if the range limit is baked into the
silicon or whether it is related to how the firmware sets things up.
Having the properties on the iommu node makes it easy for m1n1 to
update the properties with the right values if necessary.

Some of the DARTs provide a bypass facility.  That code make using the
standard "dma-ranges" property tricky.  That property would need to
contain the bypass address range.  But that would mean that if the
DART driver needs to look at that property to figure out the address
range that supports translation it will need to be able to distinguish
between the translatable address range and the bypass address range.

Cheers,

Mark

Arnd Bergmann March 26, 2021, 4:09 p.m. UTC | #16

On Fri, Mar 26, 2021 at 4:59 PM Mark Kettenis <mark.kettenis@xs4all.nl> wrote:
>
> > From: Arnd Bergmann <arnd@kernel.org>
> > Date: Thu, 25 Mar 2021 22:41:09 +0100
> >
> > On Thu, Mar 25, 2021 at 8:53 AM Sven Peter <sven@svenpeter.dev> wrote:
> > > On Tue, Mar 23, 2021, at 21:53, Rob Herring wrote:
> > >
> > > I'm probably just confused or maybe the documentation is outdated but I don't
> > > see how I could specify "this device can only use DMA addresses from
> > > 0x00100000...0x3ff00000 but can map these via the iommu to any physical
> > > address" using 'dma-ranges'.
> >
> > It sounds like this is a holdover from the original powerpc iommu,
> > which also had a limited set of virtual addresses in the iommu.
> >
> > I would think it's sufficient to describe it in the iommu itself,
> > since the limitation is more "addresses coming into the iommu must
> > be this range" than "this device must use that address range for
> > talking to the iommu".
> >
> > If the addresses are allocated by the iommu driver, and each iommu
> > only has one DMA master attached to it, having a simple range
> > property in the iommu node should do the trick here. If there might
> > be multiple devices on the same iommu but with different address
> > ranges (which I don't think is the case), then it could be part of
> > the reference to the iommu.
>
> The ADT has properties on the iommu node that describe the adresses it
> accepts for translation ("vm-base" and "vm-size").  So I think we can
> safely assume that the same limits apply to all DMA masters that are
> attached to it.  We don't know if the range limit is baked into the
> silicon or whether it is related to how the firmware sets things up.
> Having the properties on the iommu node makes it easy for m1n1 to
> update the properties with the right values if necessary.

Ok.

> Some of the DARTs provide a bypass facility.  That code make using the
> standard "dma-ranges" property tricky.  That property would need to
> contain the bypass address range.  But that would mean that if the
> DART driver needs to look at that property to figure out the address
> range that supports translation it will need to be able to distinguish
> between the translatable address range and the bypass address range.

I'm not following here. Do you mean the bus address used for bypass is
different from the upstream address that gets programmed into the iommu
when it is active?

          Arnd

Sven Peter March 26, 2021, 4:10 p.m. UTC | #17

On Fri, Mar 26, 2021, at 16:59, Mark Kettenis wrote:
> Some of the DARTs provide a bypass facility.  That code make using the
> standard "dma-ranges" property tricky.  That property would need to
> contain the bypass address range.  But that would mean that if the
> DART driver needs to look at that property to figure out the address
> range that supports translation it will need to be able to distinguish
> between the translatable address range and the bypass address range.

Do we understand if and why we even need to bypass certain streams?
And do you have an example for a node in the ADT that contains this bypass range?

I've only seen nodes with "bypass" and "bypass-adress" but that could just be
some software abstraction Apple uses which doesn't map well to Linux or other OSes
and might not even be required here.

Sven

Arnd Bergmann March 26, 2021, 4:38 p.m. UTC | #18

On Fri, Mar 26, 2021 at 5:10 PM Sven Peter <sven@svenpeter.dev> wrote:
> On Fri, Mar 26, 2021, at 16:59, Mark Kettenis wrote:
> > Some of the DARTs provide a bypass facility.  That code make using the
> > standard "dma-ranges" property tricky.  That property would need to
> > contain the bypass address range.  But that would mean that if the
> > DART driver needs to look at that property to figure out the address
> > range that supports translation it will need to be able to distinguish
> > between the translatable address range and the bypass address range.
>
> Do we understand if and why we even need to bypass certain streams?

My guess is that this is a performance optimization.

There are generally three reasons to want an iommu in the first place:
 - Pass a device down to a guest or user process without giving
   access to all of memory
 - Avoid problems with limitations in the device, typically when it
only supports
   32-bit bus addressing, but the installed memory is larger than 4GB
 - Protect kernel memory from broken drivers

If you care about none of the above, but you do care about data transfer
speed, you are better off just leaving the IOMMU in bypass mode.
I don't think we have to support it if the IOMMU works reliably, but it's
something that users might want.

        Arnd

Sven Peter March 26, 2021, 5:06 p.m. UTC | #19

On Fri, Mar 26, 2021, at 17:38, Arnd Bergmann wrote:
> On Fri, Mar 26, 2021 at 5:10 PM Sven Peter <sven@svenpeter.dev> wrote:
> > On Fri, Mar 26, 2021, at 16:59, Mark Kettenis wrote:
> > > Some of the DARTs provide a bypass facility.  That code make using the
> > > standard "dma-ranges" property tricky.  That property would need to
> > > contain the bypass address range.  But that would mean that if the
> > > DART driver needs to look at that property to figure out the address
> > > range that supports translation it will need to be able to distinguish
> > > between the translatable address range and the bypass address range.
> >
> > Do we understand if and why we even need to bypass certain streams?
> 
> My guess is that this is a performance optimization.

Makes sense.

> 
> There are generally three reasons to want an iommu in the first place:
>  - Pass a device down to a guest or user process without giving
>    access to all of memory
>  - Avoid problems with limitations in the device, typically when it
> only supports
>    32-bit bus addressing, but the installed memory is larger than 4GB
>  - Protect kernel memory from broken drivers
> 
> If you care about none of the above, but you do care about data transfer
> speed, you are better off just leaving the IOMMU in bypass mode.
> I don't think we have to support it if the IOMMU works reliably, but it's
> something that users might want.

Right now the IOMMU works very reliably while bypass mode seems to be tricky
at best. I think I partly know how to enable it but it looks like either not
every DART or DART/master combination even supports it or that there is
some additional configuration required to make it work reliably.

I had it working with the USB DART at one point but I needed to enable it in
all 16 streams of the IOMMU even though the pagetables only need to be setup
in one stream as indicated by the ADT.
I couldn't get it to work at all for the framebuffer IOMMU.

I think it's fine to skip it for now until it's either actually required due
to some hardware quirk or once we have users requesting support. Apple uses
almost all IOMMUs without bypass mode if that ADT is to be believed though.

Best,

Sven

Mark Kettenis March 26, 2021, 5:26 p.m. UTC | #20

> From: Arnd Bergmann <arnd@kernel.org>
> Date: Fri, 26 Mar 2021 17:38:24 +0100
> 
> On Fri, Mar 26, 2021 at 5:10 PM Sven Peter <sven@svenpeter.dev> wrote:
> > On Fri, Mar 26, 2021, at 16:59, Mark Kettenis wrote:
> > > Some of the DARTs provide a bypass facility.  That code make using the
> > > standard "dma-ranges" property tricky.  That property would need to
> > > contain the bypass address range.  But that would mean that if the
> > > DART driver needs to look at that property to figure out the address
> > > range that supports translation it will need to be able to distinguish
> > > between the translatable address range and the bypass address range.
> >
> > Do we understand if and why we even need to bypass certain streams?
> 
> My guess is that this is a performance optimization.
> 
> There are generally three reasons to want an iommu in the first place:
>  - Pass a device down to a guest or user process without giving
>    access to all of memory
>  - Avoid problems with limitations in the device, typically when it
> only supports
>    32-bit bus addressing, but the installed memory is larger than 4GB
>  - Protect kernel memory from broken drivers
> 
> If you care about none of the above, but you do care about data transfer
> speed, you are better off just leaving the IOMMU in bypass mode.
> I don't think we have to support it if the IOMMU works reliably, but it's
> something that users might want.

Another reason might be that a device needs access to large amounts of
physical memory at the same time and the 32-bit address space that the
DART provides is too tight.

In U-Boot I might want to use bypass where it works since there is no
proper IOMMU support in U-Boot.  Generally speaking, the goal is to
keep the code size down in U-Boot.  In OpenBSD I'll probably avoid
bypass mode if I can.

I haven't figured out how the bypass stuff really works.  Corellium
added support for it in their codebase when they added support for
Thunderbolt, and some of the DARTs that seem to be related to
Thunderbolt do indeed have a "bypass" property.  But it is unclear to
me how the different puzzle pieces fit together for Thunderbolt.

Anyway, from my viewpoint having the information about the IOVA
address space sit on the devices makes little sense.  This information
is needed by the DART driver, and there is no direct cnnection from
the DART to the individual devices in the devicetree.  The "iommus"
property makes a connection in the opposite direction.

Cheers,

Mark

Robin Murphy March 26, 2021, 5:34 p.m. UTC | #21

On 2021-03-26 17:26, Mark Kettenis wrote:
>> From: Arnd Bergmann <arnd@kernel.org>
>> Date: Fri, 26 Mar 2021 17:38:24 +0100
>>
>> On Fri, Mar 26, 2021 at 5:10 PM Sven Peter <sven@svenpeter.dev> wrote:
>>> On Fri, Mar 26, 2021, at 16:59, Mark Kettenis wrote:
>>>> Some of the DARTs provide a bypass facility.  That code make using the
>>>> standard "dma-ranges" property tricky.  That property would need to
>>>> contain the bypass address range.  But that would mean that if the
>>>> DART driver needs to look at that property to figure out the address
>>>> range that supports translation it will need to be able to distinguish
>>>> between the translatable address range and the bypass address range.
>>>
>>> Do we understand if and why we even need to bypass certain streams?
>>
>> My guess is that this is a performance optimization.
>>
>> There are generally three reasons to want an iommu in the first place:
>>   - Pass a device down to a guest or user process without giving
>>     access to all of memory
>>   - Avoid problems with limitations in the device, typically when it
>> only supports
>>     32-bit bus addressing, but the installed memory is larger than 4GB
>>   - Protect kernel memory from broken drivers
>>
>> If you care about none of the above, but you do care about data transfer
>> speed, you are better off just leaving the IOMMU in bypass mode.
>> I don't think we have to support it if the IOMMU works reliably, but it's
>> something that users might want.
> 
> Another reason might be that a device needs access to large amounts of
> physical memory at the same time and the 32-bit address space that the
> DART provides is too tight.
> 
> In U-Boot I might want to use bypass where it works since there is no
> proper IOMMU support in U-Boot.  Generally speaking, the goal is to
> keep the code size down in U-Boot.  In OpenBSD I'll probably avoid
> bypass mode if I can.
> 
> I haven't figured out how the bypass stuff really works.  Corellium
> added support for it in their codebase when they added support for
> Thunderbolt, and some of the DARTs that seem to be related to
> Thunderbolt do indeed have a "bypass" property.  But it is unclear to
> me how the different puzzle pieces fit together for Thunderbolt.
> 
> Anyway, from my viewpoint having the information about the IOVA
> address space sit on the devices makes little sense.  This information
> is needed by the DART driver, and there is no direct cnnection from
> the DART to the individual devices in the devicetree.  The "iommus"
> property makes a connection in the opposite direction.

What still seems unclear is whether these addressing limitations are a 
property of the DART input interface, the device output interface, or 
the interconnect between them. Although the observable end result 
appears more or less the same either way, they are conceptually 
different things which we have different abstractions to deal with.

Robin.

Sven Peter March 26, 2021, 5:51 p.m. UTC | #22

On Fri, Mar 26, 2021, at 18:34, Robin Murphy wrote:
> On 2021-03-26 17:26, Mark Kettenis wrote:
> > 
> > Anyway, from my viewpoint having the information about the IOVA
> > address space sit on the devices makes little sense.  This information
> > is needed by the DART driver, and there is no direct cnnection from
> > the DART to the individual devices in the devicetree.  The "iommus"
> > property makes a connection in the opposite direction.
> 
> What still seems unclear is whether these addressing limitations are a 
> property of the DART input interface, the device output interface, or 
> the interconnect between them. Although the observable end result 
> appears more or less the same either way, they are conceptually 
> different things which we have different abstractions to deal with.
> 
> Robin.
>

I'm not really sure if there is any way for us to figure out where these
limitation comes from though.

I've done some more experiments and looked at all DART nodes in Apple's Device
Tree though. It seems that most (if not all) masters only connect 32 address
lines even though the iommu can handle a much larger address space. I'll therefore
remove the code to handle the full space for v2 since it's essentially dead
code that can't be tested anyway.

There are some exceptions though:

There are the PCIe DARTs which have a different limitation which could be
encoded as 'dma-ranges' in the pci bus node:

           name         base      size
         dart-apcie1: 00100000  3fe00000
         dart-apcie2: 00100000  3fe00000
         dart-apcie0: 00100000  3fe00000
        dart-apciec0: 00004000  7fffc000
        dart-apciec1: 80000000  7fffc000

Then there are also these display controller DARTs. If we wanted to use dma-ranges
we could just put them in a single sub bus:

              name     base      size
          dart-disp0: 00000000 fc000000
            dart-dcp: 00000000 fc000000
       dart-dispext0: 00000000 fc000000
         dart-dcpext: 00000000 fc000000

And finally we have these strange ones which might eventually each require
another awkward sub-bus if we want to stick to the dma-ranges property.

    name     base      size
  dart-aop: 00030000 ffffffff ("always-on processor")
  dart-pmp: 00000000 bff00000 (no idea yet)
  dart-sio: 0021c000 fbde4000 (at least their Secure Enclave/TPM co-processor)
  dart-ane: 00000000 e0000000 ("Neural Engine", their ML accelerator)

For all we know these limitations could even arise for different reasons.
(the secure enclave one looks like it might be imposed by the code running
on there).

Not really sure to proceed from here. I'll give the dma-ranges options a try
for v2 and see how that one works out but that's not going to help us understand
*why* these limitations exist. 
At least I won't have to change much code if we agree on a different abstraction :)

The important ones for now are probably the USB and the PCIe ones. We'll need the
display ones after that and can probably ignore the strange ones for quite a while.

Best,

Sven

Arnd Bergmann March 26, 2021, 7:59 p.m. UTC | #23

On Fri, Mar 26, 2021 at 6:51 PM Sven Peter <sven@svenpeter.dev> wrote:
> On Fri, Mar 26, 2021, at 18:34, Robin Murphy wrote:
> > On 2021-03-26 17:26, Mark Kettenis wrote:
> > >
> > > Anyway, from my viewpoint having the information about the IOVA
> > > address space sit on the devices makes little sense.  This information
> > > is needed by the DART driver, and there is no direct cnnection from
> > > the DART to the individual devices in the devicetree.  The "iommus"
> > > property makes a connection in the opposite direction.
> >
> > What still seems unclear is whether these addressing limitations are a
> > property of the DART input interface, the device output interface, or
> > the interconnect between them. Although the observable end result
> > appears more or less the same either way, they are conceptually
> > different things which we have different abstractions to deal with.
>
> I'm not really sure if there is any way for us to figure out where these
> limitation comes from though.

My first guess was that this is done to partition the available address
address space in a way that allows one physical IOTLB to handle
multiple devices that each have their own page table for a subset
of the address space, as was done on old PowerPC IOMMUs.
However, the ranges you list don't really support that model.

> I've done some more experiments and looked at all DART nodes in Apple's Device
> Tree though. It seems that most (if not all) masters only connect 32 address
> lines even though the iommu can handle a much larger address space. I'll therefore
> remove the code to handle the full space for v2 since it's essentially dead
> code that can't be tested anyway.
>
>
> There are some exceptions though:
>
> There are the PCIe DARTs which have a different limitation which could be
> encoded as 'dma-ranges' in the pci bus node:
>
>            name         base      size
>          dart-apcie1: 00100000  3fe00000
>          dart-apcie2: 00100000  3fe00000
>          dart-apcie0: 00100000  3fe00000
>         dart-apciec0: 00004000  7fffc000
>         dart-apciec1: 80000000  7fffc000

This looks like they are reserving some address space in the beginning
and/or the end, and for apciec0, the address space is partitioned into
two equal-sized regions.

> Then there are also these display controller DARTs. If we wanted to use dma-ranges
> we could just put them in a single sub bus:
>
>               name     base      size
>           dart-disp0: 00000000 fc000000
>             dart-dcp: 00000000 fc000000
>        dart-dispext0: 00000000 fc000000
>          dart-dcpext: 00000000 fc000000
>
>
> And finally we have these strange ones which might eventually each require
> another awkward sub-bus if we want to stick to the dma-ranges property.
>
>     name     base      size
>   dart-aop: 00030000 ffffffff ("always-on processor")
>   dart-pmp: 00000000 bff00000 (no idea yet)

Here I also see a "pio-vm-size" property:

    dart-pmp {
      pio-vm-base = <0xc0000000>;
      pio-vm-size = <0x40000000>;
      vm-size = <0xbff00000>;
      ...
   };

Which seems to give 3GB of address space to the normal iotlb,
plus the last 1GB to something else. The vm-base property is also
missing rather than zero, but that could just be part of their syntax
instead of a real difference.

Could it be that there are

>   dart-sio: 0021c000 fbde4000 (at least their Secure Enclave/TPM co-processor)

Same here:
    dart-sio {
       vm-base = <0x0>;
       vm-size = <0xfc000000>;
       pio-vm-base = <0xfd000000>;
      pio-vm-size = <0x2000000>;
      pio-granularity = <0x1000000>;
   }

There are clearly two distinct ranges that split up the 4GB space again,
with a small hole of 16MB (==pio-granularity) at the end of each range.

The "pio" name might indicate that this is a range of addresses that
can be programmed to point at I/O registers in another device, rather
than pointing to RAM.

           Arnd

Arnd Bergmann March 26, 2021, 8:03 p.m. UTC | #24

On Fri, Mar 26, 2021 at 6:28 PM Mark Kettenis <mark.kettenis@xs4all.nl> wrote:

> I haven't figured out how the bypass stuff really works.  Corellium
> added support for it in their codebase when they added support for
> Thunderbolt, and some of the DARTs that seem to be related to
> Thunderbolt do indeed have a "bypass" property.  But it is unclear to
> me how the different puzzle pieces fit together for Thunderbolt.

As a general observation, bypass mode for Thunderbolt is what enabled
the http://thunderclap.io/ attack. This is extremely useful for debugging
a running kernel from another machine, but it's also something that
should never be done in a production kernel.

         Arnd

Mark Kettenis March 26, 2021, 9:13 p.m. UTC | #25

> From: Arnd Bergmann <arnd@kernel.org>
> Date: Fri, 26 Mar 2021 21:03:32 +0100
> 
> On Fri, Mar 26, 2021 at 6:28 PM Mark Kettenis <mark.kettenis@xs4all.nl> wrote:
> 
> > I haven't figured out how the bypass stuff really works.  Corellium
> > added support for it in their codebase when they added support for
> > Thunderbolt, and some of the DARTs that seem to be related to
> > Thunderbolt do indeed have a "bypass" property.  But it is unclear to
> > me how the different puzzle pieces fit together for Thunderbolt.
> 
> As a general observation, bypass mode for Thunderbolt is what enabled
> the http://thunderclap.io/ attack. This is extremely useful for debugging
> a running kernel from another machine, but it's also something that
> should never be done in a production kernel.

No kidding!  I was surprised to see the bypass support on the
Thunderbolt-related nodes.

Mark Kettenis March 26, 2021, 9:16 p.m. UTC | #26

> From: Arnd Bergmann <arnd@kernel.org>
> Date: Fri, 26 Mar 2021 20:59:58 +0100
> 
> On Fri, Mar 26, 2021 at 6:51 PM Sven Peter <sven@svenpeter.dev> wrote:
> > On Fri, Mar 26, 2021, at 18:34, Robin Murphy wrote:
> > > On 2021-03-26 17:26, Mark Kettenis wrote:
> > > >
> > > > Anyway, from my viewpoint having the information about the IOVA
> > > > address space sit on the devices makes little sense.  This information
> > > > is needed by the DART driver, and there is no direct cnnection from
> > > > the DART to the individual devices in the devicetree.  The "iommus"
> > > > property makes a connection in the opposite direction.
> > >
> > > What still seems unclear is whether these addressing limitations are a
> > > property of the DART input interface, the device output interface, or
> > > the interconnect between them. Although the observable end result
> > > appears more or less the same either way, they are conceptually
> > > different things which we have different abstractions to deal with.
> >
> > I'm not really sure if there is any way for us to figure out where these
> > limitation comes from though.
> 
> My first guess was that this is done to partition the available address
> address space in a way that allows one physical IOTLB to handle
> multiple devices that each have their own page table for a subset
> of the address space, as was done on old PowerPC IOMMUs.
> However, the ranges you list don't really support that model.
> 
> > I've done some more experiments and looked at all DART nodes in Apple's Device
> > Tree though. It seems that most (if not all) masters only connect 32 address
> > lines even though the iommu can handle a much larger address space. I'll therefore
> > remove the code to handle the full space for v2 since it's essentially dead
> > code that can't be tested anyway.
> >
> >
> > There are some exceptions though:
> >
> > There are the PCIe DARTs which have a different limitation which could be
> > encoded as 'dma-ranges' in the pci bus node:
> >
> >            name         base      size
> >          dart-apcie1: 00100000  3fe00000
> >          dart-apcie2: 00100000  3fe00000
> >          dart-apcie0: 00100000  3fe00000
> >         dart-apciec0: 00004000  7fffc000
> >         dart-apciec1: 80000000  7fffc000
> 
> This looks like they are reserving some address space in the beginning
> and/or the end, and for apciec0, the address space is partitioned into
> two equal-sized regions.
> 
> > Then there are also these display controller DARTs. If we wanted to use dma-ranges
> > we could just put them in a single sub bus:
> >
> >               name     base      size
> >           dart-disp0: 00000000 fc000000
> >             dart-dcp: 00000000 fc000000
> >        dart-dispext0: 00000000 fc000000
> >          dart-dcpext: 00000000 fc000000
> >
> >
> > And finally we have these strange ones which might eventually each require
> > another awkward sub-bus if we want to stick to the dma-ranges property.
> >
> >     name     base      size
> >   dart-aop: 00030000 ffffffff ("always-on processor")
> >   dart-pmp: 00000000 bff00000 (no idea yet)
> 
> Here I also see a "pio-vm-size" property:
> 
>     dart-pmp {
>       pio-vm-base = <0xc0000000>;
>       pio-vm-size = <0x40000000>;
>       vm-size = <0xbff00000>;
>       ...
>    };
> 
> Which seems to give 3GB of address space to the normal iotlb,
> plus the last 1GB to something else. The vm-base property is also
> missing rather than zero, but that could just be part of their syntax
> instead of a real difference.

Yes.  It seems like "vm-base" is omitted when it is 0, and "vm-size"
is omitted when the end of the window is at 0xffffffff.

> 
> Could it be that there are
> 
> >   dart-sio: 0021c000 fbde4000 (at least their Secure Enclave/TPM co-processor)
> 
> Same here:
>     dart-sio {
>        vm-base = <0x0>;
>        vm-size = <0xfc000000>;
>        pio-vm-base = <0xfd000000>;
>       pio-vm-size = <0x2000000>;
>       pio-granularity = <0x1000000>;
>    }
> 
> There are clearly two distinct ranges that split up the 4GB space again,
> with a small hole of 16MB (==pio-granularity) at the end of each range.
> 
> The "pio" name might indicate that this is a range of addresses that
> can be programmed to point at I/O registers in another device, rather
> than pointing to RAM.
> 
>            Arnd
>

Sven Peter March 27, 2021, 3:30 p.m. UTC | #27

On Fri, Mar 26, 2021, at 20:59, Arnd Bergmann wrote:
> On Fri, Mar 26, 2021 at 6:51 PM Sven Peter <sven@svenpeter.dev> wrote:
> >   dart-sio: 0021c000 fbde4000 (at least their Secure Enclave/TPM co-processor)
> 
> Same here:
>     dart-sio {
>        vm-base = <0x0>;
>        vm-size = <0xfc000000>;
>        pio-vm-base = <0xfd000000>;
>       pio-vm-size = <0x2000000>;
>       pio-granularity = <0x1000000>;
>    }
> 
> There are clearly two distinct ranges that split up the 4GB space again,
> with a small hole of 16MB (==pio-granularity) at the end of each range.
> 
> The "pio" name might indicate that this is a range of addresses that
> can be programmed to point at I/O registers in another device, rather
> than pointing to RAM.
> 
>            Arnd
>

Very interesting observation!

Mark and I have discussed this a little bit further on IRC. Mark also successfully
used the PCIe DARTs with a DMA window outside of the one specified by vm-base/vm-size
in the ADT.

I believe that the (pio-)vm-base/size properties merely specify the ranges their
allocator uses and do not describe actual hardware limitations. Mark also suggested
that they might reserve memory at the beginning to find bugs similar to how one
might not allow to map memory at 0x0.

I have also done a few more experiments and figured out that if I put the IOMMU
into bypass mode (which doesn't seem to work for all IOMMUs/master combinations
which is why I'll leave it out of this series for now until I figure out more
details) I *can* use the full address space. I think the limitation
is therefore imposed by the translation hardware inside the IOMMU and not by
the bus/the interconnect.

If that's correct I think the right place to enforce this is to just limit
the aperture inside the DART driver to a 32bit address space whenever address
translation is enabled.

Thanks,

Sven

Sven Peter March 27, 2021, 3:33 p.m. UTC | #28

Hi Robin,


On Thu, Mar 25, 2021, at 12:50, Robin Murphy wrote:
> On 2021-03-25 07:53, Sven Peter wrote:
> > The iommu binding documentation [1] mentions that
> > 
> >      The device tree node of the IOMMU device's parent bus must contain a valid
> >      "dma-ranges" property that describes how the physical address space of the
> >      IOMMU maps to memory. An empty "dma-ranges" property means that there is a
> >      1:1 mapping from IOMMU to memory.
> > 
> > which, if I understand this correctly, means that the 'dma-ranges' for the
> > parent bus of the iommu should be empty since the DART hardware can see the
> > full physical address space with a 1:1 mapping.
> > 
> > 
> > The documentation also mentions that
> > 
> >       When an "iommus" property is specified in a device tree node, the IOMMU
> >       will be used for address translation. If a "dma-ranges" property exists
> >       in the device's parent node it will be ignored.
> > 
> > which means that specifying a 'dma-ranges' in the parent bus of any devices
> > that use the iommu will just be ignored.
> 
> I think that's just wrong and wants updating (or at least clarifying). 
> The high-level view now is that we use "dma-ranges" to describe 
> limitations imposed by a bridge or interconnect segment, and that can 
> certainly happen upstream of an IOMMU. As it happens, I've just recently 
> sent a patch for precisely that case[1].
> 
> I guess what it might have been trying to say is that "dma-ranges" 
> *does* become irrelevant in terms of constraining what physical memory 
> is usable for DMA, but that shouldn't imply that its meaning doesn't 
> just shift to a different purpose.

Should I add a patch to clarify this paragraph for v2 or submit a separate
one-off patch? I'm not entirely sure about the process here but I could add
a Suggested-by: to the commit if that's fine with you.



Best,

Sven

[0/3] Apple M1 DART IOMMU driver

Message

Comments