mbox series

[v4,0/3] Apple M1 DART IOMMU driver

Message ID 20210627143405.77298-1-sven@svenpeter.dev (mailing list archive)
Headers show
Series Apple M1 DART IOMMU driver | expand

Message

Sven Peter June 27, 2021, 2:34 p.m. UTC
Hi,

This is v4 of my Apple M1 DART IOMMU driver series as a follow up to the previous
versions [1][2][3].

Short summary: this series adds support for the iommu found in Apple's new M1
SoC which is required to use DMA on most peripherals like the display controller,
the USB ports or the internal PCIe bus (which is used for WiFi, Ethernet and
more USB ports).
So far this code has been tested by multiple people with dwc3 in host and
device mode (which both only require changes to the device tree after this
patchset) and PCIe (using a yet to be finalized patchset).


== Testing this patchset with the USB-C controller == 

The two USB-C ports on the M1 machines are exposed as two separate dwc3
controllers which are behind a DART. Now that my USB phy bringup code has been
merged into our bootloader m1n1 you can easily test this patchset yourself:

1) Follow the instructions at [4] to setup our bootloader m1n1 on your M1
   machine which will allow you to boot kernels using a normal USB cable.
   Note that you'll still need a special setup to expose the UART for very
   low-level debugging.

2) Either apply this patchset and add the DART and dwc3 nodes as done in [5]
   or alternatively just pull from [6] to get the tree that I've been using.
   (That tree contains two commits already in linux-next, this patchset and
    a final commit to add the dart+dwc3 nodes)

3) Boot the kernel through our bootloader m1n1. You'll need a version after
   commit [7] which enables the USB PHY and the USB PD chip.

Note that the dwc3 controller has a quirk where each root port can only be used
once right now. The most stable way to test is to already connected the USB
device(s) before booting the kernel.

It's also possible to test the PCIe bus but this requires a more complex setup
for now. I can write a quick howto if anyone is interested though.
(tl;dr: Mark Kettenis has a u-boot fork that includes PCIe bringup code and
Marc Zyngier has a WIP patchset to add a PCIe driver)


== Hardwired 16K IOMMU pagesize ==

This IOMMU comes with a hard-wired pagesize of 16K. This makes booting a
kernel with 4K page challenging. Right now it will hit a BUG_ON if a translated
domain is used.

For dwc3 this is no issue: As long as the iommu is set to bypass mode
dwc3 works just fine. Translated mode just isn't supported then.

The most controversial part on which I'd like to get feedback are the
PCIe DARTs. These DARTs do not support hardware bypass mode and also limit
the iova space to 32 bit. To still allow booting on kernels with a 4K
pagesize I manually program a software bypass mode. With a correctly configured
dma-ranges in the PCIe nodes this then also allows a 4K kernel to boot.

In the long term, I'd like to extend the dma-iommu framework itself to
support iommu pagesizes with a larger granule than the CPU pagesize if that is
something you agree with.
This would be important to later support the thunderbolt DARTs since I would be
very uncomfortable to have these running in (software or hardware) bypass mode.

== Project Blurb ==

Asahi Linux is an open community project dedicated to developing and
maintaining mainline support for Apple Silicon on Linux. Feel free to
drop by #asahi and #asahi-dev on OFTC to chat with us, or check
our website for more information on the project:

https://asahilinux.org/

== Changes ==

Changes for v4:
 - Addressed Rob Herring's remark about the incorrect phandles in the device
   tree binding example and added his reviewed-by tag
 - Take the software linear mapping range from the bus instead of hardcoding
   it in the driver
 - Use def_domain_type to force bypass mode if there's a pagesize mismatch
   between the DART (hardwired to 16KB) and the kernel (may use 4K)
 - Added lockdep_assert_held instead of comments as suggested by Rouven Czerwinski
 - rebased on 5.13-rc7

Changes for v3:
 - fixed name of the iommu node in the device tree binding example
   pointed out by Arnd Bergmann
 - remove hardware specific checks from io-pgtable.c  as pointed out by
   Will Deacon
 - introduced a fake bypass mode by programming static linear pagetables
   if the DART does not support regular bypass mode as proposed by Alex
   Graf
 - added checks to enforce bypass mode if there is a pagesize mismatch
   between the DART HW and the CPU.
 - fixed usage of GFP_KERNEL during a held spinlock found by Julia Lawall
 - rebased on v5.13-rc3

Changes for v2:
 - fixed devicetree binding linting issues pointed out by Rob Herring and
   reworked that file.
 - made DART-specific code in io-pgtable.c unconditional and removed flag from
   Kconfig as proposed by Robin Murphy.
 - allowed multiple DART nodes in the "iommus" property as proposed by
   Rob Herring and Robin Murphy. this resulted in significant changes
   to apple-iommu-dart.c.
 - the domain aperture is now forced to 32bit if translation is enabled after
   the original suggestion to limit the aperture by Mark Kettenis and the
   follow-up discussion and investigation with Mark Kettenis, Arnd Bergmann,
   Robin Murphy and Rob Herring. This change also simplified the code
   in io-pgtable.c and made some of the improvements suggested during review
   not apply anymore.
 - added support for bypassed and isolated domain modes.
 - reject IOMMU_MMIO and IOMMU_NOEXEC since it's unknown how to set these up
   for now or if the hardware even supports these flags.
 - renamed some registers to be less confusing (mainly s/DOMAIN/STREAM/ to
   prevent confusion with linux's iommu domain concept).


[1] https://lore.kernel.org/linux-iommu/20210320151903.60759-1-sven@svenpeter.dev/
[2] https://lore.kernel.org/linux-iommu/20210328074009.95932-1-sven@svenpeter.dev/
[3] https://lore.kernel.org/linux-iommu/20210603085003.50465-1-sven@svenpeter.dev/
[4] https://github.com/AsahiLinux/docs/wiki/Developer-Quickstart
[5] https://github.com/AsahiLinux/linux/commit/7d4ebb0b22e9bfec849e2af86ddeb46ec29d7feb
[6] https://github.com/AsahiLinux/linux/tree/dart/dev
[7] https://github.com/AsahiLinux/m1n1/commit/9529ec2b4fd6550f9cfd66d9f2448b90804699a1

Sven Peter (3):
  iommu: io-pgtable: add DART pagetable format
  dt-bindings: iommu: add DART iommu bindings
  iommu: dart: Add DART iommu driver

 .../devicetree/bindings/iommu/apple,dart.yaml |   81 ++
 MAINTAINERS                                   |    7 +
 drivers/iommu/Kconfig                         |   15 +
 drivers/iommu/Makefile                        |    1 +
 drivers/iommu/apple-dart-iommu.c              | 1058 +++++++++++++++++
 drivers/iommu/io-pgtable-arm.c                |   62 +
 drivers/iommu/io-pgtable.c                    |    1 +
 include/linux/io-pgtable.h                    |    7 +
 8 files changed, 1232 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/iommu/apple,dart.yaml
 create mode 100644 drivers/iommu/apple-dart-iommu.c

Comments

Robin Murphy July 14, 2021, 6:19 p.m. UTC | #1
On 2021-06-27 15:34, Sven Peter wrote:
[...]
> In the long term, I'd like to extend the dma-iommu framework itself to
> support iommu pagesizes with a larger granule than the CPU pagesize if that is
> something you agree with.

BTW this isn't something we can fully support in general. IOMMU API 
users may expect this to work:

iommu_map(domain, iova, page_to_phys(p1), PAGE_SIZE, prot);
iommu_map(domain, iova + PAGE_SIZE, page_to_phys(p2), PAGE_SIZE, prot);

Although they do in principle have visibility of pgsize_bitmap, I still 
doubt anyone is really prepared for CPU-page-aligned mappings to fail.
Even at the DMA API level you could hide *some* of it (at the cost of 
effectively only having 1/4 of the usable address space), but there are 
still cases like where v4l2 has a hard requirement that a page-aligned 
scatterlist can be mapped into a contiguous region of DMA addresses.

> This would be important to later support the thunderbolt DARTs since I would be
> very uncomfortable to have these running in (software or hardware) bypass mode.

Funnily enough that's the one case that would be relatively workable, 
since untrusted devices are currently subject to bounce-buffering of the 
entire DMA request, so it doesn't matter so much how the bounce buffer 
itself is mapped. Even with the possible future optimisation of only 
bouncing the non-page-aligned start and end parts of a buffer I think it 
still works (the physical alignment just has to be considered in terms 
of the IOMMU granule).

Robin.
Arnd Bergmann July 14, 2021, 8:51 p.m. UTC | #2
On Wed, Jul 14, 2021 at 8:21 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2021-06-27 15:34, Sven Peter wrote:
> [...]
> > In the long term, I'd like to extend the dma-iommu framework itself to
> > support iommu pagesizes with a larger granule than the CPU pagesize if that is
> > something you agree with.
>
> BTW this isn't something we can fully support in general. IOMMU API
> users may expect this to work:
>
> iommu_map(domain, iova, page_to_phys(p1), PAGE_SIZE, prot);
> iommu_map(domain, iova + PAGE_SIZE, page_to_phys(p2), PAGE_SIZE, prot);
>
> Although they do in principle have visibility of pgsize_bitmap, I still
> doubt anyone is really prepared for CPU-page-aligned mappings to fail.
> Even at the DMA API level you could hide *some* of it (at the cost of
> effectively only having 1/4 of the usable address space), but there are
> still cases like where v4l2 has a hard requirement that a page-aligned
> scatterlist can be mapped into a contiguous region of DMA addresses.

I think that was the same conclusion we had earlier: the dma-mapping
interfaces should be possible for large iotlb pages, but any driver directly
using the IOMMU API, such as VFIO, would not work.

The question is how we can best allow one but not the other.

         Arnd
Joerg Roedel July 15, 2021, 6:52 a.m. UTC | #3
On Wed, Jul 14, 2021 at 10:51:34PM +0200, Arnd Bergmann wrote:
> The question is how we can best allow one but not the other.

By only allowing to allocate domains of type IDENTITY and DMA, but fail
to allocate UNMANAGED domains.

Regards,

	Joerg
Christoph Hellwig July 16, 2021, 6:24 a.m. UTC | #4
On Wed, Jul 14, 2021 at 07:19:50PM +0100, Robin Murphy wrote:
> Even at the DMA API level you could hide *some* of it (at the cost of
> effectively only having 1/4 of the usable address space), but there are
> still cases like where v4l2 has a hard requirement that a page-aligned
> scatterlist can be mapped into a contiguous region of DMA addresses.

Where does v4l2 make that broken assumption?  Plenty of dma mapping
implementations including dma-direct do not support that.

Drivers need to call dma_get_merge_boundary() to check for that kind of
behavior.
Robin Murphy July 16, 2021, 3:32 p.m. UTC | #5
On 2021-07-16 07:24, Christoph Hellwig wrote:
> On Wed, Jul 14, 2021 at 07:19:50PM +0100, Robin Murphy wrote:
>> Even at the DMA API level you could hide *some* of it (at the cost of
>> effectively only having 1/4 of the usable address space), but there are
>> still cases like where v4l2 has a hard requirement that a page-aligned
>> scatterlist can be mapped into a contiguous region of DMA addresses.
> 
> Where does v4l2 make that broken assumption?  Plenty of dma mapping
> implementations including dma-direct do not support that.

See vb2_dc_get_contiguous_size() and its callers. I still remember 
spending an entire work day on writing one email at the culmination of 
this discussion:

https://lore.kernel.org/linux-iommu/56409B6D.5090903@arm.com/

809eac54cdd6 was framed as an efficiency improvement because it 
technically was one (and something I had wanted to implement anyway), 
but it was also very much to save myself from any further email debates 
or customer calls about "regressing" code ported from 32-bit platforms...

Robin.