diff mbox series

[13/18] arm64: ioremap: use nGnRnE mappings on platforms that require it

Message ID 20210204203951.52105-14-marcan@marcan.st (mailing list archive)
State Changes Requested
Headers show
Series Apple M1 SoC platform bring-up | expand

Commit Message

Hector Martin Feb. 4, 2021, 8:39 p.m. UTC
This follows from the fixmap patch, but relates to the general case.
This is a hack, and incomplete. Read on for discussion.

The problem: on Apple ARM platforms, SoC MMIO needs to use nGnRnE
mappings: writes using nGnRE are blackholed. This seems to be by design,
and there doesn't seem to be any fabric configuration or other bit we
can flip to make the problem go away.

As an additional confounding factor, reportedly PCIe MMIO BAR mappings
conversely *do* need to use nGnRE to work properly. So we can't even get
away with a single ioremap setting, but need to discriminate based on
what bus the device is in. Since these devices have Thunderbolt, all PCI
devices in the tree are potentially in scope. Ugh.

Ideas:

(1) Set up some devicetree property to default to nGnRnE at the platform
    level, and then make PCI drivers use nGnRE.

    This will require changing the PCI code to make pci_ioremap_bar do
    something other than a plain ioremap().

    Unfortunately, of the ~630 PCI drivers in the tree, only ~90 use
    pci_ioremap_bar(). This would require a tree-wide cleanup to
    introduce something akin to pci_ioremap(), and make all PCI
    drivers use it instead of naked ioremap().

    Currently there are three ioremap variants:

    ioremap()
    ioremap_wc()
    ioremap_uc() (not normally used on arm64)

    None of these really capture the nGnRE vs nGnRnE distinction. If
    a new variant is introduced in common code, we'd have to provide
    a default implementation that falls back to regular ioremap() on
    other arches. Something like ioremap() vs. ioremap_np() (nonposted)?

(2) The converse of (1): keep the nGnRE default, but introduce special
    casing to the OF binding code to use nGnRnE when instructed to do so
    on these platforms. This means of_iomap() needs changing.

    The advantage of this approach is that the set of possible non-PCI
    drivers that are useful on these SoCs is bounded, so not all drivers
    that don't go through that path need to be fixed.

    Additionally, this could take advantage of the OF address
    translation stuff to be smarter about deciding to use nGnRnE, e.g.
    doing it based on a property of the parent bus node.

    Of note, some devices (like samsung_tty) go through the platform
    device framework, which eventually goes into devm code. So
    of_address_to_resource would need to set some flag on the struct
    resource, that can then be used by both of_iomap() and
    devm_ioremap_resource() and friends to eventually call the right
    ioremap variant.

    The ioremap considerations from (1) apply here too.

(3) Do it at a lower level, in ioremap() itself. This requires that
    ioremap() somehow discriminates based on address range to pick what
    kind of mapping to make.

    Declaring these address ranges would be an issue. Options:

    a) An out of band list in a DT node, a la /reserved-memory

    b) Something based on the existing DT hierarchy, where we can scan
       bus ranges and locate buses with a property that says "nGnRnE" or
       "nGnRE" and dynamically build the list based on that.

    The advantage of this option is that it doesn't touch non-arch code.
    The disadvantage is that it adds a complete new bespoke mechanism to
    the DT, and that it does not let device drivers actually select the
    IO mode, which might be desirable in the future anyway for some
    devices.

All discussion and additional ideas welcome.

Signed-off-by: Hector Martin <marcan@marcan.st>
---
 arch/arm64/include/asm/io.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Arnd Bergmann Feb. 4, 2021, 10:21 p.m. UTC | #1
On Thu, Feb 4, 2021 at 9:39 PM Hector Martin <marcan@marcan.st> wrote:
>
>     Currently there are three ioremap variants:
>
>     ioremap()
>     ioremap_wc()
>     ioremap_uc() (not normally used on arm64)
>
>     None of these really capture the nGnRE vs nGnRnE distinction. If
>     a new variant is introduced in common code, we'd have to provide
>     a default implementation that falls back to regular ioremap() on
>     other arches. Something like ioremap() vs. ioremap_np() (nonposted)?

The ioport_map() function could be considered a variant of nonposted
I/O, as being nonposted is a requirement for PCI I/O space. It's
a bit weird to overload it here, as I/O space has a number of other
special cases, including a limit for the total size of the address space,
and the assumption that all I/O ports are always mapped into virtual
addresses at boot time.

Are the registers that need nGnRnE all part of a well-defined
physical address range that we could pretend to be I/O space?
Also, is the actual PCI I/O space within this region?

The main advantage here would be that we could reuse the
IORESOURCE_IO bit to signify a register in this area.

Note: I don't actually think this is going to be a good solution, just
throwing it out as another alternative in case everything else ends
up being worse.

> (2) The converse of (1): keep the nGnRE default, but introduce special
>     casing to the OF binding code to use nGnRnE when instructed to do so
>     on these platforms. This means of_iomap() needs changing.

It probably also means changing of_address_to_resource(),
and devm_ioremap_resource().

> (3) Do it at a lower level, in ioremap() itself. This requires that
>     ioremap() somehow discriminates based on address range to pick what
>     kind of mapping to make.
>
>     Declaring these address ranges would be an issue. Options:
>
>     a) An out of band list in a DT node, a la /reserved-memory
>
>     b) Something based on the existing DT hierarchy, where we can scan
>        bus ranges and locate buses with a property that says "nGnRnE" or
>        "nGnRE" and dynamically build the list based on that.
>
>     The advantage of this option is that it doesn't touch non-arch code.
>     The disadvantage is that it adds a complete new bespoke mechanism to
>     the DT, and that it does not let device drivers actually select the
>     IO mode, which might be desirable in the future anyway for some
>     devices.
>
> All discussion and additional ideas welcome.

A very simple but ugly hack would be to take one of the high address
bits in phys_addr_t to encode the type, and then pass that through
all the way into the ioremap implementation.

This does have some precedent with the upper bits of the (96-bit)
PCI addresses encoding the type of resource, but it also conflates
the DT representation with the arm64 kernel implementation and
requires extending both in a fairly generic way to do something that
is highly platform specific.

         Arnd
Arnd Bergmann Feb. 8, 2021, 10:57 p.m. UTC | #2
On Thu, Feb 4, 2021 at 9:39 PM Hector Martin <marcan@marcan.st> wrote:

> (3) Do it at a lower level, in ioremap() itself. This requires that
>     ioremap() somehow discriminates based on address range to pick what
>     kind of mapping to make.
>
>     Declaring these address ranges would be an issue. Options:
>
>     a) An out of band list in a DT node, a la /reserved-memory
>
>     b) Something based on the existing DT hierarchy, where we can scan
>        bus ranges and locate buses with a property that says "nGnRnE" or
>        "nGnRE" and dynamically build the list based on that.
>
>     The advantage of this option is that it doesn't touch non-arch code.
>     The disadvantage is that it adds a complete new bespoke mechanism to
>     the DT, and that it does not let device drivers actually select the
>     IO mode, which might be desirable in the future anyway for some
>     devices.

I tried investigating further what this would look like, but scanning through
the ADT dump for what nodes use which register ranges. At first it seemed
the range 0x200000000-0x2ffffffff is used for all normal devices, while
the three PCI buses fall into the 0x380000000-0x4ffffffff,
0x500000000-0x67fffffff and 0x680000000-0x6ffffffff ranges
respectively. This would allow a nice abstraction where one node
contains all the devices in the 0x200000000-0x2ffffffff, and we do a
translation in of_address_to_resource(),  similar to what we have for
pci and isa nodes with their special addresses.

However, I did find that there are several nodes that use mmio
addresses next to the PCI addresses, e.g. apciec0, dart-apciec0,
apciec0-piodma, dart-acio0, acio0, acio-cpu0, atc0-dpin0, atc-phy0,
dart-usb0, and usb-drd0 in the 0x380000000-0x3ffffffff range, just
before the MMIO space of the first PCIe bus, so it gets a little
more complicated.

The actual device node could look something like

#define MAP_NONPOSTED 0x80000000

arm-io { /* name for adt, should be changed */
     compatible = "apple,m1-internal-bus";
     #address-cells = <2>; /* or maybe <3> if we want */
     #size-cells = <2>;
     ranges =
               /* on-chip MMIO */
               <(MAP_NONPOSTED | 0x2) 0x0   0x2 0x0 0x1 0x0>,

              /* first PCI: 2GB control, 4GB memory space */
             <(MAP_NONPOSTED | 0x3) 0x80000000  0x3 0x80000000  0x0 0x80000000>,
              <0x4 0x0   0x4 0x0  0x1 0x0>,

              /* second PCI: 2GB control, 4GB memory space */
             <(MAP_NONPOSTED | 0x5) 0x0  0x5 0x0  0x0 0x80000000>,
              <0x5 0x80000000  0x5 0x80000000  0x1>,

              /* third PCI  0.5GB control, 1.5GB memory space*/
             <(MAP_NONPOSTED | 0x6) 0x80000000  0x6 0x80000000  0x0 0x20000000>,
             <0x6 0xa0000000   0x6 0xa0000000  0x0 0x60000000>;
}

The MAP_NONPOSTED flag then gets interpreted by the .translate() and
.get_flags() callbacks of "struct of_bus" in the kernel, where it is put into
a "struct resource" flag, and interpreted when the resource gets mapped.

The PCI host controller nests inside of the node above, and (in theory)
uses the same trick to distinguish between prefetchable and non-prefetchable
memory, except in practice this is handled in device drivers that already
know whether to call ioremap() or ioremap_wc().

        Arnd
Mark Kettenis Feb. 8, 2021, 11:20 p.m. UTC | #3
> From: Arnd Bergmann <arnd@kernel.org>
> Date: Mon, 8 Feb 2021 23:57:20 +0100
> 
> On Thu, Feb 4, 2021 at 9:39 PM Hector Martin <marcan@marcan.st> wrote:
> 
> > (3) Do it at a lower level, in ioremap() itself. This requires that
> >     ioremap() somehow discriminates based on address range to pick what
> >     kind of mapping to make.
> >
> >     Declaring these address ranges would be an issue. Options:
> >
> >     a) An out of band list in a DT node, a la /reserved-memory
> >
> >     b) Something based on the existing DT hierarchy, where we can scan
> >        bus ranges and locate buses with a property that says "nGnRnE" or
> >        "nGnRE" and dynamically build the list based on that.
> >
> >     The advantage of this option is that it doesn't touch non-arch code.
> >     The disadvantage is that it adds a complete new bespoke mechanism to
> >     the DT, and that it does not let device drivers actually select the
> >     IO mode, which might be desirable in the future anyway for some
> >     devices.
> 
> I tried investigating further what this would look like, but scanning through
> the ADT dump for what nodes use which register ranges. At first it seemed
> the range 0x200000000-0x2ffffffff is used for all normal devices, while
> the three PCI buses fall into the 0x380000000-0x4ffffffff,
> 0x500000000-0x67fffffff and 0x680000000-0x6ffffffff ranges
> respectively. This would allow a nice abstraction where one node
> contains all the devices in the 0x200000000-0x2ffffffff, and we do a
> translation in of_address_to_resource(),  similar to what we have for
> pci and isa nodes with their special addresses.
> 
> However, I did find that there are several nodes that use mmio
> addresses next to the PCI addresses, e.g. apciec0, dart-apciec0,
> apciec0-piodma, dart-acio0, acio0, acio-cpu0, atc0-dpin0, atc-phy0,
> dart-usb0, and usb-drd0 in the 0x380000000-0x3ffffffff range, just
> before the MMIO space of the first PCIe bus, so it gets a little
> more complicated.
> 
> The actual device node could look something like
> 
> #define MAP_NONPOSTED 0x80000000
> 
> arm-io { /* name for adt, should be changed */
>      compatible = "apple,m1-internal-bus";
>      #address-cells = <2>; /* or maybe <3> if we want */
>      #size-cells = <2>;
>      ranges =
>                /* on-chip MMIO */
>                <(MAP_NONPOSTED | 0x2) 0x0   0x2 0x0 0x1 0x0>,
> 
>               /* first PCI: 2GB control, 4GB memory space */
>              <(MAP_NONPOSTED | 0x3) 0x80000000  0x3 0x80000000  0x0 0x80000000>,
>               <0x4 0x0   0x4 0x0  0x1 0x0>,
> 
>               /* second PCI: 2GB control, 4GB memory space */
>              <(MAP_NONPOSTED | 0x5) 0x0  0x5 0x0  0x0 0x80000000>,
>               <0x5 0x80000000  0x5 0x80000000  0x1>,
> 
>               /* third PCI  0.5GB control, 1.5GB memory space*/
>              <(MAP_NONPOSTED | 0x6) 0x80000000  0x6 0x80000000  0x0 0x20000000>,
>              <0x6 0xa0000000   0x6 0xa0000000  0x0 0x60000000>;
> }
> 
> The MAP_NONPOSTED flag then gets interpreted by the .translate() and
> .get_flags() callbacks of "struct of_bus" in the kernel, where it is put into
> a "struct resource" flag, and interpreted when the resource gets mapped.
> 
> The PCI host controller nests inside of the node above, and (in theory)
> uses the same trick to distinguish between prefetchable and non-prefetchable
> memory, except in practice this is handled in device drivers that already
> know whether to call ioremap() or ioremap_wc().

It is only PCI mmio space that needs to be nGnRE.  The PCI host
controller register space itself needs nGnRnE just like all other
integrated peripherals (or at least it works that way).

For U-Boot I'm using the following memory map:

static struct mm_region apple_mem_map[] = {
	{
		/* I/O */
		.virt = 0x200000000,
		.phys = 0x200000000,
		.size = 8UL * SZ_1G,
		.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |
			 PTE_BLOCK_NON_SHARE |
			 PTE_BLOCK_PXN | PTE_BLOCK_UXN
	}, {
		/* I/O */
		.virt = 0x500000000,
		.phys = 0x500000000,
		.size = 2UL * SZ_1G,
		.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |
			 PTE_BLOCK_NON_SHARE |
			 PTE_BLOCK_PXN | PTE_BLOCK_UXN
	}, {
		/* I/O */
		.virt = 0x680000000,
		.phys = 0x680000000,
		.size = SZ_512M,
		.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |
			 PTE_BLOCK_NON_SHARE |
			 PTE_BLOCK_PXN | PTE_BLOCK_UXN
	}, {
		/* PCIE */
		.virt = 0x6a0000000,
		.phys = 0x6a0000000,
		.size = SZ_512M,
		.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRE) |
			 PTE_BLOCK_INNER_SHARE |
			 PTE_BLOCK_PXN | PTE_BLOCK_UXN
	}, {
		/* PCIE */
		.virt = 0x6c0000000,
		.phys = 0x6c0000000,
		.size = SZ_1G,
		.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRE) |
			 PTE_BLOCK_INNER_SHARE |
			 PTE_BLOCK_PXN | PTE_BLOCK_UXN
	}, {
		/* RAM */
		.virt = 0x800000000,
		.phys = 0x800000000,
		.size = 8UL * SZ_1G,
		.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |
			 PTE_BLOCK_INNER_SHARE
	}, {
		/* List terminator */
		0,
	}
};

struct mm_region *mem_map = apple_mem_map;

This seems to work so far.  It only has the regions for one PCIe
controller.  I suppose the other two are there to support the TB4
ports?

So there is one 512M region for "32-bit" mmio starting at 0x6a0000000
and one 1G region for "64-bit" mmio starting at 0x6c0000000.

Cheers,

Mark
Hector Martin Feb. 9, 2021, 12:25 a.m. UTC | #4
On 09/02/2021 08.20, Mark Kettenis wrote:
> It is only PCI mmio space that needs to be nGnRE.  The PCI host
> controller register space itself needs nGnRnE just like all other
> integrated peripherals (or at least it works that way).

This is correct. Actually, as I just discovered, nGnRE writes to MMIO 
are not silently blackholed, but rather raise an SError. A certain other 
Linux loader masks those SErrors in a vendor register completely 
unnecessarily, which is why this isn't apparent when you use it. I never 
noticed this myself until now because when I first ran into it, it was 
breaking the UART, so of course I'd never see the SErrors, and I didn't 
try again after I learned more about the L2C SError control mechanism :-)

Testing now, it seems we can actually fairly neatly figure out where 
nGnRE is allowed and where not, as writes that fail due to that raise a 
SError with L2C_ERR_INF=3.

I probed writing to i<<28 for i = [0..255], using nGnRE. This reveals 
that nGnRE writes are allowed (i.e. either succeed or error out 
differently) in the following ranges:

0x400000000 - 0x4ffffffff (apciec0)
0x580000000 - 0x67fffffff (apciec1)
0x6a0000000 - 0x6ffffffff (apcie)

Which matches the `ranges` properties of the respective apcie devices in 
the ADT. The first two are obviously the TB3 ports, amd have more 
features (three ranges instead of two, presumably IO port ranges are 
supported on those, there's some extra DMA stuff, etc).

So the hardware behavior is to block nGnRE everywhere except in those 
ranges (i.e. the nGnRnE fault takes precedence over other errors, like 
the address not existing at all).
Arnd Bergmann Feb. 9, 2021, 9:15 a.m. UTC | #5
On Tue, Feb 9, 2021 at 1:25 AM Hector Martin <marcan@marcan.st> wrote:
> On 09/02/2021 08.20, Mark Kettenis wrote:
> I probed writing to i<<28 for i = [0..255], using nGnRE. This reveals
> that nGnRE writes are allowed (i.e. either succeed or error out
> differently) in the following ranges:
>
> 0x400000000 - 0x4ffffffff (apciec0)
> 0x580000000 - 0x67fffffff (apciec1)
> 0x6a0000000 - 0x6ffffffff (apcie)
>
> Which matches the `ranges` properties of the respective apcie devices in
> the ADT.

Right, these are the same ranges that I found in the adt and that Mark
listed in his code snippet, so it seems we all see the same partitioning
of the address space. I also see them reflected in the
/defaults/pmap-io-ranges property in ADT, which seems to have an entry
for every register range that has some mmio registers, along with what
appears to be a bitmask of some attributes, and it clearly shows
the above ranges as having a distinct set of bits from the others
(in little-endian):

 00000000 04000000 00000080 00000000 27000080 65494350
 00000080 04000000 00000080 00000000 27000080 65494350
 00000080 05000000 00000080 00000000 27000080 65494350
 00000000 06000000 00000080 00000000 27000080 65494350
 000000a0 06000000 00000020 00000000 27000080 65494350
 000000c0 06000000 00000040 00000000 27000080 65494350
   ^64-bit address       ^64-bit length            ^ 64-bit flags?

As opposed to e.g.

 0000f002 05000000 00400000 00000000 07400000 54524144
 0000f802 05000000 00400000 00000000 07400000 54524144
 00800021 05000000 00400000 00000000 07400000 44495344
 0000a801 05000000 00400000 00000000 07400000 54524144
00000367 02000000 00400000 00000000 07400000 54524144
...

There is one more entry for the 0x700000000-0x7ffffffff range, which
has yet another distinct bitmask, but does not seem to correspond
to any registers listed in other nodes.

> The first two are obviously the TB3 ports, and have more
> features (three ranges instead of two, presumably IO port ranges are
> supported on those, there's some extra DMA stuff, etc).

The PCI ranges property identifies these as 64-bit prefetchable (0x43),
32-bit non-prefetchable (0x02), and 32-bit pre prefetchable (0x42)
respectively. The third bus only lacks the 32-bit prefetchable range,
that is normally ok. Is this the NVMe host or something else?

None of them have an I/O space ranges though, only memory space.

> So the hardware behavior is to block nGnRE everywhere except in those
> ranges (i.e. the nGnRnE fault takes precedence over other errors, like
> the address not existing at all).

Ok, so if we want this to get encoded in a 'struct resource' flag, the PCI
resources should work just fine as these resources come from the
PCI layer rather than of_address_to_resource(). I think it would be
reasonable here to add something to of_address_to_resource() to
set such a flag if we can find an unused one, and then require the
drivers for this platform to go through devm_ioremap_resource()
or similar.

       Arnd
Arnd Bergmann Feb. 9, 2021, 9:35 a.m. UTC | #6
On Tue, Feb 9, 2021 at 12:20 AM Mark Kettenis <mark.kettenis@xs4all.nl> wrote:
> > From: Arnd Bergmann <arnd@kernel.org>
> It is only PCI mmio space that needs to be nGnRE.  The PCI host
> controller register space itself needs nGnRnE just like all other
> integrated peripherals (or at least it works that way).
>
> For U-Boot I'm using the following memory map:
>
> static struct mm_region apple_mem_map[] = {
>         {
>                 /* I/O */
>                 .virt = 0x200000000,
>                 .phys = 0x200000000,
>                 .size = 8UL * SZ_1G,
>                 .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |
>                          PTE_BLOCK_NON_SHARE |
>                          PTE_BLOCK_PXN | PTE_BLOCK_UXN
>         }, {...
>                 /* List terminator */
>                 0,
>         }
> };

Right, that list agrees mostly what I found, except I didn't see a
continuous 8GB range at 0x200000000 in the ADT ranges properties
but instead a 4GB range for most of the on-chip components plus a
2GB range for stuff that is related to the first PCIe host, starting at
0x380000000 and nothing inbetween. I suppose this makes no
practical difference.

For the PCIe ranges, I see that only one of them is non-prefetchable,
while the other two ranges are marked as prefetchable. These usually
get turned into ioremap_wc() or ioremap_wt() mappings in Linux,
which are more relaxed than MT_DEVICE_NGNRE. Having
stricter attributes in u-boot shouldn't hurt, it might just be slightly
slower than it has to be.

> struct mm_region *mem_map = apple_mem_map;
>
> This seems to work so far.  It only has the regions for one PCIe
> controller.  I suppose the other two are there to support the TB4
> ports?
>
> So there is one 512M region for "32-bit" mmio starting at 0x6a0000000
> and one 1G region for "64-bit" mmio starting at 0x6c0000000.

What is connected to this controller?

        Arnd
Mark Kettenis Feb. 9, 2021, 9:58 a.m. UTC | #7
> From: Arnd Bergmann <arnd@kernel.org>
> Date: Tue, 9 Feb 2021 10:15:31 +0100
> 
> On Tue, Feb 9, 2021 at 1:25 AM Hector Martin <marcan@marcan.st> wrote:
> > On 09/02/2021 08.20, Mark Kettenis wrote:
> > I probed writing to i<<28 for i = [0..255], using nGnRE. This reveals
> > that nGnRE writes are allowed (i.e. either succeed or error out
> > differently) in the following ranges:
> >
> > 0x400000000 - 0x4ffffffff (apciec0)
> > 0x580000000 - 0x67fffffff (apciec1)
> > 0x6a0000000 - 0x6ffffffff (apcie)
> >
> > Which matches the `ranges` properties of the respective apcie devices in
> > the ADT.
> 
> Right, these are the same ranges that I found in the adt and that Mark
> listed in his code snippet, so it seems we all see the same partitioning
> of the address space. I also see them reflected in the
> /defaults/pmap-io-ranges property in ADT, which seems to have an entry
> for every register range that has some mmio registers, along with what
> appears to be a bitmask of some attributes, and it clearly shows
> the above ranges as having a distinct set of bits from the others
> (in little-endian):
> 
>  00000000 04000000 00000080 00000000 27000080 65494350
>  00000080 04000000 00000080 00000000 27000080 65494350
>  00000080 05000000 00000080 00000000 27000080 65494350
>  00000000 06000000 00000080 00000000 27000080 65494350
>  000000a0 06000000 00000020 00000000 27000080 65494350
>  000000c0 06000000 00000040 00000000 27000080 65494350
>    ^64-bit address       ^64-bit length            ^ 64-bit flags?
> 
> As opposed to e.g.
> 
>  0000f002 05000000 00400000 00000000 07400000 54524144
>  0000f802 05000000 00400000 00000000 07400000 54524144
>  00800021 05000000 00400000 00000000 07400000 44495344
>  0000a801 05000000 00400000 00000000 07400000 54524144
> 00000367 02000000 00400000 00000000 07400000 54524144
> ...
> 
> There is one more entry for the 0x700000000-0x7ffffffff range, which
> has yet another distinct bitmask, but does not seem to correspond
> to any registers listed in other nodes.
> 
> > The first two are obviously the TB3 ports, and have more
> > features (three ranges instead of two, presumably IO port ranges are
> > supported on those, there's some extra DMA stuff, etc).
> 
> The PCI ranges property identifies these as 64-bit prefetchable (0x43),
> 32-bit non-prefetchable (0x02), and 32-bit pre prefetchable (0x42)
> respectively. The third bus only lacks the 32-bit prefetchable range,
> that is normally ok. Is this the NVMe host or something else?

Third bus has Broadcom WiFi, Fresco Logic xHCI and Broadcom Ethernet,
so all the onboard PCIe devices.
Hector Martin Feb. 9, 2021, 11:22 a.m. UTC | #8
On 09/02/2021 18.15, Arnd Bergmann wrote:
> Right, these are the same ranges that I found in the adt and that Mark
> listed in his code snippet, so it seems we all see the same partitioning
> of the address space. I also see them reflected in the
> /defaults/pmap-io-ranges property in ADT, which seems to have an entry
> for every register range that has some mmio registers, along with what
> appears to be a bitmask of some attributes, and it clearly shows
> the above ranges as having a distinct set of bits from the others
> (in little-endian):
> 
>   00000000 04000000 00000080 00000000 27000080 65494350
>   00000080 04000000 00000080 00000000 27000080 65494350
>   00000080 05000000 00000080 00000000 27000080 65494350
>   00000000 06000000 00000080 00000000 27000080 65494350
>   000000a0 06000000 00000020 00000000 27000080 65494350
>   000000c0 06000000 00000040 00000000 27000080 65494350
>     ^64-bit address       ^64-bit length            ^ 64-bit flags?

That's ASCII :-)

'PCIe'

> 
> As opposed to e.g.
> 
>   0000f002 05000000 00400000 00000000 07400000 54524144

'DART'

>   00800021 05000000 00400000 00000000 07400000 44495344

'DSID'

> Ok, so if we want this to get encoded in a 'struct resource' flag, the PCI
> resources should work just fine as these resources come from the
> PCI layer rather than of_address_to_resource(). I think it would be
> reasonable here to add something to of_address_to_resource() to
> set such a flag if we can find an unused one, and then require the
> drivers for this platform to go through devm_ioremap_resource()
> or similar.

This sounds reasonable. For setting such a flag, I guess looking for a 
property (inherited from parents) would make sense. `mmio-map-mode = 
"nonposted"` or something like that?
Hector Martin Feb. 10, 2021, 12:24 p.m. UTC | #9
Hi Will, I'm pulling you in at Marc's suggestion. Do you have an opinion 
on what the better solution to this problem is?

The executive summary is that Apple SoCs require nGnRnE memory mappings 
for MMIO, except for PCI space which uses nGnRE. ARM64 currently uses 
nGnRE everywhere. Solutions discussed in this thread and elsewhere include:

1. Something similar to the current hacky patch (which isn't really 
intended as a real solution...); change ioremap to default to nGnRnE 
using some kind of platform-level flag in the DT, then figure out how to 
get the PCI drivers to bypass that. This requires changing almost every 
PCI driver, since most of them use plain ioremap().

2. A per-device DT property that tells of_address_to_resource to set a 
flag in the struct resource, which is then used by 
devm_ioremap_resource/of_iomap to pick a new ioremap_ variant for nGnRnE 
(introduce ioremap_np() for nonposted?) (PCI would work with this 
inasmuch as it would not support it, and thus fall back to the current 
nGnRE default, which is correct for PCI...). This requires changing 
DT-binding drivers that we depend on to not use plain ioremap() (if any 
do so), but that is a finite subset (unlike PCI which involves 
potentially every driver, because thunderbolt).

3. Encode the mapping type in the address of child devices, either 
abusing the high bits of the reg or introducing an extra DT cell there, 
introduce a new OF bus type that strips those away via a ranges mapping 
and sets flags in the struct resource, similar to what the PCI bus does 
with its 3-cell ranges, then do as (2) above and make 
devm_ioremap_resource/of_iomap use it:

On 09/02/2021 07.57, Arnd Bergmann wrote:
> #define MAP_NONPOSTED 0x80000000
> 
> arm-io { /* name for adt, should be changed */
>       compatible = "apple,m1-internal-bus";
>       #address-cells = <2>; /* or maybe <3> if we want */
>       #size-cells = <2>;
>       ranges =
>                 /* on-chip MMIO */
>                 <(MAP_NONPOSTED | 0x2) 0x0   0x2 0x0 0x1 0x0>,
> 
>                /* first PCI: 2GB control, 4GB memory space */
>               <(MAP_NONPOSTED | 0x3) 0x80000000  0x3 0x80000000  0x0 0x80000000>,
>                <0x4 0x0   0x4 0x0  0x1 0x0>,
[...]

> The MAP_NONPOSTED flag then gets interpreted by the .translate() and
> .get_flags() callbacks of "struct of_bus" in the kernel, where it is put into
> a "struct resource" flag, and interpreted when the resource gets mapped.
> 
> The PCI host controller nests inside of the node above, and (in theory)
> uses the same trick to distinguish between prefetchable and non-prefetchable
> memory, except in practice this is handled in device drivers that already
> know whether to call ioremap() or ioremap_wc().

4. Introduce a new top-level DT element in the style of reserved-memory, 
that describes address ranges and the mapping type to use. This could be 
implemented entirely in arch code, having arm64's ioremap look up the 
address in a structure populated from this.

As an additional wrinkle, earlycon is almost certainly going to need a 
special path to handle this very early, before OF stuff is available; it 
also uses fixmap instead of ioremap, which has its own idea about what 
type of mapping to use.
Mark Kettenis Feb. 10, 2021, 1:40 p.m. UTC | #10
> From: Hector Martin <marcan@marcan.st>
> Date: Wed, 10 Feb 2021 21:24:15 +0900

Hi Hector,

Since device tree bindings are widely used outside the Linux tree,
here are some thoughts from a U-Boot and OpenBSD perspective.

> Hi Will, I'm pulling you in at Marc's suggestion. Do you have an opinion 
> on what the better solution to this problem is?
> 
> The executive summary is that Apple SoCs require nGnRnE memory mappings 
> for MMIO, except for PCI space which uses nGnRE. ARM64 currently uses 
> nGnRE everywhere. Solutions discussed in this thread and elsewhere include:
> 
> 1. Something similar to the current hacky patch (which isn't really 
> intended as a real solution...); change ioremap to default to nGnRnE 
> using some kind of platform-level flag in the DT, then figure out how to 
> get the PCI drivers to bypass that. This requires changing almost every 
> PCI driver, since most of them use plain ioremap().
> 
> 2. A per-device DT property that tells of_address_to_resource to set a 
> flag in the struct resource, which is then used by 
> devm_ioremap_resource/of_iomap to pick a new ioremap_ variant for nGnRnE 
> (introduce ioremap_np() for nonposted?) (PCI would work with this 
> inasmuch as it would not support it, and thus fall back to the current 
> nGnRE default, which is correct for PCI...). This requires changing 
> DT-binding drivers that we depend on to not use plain ioremap() (if any 
> do so), but that is a finite subset (unlike PCI which involves 
> potentially every driver, because thunderbolt).

That solution is not dissimilar to how "dma-coherent", "big-endian"
and "little-endian" properties work.  U-Boot could simply ignore the
property since it already has a memory map with the required memory
attributes for each SoC.  I don't see any issue with this for the
OpenBSD kernel either.

The number of existing drivers that would need to be changed is small.
The dwc3 core driver already uses devm_ioremap_resource().  The nvme
driver uses plain ioremap() in the PCI-specific part of the driver,
but that will need new glue for a platform driver anyway.

As things stand now that leaves us with the samsung serial driver
which uses devm_ioremap().  That could be turned into a
devm_iomap_resource() without much trouble I think.

> 3. Encode the mapping type in the address of child devices, either 
> abusing the high bits of the reg or introducing an extra DT cell there, 
> introduce a new OF bus type that strips those away via a ranges mapping 
> and sets flags in the struct resource, similar to what the PCI bus does 
> with its 3-cell ranges, then do as (2) above and make 
> devm_ioremap_resource/of_iomap use it:
> 
> On 09/02/2021 07.57, Arnd Bergmann wrote:
> > #define MAP_NONPOSTED 0x80000000
> > 
> > arm-io { /* name for adt, should be changed */
> >       compatible = "apple,m1-internal-bus";
> >       #address-cells = <2>; /* or maybe <3> if we want */
> >       #size-cells = <2>;
> >       ranges =
> >                 /* on-chip MMIO */
> >                 <(MAP_NONPOSTED | 0x2) 0x0   0x2 0x0 0x1 0x0>,
> > 
> >                /* first PCI: 2GB control, 4GB memory space */
> >               <(MAP_NONPOSTED | 0x3) 0x80000000  0x3 0x80000000  0x0 0x80000000>,
> >                <0x4 0x0   0x4 0x0  0x1 0x0>,
> [...]
> 
> > The MAP_NONPOSTED flag then gets interpreted by the .translate() and
> > .get_flags() callbacks of "struct of_bus" in the kernel, where it is put into
> > a "struct resource" flag, and interpreted when the resource gets mapped.
> > 
> > The PCI host controller nests inside of the node above, and (in theory)
> > uses the same trick to distinguish between prefetchable and non-prefetchable
> > memory, except in practice this is handled in device drivers that already
> > know whether to call ioremap() or ioremap_wc().

Using the high bit in the address would be awkward I think.  For
example in U-Boot I'd have to mask that bit away but doing so in a
per-SoC way would be ugly.  Unless you map the high bit away using a
ranges property for the bus.

Using #address-cells = <3> wll cause some fallout as well as it is
convenient to store addresses as 64-bit integers.  I've written code
that just panics if that isn't possible.

> 4. Introduce a new top-level DT element in the style of reserved-memory, 
> that describes address ranges and the mapping type to use. This could be 
> implemented entirely in arch code, having arm64's ioremap look up the 
> address in a structure populated from this.

This isn't a strange idea either.  For UEFI such a mapping already
exists as a separate table that encodes the cache attributes of
certain memory regions.
 
> As an additional wrinkle, earlycon is almost certainly going to need a 
> special path to handle this very early, before OF stuff is available; it 
> also uses fixmap instead of ioremap, which has its own idea about what 
> type of mapping to use.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 5ea8656a2030..f2609a4f5019 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -167,7 +167,14 @@  extern void __iomem *__ioremap(phys_addr_t phys_addr, size_t size, pgprot_t prot
 extern void iounmap(volatile void __iomem *addr);
 extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size);
 
-#define ioremap(addr, size)		__ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
+/*
+ * Some platforms require nGnRnE for MMIO.
+ */
+extern bool arm64_use_ne_io;
+
+#define ioremap(addr, size)		__ioremap((addr), (size), \
+						  arm64_use_ne_io ? __pgprot(PROT_DEVICE_nGnRnE) \
+								  : __pgprot(PROT_DEVICE_nGnRE))
 #define ioremap_wc(addr, size)		__ioremap((addr), (size), __pgprot(PROT_NORMAL_NC))
 
 /*