Message ID | 20210204203951.52105-14-marcan@marcan.st (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | Apple M1 SoC platform bring-up | expand |
On Thu, Feb 4, 2021 at 9:39 PM Hector Martin <marcan@marcan.st> wrote: > > Currently there are three ioremap variants: > > ioremap() > ioremap_wc() > ioremap_uc() (not normally used on arm64) > > None of these really capture the nGnRE vs nGnRnE distinction. If > a new variant is introduced in common code, we'd have to provide > a default implementation that falls back to regular ioremap() on > other arches. Something like ioremap() vs. ioremap_np() (nonposted)? The ioport_map() function could be considered a variant of nonposted I/O, as being nonposted is a requirement for PCI I/O space. It's a bit weird to overload it here, as I/O space has a number of other special cases, including a limit for the total size of the address space, and the assumption that all I/O ports are always mapped into virtual addresses at boot time. Are the registers that need nGnRnE all part of a well-defined physical address range that we could pretend to be I/O space? Also, is the actual PCI I/O space within this region? The main advantage here would be that we could reuse the IORESOURCE_IO bit to signify a register in this area. Note: I don't actually think this is going to be a good solution, just throwing it out as another alternative in case everything else ends up being worse. > (2) The converse of (1): keep the nGnRE default, but introduce special > casing to the OF binding code to use nGnRnE when instructed to do so > on these platforms. This means of_iomap() needs changing. It probably also means changing of_address_to_resource(), and devm_ioremap_resource(). > (3) Do it at a lower level, in ioremap() itself. This requires that > ioremap() somehow discriminates based on address range to pick what > kind of mapping to make. > > Declaring these address ranges would be an issue. Options: > > a) An out of band list in a DT node, a la /reserved-memory > > b) Something based on the existing DT hierarchy, where we can scan > bus ranges and locate buses with a property that says "nGnRnE" or > "nGnRE" and dynamically build the list based on that. > > The advantage of this option is that it doesn't touch non-arch code. > The disadvantage is that it adds a complete new bespoke mechanism to > the DT, and that it does not let device drivers actually select the > IO mode, which might be desirable in the future anyway for some > devices. > > All discussion and additional ideas welcome. A very simple but ugly hack would be to take one of the high address bits in phys_addr_t to encode the type, and then pass that through all the way into the ioremap implementation. This does have some precedent with the upper bits of the (96-bit) PCI addresses encoding the type of resource, but it also conflates the DT representation with the arm64 kernel implementation and requires extending both in a fairly generic way to do something that is highly platform specific. Arnd
On Thu, Feb 4, 2021 at 9:39 PM Hector Martin <marcan@marcan.st> wrote: > (3) Do it at a lower level, in ioremap() itself. This requires that > ioremap() somehow discriminates based on address range to pick what > kind of mapping to make. > > Declaring these address ranges would be an issue. Options: > > a) An out of band list in a DT node, a la /reserved-memory > > b) Something based on the existing DT hierarchy, where we can scan > bus ranges and locate buses with a property that says "nGnRnE" or > "nGnRE" and dynamically build the list based on that. > > The advantage of this option is that it doesn't touch non-arch code. > The disadvantage is that it adds a complete new bespoke mechanism to > the DT, and that it does not let device drivers actually select the > IO mode, which might be desirable in the future anyway for some > devices. I tried investigating further what this would look like, but scanning through the ADT dump for what nodes use which register ranges. At first it seemed the range 0x200000000-0x2ffffffff is used for all normal devices, while the three PCI buses fall into the 0x380000000-0x4ffffffff, 0x500000000-0x67fffffff and 0x680000000-0x6ffffffff ranges respectively. This would allow a nice abstraction where one node contains all the devices in the 0x200000000-0x2ffffffff, and we do a translation in of_address_to_resource(), similar to what we have for pci and isa nodes with their special addresses. However, I did find that there are several nodes that use mmio addresses next to the PCI addresses, e.g. apciec0, dart-apciec0, apciec0-piodma, dart-acio0, acio0, acio-cpu0, atc0-dpin0, atc-phy0, dart-usb0, and usb-drd0 in the 0x380000000-0x3ffffffff range, just before the MMIO space of the first PCIe bus, so it gets a little more complicated. The actual device node could look something like #define MAP_NONPOSTED 0x80000000 arm-io { /* name for adt, should be changed */ compatible = "apple,m1-internal-bus"; #address-cells = <2>; /* or maybe <3> if we want */ #size-cells = <2>; ranges = /* on-chip MMIO */ <(MAP_NONPOSTED | 0x2) 0x0 0x2 0x0 0x1 0x0>, /* first PCI: 2GB control, 4GB memory space */ <(MAP_NONPOSTED | 0x3) 0x80000000 0x3 0x80000000 0x0 0x80000000>, <0x4 0x0 0x4 0x0 0x1 0x0>, /* second PCI: 2GB control, 4GB memory space */ <(MAP_NONPOSTED | 0x5) 0x0 0x5 0x0 0x0 0x80000000>, <0x5 0x80000000 0x5 0x80000000 0x1>, /* third PCI 0.5GB control, 1.5GB memory space*/ <(MAP_NONPOSTED | 0x6) 0x80000000 0x6 0x80000000 0x0 0x20000000>, <0x6 0xa0000000 0x6 0xa0000000 0x0 0x60000000>; } The MAP_NONPOSTED flag then gets interpreted by the .translate() and .get_flags() callbacks of "struct of_bus" in the kernel, where it is put into a "struct resource" flag, and interpreted when the resource gets mapped. The PCI host controller nests inside of the node above, and (in theory) uses the same trick to distinguish between prefetchable and non-prefetchable memory, except in practice this is handled in device drivers that already know whether to call ioremap() or ioremap_wc(). Arnd
> From: Arnd Bergmann <arnd@kernel.org> > Date: Mon, 8 Feb 2021 23:57:20 +0100 > > On Thu, Feb 4, 2021 at 9:39 PM Hector Martin <marcan@marcan.st> wrote: > > > (3) Do it at a lower level, in ioremap() itself. This requires that > > ioremap() somehow discriminates based on address range to pick what > > kind of mapping to make. > > > > Declaring these address ranges would be an issue. Options: > > > > a) An out of band list in a DT node, a la /reserved-memory > > > > b) Something based on the existing DT hierarchy, where we can scan > > bus ranges and locate buses with a property that says "nGnRnE" or > > "nGnRE" and dynamically build the list based on that. > > > > The advantage of this option is that it doesn't touch non-arch code. > > The disadvantage is that it adds a complete new bespoke mechanism to > > the DT, and that it does not let device drivers actually select the > > IO mode, which might be desirable in the future anyway for some > > devices. > > I tried investigating further what this would look like, but scanning through > the ADT dump for what nodes use which register ranges. At first it seemed > the range 0x200000000-0x2ffffffff is used for all normal devices, while > the three PCI buses fall into the 0x380000000-0x4ffffffff, > 0x500000000-0x67fffffff and 0x680000000-0x6ffffffff ranges > respectively. This would allow a nice abstraction where one node > contains all the devices in the 0x200000000-0x2ffffffff, and we do a > translation in of_address_to_resource(), similar to what we have for > pci and isa nodes with their special addresses. > > However, I did find that there are several nodes that use mmio > addresses next to the PCI addresses, e.g. apciec0, dart-apciec0, > apciec0-piodma, dart-acio0, acio0, acio-cpu0, atc0-dpin0, atc-phy0, > dart-usb0, and usb-drd0 in the 0x380000000-0x3ffffffff range, just > before the MMIO space of the first PCIe bus, so it gets a little > more complicated. > > The actual device node could look something like > > #define MAP_NONPOSTED 0x80000000 > > arm-io { /* name for adt, should be changed */ > compatible = "apple,m1-internal-bus"; > #address-cells = <2>; /* or maybe <3> if we want */ > #size-cells = <2>; > ranges = > /* on-chip MMIO */ > <(MAP_NONPOSTED | 0x2) 0x0 0x2 0x0 0x1 0x0>, > > /* first PCI: 2GB control, 4GB memory space */ > <(MAP_NONPOSTED | 0x3) 0x80000000 0x3 0x80000000 0x0 0x80000000>, > <0x4 0x0 0x4 0x0 0x1 0x0>, > > /* second PCI: 2GB control, 4GB memory space */ > <(MAP_NONPOSTED | 0x5) 0x0 0x5 0x0 0x0 0x80000000>, > <0x5 0x80000000 0x5 0x80000000 0x1>, > > /* third PCI 0.5GB control, 1.5GB memory space*/ > <(MAP_NONPOSTED | 0x6) 0x80000000 0x6 0x80000000 0x0 0x20000000>, > <0x6 0xa0000000 0x6 0xa0000000 0x0 0x60000000>; > } > > The MAP_NONPOSTED flag then gets interpreted by the .translate() and > .get_flags() callbacks of "struct of_bus" in the kernel, where it is put into > a "struct resource" flag, and interpreted when the resource gets mapped. > > The PCI host controller nests inside of the node above, and (in theory) > uses the same trick to distinguish between prefetchable and non-prefetchable > memory, except in practice this is handled in device drivers that already > know whether to call ioremap() or ioremap_wc(). It is only PCI mmio space that needs to be nGnRE. The PCI host controller register space itself needs nGnRnE just like all other integrated peripherals (or at least it works that way). For U-Boot I'm using the following memory map: static struct mm_region apple_mem_map[] = { { /* I/O */ .virt = 0x200000000, .phys = 0x200000000, .size = 8UL * SZ_1G, .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { /* I/O */ .virt = 0x500000000, .phys = 0x500000000, .size = 2UL * SZ_1G, .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { /* I/O */ .virt = 0x680000000, .phys = 0x680000000, .size = SZ_512M, .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { /* PCIE */ .virt = 0x6a0000000, .phys = 0x6a0000000, .size = SZ_512M, .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRE) | PTE_BLOCK_INNER_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { /* PCIE */ .virt = 0x6c0000000, .phys = 0x6c0000000, .size = SZ_1G, .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRE) | PTE_BLOCK_INNER_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { /* RAM */ .virt = 0x800000000, .phys = 0x800000000, .size = 8UL * SZ_1G, .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) | PTE_BLOCK_INNER_SHARE }, { /* List terminator */ 0, } }; struct mm_region *mem_map = apple_mem_map; This seems to work so far. It only has the regions for one PCIe controller. I suppose the other two are there to support the TB4 ports? So there is one 512M region for "32-bit" mmio starting at 0x6a0000000 and one 1G region for "64-bit" mmio starting at 0x6c0000000. Cheers, Mark
On 09/02/2021 08.20, Mark Kettenis wrote: > It is only PCI mmio space that needs to be nGnRE. The PCI host > controller register space itself needs nGnRnE just like all other > integrated peripherals (or at least it works that way). This is correct. Actually, as I just discovered, nGnRE writes to MMIO are not silently blackholed, but rather raise an SError. A certain other Linux loader masks those SErrors in a vendor register completely unnecessarily, which is why this isn't apparent when you use it. I never noticed this myself until now because when I first ran into it, it was breaking the UART, so of course I'd never see the SErrors, and I didn't try again after I learned more about the L2C SError control mechanism :-) Testing now, it seems we can actually fairly neatly figure out where nGnRE is allowed and where not, as writes that fail due to that raise a SError with L2C_ERR_INF=3. I probed writing to i<<28 for i = [0..255], using nGnRE. This reveals that nGnRE writes are allowed (i.e. either succeed or error out differently) in the following ranges: 0x400000000 - 0x4ffffffff (apciec0) 0x580000000 - 0x67fffffff (apciec1) 0x6a0000000 - 0x6ffffffff (apcie) Which matches the `ranges` properties of the respective apcie devices in the ADT. The first two are obviously the TB3 ports, amd have more features (three ranges instead of two, presumably IO port ranges are supported on those, there's some extra DMA stuff, etc). So the hardware behavior is to block nGnRE everywhere except in those ranges (i.e. the nGnRnE fault takes precedence over other errors, like the address not existing at all).
On Tue, Feb 9, 2021 at 1:25 AM Hector Martin <marcan@marcan.st> wrote: > On 09/02/2021 08.20, Mark Kettenis wrote: > I probed writing to i<<28 for i = [0..255], using nGnRE. This reveals > that nGnRE writes are allowed (i.e. either succeed or error out > differently) in the following ranges: > > 0x400000000 - 0x4ffffffff (apciec0) > 0x580000000 - 0x67fffffff (apciec1) > 0x6a0000000 - 0x6ffffffff (apcie) > > Which matches the `ranges` properties of the respective apcie devices in > the ADT. Right, these are the same ranges that I found in the adt and that Mark listed in his code snippet, so it seems we all see the same partitioning of the address space. I also see them reflected in the /defaults/pmap-io-ranges property in ADT, which seems to have an entry for every register range that has some mmio registers, along with what appears to be a bitmask of some attributes, and it clearly shows the above ranges as having a distinct set of bits from the others (in little-endian): 00000000 04000000 00000080 00000000 27000080 65494350 00000080 04000000 00000080 00000000 27000080 65494350 00000080 05000000 00000080 00000000 27000080 65494350 00000000 06000000 00000080 00000000 27000080 65494350 000000a0 06000000 00000020 00000000 27000080 65494350 000000c0 06000000 00000040 00000000 27000080 65494350 ^64-bit address ^64-bit length ^ 64-bit flags? As opposed to e.g. 0000f002 05000000 00400000 00000000 07400000 54524144 0000f802 05000000 00400000 00000000 07400000 54524144 00800021 05000000 00400000 00000000 07400000 44495344 0000a801 05000000 00400000 00000000 07400000 54524144 00000367 02000000 00400000 00000000 07400000 54524144 ... There is one more entry for the 0x700000000-0x7ffffffff range, which has yet another distinct bitmask, but does not seem to correspond to any registers listed in other nodes. > The first two are obviously the TB3 ports, and have more > features (three ranges instead of two, presumably IO port ranges are > supported on those, there's some extra DMA stuff, etc). The PCI ranges property identifies these as 64-bit prefetchable (0x43), 32-bit non-prefetchable (0x02), and 32-bit pre prefetchable (0x42) respectively. The third bus only lacks the 32-bit prefetchable range, that is normally ok. Is this the NVMe host or something else? None of them have an I/O space ranges though, only memory space. > So the hardware behavior is to block nGnRE everywhere except in those > ranges (i.e. the nGnRnE fault takes precedence over other errors, like > the address not existing at all). Ok, so if we want this to get encoded in a 'struct resource' flag, the PCI resources should work just fine as these resources come from the PCI layer rather than of_address_to_resource(). I think it would be reasonable here to add something to of_address_to_resource() to set such a flag if we can find an unused one, and then require the drivers for this platform to go through devm_ioremap_resource() or similar. Arnd
On Tue, Feb 9, 2021 at 12:20 AM Mark Kettenis <mark.kettenis@xs4all.nl> wrote: > > From: Arnd Bergmann <arnd@kernel.org> > It is only PCI mmio space that needs to be nGnRE. The PCI host > controller register space itself needs nGnRnE just like all other > integrated peripherals (or at least it works that way). > > For U-Boot I'm using the following memory map: > > static struct mm_region apple_mem_map[] = { > { > /* I/O */ > .virt = 0x200000000, > .phys = 0x200000000, > .size = 8UL * SZ_1G, > .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | > PTE_BLOCK_NON_SHARE | > PTE_BLOCK_PXN | PTE_BLOCK_UXN > }, {... > /* List terminator */ > 0, > } > }; Right, that list agrees mostly what I found, except I didn't see a continuous 8GB range at 0x200000000 in the ADT ranges properties but instead a 4GB range for most of the on-chip components plus a 2GB range for stuff that is related to the first PCIe host, starting at 0x380000000 and nothing inbetween. I suppose this makes no practical difference. For the PCIe ranges, I see that only one of them is non-prefetchable, while the other two ranges are marked as prefetchable. These usually get turned into ioremap_wc() or ioremap_wt() mappings in Linux, which are more relaxed than MT_DEVICE_NGNRE. Having stricter attributes in u-boot shouldn't hurt, it might just be slightly slower than it has to be. > struct mm_region *mem_map = apple_mem_map; > > This seems to work so far. It only has the regions for one PCIe > controller. I suppose the other two are there to support the TB4 > ports? > > So there is one 512M region for "32-bit" mmio starting at 0x6a0000000 > and one 1G region for "64-bit" mmio starting at 0x6c0000000. What is connected to this controller? Arnd
> From: Arnd Bergmann <arnd@kernel.org> > Date: Tue, 9 Feb 2021 10:15:31 +0100 > > On Tue, Feb 9, 2021 at 1:25 AM Hector Martin <marcan@marcan.st> wrote: > > On 09/02/2021 08.20, Mark Kettenis wrote: > > I probed writing to i<<28 for i = [0..255], using nGnRE. This reveals > > that nGnRE writes are allowed (i.e. either succeed or error out > > differently) in the following ranges: > > > > 0x400000000 - 0x4ffffffff (apciec0) > > 0x580000000 - 0x67fffffff (apciec1) > > 0x6a0000000 - 0x6ffffffff (apcie) > > > > Which matches the `ranges` properties of the respective apcie devices in > > the ADT. > > Right, these are the same ranges that I found in the adt and that Mark > listed in his code snippet, so it seems we all see the same partitioning > of the address space. I also see them reflected in the > /defaults/pmap-io-ranges property in ADT, which seems to have an entry > for every register range that has some mmio registers, along with what > appears to be a bitmask of some attributes, and it clearly shows > the above ranges as having a distinct set of bits from the others > (in little-endian): > > 00000000 04000000 00000080 00000000 27000080 65494350 > 00000080 04000000 00000080 00000000 27000080 65494350 > 00000080 05000000 00000080 00000000 27000080 65494350 > 00000000 06000000 00000080 00000000 27000080 65494350 > 000000a0 06000000 00000020 00000000 27000080 65494350 > 000000c0 06000000 00000040 00000000 27000080 65494350 > ^64-bit address ^64-bit length ^ 64-bit flags? > > As opposed to e.g. > > 0000f002 05000000 00400000 00000000 07400000 54524144 > 0000f802 05000000 00400000 00000000 07400000 54524144 > 00800021 05000000 00400000 00000000 07400000 44495344 > 0000a801 05000000 00400000 00000000 07400000 54524144 > 00000367 02000000 00400000 00000000 07400000 54524144 > ... > > There is one more entry for the 0x700000000-0x7ffffffff range, which > has yet another distinct bitmask, but does not seem to correspond > to any registers listed in other nodes. > > > The first two are obviously the TB3 ports, and have more > > features (three ranges instead of two, presumably IO port ranges are > > supported on those, there's some extra DMA stuff, etc). > > The PCI ranges property identifies these as 64-bit prefetchable (0x43), > 32-bit non-prefetchable (0x02), and 32-bit pre prefetchable (0x42) > respectively. The third bus only lacks the 32-bit prefetchable range, > that is normally ok. Is this the NVMe host or something else? Third bus has Broadcom WiFi, Fresco Logic xHCI and Broadcom Ethernet, so all the onboard PCIe devices.
On 09/02/2021 18.15, Arnd Bergmann wrote: > Right, these are the same ranges that I found in the adt and that Mark > listed in his code snippet, so it seems we all see the same partitioning > of the address space. I also see them reflected in the > /defaults/pmap-io-ranges property in ADT, which seems to have an entry > for every register range that has some mmio registers, along with what > appears to be a bitmask of some attributes, and it clearly shows > the above ranges as having a distinct set of bits from the others > (in little-endian): > > 00000000 04000000 00000080 00000000 27000080 65494350 > 00000080 04000000 00000080 00000000 27000080 65494350 > 00000080 05000000 00000080 00000000 27000080 65494350 > 00000000 06000000 00000080 00000000 27000080 65494350 > 000000a0 06000000 00000020 00000000 27000080 65494350 > 000000c0 06000000 00000040 00000000 27000080 65494350 > ^64-bit address ^64-bit length ^ 64-bit flags? That's ASCII :-) 'PCIe' > > As opposed to e.g. > > 0000f002 05000000 00400000 00000000 07400000 54524144 'DART' > 00800021 05000000 00400000 00000000 07400000 44495344 'DSID' > Ok, so if we want this to get encoded in a 'struct resource' flag, the PCI > resources should work just fine as these resources come from the > PCI layer rather than of_address_to_resource(). I think it would be > reasonable here to add something to of_address_to_resource() to > set such a flag if we can find an unused one, and then require the > drivers for this platform to go through devm_ioremap_resource() > or similar. This sounds reasonable. For setting such a flag, I guess looking for a property (inherited from parents) would make sense. `mmio-map-mode = "nonposted"` or something like that?
Hi Will, I'm pulling you in at Marc's suggestion. Do you have an opinion on what the better solution to this problem is? The executive summary is that Apple SoCs require nGnRnE memory mappings for MMIO, except for PCI space which uses nGnRE. ARM64 currently uses nGnRE everywhere. Solutions discussed in this thread and elsewhere include: 1. Something similar to the current hacky patch (which isn't really intended as a real solution...); change ioremap to default to nGnRnE using some kind of platform-level flag in the DT, then figure out how to get the PCI drivers to bypass that. This requires changing almost every PCI driver, since most of them use plain ioremap(). 2. A per-device DT property that tells of_address_to_resource to set a flag in the struct resource, which is then used by devm_ioremap_resource/of_iomap to pick a new ioremap_ variant for nGnRnE (introduce ioremap_np() for nonposted?) (PCI would work with this inasmuch as it would not support it, and thus fall back to the current nGnRE default, which is correct for PCI...). This requires changing DT-binding drivers that we depend on to not use plain ioremap() (if any do so), but that is a finite subset (unlike PCI which involves potentially every driver, because thunderbolt). 3. Encode the mapping type in the address of child devices, either abusing the high bits of the reg or introducing an extra DT cell there, introduce a new OF bus type that strips those away via a ranges mapping and sets flags in the struct resource, similar to what the PCI bus does with its 3-cell ranges, then do as (2) above and make devm_ioremap_resource/of_iomap use it: On 09/02/2021 07.57, Arnd Bergmann wrote: > #define MAP_NONPOSTED 0x80000000 > > arm-io { /* name for adt, should be changed */ > compatible = "apple,m1-internal-bus"; > #address-cells = <2>; /* or maybe <3> if we want */ > #size-cells = <2>; > ranges = > /* on-chip MMIO */ > <(MAP_NONPOSTED | 0x2) 0x0 0x2 0x0 0x1 0x0>, > > /* first PCI: 2GB control, 4GB memory space */ > <(MAP_NONPOSTED | 0x3) 0x80000000 0x3 0x80000000 0x0 0x80000000>, > <0x4 0x0 0x4 0x0 0x1 0x0>, [...] > The MAP_NONPOSTED flag then gets interpreted by the .translate() and > .get_flags() callbacks of "struct of_bus" in the kernel, where it is put into > a "struct resource" flag, and interpreted when the resource gets mapped. > > The PCI host controller nests inside of the node above, and (in theory) > uses the same trick to distinguish between prefetchable and non-prefetchable > memory, except in practice this is handled in device drivers that already > know whether to call ioremap() or ioremap_wc(). 4. Introduce a new top-level DT element in the style of reserved-memory, that describes address ranges and the mapping type to use. This could be implemented entirely in arch code, having arm64's ioremap look up the address in a structure populated from this. As an additional wrinkle, earlycon is almost certainly going to need a special path to handle this very early, before OF stuff is available; it also uses fixmap instead of ioremap, which has its own idea about what type of mapping to use.
> From: Hector Martin <marcan@marcan.st> > Date: Wed, 10 Feb 2021 21:24:15 +0900 Hi Hector, Since device tree bindings are widely used outside the Linux tree, here are some thoughts from a U-Boot and OpenBSD perspective. > Hi Will, I'm pulling you in at Marc's suggestion. Do you have an opinion > on what the better solution to this problem is? > > The executive summary is that Apple SoCs require nGnRnE memory mappings > for MMIO, except for PCI space which uses nGnRE. ARM64 currently uses > nGnRE everywhere. Solutions discussed in this thread and elsewhere include: > > 1. Something similar to the current hacky patch (which isn't really > intended as a real solution...); change ioremap to default to nGnRnE > using some kind of platform-level flag in the DT, then figure out how to > get the PCI drivers to bypass that. This requires changing almost every > PCI driver, since most of them use plain ioremap(). > > 2. A per-device DT property that tells of_address_to_resource to set a > flag in the struct resource, which is then used by > devm_ioremap_resource/of_iomap to pick a new ioremap_ variant for nGnRnE > (introduce ioremap_np() for nonposted?) (PCI would work with this > inasmuch as it would not support it, and thus fall back to the current > nGnRE default, which is correct for PCI...). This requires changing > DT-binding drivers that we depend on to not use plain ioremap() (if any > do so), but that is a finite subset (unlike PCI which involves > potentially every driver, because thunderbolt). That solution is not dissimilar to how "dma-coherent", "big-endian" and "little-endian" properties work. U-Boot could simply ignore the property since it already has a memory map with the required memory attributes for each SoC. I don't see any issue with this for the OpenBSD kernel either. The number of existing drivers that would need to be changed is small. The dwc3 core driver already uses devm_ioremap_resource(). The nvme driver uses plain ioremap() in the PCI-specific part of the driver, but that will need new glue for a platform driver anyway. As things stand now that leaves us with the samsung serial driver which uses devm_ioremap(). That could be turned into a devm_iomap_resource() without much trouble I think. > 3. Encode the mapping type in the address of child devices, either > abusing the high bits of the reg or introducing an extra DT cell there, > introduce a new OF bus type that strips those away via a ranges mapping > and sets flags in the struct resource, similar to what the PCI bus does > with its 3-cell ranges, then do as (2) above and make > devm_ioremap_resource/of_iomap use it: > > On 09/02/2021 07.57, Arnd Bergmann wrote: > > #define MAP_NONPOSTED 0x80000000 > > > > arm-io { /* name for adt, should be changed */ > > compatible = "apple,m1-internal-bus"; > > #address-cells = <2>; /* or maybe <3> if we want */ > > #size-cells = <2>; > > ranges = > > /* on-chip MMIO */ > > <(MAP_NONPOSTED | 0x2) 0x0 0x2 0x0 0x1 0x0>, > > > > /* first PCI: 2GB control, 4GB memory space */ > > <(MAP_NONPOSTED | 0x3) 0x80000000 0x3 0x80000000 0x0 0x80000000>, > > <0x4 0x0 0x4 0x0 0x1 0x0>, > [...] > > > The MAP_NONPOSTED flag then gets interpreted by the .translate() and > > .get_flags() callbacks of "struct of_bus" in the kernel, where it is put into > > a "struct resource" flag, and interpreted when the resource gets mapped. > > > > The PCI host controller nests inside of the node above, and (in theory) > > uses the same trick to distinguish between prefetchable and non-prefetchable > > memory, except in practice this is handled in device drivers that already > > know whether to call ioremap() or ioremap_wc(). Using the high bit in the address would be awkward I think. For example in U-Boot I'd have to mask that bit away but doing so in a per-SoC way would be ugly. Unless you map the high bit away using a ranges property for the bus. Using #address-cells = <3> wll cause some fallout as well as it is convenient to store addresses as 64-bit integers. I've written code that just panics if that isn't possible. > 4. Introduce a new top-level DT element in the style of reserved-memory, > that describes address ranges and the mapping type to use. This could be > implemented entirely in arch code, having arm64's ioremap look up the > address in a structure populated from this. This isn't a strange idea either. For UEFI such a mapping already exists as a separate table that encodes the cache attributes of certain memory regions. > As an additional wrinkle, earlycon is almost certainly going to need a > special path to handle this very early, before OF stuff is available; it > also uses fixmap instead of ioremap, which has its own idea about what > type of mapping to use.
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 5ea8656a2030..f2609a4f5019 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -167,7 +167,14 @@ extern void __iomem *__ioremap(phys_addr_t phys_addr, size_t size, pgprot_t prot extern void iounmap(volatile void __iomem *addr); extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size); -#define ioremap(addr, size) __ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE)) +/* + * Some platforms require nGnRnE for MMIO. + */ +extern bool arm64_use_ne_io; + +#define ioremap(addr, size) __ioremap((addr), (size), \ + arm64_use_ne_io ? __pgprot(PROT_DEVICE_nGnRnE) \ + : __pgprot(PROT_DEVICE_nGnRE)) #define ioremap_wc(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_NC)) /*
This follows from the fixmap patch, but relates to the general case. This is a hack, and incomplete. Read on for discussion. The problem: on Apple ARM platforms, SoC MMIO needs to use nGnRnE mappings: writes using nGnRE are blackholed. This seems to be by design, and there doesn't seem to be any fabric configuration or other bit we can flip to make the problem go away. As an additional confounding factor, reportedly PCIe MMIO BAR mappings conversely *do* need to use nGnRE to work properly. So we can't even get away with a single ioremap setting, but need to discriminate based on what bus the device is in. Since these devices have Thunderbolt, all PCI devices in the tree are potentially in scope. Ugh. Ideas: (1) Set up some devicetree property to default to nGnRnE at the platform level, and then make PCI drivers use nGnRE. This will require changing the PCI code to make pci_ioremap_bar do something other than a plain ioremap(). Unfortunately, of the ~630 PCI drivers in the tree, only ~90 use pci_ioremap_bar(). This would require a tree-wide cleanup to introduce something akin to pci_ioremap(), and make all PCI drivers use it instead of naked ioremap(). Currently there are three ioremap variants: ioremap() ioremap_wc() ioremap_uc() (not normally used on arm64) None of these really capture the nGnRE vs nGnRnE distinction. If a new variant is introduced in common code, we'd have to provide a default implementation that falls back to regular ioremap() on other arches. Something like ioremap() vs. ioremap_np() (nonposted)? (2) The converse of (1): keep the nGnRE default, but introduce special casing to the OF binding code to use nGnRnE when instructed to do so on these platforms. This means of_iomap() needs changing. The advantage of this approach is that the set of possible non-PCI drivers that are useful on these SoCs is bounded, so not all drivers that don't go through that path need to be fixed. Additionally, this could take advantage of the OF address translation stuff to be smarter about deciding to use nGnRnE, e.g. doing it based on a property of the parent bus node. Of note, some devices (like samsung_tty) go through the platform device framework, which eventually goes into devm code. So of_address_to_resource would need to set some flag on the struct resource, that can then be used by both of_iomap() and devm_ioremap_resource() and friends to eventually call the right ioremap variant. The ioremap considerations from (1) apply here too. (3) Do it at a lower level, in ioremap() itself. This requires that ioremap() somehow discriminates based on address range to pick what kind of mapping to make. Declaring these address ranges would be an issue. Options: a) An out of band list in a DT node, a la /reserved-memory b) Something based on the existing DT hierarchy, where we can scan bus ranges and locate buses with a property that says "nGnRnE" or "nGnRE" and dynamically build the list based on that. The advantage of this option is that it doesn't touch non-arch code. The disadvantage is that it adds a complete new bespoke mechanism to the DT, and that it does not let device drivers actually select the IO mode, which might be desirable in the future anyway for some devices. All discussion and additional ideas welcome. Signed-off-by: Hector Martin <marcan@marcan.st> --- arch/arm64/include/asm/io.h | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)