mbox series

[v2,0/7] irqchip/apple-aic: Add support for AICv2

Message ID 20220224130741.63924-1-marcan@marcan.st (mailing list archive)
Headers show
Series irqchip/apple-aic: Add support for AICv2 | expand

Message

Hector Martin Feb. 24, 2022, 1:07 p.m. UTC
Hi folks,

In the t6000/t6001 (M1 Pro / Max) SoCs, Apple introduced a new version
of their interrupt controller. This is a significant departure from
AICv1 and seems designed to better scale to larger chips. This series
adds support for it to the existing AIC driver.

Gone are CPU affinities; instead there seems to be some kind of
"automagic" dispatch to willing CPU cores, and cores can also opt-out
via an IMP-DEF sysreg (!). Right now the bootloader just sets up all
cores to accept IRQs, and we ignore all this and let the magic
algorithm pick a CPU to accept the IRQ. In the future, we might start
making use of these finer-grained capabilities for e.g. better
real-time guarantees (CPUs running RT threads might opt out of IRQs).

Legacy IPI support is also gone, so this implements Fast IPI support.
Fast IPIs are implemented entirely in the CPU core complexes, using
FIQs and IMP-DEF sysregs. This is also supported on t8103/M1, so we
enable it there too, but we keep the legacy AIC IPI codepath in case
it is useful for backporting to older chips.

This also adds support for multi-die AIC2 controllers. While no
multi-die products exist yet, the AIC2 in t600x is built to support
up to 2 dies, and it's pretty clear how it works, so let's implement
it. If we're lucky, when multi-die products roll around, this will
let us support them with only DT changes. In order to support the
extra die dimension, this introduces a 4-argument IRQ phandle form
(3-argument is always supported and just implies die 0).

All register offsets are computed based on capability register values,
which should allow forward-compatibility with future AIC2 variants...
except for one. For some inexplicable reason, the number of actually
implemented die register sets is nowhere to be found (t600x has 2,
but claims 1 die in use and 8 dies max, neither of which is what we
need), and this is necessary to compute the event register offset,
which is page-aligned after the die register sets. We have no choice
but to stick this offset in the device tree... which is the same thing
Apple do in their ADT.

Changes since v1:
- Split off the DT binding
- Changed fast-ipi codepath selection to use a static key for performance
- Added fix for PCI driver to support the new 4-cell IRQ form
- Minor style / review feedback fixes

Hector Martin (7):
  PCI: apple: Change MSI handling to handle 4-cell AIC fwspec form
  dt-bindings: interrupt-controller: apple,aic2: New binding for AICv2
  irqchip/apple-aic: Add Fast IPI support
  irqchip/apple-aic: Switch to irq_domain_create_tree and sparse hwirqs
  irqchip/apple-aic: Dynamically compute register offsets
  irqchip/apple-aic: Support multiple dies
  irqchip/apple-aic: Add support for AICv2

 .../interrupt-controller/apple,aic2.yaml      |  99 ++++
 MAINTAINERS                                   |   2 +-
 drivers/irqchip/irq-apple-aic.c               | 432 +++++++++++++++---
 drivers/pci/controller/pcie-apple.c           |   2 +-
 4 files changed, 458 insertions(+), 77 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml

Comments

Mark Rutland Feb. 24, 2022, 6:26 p.m. UTC | #1
On Thu, Feb 24, 2022 at 10:07:34PM +0900, Hector Martin wrote:
> Hi folks,

Hi Hector,

> In the t6000/t6001 (M1 Pro / Max) SoCs, Apple introduced a new version
> of their interrupt controller. This is a significant departure from
> AICv1 and seems designed to better scale to larger chips. This series
> adds support for it to the existing AIC driver.
> 
> Gone are CPU affinities; instead there seems to be some kind of
> "automagic" dispatch to willing CPU cores, and cores can also opt-out
> via an IMP-DEF sysreg (!). Right now the bootloader just sets up all
> cores to accept IRQs, and we ignore all this and let the magic
> algorithm pick a CPU to accept the IRQ.

Maybe that's ok for the set of peripherals attached, but in general that
violates existing expectations regarding affinity, and I fear there'll
be some subtle brokenness resulting from this automatic target
selection.

For example, in the perf events subsystem there are PMU drivers (even
those for "uncore" or "system" devices which are shared by many/all
CPUs) which rely on a combination of interrupt affinity and local IRQ
masking (without any other locking) to provide exclusion between a PMU's
IRQ handler and any other management operations for that PMU (which are
all handled from the same CPU).

> In the future, we might start making use of these finer-grained
> capabilities for e.g. better real-time guarantees (CPUs running RT
> threads might opt out of IRQs).

What mechanism does the HW have for affinity selection? The wording
above makes it sound like each CPU has to opt-out rather than having a
central affinity selection. Is there a mechanism to select a single
target?

Thanks,
Mark.

> Legacy IPI support is also gone, so this implements Fast IPI support.
> Fast IPIs are implemented entirely in the CPU core complexes, using
> FIQs and IMP-DEF sysregs. This is also supported on t8103/M1, so we
> enable it there too, but we keep the legacy AIC IPI codepath in case
> it is useful for backporting to older chips.
> 
> This also adds support for multi-die AIC2 controllers. While no
> multi-die products exist yet, the AIC2 in t600x is built to support
> up to 2 dies, and it's pretty clear how it works, so let's implement
> it. If we're lucky, when multi-die products roll around, this will
> let us support them with only DT changes. In order to support the
> extra die dimension, this introduces a 4-argument IRQ phandle form
> (3-argument is always supported and just implies die 0).
> 
> All register offsets are computed based on capability register values,
> which should allow forward-compatibility with future AIC2 variants...
> except for one. For some inexplicable reason, the number of actually
> implemented die register sets is nowhere to be found (t600x has 2,
> but claims 1 die in use and 8 dies max, neither of which is what we
> need), and this is necessary to compute the event register offset,
> which is page-aligned after the die register sets. We have no choice
> but to stick this offset in the device tree... which is the same thing
> Apple do in their ADT.
> 
> Changes since v1:
> - Split off the DT binding
> - Changed fast-ipi codepath selection to use a static key for performance
> - Added fix for PCI driver to support the new 4-cell IRQ form
> - Minor style / review feedback fixes
> 
> Hector Martin (7):
>   PCI: apple: Change MSI handling to handle 4-cell AIC fwspec form
>   dt-bindings: interrupt-controller: apple,aic2: New binding for AICv2
>   irqchip/apple-aic: Add Fast IPI support
>   irqchip/apple-aic: Switch to irq_domain_create_tree and sparse hwirqs
>   irqchip/apple-aic: Dynamically compute register offsets
>   irqchip/apple-aic: Support multiple dies
>   irqchip/apple-aic: Add support for AICv2
> 
>  .../interrupt-controller/apple,aic2.yaml      |  99 ++++
>  MAINTAINERS                                   |   2 +-
>  drivers/irqchip/irq-apple-aic.c               | 432 +++++++++++++++---
>  drivers/pci/controller/pcie-apple.c           |   2 +-
>  4 files changed, 458 insertions(+), 77 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml
> 
> -- 
> 2.33.0
>
Marc Zyngier Feb. 24, 2022, 7:06 p.m. UTC | #2
On Thu, 24 Feb 2022 18:26:41 +0000,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> On Thu, Feb 24, 2022 at 10:07:34PM +0900, Hector Martin wrote:
> > Hi folks,
> 
> Hi Hector,
> 
> > In the t6000/t6001 (M1 Pro / Max) SoCs, Apple introduced a new version
> > of their interrupt controller. This is a significant departure from
> > AICv1 and seems designed to better scale to larger chips. This series
> > adds support for it to the existing AIC driver.
> > 
> > Gone are CPU affinities; instead there seems to be some kind of
> > "automagic" dispatch to willing CPU cores, and cores can also opt-out
> > via an IMP-DEF sysreg (!). Right now the bootloader just sets up all
> > cores to accept IRQs, and we ignore all this and let the magic
> > algorithm pick a CPU to accept the IRQ.
> 
> Maybe that's ok for the set of peripherals attached, but in general that
> violates existing expectations regarding affinity, and I fear there'll
> be some subtle brokenness resulting from this automatic target
> selection.
> 
> For example, in the perf events subsystem there are PMU drivers (even
> those for "uncore" or "system" devices which are shared by many/all
> CPUs) which rely on a combination of interrupt affinity and local IRQ
> masking (without any other locking) to provide exclusion between a PMU's
> IRQ handler and any other management operations for that PMU (which are
> all handled from the same CPU).

It will definitely break anything that relies on managed interrupts,
where the kernel expects to allocate interrupts that have a strict
affinity. Drivers using this feature can legitimately expect that they
can keep their state in per-CPU pointers, and that obviously breaks.

This may affect any PCIe device with more than a couple of queues.
Maybe users of this HW do not care (yet), but we'll have to find a way
to tell drivers of the limitation.

	M.
Hector Martin Feb. 25, 2022, 4:27 a.m. UTC | #3
On 25/02/2022 04.06, Marc Zyngier wrote:
> On Thu, 24 Feb 2022 18:26:41 +0000,
> Mark Rutland <mark.rutland@arm.com> wrote:
>>
>> On Thu, Feb 24, 2022 at 10:07:34PM +0900, Hector Martin wrote:
>>> Hi folks,
>>
>> Hi Hector,
>>
>>> In the t6000/t6001 (M1 Pro / Max) SoCs, Apple introduced a new version
>>> of their interrupt controller. This is a significant departure from
>>> AICv1 and seems designed to better scale to larger chips. This series
>>> adds support for it to the existing AIC driver.
>>>
>>> Gone are CPU affinities; instead there seems to be some kind of
>>> "automagic" dispatch to willing CPU cores, and cores can also opt-out
>>> via an IMP-DEF sysreg (!). Right now the bootloader just sets up all
>>> cores to accept IRQs, and we ignore all this and let the magic
>>> algorithm pick a CPU to accept the IRQ.
>>
>> Maybe that's ok for the set of peripherals attached, but in general that
>> violates existing expectations regarding affinity, and I fear there'll
>> be some subtle brokenness resulting from this automatic target
>> selection.
>>
>> For example, in the perf events subsystem there are PMU drivers (even
>> those for "uncore" or "system" devices which are shared by many/all
>> CPUs) which rely on a combination of interrupt affinity and local IRQ
>> masking (without any other locking) to provide exclusion between a PMU's
>> IRQ handler and any other management operations for that PMU (which are
>> all handled from the same CPU).
> 
> It will definitely break anything that relies on managed interrupts,
> where the kernel expects to allocate interrupts that have a strict
> affinity. Drivers using this feature can legitimately expect that they
> can keep their state in per-CPU pointers, and that obviously breaks.
> 
> This may affect any PCIe device with more than a couple of queues.
> Maybe users of this HW do not care (yet), but we'll have to find a way
> to tell drivers of the limitation.

Yes, we already had a brief discussion about this in the v1 thread:

https://lore.kernel.org/linux-arm-kernel/4a83dfb1-3188-8b09-fc60-d3083230fb54@marcan.st/

TL;DR there is no explicit per-IRQ affinity control, nor does an unknown
one seem possible, since there just aren't enough bits for it in per-IRQ
registers. AICv1 had that, but AICv2 got rid of it in favor of heuristic
magic and global per-CPU controls.

This hasn't actually been fully tested yet, but current hypothesis is
the mapping goes:

1 IRQ -> group (0-7) -> priority (0-3?) -> 1 CPU (local priority threshold)

This is based on the fact that the per-IRQ group field is 3 bits, and
the per-CPU mask IMP-DEF sysreg is 2 bits. There may or may not be
per-IRQ cluster controls. But that still leaves all IRQs funnelled into,
at most, 3-4 classes per CPU cluster, and 8 groups globally, so there's
no way to implement proper per-IRQ affinity (since we have 10 CPUs on
these platforms).

My guess is Apple has bet on heuristic magic to optimize IRQ delivery to
avoid waking up (deep?-)sleeping CPUs on low-priority events and
optimize for power, and forgone strict per-CPU queues which are how many
drivers optimize for performance. This makes some sense, since these are
largely consumer/prosumer platforms, many of them battery-powered, not
128-CPU datacenter monsters with multiple 10GbE interfaces. They can
probably get away without hard multiqueue stuff.

This won't be an issue for PMU interrupts (including the uncore PMU),
since those do not go through AIC per se but rather the FIQ path (which
is inherently per-CPU), same as the local timers. Marc's PMU support
patch set already takes care of adding support for those FIQ sources.
But it will indeed break some PCIe drivers for devices that users might
have arbitrarily attached through Thunderbolt.

Since we do not support Thunderbolt yet, I suggest we kick this can down
the road until we have test cases for how this breaks and how to fix it :-)

There are also other fun things to be done with the local CPU masking,
e.g. directing low-priority IRQs away from CPUs running real-time
threads. I definitely want to take a look in more detail at the controls
we *do* have, especially since I have a personal interest in RT for
audio production (and these platforms have no SMM/TEE, no latency
spikes, and fast cpufreq, woo!). But for now this works and brings up
the platform, so that yak is probably best shaved in the future. Let me
know if you're interested in having more discussions about RT-centric
features, though. I suspect we'll need some new kernel
mechanisms/interfaces to handle e.g. the CPU IMPDEF mask/prio stuff...

Aside, I wonder how they'll handle multi-die devices... for a single
die, you can probably well get away with no CPU pinning, but for
multi-die, are they going to do NUMA? If so, they'd want at least die
controls to avoid bouncing cache lines between dies too much... though
for some reason, I'm getting the feeling they're just going to
interleave memory and pretend it's UMA. Good chance we find out next
month...