mbox series

[v10,00/15] Linux RISC-V AIA Support

Message ID 20231003044403.1974628-1-apatel@ventanamicro.com (mailing list archive)
Headers show
Series Linux RISC-V AIA Support | expand

Message

Anup Patel Oct. 3, 2023, 4:43 a.m. UTC
The RISC-V AIA specification is ratified as-per the RISC-V international
process. The latest ratified AIA specifcation can be found at:
https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf

At a high-level, the AIA specification adds three things:
1) AIA CSRs
   - Improved local interrupt support
2) Incoming Message Signaled Interrupt Controller (IMSIC)
   - Per-HART MSI controller
   - Support MSI virtualization
   - Support IPI along with virtualization
3) Advanced Platform-Level Interrupt Controller (APLIC)
   - Wired interrupt controller
   - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
   - In Direct-mode, injects external interrupts directly into HARTs

For an overview of the AIA specification, refer the AIA virtualization
talk at KVM Forum 2022:
https://static.sched.com/hosted_files/kvmforum2022/a1/AIA_Virtualization_in_KVM_RISCV_final.pdf
https://www.youtube.com/watch?v=r071dL8Z0yo

To test this series, use QEMU v7.2 (or higher) and OpenSBI v1.2 (or higher).

These patches can also be found in the riscv_aia_v10 branch at:
https://github.com/avpatel/linux.git

Changes since v9:
 - Rebased on Linux-6.6-rc4
 - Use builtin_platform_driver() in PATCH5, PATCH9, and PATCH12

Changes since v8:
 - Rebased on Linux-6.6-rc3
 - Dropped PATCH2 of v8 series since we won't be requiring
   riscv_get_intc_hartid() based on Marc Z's comments on ACPI AIA support.
 - Addressed Saravana's comments in PATCH3 of v8 series
 - Update PATCH9 and PATCH13 of v8 series based on comments from Sunil

Changes since v7:
 - Rebased on Linux-6.6-rc1
 - Addressed comments on PATCH1 of v7 series and split it into two PATCHes
 - Use DEFINE_SIMPLE_PROP() in PATCH2 of v7 series

Changes since v6:
 - Rebased on Linux-6.5-rc4
 - Updated PATCH2 to use IS_ENABLED(CONFIG_SPARC) instead of
   !IS_ENABLED(CONFIG_OF_IRQ)
 - Added new PATCH4 to fix syscore registration in PLIC driver
 - Update PATCH5 to convert PLIC driver into full-blown platform driver
   with a re-written probe function.

Changes since v5:
 - Rebased on Linux-6.5-rc2
 - Updated the overall series to ensure that only IPI, timer, and
   INTC drivers are probed very early whereas rest of the interrupt
   controllers (such as PLIC, APLIC, and IMISC) are probed as
   regular platform drivers.
 - Renamed riscv_fw_parent_hartid() to riscv_get_intc_hartid()
 - New PATCH1 to add fw_devlink support for msi-parent DT property
 - New PATCH2 to ensure all INTC suppliers are initialized which in-turn
   fixes the probing issue for PLIC, APLIC and IMSIC as platform driver
 - New PATCH3 to use platform driver probing for PLIC
 - Re-structured the IMSIC driver into two separate drivers: early and
   platform. The IMSIC early driver (PATCH7) only initialized IMSIC state
   and provides IPIs whereas the IMSIC platform driver (PATCH8) is probed
   provides MSI domain for platform devices.
 - Re-structure the APLIC platform driver into three separe sources: main,
   direct mode, and MSI mode.

Changes since v4:
 - Rebased on Linux-6.5-rc1
 - Added "Dependencies" in the APLIC bindings (PATCH6 in v4)
 - Dropped the PATCH6 which was changing the IOMMU DMA domain APIs
 - Dropped use of IOMMU DMA APIs in the IMSIC driver (PATCH4)

Changes since v3:
 - Rebased on Linux-6.4-rc6
 - Droped PATCH2 of v3 series instead we now set FWNODE_FLAG_BEST_EFFORT via
   IRQCHIP_DECLARE()
 - Extend riscv_fw_parent_hartid() to support both DT and ACPI in PATCH1
 - Extend iommu_dma_compose_msi_msg() instead of adding iommu_dma_select_msi()
   in PATCH6
 - Addressed Conor's comments in PATCH3
 - Addressed Conor's and Rob's comments in PATCH7

Changes since v2:
 - Rebased on Linux-6.4-rc1
 - Addressed Rob's comments on DT bindings patches 4 and 8.
 - Addessed Marc's comments on IMSIC driver PATCH5
 - Replaced use of OF apis in APLIC and IMSIC drivers with FWNODE apis
   this makes both drivers easily portable for ACPI support. This also
   removes unnecessary indirection from the APLIC and IMSIC drivers.
 - PATCH1 is a new patch for portability with ACPI support
 - PATCH2 is a new patch to fix probing in APLIC drivers for APLIC-only systems.
 - PATCH7 is a new patch which addresses the IOMMU DMA domain issues pointed
   out by SiFive

Changes since v1:
 - Rebased on Linux-6.2-rc2
 - Addressed comments on IMSIC DT bindings for PATCH4
 - Use raw_spin_lock_irqsave() on ids_lock for PATCH5
 - Improved MMIO alignment checks in PATCH5 to allow MMIO regions
   with holes.
 - Addressed comments on APLIC DT bindings for PATCH6
 - Fixed warning splat in aplic_msi_write_msg() caused by
   zeroed MSI message in PATCH7
 - Dropped DT property riscv,slow-ipi instead will have module
   parameter in future.

Anup Patel (15):
  RISC-V: Don't fail in riscv_of_parent_hartid() for disabled HARTs
  of: property: Add fw_devlink support for msi-parent
  drivers: irqchip/riscv-intc: Mark all INTC nodes as initialized
  irqchip/sifive-plic: Fix syscore registration for multi-socket systems
  irqchip/sifive-plic: Convert PLIC driver into a platform driver
  irqchip/riscv-intc: Add support for RISC-V AIA
  dt-bindings: interrupt-controller: Add RISC-V incoming MSI controller
  irqchip: Add RISC-V incoming MSI controller early driver
  irqchip/riscv-imsic: Add support for platform MSI irqdomain
  irqchip/riscv-imsic: Add support for PCI MSI irqdomain
  dt-bindings: interrupt-controller: Add RISC-V advanced PLIC
  irqchip: Add RISC-V advanced PLIC driver for direct-mode
  irqchip/riscv-aplic: Add support for MSI-mode
  RISC-V: Select APLIC and IMSIC drivers
  MAINTAINERS: Add entry for RISC-V AIA drivers

 .../interrupt-controller/riscv,aplic.yaml     | 172 ++++++
 .../interrupt-controller/riscv,imsics.yaml    | 172 ++++++
 MAINTAINERS                                   |  14 +
 arch/riscv/Kconfig                            |   2 +
 arch/riscv/kernel/cpu.c                       |  11 +-
 drivers/irqchip/Kconfig                       |  24 +
 drivers/irqchip/Makefile                      |   3 +
 drivers/irqchip/irq-riscv-aplic-direct.c      | 343 +++++++++++
 drivers/irqchip/irq-riscv-aplic-main.c        | 232 +++++++
 drivers/irqchip/irq-riscv-aplic-main.h        |  53 ++
 drivers/irqchip/irq-riscv-aplic-msi.c         | 285 +++++++++
 drivers/irqchip/irq-riscv-imsic-early.c       | 259 ++++++++
 drivers/irqchip/irq-riscv-imsic-platform.c    | 319 ++++++++++
 drivers/irqchip/irq-riscv-imsic-state.c       | 570 ++++++++++++++++++
 drivers/irqchip/irq-riscv-imsic-state.h       |  67 ++
 drivers/irqchip/irq-riscv-intc.c              |  44 +-
 drivers/irqchip/irq-sifive-plic.c             | 242 +++++---
 drivers/of/property.c                         |   2 +
 include/linux/irqchip/riscv-aplic.h           | 119 ++++
 include/linux/irqchip/riscv-imsic.h           |  86 +++
 20 files changed, 2915 insertions(+), 104 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,aplic.yaml
 create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
 create mode 100644 drivers/irqchip/irq-riscv-aplic-direct.c
 create mode 100644 drivers/irqchip/irq-riscv-aplic-main.c
 create mode 100644 drivers/irqchip/irq-riscv-aplic-main.h
 create mode 100644 drivers/irqchip/irq-riscv-aplic-msi.c
 create mode 100644 drivers/irqchip/irq-riscv-imsic-early.c
 create mode 100644 drivers/irqchip/irq-riscv-imsic-platform.c
 create mode 100644 drivers/irqchip/irq-riscv-imsic-state.c
 create mode 100644 drivers/irqchip/irq-riscv-imsic-state.h
 create mode 100644 include/linux/irqchip/riscv-aplic.h
 create mode 100644 include/linux/irqchip/riscv-imsic.h

Comments

Björn Töpel Oct. 19, 2023, 1:43 p.m. UTC | #1
Hi Anup,

Anup Patel <apatel@ventanamicro.com> writes:

> The RISC-V AIA specification is ratified as-per the RISC-V international
> process. The latest ratified AIA specifcation can be found at:
> https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>
> At a high-level, the AIA specification adds three things:
> 1) AIA CSRs
>    - Improved local interrupt support
> 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>    - Per-HART MSI controller
>    - Support MSI virtualization
>    - Support IPI along with virtualization
> 3) Advanced Platform-Level Interrupt Controller (APLIC)
>    - Wired interrupt controller
>    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>    - In Direct-mode, injects external interrupts directly into HARTs

Thanks for working on the AIA support! I had a look at the series, and
have some concerns about interrupt ID abstraction.

A bit of background, for readers not familiar with the AIA details.

IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
each MSI is dedicated to a certain hart. The series takes the approach
to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
slice only *one* msi-irq is acutally used.

This scheme makes affinity changes more robust, because the interrupt
sources on "other" harts are pre-allocated. On the other hand it
requires to propagate irq masking to other harts via IPIs (this is
mostly done up setup/tear down). It's also wasteful, because msi-irqs
are hogged, and cannot be used.

Contemporary storage/networking drivers usually uses queues per core
(or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
If we instead used a scheme where "msi-irq == lnx-irq", instead of
"lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
would like to use 5 queues (5 cores) on a 128 core system, the current
scheme would consume 5 * 128 MSIs, instead of just 5.

On the plus side:
* Changing interrupts affinity will never fail, because the interrupts
  on each hart is pre-allocated.

On the negative side:
* Wasteful interrupt usage, and a system can potientially "run out" of
  interrupts. Especially for many core systems.
* Interrupt masking need to proagate to harts via IPIs (there's no
  broadcast csr in IMSIC), and a more complex locking scheme IMSIC

Summary:
The current series caps the number of global interrupts to maximum
2047 MSIs for all cores (whole system). A better scheme, IMO, would be
to expose 2047 * #harts unique MSIs.

I think this could simplify/remove(?) the locking as well.

I realize that the series in v10, and coming with a change like this
now might be a bit of a pain...

Finally, another question related to APLIC/IMSIC. AFAIU the memory map
of the IMSIC regions are constrained by the APLIC, which requires a
certain layout for MSI forwarding (group/hart/guest bits). Say that a
system doesn't have an APLIC, couldn't the layout requirement be
simplified?


Again, thanks for the hard work!
Björn
Anup Patel Oct. 19, 2023, 4:11 p.m. UTC | #2
On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Hi Anup,
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > The RISC-V AIA specification is ratified as-per the RISC-V international
> > process. The latest ratified AIA specifcation can be found at:
> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >
> > At a high-level, the AIA specification adds three things:
> > 1) AIA CSRs
> >    - Improved local interrupt support
> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >    - Per-HART MSI controller
> >    - Support MSI virtualization
> >    - Support IPI along with virtualization
> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >    - Wired interrupt controller
> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >    - In Direct-mode, injects external interrupts directly into HARTs
>
> Thanks for working on the AIA support! I had a look at the series, and
> have some concerns about interrupt ID abstraction.
>
> A bit of background, for readers not familiar with the AIA details.
>
> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> each MSI is dedicated to a certain hart. The series takes the approach
> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> slice only *one* msi-irq is acutally used.
>
> This scheme makes affinity changes more robust, because the interrupt
> sources on "other" harts are pre-allocated. On the other hand it
> requires to propagate irq masking to other harts via IPIs (this is
> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> are hogged, and cannot be used.
>
> Contemporary storage/networking drivers usually uses queues per core
> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> would like to use 5 queues (5 cores) on a 128 core system, the current
> scheme would consume 5 * 128 MSIs, instead of just 5.
>
> On the plus side:
> * Changing interrupts affinity will never fail, because the interrupts
>   on each hart is pre-allocated.
>
> On the negative side:
> * Wasteful interrupt usage, and a system can potientially "run out" of
>   interrupts. Especially for many core systems.
> * Interrupt masking need to proagate to harts via IPIs (there's no
>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>
> Summary:
> The current series caps the number of global interrupts to maximum
> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> to expose 2047 * #harts unique MSIs.
>
> I think this could simplify/remove(?) the locking as well.

Exposing 2047 * #harts unique MSIs has multiple issues:
1) The irq_set_affinity() does not work for MSIs because each
     IRQ is not tied to a particular HART. This means we can't
     balance the IRQ processing load among HARTs.
2) All wired IRQs for APLIC MSI-mode will also target a
    fixed HART hence irq_set_affinity() won't work for wired
    IRQs as well.
3) Contemporary storage/networking drivers which use per-core
     queues use irq_set_affinity() on queue IRQs to balance
     across cores but this will fail.
4) HART hotplug breaks because kernel irq-subsystem can't
    migrate the IRQs (both MSIs and Wired) targeting HART X
    to another HART Y when the HART X goes down.

The idea of treating per-HART MSIs as separate IRQs has
been discussed in the past. The current approach works nicely
with all kernel use-cases at the cost of additional work on the
driver side.

Also, the current approach is very similar to the ARM GICv3
driver where ITS LPIs across CPUs are treated as single IRQ.

>
> I realize that the series in v10, and coming with a change like this
> now might be a bit of a pain...
>
> Finally, another question related to APLIC/IMSIC. AFAIU the memory map
> of the IMSIC regions are constrained by the APLIC, which requires a
> certain layout for MSI forwarding (group/hart/guest bits). Say that a
> system doesn't have an APLIC, couldn't the layout requirement be
> simplified?

Yes, this is already taken care of in the current IMSIC driver based
on feedback from Atish. We can certainly improve flexibility on the
IMSIC driver side if some case is missed-out.

The APLIC driver is certainly very strict about the arrangement of
IMSIC files so we do additional sanity checks on the APLIC driver
side at the time of probing.

>
>
> Again, thanks for the hard work!
> Björn

Regards,
Anup
Björn Töpel Oct. 20, 2023, 8:47 a.m. UTC | #3
Thanks for the quick reply!

Anup Patel <apatel@ventanamicro.com> writes:

> On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> Hi Anup,
>>
>> Anup Patel <apatel@ventanamicro.com> writes:
>>
>> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> > process. The latest ratified AIA specifcation can be found at:
>> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >
>> > At a high-level, the AIA specification adds three things:
>> > 1) AIA CSRs
>> >    - Improved local interrupt support
>> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >    - Per-HART MSI controller
>> >    - Support MSI virtualization
>> >    - Support IPI along with virtualization
>> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >    - Wired interrupt controller
>> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >    - In Direct-mode, injects external interrupts directly into HARTs
>>
>> Thanks for working on the AIA support! I had a look at the series, and
>> have some concerns about interrupt ID abstraction.
>>
>> A bit of background, for readers not familiar with the AIA details.
>>
>> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> each MSI is dedicated to a certain hart. The series takes the approach
>> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> slice only *one* msi-irq is acutally used.
>>
>> This scheme makes affinity changes more robust, because the interrupt
>> sources on "other" harts are pre-allocated. On the other hand it
>> requires to propagate irq masking to other harts via IPIs (this is
>> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> are hogged, and cannot be used.
>>
>> Contemporary storage/networking drivers usually uses queues per core
>> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> would like to use 5 queues (5 cores) on a 128 core system, the current
>> scheme would consume 5 * 128 MSIs, instead of just 5.
>>
>> On the plus side:
>> * Changing interrupts affinity will never fail, because the interrupts
>>   on each hart is pre-allocated.
>>
>> On the negative side:
>> * Wasteful interrupt usage, and a system can potientially "run out" of
>>   interrupts. Especially for many core systems.
>> * Interrupt masking need to proagate to harts via IPIs (there's no
>>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>>
>> Summary:
>> The current series caps the number of global interrupts to maximum
>> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> to expose 2047 * #harts unique MSIs.
>>
>> I think this could simplify/remove(?) the locking as well.
>
> Exposing 2047 * #harts unique MSIs has multiple issues:
> 1) The irq_set_affinity() does not work for MSIs because each
>      IRQ is not tied to a particular HART. This means we can't
>      balance the IRQ processing load among HARTs.

Yes, you can balance. In your code, each *active* MSI is still
bound/active to a specific hard together with the affinity mask. In an
1-1 model you would still need to track the affinity mask, but the
irq_set_affinity() would be different. It would try to allocate a new
MSI from the target CPU, and then switch to having that MSI active.

That's what x86 does AFAIU, which is also constrained by the # of
available MSIs.

The downside, as I pointed out, is that the set affinity action can
fail for a certain target CPU.

> 2) All wired IRQs for APLIC MSI-mode will also target a
>     fixed HART hence irq_set_affinity() won't work for wired
>     IRQs as well.

I'm not following here. Why would APLIC put a constraint here? I had a
look at the specs, and I didn't see anything supporting the current
scheme explicitly.

> 3) Contemporary storage/networking drivers which use per-core
>      queues use irq_set_affinity() on queue IRQs to balance
>      across cores but this will fail.

Or via the the managed interrupts. But this is a non-issue, as pointed
out in my reply to 1.

> 4) HART hotplug breaks because kernel irq-subsystem can't
>     migrate the IRQs (both MSIs and Wired) targeting HART X
>     to another HART Y when the HART X goes down.

Yes, we might end up in scenarios where we can't move to a certain
target cpu, but I wouldn't expect that to be a common scenario.

> The idea of treating per-HART MSIs as separate IRQs has
> been discussed in the past.

Aha! I tried to look for it in lore, but didn't find any. Could you
point me to those discussions?

> Also, the current approach is very similar to the ARM GICv3 driver
> where ITS LPIs across CPUs are treated as single IRQ.

I'm not familiar with the GIC. Is the GICv3 design similar to IMSIC? I
had the impression that the GIC had a more advanced interrupt routing
mechanism, than what IMSIC exposes. I think x86 APIC takes the 1-1
approach (the folks on the To: list definitely knows! ;-)).

My concern is interrupts become a scarce resource with this
implementation, but maybe my view is incorrect. I've seen bare-metal
x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
that is considered "a lot of interrupts".

As long as we don't get into scenarios where we're running out of
interrupts, due to the software design.


Björn
Anup Patel Oct. 20, 2023, 11 a.m. UTC | #4
On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Thanks for the quick reply!
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>
> >> Hi Anup,
> >>
> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>
> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> > process. The latest ratified AIA specifcation can be found at:
> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >
> >> > At a high-level, the AIA specification adds three things:
> >> > 1) AIA CSRs
> >> >    - Improved local interrupt support
> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >    - Per-HART MSI controller
> >> >    - Support MSI virtualization
> >> >    - Support IPI along with virtualization
> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >    - Wired interrupt controller
> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >    - In Direct-mode, injects external interrupts directly into HARTs
> >>
> >> Thanks for working on the AIA support! I had a look at the series, and
> >> have some concerns about interrupt ID abstraction.
> >>
> >> A bit of background, for readers not familiar with the AIA details.
> >>
> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> slice only *one* msi-irq is acutally used.
> >>
> >> This scheme makes affinity changes more robust, because the interrupt
> >> sources on "other" harts are pre-allocated. On the other hand it
> >> requires to propagate irq masking to other harts via IPIs (this is
> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> are hogged, and cannot be used.
> >>
> >> Contemporary storage/networking drivers usually uses queues per core
> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >>
> >> On the plus side:
> >> * Changing interrupts affinity will never fail, because the interrupts
> >>   on each hart is pre-allocated.
> >>
> >> On the negative side:
> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >>   interrupts. Especially for many core systems.
> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >>
> >> Summary:
> >> The current series caps the number of global interrupts to maximum
> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> to expose 2047 * #harts unique MSIs.
> >>
> >> I think this could simplify/remove(?) the locking as well.
> >
> > Exposing 2047 * #harts unique MSIs has multiple issues:
> > 1) The irq_set_affinity() does not work for MSIs because each
> >      IRQ is not tied to a particular HART. This means we can't
> >      balance the IRQ processing load among HARTs.
>
> Yes, you can balance. In your code, each *active* MSI is still
> bound/active to a specific hard together with the affinity mask. In an
> 1-1 model you would still need to track the affinity mask, but the
> irq_set_affinity() would be different. It would try to allocate a new
> MSI from the target CPU, and then switch to having that MSI active.
>
> That's what x86 does AFAIU, which is also constrained by the # of
> available MSIs.
>
> The downside, as I pointed out, is that the set affinity action can
> fail for a certain target CPU.

Yes, irq_set_affinity() can fail for the suggested approach plus for
RISC-V AIA, one HART does not have access to other HARTs
MSI enable/disable bits so the approach will also involve IPI.

>
> > 2) All wired IRQs for APLIC MSI-mode will also target a
> >     fixed HART hence irq_set_affinity() won't work for wired
> >     IRQs as well.
>
> I'm not following here. Why would APLIC put a constraint here? I had a
> look at the specs, and I didn't see anything supporting the current
> scheme explicitly.

Lets say the number of APLIC wired interrupts  are greater than the
number of per-CPU IMSIC IDs. In this case, if all wired interrupts are
moved to a particular CPU then irq_set_affinity() will fail for some of
the wired interrupts.

>
> > 3) Contemporary storage/networking drivers which use per-core
> >      queues use irq_set_affinity() on queue IRQs to balance
> >      across cores but this will fail.
>
> Or via the the managed interrupts. But this is a non-issue, as pointed
> out in my reply to 1.
>
> > 4) HART hotplug breaks because kernel irq-subsystem can't
> >     migrate the IRQs (both MSIs and Wired) targeting HART X
> >     to another HART Y when the HART X goes down.
>
> Yes, we might end up in scenarios where we can't move to a certain
> target cpu, but I wouldn't expect that to be a common scenario.
>
> > The idea of treating per-HART MSIs as separate IRQs has
> > been discussed in the past.
>
> Aha! I tried to look for it in lore, but didn't find any. Could you
> point me to those discussions?

This was done 2 years back in the AIA TG meeting when we were
doing the PoC for AIA spec.

>
> > Also, the current approach is very similar to the ARM GICv3 driver
> > where ITS LPIs across CPUs are treated as single IRQ.
>
> I'm not familiar with the GIC. Is the GICv3 design similar to IMSIC? I
> had the impression that the GIC had a more advanced interrupt routing
> mechanism, than what IMSIC exposes. I think x86 APIC takes the 1-1
> approach (the folks on the To: list definitely knows! ;-)).

GIC has a per-CPU redistributor which handles LPIs. The MSIs are
taken by GIC ITS and forwarded as LPI to the redistributor of a CPU.

The GIC driver treats LPI numbering space as global and not per-CPU.
Also, the limit on maximum number of LPIs is quite high because LPI
INTID can be 32-bit wide.

>
> My concern is interrupts become a scarce resource with this
> implementation, but maybe my view is incorrect. I've seen bare-metal
> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> that is considered "a lot of interrupts".
>
> As long as we don't get into scenarios where we're running out of
> interrupts, due to the software design.
>

The current approach is simpler and ensures irq_set_affinity
always works. The limit of max 2047 IDs is sufficient for many
systems (if not all).

When we encounter a system requiring a large number of MSIs,
we can either:
1) Extend the AIA spec to support greater than 2047 IDs
2) Re-think the approach in the IMSIC driver

The choice between #1 and #2 above depends on the
guarantees we want for irq_set_affinity().

Regards,
Anup
Björn Töpel Oct. 20, 2023, 2:40 p.m. UTC | #5
Anup Patel <apatel@ventanamicro.com> writes:

> On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> Thanks for the quick reply!
>>
>> Anup Patel <apatel@ventanamicro.com> writes:
>>
>> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>
>> >> Hi Anup,
>> >>
>> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >>
>> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> > process. The latest ratified AIA specifcation can be found at:
>> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >
>> >> > At a high-level, the AIA specification adds three things:
>> >> > 1) AIA CSRs
>> >> >    - Improved local interrupt support
>> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> >    - Per-HART MSI controller
>> >> >    - Support MSI virtualization
>> >> >    - Support IPI along with virtualization
>> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> >    - Wired interrupt controller
>> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> >    - In Direct-mode, injects external interrupts directly into HARTs
>> >>
>> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> have some concerns about interrupt ID abstraction.
>> >>
>> >> A bit of background, for readers not familiar with the AIA details.
>> >>
>> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> slice only *one* msi-irq is acutally used.
>> >>
>> >> This scheme makes affinity changes more robust, because the interrupt
>> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> are hogged, and cannot be used.
>> >>
>> >> Contemporary storage/networking drivers usually uses queues per core
>> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >>
>> >> On the plus side:
>> >> * Changing interrupts affinity will never fail, because the interrupts
>> >>   on each hart is pre-allocated.
>> >>
>> >> On the negative side:
>> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >>   interrupts. Especially for many core systems.
>> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >>
>> >> Summary:
>> >> The current series caps the number of global interrupts to maximum
>> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> to expose 2047 * #harts unique MSIs.
>> >>
>> >> I think this could simplify/remove(?) the locking as well.
>> >
>> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> > 1) The irq_set_affinity() does not work for MSIs because each
>> >      IRQ is not tied to a particular HART. This means we can't
>> >      balance the IRQ processing load among HARTs.
>>
>> Yes, you can balance. In your code, each *active* MSI is still
>> bound/active to a specific hard together with the affinity mask. In an
>> 1-1 model you would still need to track the affinity mask, but the
>> irq_set_affinity() would be different. It would try to allocate a new
>> MSI from the target CPU, and then switch to having that MSI active.
>>
>> That's what x86 does AFAIU, which is also constrained by the # of
>> available MSIs.
>>
>> The downside, as I pointed out, is that the set affinity action can
>> fail for a certain target CPU.
>
> Yes, irq_set_affinity() can fail for the suggested approach plus for
> RISC-V AIA, one HART does not have access to other HARTs
> MSI enable/disable bits so the approach will also involve IPI.

Correct, but the current series does a broadcast to all cores, where the
1-1 approach is at most an IPI to a single core.

128+c machines are getting more common, and you have devices that you
bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
dealing with a per-core activity is a pretty noisy neighbor.

This could be fixed in the existing 1-n approach, by not require to sync
the cores that are not handling the MSI in question. "Lazy disable"

>> > 2) All wired IRQs for APLIC MSI-mode will also target a
>> >     fixed HART hence irq_set_affinity() won't work for wired
>> >     IRQs as well.
>>
>> I'm not following here. Why would APLIC put a constraint here? I had a
>> look at the specs, and I didn't see anything supporting the current
>> scheme explicitly.
>
> Lets say the number of APLIC wired interrupts  are greater than the
> number of per-CPU IMSIC IDs. In this case, if all wired interrupts are
> moved to a particular CPU then irq_set_affinity() will fail for some of
> the wired interrupts.

Right, it's the case of "full remote CPU" again. Thanks for clearing
that up.

>> > The idea of treating per-HART MSIs as separate IRQs has
>> > been discussed in the past.
>>
>> Aha! I tried to look for it in lore, but didn't find any. Could you
>> point me to those discussions?
>
> This was done 2 years back in the AIA TG meeting when we were
> doing the PoC for AIA spec.

Ah, too bad. Thanks regardless.

>> My concern is interrupts become a scarce resource with this
>> implementation, but maybe my view is incorrect. I've seen bare-metal
>> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> that is considered "a lot of interrupts".
>>
>> As long as we don't get into scenarios where we're running out of
>> interrupts, due to the software design.
>>
>
> The current approach is simpler and ensures irq_set_affinity
> always works. The limit of max 2047 IDs is sufficient for many
> systems (if not all).

Let me give you another view. On a 128c system each core has ~16 unique
interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
network queue pairs for each PF.

> When we encounter a system requiring a large number of MSIs,
> we can either:
> 1) Extend the AIA spec to support greater than 2047 IDs
> 2) Re-think the approach in the IMSIC driver
>
> The choice between #1 and #2 above depends on the
> guarantees we want for irq_set_affinity().

The irq_set_affinity() behavior is better with this series, but I think
the other downsides: number of available interrupt sources, and IPI
broadcast are worse.


Björn
Anup Patel Oct. 20, 2023, 3:34 p.m. UTC | #6
On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>
> >> Thanks for the quick reply!
> >>
> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>
> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >>
> >> >> Hi Anup,
> >> >>
> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >>
> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >> >
> >> >> > At a high-level, the AIA specification adds three things:
> >> >> > 1) AIA CSRs
> >> >> >    - Improved local interrupt support
> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >> >    - Per-HART MSI controller
> >> >> >    - Support MSI virtualization
> >> >> >    - Support IPI along with virtualization
> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >> >    - Wired interrupt controller
> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
> >> >>
> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >> have some concerns about interrupt ID abstraction.
> >> >>
> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >>
> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >> slice only *one* msi-irq is acutally used.
> >> >>
> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >> are hogged, and cannot be used.
> >> >>
> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >>
> >> >> On the plus side:
> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >>   on each hart is pre-allocated.
> >> >>
> >> >> On the negative side:
> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >>   interrupts. Especially for many core systems.
> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >>
> >> >> Summary:
> >> >> The current series caps the number of global interrupts to maximum
> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >> to expose 2047 * #harts unique MSIs.
> >> >>
> >> >> I think this could simplify/remove(?) the locking as well.
> >> >
> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> >      IRQ is not tied to a particular HART. This means we can't
> >> >      balance the IRQ processing load among HARTs.
> >>
> >> Yes, you can balance. In your code, each *active* MSI is still
> >> bound/active to a specific hard together with the affinity mask. In an
> >> 1-1 model you would still need to track the affinity mask, but the
> >> irq_set_affinity() would be different. It would try to allocate a new
> >> MSI from the target CPU, and then switch to having that MSI active.
> >>
> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> available MSIs.
> >>
> >> The downside, as I pointed out, is that the set affinity action can
> >> fail for a certain target CPU.
> >
> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> > RISC-V AIA, one HART does not have access to other HARTs
> > MSI enable/disable bits so the approach will also involve IPI.
>
> Correct, but the current series does a broadcast to all cores, where the
> 1-1 approach is at most an IPI to a single core.
>
> 128+c machines are getting more common, and you have devices that you
> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> dealing with a per-core activity is a pretty noisy neighbor.

Broadcast IPI in the current approach is only done upon MSI mask/unmask
operation. It is not done upon set_affinity() of interrupt handling.

>
> This could be fixed in the existing 1-n approach, by not require to sync
> the cores that are not handling the MSI in question. "Lazy disable"

Incorrect. The approach you are suggesting involves an IPI upon every
irq_set_affinity(). This is because a HART can only enable it's own
MSI ID so when an IRQ is moved to from HART A to HART B with
a different ID X on HART B then we will need an IPI in irq_set_affinit()
to enable ID X on HART B.

>
> >> > 2) All wired IRQs for APLIC MSI-mode will also target a
> >> >     fixed HART hence irq_set_affinity() won't work for wired
> >> >     IRQs as well.
> >>
> >> I'm not following here. Why would APLIC put a constraint here? I had a
> >> look at the specs, and I didn't see anything supporting the current
> >> scheme explicitly.
> >
> > Lets say the number of APLIC wired interrupts  are greater than the
> > number of per-CPU IMSIC IDs. In this case, if all wired interrupts are
> > moved to a particular CPU then irq_set_affinity() will fail for some of
> > the wired interrupts.
>
> Right, it's the case of "full remote CPU" again. Thanks for clearing
> that up.
>
> >> > The idea of treating per-HART MSIs as separate IRQs has
> >> > been discussed in the past.
> >>
> >> Aha! I tried to look for it in lore, but didn't find any. Could you
> >> point me to those discussions?
> >
> > This was done 2 years back in the AIA TG meeting when we were
> > doing the PoC for AIA spec.
>
> Ah, too bad. Thanks regardless.
>
> >> My concern is interrupts become a scarce resource with this
> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> that is considered "a lot of interrupts".
> >>
> >> As long as we don't get into scenarios where we're running out of
> >> interrupts, due to the software design.
> >>
> >
> > The current approach is simpler and ensures irq_set_affinity
> > always works. The limit of max 2047 IDs is sufficient for many
> > systems (if not all).
>
> Let me give you another view. On a 128c system each core has ~16 unique
> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> network queue pairs for each PF.

Clearly, this example is a hypothetical and represents a poorly
designed platform.

Having just 16 IDs per-Core is a very poor design choice. In fact, the
Server SoC spec mandates a minimum 255 IDs.

Regarding NICs which support a large number of queues, the driver
will typically enable only one queue per-core and set the affinity to
separate cores. We have user-space data plane applications based
on DPDK which are capable of using a large number of NIC queues
but these applications are polling based and don't use MSIs.

>
> > When we encounter a system requiring a large number of MSIs,
> > we can either:
> > 1) Extend the AIA spec to support greater than 2047 IDs
> > 2) Re-think the approach in the IMSIC driver
> >
> > The choice between #1 and #2 above depends on the
> > guarantees we want for irq_set_affinity().
>
> The irq_set_affinity() behavior is better with this series, but I think
> the other downsides: number of available interrupt sources, and IPI
> broadcast are worse.

The IPI overhead in the approach you are suggesting will be
even bad compared to the IPI overhead of the current approach
because we will end-up doing IPI upon every irq_set_affinity()
in the suggested approach compared to doing IPI upon every
mask/unmask in the current approach.

The biggest advantage of the current approach is a reliable
irq_set_affinity() which is a very valuable thing to have.

ARM systems easily support a large number of LPIs per-core.
For example, GIC-700 supports 56000 LPIs per-core.
(Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)

In the RISC-V world, we can easily define a small fast track
extension based on S*csrind extension which can allow a
large number of IMSIC IDs per-core.

Instead of addressing problems on a hypothetical system,
I suggest we go ahead with the current approach and deal
with a system having MSI over-subscription when such a
system shows up.

Regards,
Anup
Björn Töpel Oct. 20, 2023, 4:36 p.m. UTC | #7
Anup Patel <apatel@ventanamicro.com> writes:

> On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> Anup Patel <apatel@ventanamicro.com> writes:
>>
>> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>
>> >> Thanks for the quick reply!
>> >>
>> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >>
>> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >> >>
>> >> >> Hi Anup,
>> >> >>
>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >> >>
>> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >> >
>> >> >> > At a high-level, the AIA specification adds three things:
>> >> >> > 1) AIA CSRs
>> >> >> >    - Improved local interrupt support
>> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> >> >    - Per-HART MSI controller
>> >> >> >    - Support MSI virtualization
>> >> >> >    - Support IPI along with virtualization
>> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> >> >    - Wired interrupt controller
>> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
>> >> >>
>> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> >> have some concerns about interrupt ID abstraction.
>> >> >>
>> >> >> A bit of background, for readers not familiar with the AIA details.
>> >> >>
>> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> >> slice only *one* msi-irq is acutally used.
>> >> >>
>> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> >> are hogged, and cannot be used.
>> >> >>
>> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >> >>
>> >> >> On the plus side:
>> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >> >>   on each hart is pre-allocated.
>> >> >>
>> >> >> On the negative side:
>> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >> >>   interrupts. Especially for many core systems.
>> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >> >>
>> >> >> Summary:
>> >> >> The current series caps the number of global interrupts to maximum
>> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> >> to expose 2047 * #harts unique MSIs.
>> >> >>
>> >> >> I think this could simplify/remove(?) the locking as well.
>> >> >
>> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >> >      IRQ is not tied to a particular HART. This means we can't
>> >> >      balance the IRQ processing load among HARTs.
>> >>
>> >> Yes, you can balance. In your code, each *active* MSI is still
>> >> bound/active to a specific hard together with the affinity mask. In an
>> >> 1-1 model you would still need to track the affinity mask, but the
>> >> irq_set_affinity() would be different. It would try to allocate a new
>> >> MSI from the target CPU, and then switch to having that MSI active.
>> >>
>> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >> available MSIs.
>> >>
>> >> The downside, as I pointed out, is that the set affinity action can
>> >> fail for a certain target CPU.
>> >
>> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> > RISC-V AIA, one HART does not have access to other HARTs
>> > MSI enable/disable bits so the approach will also involve IPI.
>>
>> Correct, but the current series does a broadcast to all cores, where the
>> 1-1 approach is at most an IPI to a single core.
>>
>> 128+c machines are getting more common, and you have devices that you
>> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> dealing with a per-core activity is a pretty noisy neighbor.
>
> Broadcast IPI in the current approach is only done upon MSI mask/unmask
> operation. It is not done upon set_affinity() of interrupt handling.

I'm aware. We're on the same page here.

>>
>> This could be fixed in the existing 1-n approach, by not require to sync
>> the cores that are not handling the MSI in question. "Lazy disable"
>
> Incorrect. The approach you are suggesting involves an IPI upon every
> irq_set_affinity(). This is because a HART can only enable it's own
> MSI ID so when an IRQ is moved to from HART A to HART B with
> a different ID X on HART B then we will need an IPI in irq_set_affinit()
> to enable ID X on HART B.

Yes, the 1-1 approach will require an IPI to one target cpu on affinity
changes, and similar on mask/unmask.

The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
broadcast to all cores on mask/unmask (not so nice).

>> >> My concern is interrupts become a scarce resource with this
>> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >> that is considered "a lot of interrupts".
>> >>
>> >> As long as we don't get into scenarios where we're running out of
>> >> interrupts, due to the software design.
>> >>
>> >
>> > The current approach is simpler and ensures irq_set_affinity
>> > always works. The limit of max 2047 IDs is sufficient for many
>> > systems (if not all).
>>
>> Let me give you another view. On a 128c system each core has ~16 unique
>> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> network queue pairs for each PF.
>
> Clearly, this example is a hypothetical and represents a poorly
> designed platform.
>
> Having just 16 IDs per-Core is a very poor design choice. In fact, the
> Server SoC spec mandates a minimum 255 IDs.

You are misreading. A 128c system with 2047 MSIs per-core, will only
have 16 *per-core unique* (2047/128) interrupts with the current series.

I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
system with the maximum amount of MSIs possible in the spec, you'll end
up with 16 *unique* interrupts per core.

> Regarding NICs which support a large number of queues, the driver
> will typically enable only one queue per-core and set the affinity to
> separate cores. We have user-space data plane applications based
> on DPDK which are capable of using a large number of NIC queues
> but these applications are polling based and don't use MSIs.

That's one sample point, and clearly not the only one. There are *many*
different usage models. Just because you *assign* MSI, doesn't mean they
are firing all the time.

I can show you a couple of networking setups where this is clearly not
enough. Each core has a large number of QoS queues, and each queue would
very much like to have a dedicated MSI.

>> > When we encounter a system requiring a large number of MSIs,
>> > we can either:
>> > 1) Extend the AIA spec to support greater than 2047 IDs
>> > 2) Re-think the approach in the IMSIC driver
>> >
>> > The choice between #1 and #2 above depends on the
>> > guarantees we want for irq_set_affinity().
>>
>> The irq_set_affinity() behavior is better with this series, but I think
>> the other downsides: number of available interrupt sources, and IPI
>> broadcast are worse.
>
> The IPI overhead in the approach you are suggesting will be
> even bad compared to the IPI overhead of the current approach
> because we will end-up doing IPI upon every irq_set_affinity()
> in the suggested approach compared to doing IPI upon every
> mask/unmask in the current approach.

Again, very workload dependent.

This series does IPI broadcast on masking/unmasking, which means that
cores that don't care get interrupted because, say, a network queue-pair
is setup on another core.

Some workloads never change the irq affinity.

I'm just pointing out that there are pro/cons with both variants.

> The biggest advantage of the current approach is a reliable
> irq_set_affinity() which is a very valuable thing to have.

...and I'm arguing that we're paying a big price for that.

> ARM systems easily support a large number of LPIs per-core.
> For example, GIC-700 supports 56000 LPIs per-core.
> (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)

Yeah, but this is not the GIC. This is something that looks more like
the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
spec, and many cores.

> In the RISC-V world, we can easily define a small fast track
> extension based on S*csrind extension which can allow a
> large number of IMSIC IDs per-core.
>
> Instead of addressing problems on a hypothetical system,
> I suggest we go ahead with the current approach and deal
> with a system having MSI over-subscription when such a
> system shows up.

I've pointed out my concerns. We're not agreeing, but hey, I'm just one
sample point here! I'll leave it here for others to chime in!

Still much appreciate all the hard work on the series!


Have a nice weekend,
Björn
Anup Patel Oct. 20, 2023, 5:13 p.m. UTC | #8
On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>
> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>
> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >>
> >> >> Thanks for the quick reply!
> >> >>
> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >>
> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >> >>
> >> >> >> Hi Anup,
> >> >> >>
> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >> >>
> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >> >> >
> >> >> >> > At a high-level, the AIA specification adds three things:
> >> >> >> > 1) AIA CSRs
> >> >> >> >    - Improved local interrupt support
> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >> >> >    - Per-HART MSI controller
> >> >> >> >    - Support MSI virtualization
> >> >> >> >    - Support IPI along with virtualization
> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >> >> >    - Wired interrupt controller
> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
> >> >> >>
> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >> >> have some concerns about interrupt ID abstraction.
> >> >> >>
> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >> >>
> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >> >> slice only *one* msi-irq is acutally used.
> >> >> >>
> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >> >> are hogged, and cannot be used.
> >> >> >>
> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >> >>
> >> >> >> On the plus side:
> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >> >>   on each hart is pre-allocated.
> >> >> >>
> >> >> >> On the negative side:
> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >> >>   interrupts. Especially for many core systems.
> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >> >>
> >> >> >> Summary:
> >> >> >> The current series caps the number of global interrupts to maximum
> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >> >> to expose 2047 * #harts unique MSIs.
> >> >> >>
> >> >> >> I think this could simplify/remove(?) the locking as well.
> >> >> >
> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> >> >      IRQ is not tied to a particular HART. This means we can't
> >> >> >      balance the IRQ processing load among HARTs.
> >> >>
> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >> >> bound/active to a specific hard together with the affinity mask. In an
> >> >> 1-1 model you would still need to track the affinity mask, but the
> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >> >>
> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> >> available MSIs.
> >> >>
> >> >> The downside, as I pointed out, is that the set affinity action can
> >> >> fail for a certain target CPU.
> >> >
> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >> > RISC-V AIA, one HART does not have access to other HARTs
> >> > MSI enable/disable bits so the approach will also involve IPI.
> >>
> >> Correct, but the current series does a broadcast to all cores, where the
> >> 1-1 approach is at most an IPI to a single core.
> >>
> >> 128+c machines are getting more common, and you have devices that you
> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >> dealing with a per-core activity is a pretty noisy neighbor.
> >
> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> > operation. It is not done upon set_affinity() of interrupt handling.
>
> I'm aware. We're on the same page here.
>
> >>
> >> This could be fixed in the existing 1-n approach, by not require to sync
> >> the cores that are not handling the MSI in question. "Lazy disable"
> >
> > Incorrect. The approach you are suggesting involves an IPI upon every
> > irq_set_affinity(). This is because a HART can only enable it's own
> > MSI ID so when an IRQ is moved to from HART A to HART B with
> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> > to enable ID X on HART B.
>
> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> changes, and similar on mask/unmask.
>
> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> broadcast to all cores on mask/unmask (not so nice).
>
> >> >> My concern is interrupts become a scarce resource with this
> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> >> that is considered "a lot of interrupts".
> >> >>
> >> >> As long as we don't get into scenarios where we're running out of
> >> >> interrupts, due to the software design.
> >> >>
> >> >
> >> > The current approach is simpler and ensures irq_set_affinity
> >> > always works. The limit of max 2047 IDs is sufficient for many
> >> > systems (if not all).
> >>
> >> Let me give you another view. On a 128c system each core has ~16 unique
> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >> network queue pairs for each PF.
> >
> > Clearly, this example is a hypothetical and represents a poorly
> > designed platform.
> >
> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> > Server SoC spec mandates a minimum 255 IDs.
>
> You are misreading. A 128c system with 2047 MSIs per-core, will only
> have 16 *per-core unique* (2047/128) interrupts with the current series.
>
> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> system with the maximum amount of MSIs possible in the spec, you'll end
> up with 16 *unique* interrupts per core.

-ENOPARSE

I don't see how this applies to the current approach because we treat
MSI ID space as global across cores so if a system has 2047 MSIs
per-core then we have 2047 MSIs across all cores.

>
> > Regarding NICs which support a large number of queues, the driver
> > will typically enable only one queue per-core and set the affinity to
> > separate cores. We have user-space data plane applications based
> > on DPDK which are capable of using a large number of NIC queues
> > but these applications are polling based and don't use MSIs.
>
> That's one sample point, and clearly not the only one. There are *many*
> different usage models. Just because you *assign* MSI, doesn't mean they
> are firing all the time.
>
> I can show you a couple of networking setups where this is clearly not
> enough. Each core has a large number of QoS queues, and each queue would
> very much like to have a dedicated MSI.
>
> >> > When we encounter a system requiring a large number of MSIs,
> >> > we can either:
> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >> > 2) Re-think the approach in the IMSIC driver
> >> >
> >> > The choice between #1 and #2 above depends on the
> >> > guarantees we want for irq_set_affinity().
> >>
> >> The irq_set_affinity() behavior is better with this series, but I think
> >> the other downsides: number of available interrupt sources, and IPI
> >> broadcast are worse.
> >
> > The IPI overhead in the approach you are suggesting will be
> > even bad compared to the IPI overhead of the current approach
> > because we will end-up doing IPI upon every irq_set_affinity()
> > in the suggested approach compared to doing IPI upon every
> > mask/unmask in the current approach.
>
> Again, very workload dependent.
>
> This series does IPI broadcast on masking/unmasking, which means that
> cores that don't care get interrupted because, say, a network queue-pair
> is setup on another core.
>
> Some workloads never change the irq affinity.

There are various events which irq affinity such as irq balance,
CPU hotplug, system suspend, etc.

Also, the 1-1 approach does IPI upon set_affinity, mask and
unmask whereas the 1-n approach does IPI only upon mask
and unmask.

>
> I'm just pointing out that there are pro/cons with both variants.
>
> > The biggest advantage of the current approach is a reliable
> > irq_set_affinity() which is a very valuable thing to have.
>
> ...and I'm arguing that we're paying a big price for that.
>
> > ARM systems easily support a large number of LPIs per-core.
> > For example, GIC-700 supports 56000 LPIs per-core.
> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>
> Yeah, but this is not the GIC. This is something that looks more like
> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> spec, and many cores.

Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
is that there are systems with large number per-core interrupt IDs
for handling MSIs.

>
> > In the RISC-V world, we can easily define a small fast track
> > extension based on S*csrind extension which can allow a
> > large number of IMSIC IDs per-core.
> >
> > Instead of addressing problems on a hypothetical system,
> > I suggest we go ahead with the current approach and deal
> > with a system having MSI over-subscription when such a
> > system shows up.
>
> I've pointed out my concerns. We're not agreeing, but hey, I'm just one
> sample point here! I'll leave it here for others to chime in!
>
> Still much appreciate all the hard work on the series!

Thanks, we have disagreements on this topic but this is
certainly a good discussion.

>
>
> Have a nice weekend,

Regards,
Anup
Björn Töpel Oct. 20, 2023, 7:45 p.m. UTC | #9
Anup Patel <apatel@ventanamicro.com> writes:

> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> Anup Patel <apatel@ventanamicro.com> writes:
>>
>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>
>> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >>
>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >> >>
>> >> >> Thanks for the quick reply!
>> >> >>
>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >> >>
>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >> >> >>
>> >> >> >> Hi Anup,
>> >> >> >>
>> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >> >> >>
>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >> >> >
>> >> >> >> > At a high-level, the AIA specification adds three things:
>> >> >> >> > 1) AIA CSRs
>> >> >> >> >    - Improved local interrupt support
>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> >> >> >    - Per-HART MSI controller
>> >> >> >> >    - Support MSI virtualization
>> >> >> >> >    - Support IPI along with virtualization
>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> >> >> >    - Wired interrupt controller
>> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
>> >> >> >>
>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> >> >> have some concerns about interrupt ID abstraction.
>> >> >> >>
>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>> >> >> >>
>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> >> >> slice only *one* msi-irq is acutally used.
>> >> >> >>
>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> >> >> are hogged, and cannot be used.
>> >> >> >>
>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >> >> >>
>> >> >> >> On the plus side:
>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >> >> >>   on each hart is pre-allocated.
>> >> >> >>
>> >> >> >> On the negative side:
>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >> >> >>   interrupts. Especially for many core systems.
>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >> >> >>
>> >> >> >> Summary:
>> >> >> >> The current series caps the number of global interrupts to maximum
>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> >> >> to expose 2047 * #harts unique MSIs.
>> >> >> >>
>> >> >> >> I think this could simplify/remove(?) the locking as well.
>> >> >> >
>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >> >> >      IRQ is not tied to a particular HART. This means we can't
>> >> >> >      balance the IRQ processing load among HARTs.
>> >> >>
>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>> >> >> bound/active to a specific hard together with the affinity mask. In an
>> >> >> 1-1 model you would still need to track the affinity mask, but the
>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>> >> >>
>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >> >> available MSIs.
>> >> >>
>> >> >> The downside, as I pointed out, is that the set affinity action can
>> >> >> fail for a certain target CPU.
>> >> >
>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> >> > RISC-V AIA, one HART does not have access to other HARTs
>> >> > MSI enable/disable bits so the approach will also involve IPI.
>> >>
>> >> Correct, but the current series does a broadcast to all cores, where the
>> >> 1-1 approach is at most an IPI to a single core.
>> >>
>> >> 128+c machines are getting more common, and you have devices that you
>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> >> dealing with a per-core activity is a pretty noisy neighbor.
>> >
>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>> > operation. It is not done upon set_affinity() of interrupt handling.
>>
>> I'm aware. We're on the same page here.
>>
>> >>
>> >> This could be fixed in the existing 1-n approach, by not require to sync
>> >> the cores that are not handling the MSI in question. "Lazy disable"
>> >
>> > Incorrect. The approach you are suggesting involves an IPI upon every
>> > irq_set_affinity(). This is because a HART can only enable it's own
>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>> > to enable ID X on HART B.
>>
>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>> changes, and similar on mask/unmask.
>>
>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>> broadcast to all cores on mask/unmask (not so nice).
>>
>> >> >> My concern is interrupts become a scarce resource with this
>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >> >> that is considered "a lot of interrupts".
>> >> >>
>> >> >> As long as we don't get into scenarios where we're running out of
>> >> >> interrupts, due to the software design.
>> >> >>
>> >> >
>> >> > The current approach is simpler and ensures irq_set_affinity
>> >> > always works. The limit of max 2047 IDs is sufficient for many
>> >> > systems (if not all).
>> >>
>> >> Let me give you another view. On a 128c system each core has ~16 unique
>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> >> network queue pairs for each PF.
>> >
>> > Clearly, this example is a hypothetical and represents a poorly
>> > designed platform.
>> >
>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>> > Server SoC spec mandates a minimum 255 IDs.
>>
>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>>
>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>> system with the maximum amount of MSIs possible in the spec, you'll end
>> up with 16 *unique* interrupts per core.
>
> -ENOPARSE
>
> I don't see how this applies to the current approach because we treat
> MSI ID space as global across cores so if a system has 2047 MSIs
> per-core then we have 2047 MSIs across all cores.

Ok, I'll try again! :-)

Let's assume that each core in the 128c system has some per-core
resources, say a two NIC queue pairs, and a storage queue pair. This
will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.

If each core does this it'll be 6*128 MSI sources of the global
namespace.

The maximum number of "privates" MSI sources a core can utilize is 16.

I'm trying (it's does seem to go that well ;-)) to point out that it's
only 16 unique sources per core. For, say, a 256 core system it would be
8. 2047 MSI sources in a system is not much.

Say that I want to spin up 24 NIC queues with one MSI each on each core
on my 128c system. That's not possible with this series, while with an
1-1 system it wouldn't be an issue.

Clearer, or still weird?

>
>>
>> > Regarding NICs which support a large number of queues, the driver
>> > will typically enable only one queue per-core and set the affinity to
>> > separate cores. We have user-space data plane applications based
>> > on DPDK which are capable of using a large number of NIC queues
>> > but these applications are polling based and don't use MSIs.
>>
>> That's one sample point, and clearly not the only one. There are *many*
>> different usage models. Just because you *assign* MSI, doesn't mean they
>> are firing all the time.
>>
>> I can show you a couple of networking setups where this is clearly not
>> enough. Each core has a large number of QoS queues, and each queue would
>> very much like to have a dedicated MSI.
>>
>> >> > When we encounter a system requiring a large number of MSIs,
>> >> > we can either:
>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>> >> > 2) Re-think the approach in the IMSIC driver
>> >> >
>> >> > The choice between #1 and #2 above depends on the
>> >> > guarantees we want for irq_set_affinity().
>> >>
>> >> The irq_set_affinity() behavior is better with this series, but I think
>> >> the other downsides: number of available interrupt sources, and IPI
>> >> broadcast are worse.
>> >
>> > The IPI overhead in the approach you are suggesting will be
>> > even bad compared to the IPI overhead of the current approach
>> > because we will end-up doing IPI upon every irq_set_affinity()
>> > in the suggested approach compared to doing IPI upon every
>> > mask/unmask in the current approach.
>>
>> Again, very workload dependent.
>>
>> This series does IPI broadcast on masking/unmasking, which means that
>> cores that don't care get interrupted because, say, a network queue-pair
>> is setup on another core.
>>
>> Some workloads never change the irq affinity.
>
> There are various events which irq affinity such as irq balance,
> CPU hotplug, system suspend, etc.
>
> Also, the 1-1 approach does IPI upon set_affinity, mask and
> unmask whereas the 1-n approach does IPI only upon mask
> and unmask.

An important distinction; When you say IPI on mask/unmask it is a
broadcast IPI to *all* cores, which is pretty instrusive.

The 1-1 variant does an IPI to a *one* target core.

>> I'm just pointing out that there are pro/cons with both variants.
>>
>> > The biggest advantage of the current approach is a reliable
>> > irq_set_affinity() which is a very valuable thing to have.
>>
>> ...and I'm arguing that we're paying a big price for that.
>>
>> > ARM systems easily support a large number of LPIs per-core.
>> > For example, GIC-700 supports 56000 LPIs per-core.
>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>>
>> Yeah, but this is not the GIC. This is something that looks more like
>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>> spec, and many cores.
>
> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> is that there are systems with large number per-core interrupt IDs
> for handling MSIs.

Yes, and while that is nice, it's not what IMSIC is.


Now, back to the weekend for real! ;-) (https://xkcd.com/386/)
Björn
Björn Töpel Oct. 23, 2023, 7:02 a.m. UTC | #10
Björn Töpel <bjorn@kernel.org> writes:

> Anup Patel <apatel@ventanamicro.com> writes:
>
>> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
>>>
>>> Anup Patel <apatel@ventanamicro.com> writes:
>>>
>>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
>>> >>
>>> >> Anup Patel <apatel@ventanamicro.com> writes:
>>> >>
>>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>>> >> >>
>>> >> >> Thanks for the quick reply!
>>> >> >>
>>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>>> >> >>
>>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>>> >> >> >>
>>> >> >> >> Hi Anup,
>>> >> >> >>
>>> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>>> >> >> >>
>>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>>> >> >> >> >
>>> >> >> >> > At a high-level, the AIA specification adds three things:
>>> >> >> >> > 1) AIA CSRs
>>> >> >> >> >    - Improved local interrupt support
>>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>>> >> >> >> >    - Per-HART MSI controller
>>> >> >> >> >    - Support MSI virtualization
>>> >> >> >> >    - Support IPI along with virtualization
>>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>>> >> >> >> >    - Wired interrupt controller
>>> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>>> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
>>> >> >> >>
>>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>>> >> >> >> have some concerns about interrupt ID abstraction.
>>> >> >> >>
>>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>>> >> >> >>
>>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>>> >> >> >> slice only *one* msi-irq is acutally used.
>>> >> >> >>
>>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>>> >> >> >> are hogged, and cannot be used.
>>> >> >> >>
>>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>>> >> >> >>
>>> >> >> >> On the plus side:
>>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>>> >> >> >>   on each hart is pre-allocated.
>>> >> >> >>
>>> >> >> >> On the negative side:
>>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>>> >> >> >>   interrupts. Especially for many core systems.
>>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>>> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>>> >> >> >>
>>> >> >> >> Summary:
>>> >> >> >> The current series caps the number of global interrupts to maximum
>>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>>> >> >> >> to expose 2047 * #harts unique MSIs.
>>> >> >> >>
>>> >> >> >> I think this could simplify/remove(?) the locking as well.
>>> >> >> >
>>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>>> >> >> >      IRQ is not tied to a particular HART. This means we can't
>>> >> >> >      balance the IRQ processing load among HARTs.
>>> >> >>
>>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>>> >> >> bound/active to a specific hard together with the affinity mask. In an
>>> >> >> 1-1 model you would still need to track the affinity mask, but the
>>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>>> >> >>
>>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>>> >> >> available MSIs.
>>> >> >>
>>> >> >> The downside, as I pointed out, is that the set affinity action can
>>> >> >> fail for a certain target CPU.
>>> >> >
>>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>>> >> > RISC-V AIA, one HART does not have access to other HARTs
>>> >> > MSI enable/disable bits so the approach will also involve IPI.
>>> >>
>>> >> Correct, but the current series does a broadcast to all cores, where the
>>> >> 1-1 approach is at most an IPI to a single core.
>>> >>
>>> >> 128+c machines are getting more common, and you have devices that you
>>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>>> >> dealing with a per-core activity is a pretty noisy neighbor.
>>> >
>>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>>> > operation. It is not done upon set_affinity() of interrupt handling.
>>>
>>> I'm aware. We're on the same page here.
>>>
>>> >>
>>> >> This could be fixed in the existing 1-n approach, by not require to sync
>>> >> the cores that are not handling the MSI in question. "Lazy disable"
>>> >
>>> > Incorrect. The approach you are suggesting involves an IPI upon every
>>> > irq_set_affinity(). This is because a HART can only enable it's own
>>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>>> > to enable ID X on HART B.
>>>
>>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>>> changes, and similar on mask/unmask.
>>>
>>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>>> broadcast to all cores on mask/unmask (not so nice).
>>>
>>> >> >> My concern is interrupts become a scarce resource with this
>>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>>> >> >> that is considered "a lot of interrupts".
>>> >> >>
>>> >> >> As long as we don't get into scenarios where we're running out of
>>> >> >> interrupts, due to the software design.
>>> >> >>
>>> >> >
>>> >> > The current approach is simpler and ensures irq_set_affinity
>>> >> > always works. The limit of max 2047 IDs is sufficient for many
>>> >> > systems (if not all).
>>> >>
>>> >> Let me give you another view. On a 128c system each core has ~16 unique
>>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>>> >> network queue pairs for each PF.
>>> >
>>> > Clearly, this example is a hypothetical and represents a poorly
>>> > designed platform.
>>> >
>>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>>> > Server SoC spec mandates a minimum 255 IDs.
>>>
>>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>>>
>>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>>> system with the maximum amount of MSIs possible in the spec, you'll end
>>> up with 16 *unique* interrupts per core.
>>
>> -ENOPARSE
>>
>> I don't see how this applies to the current approach because we treat
>> MSI ID space as global across cores so if a system has 2047 MSIs
>> per-core then we have 2047 MSIs across all cores.
>
> Ok, I'll try again! :-)
>
> Let's assume that each core in the 128c system has some per-core
> resources, say a two NIC queue pairs, and a storage queue pair. This
> will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
>
> If each core does this it'll be 6*128 MSI sources of the global
> namespace.
>
> The maximum number of "privates" MSI sources a core can utilize is 16.
>
> I'm trying (it's does seem to go that well ;-)) to point out that it's
> only 16 unique sources per core. For, say, a 256 core system it would be
> 8. 2047 MSI sources in a system is not much.
>
> Say that I want to spin up 24 NIC queues with one MSI each on each core
> on my 128c system. That's not possible with this series, while with an
> 1-1 system it wouldn't be an issue.
>
> Clearer, or still weird?
>
>>
>>>
>>> > Regarding NICs which support a large number of queues, the driver
>>> > will typically enable only one queue per-core and set the affinity to
>>> > separate cores. We have user-space data plane applications based
>>> > on DPDK which are capable of using a large number of NIC queues
>>> > but these applications are polling based and don't use MSIs.
>>>
>>> That's one sample point, and clearly not the only one. There are *many*
>>> different usage models. Just because you *assign* MSI, doesn't mean they
>>> are firing all the time.
>>>
>>> I can show you a couple of networking setups where this is clearly not
>>> enough. Each core has a large number of QoS queues, and each queue would
>>> very much like to have a dedicated MSI.
>>>
>>> >> > When we encounter a system requiring a large number of MSIs,
>>> >> > we can either:
>>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>>> >> > 2) Re-think the approach in the IMSIC driver
>>> >> >
>>> >> > The choice between #1 and #2 above depends on the
>>> >> > guarantees we want for irq_set_affinity().
>>> >>
>>> >> The irq_set_affinity() behavior is better with this series, but I think
>>> >> the other downsides: number of available interrupt sources, and IPI
>>> >> broadcast are worse.
>>> >
>>> > The IPI overhead in the approach you are suggesting will be
>>> > even bad compared to the IPI overhead of the current approach
>>> > because we will end-up doing IPI upon every irq_set_affinity()
>>> > in the suggested approach compared to doing IPI upon every
>>> > mask/unmask in the current approach.
>>>
>>> Again, very workload dependent.
>>>
>>> This series does IPI broadcast on masking/unmasking, which means that
>>> cores that don't care get interrupted because, say, a network queue-pair
>>> is setup on another core.
>>>
>>> Some workloads never change the irq affinity.
>>
>> There are various events which irq affinity such as irq balance,
>> CPU hotplug, system suspend, etc.
>>
>> Also, the 1-1 approach does IPI upon set_affinity, mask and
>> unmask whereas the 1-n approach does IPI only upon mask
>> and unmask.
>
> An important distinction; When you say IPI on mask/unmask it is a
> broadcast IPI to *all* cores, which is pretty instrusive.
>
> The 1-1 variant does an IPI to a *one* target core.
>
>>> I'm just pointing out that there are pro/cons with both variants.
>>>
>>> > The biggest advantage of the current approach is a reliable
>>> > irq_set_affinity() which is a very valuable thing to have.
>>>
>>> ...and I'm arguing that we're paying a big price for that.
>>>
>>> > ARM systems easily support a large number of LPIs per-core.
>>> > For example, GIC-700 supports 56000 LPIs per-core.
>>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>>>
>>> Yeah, but this is not the GIC. This is something that looks more like
>>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>>> spec, and many cores.
>>
>> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
>> is that there are systems with large number per-core interrupt IDs
>> for handling MSIs.
>
> Yes, and while that is nice, it's not what IMSIC is.

Some follow-ups, after thinking more about it more over the weekend.

* Do one really need an IPI for irq_set_affinity() for the 1-1 model?
  Why touch the enable/disable bits when moving interrupts?
  
* In my book the IMSIC looks very much like the x86 LAPIC, which also
  has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
  [1], and a scheme similar to LAPIC [2] would be a good fit. This is
  the 1-1 model, but more sophisticated than what I've been describing
  (e.g. properly handling mangaged/regular irqs). As a bonus we would
  get the IRQ matrix debugfs/tracepoint support.
  
  
Björn
Anup Patel Oct. 23, 2023, 8:34 a.m. UTC | #11
On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Björn Töpel <bjorn@kernel.org> writes:
>
> > Anup Patel <apatel@ventanamicro.com> writes:
> >
> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>>
> >>> Anup Patel <apatel@ventanamicro.com> writes:
> >>>
> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>> >>
> >>> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>> >>
> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>> >> >>
> >>> >> >> Thanks for the quick reply!
> >>> >> >>
> >>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>> >> >>
> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>> >> >> >>
> >>> >> >> >> Hi Anup,
> >>> >> >> >>
> >>> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>> >> >> >>
> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >>> >> >> >> >
> >>> >> >> >> > At a high-level, the AIA specification adds three things:
> >>> >> >> >> > 1) AIA CSRs
> >>> >> >> >> >    - Improved local interrupt support
> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >>> >> >> >> >    - Per-HART MSI controller
> >>> >> >> >> >    - Support MSI virtualization
> >>> >> >> >> >    - Support IPI along with virtualization
> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >>> >> >> >> >    - Wired interrupt controller
> >>> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >>> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
> >>> >> >> >>
> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >>> >> >> >> have some concerns about interrupt ID abstraction.
> >>> >> >> >>
> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >>> >> >> >>
> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >>> >> >> >> slice only *one* msi-irq is acutally used.
> >>> >> >> >>
> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >>> >> >> >> are hogged, and cannot be used.
> >>> >> >> >>
> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >>> >> >> >>
> >>> >> >> >> On the plus side:
> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >>> >> >> >>   on each hart is pre-allocated.
> >>> >> >> >>
> >>> >> >> >> On the negative side:
> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >>> >> >> >>   interrupts. Especially for many core systems.
> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >>> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >>> >> >> >>
> >>> >> >> >> Summary:
> >>> >> >> >> The current series caps the number of global interrupts to maximum
> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >>> >> >> >> to expose 2047 * #harts unique MSIs.
> >>> >> >> >>
> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
> >>> >> >> >
> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >>> >> >> >      IRQ is not tied to a particular HART. This means we can't
> >>> >> >> >      balance the IRQ processing load among HARTs.
> >>> >> >>
> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >>> >> >>
> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >>> >> >> available MSIs.
> >>> >> >>
> >>> >> >> The downside, as I pointed out, is that the set affinity action can
> >>> >> >> fail for a certain target CPU.
> >>> >> >
> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
> >>> >>
> >>> >> Correct, but the current series does a broadcast to all cores, where the
> >>> >> 1-1 approach is at most an IPI to a single core.
> >>> >>
> >>> >> 128+c machines are getting more common, and you have devices that you
> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
> >>> >
> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> >>> > operation. It is not done upon set_affinity() of interrupt handling.
> >>>
> >>> I'm aware. We're on the same page here.
> >>>
> >>> >>
> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
> >>> >
> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
> >>> > irq_set_affinity(). This is because a HART can only enable it's own
> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> >>> > to enable ID X on HART B.
> >>>
> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> >>> changes, and similar on mask/unmask.
> >>>
> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> >>> broadcast to all cores on mask/unmask (not so nice).
> >>>
> >>> >> >> My concern is interrupts become a scarce resource with this
> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >>> >> >> that is considered "a lot of interrupts".
> >>> >> >>
> >>> >> >> As long as we don't get into scenarios where we're running out of
> >>> >> >> interrupts, due to the software design.
> >>> >> >>
> >>> >> >
> >>> >> > The current approach is simpler and ensures irq_set_affinity
> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
> >>> >> > systems (if not all).
> >>> >>
> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >>> >> network queue pairs for each PF.
> >>> >
> >>> > Clearly, this example is a hypothetical and represents a poorly
> >>> > designed platform.
> >>> >
> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> >>> > Server SoC spec mandates a minimum 255 IDs.
> >>>
> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
> >>>
> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> >>> system with the maximum amount of MSIs possible in the spec, you'll end
> >>> up with 16 *unique* interrupts per core.
> >>
> >> -ENOPARSE
> >>
> >> I don't see how this applies to the current approach because we treat
> >> MSI ID space as global across cores so if a system has 2047 MSIs
> >> per-core then we have 2047 MSIs across all cores.
> >
> > Ok, I'll try again! :-)
> >
> > Let's assume that each core in the 128c system has some per-core
> > resources, say a two NIC queue pairs, and a storage queue pair. This
> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
> >
> > If each core does this it'll be 6*128 MSI sources of the global
> > namespace.
> >
> > The maximum number of "privates" MSI sources a core can utilize is 16.
> >
> > I'm trying (it's does seem to go that well ;-)) to point out that it's
> > only 16 unique sources per core. For, say, a 256 core system it would be
> > 8. 2047 MSI sources in a system is not much.
> >
> > Say that I want to spin up 24 NIC queues with one MSI each on each core
> > on my 128c system. That's not possible with this series, while with an
> > 1-1 system it wouldn't be an issue.
> >
> > Clearer, or still weird?
> >
> >>
> >>>
> >>> > Regarding NICs which support a large number of queues, the driver
> >>> > will typically enable only one queue per-core and set the affinity to
> >>> > separate cores. We have user-space data plane applications based
> >>> > on DPDK which are capable of using a large number of NIC queues
> >>> > but these applications are polling based and don't use MSIs.
> >>>
> >>> That's one sample point, and clearly not the only one. There are *many*
> >>> different usage models. Just because you *assign* MSI, doesn't mean they
> >>> are firing all the time.
> >>>
> >>> I can show you a couple of networking setups where this is clearly not
> >>> enough. Each core has a large number of QoS queues, and each queue would
> >>> very much like to have a dedicated MSI.
> >>>
> >>> >> > When we encounter a system requiring a large number of MSIs,
> >>> >> > we can either:
> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >>> >> > 2) Re-think the approach in the IMSIC driver
> >>> >> >
> >>> >> > The choice between #1 and #2 above depends on the
> >>> >> > guarantees we want for irq_set_affinity().
> >>> >>
> >>> >> The irq_set_affinity() behavior is better with this series, but I think
> >>> >> the other downsides: number of available interrupt sources, and IPI
> >>> >> broadcast are worse.
> >>> >
> >>> > The IPI overhead in the approach you are suggesting will be
> >>> > even bad compared to the IPI overhead of the current approach
> >>> > because we will end-up doing IPI upon every irq_set_affinity()
> >>> > in the suggested approach compared to doing IPI upon every
> >>> > mask/unmask in the current approach.
> >>>
> >>> Again, very workload dependent.
> >>>
> >>> This series does IPI broadcast on masking/unmasking, which means that
> >>> cores that don't care get interrupted because, say, a network queue-pair
> >>> is setup on another core.
> >>>
> >>> Some workloads never change the irq affinity.
> >>
> >> There are various events which irq affinity such as irq balance,
> >> CPU hotplug, system suspend, etc.
> >>
> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
> >> unmask whereas the 1-n approach does IPI only upon mask
> >> and unmask.
> >
> > An important distinction; When you say IPI on mask/unmask it is a
> > broadcast IPI to *all* cores, which is pretty instrusive.
> >
> > The 1-1 variant does an IPI to a *one* target core.
> >
> >>> I'm just pointing out that there are pro/cons with both variants.
> >>>
> >>> > The biggest advantage of the current approach is a reliable
> >>> > irq_set_affinity() which is a very valuable thing to have.
> >>>
> >>> ...and I'm arguing that we're paying a big price for that.
> >>>
> >>> > ARM systems easily support a large number of LPIs per-core.
> >>> > For example, GIC-700 supports 56000 LPIs per-core.
> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
> >>>
> >>> Yeah, but this is not the GIC. This is something that looks more like
> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> >>> spec, and many cores.
> >>
> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> >> is that there are systems with large number per-core interrupt IDs
> >> for handling MSIs.
> >
> > Yes, and while that is nice, it's not what IMSIC is.
>
> Some follow-ups, after thinking more about it more over the weekend.
>
> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
>   Why touch the enable/disable bits when moving interrupts?

In the 1-1 model, the ID on the current HART and target HART upon
irq_set_affinity will be different so we can't leave the unused ID on
current HART enabled because it can lead to spurious interrupts
when the ID on current HART is re-used for some other device.

There is also a possibility of receiving an interrupt while the ID was
moved to a new target HART in-which case we have to detect and
re-trigger interrupt on the new target HART. In fact, x86 APLIC does
an IPI to take care of this case.

>
> * In my book the IMSIC looks very much like the x86 LAPIC, which also
>   has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
>   [1], and a scheme similar to LAPIC [2] would be a good fit. This is
>   the 1-1 model, but more sophisticated than what I've been describing
>   (e.g. properly handling mangaged/regular irqs). As a bonus we would
>   get the IRQ matrix debugfs/tracepoint support.
>

Yes, I have been evaluating the 1-1 model for the past few days. I also
have a working implementation with a simple per-CPU bitmap based
allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
and MSI-X.

The irq matrix allocator needs to be improved for handling legacy MSI
so initially I will post a v11 series which works for me and converging
with irq matrix allocator can be future work.

Regards,
Anup
Björn Töpel Oct. 23, 2023, 2:07 p.m. UTC | #12
Anup Patel <apatel@ventanamicro.com> writes:

> On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> Björn Töpel <bjorn@kernel.org> writes:
>>
>> > Anup Patel <apatel@ventanamicro.com> writes:
>> >
>> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>>
>> >>> Anup Patel <apatel@ventanamicro.com> writes:
>> >>>
>> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>> >>
>> >>> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >>> >>
>> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>> >> >>
>> >>> >> >> Thanks for the quick reply!
>> >>> >> >>
>> >>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >>> >> >>
>> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>> >> >> >>
>> >>> >> >> >> Hi Anup,
>> >>> >> >> >>
>> >>> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >>> >> >> >>
>> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >>> >> >> >> >
>> >>> >> >> >> > At a high-level, the AIA specification adds three things:
>> >>> >> >> >> > 1) AIA CSRs
>> >>> >> >> >> >    - Improved local interrupt support
>> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >>> >> >> >> >    - Per-HART MSI controller
>> >>> >> >> >> >    - Support MSI virtualization
>> >>> >> >> >> >    - Support IPI along with virtualization
>> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >>> >> >> >> >    - Wired interrupt controller
>> >>> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >>> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
>> >>> >> >> >>
>> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >>> >> >> >> have some concerns about interrupt ID abstraction.
>> >>> >> >> >>
>> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>> >>> >> >> >>
>> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >>> >> >> >> slice only *one* msi-irq is acutally used.
>> >>> >> >> >>
>> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >>> >> >> >> are hogged, and cannot be used.
>> >>> >> >> >>
>> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >>> >> >> >>
>> >>> >> >> >> On the plus side:
>> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >>> >> >> >>   on each hart is pre-allocated.
>> >>> >> >> >>
>> >>> >> >> >> On the negative side:
>> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >>> >> >> >>   interrupts. Especially for many core systems.
>> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >>> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >>> >> >> >>
>> >>> >> >> >> Summary:
>> >>> >> >> >> The current series caps the number of global interrupts to maximum
>> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >>> >> >> >> to expose 2047 * #harts unique MSIs.
>> >>> >> >> >>
>> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
>> >>> >> >> >
>> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >>> >> >> >      IRQ is not tied to a particular HART. This means we can't
>> >>> >> >> >      balance the IRQ processing load among HARTs.
>> >>> >> >>
>> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
>> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
>> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>> >>> >> >>
>> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >>> >> >> available MSIs.
>> >>> >> >>
>> >>> >> >> The downside, as I pointed out, is that the set affinity action can
>> >>> >> >> fail for a certain target CPU.
>> >>> >> >
>> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
>> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
>> >>> >>
>> >>> >> Correct, but the current series does a broadcast to all cores, where the
>> >>> >> 1-1 approach is at most an IPI to a single core.
>> >>> >>
>> >>> >> 128+c machines are getting more common, and you have devices that you
>> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
>> >>> >
>> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>> >>> > operation. It is not done upon set_affinity() of interrupt handling.
>> >>>
>> >>> I'm aware. We're on the same page here.
>> >>>
>> >>> >>
>> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
>> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
>> >>> >
>> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
>> >>> > irq_set_affinity(). This is because a HART can only enable it's own
>> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>> >>> > to enable ID X on HART B.
>> >>>
>> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>> >>> changes, and similar on mask/unmask.
>> >>>
>> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>> >>> broadcast to all cores on mask/unmask (not so nice).
>> >>>
>> >>> >> >> My concern is interrupts become a scarce resource with this
>> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >>> >> >> that is considered "a lot of interrupts".
>> >>> >> >>
>> >>> >> >> As long as we don't get into scenarios where we're running out of
>> >>> >> >> interrupts, due to the software design.
>> >>> >> >>
>> >>> >> >
>> >>> >> > The current approach is simpler and ensures irq_set_affinity
>> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
>> >>> >> > systems (if not all).
>> >>> >>
>> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
>> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> >>> >> network queue pairs for each PF.
>> >>> >
>> >>> > Clearly, this example is a hypothetical and represents a poorly
>> >>> > designed platform.
>> >>> >
>> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>> >>> > Server SoC spec mandates a minimum 255 IDs.
>> >>>
>> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>> >>>
>> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>> >>> system with the maximum amount of MSIs possible in the spec, you'll end
>> >>> up with 16 *unique* interrupts per core.
>> >>
>> >> -ENOPARSE
>> >>
>> >> I don't see how this applies to the current approach because we treat
>> >> MSI ID space as global across cores so if a system has 2047 MSIs
>> >> per-core then we have 2047 MSIs across all cores.
>> >
>> > Ok, I'll try again! :-)
>> >
>> > Let's assume that each core in the 128c system has some per-core
>> > resources, say a two NIC queue pairs, and a storage queue pair. This
>> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
>> >
>> > If each core does this it'll be 6*128 MSI sources of the global
>> > namespace.
>> >
>> > The maximum number of "privates" MSI sources a core can utilize is 16.
>> >
>> > I'm trying (it's does seem to go that well ;-)) to point out that it's
>> > only 16 unique sources per core. For, say, a 256 core system it would be
>> > 8. 2047 MSI sources in a system is not much.
>> >
>> > Say that I want to spin up 24 NIC queues with one MSI each on each core
>> > on my 128c system. That's not possible with this series, while with an
>> > 1-1 system it wouldn't be an issue.
>> >
>> > Clearer, or still weird?
>> >
>> >>
>> >>>
>> >>> > Regarding NICs which support a large number of queues, the driver
>> >>> > will typically enable only one queue per-core and set the affinity to
>> >>> > separate cores. We have user-space data plane applications based
>> >>> > on DPDK which are capable of using a large number of NIC queues
>> >>> > but these applications are polling based and don't use MSIs.
>> >>>
>> >>> That's one sample point, and clearly not the only one. There are *many*
>> >>> different usage models. Just because you *assign* MSI, doesn't mean they
>> >>> are firing all the time.
>> >>>
>> >>> I can show you a couple of networking setups where this is clearly not
>> >>> enough. Each core has a large number of QoS queues, and each queue would
>> >>> very much like to have a dedicated MSI.
>> >>>
>> >>> >> > When we encounter a system requiring a large number of MSIs,
>> >>> >> > we can either:
>> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>> >>> >> > 2) Re-think the approach in the IMSIC driver
>> >>> >> >
>> >>> >> > The choice between #1 and #2 above depends on the
>> >>> >> > guarantees we want for irq_set_affinity().
>> >>> >>
>> >>> >> The irq_set_affinity() behavior is better with this series, but I think
>> >>> >> the other downsides: number of available interrupt sources, and IPI
>> >>> >> broadcast are worse.
>> >>> >
>> >>> > The IPI overhead in the approach you are suggesting will be
>> >>> > even bad compared to the IPI overhead of the current approach
>> >>> > because we will end-up doing IPI upon every irq_set_affinity()
>> >>> > in the suggested approach compared to doing IPI upon every
>> >>> > mask/unmask in the current approach.
>> >>>
>> >>> Again, very workload dependent.
>> >>>
>> >>> This series does IPI broadcast on masking/unmasking, which means that
>> >>> cores that don't care get interrupted because, say, a network queue-pair
>> >>> is setup on another core.
>> >>>
>> >>> Some workloads never change the irq affinity.
>> >>
>> >> There are various events which irq affinity such as irq balance,
>> >> CPU hotplug, system suspend, etc.
>> >>
>> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
>> >> unmask whereas the 1-n approach does IPI only upon mask
>> >> and unmask.
>> >
>> > An important distinction; When you say IPI on mask/unmask it is a
>> > broadcast IPI to *all* cores, which is pretty instrusive.
>> >
>> > The 1-1 variant does an IPI to a *one* target core.
>> >
>> >>> I'm just pointing out that there are pro/cons with both variants.
>> >>>
>> >>> > The biggest advantage of the current approach is a reliable
>> >>> > irq_set_affinity() which is a very valuable thing to have.
>> >>>
>> >>> ...and I'm arguing that we're paying a big price for that.
>> >>>
>> >>> > ARM systems easily support a large number of LPIs per-core.
>> >>> > For example, GIC-700 supports 56000 LPIs per-core.
>> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>> >>>
>> >>> Yeah, but this is not the GIC. This is something that looks more like
>> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>> >>> spec, and many cores.
>> >>
>> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
>> >> is that there are systems with large number per-core interrupt IDs
>> >> for handling MSIs.
>> >
>> > Yes, and while that is nice, it's not what IMSIC is.
>>
>> Some follow-ups, after thinking more about it more over the weekend.
>>
>> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
>>   Why touch the enable/disable bits when moving interrupts?
>
> In the 1-1 model, the ID on the current HART and target HART upon
> irq_set_affinity will be different so we can't leave the unused ID on
> current HART enabled because it can lead to spurious interrupts
> when the ID on current HART is re-used for some other device.

Hmm, is this really an actual problem, or a theoretical one? The
implementation need to track what's in-use, so can we ever get into this
situation?

Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
-- why not only do the the default mask operation (only
pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
mask/unmask?

> There is also a possibility of receiving an interrupt while the ID was
> moved to a new target HART in-which case we have to detect and
> re-trigger interrupt on the new target HART. In fact, x86 APLIC does
> an IPI to take care of this case.

This case I get, and the implementation can track that both are in use.
It's the spurious one that I'm dubious of (don't get).

>>
>> * In my book the IMSIC looks very much like the x86 LAPIC, which also
>>   has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
>>   [1], and a scheme similar to LAPIC [2] would be a good fit. This is
>>   the 1-1 model, but more sophisticated than what I've been describing
>>   (e.g. properly handling mangaged/regular irqs). As a bonus we would
>>   get the IRQ matrix debugfs/tracepoint support.
>>
>
> Yes, I have been evaluating the 1-1 model for the past few days. I also
> have a working implementation with a simple per-CPU bitmap based
> allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
> and MSI-X.
>
> The irq matrix allocator needs to be improved for handling legacy MSI
> so initially I will post a v11 series which works for me and converging
> with irq matrix allocator can be future work.

What's missing/needs to be improved for legacy MSI (legacy MSI ==
!MSI-X, right?) in the matrix allocator?


Björn
Anup Patel Oct. 23, 2023, 2:41 p.m. UTC | #13
On Mon, Oct 23, 2023 at 7:37 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>
> >> Björn Töpel <bjorn@kernel.org> writes:
> >>
> >> > Anup Patel <apatel@ventanamicro.com> writes:
> >> >
> >> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >>>
> >> >>> Anup Patel <apatel@ventanamicro.com> writes:
> >> >>>
> >> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >>> >>
> >> >>> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >>> >>
> >> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >>> >> >>
> >> >>> >> >> Thanks for the quick reply!
> >> >>> >> >>
> >> >>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >>> >> >>
> >> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >>> >> >> >>
> >> >>> >> >> >> Hi Anup,
> >> >>> >> >> >>
> >> >>> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >>> >> >> >>
> >> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >>> >> >> >> >
> >> >>> >> >> >> > At a high-level, the AIA specification adds three things:
> >> >>> >> >> >> > 1) AIA CSRs
> >> >>> >> >> >> >    - Improved local interrupt support
> >> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >>> >> >> >> >    - Per-HART MSI controller
> >> >>> >> >> >> >    - Support MSI virtualization
> >> >>> >> >> >> >    - Support IPI along with virtualization
> >> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >>> >> >> >> >    - Wired interrupt controller
> >> >>> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >>> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
> >> >>> >> >> >>
> >> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >>> >> >> >> have some concerns about interrupt ID abstraction.
> >> >>> >> >> >>
> >> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >>> >> >> >>
> >> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >>> >> >> >> slice only *one* msi-irq is acutally used.
> >> >>> >> >> >>
> >> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >>> >> >> >> are hogged, and cannot be used.
> >> >>> >> >> >>
> >> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >>> >> >> >>
> >> >>> >> >> >> On the plus side:
> >> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >>> >> >> >>   on each hart is pre-allocated.
> >> >>> >> >> >>
> >> >>> >> >> >> On the negative side:
> >> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >>> >> >> >>   interrupts. Especially for many core systems.
> >> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >>> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >>> >> >> >>
> >> >>> >> >> >> Summary:
> >> >>> >> >> >> The current series caps the number of global interrupts to maximum
> >> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >>> >> >> >> to expose 2047 * #harts unique MSIs.
> >> >>> >> >> >>
> >> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
> >> >>> >> >> >
> >> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> >>> >> >> >      IRQ is not tied to a particular HART. This means we can't
> >> >>> >> >> >      balance the IRQ processing load among HARTs.
> >> >>> >> >>
> >> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
> >> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
> >> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >> >>> >> >>
> >> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> >>> >> >> available MSIs.
> >> >>> >> >>
> >> >>> >> >> The downside, as I pointed out, is that the set affinity action can
> >> >>> >> >> fail for a certain target CPU.
> >> >>> >> >
> >> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
> >> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
> >> >>> >>
> >> >>> >> Correct, but the current series does a broadcast to all cores, where the
> >> >>> >> 1-1 approach is at most an IPI to a single core.
> >> >>> >>
> >> >>> >> 128+c machines are getting more common, and you have devices that you
> >> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
> >> >>> >
> >> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> >> >>> > operation. It is not done upon set_affinity() of interrupt handling.
> >> >>>
> >> >>> I'm aware. We're on the same page here.
> >> >>>
> >> >>> >>
> >> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
> >> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
> >> >>> >
> >> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
> >> >>> > irq_set_affinity(). This is because a HART can only enable it's own
> >> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
> >> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> >> >>> > to enable ID X on HART B.
> >> >>>
> >> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> >> >>> changes, and similar on mask/unmask.
> >> >>>
> >> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> >> >>> broadcast to all cores on mask/unmask (not so nice).
> >> >>>
> >> >>> >> >> My concern is interrupts become a scarce resource with this
> >> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> >>> >> >> that is considered "a lot of interrupts".
> >> >>> >> >>
> >> >>> >> >> As long as we don't get into scenarios where we're running out of
> >> >>> >> >> interrupts, due to the software design.
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > The current approach is simpler and ensures irq_set_affinity
> >> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
> >> >>> >> > systems (if not all).
> >> >>> >>
> >> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
> >> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >> >>> >> network queue pairs for each PF.
> >> >>> >
> >> >>> > Clearly, this example is a hypothetical and represents a poorly
> >> >>> > designed platform.
> >> >>> >
> >> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> >> >>> > Server SoC spec mandates a minimum 255 IDs.
> >> >>>
> >> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
> >> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
> >> >>>
> >> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> >> >>> system with the maximum amount of MSIs possible in the spec, you'll end
> >> >>> up with 16 *unique* interrupts per core.
> >> >>
> >> >> -ENOPARSE
> >> >>
> >> >> I don't see how this applies to the current approach because we treat
> >> >> MSI ID space as global across cores so if a system has 2047 MSIs
> >> >> per-core then we have 2047 MSIs across all cores.
> >> >
> >> > Ok, I'll try again! :-)
> >> >
> >> > Let's assume that each core in the 128c system has some per-core
> >> > resources, say a two NIC queue pairs, and a storage queue pair. This
> >> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
> >> >
> >> > If each core does this it'll be 6*128 MSI sources of the global
> >> > namespace.
> >> >
> >> > The maximum number of "privates" MSI sources a core can utilize is 16.
> >> >
> >> > I'm trying (it's does seem to go that well ;-)) to point out that it's
> >> > only 16 unique sources per core. For, say, a 256 core system it would be
> >> > 8. 2047 MSI sources in a system is not much.
> >> >
> >> > Say that I want to spin up 24 NIC queues with one MSI each on each core
> >> > on my 128c system. That's not possible with this series, while with an
> >> > 1-1 system it wouldn't be an issue.
> >> >
> >> > Clearer, or still weird?
> >> >
> >> >>
> >> >>>
> >> >>> > Regarding NICs which support a large number of queues, the driver
> >> >>> > will typically enable only one queue per-core and set the affinity to
> >> >>> > separate cores. We have user-space data plane applications based
> >> >>> > on DPDK which are capable of using a large number of NIC queues
> >> >>> > but these applications are polling based and don't use MSIs.
> >> >>>
> >> >>> That's one sample point, and clearly not the only one. There are *many*
> >> >>> different usage models. Just because you *assign* MSI, doesn't mean they
> >> >>> are firing all the time.
> >> >>>
> >> >>> I can show you a couple of networking setups where this is clearly not
> >> >>> enough. Each core has a large number of QoS queues, and each queue would
> >> >>> very much like to have a dedicated MSI.
> >> >>>
> >> >>> >> > When we encounter a system requiring a large number of MSIs,
> >> >>> >> > we can either:
> >> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >> >>> >> > 2) Re-think the approach in the IMSIC driver
> >> >>> >> >
> >> >>> >> > The choice between #1 and #2 above depends on the
> >> >>> >> > guarantees we want for irq_set_affinity().
> >> >>> >>
> >> >>> >> The irq_set_affinity() behavior is better with this series, but I think
> >> >>> >> the other downsides: number of available interrupt sources, and IPI
> >> >>> >> broadcast are worse.
> >> >>> >
> >> >>> > The IPI overhead in the approach you are suggesting will be
> >> >>> > even bad compared to the IPI overhead of the current approach
> >> >>> > because we will end-up doing IPI upon every irq_set_affinity()
> >> >>> > in the suggested approach compared to doing IPI upon every
> >> >>> > mask/unmask in the current approach.
> >> >>>
> >> >>> Again, very workload dependent.
> >> >>>
> >> >>> This series does IPI broadcast on masking/unmasking, which means that
> >> >>> cores that don't care get interrupted because, say, a network queue-pair
> >> >>> is setup on another core.
> >> >>>
> >> >>> Some workloads never change the irq affinity.
> >> >>
> >> >> There are various events which irq affinity such as irq balance,
> >> >> CPU hotplug, system suspend, etc.
> >> >>
> >> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
> >> >> unmask whereas the 1-n approach does IPI only upon mask
> >> >> and unmask.
> >> >
> >> > An important distinction; When you say IPI on mask/unmask it is a
> >> > broadcast IPI to *all* cores, which is pretty instrusive.
> >> >
> >> > The 1-1 variant does an IPI to a *one* target core.
> >> >
> >> >>> I'm just pointing out that there are pro/cons with both variants.
> >> >>>
> >> >>> > The biggest advantage of the current approach is a reliable
> >> >>> > irq_set_affinity() which is a very valuable thing to have.
> >> >>>
> >> >>> ...and I'm arguing that we're paying a big price for that.
> >> >>>
> >> >>> > ARM systems easily support a large number of LPIs per-core.
> >> >>> > For example, GIC-700 supports 56000 LPIs per-core.
> >> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
> >> >>>
> >> >>> Yeah, but this is not the GIC. This is something that looks more like
> >> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> >> >>> spec, and many cores.
> >> >>
> >> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> >> >> is that there are systems with large number per-core interrupt IDs
> >> >> for handling MSIs.
> >> >
> >> > Yes, and while that is nice, it's not what IMSIC is.
> >>
> >> Some follow-ups, after thinking more about it more over the weekend.
> >>
> >> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
> >>   Why touch the enable/disable bits when moving interrupts?
> >
> > In the 1-1 model, the ID on the current HART and target HART upon
> > irq_set_affinity will be different so we can't leave the unused ID on
> > current HART enabled because it can lead to spurious interrupts
> > when the ID on current HART is re-used for some other device.
>
> Hmm, is this really an actual problem, or a theoretical one? The
> implementation need to track what's in-use, so can we ever get into this
> situation?

As of now, it is theoretical but it is certainly possible to hit this issue.

>
> Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
> -- why not only do the the default mask operation (only
> pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
> mask/unmask?

We have hierarchical IMSIC PCI irq domain whoes parent irq domain
is IMSIC base domain. Unfortunately pci_msi_[un]mask_irq() don't
work for hierarchical irq domain.

>
> > There is also a possibility of receiving an interrupt while the ID was
> > moved to a new target HART in-which case we have to detect and
> > re-trigger interrupt on the new target HART. In fact, x86 APLIC does
> > an IPI to take care of this case.
>
> This case I get, and the implementation can track that both are in use.
> It's the spurious one that I'm dubious of (don't get).
>
> >>
> >> * In my book the IMSIC looks very much like the x86 LAPIC, which also
> >>   has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
> >>   [1], and a scheme similar to LAPIC [2] would be a good fit. This is
> >>   the 1-1 model, but more sophisticated than what I've been describing
> >>   (e.g. properly handling mangaged/regular irqs). As a bonus we would
> >>   get the IRQ matrix debugfs/tracepoint support.
> >>
> >
> > Yes, I have been evaluating the 1-1 model for the past few days. I also
> > have a working implementation with a simple per-CPU bitmap based
> > allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
> > and MSI-X.
> >
> > The irq matrix allocator needs to be improved for handling legacy MSI
> > so initially I will post a v11 series which works for me and converging
> > with irq matrix allocator can be future work.
>
> What's missing/needs to be improved for legacy MSI (legacy MSI ==
> !MSI-X, right?) in the matrix allocator?

For legacy MSI, a block IDs needs to be contiguous and the number
of IDs can be a power of 2.

For a short difference between MSI vs MSI-X, refer:
https://en.wikipedia.org/wiki/Message_Signaled_Interrupts

Regards,
Anup
Björn Töpel Oct. 23, 2023, 3:45 p.m. UTC | #14
Anup Patel <apatel@ventanamicro.com> writes:

> On Mon, Oct 23, 2023 at 7:37 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> Anup Patel <apatel@ventanamicro.com> writes:
>>
>> > On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >>
>> >> Björn Töpel <bjorn@kernel.org> writes:
>> >>
>> >> > Anup Patel <apatel@ventanamicro.com> writes:
>> >> >
>> >> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >> >>>
>> >> >>> Anup Patel <apatel@ventanamicro.com> writes:
>> >> >>>
>> >> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >> >>> >>
>> >> >>> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >> >>> >>
>> >> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >> >>> >> >>
>> >> >>> >> >> Thanks for the quick reply!
>> >> >>> >> >>
>> >> >>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >> >>> >> >>
>> >> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
>> >> >>> >> >> >>
>> >> >>> >> >> >> Hi Anup,
>> >> >>> >> >> >>
>> >> >>> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
>> >> >>> >> >> >>
>> >> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > At a high-level, the AIA specification adds three things:
>> >> >>> >> >> >> > 1) AIA CSRs
>> >> >>> >> >> >> >    - Improved local interrupt support
>> >> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> >>> >> >> >> >    - Per-HART MSI controller
>> >> >>> >> >> >> >    - Support MSI virtualization
>> >> >>> >> >> >> >    - Support IPI along with virtualization
>> >> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> >>> >> >> >> >    - Wired interrupt controller
>> >> >>> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> >>> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
>> >> >>> >> >> >>
>> >> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> >>> >> >> >> have some concerns about interrupt ID abstraction.
>> >> >>> >> >> >>
>> >> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>> >> >>> >> >> >>
>> >> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> >>> >> >> >> slice only *one* msi-irq is acutally used.
>> >> >>> >> >> >>
>> >> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> >>> >> >> >> are hogged, and cannot be used.
>> >> >>> >> >> >>
>> >> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >> >>> >> >> >>
>> >> >>> >> >> >> On the plus side:
>> >> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >> >>> >> >> >>   on each hart is pre-allocated.
>> >> >>> >> >> >>
>> >> >>> >> >> >> On the negative side:
>> >> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >> >>> >> >> >>   interrupts. Especially for many core systems.
>> >> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >> >>> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >> >>> >> >> >>
>> >> >>> >> >> >> Summary:
>> >> >>> >> >> >> The current series caps the number of global interrupts to maximum
>> >> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> >>> >> >> >> to expose 2047 * #harts unique MSIs.
>> >> >>> >> >> >>
>> >> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
>> >> >>> >> >> >
>> >> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >> >>> >> >> >      IRQ is not tied to a particular HART. This means we can't
>> >> >>> >> >> >      balance the IRQ processing load among HARTs.
>> >> >>> >> >>
>> >> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>> >> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
>> >> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
>> >> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>> >> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>> >> >>> >> >>
>> >> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >> >>> >> >> available MSIs.
>> >> >>> >> >>
>> >> >>> >> >> The downside, as I pointed out, is that the set affinity action can
>> >> >>> >> >> fail for a certain target CPU.
>> >> >>> >> >
>> >> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> >> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
>> >> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
>> >> >>> >>
>> >> >>> >> Correct, but the current series does a broadcast to all cores, where the
>> >> >>> >> 1-1 approach is at most an IPI to a single core.
>> >> >>> >>
>> >> >>> >> 128+c machines are getting more common, and you have devices that you
>> >> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> >> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
>> >> >>> >
>> >> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>> >> >>> > operation. It is not done upon set_affinity() of interrupt handling.
>> >> >>>
>> >> >>> I'm aware. We're on the same page here.
>> >> >>>
>> >> >>> >>
>> >> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
>> >> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
>> >> >>> >
>> >> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
>> >> >>> > irq_set_affinity(). This is because a HART can only enable it's own
>> >> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>> >> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>> >> >>> > to enable ID X on HART B.
>> >> >>>
>> >> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>> >> >>> changes, and similar on mask/unmask.
>> >> >>>
>> >> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>> >> >>> broadcast to all cores on mask/unmask (not so nice).
>> >> >>>
>> >> >>> >> >> My concern is interrupts become a scarce resource with this
>> >> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >> >>> >> >> that is considered "a lot of interrupts".
>> >> >>> >> >>
>> >> >>> >> >> As long as we don't get into scenarios where we're running out of
>> >> >>> >> >> interrupts, due to the software design.
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > The current approach is simpler and ensures irq_set_affinity
>> >> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
>> >> >>> >> > systems (if not all).
>> >> >>> >>
>> >> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
>> >> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> >> >>> >> network queue pairs for each PF.
>> >> >>> >
>> >> >>> > Clearly, this example is a hypothetical and represents a poorly
>> >> >>> > designed platform.
>> >> >>> >
>> >> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>> >> >>> > Server SoC spec mandates a minimum 255 IDs.
>> >> >>>
>> >> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>> >> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>> >> >>>
>> >> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>> >> >>> system with the maximum amount of MSIs possible in the spec, you'll end
>> >> >>> up with 16 *unique* interrupts per core.
>> >> >>
>> >> >> -ENOPARSE
>> >> >>
>> >> >> I don't see how this applies to the current approach because we treat
>> >> >> MSI ID space as global across cores so if a system has 2047 MSIs
>> >> >> per-core then we have 2047 MSIs across all cores.
>> >> >
>> >> > Ok, I'll try again! :-)
>> >> >
>> >> > Let's assume that each core in the 128c system has some per-core
>> >> > resources, say a two NIC queue pairs, and a storage queue pair. This
>> >> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
>> >> >
>> >> > If each core does this it'll be 6*128 MSI sources of the global
>> >> > namespace.
>> >> >
>> >> > The maximum number of "privates" MSI sources a core can utilize is 16.
>> >> >
>> >> > I'm trying (it's does seem to go that well ;-)) to point out that it's
>> >> > only 16 unique sources per core. For, say, a 256 core system it would be
>> >> > 8. 2047 MSI sources in a system is not much.
>> >> >
>> >> > Say that I want to spin up 24 NIC queues with one MSI each on each core
>> >> > on my 128c system. That's not possible with this series, while with an
>> >> > 1-1 system it wouldn't be an issue.
>> >> >
>> >> > Clearer, or still weird?
>> >> >
>> >> >>
>> >> >>>
>> >> >>> > Regarding NICs which support a large number of queues, the driver
>> >> >>> > will typically enable only one queue per-core and set the affinity to
>> >> >>> > separate cores. We have user-space data plane applications based
>> >> >>> > on DPDK which are capable of using a large number of NIC queues
>> >> >>> > but these applications are polling based and don't use MSIs.
>> >> >>>
>> >> >>> That's one sample point, and clearly not the only one. There are *many*
>> >> >>> different usage models. Just because you *assign* MSI, doesn't mean they
>> >> >>> are firing all the time.
>> >> >>>
>> >> >>> I can show you a couple of networking setups where this is clearly not
>> >> >>> enough. Each core has a large number of QoS queues, and each queue would
>> >> >>> very much like to have a dedicated MSI.
>> >> >>>
>> >> >>> >> > When we encounter a system requiring a large number of MSIs,
>> >> >>> >> > we can either:
>> >> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>> >> >>> >> > 2) Re-think the approach in the IMSIC driver
>> >> >>> >> >
>> >> >>> >> > The choice between #1 and #2 above depends on the
>> >> >>> >> > guarantees we want for irq_set_affinity().
>> >> >>> >>
>> >> >>> >> The irq_set_affinity() behavior is better with this series, but I think
>> >> >>> >> the other downsides: number of available interrupt sources, and IPI
>> >> >>> >> broadcast are worse.
>> >> >>> >
>> >> >>> > The IPI overhead in the approach you are suggesting will be
>> >> >>> > even bad compared to the IPI overhead of the current approach
>> >> >>> > because we will end-up doing IPI upon every irq_set_affinity()
>> >> >>> > in the suggested approach compared to doing IPI upon every
>> >> >>> > mask/unmask in the current approach.
>> >> >>>
>> >> >>> Again, very workload dependent.
>> >> >>>
>> >> >>> This series does IPI broadcast on masking/unmasking, which means that
>> >> >>> cores that don't care get interrupted because, say, a network queue-pair
>> >> >>> is setup on another core.
>> >> >>>
>> >> >>> Some workloads never change the irq affinity.
>> >> >>
>> >> >> There are various events which irq affinity such as irq balance,
>> >> >> CPU hotplug, system suspend, etc.
>> >> >>
>> >> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
>> >> >> unmask whereas the 1-n approach does IPI only upon mask
>> >> >> and unmask.
>> >> >
>> >> > An important distinction; When you say IPI on mask/unmask it is a
>> >> > broadcast IPI to *all* cores, which is pretty instrusive.
>> >> >
>> >> > The 1-1 variant does an IPI to a *one* target core.
>> >> >
>> >> >>> I'm just pointing out that there are pro/cons with both variants.
>> >> >>>
>> >> >>> > The biggest advantage of the current approach is a reliable
>> >> >>> > irq_set_affinity() which is a very valuable thing to have.
>> >> >>>
>> >> >>> ...and I'm arguing that we're paying a big price for that.
>> >> >>>
>> >> >>> > ARM systems easily support a large number of LPIs per-core.
>> >> >>> > For example, GIC-700 supports 56000 LPIs per-core.
>> >> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>> >> >>>
>> >> >>> Yeah, but this is not the GIC. This is something that looks more like
>> >> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>> >> >>> spec, and many cores.
>> >> >>
>> >> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
>> >> >> is that there are systems with large number per-core interrupt IDs
>> >> >> for handling MSIs.
>> >> >
>> >> > Yes, and while that is nice, it's not what IMSIC is.
>> >>
>> >> Some follow-ups, after thinking more about it more over the weekend.
>> >>
>> >> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
>> >>   Why touch the enable/disable bits when moving interrupts?
>> >
>> > In the 1-1 model, the ID on the current HART and target HART upon
>> > irq_set_affinity will be different so we can't leave the unused ID on
>> > current HART enabled because it can lead to spurious interrupts
>> > when the ID on current HART is re-used for some other device.
>>
>> Hmm, is this really an actual problem, or a theoretical one? The
>> implementation need to track what's in-use, so can we ever get into this
>> situation?
>
> As of now, it is theoretical but it is certainly possible to hit this issue.

Sorry for being slow here, Anup, but could you give an example how this
could happen? For me it sounds like this could only be caused by a
broken (buggy) implementation?

>> Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
>> -- why not only do the the default mask operation (only
>> pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
>> mask/unmask?
>
> We have hierarchical IMSIC PCI irq domain whoes parent irq domain
> is IMSIC base domain. Unfortunately pci_msi_[un]mask_irq() don't
> work for hierarchical irq domain.

Ok! Thanks for the explanation!

>> > There is also a possibility of receiving an interrupt while the ID was
>> > moved to a new target HART in-which case we have to detect and
>> > re-trigger interrupt on the new target HART. In fact, x86 APLIC does
>> > an IPI to take care of this case.
>>
>> This case I get, and the implementation can track that both are in use.
>> It's the spurious one that I'm dubious of (don't get).
>>
>> >>
>> >> * In my book the IMSIC looks very much like the x86 LAPIC, which also
>> >>   has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
>> >>   [1], and a scheme similar to LAPIC [2] would be a good fit. This is
>> >>   the 1-1 model, but more sophisticated than what I've been describing
>> >>   (e.g. properly handling mangaged/regular irqs). As a bonus we would
>> >>   get the IRQ matrix debugfs/tracepoint support.
>> >>
>> >
>> > Yes, I have been evaluating the 1-1 model for the past few days. I also
>> > have a working implementation with a simple per-CPU bitmap based
>> > allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
>> > and MSI-X.
>> >
>> > The irq matrix allocator needs to be improved for handling legacy MSI
>> > so initially I will post a v11 series which works for me and converging
>> > with irq matrix allocator can be future work.
>>
>> What's missing/needs to be improved for legacy MSI (legacy MSI ==
>> !MSI-X, right?) in the matrix allocator?
>
> For legacy MSI, a block IDs needs to be contiguous and the number
> of IDs can be a power of 2.

Oh, so this is not supported by the matrix allocator?


Björn
Anup Patel Oct. 23, 2023, 5:25 p.m. UTC | #15
On Mon, Oct 23, 2023 at 9:15 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > On Mon, Oct 23, 2023 at 7:37 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>
> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>
> >> > On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >>
> >> >> Björn Töpel <bjorn@kernel.org> writes:
> >> >>
> >> >> > Anup Patel <apatel@ventanamicro.com> writes:
> >> >> >
> >> >> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >> >>>
> >> >> >>> Anup Patel <apatel@ventanamicro.com> writes:
> >> >> >>>
> >> >> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >> >>> >>
> >> >> >>> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >> >>> >>
> >> >> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >> >>> >> >>
> >> >> >>> >> >> Thanks for the quick reply!
> >> >> >>> >> >>
> >> >> >>> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >> >>> >> >>
> >> >> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote:
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Hi Anup,
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Anup Patel <apatel@ventanamicro.com> writes:
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> > At a high-level, the AIA specification adds three things:
> >> >> >>> >> >> >> > 1) AIA CSRs
> >> >> >>> >> >> >> >    - Improved local interrupt support
> >> >> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >> >>> >> >> >> >    - Per-HART MSI controller
> >> >> >>> >> >> >> >    - Support MSI virtualization
> >> >> >>> >> >> >> >    - Support IPI along with virtualization
> >> >> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >> >>> >> >> >> >    - Wired interrupt controller
> >> >> >>> >> >> >> >    - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >> >>> >> >> >> >    - In Direct-mode, injects external interrupts directly into HARTs
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >> >>> >> >> >> have some concerns about interrupt ID abstraction.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >> >>> >> >> >> slice only *one* msi-irq is acutally used.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >> >>> >> >> >> are hogged, and cannot be used.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> On the plus side:
> >> >> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >> >>> >> >> >>   on each hart is pre-allocated.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> On the negative side:
> >> >> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >> >>> >> >> >>   interrupts. Especially for many core systems.
> >> >> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >> >>> >> >> >>   broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Summary:
> >> >> >>> >> >> >> The current series caps the number of global interrupts to maximum
> >> >> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >> >>> >> >> >> to expose 2047 * #harts unique MSIs.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
> >> >> >>> >> >> >
> >> >> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> >> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> >> >>> >> >> >      IRQ is not tied to a particular HART. This means we can't
> >> >> >>> >> >> >      balance the IRQ processing load among HARTs.
> >> >> >>> >> >>
> >> >> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >> >> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
> >> >> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
> >> >> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >> >> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >> >> >>> >> >>
> >> >> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> >> >>> >> >> available MSIs.
> >> >> >>> >> >>
> >> >> >>> >> >> The downside, as I pointed out, is that the set affinity action can
> >> >> >>> >> >> fail for a certain target CPU.
> >> >> >>> >> >
> >> >> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >> >> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
> >> >> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
> >> >> >>> >>
> >> >> >>> >> Correct, but the current series does a broadcast to all cores, where the
> >> >> >>> >> 1-1 approach is at most an IPI to a single core.
> >> >> >>> >>
> >> >> >>> >> 128+c machines are getting more common, and you have devices that you
> >> >> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >> >> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
> >> >> >>> >
> >> >> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> >> >> >>> > operation. It is not done upon set_affinity() of interrupt handling.
> >> >> >>>
> >> >> >>> I'm aware. We're on the same page here.
> >> >> >>>
> >> >> >>> >>
> >> >> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
> >> >> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
> >> >> >>> >
> >> >> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
> >> >> >>> > irq_set_affinity(). This is because a HART can only enable it's own
> >> >> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
> >> >> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> >> >> >>> > to enable ID X on HART B.
> >> >> >>>
> >> >> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> >> >> >>> changes, and similar on mask/unmask.
> >> >> >>>
> >> >> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> >> >> >>> broadcast to all cores on mask/unmask (not so nice).
> >> >> >>>
> >> >> >>> >> >> My concern is interrupts become a scarce resource with this
> >> >> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> >> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> >> >>> >> >> that is considered "a lot of interrupts".
> >> >> >>> >> >>
> >> >> >>> >> >> As long as we don't get into scenarios where we're running out of
> >> >> >>> >> >> interrupts, due to the software design.
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >> > The current approach is simpler and ensures irq_set_affinity
> >> >> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
> >> >> >>> >> > systems (if not all).
> >> >> >>> >>
> >> >> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
> >> >> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >> >> >>> >> network queue pairs for each PF.
> >> >> >>> >
> >> >> >>> > Clearly, this example is a hypothetical and represents a poorly
> >> >> >>> > designed platform.
> >> >> >>> >
> >> >> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> >> >> >>> > Server SoC spec mandates a minimum 255 IDs.
> >> >> >>>
> >> >> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
> >> >> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
> >> >> >>>
> >> >> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> >> >> >>> system with the maximum amount of MSIs possible in the spec, you'll end
> >> >> >>> up with 16 *unique* interrupts per core.
> >> >> >>
> >> >> >> -ENOPARSE
> >> >> >>
> >> >> >> I don't see how this applies to the current approach because we treat
> >> >> >> MSI ID space as global across cores so if a system has 2047 MSIs
> >> >> >> per-core then we have 2047 MSIs across all cores.
> >> >> >
> >> >> > Ok, I'll try again! :-)
> >> >> >
> >> >> > Let's assume that each core in the 128c system has some per-core
> >> >> > resources, say a two NIC queue pairs, and a storage queue pair. This
> >> >> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
> >> >> >
> >> >> > If each core does this it'll be 6*128 MSI sources of the global
> >> >> > namespace.
> >> >> >
> >> >> > The maximum number of "privates" MSI sources a core can utilize is 16.
> >> >> >
> >> >> > I'm trying (it's does seem to go that well ;-)) to point out that it's
> >> >> > only 16 unique sources per core. For, say, a 256 core system it would be
> >> >> > 8. 2047 MSI sources in a system is not much.
> >> >> >
> >> >> > Say that I want to spin up 24 NIC queues with one MSI each on each core
> >> >> > on my 128c system. That's not possible with this series, while with an
> >> >> > 1-1 system it wouldn't be an issue.
> >> >> >
> >> >> > Clearer, or still weird?
> >> >> >
> >> >> >>
> >> >> >>>
> >> >> >>> > Regarding NICs which support a large number of queues, the driver
> >> >> >>> > will typically enable only one queue per-core and set the affinity to
> >> >> >>> > separate cores. We have user-space data plane applications based
> >> >> >>> > on DPDK which are capable of using a large number of NIC queues
> >> >> >>> > but these applications are polling based and don't use MSIs.
> >> >> >>>
> >> >> >>> That's one sample point, and clearly not the only one. There are *many*
> >> >> >>> different usage models. Just because you *assign* MSI, doesn't mean they
> >> >> >>> are firing all the time.
> >> >> >>>
> >> >> >>> I can show you a couple of networking setups where this is clearly not
> >> >> >>> enough. Each core has a large number of QoS queues, and each queue would
> >> >> >>> very much like to have a dedicated MSI.
> >> >> >>>
> >> >> >>> >> > When we encounter a system requiring a large number of MSIs,
> >> >> >>> >> > we can either:
> >> >> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >> >> >>> >> > 2) Re-think the approach in the IMSIC driver
> >> >> >>> >> >
> >> >> >>> >> > The choice between #1 and #2 above depends on the
> >> >> >>> >> > guarantees we want for irq_set_affinity().
> >> >> >>> >>
> >> >> >>> >> The irq_set_affinity() behavior is better with this series, but I think
> >> >> >>> >> the other downsides: number of available interrupt sources, and IPI
> >> >> >>> >> broadcast are worse.
> >> >> >>> >
> >> >> >>> > The IPI overhead in the approach you are suggesting will be
> >> >> >>> > even bad compared to the IPI overhead of the current approach
> >> >> >>> > because we will end-up doing IPI upon every irq_set_affinity()
> >> >> >>> > in the suggested approach compared to doing IPI upon every
> >> >> >>> > mask/unmask in the current approach.
> >> >> >>>
> >> >> >>> Again, very workload dependent.
> >> >> >>>
> >> >> >>> This series does IPI broadcast on masking/unmasking, which means that
> >> >> >>> cores that don't care get interrupted because, say, a network queue-pair
> >> >> >>> is setup on another core.
> >> >> >>>
> >> >> >>> Some workloads never change the irq affinity.
> >> >> >>
> >> >> >> There are various events which irq affinity such as irq balance,
> >> >> >> CPU hotplug, system suspend, etc.
> >> >> >>
> >> >> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
> >> >> >> unmask whereas the 1-n approach does IPI only upon mask
> >> >> >> and unmask.
> >> >> >
> >> >> > An important distinction; When you say IPI on mask/unmask it is a
> >> >> > broadcast IPI to *all* cores, which is pretty instrusive.
> >> >> >
> >> >> > The 1-1 variant does an IPI to a *one* target core.
> >> >> >
> >> >> >>> I'm just pointing out that there are pro/cons with both variants.
> >> >> >>>
> >> >> >>> > The biggest advantage of the current approach is a reliable
> >> >> >>> > irq_set_affinity() which is a very valuable thing to have.
> >> >> >>>
> >> >> >>> ...and I'm arguing that we're paying a big price for that.
> >> >> >>>
> >> >> >>> > ARM systems easily support a large number of LPIs per-core.
> >> >> >>> > For example, GIC-700 supports 56000 LPIs per-core.
> >> >> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
> >> >> >>>
> >> >> >>> Yeah, but this is not the GIC. This is something that looks more like
> >> >> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> >> >> >>> spec, and many cores.
> >> >> >>
> >> >> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> >> >> >> is that there are systems with large number per-core interrupt IDs
> >> >> >> for handling MSIs.
> >> >> >
> >> >> > Yes, and while that is nice, it's not what IMSIC is.
> >> >>
> >> >> Some follow-ups, after thinking more about it more over the weekend.
> >> >>
> >> >> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
> >> >>   Why touch the enable/disable bits when moving interrupts?
> >> >
> >> > In the 1-1 model, the ID on the current HART and target HART upon
> >> > irq_set_affinity will be different so we can't leave the unused ID on
> >> > current HART enabled because it can lead to spurious interrupts
> >> > when the ID on current HART is re-used for some other device.
> >>
> >> Hmm, is this really an actual problem, or a theoretical one? The
> >> implementation need to track what's in-use, so can we ever get into this
> >> situation?
> >
> > As of now, it is theoretical but it is certainly possible to hit this issue.
>
> Sorry for being slow here, Anup, but could you give an example how this
> could happen? For me it sounds like this could only be caused by a
> broken (buggy) implementation?

Let me re-state the problem: When ID X on HART A is moved to ID Y
on HART B. Now the movement might have be initiated by some
some HART C doing irq_set_affinity(). Once the movement is done
the ID X on HART A should be disabled because if not disabled then
some device (possibly buggy) can have HART A to take MSI with
ID X but this MSI won't map to any Linux IRQ since the it is already
moved to ID Y on HART B.

>
> >> Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
> >> -- why not only do the the default mask operation (only
> >> pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
> >> mask/unmask?
> >
> > We have hierarchical IMSIC PCI irq domain whoes parent irq domain
> > is IMSIC base domain. Unfortunately pci_msi_[un]mask_irq() don't
> > work for hierarchical irq domain.
>
> Ok! Thanks for the explanation!
>
> >> > There is also a possibility of receiving an interrupt while the ID was
> >> > moved to a new target HART in-which case we have to detect and
> >> > re-trigger interrupt on the new target HART. In fact, x86 APLIC does
> >> > an IPI to take care of this case.
> >>
> >> This case I get, and the implementation can track that both are in use.
> >> It's the spurious one that I'm dubious of (don't get).
> >>
> >> >>
> >> >> * In my book the IMSIC looks very much like the x86 LAPIC, which also
> >> >>   has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
> >> >>   [1], and a scheme similar to LAPIC [2] would be a good fit. This is
> >> >>   the 1-1 model, but more sophisticated than what I've been describing
> >> >>   (e.g. properly handling mangaged/regular irqs). As a bonus we would
> >> >>   get the IRQ matrix debugfs/tracepoint support.
> >> >>
> >> >
> >> > Yes, I have been evaluating the 1-1 model for the past few days. I also
> >> > have a working implementation with a simple per-CPU bitmap based
> >> > allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
> >> > and MSI-X.
> >> >
> >> > The irq matrix allocator needs to be improved for handling legacy MSI
> >> > so initially I will post a v11 series which works for me and converging
> >> > with irq matrix allocator can be future work.
> >>
> >> What's missing/needs to be improved for legacy MSI (legacy MSI ==
> >> !MSI-X, right?) in the matrix allocator?
> >
> > For legacy MSI, a block IDs needs to be contiguous and the number
> > of IDs can be a power of 2.
>
> Oh, so this is not supported by the matrix allocator?

Yes, that's correct. IRQ matrix allocator only supports allocating
IDs suitable for MSI-X. Improving IRQ matrix allocator is a
separate effort in my opinion.

The ARM GICv3/v4 driver supports both MSI and MSI-X.

Regards,
Anup