Message ID | 20250204075405.824721-11-apatel@ventanamicro.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | RISC-V IMSIC driver improvements | expand |
On Tue, Feb 04 2025 at 13:24, Anup Patel wrote: > Devices (such as PCI) which have non-atomic MSI update should > migrate irq in the interrupt-context so use IRQCHIP_MOVE_DEFERRED > flag for corresponding irqchips. > > The use of IRQCHIP_MOVE_DEFERRED further simplifies IMSIC vector > movement as follows: > > 1) No need to handle the intermediate state seen by devices with > non-atomic MSI update because imsic_irq_set_affinity() is called > in the interrupt-context with interrupt masked. > 2) No need to check temporary vector when completing vector movement > on the old CPU in __imsic_local_sync(). > 3) No need to call imsic_local_sync_all() from imsic_handle_irq() I have no idea how you came to that delusion. IRQCHIP_MOVE_DEFERRED is part of the mechanism to handle this insanity correctly. It does not prevent the device from observing and actually using the intermediate state. All it does is to ensure that the kernel can observe this case and act on it. The fact that the kernel executes the interrupt handler on the original target CPU does not prevent the device from firing another interrupt. PCI/MSI interrupts are strictly edge. i.e. fire and forget. IRQCHIP_MOVE_DEFERRED solely ensures that the racy affinity update in the PCI device happens in the context of the original target CPU, which is required to handle all possible cases correctly. Let's assume the interrupt is affine to CPU0, vector A and a move to CPU1, vector B is pending. So we have three possible scenarios: CPU0 Device interrupt 1) Raises interrupt on CPU0, vector A ... write_msg() write_address(CPU1) 2) Raises interrupt on CPU1, vector A write_data(vector B) 3) Raises interrupt on CPU1, vector B #1 is handled correctly because the interrupt is retriggered on CPU0, vector A, which still has the interrupt associated (it's cleaned up _after_ the first interrupt arrives on CPU1, vector B). #2 cannot be handled because CPU1, vector A is either not in use or associated to a completely unrelated interrupt, which means if that happens the interrupt is lost and the device might become stale. #3 is handled correctly for obvious reasons. The only way to handle #2 properly is to do the intermediate update to CPU0, vector B and checking for a pending interrupt on that. The important role IRQCHIP_MOVE_DEFERRED plays here is that it guarantees that the update happens on CPU0 (original target). Which in turn is required to observe that CPU0, vector B has been raised. The same could be achieved by executing that intermediate transition on CPU0 with interrupts disabled by affining the calling context (thread) to CPU0 or by issuing an IPI on CPU0 and doing it in that context. I looked into that, but that has it's own pile of issues. So at the end moving it in the context of the interrupt on the original CPU/vector turned out to be the simplest way to achieve it. Thanks, tglx
On Tue, Feb 4, 2025 at 2:26 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, Feb 04 2025 at 13:24, Anup Patel wrote: > > Devices (such as PCI) which have non-atomic MSI update should > > migrate irq in the interrupt-context so use IRQCHIP_MOVE_DEFERRED > > flag for corresponding irqchips. > > > > The use of IRQCHIP_MOVE_DEFERRED further simplifies IMSIC vector > > movement as follows: > > > > 1) No need to handle the intermediate state seen by devices with > > non-atomic MSI update because imsic_irq_set_affinity() is called > > in the interrupt-context with interrupt masked. > > 2) No need to check temporary vector when completing vector movement > > on the old CPU in __imsic_local_sync(). > > 3) No need to call imsic_local_sync_all() from imsic_handle_irq() > > I have no idea how you came to that delusion. > > IRQCHIP_MOVE_DEFERRED is part of the mechanism to handle this insanity > correctly. It does not prevent the device from observing and actually > using the intermediate state. All it does is to ensure that the kernel > can observe this case and act on it. > > The fact that the kernel executes the interrupt handler on the original > target CPU does not prevent the device from firing another interrupt. > PCI/MSI interrupts are strictly edge. i.e. fire and forget. > > IRQCHIP_MOVE_DEFERRED solely ensures that the racy affinity update in > the PCI device happens in the context of the original target CPU, which > is required to handle all possible cases correctly. > > Let's assume the interrupt is affine to CPU0, vector A and a move to > CPU1, vector B is pending. So we have three possible scenarios: > > CPU0 Device > interrupt > 1) Raises interrupt on CPU0, vector A > ... > > write_msg() > write_address(CPU1) > 2) Raises interrupt on CPU1, vector A > write_data(vector B) > 3) Raises interrupt on CPU1, vector B > > #1 is handled correctly because the interrupt is retriggered on CPU0, > vector A, which still has the interrupt associated (it's cleaned up > _after_ the first interrupt arrives on CPU1, vector B). > > #2 cannot be handled because CPU1, vector A is either not in use or > associated to a completely unrelated interrupt, which means if that > happens the interrupt is lost and the device might become stale. > > #3 is handled correctly for obvious reasons. > > The only way to handle #2 properly is to do the intermediate update to > CPU0, vector B and checking for a pending interrupt on that. The > important role IRQCHIP_MOVE_DEFERRED plays here is that it guarantees > that the update happens on CPU0 (original target). Which in turn is > required to observe that CPU0, vector B has been raised. > > The same could be achieved by executing that intermediate transition on > CPU0 with interrupts disabled by affining the calling context (thread) > to CPU0 or by issuing an IPI on CPU0 and doing it in that context. I > looked into that, but that has it's own pile of issues. So at the end > moving it in the context of the interrupt on the original CPU/vector > turned out to be the simplest way to achieve it. I got confused because IRQCHIP_MOVE_DEFERRED updates affinity with the interrupt masked which I interpreted as masked at the device level. Also, PCI MSI mask/unmask is an optional feature of PCI devices which I totally missed. I will keep the intermediate transition in the next revision. Thanks for the clarification. Regards, Anup
On Tue, Feb 04 2025 at 20:19, Anup Patel wrote: > On Tue, Feb 4, 2025 at 2:26 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> The same could be achieved by executing that intermediate transition on >> CPU0 with interrupts disabled by affining the calling context (thread) >> to CPU0 or by issuing an IPI on CPU0 and doing it in that context. I >> looked into that, but that has it's own pile of issues. So at the end >> moving it in the context of the interrupt on the original CPU/vector >> turned out to be the simplest way to achieve it. > > I got confused because IRQCHIP_MOVE_DEFERRED updates affinity > with the interrupt masked which I interpreted as masked at the device > level. Also, PCI MSI mask/unmask is an optional feature of PCI devices > which I totally missed. That's the problem this actually handles. If PCI mask/unmask would be mandatory the problem would not exist in the first place :)
diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c index e6c81718ba78..eac7f358bbba 100644 --- a/drivers/irqchip/irq-riscv-imsic-platform.c +++ b/drivers/irqchip/irq-riscv-imsic-platform.c @@ -64,6 +64,11 @@ static int imsic_irq_retrigger(struct irq_data *d) return 0; } +static void imsic_irq_ack(struct irq_data *d) +{ + irq_move_irq(d); +} + static void imsic_irq_compose_vector_msg(struct imsic_vector *vec, struct msi_msg *msg) { phys_addr_t msi_addr; @@ -97,7 +102,20 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask bool force) { struct imsic_vector *old_vec, *new_vec; - struct imsic_vector tmp_vec; + + /* + * Requirements for the downstream irqdomains (or devices): + * + * 1) Downstream irqdomains (or devices) with atomic MSI update can + * happily do imsic_irq_set_affinity() in the process-context on + * any CPU so the irqchip of such irqdomains must not set the + * IRQCHIP_MOVE_DEFERRED flag. + * + * 2) Downstream irqdomains (or devices) with non-atomic MSI update + * must do imsic_irq_set_affinity() in the interrupt-context upon + * next interrupt so the irqchip of such irqdomains must set the + * IRQCHIP_MOVE_DEFERRED flag. + */ old_vec = irq_data_get_irq_chip_data(d); if (WARN_ON(!old_vec)) @@ -117,31 +135,13 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask return -ENOSPC; /* - * Device having non-atomic MSI update might see an intermediate - * state when changing target IMSIC vector from one CPU to another. - * - * To avoid losing interrupt to some intermediate state, do the - * following (just like x86 APIC): - * - * 1) First write a temporary IMSIC vector to the device which - * has MSI address same as the old IMSIC vector but MSI data - * matches the new IMSIC vector. - * - * 2) Next write the new IMSIC vector to the device. - * - * Based on the above, the __imsic_local_sync() must check both - * old MSI data and new MSI data on the old CPU for pending + * Downstream irqdomains (or devices) with non-atomic MSI update + * may see an intermediate state when changing target IMSIC vector + * from one CPU to another but using the IRQCHIP_MOVE_DEFERRED + * flag this is taken care because imsic_irq_set_affinity() is + * called in the interrupt-context with interrupt masked. */ - if (new_vec->local_id != old_vec->local_id) { - /* Setup temporary vector */ - tmp_vec.cpu = old_vec->cpu; - tmp_vec.local_id = new_vec->local_id; - - /* Point device to the temporary vector */ - imsic_msi_update_msg(irq_get_irq_data(d->irq), &tmp_vec); - } - /* Point device to the new vector */ imsic_msi_update_msg(irq_get_irq_data(d->irq), new_vec); @@ -198,6 +198,7 @@ static struct irq_chip imsic_irq_base_chip = { .irq_force_complete_move = imsic_irq_force_complete_move, #endif .irq_retrigger = imsic_irq_retrigger, + .irq_ack = imsic_irq_ack, .irq_compose_msi_msg = imsic_irq_compose_msg, .flags = IRQCHIP_SKIP_SET_WAKE | IRQCHIP_MASK_ON_SUSPEND, @@ -217,7 +218,7 @@ static int imsic_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, return -ENOSPC; irq_domain_set_info(domain, virq, virq, &imsic_irq_base_chip, vec, - handle_simple_irq, NULL, NULL); + handle_edge_irq, NULL, NULL); irq_set_noprobe(virq); irq_set_affinity(virq, cpu_online_mask); irq_data_update_effective_affinity(irq_get_irq_data(virq), cpumask_of(vec->cpu)); @@ -256,15 +257,36 @@ static const struct irq_domain_ops imsic_base_domain_ops = { #endif }; +static bool imsic_init_dev_msi_info(struct device *dev, + struct irq_domain *domain, + struct irq_domain *real_parent, + struct msi_domain_info *info) +{ + if (!msi_lib_init_dev_msi_info(dev, domain, real_parent, info)) + return false; + + switch (info->bus_token) { + case DOMAIN_BUS_PCI_DEVICE_MSI: + case DOMAIN_BUS_PCI_DEVICE_MSIX: + info->chip->flags |= IRQCHIP_MOVE_DEFERRED; + break; + default: + break; + } + + return true; +} + static const struct msi_parent_ops imsic_msi_parent_ops = { .supported_flags = MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX, .required_flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_PCI_MSI_MASK_PARENT, + .chip_flags = MSI_CHIP_FLAG_SET_ACK, .bus_select_token = DOMAIN_BUS_NEXUS, .bus_select_mask = MATCH_PCI_MSI | MATCH_PLATFORM_MSI, - .init_dev_msi_info = msi_lib_init_dev_msi_info, + .init_dev_msi_info = imsic_init_dev_msi_info, }; int imsic_irqdomain_init(void) diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c index e70f497a9326..f9b2cec72ff2 100644 --- a/drivers/irqchip/irq-riscv-imsic-state.c +++ b/drivers/irqchip/irq-riscv-imsic-state.c @@ -126,8 +126,8 @@ void __imsic_eix_update(unsigned long base_id, unsigned long num_id, bool pend, static bool __imsic_local_sync(struct imsic_local_priv *lpriv) { - struct imsic_local_config *tlocal, *mlocal; - struct imsic_vector *vec, *tvec, *mvec; + struct imsic_local_config *mlocal; + struct imsic_vector *vec, *mvec; bool ret = true; int i; @@ -170,27 +170,6 @@ static bool __imsic_local_sync(struct imsic_local_priv *lpriv) */ mvec = READ_ONCE(vec->move_next); if (mvec) { - /* - * Device having non-atomic MSI update might see an - * intermediate state so check both old ID and new ID - * for pending interrupts. - * - * For details, refer imsic_irq_set_affinity(). - */ - - tvec = vec->local_id == mvec->local_id ? - NULL : &lpriv->vectors[mvec->local_id]; - if (tvec && __imsic_id_read_clear_pending(tvec->local_id)) { - /* Retrigger temporary vector if it was already in-use */ - if (READ_ONCE(tvec->enable)) { - tlocal = per_cpu_ptr(imsic->global.local, tvec->cpu); - writel_relaxed(tvec->local_id, tlocal->msi_va); - } - - mlocal = per_cpu_ptr(imsic->global.local, mvec->cpu); - writel_relaxed(mvec->local_id, mlocal->msi_va); - } - if (__imsic_id_read_clear_pending(vec->local_id)) { mlocal = per_cpu_ptr(imsic->global.local, mvec->cpu); writel_relaxed(mvec->local_id, mlocal->msi_va);
Devices (such as PCI) which have non-atomic MSI update should migrate irq in the interrupt-context so use IRQCHIP_MOVE_DEFERRED flag for corresponding irqchips. The use of IRQCHIP_MOVE_DEFERRED further simplifies IMSIC vector movement as follows: 1) No need to handle the intermediate state seen by devices with non-atomic MSI update because imsic_irq_set_affinity() is called in the interrupt-context with interrupt masked. 2) No need to check temporary vector when completing vector movement on the old CPU in __imsic_local_sync(). 3) No need to call imsic_local_sync_all() from imsic_handle_irq() Signed-off-by: Anup Patel <apatel@ventanamicro.com> --- drivers/irqchip/irq-riscv-imsic-platform.c | 74 ++++++++++++++-------- drivers/irqchip/irq-riscv-imsic-state.c | 25 +------- 2 files changed, 50 insertions(+), 49 deletions(-)