Message ID | 20200316115433.9017-1-maz@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | irqchip/gic-v3-its: Balance LPI affinity across CPUs | expand |
On 16/03/2020 11:54, Marc Zyngier wrote: > When mapping a LPI, the ITS driver picks the first possible > affinity, which is in most cases CPU0, assuming that if > that's not suitable, someone will come and set the affinity > to something more interesting. > > It apparently isn't the case, and people complain of poor > performance when many interrupts are glued to the same CPU. > So let's place the interrupts by finding the "least loaded" > CPU (that is, the one that has the fewer LPIs mapped to it). > So called 'managed' interrupts are an interesting case where > the affinity is actually dictated by the kernel itself, and > we should honor this. > > * From v2: > - Split accounting from CPU selection > - Track managed and unmanaged interrupts separately > > Marc Zyngier (2): > irqchip/gic-v3-its: Track LPI distribution on a per CPU basis > irqchip/gic-v3-its: Balance initial LPI affinity across CPUs > > drivers/irqchip/irq-gic-v3-its.c | 153 +++++++++++++++++++++++++------ > 1 file changed, 127 insertions(+), 26 deletions(-) > Hi Marc, Initial results look good. We have 3x NVMe drives now, as opposed to 2x previously, which is better for this test. Before: ~1.3M IOPs fio read After: ~1.8M IOPs fio read So a ~50% gain in throughput. We also did try NVMe with nvme.use_threaded_interrupts=1. As you may remember, the NVMe interrupt handling can cause lockups, as they handle all completions in interrupt context by default. Before: ~1.2M IOPs fio read After: ~1.2M IOPs fio read So they were about the same. I would have hoped for an improvement here, considering before we would have all the per-queue threaded handlers running on the single CPU handling the hard irq. But we will retest all this tomorrow, so please consider these provisional for now. Thanks to Luo Jiaxing for testing. Cheers, john
> Before: ~1.2M IOPs fio read > After: ~1.2M IOPs fio read > > So they were about the same. I would have hoped for an improvement here, > considering before we would have all the per-queue threaded handlers > running on the single CPU handling the hard irq. > > But we will retest all this tomorrow, so please consider these > provisional for now. > > Thanks to Luo Jiaxing for testing. Hi Marc, Just to let you know that we're still looking at this. It turns out that we're not getting the results as hoped, and the previous results were incorrect due to a test booboo .. I also think that the SMMU may even be creating a performance ceiling. I need to check this more, as I can't seem to get the throughput above a certain level. I realise that we're so late in the cycle that there is now no immediate rush. But I would also like to report some other unexpected behaviour for managed interrupts in this series - I'll reply directly to the specific patch for that. Much appreciated, John
Hi Marc, > But I would also like to report some other unexpected behaviour for > managed interrupts in this series - I'll reply directly to the specific > patch for that. > So I made this change: diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 9199fb53c75c..ebbfc8d44d35 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -1539,6 +1539,8 @@ static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val, if (irqd_is_forwarded_to_vcpu(d)) return -EINVAL; + its_dec_lpi_count(d, its_dev->event_map.col_map[id]); + if (!force) cpu = its_select_cpu(d, mask_val); else @@ -1549,14 +1551,14 @@ static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val, /* don't set the affinity when the target cpu is same as current one */ if (cpu != its_dev->event_map.col_map[id]) { - its_inc_lpi_count(d, cpu); - its_dec_lpi_count(d, its_dev->event_map.col_map[id]); target_col = &its_dev->its->collections[cpu]; its_send_movi(its_dev, target_col, id); its_dev->event_map.col_map[id] = cpu; irq_data_update_effective_affinity(d, cpumask_of(cpu)); } + its_inc_lpi_count(d, cpu); + return IRQ_SET_MASK_OK_DONE; } Results look ok: nvme.use_threaded_interrupts=1 =0* Before 950K IOPs 1000K IOPs After 1100K IOPs 1150K IOPs * as mentioned before, this is quite unstable and causes lockups. JFYI, there was an attempt to fix this: https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/ Thanks, John
> > + its_inc_lpi_count(d, cpu); > + > return IRQ_SET_MASK_OK_DONE; > } > > Results look ok: > nvme.use_threaded_interrupts=1 =0* > Before 950K IOPs 1000K IOPs > After 1100K IOPs 1150K IOPs > > * as mentioned before, this is quite unstable and causes lockups. JFYI, > there was an attempt to fix this: > > https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/ > Hi Marc, Just wondering if we can try to get this series over the line? So I tested the patches on v5.7-rc5, and get similar performance improvement to above. I did apply a couple of patches, below, to remedy the issues I experienced for my D06CS. Thanks, John ---->8 [PATCH 1/2] irqchip/gic-v3-its: Don't double account for target CPU assignment In its_set_affinity(), when a managed irq is already assigned to a CPU, we may needlessly reassign the irq to another CPU. This is because when selecting the target CPU, being the least loaded CPU in the mask, we account of that irq still being assigned to a CPU; thereby we may unfairly select another CPU. Modify this behaviour to pre-decrement the current target CPU LPI count when finding the least loaded CPU. Alternatively we may be able to just bail out early when the current target CPU already falls within the requested mask. --- drivers/irqchip/irq-gic-v3-its.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 73f5c12..2b18feb 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -1636,6 +1636,8 @@ static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val, if (irqd_is_forwarded_to_vcpu(d)) return -EINVAL; + its_dec_lpi_count(d, its_dev->event_map.col_map[id]); + if (!force) cpu = its_select_cpu(d, mask_val); else @@ -1646,14 +1648,14 @@ static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val, /* don't set the affinity when the target cpu is same as current one */ if (cpu != its_dev->event_map.col_map[id]) { - its_inc_lpi_count(d, cpu); - its_dec_lpi_count(d, its_dev->event_map.col_map[id]); target_col = &its_dev->its->collections[cpu]; its_send_movi(its_dev, target_col, id); its_dev->event_map.col_map[id] = cpu; irq_data_update_effective_affinity(d, cpumask_of(cpu)); } + its_inc_lpi_count(d, cpu); + return IRQ_SET_MASK_OK_DONE; } --- [PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq affinity mask In selecting the target CPU for a non-managed interrupt, we may select a target CPU outside the requested affinity mask. This is because there may be no overlap of the ITS node mask and the requested CPU affinity mask. The requested affinity mask may be coming from userspace or some drivers which try to set irq affinity, see [0]. In this case, just ignore the ITS node cpumask. This is a deviation from what Thomas described. Having said that, I am not sure if the interrupt is ever bound to a node for us. [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417 --- drivers/irqchip/irq-gic-v3-its.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 2b18feb..12d5d4b4 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d, cpumask_and(tmpmask, cpumask_of_node(node), aff_mask); cpumask_and(tmpmask, tmpmask, cpu_online_mask); - /* If that doesn't work, try the nodemask itself */ - if (cpumask_empty(tmpmask)) - cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask); - cpu = cpumask_pick_least_loaded(d, tmpmask); if (cpu < nr_cpu_ids) goto out;
Hi John, On 2020-05-14 13:05, John Garry wrote: >> >> + its_inc_lpi_count(d, cpu); >> + >> return IRQ_SET_MASK_OK_DONE; >> } >> >> Results look ok: >> nvme.use_threaded_interrupts=1 =0* >> Before 950K IOPs 1000K IOPs >> After 1100K IOPs 1150K IOPs >> >> * as mentioned before, this is quite unstable and causes lockups. >> JFYI, there was an attempt to fix this: >> >> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/ >> > > Hi Marc, > > Just wondering if we can try to get this series over the line? Absolutely. Life has got in the way, so let me page it back in... > So I tested the patches on v5.7-rc5, and get similar performance > improvement to above. > > I did apply a couple of patches, below, to remedy the issues I > experienced for my D06CS. Comments on that below. > > Thanks, > John > > > ---->8 > > > [PATCH 1/2] irqchip/gic-v3-its: Don't double account for target CPU > assignment > > In its_set_affinity(), when a managed irq is already assigned to a CPU, > we may needlessly reassign the irq to another CPU. > > This is because when selecting the target CPU, being the least loaded > CPU in the mask, we account of that irq still being assigned to a CPU; > thereby we may unfairly select another CPU. > > Modify this behaviour to pre-decrement the current target CPU LPI count > when finding the least loaded CPU. > > Alternatively we may be able to just bail out early when the current > target CPU already falls within the requested mask. > > --- > drivers/irqchip/irq-gic-v3-its.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/irqchip/irq-gic-v3-its.c > b/drivers/irqchip/irq-gic-v3-its.c > index 73f5c12..2b18feb 100644 > --- a/drivers/irqchip/irq-gic-v3-its.c > +++ b/drivers/irqchip/irq-gic-v3-its.c > @@ -1636,6 +1636,8 @@ static int its_set_affinity(struct irq_data *d, > const struct cpumask *mask_val, > if (irqd_is_forwarded_to_vcpu(d)) > return -EINVAL; > > + its_dec_lpi_count(d, its_dev->event_map.col_map[id]); > + > if (!force) > cpu = its_select_cpu(d, mask_val); > else > @@ -1646,14 +1648,14 @@ static int its_set_affinity(struct irq_data > *d, const struct cpumask *mask_val, > > /* don't set the affinity when the target cpu is same as current one > */ > if (cpu != its_dev->event_map.col_map[id]) { > - its_inc_lpi_count(d, cpu); > - its_dec_lpi_count(d, its_dev->event_map.col_map[id]); > target_col = &its_dev->its->collections[cpu]; > its_send_movi(its_dev, target_col, id); > its_dev->event_map.col_map[id] = cpu; > irq_data_update_effective_affinity(d, cpumask_of(cpu)); > } > > + its_inc_lpi_count(d, cpu); > + I'm OK with that change, as it removes unnecessary churn. > return IRQ_SET_MASK_OK_DONE; > } > > --- > > > [PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq > affinity mask > > In selecting the target CPU for a non-managed interrupt, we may select > a > target CPU outside the requested affinity mask. > > This is because there may be no overlap of the ITS node mask and the > requested CPU affinity mask. The requested affinity mask may be coming > from userspace or some drivers which try to set irq affinity, see [0]. > > In this case, just ignore the ITS node cpumask. This is a deviation > from > what Thomas described. Having said that, I am not sure if the > interrupt is ever bound to a node for us. > > [0] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417 > > --- > drivers/irqchip/irq-gic-v3-its.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/drivers/irqchip/irq-gic-v3-its.c > b/drivers/irqchip/irq-gic-v3-its.c > index 2b18feb..12d5d4b4 100644 > --- a/drivers/irqchip/irq-gic-v3-its.c > +++ b/drivers/irqchip/irq-gic-v3-its.c > @@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d, > cpumask_and(tmpmask, cpumask_of_node(node), aff_mask); > cpumask_and(tmpmask, tmpmask, cpu_online_mask); > > - /* If that doesn't work, try the nodemask itself */ > - if (cpumask_empty(tmpmask)) > - cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask); > - > cpu = cpumask_pick_least_loaded(d, tmpmask); > if (cpu < nr_cpu_ids) > goto out; I'm really not sure. Shouldn't we then drop the wider search on cpu_inline_mask, because userspace could have given us something that we cannot deal with? What you are advocating for is a strict adherence to the provided mask, and it doesn't seem to be what other architectures are providing. I consider the userspace-provided affinity as a hint more that anything else, as in this case the kernel does know better (routing the interrupt to a foreign node might be costly, or even impossible, see the TX1 erratum). From what I remember of the earlier discussion, you saw an issue on a system with two sockets and a single ITS, with the node mask limited to the first socket. Is that correct? I'll respin the series today and report it with you first patch squased in. Thanks, M.
Hi Marc, > Absolutely. Life has got in the way, so let me page it back in... Great >> >> [PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq >> affinity mask >> >> In selecting the target CPU for a non-managed interrupt, we may select >> a >> target CPU outside the requested affinity mask. >> >> This is because there may be no overlap of the ITS node mask and the >> requested CPU affinity mask. The requested affinity mask may be coming >> from userspace or some drivers which try to set irq affinity, see [0]. >> >> In this case, just ignore the ITS node cpumask. This is a deviation >> from >> what Thomas described. Having said that, I am not sure if the >> interrupt is ever bound to a node for us. >> >> [0] >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417 >> >> --- >> drivers/irqchip/irq-gic-v3-its.c | 4 ---- >> 1 file changed, 4 deletions(-) >> >> diff --git a/drivers/irqchip/irq-gic-v3-its.c >> b/drivers/irqchip/irq-gic-v3-its.c >> index 2b18feb..12d5d4b4 100644 >> --- a/drivers/irqchip/irq-gic-v3-its.c >> +++ b/drivers/irqchip/irq-gic-v3-its.c >> @@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d, >> cpumask_and(tmpmask, cpumask_of_node(node), aff_mask); >> cpumask_and(tmpmask, tmpmask, cpu_online_mask); >> >> - /* If that doesn't work, try the nodemask itself */ >> - if (cpumask_empty(tmpmask)) >> - cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask); >> - >> cpu = cpumask_pick_least_loaded(d, tmpmask); >> if (cpu < nr_cpu_ids) >> goto out; > > I'm really not sure. Shouldn't we then drop the wider search on > cpu_inline_mask, because userspace could have given us something > that we cannot deal with? It's not just userspace. Some drivers call irq_set_affinity{_hint}}() also, with a non-overlapping affinity mask. We could just error these requests, but some drivers rely on this behavior. Consider the uncore driver I mentioned above, which WARNs when the affinity setting fails. So it tries to set the affinity with the cpumask of the cluster associated with the device, but with D06's ITS config, below, there may be no overlap. > > What you are advocating for is a strict adherence to the provided > mask, and it doesn't seem to be what other architectures are providing. > I consider the userspace-provided affinity as a hint more that anything > else, as in this case the kernel does know better (routing the interrupt > to a foreign node might be costly, or even impossible, see the TX1 > erratum). Right > > From what I remember of the earlier discussion, you saw an issue on > a system with two sockets and a single ITS, with the node mask limited > to the first socket. Is that correct? A bit more complicated: 2 sockets, 2 NUMA nodes per socket, and ITS config as follows: D06ES 1x ITS with proximity node #0 root@(none)$ dmesg | grep ITS [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0 D06CS 2x ITS with proximity node #0, #2 estuary:/$ dmesg | grep ITS [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0 [ 0.000000] SRAT: PXM 2 -> ITS 1 -> Node 2 It complicates things. We could add extra intelligence to record if an node has an ITS associated. In the case of that not being true, we would fallback on the requested affin only (for case of no overlap). It gets a bit more messy then. Thanks, John
On 2020-05-15 12:50, John Garry wrote: > Hi Marc, > >> Absolutely. Life has got in the way, so let me page it back in... > > Great > >>> >>> [PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq >>> affinity mask >>> >>> In selecting the target CPU for a non-managed interrupt, we may >>> select >>> a >>> target CPU outside the requested affinity mask. >>> >>> This is because there may be no overlap of the ITS node mask and the >>> requested CPU affinity mask. The requested affinity mask may be >>> coming >>> from userspace or some drivers which try to set irq affinity, see >>> [0]. >>> >>> In this case, just ignore the ITS node cpumask. This is a deviation >>> from >>> what Thomas described. Having said that, I am not sure if the >>> interrupt is ever bound to a node for us. >>> >>> [0] >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417 >>> >>> --- >>> drivers/irqchip/irq-gic-v3-its.c | 4 ---- >>> 1 file changed, 4 deletions(-) >>> >>> diff --git a/drivers/irqchip/irq-gic-v3-its.c >>> b/drivers/irqchip/irq-gic-v3-its.c >>> index 2b18feb..12d5d4b4 100644 >>> --- a/drivers/irqchip/irq-gic-v3-its.c >>> +++ b/drivers/irqchip/irq-gic-v3-its.c >>> @@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d, >>> cpumask_and(tmpmask, cpumask_of_node(node), aff_mask); >>> cpumask_and(tmpmask, tmpmask, cpu_online_mask); >>> >>> - /* If that doesn't work, try the nodemask itself */ >>> - if (cpumask_empty(tmpmask)) >>> - cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask); >>> - >>> cpu = cpumask_pick_least_loaded(d, tmpmask); >>> if (cpu < nr_cpu_ids) >>> goto out; >> >> I'm really not sure. Shouldn't we then drop the wider search on >> cpu_inline_mask, because userspace could have given us something >> that we cannot deal with? > > It's not just userspace. Some drivers call irq_set_affinity{_hint}}() > also, with a non-overlapping affinity mask. > > We could just error these requests, but some drivers rely on this > behavior. Consider the uncore driver I mentioned above, which WARNs > when the affinity setting fails. So it tries to set the affinity with > the cpumask of the cluster associated with the device, but with D06's > ITS config, below, there may be no overlap. Does this PMU use the ITS? That's a pretty odd setup. So this is a case where the device has an implicit affinity that isn't that of the ITS. Huhu... >> >> What you are advocating for is a strict adherence to the provided >> mask, and it doesn't seem to be what other architectures are >> providing. >> I consider the userspace-provided affinity as a hint more that >> anything >> else, as in this case the kernel does know better (routing the >> interrupt >> to a foreign node might be costly, or even impossible, see the TX1 >> erratum). > > Right > >> >> From what I remember of the earlier discussion, you saw an issue on >> a system with two sockets and a single ITS, with the node mask limited >> to the first socket. Is that correct? > > A bit more complicated: 2 sockets, 2 NUMA nodes per socket, and ITS > config as follows: > D06ES 1x ITS with proximity node #0 > > root@(none)$ dmesg | grep ITS > [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0 > > > D06CS > 2x ITS with proximity node #0, #2 > > estuary:/$ dmesg | grep ITS > [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0 > [ 0.000000] SRAT: PXM 2 -> ITS 1 -> Node 2 > > It complicates things. > > We could add extra intelligence to record if an node has an ITS > associated. In the case of that not being true, we would fallback on > the requested affin only (for case of no overlap). It gets a bit more > messy then. It looks like part of the problem is that we can't reliably describe an ITS affine to multiple NUMA nodes... If we could describe that, then the above situation wouldn't occur (we'd say that ITS-0 covers both nodes 0 and 1). But I can't find a way to express this with SRAT and _PXM. Also, SRAT describes the affinity of the ITS with memory, and not with the CPUs... It is all a bit fsck'd. :-( I guess I'll apply your change for now with a comment explaining the situation. M.
On 15/05/2020 16:37, Marc Zyngier wrote: Hi Marc, >>> >> >> It's not just userspace. Some drivers call irq_set_affinity{_hint}}() >> also, with a non-overlapping affinity mask. >> >> We could just error these requests, but some drivers rely on this >> behavior. Consider the uncore driver I mentioned above, which WARNs >> when the affinity setting fails. So it tries to set the affinity with >> the cpumask of the cluster associated with the device, but with D06's >> ITS config, below, there may be no overlap. > > Does this PMU use the ITS? That's a pretty odd setup. > Yes, I think so. Our old friend, mbigen... > So this is a case where the device has an implicit affinity that > isn't that of the ITS. Huhu... > >>> >>> What you are advocating for is a strict adherence to the provided >>> mask, and it doesn't seem to be what other architectures are providing. >>> I consider the userspace-provided affinity as a hint more that anything >>> else, as in this case the kernel does know better (routing the interrupt >>> to a foreign node might be costly, or even impossible, see the TX1 >>> erratum). >> >> Right >> >>> >>> From what I remember of the earlier discussion, you saw an issue on >>> a system with two sockets and a single ITS, with the node mask limited >>> to the first socket. Is that correct? >> >> A bit more complicated: 2 sockets, 2 NUMA nodes per socket, and ITS >> config as follows: >> D06ES 1x ITS with proximity node #0 >> >> root@(none)$ dmesg | grep ITS >> [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0 >> >> >> D06CS >> 2x ITS with proximity node #0, #2 >> >> estuary:/$ dmesg | grep ITS >> [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0 >> [ 0.000000] SRAT: PXM 2 -> ITS 1 -> Node 2 >> >> It complicates things. >> >> We could add extra intelligence to record if an node has an ITS >> associated. In the case of that not being true, we would fallback on >> the requested affin only (for case of no overlap). It gets a bit more >> messy then. > > It looks like part of the problem is that we can't reliably describe > an ITS affine to multiple NUMA nodes... If we could describe that, then > the above situation wouldn't occur (we'd say that ITS-0 covers both > nodes 0 and 1). But I can't find a way to express this with SRAT and > _PXM. Also, SRAT describes the affinity of the ITS with memory, and not > with the CPUs... It is all a bit fsck'd. :-( Yeah. We could try to deal with those excluded CPUs, not within an ITS node cpumask, by trying to find the "closest" ITS, and expanding the cpumask of that ITS to cover those. Not sure if it's worth it. But this sort of problem has always cropped up, with no clear way to describe our topology exactly. > > I guess I'll apply your change for now with a comment explaining the > situation. > Cheers, John