mbox series

[v3,0/2] irqchip/gic-v3-its: Balance LPI affinity across CPUs

Message ID 20200316115433.9017-1-maz@kernel.org (mailing list archive)
Headers show
Series irqchip/gic-v3-its: Balance LPI affinity across CPUs | expand

Message

Marc Zyngier March 16, 2020, 11:54 a.m. UTC
When mapping a LPI, the ITS driver picks the first possible
affinity, which is in most cases CPU0, assuming that if
that's not suitable, someone will come and set the affinity
to something more interesting.

It apparently isn't the case, and people complain of poor
performance when many interrupts are glued to the same CPU.
So let's place the interrupts by finding the "least loaded"
CPU (that is, the one that has the fewer LPIs mapped to it).
So called 'managed' interrupts are an interesting case where
the affinity is actually dictated by the kernel itself, and
we should honor this.

* From v2:
  - Split accounting from CPU selection
  - Track managed and unmanaged interrupts separately

Marc Zyngier (2):
  irqchip/gic-v3-its: Track LPI distribution on a per CPU basis
  irqchip/gic-v3-its: Balance initial LPI affinity across CPUs

 drivers/irqchip/irq-gic-v3-its.c | 153 +++++++++++++++++++++++++------
 1 file changed, 127 insertions(+), 26 deletions(-)

Comments

John Garry March 19, 2020, 12:31 p.m. UTC | #1
On 16/03/2020 11:54, Marc Zyngier wrote:
> When mapping a LPI, the ITS driver picks the first possible
> affinity, which is in most cases CPU0, assuming that if
> that's not suitable, someone will come and set the affinity
> to something more interesting.
> 
> It apparently isn't the case, and people complain of poor
> performance when many interrupts are glued to the same CPU.
> So let's place the interrupts by finding the "least loaded"
> CPU (that is, the one that has the fewer LPIs mapped to it).
> So called 'managed' interrupts are an interesting case where
> the affinity is actually dictated by the kernel itself, and
> we should honor this.
> 
> * From v2:
>    - Split accounting from CPU selection
>    - Track managed and unmanaged interrupts separately
> 
> Marc Zyngier (2):
>    irqchip/gic-v3-its: Track LPI distribution on a per CPU basis
>    irqchip/gic-v3-its: Balance initial LPI affinity across CPUs
> 
>   drivers/irqchip/irq-gic-v3-its.c | 153 +++++++++++++++++++++++++------
>   1 file changed, 127 insertions(+), 26 deletions(-)
> 

Hi Marc,

Initial results look good. We have 3x NVMe drives now, as opposed to 2x 
previously, which is better for this test.

Before: ~1.3M IOPs fio read
After: ~1.8M IOPs fio read

So a ~50% gain in throughput.

We also did try NVMe with nvme.use_threaded_interrupts=1. As you may 
remember, the NVMe interrupt handling can cause lockups, as they handle 
all completions in interrupt context by default.

Before: ~1.2M IOPs fio read
After: ~1.2M IOPs fio read

So they were about the same. I would have hoped for an improvement here, 
considering before we would have all the per-queue threaded handlers 
running on the single CPU handling the hard irq.

But we will retest all this tomorrow, so please consider these 
provisional for now.

Thanks to Luo Jiaxing for testing.

Cheers,
john
John Garry March 27, 2020, 5:47 p.m. UTC | #2
> Before: ~1.2M IOPs fio read
> After: ~1.2M IOPs fio read
> 
> So they were about the same. I would have hoped for an improvement here, 
> considering before we would have all the per-queue threaded handlers 
> running on the single CPU handling the hard irq.
> 
> But we will retest all this tomorrow, so please consider these 
> provisional for now.
> 
> Thanks to Luo Jiaxing for testing.

Hi Marc,

Just to let you know that we're still looking at this. It turns out that 
we're not getting the results as hoped, and the previous results were 
incorrect due to a test booboo ..

I also think that the SMMU may even be creating a performance ceiling. I 
need to check this more, as I can't seem to get the throughput above a 
certain level.

I realise that we're so late in the cycle that there is now no immediate 
rush.

But I would also like to report some other unexpected behaviour for 
managed interrupts in this series - I'll reply directly to the specific 
patch for that.

Much appreciated,
John
John Garry April 1, 2020, 11:33 a.m. UTC | #3
Hi Marc,

> But I would also like to report some other unexpected behaviour for 
> managed interrupts in this series - I'll reply directly to the specific 
> patch for that.
> 

So I made this change:

diff --git a/drivers/irqchip/irq-gic-v3-its.c 
b/drivers/irqchip/irq-gic-v3-its.c
index 9199fb53c75c..ebbfc8d44d35 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1539,6 +1539,8 @@ static int its_set_affinity(struct irq_data *d, 
const struct cpumask *mask_val,
         if (irqd_is_forwarded_to_vcpu(d))
                 return -EINVAL;

+       its_dec_lpi_count(d, its_dev->event_map.col_map[id]);
+
         if (!force)
                 cpu = its_select_cpu(d, mask_val);
         else
@@ -1549,14 +1551,14 @@ static int its_set_affinity(struct irq_data *d, 
const struct cpumask *mask_val,

         /* don't set the affinity when the target cpu is same as 
current one */
         if (cpu != its_dev->event_map.col_map[id]) {
-               its_inc_lpi_count(d, cpu);
-               its_dec_lpi_count(d, its_dev->event_map.col_map[id]);
                 target_col = &its_dev->its->collections[cpu];
                 its_send_movi(its_dev, target_col, id);
                 its_dev->event_map.col_map[id] = cpu;
                 irq_data_update_effective_affinity(d, cpumask_of(cpu));
         }

+       its_inc_lpi_count(d, cpu);
+
         return IRQ_SET_MASK_OK_DONE;
  }

Results look ok:
	nvme.use_threaded_interrupts=1	=0*
Before	950K IOPs			1000K IOPs
After	1100K IOPs			1150K IOPs

* as mentioned before, this is quite unstable and causes lockups. JFYI, 
there was an attempt to fix this:

https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/

Thanks,
John
John Garry May 14, 2020, 12:05 p.m. UTC | #4
> 
> +       its_inc_lpi_count(d, cpu);
> +
>          return IRQ_SET_MASK_OK_DONE;
>   }
> 
> Results look ok:
>      nvme.use_threaded_interrupts=1    =0*
> Before    950K IOPs            1000K IOPs
> After    1100K IOPs            1150K IOPs
> 
> * as mentioned before, this is quite unstable and causes lockups. JFYI, 
> there was an attempt to fix this:
> 
> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/
> 

Hi Marc,

Just wondering if we can try to get this series over the line?

So I tested the patches on v5.7-rc5, and get similar performance 
improvement to above.

I did apply a couple of patches, below, to remedy the issues I 
experienced for my D06CS.

Thanks,
John


---->8


[PATCH 1/2] irqchip/gic-v3-its: Don't double account for target CPU
  assignment

In its_set_affinity(), when a managed irq is already assigned to a CPU,
we may needlessly reassign the irq to another CPU.

This is because when selecting the target CPU, being the least loaded 
CPU in the mask, we account of that irq still being assigned to a CPU;
thereby we may unfairly select another CPU.

Modify this behaviour to pre-decrement the current target CPU LPI count
when finding the least loaded CPU.

Alternatively we may be able to just bail out early when the current 
target CPU already falls within the requested mask.

---
  drivers/irqchip/irq-gic-v3-its.c | 6 ++++--
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c 
b/drivers/irqchip/irq-gic-v3-its.c
index 73f5c12..2b18feb 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1636,6 +1636,8 @@ static int its_set_affinity(struct irq_data *d, 
const struct cpumask *mask_val,
  	if (irqd_is_forwarded_to_vcpu(d))
  		return -EINVAL;

+	its_dec_lpi_count(d, its_dev->event_map.col_map[id]);
+
  	if (!force)
  		cpu = its_select_cpu(d, mask_val);
  	else
@@ -1646,14 +1648,14 @@ static int its_set_affinity(struct irq_data *d, 
const struct cpumask *mask_val,

  	/* don't set the affinity when the target cpu is same as current one */
  	if (cpu != its_dev->event_map.col_map[id]) {
-		its_inc_lpi_count(d, cpu);
-		its_dec_lpi_count(d, its_dev->event_map.col_map[id]);
  		target_col = &its_dev->its->collections[cpu];
  		its_send_movi(its_dev, target_col, id);
  		its_dev->event_map.col_map[id] = cpu;
  		irq_data_update_effective_affinity(d, cpumask_of(cpu));
  	}

+	its_inc_lpi_count(d, cpu);
+
  	return IRQ_SET_MASK_OK_DONE;
  }

---


[PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq
  affinity mask

In selecting the target CPU for a non-managed interrupt, we may select a
target CPU outside the requested affinity mask.

This is because there may be no overlap of the ITS node mask and the
requested CPU affinity mask. The requested affinity mask may be coming 
from userspace or some drivers which try to set irq affinity, see [0].

In this case, just ignore the ITS node cpumask. This is a deviation from
what Thomas described. Having said that, I am not sure if the interrupt 
is ever bound to a node for us.

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417

---
  drivers/irqchip/irq-gic-v3-its.c | 4 ----
  1 file changed, 4 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c 
b/drivers/irqchip/irq-gic-v3-its.c
index 2b18feb..12d5d4b4 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d,
  			cpumask_and(tmpmask, cpumask_of_node(node), aff_mask);
  			cpumask_and(tmpmask, tmpmask, cpu_online_mask);

-			/* If that doesn't work, try the nodemask itself */
-			if (cpumask_empty(tmpmask))
-				cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask);
-
  			cpu = cpumask_pick_least_loaded(d, tmpmask);
  			if (cpu < nr_cpu_ids)
  				goto out;
Marc Zyngier May 15, 2020, 10:14 a.m. UTC | #5
Hi John,

On 2020-05-14 13:05, John Garry wrote:
>> 
>> +       its_inc_lpi_count(d, cpu);
>> +
>>          return IRQ_SET_MASK_OK_DONE;
>>   }
>> 
>> Results look ok:
>>      nvme.use_threaded_interrupts=1    =0*
>> Before    950K IOPs            1000K IOPs
>> After    1100K IOPs            1150K IOPs
>> 
>> * as mentioned before, this is quite unstable and causes lockups. 
>> JFYI, there was an attempt to fix this:
>> 
>> https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@kernel.org/
>> 
> 
> Hi Marc,
> 
> Just wondering if we can try to get this series over the line?

Absolutely. Life has got in the way, so let me page it back in...

> So I tested the patches on v5.7-rc5, and get similar performance
> improvement to above.
> 
> I did apply a couple of patches, below, to remedy the issues I
> experienced for my D06CS.

Comments on that below.

> 
> Thanks,
> John
> 
> 
> ---->8
> 
> 
> [PATCH 1/2] irqchip/gic-v3-its: Don't double account for target CPU
>  assignment
> 
> In its_set_affinity(), when a managed irq is already assigned to a CPU,
> we may needlessly reassign the irq to another CPU.
> 
> This is because when selecting the target CPU, being the least loaded
> CPU in the mask, we account of that irq still being assigned to a CPU;
> thereby we may unfairly select another CPU.
> 
> Modify this behaviour to pre-decrement the current target CPU LPI count
> when finding the least loaded CPU.
> 
> Alternatively we may be able to just bail out early when the current
> target CPU already falls within the requested mask.
> 
> ---
>  drivers/irqchip/irq-gic-v3-its.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
> b/drivers/irqchip/irq-gic-v3-its.c
> index 73f5c12..2b18feb 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -1636,6 +1636,8 @@ static int its_set_affinity(struct irq_data *d,
> const struct cpumask *mask_val,
>  	if (irqd_is_forwarded_to_vcpu(d))
>  		return -EINVAL;
> 
> +	its_dec_lpi_count(d, its_dev->event_map.col_map[id]);
> +
>  	if (!force)
>  		cpu = its_select_cpu(d, mask_val);
>  	else
> @@ -1646,14 +1648,14 @@ static int its_set_affinity(struct irq_data
> *d, const struct cpumask *mask_val,
> 
>  	/* don't set the affinity when the target cpu is same as current one 
> */
>  	if (cpu != its_dev->event_map.col_map[id]) {
> -		its_inc_lpi_count(d, cpu);
> -		its_dec_lpi_count(d, its_dev->event_map.col_map[id]);
>  		target_col = &its_dev->its->collections[cpu];
>  		its_send_movi(its_dev, target_col, id);
>  		its_dev->event_map.col_map[id] = cpu;
>  		irq_data_update_effective_affinity(d, cpumask_of(cpu));
>  	}
> 
> +	its_inc_lpi_count(d, cpu);
> +

I'm OK with that change, as it removes unnecessary churn.

>  	return IRQ_SET_MASK_OK_DONE;
>  }
> 
> ---
> 
> 
> [PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq
>  affinity mask
> 
> In selecting the target CPU for a non-managed interrupt, we may select 
> a
> target CPU outside the requested affinity mask.
> 
> This is because there may be no overlap of the ITS node mask and the
> requested CPU affinity mask. The requested affinity mask may be coming
> from userspace or some drivers which try to set irq affinity, see [0].
> 
> In this case, just ignore the ITS node cpumask. This is a deviation 
> from
> what Thomas described. Having said that, I am not sure if the
> interrupt is ever bound to a node for us.
> 
> [0] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417
> 
> ---
>  drivers/irqchip/irq-gic-v3-its.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
> b/drivers/irqchip/irq-gic-v3-its.c
> index 2b18feb..12d5d4b4 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d,
>  			cpumask_and(tmpmask, cpumask_of_node(node), aff_mask);
>  			cpumask_and(tmpmask, tmpmask, cpu_online_mask);
> 
> -			/* If that doesn't work, try the nodemask itself */
> -			if (cpumask_empty(tmpmask))
> -				cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask);
> -
>  			cpu = cpumask_pick_least_loaded(d, tmpmask);
>  			if (cpu < nr_cpu_ids)
>  				goto out;

I'm really not sure. Shouldn't we then drop the wider search on
cpu_inline_mask, because userspace could have given us something
that we cannot deal with?

What you are advocating for is a strict adherence to the provided
mask, and it doesn't seem to be what other architectures are providing.
I consider the userspace-provided affinity as a hint more that anything
else, as in this case the kernel does know better (routing the interrupt
to a foreign node might be costly, or even impossible, see the TX1
erratum).

 From what I remember of the earlier discussion, you saw an issue on
a system with two sockets and a single ITS, with the node mask limited
to the first socket. Is that correct?

I'll respin the series today and report it with you first patch
squased in.

Thanks,

         M.
John Garry May 15, 2020, 11:50 a.m. UTC | #6
Hi Marc,

> Absolutely. Life has got in the way, so let me page it back in...

Great

>>
>> [PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq
>>   affinity mask
>>
>> In selecting the target CPU for a non-managed interrupt, we may select
>> a
>> target CPU outside the requested affinity mask.
>>
>> This is because there may be no overlap of the ITS node mask and the
>> requested CPU affinity mask. The requested affinity mask may be coming
>> from userspace or some drivers which try to set irq affinity, see [0].
>>
>> In this case, just ignore the ITS node cpumask. This is a deviation
>> from
>> what Thomas described. Having said that, I am not sure if the
>> interrupt is ever bound to a node for us.
>>
>> [0]
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417
>>
>> ---
>>   drivers/irqchip/irq-gic-v3-its.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c
>> b/drivers/irqchip/irq-gic-v3-its.c
>> index 2b18feb..12d5d4b4 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d,
>>   			cpumask_and(tmpmask, cpumask_of_node(node), aff_mask);
>>   			cpumask_and(tmpmask, tmpmask, cpu_online_mask);
>>
>> -			/* If that doesn't work, try the nodemask itself */
>> -			if (cpumask_empty(tmpmask))
>> -				cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask);
>> -
>>   			cpu = cpumask_pick_least_loaded(d, tmpmask);
>>   			if (cpu < nr_cpu_ids)
>>   				goto out;
> 
> I'm really not sure. Shouldn't we then drop the wider search on
> cpu_inline_mask, because userspace could have given us something
> that we cannot deal with?

It's not just userspace. Some drivers call irq_set_affinity{_hint}}() 
also, with a non-overlapping affinity mask.

We could just error these requests, but some drivers rely on this 
behavior. Consider the uncore driver I mentioned above, which WARNs when 
the affinity setting fails. So it tries to set the affinity with the 
cpumask of the cluster associated with the device, but with D06's ITS 
config, below, there may be no overlap.

> 
> What you are advocating for is a strict adherence to the provided
> mask, and it doesn't seem to be what other architectures are providing.
> I consider the userspace-provided affinity as a hint more that anything
> else, as in this case the kernel does know better (routing the interrupt
> to a foreign node might be costly, or even impossible, see the TX1
> erratum).

Right

> 
>   From what I remember of the earlier discussion, you saw an issue on
> a system with two sockets and a single ITS, with the node mask limited
> to the first socket. Is that correct?

A bit more complicated: 2 sockets, 2 NUMA nodes per socket, and ITS 
config as follows:
D06ES  1x ITS with proximity node #0

root@(none)$ dmesg | grep ITS
[ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0


D06CS
2x ITS with proximity node #0, #2

estuary:/$ dmesg | grep ITS
[    0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0
[    0.000000] SRAT: PXM 2 -> ITS 1 -> Node 2

It complicates things.

We could add extra intelligence to record if an node has an ITS 
associated. In the case of that not being true, we would fallback on the 
requested affin only (for case of no overlap). It gets a bit more messy 
then.

Thanks,
John
Marc Zyngier May 15, 2020, 3:37 p.m. UTC | #7
On 2020-05-15 12:50, John Garry wrote:
> Hi Marc,
> 
>> Absolutely. Life has got in the way, so let me page it back in...
> 
> Great
> 
>>> 
>>> [PATCH 2/2] irqchip/gic-v3-its: Handle no overlap of non-managed irq
>>>   affinity mask
>>> 
>>> In selecting the target CPU for a non-managed interrupt, we may 
>>> select
>>> a
>>> target CPU outside the requested affinity mask.
>>> 
>>> This is because there may be no overlap of the ITS node mask and the
>>> requested CPU affinity mask. The requested affinity mask may be 
>>> coming
>>> from userspace or some drivers which try to set irq affinity, see 
>>> [0].
>>> 
>>> In this case, just ignore the ITS node cpumask. This is a deviation
>>> from
>>> what Thomas described. Having said that, I am not sure if the
>>> interrupt is ever bound to a node for us.
>>> 
>>> [0]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/hisilicon/hisi_uncore_pmu.c#n417
>>> 
>>> ---
>>>   drivers/irqchip/irq-gic-v3-its.c | 4 ----
>>>   1 file changed, 4 deletions(-)
>>> 
>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c
>>> b/drivers/irqchip/irq-gic-v3-its.c
>>> index 2b18feb..12d5d4b4 100644
>>> --- a/drivers/irqchip/irq-gic-v3-its.c
>>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>>> @@ -1584,10 +1584,6 @@ static int its_select_cpu(struct irq_data *d,
>>>   			cpumask_and(tmpmask, cpumask_of_node(node), aff_mask);
>>>   			cpumask_and(tmpmask, tmpmask, cpu_online_mask);
>>> 
>>> -			/* If that doesn't work, try the nodemask itself */
>>> -			if (cpumask_empty(tmpmask))
>>> -				cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask);
>>> -
>>>   			cpu = cpumask_pick_least_loaded(d, tmpmask);
>>>   			if (cpu < nr_cpu_ids)
>>>   				goto out;
>> 
>> I'm really not sure. Shouldn't we then drop the wider search on
>> cpu_inline_mask, because userspace could have given us something
>> that we cannot deal with?
> 
> It's not just userspace. Some drivers call irq_set_affinity{_hint}}()
> also, with a non-overlapping affinity mask.
> 
> We could just error these requests, but some drivers rely on this
> behavior. Consider the uncore driver I mentioned above, which WARNs
> when the affinity setting fails. So it tries to set the affinity with
> the cpumask of the cluster associated with the device, but with D06's
> ITS config, below, there may be no overlap.

Does this PMU use the ITS? That's a pretty odd setup.

So this is a case where the device has an implicit affinity that
isn't that of the ITS. Huhu...

>> 
>> What you are advocating for is a strict adherence to the provided
>> mask, and it doesn't seem to be what other architectures are 
>> providing.
>> I consider the userspace-provided affinity as a hint more that 
>> anything
>> else, as in this case the kernel does know better (routing the 
>> interrupt
>> to a foreign node might be costly, or even impossible, see the TX1
>> erratum).
> 
> Right
> 
>> 
>>   From what I remember of the earlier discussion, you saw an issue on
>> a system with two sockets and a single ITS, with the node mask limited
>> to the first socket. Is that correct?
> 
> A bit more complicated: 2 sockets, 2 NUMA nodes per socket, and ITS
> config as follows:
> D06ES  1x ITS with proximity node #0
> 
> root@(none)$ dmesg | grep ITS
> [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0
> 
> 
> D06CS
> 2x ITS with proximity node #0, #2
> 
> estuary:/$ dmesg | grep ITS
> [    0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0
> [    0.000000] SRAT: PXM 2 -> ITS 1 -> Node 2
> 
> It complicates things.
> 
> We could add extra intelligence to record if an node has an ITS
> associated. In the case of that not being true, we would fallback on
> the requested affin only (for case of no overlap). It gets a bit more
> messy then.

It looks like part of the problem is that we can't reliably describe
an ITS affine to multiple NUMA nodes... If we could describe that, then
the above situation wouldn't occur (we'd say that ITS-0 covers both
nodes 0 and 1). But I can't find a way to express this with SRAT and
_PXM. Also, SRAT describes the affinity of the ITS with memory, and not
with the CPUs... It is all a bit fsck'd. :-(

I guess I'll apply your change for now with a comment explaining the
situation.

         M.
John Garry May 15, 2020, 4:15 p.m. UTC | #8
On 15/05/2020 16:37, Marc Zyngier wrote:

Hi Marc,

>>>
>>
>> It's not just userspace. Some drivers call irq_set_affinity{_hint}}()
>> also, with a non-overlapping affinity mask.
>>
>> We could just error these requests, but some drivers rely on this
>> behavior. Consider the uncore driver I mentioned above, which WARNs
>> when the affinity setting fails. So it tries to set the affinity with
>> the cpumask of the cluster associated with the device, but with D06's
>> ITS config, below, there may be no overlap.
> 
> Does this PMU use the ITS? That's a pretty odd setup.
> 

Yes, I think so. Our old friend, mbigen...

> So this is a case where the device has an implicit affinity that
> isn't that of the ITS. Huhu...
> 
>>>
>>> What you are advocating for is a strict adherence to the provided
>>> mask, and it doesn't seem to be what other architectures are providing.
>>> I consider the userspace-provided affinity as a hint more that anything
>>> else, as in this case the kernel does know better (routing the interrupt
>>> to a foreign node might be costly, or even impossible, see the TX1
>>> erratum).
>>
>> Right
>>
>>>
>>>   From what I remember of the earlier discussion, you saw an issue on
>>> a system with two sockets and a single ITS, with the node mask limited
>>> to the first socket. Is that correct?
>>
>> A bit more complicated: 2 sockets, 2 NUMA nodes per socket, and ITS
>> config as follows:
>> D06ES  1x ITS with proximity node #0
>>
>> root@(none)$ dmesg | grep ITS
>> [ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0
>>
>>
>> D06CS
>> 2x ITS with proximity node #0, #2
>>
>> estuary:/$ dmesg | grep ITS
>> [    0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0
>> [    0.000000] SRAT: PXM 2 -> ITS 1 -> Node 2
>>
>> It complicates things.
>>
>> We could add extra intelligence to record if an node has an ITS
>> associated. In the case of that not being true, we would fallback on
>> the requested affin only (for case of no overlap). It gets a bit more
>> messy then.
> 
> It looks like part of the problem is that we can't reliably describe
> an ITS affine to multiple NUMA nodes... If we could describe that, then
> the above situation wouldn't occur (we'd say that ITS-0 covers both
> nodes 0 and 1). But I can't find a way to express this with SRAT and
> _PXM. Also, SRAT describes the affinity of the ITS with memory, and not
> with the CPUs... It is all a bit fsck'd. :-(

Yeah.

We could try to deal with those excluded CPUs, not within an ITS node 
cpumask, by trying to find the "closest" ITS, and expanding the cpumask 
of that ITS to cover those. Not sure if it's worth it.

But this sort of problem has always cropped up, with no clear way to 
describe our topology exactly.

> 
> I guess I'll apply your change for now with a comment explaining the
> situation.
> 

Cheers,
John