diff mbox series

irq-gic: select all CPU's selected in interrupt affinity settings

Message ID CAPaFbat4MM0=iVB-VazTK9=2qRebAgEN4euYCTESRo3yfx75Kw@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series irq-gic: select all CPU's selected in interrupt affinity settings | expand

Commit Message

Leonid Movshovich Nov. 19, 2019, 11:12 p.m. UTC
So far only a CPU selected with top affinity bit was selected. This
resulted in all interrupts
being processed by CPU0 by default despite "FF" default affinity
setting for all interrupts
---
 drivers/irqchip/irq-gic.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

 }
--
2.17.1

Comments

Russell King (Oracle) Nov. 19, 2019, 11:36 p.m. UTC | #1
On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> So far only a CPU selected with top affinity bit was selected. This
> resulted in all interrupts
> being processed by CPU0 by default despite "FF" default affinity
> setting for all interrupts

Have you checked whether this causes _ALL_ CPUs in the mask to be
delivered a single interrupt, thereby causing _ALL_ CPUs to be
slowed down and hit the same locks at the same time.

> ---
>  drivers/irqchip/irq-gic.c | 27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
> index 30ab62334..e6c6451ea 100644
> --- a/drivers/irqchip/irq-gic.c
> +++ b/drivers/irqchip/irq-gic.c
> @@ -331,25 +331,30 @@ static int gic_set_affinity(struct irq_data *d,
> const struct cpumask *mask_val,
>  {
>   void __iomem *reg = gic_dist_base(d) + GIC_DIST_TARGET + (gic_irq(d) & ~3);
>   unsigned int cpu, shift = (gic_irq(d) % 4) * 8;
> - u32 val, mask, bit;
> + u32 val, reg_mask, bits = 0;
>   unsigned long flags;
> + const struct cpumask* cpu_mask;
> 
> - if (!force)
> - cpu = cpumask_any_and(mask_val, cpu_online_mask);
> + if (force)
> + cpu_mask = mask_val;
>   else
> - cpu = cpumask_first(mask_val);
> + cpu_mask = cpu_online_mask;
> 
> - if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids)
> - return -EINVAL;
> + for_each_cpu_and(cpu, mask_val, cpu_mask) {
> + if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids) {
> + return -EINVAL;
> + }
> + bits |= gic_cpu_map[cpu];
> + }
> 
>   gic_lock_irqsave(flags);
> - mask = 0xff << shift;
> - bit = gic_cpu_map[cpu] << shift;
> - val = readl_relaxed(reg) & ~mask;
> - writel_relaxed(val | bit, reg);
> + reg_mask = 0xff << shift;
> + bits <<= shift;
> + val = readl_relaxed(reg) & ~reg_mask;
> + writel_relaxed(val | bits, reg);
>   gic_unlock_irqrestore(flags);
> 
> - irq_data_update_effective_affinity(d, cpumask_of(cpu));
> + irq_data_update_effective_affinity(d, cpu_mask);
> 
>   return IRQ_SET_MASK_OK_DONE;
>  }
> --
> 2.17.1
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Leonid Movshovich Nov. 20, 2019, 12:24 a.m. UTC | #2
On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > So far only a CPU selected with top affinity bit was selected. This
> > resulted in all interrupts
> > being processed by CPU0 by default despite "FF" default affinity
> > setting for all interrupts
>
> Have you checked whether this causes _ALL_ CPUs in the mask to be
> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> slowed down and hit the same locks at the same time.
>

Yes, I've checked this. No, interrupt is delivered to only one CPU.
Also ARM GIC architecture specification specifically states in chapter
3.1.1 that hardware interrupts are delivered to a single CPU in
multiprocessor system ("1-N model"). Here is output of
/proc/interrupts from my rk3328 with patch applied:
root@host:~ # cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  1:          0          0          0          0     GICv2  29 Edge
  arch_timer
  2:      16615      17538      17932      18593     GICv2  30 Edge
  arch_timer
 12:          0          0          0          0     GICv2  90 Level
  rockchip_thermal
 15:          0          0          0          0     GICv2  68 Level
  ff150000.i2c
 16:        557        526        181        479     GICv2  69 Level
  ff160000.i2c
 19:          0        325          0          0     GICv2  82 Level
  rk_pwm_irq
 20:        401        315        328        294     GICv2  32 Level
  ff1f0000.dmac
 21:          0          0          0          0     GICv2  33 Level
  ff1f0000.dmac
 22:        537        430        557        378     GICv2 122 Level     Mali_GP
 23:          0          0          0          0     GICv2 119 Level
  Mali_GP_MMU
 24:        257        236        257        201     GICv2 125 Level
  Mali_PP_Broadcast
 25:          0          0          0          0     GICv2 120 Level
  Mali_PP0
 26:          0          0          0          0     GICv2 121 Level
  Mali_PP0_MMU
 27:          0          0          0          0     GICv2 123 Level
  Mali_PP1
 28:          0          0          0          0     GICv2 124 Level
  Mali_PP1_MMU
 29:          0          0          0          0     GICv2  41 Level
  ff350000.vpu_service, ff351000.avsd
 31:          0          0          0          0     GICv2  39 Level
  ff360000.rkvdec
 33:          0          0          0          0     GICv2 127 Level
  ff330000.h265e
 35:          0          0          0          0     GICv2 129 Level
  ff340000.vepu
 37:       1140        832        902        789     GICv2  64 Level
  ff370000.vop, ff370000.vop
 38:          0          0          0          0     GICv2  63 Level
  ff3a0000.iep
 39:        983        759        959        741     GICv2  67 Level
  ff3c0000.hdmi, dw-hdmi-cec
 41:          0          0          0          0     GICv2 115 Level
  ff430000.hdmiphy
 42:          0          0          0          0     GICv2 109 Level
  rockchip_u3phy
 43:          7          0          5          3     GICv2  46 Level     dw-mci
 44:      52394       1141      50331      21990     GICv2  44 Level     dw-mci
 45:         71         44         68         63     GICv2  56 Level     eth0
 46:          0          0          0          0     GICv2  55 Level
  ff580000.usb, dwc2_hsotg:usb1
 47:          0          0          0          0     GICv2  48 Level
  ehci_hcd:usb2
 48:          0          0          0          0     GICv2  49 Level
  ohci_hcd:usb3
124:          0          0          0          0     gpio2   6 Level     rk805
182:          0          0          0          0     GICv2  94 Level
  rockchip_usb2phy
183:          0          0          0          0     GICv2  93 Level
  rockchip_usb2phy
184:          0          0          0          0     GICv2  99 Level
  xhci-hcd:usb4

Interrupt counts would be the same for all CPUs if all interrupts
would be delivered to all CPU

> > ---
> >  drivers/irqchip/irq-gic.c | 27 ++++++++++++++++-----------
> >  1 file changed, 16 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
> > index 30ab62334..e6c6451ea 100644
> > --- a/drivers/irqchip/irq-gic.c
> > +++ b/drivers/irqchip/irq-gic.c
> > @@ -331,25 +331,30 @@ static int gic_set_affinity(struct irq_data *d,
> > const struct cpumask *mask_val,
> >  {
> >   void __iomem *reg = gic_dist_base(d) + GIC_DIST_TARGET + (gic_irq(d) & ~3);
> >   unsigned int cpu, shift = (gic_irq(d) % 4) * 8;
> > - u32 val, mask, bit;
> > + u32 val, reg_mask, bits = 0;
> >   unsigned long flags;
> > + const struct cpumask* cpu_mask;
> >
> > - if (!force)
> > - cpu = cpumask_any_and(mask_val, cpu_online_mask);
> > + if (force)
> > + cpu_mask = mask_val;
> >   else
> > - cpu = cpumask_first(mask_val);
> > + cpu_mask = cpu_online_mask;
> >
> > - if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids)
> > - return -EINVAL;
> > + for_each_cpu_and(cpu, mask_val, cpu_mask) {
> > + if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids) {
> > + return -EINVAL;
> > + }
> > + bits |= gic_cpu_map[cpu];
> > + }
> >
> >   gic_lock_irqsave(flags);
> > - mask = 0xff << shift;
> > - bit = gic_cpu_map[cpu] << shift;
> > - val = readl_relaxed(reg) & ~mask;
> > - writel_relaxed(val | bit, reg);
> > + reg_mask = 0xff << shift;
> > + bits <<= shift;
> > + val = readl_relaxed(reg) & ~reg_mask;
> > + writel_relaxed(val | bits, reg);
> >   gic_unlock_irqrestore(flags);
> >
> > - irq_data_update_effective_affinity(d, cpumask_of(cpu));
> > + irq_data_update_effective_affinity(d, cpu_mask);
> >
> >   return IRQ_SET_MASK_OK_DONE;
> >  }
> > --
> > 2.17.1
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
Robin Murphy Nov. 20, 2019, 1:15 a.m. UTC | #3
On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
>>
>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
>>> So far only a CPU selected with top affinity bit was selected. This
>>> resulted in all interrupts
>>> being processed by CPU0 by default despite "FF" default affinity
>>> setting for all interrupts
>>
>> Have you checked whether this causes _ALL_ CPUs in the mask to be
>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
>> slowed down and hit the same locks at the same time.
>>
> 
> Yes, I've checked this. No, interrupt is delivered to only one CPU.
> Also ARM GIC architecture specification specifically states in chapter
> 3.1.1 that hardware interrupts are delivered to a single CPU in
> multiprocessor system ("1-N model").

But see also section 3.2.3 - just because only one CPU actually runs the 
given ISR doesn't necessarily guarantee that the others *weren't* 
interrupted. I'd also hesitate to make any assumptions that all GIC 
implementations behave exactly the same way.

Robin.

> Here is output of
> /proc/interrupts from my rk3328 with patch applied:
> root@host:~ # cat /proc/interrupts
>             CPU0       CPU1       CPU2       CPU3
>    1:          0          0          0          0     GICv2  29 Edge
>    arch_timer
>    2:      16615      17538      17932      18593     GICv2  30 Edge
>    arch_timer
>   12:          0          0          0          0     GICv2  90 Level
>    rockchip_thermal
>   15:          0          0          0          0     GICv2  68 Level
>    ff150000.i2c
>   16:        557        526        181        479     GICv2  69 Level
>    ff160000.i2c
>   19:          0        325          0          0     GICv2  82 Level
>    rk_pwm_irq
>   20:        401        315        328        294     GICv2  32 Level
>    ff1f0000.dmac
>   21:          0          0          0          0     GICv2  33 Level
>    ff1f0000.dmac
>   22:        537        430        557        378     GICv2 122 Level     Mali_GP
>   23:          0          0          0          0     GICv2 119 Level
>    Mali_GP_MMU
>   24:        257        236        257        201     GICv2 125 Level
>    Mali_PP_Broadcast
>   25:          0          0          0          0     GICv2 120 Level
>    Mali_PP0
>   26:          0          0          0          0     GICv2 121 Level
>    Mali_PP0_MMU
>   27:          0          0          0          0     GICv2 123 Level
>    Mali_PP1
>   28:          0          0          0          0     GICv2 124 Level
>    Mali_PP1_MMU
>   29:          0          0          0          0     GICv2  41 Level
>    ff350000.vpu_service, ff351000.avsd
>   31:          0          0          0          0     GICv2  39 Level
>    ff360000.rkvdec
>   33:          0          0          0          0     GICv2 127 Level
>    ff330000.h265e
>   35:          0          0          0          0     GICv2 129 Level
>    ff340000.vepu
>   37:       1140        832        902        789     GICv2  64 Level
>    ff370000.vop, ff370000.vop
>   38:          0          0          0          0     GICv2  63 Level
>    ff3a0000.iep
>   39:        983        759        959        741     GICv2  67 Level
>    ff3c0000.hdmi, dw-hdmi-cec
>   41:          0          0          0          0     GICv2 115 Level
>    ff430000.hdmiphy
>   42:          0          0          0          0     GICv2 109 Level
>    rockchip_u3phy
>   43:          7          0          5          3     GICv2  46 Level     dw-mci
>   44:      52394       1141      50331      21990     GICv2  44 Level     dw-mci
>   45:         71         44         68         63     GICv2  56 Level     eth0
>   46:          0          0          0          0     GICv2  55 Level
>    ff580000.usb, dwc2_hsotg:usb1
>   47:          0          0          0          0     GICv2  48 Level
>    ehci_hcd:usb2
>   48:          0          0          0          0     GICv2  49 Level
>    ohci_hcd:usb3
> 124:          0          0          0          0     gpio2   6 Level     rk805
> 182:          0          0          0          0     GICv2  94 Level
>    rockchip_usb2phy
> 183:          0          0          0          0     GICv2  93 Level
>    rockchip_usb2phy
> 184:          0          0          0          0     GICv2  99 Level
>    xhci-hcd:usb4
> 
> Interrupt counts would be the same for all CPUs if all interrupts
> would be delivered to all CPU
> 
>>> ---
>>>   drivers/irqchip/irq-gic.c | 27 ++++++++++++++++-----------
>>>   1 file changed, 16 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
>>> index 30ab62334..e6c6451ea 100644
>>> --- a/drivers/irqchip/irq-gic.c
>>> +++ b/drivers/irqchip/irq-gic.c
>>> @@ -331,25 +331,30 @@ static int gic_set_affinity(struct irq_data *d,
>>> const struct cpumask *mask_val,
>>>   {
>>>    void __iomem *reg = gic_dist_base(d) + GIC_DIST_TARGET + (gic_irq(d) & ~3);
>>>    unsigned int cpu, shift = (gic_irq(d) % 4) * 8;
>>> - u32 val, mask, bit;
>>> + u32 val, reg_mask, bits = 0;
>>>    unsigned long flags;
>>> + const struct cpumask* cpu_mask;
>>>
>>> - if (!force)
>>> - cpu = cpumask_any_and(mask_val, cpu_online_mask);
>>> + if (force)
>>> + cpu_mask = mask_val;
>>>    else
>>> - cpu = cpumask_first(mask_val);
>>> + cpu_mask = cpu_online_mask;
>>>
>>> - if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids)
>>> - return -EINVAL;
>>> + for_each_cpu_and(cpu, mask_val, cpu_mask) {
>>> + if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids) {
>>> + return -EINVAL;
>>> + }
>>> + bits |= gic_cpu_map[cpu];
>>> + }
>>>
>>>    gic_lock_irqsave(flags);
>>> - mask = 0xff << shift;
>>> - bit = gic_cpu_map[cpu] << shift;
>>> - val = readl_relaxed(reg) & ~mask;
>>> - writel_relaxed(val | bit, reg);
>>> + reg_mask = 0xff << shift;
>>> + bits <<= shift;
>>> + val = readl_relaxed(reg) & ~reg_mask;
>>> + writel_relaxed(val | bits, reg);
>>>    gic_unlock_irqrestore(flags);
>>>
>>> - irq_data_update_effective_affinity(d, cpumask_of(cpu));
>>> + irq_data_update_effective_affinity(d, cpu_mask);
>>>
>>>    return IRQ_SET_MASK_OK_DONE;
>>>   }
>>> --
>>> 2.17.1
>>>
>>> _______________________________________________
>>> linux-arm-kernel mailing list
>>> linux-arm-kernel@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>>
>>
>> --
>> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
>> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
>> According to speedtest.net: 11.9Mbps down 500kbps up
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Leonid Movshovich Nov. 20, 2019, 10:44 a.m. UTC | #4
On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> >>
> >> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> >>> So far only a CPU selected with top affinity bit was selected. This
> >>> resulted in all interrupts
> >>> being processed by CPU0 by default despite "FF" default affinity
> >>> setting for all interrupts
> >>
> >> Have you checked whether this causes _ALL_ CPUs in the mask to be
> >> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> >> slowed down and hit the same locks at the same time.
> >>
> >
> > Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > Also ARM GIC architecture specification specifically states in chapter
> > 3.1.1 that hardware interrupts are delivered to a single CPU in
> > multiprocessor system ("1-N model").
>
> But see also section 3.2.3 - just because only one CPU actually runs the
> given ISR doesn't necessarily guarantee that the others *weren't*
> interrupted. I'd also hesitate to make any assumptions that all GIC
> implementations behave exactly the same way.
>
> Robin.

Yes, that's right, however:
1. They are only interrupted for a split-second, since interrupt is
immediately ACKed in gic_handle_irq
2. More important that smp_affinity in procfs is defined to allow user
to configure multiple CPU's to handle interrupts (see
Documentation/IRQ-affinity.txt) which is effectively prohibited in
current implementation. I mean, when user sets it to FF, she expects
all CPUs to process interrupts, not CPU0 only

Leo.

>
> > Here is output of
> > /proc/interrupts from my rk3328 with patch applied:
> > root@host:~ # cat /proc/interrupts
> >             CPU0       CPU1       CPU2       CPU3
> >    1:          0          0          0          0     GICv2  29 Edge
> >    arch_timer
> >    2:      16615      17538      17932      18593     GICv2  30 Edge
> >    arch_timer
> >   12:          0          0          0          0     GICv2  90 Level
> >    rockchip_thermal
> >   15:          0          0          0          0     GICv2  68 Level
> >    ff150000.i2c
> >   16:        557        526        181        479     GICv2  69 Level
> >    ff160000.i2c
> >   19:          0        325          0          0     GICv2  82 Level
> >    rk_pwm_irq
> >   20:        401        315        328        294     GICv2  32 Level
> >    ff1f0000.dmac
> >   21:          0          0          0          0     GICv2  33 Level
> >    ff1f0000.dmac
> >   22:        537        430        557        378     GICv2 122 Level     Mali_GP
> >   23:          0          0          0          0     GICv2 119 Level
> >    Mali_GP_MMU
> >   24:        257        236        257        201     GICv2 125 Level
> >    Mali_PP_Broadcast
> >   25:          0          0          0          0     GICv2 120 Level
> >    Mali_PP0
> >   26:          0          0          0          0     GICv2 121 Level
> >    Mali_PP0_MMU
> >   27:          0          0          0          0     GICv2 123 Level
> >    Mali_PP1
> >   28:          0          0          0          0     GICv2 124 Level
> >    Mali_PP1_MMU
> >   29:          0          0          0          0     GICv2  41 Level
> >    ff350000.vpu_service, ff351000.avsd
> >   31:          0          0          0          0     GICv2  39 Level
> >    ff360000.rkvdec
> >   33:          0          0          0          0     GICv2 127 Level
> >    ff330000.h265e
> >   35:          0          0          0          0     GICv2 129 Level
> >    ff340000.vepu
> >   37:       1140        832        902        789     GICv2  64 Level
> >    ff370000.vop, ff370000.vop
> >   38:          0          0          0          0     GICv2  63 Level
> >    ff3a0000.iep
> >   39:        983        759        959        741     GICv2  67 Level
> >    ff3c0000.hdmi, dw-hdmi-cec
> >   41:          0          0          0          0     GICv2 115 Level
> >    ff430000.hdmiphy
> >   42:          0          0          0          0     GICv2 109 Level
> >    rockchip_u3phy
> >   43:          7          0          5          3     GICv2  46 Level     dw-mci
> >   44:      52394       1141      50331      21990     GICv2  44 Level     dw-mci
> >   45:         71         44         68         63     GICv2  56 Level     eth0
> >   46:          0          0          0          0     GICv2  55 Level
> >    ff580000.usb, dwc2_hsotg:usb1
> >   47:          0          0          0          0     GICv2  48 Level
> >    ehci_hcd:usb2
> >   48:          0          0          0          0     GICv2  49 Level
> >    ohci_hcd:usb3
> > 124:          0          0          0          0     gpio2   6 Level     rk805
> > 182:          0          0          0          0     GICv2  94 Level
> >    rockchip_usb2phy
> > 183:          0          0          0          0     GICv2  93 Level
> >    rockchip_usb2phy
> > 184:          0          0          0          0     GICv2  99 Level
> >    xhci-hcd:usb4
> >
> > Interrupt counts would be the same for all CPUs if all interrupts
> > would be delivered to all CPU
> >
> >>> ---
> >>>   drivers/irqchip/irq-gic.c | 27 ++++++++++++++++-----------
> >>>   1 file changed, 16 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
> >>> index 30ab62334..e6c6451ea 100644
> >>> --- a/drivers/irqchip/irq-gic.c
> >>> +++ b/drivers/irqchip/irq-gic.c
> >>> @@ -331,25 +331,30 @@ static int gic_set_affinity(struct irq_data *d,
> >>> const struct cpumask *mask_val,
> >>>   {
> >>>    void __iomem *reg = gic_dist_base(d) + GIC_DIST_TARGET + (gic_irq(d) & ~3);
> >>>    unsigned int cpu, shift = (gic_irq(d) % 4) * 8;
> >>> - u32 val, mask, bit;
> >>> + u32 val, reg_mask, bits = 0;
> >>>    unsigned long flags;
> >>> + const struct cpumask* cpu_mask;
> >>>
> >>> - if (!force)
> >>> - cpu = cpumask_any_and(mask_val, cpu_online_mask);
> >>> + if (force)
> >>> + cpu_mask = mask_val;
> >>>    else
> >>> - cpu = cpumask_first(mask_val);
> >>> + cpu_mask = cpu_online_mask;
> >>>
> >>> - if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids)
> >>> - return -EINVAL;
> >>> + for_each_cpu_and(cpu, mask_val, cpu_mask) {
> >>> + if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids) {
> >>> + return -EINVAL;
> >>> + }
> >>> + bits |= gic_cpu_map[cpu];
> >>> + }
> >>>
> >>>    gic_lock_irqsave(flags);
> >>> - mask = 0xff << shift;
> >>> - bit = gic_cpu_map[cpu] << shift;
> >>> - val = readl_relaxed(reg) & ~mask;
> >>> - writel_relaxed(val | bit, reg);
> >>> + reg_mask = 0xff << shift;
> >>> + bits <<= shift;
> >>> + val = readl_relaxed(reg) & ~reg_mask;
> >>> + writel_relaxed(val | bits, reg);
> >>>    gic_unlock_irqrestore(flags);
> >>>
> >>> - irq_data_update_effective_affinity(d, cpumask_of(cpu));
> >>> + irq_data_update_effective_affinity(d, cpu_mask);
> >>>
> >>>    return IRQ_SET_MASK_OK_DONE;
> >>>   }
> >>> --
> >>> 2.17.1
> >>>
> >>> _______________________________________________
> >>> linux-arm-kernel mailing list
> >>> linux-arm-kernel@lists.infradead.org
> >>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >>>
> >>
> >> --
> >> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> >> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> >> According to speedtest.net: 11.9Mbps down 500kbps up
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >
Russell King (Oracle) Nov. 20, 2019, 10:50 a.m. UTC | #5
On Wed, Nov 20, 2019 at 10:44:39AM +0000, Leonid Movshovich wrote:
> On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > <linux@armlinux.org.uk> wrote:
> > >>
> > >> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > >>> So far only a CPU selected with top affinity bit was selected. This
> > >>> resulted in all interrupts
> > >>> being processed by CPU0 by default despite "FF" default affinity
> > >>> setting for all interrupts
> > >>
> > >> Have you checked whether this causes _ALL_ CPUs in the mask to be
> > >> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > >> slowed down and hit the same locks at the same time.
> > >>
> > >
> > > Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > > Also ARM GIC architecture specification specifically states in chapter
> > > 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > multiprocessor system ("1-N model").
> >
> > But see also section 3.2.3 - just because only one CPU actually runs the
> > given ISR doesn't necessarily guarantee that the others *weren't*
> > interrupted. I'd also hesitate to make any assumptions that all GIC
> > implementations behave exactly the same way.
> >
> > Robin.
> 
> Yes, that's right, however:
> 1. They are only interrupted for a split-second, since interrupt is
> immediately ACKed in gic_handle_irq

Even that is detrimental - consider cpuidle where a CPU is placed in
a low power state waiting for an interrupt, and it keeps getting woken
for interrupts that it isn't able to handle.  The effect will be to
stop the CPU hitting the lower power states, which would be a regression
over how the kernel behaves today.

> 2. More important that smp_affinity in procfs is defined to allow user
> to configure multiple CPU's to handle interrupts (see
> Documentation/IRQ-affinity.txt) which is effectively prohibited in
> current implementation. I mean, when user sets it to FF, she expects
> all CPUs to process interrupts, not CPU0 only

The reason we've ended up with that on ARM is precisely because it
wasted CPU resources, and my attempts at writing code to distribute
the interrupt between CPU cores did not have a successful outcome.
So, the best thing that could be done was to route interrupts to the
first core, and run irqbalance to distribute the interrupts in a
sensible, cache friendly way between CPU cores.

And no, the current implementation is *NOT* prohibited.  You can't
prohibit something that hardware hasn't been able to provide.
Leonid Movshovich Nov. 20, 2019, 11:25 a.m. UTC | #6
On Wed, 20 Nov 2019 at 10:50, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Wed, Nov 20, 2019 at 10:44:39AM +0000, Leonid Movshovich wrote:
> > On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
> > >
> > > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > > On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > > <linux@armlinux.org.uk> wrote:
> > > >>
> > > >> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > >>> So far only a CPU selected with top affinity bit was selected. This
> > > >>> resulted in all interrupts
> > > >>> being processed by CPU0 by default despite "FF" default affinity
> > > >>> setting for all interrupts
> > > >>
> > > >> Have you checked whether this causes _ALL_ CPUs in the mask to be
> > > >> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > >> slowed down and hit the same locks at the same time.
> > > >>
> > > >
> > > > Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > > > Also ARM GIC architecture specification specifically states in chapter
> > > > 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > > multiprocessor system ("1-N model").
> > >
> > > But see also section 3.2.3 - just because only one CPU actually runs the
> > > given ISR doesn't necessarily guarantee that the others *weren't*
> > > interrupted. I'd also hesitate to make any assumptions that all GIC
> > > implementations behave exactly the same way.
> > >
> > > Robin.
> >
> > Yes, that's right, however:
> > 1. They are only interrupted for a split-second, since interrupt is
> > immediately ACKed in gic_handle_irq
>
> Even that is detrimental - consider cpuidle where a CPU is placed in
> a low power state waiting for an interrupt, and it keeps getting woken
> for interrupts that it isn't able to handle.  The effect will be to
> stop the CPU hitting the lower power states, which would be a regression
> over how the kernel behaves today.
>
> > 2. More important that smp_affinity in procfs is defined to allow user
> > to configure multiple CPU's to handle interrupts (see
> > Documentation/IRQ-affinity.txt) which is effectively prohibited in
> > current implementation. I mean, when user sets it to FF, she expects
> > all CPUs to process interrupts, not CPU0 only
>
> The reason we've ended up with that on ARM is precisely because it
> wasted CPU resources, and my attempts at writing code to distribute
> the interrupt between CPU cores did not have a successful outcome.
> So, the best thing that could be done was to route interrupts to the
> first core, and run irqbalance to distribute the interrupts in a
> sensible, cache friendly way between CPU cores.
>
> And no, the current implementation is *NOT* prohibited.  You can't
> prohibit something that hardware hasn't been able to provide.
>

Hardware allows delivering interrupt to random CPU from selected
bitmask and current implementation doesn't allow to configure this.
While this may be an issue for power-concerned systems, there are also
systems with plenty of electricity where using all CPUs for e.g.
network packet handling is more important.
Anyway, I see your point of keeping default behaviour unchanged. I'd
suggest to set irq_default_affinity for arm/arm64 architectures to
keep default behaviour as is (i.e. deliver everything to CPU0).

> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
Robin Murphy Nov. 20, 2019, 1:33 p.m. UTC | #7
On 20/11/2019 11:25 am, Leonid Movshovich wrote:
> On Wed, 20 Nov 2019 at 10:50, Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
>>
>> On Wed, Nov 20, 2019 at 10:44:39AM +0000, Leonid Movshovich wrote:
>>> On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>
>>>> On 2019-11-20 12:24 am, Leonid Movshovich wrote:
>>>>> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
>>>>> <linux@armlinux.org.uk> wrote:
>>>>>>
>>>>>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
>>>>>>> So far only a CPU selected with top affinity bit was selected. This
>>>>>>> resulted in all interrupts
>>>>>>> being processed by CPU0 by default despite "FF" default affinity
>>>>>>> setting for all interrupts
>>>>>>
>>>>>> Have you checked whether this causes _ALL_ CPUs in the mask to be
>>>>>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
>>>>>> slowed down and hit the same locks at the same time.
>>>>>>
>>>>>
>>>>> Yes, I've checked this. No, interrupt is delivered to only one CPU.
>>>>> Also ARM GIC architecture specification specifically states in chapter
>>>>> 3.1.1 that hardware interrupts are delivered to a single CPU in
>>>>> multiprocessor system ("1-N model").
>>>>
>>>> But see also section 3.2.3 - just because only one CPU actually runs the
>>>> given ISR doesn't necessarily guarantee that the others *weren't*
>>>> interrupted. I'd also hesitate to make any assumptions that all GIC
>>>> implementations behave exactly the same way.
>>>>
>>>> Robin.
>>>
>>> Yes, that's right, however:
>>> 1. They are only interrupted for a split-second, since interrupt is
>>> immediately ACKed in gic_handle_irq
>>
>> Even that is detrimental - consider cpuidle where a CPU is placed in
>> a low power state waiting for an interrupt, and it keeps getting woken
>> for interrupts that it isn't able to handle.  The effect will be to
>> stop the CPU hitting the lower power states, which would be a regression
>> over how the kernel behaves today.
>>
>>> 2. More important that smp_affinity in procfs is defined to allow user
>>> to configure multiple CPU's to handle interrupts (see
>>> Documentation/IRQ-affinity.txt) which is effectively prohibited in
>>> current implementation. I mean, when user sets it to FF, she expects
>>> all CPUs to process interrupts, not CPU0 only

I have to say, my interaction with the IRQ layer is far more as a "user" 
than as a "developer", yet I've always assumed that the affinity mask 
represents the set of CPUs that *may* handle an interrupt and have never 
felt particularly surprised by the naive implementation of "just pick 
the first one".

Do these users also expect the scheduler to constantly context-switch a 
single active task all over the place just because the default thread 
affinity mask says it can?

>> The reason we've ended up with that on ARM is precisely because it
>> wasted CPU resources, and my attempts at writing code to distribute
>> the interrupt between CPU cores did not have a successful outcome.
>> So, the best thing that could be done was to route interrupts to the
>> first core, and run irqbalance to distribute the interrupts in a
>> sensible, cache friendly way between CPU cores.
>>
>> And no, the current implementation is *NOT* prohibited.  You can't
>> prohibit something that hardware hasn't been able to provide.
>>
> 
> Hardware allows delivering interrupt to random CPU from selected
> bitmask and current implementation doesn't allow to configure this.
> While this may be an issue for power-concerned systems, there are also
> systems with plenty of electricity where using all CPUs for e.g.
> network packet handling is more important.

It's not just about batteries - more and more SoCs these days have 
internally constrained power/thermal budgets too. Think of Intel's turbo 
boost, or those Amlogic TV box chips that can only hit their advertised 
top frequencies with one or two cores active - on systems like that, 
yanking all the cores out of standby every time could be actively 
detrimental to single-thread performance and actually end up 
*increasing* interrupt-handling latency.

If you want to optimise a particular system for a particular use-case, 
you're almost certainly better off manually tuning affinities anyway 
(certain distros already do this). If you mostly just want 
/proc/interrupts to look pretty, there's irqbalance.

> Anyway, I see your point of keeping default behaviour unchanged. I'd
> suggest to set irq_default_affinity for arm/arm64 architectures to
> keep default behaviour as is (i.e. deliver everything to CPU0).

More than anything, though, let me reiterate my second point more 
strongly. Much as we might like to pretend otherwise, GICv1 is a thing, 
plus I have a feeling that there are implementation errata around 1-N 
arbitration that we've so far ignored because Linux doesn't make use of 
it. If there really is a provable benefit to supporting and maintaining 
this feature upstream at all, it at least needs to be limited to cases 
where it's guaranteed to actually work properly and safely, and I'm 
fairly confident that that set is smaller than the set of all GIC 
implementations covered by this driver.

And given the earlier argument, it's probably worth noting that there 
are precious few networking/infrastructure/server SoCs using GICv2 anyway.

Robin.
Russell King (Oracle) Nov. 20, 2019, 1:58 p.m. UTC | #8
On Wed, Nov 20, 2019 at 01:33:11PM +0000, Robin Murphy wrote:
> On 20/11/2019 11:25 am, Leonid Movshovich wrote:
> > On Wed, 20 Nov 2019 at 10:50, Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > > 
> > > On Wed, Nov 20, 2019 at 10:44:39AM +0000, Leonid Movshovich wrote:
> > > > On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > 
> > > > > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > > > > On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > > > > <linux@armlinux.org.uk> wrote:
> > > > > > > 
> > > > > > > On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > > > > > > So far only a CPU selected with top affinity bit was selected. This
> > > > > > > > resulted in all interrupts
> > > > > > > > being processed by CPU0 by default despite "FF" default affinity
> > > > > > > > setting for all interrupts
> > > > > > > 
> > > > > > > Have you checked whether this causes _ALL_ CPUs in the mask to be
> > > > > > > delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > > > > > slowed down and hit the same locks at the same time.
> > > > > > > 
> > > > > > 
> > > > > > Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > > > > > Also ARM GIC architecture specification specifically states in chapter
> > > > > > 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > > > > multiprocessor system ("1-N model").
> > > > > 
> > > > > But see also section 3.2.3 - just because only one CPU actually runs the
> > > > > given ISR doesn't necessarily guarantee that the others *weren't*
> > > > > interrupted. I'd also hesitate to make any assumptions that all GIC
> > > > > implementations behave exactly the same way.
> > > > > 
> > > > > Robin.
> > > > 
> > > > Yes, that's right, however:
> > > > 1. They are only interrupted for a split-second, since interrupt is
> > > > immediately ACKed in gic_handle_irq
> > > 
> > > Even that is detrimental - consider cpuidle where a CPU is placed in
> > > a low power state waiting for an interrupt, and it keeps getting woken
> > > for interrupts that it isn't able to handle.  The effect will be to
> > > stop the CPU hitting the lower power states, which would be a regression
> > > over how the kernel behaves today.
> > > 
> > > > 2. More important that smp_affinity in procfs is defined to allow user
> > > > to configure multiple CPU's to handle interrupts (see
> > > > Documentation/IRQ-affinity.txt) which is effectively prohibited in
> > > > current implementation. I mean, when user sets it to FF, she expects
> > > > all CPUs to process interrupts, not CPU0 only
> 
> I have to say, my interaction with the IRQ layer is far more as a "user"
> than as a "developer", yet I've always assumed that the affinity mask
> represents the set of CPUs that *may* handle an interrupt and have never
> felt particularly surprised by the naive implementation of "just pick the
> first one".
> 
> Do these users also expect the scheduler to constantly context-switch a
> single active task all over the place just because the default thread
> affinity mask says it can?

It is my understanding that the scheduler will try to keep tasks on
the CPU they are already running on, unless there's a benefit to
migrating it to a different CPU - because if you're constantly
migrating code between different CPUs, you're having to bounce
cache lines around the system.

> > > The reason we've ended up with that on ARM is precisely because it
> > > wasted CPU resources, and my attempts at writing code to distribute
> > > the interrupt between CPU cores did not have a successful outcome.
> > > So, the best thing that could be done was to route interrupts to the
> > > first core, and run irqbalance to distribute the interrupts in a
> > > sensible, cache friendly way between CPU cores.
> > > 
> > > And no, the current implementation is *NOT* prohibited.  You can't
> > > prohibit something that hardware hasn't been able to provide.
> > > 
> > 
> > Hardware allows delivering interrupt to random CPU from selected
> > bitmask and current implementation doesn't allow to configure this.
> > While this may be an issue for power-concerned systems, there are also
> > systems with plenty of electricity where using all CPUs for e.g.
> > network packet handling is more important.
> 
> It's not just about batteries - more and more SoCs these days have
> internally constrained power/thermal budgets too. Think of Intel's turbo
> boost, or those Amlogic TV box chips that can only hit their advertised top
> frequencies with one or two cores active - on systems like that, yanking all
> the cores out of standby every time could be actively detrimental to
> single-thread performance and actually end up *increasing*
> interrupt-handling latency.
> 
> If you want to optimise a particular system for a particular use-case,
> you're almost certainly better off manually tuning affinities anyway
> (certain distros already do this). If you mostly just want /proc/interrupts
> to look pretty, there's irqbalance.

The conclusion I came to when I did the initial 32-bit ARM SMP support
was:

1) it is policy, and userspace deals with policy
2) routing the IRQ in to distribute it between CPUs is difficult
3) the problem is already solved by userspace (irqbalance)

(2) is difficult because you don't want to do something naieve like
route the first interrupt to CPU0, second to CPU1, third to CPU2
etc, because that totally destroys cache locality and therefore
performance.  Your network card goes faster if its IRQ is always
processed by the same CPU (benefiting from hot cache) rather than
spreading it around the CPUs.

> And given the earlier argument, it's probably worth noting that there are
> precious few networking/infrastructure/server SoCs using GICv2 anyway.

Networking is just one specific example where it's beneficial.
Other examples are available.
Marc Zyngier Nov. 20, 2019, 3:04 p.m. UTC | #9
On 2019-11-20 01:15, Robin Murphy wrote:
> On 2019-11-20 12:24 am, Leonid Movshovich wrote:
>> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
>> <linux@armlinux.org.uk> wrote:
>>>
>>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
>>>> So far only a CPU selected with top affinity bit was selected. 
>>>> This
>>>> resulted in all interrupts
>>>> being processed by CPU0 by default despite "FF" default affinity
>>>> setting for all interrupts
>>>
>>> Have you checked whether this causes _ALL_ CPUs in the mask to be
>>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
>>> slowed down and hit the same locks at the same time.
>>>
>> Yes, I've checked this. No, interrupt is delivered to only one CPU.
>> Also ARM GIC architecture specification specifically states in 
>> chapter
>> 3.1.1 that hardware interrupts are delivered to a single CPU in
>> multiprocessor system ("1-N model").
>
> But see also section 3.2.3 - just because only one CPU actually runs
> the given ISR doesn't necessarily guarantee that the others *weren't*
> interrupted. I'd also hesitate to make any assumptions that all GIC
> implementations behave exactly the same way.

What happens is that *all* CPUs are being sent the interrupt, and there
is some logic in the GIC that ensures that only one sees it (the first
one to read the IAR register). All the other see a spurious (1023)
interrupt, and have wasted some precious cycles in doing so.

I get this patch, more or less well written, every other year.
My answer is that it may help a very small minority of use cases, and
suck for everyone else. So thank you, but no, thank you.

Note that GICv3's version of the thing is even more unusable:
- the configuration is secure only
- the distribution mode is IMPDEF
- LPIs can only be precisely routed

         M.
Leonid Movshovich Nov. 20, 2019, 3:07 p.m. UTC | #10
On Wed, 20 Nov 2019 at 13:58, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Wed, Nov 20, 2019 at 01:33:11PM +0000, Robin Murphy wrote:
> > On 20/11/2019 11:25 am, Leonid Movshovich wrote:
> > > On Wed, 20 Nov 2019 at 10:50, Russell King - ARM Linux admin
> > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > On Wed, Nov 20, 2019 at 10:44:39AM +0000, Leonid Movshovich wrote:
> > > > > On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > >
> > > > > > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > > > > > On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > > > > > <linux@armlinux.org.uk> wrote:
> > > > > > > >
> > > > > > > > On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > > > > > > > So far only a CPU selected with top affinity bit was selected. This
> > > > > > > > > resulted in all interrupts
> > > > > > > > > being processed by CPU0 by default despite "FF" default affinity
> > > > > > > > > setting for all interrupts
> > > > > > > >
> > > > > > > > Have you checked whether this causes _ALL_ CPUs in the mask to be
> > > > > > > > delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > > > > > > slowed down and hit the same locks at the same time.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > > > > > > Also ARM GIC architecture specification specifically states in chapter
> > > > > > > 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > > > > > multiprocessor system ("1-N model").
> > > > > >
> > > > > > But see also section 3.2.3 - just because only one CPU actually runs the
> > > > > > given ISR doesn't necessarily guarantee that the others *weren't*
> > > > > > interrupted. I'd also hesitate to make any assumptions that all GIC
> > > > > > implementations behave exactly the same way.
> > > > > >
> > > > > > Robin.
> > > > >
> > > > > Yes, that's right, however:
> > > > > 1. They are only interrupted for a split-second, since interrupt is
> > > > > immediately ACKed in gic_handle_irq
> > > >
> > > > Even that is detrimental - consider cpuidle where a CPU is placed in
> > > > a low power state waiting for an interrupt, and it keeps getting woken
> > > > for interrupts that it isn't able to handle.  The effect will be to
> > > > stop the CPU hitting the lower power states, which would be a regression
> > > > over how the kernel behaves today.
> > > >
> > > > > 2. More important that smp_affinity in procfs is defined to allow user
> > > > > to configure multiple CPU's to handle interrupts (see
> > > > > Documentation/IRQ-affinity.txt) which is effectively prohibited in
> > > > > current implementation. I mean, when user sets it to FF, she expects
> > > > > all CPUs to process interrupts, not CPU0 only
> >
> > I have to say, my interaction with the IRQ layer is far more as a "user"
> > than as a "developer", yet I've always assumed that the affinity mask
> > represents the set of CPUs that *may* handle an interrupt and have never
> > felt particularly surprised by the naive implementation of "just pick the
> > first one".

Kernel documentation in Documentation/IRQ-affinity.txt sets an
expectation that IRQs would be spread between CPUs evenly in case
multiple CPUs are selected in smp_affinity. It also seems to be quite
a common practice (in consumer devices at least) to have interrupts
spread between CPUs. At least that's what happens on my PC and phone
according to /proc/interrupts

> >
> > Do these users also expect the scheduler to constantly context-switch a
> > single active task all over the place just because the default thread
> > affinity mask says it can?
>
> It is my understanding that the scheduler will try to keep tasks on
> the CPU they are already running on, unless there's a benefit to
> migrating it to a different CPU - because if you're constantly
> migrating code between different CPUs, you're having to bounce
> cache lines around the system.
>
> > > > The reason we've ended up with that on ARM is precisely because it
> > > > wasted CPU resources, and my attempts at writing code to distribute
> > > > the interrupt between CPU cores did not have a successful outcome.
> > > > So, the best thing that could be done was to route interrupts to the
> > > > first core, and run irqbalance to distribute the interrupts in a
> > > > sensible, cache friendly way between CPU cores.
> > > >
> > > > And no, the current implementation is *NOT* prohibited.  You can't
> > > > prohibit something that hardware hasn't been able to provide.
> > > >
> > >
> > > Hardware allows delivering interrupt to random CPU from selected
> > > bitmask and current implementation doesn't allow to configure this.
> > > While this may be an issue for power-concerned systems, there are also
> > > systems with plenty of electricity where using all CPUs for e.g.
> > > network packet handling is more important.
> >
> > It's not just about batteries - more and more SoCs these days have
> > internally constrained power/thermal budgets too. Think of Intel's turbo
> > boost, or those Amlogic TV box chips that can only hit their advertised top
> > frequencies with one or two cores active - on systems like that, yanking all
> > the cores out of standby every time could be actively detrimental to
> > single-thread performance and actually end up *increasing*
> > interrupt-handling latency.
> >
> > If you want to optimise a particular system for a particular use-case,
> > you're almost certainly better off manually tuning affinities anyway
> > (certain distros already do this). If you mostly just want /proc/interrupts
> > to look pretty, there's irqbalance.
>
> The conclusion I came to when I did the initial 32-bit ARM SMP support
> was:
>
> 1) it is policy, and userspace deals with policy
> 2) routing the IRQ in to distribute it between CPUs is difficult

Yes, but current implementation of smp_affinity does not allow to set
multiple CPUs to handle same interrupt. Neither hardware nor software
seem to have any issues with distribution. In any case, I suggest to
keep default behaviour as is, so only those who know what are they
doing would be playing around with this.

> 3) the problem is already solved by userspace (irqbalance)

irqbalance sets smp_affinity. If one wants to dedicate a subset of
CPUs to a certain interrupt with current implementation of
set_affinity, irqbalance have to sit there and switch affinities all
the time. Constantly read /proc/interrupts and change smp_affinity.
That doesn't sound like a great solution at all.
Not even mentioning that irqbalance pulls glib which won't make many
embedded developers happy.

>
> (2) is difficult because you don't want to do something naieve like
> route the first interrupt to CPU0, second to CPU1, third to CPU2
> etc, because that totally destroys cache locality and therefore
> performance.  Your network card goes faster if its IRQ is always
> processed by the same CPU (benefiting from hot cache) rather than
> spreading it around the CPUs.

Imagine my network card receives traffic at 100Mbps, but my single CPU
can only handle 33 Mbps. I would like to dedicate 3 CPUs to
networking, but it's not possible at the moment without patching
kernel or adding a usespace application which would sit and switch
interrupts' smp_affinity few times a second and keep another CPU busy.


So if new set_affinity implementation is done together with default
affinity change in arm/arm64, it would be business-as-usual for users
of default setup and those "lucky" owners of strange setups (like
myself) would be able to configure the system.

>
> > And given the earlier argument, it's probably worth noting that there are
> > precious few networking/infrastructure/server SoCs using GICv2 anyway.
>
> Networking is just one specific example where it's beneficial.
> Other examples are available.
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
Leonid Movshovich Nov. 20, 2019, 3:28 p.m. UTC | #11
On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
>
> On 2019-11-20 01:15, Robin Murphy wrote:
> > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> >> <linux@armlinux.org.uk> wrote:
> >>>
> >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> >>>> So far only a CPU selected with top affinity bit was selected.
> >>>> This
> >>>> resulted in all interrupts
> >>>> being processed by CPU0 by default despite "FF" default affinity
> >>>> setting for all interrupts
> >>>
> >>> Have you checked whether this causes _ALL_ CPUs in the mask to be
> >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> >>> slowed down and hit the same locks at the same time.
> >>>
> >> Yes, I've checked this. No, interrupt is delivered to only one CPU.
> >> Also ARM GIC architecture specification specifically states in
> >> chapter
> >> 3.1.1 that hardware interrupts are delivered to a single CPU in
> >> multiprocessor system ("1-N model").
> >
> > But see also section 3.2.3 - just because only one CPU actually runs
> > the given ISR doesn't necessarily guarantee that the others *weren't*
> > interrupted. I'd also hesitate to make any assumptions that all GIC
> > implementations behave exactly the same way.
>
> What happens is that *all* CPUs are being sent the interrupt, and there
> is some logic in the GIC that ensures that only one sees it (the first
> one to read the IAR register). All the other see a spurious (1023)
> interrupt, and have wasted some precious cycles in doing so.

Cycles are only precious when system is under high load. Under high
load, to achieve fair spread of interrupts between CPUs one would need
a userspace app (irqbalance) to sit there and constantly rebalance
smp_affinity based on /proc/interrupts. Hard to believe such an
approach wastes less cycles.


>
> I get this patch, more or less well written, every other year.
> My answer is that it may help a very small minority of use cases, and
> suck for everyone else. So thank you, but no, thank you.

I was wondering, why such an obvious change was never made. Now I know
whom to blame :).

Anyway, I don't suggest "happiness for everyone". I suggest to change
the behaviour AND default affinity. So existing setups are not
affected AND "small minority" gets the benefit.

>
> Note that GICv3's version of the thing is even more unusable:
> - the configuration is secure only
> - the distribution mode is IMPDEF
> - LPIs can only be precisely routed
>
>          M.
> --
> Jazz is not dead. It just smells funny...
Marc Zyngier Nov. 20, 2019, 3:39 p.m. UTC | #12
On 2019-11-20 15:28, Leonid Movshovich wrote:
> On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
>>
>> On 2019-11-20 01:15, Robin Murphy wrote:
>> > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
>> >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
>> >> <linux@armlinux.org.uk> wrote:
>> >>>
>> >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
>> >>>> So far only a CPU selected with top affinity bit was selected.
>> >>>> This
>> >>>> resulted in all interrupts
>> >>>> being processed by CPU0 by default despite "FF" default 
>> affinity
>> >>>> setting for all interrupts
>> >>>
>> >>> Have you checked whether this causes _ALL_ CPUs in the mask to 
>> be
>> >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
>> >>> slowed down and hit the same locks at the same time.
>> >>>
>> >> Yes, I've checked this. No, interrupt is delivered to only one 
>> CPU.
>> >> Also ARM GIC architecture specification specifically states in
>> >> chapter
>> >> 3.1.1 that hardware interrupts are delivered to a single CPU in
>> >> multiprocessor system ("1-N model").
>> >
>> > But see also section 3.2.3 - just because only one CPU actually 
>> runs
>> > the given ISR doesn't necessarily guarantee that the others 
>> *weren't*
>> > interrupted. I'd also hesitate to make any assumptions that all 
>> GIC
>> > implementations behave exactly the same way.
>>
>> What happens is that *all* CPUs are being sent the interrupt, and 
>> there
>> is some logic in the GIC that ensures that only one sees it (the 
>> first
>> one to read the IAR register). All the other see a spurious (1023)
>> interrupt, and have wasted some precious cycles in doing so.
>
> Cycles are only precious when system is under high load. Under high
> load, to achieve fair spread of interrupts between CPUs one would 
> need
> a userspace app (irqbalance) to sit there and constantly rebalance
> smp_affinity based on /proc/interrupts. Hard to believe such an
> approach wastes less cycles.

You'd be surprised. As always when looking at these things, do come up
with actual figures with a wide range of workloads that show benefits
for the approach you're suggesting.

Also, if your system isn't under high load, why would you even care
about this kind of distribution?

>> I get this patch, more or less well written, every other year.
>> My answer is that it may help a very small minority of use cases, 
>> and
>> suck for everyone else. So thank you, but no, thank you.
>
> I was wondering, why such an obvious change was never made. Now I 
> know
> whom to blame :).

The MAINTAINERS file (and a basic git log) would have told you that.
And yes, I'm proudly taking the blame of having resisted this all 
along.

> Anyway, I don't suggest "happiness for everyone". I suggest to change
> the behaviour AND default affinity. So existing setups are not
> affected AND "small minority" gets the benefit.

As I said above, show me the numbers on a wide range of HW, with a wide
range of workloads.

         M.
Leonid Movshovich Nov. 20, 2019, 4:45 p.m. UTC | #13
On Wed, 20 Nov 2019 at 15:39, Marc Zyngier <maz@kernel.org> wrote:
>
> On 2019-11-20 15:28, Leonid Movshovich wrote:
> > On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
> >>
> >> On 2019-11-20 01:15, Robin Murphy wrote:
> >> > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> >> >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> >> >> <linux@armlinux.org.uk> wrote:
> >> >>>
> >> >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> >> >>>> So far only a CPU selected with top affinity bit was selected.
> >> >>>> This
> >> >>>> resulted in all interrupts
> >> >>>> being processed by CPU0 by default despite "FF" default
> >> affinity
> >> >>>> setting for all interrupts
> >> >>>
> >> >>> Have you checked whether this causes _ALL_ CPUs in the mask to
> >> be
> >> >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> >> >>> slowed down and hit the same locks at the same time.
> >> >>>
> >> >> Yes, I've checked this. No, interrupt is delivered to only one
> >> CPU.
> >> >> Also ARM GIC architecture specification specifically states in
> >> >> chapter
> >> >> 3.1.1 that hardware interrupts are delivered to a single CPU in
> >> >> multiprocessor system ("1-N model").
> >> >
> >> > But see also section 3.2.3 - just because only one CPU actually
> >> runs
> >> > the given ISR doesn't necessarily guarantee that the others
> >> *weren't*
> >> > interrupted. I'd also hesitate to make any assumptions that all
> >> GIC
> >> > implementations behave exactly the same way.
> >>
> >> What happens is that *all* CPUs are being sent the interrupt, and
> >> there
> >> is some logic in the GIC that ensures that only one sees it (the
> >> first
> >> one to read the IAR register). All the other see a spurious (1023)
> >> interrupt, and have wasted some precious cycles in doing so.
> >
> > Cycles are only precious when system is under high load. Under high
> > load, to achieve fair spread of interrupts between CPUs one would
> > need
> > a userspace app (irqbalance) to sit there and constantly rebalance
> > smp_affinity based on /proc/interrupts. Hard to believe such an
> > approach wastes less cycles.
>
> You'd be surprised. As always when looking at these things, do come up
> with actual figures with a wide range of workloads that show benefits
> for the approach you're suggesting.
>
> Also, if your system isn't under high load, why would you even care
> about this kind of distribution?

Coming back to my network example, under moderate load, without
distribution, you'd get one CPU struggling to process all the traffic,
while others sitting idle.

>
> >> I get this patch, more or less well written, every other year.
> >> My answer is that it may help a very small minority of use cases,
> >> and
> >> suck for everyone else. So thank you, but no, thank you.
> >
> > I was wondering, why such an obvious change was never made. Now I
> > know
> > whom to blame :).
>
> The MAINTAINERS file (and a basic git log) would have told you that.
> And yes, I'm proudly taking the blame of having resisted this all
> along.
>
> > Anyway, I don't suggest "happiness for everyone". I suggest to change
> > the behaviour AND default affinity. So existing setups are not
> > affected AND "small minority" gets the benefit.
>
> As I said above, show me the numbers on a wide range of HW, with a wide
> range of workloads.

If the default affinity would be changed, then behaviour will stay the
same as it it now. Thus, change would only affect those who would
deliberately and knowingly want to spread the load.

>
>          M.
> --
> Jazz is not dead. It just smells funny...
Russell King (Oracle) Nov. 20, 2019, 5:13 p.m. UTC | #14
On Wed, Nov 20, 2019 at 03:07:16PM +0000, Leonid Movshovich wrote:
> On Wed, 20 Nov 2019 at 13:58, Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > On Wed, Nov 20, 2019 at 01:33:11PM +0000, Robin Murphy wrote:
> > > On 20/11/2019 11:25 am, Leonid Movshovich wrote:
> > > > On Wed, 20 Nov 2019 at 10:50, Russell King - ARM Linux admin
> > > > <linux@armlinux.org.uk> wrote:
> > > > >
> > > > > On Wed, Nov 20, 2019 at 10:44:39AM +0000, Leonid Movshovich wrote:
> > > > > > On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > > >
> > > > > > > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > > > > > > On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > > > > > > <linux@armlinux.org.uk> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > > > > > > > > So far only a CPU selected with top affinity bit was selected. This
> > > > > > > > > > resulted in all interrupts
> > > > > > > > > > being processed by CPU0 by default despite "FF" default affinity
> > > > > > > > > > setting for all interrupts
> > > > > > > > >
> > > > > > > > > Have you checked whether this causes _ALL_ CPUs in the mask to be
> > > > > > > > > delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > > > > > > > slowed down and hit the same locks at the same time.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > > > > > > > Also ARM GIC architecture specification specifically states in chapter
> > > > > > > > 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > > > > > > multiprocessor system ("1-N model").
> > > > > > >
> > > > > > > But see also section 3.2.3 - just because only one CPU actually runs the
> > > > > > > given ISR doesn't necessarily guarantee that the others *weren't*
> > > > > > > interrupted. I'd also hesitate to make any assumptions that all GIC
> > > > > > > implementations behave exactly the same way.
> > > > > > >
> > > > > > > Robin.
> > > > > >
> > > > > > Yes, that's right, however:
> > > > > > 1. They are only interrupted for a split-second, since interrupt is
> > > > > > immediately ACKed in gic_handle_irq
> > > > >
> > > > > Even that is detrimental - consider cpuidle where a CPU is placed in
> > > > > a low power state waiting for an interrupt, and it keeps getting woken
> > > > > for interrupts that it isn't able to handle.  The effect will be to
> > > > > stop the CPU hitting the lower power states, which would be a regression
> > > > > over how the kernel behaves today.
> > > > >
> > > > > > 2. More important that smp_affinity in procfs is defined to allow user
> > > > > > to configure multiple CPU's to handle interrupts (see
> > > > > > Documentation/IRQ-affinity.txt) which is effectively prohibited in
> > > > > > current implementation. I mean, when user sets it to FF, she expects
> > > > > > all CPUs to process interrupts, not CPU0 only
> > >
> > > I have to say, my interaction with the IRQ layer is far more as a "user"
> > > than as a "developer", yet I've always assumed that the affinity mask
> > > represents the set of CPUs that *may* handle an interrupt and have never
> > > felt particularly surprised by the naive implementation of "just pick the
> > > first one".
> 
> Kernel documentation in Documentation/IRQ-affinity.txt sets an
> expectation that IRQs would be spread between CPUs evenly in case
> multiple CPUs are selected in smp_affinity. It also seems to be quite
> a common practice (in consumer devices at least) to have interrupts
> spread between CPUs. At least that's what happens on my PC and phone
> according to /proc/interrupts
> 
> > >
> > > Do these users also expect the scheduler to constantly context-switch a
> > > single active task all over the place just because the default thread
> > > affinity mask says it can?
> >
> > It is my understanding that the scheduler will try to keep tasks on
> > the CPU they are already running on, unless there's a benefit to
> > migrating it to a different CPU - because if you're constantly
> > migrating code between different CPUs, you're having to bounce
> > cache lines around the system.
> >
> > > > > The reason we've ended up with that on ARM is precisely because it
> > > > > wasted CPU resources, and my attempts at writing code to distribute
> > > > > the interrupt between CPU cores did not have a successful outcome.
> > > > > So, the best thing that could be done was to route interrupts to the
> > > > > first core, and run irqbalance to distribute the interrupts in a
> > > > > sensible, cache friendly way between CPU cores.
> > > > >
> > > > > And no, the current implementation is *NOT* prohibited.  You can't
> > > > > prohibit something that hardware hasn't been able to provide.
> > > > >
> > > >
> > > > Hardware allows delivering interrupt to random CPU from selected
> > > > bitmask and current implementation doesn't allow to configure this.
> > > > While this may be an issue for power-concerned systems, there are also
> > > > systems with plenty of electricity where using all CPUs for e.g.
> > > > network packet handling is more important.
> > >
> > > It's not just about batteries - more and more SoCs these days have
> > > internally constrained power/thermal budgets too. Think of Intel's turbo
> > > boost, or those Amlogic TV box chips that can only hit their advertised top
> > > frequencies with one or two cores active - on systems like that, yanking all
> > > the cores out of standby every time could be actively detrimental to
> > > single-thread performance and actually end up *increasing*
> > > interrupt-handling latency.
> > >
> > > If you want to optimise a particular system for a particular use-case,
> > > you're almost certainly better off manually tuning affinities anyway
> > > (certain distros already do this). If you mostly just want /proc/interrupts
> > > to look pretty, there's irqbalance.
> >
> > The conclusion I came to when I did the initial 32-bit ARM SMP support
> > was:
> >
> > 1) it is policy, and userspace deals with policy
> > 2) routing the IRQ in to distribute it between CPUs is difficult
> 
> Yes, but current implementation of smp_affinity does not allow to set
> multiple CPUs to handle same interrupt. Neither hardware nor software
> seem to have any issues with distribution. In any case, I suggest to
> keep default behaviour as is, so only those who know what are they
> doing would be playing around with this.
> 
> > 3) the problem is already solved by userspace (irqbalance)
> 
> irqbalance sets smp_affinity. If one wants to dedicate a subset of
> CPUs to a certain interrupt with current implementation of
> set_affinity, irqbalance have to sit there and switch affinities all
> the time. Constantly read /proc/interrupts and change smp_affinity.
> That doesn't sound like a great solution at all.
> Not even mentioning that irqbalance pulls glib which won't make many
> embedded developers happy.

This discussion is going nowhere.

I've stated my position based on experience as 32-bit ARM maintainer
trying to make it work.  It may not conform to the documentation, but
it's what has been used for decades on 32-bit ARM, and what most
people have been perfectly happy with.

If you think you have a solution to the stated problem that solves
it for hardware that doesn't automatically distribute interrupts,
then go off and code it and provide a patch.  Otherwise, no amount
of emails stating "but the documentation says X" is going to change
anything.
Russell King (Oracle) Nov. 20, 2019, 5:14 p.m. UTC | #15
On Wed, Nov 20, 2019 at 03:28:31PM +0000, Leonid Movshovich wrote:
> On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2019-11-20 01:15, Robin Murphy wrote:
> > > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > >> <linux@armlinux.org.uk> wrote:
> > >>>
> > >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > >>>> So far only a CPU selected with top affinity bit was selected.
> > >>>> This
> > >>>> resulted in all interrupts
> > >>>> being processed by CPU0 by default despite "FF" default affinity
> > >>>> setting for all interrupts
> > >>>
> > >>> Have you checked whether this causes _ALL_ CPUs in the mask to be
> > >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > >>> slowed down and hit the same locks at the same time.
> > >>>
> > >> Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > >> Also ARM GIC architecture specification specifically states in
> > >> chapter
> > >> 3.1.1 that hardware interrupts are delivered to a single CPU in
> > >> multiprocessor system ("1-N model").
> > >
> > > But see also section 3.2.3 - just because only one CPU actually runs
> > > the given ISR doesn't necessarily guarantee that the others *weren't*
> > > interrupted. I'd also hesitate to make any assumptions that all GIC
> > > implementations behave exactly the same way.
> >
> > What happens is that *all* CPUs are being sent the interrupt, and there
> > is some logic in the GIC that ensures that only one sees it (the first
> > one to read the IAR register). All the other see a spurious (1023)
> > interrupt, and have wasted some precious cycles in doing so.
> 
> Cycles are only precious when system is under high load. Under high
> load, to achieve fair spread of interrupts between CPUs one would need
> a userspace app (irqbalance) to sit there and constantly rebalance
> smp_affinity based on /proc/interrupts. Hard to believe such an
> approach wastes less cycles.

So you have no idea how irqbalance works...
Russell King (Oracle) Nov. 20, 2019, 5:17 p.m. UTC | #16
On Wed, Nov 20, 2019 at 04:45:59PM +0000, Leonid Movshovich wrote:
> On Wed, 20 Nov 2019 at 15:39, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2019-11-20 15:28, Leonid Movshovich wrote:
> > > On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
> > >>
> > >> On 2019-11-20 01:15, Robin Murphy wrote:
> > >> > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > >> >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > >> >> <linux@armlinux.org.uk> wrote:
> > >> >>>
> > >> >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > >> >>>> So far only a CPU selected with top affinity bit was selected.
> > >> >>>> This
> > >> >>>> resulted in all interrupts
> > >> >>>> being processed by CPU0 by default despite "FF" default
> > >> affinity
> > >> >>>> setting for all interrupts
> > >> >>>
> > >> >>> Have you checked whether this causes _ALL_ CPUs in the mask to
> > >> be
> > >> >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > >> >>> slowed down and hit the same locks at the same time.
> > >> >>>
> > >> >> Yes, I've checked this. No, interrupt is delivered to only one
> > >> CPU.
> > >> >> Also ARM GIC architecture specification specifically states in
> > >> >> chapter
> > >> >> 3.1.1 that hardware interrupts are delivered to a single CPU in
> > >> >> multiprocessor system ("1-N model").
> > >> >
> > >> > But see also section 3.2.3 - just because only one CPU actually
> > >> runs
> > >> > the given ISR doesn't necessarily guarantee that the others
> > >> *weren't*
> > >> > interrupted. I'd also hesitate to make any assumptions that all
> > >> GIC
> > >> > implementations behave exactly the same way.
> > >>
> > >> What happens is that *all* CPUs are being sent the interrupt, and
> > >> there
> > >> is some logic in the GIC that ensures that only one sees it (the
> > >> first
> > >> one to read the IAR register). All the other see a spurious (1023)
> > >> interrupt, and have wasted some precious cycles in doing so.
> > >
> > > Cycles are only precious when system is under high load. Under high
> > > load, to achieve fair spread of interrupts between CPUs one would
> > > need
> > > a userspace app (irqbalance) to sit there and constantly rebalance
> > > smp_affinity based on /proc/interrupts. Hard to believe such an
> > > approach wastes less cycles.
> >
> > You'd be surprised. As always when looking at these things, do come up
> > with actual figures with a wide range of workloads that show benefits
> > for the approach you're suggesting.
> >
> > Also, if your system isn't under high load, why would you even care
> > about this kind of distribution?
> 
> Coming back to my network example, under moderate load, without
> distribution, you'd get one CPU struggling to process all the traffic,
> while others sitting idle.

And you think that receiving TCP packet 1 on CPU0, TCP packet 2 on
CPU1, TCP packet 2 on CPU2 etc will help?

I guess you're not aware of network features such as GRO which
combine consecutive packets.  Forcing each packet onto a different
CPU will bounce the cache lines associated with managing the state
between different CPUs => negative performance impact.

Userspace doesn't see individual packets.
Leonid Movshovich Nov. 20, 2019, 5:37 p.m. UTC | #17
On Wed, 20 Nov 2019 at 17:18, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Wed, Nov 20, 2019 at 04:45:59PM +0000, Leonid Movshovich wrote:
> > On Wed, 20 Nov 2019 at 15:39, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On 2019-11-20 15:28, Leonid Movshovich wrote:
> > > > On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
> > > >>
> > > >> On 2019-11-20 01:15, Robin Murphy wrote:
> > > >> > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > >> >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > >> >> <linux@armlinux.org.uk> wrote:
> > > >> >>>
> > > >> >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > >> >>>> So far only a CPU selected with top affinity bit was selected.
> > > >> >>>> This
> > > >> >>>> resulted in all interrupts
> > > >> >>>> being processed by CPU0 by default despite "FF" default
> > > >> affinity
> > > >> >>>> setting for all interrupts
> > > >> >>>
> > > >> >>> Have you checked whether this causes _ALL_ CPUs in the mask to
> > > >> be
> > > >> >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > >> >>> slowed down and hit the same locks at the same time.
> > > >> >>>
> > > >> >> Yes, I've checked this. No, interrupt is delivered to only one
> > > >> CPU.
> > > >> >> Also ARM GIC architecture specification specifically states in
> > > >> >> chapter
> > > >> >> 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > >> >> multiprocessor system ("1-N model").
> > > >> >
> > > >> > But see also section 3.2.3 - just because only one CPU actually
> > > >> runs
> > > >> > the given ISR doesn't necessarily guarantee that the others
> > > >> *weren't*
> > > >> > interrupted. I'd also hesitate to make any assumptions that all
> > > >> GIC
> > > >> > implementations behave exactly the same way.
> > > >>
> > > >> What happens is that *all* CPUs are being sent the interrupt, and
> > > >> there
> > > >> is some logic in the GIC that ensures that only one sees it (the
> > > >> first
> > > >> one to read the IAR register). All the other see a spurious (1023)
> > > >> interrupt, and have wasted some precious cycles in doing so.
> > > >
> > > > Cycles are only precious when system is under high load. Under high
> > > > load, to achieve fair spread of interrupts between CPUs one would
> > > > need
> > > > a userspace app (irqbalance) to sit there and constantly rebalance
> > > > smp_affinity based on /proc/interrupts. Hard to believe such an
> > > > approach wastes less cycles.
> > >
> > > You'd be surprised. As always when looking at these things, do come up
> > > with actual figures with a wide range of workloads that show benefits
> > > for the approach you're suggesting.
> > >
> > > Also, if your system isn't under high load, why would you even care
> > > about this kind of distribution?
> >
> > Coming back to my network example, under moderate load, without
> > distribution, you'd get one CPU struggling to process all the traffic,
> > while others sitting idle.
>
> And you think that receiving TCP packet 1 on CPU0, TCP packet 2 on
> CPU1, TCP packet 2 on CPU2 etc will help?
>
> I guess you're not aware of network features such as GRO which
> combine consecutive packets.  Forcing each packet onto a different
> CPU will bounce the cache lines associated with managing the state
> between different CPUs => negative performance impact.

I guess, you're not aware that TCP is not the only protocol in the
internet. And that GRO is not a "network feature" but rather a NIC
feature. And, that not all NICs support it.

>
> Userspace doesn't see individual packets.

And packet destinations, other then userspace processes.

>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
Leonid Movshovich Nov. 20, 2019, 5:48 p.m. UTC | #18
On Wed, 20 Nov 2019 at 17:14, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Wed, Nov 20, 2019 at 03:28:31PM +0000, Leonid Movshovich wrote:
> > On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On 2019-11-20 01:15, Robin Murphy wrote:
> > > > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > >> <linux@armlinux.org.uk> wrote:
> > > >>>
> > > >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > >>>> So far only a CPU selected with top affinity bit was selected.
> > > >>>> This
> > > >>>> resulted in all interrupts
> > > >>>> being processed by CPU0 by default despite "FF" default affinity
> > > >>>> setting for all interrupts
> > > >>>
> > > >>> Have you checked whether this causes _ALL_ CPUs in the mask to be
> > > >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > >>> slowed down and hit the same locks at the same time.
> > > >>>
> > > >> Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > > >> Also ARM GIC architecture specification specifically states in
> > > >> chapter
> > > >> 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > >> multiprocessor system ("1-N model").
> > > >
> > > > But see also section 3.2.3 - just because only one CPU actually runs
> > > > the given ISR doesn't necessarily guarantee that the others *weren't*
> > > > interrupted. I'd also hesitate to make any assumptions that all GIC
> > > > implementations behave exactly the same way.
> > >
> > > What happens is that *all* CPUs are being sent the interrupt, and there
> > > is some logic in the GIC that ensures that only one sees it (the first
> > > one to read the IAR register). All the other see a spurious (1023)
> > > interrupt, and have wasted some precious cycles in doing so.
> >
> > Cycles are only precious when system is under high load. Under high
> > load, to achieve fair spread of interrupts between CPUs one would need
> > a userspace app (irqbalance) to sit there and constantly rebalance
> > smp_affinity based on /proc/interrupts. Hard to believe such an
> > approach wastes less cycles.
>
> So you have no idea how irqbalance works...

Here is the one from github
(https://github.com/Irqbalance/irqbalance/blob/master/irqbalance.c#L257)
gboolean scan(gpointer data __attribute__((unused)))
{
   ....
   clear_work_stats();
   parse_proc_interrupts(); <----
   ....
    // few more parse_proc_interrupts here
   ....
    calculate_placement();
    activate_mappings();  <---- finally this guy sets irq affinities
through procfs
    ....
}
And here is main loop setup:
    main_loop = g_main_loop_new(NULL, FALSE);
    last_interval = sleep_interval;
    g_timeout_add_seconds(sleep_interval, scan, NULL);
    g_main_loop_run(main_loop);

So unless your irqbalance is significantly different from the one from
github, that's exactly what he does: every sleep interval seconds it
parses /proc/interrupts and changes affinities to make sure his target
multiparameter balance is maintained

>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
Leonid Movshovich Nov. 20, 2019, 5:54 p.m. UTC | #19
On Wed, 20 Nov 2019 at 17:13, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Wed, Nov 20, 2019 at 03:07:16PM +0000, Leonid Movshovich wrote:
> > On Wed, 20 Nov 2019 at 13:58, Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > >
> > > On Wed, Nov 20, 2019 at 01:33:11PM +0000, Robin Murphy wrote:
> > > > On 20/11/2019 11:25 am, Leonid Movshovich wrote:
> > > > > On Wed, 20 Nov 2019 at 10:50, Russell King - ARM Linux admin
> > > > > <linux@armlinux.org.uk> wrote:
> > > > > >
> > > > > > On Wed, Nov 20, 2019 at 10:44:39AM +0000, Leonid Movshovich wrote:
> > > > > > > On Wed, 20 Nov 2019 at 01:15, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > > > >
> > > > > > > > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > > > > > > > On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > > > > > > > <linux@armlinux.org.uk> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > > > > > > > > > So far only a CPU selected with top affinity bit was selected. This
> > > > > > > > > > > resulted in all interrupts
> > > > > > > > > > > being processed by CPU0 by default despite "FF" default affinity
> > > > > > > > > > > setting for all interrupts
> > > > > > > > > >
> > > > > > > > > > Have you checked whether this causes _ALL_ CPUs in the mask to be
> > > > > > > > > > delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > > > > > > > > slowed down and hit the same locks at the same time.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, I've checked this. No, interrupt is delivered to only one CPU.
> > > > > > > > > Also ARM GIC architecture specification specifically states in chapter
> > > > > > > > > 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > > > > > > > multiprocessor system ("1-N model").
> > > > > > > >
> > > > > > > > But see also section 3.2.3 - just because only one CPU actually runs the
> > > > > > > > given ISR doesn't necessarily guarantee that the others *weren't*
> > > > > > > > interrupted. I'd also hesitate to make any assumptions that all GIC
> > > > > > > > implementations behave exactly the same way.
> > > > > > > >
> > > > > > > > Robin.
> > > > > > >
> > > > > > > Yes, that's right, however:
> > > > > > > 1. They are only interrupted for a split-second, since interrupt is
> > > > > > > immediately ACKed in gic_handle_irq
> > > > > >
> > > > > > Even that is detrimental - consider cpuidle where a CPU is placed in
> > > > > > a low power state waiting for an interrupt, and it keeps getting woken
> > > > > > for interrupts that it isn't able to handle.  The effect will be to
> > > > > > stop the CPU hitting the lower power states, which would be a regression
> > > > > > over how the kernel behaves today.
> > > > > >
> > > > > > > 2. More important that smp_affinity in procfs is defined to allow user
> > > > > > > to configure multiple CPU's to handle interrupts (see
> > > > > > > Documentation/IRQ-affinity.txt) which is effectively prohibited in
> > > > > > > current implementation. I mean, when user sets it to FF, she expects
> > > > > > > all CPUs to process interrupts, not CPU0 only
> > > >
> > > > I have to say, my interaction with the IRQ layer is far more as a "user"
> > > > than as a "developer", yet I've always assumed that the affinity mask
> > > > represents the set of CPUs that *may* handle an interrupt and have never
> > > > felt particularly surprised by the naive implementation of "just pick the
> > > > first one".
> >
> > Kernel documentation in Documentation/IRQ-affinity.txt sets an
> > expectation that IRQs would be spread between CPUs evenly in case
> > multiple CPUs are selected in smp_affinity. It also seems to be quite
> > a common practice (in consumer devices at least) to have interrupts
> > spread between CPUs. At least that's what happens on my PC and phone
> > according to /proc/interrupts
> >
> > > >
> > > > Do these users also expect the scheduler to constantly context-switch a
> > > > single active task all over the place just because the default thread
> > > > affinity mask says it can?
> > >
> > > It is my understanding that the scheduler will try to keep tasks on
> > > the CPU they are already running on, unless there's a benefit to
> > > migrating it to a different CPU - because if you're constantly
> > > migrating code between different CPUs, you're having to bounce
> > > cache lines around the system.
> > >
> > > > > > The reason we've ended up with that on ARM is precisely because it
> > > > > > wasted CPU resources, and my attempts at writing code to distribute
> > > > > > the interrupt between CPU cores did not have a successful outcome.
> > > > > > So, the best thing that could be done was to route interrupts to the
> > > > > > first core, and run irqbalance to distribute the interrupts in a
> > > > > > sensible, cache friendly way between CPU cores.
> > > > > >
> > > > > > And no, the current implementation is *NOT* prohibited.  You can't
> > > > > > prohibit something that hardware hasn't been able to provide.
> > > > > >
> > > > >
> > > > > Hardware allows delivering interrupt to random CPU from selected
> > > > > bitmask and current implementation doesn't allow to configure this.
> > > > > While this may be an issue for power-concerned systems, there are also
> > > > > systems with plenty of electricity where using all CPUs for e.g.
> > > > > network packet handling is more important.
> > > >
> > > > It's not just about batteries - more and more SoCs these days have
> > > > internally constrained power/thermal budgets too. Think of Intel's turbo
> > > > boost, or those Amlogic TV box chips that can only hit their advertised top
> > > > frequencies with one or two cores active - on systems like that, yanking all
> > > > the cores out of standby every time could be actively detrimental to
> > > > single-thread performance and actually end up *increasing*
> > > > interrupt-handling latency.
> > > >
> > > > If you want to optimise a particular system for a particular use-case,
> > > > you're almost certainly better off manually tuning affinities anyway
> > > > (certain distros already do this). If you mostly just want /proc/interrupts
> > > > to look pretty, there's irqbalance.
> > >
> > > The conclusion I came to when I did the initial 32-bit ARM SMP support
> > > was:
> > >
> > > 1) it is policy, and userspace deals with policy
> > > 2) routing the IRQ in to distribute it between CPUs is difficult
> >
> > Yes, but current implementation of smp_affinity does not allow to set
> > multiple CPUs to handle same interrupt. Neither hardware nor software
> > seem to have any issues with distribution. In any case, I suggest to
> > keep default behaviour as is, so only those who know what are they
> > doing would be playing around with this.
> >
> > > 3) the problem is already solved by userspace (irqbalance)
> >
> > irqbalance sets smp_affinity. If one wants to dedicate a subset of
> > CPUs to a certain interrupt with current implementation of
> > set_affinity, irqbalance have to sit there and switch affinities all
> > the time. Constantly read /proc/interrupts and change smp_affinity.
> > That doesn't sound like a great solution at all.
> > Not even mentioning that irqbalance pulls glib which won't make many
> > embedded developers happy.
>
> This discussion is going nowhere.
>
> I've stated my position based on experience as 32-bit ARM maintainer
> trying to make it work.  It may not conform to the documentation, but
> it's what has been used for decades on 32-bit ARM, and what most
> people have been perfectly happy with.
>
> If you think you have a solution to the stated problem that solves
> it for hardware that doesn't automatically distribute interrupts,
> then go off and code it and provide a patch.  Otherwise, no amount
> of emails stating "but the documentation says X" is going to change
> anything.

So would it be good enough if I change default affinity value for
irq-gic on top of current patch?


Reference to kernel docs was regarding Robin's expectations and nothing else.

>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
Russell King (Oracle) Nov. 20, 2019, 5:55 p.m. UTC | #20
On Wed, Nov 20, 2019 at 05:37:38PM +0000, Leonid Movshovich wrote:
> On Wed, 20 Nov 2019 at 17:18, Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > On Wed, Nov 20, 2019 at 04:45:59PM +0000, Leonid Movshovich wrote:
> > > On Wed, 20 Nov 2019 at 15:39, Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On 2019-11-20 15:28, Leonid Movshovich wrote:
> > > > > On Wed, 20 Nov 2019 at 15:04, Marc Zyngier <maz@kernel.org> wrote:
> > > > >>
> > > > >> On 2019-11-20 01:15, Robin Murphy wrote:
> > > > >> > On 2019-11-20 12:24 am, Leonid Movshovich wrote:
> > > > >> >> On Tue, 19 Nov 2019 at 23:36, Russell King - ARM Linux admin
> > > > >> >> <linux@armlinux.org.uk> wrote:
> > > > >> >>>
> > > > >> >>> On Tue, Nov 19, 2019 at 11:12:26PM +0000, event wrote:
> > > > >> >>>> So far only a CPU selected with top affinity bit was selected.
> > > > >> >>>> This
> > > > >> >>>> resulted in all interrupts
> > > > >> >>>> being processed by CPU0 by default despite "FF" default
> > > > >> affinity
> > > > >> >>>> setting for all interrupts
> > > > >> >>>
> > > > >> >>> Have you checked whether this causes _ALL_ CPUs in the mask to
> > > > >> be
> > > > >> >>> delivered a single interrupt, thereby causing _ALL_ CPUs to be
> > > > >> >>> slowed down and hit the same locks at the same time.
> > > > >> >>>
> > > > >> >> Yes, I've checked this. No, interrupt is delivered to only one
> > > > >> CPU.
> > > > >> >> Also ARM GIC architecture specification specifically states in
> > > > >> >> chapter
> > > > >> >> 3.1.1 that hardware interrupts are delivered to a single CPU in
> > > > >> >> multiprocessor system ("1-N model").
> > > > >> >
> > > > >> > But see also section 3.2.3 - just because only one CPU actually
> > > > >> runs
> > > > >> > the given ISR doesn't necessarily guarantee that the others
> > > > >> *weren't*
> > > > >> > interrupted. I'd also hesitate to make any assumptions that all
> > > > >> GIC
> > > > >> > implementations behave exactly the same way.
> > > > >>
> > > > >> What happens is that *all* CPUs are being sent the interrupt, and
> > > > >> there
> > > > >> is some logic in the GIC that ensures that only one sees it (the
> > > > >> first
> > > > >> one to read the IAR register). All the other see a spurious (1023)
> > > > >> interrupt, and have wasted some precious cycles in doing so.
> > > > >
> > > > > Cycles are only precious when system is under high load. Under high
> > > > > load, to achieve fair spread of interrupts between CPUs one would
> > > > > need
> > > > > a userspace app (irqbalance) to sit there and constantly rebalance
> > > > > smp_affinity based on /proc/interrupts. Hard to believe such an
> > > > > approach wastes less cycles.
> > > >
> > > > You'd be surprised. As always when looking at these things, do come up
> > > > with actual figures with a wide range of workloads that show benefits
> > > > for the approach you're suggesting.
> > > >
> > > > Also, if your system isn't under high load, why would you even care
> > > > about this kind of distribution?
> > >
> > > Coming back to my network example, under moderate load, without
> > > distribution, you'd get one CPU struggling to process all the traffic,
> > > while others sitting idle.
> >
> > And you think that receiving TCP packet 1 on CPU0, TCP packet 2 on
> > CPU1, TCP packet 2 on CPU2 etc will help?
> >
> > I guess you're not aware of network features such as GRO which
> > combine consecutive packets.  Forcing each packet onto a different
> > CPU will bounce the cache lines associated with managing the state
> > between different CPUs => negative performance impact.
> 
> I guess, you're not aware that TCP is not the only protocol in the
> internet. And that GRO is not a "network feature" but rather a NIC
> feature. And, that not all NICs support it.

You have an answer to everything.  Pointless continuing this, sorry.
diff mbox series

Patch

diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 30ab62334..e6c6451ea 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -331,25 +331,30 @@  static int gic_set_affinity(struct irq_data *d,
const struct cpumask *mask_val,
 {
  void __iomem *reg = gic_dist_base(d) + GIC_DIST_TARGET + (gic_irq(d) & ~3);
  unsigned int cpu, shift = (gic_irq(d) % 4) * 8;
- u32 val, mask, bit;
+ u32 val, reg_mask, bits = 0;
  unsigned long flags;
+ const struct cpumask* cpu_mask;

- if (!force)
- cpu = cpumask_any_and(mask_val, cpu_online_mask);
+ if (force)
+ cpu_mask = mask_val;
  else
- cpu = cpumask_first(mask_val);
+ cpu_mask = cpu_online_mask;

- if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids)
- return -EINVAL;
+ for_each_cpu_and(cpu, mask_val, cpu_mask) {
+ if (cpu >= NR_GIC_CPU_IF || cpu >= nr_cpu_ids) {
+ return -EINVAL;
+ }
+ bits |= gic_cpu_map[cpu];
+ }

  gic_lock_irqsave(flags);
- mask = 0xff << shift;
- bit = gic_cpu_map[cpu] << shift;
- val = readl_relaxed(reg) & ~mask;
- writel_relaxed(val | bit, reg);
+ reg_mask = 0xff << shift;
+ bits <<= shift;
+ val = readl_relaxed(reg) & ~reg_mask;
+ writel_relaxed(val | bits, reg);
  gic_unlock_irqrestore(flags);

- irq_data_update_effective_affinity(d, cpumask_of(cpu));
+ irq_data_update_effective_affinity(d, cpu_mask);

  return IRQ_SET_MASK_OK_DONE;