diff mbox series

[RFC] arm64: defconfig: Disable fine-grained task level IRQ time accounting

Message ID 20200729033934.22349-1-alison.wang@nxp.com (mailing list archive)
State New, archived
Headers show
Series [RFC] arm64: defconfig: Disable fine-grained task level IRQ time accounting | expand

Commit Message

Alison Wang July 29, 2020, 3:39 a.m. UTC
In the current arm64 defconfig, CONFIG_IRQ_TIME_ACCOUNTING is enabled as
default. According to my tests on NXP's LayerScape and i.MX platforms,
the system hangs when running the command "stress-ng --hrtimers 1" with
CONFIG_IRQ_TIME_ACCOUNTING enabled. Disabling this option, the issue
disappears. CONFIG_IRQ_TIME_ACCOUNTING causes serious performance impact
when running hrtimer stress test at the same time.

Signed-off-by: Alison Wang <alison.wang@nxp.com>
---
 arch/arm64/configs/defconfig | 1 -
 1 file changed, 1 deletion(-)

Comments

Kurt Kanzenbach July 29, 2020, 8:40 a.m. UTC | #1
Hi Alison,

On Wed Jul 29 2020, Alison Wang wrote:
> In the current arm64 defconfig, CONFIG_IRQ_TIME_ACCOUNTING is enabled as
> default. According to my tests on NXP's LayerScape and i.MX platforms,
> the system hangs when running the command "stress-ng --hrtimers 1" with
> CONFIG_IRQ_TIME_ACCOUNTING enabled. Disabling this option, the issue
> disappears. CONFIG_IRQ_TIME_ACCOUNTING causes serious performance impact
> when running hrtimer stress test at the same time.

I think instead of disabling the option for all arm64 devices, it might
be better to analyze the root-cause why the hrtimer test hangs when this
option is enabled.

+Cc hrtimer maintainers: Thomas and Anna-Maria

Thanks,
Kurt
Alison Wang July 29, 2020, 8:50 a.m. UTC | #2
Hi, Kurt,

> On Wed Jul 29 2020, Alison Wang wrote:
> > In the current arm64 defconfig, CONFIG_IRQ_TIME_ACCOUNTING is enabled
> > as default. According to my tests on NXP's LayerScape and i.MX
> > platforms, the system hangs when running the command "stress-ng
> > --hrtimers 1" with CONFIG_IRQ_TIME_ACCOUNTING enabled. Disabling this
> > option, the issue disappears. CONFIG_IRQ_TIME_ACCOUNTING causes
> > serious performance impact when running hrtimer stress test at the same
> time.
> 
> I think instead of disabling the option for all arm64 devices, it might be better
> to analyze the root-cause why the hrtimer test hangs when this option is
> enabled.
> 
> +Cc hrtimer maintainers: Thomas and Anna-Maria
> 
[Alison] Yes, I agree. I hope others can give me some clues for the root cause.


Best Regards,
Alison Wang
Vladimir Oltean July 29, 2020, 9:49 a.m. UTC | #3
On Wed, Jul 29, 2020 at 10:40:29AM +0200, Kurt Kanzenbach wrote:
> Hi Alison,
> 
> On Wed Jul 29 2020, Alison Wang wrote:
> > In the current arm64 defconfig, CONFIG_IRQ_TIME_ACCOUNTING is enabled as
> > default. According to my tests on NXP's LayerScape and i.MX platforms,
> > the system hangs when running the command "stress-ng --hrtimers 1" with
> > CONFIG_IRQ_TIME_ACCOUNTING enabled. Disabling this option, the issue
> > disappears. CONFIG_IRQ_TIME_ACCOUNTING causes serious performance impact
> > when running hrtimer stress test at the same time.
> 
> I think instead of disabling the option for all arm64 devices, it might
> be better to analyze the root-cause why the hrtimer test hangs when this
> option is enabled.
> 
> +Cc hrtimer maintainers: Thomas and Anna-Maria
> 
> Thanks,
> Kurt

For more context, here is my original report of the issue:
https://lkml.org/lkml/2020/6/4/1062

Just like you, I could not reproduce the RCU stalls and system hang on a
5.6-rt kernel, just on mainline and derivatives, using the plain
defconfig.

The issue is not specific to Layerscape or i.MX8, but rather I was able
to see the same behavior on Marvell Armada 37xx as well as Qualcomm
MSM8976.

So, while of course I agree that disabling IRQ time accounting for arm64
isn't a real solution, it isn't by far an exaggerated proposal either.
Nonetheless, the patch is just a RFC and should be treated as such. We
are at a loss when it comes to debugging this any further and we would
appreciate some pointers.

Thanks,
-Vladimir
Kurt Kanzenbach July 30, 2020, 7:23 a.m. UTC | #4
Hi Vladimir,

On Wed Jul 29 2020, Vladimir Oltean wrote:
> For more context, here is my original report of the issue:
> https://lkml.org/lkml/2020/6/4/1062
>
> Just like you, I could not reproduce the RCU stalls and system hang on a
> 5.6-rt kernel, just on mainline and derivatives, using the plain
> defconfig.
>
> The issue is not specific to Layerscape or i.MX8, but rather I was able
> to see the same behavior on Marvell Armada 37xx as well as Qualcomm
> MSM8976.
>
> So, while of course I agree that disabling IRQ time accounting for arm64
> isn't a real solution, it isn't by far an exaggerated proposal either.
> Nonetheless, the patch is just a RFC and should be treated as such. We
> are at a loss when it comes to debugging this any further and we would
> appreciate some pointers.

Yeah, sure. I'll try to reproduce this issue first. So it triggers with:

 * arm64
 * mainline, not -rt kernel
 * opened serial console
 * irq accounting enabled

Anything else?

Thanks,
Kurt
Vladimir Oltean July 30, 2020, 8:22 a.m. UTC | #5
On Thu, Jul 30, 2020 at 09:23:44AM +0200, Kurt Kanzenbach wrote:
> Hi Vladimir,
> 
> On Wed Jul 29 2020, Vladimir Oltean wrote:
> > For more context, here is my original report of the issue:
> > https://lkml.org/lkml/2020/6/4/1062
> >
> > Just like you, I could not reproduce the RCU stalls and system hang on a
> > 5.6-rt kernel, just on mainline and derivatives, using the plain
> > defconfig.
> >
> > The issue is not specific to Layerscape or i.MX8, but rather I was able
> > to see the same behavior on Marvell Armada 37xx as well as Qualcomm
> > MSM8976.
> >
> > So, while of course I agree that disabling IRQ time accounting for arm64
> > isn't a real solution, it isn't by far an exaggerated proposal either.
> > Nonetheless, the patch is just a RFC and should be treated as such. We
> > are at a loss when it comes to debugging this any further and we would
> > appreciate some pointers.
> 
> Yeah, sure. I'll try to reproduce this issue first. So it triggers with:
> 
>  * arm64
>  * mainline, not -rt kernel
>  * opened serial console
>  * irq accounting enabled
> 
> Anything else?
> 
> Thanks,
> Kurt

Thanks for giving a helping hand, Kurt. The defconfig should be enough.
In the interest of full disclosure, the only arm64 device on which we
didn't reproduce this was the 16-core LX2160A. But we did reproduce on
that with maxcpus=1 though. And also on msm8976 with all 8 cores booted.
Just mentioning this in case you're testing on a 16-core system, you
might want to reduce the number a bit.

Thanks,
-Vladimir
Kurt Kanzenbach Aug. 3, 2020, 8:04 a.m. UTC | #6
On Thu Jul 30 2020, Vladimir Oltean wrote:
> On Thu, Jul 30, 2020 at 09:23:44AM +0200, Kurt Kanzenbach wrote:
>> On Wed Jul 29 2020, Vladimir Oltean wrote:
>> > For more context, here is my original report of the issue:
>> > https://lkml.org/lkml/2020/6/4/1062
>> >
>> > Just like you, I could not reproduce the RCU stalls and system hang on a
>> > 5.6-rt kernel, just on mainline and derivatives, using the plain
>> > defconfig.
>> >
>> > The issue is not specific to Layerscape or i.MX8, but rather I was able
>> > to see the same behavior on Marvell Armada 37xx as well as Qualcomm
>> > MSM8976.
>> >
>> > So, while of course I agree that disabling IRQ time accounting for arm64
>> > isn't a real solution, it isn't by far an exaggerated proposal either.
>> > Nonetheless, the patch is just a RFC and should be treated as such. We
>> > are at a loss when it comes to debugging this any further and we would
>> > appreciate some pointers.
>> 
>> Yeah, sure. I'll try to reproduce this issue first. So it triggers with:
>> 
>>  * arm64
>>  * mainline, not -rt kernel
>>  * opened serial console
>>  * irq accounting enabled
>> 
>> Anything else?
>> 
>> Thanks,
>> Kurt
>
> Thanks for giving a helping hand, Kurt. The defconfig should be enough.
> In the interest of full disclosure, the only arm64 device on which we
> didn't reproduce this was the 16-core LX2160A. But we did reproduce on
> that with maxcpus=1 though. And also on msm8976 with all 8 cores booted.
> Just mentioning this in case you're testing on a 16-core system, you
> might want to reduce the number a bit.

OK. I've reproduced it on a Marvell Armada SoC with v5.6 mainline. See
splats below. Running with irq time accounting enabled, kills the
machine immediately. However, I'm not getting the possible deadlock
warnings in 8250 as you did. So that might be unrelated.

Unfortunately I have no idea what to debug here.

Thanks,
Kurt

Splats:

|root@marvell ~ # [  150.996487] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
|[  150.999764]  (detected by 0, t=2602 jiffies, g=9697, q=24)
|[  151.005403] rcu: All QSes seen, last rcu_sched kthread activity 2602 (4294952389-4294949787), jiffies_till_next_fqs=1, root ->qsmask 0x0
|[  151.018034] stress-ng-hrtim R  running task        0  2251   2248 0x00000001
|[  151.025290] Call trace:
|[  151.027808]  dump_backtrace+0x0/0x198
|[  151.031563]  show_stack+0x1c/0x28
|[  151.034971]  sched_show_task+0x224/0x248
|[  151.039002]  rcu_sched_clock_irq+0x7c4/0x840
|[  151.043392]  update_process_times+0x34/0x70
|[  151.047694]  tick_sched_handle.isra.0+0x34/0x70
|[  151.052350]  tick_sched_timer+0x50/0xa0
|[  151.056295]  __hrtimer_run_queues+0x18c/0x5b0
|[  151.060773]  hrtimer_interrupt+0xec/0x248
|[  151.064896]  arch_timer_handler_phys+0x38/0x48
|[  151.069465]  handle_percpu_devid_irq+0xd0/0x3c8
|[  151.074124]  generic_handle_irq+0x2c/0x40
|[  151.078246]  __handle_domain_irq+0x68/0xc0
|[  151.082456]  gic_handle_irq+0x64/0x150
|[  151.086308]  el1_irq+0xbc/0x140
|[  151.089536]  lock_acquire+0xdc/0x250
|[  151.093210]  __might_fault+0x68/0x88
|[  151.096880]  setup_rt_frame+0x2a4/0xee8
|[  151.100823]  do_notify_resume+0x3b0/0x488
|[  151.104945]  work_pending+0x8/0x14
|[  151.108443] rcu: rcu_sched kthread starved for 2613 jiffies! g9697 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
|[  151.118741] rcu: RCU grace-period kthread stack dump:
|[  151.123939] rcu_sched       R  running task        0    10      2 0x00000028
|[  151.131196] Call trace:
|[  151.133707]  __switch_to+0x104/0x170
|[  151.137384]  __schedule+0x36c/0x860
|[  151.140966]  schedule+0x7c/0x108
|[  151.144280]  schedule_timeout+0x204/0x4c0
|[  151.148403]  rcu_gp_kthread+0x5b0/0x1798
|[  151.152435]  kthread+0x110/0x140
|[  151.155749]  ret_from_fork+0x10/0x18
|[  181.665482] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 55s!
|[  181.671022] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 56s!
|[  181.679198] BUG: workqueue lockup - pool cpus=0-1 flags=0x4 nice=0 stuck for 56s!
|[  181.686902] Showing busy workqueues and worker pools:
|[  181.692059] workqueue events: flags=0x0
|[  181.696226]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
|[  181.703039]     pending: dbs_work_handler, mv88e6xxx_ptp_overflow_check
|[  181.709826]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
|[  181.716783]     pending: vmstat_shepherd, switchdev_deferred_process_work
|[  181.723804] workqueue events_unbound: flags=0x2
|[  181.728526]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/512 refcnt=3
|[  181.735029]     pending: flush_to_ldisc
|[  181.738947] workqueue events_freezable: flags=0x4
|[  181.743755]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
|[  181.750741]     pending: mmc_rescan
|[  181.754359] workqueue events_power_efficient: flags=0x82
|[  181.759825]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=5/256 refcnt=7
|[  181.766329]     pending: gc_worker, neigh_periodic_work, neigh_periodic_work, do_cache_clean, check_lifetime
|[  181.776488] workqueue mm_percpu_wq: flags=0x8
|[  181.780942]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
|[  181.787924]     pending: vmstat_update
|[  181.791808] workqueue pm: flags=0x4
|[  181.795368]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
|[  181.802349]     pending: pm_runtime_work
|[  181.806412] workqueue writeback: flags=0x4a
|[  181.810686]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/256 refcnt=3
|[  181.817224]     pending: wb_workfn
|[  181.820756] workqueue ipv6_addrconf: flags=0x40008
|[  181.825684]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 refcnt=2
|[  181.832456]     pending: addrconf_verify_work
|[  181.836968] workqueue nfsd4: flags=0x2
|[  181.840794]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/256 refcnt=3
|[  181.847329]     pending: laundromat_main
|[  229.043792] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
|[  229.047053]  (detected by 0, t=10407 jiffies, g=9697, q=29)
|[  229.052785] rcu: All QSes seen, last rcu_sched kthread activity 10407 (4294960194-4294949787), jiffies_till_next_fqs=1, root ->qsmask 0x0
|[  229.065505] stress-ng-hrtim R  running task        0  2251   2248 0x00000001
|[  229.072763] Call trace:
|[  229.075276]  dump_backtrace+0x0/0x198
|[  229.079037]  show_stack+0x1c/0x28
|[  229.082443]  sched_show_task+0x224/0x248
|[  229.086476]  rcu_sched_clock_irq+0x7c4/0x840
|[  229.090866]  update_process_times+0x34/0x70
|[  229.095165]  tick_sched_handle.isra.0+0x34/0x70
|[  229.099824]  tick_sched_timer+0x50/0xa0
|[  229.103767]  __hrtimer_run_queues+0x18c/0x5b0
|[  229.108246]  hrtimer_interrupt+0xec/0x248
|[  229.112368]  arch_timer_handler_phys+0x38/0x48
|[  229.116938]  handle_percpu_devid_irq+0xd0/0x3c8
|[  229.121597]  generic_handle_irq+0x2c/0x40
|[  229.125718]  __handle_domain_irq+0x68/0xc0
|[  229.129930]  gic_handle_irq+0x64/0x150
|[  229.133781]  el1_irq+0xbc/0x140
|[  229.137008]  lock_acquire+0xdc/0x250
|[  229.140682]  __might_fault+0x68/0x88
|[  229.144354]  preserve_fpsimd_context+0x160/0x2f8
|[  229.149101]  setup_rt_frame+0x48c/0xee8
|[  229.153044]  do_notify_resume+0x3b0/0x488
|[  229.157167]  work_pending+0x8/0x14
|[  229.160666] rcu: rcu_sched kthread starved for 10418 jiffies! g9697 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
|[  229.171053] rcu: RCU grace-period kthread stack dump:
|[  229.176250] rcu_sched       R  running task        0    10      2 0x00000028
|[  229.183506] Call trace:
|[  229.186019]  __switch_to+0x104/0x170
|[  229.189697]  __schedule+0x36c/0x860
|[  229.193277]  schedule+0x7c/0x108
|[  229.196592]  schedule_timeout+0x204/0x4c0
|[  229.200714]  rcu_gp_kthread+0x5b0/0x1798
|[  229.204747]  kthread+0x110/0x140
|[  229.208061]  ret_from_fork+0x10/0x18
Vladimir Oltean Aug. 3, 2020, 8:16 a.m. UTC | #7
On Mon, Aug 03, 2020 at 10:04:01AM +0200, Kurt Kanzenbach wrote:
> On Thu Jul 30 2020, Vladimir Oltean wrote:
> > On Thu, Jul 30, 2020 at 09:23:44AM +0200, Kurt Kanzenbach wrote:
> >> On Wed Jul 29 2020, Vladimir Oltean wrote:
> >> > For more context, here is my original report of the issue:
> >> > https://lkml.org/lkml/2020/6/4/1062
> >> >
> >> > Just like you, I could not reproduce the RCU stalls and system hang on a
> >> > 5.6-rt kernel, just on mainline and derivatives, using the plain
> >> > defconfig.
> >> >
> >> > The issue is not specific to Layerscape or i.MX8, but rather I was able
> >> > to see the same behavior on Marvell Armada 37xx as well as Qualcomm
> >> > MSM8976.
> >> >
> >> > So, while of course I agree that disabling IRQ time accounting for arm64
> >> > isn't a real solution, it isn't by far an exaggerated proposal either.
> >> > Nonetheless, the patch is just a RFC and should be treated as such. We
> >> > are at a loss when it comes to debugging this any further and we would
> >> > appreciate some pointers.
> >> 
> >> Yeah, sure. I'll try to reproduce this issue first. So it triggers with:
> >> 
> >>  * arm64
> >>  * mainline, not -rt kernel
> >>  * opened serial console
> >>  * irq accounting enabled
> >> 
> >> Anything else?
> >> 
> >> Thanks,
> >> Kurt
> >
> > Thanks for giving a helping hand, Kurt. The defconfig should be enough.
> > In the interest of full disclosure, the only arm64 device on which we
> > didn't reproduce this was the 16-core LX2160A. But we did reproduce on
> > that with maxcpus=1 though. And also on msm8976 with all 8 cores booted.
> > Just mentioning this in case you're testing on a 16-core system, you
> > might want to reduce the number a bit.
> 
> OK. I've reproduced it on a Marvell Armada SoC with v5.6 mainline. See
> splats below. Running with irq time accounting enabled, kills the
> machine immediately. However, I'm not getting the possible deadlock
> warnings in 8250 as you did. So that might be unrelated.
> 

Yes, the console lockdep warnings are unrelated. They are discussed
here:
https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/

> Unfortunately I have no idea what to debug here.
> 
> Thanks,
> Kurt

So, this means we could submit a formal version of this patch? :)

Thanks,
-Vladimir
Robin Murphy Aug. 3, 2020, 9:51 a.m. UTC | #8
On 2020-08-03 09:16, Vladimir Oltean wrote:
> On Mon, Aug 03, 2020 at 10:04:01AM +0200, Kurt Kanzenbach wrote:
>> On Thu Jul 30 2020, Vladimir Oltean wrote:
>>> On Thu, Jul 30, 2020 at 09:23:44AM +0200, Kurt Kanzenbach wrote:
>>>> On Wed Jul 29 2020, Vladimir Oltean wrote:
>>>>> For more context, here is my original report of the issue:
>>>>> https://lkml.org/lkml/2020/6/4/1062
>>>>>
>>>>> Just like you, I could not reproduce the RCU stalls and system hang on a
>>>>> 5.6-rt kernel, just on mainline and derivatives, using the plain
>>>>> defconfig.
>>>>>
>>>>> The issue is not specific to Layerscape or i.MX8, but rather I was able
>>>>> to see the same behavior on Marvell Armada 37xx as well as Qualcomm
>>>>> MSM8976.
>>>>>
>>>>> So, while of course I agree that disabling IRQ time accounting for arm64
>>>>> isn't a real solution, it isn't by far an exaggerated proposal either.
>>>>> Nonetheless, the patch is just a RFC and should be treated as such. We
>>>>> are at a loss when it comes to debugging this any further and we would
>>>>> appreciate some pointers.
>>>>
>>>> Yeah, sure. I'll try to reproduce this issue first. So it triggers with:
>>>>
>>>>   * arm64
>>>>   * mainline, not -rt kernel
>>>>   * opened serial console
>>>>   * irq accounting enabled
>>>>
>>>> Anything else?
>>>>
>>>> Thanks,
>>>> Kurt
>>>
>>> Thanks for giving a helping hand, Kurt. The defconfig should be enough.
>>> In the interest of full disclosure, the only arm64 device on which we
>>> didn't reproduce this was the 16-core LX2160A. But we did reproduce on
>>> that with maxcpus=1 though. And also on msm8976 with all 8 cores booted.
>>> Just mentioning this in case you're testing on a 16-core system, you
>>> might want to reduce the number a bit.
>>
>> OK. I've reproduced it on a Marvell Armada SoC with v5.6 mainline. See
>> splats below. Running with irq time accounting enabled, kills the
>> machine immediately. However, I'm not getting the possible deadlock
>> warnings in 8250 as you did. So that might be unrelated.
>>
> 
> Yes, the console lockdep warnings are unrelated. They are discussed
> here:
> https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/
> 
>> Unfortunately I have no idea what to debug here.
>>
>> Thanks,
>> Kurt
> 
> So, this means we could submit a formal version of this patch? :)

Defconfig is absolutely not the appropriate way to work around bugs - 
it's merely a starting point for users and distros to set up their own 
kernel, and if they can still enable this option and render their system 
unusable then patching some other config that they aren't using is 
pointless. To usefully mitigate a problem you'd need to make sure the 
offending option cannot be selected at all (i.e. prohibit 
HAVE_IRQ_TASK_ACCOUNTING as well).

Having glanced across another thread that mentions IRQ accounting 
recently[1], I wonder if the underlying bug here might have something do 
to with the stuff that Marc's trying to clean up.

Robin.

[1] 
https://lore.kernel.org/linux-arm-kernel/20200624195811.435857-16-maz@kernel.org/
Thomas Gleixner Aug. 3, 2020, 10:02 a.m. UTC | #9
Vladimir Oltean <olteanv@gmail.com> writes:
> On Mon, Aug 03, 2020 at 10:04:01AM +0200, Kurt Kanzenbach wrote:
>> Unfortunately I have no idea what to debug here.
>
> So, this means we could submit a formal version of this patch? :)

To paper over the underlying problem. Oh well...
Thomas Gleixner Aug. 3, 2020, 10:49 a.m. UTC | #10
Kurt,

Kurt Kanzenbach <kurt.kanzenbach@linutronix.de> writes:
> On Thu Jul 30 2020, Vladimir Oltean wrote:
> OK. I've reproduced it on a Marvell Armada SoC with v5.6 mainline. See
> splats below. Running with irq time accounting enabled, kills the
> machine immediately. However, I'm not getting the possible deadlock
> warnings in 8250 as you did. So that might be unrelated.
>
> Unfortunately I have no idea what to debug here.

lets look at the facts we have so far.

 1) When irq accounting is disabled, RT throttling kicks in as
    expected.

 2) With irq accounting the RT throttler does not kick in and the RCU
    stall/lockups happen.

Not much, but there is clearly interaction between irq time accounting
and scheduler accounting.

Can you please reduce /proc/sys/kernel/sched_rt_runtime_us in steps of
50000? At least in theory, reduction should bring the throttling back.

Thanks,

        tglx
Vladimir Oltean Aug. 3, 2020, 11:38 a.m. UTC | #11
On Mon, Aug 03, 2020 at 10:51:32AM +0100, Robin Murphy wrote:
> 
> Having glanced across another thread that mentions IRQ accounting
> recently[1], I wonder if the underlying bug here might have something do to
> with the stuff that Marc's trying to clean up.
> 
> Robin.
> 
> [1] https://lore.kernel.org/linux-arm-kernel/20200624195811.435857-16-maz@kernel.org/

Thanks Robin. I've applied Marc's "[PATCH v2 00/17] arm/arm64: Turning
IPIs into normal interrupts" series and the LS1028A I'm debugging hangs
in absolutely the same way.

-Vladimir
Vladimir Oltean Aug. 3, 2020, 11:41 a.m. UTC | #12
Hi Thomas,

On Mon, Aug 03, 2020 at 12:49:36PM +0200, Thomas Gleixner wrote:
> Kurt,
> 
> Kurt Kanzenbach <kurt.kanzenbach@linutronix.de> writes:
> > On Thu Jul 30 2020, Vladimir Oltean wrote:
> > OK. I've reproduced it on a Marvell Armada SoC with v5.6 mainline. See
> > splats below. Running with irq time accounting enabled, kills the
> > machine immediately. However, I'm not getting the possible deadlock
> > warnings in 8250 as you did. So that might be unrelated.
> >
> > Unfortunately I have no idea what to debug here.
> 
> lets look at the facts we have so far.
> 
>  1) When irq accounting is disabled, RT throttling kicks in as
>     expected.
> 
>  2) With irq accounting the RT throttler does not kick in and the RCU
>     stall/lockups happen.
> 
> Not much, but there is clearly interaction between irq time accounting
> and scheduler accounting.
> 
> Can you please reduce /proc/sys/kernel/sched_rt_runtime_us in steps of
> 50000? At least in theory, reduction should bring the throttling back.
> 
> Thanks,
> 
>         tglx
> 

We already know from my original report that reducing
sched_rt_runtime_us to 80% (800000) gives us a stable system.
https://lkml.org/lkml/2020/6/4/1062

What is this telling us?

Thanks,
-Vladimir
Valentin Schneider Aug. 3, 2020, 11:48 a.m. UTC | #13
On 03/08/20 12:38, Vladimir Oltean wrote:
> On Mon, Aug 03, 2020 at 10:51:32AM +0100, Robin Murphy wrote:
>> 
>> Having glanced across another thread that mentions IRQ accounting
>> recently[1], I wonder if the underlying bug here might have something do to
>> with the stuff that Marc's trying to clean up.
>> 
>> Robin.
>> 
>> [1] https://lore.kernel.org/linux-arm-kernel/20200624195811.435857-16-maz@kernel.org/
>
> Thanks Robin. I've applied Marc's "[PATCH v2 00/17] arm/arm64: Turning
> IPIs into normal interrupts" series and the LS1028A I'm debugging hangs
> in absolutely the same way.
>

I'm not too surprised by that, wrt accounting this mostly changes where the
stores go to and barely shuffles when they happen (slightly earlier on the
IPI handling path).

FWIW I've had 'stress-ng --hrtimers 1' running on my Juno and eMAG for ~15
minutes and haven't had a splat yet. 

> -Vladimir
Marc Zyngier Aug. 3, 2020, 1:24 p.m. UTC | #14
On 2020-08-03 12:48, Valentin Schneider wrote:
> On 03/08/20 12:38, Vladimir Oltean wrote:
>> On Mon, Aug 03, 2020 at 10:51:32AM +0100, Robin Murphy wrote:
>>> 
>>> Having glanced across another thread that mentions IRQ accounting
>>> recently[1], I wonder if the underlying bug here might have something 
>>> do to
>>> with the stuff that Marc's trying to clean up.
>>> 
>>> Robin.
>>> 
>>> [1] 
>>> https://lore.kernel.org/linux-arm-kernel/20200624195811.435857-16-maz@kernel.org/
>> 
>> Thanks Robin. I've applied Marc's "[PATCH v2 00/17] arm/arm64: Turning
>> IPIs into normal interrupts" series and the LS1028A I'm debugging 
>> hangs
>> in absolutely the same way.
>> 
> 
> I'm not too surprised by that, wrt accounting this mostly changes where 
> the
> stores go to and barely shuffles when they happen (slightly earlier on 
> the
> IPI handling path).

Indeed. This series is just a "let's make things be the way they should 
be",
and isn't really fixing any bug. It actually may introduce a couple...

> FWIW I've had 'stress-ng --hrtimers 1' running on my Juno and eMAG for 
> ~15
> minutes and haven't had a splat yet.

I've started a couple of VMs with that workload too. 400K irq/s on an
8 vcpu guest, nothing to report so far.

But removing IRQ_TIME_ACCOUNTING from defconfig is a resounding "no 
way!".
It looks like we have a bug, and it should be squashed, not glanced 
over.

         M.
Thomas Gleixner Aug. 3, 2020, 3:13 p.m. UTC | #15
Vladimir Oltean <olteanv@gmail.com> writes:
>>  1) When irq accounting is disabled, RT throttling kicks in as
>>     expected.
>> 
>>  2) With irq accounting the RT throttler does not kick in and the RCU
>>     stall/lockups happen.
>> 
>> Not much, but there is clearly interaction between irq time accounting
>> and scheduler accounting.
>> 
>> Can you please reduce /proc/sys/kernel/sched_rt_runtime_us in steps of
>> 50000? At least in theory, reduction should bring the throttling back.
>
> We already know from my original report that reducing
> sched_rt_runtime_us to 80% (800000) gives us a stable system.
> https://lkml.org/lkml/2020/6/4/1062

Oh. Somehow missed that. Btw, can you please use links in the form of:

    https://lore.kernel.org/r/$MSGID

lkml.org is spotty and having the message id allows to search in the
mail client as well.

> What is this telling us?

It seems that the fine grained irq time accounting affects the runtime
accounting in some way which I haven't figured out yet.

Thanks,

        tglx
Valentin Schneider Aug. 3, 2020, 3:47 p.m. UTC | #16
On 03/08/20 16:13, Thomas Gleixner wrote:
> Vladimir Oltean <olteanv@gmail.com> writes:
>>>  1) When irq accounting is disabled, RT throttling kicks in as
>>>     expected.
>>> 
>>>  2) With irq accounting the RT throttler does not kick in and the RCU
>>>     stall/lockups happen.
>> What is this telling us?
>
> It seems that the fine grained irq time accounting affects the runtime
> accounting in some way which I haven't figured out yet.
>

With IRQ_TIME_ACCOUNTING, rq_clock_task() will always be incremented by a
lesser-or-equal value than when not having the option; you start with the
same delta_exec but slice some for the IRQ accounting, and leave the rest
for the rq_clock_task() (+paravirt).

IIUC this means that if you spend e.g. 10% of the time in IRQ and 90% of
the time running the stress-ng RT tasks, despite having RT tasks hogging
the entirety of the "available time" it is still only 90% runtime, which is
below the 95% default and the throttling doesn't happen.

I don't know if considering IRQ time in some way or another in
sched_rt_runtime_exceeded() really is a way out here.

> Thanks,
>
>         tglx
Vladimir Oltean Aug. 3, 2020, 4:14 p.m. UTC | #17
On Mon, Aug 03, 2020 at 04:47:03PM +0100, Valentin Schneider wrote:
> 
> On 03/08/20 16:13, Thomas Gleixner wrote:
> > Vladimir Oltean <olteanv@gmail.com> writes:
> >>>  1) When irq accounting is disabled, RT throttling kicks in as
> >>>     expected.
> >>> 
> >>>  2) With irq accounting the RT throttler does not kick in and the RCU
> >>>     stall/lockups happen.
> >> What is this telling us?
> >
> > It seems that the fine grained irq time accounting affects the runtime
> > accounting in some way which I haven't figured out yet.
> >
> 
> With IRQ_TIME_ACCOUNTING, rq_clock_task() will always be incremented by a
> lesser-or-equal value than when not having the option; you start with the
> same delta_exec but slice some for the IRQ accounting, and leave the rest
> for the rq_clock_task() (+paravirt).
> 
> IIUC this means that if you spend e.g. 10% of the time in IRQ and 90% of
> the time running the stress-ng RT tasks, despite having RT tasks hogging
> the entirety of the "available time" it is still only 90% runtime, which is
> below the 95% default and the throttling doesn't happen.
> 
> I don't know if considering IRQ time in some way or another in
> sched_rt_runtime_exceeded() really is a way out here.
> 
> > Thanks,
> >
> >         tglx
> 

Not sure it this is relevant at all to the conversation we're having
(sorry if it isn't), but FWIW, on my armv7 board, enabling
CONFIG_IRQ_TIME_ACCOUNTING makes it completely unable to boot (the log
below is just earlyprintk):

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0xf00
[    0.000000] Linux version 5.8.0-rc7-02154-g1a0b51626a14-dirty (tigrisor@skbuf) (arm-linux-gnueabihf-gcc (Linaro GCC 7.3-2018.05) 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701], GNU ld (Linaro_Binutils-2018.05) 2.28.2.20170706) #383 SMP Mon Aug 3 19:03:50 EEST 2020
[    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] OF: fdt: Machine model: NXP LS1021A-TSN Board
[    0.000000] earlycon: uart8250 at MMIO 0x021c0500 (options '')
[    0.000000] printk: bootconsole [uart8250] enabled
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] efi: UEFI not found.
[    0.000000] Ignoring RAM at 0xb0000000-0xc0000000
[    0.000000] Consider using a HIGHMEM enabled kernel.
[    0.000000] cma: Reserved 64 MiB at 0xac000000
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000080000000-0x00000000afffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x00000000afffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000000afffffff]
[    0.000000] percpu: Embedded 20 pages/cpu s51660 r8192 d22068 u81920
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 194880
[    0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio,0x21c0500 root=/dev/mmcblk0p2 rootwait rw
[    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes, linear)
[    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 687944K/786432K available (12288K kernel code, 1026K rwdata, 6980K rodata, 1024K init, 279K bss, 32952K reserved, 65536K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu:     RCU event tracing is enabled.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=16 to nr_cpu_ids=2.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000000] random: get_random_bytes called from start_kernel+0x36c/0x54c with crng_init=0
[    0.000000] timer_probe: no matching timers found
[    0.000000] Console: colour dummy device 80x30
[    0.000000] sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns

Thanks,
-Vladimir
Thomas Gleixner Aug. 3, 2020, 7:22 p.m. UTC | #18
Valentin,

Valentin Schneider <valentin.schneider@arm.com> writes:
> On 03/08/20 16:13, Thomas Gleixner wrote:
>> Vladimir Oltean <olteanv@gmail.com> writes:
>>>>  1) When irq accounting is disabled, RT throttling kicks in as
>>>>     expected.
>>>> 
>>>>  2) With irq accounting the RT throttler does not kick in and the RCU
>>>>     stall/lockups happen.
>>> What is this telling us?
>>
>> It seems that the fine grained irq time accounting affects the runtime
>> accounting in some way which I haven't figured out yet.
>>
>
> With IRQ_TIME_ACCOUNTING, rq_clock_task() will always be incremented by a
> lesser-or-equal value than when not having the option; you start with the
> same delta_exec but slice some for the IRQ accounting, and leave the rest
> for the rq_clock_task() (+paravirt).
>
> IIUC this means that if you spend e.g. 10% of the time in IRQ and 90% of
> the time running the stress-ng RT tasks, despite having RT tasks hogging
> the entirety of the "available time" it is still only 90% runtime, which is
> below the 95% default and the throttling doesn't happen.

   totaltime = irqtime + tasktime

Ignoring irqtime and pretending that totaltime is what the scheduler
can control and deal with is naive at best.

Thanks,

        tglx
Valentin Schneider Aug. 3, 2020, 11:59 p.m. UTC | #19
On 03/08/20 20:22, Thomas Gleixner wrote:
> Valentin,
>
> Valentin Schneider <valentin.schneider@arm.com> writes:
>> On 03/08/20 16:13, Thomas Gleixner wrote:
>>> Vladimir Oltean <olteanv@gmail.com> writes:
>>>>>  1) When irq accounting is disabled, RT throttling kicks in as
>>>>>     expected.
>>>>>
>>>>>  2) With irq accounting the RT throttler does not kick in and the RCU
>>>>>     stall/lockups happen.
>>>> What is this telling us?
>>>
>>> It seems that the fine grained irq time accounting affects the runtime
>>> accounting in some way which I haven't figured out yet.
>>>
>>
>> With IRQ_TIME_ACCOUNTING, rq_clock_task() will always be incremented by a
>> lesser-or-equal value than when not having the option; you start with the
>> same delta_exec but slice some for the IRQ accounting, and leave the rest
>> for the rq_clock_task() (+paravirt).
>>
>> IIUC this means that if you spend e.g. 10% of the time in IRQ and 90% of
>> the time running the stress-ng RT tasks, despite having RT tasks hogging
>> the entirety of the "available time" it is still only 90% runtime, which is
>> below the 95% default and the throttling doesn't happen.
>
>    totaltime = irqtime + tasktime
>
> Ignoring irqtime and pretending that totaltime is what the scheduler
> can control and deal with is naive at best.
>

Agreed, however AFAICT rt_time is only incremented by rq_clock_task()
deltas, which don't include IRQ time with IRQ_TIME_ACCOUNTING=y. That would
then be directly compared to the sysctl runtime.

Adding some prints in sched_rt_runtime_exceeded() and running this test
case on my Juno, I get:

  # IRQ_TIME_ACCOUNTING=y
  cpu=2 rt_time=713455220 runtime=950000000 rq->avg_irq.util_avg=265
  (rt_time oscillates between [70.1e7, 75.1e7]; avg_irq between [220, 270])

  # IRQ_TIME_ACCOUNTING=n
  cpu=2 rt_time=963035300 runtime=949951811
  (rt_time oscillates between [94.1e7, 96.1e7];

Throttling happens for IRQ_TIME_ACCOUNTING=n and doesn't for
IRQ_TIME_ACCOUNTING=y - clearly the accounted rt_time isn't high enough for
that to happen, and it does look like what is missing in rt_time (or what
should be subtracted from the available runtime) is there in the avg_irq.

Or is that another case where I shouldn't have been writing emails at this
hour?

> Thanks,
>
>         tglx
Dietmar Eggemann Aug. 5, 2020, 8:50 a.m. UTC | #20
On 04/08/2020 01:59, Valentin Schneider wrote:
> 
> On 03/08/20 20:22, Thomas Gleixner wrote:
>> Valentin,
>>
>> Valentin Schneider <valentin.schneider@arm.com> writes:
>>> On 03/08/20 16:13, Thomas Gleixner wrote:
>>>> Vladimir Oltean <olteanv@gmail.com> writes:
>>>>>>  1) When irq accounting is disabled, RT throttling kicks in as
>>>>>>     expected.
>>>>>>
>>>>>>  2) With irq accounting the RT throttler does not kick in and the RCU
>>>>>>     stall/lockups happen.
>>>>> What is this telling us?
>>>>
>>>> It seems that the fine grained irq time accounting affects the runtime
>>>> accounting in some way which I haven't figured out yet.
>>>>
>>>
>>> With IRQ_TIME_ACCOUNTING, rq_clock_task() will always be incremented by a
>>> lesser-or-equal value than when not having the option; you start with the
>>> same delta_exec but slice some for the IRQ accounting, and leave the rest
>>> for the rq_clock_task() (+paravirt).
>>>
>>> IIUC this means that if you spend e.g. 10% of the time in IRQ and 90% of
>>> the time running the stress-ng RT tasks, despite having RT tasks hogging
>>> the entirety of the "available time" it is still only 90% runtime, which is
>>> below the 95% default and the throttling doesn't happen.
>>
>>    totaltime = irqtime + tasktime
>>
>> Ignoring irqtime and pretending that totaltime is what the scheduler
>> can control and deal with is naive at best.
>>
> 
> Agreed, however AFAICT rt_time is only incremented by rq_clock_task()
> deltas, which don't include IRQ time with IRQ_TIME_ACCOUNTING=y. That would
> then be directly compared to the sysctl runtime.
> 
> Adding some prints in sched_rt_runtime_exceeded() and running this test
> case on my Juno, I get:
>   # IRQ_TIME_ACCOUNTING=y
>   cpu=2 rt_time=713455220 runtime=950000000 rq->avg_irq.util_avg=265
>   (rt_time oscillates between [70.1e7, 75.1e7]; avg_irq between [220, 270])
> 
>   # IRQ_TIME_ACCOUNTING=n
>   cpu=2 rt_time=963035300 runtime=949951811
>   (rt_time oscillates between [94.1e7, 96.1e7];
> 
> Throttling happens for IRQ_TIME_ACCOUNTING=n and doesn't for
> IRQ_TIME_ACCOUNTING=y - clearly the accounted rt_time isn't high enough for
> that to happen, and it does look like what is missing in rt_time (or what
> should be subtracted from the available runtime) is there in the avg_irq.

I agree that w/ IRQ_TIME_ACCOUNTING=y rt_rq->rt_time isn't high enough
in this testcase.

stress-ng-hrtim-1655 [001] 462.897733: bprint: update_curr_rt:
rt_rq->rt_time=416716900 rt_rq->rt_runtime=950000000
rt_b->rt_runtime=950000000

The 5% reservation (1 - sched_rt_runtime_us/sched_rt_period_us) for CFS
is massively eclipsed by irqtime.

It's true that avg_irq tracks 'irq_delta + steal' time but it is meant
to potentially reduce cpu capacity. It's also cpu and frequency
invariant (your CPU2 is a big CPU so no issue here).

Could a rq_clock(rq) derived rt_rq signal been used to compare against
rt_runtime?

BTW, DL already influences rt_rq->rt_time.

[...]
Peter Zijlstra Aug. 5, 2020, 1:40 p.m. UTC | #21
On Mon, Aug 03, 2020 at 09:22:53PM +0200, Thomas Gleixner wrote:

>    totaltime = irqtime + tasktime
> 
> Ignoring irqtime and pretending that totaltime is what the scheduler
> can control and deal with is naive at best.

Well no, that's what we call system overhead and is assumed to be
included in the 'error margin'.

The way things are set up is that we say that, by default, RT tasks can
consume 95% of cputime and the remaining 5% is sufficient to keep the
system alive.

Those 5% include all system overhead, IRQs, RCU, !RT workqueues etc..

Obviously IRQ_TIME accounting changes the balance a bit, but that's what
it is. We can't really do anything better.

Apparently this SoC has significant IRQ time for some reason. Also,
relying on RT throttling for 'correct' behaviour is also wrong. What
needs to be done is find who is using all this RT time and why, that
isn't right.
Valentin Schneider Aug. 5, 2020, 1:56 p.m. UTC | #22
On 05/08/20 14:40, peterz@infradead.org wrote:
> On Mon, Aug 03, 2020 at 09:22:53PM +0200, Thomas Gleixner wrote:
>
>>    totaltime = irqtime + tasktime
>>
>> Ignoring irqtime and pretending that totaltime is what the scheduler
>> can control and deal with is naive at best.
>
> Well no, that's what we call system overhead and is assumed to be
> included in the 'error margin'.
>
> The way things are set up is that we say that, by default, RT tasks can
> consume 95% of cputime and the remaining 5% is sufficient to keep the
> system alive.
>
> Those 5% include all system overhead, IRQs, RCU, !RT workqueues etc..
>
> Obviously IRQ_TIME accounting changes the balance a bit, but that's what
> it is. We can't really do anything better.
>

I'm starting to think that as well. I tried some fugly hack of injecting
avg_irq into sched_rt_runtime_exceeded() with something along the lines of:

  irq_time = (rq->avg_irq.util_avg * sched_rt_period(rt_rq)) >> SCHED_CAPACITY_SHIFT;

It's pretty bad for a few reasons; one is that avg_irq already has its own
period (PELT-based). Another is that it is, as Dietmar pointed out, CPU and
freq invariant, so falls over on big.LITTLE.

Making update_curr_rt() use rq_clock() rather than rq_clock_task() makes it
"work" but goes against all the good reasons there were to introduce
rq_clock_task() in the first place.

> Apparently this SoC has significant IRQ time for some reason. Also,
> relying on RT throttling for 'correct' behaviour is also wrong. What
> needs to be done is find who is using all this RT time and why, that
> isn't right.

I've been tempted to say the test case is a bit bogus, but am not familiar
enough with the RT throttling details to stand that ground. That said, from
both looking at the execution and the stress-ng source code, it seems to
unconditionally spawn 32 FIFO-50 tasks (there's even an option to make
these FIFO-99!!!), which is quite a crowd on monoCPU systems.
Peter Zijlstra Aug. 5, 2020, 3:31 p.m. UTC | #23
On Wed, Aug 05, 2020 at 02:56:49PM +0100, Valentin Schneider wrote:

> I've been tempted to say the test case is a bit bogus, but am not familiar
> enough with the RT throttling details to stand that ground. That said, from
> both looking at the execution and the stress-ng source code, it seems to
> unconditionally spawn 32 FIFO-50 tasks (there's even an option to make
> these FIFO-99!!!), which is quite a crowd on monoCPU systems.

Oh, so it's a case of: we do stupid without tuning and the system falls
over. I can live with that.
Thomas Gleixner Aug. 6, 2020, 9:34 a.m. UTC | #24
peterz@infradead.org writes:
> On Mon, Aug 03, 2020 at 09:22:53PM +0200, Thomas Gleixner wrote:
>
>>    totaltime = irqtime + tasktime
>> 
>> Ignoring irqtime and pretending that totaltime is what the scheduler
>> can control and deal with is naive at best.
>
> Well no, that's what we call system overhead and is assumed to be
> included in the 'error margin'.
>
> The way things are set up is that we say that, by default, RT tasks can
> consume 95% of cputime and the remaining 5% is sufficient to keep the
> system alive.
>
> Those 5% include all system overhead, IRQs, RCU, !RT workqueues etc..
>
> Obviously IRQ_TIME accounting changes the balance a bit, but that's what
> it is. We can't really do anything better.
>
> Apparently this SoC has significant IRQ time for some reason. Also,
> relying on RT throttling for 'correct' behaviour is also wrong. What
> needs to be done is find who is using all this RT time and why, that
> isn't right.

It's a test case and we know already who is using the time. But that's
not the point.

A runaway RT task resulting in a RCU stall or whatever lockup of the
system is definitely not the right answer.

The throttler, as much as it's a horrible hack, is there to prevent this
and to give the admin a chance to pinpoint and kill that thing instead
of having to press the reset button and scratching head what might have
caused this.

Thanks,

        tglx
Thomas Gleixner Aug. 6, 2020, 9:41 a.m. UTC | #25
peterz@infradead.org writes:
> On Wed, Aug 05, 2020 at 02:56:49PM +0100, Valentin Schneider wrote:
>
>> I've been tempted to say the test case is a bit bogus, but am not familiar
>> enough with the RT throttling details to stand that ground. That said, from
>> both looking at the execution and the stress-ng source code, it seems to
>> unconditionally spawn 32 FIFO-50 tasks (there's even an option to make
>> these FIFO-99!!!), which is quite a crowd on monoCPU systems.
>
> Oh, so it's a case of: we do stupid without tuning and the system falls
> over. I can live with that.

It's not a question of whether you can live with that behaviour for a
particular silly test case.

The same happens with a single RT runaway task with enough interrupt
load on a UP machine. Just validated that. And that has nothing to do
with a silly test case. Sporadic runaways due to a bug in a once per
week code path simply can happen and having the safety net working
depending on a config option selected or not is just wrong.

Thanks,

        tglx
Peter Zijlstra Aug. 6, 2020, 11:45 a.m. UTC | #26
On Thu, Aug 06, 2020 at 11:41:06AM +0200, Thomas Gleixner wrote:
> peterz@infradead.org writes:
> > On Wed, Aug 05, 2020 at 02:56:49PM +0100, Valentin Schneider wrote:
> >
> >> I've been tempted to say the test case is a bit bogus, but am not familiar
> >> enough with the RT throttling details to stand that ground. That said, from
> >> both looking at the execution and the stress-ng source code, it seems to
> >> unconditionally spawn 32 FIFO-50 tasks (there's even an option to make
> >> these FIFO-99!!!), which is quite a crowd on monoCPU systems.
> >
> > Oh, so it's a case of: we do stupid without tuning and the system falls
> > over. I can live with that.
> 
> It's not a question of whether you can live with that behaviour for a
> particular silly test case.
> 
> The same happens with a single RT runaway task with enough interrupt
> load on a UP machine. Just validated that. 

Of course.

> And that has nothing to do
> with a silly test case. Sporadic runaways due to a bug in a once per
> week code path simply can happen and having the safety net working
> depending on a config option selected or not is just wrong.

The safety thing is concerned with RT tasks. It doesn't pretend to help
with runnaway IRQs, never has, never will.

The further extreme is an interrupt storm, those have always taken a
machine down.

Accounting unrelated IRQ time to RT tasks is equally wrong, the task
execution is unrelated to the IRQs. The config option at least offers
insight into where time goes -- and it's a config option because doing
time accounting on interrupts adds overhead :/

This really is a no-win all round.

The only 'sensible' option here is threaded IRQs, where the IRQ line
gets disabled until the handler thread has ran, that also helps with IRQ
storms.
Paul E. McKenney Aug. 6, 2020, 1:27 p.m. UTC | #27
On Thu, Aug 06, 2020 at 01:45:45PM +0200, peterz@infradead.org wrote:
> On Thu, Aug 06, 2020 at 11:41:06AM +0200, Thomas Gleixner wrote:
> > peterz@infradead.org writes:
> > > On Wed, Aug 05, 2020 at 02:56:49PM +0100, Valentin Schneider wrote:
> > >
> > >> I've been tempted to say the test case is a bit bogus, but am not familiar
> > >> enough with the RT throttling details to stand that ground. That said, from
> > >> both looking at the execution and the stress-ng source code, it seems to
> > >> unconditionally spawn 32 FIFO-50 tasks (there's even an option to make
> > >> these FIFO-99!!!), which is quite a crowd on monoCPU systems.
> > >
> > > Oh, so it's a case of: we do stupid without tuning and the system falls
> > > over. I can live with that.
> > 
> > It's not a question of whether you can live with that behaviour for a
> > particular silly test case.
> > 
> > The same happens with a single RT runaway task with enough interrupt
> > load on a UP machine. Just validated that. 
> 
> Of course.
> 
> > And that has nothing to do
> > with a silly test case. Sporadic runaways due to a bug in a once per
> > week code path simply can happen and having the safety net working
> > depending on a config option selected or not is just wrong.
> 
> The safety thing is concerned with RT tasks. It doesn't pretend to help
> with runnaway IRQs, never has, never will.

Getting into the time machine back to the 1990s...

DYNIX/ptx had a discretionary mechanism to deal with excessive interrupts.
There was a function that long-running interrupt handlers were supposed
to call periodically that would return false if the system felt that
the CPU had done enough interrupts for the time being.  In that case,
the interrupt handler was supposed to schedule itself for a later time,
but leave the interrupt unacknowledged in order to prevent retriggering
in the meantime.

Of course, this mechanism would be rather less helpful in Linux.

For one, Linux has way more device drivers and way more oddball devices.
In contrast, the few devices that DYNIX/ptx supported were carefully
selected, and the selection criteria included being able to put up
with this sort of thing.  Also, the fact that there was but a handful
of device drivers meant that changes like this could be more easily
propagated through all drivers.

Also, Linux supports way more workloads.  In contrast, DYNIX/ptx could
pick a small percentage of each CPU that would be permitted to be used
by hardware interrupt handlers.  As in there are probably Linux workloads
that run >90% of some poor CPU within hardware interrupt handlers.

But reminiscing anyway on the off-chance that this inspires someone
to come up with an idea that would work well in the Linux environment.

							Thanx, Paul

> The further extreme is an interrupt storm, those have always taken a
> machine down.
> 
> Accounting unrelated IRQ time to RT tasks is equally wrong, the task
> execution is unrelated to the IRQs. The config option at least offers
> insight into where time goes -- and it's a config option because doing
> time accounting on interrupts adds overhead :/
> 
> This really is a no-win all round.
> 
> The only 'sensible' option here is threaded IRQs, where the IRQ line
> gets disabled until the handler thread has ran, that also helps with IRQ
> storms.
Thomas Gleixner Aug. 6, 2020, 6:58 p.m. UTC | #28
Peter,

peterz@infradead.org writes:
> On Thu, Aug 06, 2020 at 11:41:06AM +0200, Thomas Gleixner wrote:
>> And that has nothing to do
>> with a silly test case. Sporadic runaways due to a bug in a once per
>> week code path simply can happen and having the safety net working
>> depending on a config option selected or not is just wrong.
>
> The safety thing is concerned with RT tasks. It doesn't pretend to help
> with runnaway IRQs, never has, never will.

Of course not. But without irq accounting the runtime is accounted on
the runaway task which causes it to throttle.

> The further extreme is an interrupt storm, those have always taken a
> machine down.

If every interrupt is actually handled, then yes.

> Accounting unrelated IRQ time to RT tasks is equally wrong, the task
> execution is unrelated to the IRQs. The config option at least offers
> insight into where time goes -- and it's a config option because doing
> time accounting on interrupts adds overhead :/

Right, but it's not totally out of the world either to make the
throttler do:

          if (rt_runtime + irq_time > threshold)
          	try_to_keep_the_box_alive()

> This really is a no-win all round.

That's not the question here :)

> The only 'sensible' option here is threaded IRQs, where the IRQ line
> gets disabled until the handler thread has ran, that also helps with IRQ
> storms.

I'm not against enforcing threaded IRQs. :)

Thanks,

        tglx
Thomas Gleixner Aug. 6, 2020, 7:03 p.m. UTC | #29
Paul,

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Thu, Aug 06, 2020 at 01:45:45PM +0200, peterz@infradead.org wrote:
>> The safety thing is concerned with RT tasks. It doesn't pretend to help
>> with runnaway IRQs, never has, never will.
>
> Getting into the time machine back to the 1990s...
>
> DYNIX/ptx had a discretionary mechanism to deal with excessive interrupts.
> There was a function that long-running interrupt handlers were supposed
> to call periodically that would return false if the system felt that
> the CPU had done enough interrupts for the time being.  In that case,
> the interrupt handler was supposed to schedule itself for a later time,
> but leave the interrupt unacknowledged in order to prevent retriggering
> in the meantime.
>
> Of course, this mechanism would be rather less helpful in Linux.
>
> For one, Linux has way more device drivers and way more oddball devices.
> In contrast, the few devices that DYNIX/ptx supported were carefully
> selected, and the selection criteria included being able to put up
> with this sort of thing.  Also, the fact that there was but a handful
> of device drivers meant that changes like this could be more easily
> propagated through all drivers.

We could do that completely at the core interrupt handling level. 

> Also, Linux supports way more workloads.  In contrast, DYNIX/ptx could
> pick a small percentage of each CPU that would be permitted to be used
> by hardware interrupt handlers.  As in there are probably Linux workloads
> that run >90% of some poor CPU within hardware interrupt handlers.

Yet another tunable. /me runs
Paul E. McKenney Aug. 6, 2020, 8:39 p.m. UTC | #30
On Thu, Aug 06, 2020 at 09:03:24PM +0200, Thomas Gleixner wrote:
> Paul,
> 
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Thu, Aug 06, 2020 at 01:45:45PM +0200, peterz@infradead.org wrote:
> >> The safety thing is concerned with RT tasks. It doesn't pretend to help
> >> with runnaway IRQs, never has, never will.
> >
> > Getting into the time machine back to the 1990s...
> >
> > DYNIX/ptx had a discretionary mechanism to deal with excessive interrupts.
> > There was a function that long-running interrupt handlers were supposed
> > to call periodically that would return false if the system felt that
> > the CPU had done enough interrupts for the time being.  In that case,
> > the interrupt handler was supposed to schedule itself for a later time,
> > but leave the interrupt unacknowledged in order to prevent retriggering
> > in the meantime.
> >
> > Of course, this mechanism would be rather less helpful in Linux.
> >
> > For one, Linux has way more device drivers and way more oddball devices.
> > In contrast, the few devices that DYNIX/ptx supported were carefully
> > selected, and the selection criteria included being able to put up
> > with this sort of thing.  Also, the fact that there was but a handful
> > of device drivers meant that changes like this could be more easily
> > propagated through all drivers.
> 
> We could do that completely at the core interrupt handling level. 

Ah, true enough if the various NAPI-like devices give up the CPU from
time to time.  Which they might well do for all I know.

> > Also, Linux supports way more workloads.  In contrast, DYNIX/ptx could
> > pick a small percentage of each CPU that would be permitted to be used
> > by hardware interrupt handlers.  As in there are probably Linux workloads
> > that run >90% of some poor CPU within hardware interrupt handlers.
> 
> Yet another tunable. /me runs

;-) ;-) ;-)

If there are workloads that would like to be able to keep one or more
CPUs completely busy handling interrupts, it should be possible to
create something that is used sort of like cond_resched() to keep RCU,
the scheduler, and the various watchdogs and lockup detectors at bay.

For example, RCU could supply a function that checked to make sure that
it was in an interrupt from idle, and if so report a quiescent state
for that CPU.  So if the CPU was idle and there wasn't anything pending
for it, that CPU could safely stay in a hardirq handler indefinitely.
I suppose that the function should return an indication in cases such
as interrupt from non-idle.

Sort of like NO_HZ_FULL, but for hardirq handlers, and also allowing
those handlers to use RCU read-side critical sections.

Or we could do what all the cool kids do these days, namely just apply
machine learning, thus automatically self-tuning in real time.

/me runs...

							Thanx, Paul
diff mbox series

Patch

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index e0f33826819f..ff1c11d8b10b 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -4,7 +4,6 @@  CONFIG_AUDIT=y
 CONFIG_NO_HZ_IDLE=y
 CONFIG_HIGH_RES_TIMERS=y
 CONFIG_PREEMPT=y
-CONFIG_IRQ_TIME_ACCOUNTING=y
 CONFIG_BSD_PROCESS_ACCT=y
 CONFIG_BSD_PROCESS_ACCT_V3=y
 CONFIG_TASK_XACCT=y