diff mbox

usb: dwc3: omap: Fix imprecise external abort and oops on boot

Message ID 9cce1e34-2fec-9477-4827-6b4c67920a8a@ti.com (mailing list archive)
State New, archived
Headers show

Commit Message

Grygorii Strashko Dec. 8, 2016, 9:54 p.m. UTC
Hi Tony,

On 12/08/2016 09:37 AM, Tony Lindgren wrote:
> * Felipe Balbi <balbi@kernel.org> [161208 01:45]:
>> Tony Lindgren <tony@atomide.com> writes:
>>> Somehow starting with v4.9-rc7 there have been imprecise
>>
>> There's nothing touching dwc3 since v4.9-rc5.
> 
> Right, nothing obvious has changed. I think it's just a slight timing
> change in the code that started triggering it.
> 
>>> external aborts on omap5-uevm dwc3 controller. I have not been
>>> able to bisect what exactly triggered this as it does not always
>>> happen. It seems that something changed with probing that
>>> now exposes the issue:
>>>
>>> Unhandled fault: imprecise external abort (0x1406) at 0x00000000
>>
>> hmmm, clock disabled... dwc3-omap shouldn't have pm runtime at all.
> 
> It does for the interconnect target module clkctrl register via PM
> runtime. That's the "usb_otg_ss" module.
> 
>>> ...
>>> PC is at dwc3_omap_interrupt_thread+0x20/0x80
>>> LR is at irq_thread_fn+0x1c/0x54
>>> ...
>>> [<c0989fa8>] (dwc3_omap_interrupt_thread) from [<c038a4d8>]
>>> (irq_thread_fn+0x1c/0x54)
>>> [<c038a4d8>] (irq_thread_fn) from [<c038a7b0>] (irq_thread+0x12c/0x1f0)
>>> [<c038a7b0>] (irq_thread) from [<c035d6f0>] (kthread+0xdc/0xf4)
>>> [<c035d6f0>] (kthread) from [<c0307d78>] (ret_from_fork+0x14/0x3c)
>>> ...
>>>
>>> Unable to handle kernel paging request at virtual address ffffffec
>>> ...
>>> Internal error: Oops: 37 [#2] SMP ARM
>>> PC is at kthread_data+0x4/0xc
>>> LR is at irq_thread_dtor+0x28/0xd0
>>> ...
>>> [<c035e0b4>] (kthread_data) from [<c038a5dc>] (irq_thread_dtor+0x28/0xd0)
>>> [<c038a5dc>] (irq_thread_dtor) from [<c035bef0>] (task_work_run+0xb8/0xdc)
>>> [<c035bef0>] (task_work_run) from [<c03448d4>] (do_exit+0x314/0xa20)
>>> [<c03448d4>] (do_exit) from [<c030bea8>] (die+0x410/0x474)
>>> [<c030bea8>] (die) from [<c0301350>] (do_DataAbort+0xb4/0xb8)
>>> [<c0301350>] (do_DataAbort) from [<c030c578>] (__dabt_svc+0x58/0x80)
>>> Exception stack(0xee777ec8 to 0xee777f10)
>>> 7ec0:                   0000014d ee6e6f10 00000034 fc020034 ee6e6f10 ee1eec00
>>> 7ee0: 00000000 00000000 ee6ec640 c038a4bc 00000000 00000000 00000000 ee777f18
>>> 7f00: c038a4d8 c0989fa8 60000013 ffffffff
>>> [<c030c578>] (__dabt_svc) from [<c0989fa8>] (dwc3_omap_interrupt_thread+0x20/0x80)
>>> [<c0989fa8>] (dwc3_omap_interrupt_thread) from [<c038a4d8>] (irq_thread_fn+0x1c/0x54)
>>> [<c038a4d8>] (irq_thread_fn) from [<c038a7b0>] (irq_thread+0x12c/0x1f0)
>>> [<c038a7b0>] (irq_thread) from [<c035d6f0>] (kthread+0xdc/0xf4)
>>> [<c035d6f0>] (kthread) from [<c0307d78>] (ret_from_fork+0x14/0x3c)
>>>
>>> Fix the issue by making sure the dwc3 interrupts are disabled
>>> before we call devm_request_threaded_irq().
>>
>> that should already be the case. Are you sure that register isn't zero
>> in probe?
> 
> Looks like irq0_status = 0 and irqmisc_status = 0x2121. Also just
> clearing irqmisc with dwc3_omap_write_irqmisc_clr(omap, 0xffffffff)
> stops the issue from happening.
> 
> There is some deferred probing happening but irqmisc is always 0x2121.
> 

My assumption is that you have race here between IRQ handler and PM runtime
in error handling path during deferred probing.

 Will below change fix issue for you?

Comments

Tony Lindgren Dec. 8, 2016, 10:57 p.m. UTC | #1
* Grygorii Strashko <grygorii.strashko@ti.com> [161208 13:54]:
> Hi Tony,
> 
> On 12/08/2016 09:37 AM, Tony Lindgren wrote:
> > * Felipe Balbi <balbi@kernel.org> [161208 01:45]:
> >> Tony Lindgren <tony@atomide.com> writes:
> >>> Somehow starting with v4.9-rc7 there have been imprecise
> >>
> >> There's nothing touching dwc3 since v4.9-rc5.
> > 
> > Right, nothing obvious has changed. I think it's just a slight timing
> > change in the code that started triggering it.
> > 
> >>> external aborts on omap5-uevm dwc3 controller. I have not been
> >>> able to bisect what exactly triggered this as it does not always
> >>> happen. It seems that something changed with probing that
> >>> now exposes the issue:
> >>>
> >>> Unhandled fault: imprecise external abort (0x1406) at 0x00000000
> >>
> >> hmmm, clock disabled... dwc3-omap shouldn't have pm runtime at all.
> > 
> > It does for the interconnect target module clkctrl register via PM
> > runtime. That's the "usb_otg_ss" module.
> > 
> >>> ...
> >>> PC is at dwc3_omap_interrupt_thread+0x20/0x80
> >>> LR is at irq_thread_fn+0x1c/0x54
> >>> ...
> >>> [<c0989fa8>] (dwc3_omap_interrupt_thread) from [<c038a4d8>]
> >>> (irq_thread_fn+0x1c/0x54)
> >>> [<c038a4d8>] (irq_thread_fn) from [<c038a7b0>] (irq_thread+0x12c/0x1f0)
> >>> [<c038a7b0>] (irq_thread) from [<c035d6f0>] (kthread+0xdc/0xf4)
> >>> [<c035d6f0>] (kthread) from [<c0307d78>] (ret_from_fork+0x14/0x3c)
> >>> ...
> >>>
> >>> Unable to handle kernel paging request at virtual address ffffffec
> >>> ...
> >>> Internal error: Oops: 37 [#2] SMP ARM
> >>> PC is at kthread_data+0x4/0xc
> >>> LR is at irq_thread_dtor+0x28/0xd0
> >>> ...
> >>> [<c035e0b4>] (kthread_data) from [<c038a5dc>] (irq_thread_dtor+0x28/0xd0)
> >>> [<c038a5dc>] (irq_thread_dtor) from [<c035bef0>] (task_work_run+0xb8/0xdc)
> >>> [<c035bef0>] (task_work_run) from [<c03448d4>] (do_exit+0x314/0xa20)
> >>> [<c03448d4>] (do_exit) from [<c030bea8>] (die+0x410/0x474)
> >>> [<c030bea8>] (die) from [<c0301350>] (do_DataAbort+0xb4/0xb8)
> >>> [<c0301350>] (do_DataAbort) from [<c030c578>] (__dabt_svc+0x58/0x80)
> >>> Exception stack(0xee777ec8 to 0xee777f10)
> >>> 7ec0:                   0000014d ee6e6f10 00000034 fc020034 ee6e6f10 ee1eec00
> >>> 7ee0: 00000000 00000000 ee6ec640 c038a4bc 00000000 00000000 00000000 ee777f18
> >>> 7f00: c038a4d8 c0989fa8 60000013 ffffffff
> >>> [<c030c578>] (__dabt_svc) from [<c0989fa8>] (dwc3_omap_interrupt_thread+0x20/0x80)
> >>> [<c0989fa8>] (dwc3_omap_interrupt_thread) from [<c038a4d8>] (irq_thread_fn+0x1c/0x54)
> >>> [<c038a4d8>] (irq_thread_fn) from [<c038a7b0>] (irq_thread+0x12c/0x1f0)
> >>> [<c038a7b0>] (irq_thread) from [<c035d6f0>] (kthread+0xdc/0xf4)
> >>> [<c035d6f0>] (kthread) from [<c0307d78>] (ret_from_fork+0x14/0x3c)
> >>>
> >>> Fix the issue by making sure the dwc3 interrupts are disabled
> >>> before we call devm_request_threaded_irq().
> >>
> >> that should already be the case. Are you sure that register isn't zero
> >> in probe?
> > 
> > Looks like irq0_status = 0 and irqmisc_status = 0x2121. Also just
> > clearing irqmisc with dwc3_omap_write_irqmisc_clr(omap, 0xffffffff)
> > stops the issue from happening.
> > 
> > There is some deferred probing happening but irqmisc is always 0x2121.
> > 
> 
> My assumption is that you have race here between IRQ handler and PM runtime
> in error handling path during deferred probing.
> 
>  Will below change fix issue for you?

I think you figured out why it behaves that way :)

Seems to work based on few boot tests. Probably both should be applied,
my original patch to prevent spurious interrupts before things are
initialized, this to disable interrupts before pm_runtime_suspend()
on exit path:

Tested-by: Tony Lindgren <tony@atomide.com>


> diff --git a/drivers/usb/dwc3/dwc3-omap.c b/drivers/usb/dwc3/dwc3-omap.c
> index 29e80cc..dbc21bc 100644
> --- a/drivers/usb/dwc3/dwc3-omap.c
> +++ b/drivers/usb/dwc3/dwc3-omap.c
> @@ -517,12 +517,12 @@ static int dwc3_omap_probe(struct platform_device *pdev)
>         if (ret) {
>                 dev_err(dev, "failed to request IRQ #%d --> %d\n",
>                                 omap->irq, ret);
> -               goto err1;
> +               goto err11;
>         }
>  
>         ret = dwc3_omap_extcon_register(omap);
>         if (ret < 0)
> -               goto err1;
> +               goto err11;
>  
>         ret = of_platform_populate(node, NULL, NULL, dev);
>         if (ret) {
> @@ -538,6 +538,8 @@ static int dwc3_omap_probe(struct platform_device *pdev)
>         extcon_unregister_notifier(omap->edev, EXTCON_USB, &omap->vbus_nb);
>         extcon_unregister_notifier(omap->edev, EXTCON_USB_HOST, &omap->id_nb);
>  
> +err11:
> +       disable_irq(omap->irq);
>  err1:
>         pm_runtime_put_sync(dev);
>         pm_runtime_disable(dev);
> 
> 
> -- 
> regards,
> -grygorii
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Grygorii Strashko Dec. 8, 2016, 11:38 p.m. UTC | #2
On 12/08/2016 04:57 PM, Tony Lindgren wrote:
> * Grygorii Strashko <grygorii.strashko@ti.com> [161208 13:54]:
>> Hi Tony,
>>
>> On 12/08/2016 09:37 AM, Tony Lindgren wrote:
>>> * Felipe Balbi <balbi@kernel.org> [161208 01:45]:
>>>> Tony Lindgren <tony@atomide.com> writes:
>>>>> Somehow starting with v4.9-rc7 there have been imprecise
>>>>
>>>> There's nothing touching dwc3 since v4.9-rc5.
>>>
>>> Right, nothing obvious has changed. I think it's just a slight timing
>>> change in the code that started triggering it.
>>>
>>>>> external aborts on omap5-uevm dwc3 controller. I have not been
>>>>> able to bisect what exactly triggered this as it does not always
>>>>> happen. It seems that something changed with probing that
>>>>> now exposes the issue:
>>>>>
>>>>> Unhandled fault: imprecise external abort (0x1406) at 0x00000000
>>>>
>>>> hmmm, clock disabled... dwc3-omap shouldn't have pm runtime at all.
>>>
>>> It does for the interconnect target module clkctrl register via PM
>>> runtime. That's the "usb_otg_ss" module.
>>>
>>>>> ...
>>>>> PC is at dwc3_omap_interrupt_thread+0x20/0x80
>>>>> LR is at irq_thread_fn+0x1c/0x54
>>>>> ...
>>>>> [<c0989fa8>] (dwc3_omap_interrupt_thread) from [<c038a4d8>]
>>>>> (irq_thread_fn+0x1c/0x54)
>>>>> [<c038a4d8>] (irq_thread_fn) from [<c038a7b0>] (irq_thread+0x12c/0x1f0)
>>>>> [<c038a7b0>] (irq_thread) from [<c035d6f0>] (kthread+0xdc/0xf4)
>>>>> [<c035d6f0>] (kthread) from [<c0307d78>] (ret_from_fork+0x14/0x3c)
>>>>> ...
>>>>>
>>>>> Unable to handle kernel paging request at virtual address ffffffec
>>>>> ...
>>>>> Internal error: Oops: 37 [#2] SMP ARM
>>>>> PC is at kthread_data+0x4/0xc
>>>>> LR is at irq_thread_dtor+0x28/0xd0
>>>>> ...
>>>>> [<c035e0b4>] (kthread_data) from [<c038a5dc>] (irq_thread_dtor+0x28/0xd0)
>>>>> [<c038a5dc>] (irq_thread_dtor) from [<c035bef0>] (task_work_run+0xb8/0xdc)
>>>>> [<c035bef0>] (task_work_run) from [<c03448d4>] (do_exit+0x314/0xa20)
>>>>> [<c03448d4>] (do_exit) from [<c030bea8>] (die+0x410/0x474)
>>>>> [<c030bea8>] (die) from [<c0301350>] (do_DataAbort+0xb4/0xb8)
>>>>> [<c0301350>] (do_DataAbort) from [<c030c578>] (__dabt_svc+0x58/0x80)
>>>>> Exception stack(0xee777ec8 to 0xee777f10)
>>>>> 7ec0:                   0000014d ee6e6f10 00000034 fc020034 ee6e6f10 ee1eec00
>>>>> 7ee0: 00000000 00000000 ee6ec640 c038a4bc 00000000 00000000 00000000 ee777f18
>>>>> 7f00: c038a4d8 c0989fa8 60000013 ffffffff
>>>>> [<c030c578>] (__dabt_svc) from [<c0989fa8>] (dwc3_omap_interrupt_thread+0x20/0x80)
>>>>> [<c0989fa8>] (dwc3_omap_interrupt_thread) from [<c038a4d8>] (irq_thread_fn+0x1c/0x54)
>>>>> [<c038a4d8>] (irq_thread_fn) from [<c038a7b0>] (irq_thread+0x12c/0x1f0)
>>>>> [<c038a7b0>] (irq_thread) from [<c035d6f0>] (kthread+0xdc/0xf4)
>>>>> [<c035d6f0>] (kthread) from [<c0307d78>] (ret_from_fork+0x14/0x3c)
>>>>>
>>>>> Fix the issue by making sure the dwc3 interrupts are disabled
>>>>> before we call devm_request_threaded_irq().
>>>>
>>>> that should already be the case. Are you sure that register isn't zero
>>>> in probe?
>>>
>>> Looks like irq0_status = 0 and irqmisc_status = 0x2121. Also just
>>> clearing irqmisc with dwc3_omap_write_irqmisc_clr(omap, 0xffffffff)
>>> stops the issue from happening.
>>>
>>> There is some deferred probing happening but irqmisc is always 0x2121.
>>>
>>
>> My assumption is that you have race here between IRQ handler and PM runtime
>> in error handling path during deferred probing.
>>
>>  Will below change fix issue for you?
>
> I think you figured out why it behaves that way :)

just followed the code and your bug report:) not tested by my self.

>
> Seems to work based on few boot tests. Probably both should be applied,
> my original patch to prevent spurious interrupts before things are
> initialized,

you patch - do you mean disable irq  or clean up
irq status regs before requesting irq? for me, more logical is to
clean up irq status regs.

this to disable interrupts before pm_runtime_suspend()
> on exit path:
>
> Tested-by: Tony Lindgren <tony@atomide.com>

So, you wanna me to send this as patch or wish to squash in yours?
(I'll be able to do this only tomorrow)

>
>
>> diff --git a/drivers/usb/dwc3/dwc3-omap.c b/drivers/usb/dwc3/dwc3-omap.c
>> index 29e80cc..dbc21bc 100644
>> --- a/drivers/usb/dwc3/dwc3-omap.c
>> +++ b/drivers/usb/dwc3/dwc3-omap.c
>> @@ -517,12 +517,12 @@ static int dwc3_omap_probe(struct platform_device *pdev)
>>         if (ret) {
>>                 dev_err(dev, "failed to request IRQ #%d --> %d\n",
>>                                 omap->irq, ret);
>> -               goto err1;
>> +               goto err11;
>>         }
>>
>>         ret = dwc3_omap_extcon_register(omap);
>>         if (ret < 0)
>> -               goto err1;
>> +               goto err11;
>>
>>         ret = of_platform_populate(node, NULL, NULL, dev);
>>         if (ret) {
>> @@ -538,6 +538,8 @@ static int dwc3_omap_probe(struct platform_device *pdev)
>>         extcon_unregister_notifier(omap->edev, EXTCON_USB, &omap->vbus_nb);
>>         extcon_unregister_notifier(omap->edev, EXTCON_USB_HOST, &omap->id_nb);
>>
>> +err11:
>> +       disable_irq(omap->irq);
>>  err1:
>>         pm_runtime_put_sync(dev);
>>         pm_runtime_disable(dev);
>>
>>
>> --
>> regards,
>> -grygorii
Tony Lindgren Dec. 8, 2016, 11:55 p.m. UTC | #3
* Grygorii Strashko <grygorii.strashko@ti.com> [161208 15:38]:
> On 12/08/2016 04:57 PM, Tony Lindgren wrote:
> > Seems to work based on few boot tests. Probably both should be applied,
> > my original patch to prevent spurious interrupts before things are
> > initialized,
> 
> you patch - do you mean disable irq  or clean up
> irq status regs before requesting irq? for me, more logical is to
> clean up irq status regs.

What I originally posted in this thread, clear irq registers
before requesting irq.

> this to disable interrupts before pm_runtime_suspend()
> > on exit path:
> > 
> > Tested-by: Tony Lindgren <tony@atomide.com>
> 
> So, you wanna me to send this as patch or wish to squash in yours?
> (I'll be able to do this only tomorrow)

I think this should be a separate patch for the exit path.

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Grygorii Strashko Dec. 9, 2016, 6:45 p.m. UTC | #4
On 12/08/2016 05:55 PM, Tony Lindgren wrote:
> * Grygorii Strashko <grygorii.strashko@ti.com> [161208 15:38]:
>> On 12/08/2016 04:57 PM, Tony Lindgren wrote:
>>> Seems to work based on few boot tests. Probably both should be applied,
>>> my original patch to prevent spurious interrupts before things are
>>> initialized,
>>
>> you patch - do you mean disable irq  or clean up
>> irq status regs before requesting irq? for me, more logical is to
>> clean up irq status regs.
> 
> What I originally posted in this thread, clear irq registers
> before requesting irq.
> 
>> this to disable interrupts before pm_runtime_suspend()
>>> on exit path:
>>>
>>> Tested-by: Tony Lindgren <tony@atomide.com>
>>
>> So, you wanna me to send this as patch or wish to squash in yours?
>> (I'll be able to do this only tomorrow)
> 
> I think this should be a separate patch for the exit path.
> 

Actually, It seems correct solution will be to switch back from
devm_request_threaded_irq() to just request_threaded_irq(), because
similar problem can happen with dwc3_omap_remove() - there are no guarantee
that there will be no IRQ handler running when dwc3_omap_remove()
is executed and dwc3_omap_disable_irqs() will not help with that.
diff mbox

Patch

diff --git a/drivers/usb/dwc3/dwc3-omap.c b/drivers/usb/dwc3/dwc3-omap.c
index 29e80cc..dbc21bc 100644
--- a/drivers/usb/dwc3/dwc3-omap.c
+++ b/drivers/usb/dwc3/dwc3-omap.c
@@ -517,12 +517,12 @@  static int dwc3_omap_probe(struct platform_device *pdev)
        if (ret) {
                dev_err(dev, "failed to request IRQ #%d --> %d\n",
                                omap->irq, ret);
-               goto err1;
+               goto err11;
        }
 
        ret = dwc3_omap_extcon_register(omap);
        if (ret < 0)
-               goto err1;
+               goto err11;
 
        ret = of_platform_populate(node, NULL, NULL, dev);
        if (ret) {
@@ -538,6 +538,8 @@  static int dwc3_omap_probe(struct platform_device *pdev)
        extcon_unregister_notifier(omap->edev, EXTCON_USB, &omap->vbus_nb);
        extcon_unregister_notifier(omap->edev, EXTCON_USB_HOST, &omap->id_nb);
 
+err11:
+       disable_irq(omap->irq);
 err1:
        pm_runtime_put_sync(dev);
        pm_runtime_disable(dev);