diff mbox series

usb: dwc3: Potential fix of possible dwc3 interrupt storm

Message ID 20240719110100.329-1-selvarasu.g@samsung.com (mailing list archive)
State New
Headers show
Series usb: dwc3: Potential fix of possible dwc3 interrupt storm | expand

Commit Message

Selvarasu Ganesan July 19, 2024, 11 a.m. UTC
In certain scenarios, there is a chance that the CPU may not be
scheduled the bottom half of dwc3 interrupt. This is because the CPU
may hang up where any work queue lockup has happened for the same CPU
that is trying to schedule the dwc3 thread interrupt. In this scenario,
the USB can enter runtime suspend as the bus may idle for a longer time
, or user can reconnect the USB cable. Then, the dwc3 event interrupt
can be enabled when runtime resume is happening with regardless of the
previous event status. This can lead to a dwc3 IRQ storm due to the
return from the interrupt handler by checking only the evt->flags as
DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
in previous work queue lockup.
Let's consider the following sequences in this scenario,

Call trace of dwc3 IRQ after workqueue lockup scenario
======================================================
IRQ #1:
->dwc3_interrupt()
  ->dwc3_check_event_buf()
        ->if (evt->flags & DWC3_EVENT_PENDING)
                     return IRQ_HANDLED;
        ->evt->flags |= DWC3_EVENT_PENDING;
        ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
                                                        DWC3_GEVNTSIZ
        ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
                                     thread_fu due to workqueue lockup
                                     even after return IRQ_WAKE_THREAD
                                     from top-half.

Thread #2:
->dwc3_runtime_resume()
 ->dwc3_resume_common()
   ->dwc3_gadget_resume()
      ->dwc3_gadget_soft_connect()
        ->dwc3_event_buffers_setup()
           ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
                                                        DWC3_GEVNTSIZ*/

Start IRQ Storming after enable dwc3 event in resume path
=========================================================
CPU0: IRQ
dwc3_interrupt()
 dwc3_check_event_buf()
        if (evt->flags & DWC3_EVENT_PENDING)
         return IRQ_HANDLED;

CPU0: IRQ
dwc3_interrupt()
 dwc3_check_event_buf()
        if (evt->flags & DWC3_EVENT_PENDING)
         return IRQ_HANDLED;
..
..

To fix this issue by avoiding enabling of the dwc3 event interrupt in
the runtime resume path if dwc3 event processing is in progress.

Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
---
 drivers/usb/dwc3/core.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Thinh Nguyen Aug. 7, 2024, 12:38 a.m. UTC | #1
On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
> In certain scenarios, there is a chance that the CPU may not be
> scheduled the bottom half of dwc3 interrupt. This is because the CPU
> may hang up where any work queue lockup has happened for the same CPU
> that is trying to schedule the dwc3 thread interrupt. In this scenario,
> the USB can enter runtime suspend as the bus may idle for a longer time
> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
> can be enabled when runtime resume is happening with regardless of the
> previous event status. This can lead to a dwc3 IRQ storm due to the
> return from the interrupt handler by checking only the evt->flags as
> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
> in previous work queue lockup.
> Let's consider the following sequences in this scenario,
> 
> Call trace of dwc3 IRQ after workqueue lockup scenario
> ======================================================
> IRQ #1:
> ->dwc3_interrupt()
>   ->dwc3_check_event_buf()
>         ->if (evt->flags & DWC3_EVENT_PENDING)
>                      return IRQ_HANDLED;
>         ->evt->flags |= DWC3_EVENT_PENDING;
>         ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
>                                                         DWC3_GEVNTSIZ
>         ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>                                      thread_fu due to workqueue lockup
>                                      even after return IRQ_WAKE_THREAD
>                                      from top-half.
> 
> Thread #2:
> ->dwc3_runtime_resume()
>  ->dwc3_resume_common()
>    ->dwc3_gadget_resume()
>       ->dwc3_gadget_soft_connect()
>         ->dwc3_event_buffers_setup()
>            ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
>                                                         DWC3_GEVNTSIZ*/
> 
> Start IRQ Storming after enable dwc3 event in resume path
> =========================================================
> CPU0: IRQ
> dwc3_interrupt()
>  dwc3_check_event_buf()
>         if (evt->flags & DWC3_EVENT_PENDING)
>          return IRQ_HANDLED;
> 
> CPU0: IRQ
> dwc3_interrupt()
>  dwc3_check_event_buf()
>         if (evt->flags & DWC3_EVENT_PENDING)
>          return IRQ_HANDLED;
> ..
> ..
> 
> To fix this issue by avoiding enabling of the dwc3 event interrupt in
> the runtime resume path if dwc3 event processing is in progress.
> 
> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
> ---
>  drivers/usb/dwc3/core.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index cb82557678dd..610792a70805 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>  			lower_32_bits(evt->dma));
>  	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>  			upper_32_bits(evt->dma));
> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> -			DWC3_GEVNTSIZ_SIZE(evt->length));
> +
> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
> +	if (!(evt->flags & DWC3_EVENT_PENDING))
> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> +				DWC3_GEVNTSIZ_SIZE(evt->length));
> +
>  	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>  
>  	return 0;
> -- 
> 2.17.1
> 

We're not waking up from a hibernation. So after a soft-reset and
resume, the events that weren't processed are stale. They should be
processed prior to entering suspend or be discarded before resume.

The synchronize_irq() during suspend() was not sufficient to prevent
this? What are we missing here.

Thanks,
Thinh
Selvarasu Ganesan Aug. 7, 2024, 6:20 a.m. UTC | #2
On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
>> In certain scenarios, there is a chance that the CPU may not be
>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
>> may hang up where any work queue lockup has happened for the same CPU
>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
>> the USB can enter runtime suspend as the bus may idle for a longer time
>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
>> can be enabled when runtime resume is happening with regardless of the
>> previous event status. This can lead to a dwc3 IRQ storm due to the
>> return from the interrupt handler by checking only the evt->flags as
>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
>> in previous work queue lockup.
>> Let's consider the following sequences in this scenario,
>>
>> Call trace of dwc3 IRQ after workqueue lockup scenario
>> ======================================================
>> IRQ #1:
>> ->dwc3_interrupt()
>>    ->dwc3_check_event_buf()
>>          ->if (evt->flags & DWC3_EVENT_PENDING)
>>                       return IRQ_HANDLED;
>>          ->evt->flags |= DWC3_EVENT_PENDING;
>>          ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
>>                                                          DWC3_GEVNTSIZ
>>          ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>>                                       thread_fu due to workqueue lockup
>>                                       even after return IRQ_WAKE_THREAD
>>                                       from top-half.
>>
>> Thread #2:
>> ->dwc3_runtime_resume()
>>   ->dwc3_resume_common()
>>     ->dwc3_gadget_resume()
>>        ->dwc3_gadget_soft_connect()
>>          ->dwc3_event_buffers_setup()
>>             ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
>>                                                          DWC3_GEVNTSIZ*/
>>
>> Start IRQ Storming after enable dwc3 event in resume path
>> =========================================================
>> CPU0: IRQ
>> dwc3_interrupt()
>>   dwc3_check_event_buf()
>>          if (evt->flags & DWC3_EVENT_PENDING)
>>           return IRQ_HANDLED;
>>
>> CPU0: IRQ
>> dwc3_interrupt()
>>   dwc3_check_event_buf()
>>          if (evt->flags & DWC3_EVENT_PENDING)
>>           return IRQ_HANDLED;
>> ..
>> ..
>>
>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
>> the runtime resume path if dwc3 event processing is in progress.
>>
>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
>> ---
>>   drivers/usb/dwc3/core.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>> index cb82557678dd..610792a70805 100644
>> --- a/drivers/usb/dwc3/core.c
>> +++ b/drivers/usb/dwc3/core.c
>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>   			lower_32_bits(evt->dma));
>>   	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>   			upper_32_bits(evt->dma));
>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
>> +
>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
>> +
>>   	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>   
>>   	return 0;
>> -- 
>> 2.17.1
>>
> We're not waking up from a hibernation. So after a soft-reset and
> resume, the events that weren't processed are stale. They should be
> processed prior to entering suspend or be discarded before resume.
>
> The synchronize_irq() during suspend() was not sufficient to prevent
> this? What are we missing here.
>
> Thanks,
> Thinh
I don’t think the triggering of interrupt would not be stopped even if 
do soft reset. It's because of event count is may be valid .
Let consider the scenarios where SW is not acknowledge the event count 
after getting a interrupt with disable GEVNTSIZ_INTMASK =0. It will 
triggering the spurious interrupts until enable GEVNTSIZ_INTMASK=1 or 
acknowledge the event count by SW. This is happening here because of We 
just return from interrupt handler by checking if evt->flags as 
DWC3_EVENT_PENDING. Clearing of DWC3_EVENT_PENDING flag is done in 
dwc3_thread_interrupt. In some scenario it's not happening due to cpu 
hangup or work queue lockup.

The same issue was reported earlier and not derived actual root cause 
from USB dwc3 driver point of view, and somehow we managing fix with our 
vendor kernel by using kretprobe.

To fix this issue, We have to reset the evt->flags or stop disable  
GEVNTSIZ_INTMASK=0
https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/
https://lore.kernel.org/linux-usb/1646630679-121585-1-git-send-email-jh0801.jung@samsung.com/

Thanks
Selva
Thinh Nguyen Aug. 8, 2024, 1:15 a.m. UTC | #3
On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
> 
> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
> > On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
> >> In certain scenarios, there is a chance that the CPU may not be
> >> scheduled the bottom half of dwc3 interrupt. This is because the CPU
> >> may hang up where any work queue lockup has happened for the same CPU
> >> that is trying to schedule the dwc3 thread interrupt. In this scenario,
> >> the USB can enter runtime suspend as the bus may idle for a longer time
> >> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
> >> can be enabled when runtime resume is happening with regardless of the
> >> previous event status. This can lead to a dwc3 IRQ storm due to the
> >> return from the interrupt handler by checking only the evt->flags as
> >> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
> >> in previous work queue lockup.
> >> Let's consider the following sequences in this scenario,
> >>
> >> Call trace of dwc3 IRQ after workqueue lockup scenario
> >> ======================================================
> >> IRQ #1:
> >> ->dwc3_interrupt()
> >>    ->dwc3_check_event_buf()
> >>          ->if (evt->flags & DWC3_EVENT_PENDING)
> >>                       return IRQ_HANDLED;
> >>          ->evt->flags |= DWC3_EVENT_PENDING;
> >>          ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
> >>                                                          DWC3_GEVNTSIZ
> >>          ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
> >>                                       thread_fu due to workqueue lockup
> >>                                       even after return IRQ_WAKE_THREAD
> >>                                       from top-half.
> >>
> >> Thread #2:
> >> ->dwc3_runtime_resume()
> >>   ->dwc3_resume_common()
> >>     ->dwc3_gadget_resume()
> >>        ->dwc3_gadget_soft_connect()
> >>          ->dwc3_event_buffers_setup()
> >>             ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
> >>                                                          DWC3_GEVNTSIZ*/
> >>
> >> Start IRQ Storming after enable dwc3 event in resume path
> >> =========================================================
> >> CPU0: IRQ
> >> dwc3_interrupt()
> >>   dwc3_check_event_buf()
> >>          if (evt->flags & DWC3_EVENT_PENDING)
> >>           return IRQ_HANDLED;
> >>
> >> CPU0: IRQ
> >> dwc3_interrupt()
> >>   dwc3_check_event_buf()
> >>          if (evt->flags & DWC3_EVENT_PENDING)
> >>           return IRQ_HANDLED;
> >> ..
> >> ..
> >>
> >> To fix this issue by avoiding enabling of the dwc3 event interrupt in
> >> the runtime resume path if dwc3 event processing is in progress.
> >>
> >> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
> >> ---
> >>   drivers/usb/dwc3/core.c | 8 ++++++--
> >>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> >> index cb82557678dd..610792a70805 100644
> >> --- a/drivers/usb/dwc3/core.c
> >> +++ b/drivers/usb/dwc3/core.c
> >> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
> >>   			lower_32_bits(evt->dma));
> >>   	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
> >>   			upper_32_bits(evt->dma));
> >> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >> -			DWC3_GEVNTSIZ_SIZE(evt->length));
> >> +
> >> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
> >> +	if (!(evt->flags & DWC3_EVENT_PENDING))
> >> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >> +				DWC3_GEVNTSIZ_SIZE(evt->length));
> >> +
> >>   	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> >>   
> >>   	return 0;
> >> -- 
> >> 2.17.1
> >>
> > We're not waking up from a hibernation. So after a soft-reset and
> > resume, the events that weren't processed are stale. They should be
> > processed prior to entering suspend or be discarded before resume.
> >
> > The synchronize_irq() during suspend() was not sufficient to prevent
> > this? What are we missing here.
> >
> > Thanks,
> > Thinh
> I don’t think the triggering of interrupt would not be stopped even if 
> do soft reset. It's because of event count is may be valid .

Ok. I think I see what you're referring to when you say "event is
processing in the middle" now.

What you want to check is probably this in dwc3_event_buffers_setup().
Please confirm:

if (dwc->pending_events)
	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
else
	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));

> Let consider the scenarios where SW is not acknowledge the event count 
> after getting a interrupt with disable GEVNTSIZ_INTMASK =0. It will 
> triggering the spurious interrupts until enable GEVNTSIZ_INTMASK=1 or 
> acknowledge the event count by SW. This is happening here because of We 
> just return from interrupt handler by checking if evt->flags as 
> DWC3_EVENT_PENDING. Clearing of DWC3_EVENT_PENDING flag is done in 
> dwc3_thread_interrupt. In some scenario it's not happening due to cpu 
> hangup or work queue lockup.

This can be mitigated by adjusting the imod_interval (interrupt
moderation). Have you tried that?

Thanks,
Thinh

> 
> The same issue was reported earlier and not derived actual root cause 
> from USB dwc3 driver point of view, and somehow we managing fix with our 
> vendor kernel by using kretprobe.
> 
> To fix this issue, We have to reset the evt->flags or stop disable  
> GEVNTSIZ_INTMASK=0
> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!b0Uj7NvYzWqrvNKOrkkXaY-g4Uaso-pW520AY2fBdlqmJeudUnmdsgpnL1YxAZaw2ydyg2M-XzhetbCraKxJa5C3NBQ$ 
> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/1646630679-121585-1-git-send-email-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!b0Uj7NvYzWqrvNKOrkkXaY-g4Uaso-pW520AY2fBdlqmJeudUnmdsgpnL1YxAZaw2ydyg2M-XzhetbCraKxJ2d2qDqA$ 
> 

Thanks,
Thinh
Selvarasu Ganesan Aug. 8, 2024, 6:23 a.m. UTC | #4
On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
>>>> In certain scenarios, there is a chance that the CPU may not be
>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
>>>> may hang up where any work queue lockup has happened for the same CPU
>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
>>>> the USB can enter runtime suspend as the bus may idle for a longer time
>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
>>>> can be enabled when runtime resume is happening with regardless of the
>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
>>>> return from the interrupt handler by checking only the evt->flags as
>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
>>>> in previous work queue lockup.
>>>> Let's consider the following sequences in this scenario,
>>>>
>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
>>>> ======================================================
>>>> IRQ #1:
>>>> ->dwc3_interrupt()
>>>>     ->dwc3_check_event_buf()
>>>>           ->if (evt->flags & DWC3_EVENT_PENDING)
>>>>                        return IRQ_HANDLED;
>>>>           ->evt->flags |= DWC3_EVENT_PENDING;
>>>>           ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
>>>>                                                           DWC3_GEVNTSIZ
>>>>           ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>>>>                                        thread_fu due to workqueue lockup
>>>>                                        even after return IRQ_WAKE_THREAD
>>>>                                        from top-half.
>>>>
>>>> Thread #2:
>>>> ->dwc3_runtime_resume()
>>>>    ->dwc3_resume_common()
>>>>      ->dwc3_gadget_resume()
>>>>         ->dwc3_gadget_soft_connect()
>>>>           ->dwc3_event_buffers_setup()
>>>>              ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
>>>>                                                           DWC3_GEVNTSIZ*/
>>>>
>>>> Start IRQ Storming after enable dwc3 event in resume path
>>>> =========================================================
>>>> CPU0: IRQ
>>>> dwc3_interrupt()
>>>>    dwc3_check_event_buf()
>>>>           if (evt->flags & DWC3_EVENT_PENDING)
>>>>            return IRQ_HANDLED;
>>>>
>>>> CPU0: IRQ
>>>> dwc3_interrupt()
>>>>    dwc3_check_event_buf()
>>>>           if (evt->flags & DWC3_EVENT_PENDING)
>>>>            return IRQ_HANDLED;
>>>> ..
>>>> ..
>>>>
>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
>>>> the runtime resume path if dwc3 event processing is in progress.
>>>>
>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
>>>> ---
>>>>    drivers/usb/dwc3/core.c | 8 ++++++--
>>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>> index cb82557678dd..610792a70805 100644
>>>> --- a/drivers/usb/dwc3/core.c
>>>> +++ b/drivers/usb/dwc3/core.c
>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>>>    			lower_32_bits(evt->dma));
>>>>    	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>>>    			upper_32_bits(evt->dma));
>>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
>>>> +
>>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
>>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
>>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
>>>> +
>>>>    	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>>>    
>>>>    	return 0;
>>>> -- 
>>>> 2.17.1
>>>>
>>> We're not waking up from a hibernation. So after a soft-reset and
>>> resume, the events that weren't processed are stale. They should be
>>> processed prior to entering suspend or be discarded before resume.
>>>
>>> The synchronize_irq() during suspend() was not sufficient to prevent
>>> this? What are we missing here.
>>>
>>> Thanks,
>>> Thinh
>> I don’t think the triggering of interrupt would not be stopped even if
>> do soft reset. It's because of event count is may be valid .
> Ok. I think I see what you're referring to when you say "event is
> processing in the middle" now.
>
> What you want to check is probably this in dwc3_event_buffers_setup().
> Please confirm:
>
> if (dwc->pending_events)
> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> else
> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));

Yes, we are expecting the same. But, we must verify the status of 
evt->flags, which will indicate whether the event is currently 
processing in middle or not. The below code is for the reference.

if (!(evt->flags & DWC3_EVENT_PENDING))
	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
			 DWC3_GEVNTSIZ_SIZE(evt->length));
else
	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));

>> Let consider the scenarios where SW is not acknowledge the event count
>> after getting a interrupt with disable GEVNTSIZ_INTMASK =0. It will
>> triggering the spurious interrupts until enable GEVNTSIZ_INTMASK=1 or
>> acknowledge the event count by SW. This is happening here because of We
>> just return from interrupt handler by checking if evt->flags as
>> DWC3_EVENT_PENDING. Clearing of DWC3_EVENT_PENDING flag is done in
>> dwc3_thread_interrupt. In some scenario it's not happening due to cpu
>> hangup or work queue lockup.
> This can be mitigated by adjusting the imod_interval (interrupt
> moderation). Have you tried that?


Yes we tried to play around the changing of imod interval value but no 
improvements.
>
> Thanks,
> Thinh
>
>> The same issue was reported earlier and not derived actual root cause
>> from USB dwc3 driver point of view, and somehow we managing fix with our
>> vendor kernel by using kretprobe.
>>
>> To fix this issue, We have to reset the evt->flags or stop disable
>> GEVNTSIZ_INTMASK=0
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!b0Uj7NvYzWqrvNKOrkkXaY-g4Uaso-pW520AY2fBdlqmJeudUnmdsgpnL1YxAZaw2ydyg2M-XzhetbCraKxJa5C3NBQ$
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/1646630679-121585-1-git-send-email-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!b0Uj7NvYzWqrvNKOrkkXaY-g4Uaso-pW520AY2fBdlqmJeudUnmdsgpnL1YxAZaw2ydyg2M-XzhetbCraKxJ2d2qDqA$
>>
> Thanks,
> Thinh
Thinh Nguyen Aug. 9, 2024, 11:42 p.m. UTC | #5
On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
> 
> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
> > On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
> >> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
> >>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
> >>>> In certain scenarios, there is a chance that the CPU may not be
> >>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
> >>>> may hang up where any work queue lockup has happened for the same CPU
> >>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
> >>>> the USB can enter runtime suspend as the bus may idle for a longer time
> >>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
> >>>> can be enabled when runtime resume is happening with regardless of the
> >>>> previous event status. This can lead to a dwc3 IRQ storm due to the
> >>>> return from the interrupt handler by checking only the evt->flags as
> >>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
> >>>> in previous work queue lockup.
> >>>> Let's consider the following sequences in this scenario,
> >>>>
> >>>> Call trace of dwc3 IRQ after workqueue lockup scenario
> >>>> ======================================================
> >>>> IRQ #1:
> >>>> ->dwc3_interrupt()
> >>>>     ->dwc3_check_event_buf()
> >>>>           ->if (evt->flags & DWC3_EVENT_PENDING)
> >>>>                        return IRQ_HANDLED;
> >>>>           ->evt->flags |= DWC3_EVENT_PENDING;
> >>>>           ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
> >>>>                                                           DWC3_GEVNTSIZ
> >>>>           ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
> >>>>                                        thread_fu due to workqueue lockup
> >>>>                                        even after return IRQ_WAKE_THREAD
> >>>>                                        from top-half.
> >>>>
> >>>> Thread #2:
> >>>> ->dwc3_runtime_resume()
> >>>>    ->dwc3_resume_common()
> >>>>      ->dwc3_gadget_resume()
> >>>>         ->dwc3_gadget_soft_connect()
> >>>>           ->dwc3_event_buffers_setup()
> >>>>              ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
> >>>>                                                           DWC3_GEVNTSIZ*/
> >>>>
> >>>> Start IRQ Storming after enable dwc3 event in resume path
> >>>> =========================================================
> >>>> CPU0: IRQ
> >>>> dwc3_interrupt()
> >>>>    dwc3_check_event_buf()
> >>>>           if (evt->flags & DWC3_EVENT_PENDING)
> >>>>            return IRQ_HANDLED;
> >>>>
> >>>> CPU0: IRQ
> >>>> dwc3_interrupt()
> >>>>    dwc3_check_event_buf()
> >>>>           if (evt->flags & DWC3_EVENT_PENDING)
> >>>>            return IRQ_HANDLED;
> >>>> ..
> >>>> ..
> >>>>
> >>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
> >>>> the runtime resume path if dwc3 event processing is in progress.
> >>>>
> >>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
> >>>> ---
> >>>>    drivers/usb/dwc3/core.c | 8 ++++++--
> >>>>    1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> >>>> index cb82557678dd..610792a70805 100644
> >>>> --- a/drivers/usb/dwc3/core.c
> >>>> +++ b/drivers/usb/dwc3/core.c
> >>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
> >>>>    			lower_32_bits(evt->dma));
> >>>>    	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
> >>>>    			upper_32_bits(evt->dma));
> >>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>> +
> >>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
> >>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
> >>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>> +
> >>>>    	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> >>>>    
> >>>>    	return 0;
> >>>> -- 
> >>>> 2.17.1
> >>>>
> >>> We're not waking up from a hibernation. So after a soft-reset and
> >>> resume, the events that weren't processed are stale. They should be
> >>> processed prior to entering suspend or be discarded before resume.
> >>>
> >>> The synchronize_irq() during suspend() was not sufficient to prevent
> >>> this? What are we missing here.
> >>>
> >>> Thanks,
> >>> Thinh
> >> I don’t think the triggering of interrupt would not be stopped even if
> >> do soft reset. It's because of event count is may be valid .
> > Ok. I think I see what you're referring to when you say "event is
> > processing in the middle" now.
> >
> > What you want to check is probably this in dwc3_event_buffers_setup().
> > Please confirm:
> >
> > if (dwc->pending_events)
> > 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> > 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> > else
> > 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
> 
> Yes, we are expecting the same. But, we must verify the status of 
> evt->flags, which will indicate whether the event is currently 
> processing in middle or not. The below code is for the reference.
> 
> if (!(evt->flags & DWC3_EVENT_PENDING))
> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> 			 DWC3_GEVNTSIZ_SIZE(evt->length));
> else
> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));

So, this happens while pending_events is set right? I need to review
this runtime suspend flow next week. Something doesn't look right. When
there's a suspend/resume runtime or not, there's a soft disconnect. We
shouldn't be processing any event prior to going into suspend. Also, we
shouldn't be doing soft-disconnect while connected and in operation
unless we specifically tell it to.

> 
> >> Let consider the scenarios where SW is not acknowledge the event count
> >> after getting a interrupt with disable GEVNTSIZ_INTMASK =0. It will
> >> triggering the spurious interrupts until enable GEVNTSIZ_INTMASK=1 or
> >> acknowledge the event count by SW. This is happening here because of We
> >> just return from interrupt handler by checking if evt->flags as
> >> DWC3_EVENT_PENDING. Clearing of DWC3_EVENT_PENDING flag is done in
> >> dwc3_thread_interrupt. In some scenario it's not happening due to cpu
> >> hangup or work queue lockup.
> > This can be mitigated by adjusting the imod_interval (interrupt
> > moderation). Have you tried that?
> 
> 
> Yes we tried to play around the changing of imod interval value but no 
> improvements.

Ok.

Thanks,
Thinh
Thinh Nguyen Aug. 9, 2024, 11:45 p.m. UTC | #6
On Fri, Aug 09, 2024, Thinh Nguyen wrote:
> On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
> > 
> > On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
> > > On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
> > >> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
> > >>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
> > >>>> In certain scenarios, there is a chance that the CPU may not be
> > >>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
> > >>>> may hang up where any work queue lockup has happened for the same CPU
> > >>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
> > >>>> the USB can enter runtime suspend as the bus may idle for a longer time
> > >>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
> > >>>> can be enabled when runtime resume is happening with regardless of the
> > >>>> previous event status. This can lead to a dwc3 IRQ storm due to the
> > >>>> return from the interrupt handler by checking only the evt->flags as
> > >>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
> > >>>> in previous work queue lockup.
> > >>>> Let's consider the following sequences in this scenario,
> > >>>>
> > >>>> Call trace of dwc3 IRQ after workqueue lockup scenario
> > >>>> ======================================================
> > >>>> IRQ #1:
> > >>>> ->dwc3_interrupt()
> > >>>>     ->dwc3_check_event_buf()
> > >>>>           ->if (evt->flags & DWC3_EVENT_PENDING)
> > >>>>                        return IRQ_HANDLED;
> > >>>>           ->evt->flags |= DWC3_EVENT_PENDING;
> > >>>>           ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
> > >>>>                                                           DWC3_GEVNTSIZ
> > >>>>           ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
> > >>>>                                        thread_fu due to workqueue lockup
> > >>>>                                        even after return IRQ_WAKE_THREAD
> > >>>>                                        from top-half.
> > >>>>
> > >>>> Thread #2:
> > >>>> ->dwc3_runtime_resume()
> > >>>>    ->dwc3_resume_common()
> > >>>>      ->dwc3_gadget_resume()
> > >>>>         ->dwc3_gadget_soft_connect()
> > >>>>           ->dwc3_event_buffers_setup()
> > >>>>              ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
> > >>>>                                                           DWC3_GEVNTSIZ*/
> > >>>>
> > >>>> Start IRQ Storming after enable dwc3 event in resume path
> > >>>> =========================================================
> > >>>> CPU0: IRQ
> > >>>> dwc3_interrupt()
> > >>>>    dwc3_check_event_buf()
> > >>>>           if (evt->flags & DWC3_EVENT_PENDING)
> > >>>>            return IRQ_HANDLED;
> > >>>>
> > >>>> CPU0: IRQ
> > >>>> dwc3_interrupt()
> > >>>>    dwc3_check_event_buf()
> > >>>>           if (evt->flags & DWC3_EVENT_PENDING)
> > >>>>            return IRQ_HANDLED;
> > >>>> ..
> > >>>> ..
> > >>>>
> > >>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
> > >>>> the runtime resume path if dwc3 event processing is in progress.
> > >>>>
> > >>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
> > >>>> ---
> > >>>>    drivers/usb/dwc3/core.c | 8 ++++++--
> > >>>>    1 file changed, 6 insertions(+), 2 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> > >>>> index cb82557678dd..610792a70805 100644
> > >>>> --- a/drivers/usb/dwc3/core.c
> > >>>> +++ b/drivers/usb/dwc3/core.c
> > >>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
> > >>>>    			lower_32_bits(evt->dma));
> > >>>>    	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
> > >>>>    			upper_32_bits(evt->dma));
> > >>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> > >>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
> > >>>> +
> > >>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
> > >>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
> > >>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> > >>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
> > >>>> +
> > >>>>    	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> > >>>>    
> > >>>>    	return 0;
> > >>>> -- 
> > >>>> 2.17.1
> > >>>>
> > >>> We're not waking up from a hibernation. So after a soft-reset and
> > >>> resume, the events that weren't processed are stale. They should be
> > >>> processed prior to entering suspend or be discarded before resume.
> > >>>
> > >>> The synchronize_irq() during suspend() was not sufficient to prevent
> > >>> this? What are we missing here.
> > >>>
> > >>> Thanks,
> > >>> Thinh
> > >> I don’t think the triggering of interrupt would not be stopped even if
> > >> do soft reset. It's because of event count is may be valid .
> > > Ok. I think I see what you're referring to when you say "event is
> > > processing in the middle" now.
> > >
> > > What you want to check is probably this in dwc3_event_buffers_setup().
> > > Please confirm:
> > >
> > > if (dwc->pending_events)
> > > 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> > > 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> > > else
> > > 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
> > 
> > Yes, we are expecting the same. But, we must verify the status of 
> > evt->flags, which will indicate whether the event is currently 
> > processing in middle or not. The below code is for the reference.
> > 
> > if (!(evt->flags & DWC3_EVENT_PENDING))
> > 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> > 			 DWC3_GEVNTSIZ_SIZE(evt->length));
> > else
> > 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> > 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> 
> So, this happens while pending_events is set right? I need to review
> this runtime suspend flow next week. Something doesn't look right. When
> there's a suspend/resume runtime or not, there's a soft disconnect. We
> shouldn't be processing any event prior to going into suspend. Also, we

Clarification: I mean we shouldn't process any event that happened prior
to suspend on resume because there was a disconnect.

> shouldn't be doing soft-disconnect while connected and in operation
> unless we specifically tell it to.

Thinh
Selvarasu Ganesan Aug. 10, 2024, 3:14 p.m. UTC | #7
On 8/10/2024 5:15 AM, Thinh Nguyen wrote:
> On Fri, Aug 09, 2024, Thinh Nguyen wrote:
>> On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
>>> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
>>>> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
>>>>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
>>>>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
>>>>>>> In certain scenarios, there is a chance that the CPU may not be
>>>>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
>>>>>>> may hang up where any work queue lockup has happened for the same CPU
>>>>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
>>>>>>> the USB can enter runtime suspend as the bus may idle for a longer time
>>>>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
>>>>>>> can be enabled when runtime resume is happening with regardless of the
>>>>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
>>>>>>> return from the interrupt handler by checking only the evt->flags as
>>>>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
>>>>>>> in previous work queue lockup.
>>>>>>> Let's consider the following sequences in this scenario,
>>>>>>>
>>>>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
>>>>>>> ======================================================
>>>>>>> IRQ #1:
>>>>>>> ->dwc3_interrupt()
>>>>>>>      ->dwc3_check_event_buf()
>>>>>>>            ->if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>                         return IRQ_HANDLED;
>>>>>>>            ->evt->flags |= DWC3_EVENT_PENDING;
>>>>>>>            ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
>>>>>>>                                                            DWC3_GEVNTSIZ
>>>>>>>            ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>>>>>>>                                         thread_fu due to workqueue lockup
>>>>>>>                                         even after return IRQ_WAKE_THREAD
>>>>>>>                                         from top-half.
>>>>>>>
>>>>>>> Thread #2:
>>>>>>> ->dwc3_runtime_resume()
>>>>>>>     ->dwc3_resume_common()
>>>>>>>       ->dwc3_gadget_resume()
>>>>>>>          ->dwc3_gadget_soft_connect()
>>>>>>>            ->dwc3_event_buffers_setup()
>>>>>>>               ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
>>>>>>>                                                            DWC3_GEVNTSIZ*/
>>>>>>>
>>>>>>> Start IRQ Storming after enable dwc3 event in resume path
>>>>>>> =========================================================
>>>>>>> CPU0: IRQ
>>>>>>> dwc3_interrupt()
>>>>>>>     dwc3_check_event_buf()
>>>>>>>            if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>             return IRQ_HANDLED;
>>>>>>>
>>>>>>> CPU0: IRQ
>>>>>>> dwc3_interrupt()
>>>>>>>     dwc3_check_event_buf()
>>>>>>>            if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>             return IRQ_HANDLED;
>>>>>>> ..
>>>>>>> ..
>>>>>>>
>>>>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
>>>>>>> the runtime resume path if dwc3 event processing is in progress.
>>>>>>>
>>>>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
>>>>>>> ---
>>>>>>>     drivers/usb/dwc3/core.c | 8 ++++++--
>>>>>>>     1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>>>>> index cb82557678dd..610792a70805 100644
>>>>>>> --- a/drivers/usb/dwc3/core.c
>>>>>>> +++ b/drivers/usb/dwc3/core.c
>>>>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>>>>>>     			lower_32_bits(evt->dma));
>>>>>>>     	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>>>>>>     			upper_32_bits(evt->dma));
>>>>>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>>> +
>>>>>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
>>>>>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
>>>>>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>>> +
>>>>>>>     	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>>>>>>     
>>>>>>>     	return 0;
>>>>>>> -- 
>>>>>>> 2.17.1
>>>>>>>
>>>>>> We're not waking up from a hibernation. So after a soft-reset and
>>>>>> resume, the events that weren't processed are stale. They should be
>>>>>> processed prior to entering suspend or be discarded before resume.
>>>>>>
>>>>>> The synchronize_irq() during suspend() was not sufficient to prevent
>>>>>> this? What are we missing here.
>>>>>>
>>>>>> Thanks,
>>>>>> Thinh
>>>>> I don’t think the triggering of interrupt would not be stopped even if
>>>>> do soft reset. It's because of event count is may be valid .
>>>> Ok. I think I see what you're referring to when you say "event is
>>>> processing in the middle" now.
>>>>
>>>> What you want to check is probably this in dwc3_event_buffers_setup().
>>>> Please confirm:
>>>>
>>>> if (dwc->pending_events)
>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>>>> else
>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
>>> Yes, we are expecting the same. But, we must verify the status of
>>> evt->flags, which will indicate whether the event is currently
>>> processing in middle or not. The below code is for the reference.
>>>
>>> if (!(evt->flags & DWC3_EVENT_PENDING))
>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>> 			 DWC3_GEVNTSIZ_SIZE(evt->length));
>>> else
>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>> So, this happens while pending_events is set right? I need to review
>> this runtime suspend flow next week. Something doesn't look right. When

yes. You are correct. Its happening while pending_events is set.
>> there's a suspend/resume runtime or not, there's a soft disconnect. We
>> shouldn't be processing any event prior to going into suspend. Also, we
> Clarification: I mean we shouldn't process any event that happened prior
> to suspend on resume because there was a disconnect.

Agree.
>
>> shouldn't be doing soft-disconnect while connected and in operation
>> unless we specifically tell it to.
> Thinh
Selvarasu Ganesan Aug. 30, 2024, 12:16 p.m. UTC | #8
On 8/10/2024 5:12 AM, Thinh Nguyen wrote:
> On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
>> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
>>> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
>>>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
>>>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
>>>>>> In certain scenarios, there is a chance that the CPU may not be
>>>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
>>>>>> may hang up where any work queue lockup has happened for the same CPU
>>>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
>>>>>> the USB can enter runtime suspend as the bus may idle for a longer time
>>>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
>>>>>> can be enabled when runtime resume is happening with regardless of the
>>>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
>>>>>> return from the interrupt handler by checking only the evt->flags as
>>>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
>>>>>> in previous work queue lockup.
>>>>>> Let's consider the following sequences in this scenario,
>>>>>>
>>>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
>>>>>> ======================================================
>>>>>> IRQ #1:
>>>>>> ->dwc3_interrupt()
>>>>>>      ->dwc3_check_event_buf()
>>>>>>            ->if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>                         return IRQ_HANDLED;
>>>>>>            ->evt->flags |= DWC3_EVENT_PENDING;
>>>>>>            ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
>>>>>>                                                            DWC3_GEVNTSIZ
>>>>>>            ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>>>>>>                                         thread_fu due to workqueue lockup
>>>>>>                                         even after return IRQ_WAKE_THREAD
>>>>>>                                         from top-half.
>>>>>>
>>>>>> Thread #2:
>>>>>> ->dwc3_runtime_resume()
>>>>>>     ->dwc3_resume_common()
>>>>>>       ->dwc3_gadget_resume()
>>>>>>          ->dwc3_gadget_soft_connect()
>>>>>>            ->dwc3_event_buffers_setup()
>>>>>>               ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
>>>>>>                                                            DWC3_GEVNTSIZ*/
>>>>>>
>>>>>> Start IRQ Storming after enable dwc3 event in resume path
>>>>>> =========================================================
>>>>>> CPU0: IRQ
>>>>>> dwc3_interrupt()
>>>>>>     dwc3_check_event_buf()
>>>>>>            if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>             return IRQ_HANDLED;
>>>>>>
>>>>>> CPU0: IRQ
>>>>>> dwc3_interrupt()
>>>>>>     dwc3_check_event_buf()
>>>>>>            if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>             return IRQ_HANDLED;
>>>>>> ..
>>>>>> ..
>>>>>>
>>>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
>>>>>> the runtime resume path if dwc3 event processing is in progress.
>>>>>>
>>>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
>>>>>> ---
>>>>>>     drivers/usb/dwc3/core.c | 8 ++++++--
>>>>>>     1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>>>> index cb82557678dd..610792a70805 100644
>>>>>> --- a/drivers/usb/dwc3/core.c
>>>>>> +++ b/drivers/usb/dwc3/core.c
>>>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>>>>>     			lower_32_bits(evt->dma));
>>>>>>     	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>>>>>     			upper_32_bits(evt->dma));
>>>>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>> +
>>>>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
>>>>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
>>>>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>> +
>>>>>>     	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>>>>>     
>>>>>>     	return 0;
>>>>>> -- 
>>>>>> 2.17.1
>>>>>>
>>>>> We're not waking up from a hibernation. So after a soft-reset and
>>>>> resume, the events that weren't processed are stale. They should be
>>>>> processed prior to entering suspend or be discarded before resume.
>>>>>
>>>>> The synchronize_irq() during suspend() was not sufficient to prevent
>>>>> this? What are we missing here.
>>>>>
>>>>> Thanks,
>>>>> Thinh
>>>> I don’t think the triggering of interrupt would not be stopped even if
>>>> do soft reset. It's because of event count is may be valid .
>>> Ok. I think I see what you're referring to when you say "event is
>>> processing in the middle" now.
>>>
>>> What you want to check is probably this in dwc3_event_buffers_setup().
>>> Please confirm:
>>>
>>> if (dwc->pending_events)
>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>>> else
>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
>> Yes, we are expecting the same. But, we must verify the status of
>> evt->flags, which will indicate whether the event is currently
>> processing in middle or not. The below code is for the reference.
>>
>> if (!(evt->flags & DWC3_EVENT_PENDING))
>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>> 			 DWC3_GEVNTSIZ_SIZE(evt->length));
>> else
>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> So, this happens while pending_events is set right? I need to review
> this runtime suspend flow next week. Something doesn't look right. When
> there's a suspend/resume runtime or not, there's a soft disconnect. We
> shouldn't be processing any event prior to going into suspend. Also, we
> shouldn't be doing soft-disconnect while connected and in operation
> unless we specifically tell it to.
HI Thinh,

Would you be able to review this runtime suspend flow?

 From our end, after conducting multiple regression tests, we have 
determined that the resetting of "evt->flags" are sufficient when 
attempting to enable event IRQ masks instead of enable event IRQ mask 
based on pending event flags. There is a possibility that reconnecting 
USB with the host PC may cause event interrupts to be missed by the CPU 
if disable event IRQ mask.  So, The fix should be as follow. Could you 
please review this once from your end?

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index ccc3895dbd7f..3b2441608e9e 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -554,6 +554,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
                         lower_32_bits(evt->dma));
         dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
                         upper_32_bits(evt->dma));
+
+       /*
+        * The DWC3_EVENT_PENDING flag is cleared if it has
+        * already been set when enabling the event IRQ mask
+        * to prevent possibly of an IRQ storm.
+        */
+       if (evt->flags & DWC3_EVENT_PENDING)
+               evt->flags &= ~DWC3_EVENT_PENDING;
+
         dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
                         DWC3_GEVNTSIZ_SIZE(evt->length));
         dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);

Thanks,
Selva
>
>>>> Let consider the scenarios where SW is not acknowledge the event count
>>>> after getting a interrupt with disable GEVNTSIZ_INTMASK =0. It will
>>>> triggering the spurious interrupts until enable GEVNTSIZ_INTMASK=1 or
>>>> acknowledge the event count by SW. This is happening here because of We
>>>> just return from interrupt handler by checking if evt->flags as
>>>> DWC3_EVENT_PENDING. Clearing of DWC3_EVENT_PENDING flag is done in
>>>> dwc3_thread_interrupt. In some scenario it's not happening due to cpu
>>>> hangup or work queue lockup.
>>> This can be mitigated by adjusting the imod_interval (interrupt
>>> moderation). Have you tried that?
>>
>> Yes we tried to play around the changing of imod interval value but no
>> improvements.
> Ok.
>
> Thanks,
> Thinh
Thinh Nguyen Aug. 31, 2024, 12:50 a.m. UTC | #9
Hi Selvarasu,

On Fri, Aug 30, 2024, Selvarasu Ganesan wrote:
> 
> On 8/10/2024 5:12 AM, Thinh Nguyen wrote:
> > On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
> >> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
> >>> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
> >>>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
> >>>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
> >>>>>> In certain scenarios, there is a chance that the CPU may not be
> >>>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
> >>>>>> may hang up where any work queue lockup has happened for the same CPU
> >>>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
> >>>>>> the USB can enter runtime suspend as the bus may idle for a longer time
> >>>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
> >>>>>> can be enabled when runtime resume is happening with regardless of the
> >>>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
> >>>>>> return from the interrupt handler by checking only the evt->flags as
> >>>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
> >>>>>> in previous work queue lockup.
> >>>>>> Let's consider the following sequences in this scenario,
> >>>>>>
> >>>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
> >>>>>> ======================================================
> >>>>>> IRQ #1:
> >>>>>> ->dwc3_interrupt()
> >>>>>>      ->dwc3_check_event_buf()
> >>>>>>            ->if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                         return IRQ_HANDLED;
> >>>>>>            ->evt->flags |= DWC3_EVENT_PENDING;
> >>>>>>            ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
> >>>>>>                                                            DWC3_GEVNTSIZ
> >>>>>>            ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
> >>>>>>                                         thread_fu due to workqueue lockup
> >>>>>>                                         even after return IRQ_WAKE_THREAD
> >>>>>>                                         from top-half.
> >>>>>>
> >>>>>> Thread #2:
> >>>>>> ->dwc3_runtime_resume()
> >>>>>>     ->dwc3_resume_common()
> >>>>>>       ->dwc3_gadget_resume()
> >>>>>>          ->dwc3_gadget_soft_connect()
> >>>>>>            ->dwc3_event_buffers_setup()
> >>>>>>               ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
> >>>>>>                                                            DWC3_GEVNTSIZ*/
> >>>>>>
> >>>>>> Start IRQ Storming after enable dwc3 event in resume path
> >>>>>> =========================================================
> >>>>>> CPU0: IRQ
> >>>>>> dwc3_interrupt()
> >>>>>>     dwc3_check_event_buf()
> >>>>>>            if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>             return IRQ_HANDLED;
> >>>>>>
> >>>>>> CPU0: IRQ
> >>>>>> dwc3_interrupt()
> >>>>>>     dwc3_check_event_buf()
> >>>>>>            if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>             return IRQ_HANDLED;
> >>>>>> ..
> >>>>>> ..
> >>>>>>
> >>>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
> >>>>>> the runtime resume path if dwc3 event processing is in progress.
> >>>>>>
> >>>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
> >>>>>> ---
> >>>>>>     drivers/usb/dwc3/core.c | 8 ++++++--
> >>>>>>     1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> >>>>>> index cb82557678dd..610792a70805 100644
> >>>>>> --- a/drivers/usb/dwc3/core.c
> >>>>>> +++ b/drivers/usb/dwc3/core.c
> >>>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
> >>>>>>     			lower_32_bits(evt->dma));
> >>>>>>     	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
> >>>>>>     			upper_32_bits(evt->dma));
> >>>>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>>>> +
> >>>>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
> >>>>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
> >>>>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>>>> +
> >>>>>>     	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> >>>>>>     
> >>>>>>     	return 0;
> >>>>>> -- 
> >>>>>> 2.17.1
> >>>>>>
> >>>>> We're not waking up from a hibernation. So after a soft-reset and
> >>>>> resume, the events that weren't processed are stale. They should be
> >>>>> processed prior to entering suspend or be discarded before resume.
> >>>>>
> >>>>> The synchronize_irq() during suspend() was not sufficient to prevent
> >>>>> this? What are we missing here.
> >>>>>
> >>>>> Thanks,
> >>>>> Thinh
> >>>> I don’t think the triggering of interrupt would not be stopped even if
> >>>> do soft reset. It's because of event count is may be valid .
> >>> Ok. I think I see what you're referring to when you say "event is
> >>> processing in the middle" now.
> >>>
> >>> What you want to check is probably this in dwc3_event_buffers_setup().
> >>> Please confirm:
> >>>
> >>> if (dwc->pending_events)
> >>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> >>> else
> >>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
> >> Yes, we are expecting the same. But, we must verify the status of
> >> evt->flags, which will indicate whether the event is currently
> >> processing in middle or not. The below code is for the reference.
> >>
> >> if (!(evt->flags & DWC3_EVENT_PENDING))
> >> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >> 			 DWC3_GEVNTSIZ_SIZE(evt->length));
> >> else
> >> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> > So, this happens while pending_events is set right? I need to review
> > this runtime suspend flow next week. Something doesn't look right. When
> > there's a suspend/resume runtime or not, there's a soft disconnect. We
> > shouldn't be processing any event prior to going into suspend. Also, we
> > shouldn't be doing soft-disconnect while connected and in operation
> > unless we specifically tell it to.
> HI Thinh,
> 
> Would you be able to review this runtime suspend flow?
> 
>  From our end, after conducting multiple regression tests, we have 
> determined that the resetting of "evt->flags" are sufficient when 
> attempting to enable event IRQ masks instead of enable event IRQ mask 
> based on pending event flags. There is a possibility that reconnecting 
> USB with the host PC may cause event interrupts to be missed by the CPU 
> if disable event IRQ mask.  So, The fix should be as follow. Could you 
> please review this once from your end?
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index ccc3895dbd7f..3b2441608e9e 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -554,6 +554,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>                          lower_32_bits(evt->dma));
>          dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>                          upper_32_bits(evt->dma));
> +
> +       /*
> +        * The DWC3_EVENT_PENDING flag is cleared if it has
> +        * already been set when enabling the event IRQ mask
> +        * to prevent possibly of an IRQ storm.
> +        */
> +       if (evt->flags & DWC3_EVENT_PENDING)
> +               evt->flags &= ~DWC3_EVENT_PENDING;
> +
>          dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>                          DWC3_GEVNTSIZ_SIZE(evt->length));
>          dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> 

Sorry for the delay response.

In addition to that, please rework and rename
dwc3_gadget_process_pending_event(). We're not supposed to handle any
left-over event. So remove the manual calls to the interrupt handlers
there.

On runtime suspend, the device is soft disconnected. So any interrupt
assertion to notify a new connection must be a custom configuration of
your platform. No event should be generated while the run_stop bit is
cleared.

On runtime resume, we will initiate soft-reset and soft-connect to
restore the run_stop bit. A new connection event will be generated then.

Thanks,
Thinh
Selvarasu Ganesan Sept. 2, 2024, 11:27 a.m. UTC | #10
On 8/31/2024 6:20 AM, Thinh Nguyen wrote:
> Hi Selvarasu,
>
> On Fri, Aug 30, 2024, Selvarasu Ganesan wrote:
>> On 8/10/2024 5:12 AM, Thinh Nguyen wrote:
>>> On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
>>>> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
>>>>> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
>>>>>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
>>>>>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
>>>>>>>> In certain scenarios, there is a chance that the CPU may not be
>>>>>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
>>>>>>>> may hang up where any work queue lockup has happened for the same CPU
>>>>>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
>>>>>>>> the USB can enter runtime suspend as the bus may idle for a longer time
>>>>>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
>>>>>>>> can be enabled when runtime resume is happening with regardless of the
>>>>>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
>>>>>>>> return from the interrupt handler by checking only the evt->flags as
>>>>>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
>>>>>>>> in previous work queue lockup.
>>>>>>>> Let's consider the following sequences in this scenario,
>>>>>>>>
>>>>>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
>>>>>>>> ======================================================
>>>>>>>> IRQ #1:
>>>>>>>> ->dwc3_interrupt()
>>>>>>>>       ->dwc3_check_event_buf()
>>>>>>>>             ->if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                          return IRQ_HANDLED;
>>>>>>>>             ->evt->flags |= DWC3_EVENT_PENDING;
>>>>>>>>             ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
>>>>>>>>                                                             DWC3_GEVNTSIZ
>>>>>>>>             ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>>>>>>>>                                          thread_fu due to workqueue lockup
>>>>>>>>                                          even after return IRQ_WAKE_THREAD
>>>>>>>>                                          from top-half.
>>>>>>>>
>>>>>>>> Thread #2:
>>>>>>>> ->dwc3_runtime_resume()
>>>>>>>>      ->dwc3_resume_common()
>>>>>>>>        ->dwc3_gadget_resume()
>>>>>>>>           ->dwc3_gadget_soft_connect()
>>>>>>>>             ->dwc3_event_buffers_setup()
>>>>>>>>                ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
>>>>>>>>                                                             DWC3_GEVNTSIZ*/
>>>>>>>>
>>>>>>>> Start IRQ Storming after enable dwc3 event in resume path
>>>>>>>> =========================================================
>>>>>>>> CPU0: IRQ
>>>>>>>> dwc3_interrupt()
>>>>>>>>      dwc3_check_event_buf()
>>>>>>>>             if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>              return IRQ_HANDLED;
>>>>>>>>
>>>>>>>> CPU0: IRQ
>>>>>>>> dwc3_interrupt()
>>>>>>>>      dwc3_check_event_buf()
>>>>>>>>             if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>              return IRQ_HANDLED;
>>>>>>>> ..
>>>>>>>> ..
>>>>>>>>
>>>>>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
>>>>>>>> the runtime resume path if dwc3 event processing is in progress.
>>>>>>>>
>>>>>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
>>>>>>>> ---
>>>>>>>>      drivers/usb/dwc3/core.c | 8 ++++++--
>>>>>>>>      1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>>>>>> index cb82557678dd..610792a70805 100644
>>>>>>>> --- a/drivers/usb/dwc3/core.c
>>>>>>>> +++ b/drivers/usb/dwc3/core.c
>>>>>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>>>>>>>      			lower_32_bits(evt->dma));
>>>>>>>>      	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>>>>>>>      			upper_32_bits(evt->dma));
>>>>>>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>>>> +
>>>>>>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
>>>>>>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
>>>>>>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>>>> +
>>>>>>>>      	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>>>>>>>      
>>>>>>>>      	return 0;
>>>>>>>> -- 
>>>>>>>> 2.17.1
>>>>>>>>
>>>>>>> We're not waking up from a hibernation. So after a soft-reset and
>>>>>>> resume, the events that weren't processed are stale. They should be
>>>>>>> processed prior to entering suspend or be discarded before resume.
>>>>>>>
>>>>>>> The synchronize_irq() during suspend() was not sufficient to prevent
>>>>>>> this? What are we missing here.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thinh
>>>>>> I don’t think the triggering of interrupt would not be stopped even if
>>>>>> do soft reset. It's because of event count is may be valid .
>>>>> Ok. I think I see what you're referring to when you say "event is
>>>>> processing in the middle" now.
>>>>>
>>>>> What you want to check is probably this in dwc3_event_buffers_setup().
>>>>> Please confirm:
>>>>>
>>>>> if (dwc->pending_events)
>>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>> else
>>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
>>>> Yes, we are expecting the same. But, we must verify the status of
>>>> evt->flags, which will indicate whether the event is currently
>>>> processing in middle or not. The below code is for the reference.
>>>>
>>>> if (!(evt->flags & DWC3_EVENT_PENDING))
>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>> 			 DWC3_GEVNTSIZ_SIZE(evt->length));
>>>> else
>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>>> So, this happens while pending_events is set right? I need to review
>>> this runtime suspend flow next week. Something doesn't look right. When
>>> there's a suspend/resume runtime or not, there's a soft disconnect. We
>>> shouldn't be processing any event prior to going into suspend. Also, we
>>> shouldn't be doing soft-disconnect while connected and in operation
>>> unless we specifically tell it to.
>> HI Thinh,
>>
>> Would you be able to review this runtime suspend flow?
>>
>>   From our end, after conducting multiple regression tests, we have
>> determined that the resetting of "evt->flags" are sufficient when
>> attempting to enable event IRQ masks instead of enable event IRQ mask
>> based on pending event flags. There is a possibility that reconnecting
>> USB with the host PC may cause event interrupts to be missed by the CPU
>> if disable event IRQ mask.  So, The fix should be as follow. Could you
>> please review this once from your end?
>>
>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>> index ccc3895dbd7f..3b2441608e9e 100644
>> --- a/drivers/usb/dwc3/core.c
>> +++ b/drivers/usb/dwc3/core.c
>> @@ -554,6 +554,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>                           lower_32_bits(evt->dma));
>>           dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>                           upper_32_bits(evt->dma));
>> +
>> +       /*
>> +        * The DWC3_EVENT_PENDING flag is cleared if it has
>> +        * already been set when enabling the event IRQ mask
>> +        * to prevent possibly of an IRQ storm.
>> +        */
>> +       if (evt->flags & DWC3_EVENT_PENDING)
>> +               evt->flags &= ~DWC3_EVENT_PENDING;
>> +
>>           dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>                           DWC3_GEVNTSIZ_SIZE(evt->length));
>>           dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>
> Sorry for the delay response.
>
> In addition to that, please rework and rename
> dwc3_gadget_process_pending_event(). We're not supposed to handle any
> left-over event. So remove the manual calls to the interrupt handlers
> there.
Hi Thinh,

Thanks for your comments.

Regarding the handling of leftover events during dwc3 runtime resume, I 
understand that we are not supposed to handle any leftover events. Would 
you be interested in making changes to the code as suggested below? The 
reason for removing interrupt handlers from 
dwc3_gadget_process_pending_events() is to avoid handling any leftover 
events from the last suspend right. If so, based on my understanding, we 
can simply remove the use of dwc3_gadget_process_pending_events() 
instead of rework on this function since it is not necessary if we 
remove the call to interrupt handlers from there.

I would like to reconfirm from our end that in our failure scenario, we 
observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3 
resume sequence is executed, and the dwc->pending_events flag is not 
being set.



diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index ccc3895dbd7f..63e8dd24ad0e 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -550,6 +550,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)

         evt = dwc->ev_buf;
         evt->lpos = 0;
+
+       /*
+        * The DWC3_EVENT_PENDING flag is cleared if it has
+        * already been set when enabling the event IRQ mask
+        * to prevent possibly of an IRQ storm.
+        */
+       if (evt->flags & DWC3_EVENT_PENDING)
+               evt->flags &= ~DWC3_EVENT_PENDING;
+
         dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
lower_32_bits(evt->dma));
         dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 89fc690fdf34..951c805337c2 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4739,8 +4739,6 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
  void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
  {
         if (dwc->pending_events) {
- dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
- dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
                 pm_runtime_put(dwc->dev);
                 dwc->pending_events = false;
enable_irq(dwc->irq_gadget);


Thanks,
Selva

>
> On runtime suspend, the device is soft disconnected. So any interrupt
> assertion to notify a new connection must be a custom configuration of
> your platform. No event should be generated while the run_stop bit is
> cleared.
>
> On runtime resume, we will initiate soft-reset and soft-connect to
> restore the run_stop bit. A new connection event will be generated then.

Agree.


>
> Thanks,
> Thinh
Thinh Nguyen Sept. 3, 2024, 11:41 p.m. UTC | #11
On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
> 
> On 8/31/2024 6:20 AM, Thinh Nguyen wrote:
> > Hi Selvarasu,
> >
> > On Fri, Aug 30, 2024, Selvarasu Ganesan wrote:
> >> On 8/10/2024 5:12 AM, Thinh Nguyen wrote:
> >>> On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
> >>>> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
> >>>>> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
> >>>>>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
> >>>>>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
> >>>>>>>> In certain scenarios, there is a chance that the CPU may not be
> >>>>>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
> >>>>>>>> may hang up where any work queue lockup has happened for the same CPU
> >>>>>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
> >>>>>>>> the USB can enter runtime suspend as the bus may idle for a longer time
> >>>>>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
> >>>>>>>> can be enabled when runtime resume is happening with regardless of the
> >>>>>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
> >>>>>>>> return from the interrupt handler by checking only the evt->flags as
> >>>>>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
> >>>>>>>> in previous work queue lockup.
> >>>>>>>> Let's consider the following sequences in this scenario,
> >>>>>>>>
> >>>>>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
> >>>>>>>> ======================================================
> >>>>>>>> IRQ #1:
> >>>>>>>> ->dwc3_interrupt()
> >>>>>>>>       ->dwc3_check_event_buf()
> >>>>>>>>             ->if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>>>                          return IRQ_HANDLED;
> >>>>>>>>             ->evt->flags |= DWC3_EVENT_PENDING;
> >>>>>>>>             ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK  in
> >>>>>>>>                                                             DWC3_GEVNTSIZ
> >>>>>>>>             ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
> >>>>>>>>                                          thread_fu due to workqueue lockup
> >>>>>>>>                                          even after return IRQ_WAKE_THREAD
> >>>>>>>>                                          from top-half.
> >>>>>>>>
> >>>>>>>> Thread #2:
> >>>>>>>> ->dwc3_runtime_resume()
> >>>>>>>>      ->dwc3_resume_common()
> >>>>>>>>        ->dwc3_gadget_resume()
> >>>>>>>>           ->dwc3_gadget_soft_connect()
> >>>>>>>>             ->dwc3_event_buffers_setup()
> >>>>>>>>                ->/*Enable interrupt by clearing  DWC3_GEVNTSIZ_INTMASK in
> >>>>>>>>                                                             DWC3_GEVNTSIZ*/
> >>>>>>>>
> >>>>>>>> Start IRQ Storming after enable dwc3 event in resume path
> >>>>>>>> =========================================================
> >>>>>>>> CPU0: IRQ
> >>>>>>>> dwc3_interrupt()
> >>>>>>>>      dwc3_check_event_buf()
> >>>>>>>>             if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>>>              return IRQ_HANDLED;
> >>>>>>>>
> >>>>>>>> CPU0: IRQ
> >>>>>>>> dwc3_interrupt()
> >>>>>>>>      dwc3_check_event_buf()
> >>>>>>>>             if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>>>              return IRQ_HANDLED;
> >>>>>>>> ..
> >>>>>>>> ..
> >>>>>>>>
> >>>>>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
> >>>>>>>> the runtime resume path if dwc3 event processing is in progress.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
> >>>>>>>> ---
> >>>>>>>>      drivers/usb/dwc3/core.c | 8 ++++++--
> >>>>>>>>      1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> >>>>>>>> index cb82557678dd..610792a70805 100644
> >>>>>>>> --- a/drivers/usb/dwc3/core.c
> >>>>>>>> +++ b/drivers/usb/dwc3/core.c
> >>>>>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
> >>>>>>>>      			lower_32_bits(evt->dma));
> >>>>>>>>      	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
> >>>>>>>>      			upper_32_bits(evt->dma));
> >>>>>>>> -	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>>>>>> -			DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>>>>>> +
> >>>>>>>> +	/* Skip enable dwc3 event interrupt if event is processing in middle */
> >>>>>>>> +	if (!(evt->flags & DWC3_EVENT_PENDING))
> >>>>>>>> +		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>>>>>> +				DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>>>>>> +
> >>>>>>>>      	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> >>>>>>>>      
> >>>>>>>>      	return 0;
> >>>>>>>> -- 
> >>>>>>>> 2.17.1
> >>>>>>>>
> >>>>>>> We're not waking up from a hibernation. So after a soft-reset and
> >>>>>>> resume, the events that weren't processed are stale. They should be
> >>>>>>> processed prior to entering suspend or be discarded before resume.
> >>>>>>>
> >>>>>>> The synchronize_irq() during suspend() was not sufficient to prevent
> >>>>>>> this? What are we missing here.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Thinh
> >>>>>> I don’t think the triggering of interrupt would not be stopped even if
> >>>>>> do soft reset. It's because of event count is may be valid .
> >>>>> Ok. I think I see what you're referring to when you say "event is
> >>>>> processing in the middle" now.
> >>>>>
> >>>>> What you want to check is probably this in dwc3_event_buffers_setup().
> >>>>> Please confirm:
> >>>>>
> >>>>> if (dwc->pending_events)
> >>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>>> else
> >>>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>> Yes, we are expecting the same. But, we must verify the status of
> >>>> evt->flags, which will indicate whether the event is currently
> >>>> processing in middle or not. The below code is for the reference.
> >>>>
> >>>> if (!(evt->flags & DWC3_EVENT_PENDING))
> >>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>> 			 DWC3_GEVNTSIZ_SIZE(evt->length));
> >>>> else
> >>>> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>>> 			DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> >>> So, this happens while pending_events is set right? I need to review
> >>> this runtime suspend flow next week. Something doesn't look right. When
> >>> there's a suspend/resume runtime or not, there's a soft disconnect. We
> >>> shouldn't be processing any event prior to going into suspend. Also, we
> >>> shouldn't be doing soft-disconnect while connected and in operation
> >>> unless we specifically tell it to.
> >> HI Thinh,
> >>
> >> Would you be able to review this runtime suspend flow?
> >>
> >>   From our end, after conducting multiple regression tests, we have
> >> determined that the resetting of "evt->flags" are sufficient when
> >> attempting to enable event IRQ masks instead of enable event IRQ mask
> >> based on pending event flags. There is a possibility that reconnecting
> >> USB with the host PC may cause event interrupts to be missed by the CPU
> >> if disable event IRQ mask.  So, The fix should be as follow. Could you
> >> please review this once from your end?
> >>
> >> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> >> index ccc3895dbd7f..3b2441608e9e 100644
> >> --- a/drivers/usb/dwc3/core.c
> >> +++ b/drivers/usb/dwc3/core.c
> >> @@ -554,6 +554,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
> >>                           lower_32_bits(evt->dma));
> >>           dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
> >>                           upper_32_bits(evt->dma));
> >> +
> >> +       /*
> >> +        * The DWC3_EVENT_PENDING flag is cleared if it has
> >> +        * already been set when enabling the event IRQ mask
> >> +        * to prevent possibly of an IRQ storm.
> >> +        */
> >> +       if (evt->flags & DWC3_EVENT_PENDING)
> >> +               evt->flags &= ~DWC3_EVENT_PENDING;
> >> +
> >>           dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
> >>                           DWC3_GEVNTSIZ_SIZE(evt->length));
> >>           dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> >>
> > Sorry for the delay response.
> >
> > In addition to that, please rework and rename
> > dwc3_gadget_process_pending_event(). We're not supposed to handle any
> > left-over event. So remove the manual calls to the interrupt handlers
> > there.
> Hi Thinh,
> 
> Thanks for your comments.
> 
> Regarding the handling of leftover events during dwc3 runtime resume, I 
> understand that we are not supposed to handle any leftover events. Would 

Actually, there's no leftover event.

> you be interested in making changes to the code as suggested below? The 
> reason for removing interrupt handlers from 
> dwc3_gadget_process_pending_events() is to avoid handling any leftover 
> events from the last suspend right. If so, based on my understanding, we 
> can simply remove the use of dwc3_gadget_process_pending_events() 
> instead of rework on this function since it is not necessary if we 
> remove the call to interrupt handlers from there.

That's right.

> 
> I would like to reconfirm from our end that in our failure scenario, we 
> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3 
> resume sequence is executed, and the dwc->pending_events flag is not 
> being set.
> 
> 
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index ccc3895dbd7f..63e8dd24ad0e 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -550,6 +550,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
> 
>          evt = dwc->ev_buf;
>          evt->lpos = 0;
> +
> +       /*
> +        * The DWC3_EVENT_PENDING flag is cleared if it has
> +        * already been set when enabling the event IRQ mask
> +        * to prevent possibly of an IRQ storm.
> +        */
> +       if (evt->flags & DWC3_EVENT_PENDING)
> +               evt->flags &= ~DWC3_EVENT_PENDING;
> +

We actually don't need to clear DWCC3_EVENT_PENDING flag if we remove
the below.

>          dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
> lower_32_bits(evt->dma));
>          dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> index 89fc690fdf34..951c805337c2 100644
> --- a/drivers/usb/dwc3/gadget.c
> +++ b/drivers/usb/dwc3/gadget.c
> @@ -4739,8 +4739,6 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
>   void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
>   {
>          if (dwc->pending_events) {
> - dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
> - dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
>                  pm_runtime_put(dwc->dev);
>                  dwc->pending_events = false;
> enable_irq(dwc->irq_gadget);
> 
> 
> Thanks,
> Selva
> 
> >
> > On runtime suspend, the device is soft disconnected. So any interrupt
> > assertion to notify a new connection must be a custom configuration of
> > your platform. No event should be generated while the run_stop bit is
> > cleared.
> >
> > On runtime resume, we will initiate soft-reset and soft-connect to
> > restore the run_stop bit. A new connection event will be generated then.
> 
> Agree.
> 

Please try the following with some cleanup and additional comments:

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index ccc3895dbd7f..f02dae464efe 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -2484,7 +2484,11 @@ static int dwc3_runtime_resume(struct device *dev)
 
 	switch (dwc->current_dr_role) {
 	case DWC3_GCTL_PRTCAP_DEVICE:
-		dwc3_gadget_process_pending_events(dwc);
+		if (dwc->pending_events) {
+			pm_runtime_put(dwc->dev);
+			dwc->pending_events = false;
+			enable_irq(dwc->irq_gadget);
+		}
 		break;
 	case DWC3_GCTL_PRTCAP_HOST:
 	default:
@@ -2574,6 +2578,14 @@ static void dwc3_complete(struct device *dev)
 static const struct dev_pm_ops dwc3_dev_pm_ops = {
 	SET_SYSTEM_SLEEP_PM_OPS(dwc3_suspend, dwc3_resume)
 	.complete = dwc3_complete,
+
+	/*
+	 * Current implementation for runtime suspend stops the controller on
+	 * disconnection. It relies on platforms with custom connection
+	 * interrupt to notify the driver to start the controller again. This
+	 * flow is platform specific implemenation and not part of the
+	 * controller's programming flow.
+	 */
 	SET_RUNTIME_PM_OPS(dwc3_runtime_suspend, dwc3_runtime_resume,
 			dwc3_runtime_idle)
 };
diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
index 1e561fd8b86e..2fa3fd55eb32 100644
--- a/drivers/usb/dwc3/core.h
+++ b/drivers/usb/dwc3/core.h
@@ -1673,7 +1673,6 @@ static inline void dwc3_otg_host_init(struct dwc3 *dwc)
 #if !IS_ENABLED(CONFIG_USB_DWC3_HOST)
 int dwc3_gadget_suspend(struct dwc3 *dwc);
 int dwc3_gadget_resume(struct dwc3 *dwc);
-void dwc3_gadget_process_pending_events(struct dwc3 *dwc);
 #else
 static inline int dwc3_gadget_suspend(struct dwc3 *dwc)
 {
@@ -1684,10 +1683,6 @@ static inline int dwc3_gadget_resume(struct dwc3 *dwc)
 {
 	return 0;
 }
-
-static inline void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
-{
-}
 #endif /* !IS_ENABLED(CONFIG_USB_DWC3_HOST) */
 
 #if IS_ENABLED(CONFIG_USB_DWC3_ULPI)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 89fc690fdf34..ea97e9c1d1bd 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4735,14 +4735,3 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
 
 	return dwc3_gadget_soft_connect(dwc);
 }
-
-void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
-{
-	if (dwc->pending_events) {
-		dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
-		dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
-		pm_runtime_put(dwc->dev);
-		dwc->pending_events = false;
-		enable_irq(dwc->irq_gadget);
-	}
-}


--
BR,
Thinh
Thinh Nguyen Sept. 4, 2024, 1:03 a.m. UTC | #12
On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
> 
> I would like to reconfirm from our end that in our failure scenario, we 
> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3 
> resume sequence is executed, and the dwc->pending_events flag is not 
> being set.
> 

If the controller is stopped, no event is generated until it's restarted
again. (ie, you should not see GEVNTCOUNT updated after clearing
DCTL.run_stop). If there's no event, no interrupt assertion should come
from the controller.

If the pending_events is not set and you still see this failure, then
likely that the controller had started, and the interrupt is generated
from the controller event. This occurs along with the interrupt
generated from your connection notification from your setup.

Check your platform and internal team, what condition would cause the
setup to generate the interrupt and what condition would stop this
custom connection interrupt assertion? This is outside of what's defined
by the flow of the controller.

BR,
Thinh
Selvarasu Ganesan Sept. 4, 2024, 3:50 p.m. UTC | #13
On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
>> I would like to reconfirm from our end that in our failure scenario, we
>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
>> resume sequence is executed, and the dwc->pending_events flag is not
>> being set.
>>
> If the controller is stopped, no event is generated until it's restarted
> again. (ie, you should not see GEVNTCOUNT updated after clearing
> DCTL.run_stop). If there's no event, no interrupt assertion should come
> from the controller.
>
> If the pending_events is not set and you still see this failure, then
> likely that the controller had started, and the interrupt is generated
> from the controller event. This occurs along with the interrupt
> generated from your connection notification from your setup.


I completely agree. My discussion revolves around the handling of the 
DWC3_EVENT_PENDING flag in all situations. The purpose of using this 
flag is to prevent the processing of new events if an existing event is 
still being processed. This flag is set in the top-half interrupt 
handler and cleared at the end of the bottom-half handler.

Now, let's consider scenarios where the bottom half is not scheduled, 
and a USB reconnect occurs. In this case, there is a possibility that 
the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB 
controller begins posting new events. The top-half interrupt handler 
checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without 
processing any new events. However, the USB controller continues to post 
interrupts until they are acknowledged.

Please review the complete sequence once with DWC3_EVENT_PENDING flag.

My proposal is to clear or reset the DWC3_EVENT_PENDING flag when 
unmasking the interrupt line dwc3_event_buffers_setup, apart from 
bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in 
dwc3_event_buffers_setup does not cause any harm, as we have implemented 
a temporary workaround in our test setup to prevent IRQ storms.



Working scenarios:
==================
1. Top-half handler:
     a. if (evt->flags & DWC3_EVENT_PENDING)
         return IRQ_HANDLED;
     b. Set DWC3_EVENT_PENDING flag
     c. Masking interrupt line

2. Bottom-half handler:
     a. Un-masking interrupt line
     b. Clear DWC3_EVENT_PENDING flag

Failure scenarios:
==================
1. Top-half handler:
     a. if (evt->flags & DWC3_EVENT_PENDING)
                 return IRQ_HANDLED;
     b. Set DWC3_EVENT_PENDING flag
     c. Masking interrupt line

2. No Bottom-half scheduled:

3. USB reconnect: dwc3_event_buffers_setup
     a. Un-masking interrupt line

4. Continuous interrupts : Top-half handler:
     a. if (evt->flags & DWC3_EVENT_PENDING)
                 return IRQ_HANDLED;

     a. if (evt->flags & DWC3_EVENT_PENDING)
                 return IRQ_HANDLED;

     a. if (evt->flags & DWC3_EVENT_PENDING)
                 return IRQ_HANDLED;
.....

.....

.....

Thanks,
Selva

>
> Check your platform and internal team, what condition would cause the
> setup to generate the interrupt and what condition would stop this
> custom connection interrupt assertion? This is outside of what's defined
> by the flow of the controller.
>
> BR,
> Thinh
Thinh Nguyen Sept. 5, 2024, 12:26 a.m. UTC | #14
On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
> 
> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
> > On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
> >> I would like to reconfirm from our end that in our failure scenario, we
> >> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
> >> resume sequence is executed, and the dwc->pending_events flag is not
> >> being set.
> >>
> > If the controller is stopped, no event is generated until it's restarted
> > again. (ie, you should not see GEVNTCOUNT updated after clearing
> > DCTL.run_stop). If there's no event, no interrupt assertion should come
> > from the controller.
> >
> > If the pending_events is not set and you still see this failure, then
> > likely that the controller had started, and the interrupt is generated
> > from the controller event. This occurs along with the interrupt
> > generated from your connection notification from your setup.
> 
> 
> I completely agree. My discussion revolves around the handling of the 
> DWC3_EVENT_PENDING flag in all situations. The purpose of using this 
> flag is to prevent the processing of new events if an existing event is 
> still being processed. This flag is set in the top-half interrupt 
> handler and cleared at the end of the bottom-half handler.
> 
> Now, let's consider scenarios where the bottom half is not scheduled, 
> and a USB reconnect occurs. In this case, there is a possibility that 
> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB 
> controller begins posting new events. The top-half interrupt handler 
> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without 
> processing any new events. However, the USB controller continues to post 
> interrupts until they are acknowledged.
> 
> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
> 
> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when 
> unmasking the interrupt line dwc3_event_buffers_setup, apart from 
> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in 
> dwc3_event_buffers_setup does not cause any harm, as we have implemented 
> a temporary workaround in our test setup to prevent IRQ storms.
> 
> 
> 
> Working scenarios:
> ==================
> 1. Top-half handler:
>      a. if (evt->flags & DWC3_EVENT_PENDING)
>          return IRQ_HANDLED;
>      b. Set DWC3_EVENT_PENDING flag
>      c. Masking interrupt line
> 
> 2. Bottom-half handler:
>      a. Un-masking interrupt line
>      b. Clear DWC3_EVENT_PENDING flag
> 
> Failure scenarios:
> ==================
> 1. Top-half handler:
>      a. if (evt->flags & DWC3_EVENT_PENDING)
>                  return IRQ_HANDLED;
>      b. Set DWC3_EVENT_PENDING flag
>      c. Masking interrupt line

For DWC3_EVENT_PENDING flag to be set at this point (before we start the
controller), that means that the GEVNTCOUNT was not 0 after
soft-disconnect and that the pm_runtime_suspended() must be false.

> 
> 2. No Bottom-half scheduled:

Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
yet before the next top-half coming?

> 
> 3. USB reconnect: dwc3_event_buffers_setup
>      a. Un-masking interrupt line

Do we know that the GEVNTCOUNT is non-zero before starting the
controller again?

> 
> 4. Continuous interrupts : Top-half handler:
>      a. if (evt->flags & DWC3_EVENT_PENDING)
>                  return IRQ_HANDLED;
> 
>      a. if (evt->flags & DWC3_EVENT_PENDING)
>                  return IRQ_HANDLED;
> 
>      a. if (evt->flags & DWC3_EVENT_PENDING)
>                  return IRQ_HANDLED;
> .....
> 
> .....
> 
> .....
> 

I don't want the driver to process any stale event after a
soft-disconnect. Can we try this instead:

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index ccc3895dbd7f..a1e6ba92e808 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -544,6 +544,7 @@ static int dwc3_alloc_event_buffers(struct dwc3 *dwc, unsigned int length)
 int dwc3_event_buffers_setup(struct dwc3 *dwc)
 {
 	struct dwc3_event_buffer	*evt;
+	u32				reg;
 
 	if (!dwc->ev_buf)
 		return 0;
@@ -556,7 +557,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
 			upper_32_bits(evt->dma));
 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
 			DWC3_GEVNTSIZ_SIZE(evt->length));
-	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
+
+	/* Clear any stale event */
+	reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
+	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
 
 	return 0;
 }
@@ -2484,7 +2488,11 @@ static int dwc3_runtime_resume(struct device *dev)
 
 	switch (dwc->current_dr_role) {
 	case DWC3_GCTL_PRTCAP_DEVICE:
-		dwc3_gadget_process_pending_events(dwc);
+		if (dwc->pending_events) {
+			pm_runtime_put(dwc->dev);
+			dwc->pending_events = false;
+			enable_irq(dwc->irq_gadget);
+		}
 		break;
 	case DWC3_GCTL_PRTCAP_HOST:
 	default:
@@ -2574,6 +2582,12 @@ static void dwc3_complete(struct device *dev)
 static const struct dev_pm_ops dwc3_dev_pm_ops = {
 	SET_SYSTEM_SLEEP_PM_OPS(dwc3_suspend, dwc3_resume)
 	.complete = dwc3_complete,
+
+	/*
+	 * Runtime suspend halts the controller on disconnection. It replies on
+	 * platforms with custom connection notification to start the controller
+	 * again.
+	 */
 	SET_RUNTIME_PM_OPS(dwc3_runtime_suspend, dwc3_runtime_resume,
 			dwc3_runtime_idle)
 };
diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
index 1e561fd8b86e..2fa3fd55eb32 100644
--- a/drivers/usb/dwc3/core.h
+++ b/drivers/usb/dwc3/core.h
@@ -1673,7 +1673,6 @@ static inline void dwc3_otg_host_init(struct dwc3 *dwc)
 #if !IS_ENABLED(CONFIG_USB_DWC3_HOST)
 int dwc3_gadget_suspend(struct dwc3 *dwc);
 int dwc3_gadget_resume(struct dwc3 *dwc);
-void dwc3_gadget_process_pending_events(struct dwc3 *dwc);
 #else
 static inline int dwc3_gadget_suspend(struct dwc3 *dwc)
 {
@@ -1684,10 +1683,6 @@ static inline int dwc3_gadget_resume(struct dwc3 *dwc)
 {
 	return 0;
 }
-
-static inline void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
-{
-}
 #endif /* !IS_ENABLED(CONFIG_USB_DWC3_HOST) */
 
 #if IS_ENABLED(CONFIG_USB_DWC3_ULPI)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 89fc690fdf34..a525f7ea5886 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4490,6 +4490,17 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
 		return IRQ_HANDLED;
 
 	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
+
+	/*
+	 * If the controller is halted, the event count is stale/invalid. Ignore
+	 * them. This happens if the interrupt assertion is from an out-of-band
+	 * resume notification.
+	 */
+	if (!dwc->pullups_connected && count) {
+		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
+		return IRQ_HANDLED;
+	}
+
 	count &= DWC3_GEVNTCOUNT_MASK;
 	if (!count)
 		return IRQ_NONE;
@@ -4735,14 +4746,3 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
 
 	return dwc3_gadget_soft_connect(dwc);
 }
-
-void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
-{
-	if (dwc->pending_events) {
-		dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
-		dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
-		pm_runtime_put(dwc->dev);
-		dwc->pending_events = false;
-		enable_irq(dwc->irq_gadget);
-	}
-}


---

Thanks,
Thinh
Selvarasu Ganesan Sept. 5, 2024, 1:19 p.m. UTC | #15
On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
>>>> I would like to reconfirm from our end that in our failure scenario, we
>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
>>>> resume sequence is executed, and the dwc->pending_events flag is not
>>>> being set.
>>>>
>>> If the controller is stopped, no event is generated until it's restarted
>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
>>> from the controller.
>>>
>>> If the pending_events is not set and you still see this failure, then
>>> likely that the controller had started, and the interrupt is generated
>>> from the controller event. This occurs along with the interrupt
>>> generated from your connection notification from your setup.
>>
>> I completely agree. My discussion revolves around the handling of the
>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
>> flag is to prevent the processing of new events if an existing event is
>> still being processed. This flag is set in the top-half interrupt
>> handler and cleared at the end of the bottom-half handler.
>>
>> Now, let's consider scenarios where the bottom half is not scheduled,
>> and a USB reconnect occurs. In this case, there is a possibility that
>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
>> controller begins posting new events. The top-half interrupt handler
>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
>> processing any new events. However, the USB controller continues to post
>> interrupts until they are acknowledged.
>>
>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
>>
>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
>> a temporary workaround in our test setup to prevent IRQ storms.
>>
>>
>>
>> Working scenarios:
>> ==================
>> 1. Top-half handler:
>>       a. if (evt->flags & DWC3_EVENT_PENDING)
>>           return IRQ_HANDLED;
>>       b. Set DWC3_EVENT_PENDING flag
>>       c. Masking interrupt line
>>
>> 2. Bottom-half handler:
>>       a. Un-masking interrupt line
>>       b. Clear DWC3_EVENT_PENDING flag
>>
>> Failure scenarios:
>> ==================
>> 1. Top-half handler:
>>       a. if (evt->flags & DWC3_EVENT_PENDING)
>>                   return IRQ_HANDLED;
>>       b. Set DWC3_EVENT_PENDING flag
>>       c. Masking interrupt line
> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
> controller), that means that the GEVNTCOUNT was not 0 after
> soft-disconnect and that the pm_runtime_suspended() must be false.

In the top-half code where we set the DWC3_EVENT_PENDING flag, we 
acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for 
GEVNTCOUNT to have a non-zero value until a new event occurs. In fact, 
when we tried to print GEVNTCOUNT in a non-interrupt context, we found 
that it was zero, where we received DWC3_EVENT_PENDING being set in 
non-interrupt context.

>
>> 2. No Bottom-half scheduled:
> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
> yet before the next top-half coming?

In very rare cases, it is possible in our platform that the CPU may not 
be able to schedule the bottom half of the dwc3 interrupt because a work 
queue lockup has occurred on the same CPU that is attempting to schedule 
the dwc3 thread interrupt. In this case Yes, the bottom-half handler 
hasn't woken up, then initiate an IRQ storm for new events after the 
controller restarts, resulting in no more bottom-half scheduling due to 
the CPU being stuck in processing continuous interrupts and return 
IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).

>
>> 3. USB reconnect: dwc3_event_buffers_setup
>>       a. Un-masking interrupt line
> Do we know that the GEVNTCOUNT is non-zero before starting the
> controller again?

The GEVNTCOUNT value showing as zero that we confirmed by adding debug 
message here.
>
>> 4. Continuous interrupts : Top-half handler:
>>       a. if (evt->flags & DWC3_EVENT_PENDING)
>>                   return IRQ_HANDLED;
>>
>>       a. if (evt->flags & DWC3_EVENT_PENDING)
>>                   return IRQ_HANDLED;
>>
>>       a. if (evt->flags & DWC3_EVENT_PENDING)
>>                   return IRQ_HANDLED;
>> .....
>>
>> .....
>>
>> .....
>>

Sure, I can try implementing the proposed code modifications in our 
testing environment.

But, I am uncertain about how these changes will effectively prevent an 
IRQ storm when the USB controller sequence restarts with the 
DWC3_EVENT_PENDING. The following code will only execute until the 
DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half 
will not be scheduled.

Please correct me if i am wrong in my above understanding.


static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt) { 
struct dwc3 *dwc = evt->dwc; u32 amount; u32 count; .....

.....

.....

if (evt->flags & DWC3_EVENT_PENDING) return IRQ_HANDLED; }


Thanks,
Selva
> I don't want the driver to process any stale event after a
> soft-disconnect. Can we try this instead:
>
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index ccc3895dbd7f..a1e6ba92e808 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -544,6 +544,7 @@ static int dwc3_alloc_event_buffers(struct dwc3 *dwc, unsigned int length)
>   int dwc3_event_buffers_setup(struct dwc3 *dwc)
>   {
>   	struct dwc3_event_buffer	*evt;
> +	u32				reg;
>   
>   	if (!dwc->ev_buf)
>   		return 0;
> @@ -556,7 +557,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>   			upper_32_bits(evt->dma));
>   	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>   			DWC3_GEVNTSIZ_SIZE(evt->length));
> -	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> +
> +	/* Clear any stale event */
> +	reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> +	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
>   
>   	return 0;
>   }
> @@ -2484,7 +2488,11 @@ static int dwc3_runtime_resume(struct device *dev)
>   
>   	switch (dwc->current_dr_role) {
>   	case DWC3_GCTL_PRTCAP_DEVICE:
> -		dwc3_gadget_process_pending_events(dwc);
> +		if (dwc->pending_events) {
> +			pm_runtime_put(dwc->dev);
> +			dwc->pending_events = false;
> +			enable_irq(dwc->irq_gadget);
> +		}
>   		break;
>   	case DWC3_GCTL_PRTCAP_HOST:
>   	default:
> @@ -2574,6 +2582,12 @@ static void dwc3_complete(struct device *dev)
>   static const struct dev_pm_ops dwc3_dev_pm_ops = {
>   	SET_SYSTEM_SLEEP_PM_OPS(dwc3_suspend, dwc3_resume)
>   	.complete = dwc3_complete,
> +
> +	/*
> +	 * Runtime suspend halts the controller on disconnection. It replies on
> +	 * platforms with custom connection notification to start the controller
> +	 * again.
> +	 */
>   	SET_RUNTIME_PM_OPS(dwc3_runtime_suspend, dwc3_runtime_resume,
>   			dwc3_runtime_idle)
>   };
> diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
> index 1e561fd8b86e..2fa3fd55eb32 100644
> --- a/drivers/usb/dwc3/core.h
> +++ b/drivers/usb/dwc3/core.h
> @@ -1673,7 +1673,6 @@ static inline void dwc3_otg_host_init(struct dwc3 *dwc)
>   #if !IS_ENABLED(CONFIG_USB_DWC3_HOST)
>   int dwc3_gadget_suspend(struct dwc3 *dwc);
>   int dwc3_gadget_resume(struct dwc3 *dwc);
> -void dwc3_gadget_process_pending_events(struct dwc3 *dwc);
>   #else
>   static inline int dwc3_gadget_suspend(struct dwc3 *dwc)
>   {
> @@ -1684,10 +1683,6 @@ static inline int dwc3_gadget_resume(struct dwc3 *dwc)
>   {
>   	return 0;
>   }
> -
> -static inline void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
> -{
> -}
>   #endif /* !IS_ENABLED(CONFIG_USB_DWC3_HOST) */
>   
>   #if IS_ENABLED(CONFIG_USB_DWC3_ULPI)
> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> index 89fc690fdf34..a525f7ea5886 100644
> --- a/drivers/usb/dwc3/gadget.c
> +++ b/drivers/usb/dwc3/gadget.c
> @@ -4490,6 +4490,17 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
>   		return IRQ_HANDLED;
>   
>   	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> +
> +	/*
> +	 * If the controller is halted, the event count is stale/invalid. Ignore
> +	 * them. This happens if the interrupt assertion is from an out-of-band
> +	 * resume notification.
> +	 */
> +	if (!dwc->pullups_connected && count) {
> +		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> +		return IRQ_HANDLED;
> +	}
> +
>   	count &= DWC3_GEVNTCOUNT_MASK;
>   	if (!count)
>   		return IRQ_NONE;
> @@ -4735,14 +4746,3 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
>   
>   	return dwc3_gadget_soft_connect(dwc);
>   }
> -
> -void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
> -{
> -	if (dwc->pending_events) {
> -		dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
> -		dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
> -		pm_runtime_put(dwc->dev);
> -		dwc->pending_events = false;
> -		enable_irq(dwc->irq_gadget);
> -	}
> -}
>
>
> ---
>
> Thanks,
> Thinh
Thinh Nguyen Sept. 5, 2024, 9:13 p.m. UTC | #16
On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
> 
> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
> > On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
> >> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
> >>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
> >>>> I would like to reconfirm from our end that in our failure scenario, we
> >>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
> >>>> resume sequence is executed, and the dwc->pending_events flag is not
> >>>> being set.
> >>>>
> >>> If the controller is stopped, no event is generated until it's restarted
> >>> again. (ie, you should not see GEVNTCOUNT updated after clearing
> >>> DCTL.run_stop). If there's no event, no interrupt assertion should come
> >>> from the controller.
> >>>
> >>> If the pending_events is not set and you still see this failure, then
> >>> likely that the controller had started, and the interrupt is generated
> >>> from the controller event. This occurs along with the interrupt
> >>> generated from your connection notification from your setup.
> >>
> >> I completely agree. My discussion revolves around the handling of the
> >> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
> >> flag is to prevent the processing of new events if an existing event is
> >> still being processed. This flag is set in the top-half interrupt
> >> handler and cleared at the end of the bottom-half handler.
> >>
> >> Now, let's consider scenarios where the bottom half is not scheduled,
> >> and a USB reconnect occurs. In this case, there is a possibility that
> >> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
> >> controller begins posting new events. The top-half interrupt handler
> >> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
> >> processing any new events. However, the USB controller continues to post
> >> interrupts until they are acknowledged.
> >>
> >> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
> >>
> >> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
> >> unmasking the interrupt line dwc3_event_buffers_setup, apart from
> >> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
> >> dwc3_event_buffers_setup does not cause any harm, as we have implemented
> >> a temporary workaround in our test setup to prevent IRQ storms.
> >>
> >>
> >>
> >> Working scenarios:
> >> ==================
> >> 1. Top-half handler:
> >>       a. if (evt->flags & DWC3_EVENT_PENDING)
> >>           return IRQ_HANDLED;
> >>       b. Set DWC3_EVENT_PENDING flag
> >>       c. Masking interrupt line
> >>
> >> 2. Bottom-half handler:
> >>       a. Un-masking interrupt line
> >>       b. Clear DWC3_EVENT_PENDING flag
> >>
> >> Failure scenarios:
> >> ==================
> >> 1. Top-half handler:
> >>       a. if (evt->flags & DWC3_EVENT_PENDING)
> >>                   return IRQ_HANDLED;
> >>       b. Set DWC3_EVENT_PENDING flag
> >>       c. Masking interrupt line
> > For DWC3_EVENT_PENDING flag to be set at this point (before we start the
> > controller), that means that the GEVNTCOUNT was not 0 after
> > soft-disconnect and that the pm_runtime_suspended() must be false.
> 
> In the top-half code where we set the DWC3_EVENT_PENDING flag, we 
> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for 
> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact, 
> when we tried to print GEVNTCOUNT in a non-interrupt context, we found 
> that it was zero, where we received DWC3_EVENT_PENDING being set in 
> non-interrupt context.

For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
see it's zero, that means that it was already decremented by the driver.

If the driver acknowledges the GEVNTCOUNT, then that means that the
events are copied and prepared to be processed. The bottom-half thread
is scheduled. If it's for stale event, I don't want it to be processed.

> 
> >
> >> 2. No Bottom-half scheduled:
> > Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
> > yet before the next top-half coming?
> 
> In very rare cases, it is possible in our platform that the CPU may not 
> be able to schedule the bottom half of the dwc3 interrupt because a work 
> queue lockup has occurred on the same CPU that is attempting to schedule 
> the dwc3 thread interrupt. In this case Yes, the bottom-half handler 
> hasn't woken up, then initiate an IRQ storm for new events after the 
> controller restarts, resulting in no more bottom-half scheduling due to 
> the CPU being stuck in processing continuous interrupts and return 
> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
> 
> >
> >> 3. USB reconnect: dwc3_event_buffers_setup
> >>       a. Un-masking interrupt line
> > Do we know that the GEVNTCOUNT is non-zero before starting the
> > controller again?
> 
> The GEVNTCOUNT value showing as zero that we confirmed by adding debug 
> message here.
> >
> >> 4. Continuous interrupts : Top-half handler:
> >>       a. if (evt->flags & DWC3_EVENT_PENDING)
> >>                   return IRQ_HANDLED;
> >>
> >>       a. if (evt->flags & DWC3_EVENT_PENDING)
> >>                   return IRQ_HANDLED;
> >>
> >>       a. if (evt->flags & DWC3_EVENT_PENDING)
> >>                   return IRQ_HANDLED;
> >> .....
> >>
> >> .....
> >>
> >> .....
> >>
> 
> Sure, I can try implementing the proposed code modifications in our 
> testing environment.
> 
> But, I am uncertain about how these changes will effectively prevent an 
> IRQ storm when the USB controller sequence restarts with the 
> DWC3_EVENT_PENDING. The following code will only execute until the 
> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half 
> will not be scheduled.
> 
> Please correct me if i am wrong in my above understanding.

As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
the stale event. I want to ignore and skip processing any stale event.

The DWC3_EVENT_PENDING should not be set by the time
dwc3_event_buffers_setup() is called.

Specifically review this condition in my testing patch:

	/*
	 * If the controller is halted, the event count is stale/invalid. Ignore
	 * them. This happens if the interrupt assertion is from an out-of-band
	 * resume notification.
	 */
	if (!dwc->pullups_connected && count) {
		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
		return IRQ_HANDLED;
	}

Let me know if the condition matches with what's happening for your
case.

Thanks,
Thinh
Selvarasu Ganesan Sept. 5, 2024, 11:05 p.m. UTC | #17
On 9/6/2024 2:43 AM, Thinh Nguyen wrote:
> On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
>> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
>>> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
>>>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
>>>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
>>>>>> I would like to reconfirm from our end that in our failure scenario, we
>>>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
>>>>>> resume sequence is executed, and the dwc->pending_events flag is not
>>>>>> being set.
>>>>>>
>>>>> If the controller is stopped, no event is generated until it's restarted
>>>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
>>>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
>>>>> from the controller.
>>>>>
>>>>> If the pending_events is not set and you still see this failure, then
>>>>> likely that the controller had started, and the interrupt is generated
>>>>> from the controller event. This occurs along with the interrupt
>>>>> generated from your connection notification from your setup.
>>>> I completely agree. My discussion revolves around the handling of the
>>>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
>>>> flag is to prevent the processing of new events if an existing event is
>>>> still being processed. This flag is set in the top-half interrupt
>>>> handler and cleared at the end of the bottom-half handler.
>>>>
>>>> Now, let's consider scenarios where the bottom half is not scheduled,
>>>> and a USB reconnect occurs. In this case, there is a possibility that
>>>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
>>>> controller begins posting new events. The top-half interrupt handler
>>>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
>>>> processing any new events. However, the USB controller continues to post
>>>> interrupts until they are acknowledged.
>>>>
>>>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
>>>>
>>>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
>>>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
>>>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
>>>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
>>>> a temporary workaround in our test setup to prevent IRQ storms.
>>>>
>>>>
>>>>
>>>> Working scenarios:
>>>> ==================
>>>> 1. Top-half handler:
>>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>            return IRQ_HANDLED;
>>>>        b. Set DWC3_EVENT_PENDING flag
>>>>        c. Masking interrupt line
>>>>
>>>> 2. Bottom-half handler:
>>>>        a. Un-masking interrupt line
>>>>        b. Clear DWC3_EVENT_PENDING flag
>>>>
>>>> Failure scenarios:
>>>> ==================
>>>> 1. Top-half handler:
>>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>                    return IRQ_HANDLED;
>>>>        b. Set DWC3_EVENT_PENDING flag
>>>>        c. Masking interrupt line
>>> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
>>> controller), that means that the GEVNTCOUNT was not 0 after
>>> soft-disconnect and that the pm_runtime_suspended() must be false.
>> In the top-half code where we set the DWC3_EVENT_PENDING flag, we
>> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for
>> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact,
>> when we tried to print GEVNTCOUNT in a non-interrupt context, we found
>> that it was zero, where we received DWC3_EVENT_PENDING being set in
>> non-interrupt context.
> For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
> see it's zero, that means that it was already decremented by the driver.
>
> If the driver acknowledges the GEVNTCOUNT, then that means that the
> events are copied and prepared to be processed. The bottom-half thread
> is scheduled. If it's for stale event, I don't want it to be processed.
>
>>>> 2. No Bottom-half scheduled:
>>> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
>>> yet before the next top-half coming?
>> In very rare cases, it is possible in our platform that the CPU may not
>> be able to schedule the bottom half of the dwc3 interrupt because a work
>> queue lockup has occurred on the same CPU that is attempting to schedule
>> the dwc3 thread interrupt. In this case Yes, the bottom-half handler
>> hasn't woken up, then initiate an IRQ storm for new events after the
>> controller restarts, resulting in no more bottom-half scheduling due to
>> the CPU being stuck in processing continuous interrupts and return
>> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
>>
>>>> 3. USB reconnect: dwc3_event_buffers_setup
>>>>        a. Un-masking interrupt line
>>> Do we know that the GEVNTCOUNT is non-zero before starting the
>>> controller again?
>> The GEVNTCOUNT value showing as zero that we confirmed by adding debug
>> message here.
>>>> 4. Continuous interrupts : Top-half handler:
>>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>                    return IRQ_HANDLED;
>>>>
>>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>                    return IRQ_HANDLED;
>>>>
>>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>                    return IRQ_HANDLED;
>>>> .....
>>>>
>>>> .....
>>>>
>>>> .....
>>>>
>> Sure, I can try implementing the proposed code modifications in our
>> testing environment.
>>
>> But, I am uncertain about how these changes will effectively prevent an
>> IRQ storm when the USB controller sequence restarts with the
>> DWC3_EVENT_PENDING. The following code will only execute until the
>> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half
>> will not be scheduled.
>>
>> Please correct me if i am wrong in my above understanding.
> As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
> the stale event. I want to ignore and skip processing any stale event.
>
> The DWC3_EVENT_PENDING should not be set by the time
> dwc3_event_buffers_setup() is called.
>
> Specifically review this condition in my testing patch:
>
> 	/*
> 	 * If the controller is halted, the event count is stale/invalid. Ignore
> 	 * them. This happens if the interrupt assertion is from an out-of-band
> 	 * resume notification.
> 	 */
> 	if (!dwc->pullups_connected && count) {
> 		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> 		return IRQ_HANDLED;
> 	}
>
> Let me know if the condition matches with what's happening for your
> case.
Hi Thinh,

Thanks for your continuous reviews and suggestions.

The given condition also will not matches in our case.
As i mentioned in starting of this thread please refer once the below 
link of older discussion for similar issue from Samsung..

https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/


DWC3_EVENT_PENDING flags set when count is 0.
It means "There are no interrupts to handle.".

(struct dwc3_event_buffer *) ev_buf = 0xFFFFFF883DBF1180 (
	(void *) buf = 0xFFFFFFC00DBDD000 = end+0x337D000,
	(void *) cache = 0xFFFFFF8839F54080,
	(unsigned int) length = 0x1000,
	(unsigned int) lpos = 0x0,
*(unsigned int) count = 0x0, (unsigned int) flags = 0x00000001,*
	(dma_addr_t) dma = 0x00000008BD7D7000,
	(struct dwc3 *) dwc = 0xFFFFFF8839CBC880,
	(u64) android_kabi_reserved1 = 0x0),

IRQ Storm:
(time = 47557628930999, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
(time = 47557628931268, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
(time = 47557628932383, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
(time = 47557628932652, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
(time = 47557628933768, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
(time = 47557628934037, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
...
...
...


We are also fine with below code changes as you suggested earlier.
https://lore.kernel.org/linux-usb/20230109190914.3blihjfjdcszazdd@synopsys.com/

diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 65500246323b..3c36dfdb88f0 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -5515,8 +5515,15 @@ static irqreturn_t dwc3_check_event_buf(struct 
dwc3_event_buffer *evt)
          * irq event handler completes before caching new event to prevent
          * losing events.
          */
-       if (evt->flags & DWC3_EVENT_PENDING)
+       if (evt->flags & DWC3_EVENT_PENDING) {
+               if (!evt->count) {
+                       u32 reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
+
+                       if (!(reg & DWC3_GEVNTSIZ_INTMASK))
+                               evt->flags &= ~DWC3_EVENT_PENDING;
+               }
                 return IRQ_HANDLED;
+       }


Thanks,
Selva


> .
> Thanks,
> Thinh
Thinh Nguyen Sept. 5, 2024, 11:18 p.m. UTC | #18
On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
> 
> On 9/6/2024 2:43 AM, Thinh Nguyen wrote:
> > On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
> >> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
> >>> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
> >>>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
> >>>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
> >>>>>> I would like to reconfirm from our end that in our failure scenario, we
> >>>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
> >>>>>> resume sequence is executed, and the dwc->pending_events flag is not
> >>>>>> being set.
> >>>>>>
> >>>>> If the controller is stopped, no event is generated until it's restarted
> >>>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
> >>>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
> >>>>> from the controller.
> >>>>>
> >>>>> If the pending_events is not set and you still see this failure, then
> >>>>> likely that the controller had started, and the interrupt is generated
> >>>>> from the controller event. This occurs along with the interrupt
> >>>>> generated from your connection notification from your setup.
> >>>> I completely agree. My discussion revolves around the handling of the
> >>>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
> >>>> flag is to prevent the processing of new events if an existing event is
> >>>> still being processed. This flag is set in the top-half interrupt
> >>>> handler and cleared at the end of the bottom-half handler.
> >>>>
> >>>> Now, let's consider scenarios where the bottom half is not scheduled,
> >>>> and a USB reconnect occurs. In this case, there is a possibility that
> >>>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
> >>>> controller begins posting new events. The top-half interrupt handler
> >>>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
> >>>> processing any new events. However, the USB controller continues to post
> >>>> interrupts until they are acknowledged.
> >>>>
> >>>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
> >>>>
> >>>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
> >>>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
> >>>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
> >>>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
> >>>> a temporary workaround in our test setup to prevent IRQ storms.
> >>>>
> >>>>
> >>>>
> >>>> Working scenarios:
> >>>> ==================
> >>>> 1. Top-half handler:
> >>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>            return IRQ_HANDLED;
> >>>>        b. Set DWC3_EVENT_PENDING flag
> >>>>        c. Masking interrupt line
> >>>>
> >>>> 2. Bottom-half handler:
> >>>>        a. Un-masking interrupt line
> >>>>        b. Clear DWC3_EVENT_PENDING flag
> >>>>
> >>>> Failure scenarios:
> >>>> ==================
> >>>> 1. Top-half handler:
> >>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>                    return IRQ_HANDLED;
> >>>>        b. Set DWC3_EVENT_PENDING flag
> >>>>        c. Masking interrupt line
> >>> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
> >>> controller), that means that the GEVNTCOUNT was not 0 after
> >>> soft-disconnect and that the pm_runtime_suspended() must be false.
> >> In the top-half code where we set the DWC3_EVENT_PENDING flag, we
> >> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for
> >> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact,
> >> when we tried to print GEVNTCOUNT in a non-interrupt context, we found
> >> that it was zero, where we received DWC3_EVENT_PENDING being set in
> >> non-interrupt context.
> > For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
> > see it's zero, that means that it was already decremented by the driver.
> >
> > If the driver acknowledges the GEVNTCOUNT, then that means that the
> > events are copied and prepared to be processed. The bottom-half thread
> > is scheduled. If it's for stale event, I don't want it to be processed.
> >
> >>>> 2. No Bottom-half scheduled:
> >>> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
> >>> yet before the next top-half coming?
> >> In very rare cases, it is possible in our platform that the CPU may not
> >> be able to schedule the bottom half of the dwc3 interrupt because a work
> >> queue lockup has occurred on the same CPU that is attempting to schedule
> >> the dwc3 thread interrupt. In this case Yes, the bottom-half handler
> >> hasn't woken up, then initiate an IRQ storm for new events after the
> >> controller restarts, resulting in no more bottom-half scheduling due to
> >> the CPU being stuck in processing continuous interrupts and return
> >> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
> >>
> >>>> 3. USB reconnect: dwc3_event_buffers_setup
> >>>>        a. Un-masking interrupt line
> >>> Do we know that the GEVNTCOUNT is non-zero before starting the
> >>> controller again?
> >> The GEVNTCOUNT value showing as zero that we confirmed by adding debug
> >> message here.
> >>>> 4. Continuous interrupts : Top-half handler:
> >>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>                    return IRQ_HANDLED;
> >>>>
> >>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>                    return IRQ_HANDLED;
> >>>>
> >>>>        a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>                    return IRQ_HANDLED;
> >>>> .....
> >>>>
> >>>> .....
> >>>>
> >>>> .....
> >>>>
> >> Sure, I can try implementing the proposed code modifications in our
> >> testing environment.
> >>
> >> But, I am uncertain about how these changes will effectively prevent an
> >> IRQ storm when the USB controller sequence restarts with the
> >> DWC3_EVENT_PENDING. The following code will only execute until the
> >> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half
> >> will not be scheduled.
> >>
> >> Please correct me if i am wrong in my above understanding.
> > As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
> > the stale event. I want to ignore and skip processing any stale event.
> >
> > The DWC3_EVENT_PENDING should not be set by the time
> > dwc3_event_buffers_setup() is called.
> >
> > Specifically review this condition in my testing patch:
> >
> > 	/*
> > 	 * If the controller is halted, the event count is stale/invalid. Ignore
> > 	 * them. This happens if the interrupt assertion is from an out-of-band
> > 	 * resume notification.
> > 	 */
> > 	if (!dwc->pullups_connected && count) {
> > 		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> > 		return IRQ_HANDLED;
> > 	}
> >
> > Let me know if the condition matches with what's happening for your
> > case.
> Hi Thinh,
> 
> Thanks for your continuous reviews and suggestions.
> 
> The given condition also will not matches in our case.
> As i mentioned in starting of this thread please refer once the below 
> link of older discussion for similar issue from Samsung..
> 
> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynpe7Ex6cU$ 
> 
> 
> DWC3_EVENT_PENDING flags set when count is 0.
> It means "There are no interrupts to handle.".
> 
> (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF883DBF1180 (
> 	(void *) buf = 0xFFFFFFC00DBDD000 = end+0x337D000,
> 	(void *) cache = 0xFFFFFF8839F54080,
> 	(unsigned int) length = 0x1000,
> 	(unsigned int) lpos = 0x0,
> *(unsigned int) count = 0x0, (unsigned int) flags = 0x00000001,*
> 	(dma_addr_t) dma = 0x00000008BD7D7000,
> 	(struct dwc3 *) dwc = 0xFFFFFF8839CBC880,
> 	(u64) android_kabi_reserved1 = 0x0),


This is the info of the event buffer that was reset after the
dwc3_event_buffers_setup(). I'm talking about the first time
DWC3_EVENT_PENDING flag was set.

By the time the interrupt storm below occur, the count in the buffer is
already zero'ed out.

> 
> IRQ Storm:
> (time = 47557628930999, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> (time = 47557628931268, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> (time = 47557628932383, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> (time = 47557628932652, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> (time = 47557628933768, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> (time = 47557628934037, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> ...
> ...
> ...
> 
> 
> We are also fine with below code changes as you suggested earlier.
> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230109190914.3blihjfjdcszazdd@synopsys.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynp367zvEw$ 
> 
> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> index 65500246323b..3c36dfdb88f0 100644
> --- a/drivers/usb/dwc3/gadget.c
> +++ b/drivers/usb/dwc3/gadget.c
> @@ -5515,8 +5515,15 @@ static irqreturn_t dwc3_check_event_buf(struct 
> dwc3_event_buffer *evt)
>           * irq event handler completes before caching new event to prevent
>           * losing events.
>           */
> -       if (evt->flags & DWC3_EVENT_PENDING)
> +       if (evt->flags & DWC3_EVENT_PENDING) {
> +               if (!evt->count) {
> +                       u32 reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
> +
> +                       if (!(reg & DWC3_GEVNTSIZ_INTMASK))
> +                               evt->flags &= ~DWC3_EVENT_PENDING;
> +               }
>                  return IRQ_HANDLED;
> +       }
> 
> 

I don't want the bottom-half to be scheduled in the beginning as it may
come before the cleanup in dwc3_event_buffers_setup().

BR,
Thinh
Selvarasu Ganesan Sept. 6, 2024, 12:28 a.m. UTC | #19
On 9/6/2024 4:48 AM, Thinh Nguyen wrote:
> On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
>> On 9/6/2024 2:43 AM, Thinh Nguyen wrote:
>>> On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
>>>> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
>>>>> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
>>>>>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
>>>>>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
>>>>>>>> I would like to reconfirm from our end that in our failure scenario, we
>>>>>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
>>>>>>>> resume sequence is executed, and the dwc->pending_events flag is not
>>>>>>>> being set.
>>>>>>>>
>>>>>>> If the controller is stopped, no event is generated until it's restarted
>>>>>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
>>>>>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
>>>>>>> from the controller.
>>>>>>>
>>>>>>> If the pending_events is not set and you still see this failure, then
>>>>>>> likely that the controller had started, and the interrupt is generated
>>>>>>> from the controller event. This occurs along with the interrupt
>>>>>>> generated from your connection notification from your setup.
>>>>>> I completely agree. My discussion revolves around the handling of the
>>>>>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
>>>>>> flag is to prevent the processing of new events if an existing event is
>>>>>> still being processed. This flag is set in the top-half interrupt
>>>>>> handler and cleared at the end of the bottom-half handler.
>>>>>>
>>>>>> Now, let's consider scenarios where the bottom half is not scheduled,
>>>>>> and a USB reconnect occurs. In this case, there is a possibility that
>>>>>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
>>>>>> controller begins posting new events. The top-half interrupt handler
>>>>>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
>>>>>> processing any new events. However, the USB controller continues to post
>>>>>> interrupts until they are acknowledged.
>>>>>>
>>>>>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
>>>>>>
>>>>>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
>>>>>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
>>>>>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
>>>>>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
>>>>>> a temporary workaround in our test setup to prevent IRQ storms.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Working scenarios:
>>>>>> ==================
>>>>>> 1. Top-half handler:
>>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>             return IRQ_HANDLED;
>>>>>>         b. Set DWC3_EVENT_PENDING flag
>>>>>>         c. Masking interrupt line
>>>>>>
>>>>>> 2. Bottom-half handler:
>>>>>>         a. Un-masking interrupt line
>>>>>>         b. Clear DWC3_EVENT_PENDING flag
>>>>>>
>>>>>> Failure scenarios:
>>>>>> ==================
>>>>>> 1. Top-half handler:
>>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>                     return IRQ_HANDLED;
>>>>>>         b. Set DWC3_EVENT_PENDING flag
>>>>>>         c. Masking interrupt line
>>>>> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
>>>>> controller), that means that the GEVNTCOUNT was not 0 after
>>>>> soft-disconnect and that the pm_runtime_suspended() must be false.
>>>> In the top-half code where we set the DWC3_EVENT_PENDING flag, we
>>>> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for
>>>> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact,
>>>> when we tried to print GEVNTCOUNT in a non-interrupt context, we found
>>>> that it was zero, where we received DWC3_EVENT_PENDING being set in
>>>> non-interrupt context.
>>> For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
>>> see it's zero, that means that it was already decremented by the driver.
>>>
>>> If the driver acknowledges the GEVNTCOUNT, then that means that the
>>> events are copied and prepared to be processed. The bottom-half thread
>>> is scheduled. If it's for stale event, I don't want it to be processed.
>>>
>>>>>> 2. No Bottom-half scheduled:
>>>>> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
>>>>> yet before the next top-half coming?
>>>> In very rare cases, it is possible in our platform that the CPU may not
>>>> be able to schedule the bottom half of the dwc3 interrupt because a work
>>>> queue lockup has occurred on the same CPU that is attempting to schedule
>>>> the dwc3 thread interrupt. In this case Yes, the bottom-half handler
>>>> hasn't woken up, then initiate an IRQ storm for new events after the
>>>> controller restarts, resulting in no more bottom-half scheduling due to
>>>> the CPU being stuck in processing continuous interrupts and return
>>>> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
>>>>
>>>>>> 3. USB reconnect: dwc3_event_buffers_setup
>>>>>>         a. Un-masking interrupt line
>>>>> Do we know that the GEVNTCOUNT is non-zero before starting the
>>>>> controller again?
>>>> The GEVNTCOUNT value showing as zero that we confirmed by adding debug
>>>> message here.
>>>>>> 4. Continuous interrupts : Top-half handler:
>>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>                     return IRQ_HANDLED;
>>>>>>
>>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>                     return IRQ_HANDLED;
>>>>>>
>>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>                     return IRQ_HANDLED;
>>>>>> .....
>>>>>>
>>>>>> .....
>>>>>>
>>>>>> .....
>>>>>>
>>>> Sure, I can try implementing the proposed code modifications in our
>>>> testing environment.
>>>>
>>>> But, I am uncertain about how these changes will effectively prevent an
>>>> IRQ storm when the USB controller sequence restarts with the
>>>> DWC3_EVENT_PENDING. The following code will only execute until the
>>>> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half
>>>> will not be scheduled.
>>>>
>>>> Please correct me if i am wrong in my above understanding.
>>> As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
>>> the stale event. I want to ignore and skip processing any stale event.
>>>
>>> The DWC3_EVENT_PENDING should not be set by the time
>>> dwc3_event_buffers_setup() is called.
>>>
>>> Specifically review this condition in my testing patch:
>>>
>>> 	/*
>>> 	 * If the controller is halted, the event count is stale/invalid. Ignore
>>> 	 * them. This happens if the interrupt assertion is from an out-of-band
>>> 	 * resume notification.
>>> 	 */
>>> 	if (!dwc->pullups_connected && count) {
>>> 		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
>>> 		return IRQ_HANDLED;
>>> 	}
>>>
>>> Let me know if the condition matches with what's happening for your
>>> case.
>> Hi Thinh,
>>
>> Thanks for your continuous reviews and suggestions.
>>
>> The given condition also will not matches in our case.
>> As i mentioned in starting of this thread please refer once the below
>> link of older discussion for similar issue from Samsung..
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynpe7Ex6cU$
>>
>>
>> DWC3_EVENT_PENDING flags set when count is 0.
>> It means "There are no interrupts to handle.".
>>
>> (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF883DBF1180 (
>> 	(void *) buf = 0xFFFFFFC00DBDD000 = end+0x337D000,
>> 	(void *) cache = 0xFFFFFF8839F54080,
>> 	(unsigned int) length = 0x1000,
>> 	(unsigned int) lpos = 0x0,
>> *(unsigned int) count = 0x0, (unsigned int) flags = 0x00000001,*
>> 	(dma_addr_t) dma = 0x00000008BD7D7000,
>> 	(struct dwc3 *) dwc = 0xFFFFFF8839CBC880,
>> 	(u64) android_kabi_reserved1 = 0x0),
>
> This is the info of the event buffer that was reset after the
> dwc3_event_buffers_setup(). I'm talking about the first time
> DWC3_EVENT_PENDING flag was set.

Yes, the buffer that was reset before as part of 
dwc3_event_buffers_setup() is correct.
I agree on your new code changes in below will prevent setting 
DWC3_EVENT_PENDING and avoid the bottom-half handler if the controller 
is halted, and the event count is invalid.

Are you suspecting that the DWC3_EVENT_PENDING flag was only set in this 
scenario in our failure case?

/*diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 89fc690fdf34..a525f7ea5886 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4490,6 +4490,17 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
  		return IRQ_HANDLED;
  
  	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
+
+	/*
+	 * If the controller is halted, the event count is stale/invalid. Ignore
+	 * them. This happens if the interrupt assertion is from an out-of-band
+	 * resume notification.
+	 */
+	if (!dwc->pullups_connected && count) {
+		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
+		return IRQ_HANDLED;
+	}
+
  	count &= DWC3_GEVNTCOUNT_MASK;
  	if (!count)
  		return IRQ_NONE;

>
> By the time the interrupt storm below occur, the count in the buffer is
> already zero'ed out.
>
>> IRQ Storm:
>> (time = 47557628930999, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>> (time = 47557628931268, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>> (time = 47557628932383, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>> (time = 47557628932652, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>> (time = 47557628933768, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>> (time = 47557628934037, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>> ...
>> ...
>> ...
>>
>>
>> We are also fine with below code changes as you suggested earlier.
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230109190914.3blihjfjdcszazdd@synopsys.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynp367zvEw$
>>
>> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>> index 65500246323b..3c36dfdb88f0 100644
>> --- a/drivers/usb/dwc3/gadget.c
>> +++ b/drivers/usb/dwc3/gadget.c
>> @@ -5515,8 +5515,15 @@ static irqreturn_t dwc3_check_event_buf(struct
>> dwc3_event_buffer *evt)
>>            * irq event handler completes before caching new event to prevent
>>            * losing events.
>>            */
>> -       if (evt->flags & DWC3_EVENT_PENDING)
>> +       if (evt->flags & DWC3_EVENT_PENDING) {
>> +               if (!evt->count) {
>> +                       u32 reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
>> +
>> +                       if (!(reg & DWC3_GEVNTSIZ_INTMASK))
>> +                               evt->flags &= ~DWC3_EVENT_PENDING;
>> +               }
>>                   return IRQ_HANDLED;
>> +       }
>>
>>
> I don't want the bottom-half to be scheduled in the beginning as it may
> come before the cleanup in dwc3_event_buffers_setup().
You mean the above changes for clearing DWC3_EVENT_PENDINGnot required 
as you given new change will prevent setting of DWC3_EVENT_PENDING 
before dwc3_event_buffers_setup().
But I dont see any harm in above code changes for clearing 
DWC3_EVENT_PENDING if it already set with evt->count=0.

Anyway I will try the your new proposed changes alone on our testing 
setup and will update the status,


>
> BR,
> Thinh
Thinh Nguyen Sept. 6, 2024, 12:59 a.m. UTC | #20
On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
> 
> On 9/6/2024 4:48 AM, Thinh Nguyen wrote:
> > On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
> >> On 9/6/2024 2:43 AM, Thinh Nguyen wrote:
> >>> On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
> >>>> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
> >>>>> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
> >>>>>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
> >>>>>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
> >>>>>>>> I would like to reconfirm from our end that in our failure scenario, we
> >>>>>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
> >>>>>>>> resume sequence is executed, and the dwc->pending_events flag is not
> >>>>>>>> being set.
> >>>>>>>>
> >>>>>>> If the controller is stopped, no event is generated until it's restarted
> >>>>>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
> >>>>>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
> >>>>>>> from the controller.
> >>>>>>>
> >>>>>>> If the pending_events is not set and you still see this failure, then
> >>>>>>> likely that the controller had started, and the interrupt is generated
> >>>>>>> from the controller event. This occurs along with the interrupt
> >>>>>>> generated from your connection notification from your setup.
> >>>>>> I completely agree. My discussion revolves around the handling of the
> >>>>>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
> >>>>>> flag is to prevent the processing of new events if an existing event is
> >>>>>> still being processed. This flag is set in the top-half interrupt
> >>>>>> handler and cleared at the end of the bottom-half handler.
> >>>>>>
> >>>>>> Now, let's consider scenarios where the bottom half is not scheduled,
> >>>>>> and a USB reconnect occurs. In this case, there is a possibility that
> >>>>>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
> >>>>>> controller begins posting new events. The top-half interrupt handler
> >>>>>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
> >>>>>> processing any new events. However, the USB controller continues to post
> >>>>>> interrupts until they are acknowledged.
> >>>>>>
> >>>>>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
> >>>>>>
> >>>>>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
> >>>>>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
> >>>>>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
> >>>>>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
> >>>>>> a temporary workaround in our test setup to prevent IRQ storms.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Working scenarios:
> >>>>>> ==================
> >>>>>> 1. Top-half handler:
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>             return IRQ_HANDLED;
> >>>>>>         b. Set DWC3_EVENT_PENDING flag
> >>>>>>         c. Masking interrupt line
> >>>>>>
> >>>>>> 2. Bottom-half handler:
> >>>>>>         a. Un-masking interrupt line
> >>>>>>         b. Clear DWC3_EVENT_PENDING flag
> >>>>>>
> >>>>>> Failure scenarios:
> >>>>>> ==================
> >>>>>> 1. Top-half handler:
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>>         b. Set DWC3_EVENT_PENDING flag
> >>>>>>         c. Masking interrupt line
> >>>>> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
> >>>>> controller), that means that the GEVNTCOUNT was not 0 after
> >>>>> soft-disconnect and that the pm_runtime_suspended() must be false.
> >>>> In the top-half code where we set the DWC3_EVENT_PENDING flag, we
> >>>> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for
> >>>> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact,
> >>>> when we tried to print GEVNTCOUNT in a non-interrupt context, we found
> >>>> that it was zero, where we received DWC3_EVENT_PENDING being set in
> >>>> non-interrupt context.
> >>> For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
> >>> see it's zero, that means that it was already decremented by the driver.
> >>>
> >>> If the driver acknowledges the GEVNTCOUNT, then that means that the
> >>> events are copied and prepared to be processed. The bottom-half thread
> >>> is scheduled. If it's for stale event, I don't want it to be processed.
> >>>
> >>>>>> 2. No Bottom-half scheduled:
> >>>>> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
> >>>>> yet before the next top-half coming?
> >>>> In very rare cases, it is possible in our platform that the CPU may not
> >>>> be able to schedule the bottom half of the dwc3 interrupt because a work
> >>>> queue lockup has occurred on the same CPU that is attempting to schedule
> >>>> the dwc3 thread interrupt. In this case Yes, the bottom-half handler
> >>>> hasn't woken up, then initiate an IRQ storm for new events after the
> >>>> controller restarts, resulting in no more bottom-half scheduling due to
> >>>> the CPU being stuck in processing continuous interrupts and return
> >>>> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
> >>>>
> >>>>>> 3. USB reconnect: dwc3_event_buffers_setup
> >>>>>>         a. Un-masking interrupt line
> >>>>> Do we know that the GEVNTCOUNT is non-zero before starting the
> >>>>> controller again?
> >>>> The GEVNTCOUNT value showing as zero that we confirmed by adding debug
> >>>> message here.
> >>>>>> 4. Continuous interrupts : Top-half handler:
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>>
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>>
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>> .....
> >>>>>>
> >>>>>> .....
> >>>>>>
> >>>>>> .....
> >>>>>>
> >>>> Sure, I can try implementing the proposed code modifications in our
> >>>> testing environment.
> >>>>
> >>>> But, I am uncertain about how these changes will effectively prevent an
> >>>> IRQ storm when the USB controller sequence restarts with the
> >>>> DWC3_EVENT_PENDING. The following code will only execute until the
> >>>> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half
> >>>> will not be scheduled.
> >>>>
> >>>> Please correct me if i am wrong in my above understanding.
> >>> As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
> >>> the stale event. I want to ignore and skip processing any stale event.
> >>>
> >>> The DWC3_EVENT_PENDING should not be set by the time
> >>> dwc3_event_buffers_setup() is called.
> >>>
> >>> Specifically review this condition in my testing patch:
> >>>
> >>> 	/*
> >>> 	 * If the controller is halted, the event count is stale/invalid. Ignore
> >>> 	 * them. This happens if the interrupt assertion is from an out-of-band
> >>> 	 * resume notification.
> >>> 	 */
> >>> 	if (!dwc->pullups_connected && count) {
> >>> 		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> >>> 		return IRQ_HANDLED;
> >>> 	}
> >>>
> >>> Let me know if the condition matches with what's happening for your
> >>> case.
> >> Hi Thinh,
> >>
> >> Thanks for your continuous reviews and suggestions.
> >>
> >> The given condition also will not matches in our case.
> >> As i mentioned in starting of this thread please refer once the below
> >> link of older discussion for similar issue from Samsung..
> >>
> >> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynpe7Ex6cU$
> >>
> >>
> >> DWC3_EVENT_PENDING flags set when count is 0.
> >> It means "There are no interrupts to handle.".
> >>
> >> (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF883DBF1180 (
> >> 	(void *) buf = 0xFFFFFFC00DBDD000 = end+0x337D000,
> >> 	(void *) cache = 0xFFFFFF8839F54080,
> >> 	(unsigned int) length = 0x1000,
> >> 	(unsigned int) lpos = 0x0,
> >> *(unsigned int) count = 0x0, (unsigned int) flags = 0x00000001,*
> >> 	(dma_addr_t) dma = 0x00000008BD7D7000,
> >> 	(struct dwc3 *) dwc = 0xFFFFFF8839CBC880,
> >> 	(u64) android_kabi_reserved1 = 0x0),
> >
> > This is the info of the event buffer that was reset after the
> > dwc3_event_buffers_setup(). I'm talking about the first time
> > DWC3_EVENT_PENDING flag was set.
> 
> Yes, the buffer that was reset before as part of 
> dwc3_event_buffers_setup() is correct.
> I agree on your new code changes in below will prevent setting 
> DWC3_EVENT_PENDING and avoid the bottom-half handler if the controller 
> is halted, and the event count is invalid.
> 
> Are you suspecting that the DWC3_EVENT_PENDING flag was only set in this 
> scenario in our failure case?

Base on the discussion so far, that's what I'm suspecting.

> 
> /*diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> index 89fc690fdf34..a525f7ea5886 100644
> --- a/drivers/usb/dwc3/gadget.c
> +++ b/drivers/usb/dwc3/gadget.c
> @@ -4490,6 +4490,17 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
>   		return IRQ_HANDLED;
>   
>   	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> +
> +	/*
> +	 * If the controller is halted, the event count is stale/invalid. Ignore
> +	 * them. This happens if the interrupt assertion is from an out-of-band
> +	 * resume notification.
> +	 */
> +	if (!dwc->pullups_connected && count) {
> +		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> +		return IRQ_HANDLED;
> +	}
> +
>   	count &= DWC3_GEVNTCOUNT_MASK;
>   	if (!count)
>   		return IRQ_NONE;
> 
> >
> > By the time the interrupt storm below occur, the count in the buffer is
> > already zero'ed out.
> >
> >> IRQ Storm:
> >> (time = 47557628930999, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> >> (time = 47557628931268, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> >> (time = 47557628932383, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> >> (time = 47557628932652, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> >> (time = 47557628933768, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> >> (time = 47557628934037, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> >> ...
> >> ...
> >> ...
> >>
> >>
> >> We are also fine with below code changes as you suggested earlier.
> >> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230109190914.3blihjfjdcszazdd@synopsys.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynp367zvEw$
> >>
> >> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> >> index 65500246323b..3c36dfdb88f0 100644
> >> --- a/drivers/usb/dwc3/gadget.c
> >> +++ b/drivers/usb/dwc3/gadget.c
> >> @@ -5515,8 +5515,15 @@ static irqreturn_t dwc3_check_event_buf(struct
> >> dwc3_event_buffer *evt)
> >>            * irq event handler completes before caching new event to prevent
> >>            * losing events.
> >>            */
> >> -       if (evt->flags & DWC3_EVENT_PENDING)
> >> +       if (evt->flags & DWC3_EVENT_PENDING) {
> >> +               if (!evt->count) {
> >> +                       u32 reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
> >> +
> >> +                       if (!(reg & DWC3_GEVNTSIZ_INTMASK))
> >> +                               evt->flags &= ~DWC3_EVENT_PENDING;
> >> +               }
> >>                   return IRQ_HANDLED;
> >> +       }
> >>
> >>
> > I don't want the bottom-half to be scheduled in the beginning as it may
> > come before the cleanup in dwc3_event_buffers_setup().
> You mean the above changes for clearing DWC3_EVENT_PENDINGnot required 
> as you given new change will prevent setting of DWC3_EVENT_PENDING 
> before dwc3_event_buffers_setup().

Yes.

> But I dont see any harm in above code changes for clearing 
> DWC3_EVENT_PENDING if it already set with evt->count=0.

You can add it there, but do we need to if we can solve the actual
issue?

I'm interested in confirming my suspiction of what's really causing the
DWC3_EVENT_PENDING to be set here. The code logic would be clearer
rather than masking the behavior by depending on the reset by the
dwc3_event_buffers_setup(). The runtime resume doesn't share the same
locking mechanism as when processing an event. While it may be unlikely,
I don't want the bottom-half thread to handle stale event or race with
the runtime resume.

> 
> Anyway I will try the your new proposed changes alone on our testing 
> setup and will update the status,
> 

Thanks,
Thinh
Selvarasu Ganesan Sept. 6, 2024, 7:02 p.m. UTC | #21
On 9/6/2024 6:29 AM, Thinh Nguyen wrote:
> On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
>> On 9/6/2024 4:48 AM, Thinh Nguyen wrote:
>>> On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
>>>> On 9/6/2024 2:43 AM, Thinh Nguyen wrote:
>>>>> On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
>>>>>> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
>>>>>>> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
>>>>>>>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
>>>>>>>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
>>>>>>>>>> I would like to reconfirm from our end that in our failure scenario, we
>>>>>>>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
>>>>>>>>>> resume sequence is executed, and the dwc->pending_events flag is not
>>>>>>>>>> being set.
>>>>>>>>>>
>>>>>>>>> If the controller is stopped, no event is generated until it's restarted
>>>>>>>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
>>>>>>>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
>>>>>>>>> from the controller.
>>>>>>>>>
>>>>>>>>> If the pending_events is not set and you still see this failure, then
>>>>>>>>> likely that the controller had started, and the interrupt is generated
>>>>>>>>> from the controller event. This occurs along with the interrupt
>>>>>>>>> generated from your connection notification from your setup.
>>>>>>>> I completely agree. My discussion revolves around the handling of the
>>>>>>>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
>>>>>>>> flag is to prevent the processing of new events if an existing event is
>>>>>>>> still being processed. This flag is set in the top-half interrupt
>>>>>>>> handler and cleared at the end of the bottom-half handler.
>>>>>>>>
>>>>>>>> Now, let's consider scenarios where the bottom half is not scheduled,
>>>>>>>> and a USB reconnect occurs. In this case, there is a possibility that
>>>>>>>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
>>>>>>>> controller begins posting new events. The top-half interrupt handler
>>>>>>>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
>>>>>>>> processing any new events. However, the USB controller continues to post
>>>>>>>> interrupts until they are acknowledged.
>>>>>>>>
>>>>>>>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
>>>>>>>>
>>>>>>>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
>>>>>>>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
>>>>>>>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
>>>>>>>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
>>>>>>>> a temporary workaround in our test setup to prevent IRQ storms.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Working scenarios:
>>>>>>>> ==================
>>>>>>>> 1. Top-half handler:
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>              return IRQ_HANDLED;
>>>>>>>>          b. Set DWC3_EVENT_PENDING flag
>>>>>>>>          c. Masking interrupt line
>>>>>>>>
>>>>>>>> 2. Bottom-half handler:
>>>>>>>>          a. Un-masking interrupt line
>>>>>>>>          b. Clear DWC3_EVENT_PENDING flag
>>>>>>>>
>>>>>>>> Failure scenarios:
>>>>>>>> ==================
>>>>>>>> 1. Top-half handler:
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>>          b. Set DWC3_EVENT_PENDING flag
>>>>>>>>          c. Masking interrupt line
>>>>>>> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
>>>>>>> controller), that means that the GEVNTCOUNT was not 0 after
>>>>>>> soft-disconnect and that the pm_runtime_suspended() must be false.
>>>>>> In the top-half code where we set the DWC3_EVENT_PENDING flag, we
>>>>>> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for
>>>>>> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact,
>>>>>> when we tried to print GEVNTCOUNT in a non-interrupt context, we found
>>>>>> that it was zero, where we received DWC3_EVENT_PENDING being set in
>>>>>> non-interrupt context.
>>>>> For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
>>>>> see it's zero, that means that it was already decremented by the driver.
>>>>>
>>>>> If the driver acknowledges the GEVNTCOUNT, then that means that the
>>>>> events are copied and prepared to be processed. The bottom-half thread
>>>>> is scheduled. If it's for stale event, I don't want it to be processed.
>>>>>
>>>>>>>> 2. No Bottom-half scheduled:
>>>>>>> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
>>>>>>> yet before the next top-half coming?
>>>>>> In very rare cases, it is possible in our platform that the CPU may not
>>>>>> be able to schedule the bottom half of the dwc3 interrupt because a work
>>>>>> queue lockup has occurred on the same CPU that is attempting to schedule
>>>>>> the dwc3 thread interrupt. In this case Yes, the bottom-half handler
>>>>>> hasn't woken up, then initiate an IRQ storm for new events after the
>>>>>> controller restarts, resulting in no more bottom-half scheduling due to
>>>>>> the CPU being stuck in processing continuous interrupts and return
>>>>>> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
>>>>>>
>>>>>>>> 3. USB reconnect: dwc3_event_buffers_setup
>>>>>>>>          a. Un-masking interrupt line
>>>>>>> Do we know that the GEVNTCOUNT is non-zero before starting the
>>>>>>> controller again?
>>>>>> The GEVNTCOUNT value showing as zero that we confirmed by adding debug
>>>>>> message here.
>>>>>>>> 4. Continuous interrupts : Top-half handler:
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>>
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>>
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>> .....
>>>>>>>>
>>>>>>>> .....
>>>>>>>>
>>>>>>>> .....
>>>>>>>>
>>>>>> Sure, I can try implementing the proposed code modifications in our
>>>>>> testing environment.
>>>>>>
>>>>>> But, I am uncertain about how these changes will effectively prevent an
>>>>>> IRQ storm when the USB controller sequence restarts with the
>>>>>> DWC3_EVENT_PENDING. The following code will only execute until the
>>>>>> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half
>>>>>> will not be scheduled.
>>>>>>
>>>>>> Please correct me if i am wrong in my above understanding.
>>>>> As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
>>>>> the stale event. I want to ignore and skip processing any stale event.
>>>>>
>>>>> The DWC3_EVENT_PENDING should not be set by the time
>>>>> dwc3_event_buffers_setup() is called.
>>>>>
>>>>> Specifically review this condition in my testing patch:
>>>>>
>>>>> 	/*
>>>>> 	 * If the controller is halted, the event count is stale/invalid. Ignore
>>>>> 	 * them. This happens if the interrupt assertion is from an out-of-band
>>>>> 	 * resume notification.
>>>>> 	 */
>>>>> 	if (!dwc->pullups_connected && count) {
>>>>> 		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
>>>>> 		return IRQ_HANDLED;
>>>>> 	}
>>>>>
>>>>> Let me know if the condition matches with what's happening for your
>>>>> case.
>>>> Hi Thinh,
>>>>
>>>> Thanks for your continuous reviews and suggestions.
>>>>
>>>> The given condition also will not matches in our case.
>>>> As i mentioned in starting of this thread please refer once the below
>>>> link of older discussion for similar issue from Samsung..
>>>>
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@samsung.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynpe7Ex6cU$
>>>>
>>>>
>>>> DWC3_EVENT_PENDING flags set when count is 0.
>>>> It means "There are no interrupts to handle.".
>>>>
>>>> (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF883DBF1180 (
>>>> 	(void *) buf = 0xFFFFFFC00DBDD000 = end+0x337D000,
>>>> 	(void *) cache = 0xFFFFFF8839F54080,
>>>> 	(unsigned int) length = 0x1000,
>>>> 	(unsigned int) lpos = 0x0,
>>>> *(unsigned int) count = 0x0, (unsigned int) flags = 0x00000001,*
>>>> 	(dma_addr_t) dma = 0x00000008BD7D7000,
>>>> 	(struct dwc3 *) dwc = 0xFFFFFF8839CBC880,
>>>> 	(u64) android_kabi_reserved1 = 0x0),
>>> This is the info of the event buffer that was reset after the
>>> dwc3_event_buffers_setup(). I'm talking about the first time
>>> DWC3_EVENT_PENDING flag was set.
>> Yes, the buffer that was reset before as part of
>> dwc3_event_buffers_setup() is correct.
>> I agree on your new code changes in below will prevent setting
>> DWC3_EVENT_PENDING and avoid the bottom-half handler if the controller
>> is halted, and the event count is invalid.
>>
>> Are you suspecting that the DWC3_EVENT_PENDING flag was only set in this
>> scenario in our failure case?
> Base on the discussion so far, that's what I'm suspecting.
>
>> /*diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>> index 89fc690fdf34..a525f7ea5886 100644
>> --- a/drivers/usb/dwc3/gadget.c
>> +++ b/drivers/usb/dwc3/gadget.c
>> @@ -4490,6 +4490,17 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
>>    		return IRQ_HANDLED;
>>    
>>    	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
>> +
>> +	/*
>> +	 * If the controller is halted, the event count is stale/invalid. Ignore
>> +	 * them. This happens if the interrupt assertion is from an out-of-band
>> +	 * resume notification.
>> +	 */
>> +	if (!dwc->pullups_connected && count) {
>> +		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
>> +		return IRQ_HANDLED;
>> +	}
>> +
>>    	count &= DWC3_GEVNTCOUNT_MASK;
>>    	if (!count)
>>    		return IRQ_NONE;
>>
>>> By the time the interrupt storm below occur, the count in the buffer is
>>> already zero'ed out.
>>>
>>>> IRQ Storm:
>>>> (time = 47557628930999, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>>>> (time = 47557628931268, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>>>> (time = 47557628932383, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>>>> (time = 47557628932652, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>>>> (time = 47557628933768, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>>>> (time = 47557628934037, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>>>> ...
>>>> ...
>>>> ...
>>>>
>>>>
>>>> We are also fine with below code changes as you suggested earlier.
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230109190914.3blihjfjdcszazdd@synopsys.com/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynp367zvEw$
>>>>
>>>> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>>>> index 65500246323b..3c36dfdb88f0 100644
>>>> --- a/drivers/usb/dwc3/gadget.c
>>>> +++ b/drivers/usb/dwc3/gadget.c
>>>> @@ -5515,8 +5515,15 @@ static irqreturn_t dwc3_check_event_buf(struct
>>>> dwc3_event_buffer *evt)
>>>>             * irq event handler completes before caching new event to prevent
>>>>             * losing events.
>>>>             */
>>>> -       if (evt->flags & DWC3_EVENT_PENDING)
>>>> +       if (evt->flags & DWC3_EVENT_PENDING) {
>>>> +               if (!evt->count) {
>>>> +                       u32 reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
>>>> +
>>>> +                       if (!(reg & DWC3_GEVNTSIZ_INTMASK))
>>>> +                               evt->flags &= ~DWC3_EVENT_PENDING;
>>>> +               }
>>>>                    return IRQ_HANDLED;
>>>> +       }
>>>>
>>>>
>>> I don't want the bottom-half to be scheduled in the beginning as it may
>>> come before the cleanup in dwc3_event_buffers_setup().
>> You mean the above changes for clearing DWC3_EVENT_PENDINGnot required
>> as you given new change will prevent setting of DWC3_EVENT_PENDING
>> before dwc3_event_buffers_setup().
> Yes.
>
>> But I dont see any harm in above code changes for clearing
>> DWC3_EVENT_PENDING if it already set with evt->count=0.
> You can add it there, but do we need to if we can solve the actual
> issue?
Agree. Yes, we need to find actual issue.
>
> I'm interested in confirming my suspiction of what's really causing the
> DWC3_EVENT_PENDING to be set here. The code logic would be clearer
> rather than masking the behavior by depending on the reset by the
> dwc3_event_buffers_setup(). The runtime resume doesn't share the same
> locking mechanism as when processing an event. While it may be unlikely,
> I don't want the bottom-half thread to handle stale event or race with
> the runtime resume.

Hi Thinh,

I ran the code you recommended on our testing environment and was able 
to reproduce the issue one time.

When evt->flags contains DWC3_EVENT_PENDING, I've included the following 
debugging information: I added this debug message at the start of 
dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in 
during suspend and resume.

The results were quite interesting . I'm curious to understand how 
evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is 
equal to 0x1000 during the suspend.
Its means that the previous bottom-half handler prior to suspend might 
still be executing in the middle of the process.

Could you please give your suggestions here? And let me know if anything 
want to test or additional details are required.


##DBG: dwc3_event_buffers_cleanup:
  evt->length    :0x1000
  evt->lpos      :0x20c
  evt->count     :0x0
  evt->flags     :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408): 
0x00001000. Its means that previous bottom-half handler may be still 
running in middle

  DWC3_GEVNTSIZ(0)(0xc408)       : 0x00001000
  DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
  DWC3_DCFG(0xc700)              : 0x00e008a8
  DWC3_DCTL(0xc704)              : 0x0cf00a00
  DWC3_DEVTEN(0xc708)            : 0x00000000
  DWC3_DSTS(0xc70c)              : 0x00d20cd1


##DBG: dwc3_event_buffers_setup:
  evt->length    :0x1000
  evt->lpos      :0x20c
  evt->count     :0x0
  evt->flags     :0x1 // Still It's not clearing in during resume.

  DWC3_GEVNTSIZ(0)(0xc408)       : 0x00000000
  DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
  DWC3_DCFG(0xc700)              : 0x00080800
  DWC3_DCTL(0xc704)              : 0x00f00000
  DWC3_DEVTEN(0xc708)            : 0x00000000
  DWC3_DSTS(0xc70c)              : 0x00d20001

>
>> Anyway I will try the your new proposed changes alone on our testing
>> setup and will update the status,
>>
> Thanks,
> Thinh
Thinh Nguyen Sept. 7, 2024, 12:39 a.m. UTC | #22
On Sat, Sep 07, 2024, Selvarasu Ganesan wrote:
> 
> Hi Thinh,
> 
> I ran the code you recommended on our testing environment and was able 
> to reproduce the issue one time.
> 
> When evt->flags contains DWC3_EVENT_PENDING, I've included the following 
> debugging information: I added this debug message at the start of 
> dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in 
> during suspend and resume.
> 
> The results were quite interesting . I'm curious to understand how 
> evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is 
> equal to 0x1000 during the suspend.

That is indeed strange.

> Its means that the previous bottom-half handler prior to suspend might 
> still be executing in the middle of the process.
> 
> Could you please give your suggestions here? And let me know if anything 
> want to test or additional details are required.
> 
> 
> ##DBG: dwc3_event_buffers_cleanup:
>   evt->length    :0x1000
>   evt->lpos      :0x20c
>   evt->count     :0x0
>   evt->flags     :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408): 
> 0x00001000. Its means that previous bottom-half handler may be still 
> running in middle

Perhaps.

But I doubt that's the case since it shouldn't take that long for the
bottom-half to be completed before the next resume yet the flag is still
set.

> 
>   DWC3_GEVNTSIZ(0)(0xc408)       : 0x00001000
>   DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
>   DWC3_DCFG(0xc700)              : 0x00e008a8
>   DWC3_DCTL(0xc704)              : 0x0cf00a00
>   DWC3_DEVTEN(0xc708)            : 0x00000000
>   DWC3_DSTS(0xc70c)              : 0x00d20cd1
> 

The controller status is halted. So there's no problem with
soft-disconnect. For the interrupt mask in GEVNTSIZ to be cleared,
that likely means that the bottom-half had probably completed.

> 
> ##DBG: dwc3_event_buffers_setup:
>   evt->length    :0x1000
>   evt->lpos      :0x20c

They fact that evt->lpos did not get updated tells me that there's
something wrong with memory access to your platform during suspend and
resume.

>   evt->count     :0x0
>   evt->flags     :0x1 // Still It's not clearing in during resume.
> 
>   DWC3_GEVNTSIZ(0)(0xc408)       : 0x00000000
>   DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
>   DWC3_DCFG(0xc700)              : 0x00080800
>   DWC3_DCTL(0xc704)              : 0x00f00000
>   DWC3_DEVTEN(0xc708)            : 0x00000000
>   DWC3_DSTS(0xc70c)              : 0x00d20001
> 

Please help look into your platform to see what condition triggers this
memory access issue. If this is a hardware quirk, we can properly update
the change and note it to be so.

Thanks,
Thinh

(If possible, for future tests, please dump the dwc3 tracepoints. Many
thanks for the tests.)
Selvarasu Ganesan Sept. 10, 2024, 1:37 p.m. UTC | #23
On 9/7/2024 6:09 AM, Thinh Nguyen wrote:
> On Sat, Sep 07, 2024, Selvarasu Ganesan wrote:
>> Hi Thinh,
>>
>> I ran the code you recommended on our testing environment and was able
>> to reproduce the issue one time.
>>
>> When evt->flags contains DWC3_EVENT_PENDING, I've included the following
>> debugging information: I added this debug message at the start of
>> dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in
>> during suspend and resume.
>>
>> The results were quite interesting . I'm curious to understand how
>> evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is
>> equal to 0x1000 during the suspend.
> That is indeed strange.
>
>> Its means that the previous bottom-half handler prior to suspend might
>> still be executing in the middle of the process.
>>
>> Could you please give your suggestions here? And let me know if anything
>> want to test or additional details are required.
>>
>>
>> ##DBG: dwc3_event_buffers_cleanup:
>>    evt->length    :0x1000
>>    evt->lpos      :0x20c
>>    evt->count     :0x0
>>    evt->flags     :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408):
>> 0x00001000. Its means that previous bottom-half handler may be still
>> running in middle
> Perhaps.
>
> But I doubt that's the case since it shouldn't take that long for the
> bottom-half to be completed before the next resume yet the flag is still
> set.
>
>>    DWC3_GEVNTSIZ(0)(0xc408)       : 0x00001000
>>    DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
>>    DWC3_DCFG(0xc700)              : 0x00e008a8
>>    DWC3_DCTL(0xc704)              : 0x0cf00a00
>>    DWC3_DEVTEN(0xc708)            : 0x00000000
>>    DWC3_DSTS(0xc70c)              : 0x00d20cd1
>>
> The controller status is halted. So there's no problem with
> soft-disconnect. For the interrupt mask in GEVNTSIZ to be cleared,
> that likely means that the bottom-half had probably completed.

Agree, But I am worrying on If the bottom-half is completed, then 
DWC3_EVENT_PENDING must be cleared in evt->flags.
Is there any possibility of a CPU reordering issue when updating 
evt->flags in the bottom-half handler?.
Should I try with wmb() when writing to evt->flags?
>
>> ##DBG: dwc3_event_buffers_setup:
>>    evt->length    :0x1000
>>    evt->lpos      :0x20c
> They fact that evt->lpos did not get updated tells me that there's
> something wrong with memory access to your platform during suspend and
> resume.

Are you expecting the evt->lpos value to be zero here? If so, this is 
expected in our test setup because we avoid writing zero to evt->lpos as 
part of dwc3_event_buffers_cleanup if evt->flags have a value of 1. This 
is simply to track the status of evt->lpos during suspend to resume when 
evt->flags have a value of DWC3_EVENT_PENDING. The following test codes 
for the reference.

--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -505,8 +505,20 @@ static int dwc3_alloc_event_buffers(struct dwc3 
*dwc, unsigned int length)
  int dwc3_event_buffers_setup(struct dwc3 *dwc)
  {
         struct dwc3_event_buffer        *evt;
+       u32                             reg;

         evt = dwc->ev_buf;
+
+       if (evt->flags & DWC3_EVENT_PENDING) {
+               pr_info("evt->length :%x\n", evt->length);
+               pr_info("evt->lpos :%x\n", evt->lpos);
+               pr_info("evt->count :%x\n", evt->count);
+               pr_info("evt->flags :%x\n", evt->flags);
+
+               dwc3_exynos_reg_dump(dwc);
+
+       }
+
         evt->lpos = 0;
         dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
                         lower_32_bits(evt->dma));
@@ -514,8 +526,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
                         upper_32_bits(evt->dma));
         dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
                         DWC3_GEVNTSIZ_SIZE(evt->length));
-       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);

+       /* Clear any stale event */
+       reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
+       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
         return 0;
  }

@@ -525,7 +539,16 @@ void dwc3_event_buffers_cleanup(struct dwc3 *dwc)

         evt = dwc->ev_buf;

-       evt->lpos = 0;
+       if (evt->flags & DWC3_EVENT_PENDING) {
+               pr_info("evt->length :%x\n", evt->length);
+               pr_info("evt->lpos :%x\n", evt->lpos);
+               pr_info("evt->count :%x\n", evt->count);
+               pr_info("evt->flags :%x\n", evt->flags);
+
+               dwc3_exynos_reg_dump(dwc);
+       } else {
+               evt->lpos = 0;
+       }

>
>>    evt->count     :0x0
>>    evt->flags     :0x1 // Still It's not clearing in during resume.
>>
>>    DWC3_GEVNTSIZ(0)(0xc408)       : 0x00000000
>>    DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
>>    DWC3_DCFG(0xc700)              : 0x00080800
>>    DWC3_DCTL(0xc704)              : 0x00f00000
>>    DWC3_DEVTEN(0xc708)            : 0x00000000
>>    DWC3_DSTS(0xc70c)              : 0x00d20001
>>
> Please help look into your platform to see what condition triggers this
> memory access issue. If this is a hardware quirk, we can properly update
> the change and note it to be so.

Sure I will try to figure it out. However, we are facing challenges in 
reproducing the issue. There could be a delay in understanding the 
conditions that trigger the memory issue if it is related to a memory issue.

>
> Thanks,
> Thinh
>
> (If possible, for future tests, please dump the dwc3 tracepoints. Many
> thanks for the tests.)

I tried to get dwc3 traces in the failure case, but so far no instances 
have been reported. Our testing is still in progress with enable dwc3 
traces.

I will keep posting once I get the dwc3 traces in the failure scenario.


Thanks,
Selva
Thinh Nguyen Sept. 11, 2024, 12:24 a.m. UTC | #24
On Tue, Sep 10, 2024, Selvarasu Ganesan wrote:
> 
> On 9/7/2024 6:09 AM, Thinh Nguyen wrote:
> > On Sat, Sep 07, 2024, Selvarasu Ganesan wrote:
> >> Hi Thinh,
> >>
> >> I ran the code you recommended on our testing environment and was able
> >> to reproduce the issue one time.
> >>
> >> When evt->flags contains DWC3_EVENT_PENDING, I've included the following
> >> debugging information: I added this debug message at the start of
> >> dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in
> >> during suspend and resume.
> >>
> >> The results were quite interesting . I'm curious to understand how
> >> evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is
> >> equal to 0x1000 during the suspend.
> > That is indeed strange.
> >
> >> Its means that the previous bottom-half handler prior to suspend might
> >> still be executing in the middle of the process.
> >>
> >> Could you please give your suggestions here? And let me know if anything
> >> want to test or additional details are required.
> >>
> >>
> >> ##DBG: dwc3_event_buffers_cleanup:
> >>    evt->length    :0x1000
> >>    evt->lpos      :0x20c
> >>    evt->count     :0x0
> >>    evt->flags     :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408):
> >> 0x00001000. Its means that previous bottom-half handler may be still
> >> running in middle
> > Perhaps.
> >
> > But I doubt that's the case since it shouldn't take that long for the
> > bottom-half to be completed before the next resume yet the flag is still
> > set.
> >
> >>    DWC3_GEVNTSIZ(0)(0xc408)       : 0x00001000
> >>    DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
> >>    DWC3_DCFG(0xc700)              : 0x00e008a8
> >>    DWC3_DCTL(0xc704)              : 0x0cf00a00
> >>    DWC3_DEVTEN(0xc708)            : 0x00000000
> >>    DWC3_DSTS(0xc70c)              : 0x00d20cd1
> >>
> > The controller status is halted. So there's no problem with
> > soft-disconnect. For the interrupt mask in GEVNTSIZ to be cleared,
> > that likely means that the bottom-half had probably completed.
> 
> Agree, But I am worrying on If the bottom-half is completed, then 
> DWC3_EVENT_PENDING must be cleared in evt->flags.
> Is there any possibility of a CPU reordering issue when updating 
> evt->flags in the bottom-half handler?.
> Should I try with wmb() when writing to evt->flags?

Assuming that the problem occurs after the bottom-half completed, there
should be implicit memory barrier. The memory operation should complete
before the release from spin_unlock complete. I don't think wmb() will
help.

> >
> >> ##DBG: dwc3_event_buffers_setup:
> >>    evt->length    :0x1000
> >>    evt->lpos      :0x20c
> > They fact that evt->lpos did not get updated tells me that there's
> > something wrong with memory access to your platform during suspend and
> > resume.
> 
> Are you expecting the evt->lpos value to be zero here? If so, this is 
> expected in our test setup because we avoid writing zero to evt->lpos as 
> part of dwc3_event_buffers_cleanup if evt->flags have a value of 1. This 

Oh ok. I did not know you made this modification.

> is simply to track the status of evt->lpos during suspend to resume when 
> evt->flags have a value of DWC3_EVENT_PENDING. The following test codes 
> for the reference.
> 
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -505,8 +505,20 @@ static int dwc3_alloc_event_buffers(struct dwc3 
> *dwc, unsigned int length)
>   int dwc3_event_buffers_setup(struct dwc3 *dwc)
>   {
>          struct dwc3_event_buffer        *evt;
> +       u32                             reg;
> 
>          evt = dwc->ev_buf;
> +
> +       if (evt->flags & DWC3_EVENT_PENDING) {
> +               pr_info("evt->length :%x\n", evt->length);
> +               pr_info("evt->lpos :%x\n", evt->lpos);
> +               pr_info("evt->count :%x\n", evt->count);
> +               pr_info("evt->flags :%x\n", evt->flags);
> +
> +               dwc3_exynos_reg_dump(dwc);
> +
> +       }
> +
>          evt->lpos = 0;
>          dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
>                          lower_32_bits(evt->dma));
> @@ -514,8 +526,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>                          upper_32_bits(evt->dma));
>          dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>                          DWC3_GEVNTSIZ_SIZE(evt->length));
> -       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> 
> +       /* Clear any stale event */
> +       reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> +       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
>          return 0;
>   }
> 
> @@ -525,7 +539,16 @@ void dwc3_event_buffers_cleanup(struct dwc3 *dwc)
> 
>          evt = dwc->ev_buf;
> 
> -       evt->lpos = 0;
> +       if (evt->flags & DWC3_EVENT_PENDING) {
> +               pr_info("evt->length :%x\n", evt->length);
> +               pr_info("evt->lpos :%x\n", evt->lpos);
> +               pr_info("evt->count :%x\n", evt->count);
> +               pr_info("evt->flags :%x\n", evt->flags);
> +
> +               dwc3_exynos_reg_dump(dwc);
> +       } else {
> +               evt->lpos = 0;

I wasn't aware of this change.

> +       }
> 
> >
> >>    evt->count     :0x0
> >>    evt->flags     :0x1 // Still It's not clearing in during resume.
> >>
> >>    DWC3_GEVNTSIZ(0)(0xc408)       : 0x00000000
> >>    DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
> >>    DWC3_DCFG(0xc700)              : 0x00080800
> >>    DWC3_DCTL(0xc704)              : 0x00f00000
> >>    DWC3_DEVTEN(0xc708)            : 0x00000000
> >>    DWC3_DSTS(0xc70c)              : 0x00d20001
> >>
> > Please help look into your platform to see what condition triggers this
> > memory access issue. If this is a hardware quirk, we can properly update
> > the change and note it to be so.
> 
> Sure I will try to figure it out. However, we are facing challenges in 
> reproducing the issue. There could be a delay in understanding the 
> conditions that trigger the memory issue if it is related to a memory issue.
> 
> >
> > Thanks,
> > Thinh
> >
> > (If possible, for future tests, please dump the dwc3 tracepoints. Many
> > thanks for the tests.)
> 
> I tried to get dwc3 traces in the failure case, but so far no instances 
> have been reported. Our testing is still in progress with enable dwc3 
> traces.
> 
> I will keep posting once I get the dwc3 traces in the failure scenario.
> 

Thanks for the update. I hope enabling of the driver tracepoints will
not impact the reproduction of the issue. With the driver log, we'll get
more clues to what was going on.

Thanks,
Thinh
Selvarasu Ganesan Sept. 13, 2024, 12:42 p.m. UTC | #25
On 9/11/2024 5:54 AM, Thinh Nguyen wrote:
> On Tue, Sep 10, 2024, Selvarasu Ganesan wrote:
>> On 9/7/2024 6:09 AM, Thinh Nguyen wrote:
>>> On Sat, Sep 07, 2024, Selvarasu Ganesan wrote:
>>>> Hi Thinh,
>>>>
>>>> I ran the code you recommended on our testing environment and was able
>>>> to reproduce the issue one time.
>>>>
>>>> When evt->flags contains DWC3_EVENT_PENDING, I've included the following
>>>> debugging information: I added this debug message at the start of
>>>> dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in
>>>> during suspend and resume.
>>>>
>>>> The results were quite interesting . I'm curious to understand how
>>>> evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is
>>>> equal to 0x1000 during the suspend.
>>> That is indeed strange.
>>>
>>>> Its means that the previous bottom-half handler prior to suspend might
>>>> still be executing in the middle of the process.
>>>>
>>>> Could you please give your suggestions here? And let me know if anything
>>>> want to test or additional details are required.
>>>>
>>>>
>>>> ##DBG: dwc3_event_buffers_cleanup:
>>>>     evt->length    :0x1000
>>>>     evt->lpos      :0x20c
>>>>     evt->count     :0x0
>>>>     evt->flags     :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408):
>>>> 0x00001000. Its means that previous bottom-half handler may be still
>>>> running in middle
>>> Perhaps.
>>>
>>> But I doubt that's the case since it shouldn't take that long for the
>>> bottom-half to be completed before the next resume yet the flag is still
>>> set.
>>>
>>>>     DWC3_GEVNTSIZ(0)(0xc408)       : 0x00001000
>>>>     DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
>>>>     DWC3_DCFG(0xc700)              : 0x00e008a8
>>>>     DWC3_DCTL(0xc704)              : 0x0cf00a00
>>>>     DWC3_DEVTEN(0xc708)            : 0x00000000
>>>>     DWC3_DSTS(0xc70c)              : 0x00d20cd1
>>>>
>>> The controller status is halted. So there's no problem with
>>> soft-disconnect. For the interrupt mask in GEVNTSIZ to be cleared,
>>> that likely means that the bottom-half had probably completed.
>> Agree, But I am worrying on If the bottom-half is completed, then
>> DWC3_EVENT_PENDING must be cleared in evt->flags.
>> Is there any possibility of a CPU reordering issue when updating
>> evt->flags in the bottom-half handler?.
>> Should I try with wmb() when writing to evt->flags?
> Assuming that the problem occurs after the bottom-half completed, there
> should be implicit memory barrier. The memory operation should complete
> before the release from spin_unlock complete. I don't think wmb() will
> help.
Agree.
>>>> ##DBG: dwc3_event_buffers_setup:
>>>>     evt->length    :0x1000
>>>>     evt->lpos      :0x20c
>>> They fact that evt->lpos did not get updated tells me that there's
>>> something wrong with memory access to your platform during suspend and
>>> resume.
>> Are you expecting the evt->lpos value to be zero here? If so, this is
>> expected in our test setup because we avoid writing zero to evt->lpos as
>> part of dwc3_event_buffers_cleanup if evt->flags have a value of 1. This
> Oh ok. I did not know you made this modification.
>
>> is simply to track the status of evt->lpos during suspend to resume when
>> evt->flags have a value of DWC3_EVENT_PENDING. The following test codes
>> for the reference.
>>
>> --- a/drivers/usb/dwc3/core.c
>> +++ b/drivers/usb/dwc3/core.c
>> @@ -505,8 +505,20 @@ static int dwc3_alloc_event_buffers(struct dwc3
>> *dwc, unsigned int length)
>>    int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>    {
>>           struct dwc3_event_buffer        *evt;
>> +       u32                             reg;
>>
>>           evt = dwc->ev_buf;
>> +
>> +       if (evt->flags & DWC3_EVENT_PENDING) {
>> +               pr_info("evt->length :%x\n", evt->length);
>> +               pr_info("evt->lpos :%x\n", evt->lpos);
>> +               pr_info("evt->count :%x\n", evt->count);
>> +               pr_info("evt->flags :%x\n", evt->flags);
>> +
>> +               dwc3_exynos_reg_dump(dwc);
>> +
>> +       }
>> +
>>           evt->lpos = 0;
>>           dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
>>                           lower_32_bits(evt->dma));
>> @@ -514,8 +526,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>                           upper_32_bits(evt->dma));
>>           dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>                           DWC3_GEVNTSIZ_SIZE(evt->length));
>> -       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>
>> +       /* Clear any stale event */
>> +       reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
>> +       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
>>           return 0;
>>    }
>>
>> @@ -525,7 +539,16 @@ void dwc3_event_buffers_cleanup(struct dwc3 *dwc)
>>
>>           evt = dwc->ev_buf;
>>
>> -       evt->lpos = 0;
>> +       if (evt->flags & DWC3_EVENT_PENDING) {
>> +               pr_info("evt->length :%x\n", evt->length);
>> +               pr_info("evt->lpos :%x\n", evt->lpos);
>> +               pr_info("evt->count :%x\n", evt->count);
>> +               pr_info("evt->flags :%x\n", evt->flags);
>> +
>> +               dwc3_exynos_reg_dump(dwc);
>> +       } else {
>> +               evt->lpos = 0;
> I wasn't aware of this change.
>
>> +       }
>>
>>>>     evt->count     :0x0
>>>>     evt->flags     :0x1 // Still It's not clearing in during resume.
>>>>
>>>>     DWC3_GEVNTSIZ(0)(0xc408)       : 0x00000000
>>>>     DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
>>>>     DWC3_DCFG(0xc700)              : 0x00080800
>>>>     DWC3_DCTL(0xc704)              : 0x00f00000
>>>>     DWC3_DEVTEN(0xc708)            : 0x00000000
>>>>     DWC3_DSTS(0xc70c)              : 0x00d20001
>>>>
>>> Please help look into your platform to see what condition triggers this
>>> memory access issue. If this is a hardware quirk, we can properly update
>>> the change and note it to be so.
>> Sure I will try to figure it out. However, we are facing challenges in
>> reproducing the issue. There could be a delay in understanding the
>> conditions that trigger the memory issue if it is related to a memory issue.
>>
>>> Thanks,
>>> Thinh
>>>
>>> (If possible, for future tests, please dump the dwc3 tracepoints. Many
>>> thanks for the tests.)
>> I tried to get dwc3 traces in the failure case, but so far no instances
>> have been reported. Our testing is still in progress with enable dwc3
>> traces.
>>
>> I will keep posting once I get the dwc3 traces in the failure scenario.
>>
> Thanks for the update. I hope enabling of the driver tracepoints will
> not impact the reproduction of the issue. With the driver log, we'll get
> more clues to what was going on.
>
> Thanks,
> Thinh
Hi Thinh,

So far, there have been no reported error instances. But, we suspecting 
that the issue may be related to our glue driver. In our glue driver, we 
access the reference of evt->flags when starting or stopping the gadget 
based on a VBUS notification. We apologize for sharing this information 
so late, as we only became aware of it recently.

The following sequence outlines the possible scenarios of race conditions:

Thread#1 (Our glue Driver Sequence)
===================================
->USB VBUS notification
->Start/Stop gadget
->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
->Call dwc3 exynos runtime suspend/resume
->dwc->ev_buf->flags &= ~BIT(20);
->Call dwc3 core runtime suspend/resume

Thread#2
========
->dwc3_interrupt()
->evt->flags |= DWC3_EVENT_PENDING;
->dwc3_thread_interrupt()
->evt->flags &= ~DWC3_EVENT_PENDING;



After our internal discussions, we have decided to remove the 
unnecessary access to evt->flag in our glue driver. We have made these 
changes and initiated testing.

Thank you for your help so far to understand more into our glue driver code.

And We are thinking that it would be fine to reset evt->flag when the 
USB controller is started, along with the changes you suggested earlier. 
This additional measure will help prevent similar issues from occurring 
in the future.

Please let us know your thoughts on this proposal. If it is not 
necessary, we understand and will proceed accordingly.


Thanks,
Selva
Thinh Nguyen Sept. 13, 2024, 5:51 p.m. UTC | #26
Hi,

On Fri, Sep 13, 2024, Selvarasu Ganesan wrote:
> Hi Thinh,
> 
> So far, there have been no reported error instances. But, we suspecting 
> that the issue may be related to our glue driver. In our glue driver, we 
> access the reference of evt->flags when starting or stopping the gadget 
> based on a VBUS notification. We apologize for sharing this information 
> so late, as we only became aware of it recently.
> 
> The following sequence outlines the possible scenarios of race conditions:
> 
> Thread#1 (Our glue Driver Sequence)
> ===================================
> ->USB VBUS notification
> ->Start/Stop gadget
> ->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
> ->Call dwc3 exynos runtime suspend/resume
> ->dwc->ev_buf->flags &= ~BIT(20);
> ->Call dwc3 core runtime suspend/resume
> 
> Thread#2
> ========
> ->dwc3_interrupt()
> ->evt->flags |= DWC3_EVENT_PENDING;
> ->dwc3_thread_interrupt()
> ->evt->flags &= ~DWC3_EVENT_PENDING;
> 

This is great! That's likely the problem. Glad you found it.

> 
> 
> After our internal discussions, we have decided to remove the 
> unnecessary access to evt->flag in our glue driver. We have made these 
> changes and initiated testing.
> 
> Thank you for your help so far to understand more into our glue driver code.
> 
> And We are thinking that it would be fine to reset evt->flag when the 
> USB controller is started, along with the changes you suggested earlier. 
> This additional measure will help prevent similar issues from occurring 
> in the future.
> 
> Please let us know your thoughts on this proposal. If it is not 
> necessary, we understand and will proceed accordingly.
> 

You can submit the change I suggested. That's a valid change. However,
we should not include the reset of the DWC3_EVENT_PENDING flag. Had we
done this, you may not found the issue above. It serves no purpose for
the core driver logic and will be an extra burden for us to maintain. (I
don't want to scratch my head in the future to figure out why that
change was needed or concern whether it can be removed without causing
regression).

Thanks,
Thinh
Thinh Nguyen Sept. 13, 2024, 6 p.m. UTC | #27
On Fri, Sep 13, 2024, Thinh Nguyen wrote:
> On Fri, Sep 13, 2024, Selvarasu Ganesan wrote:
> > Hi Thinh,
> > 
> > So far, there have been no reported error instances. But, we suspecting 
> > that the issue may be related to our glue driver. In our glue driver, we 
> > access the reference of evt->flags when starting or stopping the gadget 
> > based on a VBUS notification. We apologize for sharing this information 
> > so late, as we only became aware of it recently.
> > 
> > The following sequence outlines the possible scenarios of race conditions:
> > 
> > Thread#1 (Our glue Driver Sequence)
> > ===================================
> > ->USB VBUS notification
> > ->Start/Stop gadget
> > ->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
> > ->Call dwc3 exynos runtime suspend/resume
> > ->dwc->ev_buf->flags &= ~BIT(20);
> > ->Call dwc3 core runtime suspend/resume
> > 
> > Thread#2
> > ========
> > ->dwc3_interrupt()
> > ->evt->flags |= DWC3_EVENT_PENDING;
> > ->dwc3_thread_interrupt()
> > ->evt->flags &= ~DWC3_EVENT_PENDING;
> > 
> 
> This is great! That's likely the problem. Glad you found it.
> 
> > 
> > 
> > After our internal discussions, we have decided to remove the 
> > unnecessary access to evt->flag in our glue driver. We have made these 
> > changes and initiated testing.
> > 
> > Thank you for your help so far to understand more into our glue driver code.
> > 
> > And We are thinking that it would be fine to reset evt->flag when the 
> > USB controller is started, along with the changes you suggested earlier. 
> > This additional measure will help prevent similar issues from occurring 
> > in the future.
> > 
> > Please let us know your thoughts on this proposal. If it is not 
> > necessary, we understand and will proceed accordingly.
> > 
> 
> You can submit the change I suggested. That's a valid change. However,
> we should not include the reset of the DWC3_EVENT_PENDING flag. Had we
> done this, you may not found the issue above. It serves no purpose for
> the core driver logic and will be an extra burden for us to maintain. (I
> don't want to scratch my head in the future to figure out why that
> change was needed or concern whether it can be removed without causing
> regression).
> 

Also, perhaps you may want to revisit and review the change below to see
if the glue driver may be the culprit:

14e497183df2 ("usb: dwc3: core: Prevent USB core invalid event buffer address access")

Thanks,
Thinh
Selvarasu Ganesan Sept. 16, 2024, 12:41 p.m. UTC | #28
On 9/13/2024 11:21 PM, Thinh Nguyen wrote:
> Hi,
>
> On Fri, Sep 13, 2024, Selvarasu Ganesan wrote:
>> Hi Thinh,
>>
>> So far, there have been no reported error instances. But, we suspecting
>> that the issue may be related to our glue driver. In our glue driver, we
>> access the reference of evt->flags when starting or stopping the gadget
>> based on a VBUS notification. We apologize for sharing this information
>> so late, as we only became aware of it recently.
>>
>> The following sequence outlines the possible scenarios of race conditions:
>>
>> Thread#1 (Our glue Driver Sequence)
>> ===================================
>> ->USB VBUS notification
>> ->Start/Stop gadget
>> ->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
>> ->Call dwc3 exynos runtime suspend/resume
>> ->dwc->ev_buf->flags &= ~BIT(20);
>> ->Call dwc3 core runtime suspend/resume
>>
>> Thread#2
>> ========
>> ->dwc3_interrupt()
>> ->evt->flags |= DWC3_EVENT_PENDING;
>> ->dwc3_thread_interrupt()
>> ->evt->flags &= ~DWC3_EVENT_PENDING;
>>
> This is great! That's likely the problem. Glad you found it.
>
>>
>> After our internal discussions, we have decided to remove the
>> unnecessary access to evt->flag in our glue driver. We have made these
>> changes and initiated testing.
>>
>> Thank you for your help so far to understand more into our glue driver code.
>>
>> And We are thinking that it would be fine to reset evt->flag when the
>> USB controller is started, along with the changes you suggested earlier.
>> This additional measure will help prevent similar issues from occurring
>> in the future.
>>
>> Please let us know your thoughts on this proposal. If it is not
>> necessary, we understand and will proceed accordingly.
>>
> You can submit the change I suggested. That's a valid change. However,
> we should not include the reset of the DWC3_EVENT_PENDING flag. Had we
> done this, you may not found the issue above. It serves no purpose for
> the core driver logic and will be an extra burden for us to maintain. (I
> don't want to scratch my head in the future to figure out why that
> change was needed or concern whether it can be removed without causing
> regression).

Yeah I understand.

Please reconfirm the below changes once with commit message. I will post 
new version if this changes are fine.


[PATCH] usb: dwc3: core: Stop processing of pending events if controller 
is halted

This commit addresses an issue where events were being processed when
the controller was in a halted state. To fix this issue by stop
processing the events as the event count was considered stale or
invalid when the controller was halted.

Fixes: fc8bb91bc83e ("usb: dwc3: implement runtime PM")
Cc: stable <stable@kernel.org>
Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
Suggested-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
---
  drivers/usb/dwc3/core.c   | 17 +++++++++++++++--
  drivers/usb/dwc3/core.h   |  4 ----
  drivers/usb/dwc3/gadget.c | 22 +++++++++++-----------
  3 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 9eb085f359ce..f47b20bc2d1f 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -544,6 +544,7 @@ static int dwc3_alloc_event_buffers(struct dwc3 
*dwc, unsigned int length)
  int dwc3_event_buffers_setup(struct dwc3 *dwc)
  {
         struct dwc3_event_buffer        *evt;
+       u32                             reg;

         if (!dwc->ev_buf)
                 return 0;
@@ -556,8 +557,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
                         upper_32_bits(evt->dma));
         dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
                         DWC3_GEVNTSIZ_SIZE(evt->length));
-       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);

+       /* Clear any stale event */
+       reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
+       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
         return 0;
  }

@@ -2499,7 +2502,11 @@ static int dwc3_runtime_resume(struct device *dev)

         switch (dwc->current_dr_role) {
         case DWC3_GCTL_PRTCAP_DEVICE:
-               dwc3_gadget_process_pending_events(dwc);
+               if (dwc->pending_events) {
+                       pm_runtime_put(dwc->dev);
+                       dwc->pending_events = false;
+                       enable_irq(dwc->irq_gadget);
+               }
                 break;
         case DWC3_GCTL_PRTCAP_HOST:
         default:
@@ -2589,6 +2596,12 @@ static void dwc3_complete(struct device *dev)
  static const struct dev_pm_ops dwc3_dev_pm_ops = {
         SET_SYSTEM_SLEEP_PM_OPS(dwc3_suspend, dwc3_resume)
         .complete = dwc3_complete,
+
+       /*
+        * Runtime suspend halts the controller on disconnection. It 
replies on
+        * platforms with custom connection notification to start the 
controller
+        * again.
+        */
         SET_RUNTIME_PM_OPS(dwc3_runtime_suspend, dwc3_runtime_resume,
                         dwc3_runtime_idle)
  };
diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
index c71240e8f7c7..9c508e0c5cdf 100644
--- a/drivers/usb/dwc3/core.h
+++ b/drivers/usb/dwc3/core.h
@@ -1675,7 +1675,6 @@ static inline void dwc3_otg_host_init(struct dwc3 
*dwc)
  #if !IS_ENABLED(CONFIG_USB_DWC3_HOST)
  int dwc3_gadget_suspend(struct dwc3 *dwc);
  int dwc3_gadget_resume(struct dwc3 *dwc);
-void dwc3_gadget_process_pending_events(struct dwc3 *dwc);
  #else
  static inline int dwc3_gadget_suspend(struct dwc3 *dwc)
  {
@@ -1687,9 +1686,6 @@ static inline int dwc3_gadget_resume(struct dwc3 *dwc)
         return 0;
  }

-static inline void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
-{
-}
  #endif /* !IS_ENABLED(CONFIG_USB_DWC3_HOST) */

  #if IS_ENABLED(CONFIG_USB_DWC3_ULPI)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 291bc549935b..a32c3a292353 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4483,6 +4483,17 @@ static irqreturn_t dwc3_check_event_buf(struct 
dwc3_event_buffer *evt)
                 return IRQ_HANDLED;

         count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
+
+       /*
+        * If the controller is halted, the event count is 
stale/invalid. Ignore
+        * them. This happens if the interrupt assertion is from an 
out-of-band
+        * resume notification.
+        */
+       if (!dwc->pullups_connected && count) {
+               dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
+               return IRQ_HANDLED;
+       }
+
         count &= DWC3_GEVNTCOUNT_MASK;
         if (!count)
                 return IRQ_NONE;
@@ -4728,14 +4739,3 @@ int dwc3_gadget_resume(struct dwc3 *dwc)

         return dwc3_gadget_soft_connect(dwc);
  }
-
-void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
-{
-       if (dwc->pending_events) {
-               dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
-               dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
-               pm_runtime_put(dwc->dev);
-               dwc->pending_events = false;
-               enable_irq(dwc->irq_gadget);
-       }
-}
--
2.17.1

Thanks,
Selva
> Thanks,
> Thinh
Selvarasu Ganesan Sept. 16, 2024, 12:43 p.m. UTC | #29
On 9/13/2024 11:30 PM, Thinh Nguyen wrote:
> On Fri, Sep 13, 2024, Thinh Nguyen wrote:
>> On Fri, Sep 13, 2024, Selvarasu Ganesan wrote:
>>> Hi Thinh,
>>>
>>> So far, there have been no reported error instances. But, we suspecting
>>> that the issue may be related to our glue driver. In our glue driver, we
>>> access the reference of evt->flags when starting or stopping the gadget
>>> based on a VBUS notification. We apologize for sharing this information
>>> so late, as we only became aware of it recently.
>>>
>>> The following sequence outlines the possible scenarios of race conditions:
>>>
>>> Thread#1 (Our glue Driver Sequence)
>>> ===================================
>>> ->USB VBUS notification
>>> ->Start/Stop gadget
>>> ->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
>>> ->Call dwc3 exynos runtime suspend/resume
>>> ->dwc->ev_buf->flags &= ~BIT(20);
>>> ->Call dwc3 core runtime suspend/resume
>>>
>>> Thread#2
>>> ========
>>> ->dwc3_interrupt()
>>> ->evt->flags |= DWC3_EVENT_PENDING;
>>> ->dwc3_thread_interrupt()
>>> ->evt->flags &= ~DWC3_EVENT_PENDING;
>>>
>> This is great! That's likely the problem. Glad you found it.
>>
>>>
>>> After our internal discussions, we have decided to remove the
>>> unnecessary access to evt->flag in our glue driver. We have made these
>>> changes and initiated testing.
>>>
>>> Thank you for your help so far to understand more into our glue driver code.
>>>
>>> And We are thinking that it would be fine to reset evt->flag when the
>>> USB controller is started, along with the changes you suggested earlier.
>>> This additional measure will help prevent similar issues from occurring
>>> in the future.
>>>
>>> Please let us know your thoughts on this proposal. If it is not
>>> necessary, we understand and will proceed accordingly.
>>>
>> You can submit the change I suggested. That's a valid change. However,
>> we should not include the reset of the DWC3_EVENT_PENDING flag. Had we
>> done this, you may not found the issue above. It serves no purpose for
>> the core driver logic and will be an extra burden for us to maintain. (I
>> don't want to scratch my head in the future to figure out why that
>> change was needed or concern whether it can be removed without causing
>> regression).
>>
> Also, perhaps you may want to revisit and review the change below to see
> if the glue driver may be the culprit:
>
> 14e497183df2 ("usb: dwc3: core: Prevent USB core invalid event buffer address access")

Hi Thinh,

We reconfirmed that this issue not due to our glue driver.


Thanks,
Selva
> Thanks,
> Thinh
Thinh Nguyen Sept. 16, 2024, 9:18 p.m. UTC | #30
On Mon, Sep 16, 2024, Selvarasu Ganesan wrote:
> 
> On 9/13/2024 11:21 PM, Thinh Nguyen wrote:
> > Hi,
> >
> > On Fri, Sep 13, 2024, Selvarasu Ganesan wrote:
> >> Hi Thinh,
> >>
> >> So far, there have been no reported error instances. But, we suspecting
> >> that the issue may be related to our glue driver. In our glue driver, we
> >> access the reference of evt->flags when starting or stopping the gadget
> >> based on a VBUS notification. We apologize for sharing this information
> >> so late, as we only became aware of it recently.
> >>
> >> The following sequence outlines the possible scenarios of race conditions:
> >>
> >> Thread#1 (Our glue Driver Sequence)
> >> ===================================
> >> ->USB VBUS notification
> >> ->Start/Stop gadget
> >> ->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
> >> ->Call dwc3 exynos runtime suspend/resume
> >> ->dwc->ev_buf->flags &= ~BIT(20);
> >> ->Call dwc3 core runtime suspend/resume
> >>
> >> Thread#2
> >> ========
> >> ->dwc3_interrupt()
> >> ->evt->flags |= DWC3_EVENT_PENDING;
> >> ->dwc3_thread_interrupt()
> >> ->evt->flags &= ~DWC3_EVENT_PENDING;
> >>
> > This is great! That's likely the problem. Glad you found it.
> >
> >>
> >> After our internal discussions, we have decided to remove the
> >> unnecessary access to evt->flag in our glue driver. We have made these
> >> changes and initiated testing.
> >>
> >> Thank you for your help so far to understand more into our glue driver code.
> >>
> >> And We are thinking that it would be fine to reset evt->flag when the
> >> USB controller is started, along with the changes you suggested earlier.
> >> This additional measure will help prevent similar issues from occurring
> >> in the future.
> >>
> >> Please let us know your thoughts on this proposal. If it is not
> >> necessary, we understand and will proceed accordingly.
> >>
> > You can submit the change I suggested. That's a valid change. However,
> > we should not include the reset of the DWC3_EVENT_PENDING flag. Had we
> > done this, you may not found the issue above. It serves no purpose for
> > the core driver logic and will be an extra burden for us to maintain. (I
> > don't want to scratch my head in the future to figure out why that
> > change was needed or concern whether it can be removed without causing
> > regression).
> 
> Yeah I understand.
> 
> Please reconfirm the below changes once with commit message. I will post 
> new version if this changes are fine.
> 
> 
> [PATCH] usb: dwc3: core: Stop processing of pending events if controller 
> is halted
> 
> This commit addresses an issue where events were being processed when
> the controller was in a halted state. To fix this issue by stop
> processing the events as the event count was considered stale or
> invalid when the controller was halted.
> 
> Fixes: fc8bb91bc83e ("usb: dwc3: implement runtime PM")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
> Suggested-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
> ---
>   drivers/usb/dwc3/core.c   | 17 +++++++++++++++--
>   drivers/usb/dwc3/core.h   |  4 ----
>   drivers/usb/dwc3/gadget.c | 22 +++++++++++-----------
>   3 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index 9eb085f359ce..f47b20bc2d1f 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -544,6 +544,7 @@ static int dwc3_alloc_event_buffers(struct dwc3 
> *dwc, unsigned int length)
>   int dwc3_event_buffers_setup(struct dwc3 *dwc)
>   {
>          struct dwc3_event_buffer        *evt;
> +       u32                             reg;
> 
>          if (!dwc->ev_buf)
>                  return 0;
> @@ -556,8 +557,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>                          upper_32_bits(evt->dma));
>          dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>                          DWC3_GEVNTSIZ_SIZE(evt->length));
> -       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
> 
> +       /* Clear any stale event */

Do the same thing here as in dwc3_event_buffers_cleanup().

> +       reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> +       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
>          return 0;
>   }
> 
> @@ -2499,7 +2502,11 @@ static int dwc3_runtime_resume(struct device *dev)
> 
>          switch (dwc->current_dr_role) {
>          case DWC3_GCTL_PRTCAP_DEVICE:
> -               dwc3_gadget_process_pending_events(dwc);
> +               if (dwc->pending_events) {
> +                       pm_runtime_put(dwc->dev);
> +                       dwc->pending_events = false;
> +                       enable_irq(dwc->irq_gadget);
> +               }
>                  break;
>          case DWC3_GCTL_PRTCAP_HOST:
>          default:
> @@ -2589,6 +2596,12 @@ static void dwc3_complete(struct device *dev)
>   static const struct dev_pm_ops dwc3_dev_pm_ops = {
>          SET_SYSTEM_SLEEP_PM_OPS(dwc3_suspend, dwc3_resume)
>          .complete = dwc3_complete,
> +
> +       /*
> +        * Runtime suspend halts the controller on disconnection. It 
> replies on
> +        * platforms with custom connection notification to start the 
> controller
> +        * again.
> +        */
>          SET_RUNTIME_PM_OPS(dwc3_runtime_suspend, dwc3_runtime_resume,
>                          dwc3_runtime_idle)
>   };
> diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
> index c71240e8f7c7..9c508e0c5cdf 100644
> --- a/drivers/usb/dwc3/core.h
> +++ b/drivers/usb/dwc3/core.h
> @@ -1675,7 +1675,6 @@ static inline void dwc3_otg_host_init(struct dwc3 
> *dwc)
>   #if !IS_ENABLED(CONFIG_USB_DWC3_HOST)
>   int dwc3_gadget_suspend(struct dwc3 *dwc);
>   int dwc3_gadget_resume(struct dwc3 *dwc);
> -void dwc3_gadget_process_pending_events(struct dwc3 *dwc);
>   #else
>   static inline int dwc3_gadget_suspend(struct dwc3 *dwc)
>   {
> @@ -1687,9 +1686,6 @@ static inline int dwc3_gadget_resume(struct dwc3 *dwc)
>          return 0;
>   }
> 
> -static inline void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
> -{
> -}
>   #endif /* !IS_ENABLED(CONFIG_USB_DWC3_HOST) */
> 
>   #if IS_ENABLED(CONFIG_USB_DWC3_ULPI)
> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> index 291bc549935b..a32c3a292353 100644
> --- a/drivers/usb/dwc3/gadget.c
> +++ b/drivers/usb/dwc3/gadget.c
> @@ -4483,6 +4483,17 @@ static irqreturn_t dwc3_check_event_buf(struct 
> dwc3_event_buffer *evt)
>                  return IRQ_HANDLED;
> 
>          count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));

If we properly cleanup event count in dwc3_event_buffers_cleanup() as
noted above, then you don't need this condition below. You can remove
the below check:

> +
> +       /*
> +        * If the controller is halted, the event count is 
> stale/invalid. Ignore
> +        * them. This happens if the interrupt assertion is from an 
> out-of-band
> +        * resume notification.
> +        */
> +       if (!dwc->pullups_connected && count) {
> +               dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> +               return IRQ_HANDLED;
> +       }
> +
>          count &= DWC3_GEVNTCOUNT_MASK;
>          if (!count)
>                  return IRQ_NONE;
> @@ -4728,14 +4739,3 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
> 
>          return dwc3_gadget_soft_connect(dwc);
>   }
> -
> -void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
> -{
> -       if (dwc->pending_events) {
> -               dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
> -               dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
> -               pm_runtime_put(dwc->dev);
> -               dwc->pending_events = false;
> -               enable_irq(dwc->irq_gadget);
> -       }
> -}

The rest looks fine.

Thanks,
Thinh
Thinh Nguyen Sept. 16, 2024, 9:19 p.m. UTC | #31
On Mon, Sep 16, 2024, Selvarasu Ganesan wrote:
> 
> On 9/13/2024 11:30 PM, Thinh Nguyen wrote:
> > Also, perhaps you may want to revisit and review the change below to see
> > if the glue driver may be the culprit:
> >
> > 14e497183df2 ("usb: dwc3: core: Prevent USB core invalid event buffer address access")
> 
> Hi Thinh,
> 
> We reconfirmed that this issue not due to our glue driver.
> 

Thanks for checking.

BR,
Thinh
Selvarasu Ganesan Sept. 16, 2024, 10:54 p.m. UTC | #32
On 9/17/2024 2:48 AM, Thinh Nguyen wrote:
> On Mon, Sep 16, 2024, Selvarasu Ganesan wrote:
>> On 9/13/2024 11:21 PM, Thinh Nguyen wrote:
>>> Hi,
>>>
>>> On Fri, Sep 13, 2024, Selvarasu Ganesan wrote:
>>>> Hi Thinh,
>>>>
>>>> So far, there have been no reported error instances. But, we suspecting
>>>> that the issue may be related to our glue driver. In our glue driver, we
>>>> access the reference of evt->flags when starting or stopping the gadget
>>>> based on a VBUS notification. We apologize for sharing this information
>>>> so late, as we only became aware of it recently.
>>>>
>>>> The following sequence outlines the possible scenarios of race conditions:
>>>>
>>>> Thread#1 (Our glue Driver Sequence)
>>>> ===================================
>>>> ->USB VBUS notification
>>>> ->Start/Stop gadget
>>>> ->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
>>>> ->Call dwc3 exynos runtime suspend/resume
>>>> ->dwc->ev_buf->flags &= ~BIT(20);
>>>> ->Call dwc3 core runtime suspend/resume
>>>>
>>>> Thread#2
>>>> ========
>>>> ->dwc3_interrupt()
>>>> ->evt->flags |= DWC3_EVENT_PENDING;
>>>> ->dwc3_thread_interrupt()
>>>> ->evt->flags &= ~DWC3_EVENT_PENDING;
>>>>
>>> This is great! That's likely the problem. Glad you found it.
>>>
>>>> After our internal discussions, we have decided to remove the
>>>> unnecessary access to evt->flag in our glue driver. We have made these
>>>> changes and initiated testing.
>>>>
>>>> Thank you for your help so far to understand more into our glue driver code.
>>>>
>>>> And We are thinking that it would be fine to reset evt->flag when the
>>>> USB controller is started, along with the changes you suggested earlier.
>>>> This additional measure will help prevent similar issues from occurring
>>>> in the future.
>>>>
>>>> Please let us know your thoughts on this proposal. If it is not
>>>> necessary, we understand and will proceed accordingly.
>>>>
>>> You can submit the change I suggested. That's a valid change. However,
>>> we should not include the reset of the DWC3_EVENT_PENDING flag. Had we
>>> done this, you may not found the issue above. It serves no purpose for
>>> the core driver logic and will be an extra burden for us to maintain. (I
>>> don't want to scratch my head in the future to figure out why that
>>> change was needed or concern whether it can be removed without causing
>>> regression).
>> Yeah I understand.
>>
>> Please reconfirm the below changes once with commit message. I will post
>> new version if this changes are fine.
>>
>>
>> [PATCH] usb: dwc3: core: Stop processing of pending events if controller
>> is halted
>>
>> This commit addresses an issue where events were being processed when
>> the controller was in a halted state. To fix this issue by stop
>> processing the events as the event count was considered stale or
>> invalid when the controller was halted.
>>
>> Fixes: fc8bb91bc83e ("usb: dwc3: implement runtime PM")
>> Cc: stable <stable@kernel.org>
>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@samsung.com>
>> Suggested-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
>> ---
>>    drivers/usb/dwc3/core.c   | 17 +++++++++++++++--
>>    drivers/usb/dwc3/core.h   |  4 ----
>>    drivers/usb/dwc3/gadget.c | 22 +++++++++++-----------
>>    3 files changed, 26 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>> index 9eb085f359ce..f47b20bc2d1f 100644
>> --- a/drivers/usb/dwc3/core.c
>> +++ b/drivers/usb/dwc3/core.c
>> @@ -544,6 +544,7 @@ static int dwc3_alloc_event_buffers(struct dwc3
>> *dwc, unsigned int length)
>>    int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>    {
>>           struct dwc3_event_buffer        *evt;
>> +       u32                             reg;
>>
>>           if (!dwc->ev_buf)
>>                   return 0;
>> @@ -556,8 +557,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>                           upper_32_bits(evt->dma));
>>           dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>                           DWC3_GEVNTSIZ_SIZE(evt->length));
>> -       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>
>> +       /* Clear any stale event */
> Do the same thing here as in dwc3_event_buffers_cleanup().
done in new version.
>
>> +       reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
>> +       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
>>           return 0;
>>    }
>>
>> @@ -2499,7 +2502,11 @@ static int dwc3_runtime_resume(struct device *dev)
>>
>>           switch (dwc->current_dr_role) {
>>           case DWC3_GCTL_PRTCAP_DEVICE:
>> -               dwc3_gadget_process_pending_events(dwc);
>> +               if (dwc->pending_events) {
>> +                       pm_runtime_put(dwc->dev);
>> +                       dwc->pending_events = false;
>> +                       enable_irq(dwc->irq_gadget);
>> +               }
>>                   break;
>>           case DWC3_GCTL_PRTCAP_HOST:
>>           default:
>> @@ -2589,6 +2596,12 @@ static void dwc3_complete(struct device *dev)
>>    static const struct dev_pm_ops dwc3_dev_pm_ops = {
>>           SET_SYSTEM_SLEEP_PM_OPS(dwc3_suspend, dwc3_resume)
>>           .complete = dwc3_complete,
>> +
>> +       /*
>> +        * Runtime suspend halts the controller on disconnection. It
>> replies on
>> +        * platforms with custom connection notification to start the
>> controller
>> +        * again.
>> +        */
>>           SET_RUNTIME_PM_OPS(dwc3_runtime_suspend, dwc3_runtime_resume,
>>                           dwc3_runtime_idle)
>>    };
>> diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
>> index c71240e8f7c7..9c508e0c5cdf 100644
>> --- a/drivers/usb/dwc3/core.h
>> +++ b/drivers/usb/dwc3/core.h
>> @@ -1675,7 +1675,6 @@ static inline void dwc3_otg_host_init(struct dwc3
>> *dwc)
>>    #if !IS_ENABLED(CONFIG_USB_DWC3_HOST)
>>    int dwc3_gadget_suspend(struct dwc3 *dwc);
>>    int dwc3_gadget_resume(struct dwc3 *dwc);
>> -void dwc3_gadget_process_pending_events(struct dwc3 *dwc);
>>    #else
>>    static inline int dwc3_gadget_suspend(struct dwc3 *dwc)
>>    {
>> @@ -1687,9 +1686,6 @@ static inline int dwc3_gadget_resume(struct dwc3 *dwc)
>>           return 0;
>>    }
>>
>> -static inline void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
>> -{
>> -}
>>    #endif /* !IS_ENABLED(CONFIG_USB_DWC3_HOST) */
>>
>>    #if IS_ENABLED(CONFIG_USB_DWC3_ULPI)
>> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>> index 291bc549935b..a32c3a292353 100644
>> --- a/drivers/usb/dwc3/gadget.c
>> +++ b/drivers/usb/dwc3/gadget.c
>> @@ -4483,6 +4483,17 @@ static irqreturn_t dwc3_check_event_buf(struct
>> dwc3_event_buffer *evt)
>>                   return IRQ_HANDLED;
>>
>>           count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> If we properly cleanup event count in dwc3_event_buffers_cleanup() as
> noted above, then you don't need this condition below. You can remove
> the below check:
done in new version.
>
>> +
>> +       /*
>> +        * If the controller is halted, the event count is
>> stale/invalid. Ignore
>> +        * them. This happens if the interrupt assertion is from an
>> out-of-band
>> +        * resume notification.
>> +        */
>> +       if (!dwc->pullups_connected && count) {
>> +               dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
>> +               return IRQ_HANDLED;
>> +       }
>> +
>>           count &= DWC3_GEVNTCOUNT_MASK;
>>           if (!count)
>>                   return IRQ_NONE;
>> @@ -4728,14 +4739,3 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
>>
>>           return dwc3_gadget_soft_connect(dwc);
>>    }
>> -
>> -void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
>> -{
>> -       if (dwc->pending_events) {
>> -               dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
>> -               dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
>> -               pm_runtime_put(dwc->dev);
>> -               dwc->pending_events = false;
>> -               enable_irq(dwc->irq_gadget);
>> -       }
>> -}

Hi Thinh,

Thanks for your suggestions. I posted a updated version in the below 
link. SO, kindly review the same.

https://lore.kernel.org/lkml/20240916224543.187-1-selvarasu.g@samsung.com/

Thanks,
Selva
> The rest looks fine.
>
> Thanks,
> Thinh
diff mbox series

Patch

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index cb82557678dd..610792a70805 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -549,8 +549,12 @@  int dwc3_event_buffers_setup(struct dwc3 *dwc)
 			lower_32_bits(evt->dma));
 	dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
 			upper_32_bits(evt->dma));
-	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
-			DWC3_GEVNTSIZ_SIZE(evt->length));
+
+	/* Skip enable dwc3 event interrupt if event is processing in middle */
+	if (!(evt->flags & DWC3_EVENT_PENDING))
+		dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
+				DWC3_GEVNTSIZ_SIZE(evt->length));
+
 	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
 
 	return 0;