diff mbox series

[RFC] usb: dwc3: gadget: Fix amount of data copied from event buf to cache

Message ID 20230521100330.22478-1-quic_kriskura@quicinc.com (mailing list archive)
State New, archived
Headers show
Series [RFC] usb: dwc3: gadget: Fix amount of data copied from event buf to cache | expand

Commit Message

Krishna Kurapati May 21, 2023, 10:03 a.m. UTC
In the current implementation, the check_event_buf call reads the
GEVNTCOUNT register to determine the amount of event data generated
and copies it from ev->buf to ev->cache after masking interrupt.

During copy if the amount of data to be copied is more than
(length - lpos), we fill the ev->cache till the end of 4096 byte
buffer allocated and then start filling from the top (lpos = 0).

In one instance of SMMU crash it is observed that GEVNTCOUNT register
reads more than 4096 bytes:

dwc3_readl   base=0xffffffc0091dc000  offset=50188  value=63488

(offset = 50188 -> 0xC40C)  -> reads 63488 bytes

As per crash dump:
dwc->lpos = 2056

and evt->buf is at 0xFFFFFFC009185000 and the crash is at
0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes.

We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000).

And then we copy the rest of the data (64388 - 2040) from beginning
of dwc->ev_buf. While doing so we go beyond bounds as we are trying
to memcpy 62348 bytes into a 4K buffer resulting in crash.

Fix this by limiting the total data being copied to ev->length to
avoid copying data beyond bounds. Moreover this is logical because if
the controller generated events more than the size of ring buffer,
some of them might have been overwritten by the controller.

Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
---
Only one instance of this crash was observed so far. As per the
databook:

"The controller always leaves one entry free in each Event Buffer.
When the Event Buffer is almost full, hardware writes the Event
Buffer Overflow event and the USB eventually gets stalled when
endpoints start responding NRDY or the link layer stops returning
credits (in SuperSpeed). This event is an indication to software that
it is not processing events quickly enough. During this time, events
are queued up internally. When software frees up Event Buffer space,
the queued up events are written out and the USB returns to normal
operation"

I didn't see any overflow event in the event buffer after parsing
crash dump. Although this could be some HW issue, I thought we can
include this fix in software as well to avoid such scenario.

 drivers/usb/dwc3/gadget.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Thinh Nguyen May 22, 2023, 9:17 p.m. UTC | #1
On Sun, May 21, 2023, Krishna Kurapati wrote:
> In the current implementation, the check_event_buf call reads the
> GEVNTCOUNT register to determine the amount of event data generated
> and copies it from ev->buf to ev->cache after masking interrupt.
> 
> During copy if the amount of data to be copied is more than
> (length - lpos), we fill the ev->cache till the end of 4096 byte
> buffer allocated and then start filling from the top (lpos = 0).
> 
> In one instance of SMMU crash it is observed that GEVNTCOUNT register
> reads more than 4096 bytes:
> 
> dwc3_readl   base=0xffffffc0091dc000  offset=50188  value=63488
> 
> (offset = 50188 -> 0xC40C)  -> reads 63488 bytes
> 
> As per crash dump:
> dwc->lpos = 2056
> 
> and evt->buf is at 0xFFFFFFC009185000 and the crash is at
> 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes.
> 
> We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000).
> 
> And then we copy the rest of the data (64388 - 2040) from beginning
> of dwc->ev_buf. While doing so we go beyond bounds as we are trying
> to memcpy 62348 bytes into a 4K buffer resulting in crash.
> 
> Fix this by limiting the total data being copied to ev->length to
> avoid copying data beyond bounds. Moreover this is logical because if
> the controller generated events more than the size of ring buffer,
> some of them might have been overwritten by the controller.
> 
> Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
> ---
> Only one instance of this crash was observed so far. As per the
> databook:
> 
> "The controller always leaves one entry free in each Event Buffer.
> When the Event Buffer is almost full, hardware writes the Event
> Buffer Overflow event and the USB eventually gets stalled when
> endpoints start responding NRDY or the link layer stops returning
> credits (in SuperSpeed). This event is an indication to software that
> it is not processing events quickly enough. During this time, events
> are queued up internally. When software frees up Event Buffer space,
> the queued up events are written out and the USB returns to normal
> operation"
> 
> I didn't see any overflow event in the event buffer after parsing
> crash dump. Although this could be some HW issue, I thought we can
> include this fix in software as well to avoid such scenario.
> 

What's the GEVNTSIZ at the point of the crash? That's where the driver
tells the controller how much it allocated for the event buffer.

Check to make sure that it wasn't reset during operation (not cleanup).

BR,
Thinh
Krishna Kurapati May 23, 2023, 1:25 a.m. UTC | #2
On 5/23/2023 2:47 AM, Thinh Nguyen wrote:
> On Sun, May 21, 2023, Krishna Kurapati wrote:
>> In the current implementation, the check_event_buf call reads the
>> GEVNTCOUNT register to determine the amount of event data generated
>> and copies it from ev->buf to ev->cache after masking interrupt.
>>
>> During copy if the amount of data to be copied is more than
>> (length - lpos), we fill the ev->cache till the end of 4096 byte
>> buffer allocated and then start filling from the top (lpos = 0).
>>
>> In one instance of SMMU crash it is observed that GEVNTCOUNT register
>> reads more than 4096 bytes:
>>
>> dwc3_readl   base=0xffffffc0091dc000  offset=50188  value=63488
>>
>> (offset = 50188 -> 0xC40C)  -> reads 63488 bytes
>>
>> As per crash dump:
>> dwc->lpos = 2056
>>
>> and evt->buf is at 0xFFFFFFC009185000 and the crash is at
>> 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes.
>>
>> We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000).
>>
>> And then we copy the rest of the data (64388 - 2040) from beginning
>> of dwc->ev_buf. While doing so we go beyond bounds as we are trying
>> to memcpy 62348 bytes into a 4K buffer resulting in crash.
>>
>> Fix this by limiting the total data being copied to ev->length to
>> avoid copying data beyond bounds. Moreover this is logical because if
>> the controller generated events more than the size of ring buffer,
>> some of them might have been overwritten by the controller.
>>
>> Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
>> ---
>> Only one instance of this crash was observed so far. As per the
>> databook:
>>
>> "The controller always leaves one entry free in each Event Buffer.
>> When the Event Buffer is almost full, hardware writes the Event
>> Buffer Overflow event and the USB eventually gets stalled when
>> endpoints start responding NRDY or the link layer stops returning
>> credits (in SuperSpeed). This event is an indication to software that
>> it is not processing events quickly enough. During this time, events
>> are queued up internally. When software frees up Event Buffer space,
>> the queued up events are written out and the USB returns to normal
>> operation"
>>
>> I didn't see any overflow event in the event buffer after parsing
>> crash dump. Although this could be some HW issue, I thought we can
>> include this fix in software as well to avoid such scenario.
>>
> 
> What's the GEVNTSIZ at the point of the crash? That's where the driver
> tells the controller how much it allocated for the event buffer.
> 
> Check to make sure that it wasn't reset during operation (not cleanup).


Hi Thinh,

  The last 3 RW traces were as follows:

<idle>-0    [001]   5834.471640:  dwc3_writel   base=0xffffffc0091dc000  
offset=50184  value=4096
<idle>-0    [001]   5834.471799:  dwc3_readl   base=0xffffffc0091dc000  
offset=50188  value=63488
<idle>-0    [001]   5834.471803:  dwc3_writel   base=0xffffffc0091dc000  
offset=50184  value=2147487744

The first one was at the end of previous process_event_entry call where 
we unmasked the interrupt.

The second one was the read of GEVTCOUNT.

The third one was where we wrote 31st bit of the GEVTSIZ register to 
mask interrupt along with 4096 bytes to [15:0] bits.

There is only 100-150us gap between each of these read writes and if you 
are referring to whether the GEVTSIZ got modified in between, I am not 
sure of that.

Regards,
Krishna,
Thinh Nguyen May 23, 2023, 9:22 p.m. UTC | #3
On Tue, May 23, 2023, Krishna Kurapati PSSNV wrote:
> 
> 
> On 5/23/2023 2:47 AM, Thinh Nguyen wrote:
> > On Sun, May 21, 2023, Krishna Kurapati wrote:
> > > In the current implementation, the check_event_buf call reads the
> > > GEVNTCOUNT register to determine the amount of event data generated
> > > and copies it from ev->buf to ev->cache after masking interrupt.
> > > 
> > > During copy if the amount of data to be copied is more than
> > > (length - lpos), we fill the ev->cache till the end of 4096 byte
> > > buffer allocated and then start filling from the top (lpos = 0).
> > > 
> > > In one instance of SMMU crash it is observed that GEVNTCOUNT register
> > > reads more than 4096 bytes:
> > > 
> > > dwc3_readl   base=0xffffffc0091dc000  offset=50188  value=63488
> > > 
> > > (offset = 50188 -> 0xC40C)  -> reads 63488 bytes
> > > 
> > > As per crash dump:
> > > dwc->lpos = 2056
> > > 
> > > and evt->buf is at 0xFFFFFFC009185000 and the crash is at
> > > 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes.
> > > 
> > > We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000).
> > > 
> > > And then we copy the rest of the data (64388 - 2040) from beginning
> > > of dwc->ev_buf. While doing so we go beyond bounds as we are trying
> > > to memcpy 62348 bytes into a 4K buffer resulting in crash.
> > > 
> > > Fix this by limiting the total data being copied to ev->length to
> > > avoid copying data beyond bounds. Moreover this is logical because if
> > > the controller generated events more than the size of ring buffer,
> > > some of them might have been overwritten by the controller.
> > > 
> > > Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
> > > ---
> > > Only one instance of this crash was observed so far. As per the
> > > databook:
> > > 
> > > "The controller always leaves one entry free in each Event Buffer.
> > > When the Event Buffer is almost full, hardware writes the Event
> > > Buffer Overflow event and the USB eventually gets stalled when
> > > endpoints start responding NRDY or the link layer stops returning
> > > credits (in SuperSpeed). This event is an indication to software that
> > > it is not processing events quickly enough. During this time, events
> > > are queued up internally. When software frees up Event Buffer space,
> > > the queued up events are written out and the USB returns to normal
> > > operation"
> > > 
> > > I didn't see any overflow event in the event buffer after parsing
> > > crash dump. Although this could be some HW issue, I thought we can
> > > include this fix in software as well to avoid such scenario.
> > > 
> > 
> > What's the GEVNTSIZ at the point of the crash? That's where the driver
> > tells the controller how much it allocated for the event buffer.
> > 
> > Check to make sure that it wasn't reset during operation (not cleanup).
> 
> 
> Hi Thinh,
> 
>  The last 3 RW traces were as follows:
> 
> <idle>-0    [001]   5834.471640:  dwc3_writel   base=0xffffffc0091dc000 
> offset=50184  value=4096
> <idle>-0    [001]   5834.471799:  dwc3_readl   base=0xffffffc0091dc000 
> offset=50188  value=63488
> <idle>-0    [001]   5834.471803:  dwc3_writel   base=0xffffffc0091dc000 
> offset=50184  value=2147487744
> 
> The first one was at the end of previous process_event_entry call where we
> unmasked the interrupt.
> 
> The second one was the read of GEVTCOUNT.
> 
> The third one was where we wrote 31st bit of the GEVTSIZ register to mask
> interrupt along with 4096 bytes to [15:0] bits.
> 
> There is only 100-150us gap between each of these read writes and if you are
> referring to whether the GEVTSIZ got modified in between, I am not sure of
> that.
> 

The read from GEVNTCOUNT looks corrupted. When was the last time (delta
time) an event was handled prior to the failure? If it's a short delta,
it further indicates that whatever the value returned is invalid.

So, setting count = evt->length if count > evt->length is no good. The
event count is probably much lower and you may end up reading stale
events.

It's odd that we are only able to see this issue now. For this type of
failure, it should've been exposed much early on. Is there potentially
something different for your setup/configuration that may cause this
issue. Also, how easily are you able to reproduce this. Did you try to
re-read the GEVNTCOUNT multiple times when this happens to see if the
value changes dramatically?

Thanks,
Thinh
Krishna Kurapati May 24, 2023, 3:25 a.m. UTC | #4
On 5/24/2023 2:52 AM, Thinh Nguyen wrote:
> On Tue, May 23, 2023, Krishna Kurapati PSSNV wrote:
>>
>>
>> On 5/23/2023 2:47 AM, Thinh Nguyen wrote:
>>> On Sun, May 21, 2023, Krishna Kurapati wrote:
>>>> In the current implementation, the check_event_buf call reads the
>>>> GEVNTCOUNT register to determine the amount of event data generated
>>>> and copies it from ev->buf to ev->cache after masking interrupt.
>>>>
>>>> During copy if the amount of data to be copied is more than
>>>> (length - lpos), we fill the ev->cache till the end of 4096 byte
>>>> buffer allocated and then start filling from the top (lpos = 0).
>>>>
>>>> In one instance of SMMU crash it is observed that GEVNTCOUNT register
>>>> reads more than 4096 bytes:
>>>>
>>>> dwc3_readl   base=0xffffffc0091dc000  offset=50188  value=63488
>>>>
>>>> (offset = 50188 -> 0xC40C)  -> reads 63488 bytes
>>>>
>>>> As per crash dump:
>>>> dwc->lpos = 2056
>>>>
>>>> and evt->buf is at 0xFFFFFFC009185000 and the crash is at
>>>> 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes.
>>>>
>>>> We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000).
>>>>
>>>> And then we copy the rest of the data (64388 - 2040) from beginning
>>>> of dwc->ev_buf. While doing so we go beyond bounds as we are trying
>>>> to memcpy 62348 bytes into a 4K buffer resulting in crash.
>>>>
>>>> Fix this by limiting the total data being copied to ev->length to
>>>> avoid copying data beyond bounds. Moreover this is logical because if
>>>> the controller generated events more than the size of ring buffer,
>>>> some of them might have been overwritten by the controller.
>>>>
>>>> Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
>>>> ---
>>>> Only one instance of this crash was observed so far. As per the
>>>> databook:
>>>>
>>>> "The controller always leaves one entry free in each Event Buffer.
>>>> When the Event Buffer is almost full, hardware writes the Event
>>>> Buffer Overflow event and the USB eventually gets stalled when
>>>> endpoints start responding NRDY or the link layer stops returning
>>>> credits (in SuperSpeed). This event is an indication to software that
>>>> it is not processing events quickly enough. During this time, events
>>>> are queued up internally. When software frees up Event Buffer space,
>>>> the queued up events are written out and the USB returns to normal
>>>> operation"
>>>>
>>>> I didn't see any overflow event in the event buffer after parsing
>>>> crash dump. Although this could be some HW issue, I thought we can
>>>> include this fix in software as well to avoid such scenario.
>>>>
>>>
>>> What's the GEVNTSIZ at the point of the crash? That's where the driver
>>> tells the controller how much it allocated for the event buffer.
>>>
>>> Check to make sure that it wasn't reset during operation (not cleanup).
>>
>>
>> Hi Thinh,
>>
>>   The last 3 RW traces were as follows:
>>
>> <idle>-0    [001]   5834.471640:  dwc3_writel   base=0xffffffc0091dc000
>> offset=50184  value=4096
>> <idle>-0    [001]   5834.471799:  dwc3_readl   base=0xffffffc0091dc000
>> offset=50188  value=63488
>> <idle>-0    [001]   5834.471803:  dwc3_writel   base=0xffffffc0091dc000
>> offset=50184  value=2147487744
>>
>> The first one was at the end of previous process_event_entry call where we
>> unmasked the interrupt.
>>
>> The second one was the read of GEVTCOUNT.
>>
>> The third one was where we wrote 31st bit of the GEVTSIZ register to mask
>> interrupt along with 4096 bytes to [15:0] bits.
>>
>> There is only 100-150us gap between each of these read writes and if you are
>> referring to whether the GEVTSIZ got modified in between, I am not sure of
>> that.
>>
> 
> The read from GEVNTCOUNT looks corrupted. When was the last time (delta
> time) an event was handled prior to the failure? If it's a short delta,
> it further indicates that whatever the value returned is invalid.
> 
> So, setting count = evt->length if count > evt->length is no good. The
> event count is probably much lower and you may end up reading stale
> events.
> 
> It's odd that we are only able to see this issue now. For this type of
> failure, it should've been exposed much early on. Is there potentially
> something different for your setup/configuration that may cause this
> issue. Also, how easily are you able to reproduce this. Did you try to
> re-read the GEVNTCOUNT multiple times when this happens to see if the
> value changes dramatically?
> 

The previous event processed was just before this interrupt came in 
which too was having a huge GEVTCOUNT but not big to crash the device 
(in memcpy). And very soon, after it was processed, this current 
interrupt came in.

(That is why in the comment section of the patch, I mentioned that this 
could be a hw issue too that came up randomly. But wanted to check if 
this case is something that can be considered for fixing.)

And there has been only one instance reported so far on a target which 
has very less computational capability and has a hug sw stack (apart 
from linux) running on top of it.

Regards,
Krishna,
Thinh Nguyen May 25, 2023, 12:18 a.m. UTC | #5
On Wed, May 24, 2023, Krishna Kurapati PSSNV wrote:
> 
> 
> On 5/24/2023 2:52 AM, Thinh Nguyen wrote:
> > On Tue, May 23, 2023, Krishna Kurapati PSSNV wrote:
> > > 
> > > 
> > > On 5/23/2023 2:47 AM, Thinh Nguyen wrote:
> > > > On Sun, May 21, 2023, Krishna Kurapati wrote:
> > > > > In the current implementation, the check_event_buf call reads the
> > > > > GEVNTCOUNT register to determine the amount of event data generated
> > > > > and copies it from ev->buf to ev->cache after masking interrupt.
> > > > > 
> > > > > During copy if the amount of data to be copied is more than
> > > > > (length - lpos), we fill the ev->cache till the end of 4096 byte
> > > > > buffer allocated and then start filling from the top (lpos = 0).
> > > > > 
> > > > > In one instance of SMMU crash it is observed that GEVNTCOUNT register
> > > > > reads more than 4096 bytes:
> > > > > 
> > > > > dwc3_readl   base=0xffffffc0091dc000  offset=50188  value=63488
> > > > > 
> > > > > (offset = 50188 -> 0xC40C)  -> reads 63488 bytes
> > > > > 
> > > > > As per crash dump:
> > > > > dwc->lpos = 2056
> > > > > 
> > > > > and evt->buf is at 0xFFFFFFC009185000 and the crash is at
> > > > > 0xFFFFFFC009186000. The diff which is exactly 0x1000 bytes.
> > > > > 
> > > > > We first memcpy 2040 bytes from (buf + lpos) to (buf + 0x1000).
> > > > > 
> > > > > And then we copy the rest of the data (64388 - 2040) from beginning
> > > > > of dwc->ev_buf. While doing so we go beyond bounds as we are trying
> > > > > to memcpy 62348 bytes into a 4K buffer resulting in crash.
> > > > > 
> > > > > Fix this by limiting the total data being copied to ev->length to
> > > > > avoid copying data beyond bounds. Moreover this is logical because if
> > > > > the controller generated events more than the size of ring buffer,
> > > > > some of them might have been overwritten by the controller.
> > > > > 
> > > > > Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
> > > > > ---
> > > > > Only one instance of this crash was observed so far. As per the
> > > > > databook:
> > > > > 
> > > > > "The controller always leaves one entry free in each Event Buffer.
> > > > > When the Event Buffer is almost full, hardware writes the Event
> > > > > Buffer Overflow event and the USB eventually gets stalled when
> > > > > endpoints start responding NRDY or the link layer stops returning
> > > > > credits (in SuperSpeed). This event is an indication to software that
> > > > > it is not processing events quickly enough. During this time, events
> > > > > are queued up internally. When software frees up Event Buffer space,
> > > > > the queued up events are written out and the USB returns to normal
> > > > > operation"
> > > > > 
> > > > > I didn't see any overflow event in the event buffer after parsing
> > > > > crash dump. Although this could be some HW issue, I thought we can
> > > > > include this fix in software as well to avoid such scenario.
> > > > > 
> > > > 
> > > > What's the GEVNTSIZ at the point of the crash? That's where the driver
> > > > tells the controller how much it allocated for the event buffer.
> > > > 
> > > > Check to make sure that it wasn't reset during operation (not cleanup).
> > > 
> > > 
> > > Hi Thinh,
> > > 
> > >   The last 3 RW traces were as follows:
> > > 
> > > <idle>-0    [001]   5834.471640:  dwc3_writel   base=0xffffffc0091dc000
> > > offset=50184  value=4096
> > > <idle>-0    [001]   5834.471799:  dwc3_readl   base=0xffffffc0091dc000
> > > offset=50188  value=63488
> > > <idle>-0    [001]   5834.471803:  dwc3_writel   base=0xffffffc0091dc000
> > > offset=50184  value=2147487744
> > > 
> > > The first one was at the end of previous process_event_entry call where we
> > > unmasked the interrupt.
> > > 
> > > The second one was the read of GEVTCOUNT.
> > > 
> > > The third one was where we wrote 31st bit of the GEVTSIZ register to mask
> > > interrupt along with 4096 bytes to [15:0] bits.
> > > 
> > > There is only 100-150us gap between each of these read writes and if you are
> > > referring to whether the GEVTSIZ got modified in between, I am not sure of
> > > that.
> > > 
> > 
> > The read from GEVNTCOUNT looks corrupted. When was the last time (delta
> > time) an event was handled prior to the failure? If it's a short delta,
> > it further indicates that whatever the value returned is invalid.
> > 
> > So, setting count = evt->length if count > evt->length is no good. The
> > event count is probably much lower and you may end up reading stale
> > events.
> > 
> > It's odd that we are only able to see this issue now. For this type of
> > failure, it should've been exposed much early on. Is there potentially
> > something different for your setup/configuration that may cause this
> > issue. Also, how easily are you able to reproduce this. Did you try to
> > re-read the GEVNTCOUNT multiple times when this happens to see if the
> > value changes dramatically?
> > 
> 
> The previous event processed was just before this interrupt came in which
> too was having a huge GEVTCOUNT but not big to crash the device (in memcpy).
> And very soon, after it was processed, this current interrupt came in.
> 
> (That is why in the comment section of the patch, I mentioned that this
> could be a hw issue too that came up randomly. But wanted to check if this
> case is something that can be considered for fixing.)
> 
> And there has been only one instance reported so far on a target which has
> very less computational capability and has a hug sw stack (apart from linux)
> running on top of it.
> 

This seems to be a hw issue. We can add a check if the event count is
beyond evt->length, then ignore the interrupt and print an error to
prevent the invalid memory access. We're not sure the events are valid
at this point.

As for the recovery from this issue, we need root cause the problem to
provide an appropriate fix.

Thanks,
Thinh
diff mbox series

Patch

diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 5965796bc5d5..cff043ab9017 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4457,6 +4457,9 @@  static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
 	if (!count)
 		return IRQ_NONE;
 
+	if (count > evt->length)
+		count  = evt->length;
+
 	evt->count = count;
 	evt->flags |= DWC3_EVENT_PENDING;