diff mbox

[v2] serial: 8250_dw: Avoid "too much work" from bogus rx timeout interrupt

Message ID 20170206233000.3021-1-dianders@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Doug Anderson Feb. 6, 2017, 11:30 p.m. UTC
On a Rockchip rk3399-based board during suspend/resume testing, we
found that we could get the console UART into a state where it would
print this to the console a lot:
  serial8250: too much work for irq42

Followed eventually by:
  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 11s!

Upon debugging I found that we're in this state:
  iir = 0x000000cc
  lsr = 0x00000060

It appears that somehow we have a RX Timeout interrupt but there is no
actual data present to receive.  When we're in this state the UART
driver claims that it handled the interrupt but it actually doesn't
really do anything.  This means that we keep getting the interrupt
over and over again.

Normally we don't actually need to do anything special to handle a RX
Timeout interrupt.  We'll notice that there is some data ready and
we'll read it, which will end up clearing the RX Timeout.  In this
case we have a problem specifically because we got the RX TImeout
without any data.  Reading a bogus byte is confirmed to get us out of
this state.

It's unclear how exactly the UART got into this state, but it is known
that the UART lines are essentially undriven and unpowered during
suspend, so possibly during resume some garbage / half transmitted
bits are seen on the line and put the UART into this state.

The UART on the rk3399 is a DesignWare based 8250 UART.  From mailing
list posts, it appears that other people have run into similar
problems with DesignWare based IP.  Presumably this problem is unique
to that IP, so I have placed the workaround there to avoid possibly of
accidentally triggering bad behavior on other IP.  Also note the RX
Timeout behaves very differently in the DMA case, for for now the
workaround is only applied to the non-DMA case.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
Testing and development done on a kernel-4.4 based tree, then picked
to ToT, where the code applied cleanly.

Changes in v2:
- Only apply to 8250_dw, not all 8250
- Only apply to the non-DMA case

 drivers/tty/serial/8250/8250_dw.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Comments

california.l.sullivan@intel.com Feb. 7, 2017, 12:04 a.m. UTC | #1
On 02/06/2017 03:30 PM, Douglas Anderson wrote:
> On a Rockchip rk3399-based board during suspend/resume testing, we
> found that we could get the console UART into a state where it would
> print this to the console a lot:
>    serial8250: too much work for irq42
>
> Followed eventually by:
>    NMI watchdog: BUG: soft lockup - CPU#0 stuck for 11s!
>
> Upon debugging I found that we're in this state:
>    iir = 0x000000cc
>    lsr = 0x00000060
>
> It appears that somehow we have a RX Timeout interrupt but there is no
> actual data present to receive.  When we're in this state the UART
> driver claims that it handled the interrupt but it actually doesn't
> really do anything.  This means that we keep getting the interrupt
> over and over again.
>
> Normally we don't actually need to do anything special to handle a RX
> Timeout interrupt.  We'll notice that there is some data ready and
> we'll read it, which will end up clearing the RX Timeout.  In this
> case we have a problem specifically because we got the RX TImeout
> without any data.  Reading a bogus byte is confirmed to get us out of
> this state.
>
> It's unclear how exactly the UART got into this state, but it is known
> that the UART lines are essentially undriven and unpowered during
> suspend, so possibly during resume some garbage / half transmitted
> bits are seen on the line and put the UART into this state.
Its been a long time since I looked at this, but IIRC it wasn't garbage 
bits, but spurious interrupts. The FIFO is implemented in such a way 
that it acts as a ring, and with a known input you could know ahead of 
time what the result of the extra read would be. This tricked me up, as 
with the inputs I was originally using it appeared to be valid data, 
when in fact it was just the next buffer in the ring which still had old 
data.

This probably doesn't help much, but at least gives some background 
knowledge.

---
Cal

>
> The UART on the rk3399 is a DesignWare based 8250 UART.  From mailing
> list posts, it appears that other people have run into similar
> problems with DesignWare based IP.  Presumably this problem is unique
> to that IP, so I have placed the workaround there to avoid possibly of
> accidentally triggering bad behavior on other IP.  Also note the RX
> Timeout behaves very differently in the DMA case, for for now the
> workaround is only applied to the non-DMA case.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> Testing and development done on a kernel-4.4 based tree, then picked
> to ToT, where the code applied cleanly.
>
> Changes in v2:
> - Only apply to 8250_dw, not all 8250
> - Only apply to the non-DMA case
>
>   drivers/tty/serial/8250/8250_dw.c | 23 +++++++++++++++++++++++
>   1 file changed, 23 insertions(+)
>
> diff --git a/drivers/tty/serial/8250/8250_dw.c b/drivers/tty/serial/8250/8250_dw.c
> index c89ae4581378..6ee55a2d47bb 100644
> --- a/drivers/tty/serial/8250/8250_dw.c
> +++ b/drivers/tty/serial/8250/8250_dw.c
> @@ -201,8 +201,31 @@ static unsigned int dw8250_serial_in32be(struct uart_port *p, int offset)
>   
>   static int dw8250_handle_irq(struct uart_port *p)
>   {
> +	struct uart_8250_port *up = up_to_u8250p(p);
>   	struct dw8250_data *d = p->private_data;
>   	unsigned int iir = p->serial_in(p, UART_IIR);
> +	unsigned int status;
> +	unsigned long flags;
> +
> +	/*
> +	 * There are ways to get Designware-based UARTs into a state where
> +	 * they are asserting UART_IIR_RX_TIMEOUT but there is no actual
> +	 * data available.  If we see such a case then we'll do a bogus
> +	 * read.  If we don't do this then the "RX TIMEOUT" interrupt will
> +	 * fire forever.
> +	 *
> +	 * This problem has only been observed so far when not in DMA mode
> +	 * so we limit the workaround only to non-DMA mode.
> +	 */
> +	if (!up->dma && ((iir & 0x3f) == UART_IIR_RX_TIMEOUT)) {
> +		spin_lock_irqsave(&p->lock, flags);
> +		status = p->serial_in(p, UART_LSR);
> +
> +		if (!(status & (UART_LSR_DR | UART_LSR_BI)))
> +			(void) p->serial_in(p, UART_RX);
> +
> +		spin_unlock_irqrestore(&p->lock, flags);
> +	}
>   
>   	if (serial8250_handle_irq(p, iir))
>   		return 1;
oliver@schinagl.nl March 29, 2017, 7:58 a.m. UTC | #2
Hey Douglas,

On 07-02-17 00:30, Douglas Anderson wrote:
> On a Rockchip rk3399-based board during suspend/resume testing, we
> found that we could get the console UART into a state where it would
> print this to the console a lot:
>   serial8250: too much work for irq42
>
> Followed eventually by:
>   NMI watchdog: BUG: soft lockup - CPU#0 stuck for 11s!
>
> Upon debugging I found that we're in this state:
>   iir = 0x000000cc
>   lsr = 0x00000060
>
> It appears that somehow we have a RX Timeout interrupt but there is no
> actual data present to receive.  When we're in this state the UART
> driver claims that it handled the interrupt but it actually doesn't
> really do anything.  This means that we keep getting the interrupt
> over and over again.

I may be running into the same thing on an A20 SoC, but still in the 
stage of figuring out what is going on, as we get this error very 
occasionally. Do you have a way to externally induce this behavior other 
then suspend/resume? As we get it during uart-use and do not have (or I 
have never tried) suspend/resume on our platform.

>
> Normally we don't actually need to do anything special to handle a RX
> Timeout interrupt.  We'll notice that there is some data ready and
> we'll read it, which will end up clearing the RX Timeout.  In this
> case we have a problem specifically because we got the RX TImeout
> without any data.  Reading a bogus byte is confirmed to get us out of
> this state.
>
> It's unclear how exactly the UART got into this state, but it is known
> that the UART lines are essentially undriven and unpowered during
> suspend, so possibly during resume some garbage / half transmitted
> bits are seen on the line and put the UART into this state.
>
> The UART on the rk3399 is a DesignWare based 8250 UART.  From mailing
> list posts, it appears that other people have run into similar
> problems with DesignWare based IP.  Presumably this problem is unique
> to that IP, so I have placed the workaround there to avoid possibly of
> accidentally triggering bad behavior on other IP.  Also note the RX
> Timeout behaves very differently in the DMA case, for for now the
> workaround is only applied to the non-DMA case.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> Testing and development done on a kernel-4.4 based tree, then picked
> to ToT, where the code applied cleanly.
>
> Changes in v2:
> - Only apply to 8250_dw, not all 8250
> - Only apply to the non-DMA case
>
>  drivers/tty/serial/8250/8250_dw.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/drivers/tty/serial/8250/8250_dw.c b/drivers/tty/serial/8250/8250_dw.c
> index c89ae4581378..6ee55a2d47bb 100644
> --- a/drivers/tty/serial/8250/8250_dw.c
> +++ b/drivers/tty/serial/8250/8250_dw.c
> @@ -201,8 +201,31 @@ static unsigned int dw8250_serial_in32be(struct uart_port *p, int offset)
>
>  static int dw8250_handle_irq(struct uart_port *p)
>  {
> +	struct uart_8250_port *up = up_to_u8250p(p);
>  	struct dw8250_data *d = p->private_data;
>  	unsigned int iir = p->serial_in(p, UART_IIR);
> +	unsigned int status;
> +	unsigned long flags;
> +
> +	/*
> +	 * There are ways to get Designware-based UARTs into a state where
> +	 * they are asserting UART_IIR_RX_TIMEOUT but there is no actual
> +	 * data available.  If we see such a case then we'll do a bogus
> +	 * read.  If we don't do this then the "RX TIMEOUT" interrupt will
> +	 * fire forever.
I think what you are saying is 'do a bogus read as that is the only way 
to clear the interrupt, otherwise it will keep firing forever.'?
> +	 *
> +	 * This problem has only been observed so far when not in DMA mode
> +	 * so we limit the workaround only to non-DMA mode.
> +	 */
> +	if (!up->dma && ((iir & 0x3f) == UART_IIR_RX_TIMEOUT)) {
why not
if (!up->dma && ((iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT)) {

it follows the flow as other conditionals in the 8250 source and you 
really only need to mask the specific interrupt anyway.

> +		spin_lock_irqsave(&p->lock, flags);
this is a bit above my knowledge of driver etc, but I don't any 
spinlocks in the 8250 handle_irq glue drivers, except in the OMAP's case 
where they are handeling a DMA IRQ. So I ask, because I don't know, why 
is it needed here?
> +		status = p->serial_in(p, UART_LSR);
> +
> +		if (!(status & (UART_LSR_DR | UART_LSR_BI)))
> +			(void) p->serial_in(p, UART_RX);
I think there should be no space between (void) and p->serial_in
> +
> +		spin_unlock_irqrestore(&p->lock, flags);
> +	}
>
>  	if (serial8250_handle_irq(p, iir))
>  		return 1;
>

Once I found a way to reproduce the problem (without suspend) I will 
test this to see if it fixes it for us too.

Olliver
Andy Shevchenko March 29, 2017, 9:11 a.m. UTC | #3
On Wed, Mar 29, 2017 at 10:58 AM, Olliver Schinagl <oliver@schinagl.nl> wrote:
> On 07-02-17 00:30, Douglas Anderson wrote:

First of all I didn't get why people from Cc list are suddenly
disappeared. Check your mail client settings.
Returning back some of them.

>> It appears that somehow we have a RX Timeout interrupt but there is no
>> actual data present to receive.  When we're in this state the UART
>> driver claims that it handled the interrupt but it actually doesn't
>> really do anything.  This means that we keep getting the interrupt
>> over and over again.

> I may be running into the same thing on an A20 SoC, but still in the stage
> of figuring out what is going on, as we get this error very occasionally. Do
> you have a way to externally induce this behavior other then suspend/resume?
> As we get it during uart-use and do not have (or I have never tried)
> suspend/resume on our platform.

On Intel platforms with this IP I can see similar when run loopback
test on high speeds.
California may correct me since he did a lot of investigation of the
issue on x86.

>>  static int dw8250_handle_irq(struct uart_port *p)
>>  {
>> +       struct uart_8250_port *up = up_to_u8250p(p);
>>         struct dw8250_data *d = p->private_data;
>>         unsigned int iir = p->serial_in(p, UART_IIR);
>> +       unsigned int status;
>> +       unsigned long flags;
>> +
>> +       /*
>> +        * There are ways to get Designware-based UARTs into a state where
>> +        * they are asserting UART_IIR_RX_TIMEOUT but there is no actual
>> +        * data available.  If we see such a case then we'll do a bogus
>> +        * read.  If we don't do this then the "RX TIMEOUT" interrupt will
>> +        * fire forever.
>
> I think what you are saying is 'do a bogus read as that is the only way to
> clear the interrupt, otherwise it will keep firing forever.'?

No, we don't know if this _the only way_. It looks like no one from us
can tell you a root cause, except may be Synopsys guys.

>> +               spin_lock_irqsave(&p->lock, flags);
>
> this is a bit above my knowledge of driver etc, but I don't any spinlocks in
> the 8250 handle_irq glue drivers, except in the OMAP's case where they are
> handeling a DMA IRQ. So I ask, because I don't know, why is it needed here?

They serialize IO accessors.

Regarding to the rest comments, the patch is already in upstream, if
you feel that something should be changed, send an incremental fix.

> Once I found a way to reproduce the problem (without suspend) I will test
> this to see if it fixes it for us too.

It would be appreciated, but better to get know the root cause and
what _hardware_ guys think about solutions.
Olliver Schinagl March 29, 2017, 9:45 a.m. UTC | #4
Hey Andy,

On 29-03-17 11:11, Andy Shevchenko wrote:
> On Wed, Mar 29, 2017 at 10:58 AM, Olliver Schinagl <oliver@schinagl.nl> wrote:
>> On 07-02-17 00:30, Douglas Anderson wrote:
>
> First of all I didn't get why people from Cc list are suddenly
> disappeared. Check your mail client settings.
> Returning back some of them.
Appologies, I replied via gmane's news feed to Douglas's initial post as 
I did not have the original post and I failed to check the other 
recipients. My fault. Sorry. I've added the original others as well.

>
>>> It appears that somehow we have a RX Timeout interrupt but there is no
>>> actual data present to receive.  When we're in this state the UART
>>> driver claims that it handled the interrupt but it actually doesn't
>>> really do anything.  This means that we keep getting the interrupt
>>> over and over again.
>
>> I may be running into the same thing on an A20 SoC, but still in the stage
>> of figuring out what is going on, as we get this error very occasionally. Do
>> you have a way to externally induce this behavior other then suspend/resume?
>> As we get it during uart-use and do not have (or I have never tried)
>> suspend/resume on our platform.
>
> On Intel platforms with this IP I can see similar when run loopback
> test on high speeds.
> California may correct me since he did a lot of investigation of the
> issue on x86.
>
>>>  static int dw8250_handle_irq(struct uart_port *p)
>>>  {
>>> +       struct uart_8250_port *up = up_to_u8250p(p);
>>>         struct dw8250_data *d = p->private_data;
>>>         unsigned int iir = p->serial_in(p, UART_IIR);
>>> +       unsigned int status;
>>> +       unsigned long flags;
>>> +
>>> +       /*
>>> +        * There are ways to get Designware-based UARTs into a state where
>>> +        * they are asserting UART_IIR_RX_TIMEOUT but there is no actual
>>> +        * data available.  If we see such a case then we'll do a bogus
>>> +        * read.  If we don't do this then the "RX TIMEOUT" interrupt will
>>> +        * fire forever.
>>
>> I think what you are saying is 'do a bogus read as that is the only way to
>> clear the interrupt, otherwise it will keep firing forever.'?
>
> No, we don't know if this _the only way_. It looks like no one from us
> can tell you a root cause, except may be Synopsys guys.
Has anybody tried to contact synopsis/dw about this issue at all?

true, it is not the only way (maybe only as far as we know for now) but 
it is 'the' way currently.
>
>>> +               spin_lock_irqsave(&p->lock, flags);
>>
>> this is a bit above my knowledge of driver etc, but I don't any spinlocks in
>> the 8250 handle_irq glue drivers, except in the OMAP's case where they are
>> handeling a DMA IRQ. So I ask, because I don't know, why is it needed here?
>
> They serialize IO accessors.
>
> Regarding to the rest comments, the patch is already in upstream, if
> you feel that something should be changed, send an incremental fix.
Ah, I thought I checked, but thought I didn't see it. I'll probably 
forgot to fetch. I'll send a patch for the small mask fix.
>
>> Once I found a way to reproduce the problem (without suspend) I will test
>> this to see if it fixes it for us too.
>
> It would be appreciated, but better to get know the root cause and
> what _hardware_ guys think about solutions.
>
I read over the docs of the IP block (I know a little FPGA programming) 
(dw_apb_uart of 2006) but found nothing yet that would warn for this 
behavior. I suppose hardware/fgpa guys can give more background here 
potentially, but it may also be simply an IP bug?

Olliver
diff mbox

Patch

diff --git a/drivers/tty/serial/8250/8250_dw.c b/drivers/tty/serial/8250/8250_dw.c
index c89ae4581378..6ee55a2d47bb 100644
--- a/drivers/tty/serial/8250/8250_dw.c
+++ b/drivers/tty/serial/8250/8250_dw.c
@@ -201,8 +201,31 @@  static unsigned int dw8250_serial_in32be(struct uart_port *p, int offset)
 
 static int dw8250_handle_irq(struct uart_port *p)
 {
+	struct uart_8250_port *up = up_to_u8250p(p);
 	struct dw8250_data *d = p->private_data;
 	unsigned int iir = p->serial_in(p, UART_IIR);
+	unsigned int status;
+	unsigned long flags;
+
+	/*
+	 * There are ways to get Designware-based UARTs into a state where
+	 * they are asserting UART_IIR_RX_TIMEOUT but there is no actual
+	 * data available.  If we see such a case then we'll do a bogus
+	 * read.  If we don't do this then the "RX TIMEOUT" interrupt will
+	 * fire forever.
+	 *
+	 * This problem has only been observed so far when not in DMA mode
+	 * so we limit the workaround only to non-DMA mode.
+	 */
+	if (!up->dma && ((iir & 0x3f) == UART_IIR_RX_TIMEOUT)) {
+		spin_lock_irqsave(&p->lock, flags);
+		status = p->serial_in(p, UART_LSR);
+
+		if (!(status & (UART_LSR_DR | UART_LSR_BI)))
+			(void) p->serial_in(p, UART_RX);
+
+		spin_unlock_irqrestore(&p->lock, flags);
+	}
 
 	if (serial8250_handle_irq(p, iir))
 		return 1;