xhci_hcd HC died; cleaning up with TUSB7340 and µPD720201

Message ID	f57ccfd8-0b99-278c-6c10-474395f96a3c@ti.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> Subject: =?UTF-8?Q?Re:_xhci=5fhcd_HC_died; _cleaning_up_with_TUSB7340_and_?= =?UTF-8?Q?=c2=b5PD720201?= To: "Quadros, Roger" <rogerq@ti.com> References: <BN6PR18MB124994D85EAC4B5B1AD5EC56866E0@BN6PR18MB1249.namprd18.prod.outlook.com> <3dd7a4fc-da86-03cc-9b01-a0d29dd73230@ti.com> From: Vignesh R <vigneshr@ti.com> CC: Chris Welch <Chris.Welch@viavisolutions.com>, "linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>, <linux-pci@vger.kernel.org>, Joao Pinto <jpinto@synopsys.com>, KISHON VIJAY ABRAHAM <kishon@ti.com> Message-ID: <f57ccfd8-0b99-278c-6c10-474395f96a3c@ti.com> Date: Thu, 16 Nov 2017 17:56:56 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <3dd7a4fc-da86-03cc-9b01-a0d29dd73230@ti.com> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Vignesh Raghavendra Nov. 16, 2017, 12:26 p.m. UTC

+linux-pci

Hi Chris,

On Thursday 16 November 2017 05:20 PM, Quadros, Roger wrote:
> +Vignesh
> 
> On 13/09/17 17:26, Chris Welch wrote:
>> We are developing a product based on the TI AM5728 EVM.  The product utilizes a TUSB7340 PCIe USB host for additional ports.  The TUSB7340 is detected and setup properly and works OK with low data rate devices.  However, hot plugging a Realtek USB network adapter and doing Ethernet transfer bandwidth testing using iperf3 causes the TUSB7340 host to be  locked out.  The TUSB7340 host appears to no longer communicate and the logging indicates xhci_hcd 0000:01:00.0: HC died; cleaning up.  Same issue occurs with another USB Ethernet adapter I tried (Asus).
>>
>> We looked at using another host and found a mini PCIe card that utilizes the µPD720201 and can be directly installed on the TI AM5728 EVM.  The card is detected properly and we reran the transfer test.  The uPD720201 gets locks out with the same problem.
>>
>> The AM5728 testing was performed using the TI SD card stock am57xx-evm-linux-04.00.00.04.img, kernel am57xx-evm 4.9.28-geed43d1050, and it reports that it is using the TI AM572x EVM Rev A3 device tree.
>>
>> It shows the following logging when it fails (this is with the TI EVM and uPD720201).
>>
>> [  630.400899] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
>> [  630.408769] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.
>> [  630.420849] r8152 2-4:1.0 enp1s0u4: Tx status -108
>> [  630.425667] r8152 2-4:1.0 enp1s0u4: Tx status -108
>> [  630.430483] r8152 2-4:1.0 enp1s0u4: Tx status -108
>> [  630.435297] r8152 2-4:1.0 enp1s0u4: Tx status -108
>> [  630.440122] xhci_hcd 0000:01:00.0: HC died; cleaning up
>> [  630.453961] usb 2-4: USB disconnect, device number 2
>>
>> The problem appears to be a general driver issue given we get the same problem with both the  TUSB7340 and the µPD720201.

Seems like PCIe driver is missing MSI IRQs leading to stall. 
Reading xHCI registers via PCIe mem space confirms this.

I see two problems wrt MSI handling:
Since commit 8c934095fa2f3 ("PCI: dwc: Clear MSI interrupt status after it is handled, not before"),
dwc clears MSI status after calling EP's IRQ handler. But, it happens that another MSI interrupt is
raised just at the end of EP's IRQ handler and before clearing MSI status.
This will result in loss of new MSI IRQ as we clear the MSI IRQ status without handling.

Another problem appears to be wrt dra7xx PCIe wrapper:
PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI does not seem to catch MSI IRQs unless,
its ensured that PCIE_MSI_INTR0_STATUS register read returns 0.

So, could you try reverting commit 8c934095fa2f3 and 
also apply below patch and let me know if that fixes the issue?

-----------

Roger Quadros Nov. 20, 2017, 8:01 a.m. UTC | #1

Hi Vignesh,

On 16/11/17 14:26, Vignesh R wrote:
> +linux-pci
> 
> Hi Chris,
> 
> On Thursday 16 November 2017 05:20 PM, Quadros, Roger wrote:
>> +Vignesh
>>
>> On 13/09/17 17:26, Chris Welch wrote:
>>> We are developing a product based on the TI AM5728 EVM.  The product utilizes a TUSB7340 PCIe USB host for additional ports.  The TUSB7340 is detected and setup properly and works OK with low data rate devices.  However, hot plugging a Realtek USB network adapter and doing Ethernet transfer bandwidth testing using iperf3 causes the TUSB7340 host to be  locked out.  The TUSB7340 host appears to no longer communicate and the logging indicates xhci_hcd 0000:01:00.0: HC died; cleaning up.  Same issue occurs with another USB Ethernet adapter I tried (Asus).
>>>
>>> We looked at using another host and found a mini PCIe card that utilizes the µPD720201 and can be directly installed on the TI AM5728 EVM.  The card is detected properly and we reran the transfer test.  The uPD720201 gets locks out with the same problem.
>>>
>>> The AM5728 testing was performed using the TI SD card stock am57xx-evm-linux-04.00.00.04.img, kernel am57xx-evm 4.9.28-geed43d1050, and it reports that it is using the TI AM572x EVM Rev A3 device tree.
>>>
>>> It shows the following logging when it fails (this is with the TI EVM and uPD720201).
>>>
>>> [  630.400899] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
>>> [  630.408769] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.
>>> [  630.420849] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.425667] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.430483] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.435297] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.440122] xhci_hcd 0000:01:00.0: HC died; cleaning up
>>> [  630.453961] usb 2-4: USB disconnect, device number 2
>>>
>>> The problem appears to be a general driver issue given we get the same problem with both the  TUSB7340 and the µPD720201.
> 
> Seems like PCIe driver is missing MSI IRQs leading to stall. 
> Reading xHCI registers via PCIe mem space confirms this.
> 
> I see two problems wrt MSI handling:
> Since commit 8c934095fa2f3 ("PCI: dwc: Clear MSI interrupt status after it is handled, not before"),
> dwc clears MSI status after calling EP's IRQ handler. But, it happens that another MSI interrupt is
> raised just at the end of EP's IRQ handler and before clearing MSI status.
> This will result in loss of new MSI IRQ as we clear the MSI IRQ status without handling.
> 
> Another problem appears to be wrt dra7xx PCIe wrapper:
> PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI does not seem to catch MSI IRQs unless,
> its ensured that PCIE_MSI_INTR0_STATUS register read returns 0.
> 
> So, could you try reverting commit 8c934095fa2f3 and 
> also apply below patch and let me know if that fixes the issue?
> 
> -----------
> 
> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
> index e77a4ceed74c..8280abc56f30 100644
> --- a/drivers/pci/dwc/pci-dra7xx.c
> +++ b/drivers/pci/dwc/pci-dra7xx.c
> @@ -259,10 +259,17 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>         u32 reg;
>  
>         reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
> +       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>  
>         switch (reg) {
>         case MSI:
> -               dw_handle_msi_irq(pp);
> +               /*
> +                * Need to make sure no MSI IRQs are pending before
> +                * exiting handler, else the wrapper will not catch new
> +                * IRQs. So loop around till dw_handle_msi_irq() returns
> +                * IRQ_NONE
> +                */
> +               while (dw_handle_msi_irq(pp) != IRQ_NONE);

To avoid this kind of looping, shouldn't we be disabling all IRQ events while
the interrupt handler is running and enable them just before we return from the hardirq
handler?


>                 break;
>         case INTA:
>         case INTB:
> @@ -273,8 +280,6 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>                 break;
>         }
>  
> -       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
> -
>         return IRQ_HANDLED;
>  }
> 
> 
> 
>

Vignesh Raghavendra Nov. 20, 2017, 1:19 p.m. UTC | #2

On Monday 20 November 2017 01:31 PM, Roger Quadros wrote:
[...]
>>
>> So, could you try reverting commit 8c934095fa2f3 and 
>> also apply below patch and let me know if that fixes the issue?
>>
>> -----------
>>
>> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
>> index e77a4ceed74c..8280abc56f30 100644
>> --- a/drivers/pci/dwc/pci-dra7xx.c
>> +++ b/drivers/pci/dwc/pci-dra7xx.c
>> @@ -259,10 +259,17 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>>         u32 reg;
>>  
>>         reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
>> +       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>>  
>>         switch (reg) {
>>         case MSI:
>> -               dw_handle_msi_irq(pp);
>> +               /*
>> +                * Need to make sure no MSI IRQs are pending before
>> +                * exiting handler, else the wrapper will not catch new
>> +                * IRQs. So loop around till dw_handle_msi_irq() returns
>> +                * IRQ_NONE
>> +                */
>> +               while (dw_handle_msi_irq(pp) != IRQ_NONE);
> 
> To avoid this kind of looping, shouldn't we be disabling all IRQ events while
> the interrupt handler is running and enable them just before we return from the hardirq
> handler?

IIUC, you are saying to disable all MSIs at PCIe designware core level,
then call dw_handle_msi_irq() and then enable MSIs after hardirq
returns. But, the problem is if PCIe EP raises another MSI after the
call to EP's handler but before re-enabling MSIs, then it will be
ignored as IRQs are not yet enabled.
Ideally, EP's support Per Vector Masking(PVM) which allow RC to prevent
EP from sending MSI messages for sometime. But, unfortunately, the cards
mentioned here don't support this feature.

> 
> 
>>                 break;
>>         case INTA:
>>         case INTB:
>> @@ -273,8 +280,6 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>>                 break;
>>         }
>>  
>> -       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>> -
>>         return IRQ_HANDLED;
>>  }
>>
>>
>>
>>
>

Roger Quadros Nov. 20, 2017, 1:31 p.m. UTC | #3

On 20/11/17 15:19, Vignesh R wrote:
> 
> 
> On Monday 20 November 2017 01:31 PM, Roger Quadros wrote:
> [...]
>>>
>>> So, could you try reverting commit 8c934095fa2f3 and 
>>> also apply below patch and let me know if that fixes the issue?
>>>
>>> -----------
>>>
>>> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
>>> index e77a4ceed74c..8280abc56f30 100644
>>> --- a/drivers/pci/dwc/pci-dra7xx.c
>>> +++ b/drivers/pci/dwc/pci-dra7xx.c
>>> @@ -259,10 +259,17 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>>>         u32 reg;
>>>  
>>>         reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
>>> +       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>>>  
>>>         switch (reg) {
>>>         case MSI:
>>> -               dw_handle_msi_irq(pp);
>>> +               /*
>>> +                * Need to make sure no MSI IRQs are pending before
>>> +                * exiting handler, else the wrapper will not catch new
>>> +                * IRQs. So loop around till dw_handle_msi_irq() returns
>>> +                * IRQ_NONE
>>> +                */
>>> +               while (dw_handle_msi_irq(pp) != IRQ_NONE);
>>
>> To avoid this kind of looping, shouldn't we be disabling all IRQ events while
>> the interrupt handler is running and enable them just before we return from the hardirq
>> handler?
> 
> IIUC, you are saying to disable all MSIs at PCIe designware core level,
> then call dw_handle_msi_irq() and then enable MSIs after hardirq
> returns. But, the problem is if PCIe EP raises another MSI after the
> call to EP's handler but before re-enabling MSIs, then it will be
> ignored as IRQs are not yet enabled.
> Ideally, EP's support Per Vector Masking(PVM) which allow RC to prevent
> EP from sending MSI messages for sometime. But, unfortunately, the cards
> mentioned here don't support this feature.

I'm not aware of MSIs.

But for any typical hardware, there should be an interrupt event enable register and an
interrupt mask register.

In the IRQ handler, we mask the interrupt but still keep the interrupt events enabled so that
they can be latched during the time the interrupt was masked.

When the interrupt is unmasked at end of the IRQ handler, it should re-trigger the interrupt
if any events were latched and pending.

This way you don't need to keep checking for any pending events in the IRQ handler.

> 
>>
>>
>>>                 break;
>>>         case INTA:
>>>         case INTB:
>>> @@ -273,8 +280,6 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>>>                 break;
>>>         }
>>>  
>>> -       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>>> -
>>>         return IRQ_HANDLED;
>>>  }
>>>
>>>
>>>
>>>
>>
>

Vignesh Raghavendra Nov. 21, 2017, 5:47 a.m. UTC | #4

On Monday 20 November 2017 07:01 PM, Roger Quadros wrote:
> On 20/11/17 15:19, Vignesh R wrote:
>>
>>
>> On Monday 20 November 2017 01:31 PM, Roger Quadros wrote:
>> [...]
>>>>
>>>> So, could you try reverting commit 8c934095fa2f3 and 
>>>> also apply below patch and let me know if that fixes the issue?
>>>>
>>>> -----------
>>>>
>>>> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
>>>> index e77a4ceed74c..8280abc56f30 100644
>>>> --- a/drivers/pci/dwc/pci-dra7xx.c
>>>> +++ b/drivers/pci/dwc/pci-dra7xx.c
>>>> @@ -259,10 +259,17 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>>>>         u32 reg;
>>>>  
>>>>         reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
>>>> +       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>>>>  
>>>>         switch (reg) {
>>>>         case MSI:
>>>> -               dw_handle_msi_irq(pp);
>>>> +               /*
>>>> +                * Need to make sure no MSI IRQs are pending before
>>>> +                * exiting handler, else the wrapper will not catch new
>>>> +                * IRQs. So loop around till dw_handle_msi_irq() returns
>>>> +                * IRQ_NONE
>>>> +                */
>>>> +               while (dw_handle_msi_irq(pp) != IRQ_NONE);
>>>
>>> To avoid this kind of looping, shouldn't we be disabling all IRQ events while
>>> the interrupt handler is running and enable them just before we return from the hardirq
>>> handler?
>>
>> IIUC, you are saying to disable all MSIs at PCIe designware core level,
>> then call dw_handle_msi_irq() and then enable MSIs after hardirq
>> returns. But, the problem is if PCIe EP raises another MSI after the
>> call to EP's handler but before re-enabling MSIs, then it will be
>> ignored as IRQs are not yet enabled.
>> Ideally, EP's support Per Vector Masking(PVM) which allow RC to prevent
>> EP from sending MSI messages for sometime. But, unfortunately, the cards
>> mentioned here don't support this feature.
> 
> I'm not aware of MSIs.
> 
> But for any typical hardware, there should be an interrupt event enable register and an
> interrupt mask register.
> 
> In the IRQ handler, we mask the interrupt but still keep the interrupt events enabled so that
> they can be latched during the time the interrupt was masked.
> 
> When the interrupt is unmasked at end of the IRQ handler, it should re-trigger the interrupt
> if any events were latched and pending.
> 
> This way you don't need to keep checking for any pending events in the IRQ handler.
> 

Thanks for the suggestion! I tried using interrupt masking at designware
level. But, unfortunately that does not help and my test cases still fail.
Seems like designware MSI status register is a non-masked status and
dra7xx specific wrapper seems to be relying on this non-masked status to
raise IRQ(instead of actual IRQ signal of designware) to CPU. There is
very little documentation in the TRM wrt how wrapper forwards designware
IRQ status to CPU.

So, at best, I will add a check in the above while() loop and break and
exit IRQ handler, lets say after 1K loops.

Niklas Cassel Feb. 8, 2018, 8:54 a.m. UTC | #5

On 16/11/17 13:26, Vignesh R wrote:
> +linux-pci
> 
> Hi Chris,
> 
> On Thursday 16 November 2017 05:20 PM, Quadros, Roger wrote:
>> +Vignesh
>>
>> On 13/09/17 17:26, Chris Welch wrote:
>>> We are developing a product based on the TI AM5728 EVM.  The product utilizes a TUSB7340 PCIe USB host for additional ports.  The TUSB7340 is detected and setup properly and works OK with low data rate devices.  However, hot plugging a Realtek USB network adapter and doing Ethernet transfer bandwidth testing using iperf3 causes the TUSB7340 host to be  locked out.  The TUSB7340 host appears to no longer communicate and the logging indicates xhci_hcd 0000:01:00.0: HC died; cleaning up.  Same issue occurs with another USB Ethernet adapter I tried (Asus).
>>>
>>> We looked at using another host and found a mini PCIe card that utilizes the µPD720201 and can be directly installed on the TI AM5728 EVM.  The card is detected properly and we reran the transfer test.  The uPD720201 gets locks out with the same problem.
>>>
>>> The AM5728 testing was performed using the TI SD card stock am57xx-evm-linux-04.00.00.04.img, kernel am57xx-evm 4.9.28-geed43d1050, and it reports that it is using the TI AM572x EVM Rev A3 device tree.
>>>
>>> It shows the following logging when it fails (this is with the TI EVM and uPD720201).
>>>
>>> [  630.400899] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
>>> [  630.408769] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.
>>> [  630.420849] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.425667] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.430483] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.435297] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>> [  630.440122] xhci_hcd 0000:01:00.0: HC died; cleaning up
>>> [  630.453961] usb 2-4: USB disconnect, device number 2
>>>
>>> The problem appears to be a general driver issue given we get the same problem with both the  TUSB7340 and the µPD720201.
> 
> Seems like PCIe driver is missing MSI IRQs leading to stall. 
> Reading xHCI registers via PCIe mem space confirms this.
> 
> I see two problems wrt MSI handling:
> Since commit 8c934095fa2f3 ("PCI: dwc: Clear MSI interrupt status after it is handled, not before"),
> dwc clears MSI status after calling EP's IRQ handler. But, it happens that another MSI interrupt is
> raised just at the end of EP's IRQ handler and before clearing MSI status.
> This will result in loss of new MSI IRQ as we clear the MSI IRQ status without handling.
> 
> Another problem appears to be wrt dra7xx PCIe wrapper:
> PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI does not seem to catch MSI IRQs unless,
> its ensured that PCIE_MSI_INTR0_STATUS register read returns 0.
> 
> So, could you try reverting commit 8c934095fa2f3 and 
> also apply below patch and let me know if that fixes the issue?

Hello Vignesh,

It is not only dra7xx that is affected by this bug,
all other DWC based drivers as well.

I suggest that we either:
Revert 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled,
not before")
or
You implement a generic solution
(i.e. in drivers/pci/dwc/pcie-designware-host.c)
rather than just a dra7xx specific solution.


Regards,
Niklas

> 
> -----------
> 
> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
> index e77a4ceed74c..8280abc56f30 100644
> --- a/drivers/pci/dwc/pci-dra7xx.c
> +++ b/drivers/pci/dwc/pci-dra7xx.c
> @@ -259,10 +259,17 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>         u32 reg;
>  
>         reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
> +       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>  
>         switch (reg) {
>         case MSI:
> -               dw_handle_msi_irq(pp);
> +               /*
> +                * Need to make sure no MSI IRQs are pending before
> +                * exiting handler, else the wrapper will not catch new
> +                * IRQs. So loop around till dw_handle_msi_irq() returns
> +                * IRQ_NONE
> +                */
> +               while (dw_handle_msi_irq(pp) != IRQ_NONE);
>                 break;
>         case INTA:
>         case INTB:
> @@ -273,8 +280,6 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>                 break;
>         }
>  
> -       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
> -
>         return IRQ_HANDLED;
>  }
> 
> 
> 
>

Vignesh Raghavendra Feb. 8, 2018, 12:52 p.m. UTC | #6

On Thursday 08 February 2018 02:24 PM, Niklas Cassel wrote:
> On 16/11/17 13:26, Vignesh R wrote:
>> +linux-pci
>>
>> Hi Chris,
>>
>> On Thursday 16 November 2017 05:20 PM, Quadros, Roger wrote:
>>> +Vignesh
>>>
>>> On 13/09/17 17:26, Chris Welch wrote:
>>>> We are developing a product based on the TI AM5728 EVM.  The product utilizes a TUSB7340 PCIe USB host for additional ports.  The TUSB7340 is detected and setup properly and works OK with low data rate devices.  However, hot plugging a Realtek USB network adapter and doing Ethernet transfer bandwidth testing using iperf3 causes the TUSB7340 host to be  locked out.  The TUSB7340 host appears to no longer communicate and the logging indicates xhci_hcd 0000:01:00.0: HC died; cleaning up.  Same issue occurs with another USB Ethernet adapter I tried (Asus).
>>>>
>>>> We looked at using another host and found a mini PCIe card that utilizes the µPD720201 and can be directly installed on the TI AM5728 EVM.  The card is detected properly and we reran the transfer test.  The uPD720201 gets locks out with the same problem.
>>>>
>>>> The AM5728 testing was performed using the TI SD card stock am57xx-evm-linux-04.00.00.04.img, kernel am57xx-evm 4.9.28-geed43d1050, and it reports that it is using the TI AM572x EVM Rev A3 device tree.
>>>>
>>>> It shows the following logging when it fails (this is with the TI EVM and uPD720201).
>>>>
>>>> [  630.400899] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
>>>> [  630.408769] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.
>>>> [  630.420849] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>>> [  630.425667] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>>> [  630.430483] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>>> [  630.435297] r8152 2-4:1.0 enp1s0u4: Tx status -108
>>>> [  630.440122] xhci_hcd 0000:01:00.0: HC died; cleaning up
>>>> [  630.453961] usb 2-4: USB disconnect, device number 2
>>>>
>>>> The problem appears to be a general driver issue given we get the same problem with both the  TUSB7340 and the µPD720201.
>>
>> Seems like PCIe driver is missing MSI IRQs leading to stall. 
>> Reading xHCI registers via PCIe mem space confirms this.
>>
>> I see two problems wrt MSI handling:
>> Since commit 8c934095fa2f3 ("PCI: dwc: Clear MSI interrupt status after it is handled, not before"),
>> dwc clears MSI status after calling EP's IRQ handler. But, it happens that another MSI interrupt is
>> raised just at the end of EP's IRQ handler and before clearing MSI status.
>> This will result in loss of new MSI IRQ as we clear the MSI IRQ status without handling.
>>
>> Another problem appears to be wrt dra7xx PCIe wrapper:
>> PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI does not seem to catch MSI IRQs unless,
>> its ensured that PCIE_MSI_INTR0_STATUS register read returns 0.
>>
>> So, could you try reverting commit 8c934095fa2f3 and 
>> also apply below patch and let me know if that fixes the issue?
> 
> Hello Vignesh,
> 
> It is not only dra7xx that is affected by this bug,
> all other DWC based drivers as well.
> 
> I suggest that we either:
> Revert 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled,
> not before")

I will send a revert soon.

> or
> You implement a generic solution
> (i.e. in drivers/pci/dwc/pcie-designware-host.c)
> rather than just a dra7xx specific solution.

Reverting 8c934095fa2f should unblock all other DWC drivers. For dra7xx,
I will send separate patches to fix TI dra7xx specific wrapper level IRQ
handler.


>> -----------
>>
>> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
>> index e77a4ceed74c..8280abc56f30 100644
>> --- a/drivers/pci/dwc/pci-dra7xx.c
>> +++ b/drivers/pci/dwc/pci-dra7xx.c
>> @@ -259,10 +259,17 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>>         u32 reg;
>>  
>>         reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
>> +       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>>  
>>         switch (reg) {
>>         case MSI:
>> -               dw_handle_msi_irq(pp);
>> +               /*
>> +                * Need to make sure no MSI IRQs are pending before
>> +                * exiting handler, else the wrapper will not catch new
>> +                * IRQs. So loop around till dw_handle_msi_irq() returns
>> +                * IRQ_NONE
>> +                */
>> +               while (dw_handle_msi_irq(pp) != IRQ_NONE);
>>                 break;
>>         case INTA:
>>         case INTB:
>> @@ -273,8 +280,6 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
>>                 break;
>>         }
>>  
>> -       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
>> -
>>         return IRQ_HANDLED;
>>  }
>>
>>
>>
>>

xhci_hcd HC died; cleaning up with TUSB7340 and µPD720201

Commit Message

Comments

Patch