diff mbox series

SSD surprise removal leads to long wait inside pci_dev_wait() and FLR 65s timeout

Message ID 8f2d88a5-9524-c4c3-a61f-7d55d97e1c18@broadcom.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series SSD surprise removal leads to long wait inside pci_dev_wait() and FLR 65s timeout | expand

Commit Message

JD Zheng May 31, 2019, 4:55 p.m. UTC
Hello,

I am running DPDK 18.11+SPDK 19.04 with v5.1 kernel. DPDK/SPDK uses SSD 
vfio devices and after running SPDK's nvmf_tgt, unplugging a SSD cause 
kernel to print out following:
[  105.426952] vfio-pci 0000:04:00.0: not ready 2047ms after FLR; waiting
[  107.698953] vfio-pci 0000:04:00.0: not ready 4095ms after FLR; waiting
[  112.050960] vfio-pci 0000:04:00.0: not ready 8191ms after FLR; waiting
[  120.498953] vfio-pci 0000:04:00.0: not ready 16383ms after FLR; waiting
[  138.418957] vfio-pci 0000:04:00.0: not ready 32767ms after FLR; waiting
[  173.234953] vfio-pci 0000:04:00.0: not ready 65535ms after FLR; giving up

Looks like it is a PCI hotplug racing condition between DPDK's 
eal-intr-thread thread and kernel's pciehp thread. And it causes lockup 
in pci_dev_wait() at kernel side.

When SSD is removed, eal-intr-thread immediately receives 
RTE_INTR_HANDLE_ALARM and handler calls rte_pci_detach_dev() and at 
kernel side vfio_pci_release() is triggered to release this vfio device, 
which calls pci_try_reset_function(), then _pci_reset_function_locked(). 
pci_try_reset_function acquires the device lock but 
_pci_reset_function_locked() doesn't return, therefore lock is NOT released.

Inside _pci_reset_function_locked(), pcie_has_flr(), pci_pm_reset(), 
etc. call pci_dev_wait() at the end but it doesn't return and print out 
above message until 65s timeout.

At kernel pciehp side, it also detects the removal but doesn't run 
immediately as it is configured as "pciehp.pciehp_poll_time=5". So a 
couple of seconds later, it calls pciehp_unconfigure_device -> 
pci_walk_bus -> pci_dev_set_disconnected. pci_dev_set_disconnected() 
couldn't get the device lock and is stuck too because the lock is hold 
by eal-intr-thread.

The first issue is in pci_dev_wait(). It calls pci_read_config_dword() 
and only when id is not all ones, it can return. But when SSD is 
physically removed, id retrieved is always all ones therefore, it has to 
wait for FLR 65s timeout to return.

I did the following to check return value of pci_read_config_dword() to 
fix this:
         if (delay > 1000)

The second issue is that due to lock up described above, the 
pci_dev_set_disconnected() is stuck and pci_read_config_dword() won't 
return PCIBIOS_DEVICE_NOT_FOUND.

I didn't find a easy way to fix it. Maybe use device lock in 
pci_dev_set_disconnected() is too coarse and we need a finer device 
err_state lock?

BTW, pci_dev_set_disconnected wasn't using device lock until this change 
a6bd101b8f.

Any suggestions to fix this problem?

Thanks,
JD Zheng

Comments

Bjorn Helgaas June 3, 2019, 12:44 a.m. UTC | #1
[+cc Alex, Lukas]

On Fri, May 31, 2019 at 09:55:20AM -0700, JD Zheng wrote:
> Hello,
> 
> I am running DPDK 18.11+SPDK 19.04 with v5.1 kernel. DPDK/SPDK uses SSD vfio
> devices and after running SPDK's nvmf_tgt, unplugging a SSD cause kernel to
> print out following:
> [  105.426952] vfio-pci 0000:04:00.0: not ready 2047ms after FLR; waiting
> [  107.698953] vfio-pci 0000:04:00.0: not ready 4095ms after FLR; waiting
> [  112.050960] vfio-pci 0000:04:00.0: not ready 8191ms after FLR; waiting
> [  120.498953] vfio-pci 0000:04:00.0: not ready 16383ms after FLR; waiting
> [  138.418957] vfio-pci 0000:04:00.0: not ready 32767ms after FLR; waiting
> [  173.234953] vfio-pci 0000:04:00.0: not ready 65535ms after FLR; giving up
> 
> Looks like it is a PCI hotplug racing condition between DPDK's
> eal-intr-thread thread and kernel's pciehp thread. And it causes lockup in
> pci_dev_wait() at kernel side.
> 
> When SSD is removed, eal-intr-thread immediately receives
> RTE_INTR_HANDLE_ALARM and handler calls rte_pci_detach_dev() and at kernel
> side vfio_pci_release() is triggered to release this vfio device, which
> calls pci_try_reset_function(), then _pci_reset_function_locked().
> pci_try_reset_function acquires the device lock but
> _pci_reset_function_locked() doesn't return, therefore lock is NOT released.
> 
> Inside _pci_reset_function_locked(), pcie_has_flr(), pci_pm_reset(), etc.
> call pci_dev_wait() at the end but it doesn't return and print out above
> message until 65s timeout.
> 
> At kernel pciehp side, it also detects the removal but doesn't run
> immediately as it is configured as "pciehp.pciehp_poll_time=5". So a couple
> of seconds later, it calls pciehp_unconfigure_device -> pci_walk_bus ->
> pci_dev_set_disconnected. pci_dev_set_disconnected() couldn't get the device
> lock and is stuck too because the lock is hold by eal-intr-thread.
> 
> The first issue is in pci_dev_wait(). It calls pci_read_config_dword() and
> only when id is not all ones, it can return. But when SSD is physically
> removed, id retrieved is always all ones therefore, it has to wait for FLR
> 65s timeout to return.
> 
> I did the following to check return value of pci_read_config_dword() to fix
> this:
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4439,7 +4439,11 @@ static int pci_dev_wait(struct pci_dev *dev, char
> *reset_type, int timeout)
> 
>                 msleep(delay);
>                 delay *= 2;
> -               pci_read_config_dword(dev, PCI_COMMAND, &id);
> +               if (pci_read_config_dword(dev, PCI_COMMAND, &id) ==
> +                   PCIBIOS_DEVICE_NOT_FOUND) {
> +                       pci_info(dev, "device disconnected\n");
> +                       return -ENODEV;
> +               }
>         }
> 
>         if (delay > 1000)
> 
> The second issue is that due to lock up described above, the
> pci_dev_set_disconnected() is stuck and pci_read_config_dword() won't return
> PCIBIOS_DEVICE_NOT_FOUND.
> 
> I didn't find a easy way to fix it. Maybe use device lock in
> pci_dev_set_disconnected() is too coarse and we need a finer device
> err_state lock?
> 
> BTW, pci_dev_set_disconnected wasn't using device lock until this change
> a6bd101b8f.
> 
> Any suggestions to fix this problem?

Would you mind opening a report at https://bugzilla.kernel.org and
attaching the complete dmesg log and "lspci -vv" output?

Out of curiosity, why do you use "pciehp.pciehp_poll_time=5"?

Bjorn
Alex Williamson June 3, 2019, 1:40 a.m. UTC | #2
On Sun, 2 Jun 2019 19:44:14 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Alex, Lukas]
> 
> On Fri, May 31, 2019 at 09:55:20AM -0700, JD Zheng wrote:
> > Hello,
> > 
> > I am running DPDK 18.11+SPDK 19.04 with v5.1 kernel. DPDK/SPDK uses SSD vfio
> > devices and after running SPDK's nvmf_tgt, unplugging a SSD cause kernel to
> > print out following:
> > [  105.426952] vfio-pci 0000:04:00.0: not ready 2047ms after FLR; waiting
> > [  107.698953] vfio-pci 0000:04:00.0: not ready 4095ms after FLR; waiting
> > [  112.050960] vfio-pci 0000:04:00.0: not ready 8191ms after FLR; waiting
> > [  120.498953] vfio-pci 0000:04:00.0: not ready 16383ms after FLR; waiting
> > [  138.418957] vfio-pci 0000:04:00.0: not ready 32767ms after FLR; waiting
> > [  173.234953] vfio-pci 0000:04:00.0: not ready 65535ms after FLR; giving up
> > 
> > Looks like it is a PCI hotplug racing condition between DPDK's
> > eal-intr-thread thread and kernel's pciehp thread. And it causes lockup in
> > pci_dev_wait() at kernel side.
> > 
> > When SSD is removed, eal-intr-thread immediately receives
> > RTE_INTR_HANDLE_ALARM and handler calls rte_pci_detach_dev() and at kernel
> > side vfio_pci_release() is triggered to release this vfio device, which
> > calls pci_try_reset_function(), then _pci_reset_function_locked().
> > pci_try_reset_function acquires the device lock but
> > _pci_reset_function_locked() doesn't return, therefore lock is NOT released.

To what extent does vfio-pci need to learn about surprise hotplug?  My
expectation is that the current state of the code would only support
cooperative hotplug.  When a device is surprise removed, what backs a
user's mmaps?  AIUI, we don't have a revoke interface to invalidate
these.  We should probably start with an RFE or some development effort
to harden vfio-pci for surprise hotplug, it's not surprising it doesn't
just work TBH.  Thanks,

Alex


> > Inside _pci_reset_function_locked(), pcie_has_flr(), pci_pm_reset(), etc.
> > call pci_dev_wait() at the end but it doesn't return and print out above
> > message until 65s timeout.
> > 
> > At kernel pciehp side, it also detects the removal but doesn't run
> > immediately as it is configured as "pciehp.pciehp_poll_time=5". So a couple
> > of seconds later, it calls pciehp_unconfigure_device -> pci_walk_bus ->
> > pci_dev_set_disconnected. pci_dev_set_disconnected() couldn't get the device
> > lock and is stuck too because the lock is hold by eal-intr-thread.
> > 
> > The first issue is in pci_dev_wait(). It calls pci_read_config_dword() and
> > only when id is not all ones, it can return. But when SSD is physically
> > removed, id retrieved is always all ones therefore, it has to wait for FLR
> > 65s timeout to return.
> > 
> > I did the following to check return value of pci_read_config_dword() to fix
> > this:
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -4439,7 +4439,11 @@ static int pci_dev_wait(struct pci_dev *dev, char
> > *reset_type, int timeout)
> > 
> >                 msleep(delay);
> >                 delay *= 2;
> > -               pci_read_config_dword(dev, PCI_COMMAND, &id);
> > +               if (pci_read_config_dword(dev, PCI_COMMAND, &id) ==
> > +                   PCIBIOS_DEVICE_NOT_FOUND) {
> > +                       pci_info(dev, "device disconnected\n");
> > +                       return -ENODEV;
> > +               }
> >         }
> > 
> >         if (delay > 1000)
> > 
> > The second issue is that due to lock up described above, the
> > pci_dev_set_disconnected() is stuck and pci_read_config_dword() won't return
> > PCIBIOS_DEVICE_NOT_FOUND.
> > 
> > I didn't find a easy way to fix it. Maybe use device lock in
> > pci_dev_set_disconnected() is too coarse and we need a finer device
> > err_state lock?
> > 
> > BTW, pci_dev_set_disconnected wasn't using device lock until this change
> > a6bd101b8f.
> > 
> > Any suggestions to fix this problem?  
> 
> Would you mind opening a report at https://bugzilla.kernel.org and
> attaching the complete dmesg log and "lspci -vv" output?
> 
> Out of curiosity, why do you use "pciehp.pciehp_poll_time=5"?
> 
> Bjorn
JD Zheng June 3, 2019, 9:17 p.m. UTC | #3
On 6/2/19 6:40 PM, Alex Williamson wrote:
> On Sun, 2 Jun 2019 19:44:14 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
>> [+cc Alex, Lukas]
>>
>> On Fri, May 31, 2019 at 09:55:20AM -0700, JD Zheng wrote:
>>> Hello,
>>>
>>> I am running DPDK 18.11+SPDK 19.04 with v5.1 kernel. DPDK/SPDK uses SSD vfio
>>> devices and after running SPDK's nvmf_tgt, unplugging a SSD cause kernel to
>>> print out following:
>>> [  105.426952] vfio-pci 0000:04:00.0: not ready 2047ms after FLR; waiting
>>> [  107.698953] vfio-pci 0000:04:00.0: not ready 4095ms after FLR; waiting
>>> [  112.050960] vfio-pci 0000:04:00.0: not ready 8191ms after FLR; waiting
>>> [  120.498953] vfio-pci 0000:04:00.0: not ready 16383ms after FLR; waiting
>>> [  138.418957] vfio-pci 0000:04:00.0: not ready 32767ms after FLR; waiting
>>> [  173.234953] vfio-pci 0000:04:00.0: not ready 65535ms after FLR; giving up
>>>
>>> Looks like it is a PCI hotplug racing condition between DPDK's
>>> eal-intr-thread thread and kernel's pciehp thread. And it causes lockup in
>>> pci_dev_wait() at kernel side.
>>>
>>> When SSD is removed, eal-intr-thread immediately receives
>>> RTE_INTR_HANDLE_ALARM and handler calls rte_pci_detach_dev() and at kernel
>>> side vfio_pci_release() is triggered to release this vfio device, which
>>> calls pci_try_reset_function(), then _pci_reset_function_locked().
>>> pci_try_reset_function acquires the device lock but
>>> _pci_reset_function_locked() doesn't return, therefore lock is NOT released.
> 
> To what extent does vfio-pci need to learn about surprise hotplug?  My
> expectation is that the current state of the code would only support
> cooperative hotplug.  When a device is surprise removed, what backs a
> user's mmaps?  AIUI, we don't have a revoke interface to invalidate
> these.  We should probably start with an RFE or some development effort
> to harden vfio-pci for surprise hotplug, it's not surprising it doesn't
> just work TBH.  Thanks,
> 
> Alex
> 
> 

I did see other issues that DPDK unmap vifo device memory was stuck due 
to surprise removal.

Are you saying that vfio-pci surprise removal is not fully implemented yet?

>>> Inside _pci_reset_function_locked(), pcie_has_flr(), pci_pm_reset(), etc.
>>> call pci_dev_wait() at the end but it doesn't return and print out above
>>> message until 65s timeout.
>>>
>>> At kernel pciehp side, it also detects the removal but doesn't run
>>> immediately as it is configured as "pciehp.pciehp_poll_time=5". So a couple
>>> of seconds later, it calls pciehp_unconfigure_device -> pci_walk_bus ->
>>> pci_dev_set_disconnected. pci_dev_set_disconnected() couldn't get the device
>>> lock and is stuck too because the lock is hold by eal-intr-thread.
>>>
>>> The first issue is in pci_dev_wait(). It calls pci_read_config_dword() and
>>> only when id is not all ones, it can return. But when SSD is physically
>>> removed, id retrieved is always all ones therefore, it has to wait for FLR
>>> 65s timeout to return.
>>>
>>> I did the following to check return value of pci_read_config_dword() to fix
>>> this:
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -4439,7 +4439,11 @@ static int pci_dev_wait(struct pci_dev *dev, char
>>> *reset_type, int timeout)
>>>
>>>                  msleep(delay);
>>>                  delay *= 2;
>>> -               pci_read_config_dword(dev, PCI_COMMAND, &id);
>>> +               if (pci_read_config_dword(dev, PCI_COMMAND, &id) ==
>>> +                   PCIBIOS_DEVICE_NOT_FOUND) {
>>> +                       pci_info(dev, "device disconnected\n");
>>> +                       return -ENODEV;
>>> +               }
>>>          }
>>>
>>>          if (delay > 1000)
>>>
>>> The second issue is that due to lock up described above, the
>>> pci_dev_set_disconnected() is stuck and pci_read_config_dword() won't return
>>> PCIBIOS_DEVICE_NOT_FOUND.
>>>
>>> I didn't find a easy way to fix it. Maybe use device lock in
>>> pci_dev_set_disconnected() is too coarse and we need a finer device
>>> err_state lock?
>>>
>>> BTW, pci_dev_set_disconnected wasn't using device lock until this change
>>> a6bd101b8f.
>>>
>>> Any suggestions to fix this problem?
>>
>> Would you mind opening a report at https://bugzilla.kernel.org and
>> attaching the complete dmesg log and "lspci -vv" output?

I submitted one as https://bugzilla.kernel.org/show_bug.cgi?id=203797

>>
>> Out of curiosity, why do you use "pciehp.pciehp_poll_time=5"?

This was chosen by other engineer. I tried shorter polling time, which 
doesn't make difference.

I also tried pciehp.pciehp_poll_mode=0 but hotplug doesn't seem working.
Should I use poll mode or not?

>>
>> Bjorn
>
Alex Williamson June 3, 2019, 10:40 p.m. UTC | #4
On Mon, 3 Jun 2019 14:17:36 -0700
JD Zheng <jiandong.zheng@broadcom.com> wrote:

> On 6/2/19 6:40 PM, Alex Williamson wrote:
> > On Sun, 2 Jun 2019 19:44:14 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> >   
> >> [+cc Alex, Lukas]
> >>
> >> On Fri, May 31, 2019 at 09:55:20AM -0700, JD Zheng wrote:  
> >>> Hello,
> >>>
> >>> I am running DPDK 18.11+SPDK 19.04 with v5.1 kernel. DPDK/SPDK uses SSD vfio
> >>> devices and after running SPDK's nvmf_tgt, unplugging a SSD cause kernel to
> >>> print out following:
> >>> [  105.426952] vfio-pci 0000:04:00.0: not ready 2047ms after FLR; waiting
> >>> [  107.698953] vfio-pci 0000:04:00.0: not ready 4095ms after FLR; waiting
> >>> [  112.050960] vfio-pci 0000:04:00.0: not ready 8191ms after FLR; waiting
> >>> [  120.498953] vfio-pci 0000:04:00.0: not ready 16383ms after FLR; waiting
> >>> [  138.418957] vfio-pci 0000:04:00.0: not ready 32767ms after FLR; waiting
> >>> [  173.234953] vfio-pci 0000:04:00.0: not ready 65535ms after FLR; giving up
> >>>
> >>> Looks like it is a PCI hotplug racing condition between DPDK's
> >>> eal-intr-thread thread and kernel's pciehp thread. And it causes lockup in
> >>> pci_dev_wait() at kernel side.
> >>>
> >>> When SSD is removed, eal-intr-thread immediately receives
> >>> RTE_INTR_HANDLE_ALARM and handler calls rte_pci_detach_dev() and at kernel
> >>> side vfio_pci_release() is triggered to release this vfio device, which
> >>> calls pci_try_reset_function(), then _pci_reset_function_locked().
> >>> pci_try_reset_function acquires the device lock but
> >>> _pci_reset_function_locked() doesn't return, therefore lock is NOT released.  
> > 
> > To what extent does vfio-pci need to learn about surprise hotplug?  My
> > expectation is that the current state of the code would only support
> > cooperative hotplug.  When a device is surprise removed, what backs a
> > user's mmaps?  AIUI, we don't have a revoke interface to invalidate
> > these.  We should probably start with an RFE or some development effort
> > to harden vfio-pci for surprise hotplug, it's not surprising it doesn't
> > just work TBH.  Thanks,
> > 
> > Alex
> > 
> >   
> 
> I did see other issues that DPDK unmap vifo device memory was stuck due 
> to surprise removal.
> 
> Are you saying that vfio-pci surprise removal is not fully implemented yet?

Has not even been evaluated for that use case afaik.  You are in
unknown and expected broken territory.  If this is an area you'd like
to contribute, great, but expect more than a bug fix here and there.
Thanks,

Alex
Bjorn Helgaas June 3, 2019, 11:05 p.m. UTC | #5
On Mon, Jun 03, 2019 at 02:17:36PM -0700, JD Zheng wrote:
> > On Sun, 2 Jun 2019 19:44:14 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:

> > > Would you mind opening a report at https://bugzilla.kernel.org and
> > > attaching the complete dmesg log and "lspci -vv" output?
> 
> I submitted one as https://bugzilla.kernel.org/show_bug.cgi?id=203797
> 
> > > Out of curiosity, why do you use "pciehp.pciehp_poll_time=5"?
> 
> This was chosen by other engineer. I tried shorter polling time, which
> doesn't make difference.
> 
> I also tried pciehp.pciehp_poll_mode=0 but hotplug doesn't seem working.
> Should I use poll mode or not?

This is just a tangent to the real issue you're working; I'm not
suggesting any changes here to fix that issue.

In my mind, the existence of the pciehp.pciehp_poll_mode and
pciehp.pciehp_poll_time parameters is a defect in pciehp.  I would far
rather that pciehp figured out by itself whether it needed to poll.  I
don't know whether that's actually feasible, or if there's some reason
why pciehp can't be that smart.

Bjorn
diff mbox series

Patch

--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4439,7 +4439,11 @@  static int pci_dev_wait(struct pci_dev *dev, char 
*reset_type, int timeout)

                 msleep(delay);
                 delay *= 2;
-               pci_read_config_dword(dev, PCI_COMMAND, &id);
+               if (pci_read_config_dword(dev, PCI_COMMAND, &id) ==
+                   PCIBIOS_DEVICE_NOT_FOUND) {
+                       pci_info(dev, "device disconnected\n");
+                       return -ENODEV;
+               }
         }