diff mbox series

PCI/portdrv: Don't disable pci device during shutdown

Message ID 1600680138-10949-1-git-send-email-chenhc@lemote.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series PCI/portdrv: Don't disable pci device during shutdown | expand

Commit Message

Huacai Chen Sept. 21, 2020, 9:22 a.m. UTC
Use separate remove()/shutdown() callback, and don't disable pci device
during shutdown. This can avoid some poweroff/reboot failures.

The poweroff/reboot failures can easily reproduce on Loongson platforms.
I think this is not a Loongson-specific problem, instead, is a problem
related to some specific PCI hosts. On some x86 platforms, radeon/amdgpu
devices can cause the same problem, and commit faefba95c9e8ca3a523831c2e
("drm/amdgpu: just suspend the hw on pci shutdown") can resolve it.

As Tiezhu said, this occasionally shutdown or reboot failure is due to
clear PCI_COMMAND_MASTER on the device in do_pci_disable_device().

drivers/pci/pci.c
static void do_pci_disable_device(struct pci_dev *dev)
{
        u16 pci_command;

        pci_read_config_word(dev, PCI_COMMAND, &pci_command);
        if (pci_command & PCI_COMMAND_MASTER) {
                pci_command &= ~PCI_COMMAND_MASTER;
                pci_write_config_word(dev, PCI_COMMAND, pci_command);
        }

        pcibios_disable_device(dev);
}

When remove "pci_command &= ~PCI_COMMAND_MASTER;", it can work well when
shutdown or reboot. This may implies that there are DMA activities on the
device while shutdown.

Radeon driver is more difficult than amdgpu due to its confusing symbol
names, and I have maintained an out-of-tree patch for a long time [1].
Recently, we found more and more devices can cause the same problem, and
it is very difficult to modify all problematic drivers as radeon/amdgpu
does (the .shutdown callback should make sure there is no DMA activity).
So, I think modify the PCIe port driver is a simple and effective way.
And as early discussed, kexec can still work after this patch.

[1] https://github.com/chenhuacai/linux/commit/6612f9c1fc290d42a14618ce9a7d03014d8ebb1a

Signed-off-by: Huacai Chen <chenhc@lemote.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---
 drivers/pci/pcie/portdrv.h      |  2 +-
 drivers/pci/pcie/portdrv_core.c |  6 ++++--
 drivers/pci/pcie/portdrv_pci.c  | 15 +++++++++++++--
 3 files changed, 18 insertions(+), 5 deletions(-)

Comments

Sinan Kaya Sept. 21, 2020, 3:50 p.m. UTC | #1
On 9/21/2020 5:22 AM, Huacai Chen wrote:
> Use separate remove()/shutdown() callback, and don't disable pci device
> during shutdown. This can avoid some poweroff/reboot failures.
> 
> The poweroff/reboot failures can easily reproduce on Loongson platforms.
> I think this is not a Loongson-specific problem, instead, is a problem
> related to some specific PCI hosts. On some x86 platforms, radeon/amdgpu
> devices can cause the same problem, and commit faefba95c9e8ca3a523831c2e
> ("drm/amdgpu: just suspend the hw on pci shutdown") can resolve it.

This sounds like a quirk to me rather than a behavior that should be
applied to all platforms.
Huacai Chen Sept. 22, 2020, 2:11 a.m. UTC | #2
Hi, Sinan,

On Mon, Sep 21, 2020 at 11:50 PM Sinan Kaya <okaya@kernel.org> wrote:
>
> On 9/21/2020 5:22 AM, Huacai Chen wrote:
> > Use separate remove()/shutdown() callback, and don't disable pci device
> > during shutdown. This can avoid some poweroff/reboot failures.
> >
> > The poweroff/reboot failures can easily reproduce on Loongson platforms.
> > I think this is not a Loongson-specific problem, instead, is a problem
> > related to some specific PCI hosts. On some x86 platforms, radeon/amdgpu
> > devices can cause the same problem, and commit faefba95c9e8ca3a523831c2e
> > ("drm/amdgpu: just suspend the hw on pci shutdown") can resolve it.
>
> This sounds like a quirk to me rather than a behavior that should be
> applied to all platforms.
Yes, this is very like a quirk, but it seems there are a lot of
platforms that have problems, and removing the pci_disable_device()
has no side effect.

Huacai
Sinan Kaya Sept. 22, 2020, 4:30 a.m. UTC | #3
On 9/21/2020 10:11 PM, Huacai Chen wrote:
>> his sounds like a quirk to me rather than a behavior that should be
>> applied to all platforms.
> Yes, this is very like a quirk, but it seems there are a lot of
> platforms that have problems, and removing the pci_disable_device()
> has no side effect.

Why is there no side effect?

AFAIK, kexec goes through the shutdown path and you are leaving a PCI
device enabled during kexec boot which can corrupt the booting OS
memory.

I don't think you can generalize a behavior based on a few quirky
devices. You should be quirking only the device that has a problem
rather than changing the behavior of all other platforms.
Huacai Chen Sept. 22, 2020, 6:16 a.m. UTC | #4
Hi, Sinan,

On Tue, Sep 22, 2020 at 10:11 AM Huacai Chen <chenhc@lemote.com> wrote:
>
> Hi, Sinan,
>
> On Mon, Sep 21, 2020 at 11:50 PM Sinan Kaya <okaya@kernel.org> wrote:
> >
> > On 9/21/2020 5:22 AM, Huacai Chen wrote:
> > > Use separate remove()/shutdown() callback, and don't disable pci device
> > > during shutdown. This can avoid some poweroff/reboot failures.
> > >
> > > The poweroff/reboot failures can easily reproduce on Loongson platforms.
> > > I think this is not a Loongson-specific problem, instead, is a problem
> > > related to some specific PCI hosts. On some x86 platforms, radeon/amdgpu
> > > devices can cause the same problem, and commit faefba95c9e8ca3a523831c2e
> > > ("drm/amdgpu: just suspend the hw on pci shutdown") can resolve it.
> >
> > This sounds like a quirk to me rather than a behavior that should be
> > applied to all platforms.
> Yes, this is very like a quirk, but it seems there are a lot of
> platforms that have problems, and removing the pci_disable_device()
> has no side effect.
I have seen that you talk about kexec (but this email didn't go to my
inbox). This has been discussed in another thread, and Lucas told us
that in pci_device_shutdown the Bus Master is disabled for kexec. So I
think there will be no memory corruption. Moreover, before 4.15 there
is no .shutdown callback for portdrv, but kexec still works.

Yes, the perfect way is to modify all problematic drivers, as
radeon/amdgpu does. But I don't have enough knowledge about all of the
devices.

Huacai
>
> Huacai
Tiezhu Yang Sept. 22, 2020, 6:20 a.m. UTC | #5
On 09/22/2020 12:30 PM, Sinan Kaya wrote:
> On 9/21/2020 10:11 PM, Huacai Chen wrote:
>>> his sounds like a quirk to me rather than a behavior that should be
>>> applied to all platforms.
>> Yes, this is very like a quirk, but it seems there are a lot of
>> platforms that have problems, and removing the pci_disable_device()
>> has no side effect.
> Why is there no side effect?
>
> AFAIK, kexec goes through the shutdown path and you are leaving a PCI
> device enabled during kexec boot which can corrupt the booting OS
> memory.

Hi,

The related kexec operations are already executed afterwards by the function
pci_device_shutdown(), this is done by commit 4fc9bbf98fd6 ("PCI: Disable
Bus Master only on kexec reboot") and commit 6e0eda3c3898 ("PCI: Don't try
to disable Bus Master on disconnected PCI devices").

drivers/pci/pci-driver.c
static void pci_device_shutdown(struct device *dev)
{
  ...
         if (drv && drv->shutdown)
                 drv->shutdown(pci_dev);

         /*
          * If this is a kexec reboot, turn off Bus Master bit on the
          * device to tell it to not continue to do DMA. Don't touch
          * devices in D3cold or unknown states.
          * If it is not a kexec reboot, firmware will hit the PCI
          * devices with big hammer and stop their DMA any way.
          */
         if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
                 pci_clear_master(pci_dev);
}

device_shutdown()
   dev->bus->shutdown() == pci_device_shutdown()
     drv->shutdown() == pcie_portdrv_shutdown()
     pci_disable_device()

[   36.159446] Call Trace:
[   36.241688] [<ffffffff80211434>] show_stack+0x9c/0x130
[   36.326619] [<ffffffff80661b70>] dump_stack+0xb0/0xf0
[   36.410403] [<ffffffff806a8240>] pcie_portdrv_shutdown+0x18/0x78
[   36.495302] [<ffffffff8069c6b4>] pci_device_shutdown+0x44/0x90
[   36.580027] [<ffffffff807aac90>] device_shutdown+0x130/0x290
[   36.664486] [<ffffffff80265448>] kernel_power_off+0x38/0x80
[   36.748272] [<ffffffff80265634>] __do_sys_reboot+0x1a4/0x258
[   36.831985] [<ffffffff80218b90>] syscall_common+0x34/0x58

Early discussions:
https://lore.kernel.org/patchwork/patch/1304917/#1499666
https://lore.kernel.org/patchwork/patch/1305067/


Thanks,
Tiezhu

>
> I don't think you can generalize a behavior based on a few quirky
> devices. You should be quirking only the device that has a problem
> rather than changing the behavior of all other platforms.
Huacai Chen Dec. 31, 2020, 6:15 a.m. UTC | #6
Hi, Bjorn

On Tue, Sep 22, 2020 at 2:16 PM Huacai Chen <chenhuacai@gmail.com> wrote:
>
> Hi, Sinan,
>
> On Tue, Sep 22, 2020 at 10:11 AM Huacai Chen <chenhc@lemote.com> wrote:
> >
> > Hi, Sinan,
> >
> > On Mon, Sep 21, 2020 at 11:50 PM Sinan Kaya <okaya@kernel.org> wrote:
> > >
> > > On 9/21/2020 5:22 AM, Huacai Chen wrote:
> > > > Use separate remove()/shutdown() callback, and don't disable pci device
> > > > during shutdown. This can avoid some poweroff/reboot failures.
> > > >
> > > > The poweroff/reboot failures can easily reproduce on Loongson platforms.
> > > > I think this is not a Loongson-specific problem, instead, is a problem
> > > > related to some specific PCI hosts. On some x86 platforms, radeon/amdgpu
> > > > devices can cause the same problem, and commit faefba95c9e8ca3a523831c2e
> > > > ("drm/amdgpu: just suspend the hw on pci shutdown") can resolve it.
> > >
> > > This sounds like a quirk to me rather than a behavior that should be
> > > applied to all platforms.
> > Yes, this is very like a quirk, but it seems there are a lot of
> > platforms that have problems, and removing the pci_disable_device()
> > has no side effect.
> I have seen that you talk about kexec (but this email didn't go to my
> inbox). This has been discussed in another thread, and Lucas told us
> that in pci_device_shutdown the Bus Master is disabled for kexec. So I
> think there will be no memory corruption. Moreover, before 4.15 there
> is no .shutdown callback for portdrv, but kexec still works.
>
> Yes, the perfect way is to modify all problematic drivers, as
> radeon/amdgpu does. But I don't have enough knowledge about all of the
> devices.
>
> Huacai
Any new comments?

Huacai
> >
> > Huacai
diff mbox series

Patch

diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index af7cf23..22ba7b9 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -123,7 +123,7 @@  int pcie_port_device_resume(struct device *dev);
 int pcie_port_device_runtime_suspend(struct device *dev);
 int pcie_port_device_runtime_resume(struct device *dev);
 #endif
-void pcie_port_device_remove(struct pci_dev *dev);
+void pcie_port_device_remove(struct pci_dev *dev, bool disable);
 int __must_check pcie_port_bus_register(void);
 void pcie_port_bus_unregister(void);
 
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 50a9522..aa165be 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -487,11 +487,13 @@  EXPORT_SYMBOL_GPL(pcie_port_find_device);
  * Remove PCI Express port service devices associated with given port and
  * disable MSI-X or MSI for the port.
  */
-void pcie_port_device_remove(struct pci_dev *dev)
+void pcie_port_device_remove(struct pci_dev *dev, bool disable)
 {
 	device_for_each_child(&dev->dev, NULL, remove_iter);
 	pci_free_irq_vectors(dev);
-	pci_disable_device(dev);
+
+	if (disable)
+		pci_disable_device(dev);
 }
 
 /**
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 3a3ce40..e95c474 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -142,7 +142,18 @@  static void pcie_portdrv_remove(struct pci_dev *dev)
 		pm_runtime_dont_use_autosuspend(&dev->dev);
 	}
 
-	pcie_port_device_remove(dev);
+	pcie_port_device_remove(dev, true);
+}
+
+static void pcie_portdrv_shutdown(struct pci_dev *dev)
+{
+	if (pci_bridge_d3_possible(dev)) {
+		pm_runtime_forbid(&dev->dev);
+		pm_runtime_get_noresume(&dev->dev);
+		pm_runtime_dont_use_autosuspend(&dev->dev);
+	}
+
+	pcie_port_device_remove(dev, false);
 }
 
 static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
@@ -211,7 +222,7 @@  static struct pci_driver pcie_portdriver = {
 
 	.probe		= pcie_portdrv_probe,
 	.remove		= pcie_portdrv_remove,
-	.shutdown	= pcie_portdrv_remove,
+	.shutdown	= pcie_portdrv_shutdown,
 
 	.err_handler	= &pcie_portdrv_err_handler,