[V13,5/6] PCI: Add quirk for LS7A to avoid reboot failure

Message ID	20220430084846.3127041-6-chenhuacai@loongson.cn (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-pci-owner@kernel.org> From: Huacai Chen <chenhuacai@loongson.cn> To: Bjorn Helgaas <bhelgaas@google.com>, Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>, Rob Herring <robh@kernel.org>, =?utf-8?q?Krzysztof_Wilczy=C5=84ski?= <kw@linux.com> Cc: linux-pci@vger.kernel.org, Xuefeng Li <lixuefeng@loongson.cn>, Huacai Chen <chenhuacai@gmail.com>, Jiaxun Yang <jiaxun.yang@flygoat.com>, Huacai Chen <chenhuacai@loongson.cn> Subject: [PATCH V13 5/6] PCI: Add quirk for LS7A to avoid reboot failure Date: Sat, 30 Apr 2022 16:48:45 +0800 Message-Id: <20220430084846.3127041-6-chenhuacai@loongson.cn> In-Reply-To: <20220430084846.3127041-1-chenhuacai@loongson.cn> References: <20220430084846.3127041-1-chenhuacai@loongson.cn> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	PCI: Loongson pci improvements and quirks \| expand [V13,0/6] PCI: Loongson pci improvements and quirks [V13,1/6] PCI: loongson: Use generic 8/16/32-bit config ops on LS2K/LS7A [V13,2/6] PCI: loongson: Add ACPI init support [V13,3/6] PCI: loongson: Don't access unexisting devices [V13,4/6] PCI: loongson: Improve the MRRS quirk for LS7A [V13,5/6] PCI: Add quirk for LS7A to avoid reboot failure [V13,6/6] PCI: Add quirk for multifunction devices of LS7A

Huacai Chen April 30, 2022, 8:48 a.m. UTC

Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe services during
shutdown") causes poweroff/reboot failure on systems with LS7A chipset.
We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER;" in
do_pci_disable_device(), it can work well. The hardware engineer says
that the root cause is that CPU is still accessing PCIe devices while
poweroff/reboot, and if we disable the Bus Master Bit at this time, the
PCIe controller doesn't forward requests to downstream devices, and also
doesn't send TIMEOUT to CPU, which causes CPU wait forever (hardware
deadlock). This behavior is a PCIe protocol violation (Bus Master should
not be involved in CPU MMIO transactions), and it will be fixed in new
revisions of hardware (add timeout mechanism for CPU read request,
whether or not Bus Master bit is cleared).

On some x86 platforms, radeon/amdgpu devices can cause similar problems
[1][2]. Once before I wanted to make a single patch to solve "all of
these problems" together, but it seems unreasonable because maybe they
are not exactly the same problem. So, this patch just add a quirk for
LS7A to avoid clearing Bus Master bit in pcie_port_device_remove(), and
leave other platforms as is.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
[2] https://bugs.freedesktop.org/show_bug.cgi?id=98638

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
---
 drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
 drivers/pci/pcie/portdrv_core.c       |  6 +++++-
 include/linux/pci.h                   |  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

Bjorn Helgaas May 31, 2022, 11:35 p.m. UTC | #1

On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe services during
> shutdown") 

Usual quoting style would be 

  cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown")
  causes poweroff/reboot ...

> causes poweroff/reboot failure on systems with LS7A chipset.
> We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER;" in
> do_pci_disable_device(), it can work well. The hardware engineer says
> that the root cause is that CPU is still accessing PCIe devices while
> poweroff/reboot, and if we disable the Bus Master Bit at this time, the
> PCIe controller doesn't forward requests to downstream devices, and also
> doesn't send TIMEOUT to CPU, which causes CPU wait forever (hardware
> deadlock). This behavior is a PCIe protocol violation (Bus Master should
> not be involved in CPU MMIO transactions), and it will be fixed in new
> revisions of hardware (add timeout mechanism for CPU read request,
> whether or not Bus Master bit is cleared).

LS7A might have bugs in that clearing Bus Master Enable prevents the
root port from forwarding Memory or I/O requests in the downstream
direction.

But this feels like a bit of a band-aid because we don't know exactly
what those requests are.  If we're removing the Root Port, I assume we
think we no longer need any devices *below* the Root Port.

If that's not the case, e.g., if we still need to produce console
output or save state to a device, we probably should not be removing
the Root Port at all.

> On some x86 platforms, radeon/amdgpu devices can cause similar problems
> [1][2]. Once before I wanted to make a single patch to solve "all of
> these problems" together, but it seems unreasonable because maybe they
> are not exactly the same problem. So, this patch just add a quirk for
> LS7A to avoid clearing Bus Master bit in pcie_port_device_remove(), and
> leave other platforms as is.
> 
> [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> 
> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> ---
>  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
>  drivers/pci/pcie/portdrv_core.c       |  6 +++++-
>  include/linux/pci.h                   |  1 +
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> index 83447264048a..49d8b8c24ffb 100644
> --- a/drivers/pci/controller/pci-loongson.c
> +++ b/drivers/pci/controller/pci-loongson.c
> @@ -85,6 +85,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
>  			DEV_PCIE_PORT_2, loongson_mrrs_quirk);
>  
> +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> +{
> +	/*
> +	 * Some Loongson PCIe ports will cause CPU deadlock if disable
> +	 * the Bus Master bit during poweroff/reboot.
> +	 */
> +	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> +
> +	bridge->no_dis_bmaster = 1;
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> +			DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> +
>  static struct loongson_pci *pci_bus_to_loongson_pci(struct pci_bus *bus)
>  {
>  	struct pci_config_window *cfg;
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 604feeb84ee4..23f41e31a6c6 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -491,9 +491,13 @@ EXPORT_SYMBOL_GPL(pcie_port_find_device);
>   */
>  void pcie_port_device_remove(struct pci_dev *dev)
>  {
> +	struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> +
>  	device_for_each_child(&dev->dev, NULL, remove_iter);
>  	pci_free_irq_vectors(dev);
> -	pci_disable_device(dev);
> +
> +	if (!bridge->no_dis_bmaster)
> +		pci_disable_device(dev);
>  }
>  
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index d146eb28e6da..c52d6486ff99 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -570,6 +570,7 @@ struct pci_host_bridge {
>  	unsigned int	ignore_reset_delay:1;	/* For entire hierarchy */
>  	unsigned int	no_ext_tags:1;		/* No Extended Tags */
>  	unsigned int	no_inc_mrrs:1;		/* No Increase MRRS */
> +	unsigned int	no_dis_bmaster:1;	/* No Disable Bus Master */
>  	unsigned int	native_aer:1;		/* OS may use PCIe AER */
>  	unsigned int	native_pcie_hotplug:1;	/* OS may use PCIe hotplug */
>  	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
> -- 
> 2.27.0
>

Huacai Chen June 2, 2022, 12:48 p.m. UTC | #2

Hi, Bjorn,

On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe services during
> > shutdown")
>
> Usual quoting style would be
>
>   cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown")
>   causes poweroff/reboot ...
OK, thanks.

>
> > causes poweroff/reboot failure on systems with LS7A chipset.
> > We found that if we remove "pci_command &= ~PCI_COMMAND_MASTER;" in
> > do_pci_disable_device(), it can work well. The hardware engineer says
> > that the root cause is that CPU is still accessing PCIe devices while
> > poweroff/reboot, and if we disable the Bus Master Bit at this time, the
> > PCIe controller doesn't forward requests to downstream devices, and also
> > doesn't send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > deadlock). This behavior is a PCIe protocol violation (Bus Master should
> > not be involved in CPU MMIO transactions), and it will be fixed in new
> > revisions of hardware (add timeout mechanism for CPU read request,
> > whether or not Bus Master bit is cleared).
>
> LS7A might have bugs in that clearing Bus Master Enable prevents the
> root port from forwarding Memory or I/O requests in the downstream
> direction.
>
> But this feels like a bit of a band-aid because we don't know exactly
> what those requests are.  If we're removing the Root Port, I assume we
> think we no longer need any devices *below* the Root Port.
>
> If that's not the case, e.g., if we still need to produce console
> output or save state to a device, we probably should not be removing
> the Root Port at all.
Do you mean it is better to skip the whole pcie_port_device_remove()
instead of just removing the "clear bus master" operation for the
buggy hardware?

Huacai
>
> > On some x86 platforms, radeon/amdgpu devices can cause similar problems
> > [1][2]. Once before I wanted to make a single patch to solve "all of
> > these problems" together, but it seems unreasonable because maybe they
> > are not exactly the same problem. So, this patch just add a quirk for
> > LS7A to avoid clearing Bus Master bit in pcie_port_device_remove(), and
> > leave other platforms as is.
> >
> > [1] https://bugs.freedesktop.org/show_bug.cgi?id=97980
> > [2] https://bugs.freedesktop.org/show_bug.cgi?id=98638
> >
> > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > ---
> >  drivers/pci/controller/pci-loongson.c | 17 +++++++++++++++++
> >  drivers/pci/pcie/portdrv_core.c       |  6 +++++-
> >  include/linux/pci.h                   |  1 +
> >  3 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> > index 83447264048a..49d8b8c24ffb 100644
> > --- a/drivers/pci/controller/pci-loongson.c
> > +++ b/drivers/pci/controller/pci-loongson.c
> > @@ -85,6 +85,23 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> >                       DEV_PCIE_PORT_2, loongson_mrrs_quirk);
> >
> > +static void loongson_bmaster_quirk(struct pci_dev *pdev)
> > +{
> > +     /*
> > +      * Some Loongson PCIe ports will cause CPU deadlock if disable
> > +      * the Bus Master bit during poweroff/reboot.
> > +      */
> > +     struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> > +
> > +     bridge->no_dis_bmaster = 1;
> > +}
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_0, loongson_bmaster_quirk);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_1, loongson_bmaster_quirk);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
> > +                     DEV_PCIE_PORT_2, loongson_bmaster_quirk);
> > +
> >  static struct loongson_pci *pci_bus_to_loongson_pci(struct pci_bus *bus)
> >  {
> >       struct pci_config_window *cfg;
> > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> > index 604feeb84ee4..23f41e31a6c6 100644
> > --- a/drivers/pci/pcie/portdrv_core.c
> > +++ b/drivers/pci/pcie/portdrv_core.c
> > @@ -491,9 +491,13 @@ EXPORT_SYMBOL_GPL(pcie_port_find_device);
> >   */
> >  void pcie_port_device_remove(struct pci_dev *dev)
> >  {
> > +     struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
> > +
> >       device_for_each_child(&dev->dev, NULL, remove_iter);
> >       pci_free_irq_vectors(dev);
> > -     pci_disable_device(dev);
> > +
> > +     if (!bridge->no_dis_bmaster)
> > +             pci_disable_device(dev);
> >  }
> >
> >  /**
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index d146eb28e6da..c52d6486ff99 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -570,6 +570,7 @@ struct pci_host_bridge {
> >       unsigned int    ignore_reset_delay:1;   /* For entire hierarchy */
> >       unsigned int    no_ext_tags:1;          /* No Extended Tags */
> >       unsigned int    no_inc_mrrs:1;          /* No Increase MRRS */
> > +     unsigned int    no_dis_bmaster:1;       /* No Disable Bus Master */
> >       unsigned int    native_aer:1;           /* OS may use PCIe AER */
> >       unsigned int    native_pcie_hotplug:1;  /* OS may use PCIe hotplug */
> >       unsigned int    native_shpc_hotplug:1;  /* OS may use SHPC hotplug */
> > --
> > 2.27.0
> >

Bjorn Helgaas June 2, 2022, 4:29 p.m. UTC | #3

On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> Hi, Bjorn,
> 
> On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe services
> > > during shutdown") causes poweroff/reboot failure on systems with
> > > LS7A chipset.  We found that if we remove "pci_command &=
> > > ~PCI_COMMAND_MASTER;" in do_pci_disable_device(), it can work
> > > well. The hardware engineer says that the root cause is that CPU
> > > is still accessing PCIe devices while poweroff/reboot, and if we
> > > disable the Bus Master Bit at this time, the PCIe controller
> > > doesn't forward requests to downstream devices, and also doesn't
> > > send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > > deadlock). This behavior is a PCIe protocol violation (Bus
> > > Master should not be involved in CPU MMIO transactions), and it
> > > will be fixed in new revisions of hardware (add timeout
> > > mechanism for CPU read request, whether or not Bus Master bit is
> > > cleared).
> >
> > LS7A might have bugs in that clearing Bus Master Enable prevents the
> > root port from forwarding Memory or I/O requests in the downstream
> > direction.
> >
> > But this feels like a bit of a band-aid because we don't know exactly
> > what those requests are.  If we're removing the Root Port, I assume we
> > think we no longer need any devices *below* the Root Port.
> >
> > If that's not the case, e.g., if we still need to produce console
> > output or save state to a device, we probably should not be removing
> > the Root Port at all.
>
> Do you mean it is better to skip the whole pcie_port_device_remove()
> instead of just removing the "clear bus master" operation for the
> buggy hardware?

No, that's not what I want at all.  That's just another band-aid to
avoid a problem without understanding what the problem is.

My point is that apparently we remove a Root Port (which means we've
already removed any devices under it), and then we try to use a device
below the Root Port.  That seems broken.  I want to understand why we
try to use a device after we've removed it.

If the scenario ends up being legitimate and unavoidable, fine -- we
can figure out a quirk to work around the fact the LS7A doesn't allow
that access after we clear Bus Master Enable.  But right now the
scenario smells like a latent bug, and leaving bus mastering enabled 
just avoids it without fixing it.

Bjorn

Huacai Chen June 8, 2022, 9:34 a.m. UTC | #4

Hi, Bjorn,

On Fri, Jun 3, 2022 at 12:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> > Hi, Bjorn,
> >
> > On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe services
> > > > during shutdown") causes poweroff/reboot failure on systems with
> > > > LS7A chipset.  We found that if we remove "pci_command &=
> > > > ~PCI_COMMAND_MASTER;" in do_pci_disable_device(), it can work
> > > > well. The hardware engineer says that the root cause is that CPU
> > > > is still accessing PCIe devices while poweroff/reboot, and if we
> > > > disable the Bus Master Bit at this time, the PCIe controller
> > > > doesn't forward requests to downstream devices, and also doesn't
> > > > send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > > > deadlock). This behavior is a PCIe protocol violation (Bus
> > > > Master should not be involved in CPU MMIO transactions), and it
> > > > will be fixed in new revisions of hardware (add timeout
> > > > mechanism for CPU read request, whether or not Bus Master bit is
> > > > cleared).
> > >
> > > LS7A might have bugs in that clearing Bus Master Enable prevents the
> > > root port from forwarding Memory or I/O requests in the downstream
> > > direction.
> > >
> > > But this feels like a bit of a band-aid because we don't know exactly
> > > what those requests are.  If we're removing the Root Port, I assume we
> > > think we no longer need any devices *below* the Root Port.
> > >
> > > If that's not the case, e.g., if we still need to produce console
> > > output or save state to a device, we probably should not be removing
> > > the Root Port at all.
> >
> > Do you mean it is better to skip the whole pcie_port_device_remove()
> > instead of just removing the "clear bus master" operation for the
> > buggy hardware?
>
> No, that's not what I want at all.  That's just another band-aid to
> avoid a problem without understanding what the problem is.
>
> My point is that apparently we remove a Root Port (which means we've
> already removed any devices under it), and then we try to use a device
> below the Root Port.  That seems broken.  I want to understand why we
> try to use a device after we've removed it.
I agree, and I think "why we try to use a device after remove it" is
because the userspace programs don't know whether a device is
"usable", so they just use it, at any time. Then it seems it is the
responsibility of the device drivers to avoid the problem.

Take radeon/amdgpu driver as an example, the .shutdown() of the
callback can call suspend() to fix.

But..., another problem is: There are many drivers "broken", not just
radeon/amdgpu drivers (at least, the ahci driver is also "broken").
Implementing the driver's .shutdown() correctly is nearly impossible
for us, because we don't know the hardware details of so many devices.
On the other hand, those drivers "just works" on most platforms, so
the driver authors don't think they are "broken".

So, very sadly, we can only use a band-aid now. :(

Huacai

>
> If the scenario ends up being legitimate and unavoidable, fine -- we
> can figure out a quirk to work around the fact the LS7A doesn't allow
> that access after we clear Bus Master Enable.  But right now the
> scenario smells like a latent bug, and leaving bus mastering enabled
> just avoids it without fixing it.
>
> Bjorn

Bjorn Helgaas June 8, 2022, 7:31 p.m. UTC | #5

On Wed, Jun 08, 2022 at 05:34:21PM +0800, Huacai Chen wrote:
> On Fri, Jun 3, 2022 at 12:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> > > On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe services
> > > > > during shutdown") causes poweroff/reboot failure on systems with
> > > > > LS7A chipset.  We found that if we remove "pci_command &=
> > > > > ~PCI_COMMAND_MASTER;" in do_pci_disable_device(), it can work
> > > > > well. The hardware engineer says that the root cause is that CPU
> > > > > is still accessing PCIe devices while poweroff/reboot, and if we
> > > > > disable the Bus Master Bit at this time, the PCIe controller
> > > > > doesn't forward requests to downstream devices, and also doesn't
> > > > > send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > > > > deadlock). This behavior is a PCIe protocol violation (Bus
> > > > > Master should not be involved in CPU MMIO transactions), and it
> > > > > will be fixed in new revisions of hardware (add timeout
> > > > > mechanism for CPU read request, whether or not Bus Master bit is
> > > > > cleared).
> > > >
> > > > LS7A might have bugs in that clearing Bus Master Enable prevents the
> > > > root port from forwarding Memory or I/O requests in the downstream
> > > > direction.
> > > >
> > > > But this feels like a bit of a band-aid because we don't know exactly
> > > > what those requests are.  If we're removing the Root Port, I assume we
> > > > think we no longer need any devices *below* the Root Port.
> > > >
> > > > If that's not the case, e.g., if we still need to produce console
> > > > output or save state to a device, we probably should not be removing
> > > > the Root Port at all.
> > >
> > > Do you mean it is better to skip the whole pcie_port_device_remove()
> > > instead of just removing the "clear bus master" operation for the
> > > buggy hardware?
> >
> > No, that's not what I want at all.  That's just another band-aid to
> > avoid a problem without understanding what the problem is.
> >
> > My point is that apparently we remove a Root Port (which means we've
> > already removed any devices under it), and then we try to use a device
> > below the Root Port.  That seems broken.  I want to understand why we
> > try to use a device after we've removed it.
>
> I agree, and I think "why we try to use a device after remove it" is
> because the userspace programs don't know whether a device is
> "usable", so they just use it, at any time. Then it seems it is the
> responsibility of the device drivers to avoid the problem.

How is userspace able to use a device after the device is removed?
E.g., if userspace does a read/write to a device that has been
removed, the syscall should return error, not touch the missing
device.  If userspace mmaps a device, an access after the device has
been removed should fault, not do MMIO to the missing device.

> Take radeon/amdgpu driver as an example, the .shutdown() of the
> callback can call suspend() to fix.
> 
> But..., another problem is: There are many drivers "broken", not just
> radeon/amdgpu drivers (at least, the ahci driver is also "broken").
> Implementing the driver's .shutdown() correctly is nearly impossible
> for us, because we don't know the hardware details of so many devices.
> On the other hand, those drivers "just works" on most platforms, so
> the driver authors don't think they are "broken".

It sounds like you have analyzed this and have more details about
exactly how this happens.  Can you please share those details?  There
are a lot of steps in the middle that I don't understand yet.

Without those details, "userspace programs don't know whether a device
is 'usable', so they just use it, at any time" is kind of hand-wavey
and not very convincing.

> So, very sadly, we can only use a band-aid now. :(
> 
> > If the scenario ends up being legitimate and unavoidable, fine -- we
> > can figure out a quirk to work around the fact the LS7A doesn't allow
> > that access after we clear Bus Master Enable.  But right now the
> > scenario smells like a latent bug, and leaving bus mastering enabled
> > just avoids it without fixing it.
> >
> > Bjorn

Huacai Chen June 16, 2022, 8:39 a.m. UTC | #6

Hi, Bjorn,

On Thu, Jun 9, 2022 at 3:31 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jun 08, 2022 at 05:34:21PM +0800, Huacai Chen wrote:
> > On Fri, Jun 3, 2022 at 12:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> > > > On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > > > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe services
> > > > > > during shutdown") causes poweroff/reboot failure on systems with
> > > > > > LS7A chipset.  We found that if we remove "pci_command &=
> > > > > > ~PCI_COMMAND_MASTER;" in do_pci_disable_device(), it can work
> > > > > > well. The hardware engineer says that the root cause is that CPU
> > > > > > is still accessing PCIe devices while poweroff/reboot, and if we
> > > > > > disable the Bus Master Bit at this time, the PCIe controller
> > > > > > doesn't forward requests to downstream devices, and also doesn't
> > > > > > send TIMEOUT to CPU, which causes CPU wait forever (hardware
> > > > > > deadlock). This behavior is a PCIe protocol violation (Bus
> > > > > > Master should not be involved in CPU MMIO transactions), and it
> > > > > > will be fixed in new revisions of hardware (add timeout
> > > > > > mechanism for CPU read request, whether or not Bus Master bit is
> > > > > > cleared).
> > > > >
> > > > > LS7A might have bugs in that clearing Bus Master Enable prevents the
> > > > > root port from forwarding Memory or I/O requests in the downstream
> > > > > direction.
> > > > >
> > > > > But this feels like a bit of a band-aid because we don't know exactly
> > > > > what those requests are.  If we're removing the Root Port, I assume we
> > > > > think we no longer need any devices *below* the Root Port.
> > > > >
> > > > > If that's not the case, e.g., if we still need to produce console
> > > > > output or save state to a device, we probably should not be removing
> > > > > the Root Port at all.
> > > >
> > > > Do you mean it is better to skip the whole pcie_port_device_remove()
> > > > instead of just removing the "clear bus master" operation for the
> > > > buggy hardware?
> > >
> > > No, that's not what I want at all.  That's just another band-aid to
> > > avoid a problem without understanding what the problem is.
> > >
> > > My point is that apparently we remove a Root Port (which means we've
> > > already removed any devices under it), and then we try to use a device
> > > below the Root Port.  That seems broken.  I want to understand why we
> > > try to use a device after we've removed it.
> >
> > I agree, and I think "why we try to use a device after remove it" is
> > because the userspace programs don't know whether a device is
> > "usable", so they just use it, at any time. Then it seems it is the
> > responsibility of the device drivers to avoid the problem.
>
> How is userspace able to use a device after the device is removed?
> E.g., if userspace does a read/write to a device that has been
> removed, the syscall should return error, not touch the missing
> device.  If userspace mmaps a device, an access after the device has
> been removed should fault, not do MMIO to the missing device.
To give more details, let's take the graphics driver (e.g. amdgpu) as
an example again. The userspace programs call printf() to display
"shutting down xxx service" during shutdown/reboot. Or we can even
simplify further, the kernel calls printk() to display something
during shutdown/reboot. You know, printk() can happen at any time,
even after we call pcie_port_device_remove() to disable the pcie port
on the graphic card.

The call stack is: printk() --> call_console_drivers() -->
con->write() --> vt_console_print() --> fbcon_putcs()

I think this is a scenario of "userspace programs (or kernel itself)
don't know whether a device is 'usable', so they just use it, at any
time"

And why does the graphic driver call .suspend() in its .shutdown() can
fix the problem?
One of the key operations in .suspend() is drm_fb_helper_set_suspend()
--> fb_set_suspend() --> info->state = FBINFO_STATE_SUSPENDED;
This operation will cause fbcon_is_inactive() to return true, then
refuse fbcon_putcs() to really write to the framebuffer.

Huacai


>
> > Take radeon/amdgpu driver as an example, the .shutdown() of the
> > callback can call suspend() to fix.
> >
> > But..., another problem is: There are many drivers "broken", not just
> > radeon/amdgpu drivers (at least, the ahci driver is also "broken").
> > Implementing the driver's .shutdown() correctly is nearly impossible
> > for us, because we don't know the hardware details of so many devices.
> > On the other hand, those drivers "just works" on most platforms, so
> > the driver authors don't think they are "broken".
>
> It sounds like you have analyzed this and have more details about
> exactly how this happens.  Can you please share those details?  There
> are a lot of steps in the middle that I don't understand yet.
>
> Without those details, "userspace programs don't know whether a device
> is 'usable', so they just use it, at any time" is kind of hand-wavey
> and not very convincing.
>
> > So, very sadly, we can only use a band-aid now. :(
> >
> > > If the scenario ends up being legitimate and unavoidable, fine -- we
> > > can figure out a quirk to work around the fact the LS7A doesn't allow
> > > that access after we clear Bus Master Enable.  But right now the
> > > scenario smells like a latent bug, and leaving bus mastering enabled
> > > just avoids it without fixing it.
> > >
> > > Bjorn

Bjorn Helgaas June 16, 2022, 10:57 p.m. UTC | #7

On Thu, Jun 16, 2022 at 04:39:46PM +0800, Huacai Chen wrote:
> On Thu, Jun 9, 2022 at 3:31 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Jun 08, 2022 at 05:34:21PM +0800, Huacai Chen wrote:
> > > On Fri, Jun 3, 2022 at 12:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> > > > > On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > > > > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe
> > > > > > > services during shutdown") causes poweroff/reboot
> > > > > > > failure on systems with LS7A chipset.  We found that if
> > > > > > > we remove "pci_command &= ~PCI_COMMAND_MASTER;" in
> > > > > > > do_pci_disable_device(), it can work well. The hardware
> > > > > > > engineer says that the root cause is that CPU is still
> > > > > > > accessing PCIe devices while poweroff/reboot, and if we
> > > > > > > disable the Bus Master Bit at this time, the PCIe
> > > > > > > controller doesn't forward requests to downstream
> > > > > > > devices, and also doesn't send TIMEOUT to CPU, which
> > > > > > > causes CPU wait forever (hardware deadlock). This
> > > > > > > behavior is a PCIe protocol violation (Bus Master should
> > > > > > > not be involved in CPU MMIO transactions), and it will
> > > > > > > be fixed in new revisions of hardware (add timeout
> > > > > > > mechanism for CPU read request, whether or not Bus
> > > > > > > Master bit is cleared).
> > > > > >
> > > > > > LS7A might have bugs in that clearing Bus Master Enable
> > > > > > prevents the root port from forwarding Memory or I/O
> > > > > > requests in the downstream direction.
> > > > > >
> > > > > > But this feels like a bit of a band-aid because we don't
> > > > > > know exactly what those requests are.  If we're removing
> > > > > > the Root Port, I assume we think we no longer need any
> > > > > > devices *below* the Root Port.
> > > > > >
> > > > > > If that's not the case, e.g., if we still need to produce
> > > > > > console output or save state to a device, we probably
> > > > > > should not be removing the Root Port at all.
> > > > >
> > > > > Do you mean it is better to skip the whole
> > > > > pcie_port_device_remove() instead of just removing the
> > > > > "clear bus master" operation for the buggy hardware?
> > > >
> > > > No, that's not what I want at all.  That's just another
> > > > band-aid to avoid a problem without understanding what the
> > > > problem is.
> > > >
> > > > My point is that apparently we remove a Root Port (which means
> > > > we've already removed any devices under it), and then we try
> > > > to use a device below the Root Port.  That seems broken.  I
> > > > want to understand why we try to use a device after we've
> > > > removed it.
> > >
> > > I agree, and I think "why we try to use a device after remove
> > > it" is because the userspace programs don't know whether a
> > > device is "usable", so they just use it, at any time. Then it
> > > seems it is the responsibility of the device drivers to avoid
> > > the problem.
> >
> > How is userspace able to use a device after the device is removed?
> > E.g., if userspace does a read/write to a device that has been
> > removed, the syscall should return error, not touch the missing
> > device.  If userspace mmaps a device, an access after the device
> > has been removed should fault, not do MMIO to the missing device.
>
> To give more details, let's take the graphics driver (e.g. amdgpu)
> as an example again. The userspace programs call printf() to display
> "shutting down xxx service" during shutdown/reboot. Or we can even
> simplify further, the kernel calls printk() to display something
> during shutdown/reboot. You know, printk() can happen at any time,
> even after we call pcie_port_device_remove() to disable the pcie
> port on the graphic card.

I've been focusing on the *remove* path, but you said the problem
you're solving is with *poweroff/reboot*.  pcie_portdrv_remove() is
used for both paths, but if there's a reason we need those paths to be
different, we might be able to split them.

For remove, we have to assume accesses to the device may already or
will soon fail.  A driver that touches the device, or a device that
performs DMA, after its drv->remove() method has been called would be
seriously broken.  The remove operation also unbinds the driver from
the device.

The semantics of device_shutdown(), pci_device_shutdown(), and any
drv->shutdown() methods are not documented very well.  This path is
only used for halt/poweroff/reboot, so my guess is that not very much
is actually required, and it doesn't do anything at all with respect
to driver binding.

I think we mainly want to disable things that might interfere with
halt/poweroff/reboot, like interrupts and maybe DMA.  We disable DMA
in this path to prevent devices from corrupting memory, but I'm more
open to a quirk in the shutdown path than in the remove path.  So how
about if you clone pcie_port_device_remove() to make a separate
pcie_port_device_shutdown() and put the quirk there?

> The call stack is: printk() --> call_console_drivers() -->
> con->write() --> vt_console_print() --> fbcon_putcs()
> 
> I think this is a scenario of "userspace programs (or kernel itself)
> don't know whether a device is 'usable', so they just use it, at any
> time"
> 
> And why does the graphic driver call .suspend() in its .shutdown() can
> fix the problem?
>
> One of the key operations in .suspend() is drm_fb_helper_set_suspend()
> --> fb_set_suspend() --> info->state = FBINFO_STATE_SUSPENDED;
>
> This operation will cause fbcon_is_inactive() to return true, then
> refuse fbcon_putcs() to really write to the framebuffer.

> > > Take radeon/amdgpu driver as an example, the .shutdown() of the
> > > callback can call suspend() to fix.
> > >
> > > But..., another problem is: There are many drivers "broken", not just
> > > radeon/amdgpu drivers (at least, the ahci driver is also "broken").
> > > Implementing the driver's .shutdown() correctly is nearly impossible
> > > for us, because we don't know the hardware details of so many devices.
> > > On the other hand, those drivers "just works" on most platforms, so
> > > the driver authors don't think they are "broken".
> >
> > It sounds like you have analyzed this and have more details about
> > exactly how this happens.  Can you please share those details?  There
> > are a lot of steps in the middle that I don't understand yet.
> >
> > Without those details, "userspace programs don't know whether a device
> > is 'usable', so they just use it, at any time" is kind of hand-wavey
> > and not very convincing.
> >
> > > So, very sadly, we can only use a band-aid now. :(
> > >
> > > > If the scenario ends up being legitimate and unavoidable, fine -- we
> > > > can figure out a quirk to work around the fact the LS7A doesn't allow
> > > > that access after we clear Bus Master Enable.  But right now the
> > > > scenario smells like a latent bug, and leaving bus mastering enabled
> > > > just avoids it without fixing it.
> > > >
> > > > Bjorn

Huacai Chen June 17, 2022, 2:21 a.m. UTC | #8

Hi, Bjorn,

On Fri, Jun 17, 2022 at 6:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Jun 16, 2022 at 04:39:46PM +0800, Huacai Chen wrote:
> > On Thu, Jun 9, 2022 at 3:31 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Wed, Jun 08, 2022 at 05:34:21PM +0800, Huacai Chen wrote:
> > > > On Fri, Jun 3, 2022 at 12:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> > > > > > On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > > > > > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe
> > > > > > > > services during shutdown") causes poweroff/reboot
> > > > > > > > failure on systems with LS7A chipset.  We found that if
> > > > > > > > we remove "pci_command &= ~PCI_COMMAND_MASTER;" in
> > > > > > > > do_pci_disable_device(), it can work well. The hardware
> > > > > > > > engineer says that the root cause is that CPU is still
> > > > > > > > accessing PCIe devices while poweroff/reboot, and if we
> > > > > > > > disable the Bus Master Bit at this time, the PCIe
> > > > > > > > controller doesn't forward requests to downstream
> > > > > > > > devices, and also doesn't send TIMEOUT to CPU, which
> > > > > > > > causes CPU wait forever (hardware deadlock). This
> > > > > > > > behavior is a PCIe protocol violation (Bus Master should
> > > > > > > > not be involved in CPU MMIO transactions), and it will
> > > > > > > > be fixed in new revisions of hardware (add timeout
> > > > > > > > mechanism for CPU read request, whether or not Bus
> > > > > > > > Master bit is cleared).
> > > > > > >
> > > > > > > LS7A might have bugs in that clearing Bus Master Enable
> > > > > > > prevents the root port from forwarding Memory or I/O
> > > > > > > requests in the downstream direction.
> > > > > > >
> > > > > > > But this feels like a bit of a band-aid because we don't
> > > > > > > know exactly what those requests are.  If we're removing
> > > > > > > the Root Port, I assume we think we no longer need any
> > > > > > > devices *below* the Root Port.
> > > > > > >
> > > > > > > If that's not the case, e.g., if we still need to produce
> > > > > > > console output or save state to a device, we probably
> > > > > > > should not be removing the Root Port at all.
> > > > > >
> > > > > > Do you mean it is better to skip the whole
> > > > > > pcie_port_device_remove() instead of just removing the
> > > > > > "clear bus master" operation for the buggy hardware?
> > > > >
> > > > > No, that's not what I want at all.  That's just another
> > > > > band-aid to avoid a problem without understanding what the
> > > > > problem is.
> > > > >
> > > > > My point is that apparently we remove a Root Port (which means
> > > > > we've already removed any devices under it), and then we try
> > > > > to use a device below the Root Port.  That seems broken.  I
> > > > > want to understand why we try to use a device after we've
> > > > > removed it.
> > > >
> > > > I agree, and I think "why we try to use a device after remove
> > > > it" is because the userspace programs don't know whether a
> > > > device is "usable", so they just use it, at any time. Then it
> > > > seems it is the responsibility of the device drivers to avoid
> > > > the problem.
> > >
> > > How is userspace able to use a device after the device is removed?
> > > E.g., if userspace does a read/write to a device that has been
> > > removed, the syscall should return error, not touch the missing
> > > device.  If userspace mmaps a device, an access after the device
> > > has been removed should fault, not do MMIO to the missing device.
> >
> > To give more details, let's take the graphics driver (e.g. amdgpu)
> > as an example again. The userspace programs call printf() to display
> > "shutting down xxx service" during shutdown/reboot. Or we can even
> > simplify further, the kernel calls printk() to display something
> > during shutdown/reboot. You know, printk() can happen at any time,
> > even after we call pcie_port_device_remove() to disable the pcie
> > port on the graphic card.
>
> I've been focusing on the *remove* path, but you said the problem
> you're solving is with *poweroff/reboot*.  pcie_portdrv_remove() is
> used for both paths, but if there's a reason we need those paths to be
> different, we might be able to split them.
I'm very sorry for that. I have misunderstood before because I suppose
the "remove path" is the pcie_portdrv_remove() function, but your
meaning is the .remove() callback in pcie_portdriver. Am I right this
time?

>
> For remove, we have to assume accesses to the device may already or
> will soon fail.  A driver that touches the device, or a device that
> performs DMA, after its drv->remove() method has been called would be
> seriously broken.  The remove operation also unbinds the driver from
> the device.
Then what will happen about the "remove path"? If we still take the
graphics driver as an example, "rmmod amdgpu" always fails with
"device is busy" because the graphics card is always be used once
after the driver is loaded. So the "remove path" has no chance to be
executed. But if we take a NIC driver as an example, "rmmod igb" can
mostly succeed, and there will be no access on the device after
removing, at least in our observation. I think there is nothing broken
about the "remove path".

>
> The semantics of device_shutdown(), pci_device_shutdown(), and any
> drv->shutdown() methods are not documented very well.  This path is
> only used for halt/poweroff/reboot, so my guess is that not very much
> is actually required, and it doesn't do anything at all with respect
> to driver binding.
>
> I think we mainly want to disable things that might interfere with
> halt/poweroff/reboot, like interrupts and maybe DMA.  We disable DMA
> in this path to prevent devices from corrupting memory, but I'm more
> open to a quirk in the shutdown path than in the remove path.  So how
> about if you clone pcie_port_device_remove() to make a separate
> pcie_port_device_shutdown() and put the quirk there?
Hmm, I think this is better. So I will clone
pcie_portdrv_remove()/pcie_port_device_remove() to make a separate
pcie_portdrv_shutdown()/pcie_port_device_shutdown() and only apply the
quirk on the shutdown path.

Huacai
>
> > The call stack is: printk() --> call_console_drivers() -->
> > con->write() --> vt_console_print() --> fbcon_putcs()
> >
> > I think this is a scenario of "userspace programs (or kernel itself)
> > don't know whether a device is 'usable', so they just use it, at any
> > time"
> >
> > And why does the graphic driver call .suspend() in its .shutdown() can
> > fix the problem?
> >
> > One of the key operations in .suspend() is drm_fb_helper_set_suspend()
> > --> fb_set_suspend() --> info->state = FBINFO_STATE_SUSPENDED;
> >
> > This operation will cause fbcon_is_inactive() to return true, then
> > refuse fbcon_putcs() to really write to the framebuffer.
>
> > > > Take radeon/amdgpu driver as an example, the .shutdown() of the
> > > > callback can call suspend() to fix.
> > > >
> > > > But..., another problem is: There are many drivers "broken", not just
> > > > radeon/amdgpu drivers (at least, the ahci driver is also "broken").
> > > > Implementing the driver's .shutdown() correctly is nearly impossible
> > > > for us, because we don't know the hardware details of so many devices.
> > > > On the other hand, those drivers "just works" on most platforms, so
> > > > the driver authors don't think they are "broken".
> > >
> > > It sounds like you have analyzed this and have more details about
> > > exactly how this happens.  Can you please share those details?  There
> > > are a lot of steps in the middle that I don't understand yet.
> > >
> > > Without those details, "userspace programs don't know whether a device
> > > is 'usable', so they just use it, at any time" is kind of hand-wavey
> > > and not very convincing.
> > >
> > > > So, very sadly, we can only use a band-aid now. :(
> > > >
> > > > > If the scenario ends up being legitimate and unavoidable, fine -- we
> > > > > can figure out a quirk to work around the fact the LS7A doesn't allow
> > > > > that access after we clear Bus Master Enable.  But right now the
> > > > > scenario smells like a latent bug, and leaving bus mastering enabled
> > > > > just avoids it without fixing it.
> > > > >
> > > > > Bjorn

Bjorn Helgaas June 17, 2022, 11:37 a.m. UTC | #9

On Fri, Jun 17, 2022 at 10:21:14AM +0800, Huacai Chen wrote:
> On Fri, Jun 17, 2022 at 6:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Jun 16, 2022 at 04:39:46PM +0800, Huacai Chen wrote:
> > > On Thu, Jun 9, 2022 at 3:31 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Wed, Jun 08, 2022 at 05:34:21PM +0800, Huacai Chen wrote:
> > > > > On Fri, Jun 3, 2022 at 12:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> > > > > > > On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > > > > > > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe
> > > > > > > > > services during shutdown") causes poweroff/reboot
> > > > > > > > > failure on systems with LS7A chipset.  We found that if
> > > > > > > > > we remove "pci_command &= ~PCI_COMMAND_MASTER;" in
> > > > > > > > > do_pci_disable_device(), it can work well. The hardware
> > > > > > > > > engineer says that the root cause is that CPU is still
> > > > > > > > > accessing PCIe devices while poweroff/reboot, and if we
> > > > > > > > > disable the Bus Master Bit at this time, the PCIe
> > > > > > > > > controller doesn't forward requests to downstream
> > > > > > > > > devices, and also doesn't send TIMEOUT to CPU, which
> > > > > > > > > causes CPU wait forever (hardware deadlock). This
> > > > > > > > > behavior is a PCIe protocol violation (Bus Master should
> > > > > > > > > not be involved in CPU MMIO transactions), and it will
> > > > > > > > > be fixed in new revisions of hardware (add timeout
> > > > > > > > > mechanism for CPU read request, whether or not Bus
> > > > > > > > > Master bit is cleared).
> > > > > > > >
> > > > > > > > LS7A might have bugs in that clearing Bus Master Enable
> > > > > > > > prevents the root port from forwarding Memory or I/O
> > > > > > > > requests in the downstream direction.
> > > > > > > >
> > > > > > > > But this feels like a bit of a band-aid because we don't
> > > > > > > > know exactly what those requests are.  If we're removing
> > > > > > > > the Root Port, I assume we think we no longer need any
> > > > > > > > devices *below* the Root Port.
> > > > > > > >
> > > > > > > > If that's not the case, e.g., if we still need to produce
> > > > > > > > console output or save state to a device, we probably
> > > > > > > > should not be removing the Root Port at all.
> > > > > > >
> > > > > > > Do you mean it is better to skip the whole
> > > > > > > pcie_port_device_remove() instead of just removing the
> > > > > > > "clear bus master" operation for the buggy hardware?
> > > > > >
> > > > > > No, that's not what I want at all.  That's just another
> > > > > > band-aid to avoid a problem without understanding what the
> > > > > > problem is.
> > > > > >
> > > > > > My point is that apparently we remove a Root Port (which means
> > > > > > we've already removed any devices under it), and then we try
> > > > > > to use a device below the Root Port.  That seems broken.  I
> > > > > > want to understand why we try to use a device after we've
> > > > > > removed it.
> > > > >
> > > > > I agree, and I think "why we try to use a device after remove
> > > > > it" is because the userspace programs don't know whether a
> > > > > device is "usable", so they just use it, at any time. Then it
> > > > > seems it is the responsibility of the device drivers to avoid
> > > > > the problem.
> > > >
> > > > How is userspace able to use a device after the device is removed?
> > > > E.g., if userspace does a read/write to a device that has been
> > > > removed, the syscall should return error, not touch the missing
> > > > device.  If userspace mmaps a device, an access after the device
> > > > has been removed should fault, not do MMIO to the missing device.
> > >
> > > To give more details, let's take the graphics driver (e.g. amdgpu)
> > > as an example again. The userspace programs call printf() to display
> > > "shutting down xxx service" during shutdown/reboot. Or we can even
> > > simplify further, the kernel calls printk() to display something
> > > during shutdown/reboot. You know, printk() can happen at any time,
> > > even after we call pcie_port_device_remove() to disable the pcie
> > > port on the graphic card.
> >
> > I've been focusing on the *remove* path, but you said the problem
> > you're solving is with *poweroff/reboot*.  pcie_portdrv_remove() is
> > used for both paths, but if there's a reason we need those paths to be
> > different, we might be able to split them.
>
> I'm very sorry for that. I have misunderstood before because I suppose
> the "remove path" is the pcie_portdrv_remove() function, but your
> meaning is the .remove() callback in pcie_portdriver. Am I right this
> time?

No need to be sorry, you clearly said from the beginning that this was
a shutdown issue, not a remove issue!  I was just confused because the
.remove() and the .shutdown() callbacks are both
pcie_portdrv_remove(), so I was thinking "remove" even though you said
"poweroff".

> > For remove, we have to assume accesses to the device may already or
> > will soon fail.  A driver that touches the device, or a device that
> > performs DMA, after its drv->remove() method has been called would be
> > seriously broken.  The remove operation also unbinds the driver from
> > the device.
>
> Then what will happen about the "remove path"? If we still take the
> graphics driver as an example, "rmmod amdgpu" always fails with
> "device is busy" because the graphics card is always be used once
> after the driver is loaded. So the "remove path" has no chance to be
> executed.

Do you think this is a problem?  It doesn't sound like a problem to
me, but I don't know anything about graphics drivers.  I assume that
if a device is in use, the expected behavior is that we can't remove
the driver.

> But if we take a NIC driver as an example, "rmmod igb" can
> mostly succeed, and there will be no access on the device after
> removing, at least in our observation. I think there is nothing broken
> about the "remove path".

I agree.

Bjorn

Huacai Chen June 17, 2022, 12:14 p.m. UTC | #10

Hi, Bjorn,

On Fri, Jun 17, 2022 at 7:37 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Jun 17, 2022 at 10:21:14AM +0800, Huacai Chen wrote:
> > On Fri, Jun 17, 2022 at 6:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Jun 16, 2022 at 04:39:46PM +0800, Huacai Chen wrote:
> > > > On Thu, Jun 9, 2022 at 3:31 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Wed, Jun 08, 2022 at 05:34:21PM +0800, Huacai Chen wrote:
> > > > > > On Fri, Jun 3, 2022 at 12:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Thu, Jun 02, 2022 at 08:48:20PM +0800, Huacai Chen wrote:
> > > > > > > > On Wed, Jun 1, 2022 at 7:35 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > On Sat, Apr 30, 2022 at 04:48:45PM +0800, Huacai Chen wrote:
> > > > > > > > > > Commit cc27b735ad3a75574a ("PCI/portdrv: Turn off PCIe
> > > > > > > > > > services during shutdown") causes poweroff/reboot
> > > > > > > > > > failure on systems with LS7A chipset.  We found that if
> > > > > > > > > > we remove "pci_command &= ~PCI_COMMAND_MASTER;" in
> > > > > > > > > > do_pci_disable_device(), it can work well. The hardware
> > > > > > > > > > engineer says that the root cause is that CPU is still
> > > > > > > > > > accessing PCIe devices while poweroff/reboot, and if we
> > > > > > > > > > disable the Bus Master Bit at this time, the PCIe
> > > > > > > > > > controller doesn't forward requests to downstream
> > > > > > > > > > devices, and also doesn't send TIMEOUT to CPU, which
> > > > > > > > > > causes CPU wait forever (hardware deadlock). This
> > > > > > > > > > behavior is a PCIe protocol violation (Bus Master should
> > > > > > > > > > not be involved in CPU MMIO transactions), and it will
> > > > > > > > > > be fixed in new revisions of hardware (add timeout
> > > > > > > > > > mechanism for CPU read request, whether or not Bus
> > > > > > > > > > Master bit is cleared).
> > > > > > > > >
> > > > > > > > > LS7A might have bugs in that clearing Bus Master Enable
> > > > > > > > > prevents the root port from forwarding Memory or I/O
> > > > > > > > > requests in the downstream direction.
> > > > > > > > >
> > > > > > > > > But this feels like a bit of a band-aid because we don't
> > > > > > > > > know exactly what those requests are.  If we're removing
> > > > > > > > > the Root Port, I assume we think we no longer need any
> > > > > > > > > devices *below* the Root Port.
> > > > > > > > >
> > > > > > > > > If that's not the case, e.g., if we still need to produce
> > > > > > > > > console output or save state to a device, we probably
> > > > > > > > > should not be removing the Root Port at all.
> > > > > > > >
> > > > > > > > Do you mean it is better to skip the whole
> > > > > > > > pcie_port_device_remove() instead of just removing the
> > > > > > > > "clear bus master" operation for the buggy hardware?
> > > > > > >
> > > > > > > No, that's not what I want at all.  That's just another
> > > > > > > band-aid to avoid a problem without understanding what the
> > > > > > > problem is.
> > > > > > >
> > > > > > > My point is that apparently we remove a Root Port (which means
> > > > > > > we've already removed any devices under it), and then we try
> > > > > > > to use a device below the Root Port.  That seems broken.  I
> > > > > > > want to understand why we try to use a device after we've
> > > > > > > removed it.
> > > > > >
> > > > > > I agree, and I think "why we try to use a device after remove
> > > > > > it" is because the userspace programs don't know whether a
> > > > > > device is "usable", so they just use it, at any time. Then it
> > > > > > seems it is the responsibility of the device drivers to avoid
> > > > > > the problem.
> > > > >
> > > > > How is userspace able to use a device after the device is removed?
> > > > > E.g., if userspace does a read/write to a device that has been
> > > > > removed, the syscall should return error, not touch the missing
> > > > > device.  If userspace mmaps a device, an access after the device
> > > > > has been removed should fault, not do MMIO to the missing device.
> > > >
> > > > To give more details, let's take the graphics driver (e.g. amdgpu)
> > > > as an example again. The userspace programs call printf() to display
> > > > "shutting down xxx service" during shutdown/reboot. Or we can even
> > > > simplify further, the kernel calls printk() to display something
> > > > during shutdown/reboot. You know, printk() can happen at any time,
> > > > even after we call pcie_port_device_remove() to disable the pcie
> > > > port on the graphic card.
> > >
> > > I've been focusing on the *remove* path, but you said the problem
> > > you're solving is with *poweroff/reboot*.  pcie_portdrv_remove() is
> > > used for both paths, but if there's a reason we need those paths to be
> > > different, we might be able to split them.
> >
> > I'm very sorry for that. I have misunderstood before because I suppose
> > the "remove path" is the pcie_portdrv_remove() function, but your
> > meaning is the .remove() callback in pcie_portdriver. Am I right this
> > time?
>
> No need to be sorry, you clearly said from the beginning that this was
> a shutdown issue, not a remove issue!  I was just confused because the
> .remove() and the .shutdown() callbacks are both
> pcie_portdrv_remove(), so I was thinking "remove" even though you said
> "poweroff".
>
> > > For remove, we have to assume accesses to the device may already or
> > > will soon fail.  A driver that touches the device, or a device that
> > > performs DMA, after its drv->remove() method has been called would be
> > > seriously broken.  The remove operation also unbinds the driver from
> > > the device.
> >
> > Then what will happen about the "remove path"? If we still take the
> > graphics driver as an example, "rmmod amdgpu" always fails with
> > "device is busy" because the graphics card is always be used once
> > after the driver is loaded. So the "remove path" has no chance to be
> > executed.
>
> Do you think this is a problem?  It doesn't sound like a problem to
> me, but I don't know anything about graphics drivers.  I assume that
> if a device is in use, the expected behavior is that we can't remove
> the driver.
This isn't a problem, and I've sent V14, which only modifies the shutdown logic.

Huacai
>
> > But if we take a NIC driver as an example, "rmmod igb" can
> > mostly succeed, and there will be no access on the device after
> > removing, at least in our observation. I think there is nothing broken
> > about the "remove path".
>
> I agree.
>
> Bjorn

[V13,5/6] PCI: Add quirk for LS7A to avoid reboot failure

Commit Message

Comments

Patch