diff mbox

PCI: Disable async suspend/resume for Jmicron chip

Message ID 1438133951.1856.23.camel@rzhang1-mobl4 (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Zhang Rui July 29, 2015, 1:39 a.m. UTC
On Tue, 2015-07-28 at 13:58 -0400, Tejun Heo wrote:
> On Tue, Jul 28, 2015 at 09:42:27AM +0800, Zhang Rui wrote:
> > From 57edba9c677e47354846db951014dc4d5b13ce54 Mon Sep 17 00:00:00 2001
> > From: Zhang Rui <rui.zhang@intel.com>
> > Date: Sun, 26 Jul 2015 14:15:36 +0800
> > Subject: [PATCH] PCI: Disable async suspend/resume for Jmicron chip
> > 
> > In https://bugzilla.kernel.org/show_bug.cgi?id=81551,
> > we found that Jmicron chip 361/363 is broken after resume if async noirq
> > (76569faa62 (PM / sleep: Asynchronous threads for resume_noirq)) is supported,
> > thus commit e6b7e41cdd (ata: Disabling the async PM for JMicron chip 363/361)
> > is introduced to fix this problem.
> > But then, we found that Jmicron chip 368 also has this problem, and it is decided
> > to disable the pm async feature for all the Jmicron chips.
> > 
> > But how to fix this was discussed in the mailing list for some time.
> > After some investigation, we believe that a proper fix is to disable
> > the async PM in PCI instead of ata driver, because, on this platform,
> > pci_resume_noirq() of IDE controller must happen after pci_resume_noirq()
> > of AHCI controller. But as .resume_noirq() of the pata_jmicron driver is
> > no-op, this suggests that it is the PCI common actions, aka,
> > pci_pm_default_resume_early(), have the dependency.
> > To fix this, using device_pm_wait_for_dev() in pata_jmicron driver can not
> > solve the dependency because pci_pm_default_resume_early() is invoked before
> > driver callback being invoked, plus, as it is the PCI common actions that
> > have the dependency, it is reasonable to fix it in PCI bus code,
> > rather than driver code.
> > 
> > This patch is made based on the patch from Liu Chuansheng at
> > https://lkml.org/lkml/2014/12/5/74
> > it reverts commit e6b7e41cdd ("ata: Disabling the async PM for JMicron
> > chip 363/361"), and introduces a PCI quirk to disable async PM for Jmicron
> > chips.
> > 
> > Reference: https://bugzilla.kernel.org/show_bug.cgi?id=81551
> > Tested-by: Jay <MyMailClone@t-online.de>
> > Tested-by: Barto <mister.freeman@laposte.net>
> > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> 
> >  /*
> > + * For JMicron chips, we need to disable the async_suspend method, otherwise
> > + * they will hit the power-on issue when doing device resume, add one quick
> > + * solution to disable the async_suspend method.
> > + */
> 
> Maybe add a link to the bug report and/or discussion thread?

Yes, refreshed patch is attached below.


From c6b063b2874b62466362a10b53097216b16e400d Mon Sep 17 00:00:00 2001
From: Zhang Rui <rui.zhang@intel.com>
Date: Sun, 26 Jul 2015 14:15:36 +0800
Subject: [PATCH] PCI: Disable async suspend/resume for Jmicron chip

In https://bugzilla.kernel.org/show_bug.cgi?id=81551,
we found that Jmicron chip 361/363 is broken after resume if async noirq
(76569faa62 (PM / sleep: Asynchronous threads for resume_noirq)) is supported,
thus commit e6b7e41cdd (ata: Disabling the async PM for JMicron chip 363/361)
is introduced to fix this problem.
But then, we found that Jmicron chip 368 also has this problem, and it is decided
to disable the pm async feature for all the Jmicron chips.

But how to fix this was discussed in the mailing list for some time.
After some investigation, we believe that a proper fix is to disable
the async PM in PCI instead of ata driver, because, on this platform,
pci_resume_noirq() of IDE controller must happen after pci_resume_noirq()
of AHCI controller. But as .resume_noirq() of the pata_jmicron driver is
no-op, this suggests that it is the PCI common actions, aka,
pci_pm_default_resume_early(), have the dependency.
To fix this, using device_pm_wait_for_dev() in pata_jmicron driver can not
solve the dependency because pci_pm_default_resume_early() is invoked before
driver callback being invoked, plus, as it is the PCI common actions that
have the dependency, it is reasonable to fix it in PCI bus code,
rather than driver code.

This patch is made based on the patch from Liu Chuansheng at
https://lkml.org/lkml/2014/12/5/74
it reverts commit e6b7e41cdd ("ata: Disabling the async PM for JMicron
chip 363/361"), and introduces a PCI quirk to disable async PM for Jmicron
chips.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=81511
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=81551
Reviewed-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Chuansheng Liu <chuansheng.liu@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
---
 drivers/ata/ahci.c         | 12 ------------
 drivers/ata/pata_jmicron.c | 12 ------------
 drivers/pci/quirks.c       | 17 +++++++++++++++++
 3 files changed, 17 insertions(+), 24 deletions(-)

Comments

Bjorn Helgaas Aug. 14, 2015, 8:17 p.m. UTC | #1
Hi Zhang,

On Wed, Jul 29, 2015 at 09:39:11AM +0800, Zhang Rui wrote:
> From c6b063b2874b62466362a10b53097216b16e400d Mon Sep 17 00:00:00 2001
> From: Zhang Rui <rui.zhang@intel.com>
> Date: Sun, 26 Jul 2015 14:15:36 +0800
> Subject: [PATCH] PCI: Disable async suspend/resume for Jmicron chip
> 
> In https://bugzilla.kernel.org/show_bug.cgi?id=81551,
> we found that Jmicron chip 361/363 is broken after resume if async noirq
> (76569faa62 (PM / sleep: Asynchronous threads for resume_noirq)) is supported,
> thus commit e6b7e41cdd (ata: Disabling the async PM for JMicron chip 363/361)
> is introduced to fix this problem.
> But then, we found that Jmicron chip 368 also has this problem, and it is decided
> to disable the pm async feature for all the Jmicron chips.
> 
> But how to fix this was discussed in the mailing list for some time.
> After some investigation, we believe that a proper fix is to disable
> the async PM in PCI instead of ata driver, because, on this platform,
> pci_resume_noirq() of IDE controller must happen after pci_resume_noirq()
> of AHCI controller. But as .resume_noirq() of the pata_jmicron driver is
> no-op, this suggests that it is the PCI common actions, aka,
> pci_pm_default_resume_early(), have the dependency.
> To fix this, using device_pm_wait_for_dev() in pata_jmicron driver can not
> solve the dependency because pci_pm_default_resume_early() is invoked before
> driver callback being invoked, plus, as it is the PCI common actions that
> have the dependency, it is reasonable to fix it in PCI bus code,
> rather than driver code.
> 
> This patch is made based on the patch from Liu Chuansheng at
> https://lkml.org/lkml/2014/12/5/74
> it reverts commit e6b7e41cdd ("ata: Disabling the async PM for JMicron
> chip 363/361"), and introduces a PCI quirk to disable async PM for Jmicron
> chips.

This changelog has a lot of text, but doesn't tell me what I really
want to know.

We need to know what the problem is from the *device* point of view,
not in terms of kernel function names.  Function names are only
meaningful to a handful of experts, but a concrete problem description
may be useful to hardware designers and firmware writers who can
potentially fix the root issue.

In this case, I think the problem is something like: "on these
multi-function JMicron devices, the PATA controller at function 1
doesn't work if it is powered on before the SATA controller at
function 0."

It's also helpful to include a symptom to help people connect a
problem with the solution.  For example, the 81551 bug reports says
these are symptoms:

  pata_jmicron 0000:02:00.1: Refused to change power state, currently in D3
  irq 17: nobody cared (try booting with the "irqpoll" option)

This patch doesn't change the workaround: we still use
device_disable_async_suspend() on these devices.  This patch merely:

  1) Changes the workaround so instead of doing this only for 361 and
  363 devices, we turn off async suspend on *all* JMicron devices, and

  2) Moves the workaround from the AHCI and PATA drivers to the PCI
  core.

For 1), I think it is probably overkill to penalize all JMicron
devices.  There's no reason to think NICs and SD/MMC/etc controllers
have the same issue.  Or if there *is* reason to think that, you
should give evidence for it.

For 2), you have not made a convincing argument for why the workaround
needs to be in the PCI core.  Such an argument would be of the form
"we need this quirk even if the driver hasn't been loaded yet
because ..."

Since you didn't say anything like that, I assume the patch in comment
#72 of bug #81551 (https://bugzilla.kernel.org/attachment.cgi?id=156301)
would work as well.  That has the advantage that it wouldn't penalize
non-storage controllers.

Other minor nits:

- The function "pci_resume_noirq()" does not exist.  I assume you meant
  pci_pm_resume_noirq().  And as I mentioned earlier, I don't think the
  name is relevant in the changelog anyway.

- You have a mix of SHA1 lengths and summary quoting above.  Please use
  the conventional commit reference style (with 12-char SHA-1)
  consistently, e.g.,

    76569faa62c4 ("PM / sleep: Asynchronous threads for resume_noirq")

- Please use http://lkml.kernel.org/r/... references when possible,
  not https://lkml.org/....

> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=81511
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=81551
> Reviewed-by: Aaron Lu <aaron.lu@intel.com>
> Acked-by: Chuansheng Liu <chuansheng.liu@intel.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> ---
>  drivers/ata/ahci.c         | 12 ------------
>  drivers/ata/pata_jmicron.c | 12 ------------
>  drivers/pci/quirks.c       | 17 +++++++++++++++++
>  3 files changed, 17 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index 7e62751..26bb40d 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1451,18 +1451,6 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	else if (pdev->vendor == 0x177d && pdev->device == 0xa01c)
>  		ahci_pci_bar = AHCI_PCI_BAR_CAVIUM;
>  
> -	/*
> -	 * The JMicron chip 361/363 contains one SATA controller and one
> -	 * PATA controller,for powering on these both controllers, we must
> -	 * follow the sequence one by one, otherwise one of them can not be
> -	 * powered on successfully, so here we disable the async suspend
> -	 * method for these chips.
> -	 */
> -	if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
> -		(pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
> -		pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
> -		device_disable_async_suspend(&pdev->dev);
> -
>  	/* acquire resources */
>  	rc = pcim_enable_device(pdev);
>  	if (rc)
> diff --git a/drivers/ata/pata_jmicron.c b/drivers/ata/pata_jmicron.c
> index 47e418b..4d1a5d2 100644
> --- a/drivers/ata/pata_jmicron.c
> +++ b/drivers/ata/pata_jmicron.c
> @@ -143,18 +143,6 @@ static int jmicron_init_one (struct pci_dev *pdev, const struct pci_device_id *i
>  	};
>  	const struct ata_port_info *ppi[] = { &info, NULL };
>  
> -	/*
> -	 * The JMicron chip 361/363 contains one SATA controller and one
> -	 * PATA controller,for powering on these both controllers, we must
> -	 * follow the sequence one by one, otherwise one of them can not be
> -	 * powered on successfully, so here we disable the async suspend
> -	 * method for these chips.
> -	 */
> -	if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
> -		(pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
> -		pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
> -		device_disable_async_suspend(&pdev->dev);
> -
>  	return ata_pci_bmdma_init_one(pdev, ppi, &jmicron_sht, NULL, 0);
>  }
>  
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index e9fd0e9..02803f8 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -29,6 +29,23 @@
>  #include "pci.h"
>  
>  /*
> + * For JMicron chips, we need to disable the async_suspend method, otherwise
> + * they will hit the power-on issue when doing device resume, add one quick
> + * solution to disable the async_suspend method.
> + *
> + * https://bugzilla.kernel.org/show_bug.cgi?id=81551
> + */
> +static void pci_async_suspend_fixup(struct pci_dev *pdev)
> +{
> +	/*
> +	 * disabling the async_suspend method for JMicron chips to
> +	 * avoid device resuming issue.
> +	 */
> +	device_disable_async_suspend(&pdev->dev);
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, pci_async_suspend_fixup);
> +
> +/*
>   * Decoding should be disabled for a PCI device during BAR sizing to avoid
>   * conflict. But doing so may cause problems on host bridge and perhaps other
>   * key system devices. For devices that need to have mmio decoding always-on,
> -- 
> 1.9.1
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Aug. 15, 2015, 2:39 p.m. UTC | #2
Hi Jay,

On Sat, Aug 15, 2015 at 7:49 AM, Jay <MyMailClone@t-online.de> wrote:
> Am Freitag, 14. August 2015, 15:17:52 schrieb Bjorn Helgaas:
> ...
>> Since you didn't say anything like that, I assume the patch in comment
>> #72 of bug #81551 (https://bugzilla.kernel.org/attachment.cgi?id=156301)
>> would work as well.  That has the advantage that it wouldn't penalize
>> non-storage controllers.
>
> The problem was introduced with 3.15 and the kernel is almost at 4.2 now.
>
> A general solution for the JMicron-AHCI/PATA-controllers is still missing
> although available since late September 2014. It was tested by Barto.
>
> Seems what I wrote in comment #76 wasn't completely wrong:
> https://bugzilla.kernel.org/show_bug.cgi?id=81551#c76
>
> So this may be an opportunity to reconsider the approach to solving
> things like this.

I don't understand your point, except to acknowledge that we are
imperfect, have limited resources, and make many mistakes in
diagnosing problems and communicating solutions.  Do you have a
proposal for a better approach?

The patch that Barto tested in late September 2014
(https://bugzilla.kernel.org/show_bug.cgi?id=81551#c66) is exactly the
patch from comment #72 that I mentioned as possibly being a better
solution.

That patch wouldn't involve me at all, since it doesn't touch PCI.
Zhang is proposing a PCI change, so I'm asking for a clear changelog.

A changelog is a "write-once, read-many" situation.  It's very
important that it be concise and clear, and it's worth having the
expert spend extra time writing the log to make it easier for the many
novices that will read it in the future.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zhang Rui Aug. 15, 2015, 4:57 p.m. UTC | #3
Hi, Bjorn,

On Fri, 2015-08-14 at 15:17 -0500, Bjorn Helgaas wrote:
> Hi Zhang,
> 
> On Wed, Jul 29, 2015 at 09:39:11AM +0800, Zhang Rui wrote:
> > From c6b063b2874b62466362a10b53097216b16e400d Mon Sep 17 00:00:00 2001
> > From: Zhang Rui <rui.zhang@intel.com>
> > Date: Sun, 26 Jul 2015 14:15:36 +0800
> > Subject: [PATCH] PCI: Disable async suspend/resume for Jmicron chip
> > 
> > In https://bugzilla.kernel.org/show_bug.cgi?id=81551,
> > we found that Jmicron chip 361/363 is broken after resume if async noirq
> > (76569faa62 (PM / sleep: Asynchronous threads for resume_noirq)) is supported,
> > thus commit e6b7e41cdd (ata: Disabling the async PM for JMicron chip 363/361)
> > is introduced to fix this problem.
> > But then, we found that Jmicron chip 368 also has this problem, and it is decided
> > to disable the pm async feature for all the Jmicron chips.
> > 
> > But how to fix this was discussed in the mailing list for some time.
> > After some investigation, we believe that a proper fix is to disable
> > the async PM in PCI instead of ata driver, because, on this platform,
> > pci_resume_noirq() of IDE controller must happen after pci_resume_noirq()
> > of AHCI controller. But as .resume_noirq() of the pata_jmicron driver is
> > no-op, this suggests that it is the PCI common actions, aka,
> > pci_pm_default_resume_early(), have the dependency.
> > To fix this, using device_pm_wait_for_dev() in pata_jmicron driver can not
> > solve the dependency because pci_pm_default_resume_early() is invoked before
> > driver callback being invoked, plus, as it is the PCI common actions that
> > have the dependency, it is reasonable to fix it in PCI bus code,
> > rather than driver code.
> > 
> > This patch is made based on the patch from Liu Chuansheng at
> > https://lkml.org/lkml/2014/12/5/74
> > it reverts commit e6b7e41cdd ("ata: Disabling the async PM for JMicron
> > chip 363/361"), and introduces a PCI quirk to disable async PM for Jmicron
> > chips.
> 
> This changelog has a lot of text, but doesn't tell me what I really
> want to know.
> 
> We need to know what the problem is from the *device* point of view,
> not in terms of kernel function names.  Function names are only
> meaningful to a handful of experts, but a concrete problem description
> may be useful to hardware designers and firmware writers who can
> potentially fix the root issue.
> 
Well, I sent this patch just because there is a regression since 3.15,
and we already have two patches that have been verified to fix the
problem, but none of them are pushed for upstream. I don't have many PCI
background, thus I chose to use function names to make myself precise
enough but apparently I failed :p
Maybe I should send this patch as a RFC patch to get a perfect fix.

> In this case, I think the problem is something like: "on these
> multi-function JMicron devices, the PATA controller at function 1
> doesn't work if it is powered on before the SATA controller at
> function 0."
> 
exactly.

> It's also helpful to include a symptom to help people connect a
> problem with the solution.  For example, the 81551 bug reports says
> these are symptoms:
> 
>   pata_jmicron 0000:02:00.1: Refused to change power state, currently in D3
>   irq 17: nobody cared (try booting with the "irqpoll" option)
> 
okay.

> This patch doesn't change the workaround: we still use
> device_disable_async_suspend() on these devices.  This patch merely:
> 
>   1) Changes the workaround so instead of doing this only for 361 and
>   363 devices, we turn off async suspend on *all* JMicron devices, and
> 
>   2) Moves the workaround from the AHCI and PATA drivers to the PCI
>   core.
> 
> For 1), I think it is probably overkill to penalize all JMicron
> devices.  There's no reason to think NICs and SD/MMC/etc controllers
> have the same issue.  Or if there *is* reason to think that, you
> should give evidence for it.

No, I don't have any evidence. It's just because the previous fix
becomes insufficient, which makes people wondering if we should disable
all of them.

> 
> For 2), you have not made a convincing argument for why the workaround
> needs to be in the PCI core.  Such an argument would be of the form
> "we need this quirk even if the driver hasn't been loaded yet
> because ..."

Good point, Jay and Barto, can you please verify this?

> Since you didn't say anything like that, I assume the patch in comment
> #72 of bug #81551 (https://bugzilla.kernel.org/attachment.cgi?id=156301)
> would work as well.  That has the advantage that it wouldn't penalize
> non-storage controllers.
> 
Yes, the fix in driver code also works. But let's wait for the test
results from Jay and Barto because this problem really sounds like a
dependency in PCI code.

> Other minor nits:
> 
> - The function "pci_resume_noirq()" does not exist.  I assume you meant
>   pci_pm_resume_noirq().  And as I mentioned earlier, I don't think the
>   name is relevant in the changelog anyway.
> 
> - You have a mix of SHA1 lengths and summary quoting above.  Please use
>   the conventional commit reference style (with 12-char SHA-1)
>   consistently, e.g.,
> 
>     76569faa62c4 ("PM / sleep: Asynchronous threads for resume_noirq")
> 
> - Please use http://lkml.kernel.org/r/... references when possible,
>   not https://lkml.org/....
> 
Thanks for pointing these out. I was not aware of such mistakes before.
Thank you.

-rui
> > Reference: https://bugzilla.kernel.org/show_bug.cgi?id=81511
> > Reference: https://bugzilla.kernel.org/show_bug.cgi?id=81551
> > Reviewed-by: Aaron Lu <aaron.lu@intel.com>
> > Acked-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > Acked-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > ---
> >  drivers/ata/ahci.c         | 12 ------------
> >  drivers/ata/pata_jmicron.c | 12 ------------
> >  drivers/pci/quirks.c       | 17 +++++++++++++++++
> >  3 files changed, 17 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> > index 7e62751..26bb40d 100644
> > --- a/drivers/ata/ahci.c
> > +++ b/drivers/ata/ahci.c
> > @@ -1451,18 +1451,6 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> >  	else if (pdev->vendor == 0x177d && pdev->device == 0xa01c)
> >  		ahci_pci_bar = AHCI_PCI_BAR_CAVIUM;
> >  
> > -	/*
> > -	 * The JMicron chip 361/363 contains one SATA controller and one
> > -	 * PATA controller,for powering on these both controllers, we must
> > -	 * follow the sequence one by one, otherwise one of them can not be
> > -	 * powered on successfully, so here we disable the async suspend
> > -	 * method for these chips.
> > -	 */
> > -	if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
> > -		(pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
> > -		pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
> > -		device_disable_async_suspend(&pdev->dev);
> > -
> >  	/* acquire resources */
> >  	rc = pcim_enable_device(pdev);
> >  	if (rc)
> > diff --git a/drivers/ata/pata_jmicron.c b/drivers/ata/pata_jmicron.c
> > index 47e418b..4d1a5d2 100644
> > --- a/drivers/ata/pata_jmicron.c
> > +++ b/drivers/ata/pata_jmicron.c
> > @@ -143,18 +143,6 @@ static int jmicron_init_one (struct pci_dev *pdev, const struct pci_device_id *i
> >  	};
> >  	const struct ata_port_info *ppi[] = { &info, NULL };
> >  
> > -	/*
> > -	 * The JMicron chip 361/363 contains one SATA controller and one
> > -	 * PATA controller,for powering on these both controllers, we must
> > -	 * follow the sequence one by one, otherwise one of them can not be
> > -	 * powered on successfully, so here we disable the async suspend
> > -	 * method for these chips.
> > -	 */
> > -	if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
> > -		(pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
> > -		pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
> > -		device_disable_async_suspend(&pdev->dev);
> > -
> >  	return ata_pci_bmdma_init_one(pdev, ppi, &jmicron_sht, NULL, 0);
> >  }
> >  
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index e9fd0e9..02803f8 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -29,6 +29,23 @@
> >  #include "pci.h"
> >  
> >  /*
> > + * For JMicron chips, we need to disable the async_suspend method, otherwise
> > + * they will hit the power-on issue when doing device resume, add one quick
> > + * solution to disable the async_suspend method.
> > + *
> > + * https://bugzilla.kernel.org/show_bug.cgi?id=81551
> > + */
> > +static void pci_async_suspend_fixup(struct pci_dev *pdev)
> > +{
> > +	/*
> > +	 * disabling the async_suspend method for JMicron chips to
> > +	 * avoid device resuming issue.
> > +	 */
> > +	device_disable_async_suspend(&pdev->dev);
> > +}
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, pci_async_suspend_fixup);
> > +
> > +/*
> >   * Decoding should be disabled for a PCI device during BAR sizing to avoid
> >   * conflict. But doing so may cause problems on host bridge and perhaps other
> >   * key system devices. For devices that need to have mmio decoding always-on,
> > -- 
> > 1.9.1
> > 
> > 
> > 


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Aug. 15, 2015, 8:57 p.m. UTC | #4
On Sat, Aug 15, 2015 at 3:33 PM, Jay <MyMailClone@t-online.de> wrote:
> Am Samstag, 15. August 2015, 09:39:24 schrieb Bjorn Helgaas:
>> Hi Jay,
>>
>> On Sat, Aug 15, 2015 at 7:49 AM, Jay <MyMailClone@t-online.de> wrote:
>> > Am Freitag, 14. August 2015, 15:17:52 schrieb Bjorn Helgaas:
>> > ...
>> >
>> >> Since you didn't say anything like that, I assume the patch in comment
>> >> #72 of bug #81551 (https://bugzilla.kernel.org/attachment.cgi?id=156301)
>> >> would work as well.  That has the advantage that it wouldn't penalize
>> >> non-storage controllers.
>> >
>> > The problem was introduced with 3.15 and the kernel is almost at 4.2 now.
>> >
>> > A general solution for the JMicron-AHCI/PATA-controllers is still missing
>> > although available since late September 2014. It was tested by Barto.
>> >
>> > Seems what I wrote in comment #76 wasn't completely wrong:
>> > https://bugzilla.kernel.org/show_bug.cgi?id=81551#c76
>> >
>> > So this may be an opportunity to reconsider the approach to solving
>> > things like this.
>>
>> I don't understand your point, except to acknowledge that we are
>> imperfect, have limited resources, and make many mistakes in
>> diagnosing problems and communicating solutions.  Do you have a
>> proposal for a better approach?
>>
>> The patch that Barto tested in late September 2014
>> (https://bugzilla.kernel.org/show_bug.cgi?id=81551#c66) is exactly the
>> patch from comment #72 that I mentioned as possibly being a better
>> solution.
>>
>> That patch wouldn't involve me at all, since it doesn't touch PCI.
>> Zhang is proposing a PCI change, so I'm asking for a clear changelog.
>>
>> A changelog is a "write-once, read-many" situation.  It's very
>> important that it be concise and clear, and it's worth having the
>> expert spend extra time writing the log to make it easier for the many
>> novices that will read it in the future.
>
> please don't feel offended. Nobody should. No need to. And I like what you
> wrote, it's definitely constructive.
>
> My point is that instead of looking for a "perfect" solution, a more pragmatic
> approach may be more effective in solving things like this.
>
> And this was exactly what I suggested in comment #76: "Let's be pragmatic,
> take this patch and solve the problem now. And then you (developers) may take
> your time to look for a better or even the "perfect" solution."
>
> That way the case would have been closed a year ago.

The only problem with that approach is that the owner of the issue now
considers the issue closed, so the better solution never happens.
Even if it *does* happen, we now have a fix for a fix, which makes it
that much more difficult to follow the history.  This is just a
generic issue with the way Linux works: maintainers have zero leverage
after accepting a patch, so anything they really care about has to
happen before that.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Aug. 15, 2015, 9:45 p.m. UTC | #5
On Sat, Aug 15, 2015 at 11:57 AM, Zhang Rui <rui.zhang@intel.com> wrote:
> Hi, Bjorn,
>
> On Fri, 2015-08-14 at 15:17 -0500, Bjorn Helgaas wrote:
>> Hi Zhang,
>>
>> On Wed, Jul 29, 2015 at 09:39:11AM +0800, Zhang Rui wrote:
>> > From c6b063b2874b62466362a10b53097216b16e400d Mon Sep 17 00:00:00 2001
>> > From: Zhang Rui <rui.zhang@intel.com>
>> > Date: Sun, 26 Jul 2015 14:15:36 +0800
>> > Subject: [PATCH] PCI: Disable async suspend/resume for Jmicron chip
>> >
>> > In https://bugzilla.kernel.org/show_bug.cgi?id=81551,
>> > we found that Jmicron chip 361/363 is broken after resume if async noirq
>> > (76569faa62 (PM / sleep: Asynchronous threads for resume_noirq)) is supported,
>> > thus commit e6b7e41cdd (ata: Disabling the async PM for JMicron chip 363/361)
>> > is introduced to fix this problem.
>> > But then, we found that Jmicron chip 368 also has this problem, and it is decided
>> > to disable the pm async feature for all the Jmicron chips.
>> >
>> > But how to fix this was discussed in the mailing list for some time.
>> > After some investigation, we believe that a proper fix is to disable
>> > the async PM in PCI instead of ata driver, because, on this platform,
>> > pci_resume_noirq() of IDE controller must happen after pci_resume_noirq()
>> > of AHCI controller. But as .resume_noirq() of the pata_jmicron driver is
>> > no-op, this suggests that it is the PCI common actions, aka,
>> > pci_pm_default_resume_early(), have the dependency.
>> > To fix this, using device_pm_wait_for_dev() in pata_jmicron driver can not
>> > solve the dependency because pci_pm_default_resume_early() is invoked before
>> > driver callback being invoked, plus, as it is the PCI common actions that
>> > have the dependency, it is reasonable to fix it in PCI bus code,
>> > rather than driver code.
>> >
>> > This patch is made based on the patch from Liu Chuansheng at
>> > https://lkml.org/lkml/2014/12/5/74
>> > it reverts commit e6b7e41cdd ("ata: Disabling the async PM for JMicron
>> > chip 363/361"), and introduces a PCI quirk to disable async PM for Jmicron
>> > chips.
>>
>> This changelog has a lot of text, but doesn't tell me what I really
>> want to know.
>>
>> We need to know what the problem is from the *device* point of view,
>> not in terms of kernel function names.  Function names are only
>> meaningful to a handful of experts, but a concrete problem description
>> may be useful to hardware designers and firmware writers who can
>> potentially fix the root issue.
>>
> Well, I sent this patch just because there is a regression since 3.15,
> and we already have two patches that have been verified to fix the
> problem, but none of them are pushed for upstream. I don't have many PCI
> background, thus I chose to use function names to make myself precise
> enough but apparently I failed :p

Yep.  If you use function names, they have to be real function names,
not "almost" function names.

> Maybe I should send this patch as a RFC patch to get a perfect fix.

RFC has nothing to do with it.  The problem is that the changelog
doesn't say what the problem is or how we're fixing it (except in such
abstract terms as to be useless).

>> This patch doesn't change the workaround: we still use
>> device_disable_async_suspend() on these devices.  This patch merely:
>>
>>   1) Changes the workaround so instead of doing this only for 361 and
>>   363 devices, we turn off async suspend on *all* JMicron devices, and
>>
>>   2) Moves the workaround from the AHCI and PATA drivers to the PCI
>>   core.
>>
>> For 1), I think it is probably overkill to penalize all JMicron
>> devices.  There's no reason to think NICs and SD/MMC/etc controllers
>> have the same issue.  Or if there *is* reason to think that, you
>> should give evidence for it.
>
> No, I don't have any evidence. It's just because the previous fix
> becomes insufficient, which makes people wondering if we should disable
> all of them.

It's reasonable to say "JMicron multi-function PATA controllers X, Y,
and Z have this problem; let's assume all JMicron multi-function PATA
controllers have it."  All those controllers are likely to share some
silicon design.  Saying "all JMicron adapters, even single-function
and NIC and SD/etc. adapters, have this problem" is not nearly such an
obvious conclusion.

>> For 2), you have not made a convincing argument for why the workaround
>> needs to be in the PCI core.  Such an argument would be of the form
>> "we need this quirk even if the driver hasn't been loaded yet
>> because ..."
>
> Good point, Jay and Barto, can you please verify this?

I don't know Jay or Barto, but I don't even know what you're asking
them to verify, so I think you are asking too much of them.  What
exactly would you like them to do?

The scenario where I can imagine the quirk would need to be in the
core is the following, and if I were you, this is the scenario I would
be asking them to test:

  - cold boot system with comment #72 patch (quirks in ahci & pata_jmicron)
  - load ahci driver to claim JMicron SATA device
  - suspend system, which powers down both SATA and PATA devices
  - resume system, which powers up PATA (function 1), then SATA (function 0)
  - verify that SATA works fine
  - load pata_jmicron driver to claim PATA
  - see whether PATA works

Now, the question is whether PATA works after resuming and loading
pata_jmicron.  If it does, then it should be fine to keep the quirks
in ahci and pata_jmicron.

If it doesn't work (and I think it's actually likely that it *won't*
work), then we probably need to put the quirks in the PCI core so they
will happen even before ahci and pata_jmicron are loaded.

>> Since you didn't say anything like that, I assume the patch in comment
>> #72 of bug #81551 (https://bugzilla.kernel.org/attachment.cgi?id=156301)
>> would work as well.  That has the advantage that it wouldn't penalize
>> non-storage controllers.
>>
> Yes, the fix in driver code also works. But let's wait for the test
> results from Jay and Barto because this problem really sounds like a
> dependency in PCI code.

It might "sound like" a dependency in PCI code, but that sort of
hand-waving makes me think you don't really understand the problem.
There might be something in the PCI specs that says you have to power
up function 0 before function 1.  I don't think that's the case,
because we don't see this problem with other multi-function devices.
But if it were the case, you should write a PCI patch to enforce that
ordering and include a spec citation in the changelog.  That would fix
this device as well as potentially others.

If the spec doesn't require that ordering, then it's probably a
hardware defect.  It's possible there's a way to describe the ordering
via ACPI; if there is, you could argue that the lack of that
description is a BIOS defect.  Either way, it's not a PCI core
problem.  We might still want a workaround in the PCI core, but we
need to be clear about what the problem is.

>> Other minor nits:
>>
>> - The function "pci_resume_noirq()" does not exist.  I assume you meant
>>   pci_pm_resume_noirq().  And as I mentioned earlier, I don't think the
>>   name is relevant in the changelog anyway.
>>
>> - You have a mix of SHA1 lengths and summary quoting above.  Please use
>>   the conventional commit reference style (with 12-char SHA-1)
>>   consistently, e.g.,
>>
>>     76569faa62c4 ("PM / sleep: Asynchronous threads for resume_noirq")
>>
>> - Please use http://lkml.kernel.org/r/... references when possible,
>>   not https://lkml.org/....
>>
> Thanks for pointing these out. I was not aware of such mistakes before.

I forgot to mention that your changelog has random line lengths.  New
paragraphs start after a blank line.  A new sentence does not start a
new line.  Fill the lines so they fit in 75 columns so they fit in 80
columns after git indents them.

Usually I fix things like that myself.  But there were other issues,
and your SHA-1 citations weren't even consistent with each other,
which I think is sloppy, and I get grumpy when people ask me to merge
sloppy work.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Barto Aug. 15, 2015, 11:40 p.m. UTC | #6
Le 14/08/2015 22:17, Bjorn Helgaas a écrit :
> For 1), I think it is probably overkill to penalize all JMicron
> devices.  There's no reason to think NICs and SD/MMC/etc controllers
> have the same issue.  Or if there *is* reason to think that, you
> should give evidence for it.

if you don't want to penalize all JMicron models then a workaround would
be to create a new feature : the ability for the user to disable async
for a specific device,

for example a kernel parameter ( no-async-for-jmicron ), or a more
configurable mechanism, like a configuration file where the user can add
the name of the device, a kind of blacklist in order to disable async
only for a list of devices,

it's only a workaround, the real solution would to be find the real root
cause ( a bug in async method ? a design flaw in JMicron hardware ? a
bios/acpi bug ? )

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 7e62751..26bb40d 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1451,18 +1451,6 @@  static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	else if (pdev->vendor == 0x177d && pdev->device == 0xa01c)
 		ahci_pci_bar = AHCI_PCI_BAR_CAVIUM;
 
-	/*
-	 * The JMicron chip 361/363 contains one SATA controller and one
-	 * PATA controller,for powering on these both controllers, we must
-	 * follow the sequence one by one, otherwise one of them can not be
-	 * powered on successfully, so here we disable the async suspend
-	 * method for these chips.
-	 */
-	if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
-		(pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
-		pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
-		device_disable_async_suspend(&pdev->dev);
-
 	/* acquire resources */
 	rc = pcim_enable_device(pdev);
 	if (rc)
diff --git a/drivers/ata/pata_jmicron.c b/drivers/ata/pata_jmicron.c
index 47e418b..4d1a5d2 100644
--- a/drivers/ata/pata_jmicron.c
+++ b/drivers/ata/pata_jmicron.c
@@ -143,18 +143,6 @@  static int jmicron_init_one (struct pci_dev *pdev, const struct pci_device_id *i
 	};
 	const struct ata_port_info *ppi[] = { &info, NULL };
 
-	/*
-	 * The JMicron chip 361/363 contains one SATA controller and one
-	 * PATA controller,for powering on these both controllers, we must
-	 * follow the sequence one by one, otherwise one of them can not be
-	 * powered on successfully, so here we disable the async suspend
-	 * method for these chips.
-	 */
-	if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
-		(pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
-		pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
-		device_disable_async_suspend(&pdev->dev);
-
 	return ata_pci_bmdma_init_one(pdev, ppi, &jmicron_sht, NULL, 0);
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e9fd0e9..02803f8 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -29,6 +29,23 @@ 
 #include "pci.h"
 
 /*
+ * For JMicron chips, we need to disable the async_suspend method, otherwise
+ * they will hit the power-on issue when doing device resume, add one quick
+ * solution to disable the async_suspend method.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=81551
+ */
+static void pci_async_suspend_fixup(struct pci_dev *pdev)
+{
+	/*
+	 * disabling the async_suspend method for JMicron chips to
+	 * avoid device resuming issue.
+	 */
+	device_disable_async_suspend(&pdev->dev);
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, pci_async_suspend_fixup);
+
+/*
  * Decoding should be disabled for a PCI device during BAR sizing to avoid
  * conflict. But doing so may cause problems on host bridge and perhaps other
  * key system devices. For devices that need to have mmio decoding always-on,