diff mbox series

[1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

Message ID DM6PR04MB6473197DBD89FF4643CC391F8BC19@DM6PR04MB6473.namprd04.prod.outlook.com (mailing list archive)
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series [1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD | expand

Commit Message

Alexey Bogoslavsky Jan. 16, 2023, 6:32 p.m. UTC
From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>

A bug was found in SN730 WD SSD that causes occasional false AER reporting
of correctable errors. While functionally harmless, this causes error
messages to appear in the system log (dmesg) which, in turn, causes
problems in automated platform validation tests. Since the issue can not
be fixed by FW, customers asked for correctable error reporting to be
quirked out in the kernel for this particular device.

The patch was manually verified. It was checked that correctable errors
are still detected but ignored for the target device (SN730), and are both
detected and reported for devices not affected by this quirk.

Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com
---
 drivers/pci/pcie/aer.c  | 3 +++
 drivers/pci/quirks.c    | 6 ++++++
 include/linux/pci.h     | 1 +
 include/linux/pci_ids.h | 4 ++++
 4 files changed, 14 insertions(+)

Comments

hch@lst.de Jan. 17, 2023, 7:14 a.m. UTC | #1
On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> 
> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> of correctable errors. While functionally harmless, this causes error
> messages to appear in the system log (dmesg) which, in turn, causes
> problems in automated platform validation tests. Since the issue can not
> be fixed by FW, customers asked for correctable error reporting to be
> quirked out in the kernel for this particular device.
> 
> The patch was manually verified. It was checked that correctable errors
> are still detected but ignored for the target device (SN730), and are both
> detected and reported for devices not affected by this quirk.
> 
> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com
> ---
>  drivers/pci/pcie/aer.c  | 3 +++
>  drivers/pci/quirks.c    | 6 ++++++
>  include/linux/pci.h     | 1 +
>  include/linux/pci_ids.h | 4 ++++
>  4 files changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d7ee79d7b192..5cc24d28b76d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  		goto out;
>  	}
>  
> +	if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)

No need for the inner braces.

> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);

Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
instead.

But overall I'm not really sure it's worth adding code just to surpress
a harmless warning.
Alexey Bogoslavsky Jan. 17, 2023, 1:20 p.m. UTC | #2
>From: 'hch@lst.de' <hch@lst.de> 
>Sent: Tuesday, January 17, 2023 9:14 AM
>To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
>Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
>Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

>On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
>> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
>>
>> A bug was found in SN730 WD SSD that causes occasional false AER reporting
>> of correctable errors. While functionally harmless, this causes error
>> messages to appear in the system log (dmesg) which, in turn, causes
>> problems in automated platform validation tests. Since the issue can not
>> be fixed by FW, customers asked for correctable error reporting to be
>> quirked out in the kernel for this particular device.
>>
>> The patch was manually verified. It was checked that correctable errors
>> are still detected but ignored for the target device (SN730), and are both
>> detected and reported for devices not affected by this quirk.
>>
>> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com
>> ---
>>  drivers/pci/pcie/aer.c  | 3 +++
>>  drivers/pci/quirks.c    | 6 ++++++
>>  include/linux/pci.h     | 1 +
>>  include/linux/pci_ids.h | 4 ++++
>>  4 files changed, 14 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index d7ee79d7b192..5cc24d28b76d 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>               goto out;
>>       }
>>
>> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)

>No need for the inner braces.
Will fix

>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);

>Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
instead.
Will fix both, thanks
>But overall I'm not really sure it's worth adding code just to surpress
>a harmless warning.
This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else
Bjorn Helgaas Jan. 17, 2023, 2:22 p.m. UTC | #3
[+cc Rajat]

On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote:
> >From: 'hch@lst.de' <hch@lst.de> 
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests.

I don't think automated test problems warrant an OS change to suppress
warnings.  Those are internal tests where you can automatically ignore
warnings you don't care about.

> >> Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.

Customer complaints are a little more of an issue.  But let's clarify
what's happening here.  You mention "false AER reporting," and I want
to understand that better.  I guess you mean that SN730 logs a
correctable error and generates an ERR_COR message when no error
actually occurred?

I hesitate to turn off correctable error reporting completely because
that information is useful for monitoring overall system health.  But
there are two things I think we could do:

  - Reformat the correctable error messages so they are obviously
    different from uncorrectable errors and indicate that these have
    been automatically corrected, and

  - Possibly rate-limit them on a per-device basis, e.g., something
    like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@linux.intel.com

If we do end up having to turn off reporting completely, I would log a
note that we're doing that so we know we're giving up something and
there may be legitimate errors that we will never see.

> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
> >>
> >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com

Check the history and use the same format here, i.e.,

  Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>

(Also for the From: line inserted at the top.)

> >> ---
> >>  drivers/pci/pcie/aer.c  | 3 +++
> >>  drivers/pci/quirks.c    | 6 ++++++
> >>  include/linux/pci.h     | 1 +
> >>  include/linux/pci_ids.h | 4 ++++
> >>  4 files changed, 14 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index d7ee79d7b192..5cc24d28b76d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >>               goto out;
> >>       }
> >>
> >> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
> 
> >No need for the inner braces.
> Will fix
> 
> >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
> 
> >Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
> instead.
> Will fix both, thanks
> >But overall I'm not really sure it's worth adding code just to surpress
> >a harmless warning.
> This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else
Keith Busch Jan. 17, 2023, 3:54 p.m. UTC | #4
On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> 
> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> of correctable errors. While functionally harmless, this causes error
> messages to appear in the system log (dmesg) which, in turn, causes
> problems in automated platform validation tests. Since the issue can not
> be fixed by FW, customers asked for correctable error reporting to be
> quirked out in the kernel for this particular device.
> 
> The patch was manually verified. It was checked that correctable errors
> are still detected but ignored for the target device (SN730), and are both
> detected and reported for devices not affected by this quirk.

If you're just going to have the kernel ignore these, are you not able
to suppress the ERR_COR message at the source? Have the following
options been tried?

 a. Disabling Correctable Error Reporting Enable in Device Control
    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
 b. Setting AER Correctable Error Mask Register to all 1's

I think it's usually possible for firmware to hardwire these. If the
firmware can't do that, quirking the kernel to always disable reporting
sounds like a better option. If either of the above fail to suppress the
error messages, then I guess having the kernel ignore it is the only
option.
Alexey Bogoslavsky Jan. 17, 2023, 6:06 p.m. UTC | #5
>From: Bjorn Helgaas <helgaas@kernel.org> 
>Sent: Tuesday, January 17, 2023 4:23 PM
>To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
>Cc: 'hch@lst.de' <hch@lst.de>; linux-pci@vger.kernel.org; bhelgaas@google.com; Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
>Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

>[+cc Rajat]

>On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote:
>> >From: 'hch@lst.de' <hch@lst.de>
>> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
>> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
>> >>
>> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
>> >> of correctable errors. While functionally harmless, this causes error
>> >> messages to appear in the system log (dmesg) which, in turn, causes
>> >> problems in automated platform validation tests.

>I don't think automated test problems warrant an OS change to suppress
>warnings.  Those are internal tests where you can automatically ignore
>warnings you don't care about.

>> >> Since the issue can not
>> >> be fixed by FW, customers asked for correctable error reporting to be
>> >> quirked out in the kernel for this particular device.

>Customer complaints are a little more of an issue.  But let's clarify
>what's happening here.  You mention "false AER reporting," and I want
>to understand that better.  I guess you mean that SN730 logs a
>correctable error and generates an ERR_COR message when no error
>actually occurred?
Exactly

>I hesitate to turn off correctable error reporting completely because
>that information is useful for monitoring overall system health.  But
>there are two things I think we could do:

>  - Reformat the correctable error messages so they are obviously
>    different from uncorrectable errors and indicate that these have
>    been automatically corrected, and
We've had long and tiresome discussions with the customers using this
specific device in their design trying to convince them to just ignore
the messages, but they wouldn't budge
>  - Possibly rate-limit them on a per-device basis, e.g., something
>    like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@linux.intel.com
The thing is on this device correctable PCI errors can never be trusted. So reporting
them in any format or at any rate makes little sense as one can never assume they're real.
Also, the loss of potential real error notifications is something the customers seem
to be OK with, given the pros and cons.

>If we do end up having to turn off reporting completely, I would log a
>note that we're doing that so we know we're giving up something and
>there may be legitimate errors that we will never see.
You mean, print a warning message the first time such error is reported by the device
and ignore all additional ERR_COR messages? This seems like a fine idea, I would
definitely be willing to implement it

>> >> The patch was manually verified. It was checked that correctable errors
>> >> are still detected but ignored for the target device (SN730), and are both
>> >> detected and reported for devices not affected by this quirk.
>> >>
>> >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com

>Check the history and use the same format here, i.e.,

>  Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>

>(Also for the From: line inserted at the top.)
Sorry about that, will fix
> >> ---
> >>  drivers/pci/pcie/aer.c  | 3 +++
> >>  drivers/pci/quirks.c    | 6 ++++++
> >>  include/linux/pci.h     | 1 +
> >>  include/linux/pci_ids.h | 4 ++++
> >>  4 files changed, 14 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index d7ee79d7b192..5cc24d28b76d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >>               goto out;
> >>       }
> >>
> >> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
>
> >No need for the inner braces.
> Will fix
>
> >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
>
> >Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
> instead.
> Will fix both, thanks
> >But overall I'm not really sure it's worth adding code just to surpress
> >a harmless warning.
> This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else
Alexey Bogoslavsky Jan. 17, 2023, 6:15 p.m. UTC | #6
>From: Keith Busch <kbusch@kernel.org> 
>Sent: Tuesday, January 17, 2023 5:55 PM
>To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
>Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
>Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

>On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
>> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
>>
>> A bug was found in SN730 WD SSD that causes occasional false AER reporting
>> of correctable errors. While functionally harmless, this causes error
>> messages to appear in the system log (dmesg) which, in turn, causes
>> problems in automated platform validation tests. Since the issue can not
>> be fixed by FW, customers asked for correctable error reporting to be
>> quirked out in the kernel for this particular device.
>
>> The patch was manually verified. It was checked that correctable errors
>> are still detected but ignored for the target device (SN730), and are both
>> detected and reported for devices not affected by this quirk.

>If you're just going to have the kernel ignore these, are you not able
>to suppress the ERR_COR message at the source? Have the following
>options been tried?

> a. Disabling Correctable Error Reporting Enable in Device Control
>    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
> b. Setting AER Correctable Error Mask Register to all 1's

>I think it's usually possible for firmware to hardwire these. If the
I believe these options were discussed but deemed non-viable. I'll double check anyway
>If firmware can't do that, quirking the kernel to always disable reporting
>sounds like a better option. If either of the above fail to suppress the
>error messages, then I guess having the kernel ignore it is the only
>option.
This could probably work. I'll discuss this with our FW team to make sure the issue can
be resolved this way. Thank you
Bjorn Helgaas April 11, 2023, 10:15 p.m. UTC | #7
[+cc Grant, Rajat]

On Tue, Jan 17, 2023 at 06:15:28PM +0000, Alexey Bogoslavsky wrote:
> >From: Keith Busch <kbusch@kernel.org> 
> >Sent: Tuesday, January 17, 2023 5:55 PM
> >To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
> >Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
> 
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests. Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.
> >
> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
> 
> >If you're just going to have the kernel ignore these, are you not able
> >to suppress the ERR_COR message at the source? Have the following
> >options been tried?
> 
> > a. Disabling Correctable Error Reporting Enable in Device Control
> >    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
> > b. Setting AER Correctable Error Mask Register to all 1's
> 
> >I think it's usually possible for firmware to hardwire these. If the
>
> I believe these options were discussed but deemed non-viable. I'll
> double check anyway
>
> >If firmware can't do that, quirking the kernel to always disable
> >reporting sounds like a better option. If either of the above fail
> >to suppress the error messages, then I guess having the kernel
> >ignore it is the only option.
>
> This could probably work. I'll discuss this with our FW team to make
> sure the issue can be resolved this way. Thank you

Any resolution on this FW possibility?

We have patches in progress to rate-limit correctable error messages
and make them KERN_INFO instead of KERN_WARN [1], but I don't think
that's going to be a good enough solution for you because nobody wants
to see even an informational message every 5 seconds if the message is
useless.

If firmware on the device can turn off these errors, that would be the
best solution.  If not, I think your quirk is a reasonable approach
and just needs a litle polishing per the previous comments.

Bjorn

[1] https://lore.kernel.org/r/20230317175109.3859943-1-grundler@chromium.org
Alexey Bogoslavsky April 24, 2023, 11:27 a.m. UTC | #8
Hello Bjorn,

Sorry for not addressing your questions earlier. As you may have heard, WD experienced
a hacking attack which left us with no access to the company e-mail for weeks.
As for the patch, no FW change was an option as the product causing the issue was
basically at the end of life. So, I prepared a workaround that took into account
all the comments from the community.

Yet, at this point it seems like the company has lost interest in promoting this patch altogether.
So we could just drop it. Please let me know if there's anything I need to do to request that
officially.

Thank you,
Alexey

-----Original Message-----
From: Bjorn Helgaas <helgaas@kernel.org> 
Sent: Wednesday, April 12, 2023 1:15 AM
To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
Cc: Keith Busch <kbusch@kernel.org>; linux-pci@vger.kernel.org; Bjorn Helgas <bhelgaas@google.com>; Christoph Hellwig <hch@lst.de>; Grant Grundler <grundler@chromium.org>; Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.


[+cc Grant, Rajat]

On Tue, Jan 17, 2023 at 06:15:28PM +0000, Alexey Bogoslavsky wrote:
> >From: Keith Busch <kbusch@kernel.org>
> >Sent: Tuesday, January 17, 2023 5:55 PM
> >To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
> >Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
>
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests. Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.
> >
> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
>
> >If you're just going to have the kernel ignore these, are you not able
> >to suppress the ERR_COR message at the source? Have the following
> >options been tried?
>
> > a. Disabling Correctable Error Reporting Enable in Device Control
> >    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
> > b. Setting AER Correctable Error Mask Register to all 1's
>
> >I think it's usually possible for firmware to hardwire these. If the
>
> I believe these options were discussed but deemed non-viable. I'll
> double check anyway
>
> >If firmware can't do that, quirking the kernel to always disable
> >reporting sounds like a better option. If either of the above fail
> >to suppress the error messages, then I guess having the kernel
> >ignore it is the only option.
>
> This could probably work. I'll discuss this with our FW team to make
> sure the issue can be resolved this way. Thank you

Any resolution on this FW possibility?

We have patches in progress to rate-limit correctable error messages
and make them KERN_INFO instead of KERN_WARN [1], but I don't think
that's going to be a good enough solution for you because nobody wants
to see even an informational message every 5 seconds if the message is
useless.

If firmware on the device can turn off these errors, that would be the
best solution.  If not, I think your quirk is a reasonable approach
and just needs a litle polishing per the previous comments.

Bjorn

[1] https://lore.kernel.org/r/20230317175109.3859943-1-grundler@chromium.org
Bjorn Helgaas April 28, 2023, 10 p.m. UTC | #9
On Mon, Apr 24, 2023 at 11:27:43AM +0000, Alexey Bogoslavsky wrote:
> Hello Bjorn,
> 
> Sorry for not addressing your questions earlier. As you may have
> heard, WD experienced a hacking attack which left us with no access
> to the company e-mail for weeks.

So sorry to hear that, that is incredibly disruptive.

> As for the patch, no FW change was an option as the product causing
> the issue was basically at the end of life. So, I prepared a
> workaround that took into account all the comments from the
> community.

Makes sense and is a very common situation.  In general we try to
avoid "fixing" issues by requiring a firmware update because even if
such an update is available, most users will not have it.

I think the hope was that a kernel quirk could tweak something on the
device to make it stop reporting these errors.

> Yet, at this point it seems like the company has lost interest in
> promoting this patch altogether.  So we could just drop it. Please
> let me know if there's anything I need to do to request that
> officially.

No worries, you don't need to do anything.  If you can point me to any
users or bug reports, maybe I can tidy this up and merge it.  Those
users are still being annoyed by the error spam, and that seems
pointless.

Bjorn
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d7ee79d7b192..5cc24d28b76d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -721,6 +721,9 @@  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 		goto out;
 	}
 
+	if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
+		return;
+
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 285acc4aaccc..21e4972fb242 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5911,6 +5911,12 @@  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1536, rom_bar_overlap_defect);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1537, rom_bar_overlap_defect);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1538, rom_bar_overlap_defect);
 
+static void wd_ignore_correctable_errors(struct pci_dev *pdev)
+{
+	pdev->ignore_correctable_errors = true;
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
+
 #ifdef CONFIG_PCIEASPM
 /*
  * Several Intel DG2 graphics devices advertise that they can only tolerate
diff --git a/include/linux/pci.h b/include/linux/pci.h
index adffd65e84b4..1e002be07255 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -463,6 +463,7 @@  struct pci_dev {
 	unsigned int	link_active_reporting:1;/* Device capable of reporting link active */
 	unsigned int	no_vf_scan:1;		/* Don't scan for VFs after IOV enablement */
 	unsigned int	no_command_memory:1;	/* No PCI_COMMAND_MEMORY */
+	unsigned int	ignore_correctable_errors:1;	/* Ignore correctable errors reported */
 	unsigned int	rom_bar_overlap:1;	/* ROM BAR disable broken */
 	pci_dev_flags_t dev_flags;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index b362d90eb9b0..b5a08a4cf047 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2299,9 +2299,13 @@ 
 #define PCI_VENDOR_ID_QUICKNET		0x15e2
 #define PCI_DEVICE_ID_QUICKNET_XJ	0x0500
 
+#define PCI_VENDOR_ID_WESTERN_DIGITAL		0x15B7
+#define PCI_DEVICE_ID_WESTERN_DIGITAL_SN730	0x5006
+
 /*
  * ADDI-DATA GmbH communication cards mailto:info@addi-data.com
  */
+
 #define PCI_VENDOR_ID_ADDIDATA                 0x15B8
 #define PCI_DEVICE_ID_ADDIDATA_APCI7500        0x7000
 #define PCI_DEVICE_ID_ADDIDATA_APCI7420        0x7001