diff mbox series

[v2,12/15] ath10k: use new module_firmware_crashed()

Message ID 20200515212846.1347-13-mcgrof@kernel.org (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Luis Chamberlain May 15, 2020, 9:28 p.m. UTC
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wireless@vger.kernel.org
Cc: ath10k@lists.infradead.org
Cc: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
 drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
 drivers/net/wireless/ath/ath10k/snoc.c | 1 +
 3 files changed, 5 insertions(+)

Comments

Rafael Aquini May 16, 2020, 4:11 a.m. UTC | #1
On Fri, May 15, 2020 at 09:28:43PM +0000, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wireless@vger.kernel.org
> Cc: ath10k@lists.infradead.org
> Cc: Kalle Valo <kvalo@codeaurora.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
>  drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
>  drivers/net/wireless/ath/ath10k/snoc.c | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
> index 1d941d53fdc9..6bd0f3b518b9 100644
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
>  		scnprintf(guid, sizeof(guid), "n/a");
>  
>  	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> +	module_firmware_crashed();
>  	ath10k_print_driver_info(ar);
>  	ath10k_pci_dump_registers(ar, crash_data);
>  	ath10k_ce_dump_registers(ar, crash_data);
> @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
>  	if (ret) {
>  		if (ath10k_pci_has_fw_crashed(ar)) {
>  			ath10k_warn(ar, "firmware crashed during chip reset\n");
> +			module_firmware_crashed();
>  			ath10k_pci_fw_crashed_clear(ar);
>  			ath10k_pci_fw_crashed_dump(ar);
>  		}
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> index e2aff2254a40..d34ad289380f 100644
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
>  
>  	/* TODO: Add firmware crash handling */
>  	ath10k_warn(ar, "firmware crashed\n");
> +	module_firmware_crashed();
>  
>  	/* read counter to clear the interrupt, the debug error interrupt is
>  	 * counter 0.
> @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
>  	if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
>  		ath10k_err(ar, "firmware crashed!\n");
>  		queue_work(ar->workqueue, &ar->restart_work);
> +		module_firmware_crashed();
>  	}
>  	return ret;
>  }
> diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
> index 354d49b1cd45..7cfc123c345c 100644
> --- a/drivers/net/wireless/ath/ath10k/snoc.c
> +++ b/drivers/net/wireless/ath/ath10k/snoc.c
> @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
>  		scnprintf(guid, sizeof(guid), "n/a");
>  
>  	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> +	module_firmware_crashed();
>  	ath10k_print_driver_info(ar);
>  	ath10k_msa_dump_memory(ar, crash_data);
>  	mutex_unlock(&ar->dump_mutex);
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini <aquini@redhat.com>
Johannes Berg May 16, 2020, 1:24 p.m. UTC | #2
On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed

You didn't CC me or the wireless list on the rest of the patches, so I'm
replying to a random one, but ...

What is the point here?

This should in no way affect the integrity of the system/kernel, for
most devices anyway.

So what if ath10k's firmware crashes? If there's a driver bug it will
not handle it right (and probably crash, WARN_ON, or something else),
but if the driver is working right then that will not affect the kernel
at all.

So maybe I can understand that maybe you want an easy way to discover -
per device - that the firmware crashed, but that still doesn't warrant a
complete kernel taint.

Instead of the kernel taint, IMHO you should provide an annotation in
sysfs (or somewhere else) for the *struct device* that had its firmware
crash. Or maybe, if it's too complex to walk the entire hierarchy
checking for that, have a uevent, or add the ability for the kernel to
print out elsewhere in debugfs the list of devices that crashed at some
point... All of that is fine, but a kernel taint?

johannes
Johannes Berg May 16, 2020, 1:50 p.m. UTC | #3
On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote:

> Instead of the kernel taint, IMHO you should provide an annotation in
> sysfs (or somewhere else) for the *struct device* that had its firmware
> crash. Or maybe, if it's too complex to walk the entire hierarchy
> checking for that, have a uevent, or add the ability for the kernel to
> print out elsewhere in debugfs the list of devices that crashed at some

I mean sysfs, oops.


In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
detect that the device is really wedged enough that the only way we can
still try to recover is by completely unbinding the driver from it, then
we give userspace a uevent for that. I don't remember exactly how and
where that gets used (ChromeOS) though, but it'd be nice to have that
sort of thing as part of the infrastructure, in a sort of two-level
notification?

Level 1: firmware crashed, but we're recovering, at least mostly, and
it's more informational

Level 2: device is wedged, going to try to recover by some more forceful
means (perhaps some devices can be power-cycled? etc.) but (more) state
would be lost in these cases?

Still don't think a kernel taint is appropriate for either of these.

johannes
Luis Chamberlain May 18, 2020, 4:51 p.m. UTC | #4
On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> 
> You didn't CC me or the wireless list on the rest of the patches, so I'm
> replying to a random one, but ...
> 
> What is the point here?
> 
> This should in no way affect the integrity of the system/kernel, for
> most devices anyway.

Keyword you used here is "most device". And in the worst case, *who*
knows what other odd things may happen afterwards.

> So what if ath10k's firmware crashes? If there's a driver bug it will
> not handle it right (and probably crash, WARN_ON, or something else),
> but if the driver is working right then that will not affect the kernel
> at all.

Sometimes the device can go into a state which requires driver removal
and addition to get things back up.

> So maybe I can understand that maybe you want an easy way to discover -
> per device - that the firmware crashed, but that still doesn't warrant a
> complete kernel taint.

That is one reason, another is that a taint helps support cases *fast*
easily detect if the issue was a firmware crash, instead of scraping
logs for driver specific ways to say the firmware has crashed.

> Instead of the kernel taint, IMHO you should provide an annotation in
> sysfs (or somewhere else) for the *struct device* that had its firmware
> crash.

It would seem the way some folks are thinking about getting more details
would be through devlink.

> Or maybe, if it's too complex to walk the entire hierarchy
> checking for that, have a uevent,  or add the ability for the kernel to
> print out elsewhere in debugfs the list of devices that crashed at some
> point... All of that is fine, but a kernel taint?

debugfs is optional, a taint is simple, and device agnostic. From a
support perspective it is very easy to see if a possible issue may
be device firmware specific.

  Luis
Luis Chamberlain May 18, 2020, 4:56 p.m. UTC | #5
On Sat, May 16, 2020 at 03:50:55PM +0200, Johannes Berg wrote:
> On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote:
> 
> > Instead of the kernel taint, IMHO you should provide an annotation in
> > sysfs (or somewhere else) for the *struct device* that had its firmware
> > crash. Or maybe, if it's too complex to walk the entire hierarchy
> > checking for that, have a uevent, or add the ability for the kernel to
> > print out elsewhere in debugfs the list of devices that crashed at some
> 
> I mean sysfs, oops.
> 
> 
> In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> detect that the device is really wedged enough that the only way we can
> still try to recover is by completely unbinding the driver from it, then
> we give userspace a uevent for that.

Nice! Indeed a uevent is in order for these sorts of things, and I'd
argue that it begs the question if we should even uevent for any taint
as well. Today these are silent. If the kernel crashes, today we only
give userspace a log.

> I don't remember exactly how and
> where that gets used (ChromeOS) though, but it'd be nice to have that
> sort of thing as part of the infrastructure, in a sort of two-level
> notification?
> 
> Level 1: firmware crashed, but we're recovering, at least mostly, and
> it's more informational
> 
> Level 2: device is wedged, going to try to recover by some more forceful
> means (perhaps some devices can be power-cycled? etc.) but (more) state
> would be lost in these cases?

I agree that *all* this would be ideal. I don't see this as mutually
exclusive with a taint on the kernel and module for the device.

> Still don't think a kernel taint is appropriate for either of these.

From a support perspective, I do think it is vital quick information.

  Luis
Ben Greear May 18, 2020, 4:58 p.m. UTC | #6
On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
>> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
>>
>> You didn't CC me or the wireless list on the rest of the patches, so I'm
>> replying to a random one, but ...
>>
>> What is the point here?
>>
>> This should in no way affect the integrity of the system/kernel, for
>> most devices anyway.
>
> Keyword you used here is "most device". And in the worst case, *who*
> knows what other odd things may happen afterwards.
>
>> So what if ath10k's firmware crashes? If there's a driver bug it will
>> not handle it right (and probably crash, WARN_ON, or something else),
>> but if the driver is working right then that will not affect the kernel
>> at all.
>
> Sometimes the device can go into a state which requires driver removal
> and addition to get things back up.

It would be lovely to be able to detect this case in the driver/system
somehow!  I haven't seen any such cases recently, but in case there is
some common case you see, maybe we can think of a way to detect it?

>
>> So maybe I can understand that maybe you want an easy way to discover -
>> per device - that the firmware crashed, but that still doesn't warrant a
>> complete kernel taint.
>
> That is one reason, another is that a taint helps support cases *fast*
> easily detect if the issue was a firmware crash, instead of scraping
> logs for driver specific ways to say the firmware has crashed.

You can listen for udev events (I think that is the right term),
and find crashes that way.  You get the actual crash info as well.

Thanks,
Ben
Luis Chamberlain May 18, 2020, 5:09 p.m. UTC | #7
On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
> 
> 
> On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > > 
> > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > replying to a random one, but ...
> > > 
> > > What is the point here?
> > > 
> > > This should in no way affect the integrity of the system/kernel, for
> > > most devices anyway.
> > 
> > Keyword you used here is "most device". And in the worst case, *who*
> > knows what other odd things may happen afterwards.
> > 
> > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > not handle it right (and probably crash, WARN_ON, or something else),
> > > but if the driver is working right then that will not affect the kernel
> > > at all.
> > 
> > Sometimes the device can go into a state which requires driver removal
> > and addition to get things back up.
> 
> It would be lovely to be able to detect this case in the driver/system
> somehow!  I haven't seen any such cases recently,

I assure you that I have run into it. Once it does again I'll report
the crash, but the problem with some of this is that unless you scrape
the log you won't know. Eventually, a uevent would indeed tell inform
me.

> but in case there is
> some common case you see, maybe we can think of a way to detect it?

ath10k is just one case, this patch series addresses a simple way to
annotate this tree-wide.

> > > So maybe I can understand that maybe you want an easy way to discover -
> > > per device - that the firmware crashed, but that still doesn't warrant a
> > > complete kernel taint.
> > 
> > That is one reason, another is that a taint helps support cases *fast*
> > easily detect if the issue was a firmware crash, instead of scraping
> > logs for driver specific ways to say the firmware has crashed.
> 
> You can listen for udev events (I think that is the right term),
> and find crashes that way.  You get the actual crash info as well.

My follow up to this was to add uevent to add_taint() as well, this way
these could generically be processed by userspace.

  Luis
Ben Greear May 18, 2020, 5:15 p.m. UTC | #8
On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
>>
>>
>> On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
>>> On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
>>>> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
>>>>
>>>> You didn't CC me or the wireless list on the rest of the patches, so I'm
>>>> replying to a random one, but ...
>>>>
>>>> What is the point here?
>>>>
>>>> This should in no way affect the integrity of the system/kernel, for
>>>> most devices anyway.
>>>
>>> Keyword you used here is "most device". And in the worst case, *who*
>>> knows what other odd things may happen afterwards.
>>>
>>>> So what if ath10k's firmware crashes? If there's a driver bug it will
>>>> not handle it right (and probably crash, WARN_ON, or something else),
>>>> but if the driver is working right then that will not affect the kernel
>>>> at all.
>>>
>>> Sometimes the device can go into a state which requires driver removal
>>> and addition to get things back up.
>>
>> It would be lovely to be able to detect this case in the driver/system
>> somehow!  I haven't seen any such cases recently,
>
> I assure you that I have run into it. Once it does again I'll report
> the crash, but the problem with some of this is that unless you scrape
> the log you won't know. Eventually, a uevent would indeed tell inform
> me.
>
>> but in case there is
>> some common case you see, maybe we can think of a way to detect it?
>
> ath10k is just one case, this patch series addresses a simple way to
> annotate this tree-wide.
>
>>>> So maybe I can understand that maybe you want an easy way to discover -
>>>> per device - that the firmware crashed, but that still doesn't warrant a
>>>> complete kernel taint.
>>>
>>> That is one reason, another is that a taint helps support cases *fast*
>>> easily detect if the issue was a firmware crash, instead of scraping
>>> logs for driver specific ways to say the firmware has crashed.
>>
>> You can listen for udev events (I think that is the right term),
>> and find crashes that way.  You get the actual crash info as well.
>
> My follow up to this was to add uevent to add_taint() as well, this way
> these could generically be processed by userspace.

I'm not opposed to the taint, though I have not thought much on it.

But, if you can already get the crash info from uevent, and it automatically
comes without polling or scraping logs, then what benefit beyond that does
the taint give you?

Thanks,
Ben
Luis Chamberlain May 18, 2020, 5:18 p.m. UTC | #9
On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote:
> 
> 
> On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
> > > 
> > > 
> > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > > > > 
> > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > > > replying to a random one, but ...
> > > > > 
> > > > > What is the point here?
> > > > > 
> > > > > This should in no way affect the integrity of the system/kernel, for
> > > > > most devices anyway.
> > > > 
> > > > Keyword you used here is "most device". And in the worst case, *who*
> > > > knows what other odd things may happen afterwards.
> > > > 
> > > > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > > > not handle it right (and probably crash, WARN_ON, or something else),
> > > > > but if the driver is working right then that will not affect the kernel
> > > > > at all.
> > > > 
> > > > Sometimes the device can go into a state which requires driver removal
> > > > and addition to get things back up.
> > > 
> > > It would be lovely to be able to detect this case in the driver/system
> > > somehow!  I haven't seen any such cases recently,
> > 
> > I assure you that I have run into it. Once it does again I'll report
> > the crash, but the problem with some of this is that unless you scrape
> > the log you won't know. Eventually, a uevent would indeed tell inform
> > me.
> > 
> > > but in case there is
> > > some common case you see, maybe we can think of a way to detect it?
> > 
> > ath10k is just one case, this patch series addresses a simple way to
> > annotate this tree-wide.
> > 
> > > > > So maybe I can understand that maybe you want an easy way to discover -
> > > > > per device - that the firmware crashed, but that still doesn't warrant a
> > > > > complete kernel taint.
> > > > 
> > > > That is one reason, another is that a taint helps support cases *fast*
> > > > easily detect if the issue was a firmware crash, instead of scraping
> > > > logs for driver specific ways to say the firmware has crashed.
> > > 
> > > You can listen for udev events (I think that is the right term),
> > > and find crashes that way.  You get the actual crash info as well.
> > 
> > My follow up to this was to add uevent to add_taint() as well, this way
> > these could generically be processed by userspace.
> 
> I'm not opposed to the taint, though I have not thought much on it.
> 
> But, if you can already get the crash info from uevent, and it automatically
> comes without polling or scraping logs, then what benefit beyond that does
> the taint give you?

From a support perspective it is a *crystal* clear sign that the device
and / or device driver may be in a very bad state, in a generic way.

  Luis
Steve deRosier May 18, 2020, 6:06 p.m. UTC | #10
On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote:
> >
> >
> > On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> > > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
> > > >
> > > >
> > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > > > > >
> > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > > > > replying to a random one, but ...
> > > > > >
> > > > > > What is the point here?
> > > > > >
> > > > > > This should in no way affect the integrity of the system/kernel, for
> > > > > > most devices anyway.
> > > > >
> > > > > Keyword you used here is "most device". And in the worst case, *who*
> > > > > knows what other odd things may happen afterwards.
> > > > >
> > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > > > > not handle it right (and probably crash, WARN_ON, or something else),
> > > > > > but if the driver is working right then that will not affect the kernel
> > > > > > at all.
> > > > >
> > > > > Sometimes the device can go into a state which requires driver removal
> > > > > and addition to get things back up.
> > > >
> > > > It would be lovely to be able to detect this case in the driver/system
> > > > somehow!  I haven't seen any such cases recently,
> > >
> > > I assure you that I have run into it. Once it does again I'll report
> > > the crash, but the problem with some of this is that unless you scrape
> > > the log you won't know. Eventually, a uevent would indeed tell inform
> > > me.
> > >
> > > > but in case there is
> > > > some common case you see, maybe we can think of a way to detect it?
> > >
> > > ath10k is just one case, this patch series addresses a simple way to
> > > annotate this tree-wide.
> > >
> > > > > > So maybe I can understand that maybe you want an easy way to discover -
> > > > > > per device - that the firmware crashed, but that still doesn't warrant a
> > > > > > complete kernel taint.
> > > > >
> > > > > That is one reason, another is that a taint helps support cases *fast*
> > > > > easily detect if the issue was a firmware crash, instead of scraping
> > > > > logs for driver specific ways to say the firmware has crashed.
> > > >
> > > > You can listen for udev events (I think that is the right term),
> > > > and find crashes that way.  You get the actual crash info as well.
> > >
> > > My follow up to this was to add uevent to add_taint() as well, this way
> > > these could generically be processed by userspace.
> >
> > I'm not opposed to the taint, though I have not thought much on it.
> >
> > But, if you can already get the crash info from uevent, and it automatically
> > comes without polling or scraping logs, then what benefit beyond that does
> > the taint give you?
>
> From a support perspective it is a *crystal* clear sign that the device
> and / or device driver may be in a very bad state, in a generic way.
>

Unfortunately a "taint" is interpreted by many users as: "your kernel
is really F#*D up, you better do something about it right now."
Assuming they're paying attention at all in the first place of course.

The fact is, WiFi chip firmware crashes, and in most cases the driver
is able to recover seamlessly. At least that is the case with most QCA
chipsets I work with. And the users or our ability to do anything
about it is minimal to none as we don't have access to firmware
source. It's too bad and I wish it weren't the case, but we have
embraced reality and most drivers have a recovery mechanism built in
for this case. In short, it's a non-event. I fear that elevating this
to a kernel taint will significantly increase "support" requests that
really are nothing but noise; similar to how the firmware load failure
messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded
fw-0.bin) cause a lot of noise.

Not specifically opposed, but I wonder what it really accomplishes in
a world where the firmware crashing is pretty much a normal
occurrence.

If it goes in, I think that the drivers shouldn't trigger the taint if
they're able to recover normally. Only trigger on failure to come back
up.  In other words, the ideal place in the ath10k driver isn't where
you have proposed as at that point operation is normal and we're doing
a routine recovery.

- Steve





>   Luis
Luis Chamberlain May 18, 2020, 7:09 p.m. UTC | #11
On Mon, May 18, 2020 at 11:06:27AM -0700, Steve deRosier wrote:
> On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > From a support perspective it is a *crystal* clear sign that the device
> > and / or device driver may be in a very bad state, in a generic way.
> >
> 
> Unfortunately a "taint" is interpreted by many users as: "your kernel
> is really F#*D up, you better do something about it right now."
> Assuming they're paying attention at all in the first place of course.

Taint historically has been used and still is today to help rule out
whether or not you get support, or how you get support.

For instance, a staging driver is not supported by some upstream
developers, but it will be by those who help staging and Greg. TAINT_CRAP
cannot be even more clear.

So, no, it is not just about "hey your kernel is messed up", there are
clear support boundaries being drawn.

> The fact is, WiFi chip firmware crashes, and in most cases the driver
> is able to recover seamlessly. At least that is the case with most QCA
> chipsets I work with. 

That has not been my exerience with the same driver, and so how do we
know? And this patch set is not about ath10k alone, I want you to
think about *all* device drivers with firmware. In my journey to scrape
the kernel for these cases I was very surprised by the amount of code
which clearly annotates these situations.

> And the users or our ability to do anything
> about it is minimal to none as we don't have access to firmware
> source.

This is not true, we have open firmware in WiFi. Some vendors choose
to not open source their firmware, that is their decision.

These days though, I think we all admit, that firmware crashes can use
a better generic infrastructure for ensuring that clearly affecting-user
experience issues. This patch is about that *when and if these happen*,
we annotate it in the kernel for support pursposes.

> It's too bad and I wish it weren't the case, but we have
> embraced reality and most drivers have a recovery mechanism built in
> for this case.

The mentality about firmware crashes being the end of the world is
certainly what will lead developers to often hide these. Where this
is openly clear, and not obfucscated I'd argue that firmware issues
get fixed likely more common.

So what you describe is not bad, its just accepting evolution.

> In short, it's a non-event. I fear that elevating this
> to a kernel taint will significantly increase "support" requests that
> really are nothing but noise;

That will depend on where you put this on the driver, and that is
why it is important to place it in the right place, if any.

> similar to how the firmware load failure
> messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded
> fw-0.bin) cause a lot of noise.

That can be fixed, the developers behind this series gave up on it.
It has to do with a range version of supported firmwares, and all
being optional, but at least one is required.

> Not specifically opposed, but I wonder what it really accomplishes in
> a world where the firmware crashing is pretty much a normal
> occurrence.

Recovery without affecting user experience would be great, the taint is
*not* for those cases. The taint definition has:

+ 18) ``Q`` used by device drivers to annotate that the device driver's firmware
+     has crashed and the device's operation has been severely affected. The    
+     device may be left in a crippled state, requiring full driver removal /   
+     addition, system reboot, or it is unclear how long recovery will take.

Let me know if this is not clear.

> If it goes in, I think that the drivers shouldn't trigger the taint if
> they're able to recover normally. Only trigger on failure to come back
> up.  In other words, the ideal place in the ath10k driver isn't where
> you have proposed as at that point operation is normal and we're doing
> a routine recovery.

Sure, happy to remove it if indeed it is the case that the firwmare
crash is not happening to cripple the device, but I can vouch for the
fact that the exact place where I placed it left my device driver in a
state where I had to remove / add again.

  Luis
Johannes Berg May 18, 2020, 7:25 p.m. UTC | #12
On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote:

> > Unfortunately a "taint" is interpreted by many users as: "your kernel
> > is really F#*D up, you better do something about it right now."
> > Assuming they're paying attention at all in the first place of course.
> 
> Taint historically has been used and still is today to help rule out
> whether or not you get support, or how you get support.
> 
> For instance, a staging driver is not supported by some upstream
> developers, but it will be by those who help staging and Greg. TAINT_CRAP
> cannot be even more clear.
> 
> So, no, it is not just about "hey your kernel is messed up", there are
> clear support boundaries being drawn.

Err, no. Those two are most definitely related. Have you looked at (most
or some or whatever) staging drivers recently? Those contain all kinds
of garbage that might do whatever with your kernel.

Of course that's not a completely clear boundary, maybe you can find a
driver in staging that's perfect code just not written to kernel style?
But I find that hard to believe, in most cases.

So no, it's really not about "[a] staging driver is not supported" vs.
"your kernel is messed up". The very fact that you loaded one of those
things might very well have messed up your kernel entirely.

> These days though, I think we all admit, that firmware crashes can use
> a better generic infrastructure for ensuring that clearly affecting-user
> experience issues. This patch is about that *when and if these happen*,
> we annotate it in the kernel for support pursposes.

That's all fine, I just don't think it's appropriate to pretend that
your kernel is now 'tainted' (think about the meaning of that word) when
the firmware of some random device crashed. Heck, that could have been a
USB device that was since unplugged. Unless the driver is complete
garbage (hello staging again?) that really should have no lasting effect
on the system itself.

> Recovery without affecting user experience would be great, the taint is
> *not* for those cases. The taint definition has:
> 
> + 18) ``Q`` used by device drivers to annotate that the device driver's firmware
> +     has crashed and the device's operation has been severely affected. The    
> +     device may be left in a crippled state, requiring full driver removal /   
> +     addition, system reboot, or it is unclear how long recovery will take.
> 
> Let me know if this is not clear.

It's pretty clear, but even then, first of all I doubt this is the case
for many of the places that you've sprinkled the annotation on, and
secondly it actually hides useful information.

Regardless of the support issue, I think this hiding of information is
also problematic.

I really think we'd all be better off if you just made a sysfs file (I
mistyped debugfs in some other email, sorry, apparently you didn't see
the correction in time) that listed which device(s) crashed and how many
times. That would actually be useful. Because honestly, if a random
device crashed for some random reason, that's pretty much a non-event.
If it keeps happening, then we might even want to know about it.

You can obviously save the contents of this file into your bug reports
automatically and act accordingly, but I think you'll find that this is
far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this
report. Yeah, that might be reasonable thing if the bug report is about
slow wifi *and* you see that ath10k firmware crashed every 10 seconds,
but if it just crashed once a few days earlier it's of no importance to
the system anymore ... And certainly a reasonable driver (which I
believe ath10k to be) would _not_ randomly start corrupting memory
because its firmware crashed. Which really is what tainting the kernel
is about.

So no, even with all that, I still really believe you're solving the
wrong problem. Having information about firmware crashes, preferably
with some kind of frequency information attached, and *clearly* with
information about which device attached would be _great_. Munging it all
into one bit is actively harmful, IMO.

johannes
Luis Chamberlain May 18, 2020, 7:59 p.m. UTC | #13
On Mon, May 18, 2020 at 09:25:09PM +0200, Johannes Berg wrote:
> On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote:
> 
> > > Unfortunately a "taint" is interpreted by many users as: "your kernel
> > > is really F#*D up, you better do something about it right now."
> > > Assuming they're paying attention at all in the first place of course.
> > 
> > Taint historically has been used and still is today to help rule out
> > whether or not you get support, or how you get support.
> > 
> > For instance, a staging driver is not supported by some upstream
> > developers, but it will be by those who help staging and Greg. TAINT_CRAP
> > cannot be even more clear.
> > 
> > So, no, it is not just about "hey your kernel is messed up", there are
> > clear support boundaries being drawn.
> 
> Err, no. Those two are most definitely related. Have you looked at (most
> or some or whatever) staging drivers recently? Those contain all kinds
> of garbage that might do whatever with your kernel.

No, I stay away :)

> Of course that's not a completely clear boundary, maybe you can find a
> driver in staging that's perfect code just not written to kernel style?
> But I find that hard to believe, in most cases.
> 
> So no, it's really not about "[a] staging driver is not supported" vs.
> "your kernel is messed up". The very fact that you loaded one of those
> things might very well have messed up your kernel entirely.
> 
> > These days though, I think we all admit, that firmware crashes can use
> > a better generic infrastructure for ensuring that clearly affecting-user
> > experience issues. This patch is about that *when and if these happen*,
> > we annotate it in the kernel for support pursposes.
> 
> That's all fine, I just don't think it's appropriate to pretend that
> your kernel is now 'tainted' (think about the meaning of that word) when
> the firmware of some random device crashed.

If the firmware crash *does* require driver remove / addition again,
or a reboot, would you think that this is a situation that merits a taint?

> > Recovery without affecting user experience would be great, the taint is
> > *not* for those cases. The taint definition has:
> > 
> > + 18) ``Q`` used by device drivers to annotate that the device driver's firmware
> > +     has crashed and the device's operation has been severely affected. The    
> > +     device may be left in a crippled state, requiring full driver removal /   
> > +     addition, system reboot, or it is unclear how long recovery will take.
> > 
> > Let me know if this is not clear.
> 
> It's pretty clear, but even then, first of all I doubt this is the case
> for many of the places that you've sprinkled the annotation on,

We can remove it, for this driver I can vouch for its location as it did
reach a state where I required a reboot. And its not the first time this
has happened. This got me thinking about the bigger picture of the lack
of proper way to address these cases in the kernel, and how the user is
left dumbfounded.

> and secondly it actually hides useful information.

What is it hiding?

> Regardless of the support issue, I think this hiding of information is
> also problematic.
> 
> I really think we'd all be better off if you just made a sysfs file (I
> mistyped debugfs in some other email, sorry, apparently you didn't see
> the correction in time) that listed which device(s) crashed and how many
> times. 

Ah yes, count. The taint does not address count.

> That would actually be useful. Because honestly, if a random
> device crashed for some random reason, that's pretty much a non-event.
> If it keeps happening, then we might even want to know about it.

True.

> You can obviously save the contents of this file into your bug reports
> automatically and act accordingly, but I think you'll find that this is
> far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this
> report.

Absolutely.

> Yeah, that might be reasonable thing if the bug report is about
> slow wifi *and* you see that ath10k firmware crashed every 10 seconds,
> but if it just crashed once a few days earlier it's of no importance to
> the system anymore ... And certainly a reasonable driver (which I
> believe ath10k to be) would _not_ randomly start corrupting memory
> because its firmware crashed. Which really is what tainting the kernel
> is about.

I still see it as a support thing too. But discussing this further is
pointless as I agree that taint does not cover count and that it is
important.


  Luis
Johannes Berg May 18, 2020, 8:07 p.m. UTC | #14
On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote:

> > Err, no. Those two are most definitely related. Have you looked at (most
> > or some or whatever) staging drivers recently? Those contain all kinds
> > of garbage that might do whatever with your kernel.
> 
> No, I stay away :)

:)

> > That's all fine, I just don't think it's appropriate to pretend that
> > your kernel is now 'tainted' (think about the meaning of that word) when
> > the firmware of some random device crashed.
> 
> If the firmware crash *does* require driver remove / addition again,
> or a reboot, would you think that this is a situation that merits a taint?

Not really. In my experience, that's more likely a hardware issue (card
not properly seated, for example) that a bus reset happens to "fix".

> > It's pretty clear, but even then, first of all I doubt this is the case
> > for many of the places that you've sprinkled the annotation on,
> 
> We can remove it, for this driver I can vouch for its location as it did
> reach a state where I required a reboot. And its not the first time this
> has happened. This got me thinking about the bigger picture of the lack
> of proper way to address these cases in the kernel, and how the user is
> left dumbfounded.

Fair, so the driver is still broken wrt. recovery here. I still don't
think that's a situation where e.g. the system should say "hey you have
a taint here, if your graphics go bad now you should not report that
bug" (which is effectively what the single taint bit does).

> > and secondly it actually hides useful information.
> 
> What is it hiding?

Most importantly, which device crashed. Secondarily I'd say how many
times (*).

The information "firmware crashed" is really only useful in relation to
the device. If your graphics firmware crashed, yeah, well, you probably
won't even see this. If your USB wifi firmware crashed? Not really
interesting, you'll anyway just unplug. In fact it's very hard for a USB
driver (short of arbitrary memory corruption) to significantly mess up
the system.

johannes

(*) though if it crashed only once, was that because it was wedged
enough to be unusable afterwards, or because everything was fine?
Jakub Kicinski May 18, 2020, 8:28 p.m. UTC | #15
On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:
> It's pretty clear, but even then, first of all I doubt this is the case
> for many of the places that you've sprinkled the annotation on, and
> secondly it actually hides useful information.
> 
> Regardless of the support issue, I think this hiding of information is
> also problematic.
> 
> I really think we'd all be better off if you just made a sysfs file (I
> mistyped debugfs in some other email, sorry, apparently you didn't see
> the correction in time) that listed which device(s) crashed and how many
> times. That would actually be useful. Because honestly, if a random
> device crashed for some random reason, that's pretty much a non-event.
> If it keeps happening, then we might even want to know about it.

Johannes - have you seen devlink health? I think we should just use
that interface, since it supports all the things you're requesting,
rather than duplicate it in sysfs.
Johannes Berg May 18, 2020, 8:29 p.m. UTC | #16
On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:
> > It's pretty clear, but even then, first of all I doubt this is the case
> > for many of the places that you've sprinkled the annotation on, and
> > secondly it actually hides useful information.
> > 
> > Regardless of the support issue, I think this hiding of information is
> > also problematic.
> > 
> > I really think we'd all be better off if you just made a sysfs file (I
> > mistyped debugfs in some other email, sorry, apparently you didn't see
> > the correction in time) that listed which device(s) crashed and how many
> > times. That would actually be useful. Because honestly, if a random
> > device crashed for some random reason, that's pretty much a non-event.
> > If it keeps happening, then we might even want to know about it.
> 
> Johannes - have you seen devlink health? I think we should just use
> that interface, since it supports all the things you're requesting,
> rather than duplicate it in sysfs.

I haven't, and I'm glad to hear that's there, sounds good!

I suspect that Luis wants something more generic though, that isn't just
applicable to netdevices, unless devlink grew some kind of non-netdev
stuff while I wasn't looking? :)

johannes
Jakub Kicinski May 18, 2020, 8:35 p.m. UTC | #17
On Mon, 18 May 2020 22:29:53 +0200 Johannes Berg wrote:
> On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote:
> > On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:  
> > > It's pretty clear, but even then, first of all I doubt this is the case
> > > for many of the places that you've sprinkled the annotation on, and
> > > secondly it actually hides useful information.
> > > 
> > > Regardless of the support issue, I think this hiding of information is
> > > also problematic.
> > > 
> > > I really think we'd all be better off if you just made a sysfs file (I
> > > mistyped debugfs in some other email, sorry, apparently you didn't see
> > > the correction in time) that listed which device(s) crashed and how many
> > > times. That would actually be useful. Because honestly, if a random
> > > device crashed for some random reason, that's pretty much a non-event.
> > > If it keeps happening, then we might even want to know about it.  
> > 
> > Johannes - have you seen devlink health? I think we should just use
> > that interface, since it supports all the things you're requesting,
> > rather than duplicate it in sysfs.  
> 
> I haven't, and I'm glad to hear that's there, sounds good!
> 
> I suspect that Luis wants something more generic though, that isn't just
> applicable to netdevices, unless devlink grew some kind of non-netdev
> stuff while I wasn't looking? :)

It's intended to be a generic netlink channel for configuring devices.

All the firmware-related interfaces have no dependencies on netdevs,
in fact that's one of the reasons we moved to devlink - we don't want
to hold rtnl lock just for talking to device firmware.
Johannes Berg May 18, 2020, 8:41 p.m. UTC | #18
On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
> 
> It's intended to be a generic netlink channel for configuring devices.
> 
> All the firmware-related interfaces have no dependencies on netdevs,
> in fact that's one of the reasons we moved to devlink - we don't want
> to hold rtnl lock just for talking to device firmware.

Sounds good :)

So I guess Luis just has to add some way in devlink to hook up devlink
health in a simple way to drivers, perhaps? I mean, many drivers won't
really want to use devlink for anything else, so I guess it should be as
simple as the API that Luis proposed ("firmware crashed for this struct
device"), if nothing more interesting is done with devlink?

Dunno. But anyway sounds like it should somehow integrate there rather
than the way this patchset proposed?

johannes
Jakub Kicinski May 18, 2020, 8:46 p.m. UTC | #19
On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote:
> On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
> > It's intended to be a generic netlink channel for configuring devices.
> > 
> > All the firmware-related interfaces have no dependencies on netdevs,
> > in fact that's one of the reasons we moved to devlink - we don't want
> > to hold rtnl lock just for talking to device firmware.  
> 
> Sounds good :)
> 
> So I guess Luis just has to add some way in devlink to hook up devlink
> health in a simple way to drivers, perhaps? I mean, many drivers won't
> really want to use devlink for anything else, so I guess it should be as
> simple as the API that Luis proposed ("firmware crashed for this struct
> device"), if nothing more interesting is done with devlink?
> 
> Dunno. But anyway sounds like it should somehow integrate there rather
> than the way this patchset proposed?

Right, that'd be great. Simple API to register a devlink instance with
whatever number of reporters the device would need. I'm happy to help.
Luis Chamberlain May 18, 2020, 9:18 p.m. UTC | #20
On Mon, May 18, 2020 at 10:07:49PM +0200, Johannes Berg wrote:
> On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote:
> 
> > > Err, no. Those two are most definitely related. Have you looked at (most
> > > or some or whatever) staging drivers recently? Those contain all kinds
> > > of garbage that might do whatever with your kernel.
> > 
> > No, I stay away :)
> 
> :)
> 
> > > That's all fine, I just don't think it's appropriate to pretend that
> > > your kernel is now 'tainted' (think about the meaning of that word) when
> > > the firmware of some random device crashed.
> > 
> > If the firmware crash *does* require driver remove / addition again,
> > or a reboot, would you think that this is a situation that merits a taint?
> 
> Not really. In my experience, that's more likely a hardware issue (card
> not properly seated, for example) that a bus reset happens to "fix".
> 
> > > It's pretty clear, but even then, first of all I doubt this is the case
> > > for many of the places that you've sprinkled the annotation on,
> > 
> > We can remove it, for this driver I can vouch for its location as it did
> > reach a state where I required a reboot. And its not the first time this
> > has happened. This got me thinking about the bigger picture of the lack
> > of proper way to address these cases in the kernel, and how the user is
> > left dumbfounded.
> 
> Fair, so the driver is still broken wrt. recovery here. I still don't
> think that's a situation where e.g. the system should say "hey you have
> a taint here, if your graphics go bad now you should not report that
> bug" (which is effectively what the single taint bit does).

But again, let's think about the generic type of issue, and the
unexpected type of state that can be reached. The circumstance here
*does* lead to a case which is not recoverable. Now, consider how
many cases in the kernel where similar situations can happen and leave
the device or driver in a non-functional state.

> > > and secondly it actually hides useful information.
> > 
> > What is it hiding?
> 
> Most importantly, which device crashed. Secondarily I'd say how many
> times (*).

The device is implied by the module, the taint is applied to both.
If you had multiple devices, however, yes, it would not be possible
to distinguish from the taint which exact device it happened on.

So the only thing *generic* which would be left out is count.

> The information "firmware crashed" is really only useful in relation to
> the device.

If you have to reboot to get a functional network again then the device
is quite useless for many people, regardless of which device that
happened on.

But from a support perspective a sysfs interface which provides a tiny
bit more generic information indeed provides more value than a taint.

  Luis
Luis Chamberlain May 18, 2020, 9:22 p.m. UTC | #21
On Mon, May 18, 2020 at 01:46:43PM -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote:
> > On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
> > > It's intended to be a generic netlink channel for configuring devices.
> > > 
> > > All the firmware-related interfaces have no dependencies on netdevs,
> > > in fact that's one of the reasons we moved to devlink - we don't want
> > > to hold rtnl lock just for talking to device firmware.  
> > 
> > Sounds good :)
> > 
> > So I guess Luis just has to add some way in devlink to hook up devlink
> > health in a simple way to drivers, perhaps? I mean, many drivers won't
> > really want to use devlink for anything else, so I guess it should be as
> > simple as the API that Luis proposed ("firmware crashed for this struct
> > device"), if nothing more interesting is done with devlink?
> > 
> > Dunno. But anyway sounds like it should somehow integrate there rather
> > than the way this patchset proposed?
> 
> Right, that'd be great. Simple API to register a devlink instance with
> whatever number of reporters the device would need. I'm happy to help.

Indeed my issue with devlink is that it did not seem generic enough for
all devices which use firmware and for which firmware can crash. Support
should not have to be "add devlink support" + "now use this new hook",
but rather a very lighweight devlink_crash(device) call we can sprinkly
with only the device as a functional requirement.

  Luis
Jakub Kicinski May 18, 2020, 10:16 p.m. UTC | #22
On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote:
> Indeed my issue with devlink is that it did not seem generic enough for
> all devices which use firmware and for which firmware can crash. Support
> should not have to be "add devlink support" + "now use this new hook",
> but rather a very lighweight devlink_crash(device) call we can sprinkly
> with only the device as a functional requirement.

We can provide a lightweight devlink_crash(device) which only generates
the notification, without the need to register the health reporter or a
devlink instance upfront. But then we loose the ability to control the
recovery, count errors, etc. So I'd think that's not the direction we
want to go in.
Luis Chamberlain May 19, 2020, 1:05 a.m. UTC | #23
On Mon, May 18, 2020 at 03:16:45PM -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote:
> > Indeed my issue with devlink is that it did not seem generic enough for
> > all devices which use firmware and for which firmware can crash. Support
> > should not have to be "add devlink support" + "now use this new hook",
> > but rather a very lighweight devlink_crash(device) call we can sprinkly
> > with only the device as a functional requirement.
> 
> We can provide a lightweight devlink_crash(device) which only generates
> the notification, without the need to register the health reporter or a
> devlink instance upfront. But then we loose the ability to control the
> recovery, count errors, etc. So I'd think that's not the direction we
> want to go in.

Care to show me what the diff stat for a non devlink driver would look
like? Just keep in mind this taint is 1 line addition. Granted, if we
can use SmPL grammar to automate addition of an initial framework to a
driver that'd be great, but since the device addition is subsystem
specific (device_add() and friends), I don't suspect this will be easily
automated.

   Luis
Brian Norris May 19, 2020, 1:23 a.m. UTC | #24
On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> detect that the device is really wedged enough that the only way we can
> still try to recover is by completely unbinding the driver from it, then
> we give userspace a uevent for that. I don't remember exactly how and
> where that gets used (ChromeOS) though, but it'd be nice to have that
> sort of thing as part of the infrastructure, in a sort of two-level
> notification?

<slight side track>
We use this on certain devices where we know the underlying hardware
has design issues that may lead to device failure -- then when we see
this sort of unrecoverable "firmware-death", we remove the
device[*]+driver, force-reset the PCI device (SBR), and try to
reload/reattach the driver. This all happens by way of a udev rule. We
also log this sort of stuff (and metrics around it) for bug reports
and health statistics, since we really hope to not see this happen
often.

[*] "We" (user space) don't actually do this...it happens via the
'remove_when_gone' module parameter abomination found in iwlwifi. I'd
personally rather see the EVENT=INACESSIBLE stuff on its own, and let
user space deal with when and how to remove and reset the device. But
I digress too much here ;)
</slight side track>

I really came to this thread to say that I also love the idea of a
generic mechanism (a la $subject) to report firmware crashes, but I
also have no interest in seeing a taint flag for it. For Chrome OS, I
would readily (as in, we're already looking at more-hacky /
non-generic ways to do this for drivers we care about) process these
kinds of stats as they happen, logging metrics for bug reports and/or
for automated crash statistics, when we see a firmware crash. A uevent
would suit us very well I think, although it would be nice if drivers
could also supply some small amount of informative text along with it
(e.g., a sort of "reason code", in case we can possibly aggregate
certain failure types). We already do this sort of thing for WARN()
and friends (not via uevent, but via log parsing; at least it has nice
"cut here" markers!).

Perhaps devlink (as proposed down-thread) would also fit the bill. I
don't think sysfs alone would fit our needs, as we'd like to process
these things as they happen, not only when a user submits a bug
report.

> Level 1: firmware crashed, but we're recovering, at least mostly, and
> it's more informational

Chrome OS would love to track these things too, since we'd like to see
these minimized, even if they're usually recoverable ;)

> Level 2: device is wedged, going to try to recover by some more forceful
> means (perhaps some devices can be power-cycled? etc.) but (more) state
> would be lost in these cases?

And we'd definitely want to know about these. We already get this for
the iwlwifi case described above, in a non-generic way.

In general, it's probably not that easy to tell the difference between
1 and 2, since even as you and Luis have noted, with the same driver
(and the same driver location), you find the same crashes may or may
not be recoverable. iwlwifi has extracted certain level 2 cases into
iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
ways in which level 1 crashes actually lead to severe
(non-recoverable) failure.

Regards,
Brian
Luis Chamberlain May 19, 2020, 2:02 p.m. UTC | #25
On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > detect that the device is really wedged enough that the only way we can
> > still try to recover is by completely unbinding the driver from it, then
> > we give userspace a uevent for that. I don't remember exactly how and
> > where that gets used (ChromeOS) though, but it'd be nice to have that
> > sort of thing as part of the infrastructure, in a sort of two-level
> > notification?
> 
> <slight side track>
> We use this on certain devices where we know the underlying hardware
> has design issues that may lead to device failure 

Ah, after reading below I see you meant for iwlwifi.

If userspace can indeed grow to support this, that would be fantastic.

I should note that I don't discourage hiding firmware or hardware
issues. Quite the contrary, I suspect that taking pride in being
trasnparent about it, and dealing with it fast can help lead the pack.
I wrote about this long ago in 2015 [0], and stand by it.

[0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html

> -- then when we see
> this sort of unrecoverable "firmware-death", we remove the
> device[*]+driver, force-reset the PCI device (SBR), and try to
> reload/reattach the driver. This all happens by way of a udev rule.

So you've sprikled your own udev event here as part of your kernel delta?

> We
> also log this sort of stuff (and metrics around it) for bug reports
> and health statistics, since we really hope to not see this happen
> often.

Assuming perfection is ideal but silly. So, what infrastructure do you
use for this sort of issue?

> [*] "We" (user space) don't actually do this...it happens via the
> 'remove_when_gone' module parameter abomination found in iwlwifi.

Holy moly.. but hey, at least it may seem a bit more seemless than forcing
a reboot / manual driver removal / addition to the user.

BTW is this likely a place on iwlwifi where the firmware likely crashed?

> I'd
> personally rather see the EVENT=INACESSIBLE stuff on its own, and let
> user space deal with when and how to remove and reset the device. But
> I digress too much here ;)
> </slight side track>

This is all useful information. We are just touching the surface of the
topic by addressing networking first. Imagine when we address other
subsystems.

> I really came to this thread to say that I also love the idea of a
> generic mechanism (a la $subject) to report firmware crashes, but I
> also have no interest in seeing a taint flag for it. For Chrome OS, I
> would readily (as in, we're already looking at more-hacky /
> non-generic ways to do this for drivers we care about) process these
> kinds of stats as they happen, logging metrics for bug reports and/or
> for automated crash statistics, when we see a firmware crash.

Great!

> A uevent
> would suit us very well I think, although it would be nice if drivers
> could also supply some small amount of informative text along with it

A follow up to this series was to add a uevent to add_taint(), however
since a *count* is not considered I think it is correct to seek
alternatives at this point. The leaner the solution the better though.

Do you have a pointer to what guys use so I can read?

> (e.g., a sort of "reason code", in case we can possibly aggregate
> certain failure types). We already do this sort of thing for WARN()
> and friends (not via uevent, but via log parsing; at least it has nice
> "cut here" markers!).

Indeed, similar things can indeed be argued about WARN*()... this
however can be non-device specific. With panic-on-warn becoming a
"thing", the more important it becomes to really tally exactly *why*
these WARN*()s may trigger.

> Perhaps 

Note below.

> devlink (as proposed down-thread) would also fit the bill. I
> don't think sysfs alone would fit our needs, as we'd like to process
> these things as they happen, not only when a user submits a bug
> report.

I think we've reached a point where using "*Perhaps*" does not suffice,
and if there is already a *user* of similar desired infrastructure I
think we should jump on the opportunity to replace what you have with
something which could be used by other devices / subsystems which
require firmware. And indeed, also even consider in the abstract sense,
the possibility to leverage something like this for WARN*()s later too.

> > Level 1: firmware crashed, but we're recovering, at least mostly, and
> > it's more informational
> 
> Chrome OS would love to track these things too, since we'd like to see
> these minimized, even if they're usually recoverable ;)
> 
> > Level 2: device is wedged, going to try to recover by some more forceful
> > means (perhaps some devices can be power-cycled? etc.) but (more) state
> > would be lost in these cases?
> 
> And we'd definitely want to know about these. We already get this for
> the iwlwifi case described above, in a non-generic way.
> 
> In general, it's probably not that easy to tell the difference between
> 1 and 2, since even as you and Luis have noted, with the same driver
> (and the same driver location), you find the same crashes may or may
> not be recoverable. iwlwifi has extracted certain level 2 cases into
> iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
> ways in which level 1 crashes actually lead to severe
> (non-recoverable) failure.

And that is fine, accepting these for what they are will help. However,
leaving the user in the *dark*, is what we should *not do*.

  Luis
Brian Norris May 20, 2020, 12:47 a.m. UTC | #26
Hi Luis,

On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > > detect that the device is really wedged enough that the only way we can
> > > still try to recover is by completely unbinding the driver from it, then
> > > we give userspace a uevent for that. I don't remember exactly how and
> > > where that gets used (ChromeOS) though, but it'd be nice to have that
> > > sort of thing as part of the infrastructure, in a sort of two-level
> > > notification?
> >
> > <slight side track>
> > We use this on certain devices where we know the underlying hardware
> > has design issues that may lead to device failure
>
> Ah, after reading below I see you meant for iwlwifi.

Sorry, I was replying to Johannes, who I believe had his "we"="Intel"
hat (as iwlwifi maintainer) on, and was pointing at
iwl_trans_pcie_removal_wk().

> If userspace can indeed grow to support this, that would be fantastic.

Well, Chrome OS tailors its user space a bit more to the hardware (and
kernel/drivers in use) than the average distro might. We already do
this (for some values of "this") today. Is that "fantastic" to you? :D

> > -- then when we see
> > this sort of unrecoverable "firmware-death", we remove the
> > device[*]+driver, force-reset the PCI device (SBR), and try to
> > reload/reattach the driver. This all happens by way of a udev rule.
>
> So you've sprikled your own udev event here as part of your kernel delta?

No kernel delta -- the event is there already:
iwl_trans_pcie_removal_wk()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027

And you can see our udev rules and scripts, in all their ugly details
here, if you really care:
https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/

> > We
> > also log this sort of stuff (and metrics around it) for bug reports
> > and health statistics, since we really hope to not see this happen
> > often.
>
> Assuming perfection is ideal but silly. So, what infrastructure do you
> use for this sort of issue?

We don't yet log firmware crashes generally, but for all our current
crash reports (including WARN()), they go through this:
https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md

For example, look for "cut here" in:
https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc

For other specific metrics (like counting "EVENT=INACCESSIBLE"), we
use the Chrome UMA system:
https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md

I don't imagine the "infrastructure" side of any of that would be
useful to you, but maybe the client-side gathering can at least show
you what we do.

> > [*] "We" (user space) don't actually do this...it happens via the
> > 'remove_when_gone' module parameter abomination found in iwlwifi.
>
> BTW is this likely a place on iwlwifi where the firmware likely crashed?

iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out
in a way that is likely due to a dead PCIe endpoint. It's not directly
a firmware crash, although there may be firmware crashes reported
around the same time.

Intel folks can probably give a more nuanced answer, but their
firmware crashes usually land in something like iwl_mvm_nic_error() ->
iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make
your proposal less punishing (e.g., removing the "taint" as Johannes
requested), maybe iwlwifi would be another good candidate for
$subject. iwlwifi generally expects to recover seamlessly, but it's
also good to know if you've seen a hundred of these in a row.

> > A uevent
> > would suit us very well I think, although it would be nice if drivers
> > could also supply some small amount of informative text along with it
>
> A follow up to this series was to add a uevent to add_taint(), however
> since a *count* is not considered I think it is correct to seek
> alternatives at this point. The leaner the solution the better though.
>
> Do you have a pointer to what guys use so I can read?

Hopefully the above pointers are useful to you. We don't yet have
firmware-crash parsing implemented, so I don't have pointers for that
piece yet.

> > (e.g., a sort of "reason code", in case we can possibly aggregate
> > certain failure types). We already do this sort of thing for WARN()
> > and friends (not via uevent, but via log parsing; at least it has nice
> > "cut here" markers!).
>
> Indeed, similar things can indeed be argued about WARN*()... this
> however can be non-device specific. With panic-on-warn becoming a
> "thing", the more important it becomes to really tally exactly *why*
> these WARN*()s may trigger.

panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm
probably not supposed to publicize information related to how many
Chromebooks are out there, but we regularly see a scary number of
non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users
every day, and a scary number of those are in WiFi drivers...

> > Perhaps
>
> Note below.

(My use of "perhaps" is only because I'm not familiar with devlink at all.)

> > devlink (as proposed down-thread) would also fit the bill. I
> > don't think sysfs alone would fit our needs, as we'd like to process
> > these things as they happen, not only when a user submits a bug
> > report.
>
> I think we've reached a point where using "*Perhaps*" does not suffice,
> and if there is already a *user* of similar desired infrastructure I
> think we should jump on the opportunity to replace what you have with
> something which could be used by other devices / subsystems which
> require firmware. And indeed, also even consider in the abstract sense,
> the possibility to leverage something like this for WARN*()s later too.

To precisely lay out what Chrome OS does today:

* WARN() and similar: implemented, see anomaly_detector.cc above. It's
just log parsing, and that handy "cut here" stuff -- I'm nearly
certain there are other distros using this already? A uevent would
probably be nice, but log parsing is good enough for us today.
* EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked
above. It's a uevent, and we handle it via udev. Code is linked above.
* Other firmware crashes: not yet implemented here, but we're looking
at it. Currently, we plan to do something similar to WARN(), but it
will be ugly and non-generic. A uevent would be a lovely replacement,
if it has some basic metadata with it. Stats in sysfs might be icing
on the cake, but mostly redundant for us, if we can grab it on the fly
via uevent.

HTH,
Brian
Emmanuel Grumbach May 20, 2020, 5:37 a.m. UTC | #27
Hi all,

<top post intro>

Since I have been involved quite a bit in the firmware debugging
features in iwlwifi, I think I can give a few insights here.

But before this, we need to understand that there are several sources of issues:
1) the firmware may crash but the bus is still alive, you can still
use the bus to get the crash data
2) the bus is dead, when that happens, the firmware might even be in a
good condition, but since the bus is dead, you stop getting any
information about the firmware, and then, at some point, you get to
the conclusion that the firmware is dead. You can't get the crash data
that resides on the other side of the bus (you may have gathered data
in the DRAM directly, but that's a different thing), and you don't
have much recovery to do besides re-starting the PCI enumeration.

At Intel, we have seen both unfortunately. The bus issues are the ones
that are trickier obviously. Trickier to detect (because you just get
garbage from any request you issue on the bus), and trickier to
handle. One can argue that the kernel should *not* handle those and
let this in userspace hands. I guess it all depends on what component
you ship to your customer and what you customer asks from you  :).

</top post intro>

>
> Hi Luis,
>
> On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> > > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > > > detect that the device is really wedged enough that the only way we can
> > > > still try to recover is by completely unbinding the driver from it, then
> > > > we give userspace a uevent for that. I don't remember exactly how and
> > > > where that gets used (ChromeOS) though, but it'd be nice to have that
> > > > sort of thing as part of the infrastructure, in a sort of two-level
> > > > notification?
> > >
> > > <slight side track>
> > > We use this on certain devices where we know the underlying hardware
> > > has design issues that may lead to device failure
> >
> > Ah, after reading below I see you meant for iwlwifi.
>
> Sorry, I was replying to Johannes, who I believe had his "we"="Intel"
> hat (as iwlwifi maintainer) on, and was pointing at
> iwl_trans_pcie_removal_wk().
>

This pcie_removal thing is for the bus dead thing. My 2) above.

> > If userspace can indeed grow to support this, that would be fantastic.
>
> Well, Chrome OS tailors its user space a bit more to the hardware (and
> kernel/drivers in use) than the average distro might. We already do
> this (for some values of "this") today. Is that "fantastic" to you? :D

I guess it can be fantastic if other vendors also suffer from this. Or
maybe that could be done as part of the PCI bus driver inside the
kernel?

>
> > > -- then when we see
> > > this sort of unrecoverable "firmware-death", we remove the
> > > device[*]+driver, force-reset the PCI device (SBR), and try to
> > > reload/reattach the driver. This all happens by way of a udev rule.
> >
> > So you've sprikled your own udev event here as part of your kernel delta?
>
> No kernel delta -- the event is there already:
> iwl_trans_pcie_removal_wk()
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027
>
> And you can see our udev rules and scripts, in all their ugly details
> here, if you really care:
> https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/
>
> > > We
> > > also log this sort of stuff (and metrics around it) for bug reports
> > > and health statistics, since we really hope to not see this happen
> > > often.
> >
> > Assuming perfection is ideal but silly. So, what infrastructure do you
> > use for this sort of issue?
>
> We don't yet log firmware crashes generally, but for all our current
> crash reports (including WARN()), they go through this:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md
>
> For example, look for "cut here" in:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc
>
> For other specific metrics (like counting "EVENT=INACCESSIBLE"), we
> use the Chrome UMA system:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md
>
> I don't imagine the "infrastructure" side of any of that would be
> useful to you, but maybe the client-side gathering can at least show
> you what we do.
>
> > > [*] "We" (user space) don't actually do this...it happens via the
> > > 'remove_when_gone' module parameter abomination found in iwlwifi.
> >
> > BTW is this likely a place on iwlwifi where the firmware likely crashed?
>
> iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out
> in a way that is likely due to a dead PCIe endpoint. It's not directly
> a firmware crash, although there may be firmware crashes reported
> around the same time.

iwl_trans_pcie_removal_wk()  is only because of a dead bus, not
because of a firmware crash.
>
> Intel folks can probably give a more nuanced answer, but their
> firmware crashes usually land in something like iwl_mvm_nic_error() ->
> iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make
> your proposal less punishing (e.g., removing the "taint" as Johannes
> requested), maybe iwlwifi would be another good candidate for
> $subject. iwlwifi generally expects to recover seamlessly, but it's
> also good to know if you've seen a hundred of these in a row.

Yes, you are right, mvm_nic_error is the place.

So here is the proposal.
I think that a standard interface that can allow a driver to report a
firmware crash along with a proprietary unique id that makes sense to
the vendor can be useful. Then, yes, distros can listen to this,
optionally open bugs (automatically?) when that happens. I even
planned to do this long ago and integrated with a specific distro, but
it never rolled out. The vendor supplied unique id is very important
for the bug de-duplication part. For iwlwifi, we have the SYSASSERT
number which is basically how the firmware tells us briefly what
happened. Of course, there is always more data that can be useful, but
for a first level of bug de-duplication this can be enough. Then, you
have a standard way to track the firmware crashes.
We need to remember that the firmware recovery path is expected to
work, it is, yet, less tested. I specifically remember a bug where a
crash because by a bad handling of a CSA frame caused a firmware crash
in a flow that caused the driver not to re-add a station or something
like that, and because of that, you get another firmware crash. So
sometimes it is interesting to see not only the data on which firmware
crash happened and how many times, but if there is a timing
relationship between them. I guess that's for the next level though...
Not really critical for now.

The next problem here is that when you tell the firmware folks that
they have a bug, the first they'll do is to ask for more data. Back
then, I enabled our firmware debug infra on Linux, and from there
devcoredump infra was born (thanks Johannes). What we do at Intel, is
that we have a script that runs when a udev event reports the creation
of a devcoredump. It parses the kernel log (dmesg) to determine the
unique id I mentioned earlier (the SYSSASSERT number basically) and
then it creates a file in /var/log/ whose name contain the SYSSASSERT
number. It is then quite easy to match the firmware data with the
firmware crash.
So I believe the right way to do this is to create the devcoredump
along with the id. Meaning that we basically don't need another
interface, we just need to use the same one, but to provide the unique
id of the crash. This unique id can then be part of the uevent that is
thrown to the userspace and userspace can use it to name the dump file
with the right id. This way, it is fairly easy (and standard across
vendors) to track the firmware crashes, but the most important is that
you already have the firmware data that goes along with it. It would
look like this.


driver_code.c:
void my_vendor_error_interrupt()
{
  u64 uid = get_the_unique_id_from_your_device();
  void *firmware_data = get_the_data_you_need();

 dev_coredumpsg(dev_struct, firmware_data, firmware_data_len,
                              GFP_whatver, uid);
}

And then, this would cause a:
/var/log/wifi-crash-$(KBUILD_MODNAME)-,uid.bin to appear our your filesystem.

>
> > > A uevent
> > > would suit us very well I think, although it would be nice if drivers
> > > could also supply some small amount of informative text along with it
> >
> > A follow up to this series was to add a uevent to add_taint(), however
> > since a *count* is not considered I think it is correct to seek
> > alternatives at this point. The leaner the solution the better though.
> >
> > Do you have a pointer to what guys use so I can read?
>
> Hopefully the above pointers are useful to you. We don't yet have
> firmware-crash parsing implemented, so I don't have pointers for that
> piece yet.

See above. I don't think it is the device driver's role to count those.
We can add this count this in userspace I think. Debatable though.

>
> > > (e.g., a sort of "reason code", in case we can possibly aggregate
> > > certain failure types). We already do this sort of thing for WARN()
> > > and friends (not via uevent, but via log parsing; at least it has nice
> > > "cut here" markers!).
> >
> > Indeed, similar things can indeed be argued about WARN*()... this
> > however can be non-device specific. With panic-on-warn becoming a
> > "thing", the more important it becomes to really tally exactly *why*
> > these WARN*()s may trigger.
>
> panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm
> probably not supposed to publicize information related to how many
> Chromebooks are out there, but we regularly see a scary number of
> non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users
> every day, and a scary number of those are in WiFi drivers...
>
> > > Perhaps
> >
> > Note below.
>
> (My use of "perhaps" is only because I'm not familiar with devlink at all.)
>
> > > devlink (as proposed down-thread) would also fit the bill. I
> > > don't think sysfs alone would fit our needs, as we'd like to process
> > > these things as they happen, not only when a user submits a bug
> > > report.
> >
> > I think we've reached a point where using "*Perhaps*" does not suffice,
> > and if there is already a *user* of similar desired infrastructure I
> > think we should jump on the opportunity to replace what you have with
> > something which could be used by other devices / subsystems which
> > require firmware. And indeed, also even consider in the abstract sense,
> > the possibility to leverage something like this for WARN*()s later too.
>
> To precisely lay out what Chrome OS does today:
>
> * WARN() and similar: implemented, see anomaly_detector.cc above. It's
> just log parsing, and that handy "cut here" stuff -- I'm nearly
> certain there are other distros using this already? A uevent would
> probably be nice, but log parsing is good enough for us today.
> * EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked
> above. It's a uevent, and we handle it via udev. Code is linked above.
> * Other firmware crashes: not yet implemented here, but we're looking
> at it. Currently, we plan to do something similar to WARN(), but it
> will be ugly and non-generic. A uevent would be a lovely replacement,
> if it has some basic metadata with it. Stats in sysfs might be icing
> on the cake, but mostly redundant for us, if we can grab it on the fly
> via uevent.

So I believe we already have this uevent, it is the devcoredump. All
we need is to add the unique id.
Note that all this is good for firmware crashes, and not for bus dead
scenarios, but those two are fundamentally different even if they look
alike at the beginning of your error detection flow.
>
> HTH,
> Brian
Andy Shevchenko May 20, 2020, 8:32 a.m. UTC | #28
On Wed, May 20, 2020 at 8:40 AM Emmanuel Grumbach <egrumbach@gmail.com> wrote:

> Since I have been involved quite a bit in the firmware debugging
> features in iwlwifi, I think I can give a few insights here.
>
> But before this, we need to understand that there are several sources of issues:
> 1) the firmware may crash but the bus is still alive, you can still
> use the bus to get the crash data
> 2) the bus is dead, when that happens, the firmware might even be in a
> good condition, but since the bus is dead, you stop getting any
> information about the firmware, and then, at some point, you get to
> the conclusion that the firmware is dead. You can't get the crash data
> that resides on the other side of the bus (you may have gathered data
> in the DRAM directly, but that's a different thing), and you don't
> have much recovery to do besides re-starting the PCI enumeration.
>
> At Intel, we have seen both unfortunately. The bus issues are the ones
> that are trickier obviously. Trickier to detect (because you just get
> garbage from any request you issue on the bus), and trickier to
> handle. One can argue that the kernel should *not* handle those and
> let this in userspace hands. I guess it all depends on what component
> you ship to your customer and what you customer asks from you  :).

Or the two best approaches:
1) get rid of firmware completely;
2) make it OSS (like SOF).

I think any of these is a right thing to do in long-term perspective.

How many firmwares average computer has? 50? 100? Any of them is a
burden and PITA.
Brian Norris May 21, 2020, 7:01 p.m. UTC | #29
On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote:
> So I believe we already have this uevent, it is the devcoredump. All
> we need is to add the unique id.

I think there are a few reasons that devcoredump doesn't satisfy what
either Luis or I want.

1) it can be disabled entirely [1], for good reasons (e.g., think of
non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
with the opaque dumps provided by closed-source firmware)
2) not all drivers necessarily have a useful dump to provide when
there's a crash; look at the rest of Luis's series to see the kinds of
drivers-with-firmware that are crashing, some of which aren't dumping
anything
3) for those that do support devcoredump, it may be used for purposes
that are not "crashes" -- e.g., some provide debugfs or other knobs to
initiate dumps, for diagnostic or debugging purposes

Brian

[1] devcd_disabled
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22
Emmanuel Grumbach May 22, 2020, 5:12 a.m. UTC | #30
>
> On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote:
> > So I believe we already have this uevent, it is the devcoredump. All
> > we need is to add the unique id.
>
> I think there are a few reasons that devcoredump doesn't satisfy what
> either Luis or I want.
>
> 1) it can be disabled entirely [1], for good reasons (e.g., think of
> non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
> with the opaque dumps provided by closed-source firmware)

Ok, if all you're interested into is the information that this event
happen (as opposed to report a bug and providing the data), then I
agree. True, not everybody want or can enable devcoredump. I am just a
bit concerned that we may end up with two interface that notify the
same event basically. The ideal maybe would be to be able to
optionally reduce the content of the devoredump to nothing more that
is already in the dmesg output. But then, it is not what it is meant
to be: namely, a core dump..

> 2) not all drivers necessarily have a useful dump to provide when
> there's a crash; look at the rest of Luis's series to see the kinds of
> drivers-with-firmware that are crashing, some of which aren't dumping
> anything

Fair enouh.

> 3) for those that do support devcoredump, it may be used for purposes
> that are not "crashes" -- e.g., some provide debugfs or other knobs to
> initiate dumps, for diagnostic or debugging purposes

Not sure I really think we need to care about those cases, but you
already have 2 good arguments :)

>
> Brian
>
> [1] devcd_disabled
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22
Luis Chamberlain May 22, 2020, 5:23 a.m. UTC | #31
On Fri, May 22, 2020 at 08:12:59AM +0300, Emmanuel Grumbach wrote:
> >
> > On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote:
> > > So I believe we already have this uevent, it is the devcoredump. All
> > > we need is to add the unique id.
> >
> > I think there are a few reasons that devcoredump doesn't satisfy what
> > either Luis or I want.
> >
> > 1) it can be disabled entirely [1], for good reasons (e.g., think of
> > non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
> > with the opaque dumps provided by closed-source firmware)
> 
> Ok, if all you're interested into is the information that this event
> happen (as opposed to report a bug and providing the data), then I
> agree. 

I've now hit again a firmware crash with ath10k with the latest firwmare
and kernel and the *only* thing that helped recovery was a full reboot,
so that is a crystal clear case that this needs to taint the kernel, and
yes we do want to inform users too, so I've just added uevent support
for a few panic / taint events in the kernel now and rolled into my
series. I'll run some final tests and then post this as a follow up.

devlink didn't cut it, its networking specific.

  Luis
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 1d941d53fdc9..6bd0f3b518b9 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1767,6 +1767,7 @@  static void ath10k_pci_fw_dump_work(struct work_struct *work)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	module_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_pci_dump_registers(ar, crash_data);
 	ath10k_ce_dump_registers(ar, crash_data);
@@ -2837,6 +2838,7 @@  static int ath10k_pci_hif_power_up(struct ath10k *ar,
 	if (ret) {
 		if (ath10k_pci_has_fw_crashed(ar)) {
 			ath10k_warn(ar, "firmware crashed during chip reset\n");
+			module_firmware_crashed();
 			ath10k_pci_fw_crashed_clear(ar);
 			ath10k_pci_fw_crashed_dump(ar);
 		}
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index e2aff2254a40..d34ad289380f 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -794,6 +794,7 @@  static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
 
 	/* TODO: Add firmware crash handling */
 	ath10k_warn(ar, "firmware crashed\n");
+	module_firmware_crashed();
 
 	/* read counter to clear the interrupt, the debug error interrupt is
 	 * counter 0.
@@ -915,6 +916,7 @@  static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
 	if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
 		ath10k_err(ar, "firmware crashed!\n");
 		queue_work(ar->workqueue, &ar->restart_work);
+		module_firmware_crashed();
 	}
 	return ret;
 }
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 354d49b1cd45..7cfc123c345c 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1451,6 +1451,7 @@  void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	module_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_msa_dump_memory(ar, crash_data);
 	mutex_unlock(&ar->dump_mutex);