mbox series

[v3,0/8] kernel: taint when the driver firmware crashes

Message ID 20200526145815.6415-1-mcgrof@kernel.org (mailing list archive)
Headers show
Series kernel: taint when the driver firmware crashes | expand

Message

Luis Chamberlain May 26, 2020, 2:58 p.m. UTC
To those new on CC -- this is intended to be a simple generic interface
to the kernel to annotate when the firwmare has crashed leaving the
driver or system in a questionable state, in the worst case requiring
full system reboot. This series is first addressing only a few
networking patches, however, I already have an idea of where such
firmware crashes happen across the tree. The goal with this series then
is to first introduce the simple framework, and only if that moves
forward will I continue to chug on with the rest of the drivers /
subsystems.

This is *not* a networking specific problem only.

This v3 augments the last series by introducing the uevent for panic
events, one of them is during tainting. The uvent mechanism is
independent from any of this firmware taint mechanism. I've also
addressed Jessica Yu's feedback. Given I've extended the patches a bit
with other minor cleanup which checkpatch.pl complains over, and since
this infrastructure is still being discussed, I've trimmed the patch
series size to only cover drivers for which I've received an Acked-by
from the respective driver maintainer, or where we have bug reports to
support such dire situations on the driver such as ath10k.

During the last v2 it was discussed that we should instead use devlink
for this work, however the initial RFC patches produced by Jakub
Kicinski [0] shows how devlink is networking specific, and the intent
behind this series is to produce simple helpers which can be used by *any*
device driver, for any subsystem, not just networking. Subsystem
specific infrastructure to help address firwmare crashes may still make
sense, however that does not mean we *don't* need something even more
generic regardless of the subsystem the issue happens on. Since uevents
for taints are exposed, we now expose these through uapi as well, and
that was something which eventually had to happen given that the current
scheme of relying on sensible character representations for each taint
will not scale beyond the alphabet.

This series is avaialble my 20200526-taint-firmware-net-intro branch, based on
linux-next tag next-20200526 [1].

[0] https://lkml.kernel.org/r/20200519211531.3702593-1-kuba@kernel.org
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200526-taint-firmware-net-intro

Luis Chamberlain (7):
  kernel.h: move taint and system state flags to uapi
  panic: add uevent support
  taint: add firmware crash taint support
  panic: make taint data type clearer
  ath10k: use new taint_firmware_crashed()
  liquidio: use new taint_firmware_crashed()
  qed: use new taint_firmware_crashed()

Vasundhara Volam (1):
  bnxt_en: use new taint_firmware_crashed()

 Documentation/admin-guide/tainted-kernels.rst |   6 +
 MAINTAINERS                                   |   8 +
 .../net/ethernet/broadcom/bnxt/bnxt_devlink.c |   1 +
 .../net/ethernet/cavium/liquidio/lio_main.c   |   1 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c     |   1 +
 drivers/net/wireless/ath/ath10k/pci.c         |   2 +
 drivers/net/wireless/ath/ath10k/sdio.c        |   2 +
 drivers/net/wireless/ath/ath10k/snoc.c        |   1 +
 include/asm-generic/bug.h                     |   4 +-
 include/linux/kernel.h                        |  40 +--
 include/linux/module.h                        |  13 +
 include/linux/panic_events.h                  |  26 ++
 include/trace/events/module.h                 |   3 +-
 include/uapi/linux/kernel.h                   |  36 +++
 include/uapi/linux/panic_events.h             |  17 ++
 init/main.c                                   |   1 +
 kernel/Makefile                               |   1 +
 kernel/module.c                               |  13 +-
 kernel/panic.c                                |  16 +-
 kernel/panic_events.c                         | 289 ++++++++++++++++++
 lib/Kconfig.debug                             |  13 +
 tools/debugging/kernel-chktaint               |   7 +
 22 files changed, 454 insertions(+), 47 deletions(-)
 create mode 100644 include/linux/panic_events.h
 create mode 100644 include/uapi/linux/panic_events.h
 create mode 100644 kernel/panic_events.c

Comments

Jakub Kicinski May 26, 2020, 10:46 p.m. UTC | #1
On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:
> To those new on CC -- this is intended to be a simple generic interface
> to the kernel to annotate when the firwmare has crashed leaving the
> driver or system in a questionable state, in the worst case requiring
> full system reboot. This series is first addressing only a few
> networking patches, however, I already have an idea of where such
> firmware crashes happen across the tree. The goal with this series then
> is to first introduce the simple framework, and only if that moves
> forward will I continue to chug on with the rest of the drivers /
> subsystems.
> 
> This is *not* a networking specific problem only.
> 
> This v3 augments the last series by introducing the uevent for panic
> events, one of them is during tainting. The uvent mechanism is
> independent from any of this firmware taint mechanism. I've also
> addressed Jessica Yu's feedback. Given I've extended the patches a bit
> with other minor cleanup which checkpatch.pl complains over, and since
> this infrastructure is still being discussed, I've trimmed the patch
> series size to only cover drivers for which I've received an Acked-by
> from the respective driver maintainer, or where we have bug reports to
> support such dire situations on the driver such as ath10k.
> 
> During the last v2 it was discussed that we should instead use devlink
> for this work, however the initial RFC patches produced by Jakub
> Kicinski [0] shows how devlink is networking specific, and the intent
> behind this series is to produce simple helpers which can be used by *any*
> device driver, for any subsystem, not just networking. Subsystem
> specific infrastructure to help address firwmare crashes may still make
> sense, however that does not mean we *don't* need something even more
> generic regardless of the subsystem the issue happens on. Since uevents
> for taints are exposed, we now expose these through uapi as well, and
> that was something which eventually had to happen given that the current
> scheme of relying on sensible character representations for each taint
> will not scale beyond the alphabet.

Nacked-by: Jakub Kicinski <kuba@kernel.org>
Luis Chamberlain May 26, 2020, 11:07 p.m. UTC | #2
On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote:
> On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:
> > To those new on CC -- this is intended to be a simple generic interface
> > to the kernel to annotate when the firwmare has crashed leaving the
> > driver or system in a questionable state, in the worst case requiring
> > full system reboot. This series is first addressing only a few
> > networking patches, however, I already have an idea of where such
> > firmware crashes happen across the tree. The goal with this series then
> > is to first introduce the simple framework, and only if that moves
> > forward will I continue to chug on with the rest of the drivers /
> > subsystems.
> > 
> > This is *not* a networking specific problem only.
> > 
> > This v3 augments the last series by introducing the uevent for panic
> > events, one of them is during tainting. The uvent mechanism is
> > independent from any of this firmware taint mechanism. I've also
> > addressed Jessica Yu's feedback. Given I've extended the patches a bit
> > with other minor cleanup which checkpatch.pl complains over, and since
> > this infrastructure is still being discussed, I've trimmed the patch
> > series size to only cover drivers for which I've received an Acked-by
> > from the respective driver maintainer, or where we have bug reports to
> > support such dire situations on the driver such as ath10k.
> > 
> > During the last v2 it was discussed that we should instead use devlink
> > for this work, however the initial RFC patches produced by Jakub
> > Kicinski [0] shows how devlink is networking specific, and the intent
> > behind this series is to produce simple helpers which can be used by *any*
> > device driver, for any subsystem, not just networking. Subsystem
> > specific infrastructure to help address firwmare crashes may still make
> > sense, however that does not mean we *don't* need something even more
> > generic regardless of the subsystem the issue happens on. Since uevents
> > for taints are exposed, we now expose these through uapi as well, and
> > that was something which eventually had to happen given that the current
> > scheme of relying on sensible character representations for each taint
> > will not scale beyond the alphabet.
> 
> Nacked-by: Jakub Kicinski <kuba@kernel.org>

Care to elaborate?
 
  Luis
Jakub Kicinski May 26, 2020, 11:30 p.m. UTC | #3
On Tue, 26 May 2020 23:07:48 +0000 Luis Chamberlain wrote:
> On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote:
> > On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:  
> > > To those new on CC -- this is intended to be a simple generic interface
> > > to the kernel to annotate when the firwmare has crashed leaving the
> > > driver or system in a questionable state, in the worst case requiring
> > > full system reboot. This series is first addressing only a few
> > > networking patches, however, I already have an idea of where such
> > > firmware crashes happen across the tree. The goal with this series then
> > > is to first introduce the simple framework, and only if that moves
> > > forward will I continue to chug on with the rest of the drivers /
> > > subsystems.
> > > 
> > > This is *not* a networking specific problem only.
> > > 
> > > This v3 augments the last series by introducing the uevent for panic
> > > events, one of them is during tainting. The uvent mechanism is
> > > independent from any of this firmware taint mechanism. I've also
> > > addressed Jessica Yu's feedback. Given I've extended the patches a bit
> > > with other minor cleanup which checkpatch.pl complains over, and since
> > > this infrastructure is still being discussed, I've trimmed the patch
> > > series size to only cover drivers for which I've received an Acked-by
> > > from the respective driver maintainer, or where we have bug reports to
> > > support such dire situations on the driver such as ath10k.
> > > 
> > > During the last v2 it was discussed that we should instead use devlink
> > > for this work, however the initial RFC patches produced by Jakub
> > > Kicinski [0] shows how devlink is networking specific, and the intent
> > > behind this series is to produce simple helpers which can be used by *any*
> > > device driver, for any subsystem, not just networking. Subsystem
> > > specific infrastructure to help address firwmare crashes may still make
> > > sense, however that does not mean we *don't* need something even more
> > > generic regardless of the subsystem the issue happens on. Since uevents
> > > for taints are exposed, we now expose these through uapi as well, and
> > > that was something which eventually had to happen given that the current
> > > scheme of relying on sensible character representations for each taint
> > > will not scale beyond the alphabet.  
> > 
> > Nacked-by: Jakub Kicinski <kuba@kernel.org>  
> 
> Care to elaborate?

I elaborated in the previous thread and told you I will nack this, 
but sure let's go over this again.

For the third time saying the devlink is networking specific is not
true. It was created as a netlink configuration channel for devices
when there is no networking reference that could be used. It can be
compiled in or out much like sysfs.

And as I've shown you devlink already has the uAPI for what you're
trying to achieve.

Regardless of your opinions about wider interfaces, networking drivers
should implement devlink, and not have to sprinkle magic taint calls.
Luis Chamberlain May 27, 2020, 3:19 a.m. UTC | #4
On Tue, May 26, 2020 at 04:30:31PM -0700, Jakub Kicinski wrote:
> On Tue, 26 May 2020 23:07:48 +0000 Luis Chamberlain wrote:
> > On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote:
> > > On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:  
> > > > To those new on CC -- this is intended to be a simple generic interface
> > > > to the kernel to annotate when the firwmare has crashed leaving the
> > > > driver or system in a questionable state, in the worst case requiring
> > > > full system reboot. This series is first addressing only a few
> > > > networking patches, however, I already have an idea of where such
> > > > firmware crashes happen across the tree. The goal with this series then
> > > > is to first introduce the simple framework, and only if that moves
> > > > forward will I continue to chug on with the rest of the drivers /
> > > > subsystems.
> > > > 
> > > > This is *not* a networking specific problem only.
> > > > 
> > > > This v3 augments the last series by introducing the uevent for panic
> > > > events, one of them is during tainting. The uvent mechanism is
> > > > independent from any of this firmware taint mechanism. I've also
> > > > addressed Jessica Yu's feedback. Given I've extended the patches a bit
> > > > with other minor cleanup which checkpatch.pl complains over, and since
> > > > this infrastructure is still being discussed, I've trimmed the patch
> > > > series size to only cover drivers for which I've received an Acked-by
> > > > from the respective driver maintainer, or where we have bug reports to
> > > > support such dire situations on the driver such as ath10k.
> > > > 
> > > > During the last v2 it was discussed that we should instead use devlink
> > > > for this work, however the initial RFC patches produced by Jakub
> > > > Kicinski [0] shows how devlink is networking specific, and the intent
> > > > behind this series is to produce simple helpers which can be used by *any*
> > > > device driver, for any subsystem, not just networking. Subsystem
> > > > specific infrastructure to help address firwmare crashes may still make
> > > > sense, however that does not mean we *don't* need something even more
> > > > generic regardless of the subsystem the issue happens on. Since uevents
> > > > for taints are exposed, we now expose these through uapi as well, and
> > > > that was something which eventually had to happen given that the current
> > > > scheme of relying on sensible character representations for each taint
> > > > will not scale beyond the alphabet.  
> > > 
> > > Nacked-by: Jakub Kicinski <kuba@kernel.org>  
> > 
> > Care to elaborate?
> 
> I elaborated in the previous thread

No you didn't.

> and told you I will nack this, 

That's all you said.

> but sure let's go over this again.
> 
> For the third time saying the devlink is networking specific is not
> true. It was created as a netlink configuration channel for devices
> when there is no networking reference that could be used. It can be
> compiled in or out much like sysfs.

Perhaps I didn't get your email but this clarification was in no way
shape or form present in your reply on that thread.

> And as I've shown you devlink already has the uAPI for what you're
> trying to achieve.

I read your patch, and granted, I will accept I was under the incorrect
assumption that this can only be used by networking devices, however it
the devlink approach achieves getting userspace the ability with
iproute2 devlink util to query a device health, on to which we can peg
firmware health. But *this* patch series is not about health status and
letting users query it, its about a *critical* situation which has come up
with firmware requiring me to reboot my system, and the lack of *any*
infrastructure in the kernel today to inform userspace about it.

So say we use netlink to report a critical health situation, how are we
informing userspace with your patch series about requring a reboot?

  Luis
Jakub Kicinski May 27, 2020, 9:36 p.m. UTC | #5
On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote:
> I read your patch, and granted, I will accept I was under the incorrect
> assumption that this can only be used by networking devices, however it
> the devlink approach achieves getting userspace the ability with
> iproute2 devlink util to query a device health, on to which we can peg
> firmware health. But *this* patch series is not about health status and
> letting users query it, its about a *critical* situation which has come up
> with firmware requiring me to reboot my system, and the lack of *any*
> infrastructure in the kernel today to inform userspace about it.
> 
> So say we use netlink to report a critical health situation, how are we
> informing userspace with your patch series about requring a reboot?

One of main features of netlink is pub/sub model of notifications.

Whatever you imagine listening to your uevent can listen to
devlink-health notifications via devlink. 

In fact I've shown this off in the RFC patches I sent to you, see 
the devlink mon health command being used.
Luis Chamberlain May 28, 2020, 2:27 p.m. UTC | #6
On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote:
> On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote:
> > I read your patch, and granted, I will accept I was under the incorrect
> > assumption that this can only be used by networking devices, however it
> > the devlink approach achieves getting userspace the ability with
> > iproute2 devlink util to query a device health, on to which we can peg
> > firmware health. But *this* patch series is not about health status and
> > letting users query it, its about a *critical* situation which has come up
> > with firmware requiring me to reboot my system, and the lack of *any*
> > infrastructure in the kernel today to inform userspace about it.
> > 
> > So say we use netlink to report a critical health situation, how are we
> > informing userspace with your patch series about requring a reboot?
> 
> One of main features of netlink is pub/sub model of notifications.
> 
> Whatever you imagine listening to your uevent can listen to
> devlink-health notifications via devlink. 
> 
> In fact I've shown this off in the RFC patches I sent to you, see 
> the devlink mon health command being used.

Yes but I looked at iputils2 devlink and seems I made an incorrect
assumption this can only be used for a network device rather than
a struct device.

I'll take a second look.

  Luis
Ben Greear May 28, 2020, 3:04 p.m. UTC | #7
On 05/28/2020 07:27 AM, Luis Chamberlain wrote:
> On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote:
>> On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote:
>>> I read your patch, and granted, I will accept I was under the incorrect
>>> assumption that this can only be used by networking devices, however it
>>> the devlink approach achieves getting userspace the ability with
>>> iproute2 devlink util to query a device health, on to which we can peg
>>> firmware health. But *this* patch series is not about health status and
>>> letting users query it, its about a *critical* situation which has come up
>>> with firmware requiring me to reboot my system, and the lack of *any*
>>> infrastructure in the kernel today to inform userspace about it.
>>>
>>> So say we use netlink to report a critical health situation, how are we
>>> informing userspace with your patch series about requring a reboot?
>>
>> One of main features of netlink is pub/sub model of notifications.
>>
>> Whatever you imagine listening to your uevent can listen to
>> devlink-health notifications via devlink.
>>
>> In fact I've shown this off in the RFC patches I sent to you, see
>> the devlink mon health command being used.
>
> Yes but I looked at iputils2 devlink and seems I made an incorrect
> assumption this can only be used for a network device rather than
> a struct device.
>
> I'll take a second look.

Hello Jakub,

I'm thinking about something similar to what Luis is proposing, but in
my case I'd like to report just when the driver knows the hardware is gone
and cannot be recovered, like when this is reported:

[ 2548.851832] WARNING: CPU: 3 PID: 98 at backports-4.19.98-1/net/mac80211/util.c:2040 ieee80211_reconfig+0x98/0xb64 [mac80211]
[ 2548.856020] Hardware became unavailable during restart.

I'd like to be able to tie this into a watch-dog program to allow automatic reboot
of the system soon after this event is seen, for instance.

Could you post your devlink RFC patches somewhere public?

Thanks,
Ben
Luis Chamberlain May 28, 2020, 4:33 p.m. UTC | #8
On Thu, May 28, 2020 at 08:04:50AM -0700, Ben Greear wrote:
> 
> Could you post your devlink RFC patches somewhere public?

This cover letter provided a URL to these.

  Luis