Message ID | 20200515212846.1347-13-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On Fri, May 15, 2020 at 09:28:43PM +0000, Luis Chamberlain wrote: > This makes use of the new module_firmware_crashed() to help > annotate when firmware for device drivers crash. When firmware > crashes devices can sometimes become unresponsive, and recovery > sometimes requires a driver unload / reload and in the worst cases > a reboot. > > Using a taint flag allows us to annotate when this happens clearly. > > Cc: linux-wireless@vger.kernel.org > Cc: ath10k@lists.infradead.org > Cc: Kalle Valo <kvalo@codeaurora.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > drivers/net/wireless/ath/ath10k/pci.c | 2 ++ > drivers/net/wireless/ath/ath10k/sdio.c | 2 ++ > drivers/net/wireless/ath/ath10k/snoc.c | 1 + > 3 files changed, 5 insertions(+) > > diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c > index 1d941d53fdc9..6bd0f3b518b9 100644 > --- a/drivers/net/wireless/ath/ath10k/pci.c > +++ b/drivers/net/wireless/ath/ath10k/pci.c > @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work) > scnprintf(guid, sizeof(guid), "n/a"); > > ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); > + module_firmware_crashed(); > ath10k_print_driver_info(ar); > ath10k_pci_dump_registers(ar, crash_data); > ath10k_ce_dump_registers(ar, crash_data); > @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar, > if (ret) { > if (ath10k_pci_has_fw_crashed(ar)) { > ath10k_warn(ar, "firmware crashed during chip reset\n"); > + module_firmware_crashed(); > ath10k_pci_fw_crashed_clear(ar); > ath10k_pci_fw_crashed_dump(ar); > } > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c > index e2aff2254a40..d34ad289380f 100644 > --- a/drivers/net/wireless/ath/ath10k/sdio.c > +++ b/drivers/net/wireless/ath/ath10k/sdio.c > @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar) > > /* TODO: Add firmware crash handling */ > ath10k_warn(ar, "firmware crashed\n"); > + module_firmware_crashed(); > > /* read counter to clear the interrupt, the debug error interrupt is > * counter 0. > @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar) > if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) { > ath10k_err(ar, "firmware crashed!\n"); > queue_work(ar->workqueue, &ar->restart_work); > + module_firmware_crashed(); > } > return ret; > } > diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c > index 354d49b1cd45..7cfc123c345c 100644 > --- a/drivers/net/wireless/ath/ath10k/snoc.c > +++ b/drivers/net/wireless/ath/ath10k/snoc.c > @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar) > scnprintf(guid, sizeof(guid), "n/a"); > > ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); > + module_firmware_crashed(); > ath10k_print_driver_info(ar); > ath10k_msa_dump_memory(ar, crash_data); > mutex_unlock(&ar->dump_mutex); > -- > 2.26.2 > Acked-by: Rafael Aquini <aquini@redhat.com>
On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed You didn't CC me or the wireless list on the rest of the patches, so I'm replying to a random one, but ... What is the point here? This should in no way affect the integrity of the system/kernel, for most devices anyway. So what if ath10k's firmware crashes? If there's a driver bug it will not handle it right (and probably crash, WARN_ON, or something else), but if the driver is working right then that will not affect the kernel at all. So maybe I can understand that maybe you want an easy way to discover - per device - that the firmware crashed, but that still doesn't warrant a complete kernel taint. Instead of the kernel taint, IMHO you should provide an annotation in sysfs (or somewhere else) for the *struct device* that had its firmware crash. Or maybe, if it's too complex to walk the entire hierarchy checking for that, have a uevent, or add the ability for the kernel to print out elsewhere in debugfs the list of devices that crashed at some point... All of that is fine, but a kernel taint? johannes
On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote: > Instead of the kernel taint, IMHO you should provide an annotation in > sysfs (or somewhere else) for the *struct device* that had its firmware > crash. Or maybe, if it's too complex to walk the entire hierarchy > checking for that, have a uevent, or add the ability for the kernel to > print out elsewhere in debugfs the list of devices that crashed at some I mean sysfs, oops. In addition, look what we have in iwl_trans_pcie_removal_wk(). If we detect that the device is really wedged enough that the only way we can still try to recover is by completely unbinding the driver from it, then we give userspace a uevent for that. I don't remember exactly how and where that gets used (ChromeOS) though, but it'd be nice to have that sort of thing as part of the infrastructure, in a sort of two-level notification? Level 1: firmware crashed, but we're recovering, at least mostly, and it's more informational Level 2: device is wedged, going to try to recover by some more forceful means (perhaps some devices can be power-cycled? etc.) but (more) state would be lost in these cases? Still don't think a kernel taint is appropriate for either of these. johannes
On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > You didn't CC me or the wireless list on the rest of the patches, so I'm > replying to a random one, but ... > > What is the point here? > > This should in no way affect the integrity of the system/kernel, for > most devices anyway. Keyword you used here is "most device". And in the worst case, *who* knows what other odd things may happen afterwards. > So what if ath10k's firmware crashes? If there's a driver bug it will > not handle it right (and probably crash, WARN_ON, or something else), > but if the driver is working right then that will not affect the kernel > at all. Sometimes the device can go into a state which requires driver removal and addition to get things back up. > So maybe I can understand that maybe you want an easy way to discover - > per device - that the firmware crashed, but that still doesn't warrant a > complete kernel taint. That is one reason, another is that a taint helps support cases *fast* easily detect if the issue was a firmware crash, instead of scraping logs for driver specific ways to say the firmware has crashed. > Instead of the kernel taint, IMHO you should provide an annotation in > sysfs (or somewhere else) for the *struct device* that had its firmware > crash. It would seem the way some folks are thinking about getting more details would be through devlink. > Or maybe, if it's too complex to walk the entire hierarchy > checking for that, have a uevent, or add the ability for the kernel to > print out elsewhere in debugfs the list of devices that crashed at some > point... All of that is fine, but a kernel taint? debugfs is optional, a taint is simple, and device agnostic. From a support perspective it is very easy to see if a possible issue may be device firmware specific. Luis
On Sat, May 16, 2020 at 03:50:55PM +0200, Johannes Berg wrote: > On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote: > > > Instead of the kernel taint, IMHO you should provide an annotation in > > sysfs (or somewhere else) for the *struct device* that had its firmware > > crash. Or maybe, if it's too complex to walk the entire hierarchy > > checking for that, have a uevent, or add the ability for the kernel to > > print out elsewhere in debugfs the list of devices that crashed at some > > I mean sysfs, oops. > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > detect that the device is really wedged enough that the only way we can > still try to recover is by completely unbinding the driver from it, then > we give userspace a uevent for that. Nice! Indeed a uevent is in order for these sorts of things, and I'd argue that it begs the question if we should even uevent for any taint as well. Today these are silent. If the kernel crashes, today we only give userspace a log. > I don't remember exactly how and > where that gets used (ChromeOS) though, but it'd be nice to have that > sort of thing as part of the infrastructure, in a sort of two-level > notification? > > Level 1: firmware crashed, but we're recovering, at least mostly, and > it's more informational > > Level 2: device is wedged, going to try to recover by some more forceful > means (perhaps some devices can be power-cycled? etc.) but (more) state > would be lost in these cases? I agree that *all* this would be ideal. I don't see this as mutually exclusive with a taint on the kernel and module for the device. > Still don't think a kernel taint is appropriate for either of these. From a support perspective, I do think it is vital quick information. Luis
On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: >> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed >> >> You didn't CC me or the wireless list on the rest of the patches, so I'm >> replying to a random one, but ... >> >> What is the point here? >> >> This should in no way affect the integrity of the system/kernel, for >> most devices anyway. > > Keyword you used here is "most device". And in the worst case, *who* > knows what other odd things may happen afterwards. > >> So what if ath10k's firmware crashes? If there's a driver bug it will >> not handle it right (and probably crash, WARN_ON, or something else), >> but if the driver is working right then that will not affect the kernel >> at all. > > Sometimes the device can go into a state which requires driver removal > and addition to get things back up. It would be lovely to be able to detect this case in the driver/system somehow! I haven't seen any such cases recently, but in case there is some common case you see, maybe we can think of a way to detect it? > >> So maybe I can understand that maybe you want an easy way to discover - >> per device - that the firmware crashed, but that still doesn't warrant a >> complete kernel taint. > > That is one reason, another is that a taint helps support cases *fast* > easily detect if the issue was a firmware crash, instead of scraping > logs for driver specific ways to say the firmware has crashed. You can listen for udev events (I think that is the right term), and find crashes that way. You get the actual crash info as well. Thanks, Ben
On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm > > > replying to a random one, but ... > > > > > > What is the point here? > > > > > > This should in no way affect the integrity of the system/kernel, for > > > most devices anyway. > > > > Keyword you used here is "most device". And in the worst case, *who* > > knows what other odd things may happen afterwards. > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will > > > not handle it right (and probably crash, WARN_ON, or something else), > > > but if the driver is working right then that will not affect the kernel > > > at all. > > > > Sometimes the device can go into a state which requires driver removal > > and addition to get things back up. > > It would be lovely to be able to detect this case in the driver/system > somehow! I haven't seen any such cases recently, I assure you that I have run into it. Once it does again I'll report the crash, but the problem with some of this is that unless you scrape the log you won't know. Eventually, a uevent would indeed tell inform me. > but in case there is > some common case you see, maybe we can think of a way to detect it? ath10k is just one case, this patch series addresses a simple way to annotate this tree-wide. > > > So maybe I can understand that maybe you want an easy way to discover - > > > per device - that the firmware crashed, but that still doesn't warrant a > > > complete kernel taint. > > > > That is one reason, another is that a taint helps support cases *fast* > > easily detect if the issue was a firmware crash, instead of scraping > > logs for driver specific ways to say the firmware has crashed. > > You can listen for udev events (I think that is the right term), > and find crashes that way. You get the actual crash info as well. My follow up to this was to add uevent to add_taint() as well, this way these could generically be processed by userspace. Luis
On 05/18/2020 10:09 AM, Luis Chamberlain wrote: > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: >> >> >> On 05/18/2020 09:51 AM, Luis Chamberlain wrote: >>> On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: >>>> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed >>>> >>>> You didn't CC me or the wireless list on the rest of the patches, so I'm >>>> replying to a random one, but ... >>>> >>>> What is the point here? >>>> >>>> This should in no way affect the integrity of the system/kernel, for >>>> most devices anyway. >>> >>> Keyword you used here is "most device". And in the worst case, *who* >>> knows what other odd things may happen afterwards. >>> >>>> So what if ath10k's firmware crashes? If there's a driver bug it will >>>> not handle it right (and probably crash, WARN_ON, or something else), >>>> but if the driver is working right then that will not affect the kernel >>>> at all. >>> >>> Sometimes the device can go into a state which requires driver removal >>> and addition to get things back up. >> >> It would be lovely to be able to detect this case in the driver/system >> somehow! I haven't seen any such cases recently, > > I assure you that I have run into it. Once it does again I'll report > the crash, but the problem with some of this is that unless you scrape > the log you won't know. Eventually, a uevent would indeed tell inform > me. > >> but in case there is >> some common case you see, maybe we can think of a way to detect it? > > ath10k is just one case, this patch series addresses a simple way to > annotate this tree-wide. > >>>> So maybe I can understand that maybe you want an easy way to discover - >>>> per device - that the firmware crashed, but that still doesn't warrant a >>>> complete kernel taint. >>> >>> That is one reason, another is that a taint helps support cases *fast* >>> easily detect if the issue was a firmware crash, instead of scraping >>> logs for driver specific ways to say the firmware has crashed. >> >> You can listen for udev events (I think that is the right term), >> and find crashes that way. You get the actual crash info as well. > > My follow up to this was to add uevent to add_taint() as well, this way > these could generically be processed by userspace. I'm not opposed to the taint, though I have not thought much on it. But, if you can already get the crash info from uevent, and it automatically comes without polling or scraping logs, then what benefit beyond that does the taint give you? Thanks, Ben
On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote: > > > On 05/18/2020 10:09 AM, Luis Chamberlain wrote: > > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: > > > > > > > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > > > > > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm > > > > > replying to a random one, but ... > > > > > > > > > > What is the point here? > > > > > > > > > > This should in no way affect the integrity of the system/kernel, for > > > > > most devices anyway. > > > > > > > > Keyword you used here is "most device". And in the worst case, *who* > > > > knows what other odd things may happen afterwards. > > > > > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will > > > > > not handle it right (and probably crash, WARN_ON, or something else), > > > > > but if the driver is working right then that will not affect the kernel > > > > > at all. > > > > > > > > Sometimes the device can go into a state which requires driver removal > > > > and addition to get things back up. > > > > > > It would be lovely to be able to detect this case in the driver/system > > > somehow! I haven't seen any such cases recently, > > > > I assure you that I have run into it. Once it does again I'll report > > the crash, but the problem with some of this is that unless you scrape > > the log you won't know. Eventually, a uevent would indeed tell inform > > me. > > > > > but in case there is > > > some common case you see, maybe we can think of a way to detect it? > > > > ath10k is just one case, this patch series addresses a simple way to > > annotate this tree-wide. > > > > > > > So maybe I can understand that maybe you want an easy way to discover - > > > > > per device - that the firmware crashed, but that still doesn't warrant a > > > > > complete kernel taint. > > > > > > > > That is one reason, another is that a taint helps support cases *fast* > > > > easily detect if the issue was a firmware crash, instead of scraping > > > > logs for driver specific ways to say the firmware has crashed. > > > > > > You can listen for udev events (I think that is the right term), > > > and find crashes that way. You get the actual crash info as well. > > > > My follow up to this was to add uevent to add_taint() as well, this way > > these could generically be processed by userspace. > > I'm not opposed to the taint, though I have not thought much on it. > > But, if you can already get the crash info from uevent, and it automatically > comes without polling or scraping logs, then what benefit beyond that does > the taint give you? From a support perspective it is a *crystal* clear sign that the device and / or device driver may be in a very bad state, in a generic way. Luis
On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote: > > > > > > On 05/18/2020 10:09 AM, Luis Chamberlain wrote: > > > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: > > > > > > > > > > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > > > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > > > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > > > > > > > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm > > > > > > replying to a random one, but ... > > > > > > > > > > > > What is the point here? > > > > > > > > > > > > This should in no way affect the integrity of the system/kernel, for > > > > > > most devices anyway. > > > > > > > > > > Keyword you used here is "most device". And in the worst case, *who* > > > > > knows what other odd things may happen afterwards. > > > > > > > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will > > > > > > not handle it right (and probably crash, WARN_ON, or something else), > > > > > > but if the driver is working right then that will not affect the kernel > > > > > > at all. > > > > > > > > > > Sometimes the device can go into a state which requires driver removal > > > > > and addition to get things back up. > > > > > > > > It would be lovely to be able to detect this case in the driver/system > > > > somehow! I haven't seen any such cases recently, > > > > > > I assure you that I have run into it. Once it does again I'll report > > > the crash, but the problem with some of this is that unless you scrape > > > the log you won't know. Eventually, a uevent would indeed tell inform > > > me. > > > > > > > but in case there is > > > > some common case you see, maybe we can think of a way to detect it? > > > > > > ath10k is just one case, this patch series addresses a simple way to > > > annotate this tree-wide. > > > > > > > > > So maybe I can understand that maybe you want an easy way to discover - > > > > > > per device - that the firmware crashed, but that still doesn't warrant a > > > > > > complete kernel taint. > > > > > > > > > > That is one reason, another is that a taint helps support cases *fast* > > > > > easily detect if the issue was a firmware crash, instead of scraping > > > > > logs for driver specific ways to say the firmware has crashed. > > > > > > > > You can listen for udev events (I think that is the right term), > > > > and find crashes that way. You get the actual crash info as well. > > > > > > My follow up to this was to add uevent to add_taint() as well, this way > > > these could generically be processed by userspace. > > > > I'm not opposed to the taint, though I have not thought much on it. > > > > But, if you can already get the crash info from uevent, and it automatically > > comes without polling or scraping logs, then what benefit beyond that does > > the taint give you? > > From a support perspective it is a *crystal* clear sign that the device > and / or device driver may be in a very bad state, in a generic way. > Unfortunately a "taint" is interpreted by many users as: "your kernel is really F#*D up, you better do something about it right now." Assuming they're paying attention at all in the first place of course. The fact is, WiFi chip firmware crashes, and in most cases the driver is able to recover seamlessly. At least that is the case with most QCA chipsets I work with. And the users or our ability to do anything about it is minimal to none as we don't have access to firmware source. It's too bad and I wish it weren't the case, but we have embraced reality and most drivers have a recovery mechanism built in for this case. In short, it's a non-event. I fear that elevating this to a kernel taint will significantly increase "support" requests that really are nothing but noise; similar to how the firmware load failure messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded fw-0.bin) cause a lot of noise. Not specifically opposed, but I wonder what it really accomplishes in a world where the firmware crashing is pretty much a normal occurrence. If it goes in, I think that the drivers shouldn't trigger the taint if they're able to recover normally. Only trigger on failure to come back up. In other words, the ideal place in the ath10k driver isn't where you have proposed as at that point operation is normal and we're doing a routine recovery. - Steve > Luis
On Mon, May 18, 2020 at 11:06:27AM -0700, Steve deRosier wrote: > On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > From a support perspective it is a *crystal* clear sign that the device > > and / or device driver may be in a very bad state, in a generic way. > > > > Unfortunately a "taint" is interpreted by many users as: "your kernel > is really F#*D up, you better do something about it right now." > Assuming they're paying attention at all in the first place of course. Taint historically has been used and still is today to help rule out whether or not you get support, or how you get support. For instance, a staging driver is not supported by some upstream developers, but it will be by those who help staging and Greg. TAINT_CRAP cannot be even more clear. So, no, it is not just about "hey your kernel is messed up", there are clear support boundaries being drawn. > The fact is, WiFi chip firmware crashes, and in most cases the driver > is able to recover seamlessly. At least that is the case with most QCA > chipsets I work with. That has not been my exerience with the same driver, and so how do we know? And this patch set is not about ath10k alone, I want you to think about *all* device drivers with firmware. In my journey to scrape the kernel for these cases I was very surprised by the amount of code which clearly annotates these situations. > And the users or our ability to do anything > about it is minimal to none as we don't have access to firmware > source. This is not true, we have open firmware in WiFi. Some vendors choose to not open source their firmware, that is their decision. These days though, I think we all admit, that firmware crashes can use a better generic infrastructure for ensuring that clearly affecting-user experience issues. This patch is about that *when and if these happen*, we annotate it in the kernel for support pursposes. > It's too bad and I wish it weren't the case, but we have > embraced reality and most drivers have a recovery mechanism built in > for this case. The mentality about firmware crashes being the end of the world is certainly what will lead developers to often hide these. Where this is openly clear, and not obfucscated I'd argue that firmware issues get fixed likely more common. So what you describe is not bad, its just accepting evolution. > In short, it's a non-event. I fear that elevating this > to a kernel taint will significantly increase "support" requests that > really are nothing but noise; That will depend on where you put this on the driver, and that is why it is important to place it in the right place, if any. > similar to how the firmware load failure > messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded > fw-0.bin) cause a lot of noise. That can be fixed, the developers behind this series gave up on it. It has to do with a range version of supported firmwares, and all being optional, but at least one is required. > Not specifically opposed, but I wonder what it really accomplishes in > a world where the firmware crashing is pretty much a normal > occurrence. Recovery without affecting user experience would be great, the taint is *not* for those cases. The taint definition has: + 18) ``Q`` used by device drivers to annotate that the device driver's firmware + has crashed and the device's operation has been severely affected. The + device may be left in a crippled state, requiring full driver removal / + addition, system reboot, or it is unclear how long recovery will take. Let me know if this is not clear. > If it goes in, I think that the drivers shouldn't trigger the taint if > they're able to recover normally. Only trigger on failure to come back > up. In other words, the ideal place in the ath10k driver isn't where > you have proposed as at that point operation is normal and we're doing > a routine recovery. Sure, happy to remove it if indeed it is the case that the firwmare crash is not happening to cripple the device, but I can vouch for the fact that the exact place where I placed it left my device driver in a state where I had to remove / add again. Luis
On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote: > > Unfortunately a "taint" is interpreted by many users as: "your kernel > > is really F#*D up, you better do something about it right now." > > Assuming they're paying attention at all in the first place of course. > > Taint historically has been used and still is today to help rule out > whether or not you get support, or how you get support. > > For instance, a staging driver is not supported by some upstream > developers, but it will be by those who help staging and Greg. TAINT_CRAP > cannot be even more clear. > > So, no, it is not just about "hey your kernel is messed up", there are > clear support boundaries being drawn. Err, no. Those two are most definitely related. Have you looked at (most or some or whatever) staging drivers recently? Those contain all kinds of garbage that might do whatever with your kernel. Of course that's not a completely clear boundary, maybe you can find a driver in staging that's perfect code just not written to kernel style? But I find that hard to believe, in most cases. So no, it's really not about "[a] staging driver is not supported" vs. "your kernel is messed up". The very fact that you loaded one of those things might very well have messed up your kernel entirely. > These days though, I think we all admit, that firmware crashes can use > a better generic infrastructure for ensuring that clearly affecting-user > experience issues. This patch is about that *when and if these happen*, > we annotate it in the kernel for support pursposes. That's all fine, I just don't think it's appropriate to pretend that your kernel is now 'tainted' (think about the meaning of that word) when the firmware of some random device crashed. Heck, that could have been a USB device that was since unplugged. Unless the driver is complete garbage (hello staging again?) that really should have no lasting effect on the system itself. > Recovery without affecting user experience would be great, the taint is > *not* for those cases. The taint definition has: > > + 18) ``Q`` used by device drivers to annotate that the device driver's firmware > + has crashed and the device's operation has been severely affected. The > + device may be left in a crippled state, requiring full driver removal / > + addition, system reboot, or it is unclear how long recovery will take. > > Let me know if this is not clear. It's pretty clear, but even then, first of all I doubt this is the case for many of the places that you've sprinkled the annotation on, and secondly it actually hides useful information. Regardless of the support issue, I think this hiding of information is also problematic. I really think we'd all be better off if you just made a sysfs file (I mistyped debugfs in some other email, sorry, apparently you didn't see the correction in time) that listed which device(s) crashed and how many times. That would actually be useful. Because honestly, if a random device crashed for some random reason, that's pretty much a non-event. If it keeps happening, then we might even want to know about it. You can obviously save the contents of this file into your bug reports automatically and act accordingly, but I think you'll find that this is far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this report. Yeah, that might be reasonable thing if the bug report is about slow wifi *and* you see that ath10k firmware crashed every 10 seconds, but if it just crashed once a few days earlier it's of no importance to the system anymore ... And certainly a reasonable driver (which I believe ath10k to be) would _not_ randomly start corrupting memory because its firmware crashed. Which really is what tainting the kernel is about. So no, even with all that, I still really believe you're solving the wrong problem. Having information about firmware crashes, preferably with some kind of frequency information attached, and *clearly* with information about which device attached would be _great_. Munging it all into one bit is actively harmful, IMO. johannes
On Mon, May 18, 2020 at 09:25:09PM +0200, Johannes Berg wrote: > On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote: > > > > Unfortunately a "taint" is interpreted by many users as: "your kernel > > > is really F#*D up, you better do something about it right now." > > > Assuming they're paying attention at all in the first place of course. > > > > Taint historically has been used and still is today to help rule out > > whether or not you get support, or how you get support. > > > > For instance, a staging driver is not supported by some upstream > > developers, but it will be by those who help staging and Greg. TAINT_CRAP > > cannot be even more clear. > > > > So, no, it is not just about "hey your kernel is messed up", there are > > clear support boundaries being drawn. > > Err, no. Those two are most definitely related. Have you looked at (most > or some or whatever) staging drivers recently? Those contain all kinds > of garbage that might do whatever with your kernel. No, I stay away :) > Of course that's not a completely clear boundary, maybe you can find a > driver in staging that's perfect code just not written to kernel style? > But I find that hard to believe, in most cases. > > So no, it's really not about "[a] staging driver is not supported" vs. > "your kernel is messed up". The very fact that you loaded one of those > things might very well have messed up your kernel entirely. > > > These days though, I think we all admit, that firmware crashes can use > > a better generic infrastructure for ensuring that clearly affecting-user > > experience issues. This patch is about that *when and if these happen*, > > we annotate it in the kernel for support pursposes. > > That's all fine, I just don't think it's appropriate to pretend that > your kernel is now 'tainted' (think about the meaning of that word) when > the firmware of some random device crashed. If the firmware crash *does* require driver remove / addition again, or a reboot, would you think that this is a situation that merits a taint? > > Recovery without affecting user experience would be great, the taint is > > *not* for those cases. The taint definition has: > > > > + 18) ``Q`` used by device drivers to annotate that the device driver's firmware > > + has crashed and the device's operation has been severely affected. The > > + device may be left in a crippled state, requiring full driver removal / > > + addition, system reboot, or it is unclear how long recovery will take. > > > > Let me know if this is not clear. > > It's pretty clear, but even then, first of all I doubt this is the case > for many of the places that you've sprinkled the annotation on, We can remove it, for this driver I can vouch for its location as it did reach a state where I required a reboot. And its not the first time this has happened. This got me thinking about the bigger picture of the lack of proper way to address these cases in the kernel, and how the user is left dumbfounded. > and secondly it actually hides useful information. What is it hiding? > Regardless of the support issue, I think this hiding of information is > also problematic. > > I really think we'd all be better off if you just made a sysfs file (I > mistyped debugfs in some other email, sorry, apparently you didn't see > the correction in time) that listed which device(s) crashed and how many > times. Ah yes, count. The taint does not address count. > That would actually be useful. Because honestly, if a random > device crashed for some random reason, that's pretty much a non-event. > If it keeps happening, then we might even want to know about it. True. > You can obviously save the contents of this file into your bug reports > automatically and act accordingly, but I think you'll find that this is > far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this > report. Absolutely. > Yeah, that might be reasonable thing if the bug report is about > slow wifi *and* you see that ath10k firmware crashed every 10 seconds, > but if it just crashed once a few days earlier it's of no importance to > the system anymore ... And certainly a reasonable driver (which I > believe ath10k to be) would _not_ randomly start corrupting memory > because its firmware crashed. Which really is what tainting the kernel > is about. I still see it as a support thing too. But discussing this further is pointless as I agree that taint does not cover count and that it is important. Luis
On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote: > > Err, no. Those two are most definitely related. Have you looked at (most > > or some or whatever) staging drivers recently? Those contain all kinds > > of garbage that might do whatever with your kernel. > > No, I stay away :) :) > > That's all fine, I just don't think it's appropriate to pretend that > > your kernel is now 'tainted' (think about the meaning of that word) when > > the firmware of some random device crashed. > > If the firmware crash *does* require driver remove / addition again, > or a reboot, would you think that this is a situation that merits a taint? Not really. In my experience, that's more likely a hardware issue (card not properly seated, for example) that a bus reset happens to "fix". > > It's pretty clear, but even then, first of all I doubt this is the case > > for many of the places that you've sprinkled the annotation on, > > We can remove it, for this driver I can vouch for its location as it did > reach a state where I required a reboot. And its not the first time this > has happened. This got me thinking about the bigger picture of the lack > of proper way to address these cases in the kernel, and how the user is > left dumbfounded. Fair, so the driver is still broken wrt. recovery here. I still don't think that's a situation where e.g. the system should say "hey you have a taint here, if your graphics go bad now you should not report that bug" (which is effectively what the single taint bit does). > > and secondly it actually hides useful information. > > What is it hiding? Most importantly, which device crashed. Secondarily I'd say how many times (*). The information "firmware crashed" is really only useful in relation to the device. If your graphics firmware crashed, yeah, well, you probably won't even see this. If your USB wifi firmware crashed? Not really interesting, you'll anyway just unplug. In fact it's very hard for a USB driver (short of arbitrary memory corruption) to significantly mess up the system. johannes (*) though if it crashed only once, was that because it was wedged enough to be unusable afterwards, or because everything was fine?
On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote: > It's pretty clear, but even then, first of all I doubt this is the case > for many of the places that you've sprinkled the annotation on, and > secondly it actually hides useful information. > > Regardless of the support issue, I think this hiding of information is > also problematic. > > I really think we'd all be better off if you just made a sysfs file (I > mistyped debugfs in some other email, sorry, apparently you didn't see > the correction in time) that listed which device(s) crashed and how many > times. That would actually be useful. Because honestly, if a random > device crashed for some random reason, that's pretty much a non-event. > If it keeps happening, then we might even want to know about it. Johannes - have you seen devlink health? I think we should just use that interface, since it supports all the things you're requesting, rather than duplicate it in sysfs.
On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote: > On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote: > > It's pretty clear, but even then, first of all I doubt this is the case > > for many of the places that you've sprinkled the annotation on, and > > secondly it actually hides useful information. > > > > Regardless of the support issue, I think this hiding of information is > > also problematic. > > > > I really think we'd all be better off if you just made a sysfs file (I > > mistyped debugfs in some other email, sorry, apparently you didn't see > > the correction in time) that listed which device(s) crashed and how many > > times. That would actually be useful. Because honestly, if a random > > device crashed for some random reason, that's pretty much a non-event. > > If it keeps happening, then we might even want to know about it. > > Johannes - have you seen devlink health? I think we should just use > that interface, since it supports all the things you're requesting, > rather than duplicate it in sysfs. I haven't, and I'm glad to hear that's there, sounds good! I suspect that Luis wants something more generic though, that isn't just applicable to netdevices, unless devlink grew some kind of non-netdev stuff while I wasn't looking? :) johannes
On Mon, 18 May 2020 22:29:53 +0200 Johannes Berg wrote: > On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote: > > On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote: > > > It's pretty clear, but even then, first of all I doubt this is the case > > > for many of the places that you've sprinkled the annotation on, and > > > secondly it actually hides useful information. > > > > > > Regardless of the support issue, I think this hiding of information is > > > also problematic. > > > > > > I really think we'd all be better off if you just made a sysfs file (I > > > mistyped debugfs in some other email, sorry, apparently you didn't see > > > the correction in time) that listed which device(s) crashed and how many > > > times. That would actually be useful. Because honestly, if a random > > > device crashed for some random reason, that's pretty much a non-event. > > > If it keeps happening, then we might even want to know about it. > > > > Johannes - have you seen devlink health? I think we should just use > > that interface, since it supports all the things you're requesting, > > rather than duplicate it in sysfs. > > I haven't, and I'm glad to hear that's there, sounds good! > > I suspect that Luis wants something more generic though, that isn't just > applicable to netdevices, unless devlink grew some kind of non-netdev > stuff while I wasn't looking? :) It's intended to be a generic netlink channel for configuring devices. All the firmware-related interfaces have no dependencies on netdevs, in fact that's one of the reasons we moved to devlink - we don't want to hold rtnl lock just for talking to device firmware.
On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote: > > It's intended to be a generic netlink channel for configuring devices. > > All the firmware-related interfaces have no dependencies on netdevs, > in fact that's one of the reasons we moved to devlink - we don't want > to hold rtnl lock just for talking to device firmware. Sounds good :) So I guess Luis just has to add some way in devlink to hook up devlink health in a simple way to drivers, perhaps? I mean, many drivers won't really want to use devlink for anything else, so I guess it should be as simple as the API that Luis proposed ("firmware crashed for this struct device"), if nothing more interesting is done with devlink? Dunno. But anyway sounds like it should somehow integrate there rather than the way this patchset proposed? johannes
On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote: > On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote: > > It's intended to be a generic netlink channel for configuring devices. > > > > All the firmware-related interfaces have no dependencies on netdevs, > > in fact that's one of the reasons we moved to devlink - we don't want > > to hold rtnl lock just for talking to device firmware. > > Sounds good :) > > So I guess Luis just has to add some way in devlink to hook up devlink > health in a simple way to drivers, perhaps? I mean, many drivers won't > really want to use devlink for anything else, so I guess it should be as > simple as the API that Luis proposed ("firmware crashed for this struct > device"), if nothing more interesting is done with devlink? > > Dunno. But anyway sounds like it should somehow integrate there rather > than the way this patchset proposed? Right, that'd be great. Simple API to register a devlink instance with whatever number of reporters the device would need. I'm happy to help.
On Mon, May 18, 2020 at 10:07:49PM +0200, Johannes Berg wrote: > On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote: > > > > Err, no. Those two are most definitely related. Have you looked at (most > > > or some or whatever) staging drivers recently? Those contain all kinds > > > of garbage that might do whatever with your kernel. > > > > No, I stay away :) > > :) > > > > That's all fine, I just don't think it's appropriate to pretend that > > > your kernel is now 'tainted' (think about the meaning of that word) when > > > the firmware of some random device crashed. > > > > If the firmware crash *does* require driver remove / addition again, > > or a reboot, would you think that this is a situation that merits a taint? > > Not really. In my experience, that's more likely a hardware issue (card > not properly seated, for example) that a bus reset happens to "fix". > > > > It's pretty clear, but even then, first of all I doubt this is the case > > > for many of the places that you've sprinkled the annotation on, > > > > We can remove it, for this driver I can vouch for its location as it did > > reach a state where I required a reboot. And its not the first time this > > has happened. This got me thinking about the bigger picture of the lack > > of proper way to address these cases in the kernel, and how the user is > > left dumbfounded. > > Fair, so the driver is still broken wrt. recovery here. I still don't > think that's a situation where e.g. the system should say "hey you have > a taint here, if your graphics go bad now you should not report that > bug" (which is effectively what the single taint bit does). But again, let's think about the generic type of issue, and the unexpected type of state that can be reached. The circumstance here *does* lead to a case which is not recoverable. Now, consider how many cases in the kernel where similar situations can happen and leave the device or driver in a non-functional state. > > > and secondly it actually hides useful information. > > > > What is it hiding? > > Most importantly, which device crashed. Secondarily I'd say how many > times (*). The device is implied by the module, the taint is applied to both. If you had multiple devices, however, yes, it would not be possible to distinguish from the taint which exact device it happened on. So the only thing *generic* which would be left out is count. > The information "firmware crashed" is really only useful in relation to > the device. If you have to reboot to get a functional network again then the device is quite useless for many people, regardless of which device that happened on. But from a support perspective a sysfs interface which provides a tiny bit more generic information indeed provides more value than a taint. Luis
On Mon, May 18, 2020 at 01:46:43PM -0700, Jakub Kicinski wrote: > On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote: > > On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote: > > > It's intended to be a generic netlink channel for configuring devices. > > > > > > All the firmware-related interfaces have no dependencies on netdevs, > > > in fact that's one of the reasons we moved to devlink - we don't want > > > to hold rtnl lock just for talking to device firmware. > > > > Sounds good :) > > > > So I guess Luis just has to add some way in devlink to hook up devlink > > health in a simple way to drivers, perhaps? I mean, many drivers won't > > really want to use devlink for anything else, so I guess it should be as > > simple as the API that Luis proposed ("firmware crashed for this struct > > device"), if nothing more interesting is done with devlink? > > > > Dunno. But anyway sounds like it should somehow integrate there rather > > than the way this patchset proposed? > > Right, that'd be great. Simple API to register a devlink instance with > whatever number of reporters the device would need. I'm happy to help. Indeed my issue with devlink is that it did not seem generic enough for all devices which use firmware and for which firmware can crash. Support should not have to be "add devlink support" + "now use this new hook", but rather a very lighweight devlink_crash(device) call we can sprinkly with only the device as a functional requirement. Luis
On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote: > Indeed my issue with devlink is that it did not seem generic enough for > all devices which use firmware and for which firmware can crash. Support > should not have to be "add devlink support" + "now use this new hook", > but rather a very lighweight devlink_crash(device) call we can sprinkly > with only the device as a functional requirement. We can provide a lightweight devlink_crash(device) which only generates the notification, without the need to register the health reporter or a devlink instance upfront. But then we loose the ability to control the recovery, count errors, etc. So I'd think that's not the direction we want to go in.
On Mon, May 18, 2020 at 03:16:45PM -0700, Jakub Kicinski wrote: > On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote: > > Indeed my issue with devlink is that it did not seem generic enough for > > all devices which use firmware and for which firmware can crash. Support > > should not have to be "add devlink support" + "now use this new hook", > > but rather a very lighweight devlink_crash(device) call we can sprinkly > > with only the device as a functional requirement. > > We can provide a lightweight devlink_crash(device) which only generates > the notification, without the need to register the health reporter or a > devlink instance upfront. But then we loose the ability to control the > recovery, count errors, etc. So I'd think that's not the direction we > want to go in. Care to show me what the diff stat for a non devlink driver would look like? Just keep in mind this taint is 1 line addition. Granted, if we can use SmPL grammar to automate addition of an initial framework to a driver that'd be great, but since the device addition is subsystem specific (device_add() and friends), I don't suspect this will be easily automated. Luis
On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > detect that the device is really wedged enough that the only way we can > still try to recover is by completely unbinding the driver from it, then > we give userspace a uevent for that. I don't remember exactly how and > where that gets used (ChromeOS) though, but it'd be nice to have that > sort of thing as part of the infrastructure, in a sort of two-level > notification? <slight side track> We use this on certain devices where we know the underlying hardware has design issues that may lead to device failure -- then when we see this sort of unrecoverable "firmware-death", we remove the device[*]+driver, force-reset the PCI device (SBR), and try to reload/reattach the driver. This all happens by way of a udev rule. We also log this sort of stuff (and metrics around it) for bug reports and health statistics, since we really hope to not see this happen often. [*] "We" (user space) don't actually do this...it happens via the 'remove_when_gone' module parameter abomination found in iwlwifi. I'd personally rather see the EVENT=INACESSIBLE stuff on its own, and let user space deal with when and how to remove and reset the device. But I digress too much here ;) </slight side track> I really came to this thread to say that I also love the idea of a generic mechanism (a la $subject) to report firmware crashes, but I also have no interest in seeing a taint flag for it. For Chrome OS, I would readily (as in, we're already looking at more-hacky / non-generic ways to do this for drivers we care about) process these kinds of stats as they happen, logging metrics for bug reports and/or for automated crash statistics, when we see a firmware crash. A uevent would suit us very well I think, although it would be nice if drivers could also supply some small amount of informative text along with it (e.g., a sort of "reason code", in case we can possibly aggregate certain failure types). We already do this sort of thing for WARN() and friends (not via uevent, but via log parsing; at least it has nice "cut here" markers!). Perhaps devlink (as proposed down-thread) would also fit the bill. I don't think sysfs alone would fit our needs, as we'd like to process these things as they happen, not only when a user submits a bug report. > Level 1: firmware crashed, but we're recovering, at least mostly, and > it's more informational Chrome OS would love to track these things too, since we'd like to see these minimized, even if they're usually recoverable ;) > Level 2: device is wedged, going to try to recover by some more forceful > means (perhaps some devices can be power-cycled? etc.) but (more) state > would be lost in these cases? And we'd definitely want to know about these. We already get this for the iwlwifi case described above, in a non-generic way. In general, it's probably not that easy to tell the difference between 1 and 2, since even as you and Luis have noted, with the same driver (and the same driver location), you find the same crashes may or may not be recoverable. iwlwifi has extracted certain level 2 cases into iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the ways in which level 1 crashes actually lead to severe (non-recoverable) failure. Regards, Brian
On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote: > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > > detect that the device is really wedged enough that the only way we can > > still try to recover is by completely unbinding the driver from it, then > > we give userspace a uevent for that. I don't remember exactly how and > > where that gets used (ChromeOS) though, but it'd be nice to have that > > sort of thing as part of the infrastructure, in a sort of two-level > > notification? > > <slight side track> > We use this on certain devices where we know the underlying hardware > has design issues that may lead to device failure Ah, after reading below I see you meant for iwlwifi. If userspace can indeed grow to support this, that would be fantastic. I should note that I don't discourage hiding firmware or hardware issues. Quite the contrary, I suspect that taking pride in being trasnparent about it, and dealing with it fast can help lead the pack. I wrote about this long ago in 2015 [0], and stand by it. [0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html > -- then when we see > this sort of unrecoverable "firmware-death", we remove the > device[*]+driver, force-reset the PCI device (SBR), and try to > reload/reattach the driver. This all happens by way of a udev rule. So you've sprikled your own udev event here as part of your kernel delta? > We > also log this sort of stuff (and metrics around it) for bug reports > and health statistics, since we really hope to not see this happen > often. Assuming perfection is ideal but silly. So, what infrastructure do you use for this sort of issue? > [*] "We" (user space) don't actually do this...it happens via the > 'remove_when_gone' module parameter abomination found in iwlwifi. Holy moly.. but hey, at least it may seem a bit more seemless than forcing a reboot / manual driver removal / addition to the user. BTW is this likely a place on iwlwifi where the firmware likely crashed? > I'd > personally rather see the EVENT=INACESSIBLE stuff on its own, and let > user space deal with when and how to remove and reset the device. But > I digress too much here ;) > </slight side track> This is all useful information. We are just touching the surface of the topic by addressing networking first. Imagine when we address other subsystems. > I really came to this thread to say that I also love the idea of a > generic mechanism (a la $subject) to report firmware crashes, but I > also have no interest in seeing a taint flag for it. For Chrome OS, I > would readily (as in, we're already looking at more-hacky / > non-generic ways to do this for drivers we care about) process these > kinds of stats as they happen, logging metrics for bug reports and/or > for automated crash statistics, when we see a firmware crash. Great! > A uevent > would suit us very well I think, although it would be nice if drivers > could also supply some small amount of informative text along with it A follow up to this series was to add a uevent to add_taint(), however since a *count* is not considered I think it is correct to seek alternatives at this point. The leaner the solution the better though. Do you have a pointer to what guys use so I can read? > (e.g., a sort of "reason code", in case we can possibly aggregate > certain failure types). We already do this sort of thing for WARN() > and friends (not via uevent, but via log parsing; at least it has nice > "cut here" markers!). Indeed, similar things can indeed be argued about WARN*()... this however can be non-device specific. With panic-on-warn becoming a "thing", the more important it becomes to really tally exactly *why* these WARN*()s may trigger. > Perhaps Note below. > devlink (as proposed down-thread) would also fit the bill. I > don't think sysfs alone would fit our needs, as we'd like to process > these things as they happen, not only when a user submits a bug > report. I think we've reached a point where using "*Perhaps*" does not suffice, and if there is already a *user* of similar desired infrastructure I think we should jump on the opportunity to replace what you have with something which could be used by other devices / subsystems which require firmware. And indeed, also even consider in the abstract sense, the possibility to leverage something like this for WARN*()s later too. > > Level 1: firmware crashed, but we're recovering, at least mostly, and > > it's more informational > > Chrome OS would love to track these things too, since we'd like to see > these minimized, even if they're usually recoverable ;) > > > Level 2: device is wedged, going to try to recover by some more forceful > > means (perhaps some devices can be power-cycled? etc.) but (more) state > > would be lost in these cases? > > And we'd definitely want to know about these. We already get this for > the iwlwifi case described above, in a non-generic way. > > In general, it's probably not that easy to tell the difference between > 1 and 2, since even as you and Luis have noted, with the same driver > (and the same driver location), you find the same crashes may or may > not be recoverable. iwlwifi has extracted certain level 2 cases into > iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the > ways in which level 1 crashes actually lead to severe > (non-recoverable) failure. And that is fine, accepting these for what they are will help. However, leaving the user in the *dark*, is what we should *not do*. Luis
Hi Luis, On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote: > > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > > > detect that the device is really wedged enough that the only way we can > > > still try to recover is by completely unbinding the driver from it, then > > > we give userspace a uevent for that. I don't remember exactly how and > > > where that gets used (ChromeOS) though, but it'd be nice to have that > > > sort of thing as part of the infrastructure, in a sort of two-level > > > notification? > > > > <slight side track> > > We use this on certain devices where we know the underlying hardware > > has design issues that may lead to device failure > > Ah, after reading below I see you meant for iwlwifi. Sorry, I was replying to Johannes, who I believe had his "we"="Intel" hat (as iwlwifi maintainer) on, and was pointing at iwl_trans_pcie_removal_wk(). > If userspace can indeed grow to support this, that would be fantastic. Well, Chrome OS tailors its user space a bit more to the hardware (and kernel/drivers in use) than the average distro might. We already do this (for some values of "this") today. Is that "fantastic" to you? :D > > -- then when we see > > this sort of unrecoverable "firmware-death", we remove the > > device[*]+driver, force-reset the PCI device (SBR), and try to > > reload/reattach the driver. This all happens by way of a udev rule. > > So you've sprikled your own udev event here as part of your kernel delta? No kernel delta -- the event is there already: iwl_trans_pcie_removal_wk() https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027 And you can see our udev rules and scripts, in all their ugly details here, if you really care: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/ > > We > > also log this sort of stuff (and metrics around it) for bug reports > > and health statistics, since we really hope to not see this happen > > often. > > Assuming perfection is ideal but silly. So, what infrastructure do you > use for this sort of issue? We don't yet log firmware crashes generally, but for all our current crash reports (including WARN()), they go through this: https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md For example, look for "cut here" in: https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc For other specific metrics (like counting "EVENT=INACCESSIBLE"), we use the Chrome UMA system: https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md I don't imagine the "infrastructure" side of any of that would be useful to you, but maybe the client-side gathering can at least show you what we do. > > [*] "We" (user space) don't actually do this...it happens via the > > 'remove_when_gone' module parameter abomination found in iwlwifi. > > BTW is this likely a place on iwlwifi where the firmware likely crashed? iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out in a way that is likely due to a dead PCIe endpoint. It's not directly a firmware crash, although there may be firmware crashes reported around the same time. Intel folks can probably give a more nuanced answer, but their firmware crashes usually land in something like iwl_mvm_nic_error() -> iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make your proposal less punishing (e.g., removing the "taint" as Johannes requested), maybe iwlwifi would be another good candidate for $subject. iwlwifi generally expects to recover seamlessly, but it's also good to know if you've seen a hundred of these in a row. > > A uevent > > would suit us very well I think, although it would be nice if drivers > > could also supply some small amount of informative text along with it > > A follow up to this series was to add a uevent to add_taint(), however > since a *count* is not considered I think it is correct to seek > alternatives at this point. The leaner the solution the better though. > > Do you have a pointer to what guys use so I can read? Hopefully the above pointers are useful to you. We don't yet have firmware-crash parsing implemented, so I don't have pointers for that piece yet. > > (e.g., a sort of "reason code", in case we can possibly aggregate > > certain failure types). We already do this sort of thing for WARN() > > and friends (not via uevent, but via log parsing; at least it has nice > > "cut here" markers!). > > Indeed, similar things can indeed be argued about WARN*()... this > however can be non-device specific. With panic-on-warn becoming a > "thing", the more important it becomes to really tally exactly *why* > these WARN*()s may trigger. panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm probably not supposed to publicize information related to how many Chromebooks are out there, but we regularly see a scary number of non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users every day, and a scary number of those are in WiFi drivers... > > Perhaps > > Note below. (My use of "perhaps" is only because I'm not familiar with devlink at all.) > > devlink (as proposed down-thread) would also fit the bill. I > > don't think sysfs alone would fit our needs, as we'd like to process > > these things as they happen, not only when a user submits a bug > > report. > > I think we've reached a point where using "*Perhaps*" does not suffice, > and if there is already a *user* of similar desired infrastructure I > think we should jump on the opportunity to replace what you have with > something which could be used by other devices / subsystems which > require firmware. And indeed, also even consider in the abstract sense, > the possibility to leverage something like this for WARN*()s later too. To precisely lay out what Chrome OS does today: * WARN() and similar: implemented, see anomaly_detector.cc above. It's just log parsing, and that handy "cut here" stuff -- I'm nearly certain there are other distros using this already? A uevent would probably be nice, but log parsing is good enough for us today. * EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked above. It's a uevent, and we handle it via udev. Code is linked above. * Other firmware crashes: not yet implemented here, but we're looking at it. Currently, we plan to do something similar to WARN(), but it will be ugly and non-generic. A uevent would be a lovely replacement, if it has some basic metadata with it. Stats in sysfs might be icing on the cake, but mostly redundant for us, if we can grab it on the fly via uevent. HTH, Brian
Hi all, <top post intro> Since I have been involved quite a bit in the firmware debugging features in iwlwifi, I think I can give a few insights here. But before this, we need to understand that there are several sources of issues: 1) the firmware may crash but the bus is still alive, you can still use the bus to get the crash data 2) the bus is dead, when that happens, the firmware might even be in a good condition, but since the bus is dead, you stop getting any information about the firmware, and then, at some point, you get to the conclusion that the firmware is dead. You can't get the crash data that resides on the other side of the bus (you may have gathered data in the DRAM directly, but that's a different thing), and you don't have much recovery to do besides re-starting the PCI enumeration. At Intel, we have seen both unfortunately. The bus issues are the ones that are trickier obviously. Trickier to detect (because you just get garbage from any request you issue on the bus), and trickier to handle. One can argue that the kernel should *not* handle those and let this in userspace hands. I guess it all depends on what component you ship to your customer and what you customer asks from you :). </top post intro> > > Hi Luis, > > On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote: > > > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > > > > detect that the device is really wedged enough that the only way we can > > > > still try to recover is by completely unbinding the driver from it, then > > > > we give userspace a uevent for that. I don't remember exactly how and > > > > where that gets used (ChromeOS) though, but it'd be nice to have that > > > > sort of thing as part of the infrastructure, in a sort of two-level > > > > notification? > > > > > > <slight side track> > > > We use this on certain devices where we know the underlying hardware > > > has design issues that may lead to device failure > > > > Ah, after reading below I see you meant for iwlwifi. > > Sorry, I was replying to Johannes, who I believe had his "we"="Intel" > hat (as iwlwifi maintainer) on, and was pointing at > iwl_trans_pcie_removal_wk(). > This pcie_removal thing is for the bus dead thing. My 2) above. > > If userspace can indeed grow to support this, that would be fantastic. > > Well, Chrome OS tailors its user space a bit more to the hardware (and > kernel/drivers in use) than the average distro might. We already do > this (for some values of "this") today. Is that "fantastic" to you? :D I guess it can be fantastic if other vendors also suffer from this. Or maybe that could be done as part of the PCI bus driver inside the kernel? > > > > -- then when we see > > > this sort of unrecoverable "firmware-death", we remove the > > > device[*]+driver, force-reset the PCI device (SBR), and try to > > > reload/reattach the driver. This all happens by way of a udev rule. > > > > So you've sprikled your own udev event here as part of your kernel delta? > > No kernel delta -- the event is there already: > iwl_trans_pcie_removal_wk() > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027 > > And you can see our udev rules and scripts, in all their ugly details > here, if you really care: > https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/ > > > > We > > > also log this sort of stuff (and metrics around it) for bug reports > > > and health statistics, since we really hope to not see this happen > > > often. > > > > Assuming perfection is ideal but silly. So, what infrastructure do you > > use for this sort of issue? > > We don't yet log firmware crashes generally, but for all our current > crash reports (including WARN()), they go through this: > https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md > > For example, look for "cut here" in: > https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc > > For other specific metrics (like counting "EVENT=INACCESSIBLE"), we > use the Chrome UMA system: > https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md > > I don't imagine the "infrastructure" side of any of that would be > useful to you, but maybe the client-side gathering can at least show > you what we do. > > > > [*] "We" (user space) don't actually do this...it happens via the > > > 'remove_when_gone' module parameter abomination found in iwlwifi. > > > > BTW is this likely a place on iwlwifi where the firmware likely crashed? > > iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out > in a way that is likely due to a dead PCIe endpoint. It's not directly > a firmware crash, although there may be firmware crashes reported > around the same time. iwl_trans_pcie_removal_wk() is only because of a dead bus, not because of a firmware crash. > > Intel folks can probably give a more nuanced answer, but their > firmware crashes usually land in something like iwl_mvm_nic_error() -> > iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make > your proposal less punishing (e.g., removing the "taint" as Johannes > requested), maybe iwlwifi would be another good candidate for > $subject. iwlwifi generally expects to recover seamlessly, but it's > also good to know if you've seen a hundred of these in a row. Yes, you are right, mvm_nic_error is the place. So here is the proposal. I think that a standard interface that can allow a driver to report a firmware crash along with a proprietary unique id that makes sense to the vendor can be useful. Then, yes, distros can listen to this, optionally open bugs (automatically?) when that happens. I even planned to do this long ago and integrated with a specific distro, but it never rolled out. The vendor supplied unique id is very important for the bug de-duplication part. For iwlwifi, we have the SYSASSERT number which is basically how the firmware tells us briefly what happened. Of course, there is always more data that can be useful, but for a first level of bug de-duplication this can be enough. Then, you have a standard way to track the firmware crashes. We need to remember that the firmware recovery path is expected to work, it is, yet, less tested. I specifically remember a bug where a crash because by a bad handling of a CSA frame caused a firmware crash in a flow that caused the driver not to re-add a station or something like that, and because of that, you get another firmware crash. So sometimes it is interesting to see not only the data on which firmware crash happened and how many times, but if there is a timing relationship between them. I guess that's for the next level though... Not really critical for now. The next problem here is that when you tell the firmware folks that they have a bug, the first they'll do is to ask for more data. Back then, I enabled our firmware debug infra on Linux, and from there devcoredump infra was born (thanks Johannes). What we do at Intel, is that we have a script that runs when a udev event reports the creation of a devcoredump. It parses the kernel log (dmesg) to determine the unique id I mentioned earlier (the SYSSASSERT number basically) and then it creates a file in /var/log/ whose name contain the SYSSASSERT number. It is then quite easy to match the firmware data with the firmware crash. So I believe the right way to do this is to create the devcoredump along with the id. Meaning that we basically don't need another interface, we just need to use the same one, but to provide the unique id of the crash. This unique id can then be part of the uevent that is thrown to the userspace and userspace can use it to name the dump file with the right id. This way, it is fairly easy (and standard across vendors) to track the firmware crashes, but the most important is that you already have the firmware data that goes along with it. It would look like this. driver_code.c: void my_vendor_error_interrupt() { u64 uid = get_the_unique_id_from_your_device(); void *firmware_data = get_the_data_you_need(); dev_coredumpsg(dev_struct, firmware_data, firmware_data_len, GFP_whatver, uid); } And then, this would cause a: /var/log/wifi-crash-$(KBUILD_MODNAME)-,uid.bin to appear our your filesystem. > > > > A uevent > > > would suit us very well I think, although it would be nice if drivers > > > could also supply some small amount of informative text along with it > > > > A follow up to this series was to add a uevent to add_taint(), however > > since a *count* is not considered I think it is correct to seek > > alternatives at this point. The leaner the solution the better though. > > > > Do you have a pointer to what guys use so I can read? > > Hopefully the above pointers are useful to you. We don't yet have > firmware-crash parsing implemented, so I don't have pointers for that > piece yet. See above. I don't think it is the device driver's role to count those. We can add this count this in userspace I think. Debatable though. > > > > (e.g., a sort of "reason code", in case we can possibly aggregate > > > certain failure types). We already do this sort of thing for WARN() > > > and friends (not via uevent, but via log parsing; at least it has nice > > > "cut here" markers!). > > > > Indeed, similar things can indeed be argued about WARN*()... this > > however can be non-device specific. With panic-on-warn becoming a > > "thing", the more important it becomes to really tally exactly *why* > > these WARN*()s may trigger. > > panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm > probably not supposed to publicize information related to how many > Chromebooks are out there, but we regularly see a scary number of > non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users > every day, and a scary number of those are in WiFi drivers... > > > > Perhaps > > > > Note below. > > (My use of "perhaps" is only because I'm not familiar with devlink at all.) > > > > devlink (as proposed down-thread) would also fit the bill. I > > > don't think sysfs alone would fit our needs, as we'd like to process > > > these things as they happen, not only when a user submits a bug > > > report. > > > > I think we've reached a point where using "*Perhaps*" does not suffice, > > and if there is already a *user* of similar desired infrastructure I > > think we should jump on the opportunity to replace what you have with > > something which could be used by other devices / subsystems which > > require firmware. And indeed, also even consider in the abstract sense, > > the possibility to leverage something like this for WARN*()s later too. > > To precisely lay out what Chrome OS does today: > > * WARN() and similar: implemented, see anomaly_detector.cc above. It's > just log parsing, and that handy "cut here" stuff -- I'm nearly > certain there are other distros using this already? A uevent would > probably be nice, but log parsing is good enough for us today. > * EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked > above. It's a uevent, and we handle it via udev. Code is linked above. > * Other firmware crashes: not yet implemented here, but we're looking > at it. Currently, we plan to do something similar to WARN(), but it > will be ugly and non-generic. A uevent would be a lovely replacement, > if it has some basic metadata with it. Stats in sysfs might be icing > on the cake, but mostly redundant for us, if we can grab it on the fly > via uevent. So I believe we already have this uevent, it is the devcoredump. All we need is to add the unique id. Note that all this is good for firmware crashes, and not for bus dead scenarios, but those two are fundamentally different even if they look alike at the beginning of your error detection flow. > > HTH, > Brian
On Wed, May 20, 2020 at 8:40 AM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > Since I have been involved quite a bit in the firmware debugging > features in iwlwifi, I think I can give a few insights here. > > But before this, we need to understand that there are several sources of issues: > 1) the firmware may crash but the bus is still alive, you can still > use the bus to get the crash data > 2) the bus is dead, when that happens, the firmware might even be in a > good condition, but since the bus is dead, you stop getting any > information about the firmware, and then, at some point, you get to > the conclusion that the firmware is dead. You can't get the crash data > that resides on the other side of the bus (you may have gathered data > in the DRAM directly, but that's a different thing), and you don't > have much recovery to do besides re-starting the PCI enumeration. > > At Intel, we have seen both unfortunately. The bus issues are the ones > that are trickier obviously. Trickier to detect (because you just get > garbage from any request you issue on the bus), and trickier to > handle. One can argue that the kernel should *not* handle those and > let this in userspace hands. I guess it all depends on what component > you ship to your customer and what you customer asks from you :). Or the two best approaches: 1) get rid of firmware completely; 2) make it OSS (like SOF). I think any of these is a right thing to do in long-term perspective. How many firmwares average computer has? 50? 100? Any of them is a burden and PITA.
On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > So I believe we already have this uevent, it is the devcoredump. All > we need is to add the unique id. I think there are a few reasons that devcoredump doesn't satisfy what either Luis or I want. 1) it can be disabled entirely [1], for good reasons (e.g., think of non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything with the opaque dumps provided by closed-source firmware) 2) not all drivers necessarily have a useful dump to provide when there's a crash; look at the rest of Luis's series to see the kinds of drivers-with-firmware that are crashing, some of which aren't dumping anything 3) for those that do support devcoredump, it may be used for purposes that are not "crashes" -- e.g., some provide debugfs or other knobs to initiate dumps, for diagnostic or debugging purposes Brian [1] devcd_disabled https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22
> > On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > > So I believe we already have this uevent, it is the devcoredump. All > > we need is to add the unique id. > > I think there are a few reasons that devcoredump doesn't satisfy what > either Luis or I want. > > 1) it can be disabled entirely [1], for good reasons (e.g., think of > non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything > with the opaque dumps provided by closed-source firmware) Ok, if all you're interested into is the information that this event happen (as opposed to report a bug and providing the data), then I agree. True, not everybody want or can enable devcoredump. I am just a bit concerned that we may end up with two interface that notify the same event basically. The ideal maybe would be to be able to optionally reduce the content of the devoredump to nothing more that is already in the dmesg output. But then, it is not what it is meant to be: namely, a core dump.. > 2) not all drivers necessarily have a useful dump to provide when > there's a crash; look at the rest of Luis's series to see the kinds of > drivers-with-firmware that are crashing, some of which aren't dumping > anything Fair enouh. > 3) for those that do support devcoredump, it may be used for purposes > that are not "crashes" -- e.g., some provide debugfs or other knobs to > initiate dumps, for diagnostic or debugging purposes Not sure I really think we need to care about those cases, but you already have 2 good arguments :) > > Brian > > [1] devcd_disabled > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22
On Fri, May 22, 2020 at 08:12:59AM +0300, Emmanuel Grumbach wrote: > > > > On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > > > So I believe we already have this uevent, it is the devcoredump. All > > > we need is to add the unique id. > > > > I think there are a few reasons that devcoredump doesn't satisfy what > > either Luis or I want. > > > > 1) it can be disabled entirely [1], for good reasons (e.g., think of > > non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything > > with the opaque dumps provided by closed-source firmware) > > Ok, if all you're interested into is the information that this event > happen (as opposed to report a bug and providing the data), then I > agree. I've now hit again a firmware crash with ath10k with the latest firwmare and kernel and the *only* thing that helped recovery was a full reboot, so that is a crystal clear case that this needs to taint the kernel, and yes we do want to inform users too, so I've just added uevent support for a few panic / taint events in the kernel now and rolled into my series. I'll run some final tests and then post this as a follow up. devlink didn't cut it, its networking specific. Luis
diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c index 1d941d53fdc9..6bd0f3b518b9 100644 --- a/drivers/net/wireless/ath/ath10k/pci.c +++ b/drivers/net/wireless/ath/ath10k/pci.c @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work) scnprintf(guid, sizeof(guid), "n/a"); ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); + module_firmware_crashed(); ath10k_print_driver_info(ar); ath10k_pci_dump_registers(ar, crash_data); ath10k_ce_dump_registers(ar, crash_data); @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar, if (ret) { if (ath10k_pci_has_fw_crashed(ar)) { ath10k_warn(ar, "firmware crashed during chip reset\n"); + module_firmware_crashed(); ath10k_pci_fw_crashed_clear(ar); ath10k_pci_fw_crashed_dump(ar); } diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c index e2aff2254a40..d34ad289380f 100644 --- a/drivers/net/wireless/ath/ath10k/sdio.c +++ b/drivers/net/wireless/ath/ath10k/sdio.c @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar) /* TODO: Add firmware crash handling */ ath10k_warn(ar, "firmware crashed\n"); + module_firmware_crashed(); /* read counter to clear the interrupt, the debug error interrupt is * counter 0. @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar) if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) { ath10k_err(ar, "firmware crashed!\n"); queue_work(ar->workqueue, &ar->restart_work); + module_firmware_crashed(); } return ret; } diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c index 354d49b1cd45..7cfc123c345c 100644 --- a/drivers/net/wireless/ath/ath10k/snoc.c +++ b/drivers/net/wireless/ath/ath10k/snoc.c @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar) scnprintf(guid, sizeof(guid), "n/a"); ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); + module_firmware_crashed(); ath10k_print_driver_info(ar); ath10k_msa_dump_memory(ar, crash_data); mutex_unlock(&ar->dump_mutex);
This makes use of the new module_firmware_crashed() to help annotate when firmware for device drivers crash. When firmware crashes devices can sometimes become unresponsive, and recovery sometimes requires a driver unload / reload and in the worst cases a reboot. Using a taint flag allows us to annotate when this happens clearly. Cc: linux-wireless@vger.kernel.org Cc: ath10k@lists.infradead.org Cc: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- drivers/net/wireless/ath/ath10k/pci.c | 2 ++ drivers/net/wireless/ath/ath10k/sdio.c | 2 ++ drivers/net/wireless/ath/ath10k/snoc.c | 1 + 3 files changed, 5 insertions(+)