Message ID | 20210306024220.251721-1-kuba@kernel.org (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC] devlink: health: add remediation type | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Guessed tree name to be net-next |
netdev/subject_prefix | warning | Target tree name not specified in the subject |
netdev/cc_maintainers | warning | 2 maintainers not CCed: davem@davemloft.net jiri@nvidia.com |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 486 this patch: 486 |
netdev/kdoc | fail | Errors and warnings before: 16 this patch: 17 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 73 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 678 this patch: 678 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
+/** > + * enum devlink_health_reporter_remedy - severity of remediation procedure > + * @DLH_REMEDY_NONE: transient error, no remediation required > + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue) > + * will be reset > + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability > + * of the device, device configuration should not be lost > + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost > + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover > + * @DLH_REMEDY_REIMAGE: device needs to be reflashed > + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be > + * replaced > + * > + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter > + * by the severity of the required remediation, and indicates the remediation > + * type to the user if it can't be applied automatically (e.g. "reimage"). > + */ > +enum devlink_health_reporter_remedy { > + DLH_REMEDY_NONE = 1, > + DLH_REMEDY_COMP_RESET, > + DLH_REMEDY_RESET, > + DLH_REMEDY_REINIT, > + DLH_REMEDY_POWER_CYCLE, > + DLH_REMEDY_REIMAGE, > + DLH_REMEDY_BAD_PART, > +}; Hi Jakub Are there any cases where the host is the problem, not the device? The host driver needs to be unloaded and reloaded? The host needs a reboot? Andrew
On Sat, 6 Mar 2021 15:48:11 +0100 Andrew Lunn wrote: > +/** > > + * enum devlink_health_reporter_remedy - severity of remediation procedure > > + * @DLH_REMEDY_NONE: transient error, no remediation required > > + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue) > > + * will be reset > > + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability > > + * of the device, device configuration should not be lost > > + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost > > + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover > > + * @DLH_REMEDY_REIMAGE: device needs to be reflashed > > + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be > > + * replaced > > + * > > + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter > > + * by the severity of the required remediation, and indicates the remediation > > + * type to the user if it can't be applied automatically (e.g. "reimage"). > > + */ > > +enum devlink_health_reporter_remedy { > > + DLH_REMEDY_NONE = 1, > > + DLH_REMEDY_COMP_RESET, > > + DLH_REMEDY_RESET, > > + DLH_REMEDY_REINIT, > > + DLH_REMEDY_POWER_CYCLE, > > + DLH_REMEDY_REIMAGE, > > + DLH_REMEDY_BAD_PART, > > +}; > > Hi Jakub > > Are there any cases where the host is the problem, not the device? The > host driver needs to be unloaded and reloaded? The host needs a > reboot? I was thinking of REINIT addressing that case. Maybe RELOAD would be a better name since that's what the devlink command is called? But also to answer your direct question, I haven't seen such case in practice, I don't think, just trying to cover all the bases.
On 3/6/2021 4:42 AM, Jakub Kicinski wrote: > Currently devlink health does not give user any clear information > of what kind of remediation ->recover callback will perform. This > makes it difficult to understand the impact of enabling auto- > -remediation, and the severity of the error itself. > > To allow users to make more informed decision, as well as stretch > the applicability of devlink health beyond what can be fixed by > resetting things - add a new remediation type attribute. > > Note that we only allow one remediation type per reporter, this > is intentional. devlink health is not built for mixing issues > of different severity into one reporter since it only maintains > one dump, of the first event and a single error counter. > Nudging vendors towards categorizing issues beyond coarse > groups is an added bonus. > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > include/net/devlink.h | 2 ++ > include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++ > net/core/devlink.c | 7 +++++-- > 3 files changed, 37 insertions(+), 2 deletions(-) > > diff --git a/include/net/devlink.h b/include/net/devlink.h > index 853420db5d32..70b5fd6a8c0d 100644 > --- a/include/net/devlink.h > +++ b/include/net/devlink.h > @@ -664,6 +664,7 @@ enum devlink_health_reporter_state { > /** > * struct devlink_health_reporter_ops - Reporter operations > * @name: reporter name > + * remedy: severity of the remediation required > * @recover: callback to recover from reported error > * if priv_ctx is NULL, run a full recover > * @dump: callback to dump an object > @@ -674,6 +675,7 @@ enum devlink_health_reporter_state { > > struct devlink_health_reporter_ops { > char *name; > + enum devlink_health_reporter_remedy remedy; > int (*recover)(struct devlink_health_reporter *reporter, > void *priv_ctx, struct netlink_ext_ack *extack); > int (*dump)(struct devlink_health_reporter *reporter, > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h > index f6008b2fa60f..bd490c5536b1 100644 > --- a/include/uapi/linux/devlink.h > +++ b/include/uapi/linux/devlink.h > @@ -534,6 +534,9 @@ enum devlink_attr { > DEVLINK_ATTR_RELOAD_ACTION_STATS, /* nested */ > > DEVLINK_ATTR_PORT_PCI_SF_NUMBER, /* u32 */ > + > + DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, /* u32 */ > + > /* add new attributes above here, update the policy in devlink.c */ > > __DEVLINK_ATTR_MAX, > @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate { > DEVLINK_PORT_FN_OPSTATE_ATTACHED, > }; > > +/** > + * enum devlink_health_reporter_remedy - severity of remediation procedure > + * @DLH_REMEDY_NONE: transient error, no remediation required DLH_REMEDY_LOCAL_FIX: associated component will undergo a local un-harmful fix attempt. (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover()) > + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue) > + * will be reset > + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability > + * of the device, device configuration should not be lost > + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost > + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover > + * @DLH_REMEDY_REIMAGE: device needs to be reflashed > + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be > + * replaced > + * > + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter > + * by the severity of the required remediation, and indicates the remediation > + * type to the user if it can't be applied automatically (e.g. "reimage"). > + */ The assumption here is that a reporter's recovery function has one remedy. But it can have few remedies and escalate between them. Did you consider a bitmask? > +enum devlink_health_reporter_remedy { > + DLH_REMEDY_NONE = 1, > + DLH_REMEDY_COMP_RESET, > + DLH_REMEDY_RESET, > + DLH_REMEDY_REINIT, > + DLH_REMEDY_POWER_CYCLE, > + DLH_REMEDY_REIMAGE, In general, I don't expect a reported to perform POWER_CYCLE or REIMAGE as part of the recovery. > + DLH_REMEDY_BAD_PART, BAD_PART probably indicates that the reporter (or any command line execution) cannot recover the issue. As the suggested remedy is static per reporter's recover method, it doesn't make sense for one to set a recover method that by design cannot recover successfully. Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, REIMAGE and BAD_PART? To indicate the user that for a successful recovery, it should run a non-devlink-health operation? > +}; > + > #endif /* _UAPI_LINUX_DEVLINK_H_ */ > diff --git a/net/core/devlink.c b/net/core/devlink.c > index 737b61c2976e..73eb665070b9 100644 > --- a/net/core/devlink.c > +++ b/net/core/devlink.c > @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink, > { > struct devlink_health_reporter *reporter; > > - if (WARN_ON(graceful_period && !ops->recover)) > + if (WARN_ON(graceful_period && !ops->recover) || > + WARN_ON(!ops->remedy)) Here you fail every reported that doesn't have remedy set, not only the ones with recovery callback > return ERR_PTR(-EINVAL); > > reporter = kzalloc(sizeof(*reporter), GFP_KERNEL); > @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg, > if (!reporter_attr) > goto genlmsg_cancel; > if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME, > - reporter->ops->name)) > + reporter->ops->name) || > + nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, > + reporter->ops->remedy)) Why not new if clause like all other attributes later. > goto reporter_nest_cancel; > if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE, > reporter->health_state)) >
On Sun, 7 Mar 2021 17:59:58 +0200 Eran Ben Elisha wrote: > On 3/6/2021 4:42 AM, Jakub Kicinski wrote: > > Currently devlink health does not give user any clear information > > of what kind of remediation ->recover callback will perform. This > > makes it difficult to understand the impact of enabling auto- > > -remediation, and the severity of the error itself. > > > > To allow users to make more informed decision, as well as stretch > > the applicability of devlink health beyond what can be fixed by > > resetting things - add a new remediation type attribute. > > > > Note that we only allow one remediation type per reporter, this > > is intentional. devlink health is not built for mixing issues > > of different severity into one reporter since it only maintains > > one dump, of the first event and a single error counter. > > Nudging vendors towards categorizing issues beyond coarse > > groups is an added bonus. > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > --- > > include/net/devlink.h | 2 ++ > > include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++ > > net/core/devlink.c | 7 +++++-- > > 3 files changed, 37 insertions(+), 2 deletions(-) > > > > diff --git a/include/net/devlink.h b/include/net/devlink.h > > index 853420db5d32..70b5fd6a8c0d 100644 > > --- a/include/net/devlink.h > > +++ b/include/net/devlink.h > > @@ -664,6 +664,7 @@ enum devlink_health_reporter_state { > > /** > > * struct devlink_health_reporter_ops - Reporter operations > > * @name: reporter name > > + * remedy: severity of the remediation required > > * @recover: callback to recover from reported error > > * if priv_ctx is NULL, run a full recover > > * @dump: callback to dump an object > > @@ -674,6 +675,7 @@ enum devlink_health_reporter_state { > > > > struct devlink_health_reporter_ops { > > char *name; > > + enum devlink_health_reporter_remedy remedy; > > int (*recover)(struct devlink_health_reporter *reporter, > > void *priv_ctx, struct netlink_ext_ack *extack); > > int (*dump)(struct devlink_health_reporter *reporter, > > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h > > index f6008b2fa60f..bd490c5536b1 100644 > > --- a/include/uapi/linux/devlink.h > > +++ b/include/uapi/linux/devlink.h > > @@ -534,6 +534,9 @@ enum devlink_attr { > > DEVLINK_ATTR_RELOAD_ACTION_STATS, /* nested */ > > > > DEVLINK_ATTR_PORT_PCI_SF_NUMBER, /* u32 */ > > + > > + DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, /* u32 */ > > + > > /* add new attributes above here, update the policy in devlink.c */ > > > > __DEVLINK_ATTR_MAX, > > @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate { > > DEVLINK_PORT_FN_OPSTATE_ATTACHED, > > }; > > > > +/** > > + * enum devlink_health_reporter_remedy - severity of remediation procedure > > + * @DLH_REMEDY_NONE: transient error, no remediation required > DLH_REMEDY_LOCAL_FIX: associated component will undergo a local > un-harmful fix attempt. > (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover()) Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall detected, resumed by re-trigerring processing, without reset? > > + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue) > > + * will be reset > > + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability > > + * of the device, device configuration should not be lost > > + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost > > + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover > > + * @DLH_REMEDY_REIMAGE: device needs to be reflashed > > + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be > > + * replaced > > + * > > + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter > > + * by the severity of the required remediation, and indicates the remediation > > + * type to the user if it can't be applied automatically (e.g. "reimage"). > > + */ > The assumption here is that a reporter's recovery function has one > remedy. But it can have few remedies and escalate between them. Did you > consider a bitmask? Yes, I tried to explain in the commit message. If we wanted to support escalating remediations we'd also need separate counters etc. I think having a health reporter per remediation should actually work fairly well. The major case where escalations would be immediately useful would be cases where normal remediation failed therefore we need a hard reset. But in those cases I think having the health reporter in a failed state should be a sufficient indication to automation to take a machine out of service and power cycle it. > > +enum devlink_health_reporter_remedy { > > + DLH_REMEDY_NONE = 1, > > + DLH_REMEDY_COMP_RESET, > > + DLH_REMEDY_RESET, > > + DLH_REMEDY_REINIT, > > + DLH_REMEDY_POWER_CYCLE, > > + DLH_REMEDY_REIMAGE, > In general, I don't expect a reported to perform POWER_CYCLE or REIMAGE > as part of the recovery. Right, these are extending the use of health reporters beyond what can be remediated automatically. > > + DLH_REMEDY_BAD_PART, > BAD_PART probably indicates that the reporter (or any command line > execution) cannot recover the issue. > As the suggested remedy is static per reporter's recover method, it > doesn't make sense for one to set a recover method that by design cannot > recover successfully. > > Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, > REIMAGE and BAD_PART? To indicate the user that for a successful > recovery, it should run a non-devlink-health operation? Hm, export and extend devlink_health_reporter_state? I like that idea. > > diff --git a/net/core/devlink.c b/net/core/devlink.c > > index 737b61c2976e..73eb665070b9 100644 > > --- a/net/core/devlink.c > > +++ b/net/core/devlink.c > > @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink, > > { > > struct devlink_health_reporter *reporter; > > > > - if (WARN_ON(graceful_period && !ops->recover)) > > + if (WARN_ON(graceful_period && !ops->recover) || > > + WARN_ON(!ops->remedy)) > Here you fail every reported that doesn't have remedy set, not only the > ones with recovery callback Right, DLH_REMEDY_NONE doesn't require a callback. Some health issues can be remedied by the device e.g. ECC errors or something along those lines. Obviously non-RFC patch will have to come with driver changes. > > return ERR_PTR(-EINVAL); > > > > reporter = kzalloc(sizeof(*reporter), GFP_KERNEL); > > @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg, > > if (!reporter_attr) > > goto genlmsg_cancel; > > if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME, > > - reporter->ops->name)) > > + reporter->ops->name) || > > + nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, > > + reporter->ops->remedy)) > Why not new if clause like all other attributes later. Eh. > > goto reporter_nest_cancel; > > if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE, > > reporter->health_state)) > >
On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote: > > > + DLH_REMEDY_BAD_PART, > > BAD_PART probably indicates that the reporter (or any command line > > execution) cannot recover the issue. > > As the suggested remedy is static per reporter's recover method, it > > doesn't make sense for one to set a recover method that by design cannot > > recover successfully. > > > > Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, > > REIMAGE and BAD_PART? To indicate the user that for a successful > > recovery, it should run a non-devlink-health operation? > > Hm, export and extend devlink_health_reporter_state? I like that idea. Trying to type it up it looks less pretty than expected. Let's looks at some examples. A queue reporter, say "rx", resets the queue dropping all outstanding buffers. As previously mentioned when the normal remediation fails user is expected to power cycle the machine or maybe swap the card. The device itself does not have a crystal ball. A management FW reporter "fw", has a auto recovery of FW reset (REMEDY_RESET). On failure -> power cycle. An "io" reporter (PCI link had to be trained down) can only return a hardware failure (we should probably have a HW failure other than BAD_PART for this). Flash reporters - the device will know if the flash had a bad block or the entire part is bad, so probably can have 2 reporters for this. Most of the reporters would only report one "action" that can be performed to fix them. The cartesian product of ->recovery types vs manual recovery does not seem necessary. And drivers would get bloated with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for _all_ cases with ->recovery. Because what else would the fix be if software-initiated reset didn't work?
On 3/8/2021 7:59 PM, Jakub Kicinski wrote: > On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote: >>>> + DLH_REMEDY_BAD_PART, >>> BAD_PART probably indicates that the reporter (or any command line >>> execution) cannot recover the issue. >>> As the suggested remedy is static per reporter's recover method, it >>> doesn't make sense for one to set a recover method that by design cannot >>> recover successfully. >>> >>> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, >>> REIMAGE and BAD_PART? To indicate the user that for a successful >>> recovery, it should run a non-devlink-health operation? >> >> Hm, export and extend devlink_health_reporter_state? I like that idea. > > Trying to type it up it looks less pretty than expected. > > Let's looks at some examples. > > A queue reporter, say "rx", resets the queue dropping all outstanding > buffers. As previously mentioned when the normal remediation fails user > is expected to power cycle the machine or maybe swap the card. The > device itself does not have a crystal ball. Not sure, reopen the queue, or reinit the driver might also be good in case of issue in the SW/HW queue context for example. But I agree that RX reporter can't tell from its perspective what further escalation is needed in case its local defined operations failed. > > A management FW reporter "fw", has a auto recovery of FW reset > (REMEDY_RESET). On failure -> power cycle. > > An "io" reporter (PCI link had to be trained down) can only return > a hardware failure (we should probably have a HW failure other than > BAD_PART for this). > > Flash reporters - the device will know if the flash had a bad block > or the entire part is bad, so probably can have 2 reporters for this. > > Most of the reporters would only report one "action" that can be > performed to fix them. The cartesian product of ->recovery types vs > manual recovery does not seem necessary. And drivers would get bloated > with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for > _all_ cases with ->recovery. Because what else would the fix be if > software-initiated reset didn't work? > OK, I see your point. If I got you right, this is the conclusions so far: 1. Each reporter with recover callback will have to supply a remedy definition. 2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, because these are not valid reporter recover flows in any case. 3. If a reporter will fail to recover, its status shall remain as error, and it is out of the reporter's scope to advise the administrator on further actions.
On 3/8/2021 7:16 PM, Jakub Kicinski wrote: > On Sun, 7 Mar 2021 17:59:58 +0200 Eran Ben Elisha wrote: >> On 3/6/2021 4:42 AM, Jakub Kicinski wrote: >>> Currently devlink health does not give user any clear information >>> of what kind of remediation ->recover callback will perform. This >>> makes it difficult to understand the impact of enabling auto- >>> -remediation, and the severity of the error itself. >>> >>> To allow users to make more informed decision, as well as stretch >>> the applicability of devlink health beyond what can be fixed by >>> resetting things - add a new remediation type attribute. >>> >>> Note that we only allow one remediation type per reporter, this >>> is intentional. devlink health is not built for mixing issues >>> of different severity into one reporter since it only maintains >>> one dump, of the first event and a single error counter. >>> Nudging vendors towards categorizing issues beyond coarse >>> groups is an added bonus. >>> >>> Signed-off-by: Jakub Kicinski <kuba@kernel.org> >>> --- >>> include/net/devlink.h | 2 ++ >>> include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++ >>> net/core/devlink.c | 7 +++++-- >>> 3 files changed, 37 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/net/devlink.h b/include/net/devlink.h >>> index 853420db5d32..70b5fd6a8c0d 100644 >>> --- a/include/net/devlink.h >>> +++ b/include/net/devlink.h >>> @@ -664,6 +664,7 @@ enum devlink_health_reporter_state { >>> /** >>> * struct devlink_health_reporter_ops - Reporter operations >>> * @name: reporter name >>> + * remedy: severity of the remediation required >>> * @recover: callback to recover from reported error >>> * if priv_ctx is NULL, run a full recover >>> * @dump: callback to dump an object >>> @@ -674,6 +675,7 @@ enum devlink_health_reporter_state { >>> >>> struct devlink_health_reporter_ops { >>> char *name; >>> + enum devlink_health_reporter_remedy remedy; >>> int (*recover)(struct devlink_health_reporter *reporter, >>> void *priv_ctx, struct netlink_ext_ack *extack); >>> int (*dump)(struct devlink_health_reporter *reporter, >>> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h >>> index f6008b2fa60f..bd490c5536b1 100644 >>> --- a/include/uapi/linux/devlink.h >>> +++ b/include/uapi/linux/devlink.h >>> @@ -534,6 +534,9 @@ enum devlink_attr { >>> DEVLINK_ATTR_RELOAD_ACTION_STATS, /* nested */ >>> >>> DEVLINK_ATTR_PORT_PCI_SF_NUMBER, /* u32 */ >>> + >>> + DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, /* u32 */ >>> + >>> /* add new attributes above here, update the policy in devlink.c */ >>> >>> __DEVLINK_ATTR_MAX, >>> @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate { >>> DEVLINK_PORT_FN_OPSTATE_ATTACHED, >>> }; >>> >>> +/** >>> + * enum devlink_health_reporter_remedy - severity of remediation procedure >>> + * @DLH_REMEDY_NONE: transient error, no remediation required >> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local >> un-harmful fix attempt. >> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover()) > > Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall > detected, resumed by re-trigerring processing, without reset? Sounds good. > >>> + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue) >>> + * will be reset >>> + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability >>> + * of the device, device configuration should not be lost >>> + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost >>> + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover >>> + * @DLH_REMEDY_REIMAGE: device needs to be reflashed >>> + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be >>> + * replaced >>> + * >>> + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter >>> + * by the severity of the required remediation, and indicates the remediation >>> + * type to the user if it can't be applied automatically (e.g. "reimage"). >>> + */ >> The assumption here is that a reporter's recovery function has one >> remedy. But it can have few remedies and escalate between them. Did you >> consider a bitmask? > > Yes, I tried to explain in the commit message. If we wanted to support > escalating remediations we'd also need separate counters etc. I think > having a health reporter per remediation should actually work fairly > well. That would require reporter's recovery procedure failure to trigger health flow for other reporter. So we can find ourselves with 2 RX reporters, sharing the same diagnose and dump callbacks, and each has other recovery flow. Seems a bit counterintuitive. Maybe, per reporter, exposing a counter per each supported remedy is not that bad? > > The major case where escalations would be immediately useful would be > cases where normal remediation failed therefore we need a hard reset. > But in those cases I think having the health reporter in a failed state > should be a sufficient indication to automation to take a machine out > of service and power cycle it. > >>> +enum devlink_health_reporter_remedy { >>> + DLH_REMEDY_NONE = 1, >>> + DLH_REMEDY_COMP_RESET, >>> + DLH_REMEDY_RESET, >>> + DLH_REMEDY_REINIT, >>> + DLH_REMEDY_POWER_CYCLE, >>> + DLH_REMEDY_REIMAGE, >> In general, I don't expect a reported to perform POWER_CYCLE or REIMAGE >> as part of the recovery. > > Right, these are extending the use of health reporters beyond what can > be remediated automatically. > >>> + DLH_REMEDY_BAD_PART, >> BAD_PART probably indicates that the reporter (or any command line >> execution) cannot recover the issue. >> As the suggested remedy is static per reporter's recover method, it >> doesn't make sense for one to set a recover method that by design cannot >> recover successfully. >> >> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, >> REIMAGE and BAD_PART? To indicate the user that for a successful >> recovery, it should run a non-devlink-health operation? > > Hm, export and extend devlink_health_reporter_state? I like that idea. > >>> diff --git a/net/core/devlink.c b/net/core/devlink.c >>> index 737b61c2976e..73eb665070b9 100644 >>> --- a/net/core/devlink.c >>> +++ b/net/core/devlink.c >>> @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink, >>> { >>> struct devlink_health_reporter *reporter; >>> >>> - if (WARN_ON(graceful_period && !ops->recover)) >>> + if (WARN_ON(graceful_period && !ops->recover) || >>> + WARN_ON(!ops->remedy)) >> Here you fail every reported that doesn't have remedy set, not only the >> ones with recovery callback > > Right, DLH_REMEDY_NONE doesn't require a callback. Some health > issues can be remedied by the device e.g. ECC errors or something > along those lines. Obviously non-RFC patch will have to come with > driver changes. > >>> return ERR_PTR(-EINVAL); >>> >>> reporter = kzalloc(sizeof(*reporter), GFP_KERNEL); >>> @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg, >>> if (!reporter_attr) >>> goto genlmsg_cancel; >>> if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME, >>> - reporter->ops->name)) >>> + reporter->ops->name) || >>> + nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, >>> + reporter->ops->remedy)) >> Why not new if clause like all other attributes later. > > Eh. > Currently, each nla_put_* is under its own if clause. >>> goto reporter_nest_cancel; >>> if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE, >>> reporter->health_state)) >>> >
On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote: > >> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local > >> un-harmful fix attempt. > >> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover()) > > > > Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall > > detected, resumed by re-trigerring processing, without reset? > > Sounds good. FWIW I ended up calling it: + * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered > >> The assumption here is that a reporter's recovery function has one > >> remedy. But it can have few remedies and escalate between them. Did you > >> consider a bitmask? > > > > Yes, I tried to explain in the commit message. If we wanted to support > > escalating remediations we'd also need separate counters etc. I think > > having a health reporter per remediation should actually work fairly > > well. > > That would require reporter's recovery procedure failure to trigger > health flow for other reporter. > So we can find ourselves with 2 RX reporters, sharing the same diagnose > and dump callbacks, and each has other recovery flow. > Seems a bit counterintuitive. Let's talk about particular cases. Otherwise it's too easy to misunderstand each other. I can't think of any practical case where escalation makes sense. > Maybe, per reporter, exposing a counter per each supported remedy is not > that bad? It's a large change to the uAPI, and it makes vendors more likely to lump different problems under a single reporter (although I take your point that it may cause over-splitting, but if we have to choose between the two my preference is "too granular").
On Tue, 9 Mar 2021 16:06:49 +0200 Eran Ben Elisha wrote: > On 3/8/2021 7:59 PM, Jakub Kicinski wrote: > >> Hm, export and extend devlink_health_reporter_state? I like that idea. > > > > Trying to type it up it looks less pretty than expected. > > > > Let's looks at some examples. > > > > A queue reporter, say "rx", resets the queue dropping all outstanding > > buffers. As previously mentioned when the normal remediation fails user > > is expected to power cycle the machine or maybe swap the card. The > > device itself does not have a crystal ball. > > Not sure, reopen the queue, or reinit the driver might also be good in > case of issue in the SW/HW queue context for example. But I agree that > RX reporter can't tell from its perspective what further escalation is > needed in case its local defined operations failed. Right, the point being if normal remediation fails collect a full system dump and do the big hammer remediation (power cycle or reinit if user wants to try that). > > A management FW reporter "fw", has a auto recovery of FW reset > > (REMEDY_RESET). On failure -> power cycle. > > > > An "io" reporter (PCI link had to be trained down) can only return > > a hardware failure (we should probably have a HW failure other than > > BAD_PART for this). > > > > Flash reporters - the device will know if the flash had a bad block > > or the entire part is bad, so probably can have 2 reporters for this. > > > > Most of the reporters would only report one "action" that can be > > performed to fix them. The cartesian product of ->recovery types vs > > manual recovery does not seem necessary. And drivers would get bloated > > with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for > > _all_ cases with ->recovery. Because what else would the fix be if > > software-initiated reset didn't work? > > OK, I see your point. > > If I got you right, this is the conclusions so far: > 1. Each reporter with recover callback will have to supply a remedy > definition. > 2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, > because these are not valid reporter recover flows in any case. > 3. If a reporter will fail to recover, its status shall remain as error, > and it is out of the reporter's scope to advise the administrator on > further actions. I was actually intending to go back to the original proposal, mostly as is (plus he KICK). Indeed the intent is that if local remediation fails or is unavailable and reporter is in failed state - power cycle or other manual intervention is needed. So we can drop the POWER_CYCLE remedy and leave it implicit. But how are you suggesting we handle BAD_PART and REIMAGE? Still extending the health status or a separate mechanism than dl-health?
On 3/9/2021 2:52 PM, Jakub Kicinski wrote: > On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote: >>>> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local >>>> un-harmful fix attempt. >>>> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover()) >>> >>> Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall >>> detected, resumed by re-trigerring processing, without reset? >> >> Sounds good. > > FWIW I ended up calling it: > > + * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered > >>>> The assumption here is that a reporter's recovery function has one >>>> remedy. But it can have few remedies and escalate between them. Did you >>>> consider a bitmask? >>> >>> Yes, I tried to explain in the commit message. If we wanted to support >>> escalating remediations we'd also need separate counters etc. I think >>> having a health reporter per remediation should actually work fairly >>> well. >> >> That would require reporter's recovery procedure failure to trigger >> health flow for other reporter. >> So we can find ourselves with 2 RX reporters, sharing the same diagnose >> and dump callbacks, and each has other recovery flow. >> Seems a bit counterintuitive. > > Let's talk about particular cases. Otherwise it's too easy to > misunderstand each other. I can't think of any practical case > where escalation makes sense. > >> Maybe, per reporter, exposing a counter per each supported remedy is not >> that bad? > > It's a large change to the uAPI, and it makes vendors more likely > to lump different problems under a single reporter (although I take > your point that it may cause over-splitting, but if we have to choose > between the two my preference is "too granular"). > I also prefer keeping it more granular and forcing only a single "remedy" per reporter. If that remedy fails, I liked the thought of possibly having some way to indicate possible "hammer" remedies as some sort of extended status. i.e. our reporter can try one known to be effective remedy automatically, and then if it fails it could somehow report an extended status that indicates "we still failed to recover, and we think the problem might be fixed with RELOAD/REBOOT/REIMAGE" But I would keep those 3 larger remedies that require user intervention out of the set of regular remedies, and more as some other way to indicate they might help? I really don't think escalation makes a lot of sense because it's far more complicated and as an administrator I am not sure I want a remedy which could have larger impacts like resetting the device if that could cause other issues... Thanks, Jake
diff --git a/include/net/devlink.h b/include/net/devlink.h index 853420db5d32..70b5fd6a8c0d 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -664,6 +664,7 @@ enum devlink_health_reporter_state { /** * struct devlink_health_reporter_ops - Reporter operations * @name: reporter name + * remedy: severity of the remediation required * @recover: callback to recover from reported error * if priv_ctx is NULL, run a full recover * @dump: callback to dump an object @@ -674,6 +675,7 @@ enum devlink_health_reporter_state { struct devlink_health_reporter_ops { char *name; + enum devlink_health_reporter_remedy remedy; int (*recover)(struct devlink_health_reporter *reporter, void *priv_ctx, struct netlink_ext_ack *extack); int (*dump)(struct devlink_health_reporter *reporter, diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h index f6008b2fa60f..bd490c5536b1 100644 --- a/include/uapi/linux/devlink.h +++ b/include/uapi/linux/devlink.h @@ -534,6 +534,9 @@ enum devlink_attr { DEVLINK_ATTR_RELOAD_ACTION_STATS, /* nested */ DEVLINK_ATTR_PORT_PCI_SF_NUMBER, /* u32 */ + + DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, /* u32 */ + /* add new attributes above here, update the policy in devlink.c */ __DEVLINK_ATTR_MAX, @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate { DEVLINK_PORT_FN_OPSTATE_ATTACHED, }; +/** + * enum devlink_health_reporter_remedy - severity of remediation procedure + * @DLH_REMEDY_NONE: transient error, no remediation required + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue) + * will be reset + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability + * of the device, device configuration should not be lost + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover + * @DLH_REMEDY_REIMAGE: device needs to be reflashed + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be + * replaced + * + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter + * by the severity of the required remediation, and indicates the remediation + * type to the user if it can't be applied automatically (e.g. "reimage"). + */ +enum devlink_health_reporter_remedy { + DLH_REMEDY_NONE = 1, + DLH_REMEDY_COMP_RESET, + DLH_REMEDY_RESET, + DLH_REMEDY_REINIT, + DLH_REMEDY_POWER_CYCLE, + DLH_REMEDY_REIMAGE, + DLH_REMEDY_BAD_PART, +}; + #endif /* _UAPI_LINUX_DEVLINK_H_ */ diff --git a/net/core/devlink.c b/net/core/devlink.c index 737b61c2976e..73eb665070b9 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink, { struct devlink_health_reporter *reporter; - if (WARN_ON(graceful_period && !ops->recover)) + if (WARN_ON(graceful_period && !ops->recover) || + WARN_ON(!ops->remedy)) return ERR_PTR(-EINVAL); reporter = kzalloc(sizeof(*reporter), GFP_KERNEL); @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg, if (!reporter_attr) goto genlmsg_cancel; if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME, - reporter->ops->name)) + reporter->ops->name) || + nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, + reporter->ops->remedy)) goto reporter_nest_cancel; if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE, reporter->health_state))
Currently devlink health does not give user any clear information of what kind of remediation ->recover callback will perform. This makes it difficult to understand the impact of enabling auto- -remediation, and the severity of the error itself. To allow users to make more informed decision, as well as stretch the applicability of devlink health beyond what can be fixed by resetting things - add a new remediation type attribute. Note that we only allow one remediation type per reporter, this is intentional. devlink health is not built for mixing issues of different severity into one reporter since it only maintains one dump, of the first event and a single error counter. Nudging vendors towards categorizing issues beyond coarse groups is an added bonus. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- include/net/devlink.h | 2 ++ include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++ net/core/devlink.c | 7 +++++-- 3 files changed, 37 insertions(+), 2 deletions(-)