Message ID | 20210311032613.1533100-3-kuba@kernel.org (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net-next,v2,1/3] devlink: move health state to uAPI | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | warning | Series does not have a cover letter |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | warning | 3 maintainers not CCed: jiri@nvidia.com linux-api@vger.kernel.org davem@davemloft.net |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 496 this patch: 496 |
netdev/kdoc | success | Errors and warnings before: 1 this patch: 1 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 26 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 688 this patch: 688 |
netdev/header_inline | success | Link |
On 3/11/2021 5:26 AM, Jakub Kicinski wrote: >>> Pending vendors adding the right reporters. << Would you like Nvidia to reply with the remedy per reporter or to actually prepare the patch? > > Extend the applicability of devlink health reporters > beyond what can be locally remedied. Add failure modes > which require re-flashing the NVM image or HW changes. > > The expectation is that driver will call > devlink_health_reporter_state_update() to put hardware > health reporters into bad state. > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > include/uapi/linux/devlink.h | 7 +++++++ > net/core/devlink.c | 3 +-- > 2 files changed, 8 insertions(+), 2 deletions(-) > > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h > index 8cd1508b525b..f623bbc63489 100644 > --- a/include/uapi/linux/devlink.h > +++ b/include/uapi/linux/devlink.h > @@ -617,10 +617,17 @@ enum devlink_port_fn_opstate { > * @DL_HEALTH_STATE_ERROR: error state, running health reporter's recovery > * may fix the issue, otherwise user needs to try > * power cycling or other forms of reset > + * @DL_HEALTH_STATE_BAD_IMAGE: device's non-volatile memory needs > + * to be re-written, usually due to block corruption > + * @DL_HEALTH_STATE_BAD_HW: hardware errors detected, device, host > + * or the connection between the two may be at fault > */ > enum devlink_health_state { > DL_HEALTH_STATE_HEALTHY, > DL_HEALTH_STATE_ERROR, > + > + DL_HEALTH_STATE_BAD_IMAGE, > + DL_HEALTH_STATE_BAD_HW, > }; > > /** > diff --git a/net/core/devlink.c b/net/core/devlink.c > index 09d77d43ff63..4a9fa6288a4a 100644 > --- a/net/core/devlink.c > +++ b/net/core/devlink.c > @@ -6527,8 +6527,7 @@ void > devlink_health_reporter_state_update(struct devlink_health_reporter *reporter, > enum devlink_health_state state) > { > - if (WARN_ON(state != DL_HEALTH_STATE_HEALTHY && > - state != DL_HEALTH_STATE_ERROR)) > + if (WARN_ON(state > DL_HEALTH_STATE_BAD_HW)) > return; > > if (reporter->health_state == state) > devlink_health_reporter_recover() requires an update as well. something like: @@ -6346,8 +6346,15 @@ devlink_health_reporter_recover(struct devlink_health_reporter *reporter, { int err; - if (reporter->health_state == DL_HEALTH_STATE_HEALTHY) + switch (reporter->health_state) { + case DL_HEALTH_STATE_HEALTHY: return 0; + case DL_HEALTH_STATE_ERROR: + break; + case DL_HEALTH_STATE_BAD_IMAGE: + case DL_HEALTH_STATE_BAD_HW: + return -EOPNOTSUPP; + } if (!reporter->ops->recover) return -EOPNOTSUPP;
On Thu, 11 Mar 2021 16:23:09 +0200 Eran Ben Elisha wrote: > On 3/11/2021 5:26 AM, Jakub Kicinski wrote: > >>> Pending vendors adding the right reporters. << > > Would you like Nvidia to reply with the remedy per reporter or to > actually prepare the patch? You mean the patch adding .remedy? If you can that'd be helpful. Or do you have HW error reporters to add? > > Extend the applicability of devlink health reporters > > beyond what can be locally remedied. Add failure modes > > which require re-flashing the NVM image or HW changes. > > > > The expectation is that driver will call > > devlink_health_reporter_state_update() to put hardware > > health reporters into bad state. > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > --- > > include/uapi/linux/devlink.h | 7 +++++++ > > net/core/devlink.c | 3 +-- > > 2 files changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h > > index 8cd1508b525b..f623bbc63489 100644 > > --- a/include/uapi/linux/devlink.h > > +++ b/include/uapi/linux/devlink.h > > @@ -617,10 +617,17 @@ enum devlink_port_fn_opstate { > > * @DL_HEALTH_STATE_ERROR: error state, running health reporter's recovery > > * may fix the issue, otherwise user needs to try > > * power cycling or other forms of reset > > + * @DL_HEALTH_STATE_BAD_IMAGE: device's non-volatile memory needs > > + * to be re-written, usually due to block corruption > > + * @DL_HEALTH_STATE_BAD_HW: hardware errors detected, device, host > > + * or the connection between the two may be at fault > > */ > > enum devlink_health_state { > > DL_HEALTH_STATE_HEALTHY, > > DL_HEALTH_STATE_ERROR, > > + > > + DL_HEALTH_STATE_BAD_IMAGE, > > + DL_HEALTH_STATE_BAD_HW, > > }; > > > > /** > > diff --git a/net/core/devlink.c b/net/core/devlink.c > > index 09d77d43ff63..4a9fa6288a4a 100644 > > --- a/net/core/devlink.c > > +++ b/net/core/devlink.c > > @@ -6527,8 +6527,7 @@ void > > devlink_health_reporter_state_update(struct devlink_health_reporter *reporter, > > enum devlink_health_state state) > > { > > - if (WARN_ON(state != DL_HEALTH_STATE_HEALTHY && > > - state != DL_HEALTH_STATE_ERROR)) > > + if (WARN_ON(state > DL_HEALTH_STATE_BAD_HW)) > > return; > > > > if (reporter->health_state == state) > > > > devlink_health_reporter_recover() requires an update as well. > something like: > > @@ -6346,8 +6346,15 @@ devlink_health_reporter_recover(struct > devlink_health_reporter *reporter, > { > int err; > > - if (reporter->health_state == DL_HEALTH_STATE_HEALTHY) > + switch (reporter->health_state) { > + case DL_HEALTH_STATE_HEALTHY: > return 0; > + case DL_HEALTH_STATE_ERROR: > + break; > + case DL_HEALTH_STATE_BAD_IMAGE: > + case DL_HEALTH_STATE_BAD_HW: > + return -EOPNOTSUPP; > + } > > if (!reporter->ops->recover) > return -EOPNOTSUPP; > Thanks!
On 3/11/2021 6:49 PM, Jakub Kicinski wrote: > On Thu, 11 Mar 2021 16:23:09 +0200 Eran Ben Elisha wrote: >> On 3/11/2021 5:26 AM, Jakub Kicinski wrote: >>>>> Pending vendors adding the right reporters. << >> Would you like Nvidia to reply with the remedy per reporter or to >> actually prepare the patch? > You mean the patch adding .remedy? If you can that'd be helpful. > > Or do you have HW error reporters to add? > I meant a patch to add .remedy to existing mlx5* reporters to be part of your series.
On Sun, 14 Mar 2021 14:33:10 +0200 Eran Ben Elisha wrote: > On 3/11/2021 6:49 PM, Jakub Kicinski wrote: > > On Thu, 11 Mar 2021 16:23:09 +0200 Eran Ben Elisha wrote: > >> Would you like Nvidia to reply with the remedy per reporter or to > >> actually prepare the patch? > > You mean the patch adding .remedy? If you can that'd be helpful. > > > > Or do you have HW error reporters to add? > > I meant a patch to add .remedy to existing mlx5* reporters to be part of > your series. After talking some more with the HW health team the series appears less necessary than I thought. I'm putting it on hold for now, sorry.
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h index 8cd1508b525b..f623bbc63489 100644 --- a/include/uapi/linux/devlink.h +++ b/include/uapi/linux/devlink.h @@ -617,10 +617,17 @@ enum devlink_port_fn_opstate { * @DL_HEALTH_STATE_ERROR: error state, running health reporter's recovery * may fix the issue, otherwise user needs to try * power cycling or other forms of reset + * @DL_HEALTH_STATE_BAD_IMAGE: device's non-volatile memory needs + * to be re-written, usually due to block corruption + * @DL_HEALTH_STATE_BAD_HW: hardware errors detected, device, host + * or the connection between the two may be at fault */ enum devlink_health_state { DL_HEALTH_STATE_HEALTHY, DL_HEALTH_STATE_ERROR, + + DL_HEALTH_STATE_BAD_IMAGE, + DL_HEALTH_STATE_BAD_HW, }; /** diff --git a/net/core/devlink.c b/net/core/devlink.c index 09d77d43ff63..4a9fa6288a4a 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -6527,8 +6527,7 @@ void devlink_health_reporter_state_update(struct devlink_health_reporter *reporter, enum devlink_health_state state) { - if (WARN_ON(state != DL_HEALTH_STATE_HEALTHY && - state != DL_HEALTH_STATE_ERROR)) + if (WARN_ON(state > DL_HEALTH_STATE_BAD_HW)) return; if (reporter->health_state == state)