diff mbox

[RFC] Report the Health Status Detail for the HPE1 DSM family

Message ID 1490989421-20872-1-git-send-email-linda.knippers@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Linda Knippers March 31, 2017, 7:43 p.m. UTC
Dan, 
This is an RFC because I'd like some initial feedback on the
approach.  I think this is what you had in mind from your last
exchanges with Brian but I wanted to check a few things before
going too far.

1) Do we want to export a library function for what could be a long
list of DSM-family-specific health information?  I think there
could be some common information between the HPE1 and MSFT DSM
but much will not be common.
2) If we do export the functions, would we need to also export
the ndctl-hpe1.h include file or consolidate the information into
an already exported file?
3) Do you want json-smart.c to keep growing or should new smart
functions provide their own matching json functions?
4) The code in json-smart.c with a macro was a quick prototype
but if you have feedback on the json parts, that would be appreciated.
Right now the detail is reported as a string if all is well and an 
array if there are errors.  I'm not sure about that or whether 
the strings should have spaces.  

Anyway, here's the patch ...

This patch adds a new interface to provide Health Status Detail.
This field is reported as part of the Smart Health with the HPE1
DSM family so the function for the Intel family is NULL.  If
the field is available, the ndctl --health option will decode
the bits that make up the field.

On a healthy device, the output would look something like:
{
    "dev":"nmem0",
    "id":"802c-01-1521-b300bdbc",
    "health":{
      "health_state":"ok",
      "temperature_celsius":25.000000,
      "spares_percentage":99,
      "alarm_temperature":false,
      "alarm_spares":false,
      "temperature_threshold":50.000000,
      "spares_threshold":20,
      "life_used_percentage":2,
      "shutdown_state":"clean",
      "health_status_detail":"ok"
    }
  }

A device with every possible error could look like this:

 {
    "dev":"nmem0",
    "id":"802c-01-1521-b300bdbc",
    "health":{
      "health_state":"ok",
      "temperature_celsius":25.000000,
      "spares_percentage":99,
      "alarm_temperature":false,
      "alarm_spares":false,
      "temperature_threshold":50.000000,
      "spares_threshold":20,
      "life_used_percentage":2,
      "shutdown_state":"clean",
      "health_status_detail":[
        "energy source error",
        "controller error",
        "UC ECC error",
        "CE trip",
        "save error",
        "restore error",
        "arm error",
        "erase error",
        "configuration error",
        "firmware error",
        "vendor specific error"
      ]
    }
  }

---
 ndctl/lib/libndctl-hpe1.c    | 12 ++++++++++++
 ndctl/lib/libndctl-private.h |  1 +
 ndctl/lib/libndctl-smart.c   |  2 ++
 ndctl/lib/libndctl.sym       |  1 +
 ndctl/libndctl.h.in          |  5 +++++
 ndctl/ndctl.h                |  1 +
 ndctl/util/json-smart.c      | 46 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 68 insertions(+)

Comments

Dan Williams April 5, 2017, 8:53 p.m. UTC | #1
On Fri, Mar 31, 2017 at 12:43 PM, Linda Knippers <linda.knippers@hpe.com> wrote:
> Dan,
> This is an RFC because I'd like some initial feedback on the
> approach.  I think this is what you had in mind from your last
> exchanges with Brian but I wanted to check a few things before
> going too far.
>
> 1) Do we want to export a library function for what could be a long
> list of DSM-family-specific health information?  I think there
> could be some common information between the HPE1 and MSFT DSM
> but much will not be common.

Even for non-common information I want libndctl to hide the details
about which DIMM families support which fields. Just extend the
validity flags concept to make unsupported / family specific health
data appear as just an invalid field to the library interface if the
DIMM does not support it.

> 2) If we do export the functions, would we need to also export
> the ndctl-hpe1.h include file or consolidate the information into
> an already exported file?

I want to avoid libndctl ever exporting vendor-specific details. There
may be json keys that only show up on one vendor's DIMMs, but that
should be indistinguishable from a SMART payload that happens to clear
a validity bit.

> 3) Do you want json-smart.c to keep growing or should new smart
> functions provide their own matching json functions?

json-smart.c should understand the superset of all possible health
fields, i.e. the union of all vendor fields that can possibly be
returned. If this ever gets unwieldy it means we need to redouble
standardization efforts.

> 4) The code in json-smart.c with a macro was a quick prototype
> but if you have feedback on the json parts, that would be appreciated.
> Right now the detail is reported as a string if all is well and an
> array if there are errors.  I'm not sure about that or whether
> the strings should have spaces.

I think we should have all fields searchable as json keys in lowercase
with underscores. Lets handle these flags similar to the existing
boolean alarm_temperature and alarm_spares flags, but with a small
change. Since these flags are reporting some dimm-type-specific
details I think we should only add them to the json when they are
true. This way we won't have confusing entries like a
'"energy_source_error":false' line for non-energy-source-backed dimms.
I'll go back and fix up 'alarm_spares' and 'alarm_temperature' to be
hidden on DIMMs where they are always false.

Hmm, it seems like we need a "supported" vs "valid" distinction. Where
supported but invalid shows flags in the 'false' state, but
"unsupported" hides flags that are always going to be 'false'.
Linda Knippers April 11, 2017, 8:45 p.m. UTC | #2
On 04/05/2017 04:53 PM, Dan Williams wrote:
> On Fri, Mar 31, 2017 at 12:43 PM, Linda Knippers <linda.knippers@hpe.com> wrote:
>> Dan,
>> This is an RFC because I'd like some initial feedback on the
>> approach.  I think this is what you had in mind from your last
>> exchanges with Brian but I wanted to check a few things before
>> going too far.
>>
>> 1) Do we want to export a library function for what could be a long
>> list of DSM-family-specific health information?  I think there
>> could be some common information between the HPE1 and MSFT DSM
>> but much will not be common.
> 
> Even for non-common information I want libndctl to hide the details
> about which DIMM families support which fields. Just extend the
> validity flags concept to make unsupported / family specific health
> data appear as just an invalid field to the library interface if the
> DIMM does not support it.
> 
>> 2) If we do export the functions, would we need to also export
>> the ndctl-hpe1.h include file or consolidate the information into
>> an already exported file?
> 
> I want to avoid libndctl ever exporting vendor-specific details. There
> may be json keys that only show up on one vendor's DIMMs, but that
> should be indistinguishable from a SMART payload that happens to clear
> a validity bit.
> 
>> 3) Do you want json-smart.c to keep growing or should new smart
>> functions provide their own matching json functions?
> 
> json-smart.c should understand the superset of all possible health
> fields, i.e. the union of all vendor fields that can possibly be
> returned. If this ever gets unwieldy it means we need to redouble
> standardization efforts.

Ok.  For the above 3, I think that's what my current patch does.
> 
>> 4) The code in json-smart.c with a macro was a quick prototype
>> but if you have feedback on the json parts, that would be appreciated.
>> Right now the detail is reported as a string if all is well and an
>> array if there are errors.  I'm not sure about that or whether
>> the strings should have spaces.
> 
> I think we should have all fields searchable as json keys in lowercase
> with underscores. 

Ok, will fix that.

> Lets handle these flags similar to the existing
> boolean alarm_temperature and alarm_spares flags, but with a small
> change. Since these flags are reporting some dimm-type-specific
> details I think we should only add them to the json when they are
> true. This way we won't have confusing entries like a
> '"energy_source_error":false' line for non-energy-source-backed dimms.

My current patch won't display an '*_error: false' for the field I added.
You'll either get "ok" (which means that the field is valid and no error bits
are set, like the health_state field) or you'll get just the specific errors.
If the field isn't valid at all, either because it's not supported by the NVDIMM
or because it wasn't marked as valid in the DSM output, nothing will added to
the json.

Going back to the example from my commit message (fixed the underscores), the
json for an NVDIMM for which the field isn't supported or isn't valid would be
unchanged.  For an NVDIMM that reports it and no bits are are,
this would be added:

      "health_status_detail":"ok"

For an NVDIMM that reports it and some bits are set, you'd get a json
array element for each bit that is set.  The worst case with all error
bits set is this:

      "health_status_detail":[
        "energy_source_error",
        "controller_error",
        "UC_ECC_error",
        "CE_trip",
        "save_error",
        "restore_error",
        "arm_error",
        "erase_error",
        "configuration_error",
        "firmware_error",
        "vendor_specific_error"
      ]

The more likely case is that there would just be one error, like
this:

      "health_status_detail":[
        "energy_source_error",
      ]

That could be "energy_source: error" or "energy_source_error: true".
I was trying to minimize the output on healthy systems and not add information
for non-errors. I wouldn't necessarily want to see a long list of
"energy_source_error: false".

> I'll go back and fix up 'alarm_spares' and 'alarm_temperature' to be
> hidden on DIMMs where they are always false.
> 
> Hmm, it seems like we need a "supported" vs "valid" distinction. Where
> supported but invalid shows flags in the 'false' state, but
> "unsupported" hides flags that are always going to be 'false'.

There are reasons that a field might be supported by the DSM family
but not valid for this particular device and it's not necessarily
an error.  At least for the fields I know about, unsupported and invalid
would make sense to treat the same way.  If something truly is broken,
the DSM would likely fail.

-- ljk
diff mbox

Patch

diff --git a/ndctl/lib/libndctl-hpe1.c b/ndctl/lib/libndctl-hpe1.c
index ec54252..23b76a4 100644
--- a/ndctl/lib/libndctl-hpe1.c
+++ b/ndctl/lib/libndctl-hpe1.c
@@ -63,6 +63,7 @@  static struct ndctl_cmd *hpe1_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
 	hpe1->u.smart.in_valid_flags |= NDN_HPE1_SMART_USED_VALID;
 	hpe1->u.smart.in_valid_flags |= NDN_HPE1_SMART_SHUTDOWN_VALID;
 	hpe1->u.smart.in_valid_flags |= NDN_HPE1_SMART_VENDOR_VALID;
+	hpe1->u.smart.in_valid_flags |= NDN_HPE1_SMART_DETAIL_VALID;
 
 	cmd->firmware_status = &hpe1->u.smart.status;
 
@@ -104,6 +105,8 @@  static unsigned int hpe1_cmd_smart_get_flags(struct ndctl_cmd *cmd)
 		flags |= ND_SMART_SHUTDOWN_VALID;
 	if (hpe1flags & NDN_HPE1_SMART_VENDOR_VALID)
 		flags |= ND_SMART_VENDOR_VALID;
+	if (hpe1flags & NDN_HPE1_SMART_DETAIL_VALID)
+		flags |= ND_SMART_DETAIL_VALID;
 
 	return flags;
 }
@@ -282,6 +285,14 @@  static unsigned int hpe1_cmd_smart_threshold_get_spares(struct ndctl_cmd *cmd)
 
 	return CMD_HPE1_SMART_THRESH(cmd)->spare_block_threshold;
 }
+static unsigned int hpe1_cmd_smart_get_detail(struct ndctl_cmd *cmd)
+{
+	if (hpe1_smart_valid(cmd) < 0)
+		return UINT_MAX;
+
+	return CMD_HPE1_SMART(cmd)->mod_hlth_stat;
+}
+
 
 struct ndctl_smart_ops * const hpe1_smart_ops = &(struct ndctl_smart_ops) {
 	.new_smart = hpe1_dimm_cmd_new_smart,
@@ -298,4 +309,5 @@  struct ndctl_smart_ops * const hpe1_smart_ops = &(struct ndctl_smart_ops) {
 	.smart_threshold_get_alarm_control = hpe1_cmd_smart_threshold_get_alarm_control,
 	.smart_threshold_get_temperature = hpe1_cmd_smart_threshold_get_temperature,
 	.smart_threshold_get_spares = hpe1_cmd_smart_threshold_get_spares,
+	.smart_get_detail = hpe1_cmd_smart_get_detail,
 };
diff --git a/ndctl/lib/libndctl-private.h b/ndctl/lib/libndctl-private.h
index 3e67db0..e379e7d 100644
--- a/ndctl/lib/libndctl-private.h
+++ b/ndctl/lib/libndctl-private.h
@@ -221,6 +221,7 @@  struct ndctl_smart_ops {
 	unsigned int (*smart_threshold_get_alarm_control)(struct ndctl_cmd *);
 	unsigned int (*smart_threshold_get_temperature)(struct ndctl_cmd *);
 	unsigned int (*smart_threshold_get_spares)(struct ndctl_cmd *);
+	unsigned int (*smart_get_detail)(struct ndctl_cmd *);
 };
 
 #if HAS_SMART == 1
diff --git a/ndctl/lib/libndctl-smart.c b/ndctl/lib/libndctl-smart.c
index 73a49ef..890fa47 100644
--- a/ndctl/lib/libndctl-smart.c
+++ b/ndctl/lib/libndctl-smart.c
@@ -63,6 +63,7 @@  smart_cmd_op(ndctl_cmd_smart_get_vendor_data, smart_get_vendor_data, unsigned ch
 smart_cmd_op(ndctl_cmd_smart_threshold_get_alarm_control, smart_threshold_get_alarm_control, unsigned int, 0)
 smart_cmd_op(ndctl_cmd_smart_threshold_get_temperature, smart_threshold_get_temperature, unsigned int, 0)
 smart_cmd_op(ndctl_cmd_smart_threshold_get_spares, smart_threshold_get_spares, unsigned int, 0)
+smart_cmd_op(ndctl_cmd_smart_get_detail, smart_get_detail, unsigned int, 0)
 
 /*
  * The following intel_dimm_*() and intel_smart_*() functions implement
@@ -202,4 +203,5 @@  struct ndctl_smart_ops * const intel_smart_ops = &(struct ndctl_smart_ops) {
 	.smart_threshold_get_alarm_control = intel_cmd_smart_threshold_get_alarm_control,
 	.smart_threshold_get_temperature = intel_cmd_smart_threshold_get_temperature,
 	.smart_threshold_get_spares = intel_cmd_smart_threshold_get_spares,
+	.smart_get_detail = NULL,
 };
diff --git a/ndctl/lib/libndctl.sym b/ndctl/lib/libndctl.sym
index be2e368..d3a55f4 100644
--- a/ndctl/lib/libndctl.sym
+++ b/ndctl/lib/libndctl.sym
@@ -110,6 +110,7 @@  global:
 	ndctl_cmd_smart_threshold_get_alarm_control;
 	ndctl_cmd_smart_threshold_get_temperature;
 	ndctl_cmd_smart_threshold_get_spares;
+	ndctl_cmd_smart_get_detail;
 	ndctl_dimm_zero_labels;
 	ndctl_dimm_get_available_labels;
 	ndctl_region_get_first;
diff --git a/ndctl/libndctl.h.in b/ndctl/libndctl.h.in
index c27581d..d215c48 100644
--- a/ndctl/libndctl.h.in
+++ b/ndctl/libndctl.h.in
@@ -280,6 +280,7 @@  struct ndctl_cmd *ndctl_dimm_cmd_new_smart_threshold(struct ndctl_dimm *dimm);
 unsigned int ndctl_cmd_smart_threshold_get_alarm_control(struct ndctl_cmd *cmd);
 unsigned int ndctl_cmd_smart_threshold_get_temperature(struct ndctl_cmd *cmd);
 unsigned int ndctl_cmd_smart_threshold_get_spares(struct ndctl_cmd *cmd);
+unsigned int ndctl_cmd_smart_get_detail(struct ndctl_cmd *cmd);
 #else
 static inline struct ndctl_cmd *ndctl_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
 {
@@ -341,6 +342,10 @@  static inline unsigned int ndctl_cmd_smart_threshold_get_spares(
 {
 	return 0;
 }
+static inline unsigned int ndctl_cmd_smart_get_detail(struct ndctl_cmd *cmd)
+{
+	return 0;
+}
 #endif
 
 struct ndctl_cmd *ndctl_dimm_cmd_new_vendor_specific(struct ndctl_dimm *dimm,
diff --git a/ndctl/ndctl.h b/ndctl/ndctl.h
index 3b1d703..0bdf96f 100644
--- a/ndctl/ndctl.h
+++ b/ndctl/ndctl.h
@@ -28,6 +28,7 @@  struct nd_cmd_smart {
 #define ND_SMART_ALARM_VALID	(1 << 9)
 #define ND_SMART_SHUTDOWN_VALID	(1 << 10)
 #define ND_SMART_VENDOR_VALID	(1 << 11)
+#define ND_SMART_DETAIL_VALID	(1 << 13)
 #define ND_SMART_SPARE_TRIP	(1 << 0)
 #define ND_SMART_TEMP_TRIP	(1 << 1)
 #define ND_SMART_CTEMP_TRIP	(1 << 2)
diff --git a/ndctl/util/json-smart.c b/ndctl/util/json-smart.c
index 94519da..304a66a 100644
--- a/ndctl/util/json-smart.c
+++ b/ndctl/util/json-smart.c
@@ -10,6 +10,7 @@ 
 #else
 #include <ndctl.h>
 #endif
+#include "lib/ndctl-hpe1.h"
 
 static double parse_smart_temperature(unsigned int temp)
 {
@@ -151,6 +152,51 @@  struct json_object *util_dimm_health_to_json(struct ndctl_dimm *dimm)
 			json_object_object_add(jhealth, "shutdown_state", jobj);
 	}
 
+#define json_detail(jobj,jstring,detail,bit,string) \
+{ \
+	if (detail & bit) { \
+		jstring = json_object_new_string(string); \
+		if (jstring) \
+			json_object_array_add(jobj,jstring); \
+	} \
+}
+
+	if (flags & ND_SMART_DETAIL_VALID) {
+		unsigned int detail = ndctl_cmd_smart_get_detail(cmd);
+		if (detail) {
+			jobj = json_object_new_array();
+			json_object *jstring = NULL;
+
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_ES_FAILURE, "energy source error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_CTLR_FAILURE, "controller error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_UE_TRIP, "UC ECC error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_CE_TRIP, "CE trip")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_SAVE_FAILED, "save error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_RESTORE_FAILED, "restore error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_ARM_FAILED, "arm error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_ERASE_FAILED, "erase error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_CONFIG_ERROR, "configuration error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_FW_ERROR, "firmware error")
+			json_detail(jobj, jstring, detail,
+				NDN_HPE1_SMART_VENDOR_ERROR, "vendor specific error")
+		}
+		else
+			jobj = json_object_new_string("ok");
+		if (jobj)
+			json_object_object_add(jhealth, "health_status_detail",
+				jobj);
+	}
+
 	ndctl_cmd_unref(cmd);
 	return jhealth;
  err: