diff mbox series

[ndctl,v5,4/5] cxl/list: add --poison option to cxl list

Message ID 216ab396ab0c34fc391d1c3d3797a0d832a8d563.1700615159.git.alison.schofield@intel.com
State Superseded
Commit f996c3806c8daabad889e9565da0f4e87628ad2e
Headers show
Series Support poison list retrieval | expand

Commit Message

Alison Schofield Nov. 22, 2023, 1:22 a.m. UTC
From: Alison Schofield <alison.schofield@intel.com>

The --poison option to 'cxl list' retrieves poison lists from
memory devices supporting the capability and displays the
returned poison records in the cxl list json. This option can
apply to memdevs or regions.

Example usage in the Documentation/cxl/cxl-list.txt update.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++
 cxl/filter.h                   |  3 ++
 cxl/list.c                     |  2 ++
 3 files changed, 63 insertions(+)

Comments

Dan Williams Dec. 7, 2023, 4:23 a.m. UTC | #1
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> The --poison option to 'cxl list' retrieves poison lists from
> memory devices supporting the capability and displays the
> returned poison records in the cxl list json. This option can
> apply to memdevs or regions.
> 
> Example usage in the Documentation/cxl/cxl-list.txt update.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> ---
>  Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++
>  cxl/filter.h                   |  3 ++
>  cxl/list.c                     |  2 ++
>  3 files changed, 63 insertions(+)
> 
> diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
> index 838de4086678..ee2f1b2d9fae 100644
> --- a/Documentation/cxl/cxl-list.txt
> +++ b/Documentation/cxl/cxl-list.txt
> @@ -415,6 +415,64 @@ OPTIONS
>  --region::
>  	Specify CXL region device name(s), or device id(s), to filter the listing.
>  
> +-L::
> +--poison::
> +	Include poison information. The poison list is retrieved from the
> +	device(s) and poison records are added to the listing. Apply this
> +	option to memdevs and regions where devices support the poison
> +	list capability.

While CXL calls it "poison" I am not convinced that's the term that end
users can universally use for this. This is why "ndctl list" uses -M,
but yeah, -M and -P are already taken. Even -E is taken for "errors".

> +
> +----
> +# cxl list -m mem11 --poison
> +[
> +  {
> +    "memdev":"mem11",
> +    "pmem_size":268435456,
> +    "ram_size":0,
> +    "serial":0,
> +    "host":"0000:37:00.0",
> +    "poison":{
> +      "nr_records":1,
> +      "records":[

One cleanup I want to see before this goes live... drop nr_records and
just make "poison" an array object directly. The number of records is
trivially determined by the jq "len" operator.

Also, per above rename "poison" to "media_errors". I believe "poison" is
an x86'ism where "media_error" is a more generic term.

> +        {
> +          "dpa":0,
> +          "dpa_length":64,
> +          "source":"Internal",
> +        }
> +      ]
> +    }
> +  }
> +]
> +# cxl list -r region5 --poison
> +[
> +  {
> +    "region":"region5",
> +    "resource":1035623989248,
> +    "size":2147483648,
> +    "interleave_ways":2,
> +    "interleave_granularity":4096,
> +    "decode_state":"commit",
> +    "poison":{
> +      "nr_records":2,
> +      "records":[
> +        {
> +          "memdev":"mem2",
> +          "dpa":0,
> +          "dpa_length":64,

Does length need to be prefixed with "dpa_"?

> +          "source":"Internal",

I am not sure what the end user can do with "source"? I have tended to
not emit things if I can't think of a use case for the field to be
there.
Tsaur, Erwin Dec. 13, 2023, 12:34 a.m. UTC | #2
Some comments inline

-----Original Message-----
From: Dan Williams <dan.j.williams@intel.com> 
Sent: Wednesday, December 6, 2023 8:23 PM
To: Schofield, Alison <alison.schofield@intel.com>; Verma, Vishal L <vishal.l.verma@intel.com>
Cc: Schofield, Alison <alison.schofield@intel.com>; nvdimm@lists.linux.dev; linux-cxl@vger.kernel.org
Subject: Re: [ndctl PATCH v5 4/5] cxl/list: add --poison option to cxl list

alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> The --poison option to 'cxl list' retrieves poison lists from memory 
> devices supporting the capability and displays the returned poison 
> records in the cxl list json. This option can apply to memdevs or 
> regions.
> 
> Example usage in the Documentation/cxl/cxl-list.txt update.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> ---
>  Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++
>  cxl/filter.h                   |  3 ++
>  cxl/list.c                     |  2 ++
>  3 files changed, 63 insertions(+)
> 
> diff --git a/Documentation/cxl/cxl-list.txt 
> b/Documentation/cxl/cxl-list.txt index 838de4086678..ee2f1b2d9fae 
> 100644
> --- a/Documentation/cxl/cxl-list.txt
> +++ b/Documentation/cxl/cxl-list.txt
> @@ -415,6 +415,64 @@ OPTIONS
>  --region::
>  	Specify CXL region device name(s), or device id(s), to filter the listing.
>  
> +-L::
> +--poison::
> +	Include poison information. The poison list is retrieved from the
> +	device(s) and poison records are added to the listing. Apply this
> +	option to memdevs and regions where devices support the poison
> +	list capability.

While CXL calls it "poison" I am not convinced that's the term that end users can universally use for this. This is why "ndctl list" uses -M, but yeah, -M and -P are already taken. Even -E is taken for "errors".

> +
> +----
> +# cxl list -m mem11 --poison
> +[
> +  {
> +    "memdev":"mem11",
> +    "pmem_size":268435456,
> +    "ram_size":0,
> +    "serial":0,
> +    "host":"0000:37:00.0",
> +    "poison":{
> +      "nr_records":1,
> +      "records":[

One cleanup I want to see before this goes live... drop nr_records and just make "poison" an array object directly. The number of records is trivially determined by the jq "len" operator.

Also, per above rename "poison" to "media_errors". I believe "poison" is an x86'ism where "media_error" is a more generic term.
[ETT] The problem is that a poison can be written into CXL and is NOT a media error.  Generally... the only way a customer should interpret that a DPA is "poisoned" is that if you consume that address, you'll end up in a MCA-Recovery path.  It does not indicate media health.  But see discussion later about "source"

> +        {
> +          "dpa":0,
> +          "dpa_length":64,
> +          "source":"Internal",
> +        }
> +      ]
> +    }
> +  }
> +]
> +# cxl list -r region5 --poison
> +[
> +  {
> +    "region":"region5",
> +    "resource":1035623989248,
> +    "size":2147483648,
> +    "interleave_ways":2,
> +    "interleave_granularity":4096,
> +    "decode_state":"commit",
> +    "poison":{
> +      "nr_records":2,
> +      "records":[
> +        {
> +          "memdev":"mem2",
> +          "dpa":0,
> +          "dpa_length":64,

Does length need to be prefixed with "dpa_"?

> +          "source":"Internal",

I am not sure what the end user can do with "source"? I have tended to not emit things if I can't think of a use case for the field to be there.
[ETT] The sources are "external" (poison written into CXL), "internal" (the device generated the poison), "injected" (ie via poison inject) and "vendor specific".
This is how I would use source.
"external" = don't expect to see a cxl media error, look elsewhere like a UCNA or a mem_data error in the RP's CXL.CM RAS.
"internal" = expect to see a media error for more information.
"injected" = somebody injected the error, no service action needed except to maybe tighten up your security.
"vendor" = see vendor 

How about a HPA? :D
Alison Schofield Dec. 13, 2023, 2:02 a.m. UTC | #3
On Wed, Dec 06, 2023 at 08:23:10PM -0800, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > The --poison option to 'cxl list' retrieves poison lists from
> > memory devices supporting the capability and displays the
> > returned poison records in the cxl list json. This option can
> > apply to memdevs or regions.
> > 
> > Example usage in the Documentation/cxl/cxl-list.txt update.
> > 
> > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > ---
> >  Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++
> >  cxl/filter.h                   |  3 ++
> >  cxl/list.c                     |  2 ++
> >  3 files changed, 63 insertions(+)
> > 
> > diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
> > index 838de4086678..ee2f1b2d9fae 100644
> > --- a/Documentation/cxl/cxl-list.txt
> > +++ b/Documentation/cxl/cxl-list.txt
> > @@ -415,6 +415,64 @@ OPTIONS
> >  --region::
> >  	Specify CXL region device name(s), or device id(s), to filter the listing.
> >  
> > +-L::
> > +--poison::
> > +	Include poison information. The poison list is retrieved from the
> > +	device(s) and poison records are added to the listing. Apply this
> > +	option to memdevs and regions where devices support the poison
> > +	list capability.
> 
> While CXL calls it "poison" I am not convinced that's the term that end
> users can universally use for this. This is why "ndctl list" uses -M,
> but yeah, -M and -P are already taken. Even -E is taken for "errors".
> 
> > +
> > +----
> > +# cxl list -m mem11 --poison
> > +[
> > +  {
> > +    "memdev":"mem11",
> > +    "pmem_size":268435456,
> > +    "ram_size":0,
> > +    "serial":0,
> > +    "host":"0000:37:00.0",
> > +    "poison":{
> > +      "nr_records":1,
> > +      "records":[
> 
> One cleanup I want to see before this goes live... drop nr_records and
> just make "poison" an array object directly. The number of records is
> trivially determined by the jq "len" operator.

The poison nr_records count is a convenience for the cmdline user,
not for the user doing their own jq parsing. It also intends to
explicitly show nr_records:0 when no poison is found.

Say NAK again and I'll rm it all.

> 
> Also, per above rename "poison" to "media_errors". I believe "poison" is
> an x86'ism where "media_error" is a more generic term.

OK

> 
> > +        {
> > +          "dpa":0,
> > +          "dpa_length":64,
> > +          "source":"Internal",
> > +        }
> > +      ]
> > +    }
> > +  }
> > +]
> > +# cxl list -r region5 --poison
> > +[
> > +  {
> > +    "region":"region5",
> > +    "resource":1035623989248,
> > +    "size":2147483648,
> > +    "interleave_ways":2,
> > +    "interleave_granularity":4096,
> > +    "decode_state":"commit",
> > +    "poison":{
> > +      "nr_records":2,
> > +      "records":[
> > +        {
> > +          "memdev":"mem2",
> > +          "dpa":0,
> > +          "dpa_length":64,
> 
> Does length need to be prefixed with "dpa_"?
>

Nope. Will remove.

> > +          "source":"Internal",
> 
> I am not sure what the end user can do with "source"? I have tended to
> not emit things if I can't think of a use case for the field to be
> there.
Erwin followed w a comment on this one. Yeah, I think folks want to 
know if error was injected at least.

Alison
diff mbox series

Patch

diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt
index 838de4086678..ee2f1b2d9fae 100644
--- a/Documentation/cxl/cxl-list.txt
+++ b/Documentation/cxl/cxl-list.txt
@@ -415,6 +415,64 @@  OPTIONS
 --region::
 	Specify CXL region device name(s), or device id(s), to filter the listing.
 
+-L::
+--poison::
+	Include poison information. The poison list is retrieved from the
+	device(s) and poison records are added to the listing. Apply this
+	option to memdevs and regions where devices support the poison
+	list capability.
+
+----
+# cxl list -m mem11 --poison
+[
+  {
+    "memdev":"mem11",
+    "pmem_size":268435456,
+    "ram_size":0,
+    "serial":0,
+    "host":"0000:37:00.0",
+    "poison":{
+      "nr_records":1,
+      "records":[
+        {
+          "dpa":0,
+          "dpa_length":64,
+          "source":"Internal",
+        }
+      ]
+    }
+  }
+]
+# cxl list -r region5 --poison
+[
+  {
+    "region":"region5",
+    "resource":1035623989248,
+    "size":2147483648,
+    "interleave_ways":2,
+    "interleave_granularity":4096,
+    "decode_state":"commit",
+    "poison":{
+      "nr_records":2,
+      "records":[
+        {
+          "memdev":"mem2",
+          "dpa":0,
+          "dpa_length":64,
+          "source":"Internal",
+        },
+        {
+          "memdev":"mem5",
+          "dpa":0,
+          "length":512,
+          "source":"Vendor",
+        }
+      ]
+    }
+  }
+]
+----
+
 -v::
 --verbose::
 	Increase verbosity of the output. This can be specified
diff --git a/cxl/filter.h b/cxl/filter.h
index 3f65990f835a..1241f72ccf62 100644
--- a/cxl/filter.h
+++ b/cxl/filter.h
@@ -30,6 +30,7 @@  struct cxl_filter_params {
 	bool fw;
 	bool alert_config;
 	bool dax;
+	bool poison;
 	int verbose;
 	struct log_ctx ctx;
 };
@@ -88,6 +89,8 @@  static inline unsigned long cxl_filter_to_flags(struct cxl_filter_params *param)
 		flags |= UTIL_JSON_ALERT_CONFIG;
 	if (param->dax)
 		flags |= UTIL_JSON_DAX | UTIL_JSON_DAX_DEVS;
+	if (param->poison)
+		flags |= UTIL_JSON_MEDIA_ERRORS;
 	return flags;
 }
 
diff --git a/cxl/list.c b/cxl/list.c
index 93ba51ef895c..13fef8569340 100644
--- a/cxl/list.c
+++ b/cxl/list.c
@@ -57,6 +57,8 @@  static const struct option options[] = {
 		    "include memory device firmware information"),
 	OPT_BOOLEAN('A', "alert-config", &param.alert_config,
 		    "include alert configuration information"),
+	OPT_BOOLEAN('L', "poison", &param.poison,
+		    "include poison information "),
 	OPT_INCR('v', "verbose", &param.verbose, "increase output detail"),
 #ifdef ENABLE_DEBUG
 	OPT_BOOLEAN(0, "debug", &debug, "debug list walk"),