mbox series

[ndctl,v10,0/7] Support poison list retrieval

Message ID cover.1709748564.git.alison.schofield@intel.com (mailing list archive)
Headers show
Series Support poison list retrieval | expand

Message

Alison Schofield March 6, 2024, 6:42 p.m. UTC
From: Alison Schofield <alison.schofield@intel.com>

Changes since v9:
- Replace the multi-use 'name' var, with multiple descriptive
  flavors: memdev_name, region_name, decoder_name (DaveJ)
- Use a static string table for poison source lookup (DaveJ)
- Rebased on latest pending
Link to v9: https://lore.kernel.org/r/cover.1709253898.git.alison.schofield@intel.com/


Add the option to add a memory devices poison list to the cxl-list
json output. Offer the option by memdev and by region. Sample usage:

# cxl list -m mem1 --media-errors
[
  {
    "memdev":"mem1",
    "pmem_size":1073741824,
    "ram_size":1073741824,
    "serial":1,
    "numa_node":1,
    "host":"cxl_mem.1",
    "media_errors":[
      {
        "dpa":0,
        "length":64,
        "source":"Internal"
      },
      {
        "decoder":"decoder10.0",
        "hpa":1035355557888,
        "dpa":1073741824,
        "length":64,
        "source":"External"
      },
      {
        "decoder":"decoder10.0",
        "hpa":1035355566080,
        "dpa":1073745920,
        "length":64,
        "source":"Injected"
      }
    ]
  }
]

# cxl list -r region5 --media-errors
[
  {
    "region":"region5",
    "resource":1035355553792,
    "size":2147483648,
    "type":"pmem",
    "interleave_ways":2,
    "interleave_granularity":4096,
    "decode_state":"commit",
    "media_errors":[
      {
        "decoder":"decoder10.0",
        "hpa":1035355557888,
        "dpa":1073741824,
        "length":64,
        "source":"External"
      },
      {
        "decoder":"decoder8.1",
        "hpa":1035355566080,
        "dpa":1073745920,
        "length":64,
        "source":"Internal"
      }
    ]
  }
]

Alison Schofield (7):
  libcxl: add interfaces for GET_POISON_LIST mailbox commands
  cxl: add an optional pid check to event parsing
  cxl/event_trace: add a private context for private parsers
  cxl/event_trace: add helpers to retrieve tep fields by type
  cxl/list: collect and parse media_error records
  cxl/list: add --media-errors option to cxl list
  cxl/test: add cxl-poison.sh unit test

 Documentation/cxl/cxl-list.txt |  79 +++++++++-
 cxl/event_trace.c              |  82 ++++++++++-
 cxl/event_trace.h              |  14 +-
 cxl/filter.h                   |   3 +
 cxl/json.c                     | 257 +++++++++++++++++++++++++++++++++
 cxl/lib/libcxl.c               |  47 ++++++
 cxl/lib/libcxl.sym             |   2 +
 cxl/libcxl.h                   |   2 +
 cxl/list.c                     |   3 +
 test/cxl-poison.sh             | 137 ++++++++++++++++++
 test/meson.build               |   2 +
 11 files changed, 624 insertions(+), 4 deletions(-)
 create mode 100644 test/cxl-poison.sh


base-commit: e0d0680bd3e554bd5f211e989480c5a13a023b2d

Comments

Dan Williams March 6, 2024, 11:03 p.m. UTC | #1
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> Changes since v9:
> - Replace the multi-use 'name' var, with multiple descriptive
>   flavors: memdev_name, region_name, decoder_name (DaveJ)
> - Use a static string table for poison source lookup (DaveJ)
> - Rebased on latest pending
> Link to v9: https://lore.kernel.org/r/cover.1709253898.git.alison.schofield@intel.com/
> 
> 
> Add the option to add a memory devices poison list to the cxl-list
> json output. Offer the option by memdev and by region. Sample usage:
> 
> # cxl list -m mem1 --media-errors
> [
>   {
>     "memdev":"mem1",
>     "pmem_size":1073741824,
>     "ram_size":1073741824,
>     "serial":1,
>     "numa_node":1,
>     "host":"cxl_mem.1",
>     "media_errors":[
>       {
>         "dpa":0,
>         "length":64,
>         "source":"Internal"
>       },
>       {
>         "decoder":"decoder10.0",
>         "hpa":1035355557888,
>         "dpa":1073741824,
>         "length":64,
>         "source":"External"
>       },
>       {
>         "decoder":"decoder10.0",
>         "hpa":1035355566080,
>         "dpa":1073745920,
>         "length":64,
>         "source":"Injected"
>       }
>     ]
>   }
> ]
> 
> # cxl list -r region5 --media-errors
> [
>   {
>     "region":"region5",
>     "resource":1035355553792,
>     "size":2147483648,
>     "type":"pmem",
>     "interleave_ways":2,
>     "interleave_granularity":4096,
>     "decode_state":"commit",
>     "media_errors":[
>       {
>         "decoder":"decoder10.0",
>         "hpa":1035355557888,
>         "dpa":1073741824,
>         "length":64,

I notice that the ndctl --media-errors records are:

{ offset, length }

...it is not clear to me that "dpa" and "hpa" have much meaning to
userspace by default. Physical address information is privileged, so if
these records were { offset, length } tuples there is the possibility
that they can be provided to non-root.

"Offset" is region relative "hpa" when listing region media errors, and
"offset" is memdev relative "dpa" while listing memdev relative media
errors.
Alison Schofield March 8, 2024, 3:58 a.m. UTC | #2
On Wed, Mar 06, 2024 at 03:03:40PM -0800, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > Changes since v9:
> > - Replace the multi-use 'name' var, with multiple descriptive
> >   flavors: memdev_name, region_name, decoder_name (DaveJ)
> > - Use a static string table for poison source lookup (DaveJ)
> > - Rebased on latest pending
> > Link to v9: https://lore.kernel.org/r/cover.1709253898.git.alison.schofield@intel.com/
> > 
> > 
> > Add the option to add a memory devices poison list to the cxl-list
> > json output. Offer the option by memdev and by region. Sample usage:
> > 
> > # cxl list -m mem1 --media-errors
> > [
> >   {
> >     "memdev":"mem1",
> >     "pmem_size":1073741824,
> >     "ram_size":1073741824,
> >     "serial":1,
> >     "numa_node":1,
> >     "host":"cxl_mem.1",
> >     "media_errors":[
> >       {
> >         "dpa":0,
> >         "length":64,
> >         "source":"Internal"
> >       },
> >       {
> >         "decoder":"decoder10.0",
> >         "hpa":1035355557888,
> >         "dpa":1073741824,
> >         "length":64,
> >         "source":"External"
> >       },
> >       {
> >         "decoder":"decoder10.0",
> >         "hpa":1035355566080,
> >         "dpa":1073745920,
> >         "length":64,
> >         "source":"Injected"
> >       }
> >     ]
> >   }
> > ]
> > 
> > # cxl list -r region5 --media-errors
> > [
> >   {
> >     "region":"region5",
> >     "resource":1035355553792,
> >     "size":2147483648,
> >     "type":"pmem",
> >     "interleave_ways":2,
> >     "interleave_granularity":4096,
> >     "decode_state":"commit",
> >     "media_errors":[
> >       {
> >         "decoder":"decoder10.0",
> >         "hpa":1035355557888,
> >         "dpa":1073741824,
> >         "length":64,
> 
> I notice that the ndctl --media-errors records are:
> 
> { offset, length }
> 
> ...it is not clear to me that "dpa" and "hpa" have much meaning to
> userspace by default. Physical address information is privileged, so if
> these records were { offset, length } tuples there is the possibility
> that they can be provided to non-root.
> 
> "Offset" is region relative "hpa" when listing region media errors, and
> "offset" is memdev relative "dpa" while listing memdev relative media
> errors.

Done. memdev relative dpa is just dpa right? Unless you are thinking
offset into a partition? I don't think so.
Dan Williams March 8, 2024, 4:03 a.m. UTC | #3
Alison Schofield wrote:
[..]
> > I notice that the ndctl --media-errors records are:
> > 
> > { offset, length }
> > 
> > ...it is not clear to me that "dpa" and "hpa" have much meaning to
> > userspace by default. Physical address information is privileged, so if
> > these records were { offset, length } tuples there is the possibility
> > that they can be provided to non-root.
> > 
> > "Offset" is region relative "hpa" when listing region media errors, and
> > "offset" is memdev relative "dpa" while listing memdev relative media
> > errors.
> 
> Done. memdev relative dpa is just dpa right? Unless you are thinking
> offset into a partition? I don't think so.

Right,
	memdev offset == absolute device dpa

	region offset == region base relative
Alison Schofield March 10, 2024, 7:21 p.m. UTC | #4
On Wed, Mar 06, 2024 at 03:03:40PM -0800, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > Changes since v9:
> > - Replace the multi-use 'name' var, with multiple descriptive
> >   flavors: memdev_name, region_name, decoder_name (DaveJ)
> > - Use a static string table for poison source lookup (DaveJ)
> > - Rebased on latest pending
> > Link to v9: https://lore.kernel.org/r/cover.1709253898.git.alison.schofield@intel.com/
> > 
> > 
> > Add the option to add a memory devices poison list to the cxl-list
> > json output. Offer the option by memdev and by region. Sample usage:
> > 
> > # cxl list -m mem1 --media-errors
> > [
> >   {
> >     "memdev":"mem1",
> >     "pmem_size":1073741824,
> >     "ram_size":1073741824,
> >     "serial":1,
> >     "numa_node":1,
> >     "host":"cxl_mem.1",
> >     "media_errors":[
> >       {
> >         "dpa":0,
> >         "length":64,
> >         "source":"Internal"
> >       },
> >       {
> >         "decoder":"decoder10.0",
> >         "hpa":1035355557888,
> >         "dpa":1073741824,
> >         "length":64,
> >         "source":"External"
> >       },
> >       {
> >         "decoder":"decoder10.0",
> >         "hpa":1035355566080,
> >         "dpa":1073745920,
> >         "length":64,
> >         "source":"Injected"
> >       }

Dan,

In cleaning up the man pages, I need to follow up on this offset, length
notation.

A default by memdev list now looks like this-

{
 "offset" :
 "length" :
 "source" :
}

Which means dropping the 'decoder' even if it can be discovered from
the trace event. Recall previously a region was listed if present,
then we changed that to a decoder is listed if present.  Now, with
no 'hpa' listing I've dropped the decoder too. That leaves no hint
in the by memdev listing that this poison is in a region. 

Decoders will only be included in the by region list:
{
 "decoder":
 "offset" :
 "length" :
 "source" :
}

OK ?


> >   }
> > ]
> > 
> > # cxl list -r region5 --media-errors
> > [
> >   {
> >     "region":"region5",
> >     "resource":1035355553792,
> >     "size":2147483648,
> >     "type":"pmem",
> >     "interleave_ways":2,
> >     "interleave_granularity":4096,
> >     "decode_state":"commit",
> >     "media_errors":[
> >       {
> >         "decoder":"decoder10.0",
> >         "hpa":1035355557888,
> >         "dpa":1073741824,
> >         "length":64,
> 
> I notice that the ndctl --media-errors records are:
> 
> { offset, length }
> 
> ...it is not clear to me that "dpa" and "hpa" have much meaning to
> userspace by default. Physical address information is privileged, so if
> these records were { offset, length } tuples there is the possibility
> that they can be provided to non-root.
> 
> "Offset" is region relative "hpa" when listing region media errors, and
> "offset" is memdev relative "dpa" while listing memdev relative media
> errors.