mbox series

[ndctl,RFC,0/3] Support poison list retrieval

Message ID cover.1665699750.git.alison.schofield@intel.com
Headers show
Series Support poison list retrieval | expand

Message

Alison Schofield Oct. 13, 2022, 11:39 p.m. UTC
From: Alison Schofield <alison.schofield@intel.com>

The RFC label is because this is built upon in flight patchsets
making it unlikely others can try it out. It depends upon the
tracing support in Dave's monitor patchset [1], and the kernel
driver support for poison in this patchset [2].

The first patch adds a libcxl API for triggering the read of a
poison list from a memory device. Users of that API will need to
trace the kernel events to collect the error records.

Patches 2 & 3 offer a pretty option, --media-errors to cxl list 
where the the poison list is read, results collected and parsed,
and the media error records included in the JSON list output.

The JSON output of 'cxl list' does not include all the same fields
that are available in the 'cxl_poison' trace event.

Trace events of 'cxl_poison' always include these fields:
region: memdev: pcidev: hpa: dpa: length: source: flags: overflow_time:

'cxl list --media-errors' omits fields that seem useless in the
context of the cxl list command:
- Do not repeat the memdev, region, or pcidev's that are
  already included in the list output.
- Only include 'hpa' when media errors are listed by region.

Examples:
cxl list -m mem2 --media-errors
[
  {
    "memdev":"mem2",
    "pmem_size":1073741824,
    "ram_size":0,
    "serial":2,
    "host":"cxl_mem.2",
    "media_errors":{
      "nr media-errors":2,
      "media-error records":[
        {
          "dpa":64,
          "length":128,
          "source":"Injected",
          "flags":"Overflow,",
          "overflow_time":1656711046
        },
        {
          "dpa":192,
          "length":192,
          "source":"Internal",
          "flags":"Overflow,",
          "overflow_time":1656711046
        },
      ]
    }
  }
]

# cxl list -r region5 --media-errors
[
  {
    "region":"region5",
    "resource":1035623989248,
    "size":2147483648,
    "interleave_ways":2,
    "interleave_granularity":4096,
    "decode_state":"commit",
    "media_errors":{
      "nr media-errors":2,
      "media-error records":[
        {
          "memdev":"mem2",
          "hpa":0,
          "dpa":0,
          "length":64,
          "source":"Reserved",
          "flags":"",
          "overflow_time":0
        },
	{
          "memdev":"mem5",
          "hpa":0,
          "dpa":384,
          "length":256,
          "source":"Injected",
          "flags":"",
          "overflow_time":0
        }
      ]
    }
  }
]

[1] https://lore.kernel.org/nvdimm/166363103019.3861186.3067220004819656109.stgit@djiang5-desk3.ch.intel.com/
[2] https://lore.kernel.org/linux-cxl/cover.1665606782.git.alison.schofield@intel.com/

Alison Schofield (3):
  libcxl: add interfaces for GET_POISON_LIST mailbox commands
  cxl/list: collect and parse the poison list records
  cxl/list: add --media-errors option to cxl list

 Documentation/cxl/cxl-list.txt |  66 +++++++++++
 cxl/filter.c                   |   2 +
 cxl/filter.h                   |   1 +
 cxl/json.c                     | 197 +++++++++++++++++++++++++++++++++
 cxl/lib/libcxl.c               |  40 +++++++
 cxl/lib/libcxl.sym             |   6 +
 cxl/libcxl.h                   |   2 +
 cxl/list.c                     |   2 +
 8 files changed, 316 insertions(+)