mbox series

[ndctl,v11,0/7] Support poison list retrieval

Message ID cover.1710386468.git.alison.schofield@intel.com (mailing list archive)
Headers show
Series Support poison list retrieval | expand

Message

Alison Schofield March 14, 2024, 4:05 a.m. UTC
From: Alison Schofield <alison.schofield@intel.com>

Changes since v10:
- Use offset, length notation in json output (Dan)
- Remove endpoint decoder from json output
- Man page updates to reflect above changes
- Remove open coded tep_find_field() (Dan)
- Use raw instead of custom string helper 
- Use get_field_val() in u8,32,64 helpers instead of _raw (Dan)
- Pass event_ctx to its own parsing method as a typical 'this' pointer (Dan)
- Replace private_ctx w poison_ctx in event_ctx. This addresses Dan's
  feedback to avoid a void* but stops short of his suggestion to wrap
  event_ctx in a private_ctx for this single use case.
- v10: https://lore.kernel.org/linux-cxl/cover.1709748564.git.alison.schofield@intel.com/


Begin cover letter:
Add the option to add a memory devices poison list to the cxl-list
json output. Offer the option by memdev and by region. 

From the man page cxl-list:

       -L, --media-errors
           Include media-error information. The poison list is retrieved from
           the device(s) and media_error records are added to the listing.
           Apply this option to memdevs and regions where devices support the
           poison list capability. "offset:" is relative to the region
           resource when listing by region and is the absolute device DPA when
           listing by memdev. "source:" is one of: External, Internal,
           Injected, Vendor Specific, or Unknown, as defined in CXL
           Specification v3.1 Table 8-140.

           # cxl list -m mem9 --media-errors -u
           {
             "memdev":"mem9",
             "pmem_size":"1024.00 MiB (1073.74 MB)",
             "pmem_qos_class":42,
             "ram_size":"1024.00 MiB (1073.74 MB)",
             "ram_qos_class":42,
             "serial":"0x5",
             "numa_node":1,
             "host":"cxl_mem.5",
             "media_errors":[
               {
                 "offset":"0x40000000",
                 "length":64,
                 "source":"Injected"
               }
             ]
           }

       In the above example, region mappings can be found using: "cxl list -p
       mem9 --decoders"

           # cxl list -r region5 --media-errors -u
           {
             "region":"region5",
             "resource":"0xf110000000",
             "size":"2.00 GiB (2.15 GB)",
             "type":"pmem",
             "interleave_ways":2,
             "interleave_granularity":4096,
             "decode_state":"commit",
             "media_errors":[
               {
                 "offset":"0x1000",
                 "length":64,
                 "source":"Injected"
               },
               {
                 "offset":"0x2000",
                 "length":64,
                 "source":"Injected"
               }
             ]
           }

       In the above example, memdev mappings can be found using: "cxl list -r
       region5 --targets" and "cxl list -d <decoder_name>"



Alison Schofield (7):
  libcxl: add interfaces for GET_POISON_LIST mailbox commands
  cxl/event_trace: add an optional pid check to event parsing
  cxl/event_trace: support poison context in event parsing
  cxl/event_trace: add helpers to retrieve tep fields by type
  cxl/list: collect and parse media_error records
  cxl/list: add --media-errors option to cxl list
  cxl/test: add cxl-poison.sh unit test

 Documentation/cxl/cxl-list.txt |  62 ++++++++++-
 cxl/event_trace.c              |  51 ++++++++-
 cxl/event_trace.h              |  19 +++-
 cxl/filter.h                   |   3 +
 cxl/json.c                     | 194 +++++++++++++++++++++++++++++++++
 cxl/lib/libcxl.c               |  47 ++++++++
 cxl/lib/libcxl.sym             |   2 +
 cxl/libcxl.h                   |   2 +
 cxl/list.c                     |   3 +
 test/cxl-poison.sh             | 137 +++++++++++++++++++++++
 test/meson.build               |   2 +
 11 files changed, 514 insertions(+), 8 deletions(-)
 create mode 100644 test/cxl-poison.sh


base-commit: e0d0680bd3e554bd5f211e989480c5a13a023b2d

Comments

Alison Schofield March 20, 2024, 8:42 p.m. UTC | #1
On Wed, Mar 13, 2024 at 09:05:16PM -0700, alison.schofield@intel.com wrote:
> From: Alison Schofield <alison.schofield@intel.com>

Asking folks to share this with future users of the poison list
feature of ndctl. ie. cxl list --media-errors

I'd like to get additional 'user' input on the json output provided by
this --media-errors option to cxl-list. After a few iterations of what
should be included in the cxl-list output, I'm not so sure that we've
captured sufficient input from potential users. (Since they typically
won't use this til it's released in ndctl.)

To guide your thinking recall that users can retrieve a devices poison
list now without any cxl-cli (ndctl) tool support. Users can trigger
the collection via sysfs and see the results in the trace logs like:

this:
- echo 1 > /sys/kernel/tracing/events/cxl/cxl_poison/enable
- echo 1 > /sys/bus/cxl/devices/memX/trigger_poison_list
- Examine the cxl_poison events in the trace file at 

or this:
- cxl monitor --daemon --log=<poison-log-path>
- echo 1 > /sys/bus/cxl/devices/memX/trigger_poison_list
- Examine the cxl_poison events in the monitor log

or this:
- enable tp_printk
- echo 1 > /sys/kernel/tracing/events/cxl/cxl_poison/enable
- echo 1 > /sys/bus/cxl/devices/memX/trigger_poison_list
- Examine the cxl_poison events in the dmesg log

So, a few ways to get at this cxl_poison trace data:
memdev=mem9
host=cxl_mem.5 
serial=5 
trace_type=List 
region=region5 
region_uuid=99352a43-44cb-405d-85c9-fdbd971455d8
hpa=0xf110001000
dpa=0x40000000
dpa_length=0x40
source=Injected
flags=
overflow_time=0

The tool should be providing a better experience that the sysfs/trace.
The tools does look up memdevs contributing to a region and triggers
the needed poison list reads, so that's a small convenience. It's
usefulness needs to extend to the json listing.

Here's history of json output pulled from the patch cover letters.
It's long, but I didn't want to omit any detail.

I've appended here the history of changes to the output.
Only including samples where the json output actually changed.
I'm including it to spur conversation not as a guideline.

Subject: [ndctl PATCH v11 0/7] Support poison list retrieval

           # cxl list -m mem9 --media-errors -u
           {
             "memdev":"mem9",
             "pmem_size":"1024.00 MiB (1073.74 MB)",
             "pmem_qos_class":42,
             "ram_size":"1024.00 MiB (1073.74 MB)",
             "ram_qos_class":42,
             "serial":"0x5",
             "numa_node":1,
             "host":"cxl_mem.5",
             "media_errors":[
               {
                 "offset":"0x40000000",
                 "length":64,
                 "source":"Injected"
               }
             ]
           }


           # cxl list -r region5 --media-errors -u
           {
             "region":"region5",
             "resource":"0xf110000000",
             "size":"2.00 GiB (2.15 GB)",
             "type":"pmem",
             "interleave_ways":2,
             "interleave_granularity":4096,
             "decode_state":"commit",
             "media_errors":[
               {
                 "offset":"0x1000",
                 "length":64,
                 "source":"Injected"
               },
               {
                 "offset":"0x2000",
                 "length":64,
                 "source":"Injected"
               }
             ]
           }


Subject: [ndctl PATCH v7 0/7] Support poison list retrieval

# cxl list -m mem1 --media-errors
[
  {
    "memdev":"mem1",
    "pmem_size":1073741824,
    "ram_size":1073741824,
    "serial":1,
    "numa_node":1,
    "host":"cxl_mem.1",
    "media_errors":[
      {
        "dpa":0,
        "length":64,
        "source":"Internal"
      },
      {
        "decoder":"decoder10.0",
        "hpa":1035355557888,
        "dpa":1073741824,
        "length":64,
        "source":"External"
      },
      {
        "decoder":"decoder10.0",
        "hpa":1035355566080,
        "dpa":1073745920,
        "length":64,
        "source":"Injected"
      }
    ]
  }
]

# cxl list -r region5 --media-errors
[
  {
    "region":"region5",
    "resource":1035355553792,
    "size":2147483648,
    "type":"pmem",
    "interleave_ways":2,
    "interleave_granularity":4096,
    "decode_state":"commit",
    "media_errors":[
      {
        "decoder":"decoder10.0",
        "hpa":1035355557888,
        "dpa":1073741824,
        "length":64,
        "source":"External"
      },
      {
        "decoder":"decoder8.1",
        "hpa":1035355566080,
        "dpa":1073745920,
        "length":64,
        "source":"Internal"
      }
    ]
  }
]

Subject: [ndctl PATCH v6 0/7] Support poison list retrieval

# cxl list -m mem1 --media-errors
[
  {
    "memdev":"mem1",
    "pmem_size":1073741824,
    "ram_size":1073741824,
    "serial":1,
    "numa_node":1,
    "host":"cxl_mem.1",
    "media_errors":[
      {
        "dpa":0,
        "dpa_length":64,
        "source":"Injected"
      },
      {
        "region":"region5",
        "dpa":1073741824,
        "dpa_length":64,
        "hpa":1035355557888,
        "source":"Injected"
      },
      {
        "region":"region5",
        "dpa":1073745920,
        "dpa_length":64,
        "hpa":1035355566080,
        "source":"Injected"
      }
    ]
  }
]

# cxl list -r region5 --media-errors
[
  {
    "region":"region5",
    "resource":1035355553792,
    "size":2147483648,
    "type":"pmem",
    "interleave_ways":2,
    "interleave_granularity":4096,
    "decode_state":"commit",
    "media_errors":[
      {
        "memdev":"mem1",
        "dpa":1073741824,
        "dpa_length":64,
        "hpa":1035355557888,
        "source":"Injected"
      },
      {
        "memdev":"mem1",
        "dpa":1073745920,
        "dpa_length":64,
        "hpa":1035355566080,
        "source":"Injected"
      }
    ]
  }
]

Subject: [ndctl PATCH v2 0/3] Support poison list retrieval

Example: By memdev
cxl list -m mem1 --poison -u
{
  "memdev":"mem1",
  "pmem_size":"1024.00 MiB (1073.74 MB)",
  "ram_size":"1024.00 MiB (1073.74 MB)",
  "serial":"0x1",
  "numa_node":1,
  "host":"cxl_mem.1",
  "poison":{
    "nr_poison_records":4,
    "poison_records":[
      {
        "dpa":"0x40000000",
        "dpa_length":64,
        "source":"Injected",
        "flags":""
      },
      {
        "dpa":"0x40001000",
        "dpa_length":64,
        "source":"Injected",
        "flags":""
      },
      {
        "dpa":"0",
        "dpa_length":64,
        "source":"Injected",
        "flags":""
      },
      {
        "dpa":"0x600",
        "dpa_length":64,
        "source":"Injected",
        "flags":""
      }
    ]
  }
}

Example: By region
cxl list -r region5 --poison -u
{
  "region":"region5",
  "resource":"0xf110000000",
  "size":"2.00 GiB (2.15 GB)",
  "type":"pmem",
  "interleave_ways":2,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "poison":{
    "nr_poison_records":2,
    "poison_records":[
      {
        "memdev":"mem1",
        "region":"region5",
        "hpa":"0xf110001000",
        "dpa":"0x40000000",
        "dpa_length":64,
        "source":"Injected",
        "flags":""
      },
      {
        "memdev":"mem0",
        "region":"region5",
        "hpa":"0xf110000000",
        "dpa":"0x40000000",
        "dpa_length":64,
        "source":"Injected",
        "flags":""
      }
    ]
  }
}


Example: By memdev and coincidentally in a region
# cxl list -m mem0 --poison -u
{
  "memdev":"mem0",
  "pmem_size":"1024.00 MiB (1073.74 MB)",
  "ram_size":"1024.00 MiB (1073.74 MB)",
  "serial":"0",
  "numa_node":0,
  "host":"cxl_mem.0",
  "poison":{
    "nr_poison_records":1,
    "poison_records":[
      {
        "region":"region5",
        "hpa":"0xf110000000",
        "dpa":"0x40000000",
        "dpa_length":64,
        "source":"Injected",
        "flags":""
      }
    ]
  }
}


Example: No poison found
cxl list -m mem9 --poison -u
{
  "memdev":"mem9",
  "pmem_size":"1024.00 MiB (1073.74 MB)",
  "ram_size":"1024.00 MiB (1073.74 MB)",
  "serial":"0x9",
  "numa_node":1,
  "host":"cxl_mem.9",
  "poison":{
    "nr_poison_records":0
  }
}