mbox series

[RFC,0/2] CXL UE RAS Multiple Header Logging support

Message ID 20230113154058.16227-1-Jonathan.Cameron@huawei.com
Headers show
Series CXL UE RAS Multiple Header Logging support | expand

Message

Jonathan Cameron Jan. 13, 2023, 3:40 p.m. UTC
CXL UE RAS Error reporting allows an EP to report the capability of
recording Multiple Header Logs for uncorrectable errors.
Unlike equivalent feature in PCIe, there is no enable control
for this feature, so a supporting device may be expecting
a more complex software flow than that necessary for devices
that do not support this feature. Documentation of this feature
is sparse, with assumption it works the same as PCIe.

There are hardware implementation choices allowed in the
equivalent PCIe r6.0 base spec section (6.4.2.4) that could
be safely used with the existing code, even with Multiple
Header Recording support but there are others that cannot.

The issue is what happens when the EP is doing Multiple Header
Recording but then the software writes 1 to clear more than one
status bit at the time (PCIe spec warns against doing this
 - but it is what the current kernel code will do):
Option 1)
  It does the nice thing and clears all matching errors.
  Note this is a bit strange for the case where the device
  supports logging multiple instances of a given error - so
  the two can't be combined cleanly.  With that feature
  I can't see how anyone could implement hardware that coped
  cleanly with the wrong software flow.
Option 2)
  It clears only the first error bit leaving a bunch of error
  bits set (note that if it has recorded multiple errors of
  same type it might not even do that). These are sticky
  across resets, so you will probably end up coming back up
  and immediately seeing an error.

So whilst you can design an EP to safe against non MH recording
aware software, it isn't generally the case. As we don't have
an explicit enable on CXL we have to handle anything reporting
the capability in a MH safe fashion.

This feature was developed against emulation in QEMU.
The relevant patches have not yet been posted but can be found on
https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
along with description of how to inject errors in the patch
descriptions.  I'll post them for review for QEMU inclusion
shortly.

RFC simply because the lack of specification detail means I am
less sure on this code than I would normally be. Unfortunately it
could be argued that the first patch is a fix for the
current upstream CXL RAS support.  If we want a simpler fix
one option would be to just fail to enable RAS support if
Multiple Header recording capability bit is set.  Or we
decide that it doesn't matter for now and add support for this
feature via the normal merge cycle.

Second patch is just there to make this easier to test as
no additional software is needed to print the header log.

Base is rather messy due to a clash between multiple cxl tree
branches.
cxl/fixes with the trace move on cxl/next cherry picked on top
as it moves the code that was fixed.

Jonathan Cameron (2):
  cxl: RAS: Multiple header recording support
  cxl: Add tprintk support for header log hex dump

 drivers/cxl/core/pci.c   | 17 ++++++++++++-----
 drivers/cxl/core/trace.h |  7 +++++--
 drivers/cxl/cxl.h        |  1 +
 3 files changed, 18 insertions(+), 7 deletions(-)