diff mbox series

[02/11] cxl/mem: Implement Get Event Records command

Message ID 20221110185758.879472-3-ira.weiny@intel.com
State Superseded
Headers show
Series CXL: Process event logs | expand

Commit Message

Ira Weiny Nov. 10, 2022, 6:57 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

CXL devices have multiple event logs which can be queried for CXL event
records.  Devices are required to support the storage of at least one
event record in each event log type.

Devices track event log overflow by incrementing a counter and tracking
the time of the first and last overflow event seen.

Software queries events via the Get Event Record mailbox command; CXL
rev 3.0 section 8.2.9.2.2.

Issue the Get Event Record mailbox command on driver load.  Trace each
record found with a generic record trace.  Trace any overflow
conditions.

The device can return up to 1MB worth of event records per query.  This
presents complications with allocating a huge buffers to potentially
capture all the records.  It is not anticipated that these event logs
will be very deep and reading them does not need to be performant.
Process only 3 records at a time.  3 records was chosen as it fits
comfortably on the stack to prevent dynamic allocation while still
cutting down on extra mailbox messages.

This patch traces a raw event record only and leaves the specific event
record types to subsequent patches.

Macros are created to use for tracing the common CXL Event header
fields.

Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Change from RFC v2:
	Support reading 3 events at once.
	Reverse Jonathan's suggestion and check for positive number of
		records.  Because the record count may have been
		returned as something > 3 based on what the device
		thinks it can send back even though the core Linux mbox
		processing truncates the data.
	Alison and Dave Jiang
		Change header uuid type to uuid_t for better user space
		processing
	Smita
		Check status reg before reading log.
	Steven
		Prefix all trace points with 'cxl_'
		Use static branch <trace>_enabled() calls
	Jonathan
		s/CXL_EVENT_TYPE_INFO/0
		s/{first,last}/{first,last}_ts
		Remove Reserved field from header
		Fix header issue for cxl_event_log_type_str()

Change from RFC:
	Remove redundant error message in get event records loop
	s/EVENT_RECORD_DATA_LENGTH/CXL_EVENT_RECORD_DATA_LENGTH
	Use hdr_uuid for the header UUID field
	Use cxl_event_log_type_str() for the trace events
	Create macros for the header fields and common entries of each event
	Add reserved buffer output dump
	Report error if event query fails
	Remove unused record_cnt variable
	Steven - reorder overflow record
		Remove NOTE about checkpatch
	Jonathan
		check for exactly 1 record
		s/v3.0/rev 3.0
		Use 3 byte fields for 24bit fields
		Add 3.0 Maintenance Operation Class
		Add Dynamic Capacity log type
		Fix spelling
	Dave Jiang/Dan/Alison
		s/cxl-event/cxl
		trace/events/cxl-events => trace/events/cxl.h
		s/cxl_event_overflow/overflow
		s/cxl_event/generic_event
---
 MAINTAINERS                  |   1 +
 drivers/cxl/core/mbox.c      |  70 +++++++++++++++++++
 drivers/cxl/cxl.h            |   8 +++
 drivers/cxl/cxlmem.h         |  73 ++++++++++++++++++++
 include/trace/events/cxl.h   | 127 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/cxl_mem.h |   1 +
 6 files changed, 280 insertions(+)
 create mode 100644 include/trace/events/cxl.h

Comments

Dave Jiang Nov. 15, 2022, 9:54 p.m. UTC | #1
On 11/10/2022 10:57 AM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL devices have multiple event logs which can be queried for CXL event
> records.  Devices are required to support the storage of at least one
> event record in each event log type.
> 
> Devices track event log overflow by incrementing a counter and tracking
> the time of the first and last overflow event seen.
> 
> Software queries events via the Get Event Record mailbox command; CXL
> rev 3.0 section 8.2.9.2.2.
> 
> Issue the Get Event Record mailbox command on driver load.  Trace each
> record found with a generic record trace.  Trace any overflow
> conditions.
> 
> The device can return up to 1MB worth of event records per query.  This
> presents complications with allocating a huge buffers to potentially
> capture all the records.  It is not anticipated that these event logs
> will be very deep and reading them does not need to be performant.
> Process only 3 records at a time.  3 records was chosen as it fits
> comfortably on the stack to prevent dynamic allocation while still
> cutting down on extra mailbox messages.
> 
> This patch traces a raw event record only and leaves the specific event
> record types to subsequent patches.
> 
> Macros are created to use for tracing the common CXL Event header
> fields.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Would it be cleaner to split out the include/trace/events/cxl.h changes 
to its own patch?

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> ---
> Change from RFC v2:
> 	Support reading 3 events at once.
> 	Reverse Jonathan's suggestion and check for positive number of
> 		records.  Because the record count may have been
> 		returned as something > 3 based on what the device
> 		thinks it can send back even though the core Linux mbox
> 		processing truncates the data.
> 	Alison and Dave Jiang
> 		Change header uuid type to uuid_t for better user space
> 		processing
> 	Smita
> 		Check status reg before reading log.
> 	Steven
> 		Prefix all trace points with 'cxl_'
> 		Use static branch <trace>_enabled() calls
> 	Jonathan
> 		s/CXL_EVENT_TYPE_INFO/0
> 		s/{first,last}/{first,last}_ts
> 		Remove Reserved field from header
> 		Fix header issue for cxl_event_log_type_str()
> 
> Change from RFC:
> 	Remove redundant error message in get event records loop
> 	s/EVENT_RECORD_DATA_LENGTH/CXL_EVENT_RECORD_DATA_LENGTH
> 	Use hdr_uuid for the header UUID field
> 	Use cxl_event_log_type_str() for the trace events
> 	Create macros for the header fields and common entries of each event
> 	Add reserved buffer output dump
> 	Report error if event query fails
> 	Remove unused record_cnt variable
> 	Steven - reorder overflow record
> 		Remove NOTE about checkpatch
> 	Jonathan
> 		check for exactly 1 record
> 		s/v3.0/rev 3.0
> 		Use 3 byte fields for 24bit fields
> 		Add 3.0 Maintenance Operation Class
> 		Add Dynamic Capacity log type
> 		Fix spelling
> 	Dave Jiang/Dan/Alison
> 		s/cxl-event/cxl
> 		trace/events/cxl-events => trace/events/cxl.h
> 		s/cxl_event_overflow/overflow
> 		s/cxl_event/generic_event
> ---
>   MAINTAINERS                  |   1 +
>   drivers/cxl/core/mbox.c      |  70 +++++++++++++++++++
>   drivers/cxl/cxl.h            |   8 +++
>   drivers/cxl/cxlmem.h         |  73 ++++++++++++++++++++
>   include/trace/events/cxl.h   | 127 +++++++++++++++++++++++++++++++++++
>   include/uapi/linux/cxl_mem.h |   1 +
>   6 files changed, 280 insertions(+)
>   create mode 100644 include/trace/events/cxl.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ca063a504026..4b7c6e3055c6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5223,6 +5223,7 @@ M:	Dan Williams <dan.j.williams@intel.com>
>   L:	linux-cxl@vger.kernel.org
>   S:	Maintained
>   F:	drivers/cxl/
> +F:	include/trace/events/cxl.h
>   F:	include/uapi/linux/cxl_mem.h
>   
>   CONEXANT ACCESSRUNNER USB DRIVER
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 16176b9278b4..a908b95a7de4 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -7,6 +7,9 @@
>   #include <cxlmem.h>
>   #include <cxl.h>
>   
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/cxl.h>
> +
>   #include "core.h"
>   
>   static bool cxl_raw_allow_all;
> @@ -48,6 +51,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = {
>   	CXL_CMD(RAW, CXL_VARIABLE_PAYLOAD, CXL_VARIABLE_PAYLOAD, 0),
>   #endif
>   	CXL_CMD(GET_SUPPORTED_LOGS, 0, CXL_VARIABLE_PAYLOAD, CXL_CMD_FLAG_FORCE_ENABLE),
> +	CXL_CMD(GET_EVENT_RECORD, 1, CXL_VARIABLE_PAYLOAD, 0),
>   	CXL_CMD(GET_FW_INFO, 0, 0x50, 0),
>   	CXL_CMD(GET_PARTITION_INFO, 0, 0x20, 0),
>   	CXL_CMD(GET_LSA, 0x8, CXL_VARIABLE_PAYLOAD, 0),
> @@ -704,6 +708,72 @@ int cxl_enumerate_cmds(struct cxl_dev_state *cxlds)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>   
> +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds,
> +				    enum cxl_event_log_type type)
> +{
> +	struct cxl_get_event_payload payload;
> +	u16 pl_nr;
> +
> +	do {
> +		u8 log_type = type;
> +		int rc;
> +
> +		rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD,
> +				       &log_type, sizeof(log_type),
> +				       &payload, sizeof(payload));
> +		if (rc) {
> +			dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d",
> +				cxl_event_log_type_str(type), rc);
> +			return;
> +		}
> +
> +		pl_nr = le16_to_cpu(payload.record_count);
> +		if (trace_cxl_generic_event_enabled()) {
> +			u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS);
> +			int i;
> +
> +			for (i = 0; i < nr_rec; i++)
> +				trace_cxl_generic_event(dev_name(cxlds->dev),
> +							type,
> +							&payload.record[i]);
> +		}
> +
> +		if (trace_cxl_overflow_enabled() &&
> +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> +
> +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||
> +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> +}
> +
> +/**
> + * cxl_mem_get_event_records - Get Event Records from the device
> + * @cxlds: The device data for the operation
> + *
> + * Retrieve all event records available on the device and report them as trace
> + * events.
> + *
> + * See CXL rev 3.0 @8.2.9.2.2 Get Event Records
> + */
> +void cxl_mem_get_event_records(struct cxl_dev_state *cxlds)
> +{
> +	u32 status = readl(cxlds->regs.status + CXLDEV_DEV_EVENT_STATUS_OFFSET);
> +
> +	dev_dbg(cxlds->dev, "Reading event logs: %x\n", status);
> +
> +	if (status & CXLDEV_EVENT_STATUS_INFO)
> +		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_INFO);
> +	if (status & CXLDEV_EVENT_STATUS_WARN)
> +		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_WARN);
> +	if (status & CXLDEV_EVENT_STATUS_FAIL)
> +		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FAIL);
> +	if (status & CXLDEV_EVENT_STATUS_FATAL)
> +		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FATAL);
> +	if (status & CXLDEV_EVENT_STATUS_DYNAMIC_CAP)
> +		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_DYNAMIC_CAP);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
> +
>   /**
>    * cxl_mem_get_partition_info - Get partition info
>    * @cxlds: The device data for the operation
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index f680450f0b16..492cff1bea6d 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -132,6 +132,14 @@ static inline int ways_to_cxl(unsigned int ways, u8 *iw)
>   #define CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX 0x3
>   #define CXLDEV_CAP_CAP_ID_MEMDEV 0x4000
>   
> +/* CXL 3.0 8.2.8.3.1 Event Status Register */
> +#define CXLDEV_DEV_EVENT_STATUS_OFFSET		0x00
> +#define CXLDEV_EVENT_STATUS_INFO		BIT(0)
> +#define CXLDEV_EVENT_STATUS_WARN		BIT(1)
> +#define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
> +#define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
> +#define CXLDEV_EVENT_STATUS_DYNAMIC_CAP		BIT(4)
> +
>   /* CXL 2.0 8.2.8.4 Mailbox Registers */
>   #define CXLDEV_MBOX_CAPS_OFFSET 0x00
>   #define   CXLDEV_MBOX_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index b7b955ded3ac..da64ba0f156b 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -4,6 +4,7 @@
>   #define __CXL_MEM_H__
>   #include <uapi/linux/cxl_mem.h>
>   #include <linux/cdev.h>
> +#include <linux/uuid.h>
>   #include "cxl.h"
>   
>   /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
> @@ -256,6 +257,7 @@ struct cxl_dev_state {
>   enum cxl_opcode {
>   	CXL_MBOX_OP_INVALID		= 0x0000,
>   	CXL_MBOX_OP_RAW			= CXL_MBOX_OP_INVALID,
> +	CXL_MBOX_OP_GET_EVENT_RECORD	= 0x0100,
>   	CXL_MBOX_OP_GET_FW_INFO		= 0x0200,
>   	CXL_MBOX_OP_ACTIVATE_FW		= 0x0202,
>   	CXL_MBOX_OP_GET_SUPPORTED_LOGS	= 0x0400,
> @@ -325,6 +327,76 @@ struct cxl_mbox_identify {
>   	u8 qos_telemetry_caps;
>   } __packed;
>   
> +/*
> + * Common Event Record Format
> + * CXL rev 3.0 section 8.2.9.2.1; Table 8-42
> + */
> +struct cxl_event_record_hdr {
> +	uuid_t id;
> +	u8 length;
> +	u8 flags[3];
> +	__le16 handle;
> +	__le16 related_handle;
> +	__le64 timestamp;
> +	u8 maint_op_class;
> +	u8 reserved[0xf];
> +} __packed;
> +
> +#define CXL_EVENT_RECORD_DATA_LENGTH 0x50
> +struct cxl_event_record_raw {
> +	struct cxl_event_record_hdr hdr;
> +	u8 data[CXL_EVENT_RECORD_DATA_LENGTH];
> +} __packed;
> +
> +/*
> + * Get Event Records output payload
> + * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
> + */
> +#define CXL_GET_EVENT_FLAG_OVERFLOW		BIT(0)
> +#define CXL_GET_EVENT_FLAG_MORE_RECORDS		BIT(1)
> +#define CXL_GET_EVENT_NR_RECORDS		3
> +struct cxl_get_event_payload {
> +	u8 flags;
> +	u8 reserved1;
> +	__le16 overflow_err_count;
> +	__le64 first_overflow_timestamp;
> +	__le64 last_overflow_timestamp;
> +	__le16 record_count;
> +	u8 reserved2[0xa];
> +	struct cxl_event_record_raw record[CXL_GET_EVENT_NR_RECORDS];
> +} __packed;
> +
> +/*
> + * CXL rev 3.0 section 8.2.9.2.2; Table 8-49
> + */
> +enum cxl_event_log_type {
> +	CXL_EVENT_TYPE_INFO = 0x00,
> +	CXL_EVENT_TYPE_WARN,
> +	CXL_EVENT_TYPE_FAIL,
> +	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DYNAMIC_CAP,
> +	CXL_EVENT_TYPE_MAX
> +};
> +
> +static inline const char *cxl_event_log_type_str(enum cxl_event_log_type type)
> +{
> +	switch (type) {
> +	case CXL_EVENT_TYPE_INFO:
> +		return "Informational";
> +	case CXL_EVENT_TYPE_WARN:
> +		return "Warning";
> +	case CXL_EVENT_TYPE_FAIL:
> +		return "Failure";
> +	case CXL_EVENT_TYPE_FATAL:
> +		return "Fatal";
> +	case CXL_EVENT_TYPE_DYNAMIC_CAP:
> +		return "Dynamic Capacity";
> +	default:
> +		break;
> +	}
> +	return "<unknown>";
> +}
> +
>   struct cxl_mbox_get_partition_info {
>   	__le64 active_volatile_cap;
>   	__le64 active_persistent_cap;
> @@ -384,6 +456,7 @@ int cxl_mem_create_range_info(struct cxl_dev_state *cxlds);
>   struct cxl_dev_state *cxl_dev_state_create(struct device *dev);
>   void set_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds);
>   void clear_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds);
> +void cxl_mem_get_event_records(struct cxl_dev_state *cxlds);
>   #ifdef CONFIG_CXL_SUSPEND
>   void cxl_mem_active_inc(void);
>   void cxl_mem_active_dec(void);
> diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h
> new file mode 100644
> index 000000000000..60dec9a84918
> --- /dev/null
> +++ b/include/trace/events/cxl.h
> @@ -0,0 +1,127 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM cxl
> +
> +#if !defined(_CXL_TRACE_EVENTS_H) ||  defined(TRACE_HEADER_MULTI_READ)
> +#define _CXL_TRACE_EVENTS_H
> +
> +#include <asm-generic/unaligned.h>
> +#include <linux/tracepoint.h>
> +#include <cxlmem.h>
> +
> +TRACE_EVENT(cxl_overflow,
> +
> +	TP_PROTO(const char *dev_name, enum cxl_event_log_type log,
> +		 struct cxl_get_event_payload *payload),
> +
> +	TP_ARGS(dev_name, log, payload),
> +
> +	TP_STRUCT__entry(
> +		__string(dev_name, dev_name)
> +		__field(int, log)
> +		__field(u64, first_ts)
> +		__field(u64, last_ts)
> +		__field(u16, count)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(dev_name, dev_name);
> +		__entry->log = log;
> +		__entry->count = le16_to_cpu(payload->overflow_err_count);
> +		__entry->first_ts = le64_to_cpu(payload->first_overflow_timestamp);
> +		__entry->last_ts = le64_to_cpu(payload->last_overflow_timestamp);
> +	),
> +
> +	TP_printk("%s: EVENT LOG OVERFLOW log=%s : %u records from %llu to %llu",
> +		__get_str(dev_name), cxl_event_log_type_str(__entry->log),
> +		__entry->count, __entry->first_ts, __entry->last_ts)
> +
> +);
> +
> +/*
> + * Common Event Record Format
> + * CXL 3.0 section 8.2.9.2.1; Table 8-42
> + */
> +#define CXL_EVENT_RECORD_FLAG_PERMANENT		BIT(2)
> +#define CXL_EVENT_RECORD_FLAG_MAINT_NEEDED	BIT(3)
> +#define CXL_EVENT_RECORD_FLAG_PERF_DEGRADED	BIT(4)
> +#define CXL_EVENT_RECORD_FLAG_HW_REPLACE	BIT(5)
> +#define show_hdr_flags(flags)	__print_flags(flags, " | ",			   \
> +	{ CXL_EVENT_RECORD_FLAG_PERMANENT,	"Permanent Condition"		}, \
> +	{ CXL_EVENT_RECORD_FLAG_MAINT_NEEDED,	"Maintenance Needed"		}, \
> +	{ CXL_EVENT_RECORD_FLAG_PERF_DEGRADED,	"Performance Degraded"		}, \
> +	{ CXL_EVENT_RECORD_FLAG_HW_REPLACE,	"Hardware Replacement Needed"	}  \
> +)
> +
> +/*
> + * Define macros for the common header of each CXL event.
> + *
> + * Tracepoints using these macros must do 3 things:
> + *
> + *	1) Add CXL_EVT_TP_entry to TP_STRUCT__entry
> + *	2) Use CXL_EVT_TP_fast_assign within TP_fast_assign;
> + *	   pass the dev_name, log, and CXL event header
> + *	3) Use CXL_EVT_TP_printk() instead of TP_printk()
> + *
> + * See the generic_event tracepoint as an example.
> + */
> +#define CXL_EVT_TP_entry					\
> +	__string(dev_name, dev_name)				\
> +	__field(int, log)					\
> +	__field_struct(uuid_t, hdr_uuid)			\
> +	__field(u32, hdr_flags)					\
> +	__field(u16, hdr_handle)				\
> +	__field(u16, hdr_related_handle)			\
> +	__field(u64, hdr_timestamp)				\
> +	__field(u8, hdr_length)					\
> +	__field(u8, hdr_maint_op_class)
> +
> +#define CXL_EVT_TP_fast_assign(dname, l, hdr)					\
> +	__assign_str(dev_name, (dname));					\
> +	__entry->log = (l);							\
> +	memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t));			\
> +	__entry->hdr_length = (hdr).length;					\
> +	__entry->hdr_flags = get_unaligned_le24((hdr).flags);			\
> +	__entry->hdr_handle = le16_to_cpu((hdr).handle);			\
> +	__entry->hdr_related_handle = le16_to_cpu((hdr).related_handle);	\
> +	__entry->hdr_timestamp = le64_to_cpu((hdr).timestamp);			\
> +	__entry->hdr_maint_op_class = (hdr).maint_op_class
> +
> +
> +#define CXL_EVT_TP_printk(fmt, ...) \
> +	TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' "		\
> +		"handle=%x related_handle=%x maint_op_class=%u"			\
> +		" : " fmt,							\
> +		__get_str(dev_name), cxl_event_log_type_str(__entry->log),	\
> +		__entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\
> +		show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle,	\
> +		__entry->hdr_related_handle, __entry->hdr_maint_op_class,	\
> +		##__VA_ARGS__)
> +
> +TRACE_EVENT(cxl_generic_event,
> +
> +	TP_PROTO(const char *dev_name, enum cxl_event_log_type log,
> +		 struct cxl_event_record_raw *rec),
> +
> +	TP_ARGS(dev_name, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +		__array(u8, data, CXL_EVENT_RECORD_DATA_LENGTH)
> +	),
> +
> +	TP_fast_assign(
> +		CXL_EVT_TP_fast_assign(dev_name, log, rec->hdr);
> +		memcpy(__entry->data, &rec->data, CXL_EVENT_RECORD_DATA_LENGTH);
> +	),
> +
> +	CXL_EVT_TP_printk("%s",
> +		__print_hex(__entry->data, CXL_EVENT_RECORD_DATA_LENGTH))
> +);
> +
> +#endif /* _CXL_TRACE_EVENTS_H */
> +
> +/* This part must be outside protection */
> +#undef TRACE_INCLUDE_FILE
> +#define TRACE_INCLUDE_FILE cxl
> +#include <trace/define_trace.h>
> diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
> index c71021a2a9ed..70459be5bdd4 100644
> --- a/include/uapi/linux/cxl_mem.h
> +++ b/include/uapi/linux/cxl_mem.h
> @@ -24,6 +24,7 @@
>   	___C(IDENTIFY, "Identify Command"),                               \
>   	___C(RAW, "Raw device command"),                                  \
>   	___C(GET_SUPPORTED_LOGS, "Get Supported Logs"),                   \
> +	___C(GET_EVENT_RECORD, "Get Event Record"),                       \
>   	___C(GET_FW_INFO, "Get FW Info"),                                 \
>   	___C(GET_PARTITION_INFO, "Get Partition Information"),            \
>   	___C(GET_LSA, "Get Label Storage Area"),                          \
Jonathan Cameron Nov. 16, 2022, 3:19 p.m. UTC | #2
On Thu, 10 Nov 2022 10:57:49 -0800
ira.weiny@intel.com wrote:

> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL devices have multiple event logs which can be queried for CXL event
> records.  Devices are required to support the storage of at least one
> event record in each event log type.
> 
> Devices track event log overflow by incrementing a counter and tracking
> the time of the first and last overflow event seen.
> 
> Software queries events via the Get Event Record mailbox command; CXL
> rev 3.0 section 8.2.9.2.2.
> 
> Issue the Get Event Record mailbox command on driver load.  Trace each
> record found with a generic record trace.  Trace any overflow
> conditions.
> 
> The device can return up to 1MB worth of event records per query.  This
> presents complications with allocating a huge buffers to potentially
> capture all the records.  It is not anticipated that these event logs
> will be very deep and reading them does not need to be performant.
> Process only 3 records at a time.  3 records was chosen as it fits
> comfortably on the stack to prevent dynamic allocation while still
> cutting down on extra mailbox messages.
> 
> This patch traces a raw event record only and leaves the specific event
> record types to subsequent patches.
> 
> Macros are created to use for tracing the common CXL Event header
> fields.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Hi Ira,

A question inline about whether some of the conditions you are checking
for can actually happen. Otherwise looks good to me.

Jonathan

> 
> ---
> Change from RFC v2:
> 	Support reading 3 events at once.
> 	Reverse Jonathan's suggestion and check for positive number of
> 		records.  Because the record count may have been
> 		returned as something > 3 based on what the device
> 		thinks it can send back even though the core Linux mbox
> 		processing truncates the data.
> 	Alison and Dave Jiang
> 		Change header uuid type to uuid_t for better user space
> 		processing
> 	Smita
> 		Check status reg before reading log.
> 	Steven
> 		Prefix all trace points with 'cxl_'
> 		Use static branch <trace>_enabled() calls
> 	Jonathan
> 		s/CXL_EVENT_TYPE_INFO/0
> 		s/{first,last}/{first,last}_ts
> 		Remove Reserved field from header
> 		Fix header issue for cxl_event_log_type_str()
> 
> Change from RFC:
> 	Remove redundant error message in get event records loop
> 	s/EVENT_RECORD_DATA_LENGTH/CXL_EVENT_RECORD_DATA_LENGTH
> 	Use hdr_uuid for the header UUID field
> 	Use cxl_event_log_type_str() for the trace events
> 	Create macros for the header fields and common entries of each event
> 	Add reserved buffer output dump
> 	Report error if event query fails
> 	Remove unused record_cnt variable
> 	Steven - reorder overflow record
> 		Remove NOTE about checkpatch
> 	Jonathan
> 		check for exactly 1 record
> 		s/v3.0/rev 3.0
> 		Use 3 byte fields for 24bit fields
> 		Add 3.0 Maintenance Operation Class
> 		Add Dynamic Capacity log type
> 		Fix spelling
> 	Dave Jiang/Dan/Alison
> 		s/cxl-event/cxl
> 		trace/events/cxl-events => trace/events/cxl.h
> 		s/cxl_event_overflow/overflow
> 		s/cxl_event/generic_event
> ---
>  MAINTAINERS                  |   1 +
>  drivers/cxl/core/mbox.c      |  70 +++++++++++++++++++
>  drivers/cxl/cxl.h            |   8 +++
>  drivers/cxl/cxlmem.h         |  73 ++++++++++++++++++++
>  include/trace/events/cxl.h   | 127 +++++++++++++++++++++++++++++++++++
>  include/uapi/linux/cxl_mem.h |   1 +
>  6 files changed, 280 insertions(+)
>  create mode 100644 include/trace/events/cxl.h

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 16176b9278b4..a908b95a7de4 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c

> +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds,
> +				    enum cxl_event_log_type type)
> +{
> +	struct cxl_get_event_payload payload;
> +	u16 pl_nr;
> +
> +	do {
> +		u8 log_type = type;
> +		int rc;
> +
> +		rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD,
> +				       &log_type, sizeof(log_type),
> +				       &payload, sizeof(payload));
> +		if (rc) {
> +			dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d",
> +				cxl_event_log_type_str(type), rc);
> +			return;
> +		}
> +
> +		pl_nr = le16_to_cpu(payload.record_count);
> +		if (trace_cxl_generic_event_enabled()) {
> +			u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS);

Either I'm misreading the spec, or it can't be greater than NR_RECORDS.
"The number of event records in the Event Records list...."
Event Records being the field inside this payload which is not big enough to
take more than CXL_GET_EVENT_NR_RECORDS and the intro to Get Event Records
refers to the number being restricted by the mailbox output payload provided.

I'm in favor of defense against broken hardware, but don't paper over any
such error - scream about it.

> +			int i;
> +
> +			for (i = 0; i < nr_rec; i++)
> +				trace_cxl_generic_event(dev_name(cxlds->dev),
> +							type,
> +							&payload.record[i]);
> +		}
> +
> +		if (trace_cxl_overflow_enabled() &&
> +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> +
> +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||

Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned
payload not the total number.

> +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> +}


> diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h
> new file mode 100644
> index 000000000000..60dec9a84918
> --- /dev/null
> +++ b/include/trace/events/cxl.h
> @@ -0,0 +1,127 @@


> +#define CXL_EVT_TP_fast_assign(dname, l, hdr)					\
> +	__assign_str(dev_name, (dname));					\
> +	__entry->log = (l);							\
> +	memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t));			\
> +	__entry->hdr_length = (hdr).length;					\
> +	__entry->hdr_flags = get_unaligned_le24((hdr).flags);			\
> +	__entry->hdr_handle = le16_to_cpu((hdr).handle);			\
> +	__entry->hdr_related_handle = le16_to_cpu((hdr).related_handle);	\
> +	__entry->hdr_timestamp = le64_to_cpu((hdr).timestamp);			\
> +	__entry->hdr_maint_op_class = (hdr).maint_op_class
> +
Trivial: Maybe one blank line is enough?
> +
> +#define CXL_EVT_TP_printk(fmt, ...) \
> +	TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' "		\
> +		"handle=%x related_handle=%x maint_op_class=%u"			\
> +		" : " fmt,							\
> +		__get_str(dev_name), cxl_event_log_type_str(__entry->log),	\
> +		__entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\
> +		show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle,	\
> +		__entry->hdr_related_handle, __entry->hdr_maint_op_class,	\
> +		##__VA_ARGS__)
Ira Weiny Nov. 17, 2022, 12:47 a.m. UTC | #3
On Wed, Nov 16, 2022 at 03:19:36PM +0000, Jonathan Cameron wrote:
> On Thu, 10 Nov 2022 10:57:49 -0800
> ira.weiny@intel.com wrote:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL devices have multiple event logs which can be queried for CXL event
> > records.  Devices are required to support the storage of at least one
> > event record in each event log type.
> > 
> > Devices track event log overflow by incrementing a counter and tracking
> > the time of the first and last overflow event seen.
> > 
> > Software queries events via the Get Event Record mailbox command; CXL
> > rev 3.0 section 8.2.9.2.2.
> > 
> > Issue the Get Event Record mailbox command on driver load.  Trace each
> > record found with a generic record trace.  Trace any overflow
> > conditions.
> > 
> > The device can return up to 1MB worth of event records per query.  This
> > presents complications with allocating a huge buffers to potentially
> > capture all the records.  It is not anticipated that these event logs
> > will be very deep and reading them does not need to be performant.
> > Process only 3 records at a time.  3 records was chosen as it fits
> > comfortably on the stack to prevent dynamic allocation while still
> > cutting down on extra mailbox messages.
> > 
> > This patch traces a raw event record only and leaves the specific event
> > record types to subsequent patches.
> > 
> > Macros are created to use for tracing the common CXL Event header
> > fields.
> > 
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Hi Ira,
> 
> A question inline about whether some of the conditions you are checking
> for can actually happen. Otherwise looks good to me.
> 
> Jonathan
> 

[snip]

> > +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds,
> > +				    enum cxl_event_log_type type)
> > +{
> > +	struct cxl_get_event_payload payload;
> > +	u16 pl_nr;
> > +
> > +	do {
> > +		u8 log_type = type;
> > +		int rc;
> > +
> > +		rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD,
> > +				       &log_type, sizeof(log_type),
> > +				       &payload, sizeof(payload));
> > +		if (rc) {
> > +			dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d",
> > +				cxl_event_log_type_str(type), rc);
> > +			return;
> > +		}
> > +
> > +		pl_nr = le16_to_cpu(payload.record_count);
> > +		if (trace_cxl_generic_event_enabled()) {
> > +			u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS);
> 
> Either I'm misreading the spec, or it can't be greater than NR_RECORDS.

Well...  I could have read the spec wrong as well.  But after reading very
carefully I think this is actually correct.

> "The number of event records in the Event Records list...."

Where is this quote from?  I don't see that in the spec.

> Event Records being the field inside this payload which is not big enough to
> take more than CXL_GET_EVENT_NR_RECORDS and the intro to Get Event Records
> refers to the number being restricted by the mailbox output payload provided.

My understanding is that the output payload is only limited by the Payload Size
reported in the Mailbox Capability Register.Payload Size.  (Section 8.2.8.4.3)

This can be up to 1MB.  So the device could fill up to 1MB's worth of Event
Records while still being in compliance.  The generic mailbox code in the
driver caps the data based on the size passed into cxl_mbox_send_cmd() however,
the number of records reported is not changed.

> 
> I'm in favor of defense against broken hardware, but don't paper over any
> such error - scream about it.

I don't think this is out of spec unless the device is trying to write more
than 1MB and I think the core mailbox code will scream about that.

> 
> > +			int i;
> > +
> > +			for (i = 0; i < nr_rec; i++)
> > +				trace_cxl_generic_event(dev_name(cxlds->dev),
> > +							type,
> > +							&payload.record[i]);
> > +		}
> > +
> > +		if (trace_cxl_overflow_enabled() &&
> > +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> > +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> > +
> > +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||
> 
> Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned
> payload not the total number.

I don't think so.  The only value passed to the device is the _input_ payload
size.  The output payload size is not passed to the device and is not included
in the Get Event Records Input Payload.  (Table 8-49)

So my previous code was wrong.  Here is an example I think which is within the
spec but would result in the more records flag not being set.

	Device log depth == 10
	nr log entries == 7
	nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000

Device sets Output Payload.Event Record Count == 7 (which is < 8000).  Common
mailbox code truncates that to 3.  More Event Records == 0 because it sent all
7 that it had.

This code will clear 3 and read again 2 more times.

Am I reading that wrong?

> 
> > +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> > +}
> 
> 
> > diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h
> > new file mode 100644
> > index 000000000000..60dec9a84918
> > --- /dev/null
> > +++ b/include/trace/events/cxl.h
> > @@ -0,0 +1,127 @@
> 
> 
> > +#define CXL_EVT_TP_fast_assign(dname, l, hdr)					\
> > +	__assign_str(dev_name, (dname));					\
> > +	__entry->log = (l);							\
> > +	memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t));			\
> > +	__entry->hdr_length = (hdr).length;					\
> > +	__entry->hdr_flags = get_unaligned_le24((hdr).flags);			\
> > +	__entry->hdr_handle = le16_to_cpu((hdr).handle);			\
> > +	__entry->hdr_related_handle = le16_to_cpu((hdr).related_handle);	\
> > +	__entry->hdr_timestamp = le64_to_cpu((hdr).timestamp);			\
> > +	__entry->hdr_maint_op_class = (hdr).maint_op_class
> > +
> Trivial: Maybe one blank line is enough?

Yea I'll adjust,
Ira

> > +
> > +#define CXL_EVT_TP_printk(fmt, ...) \
> > +	TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' "		\
> > +		"handle=%x related_handle=%x maint_op_class=%u"			\
> > +		" : " fmt,							\
> > +		__get_str(dev_name), cxl_event_log_type_str(__entry->log),	\
> > +		__entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\
> > +		show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle,	\
> > +		__entry->hdr_related_handle, __entry->hdr_maint_op_class,	\
> > +		##__VA_ARGS__)
>
Jonathan Cameron Nov. 17, 2022, 10:43 a.m. UTC | #4
On Wed, 16 Nov 2022 16:47:20 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Wed, Nov 16, 2022 at 03:19:36PM +0000, Jonathan Cameron wrote:
> > On Thu, 10 Nov 2022 10:57:49 -0800
> > ira.weiny@intel.com wrote:
> >   
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > CXL devices have multiple event logs which can be queried for CXL event
> > > records.  Devices are required to support the storage of at least one
> > > event record in each event log type.
> > > 
> > > Devices track event log overflow by incrementing a counter and tracking
> > > the time of the first and last overflow event seen.
> > > 
> > > Software queries events via the Get Event Record mailbox command; CXL
> > > rev 3.0 section 8.2.9.2.2.
> > > 
> > > Issue the Get Event Record mailbox command on driver load.  Trace each
> > > record found with a generic record trace.  Trace any overflow
> > > conditions.
> > > 
> > > The device can return up to 1MB worth of event records per query.  This
> > > presents complications with allocating a huge buffers to potentially
> > > capture all the records.  It is not anticipated that these event logs
> > > will be very deep and reading them does not need to be performant.
> > > Process only 3 records at a time.  3 records was chosen as it fits
> > > comfortably on the stack to prevent dynamic allocation while still
> > > cutting down on extra mailbox messages.
> > > 
> > > This patch traces a raw event record only and leaves the specific event
> > > record types to subsequent patches.
> > > 
> > > Macros are created to use for tracing the common CXL Event header
> > > fields.
> > > 
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>  
> > 
> > Hi Ira,
> > 
> > A question inline about whether some of the conditions you are checking
> > for can actually happen. Otherwise looks good to me.
> > 
> > Jonathan
> >   
> 
> [snip]
> 
> > > +static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds,
> > > +				    enum cxl_event_log_type type)
> > > +{
> > > +	struct cxl_get_event_payload payload;
> > > +	u16 pl_nr;
> > > +
> > > +	do {
> > > +		u8 log_type = type;
> > > +		int rc;
> > > +
> > > +		rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD,
> > > +				       &log_type, sizeof(log_type),
> > > +				       &payload, sizeof(payload));
> > > +		if (rc) {
> > > +			dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d",
> > > +				cxl_event_log_type_str(type), rc);
> > > +			return;
> > > +		}
> > > +
> > > +		pl_nr = le16_to_cpu(payload.record_count);
> > > +		if (trace_cxl_generic_event_enabled()) {
> > > +			u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS);  
> > 
> > Either I'm misreading the spec, or it can't be greater than NR_RECORDS.  
> 
> Well...  I could have read the spec wrong as well.  But after reading very
> carefully I think this is actually correct.
> 
> > "The number of event records in the Event Records list...."  
> 
> Where is this quote from?  I don't see that in the spec.

Table 8-50 Event Record Count (the field we are reading here).

> 
> > Event Records being the field inside this payload which is not big enough to
> > take more than CXL_GET_EVENT_NR_RECORDS and the intro to Get Event Records
> > refers to the number being restricted by the mailbox output payload provided.  
> 
> My understanding is that the output payload is only limited by the Payload Size
> reported in the Mailbox Capability Register.Payload Size.  (Section 8.2.8.4.3)
> 
> This can be up to 1MB.  So the device could fill up to 1MB's worth of Event
> Records while still being in compliance.  The generic mailbox code in the
> driver caps the data based on the size passed into cxl_mbox_send_cmd() however,
> the number of records reported is not changed.

Indeed I had that wrong.  I thought we passed in an output payload length whereas
we only provide "payload length" which is defined as being the input length in 8.2.8.4.5

> 
> > 
> > I'm in favor of defense against broken hardware, but don't paper over any
> > such error - scream about it.  
> 
> I don't think this is out of spec unless the device is trying to write more
> than 1MB and I think the core mailbox code will scream about that.
> 
> >   
> > > +			int i;
> > > +
> > > +			for (i = 0; i < nr_rec; i++)
> > > +				trace_cxl_generic_event(dev_name(cxlds->dev),
> > > +							type,
> > > +							&payload.record[i]);
> > > +		}
> > > +
> > > +		if (trace_cxl_overflow_enabled() &&
> > > +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> > > +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> > > +
> > > +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||  
> > 
> > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned
> > payload not the total number.  
> 
> I don't think so.  The only value passed to the device is the _input_ payload
> size.  The output payload size is not passed to the device and is not included
> in the Get Event Records Input Payload.  (Table 8-49)
> 
> So my previous code was wrong.  Here is an example I think which is within the
> spec but would result in the more records flag not being set.
> 
> 	Device log depth == 10
> 	nr log entries == 7
> 	nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000
> 
> Device sets Output Payload.Event Record Count == 7 (which is < 8000).  Common
> mailbox code truncates that to 3.  More Event Records == 0 because it sent all
> 7 that it had.
> 
> This code will clear 3 and read again 2 more times.
> 
> Am I reading that wrong?

I think this is still wrong, but for a different reason. :)
If we don't clear the records and more records is set, that means it didn't
fit in the mailbox payload (potentially 1MB)  then the next read
will return the next set of records from there.

Taking this patch only, let's say the mailbox takes 4 records.
Read 1: Records 0, 1, 2, 3 More set.
   We handle 0, 1, 2
Read 2: Records 4, 5, 6 More not set.
   We handle 4, 5, 6

Record 3 is never handled.

If we add in clearing as happens later in the series, the current
assumption is that if we clear some records a subsequent read will
start again.  I'm not sure that is true. If it is spec reference needed.

So assumption is
Read 1: Records 0, 1, 2, 3 More set
  Clear 0, 1, 2
Read 2: Records 3, 4, 5, 6
  Clear 3, 4, 5 More not set, but catch it with the condition above.
Read 3: 6 only
  Clear 6

However, I think a valid implementation could do the following
(imagine a ring buffer with a pointer to the 'next' record to read out and
 each record has a 'valid' flag to deal with corner cases around
 sequences such as read log once, start reading again and some
 clears occur using handles obtained from first read - not that
 case isn't ruled out by the spec as far as I can see).

Read 1: Records 0, 1, 2, 3 More set.  'next' pointer points to record 4.
  Clear 0, 1, 2
Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7.
  Clear 4, 5, 6

Skipping record 3.

So I think we have to absorb the full mailbox payload each time to guarantee
we don't skip events or process them out of order (which is what would happen
if we relied on a retry loop - we aren't allowed to clear them out of
order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device
shall verify the event record handles specified in the input payload are in
temporal order. ... "). 
Obviously that temporal order thing is only relevant if we get my second
example occurring on real hardware.  I think the spec is vague enough
to allow that implementation.  Would have been easy to specify this originally
but it probably won't go in as errata so we need to cope with all the
flexibility that is present.

What fun and oh for a parameter to control how many records are returned!

Jonathan


> 
> >   
> > > +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> > > +}  
> > 

>
Ira Weiny Nov. 18, 2022, 11:26 p.m. UTC | #5
On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote:
> On Wed, 16 Nov 2022 16:47:20 -0800
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> 

[snip]

> > 
> > >   
> > > > +			int i;
> > > > +
> > > > +			for (i = 0; i < nr_rec; i++)
> > > > +				trace_cxl_generic_event(dev_name(cxlds->dev),
> > > > +							type,
> > > > +							&payload.record[i]);
> > > > +		}
> > > > +
> > > > +		if (trace_cxl_overflow_enabled() &&
> > > > +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> > > > +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> > > > +
> > > > +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||  
> > > 
> > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned
> > > payload not the total number.  
> > 
> > I don't think so.  The only value passed to the device is the _input_ payload
> > size.  The output payload size is not passed to the device and is not included
> > in the Get Event Records Input Payload.  (Table 8-49)
> > 
> > So my previous code was wrong.  Here is an example I think which is within the
> > spec but would result in the more records flag not being set.
> > 
> > 	Device log depth == 10
> > 	nr log entries == 7
> > 	nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000
> > 
> > Device sets Output Payload.Event Record Count == 7 (which is < 8000).  Common
> > mailbox code truncates that to 3.  More Event Records == 0 because it sent all
> > 7 that it had.
> > 
> > This code will clear 3 and read again 2 more times.
> > 
> > Am I reading that wrong?
> 
> I think this is still wrong, but for a different reason. :)

I hope not...  :-/

> If we don't clear the records and more records is set, that means it didn't
> fit in the mailbox payload (potentially 1MB)  then the next read
> will return the next set of records from there.

That is not how I read the Get Event Records command:

From 8.2.9.2.2 Get Event Records

... "Devices shall return event records to the host in the temporal order the
device detected the events in. The event occurring the earliest in time, in the
specific event log, shall be returned first."

If item 3 below is earlier than 4 then it must be returned if we have not
cleared it.  At least that is how I read the above.  :-/

> 
> Taking this patch only, let's say the mailbox takes 4 records.
> Read 1: Records 0, 1, 2, 3 More set.
>    We handle 0, 1, 2
> Read 2: Records 4, 5, 6 More not set.
>    We handle 4, 5, 6
> 
> Record 3 is never handled.
> 
> If we add in clearing as happens later in the series,

I suppose I should squash the patches as this may not work without the
clearing.  :-/

> the current
> assumption is that if we clear some records a subsequent read will
> start again.  I'm not sure that is true. If it is spec reference needed.
> 
> So assumption is
> Read 1: Records 0, 1, 2, 3 More set
>   Clear 0, 1, 2
> Read 2: Records 3, 4, 5, 6
>   Clear 3, 4, 5 More not set, but catch it with the condition above.
> Read 3: 6 only
>   Clear 6
> 
> However, I think a valid implementation could do the following
> (imagine a ring buffer with a pointer to the 'next' record to read out and
>  each record has a 'valid' flag to deal with corner cases around
>  sequences such as read log once, start reading again and some
>  clears occur using handles obtained from first read - not that
>  case isn't ruled out by the spec as far as I can see).

I believe this is a violation because the next pointer can't be advanced until
the record is cleared.  Otherwise the device is not returning items in temporal
order based on what is in the log.

> 
> Read 1: Records 0, 1, 2, 3 More set.  'next' pointer points to record 4.
>   Clear 0, 1, 2
> Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7.
>   Clear 4, 5, 6
> 
> Skipping record 3.
> 
> So I think we have to absorb the full mailbox payload each time to guarantee
> we don't skip events or process them out of order (which is what would happen
> if we relied on a retry loop - we aren't allowed to clear them out of
> order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device
> shall verify the event record handles specified in the input payload are in
> temporal order. ... "). 
> Obviously that temporal order thing is only relevant if we get my second
> example occurring on real hardware.  I think the spec is vague enough
> to allow that implementation.  Would have been easy to specify this originally
> but it probably won't go in as errata so we need to cope with all the
> flexibility that is present.

:-(  Yea coulda, woulda, shoulda...  ;-)

> 
> What fun and oh for a parameter to control how many records are returned!

Yea.  But I really don't think there is a problem unless someone really take
liberty with the spec.  I think it boils down to how one interprets _when_ a
record is removed from the log.

If the record is removed when it is returned (as in your 'next' pointer
example) then why have a clear at all?  If my interpretation is correct then
the next available entry is the one which has not been cleared.  Therefore in
your example 'next' is not incremented until clear has been called.  I think
that implementation is also supported by the idea that records must be cleared
in temporal order.  Otherwise I think devices would get confused.

FWIW the qemu implementation is based on my interpretation ATM.

Ira

> 
> Jonathan
> 
> 
> > 
> > >   
> > > > +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> > > > +}  
> > > 
> 
> > 
>
Jonathan Cameron Nov. 21, 2022, 10:47 a.m. UTC | #6
On Fri, 18 Nov 2022 15:26:17 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote:
> > On Wed, 16 Nov 2022 16:47:20 -0800
> > Ira Weiny <ira.weiny@intel.com> wrote:
> > 
> >   
> 
> [snip]
> 
> > >   
> > > >     
> > > > > +			int i;
> > > > > +
> > > > > +			for (i = 0; i < nr_rec; i++)
> > > > > +				trace_cxl_generic_event(dev_name(cxlds->dev),
> > > > > +							type,
> > > > > +							&payload.record[i]);
> > > > > +		}
> > > > > +
> > > > > +		if (trace_cxl_overflow_enabled() &&
> > > > > +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> > > > > +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> > > > > +
> > > > > +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||    
> > > > 
> > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned
> > > > payload not the total number.    
> > > 
> > > I don't think so.  The only value passed to the device is the _input_ payload
> > > size.  The output payload size is not passed to the device and is not included
> > > in the Get Event Records Input Payload.  (Table 8-49)
> > > 
> > > So my previous code was wrong.  Here is an example I think which is within the
> > > spec but would result in the more records flag not being set.
> > > 
> > > 	Device log depth == 10
> > > 	nr log entries == 7
> > > 	nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000
> > > 
> > > Device sets Output Payload.Event Record Count == 7 (which is < 8000).  Common
> > > mailbox code truncates that to 3.  More Event Records == 0 because it sent all
> > > 7 that it had.
> > > 
> > > This code will clear 3 and read again 2 more times.
> > > 
> > > Am I reading that wrong?  
> > 
> > I think this is still wrong, but for a different reason. :)  
> 
> I hope not...  :-/
> 
> > If we don't clear the records and more records is set, that means it didn't
> > fit in the mailbox payload (potentially 1MB)  then the next read
> > will return the next set of records from there.  
> 
> That is not how I read the Get Event Records command:
> 
> From 8.2.9.2.2 Get Event Records
> 
> ... "Devices shall return event records to the host in the temporal order the
> device detected the events in. The event occurring the earliest in time, in the
> specific event log, shall be returned first."
> 
> If item 3 below is earlier than 4 then it must be returned if we have not
> cleared it.  At least that is how I read the above.  :-/

In general that doesn't work.  Imagine we cleared no records.
In that case we'd return 4 despite there being earlier records.
There is no language to cover this particular case of clearing
part of what was returned.  The device did return the records
in temporal order, we just didn't notice some of them.

The wonders of slightly loose spec wording.  Far as I can tell
we are stuck with having to come with all things that could be
read as being valid implementations.

> 
> > 
> > Taking this patch only, let's say the mailbox takes 4 records.
> > Read 1: Records 0, 1, 2, 3 More set.
> >    We handle 0, 1, 2
> > Read 2: Records 4, 5, 6 More not set.
> >    We handle 4, 5, 6
> > 
> > Record 3 is never handled.
> > 
> > If we add in clearing as happens later in the series,  
> 
> I suppose I should squash the patches as this may not work without the
> clearing.  :-/
> 
> > the current
> > assumption is that if we clear some records a subsequent read will
> > start again.  I'm not sure that is true. If it is spec reference needed.
> > 
> > So assumption is
> > Read 1: Records 0, 1, 2, 3 More set
> >   Clear 0, 1, 2
> > Read 2: Records 3, 4, 5, 6
> >   Clear 3, 4, 5 More not set, but catch it with the condition above.
> > Read 3: 6 only
> >   Clear 6
> > 
> > However, I think a valid implementation could do the following
> > (imagine a ring buffer with a pointer to the 'next' record to read out and
> >  each record has a 'valid' flag to deal with corner cases around
> >  sequences such as read log once, start reading again and some
> >  clears occur using handles obtained from first read - not that
> >  case isn't ruled out by the spec as far as I can see).  
> 
> I believe this is a violation because the next pointer can't be advanced until
> the record is cleared.  Otherwise the device is not returning items in temporal
> order based on what is in the log.

Ah. This is where we disagree.  The temporal order is (potentially?) unconnected
from the clearing.  The device did return them in temporal order, we just didn't
take any novice of record 3 being returned.
A valid reading of that temporal order comment is actually the other way around
that the device must not reset it's idea of temporal order until all records
have been read (reading 3 twice is not in temporal order - imagine we had
read 5 each time and it becomes more obvious as the read order becomes
0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal
reading of the term. The more I read this, the more I think the current implementation
is not compliant with the specification at all.

I'm not seeing a spec mention of 'reseting' the ordering on clearing records
(which might have been a good thing in the first place but too late now).

> 
> > 
> > Read 1: Records 0, 1, 2, 3 More set.  'next' pointer points to record 4.
> >   Clear 0, 1, 2
> > Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7.
> >   Clear 4, 5, 6
> > 
> > Skipping record 3.
> > 
> > So I think we have to absorb the full mailbox payload each time to guarantee
> > we don't skip events or process them out of order (which is what would happen
> > if we relied on a retry loop - we aren't allowed to clear them out of
> > order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device
> > shall verify the event record handles specified in the input payload are in
> > temporal order. ... "). 
> > Obviously that temporal order thing is only relevant if we get my second
> > example occurring on real hardware.  I think the spec is vague enough
> > to allow that implementation.  Would have been easy to specify this originally
> > but it probably won't go in as errata so we need to cope with all the
> > flexibility that is present.  
> 
> :-(  Yea coulda, woulda, shoulda...  ;-)
> 
> > 
> > What fun and oh for a parameter to control how many records are returned!  
> 
> Yea.  But I really don't think there is a problem unless someone really take
> liberty with the spec.  I think it boils down to how one interprets _when_ a
> record is removed from the log.

This is nothing to do with removal. The wording we have is just about reading
and I think a strict reading of the spec would say your assumption of a reset of the
read pointer on clear is NOT a valid implementation.  There is separate wording
about clears being in temporal order, but that doesn't effect the Get Event
Records handling.

> 
> If the record is removed when it is returned (as in your 'next' pointer
> example) then why have a clear at all?

Because if your software crashes, you don't have a handshake to reestablish
state.  If that happens you read the whole log until MORE is not set and
then read it again to get a clean list.  It's messy situation that has
been discussed before for GET POISON LIST which has the same nasty handing
of MORE.  (look in appropriate forum for resolution to that one that we can't
yet discuss here!)

Also, allows for non destructive readback (debugging tools might take a look
having paused the normal handling).

> If my interpretation is correct then
> the next available entry is the one which has not been cleared.

If that is the case the language in "More Event Records" doesn't work
"The host should continue to retrieve records using this command, until
this indicator is no longer set by the device"

With your reading of the spec, if we clear nothing, we'd keep getting the
first set of records and only be able to read more by clearing them...


>  Therefore in
> your example 'next' is not incremented until clear has been called.  I think
> that implementation is also supported by the idea that records must be cleared
> in temporal order.  Otherwise I think devices would get confused.

Not hard for device to do this (how I now read the spec) properly.

Two pointers:
1) Next to clear: CLEAR
2) Next to read:  READ

Advance the the READ pointer on Get Event Records
For CLEAR, check that the requested clears are handled in order and that
they are before the READ pointer.

Maybe we should just take it to appropriate spec forum to seek a clarification?

Jonathan

> 
> FWIW the qemu implementation is based on my interpretation ATM.
> 
> Ira
> 
> > 
> > Jonathan
> > 
> >   
> > >   
> > > >     
> > > > > +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> > > > > +}    
> > > >   
> >   
> > >   
> >
Ira Weiny Nov. 28, 2022, 11:30 p.m. UTC | #7
On Mon, Nov 21, 2022 at 10:47:14AM +0000, Jonathan Cameron wrote:
> On Fri, 18 Nov 2022 15:26:17 -0800
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote:
> > > On Wed, 16 Nov 2022 16:47:20 -0800
> > > Ira Weiny <ira.weiny@intel.com> wrote:
> > > 
> > >   
> > 
> > [snip]
> > 
> > > >   
> > > > >     
> > > > > > +			int i;
> > > > > > +
> > > > > > +			for (i = 0; i < nr_rec; i++)
> > > > > > +				trace_cxl_generic_event(dev_name(cxlds->dev),
> > > > > > +							type,
> > > > > > +							&payload.record[i]);
> > > > > > +		}
> > > > > > +
> > > > > > +		if (trace_cxl_overflow_enabled() &&
> > > > > > +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> > > > > > +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> > > > > > +
> > > > > > +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||    
> > > > > 
> > > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned
> > > > > payload not the total number.    
> > > > 
> > > > I don't think so.  The only value passed to the device is the _input_ payload
> > > > size.  The output payload size is not passed to the device and is not included
> > > > in the Get Event Records Input Payload.  (Table 8-49)
> > > > 
> > > > So my previous code was wrong.  Here is an example I think which is within the
> > > > spec but would result in the more records flag not being set.
> > > > 
> > > > 	Device log depth == 10
> > > > 	nr log entries == 7
> > > > 	nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000
> > > > 
> > > > Device sets Output Payload.Event Record Count == 7 (which is < 8000).  Common
> > > > mailbox code truncates that to 3.  More Event Records == 0 because it sent all
> > > > 7 that it had.
> > > > 
> > > > This code will clear 3 and read again 2 more times.
> > > > 
> > > > Am I reading that wrong?  
> > > 
> > > I think this is still wrong, but for a different reason. :)  
> > 
> > I hope not...  :-/
> > 
> > > If we don't clear the records and more records is set, that means it didn't
> > > fit in the mailbox payload (potentially 1MB)  then the next read
> > > will return the next set of records from there.  
> > 
> > That is not how I read the Get Event Records command:
> > 
> > From 8.2.9.2.2 Get Event Records
> > 
> > ... "Devices shall return event records to the host in the temporal order the
> > device detected the events in. The event occurring the earliest in time, in the
> > specific event log, shall be returned first."
> > 
> > If item 3 below is earlier than 4 then it must be returned if we have not
> > cleared it.  At least that is how I read the above.  :-/
> 
> In general that doesn't work.  Imagine we cleared no records.
> In that case we'd return 4 despite there being earlier records.
> There is no language to cover this particular case of clearing
> part of what was returned.  The device did return the records
> in temporal order, we just didn't notice some of them.
> 
> The wonders of slightly loose spec wording.  Far as I can tell
> we are stuck with having to come with all things that could be
> read as being valid implementations.

So I've been thinking about this for a while.

Lets take this example:

> > > 
> > > Taking this patch only, let's say the mailbox takes 4 records.
> > > Read 1: Records 0, 1, 2, 3 More set.
> > >    We handle 0, 1, 2
> > > Read 2: Records 4, 5, 6 More not set.
> > >    We handle 4, 5, 6
> > > 

In this case what happens if you do a 3rd read?  Does the device return
nothing?  Or does it return 0, 1, 2, 3 again?

It must start from the beginning right?  But that is no longer in temporal
order by your definition either.

And if it returns nothing then there is no way to recover them except on device
reset?

FWIW I'm altering the patch set to do what you say and allocate a buffer large
enough to get all the records.  Because I am thinking you are correct.

However, considering the buffer may be large, I fear we may run afoul of memory
allocation failures.  And that will require some more tricky error recovery to
continue reading the log because the irq settings state:

"... Settings: Specifies the settings for the interrupt when the <event> event
log transitions from having no entries to having one or more entries."
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This means that no more interrupts will happen until the log is empty and
additional events occur.  So if an allocation failure happens I'll have to put
a task on a work queue to wake up and continue to try.  Otherwise the log will
stall.  Or we could just put a WARN_ON_ONCE() in and hope this never happens...

I still believe that with a clear operation defined my method makes more sense.
But I agree with you that the language is not strong.

:-(

> > > Record 3 is never handled.
> > > 
> > > If we add in clearing as happens later in the series,  
> > 
> > I suppose I should squash the patches as this may not work without the
> > clearing.  :-/
> > 
> > > the current
> > > assumption is that if we clear some records a subsequent read will
> > > start again.  I'm not sure that is true. If it is spec reference needed.
> > > 
> > > So assumption is
> > > Read 1: Records 0, 1, 2, 3 More set
> > >   Clear 0, 1, 2
> > > Read 2: Records 3, 4, 5, 6
> > >   Clear 3, 4, 5 More not set, but catch it with the condition above.
> > > Read 3: 6 only
> > >   Clear 6
> > > 
> > > However, I think a valid implementation could do the following
> > > (imagine a ring buffer with a pointer to the 'next' record to read out and
> > >  each record has a 'valid' flag to deal with corner cases around
> > >  sequences such as read log once, start reading again and some
> > >  clears occur using handles obtained from first read - not that
> > >  case isn't ruled out by the spec as far as I can see).  
> > 
> > I believe this is a violation because the next pointer can't be advanced until
> > the record is cleared.  Otherwise the device is not returning items in temporal
> > order based on what is in the log.
> 
> Ah. This is where we disagree.  The temporal order is (potentially?) unconnected
> from the clearing.  The device did return them in temporal order, we just didn't
> take any novice of record 3 being returned.

:-/

> A valid reading of that temporal order comment is actually the other way around
> that the device must not reset it's idea of temporal order until all records
> have been read (reading 3 twice is not in temporal order - imagine we had
> read 5 each time and it becomes more obvious as the read order becomes
> 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal
> reading of the term.

Well I guess.  My reading was that it must return the first element temporally
within the list at the time of the Get operation.

So in this example since 3 is still in the list it must return it first.  Each
read is considered atomic from the others.  Yes as long as 0 is in the queue it
will be returned.

But I can see it your way too...

>
> The more I read this, the more I think the current implementation
> is not compliant with the specification at all.
> 
> I'm not seeing a spec mention of 'reseting' the ordering on clearing records
> (which might have been a good thing in the first place but too late now).

There is no resetting of order.  Only that the device does not consider the
previous reads on determining which events to return on any individual Get
call.

> 
> > 
> > > 
> > > Read 1: Records 0, 1, 2, 3 More set.  'next' pointer points to record 4.
> > >   Clear 0, 1, 2
> > > Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7.
> > >   Clear 4, 5, 6
> > > 
> > > Skipping record 3.
> > > 
> > > So I think we have to absorb the full mailbox payload each time to guarantee
> > > we don't skip events or process them out of order (which is what would happen
> > > if we relied on a retry loop - we aren't allowed to clear them out of
> > > order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device
> > > shall verify the event record handles specified in the input payload are in
> > > temporal order. ... "). 
> > > Obviously that temporal order thing is only relevant if we get my second
> > > example occurring on real hardware.  I think the spec is vague enough
> > > to allow that implementation.  Would have been easy to specify this originally
> > > but it probably won't go in as errata so we need to cope with all the
> > > flexibility that is present.  
> > 
> > :-(  Yea coulda, woulda, shoulda...  ;-)
> > 
> > > 
> > > What fun and oh for a parameter to control how many records are returned!  
> > 
> > Yea.  But I really don't think there is a problem unless someone really take
> > liberty with the spec.  I think it boils down to how one interprets _when_ a
> > record is removed from the log.
> 
> This is nothing to do with removal. The wording we have is just about reading
> and I think a strict reading of the spec would say your assumption of a reset of the
> read pointer on clear is NOT a valid implementation.  There is separate wording
> about clears being in temporal order, but that doesn't effect the Get Event
> Records handling.
> 
> > 
> > If the record is removed when it is returned (as in your 'next' pointer
> > example) then why have a clear at all?
> 
> Because if your software crashes, you don't have a handshake to reestablish
> state.  If that happens you read the whole log until MORE is not set and
> then read it again to get a clean list.  It's messy situation that has
> been discussed before for GET POISON LIST which has the same nasty handing
> of MORE.  (look in appropriate forum for resolution to that one that we can't
> yet discuss here!)

I can see the similarities but I think events are a more ephemeral item which
makes sense to clear once they are consumed.  The idea that they should be left
for others to consume does not make sense to me.  Where Poison is something
which could be a permanent marker which should be left in a list.

> 
> Also, allows for non destructive readback (debugging tools might take a look
> having paused the normal handling).

That is true.

> 
> > If my interpretation is correct then
> > the next available entry is the one which has not been cleared.
> 
> If that is the case the language in "More Event Records" doesn't work
> "The host should continue to retrieve records using this command, until
> this indicator is no longer set by the device"
> 
> With your reading of the spec, if we clear nothing, we'd keep getting the
> first set of records and only be able to read more by clearing them...
> 

Yea.

> 
> >  Therefore in
> > your example 'next' is not incremented until clear has been called.  I think
> > that implementation is also supported by the idea that records must be cleared
> > in temporal order.  Otherwise I think devices would get confused.
> 
> Not hard for device to do this (how I now read the spec) properly.
> 
> Two pointers:
> 1) Next to clear: CLEAR
> 2) Next to read:  READ
> 
> Advance the the READ pointer on Get Event Records

And loop back to the start on a further read...  I'm looking at changing the
code for this but I think making it fully robust under a memory allocation
failure is going to be more tedious or we punt.

> For CLEAR, check that the requested clears are handled in order and that
> they are before the READ pointer.
> 
> Maybe we should just take it to appropriate spec forum to seek a clarification?

Probably.  I've not paid attention lately.

I've sent a separate email with you cc'ed.  Perhaps we can get some
clarification before I completely rework this.

Ira

> 
> Jonathan
> 
> > 
> > FWIW the qemu implementation is based on my interpretation ATM.
> > 
> > Ira
> > 
> > > 
> > > Jonathan
> > > 
> > >   
> > > >   
> > > > >     
> > > > > > +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> > > > > > +}    
> > > > >   
> > >   
> > > >   
> > >   
>
Jonathan Cameron Nov. 29, 2022, 12:26 p.m. UTC | #8
On Mon, 28 Nov 2022 15:30:12 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Mon, Nov 21, 2022 at 10:47:14AM +0000, Jonathan Cameron wrote:
> > On Fri, 18 Nov 2022 15:26:17 -0800
> > Ira Weiny <ira.weiny@intel.com> wrote:
> >   
> > > On Thu, Nov 17, 2022 at 10:43:37AM +0000, Jonathan Cameron wrote:  
> > > > On Wed, 16 Nov 2022 16:47:20 -0800
> > > > Ira Weiny <ira.weiny@intel.com> wrote:
> > > > 
> > > >     
> > > 
> > > [snip]
> > >   
> > > > >     
> > > > > >       
> > > > > > > +			int i;
> > > > > > > +
> > > > > > > +			for (i = 0; i < nr_rec; i++)
> > > > > > > +				trace_cxl_generic_event(dev_name(cxlds->dev),
> > > > > > > +							type,
> > > > > > > +							&payload.record[i]);
> > > > > > > +		}
> > > > > > > +
> > > > > > > +		if (trace_cxl_overflow_enabled() &&
> > > > > > > +		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
> > > > > > > +			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
> > > > > > > +
> > > > > > > +	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||      
> > > > > > 
> > > > > > Isn't pl_nr > CXL_GET_EVENT_NR_RECORDS a hardware bug? It's the number in returned
> > > > > > payload not the total number.      
> > > > > 
> > > > > I don't think so.  The only value passed to the device is the _input_ payload
> > > > > size.  The output payload size is not passed to the device and is not included
> > > > > in the Get Event Records Input Payload.  (Table 8-49)
> > > > > 
> > > > > So my previous code was wrong.  Here is an example I think which is within the
> > > > > spec but would result in the more records flag not being set.
> > > > > 
> > > > > 	Device log depth == 10
> > > > > 	nr log entries == 7
> > > > > 	nr log entries in 1MB ~= (1M - hdr size) / 128 ~= 8000
> > > > > 
> > > > > Device sets Output Payload.Event Record Count == 7 (which is < 8000).  Common
> > > > > mailbox code truncates that to 3.  More Event Records == 0 because it sent all
> > > > > 7 that it had.
> > > > > 
> > > > > This code will clear 3 and read again 2 more times.
> > > > > 
> > > > > Am I reading that wrong?    
> > > > 
> > > > I think this is still wrong, but for a different reason. :)    
> > > 
> > > I hope not...  :-/
> > >   
> > > > If we don't clear the records and more records is set, that means it didn't
> > > > fit in the mailbox payload (potentially 1MB)  then the next read
> > > > will return the next set of records from there.    
> > > 
> > > That is not how I read the Get Event Records command:
> > > 
> > > From 8.2.9.2.2 Get Event Records
> > > 
> > > ... "Devices shall return event records to the host in the temporal order the
> > > device detected the events in. The event occurring the earliest in time, in the
> > > specific event log, shall be returned first."
> > > 
> > > If item 3 below is earlier than 4 then it must be returned if we have not
> > > cleared it.  At least that is how I read the above.  :-/  
> > 
> > In general that doesn't work.  Imagine we cleared no records.
> > In that case we'd return 4 despite there being earlier records.
> > There is no language to cover this particular case of clearing
> > part of what was returned.  The device did return the records
> > in temporal order, we just didn't notice some of them.
> > 
> > The wonders of slightly loose spec wording.  Far as I can tell
> > we are stuck with having to come with all things that could be
> > read as being valid implementations.  
> 
> So I've been thinking about this for a while.
> 
> Lets take this example:
> 
> > > > 
> > > > Taking this patch only, let's say the mailbox takes 4 records.
> > > > Read 1: Records 0, 1, 2, 3 More set.
> > > >    We handle 0, 1, 2
> > > > Read 2: Records 4, 5, 6 More not set.
> > > >    We handle 4, 5, 6
> > > >   
> 
> In this case what happens if you do a 3rd read?  Does the device return
> nothing?  Or does it return 0, 1, 2, 3 again?
> 
> It must start from the beginning right?  But that is no longer in temporal
> order by your definition either.

Agreed that is not clearly specified either.  I assume it works the same
way as poison where we raised the question and conclusion was it starts again
at the beginning. In fact we have to loop twice to guarantee that we have
all the records (as other software may have crashed half way through reading
the poison list so we don't know if we have the first record or not)..

> 
> And if it returns nothing then there is no way to recover them except on device
> reset?
> 
> FWIW I'm altering the patch set to do what you say and allocate a buffer large
> enough to get all the records.  Because I am thinking you are correct.

Horrible, but maybe the best we can do (subject to suggested hack below ;)

> 
> However, considering the buffer may be large, I fear we may run afoul of memory
> allocation failures.  And that will require some more tricky error recovery to
> continue reading the log because the irq settings state:
> 

We could implement cleverer mailbox handling to avoid the large allocation requirement.
Would be messy though as we'd effectively have to lock the mailbox whilst we did
multiple reads of the content into a smaller buffer.

> "... Settings: Specifies the settings for the interrupt when the <event> event
> log transitions from having no entries to having one or more entries."
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This means that no more interrupts will happen until the log is empty and
> additional events occur.  So if an allocation failure happens I'll have to put
> a task on a work queue to wake up and continue to try.  Otherwise the log will
> stall.  Or we could just put a WARN_ON_ONCE() in and hope this never happens...

I think the WARN_ON_ONCE() is probably fine.  If we are paranoid vmalloc one
when we initially connect device as failure less likely...

As a side note, seems like we should maybe take a request to SSWG for devices
to optionally be told to use a smaller mailbox than they support - in order to allow
for corners like this. There is such a command but it's prohibited on primary and
secondary mailboxes (Set Response Message Limit).  That is allowed on switch CCIs,
I guess because it is assumed they may be connected to a BMC without much memory.

> 
> I still believe that with a clear operation defined my method makes more sense.
> But I agree with you that the language is not strong.

Absolutely agree!  Your method would be the one I'd push for if we were starting
from scratch (or another similar method looking like what I can't talk about
for a similar case...)

> 
> :-(
> 
> > > > Record 3 is never handled.
> > > > 
> > > > If we add in clearing as happens later in the series,    
> > > 
> > > I suppose I should squash the patches as this may not work without the
> > > clearing.  :-/
> > >   
> > > > the current
> > > > assumption is that if we clear some records a subsequent read will
> > > > start again.  I'm not sure that is true. If it is spec reference needed.
> > > > 
> > > > So assumption is
> > > > Read 1: Records 0, 1, 2, 3 More set
> > > >   Clear 0, 1, 2
> > > > Read 2: Records 3, 4, 5, 6
> > > >   Clear 3, 4, 5 More not set, but catch it with the condition above.
> > > > Read 3: 6 only
> > > >   Clear 6
> > > > 
> > > > However, I think a valid implementation could do the following
> > > > (imagine a ring buffer with a pointer to the 'next' record to read out and
> > > >  each record has a 'valid' flag to deal with corner cases around
> > > >  sequences such as read log once, start reading again and some
> > > >  clears occur using handles obtained from first read - not that
> > > >  case isn't ruled out by the spec as far as I can see).    
> > > 
> > > I believe this is a violation because the next pointer can't be advanced until
> > > the record is cleared.  Otherwise the device is not returning items in temporal
> > > order based on what is in the log.  
> > 
> > Ah. This is where we disagree.  The temporal order is (potentially?) unconnected
> > from the clearing.  The device did return them in temporal order, we just didn't
> > take any novice of record 3 being returned.  
> 
> :-/
> 
> > A valid reading of that temporal order comment is actually the other way around
> > that the device must not reset it's idea of temporal order until all records
> > have been read (reading 3 twice is not in temporal order - imagine we had
> > read 5 each time and it becomes more obvious as the read order becomes
> > 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal
> > reading of the term.  
> 
> Well I guess.  My reading was that it must return the first element temporally
> within the list at the time of the Get operation.
> 
> So in this example since 3 is still in the list it must return it first.  Each
> read is considered atomic from the others.  Yes as long as 0 is in the queue it
> will be returned.
> 
> But I can see it your way too...

That pesky text under More Event Records flag doesn't mention clearing when it
says "The host should continue to retrieve 
records using this command, until this indicator is no longer set by the 
device."

I wish it did :(

> 
> >
> > The more I read this, the more I think the current implementation
> > is not compliant with the specification at all.
> > 
> > I'm not seeing a spec mention of 'reseting' the ordering on clearing records
> > (which might have been a good thing in the first place but too late now).  
> 
> There is no resetting of order.  Only that the device does not consider the
> previous reads on determining which events to return on any individual Get
> call.

Sure, see above quote though. 

> 
> >   
> > >   
> > > > 
> > > > Read 1: Records 0, 1, 2, 3 More set.  'next' pointer points to record 4.
> > > >   Clear 0, 1, 2
> > > > Read 2: Records 4, 5, 6 More not set. 'next' pointer points to record 7.
> > > >   Clear 4, 5, 6
> > > > 
> > > > Skipping record 3.
> > > > 
> > > > So I think we have to absorb the full mailbox payload each time to guarantee
> > > > we don't skip events or process them out of order (which is what would happen
> > > > if we relied on a retry loop - we aren't allowed to clear them out of
> > > > order anyway 8.2.9.2.3 "Events shall be cleared in temporal order. The device
> > > > shall verify the event record handles specified in the input payload are in
> > > > temporal order. ... "). 
> > > > Obviously that temporal order thing is only relevant if we get my second
> > > > example occurring on real hardware.  I think the spec is vague enough
> > > > to allow that implementation.  Would have been easy to specify this originally
> > > > but it probably won't go in as errata so we need to cope with all the
> > > > flexibility that is present.    
> > > 
> > > :-(  Yea coulda, woulda, shoulda...  ;-)
> > >   
> > > > 
> > > > What fun and oh for a parameter to control how many records are returned!    
> > > 
> > > Yea.  But I really don't think there is a problem unless someone really take
> > > liberty with the spec.  I think it boils down to how one interprets _when_ a
> > > record is removed from the log.  
> > 
> > This is nothing to do with removal. The wording we have is just about reading
> > and I think a strict reading of the spec would say your assumption of a reset of the
> > read pointer on clear is NOT a valid implementation.  There is separate wording
> > about clears being in temporal order, but that doesn't effect the Get Event
> > Records handling.
> >   
> > > 
> > > If the record is removed when it is returned (as in your 'next' pointer
> > > example) then why have a clear at all?  
> > 
> > Because if your software crashes, you don't have a handshake to reestablish
> > state.  If that happens you read the whole log until MORE is not set and
> > then read it again to get a clean list.  It's messy situation that has
> > been discussed before for GET POISON LIST which has the same nasty handing
> > of MORE.  (look in appropriate forum for resolution to that one that we can't
> > yet discuss here!)  
> 
> I can see the similarities but I think events are a more ephemeral item which
> makes sense to clear once they are consumed.  The idea that they should be left
> for others to consume does not make sense to me.  Where Poison is something
> which could be a permanent marker which should be left in a list.

Agreed - but sections use same wording for the More flag.. So we need to interpret
the same.

> 
> > 
> > Also, allows for non destructive readback (debugging tools might take a look
> > having paused the normal handling).  
> 
> That is true.
> 
> >   
> > > If my interpretation is correct then
> > > the next available entry is the one which has not been cleared.  
> > 
> > If that is the case the language in "More Event Records" doesn't work
> > "The host should continue to retrieve records using this command, until
> > this indicator is no longer set by the device"
> > 
> > With your reading of the spec, if we clear nothing, we'd keep getting the
> > first set of records and only be able to read more by clearing them...
> >   
> 
> Yea.
> 
> >   
> > >  Therefore in
> > > your example 'next' is not incremented until clear has been called.  I think
> > > that implementation is also supported by the idea that records must be cleared
> > > in temporal order.  Otherwise I think devices would get confused.  
> > 
> > Not hard for device to do this (how I now read the spec) properly.
> > 
> > Two pointers:
> > 1) Next to clear: CLEAR
> > 2) Next to read:  READ
> > 
> > Advance the the READ pointer on Get Event Records  
> 
> And loop back to the start on a further read...  I'm looking at changing the
> code for this but I think making it fully robust under a memory allocation
> failure is going to be more tedious or we punt.

If we get a memory allocation failure, perhaps we could do the follow horrible hack.

1 Allocate a small buffer.
2 Read once.
3 Hopefully we get the full record - in which case success.
4 Clear those records.
5 If not dealt with all records - read again until More Event Records not set
  (may already not be if it fitted in the buffer)
6 Go back to 2.

If we think a valid implementation might reset the read pointer on clear then
there is a variant where we make use of the fact the handles are constant
 - read 3 records, clear 2 and then use the handle of remaining one to identify
   if we have the next 3 to clear or not... 

> 
> > For CLEAR, check that the requested clears are handled in order and that
> > they are before the READ pointer.
> > 
> > Maybe we should just take it to appropriate spec forum to seek a clarification?  
> 
> Probably.  I've not paid attention lately.
> 
> I've sent a separate email with you cc'ed.  Perhaps we can get some
> clarification before I completely rework this.

Fingers crossed.

Thanks,

Jonathan

> 
> Ira
> 
> > 
> > Jonathan
> >   
> > > 
> > > FWIW the qemu implementation is based on my interpretation ATM.
> > > 
> > > Ira
> > >   
> > > > 
> > > > Jonathan
> > > > 
> > > >     
> > > > >     
> > > > > >       
> > > > > > > +		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
> > > > > > > +}      
> > > > > >     
> > > >     
> > > > >     
> > > >     
> >
Ira Weiny Nov. 30, 2022, 5:09 a.m. UTC | #9
On Tue, Nov 29, 2022 at 12:26:20PM +0000, Jonathan Cameron wrote:
> On Mon, 28 Nov 2022 15:30:12 -0800
> Ira Weiny <ira.weiny@intel.com> wrote:
> 

[snip]

> > > A valid reading of that temporal order comment is actually the other way around
> > > that the device must not reset it's idea of temporal order until all records
> > > have been read (reading 3 twice is not in temporal order - imagine we had
> > > read 5 each time and it becomes more obvious as the read order becomes
> > > 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal
> > > reading of the term.  
> > 
> > Well I guess.  My reading was that it must return the first element temporally
> > within the list at the time of the Get operation.
> > 
> > So in this example since 3 is still in the list it must return it first.  Each
> > read is considered atomic from the others.  Yes as long as 0 is in the queue it
> > will be returned.
> > 
> > But I can see it your way too...
> 
> That pesky text under More Event Records flag doesn't mention clearing when it
> says "The host should continue to retrieve 
> records using this command, until this indicator is no longer set by the 
> device."
> 
> I wish it did :(
> 

As I have reviewed these in my head again I have come to the conclusion that
the More Event Records flags is useless.  Let me explain:

The Clear all Records flag is useless because if an event which occurs between the
Get and Clear all operation will be dropped without the host having seen it.

However, while clearing records based on the handles read, additional events
could come in.  Because of the way the interrupts are specified the host can't
be sure that those new events will cause a zero to non-zero transition.  This
is because there is no way to guarantee all the events were cleared at the
moment the events came in.

I believe this is what you mentioned in another email about needing an 'extra
read' at the end to ensure there was nothing more to be read.  But based on
that logic the only thing that matters is the Get Event.Record
Count.  If it is not 0 keep on reading because while the host is clearing the
records another event could come in.

In other words, the only way to be sure that all records are seen is to do a
Get and see the number of records equal to 0.  Thus any further events will
trigger an interrupt and we can safely exit the loop.

Ira

Basically the loop looks like:

	int nr_rec;

	do {
		... <Get Events> ...

		nr_rec = le16_to_cpu(payload->record_count);

		... <for each record trace> ...
		... <for each record clear> ...

	} while (nr_rec);
Jonathan Cameron Nov. 30, 2022, 2:05 p.m. UTC | #10
On Tue, 29 Nov 2022 21:09:58 -0800
Ira Weiny <ira.weiny@intel.com> wrote:

> On Tue, Nov 29, 2022 at 12:26:20PM +0000, Jonathan Cameron wrote:
> > On Mon, 28 Nov 2022 15:30:12 -0800
> > Ira Weiny <ira.weiny@intel.com> wrote:
> >   
> 
> [snip]
> 
> > > > A valid reading of that temporal order comment is actually the other way around
> > > > that the device must not reset it's idea of temporal order until all records
> > > > have been read (reading 3 twice is not in temporal order - imagine we had
> > > > read 5 each time and it becomes more obvious as the read order becomes
> > > > 0,1,2,3,4,3,4,5,6,7 etc which is clearly not in temporal order by any normal
> > > > reading of the term.    
> > > 
> > > Well I guess.  My reading was that it must return the first element temporally
> > > within the list at the time of the Get operation.
> > > 
> > > So in this example since 3 is still in the list it must return it first.  Each
> > > read is considered atomic from the others.  Yes as long as 0 is in the queue it
> > > will be returned.
> > > 
> > > But I can see it your way too...  
> > 
> > That pesky text under More Event Records flag doesn't mention clearing when it
> > says "The host should continue to retrieve 
> > records using this command, until this indicator is no longer set by the 
> > device."
> > 
> > I wish it did :(
> >   
> 
> As I have reviewed these in my head again I have come to the conclusion that
> the More Event Records flags is useless.  Let me explain:
> 
> The Clear all Records flag is useless because if an event which occurs between the
> Get and Clear all operation will be dropped without the host having seen it.

Can still be used to get a known clean sheet if you don't care about a bunch
of records on initial boot because no data in flight yet etc.
Agreed it is no use if you care about content of the records.

Make sure interrupts are enabled before re-checking if there are new records
to close that race.

> 
> However, while clearing records based on the handles read, additional events
> could come in.  Because of the way the interrupts are specified the host can't
> be sure that those new events will cause a zero to non-zero transition.  This
> is because there is no way to guarantee all the events were cleared at the
> moment the events came in.
> 
> I believe this is what you mentioned in another email about needing an 'extra
> read' at the end to ensure there was nothing more to be read.  But based on
> that logic the only thing that matters is the Get Event.Record
> Count.  If it is not 0 keep on reading because while the host is clearing the
> records another event could come in.
> 
> In other words, the only way to be sure that all records are seen is to do a
> Get and see the number of records equal to 0.  Thus any further events will
> trigger an interrupt and we can safely exit the loop.

Agreed - standard race to close when ever we have a FIFO with edge interrupts
on how full it is.

More records is useful for a different potential pattern of non destructive
read and later clear.  Or for a debug non destructive read.


	int nr_rec;
	<list>

round_we_go:
	do {
		... <for each record trace and add to list...> ...	
		... 
	} while (!MORE);

	for_each_list_entry() {
		clear records one at a time.
	}

	nr_rec = le16_to_cpu(payload->record_count);
	if (nr_rec)
		goto round_we_go;

	...

> 
> Ira
> 
> Basically the loop looks like:
> 
> 	int nr_rec;
> 
> 	do {
> 		... <Get Events> ...
> 
> 		nr_rec = le16_to_cpu(payload->record_count);
> 
> 		... <for each record trace> ...
> 		... <for each record clear> ...
> 
> 	} while (nr_rec);
>
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index ca063a504026..4b7c6e3055c6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5223,6 +5223,7 @@  M:	Dan Williams <dan.j.williams@intel.com>
 L:	linux-cxl@vger.kernel.org
 S:	Maintained
 F:	drivers/cxl/
+F:	include/trace/events/cxl.h
 F:	include/uapi/linux/cxl_mem.h
 
 CONEXANT ACCESSRUNNER USB DRIVER
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 16176b9278b4..a908b95a7de4 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -7,6 +7,9 @@ 
 #include <cxlmem.h>
 #include <cxl.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/cxl.h>
+
 #include "core.h"
 
 static bool cxl_raw_allow_all;
@@ -48,6 +51,7 @@  static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = {
 	CXL_CMD(RAW, CXL_VARIABLE_PAYLOAD, CXL_VARIABLE_PAYLOAD, 0),
 #endif
 	CXL_CMD(GET_SUPPORTED_LOGS, 0, CXL_VARIABLE_PAYLOAD, CXL_CMD_FLAG_FORCE_ENABLE),
+	CXL_CMD(GET_EVENT_RECORD, 1, CXL_VARIABLE_PAYLOAD, 0),
 	CXL_CMD(GET_FW_INFO, 0, 0x50, 0),
 	CXL_CMD(GET_PARTITION_INFO, 0, 0x20, 0),
 	CXL_CMD(GET_LSA, 0x8, CXL_VARIABLE_PAYLOAD, 0),
@@ -704,6 +708,72 @@  int cxl_enumerate_cmds(struct cxl_dev_state *cxlds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
 
+static void cxl_mem_get_records_log(struct cxl_dev_state *cxlds,
+				    enum cxl_event_log_type type)
+{
+	struct cxl_get_event_payload payload;
+	u16 pl_nr;
+
+	do {
+		u8 log_type = type;
+		int rc;
+
+		rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_EVENT_RECORD,
+				       &log_type, sizeof(log_type),
+				       &payload, sizeof(payload));
+		if (rc) {
+			dev_err(cxlds->dev, "Event log '%s': Failed to query event records : %d",
+				cxl_event_log_type_str(type), rc);
+			return;
+		}
+
+		pl_nr = le16_to_cpu(payload.record_count);
+		if (trace_cxl_generic_event_enabled()) {
+			u16 nr_rec = min_t(u16, pl_nr, CXL_GET_EVENT_NR_RECORDS);
+			int i;
+
+			for (i = 0; i < nr_rec; i++)
+				trace_cxl_generic_event(dev_name(cxlds->dev),
+							type,
+							&payload.record[i]);
+		}
+
+		if (trace_cxl_overflow_enabled() &&
+		    (payload.flags & CXL_GET_EVENT_FLAG_OVERFLOW))
+			trace_cxl_overflow(dev_name(cxlds->dev), type, &payload);
+
+	} while (pl_nr > CXL_GET_EVENT_NR_RECORDS ||
+		 payload.flags & CXL_GET_EVENT_FLAG_MORE_RECORDS);
+}
+
+/**
+ * cxl_mem_get_event_records - Get Event Records from the device
+ * @cxlds: The device data for the operation
+ *
+ * Retrieve all event records available on the device and report them as trace
+ * events.
+ *
+ * See CXL rev 3.0 @8.2.9.2.2 Get Event Records
+ */
+void cxl_mem_get_event_records(struct cxl_dev_state *cxlds)
+{
+	u32 status = readl(cxlds->regs.status + CXLDEV_DEV_EVENT_STATUS_OFFSET);
+
+	dev_dbg(cxlds->dev, "Reading event logs: %x\n", status);
+
+	if (status & CXLDEV_EVENT_STATUS_INFO)
+		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_INFO);
+	if (status & CXLDEV_EVENT_STATUS_WARN)
+		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_WARN);
+	if (status & CXLDEV_EVENT_STATUS_FAIL)
+		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FAIL);
+	if (status & CXLDEV_EVENT_STATUS_FATAL)
+		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_FATAL);
+	if (status & CXLDEV_EVENT_STATUS_DYNAMIC_CAP)
+		cxl_mem_get_records_log(cxlds, CXL_EVENT_TYPE_DYNAMIC_CAP);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_mem_get_event_records, CXL);
+
 /**
  * cxl_mem_get_partition_info - Get partition info
  * @cxlds: The device data for the operation
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f680450f0b16..492cff1bea6d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -132,6 +132,14 @@  static inline int ways_to_cxl(unsigned int ways, u8 *iw)
 #define CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX 0x3
 #define CXLDEV_CAP_CAP_ID_MEMDEV 0x4000
 
+/* CXL 3.0 8.2.8.3.1 Event Status Register */
+#define CXLDEV_DEV_EVENT_STATUS_OFFSET		0x00
+#define CXLDEV_EVENT_STATUS_INFO		BIT(0)
+#define CXLDEV_EVENT_STATUS_WARN		BIT(1)
+#define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
+#define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
+#define CXLDEV_EVENT_STATUS_DYNAMIC_CAP		BIT(4)
+
 /* CXL 2.0 8.2.8.4 Mailbox Registers */
 #define CXLDEV_MBOX_CAPS_OFFSET 0x00
 #define   CXLDEV_MBOX_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index b7b955ded3ac..da64ba0f156b 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -4,6 +4,7 @@ 
 #define __CXL_MEM_H__
 #include <uapi/linux/cxl_mem.h>
 #include <linux/cdev.h>
+#include <linux/uuid.h>
 #include "cxl.h"
 
 /* CXL 2.0 8.2.8.5.1.1 Memory Device Status Register */
@@ -256,6 +257,7 @@  struct cxl_dev_state {
 enum cxl_opcode {
 	CXL_MBOX_OP_INVALID		= 0x0000,
 	CXL_MBOX_OP_RAW			= CXL_MBOX_OP_INVALID,
+	CXL_MBOX_OP_GET_EVENT_RECORD	= 0x0100,
 	CXL_MBOX_OP_GET_FW_INFO		= 0x0200,
 	CXL_MBOX_OP_ACTIVATE_FW		= 0x0202,
 	CXL_MBOX_OP_GET_SUPPORTED_LOGS	= 0x0400,
@@ -325,6 +327,76 @@  struct cxl_mbox_identify {
 	u8 qos_telemetry_caps;
 } __packed;
 
+/*
+ * Common Event Record Format
+ * CXL rev 3.0 section 8.2.9.2.1; Table 8-42
+ */
+struct cxl_event_record_hdr {
+	uuid_t id;
+	u8 length;
+	u8 flags[3];
+	__le16 handle;
+	__le16 related_handle;
+	__le64 timestamp;
+	u8 maint_op_class;
+	u8 reserved[0xf];
+} __packed;
+
+#define CXL_EVENT_RECORD_DATA_LENGTH 0x50
+struct cxl_event_record_raw {
+	struct cxl_event_record_hdr hdr;
+	u8 data[CXL_EVENT_RECORD_DATA_LENGTH];
+} __packed;
+
+/*
+ * Get Event Records output payload
+ * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
+ */
+#define CXL_GET_EVENT_FLAG_OVERFLOW		BIT(0)
+#define CXL_GET_EVENT_FLAG_MORE_RECORDS		BIT(1)
+#define CXL_GET_EVENT_NR_RECORDS		3
+struct cxl_get_event_payload {
+	u8 flags;
+	u8 reserved1;
+	__le16 overflow_err_count;
+	__le64 first_overflow_timestamp;
+	__le64 last_overflow_timestamp;
+	__le16 record_count;
+	u8 reserved2[0xa];
+	struct cxl_event_record_raw record[CXL_GET_EVENT_NR_RECORDS];
+} __packed;
+
+/*
+ * CXL rev 3.0 section 8.2.9.2.2; Table 8-49
+ */
+enum cxl_event_log_type {
+	CXL_EVENT_TYPE_INFO = 0x00,
+	CXL_EVENT_TYPE_WARN,
+	CXL_EVENT_TYPE_FAIL,
+	CXL_EVENT_TYPE_FATAL,
+	CXL_EVENT_TYPE_DYNAMIC_CAP,
+	CXL_EVENT_TYPE_MAX
+};
+
+static inline const char *cxl_event_log_type_str(enum cxl_event_log_type type)
+{
+	switch (type) {
+	case CXL_EVENT_TYPE_INFO:
+		return "Informational";
+	case CXL_EVENT_TYPE_WARN:
+		return "Warning";
+	case CXL_EVENT_TYPE_FAIL:
+		return "Failure";
+	case CXL_EVENT_TYPE_FATAL:
+		return "Fatal";
+	case CXL_EVENT_TYPE_DYNAMIC_CAP:
+		return "Dynamic Capacity";
+	default:
+		break;
+	}
+	return "<unknown>";
+}
+
 struct cxl_mbox_get_partition_info {
 	__le64 active_volatile_cap;
 	__le64 active_persistent_cap;
@@ -384,6 +456,7 @@  int cxl_mem_create_range_info(struct cxl_dev_state *cxlds);
 struct cxl_dev_state *cxl_dev_state_create(struct device *dev);
 void set_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds);
 void clear_exclusive_cxl_commands(struct cxl_dev_state *cxlds, unsigned long *cmds);
+void cxl_mem_get_event_records(struct cxl_dev_state *cxlds);
 #ifdef CONFIG_CXL_SUSPEND
 void cxl_mem_active_inc(void);
 void cxl_mem_active_dec(void);
diff --git a/include/trace/events/cxl.h b/include/trace/events/cxl.h
new file mode 100644
index 000000000000..60dec9a84918
--- /dev/null
+++ b/include/trace/events/cxl.h
@@ -0,0 +1,127 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM cxl
+
+#if !defined(_CXL_TRACE_EVENTS_H) ||  defined(TRACE_HEADER_MULTI_READ)
+#define _CXL_TRACE_EVENTS_H
+
+#include <asm-generic/unaligned.h>
+#include <linux/tracepoint.h>
+#include <cxlmem.h>
+
+TRACE_EVENT(cxl_overflow,
+
+	TP_PROTO(const char *dev_name, enum cxl_event_log_type log,
+		 struct cxl_get_event_payload *payload),
+
+	TP_ARGS(dev_name, log, payload),
+
+	TP_STRUCT__entry(
+		__string(dev_name, dev_name)
+		__field(int, log)
+		__field(u64, first_ts)
+		__field(u64, last_ts)
+		__field(u16, count)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev_name, dev_name);
+		__entry->log = log;
+		__entry->count = le16_to_cpu(payload->overflow_err_count);
+		__entry->first_ts = le64_to_cpu(payload->first_overflow_timestamp);
+		__entry->last_ts = le64_to_cpu(payload->last_overflow_timestamp);
+	),
+
+	TP_printk("%s: EVENT LOG OVERFLOW log=%s : %u records from %llu to %llu",
+		__get_str(dev_name), cxl_event_log_type_str(__entry->log),
+		__entry->count, __entry->first_ts, __entry->last_ts)
+
+);
+
+/*
+ * Common Event Record Format
+ * CXL 3.0 section 8.2.9.2.1; Table 8-42
+ */
+#define CXL_EVENT_RECORD_FLAG_PERMANENT		BIT(2)
+#define CXL_EVENT_RECORD_FLAG_MAINT_NEEDED	BIT(3)
+#define CXL_EVENT_RECORD_FLAG_PERF_DEGRADED	BIT(4)
+#define CXL_EVENT_RECORD_FLAG_HW_REPLACE	BIT(5)
+#define show_hdr_flags(flags)	__print_flags(flags, " | ",			   \
+	{ CXL_EVENT_RECORD_FLAG_PERMANENT,	"Permanent Condition"		}, \
+	{ CXL_EVENT_RECORD_FLAG_MAINT_NEEDED,	"Maintenance Needed"		}, \
+	{ CXL_EVENT_RECORD_FLAG_PERF_DEGRADED,	"Performance Degraded"		}, \
+	{ CXL_EVENT_RECORD_FLAG_HW_REPLACE,	"Hardware Replacement Needed"	}  \
+)
+
+/*
+ * Define macros for the common header of each CXL event.
+ *
+ * Tracepoints using these macros must do 3 things:
+ *
+ *	1) Add CXL_EVT_TP_entry to TP_STRUCT__entry
+ *	2) Use CXL_EVT_TP_fast_assign within TP_fast_assign;
+ *	   pass the dev_name, log, and CXL event header
+ *	3) Use CXL_EVT_TP_printk() instead of TP_printk()
+ *
+ * See the generic_event tracepoint as an example.
+ */
+#define CXL_EVT_TP_entry					\
+	__string(dev_name, dev_name)				\
+	__field(int, log)					\
+	__field_struct(uuid_t, hdr_uuid)			\
+	__field(u32, hdr_flags)					\
+	__field(u16, hdr_handle)				\
+	__field(u16, hdr_related_handle)			\
+	__field(u64, hdr_timestamp)				\
+	__field(u8, hdr_length)					\
+	__field(u8, hdr_maint_op_class)
+
+#define CXL_EVT_TP_fast_assign(dname, l, hdr)					\
+	__assign_str(dev_name, (dname));					\
+	__entry->log = (l);							\
+	memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t));			\
+	__entry->hdr_length = (hdr).length;					\
+	__entry->hdr_flags = get_unaligned_le24((hdr).flags);			\
+	__entry->hdr_handle = le16_to_cpu((hdr).handle);			\
+	__entry->hdr_related_handle = le16_to_cpu((hdr).related_handle);	\
+	__entry->hdr_timestamp = le64_to_cpu((hdr).timestamp);			\
+	__entry->hdr_maint_op_class = (hdr).maint_op_class
+
+
+#define CXL_EVT_TP_printk(fmt, ...) \
+	TP_printk("%s log=%s : time=%llu uuid=%pUb len=%d flags='%s' "		\
+		"handle=%x related_handle=%x maint_op_class=%u"			\
+		" : " fmt,							\
+		__get_str(dev_name), cxl_event_log_type_str(__entry->log),	\
+		__entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\
+		show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle,	\
+		__entry->hdr_related_handle, __entry->hdr_maint_op_class,	\
+		##__VA_ARGS__)
+
+TRACE_EVENT(cxl_generic_event,
+
+	TP_PROTO(const char *dev_name, enum cxl_event_log_type log,
+		 struct cxl_event_record_raw *rec),
+
+	TP_ARGS(dev_name, log, rec),
+
+	TP_STRUCT__entry(
+		CXL_EVT_TP_entry
+		__array(u8, data, CXL_EVENT_RECORD_DATA_LENGTH)
+	),
+
+	TP_fast_assign(
+		CXL_EVT_TP_fast_assign(dev_name, log, rec->hdr);
+		memcpy(__entry->data, &rec->data, CXL_EVENT_RECORD_DATA_LENGTH);
+	),
+
+	CXL_EVT_TP_printk("%s",
+		__print_hex(__entry->data, CXL_EVENT_RECORD_DATA_LENGTH))
+);
+
+#endif /* _CXL_TRACE_EVENTS_H */
+
+/* This part must be outside protection */
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE cxl
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
index c71021a2a9ed..70459be5bdd4 100644
--- a/include/uapi/linux/cxl_mem.h
+++ b/include/uapi/linux/cxl_mem.h
@@ -24,6 +24,7 @@ 
 	___C(IDENTIFY, "Identify Command"),                               \
 	___C(RAW, "Raw device command"),                                  \
 	___C(GET_SUPPORTED_LOGS, "Get Supported Logs"),                   \
+	___C(GET_EVENT_RECORD, "Get Event Record"),                       \
 	___C(GET_FW_INFO, "Get FW Info"),                                 \
 	___C(GET_PARTITION_INFO, "Get Partition Information"),            \
 	___C(GET_LSA, "Get Label Storage Area"),                          \