mbox series

[RFC,0/9] CXL: Read and clear event logs

Message ID 20220813053243.757363-1-ira.weiny@intel.com
Headers show
Series CXL: Read and clear event logs | expand

Message

Ira Weiny Aug. 13, 2022, 5:32 a.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

Event records inform the OS of various device events.  Events are not needed
for any kernel operation but various user level software will want to track
events.

Add event reporting through the trace event mechanism.  On driver load read and
clear all device events.

Normally interrupts will trigger new events to be reported as they occur.
Because the interrupt code is still being worked on this series provides a
cxl-test mechanism to create a series of events and trigger the reporting of
those events.

This series is submitted as an RFC for a few reasons:

	1) Interrupt support is still missing
	2) I'd like to get comments on the format of the trace events
	3) Some of the event formats are badly aligned and I would like to see
	   if there is any clarification on how the data will be formatted
	   (See individual patches for details)

Ira Weiny (9):
  cxl/mem: Implement Get Event Records command
  cxl/mem: Implement Clear Event Records command
  cxl/mem: Clear events on driver load
  cxl/mem: Trace General Media Event Record
  cxl/mem: Trace DRAM Event Record
  cxl/mem: Trace Memory Module Event Record
  cxl/test: Add generic mock events
  cxl/test: Add specific events
  cxl/test: Simulate event log overflow

 MAINTAINERS                       |   1 +
 drivers/cxl/core/mbox.c           | 143 ++++++++
 drivers/cxl/cxlmem.h              | 149 +++++++++
 drivers/cxl/pci.c                 |   2 +
 include/trace/events/cxl-events.h | 521 ++++++++++++++++++++++++++++++
 include/uapi/linux/cxl_mem.h      |   2 +
 tools/testing/cxl/test/mem.c      | 399 +++++++++++++++++++++++
 7 files changed, 1217 insertions(+)
 create mode 100644 include/trace/events/cxl-events.h


base-commit: 1cd8a2537eb07751d405ab7e2223f20338a90506

Comments

Davidlohr Bueso Aug. 22, 2022, 4:18 p.m. UTC | #1
On Fri, 12 Aug 2022, ira.weiny@intel.com wrote:

>From: Ira Weiny <ira.weiny@intel.com>
>
>Event records inform the OS of various device events.  Events are not needed
>for any kernel operation but various user level software will want to track
>events.
>
>Add event reporting through the trace event mechanism.  On driver load read and
>clear all device events.
>
>Normally interrupts will trigger new events to be reported as they occur.
>Because the interrupt code is still being worked on this series provides a
>cxl-test mechanism to create a series of events and trigger the reporting of
>those events.

Where is this irq code being worked on? I've asked about this for async mbox
commands, and Jonathan has also posted some code for the PMU implementation.

Could we not just start with an initial MSI/MSI-X support? Then gradually
interested users can be added? So each "feature" would need to do implement
it's "get message number" and to install the isr just do the standard:

      irq = pci_irq_vector(pdev, num);
      irq_name = devm_kasprintf(dev, GFP_KERNEL, "%s_%s\n", dev_name(dev),
			       cxl_irq_cap_table[feature].name);
      rc = devm_request_irq(dev, irq, isr_fn, IRQF_SHARED, irq_name, info);

The only complexity I see for this is to know the number of vectors to request
apriori, for which we'd have to get the larges value of all CXL features that
can support interrupts. Something like the following? One thing I have not
considered in this is the DOE stuff.

Thanks,
Davidlohr

------
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 88e3a8e54b6a..b334d2f497c1 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -245,6 +245,8 @@ struct cxl_dev_state {
	resource_size_t component_reg_phys;
	u64 serial;

+	int irq_type; /* MSI-X, MSI */
+
	struct xarray doe_mbs;

	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index eec597dbe763..95f4b91f43b1 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -53,15 +53,6 @@
  #define	    CXL_DVSEC_REG_LOCATOR_BLOCK_ID_MASK			GENMASK(15, 8)
  #define     CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK		GENMASK(31, 16)

-/* Register Block Identifier (RBI) */
-enum cxl_regloc_type {
-	CXL_REGLOC_RBI_EMPTY = 0,
-	CXL_REGLOC_RBI_COMPONENT,
-	CXL_REGLOC_RBI_VIRT,
-	CXL_REGLOC_RBI_MEMDEV,
-	CXL_REGLOC_RBI_TYPES
-};
-
  static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
						 struct cxl_register_map *map)
  {
@@ -75,4 +66,44 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
  struct cxl_dev_state;
  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
  void read_cdat_data(struct cxl_port *port);
+
+#define CXL_IRQ_CAPABILITY_TABLE				\
+	C(ISOLATION, "isolation", NULL),			\
+	C(PMU, "pmu_overflow", NULL), /* per pmu instance */	\
+	C(MBOX, "mailbox", NULL), /* primary-only */		\
+	C(EVENT, "event", NULL),
+
+#undef C
+#define C(a, b, c) CXL_IRQ_CAPABILITY_##a
+enum  { CXL_IRQ_CAPABILITY_TABLE };
+#undef C
+#define C(a, b, c) { b, c }
+/**
+ * struct cxl_irq_cap - CXL feature that is capable of receiving MSI/MSI-X irqs.
+ *
+ * @name: Name of the device generating this interrupt.
+ * @get_max_msgnum: Get the feature's largest interrupt message number. In cases
+ *                  where there is only one instance it also indicates which
+ *                  MSI/MSI-X vector is used for the interrupt message generated
+ *                  in association with the feature. If the feature does not
+ *                  have the Interrupt Supported bit set, then return -1.
+ */
+struct cxl_irq_cap {
+	const char *name;
+	int (*get_max_msgnum)(struct cxl_dev_state *cxlds);
+};
+
+static const
+struct cxl_irq_cap cxl_irq_cap_table[] = { CXL_IRQ_CAPABILITY_TABLE };
+#undef C
+
+/* Register Block Identifier (RBI) */
+enum cxl_regloc_type {
+	CXL_REGLOC_RBI_EMPTY = 0,
+	CXL_REGLOC_RBI_COMPONENT,
+	CXL_REGLOC_RBI_VIRT,
+	CXL_REGLOC_RBI_MEMDEV,
+	CXL_REGLOC_RBI_TYPES
+};
+
  #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index faeb5d9d7a7a..c0fe78e0559b 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -387,6 +387,52 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
	return rc;
  }

+static void cxl_pci_free_irq_vectors(void *data)
+{
+	pci_free_irq_vectors(data);
+}
+
+static int cxl_pci_alloc_irq_vectors(struct cxl_dev_state *cxlds)
+{
+	struct device *dev = cxlds->dev;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int rc, i, vectors = -1;
+
+	for (i = 0; i < ARRAY_SIZE(cxl_irq_cap_table); i++) {
+		int irq;
+
+		if (!cxl_irq_cap_table[i].get_max_msgnum)
+			continue;
+
+		irq = cxl_irq_cap_table[i].get_max_msgnum(cxlds);
+		vectors = max_t(int, irq, vectors);
+	}
+
+	if (vectors == -1)
+		return -EINVAL; /* no irq support whatsoever */
+
+	vectors++;
+	rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
+	if (rc < 0) {
+		rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSI);
+		if (rc < 0)
+			return rc;
+
+		cxlds->irq_type = PCI_IRQ_MSI;
+	} else {
+		cxlds->irq_type = PCI_IRQ_MSIX;
+	}
+
+	if (rc != vectors) {
+		pci_err(pdev, "Not enough interrupts; use polling where supported\n");
+		/* Some got allocated; clean them up */
+		cxl_pci_free_irq_vectors(pdev);
+		return -ENOSPC;
+	}
+
+	return devm_add_action_or_reset(dev, cxl_pci_free_irq_vectors, pdev);
+}
+
  static void cxl_pci_destroy_doe(void *mbs)
  {
	xa_destroy(mbs);
@@ -476,6 +522,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)

	cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);

+	if (cxl_pci_alloc_irq_vectors(cxlds))
+		cxlds->irq_type = 0;
+
	devm_cxl_pci_create_doe(cxlds);

	rc = cxl_pci_setup_mailbox(cxlds);
Ira Weiny Aug. 22, 2022, 10:53 p.m. UTC | #2
On Mon, Aug 22, 2022 at 09:18:02AM -0700, Davidlohr Bueso wrote:
> On Fri, 12 Aug 2022, ira.weiny@intel.com wrote:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Event records inform the OS of various device events.  Events are not needed
> > for any kernel operation but various user level software will want to track
> > events.
> > 
> > Add event reporting through the trace event mechanism.  On driver load read and
> > clear all device events.
> > 
> > Normally interrupts will trigger new events to be reported as they occur.
> > Because the interrupt code is still being worked on this series provides a
> > cxl-test mechanism to create a series of events and trigger the reporting of
> > those events.
> 
> Where is this irq code being worked on? I've asked about this for async mbox
> commands, and Jonathan has also posted some code for the PMU implementation.

I'm still trying to work out how to share irq's between PCI and CXL.  Mainly
for DOE.

I thought that we could skip IRQ support for DOE completely and this would
support your proposal below.  But I just found that:

"A device may interrupt the host when CDAT content changes using the MSI
associated with this DOE Capability instance."

So I guess it needs to be supported at some point.

> 
> Could we not just start with an initial MSI/MSI-X support? Then gradually
> interested users can be added? So each "feature" would need to do implement
> it's "get message number" and to install the isr just do the standard:
> 
>      irq = pci_irq_vector(pdev, num);
>      irq_name = devm_kasprintf(dev, GFP_KERNEL, "%s_%s\n", dev_name(dev),
> 			       cxl_irq_cap_table[feature].name);
>      rc = devm_request_irq(dev, irq, isr_fn, IRQF_SHARED, irq_name, info);
> 
> The only complexity I see for this is to know the number of vectors to request
> apriori, for which we'd have to get the larges value of all CXL features that
> can support interrupts. Something like the following?

Generally it seems ok but I have questions below.

> One thing I have not
> considered in this is the DOE stuff.

I think this is the harder thing to support because of needing to allow both
the PCI layer and the CXL layer to create irqs.  Potentially at different
times.

> 
> Thanks,
> Davidlohr
> 
> ------
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 88e3a8e54b6a..b334d2f497c1 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -245,6 +245,8 @@ struct cxl_dev_state {
> 	resource_size_t component_reg_phys;
> 	u64 serial;
> 
> +	int irq_type; /* MSI-X, MSI */
> +
> 	struct xarray doe_mbs;
> 
> 	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index eec597dbe763..95f4b91f43b1 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -53,15 +53,6 @@
>  #define	    CXL_DVSEC_REG_LOCATOR_BLOCK_ID_MASK			GENMASK(15, 8)
>  #define     CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK		GENMASK(31, 16)
> 
> -/* Register Block Identifier (RBI) */
> -enum cxl_regloc_type {
> -	CXL_REGLOC_RBI_EMPTY = 0,
> -	CXL_REGLOC_RBI_COMPONENT,
> -	CXL_REGLOC_RBI_VIRT,
> -	CXL_REGLOC_RBI_MEMDEV,
> -	CXL_REGLOC_RBI_TYPES
> -};

Why move this?

> -
>  static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
> 						 struct cxl_register_map *map)
>  {
> @@ -75,4 +66,44 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
>  struct cxl_dev_state;
>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
>  void read_cdat_data(struct cxl_port *port);
> +
> +#define CXL_IRQ_CAPABILITY_TABLE				\
> +	C(ISOLATION, "isolation", NULL),			\
> +	C(PMU, "pmu_overflow", NULL), /* per pmu instance */	\
> +	C(MBOX, "mailbox", NULL), /* primary-only */		\
> +	C(EVENT, "event", NULL),

This is defining get_max_msgnum to NULL right?

> +
> +#undef C
> +#define C(a, b, c) CXL_IRQ_CAPABILITY_##a
> +enum  { CXL_IRQ_CAPABILITY_TABLE };
> +#undef C
> +#define C(a, b, c) { b, c }
> +/**
> + * struct cxl_irq_cap - CXL feature that is capable of receiving MSI/MSI-X irqs.
> + *
> + * @name: Name of the device generating this interrupt.
> + * @get_max_msgnum: Get the feature's largest interrupt message number. In cases
> + *                  where there is only one instance it also indicates which
> + *                  MSI/MSI-X vector is used for the interrupt message generated
> + *                  in association with the feature. If the feature does not
> + *                  have the Interrupt Supported bit set, then return -1.
> + */
> +struct cxl_irq_cap {
> +	const char *name;
> +	int (*get_max_msgnum)(struct cxl_dev_state *cxlds);
> +};
> +
> +static const
> +struct cxl_irq_cap cxl_irq_cap_table[] = { CXL_IRQ_CAPABILITY_TABLE };
> +#undef C

Why all this macro magic?

> +
> +/* Register Block Identifier (RBI) */
> +enum cxl_regloc_type {
> +	CXL_REGLOC_RBI_EMPTY = 0,
> +	CXL_REGLOC_RBI_COMPONENT,
> +	CXL_REGLOC_RBI_VIRT,
> +	CXL_REGLOC_RBI_MEMDEV,
> +	CXL_REGLOC_RBI_TYPES
> +};
> +
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index faeb5d9d7a7a..c0fe78e0559b 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -387,6 +387,52 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> 	return rc;
>  }
> 
> +static void cxl_pci_free_irq_vectors(void *data)
> +{
> +	pci_free_irq_vectors(data);
> +}
> +
> +static int cxl_pci_alloc_irq_vectors(struct cxl_dev_state *cxlds)
> +{
> +	struct device *dev = cxlds->dev;
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int rc, i, vectors = -1;
> +
> +	for (i = 0; i < ARRAY_SIZE(cxl_irq_cap_table); i++) {
> +		int irq;
> +
> +		if (!cxl_irq_cap_table[i].get_max_msgnum)
> +			continue;
> +
> +		irq = cxl_irq_cap_table[i].get_max_msgnum(cxlds);
> +		vectors = max_t(int, irq, vectors);
> +	}
> +
> +	if (vectors == -1)
> +		return -EINVAL; /* no irq support whatsoever */
> +
> +	vectors++;

This is pretty much what earlier versions of the DOE code did with the
exception of only have 1 get_max_msgnum() calls defined (for DOE).  But there
was a lot of debate about how to share vectors with the PCI layer.  And
eventually we got rid of it.  I'm still trying to figure it out.  Sorry for
being slow.

Perhaps we do this for this series.  However, won't we have an issue if we want
to support switch events?

Ira

> +	rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
> +	if (rc < 0) {
> +		rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSI);
> +		if (rc < 0)
> +			return rc;
> +
> +		cxlds->irq_type = PCI_IRQ_MSI;
> +	} else {
> +		cxlds->irq_type = PCI_IRQ_MSIX;
> +	}
> +
> +	if (rc != vectors) {
> +		pci_err(pdev, "Not enough interrupts; use polling where supported\n");
> +		/* Some got allocated; clean them up */
> +		cxl_pci_free_irq_vectors(pdev);
> +		return -ENOSPC;
> +	}
> +
> +	return devm_add_action_or_reset(dev, cxl_pci_free_irq_vectors, pdev);
> +}
> +
>  static void cxl_pci_destroy_doe(void *mbs)
>  {
> 	xa_destroy(mbs);
> @@ -476,6 +522,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> 
> 	cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
> 
> +	if (cxl_pci_alloc_irq_vectors(cxlds))
> +		cxlds->irq_type = 0;
> +
> 	devm_cxl_pci_create_doe(cxlds);
> 
> 	rc = cxl_pci_setup_mailbox(cxlds);
Davidlohr Bueso Aug. 23, 2022, 4:12 p.m. UTC | #3
On Mon, 22 Aug 2022, Ira Weiny wrote:

>Generally it seems ok but I have questions below.
>
>> One thing I have not
>> considered in this is the DOE stuff.
>
>I think this is the harder thing to support because of needing to allow both
>the PCI layer and the CXL layer to create irqs.  Potentially at different
>times.

I agree.

>> -/* Register Block Identifier (RBI) */
>> -enum cxl_regloc_type {
>> -	CXL_REGLOC_RBI_EMPTY = 0,
>> -	CXL_REGLOC_RBI_COMPONENT,
>> -	CXL_REGLOC_RBI_VIRT,
>> -	CXL_REGLOC_RBI_MEMDEV,
>> -	CXL_REGLOC_RBI_TYPES
>> -};
>
>Why move this?

That was sloppy of me, sorry. I wanted to reuse struct cxlds forward declaration,
no idea why that diff formed.

>> -
>>  static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
>>						 struct cxl_register_map *map)
>>  {
>> @@ -75,4 +66,44 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
>>  struct cxl_dev_state;
>>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
>>  void read_cdat_data(struct cxl_port *port);
>> +
>> +#define CXL_IRQ_CAPABILITY_TABLE				\
>> +	C(ISOLATION, "isolation", NULL),			\
>> +	C(PMU, "pmu_overflow", NULL), /* per pmu instance */	\
>> +	C(MBOX, "mailbox", NULL), /* primary-only */		\
>> +	C(EVENT, "event", NULL),
>
>This is defining get_max_msgnum to NULL right?

Yes. So untl there are any users everything's a nop.

>> +
>> +#undef C
>> +#define C(a, b, c) CXL_IRQ_CAPABILITY_##a
>> +enum  { CXL_IRQ_CAPABILITY_TABLE };
>> +#undef C
>> +#define C(a, b, c) { b, c }
>> +/**
>> + * struct cxl_irq_cap - CXL feature that is capable of receiving MSI/MSI-X irqs.
>> + *
>> + * @name: Name of the device generating this interrupt.
>> + * @get_max_msgnum: Get the feature's largest interrupt message number. In cases
>> + *                  where there is only one instance it also indicates which
>> + *                  MSI/MSI-X vector is used for the interrupt message generated
>> + *                  in association with the feature. If the feature does not
>> + *                  have the Interrupt Supported bit set, then return -1.
>> + */
>> +struct cxl_irq_cap {
>> +	const char *name;
>> +	int (*get_max_msgnum)(struct cxl_dev_state *cxlds);
>> +};
>> +
>> +static const
>> +struct cxl_irq_cap cxl_irq_cap_table[] = { CXL_IRQ_CAPABILITY_TABLE };
>> +#undef C
>
>Why all this macro magic?

A nifty trick Dan likes, it avoids duplicating the fields (enums + the table).

>> +
>> +/* Register Block Identifier (RBI) */
>> +enum cxl_regloc_type {
>> +	CXL_REGLOC_RBI_EMPTY = 0,
>> +	CXL_REGLOC_RBI_COMPONENT,
>> +	CXL_REGLOC_RBI_VIRT,
>> +	CXL_REGLOC_RBI_MEMDEV,
>> +	CXL_REGLOC_RBI_TYPES
>> +};
>> +
>>  #endif /* __CXL_PCI_H__ */
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index faeb5d9d7a7a..c0fe78e0559b 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -387,6 +387,52 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
>>	return rc;
>>  }
>>
>> +static void cxl_pci_free_irq_vectors(void *data)
>> +{
>> +	pci_free_irq_vectors(data);
>> +}
>> +
>> +static int cxl_pci_alloc_irq_vectors(struct cxl_dev_state *cxlds)
>> +{
>> +	struct device *dev = cxlds->dev;
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	int rc, i, vectors = -1;
>> +
>> +	for (i = 0; i < ARRAY_SIZE(cxl_irq_cap_table); i++) {
>> +		int irq;
>> +
>> +		if (!cxl_irq_cap_table[i].get_max_msgnum)
>> +			continue;
>> +
>> +		irq = cxl_irq_cap_table[i].get_max_msgnum(cxlds);
>> +		vectors = max_t(int, irq, vectors);
>> +	}
>> +
>> +	if (vectors == -1)
>> +		return -EINVAL; /* no irq support whatsoever */
>> +
>> +	vectors++;
>
>This is pretty much what earlier versions of the DOE code did with the
>exception of only have 1 get_max_msgnum() calls defined (for DOE).  But there
>was a lot of debate about how to share vectors with the PCI layer.  And
>eventually we got rid of it.  I'm still trying to figure it out.  Sorry for
>being slow.

That makes sense, thanks for the explanation. And no not slow, it is _I_
that needs to go re-read the DOE stuff with more attention. But while I
knew this was the hardest part, all I really wanted was a basic irq
support to add to the bg cmd handling series.

>Perhaps we do this for this series.  However, won't we have an issue if we want
>to support switch events?

If possible, could you elaborate more on this?

Thanks,
Davidlohr
Jonathan Cameron Aug. 24, 2022, 10:07 a.m. UTC | #4
On Mon, 22 Aug 2022 15:53:54 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Mon, Aug 22, 2022 at 09:18:02AM -0700, Davidlohr Bueso wrote:
> > On Fri, 12 Aug 2022, ira.weiny@intel.com wrote:
> >   
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > Event records inform the OS of various device events.  Events are not needed
> > > for any kernel operation but various user level software will want to track
> > > events.
> > > 
> > > Add event reporting through the trace event mechanism.  On driver load read and
> > > clear all device events.
> > > 
> > > Normally interrupts will trigger new events to be reported as they occur.
> > > Because the interrupt code is still being worked on this series provides a
> > > cxl-test mechanism to create a series of events and trigger the reporting of
> > > those events.  
> > 
> > Where is this irq code being worked on? I've asked about this for async mbox
> > commands, and Jonathan has also posted some code for the PMU implementation.  
> 
> I'm still trying to work out how to share irq's between PCI and CXL.  Mainly
> for DOE.
> 
> I thought that we could skip IRQ support for DOE completely and this would
> support your proposal below.  But I just found that:
> 
> "A device may interrupt the host when CDAT content changes using the MSI
> associated with this DOE Capability instance."

As of today that doesn't work because there is no status flag anywhere to let
you know that was the interrupt source.

It's been raised in appropriate places, but I can't say anymore on that
until stuff is published.

Hence I'd not worry about that corner for now.

> 
> So I guess it needs to be supported at some point.
> 
> > 
> > Could we not just start with an initial MSI/MSI-X support? Then gradually
> > interested users can be added? So each "feature" would need to do implement
> > it's "get message number" and to install the isr just do the standard:
> > 
> >      irq = pci_irq_vector(pdev, num);
> >      irq_name = devm_kasprintf(dev, GFP_KERNEL, "%s_%s\n", dev_name(dev),
> > 			       cxl_irq_cap_table[feature].name);
> >      rc = devm_request_irq(dev, irq, isr_fn, IRQF_SHARED, irq_name, info);
> > 
> > The only complexity I see for this is to know the number of vectors to request
> > apriori, for which we'd have to get the larges value of all CXL features that
> > can support interrupts. Something like the following?  
> 
> Generally it seems ok but I have questions below.
> 
> > One thing I have not
> > considered in this is the DOE stuff.  
> 
> I think this is the harder thing to support because of needing to allow both
> the PCI layer and the CXL layer to create irqs.  Potentially at different
> times.

My reasoning on this is that IRQ creation has to be done by
the PCI device driver.  That may result in some juggling and late starting
or indeed restarting of DOE mailboxes once we can know the list of vectors.
(e.g. query them by polling, then a later driver register can request enabling
the DOE with an irq).
Or it needs the ability to do dynamic increasing of the requested IRQ vectors.

> 
> > 
> > Thanks,
> > Davidlohr
> > 
> > ------
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 88e3a8e54b6a..b334d2f497c1 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -245,6 +245,8 @@ struct cxl_dev_state {
> > 	resource_size_t component_reg_phys;
> > 	u64 serial;
> > 
> > +	int irq_type; /* MSI-X, MSI */
> > +
> > 	struct xarray doe_mbs;
> > 
> > 	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
> > diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> > index eec597dbe763..95f4b91f43b1 100644
> > --- a/drivers/cxl/cxlpci.h
> > +++ b/drivers/cxl/cxlpci.h
> > @@ -53,15 +53,6 @@
> >  #define	    CXL_DVSEC_REG_LOCATOR_BLOCK_ID_MASK			GENMASK(15, 8)
> >  #define     CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK		GENMASK(31, 16)
> > 
> > -/* Register Block Identifier (RBI) */
> > -enum cxl_regloc_type {
> > -	CXL_REGLOC_RBI_EMPTY = 0,
> > -	CXL_REGLOC_RBI_COMPONENT,
> > -	CXL_REGLOC_RBI_VIRT,
> > -	CXL_REGLOC_RBI_MEMDEV,
> > -	CXL_REGLOC_RBI_TYPES
> > -};  
> 
> Why move this?
> 
> > -
> >  static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
> > 						 struct cxl_register_map *map)
> >  {
> > @@ -75,4 +66,44 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> >  struct cxl_dev_state;
> >  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
> >  void read_cdat_data(struct cxl_port *port);
> > +
> > +#define CXL_IRQ_CAPABILITY_TABLE				\
> > +	C(ISOLATION, "isolation", NULL),			\
> > +	C(PMU, "pmu_overflow", NULL), /* per pmu instance */	\
> > +	C(MBOX, "mailbox", NULL), /* primary-only */		\
> > +	C(EVENT, "event", NULL),  
> 
> This is defining get_max_msgnum to NULL right?
> 
> > +
> > +#undef C
> > +#define C(a, b, c) CXL_IRQ_CAPABILITY_##a
> > +enum  { CXL_IRQ_CAPABILITY_TABLE };
> > +#undef C
> > +#define C(a, b, c) { b, c }
> > +/**
> > + * struct cxl_irq_cap - CXL feature that is capable of receiving MSI/MSI-X irqs.
> > + *
> > + * @name: Name of the device generating this interrupt.
> > + * @get_max_msgnum: Get the feature's largest interrupt message number. In cases
> > + *                  where there is only one instance it also indicates which
> > + *                  MSI/MSI-X vector is used for the interrupt message generated
> > + *                  in association with the feature. If the feature does not
> > + *                  have the Interrupt Supported bit set, then return -1.
> > + */
> > +struct cxl_irq_cap {
> > +	const char *name;
> > +	int (*get_max_msgnum)(struct cxl_dev_state *cxlds);
> > +};
> > +
> > +static const
> > +struct cxl_irq_cap cxl_irq_cap_table[] = { CXL_IRQ_CAPABILITY_TABLE };
> > +#undef C  
> 
> Why all this macro magic?

Agreed. I'm rarely persuaded it's a good idea to do this sort of trickery
and it definitely isn't worth the readabilty problems unless there a
large number of users.

> 
> > +
> > +/* Register Block Identifier (RBI) */
> > +enum cxl_regloc_type {
> > +	CXL_REGLOC_RBI_EMPTY = 0,
> > +	CXL_REGLOC_RBI_COMPONENT,
> > +	CXL_REGLOC_RBI_VIRT,
> > +	CXL_REGLOC_RBI_MEMDEV,
> > +	CXL_REGLOC_RBI_TYPES
> > +};
> > +
> >  #endif /* __CXL_PCI_H__ */
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index faeb5d9d7a7a..c0fe78e0559b 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -387,6 +387,52 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> > 	return rc;
> >  }
> > 
> > +static void cxl_pci_free_irq_vectors(void *data)
> > +{
> > +	pci_free_irq_vectors(data);
> > +}
> > +
> > +static int cxl_pci_alloc_irq_vectors(struct cxl_dev_state *cxlds)
> > +{
> > +	struct device *dev = cxlds->dev;
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	int rc, i, vectors = -1;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(cxl_irq_cap_table); i++) {
> > +		int irq;
> > +
> > +		if (!cxl_irq_cap_table[i].get_max_msgnum)
> > +			continue;
> > +
> > +		irq = cxl_irq_cap_table[i].get_max_msgnum(cxlds);
> > +		vectors = max_t(int, irq, vectors);
> > +	}
> > +
> > +	if (vectors == -1)
> > +		return -EINVAL; /* no irq support whatsoever */
> > +
> > +	vectors++;  
> 
> This is pretty much what earlier versions of the DOE code did with the
> exception of only have 1 get_max_msgnum() calls defined (for DOE).  But there
> was a lot of debate about how to share vectors with the PCI layer.  And
> eventually we got rid of it.  I'm still trying to figure it out.  Sorry for
> being slow.

I'm not yet setting huge advantage in wrapping this up. For now a set of
linear calls to establish the max irq vector is more readable.  Sure
down the line moving to this may make sense.

> 
> Perhaps we do this for this series.  However, won't we have an issue if we want
> to support switch events?

We 'could' extend existing stuff in the portdrv code (which is ultimately
where this general approach was copied from ;) but I suspect doing that
for non generic PCI stuff is going to be controversial.

That whole infrastructure in PCI may need a rewrite.

> 
> Ira
> 
> > +	rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
> > +	if (rc < 0) {
> > +		rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSI);
> > +		if (rc < 0)
> > +			return rc;
> > +
> > +		cxlds->irq_type = PCI_IRQ_MSI;
> > +	} else {
> > +		cxlds->irq_type = PCI_IRQ_MSIX;
> > +	}
> > +
> > +	if (rc != vectors) {
> > +		pci_err(pdev, "Not enough interrupts; use polling where supported\n");
> > +		/* Some got allocated; clean them up */
> > +		cxl_pci_free_irq_vectors(pdev);
> > +		return -ENOSPC;
> > +	}
> > +
> > +	return devm_add_action_or_reset(dev, cxl_pci_free_irq_vectors, pdev);
> > +}
> > +
> >  static void cxl_pci_destroy_doe(void *mbs)
> >  {
> > 	xa_destroy(mbs);
> > @@ -476,6 +522,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > 
> > 	cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
> > 
> > +	if (cxl_pci_alloc_irq_vectors(cxlds))
> > +		cxlds->irq_type = 0;
> > +
> > 	devm_cxl_pci_create_doe(cxlds);
> > 
> > 	rc = cxl_pci_setup_mailbox(cxlds);
Dave Jiang Sept. 1, 2022, 6:10 p.m. UTC | #5
On 8/24/2022 3:07 AM, Jonathan Cameron wrote:
> On Mon, 22 Aug 2022 15:53:54 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
>
>> On Mon, Aug 22, 2022 at 09:18:02AM -0700, Davidlohr Bueso wrote:
>>> On Fri, 12 Aug 2022, ira.weiny@intel.com wrote:
>>>    
>>>> From: Ira Weiny <ira.weiny@intel.com>
>>>>
>>>> Event records inform the OS of various device events.  Events are not needed
>>>> for any kernel operation but various user level software will want to track
>>>> events.
>>>>
>>>> Add event reporting through the trace event mechanism.  On driver load read and
>>>> clear all device events.
>>>>
>>>> Normally interrupts will trigger new events to be reported as they occur.
>>>> Because the interrupt code is still being worked on this series provides a
>>>> cxl-test mechanism to create a series of events and trigger the reporting of
>>>> those events.
>>> Where is this irq code being worked on? I've asked about this for async mbox
>>> commands, and Jonathan has also posted some code for the PMU implementation.
>> I'm still trying to work out how to share irq's between PCI and CXL.  Mainly
>> for DOE.
>>
>> I thought that we could skip IRQ support for DOE completely and this would
>> support your proposal below.  But I just found that:
>>
>> "A device may interrupt the host when CDAT content changes using the MSI
>> associated with this DOE Capability instance."
> As of today that doesn't work because there is no status flag anywhere to let
> you know that was the interrupt source.
>
> It's been raised in appropriate places, but I can't say anymore on that
> until stuff is published.
>
> Hence I'd not worry about that corner for now.
>
>> So I guess it needs to be supported at some point.
>>
>>> Could we not just start with an initial MSI/MSI-X support? Then gradually
>>> interested users can be added? So each "feature" would need to do implement
>>> it's "get message number" and to install the isr just do the standard:
>>>
>>>       irq = pci_irq_vector(pdev, num);
>>>       irq_name = devm_kasprintf(dev, GFP_KERNEL, "%s_%s\n", dev_name(dev),
>>> 			       cxl_irq_cap_table[feature].name);
>>>       rc = devm_request_irq(dev, irq, isr_fn, IRQF_SHARED, irq_name, info);
>>>
>>> The only complexity I see for this is to know the number of vectors to request
>>> apriori, for which we'd have to get the larges value of all CXL features that
>>> can support interrupts. Something like the following?
>> Generally it seems ok but I have questions below.
>>
>>> One thing I have not
>>> considered in this is the DOE stuff.
>> I think this is the harder thing to support because of needing to allow both
>> the PCI layer and the CXL layer to create irqs.  Potentially at different
>> times.
> My reasoning on this is that IRQ creation has to be done by
> the PCI device driver.  That may result in some juggling and late starting
> or indeed restarting of DOE mailboxes once we can know the list of vectors.
> (e.g. query them by polling, then a later driver register can request enabling
> the DOE with an irq).
> Or it needs the ability to do dynamic increasing of the requested IRQ vectors.

tglx was working on dynamic MSIX a while back. not sure the state of 
that now

https://lore.kernel.org/lkml/87a6hof5sr.ffs@tglx/T/

DJ

>
>>> Thanks,
>>> Davidlohr
>>>
>>> ------
>>> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>>> index 88e3a8e54b6a..b334d2f497c1 100644
>>> --- a/drivers/cxl/cxlmem.h
>>> +++ b/drivers/cxl/cxlmem.h
>>> @@ -245,6 +245,8 @@ struct cxl_dev_state {
>>> 	resource_size_t component_reg_phys;
>>> 	u64 serial;
>>>
>>> +	int irq_type; /* MSI-X, MSI */
>>> +
>>> 	struct xarray doe_mbs;
>>>
>>> 	int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
>>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>>> index eec597dbe763..95f4b91f43b1 100644
>>> --- a/drivers/cxl/cxlpci.h
>>> +++ b/drivers/cxl/cxlpci.h
>>> @@ -53,15 +53,6 @@
>>>   #define	    CXL_DVSEC_REG_LOCATOR_BLOCK_ID_MASK			GENMASK(15, 8)
>>>   #define     CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK		GENMASK(31, 16)
>>>
>>> -/* Register Block Identifier (RBI) */
>>> -enum cxl_regloc_type {
>>> -	CXL_REGLOC_RBI_EMPTY = 0,
>>> -	CXL_REGLOC_RBI_COMPONENT,
>>> -	CXL_REGLOC_RBI_VIRT,
>>> -	CXL_REGLOC_RBI_MEMDEV,
>>> -	CXL_REGLOC_RBI_TYPES
>>> -};
>> Why move this?
>>
>>> -
>>>   static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
>>> 						 struct cxl_register_map *map)
>>>   {
>>> @@ -75,4 +66,44 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
>>>   struct cxl_dev_state;
>>>   int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
>>>   void read_cdat_data(struct cxl_port *port);
>>> +
>>> +#define CXL_IRQ_CAPABILITY_TABLE				\
>>> +	C(ISOLATION, "isolation", NULL),			\
>>> +	C(PMU, "pmu_overflow", NULL), /* per pmu instance */	\
>>> +	C(MBOX, "mailbox", NULL), /* primary-only */		\
>>> +	C(EVENT, "event", NULL),
>> This is defining get_max_msgnum to NULL right?
>>
>>> +
>>> +#undef C
>>> +#define C(a, b, c) CXL_IRQ_CAPABILITY_##a
>>> +enum  { CXL_IRQ_CAPABILITY_TABLE };
>>> +#undef C
>>> +#define C(a, b, c) { b, c }
>>> +/**
>>> + * struct cxl_irq_cap - CXL feature that is capable of receiving MSI/MSI-X irqs.
>>> + *
>>> + * @name: Name of the device generating this interrupt.
>>> + * @get_max_msgnum: Get the feature's largest interrupt message number. In cases
>>> + *                  where there is only one instance it also indicates which
>>> + *                  MSI/MSI-X vector is used for the interrupt message generated
>>> + *                  in association with the feature. If the feature does not
>>> + *                  have the Interrupt Supported bit set, then return -1.
>>> + */
>>> +struct cxl_irq_cap {
>>> +	const char *name;
>>> +	int (*get_max_msgnum)(struct cxl_dev_state *cxlds);
>>> +};
>>> +
>>> +static const
>>> +struct cxl_irq_cap cxl_irq_cap_table[] = { CXL_IRQ_CAPABILITY_TABLE };
>>> +#undef C
>> Why all this macro magic?
> Agreed. I'm rarely persuaded it's a good idea to do this sort of trickery
> and it definitely isn't worth the readabilty problems unless there a
> large number of users.
>
>>> +
>>> +/* Register Block Identifier (RBI) */
>>> +enum cxl_regloc_type {
>>> +	CXL_REGLOC_RBI_EMPTY = 0,
>>> +	CXL_REGLOC_RBI_COMPONENT,
>>> +	CXL_REGLOC_RBI_VIRT,
>>> +	CXL_REGLOC_RBI_MEMDEV,
>>> +	CXL_REGLOC_RBI_TYPES
>>> +};
>>> +
>>>   #endif /* __CXL_PCI_H__ */
>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>> index faeb5d9d7a7a..c0fe78e0559b 100644
>>> --- a/drivers/cxl/pci.c
>>> +++ b/drivers/cxl/pci.c
>>> @@ -387,6 +387,52 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
>>> 	return rc;
>>>   }
>>>
>>> +static void cxl_pci_free_irq_vectors(void *data)
>>> +{
>>> +	pci_free_irq_vectors(data);
>>> +}
>>> +
>>> +static int cxl_pci_alloc_irq_vectors(struct cxl_dev_state *cxlds)
>>> +{
>>> +	struct device *dev = cxlds->dev;
>>> +	struct pci_dev *pdev = to_pci_dev(dev);
>>> +	int rc, i, vectors = -1;
>>> +
>>> +	for (i = 0; i < ARRAY_SIZE(cxl_irq_cap_table); i++) {
>>> +		int irq;
>>> +
>>> +		if (!cxl_irq_cap_table[i].get_max_msgnum)
>>> +			continue;
>>> +
>>> +		irq = cxl_irq_cap_table[i].get_max_msgnum(cxlds);
>>> +		vectors = max_t(int, irq, vectors);
>>> +	}
>>> +
>>> +	if (vectors == -1)
>>> +		return -EINVAL; /* no irq support whatsoever */
>>> +
>>> +	vectors++;
>> This is pretty much what earlier versions of the DOE code did with the
>> exception of only have 1 get_max_msgnum() calls defined (for DOE).  But there
>> was a lot of debate about how to share vectors with the PCI layer.  And
>> eventually we got rid of it.  I'm still trying to figure it out.  Sorry for
>> being slow.
> I'm not yet setting huge advantage in wrapping this up. For now a set of
> linear calls to establish the max irq vector is more readable.  Sure
> down the line moving to this may make sense.
>
>> Perhaps we do this for this series.  However, won't we have an issue if we want
>> to support switch events?
> We 'could' extend existing stuff in the portdrv code (which is ultimately
> where this general approach was copied from ;) but I suspect doing that
> for non generic PCI stuff is going to be controversial.
>
> That whole infrastructure in PCI may need a rewrite.
>
>> Ira
>>
>>> +	rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
>>> +	if (rc < 0) {
>>> +		rc = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSI);
>>> +		if (rc < 0)
>>> +			return rc;
>>> +
>>> +		cxlds->irq_type = PCI_IRQ_MSI;
>>> +	} else {
>>> +		cxlds->irq_type = PCI_IRQ_MSIX;
>>> +	}
>>> +
>>> +	if (rc != vectors) {
>>> +		pci_err(pdev, "Not enough interrupts; use polling where supported\n");
>>> +		/* Some got allocated; clean them up */
>>> +		cxl_pci_free_irq_vectors(pdev);
>>> +		return -ENOSPC;
>>> +	}
>>> +
>>> +	return devm_add_action_or_reset(dev, cxl_pci_free_irq_vectors, pdev);
>>> +}
>>> +
>>>   static void cxl_pci_destroy_doe(void *mbs)
>>>   {
>>> 	xa_destroy(mbs);
>>> @@ -476,6 +522,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>>
>>> 	cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
>>>
>>> +	if (cxl_pci_alloc_irq_vectors(cxlds))
>>> +		cxlds->irq_type = 0;
>>> +
>>> 	devm_cxl_pci_create_doe(cxlds);
>>>
>>> 	rc = cxl_pci_setup_mailbox(cxlds);