diff mbox series

[13/18] cxl: Add latency and bandwidth calculations for the CXL path

Message ID 167571667794.587790.14172786993094257614.stgit@djiang5-mobl3.local (mailing list archive)
State Handled Elsewhere
Headers show
Series cxl: Add support for QTG ID retrieval for CXL subsystem | expand

Commit Message

Dave Jiang Feb. 6, 2023, 8:51 p.m. UTC
CXL Memory Device SW Guide rev1.0 2.11.2 provides instruction on how to
caluclate latency and bandwidth for CXL memory device. Calculate minimum
bandwidth and total latency for the path from the CXL device to the root
port. The calculates values are stored in the cached DSMAS entries attached
to the cxl_port of the CXL device.

For example for a device that is directly attached to a host bus:
Total Latency = Device Latency (from CDAT) + Dev to Host Bus (HB) Link
		Latency
Min Bandwidth = Link Bandwidth between Host Bus and CXL device

For a device that has a switch in between host bus and CXL device:
Total Latency = Device (CDAT) Latency + Dev to Switch Link Latency +
		Switch (CDAT) Latency + Switch to HB Link Latency
Min Bandwidth = min(dev to switch bandwidth, switch to HB bandwidth)
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

The internal latency for a switch can be retrieved from the CDAT of the
switch PCI device. However, since there's no easy way to retrieve that
right now on Linux, a guesstimated constant is used per switch to simplify
the driver code.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/port.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h       |    9 +++++++
 drivers/cxl/port.c      |   42 +++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+)

Comments

Jonathan Cameron Feb. 9, 2023, 3:24 p.m. UTC | #1
On Mon, 06 Feb 2023 13:51:19 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> CXL Memory Device SW Guide rev1.0 2.11.2 provides instruction on how to
> caluclate latency and bandwidth for CXL memory device. Calculate minimum

Spell check your descriptions (I often forget to do this as well!
)
> bandwidth and total latency for the path from the CXL device to the root
> port. The calculates values are stored in the cached DSMAS entries attached
> to the cxl_port of the CXL device.
> 
> For example for a device that is directly attached to a host bus:
> Total Latency = Device Latency (from CDAT) + Dev to Host Bus (HB) Link
> 		Latency
> Min Bandwidth = Link Bandwidth between Host Bus and CXL device
> 
> For a device that has a switch in between host bus and CXL device:
> Total Latency = Device (CDAT) Latency + Dev to Switch Link Latency +
> 		Switch (CDAT) Latency + Switch to HB Link Latency

For QTG purposes, are we also supposed to take into account HB to
system interconnect type latency (or maybe nearest CPU?).
That is likely to be non trivial.

> Min Bandwidth = min(dev to switch bandwidth, switch to HB bandwidth)
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

Stray sign off.

> 
> The internal latency for a switch can be retrieved from the CDAT of the
> switch PCI device. However, since there's no easy way to retrieve that
> right now on Linux, a guesstimated constant is used per switch to simplify
> the driver code.

I'd like to see that gap closed asap. I think it is fairly obvious how to do
it, so shouldn't be too hard, just needs a dance to get the DOE for a switch
port using Lukas' updated handling of DOE mailboxes. 

> 
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/port.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h       |    9 +++++++
>  drivers/cxl/port.c      |   42 +++++++++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+)
> 
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 2b27319cfd42..aa260361ba7d 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1899,6 +1899,66 @@ bool schedule_cxl_memdev_detach(struct cxl_memdev *cxlmd)
>  }
>  EXPORT_SYMBOL_NS_GPL(schedule_cxl_memdev_detach, CXL);
>  
> +int cxl_port_get_downstream_qos(struct cxl_port *port, long *bw, long *lat)
> +{
> +	long total_lat = 0, latency;

Similar to before, not good for readability to hide asignments in a list all on one line.
Dave Jiang Feb. 14, 2023, 11:03 p.m. UTC | #2
On 2/9/23 8:24 AM, Jonathan Cameron wrote:
> On Mon, 06 Feb 2023 13:51:19 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
>> CXL Memory Device SW Guide rev1.0 2.11.2 provides instruction on how to
>> caluclate latency and bandwidth for CXL memory device. Calculate minimum
> 
> Spell check your descriptions (I often forget to do this as well!
> )
>> bandwidth and total latency for the path from the CXL device to the root
>> port. The calculates values are stored in the cached DSMAS entries attached
>> to the cxl_port of the CXL device.
>>
>> For example for a device that is directly attached to a host bus:
>> Total Latency = Device Latency (from CDAT) + Dev to Host Bus (HB) Link
>> 		Latency
>> Min Bandwidth = Link Bandwidth between Host Bus and CXL device
>>
>> For a device that has a switch in between host bus and CXL device:
>> Total Latency = Device (CDAT) Latency + Dev to Switch Link Latency +
>> 		Switch (CDAT) Latency + Switch to HB Link Latency
> 
> For QTG purposes, are we also supposed to take into account HB to
> system interconnect type latency (or maybe nearest CPU?).
> That is likely to be non trivial.

Dan brought this ECN [1] to my attention. We can add this if we can find 
a BIOS that implements the ECN. Or should we code a place holder for it 
until this is available?

https://lore.kernel.org/linux-cxl/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/

> 
>> Min Bandwidth = min(dev to switch bandwidth, switch to HB bandwidth)
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> 
> Stray sign off.
> 
>>
>> The internal latency for a switch can be retrieved from the CDAT of the
>> switch PCI device. However, since there's no easy way to retrieve that
>> right now on Linux, a guesstimated constant is used per switch to simplify
>> the driver code.
> 
> I'd like to see that gap closed asap. I think it is fairly obvious how to do
> it, so shouldn't be too hard, just needs a dance to get the DOE for a switch
> port using Lukas' updated handling of DOE mailboxes.

Talked to Lukas and this may not be difficult with his latest changes. I 
can take a look. Do we support switch CDAT in QEMU yet?

> 
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
>>   drivers/cxl/core/port.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++
>>   drivers/cxl/cxl.h       |    9 +++++++
>>   drivers/cxl/port.c      |   42 +++++++++++++++++++++++++++++++++
>>   3 files changed, 111 insertions(+)
>>
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index 2b27319cfd42..aa260361ba7d 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -1899,6 +1899,66 @@ bool schedule_cxl_memdev_detach(struct cxl_memdev *cxlmd)
>>   }
>>   EXPORT_SYMBOL_NS_GPL(schedule_cxl_memdev_detach, CXL);
>>   
>> +int cxl_port_get_downstream_qos(struct cxl_port *port, long *bw, long *lat)
>> +{
>> +	long total_lat = 0, latency;
> 
> Similar to before, not good for readability to hide asignments in a list all on one line.
>
Jonathan Cameron Feb. 15, 2023, 1:17 p.m. UTC | #3
On Tue, 14 Feb 2023 16:03:27 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> On 2/9/23 8:24 AM, Jonathan Cameron wrote:
> > On Mon, 06 Feb 2023 13:51:19 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >   
> >> CXL Memory Device SW Guide rev1.0 2.11.2 provides instruction on how to
> >> caluclate latency and bandwidth for CXL memory device. Calculate minimum  
> > 
> > Spell check your descriptions (I often forget to do this as well!
> > )  
> >> bandwidth and total latency for the path from the CXL device to the root
> >> port. The calculates values are stored in the cached DSMAS entries attached
> >> to the cxl_port of the CXL device.
> >>
> >> For example for a device that is directly attached to a host bus:
> >> Total Latency = Device Latency (from CDAT) + Dev to Host Bus (HB) Link
> >> 		Latency
> >> Min Bandwidth = Link Bandwidth between Host Bus and CXL device
> >>
> >> For a device that has a switch in between host bus and CXL device:
> >> Total Latency = Device (CDAT) Latency + Dev to Switch Link Latency +
> >> 		Switch (CDAT) Latency + Switch to HB Link Latency  
> > 
> > For QTG purposes, are we also supposed to take into account HB to
> > system interconnect type latency (or maybe nearest CPU?).
> > That is likely to be non trivial.  
> 
> Dan brought this ECN [1] to my attention. We can add this if we can find 
> a BIOS that implements the ECN. Or should we code a place holder for it 
> until this is available?
> 
> https://lore.kernel.org/linux-cxl/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/

I've had Generic Ports on my list to add to QEMU for a while but not been
high enough priority to either do it myself, or make it someone else's problem.
I suspect the biggest barrier in QEMU is going to be the interface to add
these to the NUMA description.

It's easy enough to hand build and inject a SRAT /SLIT/HMAT tables with
these in (that's how we developed the Generic Initiator support in Linux before
any BIOS support).  

So I'd like to see it soon, but I'm not hugely bothered if that element
follows this patch set. However, we are potentially going to see different
decisions made when that detail is added so it 'might' count as ABI
breakage if it's not there from the start. I think we are fine as probably
no BIOS' yet though.

> 
> >   
> >> Min Bandwidth = min(dev to switch bandwidth, switch to HB bandwidth)
> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com>  
> > 
> > Stray sign off.
> >   
> >>
> >> The internal latency for a switch can be retrieved from the CDAT of the
> >> switch PCI device. However, since there's no easy way to retrieve that
> >> right now on Linux, a guesstimated constant is used per switch to simplify
> >> the driver code.  
> > 
> > I'd like to see that gap closed asap. I think it is fairly obvious how to do
> > it, so shouldn't be too hard, just needs a dance to get the DOE for a switch
> > port using Lukas' updated handling of DOE mailboxes.  
> 
> Talked to Lukas and this may not be difficult with his latest changes. I 
> can take a look. Do we support switch CDAT in QEMU yet?

I started typing no, then thought I'd just check.  Seems I did write support
for CDAT on switches (and then completely forgot about it ;)
It's upstream and everything!
https://elixir.bootlin.com/qemu/latest/source/hw/pci-bridge/cxl_upstream.c#L194
Dave Jiang Feb. 15, 2023, 4:38 p.m. UTC | #4
On 2/15/23 6:17 AM, Jonathan Cameron wrote:
> On Tue, 14 Feb 2023 16:03:27 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
>> On 2/9/23 8:24 AM, Jonathan Cameron wrote:
>>> On Mon, 06 Feb 2023 13:51:19 -0700
>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>    
>>>> CXL Memory Device SW Guide rev1.0 2.11.2 provides instruction on how to
>>>> caluclate latency and bandwidth for CXL memory device. Calculate minimum
>>>
>>> Spell check your descriptions (I often forget to do this as well!
>>> )
>>>> bandwidth and total latency for the path from the CXL device to the root
>>>> port. The calculates values are stored in the cached DSMAS entries attached
>>>> to the cxl_port of the CXL device.
>>>>
>>>> For example for a device that is directly attached to a host bus:
>>>> Total Latency = Device Latency (from CDAT) + Dev to Host Bus (HB) Link
>>>> 		Latency
>>>> Min Bandwidth = Link Bandwidth between Host Bus and CXL device
>>>>
>>>> For a device that has a switch in between host bus and CXL device:
>>>> Total Latency = Device (CDAT) Latency + Dev to Switch Link Latency +
>>>> 		Switch (CDAT) Latency + Switch to HB Link Latency
>>>
>>> For QTG purposes, are we also supposed to take into account HB to
>>> system interconnect type latency (or maybe nearest CPU?).
>>> That is likely to be non trivial.
>>
>> Dan brought this ECN [1] to my attention. We can add this if we can find
>> a BIOS that implements the ECN. Or should we code a place holder for it
>> until this is available?
>>
>> https://lore.kernel.org/linux-cxl/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/
> 
> I've had Generic Ports on my list to add to QEMU for a while but not been
> high enough priority to either do it myself, or make it someone else's problem.
> I suspect the biggest barrier in QEMU is going to be the interface to add
> these to the NUMA description.
> 
> It's easy enough to hand build and inject a SRAT /SLIT/HMAT tables with
> these in (that's how we developed the Generic Initiator support in Linux before
> any BIOS support).
> 
> So I'd like to see it soon, but I'm not hugely bothered if that element
> follows this patch set. However, we are potentially going to see different
> decisions made when that detail is added so it 'might' count as ABI
> breakage if it's not there from the start. I think we are fine as probably
> no BIOS' yet though.
> 
>>
>>>    
>>>> Min Bandwidth = min(dev to switch bandwidth, switch to HB bandwidth)
>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>>
>>> Stray sign off.
>>>    
>>>>
>>>> The internal latency for a switch can be retrieved from the CDAT of the
>>>> switch PCI device. However, since there's no easy way to retrieve that
>>>> right now on Linux, a guesstimated constant is used per switch to simplify
>>>> the driver code.
>>>
>>> I'd like to see that gap closed asap. I think it is fairly obvious how to do
>>> it, so shouldn't be too hard, just needs a dance to get the DOE for a switch
>>> port using Lukas' updated handling of DOE mailboxes.
>>
>> Talked to Lukas and this may not be difficult with his latest changes. I
>> can take a look. Do we support switch CDAT in QEMU yet?
> 
> I started typing no, then thought I'd just check.  Seems I did write support
> for CDAT on switches (and then completely forgot about it ;)
> It's upstream and everything!
> https://elixir.bootlin.com/qemu/latest/source/hw/pci-bridge/cxl_upstream.c#L194
> 
Awesome! I'll go poke around a bit. Also it's very helpful to see the 
creation code. Helped me realize that I need to support parsing of 
SSLBIS sub-table for switches. Thanks!
diff mbox series

Patch

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 2b27319cfd42..aa260361ba7d 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1899,6 +1899,66 @@  bool schedule_cxl_memdev_detach(struct cxl_memdev *cxlmd)
 }
 EXPORT_SYMBOL_NS_GPL(schedule_cxl_memdev_detach, CXL);
 
+int cxl_port_get_downstream_qos(struct cxl_port *port, long *bw, long *lat)
+{
+	long total_lat = 0, latency;
+	long min_bw = INT_MAX;
+	struct pci_dev *pdev;
+	struct cxl_port *p;
+	struct device *dev;
+	int devices = 0;
+
+	/* Grab the device that is the PCI device for CXL memdev */
+	dev = port->uport->parent;
+	/* Skip if it's not PCI, most likely a cxl_test device */
+	if (!dev_is_pci(dev))
+		return 0;
+
+	pdev = to_pci_dev(dev);
+	min_bw = pcie_bandwidth_available(pdev, NULL, NULL, NULL);
+	if (min_bw == 0)
+		return -ENXIO;
+
+	/* convert to MB/s from Mb/s */
+	min_bw >>= 3;
+
+	p = port;
+	do {
+		struct cxl_dport *dport;
+
+		latency = cxl_pci_get_latency(pdev);
+		if (latency < 0)
+			return latency;
+
+		total_lat += latency;
+		devices++;
+
+		dport = p->parent_dport;
+		if (!dport)
+			break;
+
+		p = dport->port;
+		dev = p->uport;
+		if (!dev_is_pci(dev))
+			break;
+		pdev = to_pci_dev(dev);
+	} while (1);
+
+	/*
+	 * Add an approximate latency to the switch. Currently there
+	 * is no easy mechanism to read the CDAT for switches. 'devices'
+	 * should account for all the PCI devices encountered minus the
+	 * root device. So the number of switches would be 'devices - 1'
+	 * to account for the CXL device.
+	 */
+	total_lat += CXL_SWITCH_APPROX_LAT * (devices - 1);
+
+	*bw = min_bw;
+	*lat = total_lat;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_get_downstream_qos, CXL);
+
 /* for user tooling to ensure port disable work has completed */
 static ssize_t flush_store(struct bus_type *bus, const char *buf, size_t count)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ac6ea550ab0a..86668fab6e91 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -480,6 +480,13 @@  struct cxl_pmem_region {
 	struct cxl_pmem_region_mapping mapping[];
 };
 
+/*
+ * Set in picoseconds per ACPI spec 6.5 Table 5.148 Entry Base Unit.
+ * This is an approximate constant to use for switch latency calculation
+ * until there's a way to access switch CDAT.
+ */
+#define CXL_SWITCH_APPROX_LAT	5000
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
@@ -706,6 +713,7 @@  struct dsmas_entry {
 	struct range dpa_range;
 	u16 handle;
 	u64 qos[ACPI_HMAT_WRITE_BANDWIDTH + 1];
+	int qtg_id;
 };
 
 typedef int (*cdat_tbl_entry_handler)(struct acpi_cdat_header *header, void *arg);
@@ -734,6 +742,7 @@  struct qtg_dsm_output {
 struct qtg_dsm_output *cxl_acpi_evaluate_qtg_dsm(acpi_handle handle,
 						 struct qtg_dsm_input *input);
 acpi_handle cxl_acpi_get_rootdev_handle(struct device *dev);
+int cxl_port_get_downstream_qos(struct cxl_port *port, long *bw, long *lat);
 
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 8de311208b37..d72e38f9ae44 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -30,6 +30,44 @@  static void schedule_detach(void *cxlmd)
 	schedule_cxl_memdev_detach(cxlmd);
 }
 
+static int cxl_port_qos_calculate(struct cxl_port *port)
+{
+	struct qtg_dsm_output *output;
+	struct qtg_dsm_input input;
+	struct dsmas_entry *dent;
+	long min_bw, total_lat;
+	acpi_handle handle;
+	int rc;
+
+	rc = cxl_port_get_downstream_qos(port, &min_bw, &total_lat);
+	if (rc)
+		return rc;
+
+	handle = cxl_acpi_get_rootdev_handle(&port->dev);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	mutex_lock(&port->cdat.dsmas_lock);
+	list_for_each_entry(dent, &port->cdat.dsmas_list, list) {
+		input.rd_lat = dent->qos[ACPI_HMAT_READ_LATENCY] + total_lat;
+		input.wr_lat = dent->qos[ACPI_HMAT_WRITE_LATENCY] + total_lat;
+		input.rd_bw = min_t(int, min_bw,
+				    dent->qos[ACPI_HMAT_READ_BANDWIDTH]);
+		input.wr_bw = min_t(int, min_bw,
+				    dent->qos[ACPI_HMAT_WRITE_BANDWIDTH]);
+
+		output = cxl_acpi_evaluate_qtg_dsm(handle, &input);
+		if (IS_ERR(output))
+			continue;
+
+		dent->qtg_id = output->qtg_ids[0];
+		kfree(output);
+	}
+	mutex_unlock(&port->cdat.dsmas_lock);
+
+	return 0;
+}
+
 static int cxl_port_probe(struct device *dev)
 {
 	struct cxl_port *port = to_cxl_port(dev);
@@ -74,6 +112,10 @@  static int cxl_port_probe(struct device *dev)
 			} else {
 				dev_dbg(dev, "Failed to parse DSMAS: %d\n", rc);
 			}
+
+			rc = cxl_port_qos_calculate(port);
+			if (rc)
+				dev_dbg(dev, "Failed to do QoS calculations\n");
 		}
 
 		rc = cxl_hdm_decode_init(cxlds, cxlhdm);