diff mbox series

[v1,25/29] cxl/amd: Enable Zen5 address translation using ACPI PRMT

Message ID 20250107141015.3367194-26-rrichter@amd.com
State New
Headers show
Series cxl: Add address translation support and enable AMD Zen5 platforms | expand

Commit Message

Robert Richter Jan. 7, 2025, 2:10 p.m. UTC
Add AMD platform specific Zen5 support for address translation.

Zen5 systems may be configured to use 'Normalized addresses'. Then,
CXL endpoints use their own physical address space and Host Physical
Addresses (HPAs) need address translation from the endpoint to its CXL
host bridge. The HPA of a CXL host bridge is equivalent to the System
Physical Address (SPA).

ACPI Platform Runtime Mechanism (PRM) is used to translate the CXL
Device Physical Address (DPA) to its System Physical Address. This is
documented in:

 AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh
 ACPI v6.5 Porting Guide, Publication # 58088
 https://www.amd.com/en/search/documentation/hub.html

Note that DPA and HPA of an endpoint may differ depending on the
interleaving configuration. That is, an additional calculation between
DPA and HPA is needed.

To implement AMD Zen5 address translation the following steps are
needed:

Introduce the generic function cxl_port_platform_setup() that allows
to apply platform specific changes to each port where necessary.

Add a function cxl_port_setup_amd() to implement AMD platform specific
code. Use Kbuild and Kconfig options respectivly to enable the code
depending on architecture and platform options. Create a new file
core/amd.c for this.

Introduce a function cxl_zen5_init() to handle Zen5 specific
enablement. Zen5 platforms are detected using the PCIe vendor and
device ID of the corresponding CXL root port.

Apply cxl_zen5_to_hpa() as cxl_port->to_hpa() callback to Zen5 CXL
host bridges to enable platform specific address translation.

Use ACPI PRM DPA to SPA translation to determine an endpoint's
interleaving configuration and base address during the early
initialization proces. This is used to determine an endpoint's SPA
range.

Since the PRM translates DPA->SPA, but HPA->SPA is needed, determine
the interleaving config and base address of the endpoint first, then
calculate the SPA based on the given HPA using the address base.

The config can be determined calling the PRM for specific DPAs
given. Since the interleaving configuration is still unknown, chose
DPAs starting at 0xd20000. This address is factor for all values from
1 to 8 and thus valid for all possible interleaving configuration.
The resulting SPAs are used to calculate interleaving paramters and
the SPA base address of the endpoint. The maximum granularity (chunk
size) is 16k, minimum is 256. Use the following calculation for a
given DPA:

 ways    = hpa_len(SZ_16K) / SZ_16K
 gran    = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256)
          / (ways - 1)
 pos     = (hpa_len(SZ_16K) - ways * SZ_16K) / gran

Once the endpoint is attached to a region and its SPA range is know,
calling the PRM is no longer needed, the SPA base can be used.

Signed-off-by: Robert Richter <rrichter@amd.com>
---
 drivers/cxl/Kconfig       |   4 +
 drivers/cxl/core/Makefile |   1 +
 drivers/cxl/core/amd.c    | 227 ++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/core.h   |   6 +
 drivers/cxl/core/port.c   |   7 ++
 5 files changed, 245 insertions(+)
 create mode 100644 drivers/cxl/core/amd.c

Comments

Robert Richter Jan. 7, 2025, 4:32 p.m. UTC | #1
On 07.01.25 15:10:11, Robert Richter wrote:

> +#include <linux/prmt.h>

Note: linux/prmt.h is broken and causes a build error:

 ./include/linux/prmt.h:5:27: error: unknown type name ‘guid_t’
     5 | int acpi_call_prm_handler(guid_t handler_guid, void *param_buffer);
       |                           ^~~~~~

A fix for this is handled through linux-acpi:

 https://patchwork.kernel.org/project/linux-acpi/patch/20250107161923.3387552-1-rrichter@amd.com/

I guess it will have been applied already once those cxl patches go
in.

Anyway, I will add it to a v2 to make testing easier.

Thanks,

-Robert
Gregory Price Jan. 7, 2025, 11:28 p.m. UTC | #2
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> 
> Add a function cxl_port_setup_amd() to implement AMD platform specific
> code. Use Kbuild and Kconfig options respectivly to enable the code
> depending on architecture and platform options. Create a new file
> core/amd.c for this.
> 

A build option here specific to AMD doesn't seem the best. At Meta,
we try to maintain a platform agnostic kernel for our fleet (at least
for build options), and this would necessitate us maintaining separate
builds for AMD systems vs other vendors.

Is there a reason to simply not include it by default and just report
whether translation is required or not? (i.e. no build option)

Or maybe generalize to CXL_PLATFORM_QUIRKS rather than CXL_AMD?

~Gregory

> Introduce a function cxl_zen5_init() to handle Zen5 specific
> enablement. Zen5 platforms are detected using the PCIe vendor and
> device ID of the corresponding CXL root port.
> 
> Apply cxl_zen5_to_hpa() as cxl_port->to_hpa() callback to Zen5 CXL
> host bridges to enable platform specific address translation.
> 
> Use ACPI PRM DPA to SPA translation to determine an endpoint's
> interleaving configuration and base address during the early
> initialization proces. This is used to determine an endpoint's SPA
> range.
> 
> Since the PRM translates DPA->SPA, but HPA->SPA is needed, determine
> the interleaving config and base address of the endpoint first, then
> calculate the SPA based on the given HPA using the address base.
> 
> The config can be determined calling the PRM for specific DPAs
> given. Since the interleaving configuration is still unknown, chose
> DPAs starting at 0xd20000. This address is factor for all values from
> 1 to 8 and thus valid for all possible interleaving configuration.
> The resulting SPAs are used to calculate interleaving paramters and
> the SPA base address of the endpoint. The maximum granularity (chunk
> size) is 16k, minimum is 256. Use the following calculation for a
> given DPA:
> 
>  ways    = hpa_len(SZ_16K) / SZ_16K
>  gran    = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256)
>           / (ways - 1)
>  pos     = (hpa_len(SZ_16K) - ways * SZ_16K) / gran
> 
> Once the endpoint is attached to a region and its SPA range is know,
> calling the PRM is no longer needed, the SPA base can be used.
> 
> Signed-off-by: Robert Richter <rrichter@amd.com>
> ---
>  drivers/cxl/Kconfig       |   4 +
>  drivers/cxl/core/Makefile |   1 +
>  drivers/cxl/core/amd.c    | 227 ++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/core.h   |   6 +
>  drivers/cxl/core/port.c   |   7 ++
>  5 files changed, 245 insertions(+)
>  create mode 100644 drivers/cxl/core/amd.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 876469e23f7a..e576028dd983 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -146,4 +146,8 @@ config CXL_REGION_INVALIDATION_TEST
>  	  If unsure, or if this kernel is meant for production environments,
>  	  say N.
>  
> +config CXL_AMD
> +       def_bool y
> +       depends on AMD_NB
> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..dc368e61d281 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -16,3 +16,4 @@ cxl_core-y += pmu.o
>  cxl_core-y += cdat.o
>  cxl_core-$(CONFIG_TRACING) += trace.o
>  cxl_core-$(CONFIG_CXL_REGION) += region.o
> +cxl_core-$(CONFIG_CXL_AMD) += amd.o
> diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c
> new file mode 100644
> index 000000000000..553b7d0caefd
> --- /dev/null
> +++ b/drivers/cxl/core/amd.c
> @@ -0,0 +1,227 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2024 Advanced Micro Devices, Inc.
> + */
> +
> +#include <linux/prmt.h>
> +#include <linux/pci.h>
> +
> +#include "cxlmem.h"
> +#include "core.h"
> +
> +#define PCI_DEVICE_ID_AMD_ZEN5_ROOT		0x153e
> +
> +static const struct pci_device_id zen5_root_port_ids[] = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ZEN5_ROOT) },
> +	{},
> +};
> +
> +static int is_zen5_root_port(struct device *dev, void *unused)
> +{
> +	if (!dev_is_pci(dev))
> +		return 0;
> +
> +	return !!pci_match_id(zen5_root_port_ids, to_pci_dev(dev));
> +}
> +
> +static bool is_zen5(struct cxl_port *port)
> +{
> +	if (!IS_ENABLED(CONFIG_ACPI_PRMT))
> +		return false;
> +
> +	/* To get the CXL root port, find the CXL host bridge first. */
> +	if (is_cxl_root(port) ||
> +	    !port->host_bridge ||
> +	    !is_cxl_root(to_cxl_port(port->dev.parent)))
> +		return false;
> +
> +	return !!device_for_each_child(port->host_bridge, NULL,
> +				       is_zen5_root_port);
> +}
> +
> +/*
> + * PRM Address Translation - CXL DPA to System Physical Address
> + *
> + * Reference:
> + *
> + * AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh
> + * ACPI v6.5 Porting Guide, Publication # 58088
> + */
> +
> +static const guid_t prm_cxl_dpa_spa_guid =
> +	GUID_INIT(0xee41b397, 0x25d4, 0x452c, 0xad, 0x54, 0x48, 0xc6, 0xe3,
> +		  0x48, 0x0b, 0x94);
> +
> +struct prm_cxl_dpa_spa_data {
> +	u64 dpa;
> +	u8 reserved;
> +	u8 devfn;
> +	u8 bus;
> +	u8 segment;
> +	void *out;
> +} __packed;
> +
> +static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
> +{
> +	struct prm_cxl_dpa_spa_data data;
> +	u64 spa;
> +	int rc;
> +
> +	data = (struct prm_cxl_dpa_spa_data) {
> +		.dpa     = dpa,
> +		.devfn   = pci_dev->devfn,
> +		.bus     = pci_dev->bus->number,
> +		.segment = pci_domain_nr(pci_dev->bus),
> +		.out     = &spa,
> +	};
> +
> +	rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
> +	if (rc) {
> +		pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
> +		return ULLONG_MAX;
> +	}
> +
> +	pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);
> +
> +	return spa;
> +}
> +
> +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa)
> +{
> +	struct cxl_memdev *cxlmd;
> +	struct pci_dev *pci_dev;
> +	struct cxl_port *port;
> +	u64 dpa, base, spa, spa2, len, len2, offset, granularity;
> +	int ways, pos;
> +
> +	/*
> +	 * Nothing to do if base is non-zero and Normalized Addressing
> +	 * is disabled.
> +	 */
> +	if (cxld->hpa_range.start)
> +		return hpa;
> +
> +	/* Only translate from endpoint to its parent port. */
> +	if (!is_endpoint_decoder(&cxld->dev))
> +		return hpa;
> +
> +	if (hpa > cxld->hpa_range.end) {
> +		dev_dbg(&cxld->dev, "hpa addr %#llx out of range %#llx-%#llx\n",
> +			hpa, cxld->hpa_range.start, cxld->hpa_range.end);
> +		return ULLONG_MAX;
> +	}
> +
> +	/*
> +	 * If the decoder is already attached, the region's base can
> +	 * be used.
> +	 */
> +	if (cxld->region)
> +		return cxld->region->params.res->start + hpa;
> +
> +	port = to_cxl_port(cxld->dev.parent);
> +	cxlmd = port ? to_cxl_memdev(port->uport_dev) : NULL;
> +	if (!port || !dev_is_pci(cxlmd->dev.parent)) {
> +		dev_dbg(&cxld->dev, "No endpoint found: %s, range %#llx-%#llx\n",
> +			dev_name(cxld->dev.parent), cxld->hpa_range.start,
> +			cxld->hpa_range.end);
> +		return ULLONG_MAX;
> +	}
> +	pci_dev = to_pci_dev(cxlmd->dev.parent);
> +
> +	/*
> +	 * The PRM translates DPA->SPA, but we need HPA->SPA.
> +	 * Determine the interleaving config first, then calculate the
> +	 * DPA. Maximum granularity (chunk size) is 16k, minimum is
> +	 * 256. Calculated with:
> +	 *
> +	 *	ways	= hpa_len(SZ_16K) / SZ_16K
> +	 * 	gran	= (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256)
> +	 *                / (ways - 1)
> +	 *	pos	= (hpa_len(SZ_16K) - ways * SZ_16K) / gran
> +	 */
> +
> +	/*
> +	 * DPA magic:
> +	 *
> +	 * Position and granularity are unknown yet, use an always
> +	 * valid DPA:
> +	 *
> +	 * 0xd20000 = 13762560 = 16k * 2 * 3 * 2 * 5 * 7 * 2
> +	 *
> +	 * It is divisible by all positions 1 to 8. The DPA is valid
> +	 * for all positions and granularities.
> +	 */
> +#define DPA_MAGIC	0xd20000
> +	base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
> +	spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
> +	spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
> +
> +	/* Includes checks to avoid div by zero */
> +	if (!base || base == ULLONG_MAX || spa == ULLONG_MAX ||
> +	    spa2 == ULLONG_MAX || spa < base + SZ_16K || spa2 <= base ||
> +	    (spa > base + SZ_16K && spa - spa2 < SZ_256 * 2)) {
> +		dev_dbg(&cxld->dev, "Error translating HPA: base %#llx, spa %#llx, spa2 %#llx\n",
> +			base, spa, spa2);
> +		return ULLONG_MAX;
> +	}
> +
> +	len = spa - base;
> +	len2 = spa2 - base;
> +
> +	/* offset = pos * granularity */
> +	if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
> +		ways = 1;
> +		offset = 0;
> +		granularity = 0;
> +		pos = 0;
> +	} else {
> +		ways = len / SZ_16K;
> +		offset = spa & (SZ_16K - 1);
> +		granularity = (len - len2 - SZ_256) / (ways - 1);
> +		pos = offset / granularity;
> +	}
> +
> +	base = base - DPA_MAGIC * ways - pos * granularity;
> +	spa = base + hpa;
> +
> +	/*
> +	 * Check SPA using a PRM call for the closest DPA calculated
> +	 * for the HPA. If the HPA matches a different interleaving
> +	 * position other than the decoder's, determine its offset to
> +	 * adjust the SPA.
> +	 */
> +
> +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> +		+ (hpa & (granularity - 1));
> +	offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
> +	offset -= pos * granularity;
> +	spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
> +
> +	dev_dbg(&cxld->dev,
> +		"address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n",
> +		pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity);
> +
> +	if (spa != spa2) {
> +		dev_dbg(&cxld->dev, "SPA calculation failed: %#llx:%#llx\n",
> +			spa, spa2);
> +		return ULLONG_MAX;
> +	}
> +
> +	return spa;
> +}
> +
> +static void cxl_zen5_init(struct cxl_port *port)
> +{
> +	if (!is_zen5(port))
> +		return;
> +
> +	port->to_hpa = cxl_zen5_to_hpa;
> +
> +	dev_dbg(port->host_bridge, "PRM address translation enabled for %s.\n",
> +		dev_name(&port->dev));
> +}
> +
> +void cxl_port_setup_amd(struct cxl_port *port)
> +{
> +	cxl_zen5_init(port);
> +}
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 800466f96a68..efe34ae6943e 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -115,4 +115,10 @@ bool cxl_need_node_perf_attrs_update(int nid);
>  int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
>  					struct access_coordinate *c);
>  
> +#ifdef CONFIG_CXL_AMD
> +void cxl_port_setup_amd(struct cxl_port *port);
> +#else
> +static inline void cxl_port_setup_amd(struct cxl_port *port) {};
> +#endif
> +
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 901555bf4b73..c8176265c15c 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
>  			    &cxl_einj_inject_fops);
>  }
>  
> +static void cxl_port_platform_setup(struct cxl_port *port)
> +{
> +	cxl_port_setup_amd(port);
> +}
> +
>  static int cxl_port_add(struct cxl_port *port,
>  			resource_size_t component_reg_phys,
>  			struct cxl_dport *parent_dport)
> @@ -868,6 +873,8 @@ static int cxl_port_add(struct cxl_port *port,
>  			return rc;
>  	}
>  
> +	cxl_port_platform_setup(port);
> +
>  	rc = device_add(dev);
>  	if (rc)
>  		return rc;
> -- 
> 2.39.5
>
Robert Richter Jan. 8, 2025, 2:52 p.m. UTC | #3
On 07.01.25 18:28:57, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> > 
> > Add a function cxl_port_setup_amd() to implement AMD platform specific
> > code. Use Kbuild and Kconfig options respectivly to enable the code
> > depending on architecture and platform options. Create a new file
> > core/amd.c for this.
> > 
> 
> A build option here specific to AMD doesn't seem the best. At Meta,
> we try to maintain a platform agnostic kernel for our fleet (at least
> for build options), and this would necessitate us maintaining separate
> builds for AMD systems vs other vendors.
> 
> Is there a reason to simply not include it by default and just report
> whether translation is required or not? (i.e. no build option)

There is no (menu) option for CXL_AMD, it is just checking the
dependencies to AMD_NB (and indirectly arch, platform and vendor). In
that case it is always enabled.

Thanks for review.

-Robert

> 
> Or maybe generalize to CXL_PLATFORM_QUIRKS rather than CXL_AMD?
> 
> ~Gregory
Gregory Price Jan. 8, 2025, 3:48 p.m. UTC | #4
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> Add AMD platform specific Zen5 support for address translation.
> 
... snip ...
> 
> Once the endpoint is attached to a region and its SPA range is know,
> calling the PRM is no longer needed, the SPA base can be used.
> 
> Signed-off-by: Robert Richter <rrichter@amd.com>

One inline question, but not a blocker

Reviewed-by: Gregory Price <gourry@gourry.net>

> ---
>  drivers/cxl/Kconfig       |   4 +
>  drivers/cxl/core/Makefile |   1 +
>  drivers/cxl/core/amd.c    | 227 ++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/core.h   |   6 +
>  drivers/cxl/core/port.c   |   7 ++
>  5 files changed, 245 insertions(+)
>  create mode 100644 drivers/cxl/core/amd.c
> 
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 901555bf4b73..c8176265c15c 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
>  			    &cxl_einj_inject_fops);
>  }
>  
> +static void cxl_port_platform_setup(struct cxl_port *port)
> +{
> +	cxl_port_setup_amd(port);
> +}
> +

Assuming this gets expanded (which it may not), should we expect this
function to end up like so?

static void cxl_port_platform_setup(struct cxl_port *port)
{
	cxl_port_setup_amd(port);
	cxl_port_setup_intel(port);
	cxl_port_setup_arm(port);
	... etc ...
}

I suppose this logic has to exist somewhere in some form, just want to make
sure this is what we want.  Either way, this is easily modifiable, so
not a blocker as I said.

>  static int cxl_port_add(struct cxl_port *port,
>  			resource_size_t component_reg_phys,
>  			struct cxl_dport *parent_dport)
> @@ -868,6 +873,8 @@ static int cxl_port_add(struct cxl_port *port,
>  			return rc;
>  	}
>  
> +	cxl_port_platform_setup(port);
> +
>  	rc = device_add(dev);
>  	if (rc)
>  		return rc;
> -- 
> 2.39.5
>
Gregory Price Jan. 8, 2025, 3:49 p.m. UTC | #5
On Wed, Jan 08, 2025 at 03:52:35PM +0100, Robert Richter wrote:
> On 07.01.25 18:28:57, Gregory Price wrote:
> > On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> > > 
> > > Add a function cxl_port_setup_amd() to implement AMD platform specific
> > > code. Use Kbuild and Kconfig options respectivly to enable the code
> > > depending on architecture and platform options. Create a new file
> > > core/amd.c for this.
> > > 
> > 
> > A build option here specific to AMD doesn't seem the best. At Meta,
> > we try to maintain a platform agnostic kernel for our fleet (at least
> > for build options), and this would necessitate us maintaining separate
> > builds for AMD systems vs other vendors.
> > 
> > Is there a reason to simply not include it by default and just report
> > whether translation is required or not? (i.e. no build option)
> 
> There is no (menu) option for CXL_AMD, it is just checking the
> dependencies to AMD_NB (and indirectly arch, platform and vendor). In
> that case it is always enabled.
> 

Ah! I completely misunderstood the build option, my bad.  This makes
sense, sorry about that

~Gregory
Robert Richter Jan. 9, 2025, 10:14 a.m. UTC | #6
On 08.01.25 10:48:23, Gregory Price wrote:

> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 901555bf4b73..c8176265c15c 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
> >  			    &cxl_einj_inject_fops);
> >  }
> >  
> > +static void cxl_port_platform_setup(struct cxl_port *port)
> > +{
> > +	cxl_port_setup_amd(port);
> > +}
> > +
> 
> Assuming this gets expanded (which it may not), should we expect this
> function to end up like so?
> 
> static void cxl_port_platform_setup(struct cxl_port *port)
> {
> 	cxl_port_setup_amd(port);
> 	cxl_port_setup_intel(port);
> 	cxl_port_setup_arm(port);
> 	... etc ...
> }
> 
> I suppose this logic has to exist somewhere in some form, just want to make
> sure this is what we want.  Either way, this is easily modifiable, so
> not a blocker as I said.

Yes, it is exactly designed like that. I will update the patch
description.

-Robert
Gregory Price Jan. 9, 2025, 10:25 p.m. UTC | #7
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> Add AMD platform specific Zen5 support for address translation.

Doing some testing here and I'm seeing some odd results, also noticing
some naming inconsistencies

> 
> +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa)
> +{

Function name is _to_hpa, but hpa is an argument?

Should be dpa as argument? Confusing to convert an hpa to an hpa.

... snip ...

> +#define DPA_MAGIC	0xd20000
> +	base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
> +	spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
> +	spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);

For two devices interleaved, the base should be the same, correct?

example: 2 128GB devices interleaved/normalized:

dev0: base(0xc051a40000) spa(0xc051a48000) spa2(0xc051a47e00)
dev1: base(0xc051a40100) spa(0xc051a48100) spa2(0xc051a47f00)

I believe these numbers are correct.

(Note: Using PRMT emulation because I don't have a BIOS with this blob,
 but this is the same emulation i have been using for about 4 months now
 with operational hardware, so unless the translation contract changed
 and this code expects something different, it should be correct).

... snip ...
> +	len = spa - base;
> +	len2 = spa2 - base;
> +
> +	/* offset = pos * granularity */
> +	if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
> +		ways = 1;
> +		offset = 0;
> +		granularity = 0;
> +		pos = 0;
> +	} else {
> +		ways = len / SZ_16K;
> +		offset = spa & (SZ_16K - 1);
> +		granularity = (len - len2 - SZ_256) / (ways - 1);
> +		pos = offset / granularity;
> +	}

the interleave ways and such calculate out correctly

dev0: ways(0x2) offset(0x0)   granularity(0x100) pos(0x0)
dev1: ways(0x2) offset(0x100) granularity(0x100) pos(0x1)

> +
> +	base = base - DPA_MAGIC * ways - pos * granularity;
> +	spa = base + hpa;

DPA(0)
dev0: base(0xc050000000) spa(0xc050000000)
dev1: base(0xc050000000) spa(0xc050000000)

DPA(0x1fffffffff)
dev0: base(0xc050000000) spa(0xe04fffffff)
dev1: base(0xc050000000) spa(0xe04fffffff)

The bases seems correct, the SPAs looks suspect.

dev1 should have a very different SPA shouldn't it?

> +
> +	/*
> +	 * Check SPA using a PRM call for the closest DPA calculated
> +	 * for the HPA. If the HPA matches a different interleaving
> +	 * position other than the decoder's, determine its offset to
> +	 * adjust the SPA.
> +	 */
> +
> +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> +		+ (hpa & (granularity - 1));

I do not understand this chunk here, we seem to just be chopping the HPA
in half to acquire the DPA.  But the value passed in is already a DPA.

dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
    =  0xfffffffff

I don't understand why the DPA address is suddenly half (64GB boundary).

> +	offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
> +	offset -= pos * granularity;
> +	spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
> +
> +	dev_dbg(&cxld->dev,
> +		"address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n",
> +		pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity);
> +

This results in a translation that appears to be wrong:

dev0:
cxl decoder5.0: address mapping found for 0000:e1:00.0
    (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000
    base: 0xc050000000 ways: 2 pos: 0 granularity: 256
cxl decoder5.0: address mapping found for 0000:e1:00.0
    (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff
    base: 0xc050000000 ways: 2 pos: 0 granularity: 256

dev1:
cxl decoder6.0: address mapping found for 0000:c1:00.0
	(dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 
	base: 0xc050000000 ways: 2 pos: 1 granularity: 256
cxl decoder6.0: address mapping found for 0000:c1:00.0
	(dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff
	base: 0xc050000000 ways: 2 pos: 1 granularity: 256

These do not look correct.

Is my understanding of the PRMT translation incorrect?
I expect the following: (assuming one contiguous CFMW)

dev0 (dpa -> hpa -> spa): 0x0          -> 0x0          -> 0xc050000000
dev1 (dpa -> hpa -> spa): 0x0          -> 0x100        -> 0xc050000100
dev0 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3ffffffeff -> 0x1004ffffeff
dev1 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3fffffffff -> 0x1004fffffff

Extra data: here are the programmed endpoint decoder values

[endpoint5/decoder5.0]# cat start size dpa_size interleave_ways interleave_granularity 
0x0
0x2000000000
0x0000002000000000
1
256

[endpoint6/decoder6.0]# cat start size dpa_size interleave_ways interleave_granularity 
0x0
0x2000000000
0x0000002000000000
1
256


Anyway, yeah I'm a bit confused how this is all supposed to actually
work given that both devices translate to the same addresses.

In theory this *should* work since the root decoder covers the whole
space - as this has been working for me previously with some hacked up
PRMT emulation code.

[decoder0.0]# cat start size interleave_ways interleave_granularity
0xc050000000
0x4000000000
2
256

[decoder1.0]# cat start size interleave_ways interleave_granularity
0xc050000000
0x4000000000
1
256

[decoder3.0]# cat start size interleave_ways interleave_granularity 
0xc050000000
0x4000000000
1
256

[decoder5.0]# cat start size interleave_ways interleave_granularity
0x0
0x2000000000
1
256

[decoder6.0]# cat start size interleave_ways interleave_granularity 
0x0
0x2000000000
1
256

~Gregory
Gregory Price Jan. 10, 2025, 10:48 p.m. UTC | #8
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> Add AMD platform specific Zen5 support for address translation.
> 
> Zen5 systems may be configured to use 'Normalized addresses'. Then,
> CXL endpoints use their own physical address space and Host Physical
> Addresses (HPAs) need address translation from the endpoint to its CXL
> host bridge. The HPA of a CXL host bridge is equivalent to the System
> Physical Address (SPA).
> 

Just adding the note that I've tested this patch set for HPA==SPA and
found it causes no regressions in my setup.

Still working on testing the normalized address mode due to a few BIOS
quirks I'm running up against.

~Gregory
Jonathan Cameron Jan. 14, 2025, 11:13 a.m. UTC | #9
On Thu, 9 Jan 2025 11:14:46 +0100
Robert Richter <rrichter@amd.com> wrote:

> On 08.01.25 10:48:23, Gregory Price wrote:
> 
> > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > > index 901555bf4b73..c8176265c15c 100644
> > > --- a/drivers/cxl/core/port.c
> > > +++ b/drivers/cxl/core/port.c
> > > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
> > >  			    &cxl_einj_inject_fops);
> > >  }
> > >  
> > > +static void cxl_port_platform_setup(struct cxl_port *port)
> > > +{
> > > +	cxl_port_setup_amd(port);
> > > +}
> > > +  
> > 
> > Assuming this gets expanded (which it may not), should we expect this
> > function to end up like so?
> > 
> > static void cxl_port_platform_setup(struct cxl_port *port)
> > {
> > 	cxl_port_setup_amd(port);
> > 	cxl_port_setup_intel(port);
> > 	cxl_port_setup_arm(port);
> > 	... etc ...
> > }
> > 
> > I suppose this logic has to exist somewhere in some form, just want to make
> > sure this is what we want.  Either way, this is easily modifiable, so
> > not a blocker as I said.  
> 
> Yes, it is exactly designed like that. I will update the patch
> description.

If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup()
as never building a kernel that does x86 and arm.  Could rely on stubs but that
tends to get ugly as things grow.

Other than that, all makes sense.

Jonathan

> 
> -Robert
Robert Richter Jan. 15, 2025, 3:05 p.m. UTC | #10
On 09.01.25 17:25:13, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> > Add AMD platform specific Zen5 support for address translation.
> 
> Doing some testing here and I'm seeing some odd results, also noticing
> some naming inconsistencies
> 
> > 
> > +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa)
> > +{
> 
> Function name is _to_hpa, but hpa is an argument?

Conversion is always done from (old) HPA to (new) HPA of the parent
port. Note that the HPA of the root port/host bridge is same as SPA.
Port's in between may have an own HPA range.

> 
> Should be dpa as argument? Confusing to convert an hpa to an hpa.

We need to handle the decoder address ranges, the argument is always
the HPA range the decoder belongs to. The DPA is only on the device
side which is a different address range compared to the decoders. The
decoders do the interleaving arithmetic too and DPA range may be
different. E.g. the decoders may split requests to different endpoints
depending on the number of interleaving ways and endpoints have their
own (smaller) DPA address ranges then.

> 
> ... snip ...
> 
> > +#define DPA_MAGIC	0xd20000
> > +	base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
> > +	spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
> > +	spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
> 
> For two devices interleaved, the base should be the same, correct?

Same except for the interleaving offset, which is seen below (dev1
shows *100). At this stage we don't know the interleaving position of
the endpoint yet.

> 
> example: 2 128GB devices interleaved/normalized:
> 
> dev0: base(0xc051a40000) spa(0xc051a48000) spa2(0xc051a47e00)
> dev1: base(0xc051a40100) spa(0xc051a48100) spa2(0xc051a47f00)
> 
> I believe these numbers are correct.

Looks good.

> 
> (Note: Using PRMT emulation because I don't have a BIOS with this blob,
>  but this is the same emulation i have been using for about 4 months now
>  with operational hardware, so unless the translation contract changed
>  and this code expects something different, it should be correct).
> 
> ... snip ...
> > +	len = spa - base;
> > +	len2 = spa2 - base;
> > +
> > +	/* offset = pos * granularity */
> > +	if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
> > +		ways = 1;
> > +		offset = 0;
> > +		granularity = 0;
> > +		pos = 0;
> > +	} else {
> > +		ways = len / SZ_16K;
> > +		offset = spa & (SZ_16K - 1);
> > +		granularity = (len - len2 - SZ_256) / (ways - 1);
> > +		pos = offset / granularity;
> > +	}
> 
> the interleave ways and such calculate out correctly
> 
> dev0: ways(0x2) offset(0x0)   granularity(0x100) pos(0x0)
> dev1: ways(0x2) offset(0x100) granularity(0x100) pos(0x1)
> 
> > +
> > +	base = base - DPA_MAGIC * ways - pos * granularity;
> > +	spa = base + hpa;
> 
> DPA(0)
> dev0: base(0xc050000000) spa(0xc050000000)
> dev1: base(0xc050000000) spa(0xc050000000)
> 
> DPA(0x1fffffffff)
> dev0: base(0xc050000000) spa(0xe04fffffff)
> dev1: base(0xc050000000) spa(0xe04fffffff)
> 
> The bases seems correct, the SPAs looks suspect.

SPA range length must be 0x4000000000 (2x 128G). That is, upper SPA
must be 0x10050000000 (0xc050000000 + 0x4000000000 - 1). This one is
too short.

The decoder range lengths below look correct (0x2000000000), the
interleaving configuration should be checked for the decoders.

> 
> dev1 should have a very different SPA shouldn't it?

No, the HPA range is calculated, not the DPA range. Both endpoints
have the same HPA range, it must be equal and this looks correct. In
the end we calculate the following here (see cxl_find_auto_decoder()):

  hpa = cxld->hpa_range;
  // endpoint's hpa range is zero-based, equivalent to:
  // hpa->start = 0;
  // hpa->end = range_len(&hpa) - 1;
  base = hpa.start = port->to_hpa(cxld, hpa.start); // HPA(0)
  spa = hpa.end = port->to_hpa(cxld, hpa.end));     // HPA(decoder_size - 1)

Again, the HPA is the address the decoder is programmed with. HPA
length is 0x2000000000 (spa - base + 1). The DPA range is (for 2 way)
half it's size. The PRM uses DPA to SPA, but we want to translate HPA
to SPA. That is we need the calculation for.

> 
> > +
> > +	/*
> > +	 * Check SPA using a PRM call for the closest DPA calculated
> > +	 * for the HPA. If the HPA matches a different interleaving
> > +	 * position other than the decoder's, determine its offset to
> > +	 * adjust the SPA.
> > +	 */
> > +
> > +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> > +		+ (hpa & (granularity - 1));
> 
> I do not understand this chunk here, we seem to just be chopping the HPA
> in half to acquire the DPA.  But the value passed in is already a DPA.
> 
> dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
>     =  0xfffffffff

HPA is:

 HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff

Should calculate for a 2-way config to:

 DPA = 0x1fffffffff.

Actual formula:

 dpa = HPA div (granularity * ways) * granularity + HPA mod granularity
 pos = (HPA mod (granularity * ways)) div granularity

Bits used (e.g. HPA length: 0x4000000000 = 2^38, ways: 2):

 hpa = 00000000000000000000000000XXXXXXXXXXXXXXXXXXXXXXXXXXXXXYZZZZZZZZ
 dpa = 000000000000000000000000000XXXXXXXXXXXXXXXXXXXXXXXXXXXXXZZZZZZZZ
 pos = Y

With:

 X ... base part of the address
 Y ... interleaving position
 Z ... address offset

For DPA the positional bits are removed.

> 
> I don't understand why the DPA address is suddenly half (64GB boundary).

There is probably a broken interleaving config causing half the size
of total device mem.

> 
> > +	offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
> > +	offset -= pos * granularity;
> > +	spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
> > +
> > +	dev_dbg(&cxld->dev,
> > +		"address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n",
> > +		pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity);
> > +
> 
> This results in a translation that appears to be wrong:
> 
> dev0:
> cxl decoder5.0: address mapping found for 0000:e1:00.0
>     (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000
>     base: 0xc050000000 ways: 2 pos: 0 granularity: 256
> cxl decoder5.0: address mapping found for 0000:e1:00.0
>     (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff
>     base: 0xc050000000 ways: 2 pos: 0 granularity: 256
> 
> dev1:
> cxl decoder6.0: address mapping found for 0000:c1:00.0
> 	(dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 
> 	base: 0xc050000000 ways: 2 pos: 1 granularity: 256
> cxl decoder6.0: address mapping found for 0000:c1:00.0
> 	(dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff
> 	base: 0xc050000000 ways: 2 pos: 1 granularity: 256
> 
> These do not look correct.
> 
> Is my understanding of the PRMT translation incorrect?
> I expect the following: (assuming one contiguous CFMW)
> 
> dev0 (dpa -> hpa -> spa): 0x0          -> 0x0          -> 0xc050000000
> dev1 (dpa -> hpa -> spa): 0x0          -> 0x100        -> 0xc050000100
> dev0 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3ffffffeff -> 0x1004ffffeff
> dev1 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3fffffffff -> 0x1004fffffff

Yes, would be the result without the offset applied for spa2 above.
The check above calculates the *total* length of hpa and spa with out
considering the interleaving position. This is corrected using the
offset. There is no call prm_cxl_dpa_spa(dev0, 0x1fffffffff) that
returns 0x1004fffffff, but we want to check the upper boundery of the
SPA range.

> 
> Extra data: here are the programmed endpoint decoder values
> 
> [endpoint5/decoder5.0]# cat start size dpa_size interleave_ways interleave_granularity 
> 0x0
> 0x2000000000
> 0x0000002000000000
> 1
> 256
> 
> [endpoint6/decoder6.0]# cat start size dpa_size interleave_ways interleave_granularity 
> 0x0
> 0x2000000000
> 0x0000002000000000
> 1
> 256

This is correct and and must be half the size of the HPA window.

Thanks for testing.

-Robert

> 
> 
> Anyway, yeah I'm a bit confused how this is all supposed to actually
> work given that both devices translate to the same addresses.
> 
> In theory this *should* work since the root decoder covers the whole
> space - as this has been working for me previously with some hacked up
> PRMT emulation code.
> 
> [decoder0.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000
> 2
> 256
> 
> [decoder1.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000
> 1
> 256
> 
> [decoder3.0]# cat start size interleave_ways interleave_granularity 
> 0xc050000000
> 0x4000000000
> 1
> 256
> 
> [decoder5.0]# cat start size interleave_ways interleave_granularity
> 0x0
> 0x2000000000
> 1
> 256
> 
> [decoder6.0]# cat start size interleave_ways interleave_granularity 
> 0x0
> 0x2000000000
> 1
> 256
> 
> ~Gregory
Gregory Price Jan. 15, 2025, 5:05 p.m. UTC | #11
On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote:
> > 
> > Should be dpa as argument? Confusing to convert an hpa to an hpa.
> 
> We need to handle the decoder address ranges, the argument is always
> the HPA range the decoder belongs to.

I see, and this is where my confusion stems from. Basically these
addresses are consider "HPA" because they are programmed to decoders,
and decoder addresses are "always HPA".

i.e. 2 interleaved devices (endpoint decoders) with normalized addresses:

dev0: base(0x0) len(0x200000000)
dev1: base(0x0) len(0x200000000)

These are HPAs because decoders are programmed with HPAs.

It's just that in this (specific) case HPA=DPA, while root decoders and
host bridge decoders will always have HPA=SPA.  We're just translating
up the stack from HPA range to HPA range.

I've been dealing with virtualization for a long time and this has been
painful for me to follow - but I think I'm getting there.

> > DPA(0)
> > dev0: base(0xc050000000) spa(0xc050000000)
> > dev1: base(0xc050000000) spa(0xc050000000)
> > 
> > DPA(0x1fffffffff)
> > dev0: base(0xc050000000) spa(0xe04fffffff)
> > dev1: base(0xc050000000) spa(0xe04fffffff)
> > 
> > The bases seems correct, the SPAs looks suspect.
> 
> SPA range length must be 0x4000000000 (2x 128G). That is, upper SPA
> must be 0x10050000000 (0xc050000000 + 0x4000000000 - 1). This one is
> too short.
> 
> The decoder range lengths below look correct (0x2000000000), the
> interleaving configuration should be checked for the decoders.
> 

If i understand correctly, this configuration may be suspect

[decoder0.0]# cat start size interleave_ways interleave_granularity
0xc050000000
0x4000000000
2             <----- root decoder reports interleave ways = 2
256

[decoder1.0]# cat start size interleave_ways interleave_granularity
0xc050000000
0x4000000000
1             <----- host bridge decoder reports interleave ways = 1
256

[decoder3.0]# cat start size interleave_ways interleave_granularity 
0xc050000000
0x4000000000
1             <----- host bridge decoder reports interleave ways = 1
256

> > I do not understand this chunk here, we seem to just be chopping the HPA
> > in half to acquire the DPA.  But the value passed in is already a DPA.
> > 
> > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
> >     =  0xfffffffff
> 
> HPA is:
> 
>  HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff
>
... snip ...
> There is probably a broken interleaving config causing half the size
> of total device mem.
>

In my case, I never see 0x3fffffffff passed in.  The value 0x1fffffffff
from the endpoint decoders is always passed in.  This suggests the host
bridge interleave ways should be 2.

I can force this and figure out why its reporting 1 and get back to you.


> > dev0 (dpa -> hpa -> spa): 0x0          -> 0x0          -> 0xc050000000
> > dev1 (dpa -> hpa -> spa): 0x0          -> 0x100        -> 0xc050000100
> > dev0 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3ffffffeff -> 0x1004ffffeff
> > dev1 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3fffffffff -> 0x1004fffffff
> 
> Yes, would be the result without the offset applied for spa2 above.
> The check above calculates the *total* length of hpa and spa with out
> considering the interleaving position. This is corrected using the
> offset. There is no call prm_cxl_dpa_spa(dev0, 0x1fffffffff) that
> returns 0x1004fffffff, but we want to check the upper boundery of the
> SPA range.
> 

This makes sense now, there's no dpa->spa direct translation because you
may have to go through multiple layers of translation to get there - so
the best you can do is calculate the highest possible endpoint and say
"Yeah this range is in there somewhere".

Thank you for taking the time to walk me through this, I'm sorry I've
been confused on DPA/HPA/SPA for so long - it's been a bit of a
struggle.

~Gregory
Gregory Price Jan. 15, 2025, 10:24 p.m. UTC | #12
On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote:
> On 09.01.25 17:25:13, Gregory Price wrote:
> > > +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> > > +		+ (hpa & (granularity - 1));
> > 
> > I do not understand this chunk here, we seem to just be chopping the HPA
> > in half to acquire the DPA.  But the value passed in is already a DPA.
> > 
> > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
> >     =  0xfffffffff
> 
> HPA is:
> 
>  HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff
> 
> Should calculate for a 2-way config to:
> 
>  DPA = 0x1fffffffff.
> 

I'm looking back through all of this again, and I'm not seeing how the
current code is ever capable of ending up with hpa=0x3fffffffff.

Taking an example endpoint in my setup:

[decoder5.0]# cat start size interleave_ways interleave_granularity
0x0
0x2000000000   <- 128GB (half the total 256GB interleaved range)
1              <- this decoder does not apply interleave
256

translating up to a root decoder:

[decoder0.0]# cat start size interleave_ways interleave_granularity
0xc050000000
0x4000000000   <- 256GB (total interleaved capacity)
2              <- interleaved 2 ways, this decoder applies interleave
256


Now looking at the code that actually invokes the translation

static struct device *
cxl_find_auto_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled,
                      struct cxl_region *cxlr)
{
        struct cxl_decoder *cxld = &cxled->cxld;
        struct range hpa = cxld->hpa_range;
	... snip ...
	if (cxl_port_calc_hpa(parent, cxld, &hpa))
		return NULL;
}

or

static int cxl_endpoint_initialize(struct cxl_endpoint_decoder *cxled)
{
        struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
        struct cxl_port *parent, *iter = cxled_to_port(cxled);
        struct range hpa = cxled->cxld.hpa_range;
        struct cxl_decoder *cxld = &cxled->cxld;

        while (1) {
                if (is_cxl_endpoint(iter))
                        cxld = &cxled->cxld;
		...
                /* Translate HPA to the next upper memory domain. */
                if (cxl_port_calc_hpa(parent, cxld, &hpa)) {
		}
		...
	}
	....
}

Both of these will call cxl_port_calc_hpa with
	hpa = [0, 0x1fffffffff]

Resulting in the following

static int cxl_port_calc_hpa(struct cxl_port *port, struct cxl_decoder *cxld,
                             struct range *hpa_range)
{
...
        /* Translate HPA to the next upper domain. */
        hpa.start = port->to_hpa(cxld, hpa.start); <---- 0x0
        hpa.end = port->to_hpa(cxld, hpa.end);     <---- 0x1fffffffff
}

So we call:
    to_hpa(decoder5.0, hpa.end)
    to_hpa(decoder5.0, 0x1fffffffff)
                       ^^^^^^^^^^^^^ --- hpa will never be 0x3fffffffff


Should the to_hpa() code be taking an decoder length as an argument?

to_hpa(decoder5.0, range_length, addr) ?

This would actually let us calculate the end of region with the
interleave ways and granularity:

upper_hpa_base  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC) - DPA_MAGIC;
upper_hpa_end   = upper_hpa_base + (range_length * ways) - 1

Without this, you don't have enough information to actually calculate
the upper_hpa_end as you suggested.  The result is the math ends up
chopping the endpoint decoder's range (128GB) in half (64GB).

Below I walk through the translation code with these inputs step by step.

~Gregory

------------------------------------------------------------------------

Walking through the translation code by hand here:

[decoder5.0]# cat start size
0x0
0x2000000000

call:  to_hpa(decoder5.0, 0x1fffffffff)

--- code
    #define DPA_MAGIC       0xd20000
    base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
    spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
    spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);

    len = spa - base;
    len2 = spa2 - base;

    /* offset = pos * granularity */
    if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
        ... snip - not taken ...
    } else {
        ways = len / SZ_16K;
        offset = spa & (SZ_16K - 1);
        granularity = (len - len2 - SZ_256) / (ways - 1);
        pos = offset / granularity;
    }
--- end code

At this point in the code i have the following values:

base   = 0xc051a40100
spa    = 0xc051a48100
spa2   = 0xc051a47f00
len    = 0x8000
len2   = 0x7E00
ways   = 2
offset = 256
granularity = 256
pos    = 1

--- code
        base = base - DPA_MAGIC * ways - pos * granularity;
        spa = base + hpa;
--- end code

base   = 0xc051a40100 - 0xd20000 * 2 - 1 * 256
       = 0xc050000000

spa    = base + hpa
       = 0xc050000000 + 0x1fffffffff  <-----
       = 0xe04fffffff                      |

         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 Isn't this just incorrect? Should it be something like:

	 base + ((hpa & ~(granularity - 1)) * pos)
	      + (hpa & (ways * granularity - 1))

--- code
        dpa = (hpa & ~(granularity * ways - 1)) / ways
                + (hpa & (granularity - 1));
--- end code

dpa = (hpa & ~(granularity * ways - 1)) / ways + (hpa & (granularity - 1));
       ^^^                                ^^^
       0x1fffffffff                     /  2

       We are dividing the endpoint decoder HPA by interleave ways.
       This is how we end up with the truncated size.

 
      Full math
dpa   = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
      = (0x1fffffffff & (0xF...E00) / 2 + (0x1fffffffff & 0xFF)
      = (0x1ffffffe00) / 2 + (0xFF)
      = 0xfffffff00 + 0xff
      = 0xfffffffff

--- code
        offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
        offset -= pos * granularity;
---
offset = (0x1fffffffff & (256 * 2 - 1) & ~(256 - 1)) - (1 * 256)
         (0x1fffffffff & 0x1ff & 0xffffffffffffff00) - 0x100
         (0x1ff & 0xffffffffffffff00) - 0x100
	 0x100 - 0x100
	 0x0


Final Result:
--- code
        spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
---
     = prm_cxl_dpa_spa(pci_dev, 0xfffffffff) + 0
     = 0xe04fffffff + 0
       ^^^ note that in thise case my emulation gives the exact address
           you seem to suggest that i'll get the closet granularity

	   So if not for my emulation, the offset calculation is wrong?


--------------------------------------------------------------------

For the sake of completeness, here is my PRMT emulation code to show
you that it is doing the translation as-expected.

All this does is just force translation for a particular set of PCI
devices based on the known static CFMW regions. 

Note for onlookers: This patch is extremely dangerous and only applies
to my specific system / interleave configuration.


diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index ac74b6f6dad7..8ccf2d5638ed 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -432,6 +432,31 @@ static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws,
        return 0;
 }

+#define MAX_CFMWS (32)
+static unsigned int num_cfmws;
+static unsigned long cfmws_bases[MAX_CFMWS];
+static unsigned long cfmws_sizes[MAX_CFMWS];
+unsigned int cxl_get_num_cfmws(void)
+{
+       return num_cfmws;
+}
+
+unsigned long cxl_get_cfmws_base(unsigned int idx)
+{
+       if (idx >= MAX_CFMWS || idx >= num_cfmws)
+               return ~0;
+
+       return cfmws_bases[idx];
+}
+
+unsigned long cxl_get_cfmws_size(unsigned int idx)
+{
+       if (idx >= MAX_CFMWS || idx >= num_cfmws)
+               return ~0;
+
+       return cfmws_sizes[idx];
+}
+
 static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
                           const unsigned long end)
 {
@@ -446,10 +471,16 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
                        "Failed to add decode range: [%#llx - %#llx] (%d)\n",
                        cfmws->base_hpa,
                        cfmws->base_hpa + cfmws->window_size - 1, rc);
-       else
+       else {
                dev_dbg(dev, "decode range: node: %d range [%#llx - %#llx]\n",
                        phys_to_target_node(cfmws->base_hpa), cfmws->base_hpa,
                        cfmws->base_hpa + cfmws->window_size - 1);
+               if (num_cfmws < MAX_CFMWS) {
+                       cfmws_bases[num_cfmws] = cfmws->base_hpa;
+                       cfmws_sizes[num_cfmws] = cfmws->window_size;
+                       num_cfmws++;
+               }
+       }

        /* never fail cxl_acpi load for a single window failure */
        return 0;
diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c
index 553b7d0caefd..08a5bfb9fbd6 100644
--- a/drivers/cxl/core/amd.c
+++ b/drivers/cxl/core/amd.c
@@ -64,6 +64,10 @@ struct prm_cxl_dpa_spa_data {
 static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
 {
        struct prm_cxl_dpa_spa_data data;
+       unsigned int cfmws_nr;
+       unsigned int idx;
+       unsigned long offset, size;
+       unsigned int dev;
        u64 spa;
        int rc;

@@ -75,12 +79,35 @@ static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
                .out     = &spa,
        };

+       cfmws_nr = cxl_get_num_cfmws();
+       if (!cfmws_nr)
+               goto try_prmt;
+
+       /* HACK: Calculate the interleaved offset and find the matching base */
+       if (pci_dev->bus->number != 0xe1 && pci_dev->bus->number != 0xc1)
+               goto try_prmt;
+
+       dev = pci_dev->bus->number == 0xe1 ? 0 : 1;
+       offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff);
+
+       for (idx = 0; idx < cfmws_nr; idx++) {
+               size = cxl_get_cfmws_size(idx);
+               if (offset < size) {
+                       spa = cxl_get_cfmws_base(idx) + offset;
+                       goto out;
+               }
+               offset -= size;
+       }
+       /* We failed, fall back to calling the PRMT */
+try_prmt:
+
        rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
        if (rc) {
                pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
                return ULLONG_MAX;
        }

+out:
        pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);

        return spa;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f93eb464fc97..48cbfa68d739 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -919,6 +919,10 @@ bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);

 struct cxl_mem_command *cxl_find_feature_command(u16 opcode);

+unsigned int cxl_get_num_cfmws(void);
+unsigned long cxl_get_cfmws_base(unsigned int idx);
+unsigned long cxl_get_cfmws_size(unsigned int idx);
+
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
  * of these symbols in tools/testing/cxl/.
Robert Richter Jan. 17, 2025, 7:59 a.m. UTC | #13
On 14.01.25 11:13:07, Jonathan Cameron wrote:
> On Thu, 9 Jan 2025 11:14:46 +0100
> Robert Richter <rrichter@amd.com> wrote:
> 
> > On 08.01.25 10:48:23, Gregory Price wrote:
> > 
> > > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > > > index 901555bf4b73..c8176265c15c 100644
> > > > --- a/drivers/cxl/core/port.c
> > > > +++ b/drivers/cxl/core/port.c
> > > > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
> > > >  			    &cxl_einj_inject_fops);
> > > >  }
> > > >  
> > > > +static void cxl_port_platform_setup(struct cxl_port *port)
> > > > +{
> > > > +	cxl_port_setup_amd(port);
> > > > +}
> > > > +  
> > > 
> > > Assuming this gets expanded (which it may not), should we expect this
> > > function to end up like so?
> > > 
> > > static void cxl_port_platform_setup(struct cxl_port *port)
> > > {
> > > 	cxl_port_setup_amd(port);
> > > 	cxl_port_setup_intel(port);
> > > 	cxl_port_setup_arm(port);
> > > 	... etc ...
> > > }
> > > 
> > > I suppose this logic has to exist somewhere in some form, just want to make
> > > sure this is what we want.  Either way, this is easily modifiable, so
> > > not a blocker as I said.  
> > 
> > Yes, it is exactly designed like that. I will update the patch
> > description.
> 
> If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup()
> as never building a kernel that does x86 and arm.  Could rely on stubs but that
> tends to get ugly as things grow.

I could move the function and file to core/x86/amd.c already and add
a:

 void __weak arch_cxl_port_platform_setup(struct cxl_port *port) { }

-Robert
Robert Richter Jan. 17, 2025, 8:41 a.m. UTC | #14
On 10.01.25 17:48:35, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> > Add AMD platform specific Zen5 support for address translation.
> > 
> > Zen5 systems may be configured to use 'Normalized addresses'. Then,
> > CXL endpoints use their own physical address space and Host Physical
> > Addresses (HPAs) need address translation from the endpoint to its CXL
> > host bridge. The HPA of a CXL host bridge is equivalent to the System
> > Physical Address (SPA).
> > 
> 
> Just adding the note that I've tested this patch set for HPA==SPA and
> found it causes no regressions in my setup.

Thanks for testing HPA==SPA.

-Robert

> 
> Still working on testing the normalized address mode due to a few BIOS
> quirks I'm running up against.
> 
> ~Gregory
Jonathan Cameron Jan. 17, 2025, 11:46 a.m. UTC | #15
On Fri, 17 Jan 2025 08:59:00 +0100
Robert Richter <rrichter@amd.com> wrote:

> On 14.01.25 11:13:07, Jonathan Cameron wrote:
> > On Thu, 9 Jan 2025 11:14:46 +0100
> > Robert Richter <rrichter@amd.com> wrote:
> >   
> > > On 08.01.25 10:48:23, Gregory Price wrote:
> > >   
> > > > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > > > > index 901555bf4b73..c8176265c15c 100644
> > > > > --- a/drivers/cxl/core/port.c
> > > > > +++ b/drivers/cxl/core/port.c
> > > > > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
> > > > >  			    &cxl_einj_inject_fops);
> > > > >  }
> > > > >  
> > > > > +static void cxl_port_platform_setup(struct cxl_port *port)
> > > > > +{
> > > > > +	cxl_port_setup_amd(port);
> > > > > +}
> > > > > +    
> > > > 
> > > > Assuming this gets expanded (which it may not), should we expect this
> > > > function to end up like so?
> > > > 
> > > > static void cxl_port_platform_setup(struct cxl_port *port)
> > > > {
> > > > 	cxl_port_setup_amd(port);
> > > > 	cxl_port_setup_intel(port);
> > > > 	cxl_port_setup_arm(port);
> > > > 	... etc ...
> > > > }
> > > > 
> > > > I suppose this logic has to exist somewhere in some form, just want to make
> > > > sure this is what we want.  Either way, this is easily modifiable, so
> > > > not a blocker as I said.    
> > > 
> > > Yes, it is exactly designed like that. I will update the patch
> > > description.  
> > 
> > If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup()
> > as never building a kernel that does x86 and arm.  Could rely on stubs but that
> > tends to get ugly as things grow.  
> 
> I could move the function and file to core/x86/amd.c already and add
> a:
> 
>  void __weak arch_cxl_port_platform_setup(struct cxl_port *port) { }
Something like that probably makes sense. I don't like x86 calls in what
I'm building for arm, even if they are stubbed out ;)

Jonathan

> 
> -Robert
Robert Richter Jan. 17, 2025, 2:06 p.m. UTC | #16
On 15.01.25 17:24:54, Gregory Price wrote:
> On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote:
> > On 09.01.25 17:25:13, Gregory Price wrote:
> > > > +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> > > > +		+ (hpa & (granularity - 1));
> > > 
> > > I do not understand this chunk here, we seem to just be chopping the HPA
> > > in half to acquire the DPA.  But the value passed in is already a DPA.
> > > 
> > > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
> > >     =  0xfffffffff
> > 
> > HPA is:
> > 
> >  HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff
> > 
> > Should calculate for a 2-way config to:
> > 
> >  DPA = 0x1fffffffff.
> > 
> 
> I'm looking back through all of this again, and I'm not seeing how the
> current code is ever capable of ending up with hpa=0x3fffffffff.
> 
> Taking an example endpoint in my setup:
> 
> [decoder5.0]# cat start size interleave_ways interleave_granularity
> 0x0
> 0x2000000000   <- 128GB (half the total 256GB interleaved range)
> 1              <- this decoder does not apply interleave
> 256
> 
> translating up to a root decoder:
> 
> [decoder0.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000   <- 256GB (total interleaved capacity)
> 2              <- interleaved 2 ways, this decoder applies interleave
> 256
> 
> 
> Now looking at the code that actually invokes the translation
> 
> static struct device *
> cxl_find_auto_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled,
>                       struct cxl_region *cxlr)
> {
>         struct cxl_decoder *cxld = &cxled->cxld;
>         struct range hpa = cxld->hpa_range;
> 	... snip ...
> 	if (cxl_port_calc_hpa(parent, cxld, &hpa))
> 		return NULL;
> }
> 
> or
> 
> static int cxl_endpoint_initialize(struct cxl_endpoint_decoder *cxled)
> {
>         struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>         struct cxl_port *parent, *iter = cxled_to_port(cxled);
>         struct range hpa = cxled->cxld.hpa_range;
>         struct cxl_decoder *cxld = &cxled->cxld;
> 
>         while (1) {
>                 if (is_cxl_endpoint(iter))
>                         cxld = &cxled->cxld;
> 		...
>                 /* Translate HPA to the next upper memory domain. */
>                 if (cxl_port_calc_hpa(parent, cxld, &hpa)) {
> 		}
> 		...
> 	}
> 	....
> }
> 
> Both of these will call cxl_port_calc_hpa with
> 	hpa = [0, 0x1fffffffff]
> 
> Resulting in the following
> 
> static int cxl_port_calc_hpa(struct cxl_port *port, struct cxl_decoder *cxld,
>                              struct range *hpa_range)
> {
> ...
>         /* Translate HPA to the next upper domain. */
>         hpa.start = port->to_hpa(cxld, hpa.start); <---- 0x0
>         hpa.end = port->to_hpa(cxld, hpa.end);     <---- 0x1fffffffff
> }
> 
> So we call:
>     to_hpa(decoder5.0, hpa.end)
>     to_hpa(decoder5.0, 0x1fffffffff)
>                        ^^^^^^^^^^^^^ --- hpa will never be 0x3fffffffff

Depending on the endpoint the PRM call returns the following here
(2-way interleaving with 256 gran):

 hpa = 0x1fffffff00 * 2 + pos * 0x100 + 0xff;

It will either return 0x3ffffffeff or 0x3fffffffff.

But implementation in cxl_zen5_to_hpa() is not correct, see below.

> 
> 
> Should the to_hpa() code be taking an decoder length as an argument?
> 
> to_hpa(decoder5.0, range_length, addr) ?
> 
> This would actually let us calculate the end of region with the
> interleave ways and granularity:
> 
> upper_hpa_base  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC) - DPA_MAGIC;
> upper_hpa_end   = upper_hpa_base + (range_length * ways) - 1
> 
> Without this, you don't have enough information to actually calculate
> the upper_hpa_end as you suggested.  The result is the math ends up
> chopping the endpoint decoder's range (128GB) in half (64GB).
> 
> Below I walk through the translation code with these inputs step by step.
> 
> ~Gregory
> 
> ------------------------------------------------------------------------
> 
> Walking through the translation code by hand here:
> 
> [decoder5.0]# cat start size
> 0x0
> 0x2000000000
> 
> call:  to_hpa(decoder5.0, 0x1fffffffff)
> 
> --- code
>     #define DPA_MAGIC       0xd20000
>     base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
>     spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
>     spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
> 
>     len = spa - base;
>     len2 = spa2 - base;
> 
>     /* offset = pos * granularity */
>     if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
>         ... snip - not taken ...
>     } else {
>         ways = len / SZ_16K;
>         offset = spa & (SZ_16K - 1);
>         granularity = (len - len2 - SZ_256) / (ways - 1);
>         pos = offset / granularity;
>     }
> --- end code
> 
> At this point in the code i have the following values:
> 
> base   = 0xc051a40100
> spa    = 0xc051a48100
> spa2   = 0xc051a47f00
> len    = 0x8000
> len2   = 0x7E00
> ways   = 2
> offset = 256
> granularity = 256
> pos    = 1
> 

> --- code
>         base = base - DPA_MAGIC * ways - pos * granularity;
>         spa = base + hpa;
> --- end code
> 
> base   = 0xc051a40100 - 0xd20000 * 2 - 1 * 256
>        = 0xc050000000
> 
> spa    = base + hpa

This is the wrong part, the interleaving parameters must be considered
for SPA too, like:

  spa = base + hpa div gran * gran * ways + pos * gran + hpa mod gran

This caused a wrong range size and this went unnoticed due to an
overlaying issue while testing. The decection of the interleaving
config still is correct.

>        = 0xc050000000 + 0x1fffffffff  <-----
>        = 0xe04fffffff                      |
> 
>          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	 Isn't this just incorrect? Should it be something like:

> --------------------------------------------------------------------
> 
> For the sake of completeness, here is my PRMT emulation code to show
> you that it is doing the translation as-expected.
> 
> All this does is just force translation for a particular set of PCI
> devices based on the known static CFMW regions. 

> @@ -75,12 +79,35 @@ static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
>                 .out     = &spa,
>         };
> 
> +       cfmws_nr = cxl_get_num_cfmws();
> +       if (!cfmws_nr)
> +               goto try_prmt;
> +
> +       /* HACK: Calculate the interleaved offset and find the matching base */
> +       if (pci_dev->bus->number != 0xe1 && pci_dev->bus->number != 0xc1)
> +               goto try_prmt;
> +
> +       dev = pci_dev->bus->number == 0xe1 ? 0 : 1;
> +       offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff);

That looks correct to me, a test too:

 >>> dpa = 0x1fffffffff
 >>> dev = 1
 >>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff)
 >>> "0x%x" % offset
 '0x3fffffffff'
 >>> dev = 0
 >>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff)
 >>> "0x%x" % offset
 '0x3ffffffeff'

Thanks for review and testing.

Will fix in v2.

-Robert

> +
> +       for (idx = 0; idx < cfmws_nr; idx++) {
> +               size = cxl_get_cfmws_size(idx);
> +               if (offset < size) {
> +                       spa = cxl_get_cfmws_base(idx) + offset;
> +                       goto out;
> +               }
> +               offset -= size;
> +       }
> +       /* We failed, fall back to calling the PRMT */
> +try_prmt:
> +
>         rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
>         if (rc) {
>                 pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
>                 return ULLONG_MAX;
>         }
> 
> +out:
>         pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);
> 
>         return spa;
Robert Richter Jan. 17, 2025, 2:10 p.m. UTC | #17
On 17.01.25 11:46:42, Jonathan Cameron wrote:
> On Fri, 17 Jan 2025 08:59:00 +0100
> Robert Richter <rrichter@amd.com> wrote:
> > On 14.01.25 11:13:07, Jonathan Cameron wrote:

> > > > > static void cxl_port_platform_setup(struct cxl_port *port)
> > > > > {
> > > > > 	cxl_port_setup_amd(port);
> > > > > 	cxl_port_setup_intel(port);
> > > > > 	cxl_port_setup_arm(port);
> > > > > 	... etc ...
> > > > > }
> > > > > 
> > > > > I suppose this logic has to exist somewhere in some form, just want to make
> > > > > sure this is what we want.  Either way, this is easily modifiable, so
> > > > > not a blocker as I said.    
> > > > 
> > > > Yes, it is exactly designed like that. I will update the patch
> > > > description.  
> > > 
> > > If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup()
> > > as never building a kernel that does x86 and arm.  Could rely on stubs but that
> > > tends to get ugly as things grow.  
> > 
> > I could move the function and file to core/x86/amd.c already and add
> > a:
> > 
> >  void __weak arch_cxl_port_platform_setup(struct cxl_port *port) { }
> Something like that probably makes sense. I don't like x86 calls in what
> I'm building for arm, even if they are stubbed out ;)

Sure, will change that.

Thanks for review,

-Robert
Ben Cheatham Jan. 17, 2025, 9:32 p.m. UTC | #18
On 1/7/25 8:10 AM, Robert Richter wrote:
> Add AMD platform specific Zen5 support for address translation.
> 
> Zen5 systems may be configured to use 'Normalized addresses'. Then,
> CXL endpoints use their own physical address space and Host Physical
> Addresses (HPAs) need address translation from the endpoint to its CXL
> host bridge. The HPA of a CXL host bridge is equivalent to the System
> Physical Address (SPA).
> 
> ACPI Platform Runtime Mechanism (PRM) is used to translate the CXL
> Device Physical Address (DPA) to its System Physical Address. This is
> documented in:
> 
>  AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh
>  ACPI v6.5 Porting Guide, Publication # 58088
>  https://www.amd.com/en/search/documentation/hub.html
> 
> Note that DPA and HPA of an endpoint may differ depending on the
> interleaving configuration. That is, an additional calculation between
> DPA and HPA is needed.
> 
> To implement AMD Zen5 address translation the following steps are
> needed:
> 
> Introduce the generic function cxl_port_platform_setup() that allows
> to apply platform specific changes to each port where necessary.
> 
> Add a function cxl_port_setup_amd() to implement AMD platform specific
> code. Use Kbuild and Kconfig options respectivly to enable the code
> depending on architecture and platform options. Create a new file
> core/amd.c for this.
> 
> Introduce a function cxl_zen5_init() to handle Zen5 specific
> enablement. Zen5 platforms are detected using the PCIe vendor and
> device ID of the corresponding CXL root port.
> 
> Apply cxl_zen5_to_hpa() as cxl_port->to_hpa() callback to Zen5 CXL
> host bridges to enable platform specific address translation.
> 
> Use ACPI PRM DPA to SPA translation to determine an endpoint's
> interleaving configuration and base address during the early
> initialization proces. This is used to determine an endpoint's SPA
> range.
> 
> Since the PRM translates DPA->SPA, but HPA->SPA is needed, determine
> the interleaving config and base address of the endpoint first, then
> calculate the SPA based on the given HPA using the address base.
> 
> The config can be determined calling the PRM for specific DPAs
> given. Since the interleaving configuration is still unknown, chose
> DPAs starting at 0xd20000. This address is factor for all values from
> 1 to 8 and thus valid for all possible interleaving configuration.
> The resulting SPAs are used to calculate interleaving paramters and
> the SPA base address of the endpoint. The maximum granularity (chunk
> size) is 16k, minimum is 256. Use the following calculation for a
> given DPA:
> 
>  ways    = hpa_len(SZ_16K) / SZ_16K
>  gran    = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256)
>           / (ways - 1)
>  pos     = (hpa_len(SZ_16K) - ways * SZ_16K) / gran
> 
> Once the endpoint is attached to a region and its SPA range is know,
> calling the PRM is no longer needed, the SPA base can be used.
> 
> Signed-off-by: Robert Richter <rrichter@amd.com>
> ---
>  drivers/cxl/Kconfig       |   4 +
>  drivers/cxl/core/Makefile |   1 +
>  drivers/cxl/core/amd.c    | 227 ++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/core.h   |   6 +
>  drivers/cxl/core/port.c   |   7 ++
>  5 files changed, 245 insertions(+)
>  create mode 100644 drivers/cxl/core/amd.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 876469e23f7a..e576028dd983 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -146,4 +146,8 @@ config CXL_REGION_INVALIDATION_TEST
>  	  If unsure, or if this kernel is meant for production environments,
>  	  say N.
>  
> +config CXL_AMD
> +       def_bool y
> +       depends on AMD_NB
> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..dc368e61d281 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -16,3 +16,4 @@ cxl_core-y += pmu.o
>  cxl_core-y += cdat.o
>  cxl_core-$(CONFIG_TRACING) += trace.o
>  cxl_core-$(CONFIG_CXL_REGION) += region.o
> +cxl_core-$(CONFIG_CXL_AMD) += amd.o
> diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c
> new file mode 100644
> index 000000000000..553b7d0caefd
> --- /dev/null
> +++ b/drivers/cxl/core/amd.c
> @@ -0,0 +1,227 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2024 Advanced Micro Devices, Inc.

Make sure to update the year. Don't know if it's supposed to be 2024-2025 or just 2025.

> + */
> +
> +#include <linux/prmt.h>
> +#include <linux/pci.h>
> +
> +#include "cxlmem.h"
> +#include "core.h"
> +
> +#define PCI_DEVICE_ID_AMD_ZEN5_ROOT		0x153e
> +
> +static const struct pci_device_id zen5_root_port_ids[] = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ZEN5_ROOT) },
> +	{},
> +};
> +
> +static int is_zen5_root_port(struct device *dev, void *unused)
> +{
> +	if (!dev_is_pci(dev))
> +		return 0;
> +
> +	return !!pci_match_id(zen5_root_port_ids, to_pci_dev(dev));
> +}
> +
> +static bool is_zen5(struct cxl_port *port)
> +{
> +	if (!IS_ENABLED(CONFIG_ACPI_PRMT))
> +		return false;
> +
> +	/* To get the CXL root port, find the CXL host bridge first. */
> +	if (is_cxl_root(port) ||
> +	    !port->host_bridge ||
> +	    !is_cxl_root(to_cxl_port(port->dev.parent)))
> +		return false;
> +
> +	return !!device_for_each_child(port->host_bridge, NULL,
> +				       is_zen5_root_port);
> +}
> +
> +/*
> + * PRM Address Translation - CXL DPA to System Physical Address
> + *
> + * Reference:
> + *
> + * AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh
> + * ACPI v6.5 Porting Guide, Publication # 58088
> + */
> +
> +static const guid_t prm_cxl_dpa_spa_guid =
> +	GUID_INIT(0xee41b397, 0x25d4, 0x452c, 0xad, 0x54, 0x48, 0xc6, 0xe3,
> +		  0x48, 0x0b, 0x94);
> +
> +struct prm_cxl_dpa_spa_data {
> +	u64 dpa;
> +	u8 reserved;
> +	u8 devfn;
> +	u8 bus;
> +	u8 segment;
> +	void *out;
> +} __packed;
> +
> +static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
> +{
> +	struct prm_cxl_dpa_spa_data data;
> +	u64 spa;
> +	int rc;
> +
> +	data = (struct prm_cxl_dpa_spa_data) {
> +		.dpa     = dpa,
> +		.devfn   = pci_dev->devfn,
> +		.bus     = pci_dev->bus->number,
> +		.segment = pci_domain_nr(pci_dev->bus),
> +		.out     = &spa,
> +	};
> +
> +	rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
> +	if (rc) {
> +		pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
> +		return ULLONG_MAX;
> +	}
> +
> +	pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);
> +
> +	return spa;
> +}
> +
> +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa)

After reading through your discussion with Gregory I'm not confused about the function name
anymore, but I would definitely include a comment specifying the expected usage and return value.

Thanks,
Ben

> +{
> +	struct cxl_memdev *cxlmd;
> +	struct pci_dev *pci_dev;
> +	struct cxl_port *port;
> +	u64 dpa, base, spa, spa2, len, len2, offset, granularity;
> +	int ways, pos;
> +
> +	/*
> +	 * Nothing to do if base is non-zero and Normalized Addressing
> +	 * is disabled.
> +	 */
> +	if (cxld->hpa_range.start)
> +		return hpa;
> +
> +	/* Only translate from endpoint to its parent port. */
> +	if (!is_endpoint_decoder(&cxld->dev))
> +		return hpa;
> +
> +	if (hpa > cxld->hpa_range.end) {
> +		dev_dbg(&cxld->dev, "hpa addr %#llx out of range %#llx-%#llx\n",
> +			hpa, cxld->hpa_range.start, cxld->hpa_range.end);
> +		return ULLONG_MAX;
> +	}
> +
> +	/*
> +	 * If the decoder is already attached, the region's base can
> +	 * be used.
> +	 */
> +	if (cxld->region)
> +		return cxld->region->params.res->start + hpa;
> +
> +	port = to_cxl_port(cxld->dev.parent);
> +	cxlmd = port ? to_cxl_memdev(port->uport_dev) : NULL;
> +	if (!port || !dev_is_pci(cxlmd->dev.parent)) {
> +		dev_dbg(&cxld->dev, "No endpoint found: %s, range %#llx-%#llx\n",
> +			dev_name(cxld->dev.parent), cxld->hpa_range.start,
> +			cxld->hpa_range.end);
> +		return ULLONG_MAX;
> +	}
> +	pci_dev = to_pci_dev(cxlmd->dev.parent);
> +
> +	/*
> +	 * The PRM translates DPA->SPA, but we need HPA->SPA.
> +	 * Determine the interleaving config first, then calculate the
> +	 * DPA. Maximum granularity (chunk size) is 16k, minimum is
> +	 * 256. Calculated with:
> +	 *
> +	 *	ways	= hpa_len(SZ_16K) / SZ_16K
> +	 * 	gran	= (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256)
> +	 *                / (ways - 1)
> +	 *	pos	= (hpa_len(SZ_16K) - ways * SZ_16K) / gran
> +	 */
> +
> +	/*
> +	 * DPA magic:
> +	 *
> +	 * Position and granularity are unknown yet, use an always
> +	 * valid DPA:
> +	 *
> +	 * 0xd20000 = 13762560 = 16k * 2 * 3 * 2 * 5 * 7 * 2
> +	 *
> +	 * It is divisible by all positions 1 to 8. The DPA is valid
> +	 * for all positions and granularities.
> +	 */
> +#define DPA_MAGIC	0xd20000
> +	base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
> +	spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
> +	spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
> +
> +	/* Includes checks to avoid div by zero */
> +	if (!base || base == ULLONG_MAX || spa == ULLONG_MAX ||
> +	    spa2 == ULLONG_MAX || spa < base + SZ_16K || spa2 <= base ||
> +	    (spa > base + SZ_16K && spa - spa2 < SZ_256 * 2)) {
> +		dev_dbg(&cxld->dev, "Error translating HPA: base %#llx, spa %#llx, spa2 %#llx\n",
> +			base, spa, spa2);
> +		return ULLONG_MAX;
> +	}
> +
> +	len = spa - base;
> +	len2 = spa2 - base;
> +
> +	/* offset = pos * granularity */
> +	if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
> +		ways = 1;
> +		offset = 0;
> +		granularity = 0;
> +		pos = 0;
> +	} else {
> +		ways = len / SZ_16K;
> +		offset = spa & (SZ_16K - 1);
> +		granularity = (len - len2 - SZ_256) / (ways - 1);
> +		pos = offset / granularity;
> +	}
> +
> +	base = base - DPA_MAGIC * ways - pos * granularity;
> +	spa = base + hpa;
> +
> +	/*
> +	 * Check SPA using a PRM call for the closest DPA calculated
> +	 * for the HPA. If the HPA matches a different interleaving
> +	 * position other than the decoder's, determine its offset to
> +	 * adjust the SPA.
> +	 */
> +
> +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> +		+ (hpa & (granularity - 1));
> +	offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
> +	offset -= pos * granularity;
> +	spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
> +
> +	dev_dbg(&cxld->dev,
> +		"address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n",
> +		pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity);
> +
> +	if (spa != spa2) {
> +		dev_dbg(&cxld->dev, "SPA calculation failed: %#llx:%#llx\n",
> +			spa, spa2);
> +		return ULLONG_MAX;
> +	}
> +
> +	return spa;
> +}
> +
> +static void cxl_zen5_init(struct cxl_port *port)
> +{
> +	if (!is_zen5(port))
> +		return;
> +
> +	port->to_hpa = cxl_zen5_to_hpa;
> +
> +	dev_dbg(port->host_bridge, "PRM address translation enabled for %s.\n",
> +		dev_name(&port->dev));
> +}
> +
> +void cxl_port_setup_amd(struct cxl_port *port)
> +{
> +	cxl_zen5_init(port);
> +}
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 800466f96a68..efe34ae6943e 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -115,4 +115,10 @@ bool cxl_need_node_perf_attrs_update(int nid);
>  int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
>  					struct access_coordinate *c);
>  
> +#ifdef CONFIG_CXL_AMD
> +void cxl_port_setup_amd(struct cxl_port *port);
> +#else
> +static inline void cxl_port_setup_amd(struct cxl_port *port) {};
> +#endif
> +
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 901555bf4b73..c8176265c15c 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
>  			    &cxl_einj_inject_fops);
>  }
>  
> +static void cxl_port_platform_setup(struct cxl_port *port)
> +{
> +	cxl_port_setup_amd(port);
> +}
> +
>  static int cxl_port_add(struct cxl_port *port,
>  			resource_size_t component_reg_phys,
>  			struct cxl_dport *parent_dport)
> @@ -868,6 +873,8 @@ static int cxl_port_add(struct cxl_port *port,
>  			return rc;
>  	}
>  
> +	cxl_port_platform_setup(port);
> +
>  	rc = device_add(dev);
>  	if (rc)
>  		return rc;
diff mbox series

Patch

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 876469e23f7a..e576028dd983 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -146,4 +146,8 @@  config CXL_REGION_INVALIDATION_TEST
 	  If unsure, or if this kernel is meant for production environments,
 	  say N.
 
+config CXL_AMD
+       def_bool y
+       depends on AMD_NB
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 9259bcc6773c..dc368e61d281 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -16,3 +16,4 @@  cxl_core-y += pmu.o
 cxl_core-y += cdat.o
 cxl_core-$(CONFIG_TRACING) += trace.o
 cxl_core-$(CONFIG_CXL_REGION) += region.o
+cxl_core-$(CONFIG_CXL_AMD) += amd.o
diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c
new file mode 100644
index 000000000000..553b7d0caefd
--- /dev/null
+++ b/drivers/cxl/core/amd.c
@@ -0,0 +1,227 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024 Advanced Micro Devices, Inc.
+ */
+
+#include <linux/prmt.h>
+#include <linux/pci.h>
+
+#include "cxlmem.h"
+#include "core.h"
+
+#define PCI_DEVICE_ID_AMD_ZEN5_ROOT		0x153e
+
+static const struct pci_device_id zen5_root_port_ids[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ZEN5_ROOT) },
+	{},
+};
+
+static int is_zen5_root_port(struct device *dev, void *unused)
+{
+	if (!dev_is_pci(dev))
+		return 0;
+
+	return !!pci_match_id(zen5_root_port_ids, to_pci_dev(dev));
+}
+
+static bool is_zen5(struct cxl_port *port)
+{
+	if (!IS_ENABLED(CONFIG_ACPI_PRMT))
+		return false;
+
+	/* To get the CXL root port, find the CXL host bridge first. */
+	if (is_cxl_root(port) ||
+	    !port->host_bridge ||
+	    !is_cxl_root(to_cxl_port(port->dev.parent)))
+		return false;
+
+	return !!device_for_each_child(port->host_bridge, NULL,
+				       is_zen5_root_port);
+}
+
+/*
+ * PRM Address Translation - CXL DPA to System Physical Address
+ *
+ * Reference:
+ *
+ * AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh
+ * ACPI v6.5 Porting Guide, Publication # 58088
+ */
+
+static const guid_t prm_cxl_dpa_spa_guid =
+	GUID_INIT(0xee41b397, 0x25d4, 0x452c, 0xad, 0x54, 0x48, 0xc6, 0xe3,
+		  0x48, 0x0b, 0x94);
+
+struct prm_cxl_dpa_spa_data {
+	u64 dpa;
+	u8 reserved;
+	u8 devfn;
+	u8 bus;
+	u8 segment;
+	void *out;
+} __packed;
+
+static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
+{
+	struct prm_cxl_dpa_spa_data data;
+	u64 spa;
+	int rc;
+
+	data = (struct prm_cxl_dpa_spa_data) {
+		.dpa     = dpa,
+		.devfn   = pci_dev->devfn,
+		.bus     = pci_dev->bus->number,
+		.segment = pci_domain_nr(pci_dev->bus),
+		.out     = &spa,
+	};
+
+	rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
+	if (rc) {
+		pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
+		return ULLONG_MAX;
+	}
+
+	pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);
+
+	return spa;
+}
+
+static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa)
+{
+	struct cxl_memdev *cxlmd;
+	struct pci_dev *pci_dev;
+	struct cxl_port *port;
+	u64 dpa, base, spa, spa2, len, len2, offset, granularity;
+	int ways, pos;
+
+	/*
+	 * Nothing to do if base is non-zero and Normalized Addressing
+	 * is disabled.
+	 */
+	if (cxld->hpa_range.start)
+		return hpa;
+
+	/* Only translate from endpoint to its parent port. */
+	if (!is_endpoint_decoder(&cxld->dev))
+		return hpa;
+
+	if (hpa > cxld->hpa_range.end) {
+		dev_dbg(&cxld->dev, "hpa addr %#llx out of range %#llx-%#llx\n",
+			hpa, cxld->hpa_range.start, cxld->hpa_range.end);
+		return ULLONG_MAX;
+	}
+
+	/*
+	 * If the decoder is already attached, the region's base can
+	 * be used.
+	 */
+	if (cxld->region)
+		return cxld->region->params.res->start + hpa;
+
+	port = to_cxl_port(cxld->dev.parent);
+	cxlmd = port ? to_cxl_memdev(port->uport_dev) : NULL;
+	if (!port || !dev_is_pci(cxlmd->dev.parent)) {
+		dev_dbg(&cxld->dev, "No endpoint found: %s, range %#llx-%#llx\n",
+			dev_name(cxld->dev.parent), cxld->hpa_range.start,
+			cxld->hpa_range.end);
+		return ULLONG_MAX;
+	}
+	pci_dev = to_pci_dev(cxlmd->dev.parent);
+
+	/*
+	 * The PRM translates DPA->SPA, but we need HPA->SPA.
+	 * Determine the interleaving config first, then calculate the
+	 * DPA. Maximum granularity (chunk size) is 16k, minimum is
+	 * 256. Calculated with:
+	 *
+	 *	ways	= hpa_len(SZ_16K) / SZ_16K
+	 * 	gran	= (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256)
+	 *                / (ways - 1)
+	 *	pos	= (hpa_len(SZ_16K) - ways * SZ_16K) / gran
+	 */
+
+	/*
+	 * DPA magic:
+	 *
+	 * Position and granularity are unknown yet, use an always
+	 * valid DPA:
+	 *
+	 * 0xd20000 = 13762560 = 16k * 2 * 3 * 2 * 5 * 7 * 2
+	 *
+	 * It is divisible by all positions 1 to 8. The DPA is valid
+	 * for all positions and granularities.
+	 */
+#define DPA_MAGIC	0xd20000
+	base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
+	spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
+	spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
+
+	/* Includes checks to avoid div by zero */
+	if (!base || base == ULLONG_MAX || spa == ULLONG_MAX ||
+	    spa2 == ULLONG_MAX || spa < base + SZ_16K || spa2 <= base ||
+	    (spa > base + SZ_16K && spa - spa2 < SZ_256 * 2)) {
+		dev_dbg(&cxld->dev, "Error translating HPA: base %#llx, spa %#llx, spa2 %#llx\n",
+			base, spa, spa2);
+		return ULLONG_MAX;
+	}
+
+	len = spa - base;
+	len2 = spa2 - base;
+
+	/* offset = pos * granularity */
+	if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
+		ways = 1;
+		offset = 0;
+		granularity = 0;
+		pos = 0;
+	} else {
+		ways = len / SZ_16K;
+		offset = spa & (SZ_16K - 1);
+		granularity = (len - len2 - SZ_256) / (ways - 1);
+		pos = offset / granularity;
+	}
+
+	base = base - DPA_MAGIC * ways - pos * granularity;
+	spa = base + hpa;
+
+	/*
+	 * Check SPA using a PRM call for the closest DPA calculated
+	 * for the HPA. If the HPA matches a different interleaving
+	 * position other than the decoder's, determine its offset to
+	 * adjust the SPA.
+	 */
+
+	dpa = (hpa & ~(granularity * ways - 1)) / ways
+		+ (hpa & (granularity - 1));
+	offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
+	offset -= pos * granularity;
+	spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
+
+	dev_dbg(&cxld->dev,
+		"address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n",
+		pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity);
+
+	if (spa != spa2) {
+		dev_dbg(&cxld->dev, "SPA calculation failed: %#llx:%#llx\n",
+			spa, spa2);
+		return ULLONG_MAX;
+	}
+
+	return spa;
+}
+
+static void cxl_zen5_init(struct cxl_port *port)
+{
+	if (!is_zen5(port))
+		return;
+
+	port->to_hpa = cxl_zen5_to_hpa;
+
+	dev_dbg(port->host_bridge, "PRM address translation enabled for %s.\n",
+		dev_name(&port->dev));
+}
+
+void cxl_port_setup_amd(struct cxl_port *port)
+{
+	cxl_zen5_init(port);
+}
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 800466f96a68..efe34ae6943e 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -115,4 +115,10 @@  bool cxl_need_node_perf_attrs_update(int nid);
 int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 					struct access_coordinate *c);
 
+#ifdef CONFIG_CXL_AMD
+void cxl_port_setup_amd(struct cxl_port *port);
+#else
+static inline void cxl_port_setup_amd(struct cxl_port *port) {};
+#endif
+
 #endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 901555bf4b73..c8176265c15c 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -831,6 +831,11 @@  static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport)
 			    &cxl_einj_inject_fops);
 }
 
+static void cxl_port_platform_setup(struct cxl_port *port)
+{
+	cxl_port_setup_amd(port);
+}
+
 static int cxl_port_add(struct cxl_port *port,
 			resource_size_t component_reg_phys,
 			struct cxl_dport *parent_dport)
@@ -868,6 +873,8 @@  static int cxl_port_add(struct cxl_port *port,
 			return rc;
 	}
 
+	cxl_port_platform_setup(port);
+
 	rc = device_add(dev);
 	if (rc)
 		return rc;