Message ID | 20250107141015.3367194-26-rrichter@amd.com |
---|---|
State | New |
Headers | show |
Series | cxl: Add address translation support and enable AMD Zen5 platforms | expand |
On 07.01.25 15:10:11, Robert Richter wrote:
> +#include <linux/prmt.h>
Note: linux/prmt.h is broken and causes a build error:
./include/linux/prmt.h:5:27: error: unknown type name ‘guid_t’
5 | int acpi_call_prm_handler(guid_t handler_guid, void *param_buffer);
| ^~~~~~
A fix for this is handled through linux-acpi:
https://patchwork.kernel.org/project/linux-acpi/patch/20250107161923.3387552-1-rrichter@amd.com/
I guess it will have been applied already once those cxl patches go
in.
Anyway, I will add it to a v2 to make testing easier.
Thanks,
-Robert
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > > Add a function cxl_port_setup_amd() to implement AMD platform specific > code. Use Kbuild and Kconfig options respectivly to enable the code > depending on architecture and platform options. Create a new file > core/amd.c for this. > A build option here specific to AMD doesn't seem the best. At Meta, we try to maintain a platform agnostic kernel for our fleet (at least for build options), and this would necessitate us maintaining separate builds for AMD systems vs other vendors. Is there a reason to simply not include it by default and just report whether translation is required or not? (i.e. no build option) Or maybe generalize to CXL_PLATFORM_QUIRKS rather than CXL_AMD? ~Gregory > Introduce a function cxl_zen5_init() to handle Zen5 specific > enablement. Zen5 platforms are detected using the PCIe vendor and > device ID of the corresponding CXL root port. > > Apply cxl_zen5_to_hpa() as cxl_port->to_hpa() callback to Zen5 CXL > host bridges to enable platform specific address translation. > > Use ACPI PRM DPA to SPA translation to determine an endpoint's > interleaving configuration and base address during the early > initialization proces. This is used to determine an endpoint's SPA > range. > > Since the PRM translates DPA->SPA, but HPA->SPA is needed, determine > the interleaving config and base address of the endpoint first, then > calculate the SPA based on the given HPA using the address base. > > The config can be determined calling the PRM for specific DPAs > given. Since the interleaving configuration is still unknown, chose > DPAs starting at 0xd20000. This address is factor for all values from > 1 to 8 and thus valid for all possible interleaving configuration. > The resulting SPAs are used to calculate interleaving paramters and > the SPA base address of the endpoint. The maximum granularity (chunk > size) is 16k, minimum is 256. Use the following calculation for a > given DPA: > > ways = hpa_len(SZ_16K) / SZ_16K > gran = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256) > / (ways - 1) > pos = (hpa_len(SZ_16K) - ways * SZ_16K) / gran > > Once the endpoint is attached to a region and its SPA range is know, > calling the PRM is no longer needed, the SPA base can be used. > > Signed-off-by: Robert Richter <rrichter@amd.com> > --- > drivers/cxl/Kconfig | 4 + > drivers/cxl/core/Makefile | 1 + > drivers/cxl/core/amd.c | 227 ++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/core.h | 6 + > drivers/cxl/core/port.c | 7 ++ > 5 files changed, 245 insertions(+) > create mode 100644 drivers/cxl/core/amd.c > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig > index 876469e23f7a..e576028dd983 100644 > --- a/drivers/cxl/Kconfig > +++ b/drivers/cxl/Kconfig > @@ -146,4 +146,8 @@ config CXL_REGION_INVALIDATION_TEST > If unsure, or if this kernel is meant for production environments, > say N. > > +config CXL_AMD > + def_bool y > + depends on AMD_NB > + > endif > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile > index 9259bcc6773c..dc368e61d281 100644 > --- a/drivers/cxl/core/Makefile > +++ b/drivers/cxl/core/Makefile > @@ -16,3 +16,4 @@ cxl_core-y += pmu.o > cxl_core-y += cdat.o > cxl_core-$(CONFIG_TRACING) += trace.o > cxl_core-$(CONFIG_CXL_REGION) += region.o > +cxl_core-$(CONFIG_CXL_AMD) += amd.o > diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c > new file mode 100644 > index 000000000000..553b7d0caefd > --- /dev/null > +++ b/drivers/cxl/core/amd.c > @@ -0,0 +1,227 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (C) 2024 Advanced Micro Devices, Inc. > + */ > + > +#include <linux/prmt.h> > +#include <linux/pci.h> > + > +#include "cxlmem.h" > +#include "core.h" > + > +#define PCI_DEVICE_ID_AMD_ZEN5_ROOT 0x153e > + > +static const struct pci_device_id zen5_root_port_ids[] = { > + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ZEN5_ROOT) }, > + {}, > +}; > + > +static int is_zen5_root_port(struct device *dev, void *unused) > +{ > + if (!dev_is_pci(dev)) > + return 0; > + > + return !!pci_match_id(zen5_root_port_ids, to_pci_dev(dev)); > +} > + > +static bool is_zen5(struct cxl_port *port) > +{ > + if (!IS_ENABLED(CONFIG_ACPI_PRMT)) > + return false; > + > + /* To get the CXL root port, find the CXL host bridge first. */ > + if (is_cxl_root(port) || > + !port->host_bridge || > + !is_cxl_root(to_cxl_port(port->dev.parent))) > + return false; > + > + return !!device_for_each_child(port->host_bridge, NULL, > + is_zen5_root_port); > +} > + > +/* > + * PRM Address Translation - CXL DPA to System Physical Address > + * > + * Reference: > + * > + * AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh > + * ACPI v6.5 Porting Guide, Publication # 58088 > + */ > + > +static const guid_t prm_cxl_dpa_spa_guid = > + GUID_INIT(0xee41b397, 0x25d4, 0x452c, 0xad, 0x54, 0x48, 0xc6, 0xe3, > + 0x48, 0x0b, 0x94); > + > +struct prm_cxl_dpa_spa_data { > + u64 dpa; > + u8 reserved; > + u8 devfn; > + u8 bus; > + u8 segment; > + void *out; > +} __packed; > + > +static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa) > +{ > + struct prm_cxl_dpa_spa_data data; > + u64 spa; > + int rc; > + > + data = (struct prm_cxl_dpa_spa_data) { > + .dpa = dpa, > + .devfn = pci_dev->devfn, > + .bus = pci_dev->bus->number, > + .segment = pci_domain_nr(pci_dev->bus), > + .out = &spa, > + }; > + > + rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data); > + if (rc) { > + pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc); > + return ULLONG_MAX; > + } > + > + pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa); > + > + return spa; > +} > + > +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa) > +{ > + struct cxl_memdev *cxlmd; > + struct pci_dev *pci_dev; > + struct cxl_port *port; > + u64 dpa, base, spa, spa2, len, len2, offset, granularity; > + int ways, pos; > + > + /* > + * Nothing to do if base is non-zero and Normalized Addressing > + * is disabled. > + */ > + if (cxld->hpa_range.start) > + return hpa; > + > + /* Only translate from endpoint to its parent port. */ > + if (!is_endpoint_decoder(&cxld->dev)) > + return hpa; > + > + if (hpa > cxld->hpa_range.end) { > + dev_dbg(&cxld->dev, "hpa addr %#llx out of range %#llx-%#llx\n", > + hpa, cxld->hpa_range.start, cxld->hpa_range.end); > + return ULLONG_MAX; > + } > + > + /* > + * If the decoder is already attached, the region's base can > + * be used. > + */ > + if (cxld->region) > + return cxld->region->params.res->start + hpa; > + > + port = to_cxl_port(cxld->dev.parent); > + cxlmd = port ? to_cxl_memdev(port->uport_dev) : NULL; > + if (!port || !dev_is_pci(cxlmd->dev.parent)) { > + dev_dbg(&cxld->dev, "No endpoint found: %s, range %#llx-%#llx\n", > + dev_name(cxld->dev.parent), cxld->hpa_range.start, > + cxld->hpa_range.end); > + return ULLONG_MAX; > + } > + pci_dev = to_pci_dev(cxlmd->dev.parent); > + > + /* > + * The PRM translates DPA->SPA, but we need HPA->SPA. > + * Determine the interleaving config first, then calculate the > + * DPA. Maximum granularity (chunk size) is 16k, minimum is > + * 256. Calculated with: > + * > + * ways = hpa_len(SZ_16K) / SZ_16K > + * gran = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256) > + * / (ways - 1) > + * pos = (hpa_len(SZ_16K) - ways * SZ_16K) / gran > + */ > + > + /* > + * DPA magic: > + * > + * Position and granularity are unknown yet, use an always > + * valid DPA: > + * > + * 0xd20000 = 13762560 = 16k * 2 * 3 * 2 * 5 * 7 * 2 > + * > + * It is divisible by all positions 1 to 8. The DPA is valid > + * for all positions and granularities. > + */ > +#define DPA_MAGIC 0xd20000 > + base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC); > + spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K); > + spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256); > + > + /* Includes checks to avoid div by zero */ > + if (!base || base == ULLONG_MAX || spa == ULLONG_MAX || > + spa2 == ULLONG_MAX || spa < base + SZ_16K || spa2 <= base || > + (spa > base + SZ_16K && spa - spa2 < SZ_256 * 2)) { > + dev_dbg(&cxld->dev, "Error translating HPA: base %#llx, spa %#llx, spa2 %#llx\n", > + base, spa, spa2); > + return ULLONG_MAX; > + } > + > + len = spa - base; > + len2 = spa2 - base; > + > + /* offset = pos * granularity */ > + if (len == SZ_16K && len2 == SZ_16K - SZ_256) { > + ways = 1; > + offset = 0; > + granularity = 0; > + pos = 0; > + } else { > + ways = len / SZ_16K; > + offset = spa & (SZ_16K - 1); > + granularity = (len - len2 - SZ_256) / (ways - 1); > + pos = offset / granularity; > + } > + > + base = base - DPA_MAGIC * ways - pos * granularity; > + spa = base + hpa; > + > + /* > + * Check SPA using a PRM call for the closest DPA calculated > + * for the HPA. If the HPA matches a different interleaving > + * position other than the decoder's, determine its offset to > + * adjust the SPA. > + */ > + > + dpa = (hpa & ~(granularity * ways - 1)) / ways > + + (hpa & (granularity - 1)); > + offset = hpa & (granularity * ways - 1) & ~(granularity - 1); > + offset -= pos * granularity; > + spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset; > + > + dev_dbg(&cxld->dev, > + "address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n", > + pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity); > + > + if (spa != spa2) { > + dev_dbg(&cxld->dev, "SPA calculation failed: %#llx:%#llx\n", > + spa, spa2); > + return ULLONG_MAX; > + } > + > + return spa; > +} > + > +static void cxl_zen5_init(struct cxl_port *port) > +{ > + if (!is_zen5(port)) > + return; > + > + port->to_hpa = cxl_zen5_to_hpa; > + > + dev_dbg(port->host_bridge, "PRM address translation enabled for %s.\n", > + dev_name(&port->dev)); > +} > + > +void cxl_port_setup_amd(struct cxl_port *port) > +{ > + cxl_zen5_init(port); > +} > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > index 800466f96a68..efe34ae6943e 100644 > --- a/drivers/cxl/core/core.h > +++ b/drivers/cxl/core/core.h > @@ -115,4 +115,10 @@ bool cxl_need_node_perf_attrs_update(int nid); > int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port, > struct access_coordinate *c); > > +#ifdef CONFIG_CXL_AMD > +void cxl_port_setup_amd(struct cxl_port *port); > +#else > +static inline void cxl_port_setup_amd(struct cxl_port *port) {}; > +#endif > + > #endif /* __CXL_CORE_H__ */ > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > index 901555bf4b73..c8176265c15c 100644 > --- a/drivers/cxl/core/port.c > +++ b/drivers/cxl/core/port.c > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) > &cxl_einj_inject_fops); > } > > +static void cxl_port_platform_setup(struct cxl_port *port) > +{ > + cxl_port_setup_amd(port); > +} > + > static int cxl_port_add(struct cxl_port *port, > resource_size_t component_reg_phys, > struct cxl_dport *parent_dport) > @@ -868,6 +873,8 @@ static int cxl_port_add(struct cxl_port *port, > return rc; > } > > + cxl_port_platform_setup(port); > + > rc = device_add(dev); > if (rc) > return rc; > -- > 2.39.5 >
On 07.01.25 18:28:57, Gregory Price wrote: > On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > > > > Add a function cxl_port_setup_amd() to implement AMD platform specific > > code. Use Kbuild and Kconfig options respectivly to enable the code > > depending on architecture and platform options. Create a new file > > core/amd.c for this. > > > > A build option here specific to AMD doesn't seem the best. At Meta, > we try to maintain a platform agnostic kernel for our fleet (at least > for build options), and this would necessitate us maintaining separate > builds for AMD systems vs other vendors. > > Is there a reason to simply not include it by default and just report > whether translation is required or not? (i.e. no build option) There is no (menu) option for CXL_AMD, it is just checking the dependencies to AMD_NB (and indirectly arch, platform and vendor). In that case it is always enabled. Thanks for review. -Robert > > Or maybe generalize to CXL_PLATFORM_QUIRKS rather than CXL_AMD? > > ~Gregory
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > Add AMD platform specific Zen5 support for address translation. > ... snip ... > > Once the endpoint is attached to a region and its SPA range is know, > calling the PRM is no longer needed, the SPA base can be used. > > Signed-off-by: Robert Richter <rrichter@amd.com> One inline question, but not a blocker Reviewed-by: Gregory Price <gourry@gourry.net> > --- > drivers/cxl/Kconfig | 4 + > drivers/cxl/core/Makefile | 1 + > drivers/cxl/core/amd.c | 227 ++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/core.h | 6 + > drivers/cxl/core/port.c | 7 ++ > 5 files changed, 245 insertions(+) > create mode 100644 drivers/cxl/core/amd.c > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > index 901555bf4b73..c8176265c15c 100644 > --- a/drivers/cxl/core/port.c > +++ b/drivers/cxl/core/port.c > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) > &cxl_einj_inject_fops); > } > > +static void cxl_port_platform_setup(struct cxl_port *port) > +{ > + cxl_port_setup_amd(port); > +} > + Assuming this gets expanded (which it may not), should we expect this function to end up like so? static void cxl_port_platform_setup(struct cxl_port *port) { cxl_port_setup_amd(port); cxl_port_setup_intel(port); cxl_port_setup_arm(port); ... etc ... } I suppose this logic has to exist somewhere in some form, just want to make sure this is what we want. Either way, this is easily modifiable, so not a blocker as I said. > static int cxl_port_add(struct cxl_port *port, > resource_size_t component_reg_phys, > struct cxl_dport *parent_dport) > @@ -868,6 +873,8 @@ static int cxl_port_add(struct cxl_port *port, > return rc; > } > > + cxl_port_platform_setup(port); > + > rc = device_add(dev); > if (rc) > return rc; > -- > 2.39.5 >
On Wed, Jan 08, 2025 at 03:52:35PM +0100, Robert Richter wrote: > On 07.01.25 18:28:57, Gregory Price wrote: > > On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > > > > > > Add a function cxl_port_setup_amd() to implement AMD platform specific > > > code. Use Kbuild and Kconfig options respectivly to enable the code > > > depending on architecture and platform options. Create a new file > > > core/amd.c for this. > > > > > > > A build option here specific to AMD doesn't seem the best. At Meta, > > we try to maintain a platform agnostic kernel for our fleet (at least > > for build options), and this would necessitate us maintaining separate > > builds for AMD systems vs other vendors. > > > > Is there a reason to simply not include it by default and just report > > whether translation is required or not? (i.e. no build option) > > There is no (menu) option for CXL_AMD, it is just checking the > dependencies to AMD_NB (and indirectly arch, platform and vendor). In > that case it is always enabled. > Ah! I completely misunderstood the build option, my bad. This makes sense, sorry about that ~Gregory
On 08.01.25 10:48:23, Gregory Price wrote: > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > > index 901555bf4b73..c8176265c15c 100644 > > --- a/drivers/cxl/core/port.c > > +++ b/drivers/cxl/core/port.c > > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) > > &cxl_einj_inject_fops); > > } > > > > +static void cxl_port_platform_setup(struct cxl_port *port) > > +{ > > + cxl_port_setup_amd(port); > > +} > > + > > Assuming this gets expanded (which it may not), should we expect this > function to end up like so? > > static void cxl_port_platform_setup(struct cxl_port *port) > { > cxl_port_setup_amd(port); > cxl_port_setup_intel(port); > cxl_port_setup_arm(port); > ... etc ... > } > > I suppose this logic has to exist somewhere in some form, just want to make > sure this is what we want. Either way, this is easily modifiable, so > not a blocker as I said. Yes, it is exactly designed like that. I will update the patch description. -Robert
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > Add AMD platform specific Zen5 support for address translation. Doing some testing here and I'm seeing some odd results, also noticing some naming inconsistencies > > +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa) > +{ Function name is _to_hpa, but hpa is an argument? Should be dpa as argument? Confusing to convert an hpa to an hpa. ... snip ... > +#define DPA_MAGIC 0xd20000 > + base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC); > + spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K); > + spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256); For two devices interleaved, the base should be the same, correct? example: 2 128GB devices interleaved/normalized: dev0: base(0xc051a40000) spa(0xc051a48000) spa2(0xc051a47e00) dev1: base(0xc051a40100) spa(0xc051a48100) spa2(0xc051a47f00) I believe these numbers are correct. (Note: Using PRMT emulation because I don't have a BIOS with this blob, but this is the same emulation i have been using for about 4 months now with operational hardware, so unless the translation contract changed and this code expects something different, it should be correct). ... snip ... > + len = spa - base; > + len2 = spa2 - base; > + > + /* offset = pos * granularity */ > + if (len == SZ_16K && len2 == SZ_16K - SZ_256) { > + ways = 1; > + offset = 0; > + granularity = 0; > + pos = 0; > + } else { > + ways = len / SZ_16K; > + offset = spa & (SZ_16K - 1); > + granularity = (len - len2 - SZ_256) / (ways - 1); > + pos = offset / granularity; > + } the interleave ways and such calculate out correctly dev0: ways(0x2) offset(0x0) granularity(0x100) pos(0x0) dev1: ways(0x2) offset(0x100) granularity(0x100) pos(0x1) > + > + base = base - DPA_MAGIC * ways - pos * granularity; > + spa = base + hpa; DPA(0) dev0: base(0xc050000000) spa(0xc050000000) dev1: base(0xc050000000) spa(0xc050000000) DPA(0x1fffffffff) dev0: base(0xc050000000) spa(0xe04fffffff) dev1: base(0xc050000000) spa(0xe04fffffff) The bases seems correct, the SPAs looks suspect. dev1 should have a very different SPA shouldn't it? > + > + /* > + * Check SPA using a PRM call for the closest DPA calculated > + * for the HPA. If the HPA matches a different interleaving > + * position other than the decoder's, determine its offset to > + * adjust the SPA. > + */ > + > + dpa = (hpa & ~(granularity * ways - 1)) / ways > + + (hpa & (granularity - 1)); I do not understand this chunk here, we seem to just be chopping the HPA in half to acquire the DPA. But the value passed in is already a DPA. dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1)) = 0xfffffffff I don't understand why the DPA address is suddenly half (64GB boundary). > + offset = hpa & (granularity * ways - 1) & ~(granularity - 1); > + offset -= pos * granularity; > + spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset; > + > + dev_dbg(&cxld->dev, > + "address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n", > + pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity); > + This results in a translation that appears to be wrong: dev0: cxl decoder5.0: address mapping found for 0000:e1:00.0 (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 base: 0xc050000000 ways: 2 pos: 0 granularity: 256 cxl decoder5.0: address mapping found for 0000:e1:00.0 (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff base: 0xc050000000 ways: 2 pos: 0 granularity: 256 dev1: cxl decoder6.0: address mapping found for 0000:c1:00.0 (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 base: 0xc050000000 ways: 2 pos: 1 granularity: 256 cxl decoder6.0: address mapping found for 0000:c1:00.0 (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff base: 0xc050000000 ways: 2 pos: 1 granularity: 256 These do not look correct. Is my understanding of the PRMT translation incorrect? I expect the following: (assuming one contiguous CFMW) dev0 (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 dev1 (dpa -> hpa -> spa): 0x0 -> 0x100 -> 0xc050000100 dev0 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3ffffffeff -> 0x1004ffffeff dev1 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3fffffffff -> 0x1004fffffff Extra data: here are the programmed endpoint decoder values [endpoint5/decoder5.0]# cat start size dpa_size interleave_ways interleave_granularity 0x0 0x2000000000 0x0000002000000000 1 256 [endpoint6/decoder6.0]# cat start size dpa_size interleave_ways interleave_granularity 0x0 0x2000000000 0x0000002000000000 1 256 Anyway, yeah I'm a bit confused how this is all supposed to actually work given that both devices translate to the same addresses. In theory this *should* work since the root decoder covers the whole space - as this has been working for me previously with some hacked up PRMT emulation code. [decoder0.0]# cat start size interleave_ways interleave_granularity 0xc050000000 0x4000000000 2 256 [decoder1.0]# cat start size interleave_ways interleave_granularity 0xc050000000 0x4000000000 1 256 [decoder3.0]# cat start size interleave_ways interleave_granularity 0xc050000000 0x4000000000 1 256 [decoder5.0]# cat start size interleave_ways interleave_granularity 0x0 0x2000000000 1 256 [decoder6.0]# cat start size interleave_ways interleave_granularity 0x0 0x2000000000 1 256 ~Gregory
On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > Add AMD platform specific Zen5 support for address translation. > > Zen5 systems may be configured to use 'Normalized addresses'. Then, > CXL endpoints use their own physical address space and Host Physical > Addresses (HPAs) need address translation from the endpoint to its CXL > host bridge. The HPA of a CXL host bridge is equivalent to the System > Physical Address (SPA). > Just adding the note that I've tested this patch set for HPA==SPA and found it causes no regressions in my setup. Still working on testing the normalized address mode due to a few BIOS quirks I'm running up against. ~Gregory
On Thu, 9 Jan 2025 11:14:46 +0100 Robert Richter <rrichter@amd.com> wrote: > On 08.01.25 10:48:23, Gregory Price wrote: > > > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > > > index 901555bf4b73..c8176265c15c 100644 > > > --- a/drivers/cxl/core/port.c > > > +++ b/drivers/cxl/core/port.c > > > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) > > > &cxl_einj_inject_fops); > > > } > > > > > > +static void cxl_port_platform_setup(struct cxl_port *port) > > > +{ > > > + cxl_port_setup_amd(port); > > > +} > > > + > > > > Assuming this gets expanded (which it may not), should we expect this > > function to end up like so? > > > > static void cxl_port_platform_setup(struct cxl_port *port) > > { > > cxl_port_setup_amd(port); > > cxl_port_setup_intel(port); > > cxl_port_setup_arm(port); > > ... etc ... > > } > > > > I suppose this logic has to exist somewhere in some form, just want to make > > sure this is what we want. Either way, this is easily modifiable, so > > not a blocker as I said. > > Yes, it is exactly designed like that. I will update the patch > description. If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup() as never building a kernel that does x86 and arm. Could rely on stubs but that tends to get ugly as things grow. Other than that, all makes sense. Jonathan > > -Robert
On 09.01.25 17:25:13, Gregory Price wrote: > On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > > Add AMD platform specific Zen5 support for address translation. > > Doing some testing here and I'm seeing some odd results, also noticing > some naming inconsistencies > > > > > +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa) > > +{ > > Function name is _to_hpa, but hpa is an argument? Conversion is always done from (old) HPA to (new) HPA of the parent port. Note that the HPA of the root port/host bridge is same as SPA. Port's in between may have an own HPA range. > > Should be dpa as argument? Confusing to convert an hpa to an hpa. We need to handle the decoder address ranges, the argument is always the HPA range the decoder belongs to. The DPA is only on the device side which is a different address range compared to the decoders. The decoders do the interleaving arithmetic too and DPA range may be different. E.g. the decoders may split requests to different endpoints depending on the number of interleaving ways and endpoints have their own (smaller) DPA address ranges then. > > ... snip ... > > > +#define DPA_MAGIC 0xd20000 > > + base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC); > > + spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K); > > + spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256); > > For two devices interleaved, the base should be the same, correct? Same except for the interleaving offset, which is seen below (dev1 shows *100). At this stage we don't know the interleaving position of the endpoint yet. > > example: 2 128GB devices interleaved/normalized: > > dev0: base(0xc051a40000) spa(0xc051a48000) spa2(0xc051a47e00) > dev1: base(0xc051a40100) spa(0xc051a48100) spa2(0xc051a47f00) > > I believe these numbers are correct. Looks good. > > (Note: Using PRMT emulation because I don't have a BIOS with this blob, > but this is the same emulation i have been using for about 4 months now > with operational hardware, so unless the translation contract changed > and this code expects something different, it should be correct). > > ... snip ... > > + len = spa - base; > > + len2 = spa2 - base; > > + > > + /* offset = pos * granularity */ > > + if (len == SZ_16K && len2 == SZ_16K - SZ_256) { > > + ways = 1; > > + offset = 0; > > + granularity = 0; > > + pos = 0; > > + } else { > > + ways = len / SZ_16K; > > + offset = spa & (SZ_16K - 1); > > + granularity = (len - len2 - SZ_256) / (ways - 1); > > + pos = offset / granularity; > > + } > > the interleave ways and such calculate out correctly > > dev0: ways(0x2) offset(0x0) granularity(0x100) pos(0x0) > dev1: ways(0x2) offset(0x100) granularity(0x100) pos(0x1) > > > + > > + base = base - DPA_MAGIC * ways - pos * granularity; > > + spa = base + hpa; > > DPA(0) > dev0: base(0xc050000000) spa(0xc050000000) > dev1: base(0xc050000000) spa(0xc050000000) > > DPA(0x1fffffffff) > dev0: base(0xc050000000) spa(0xe04fffffff) > dev1: base(0xc050000000) spa(0xe04fffffff) > > The bases seems correct, the SPAs looks suspect. SPA range length must be 0x4000000000 (2x 128G). That is, upper SPA must be 0x10050000000 (0xc050000000 + 0x4000000000 - 1). This one is too short. The decoder range lengths below look correct (0x2000000000), the interleaving configuration should be checked for the decoders. > > dev1 should have a very different SPA shouldn't it? No, the HPA range is calculated, not the DPA range. Both endpoints have the same HPA range, it must be equal and this looks correct. In the end we calculate the following here (see cxl_find_auto_decoder()): hpa = cxld->hpa_range; // endpoint's hpa range is zero-based, equivalent to: // hpa->start = 0; // hpa->end = range_len(&hpa) - 1; base = hpa.start = port->to_hpa(cxld, hpa.start); // HPA(0) spa = hpa.end = port->to_hpa(cxld, hpa.end)); // HPA(decoder_size - 1) Again, the HPA is the address the decoder is programmed with. HPA length is 0x2000000000 (spa - base + 1). The DPA range is (for 2 way) half it's size. The PRM uses DPA to SPA, but we want to translate HPA to SPA. That is we need the calculation for. > > > + > > + /* > > + * Check SPA using a PRM call for the closest DPA calculated > > + * for the HPA. If the HPA matches a different interleaving > > + * position other than the decoder's, determine its offset to > > + * adjust the SPA. > > + */ > > + > > + dpa = (hpa & ~(granularity * ways - 1)) / ways > > + + (hpa & (granularity - 1)); > > I do not understand this chunk here, we seem to just be chopping the HPA > in half to acquire the DPA. But the value passed in is already a DPA. > > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1)) > = 0xfffffffff HPA is: HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff Should calculate for a 2-way config to: DPA = 0x1fffffffff. Actual formula: dpa = HPA div (granularity * ways) * granularity + HPA mod granularity pos = (HPA mod (granularity * ways)) div granularity Bits used (e.g. HPA length: 0x4000000000 = 2^38, ways: 2): hpa = 00000000000000000000000000XXXXXXXXXXXXXXXXXXXXXXXXXXXXXYZZZZZZZZ dpa = 000000000000000000000000000XXXXXXXXXXXXXXXXXXXXXXXXXXXXXZZZZZZZZ pos = Y With: X ... base part of the address Y ... interleaving position Z ... address offset For DPA the positional bits are removed. > > I don't understand why the DPA address is suddenly half (64GB boundary). There is probably a broken interleaving config causing half the size of total device mem. > > > + offset = hpa & (granularity * ways - 1) & ~(granularity - 1); > > + offset -= pos * granularity; > > + spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset; > > + > > + dev_dbg(&cxld->dev, > > + "address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n", > > + pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity); > > + > > This results in a translation that appears to be wrong: > > dev0: > cxl decoder5.0: address mapping found for 0000:e1:00.0 > (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 > base: 0xc050000000 ways: 2 pos: 0 granularity: 256 > cxl decoder5.0: address mapping found for 0000:e1:00.0 > (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff > base: 0xc050000000 ways: 2 pos: 0 granularity: 256 > > dev1: > cxl decoder6.0: address mapping found for 0000:c1:00.0 > (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 > base: 0xc050000000 ways: 2 pos: 1 granularity: 256 > cxl decoder6.0: address mapping found for 0000:c1:00.0 > (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff > base: 0xc050000000 ways: 2 pos: 1 granularity: 256 > > These do not look correct. > > Is my understanding of the PRMT translation incorrect? > I expect the following: (assuming one contiguous CFMW) > > dev0 (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 > dev1 (dpa -> hpa -> spa): 0x0 -> 0x100 -> 0xc050000100 > dev0 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3ffffffeff -> 0x1004ffffeff > dev1 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3fffffffff -> 0x1004fffffff Yes, would be the result without the offset applied for spa2 above. The check above calculates the *total* length of hpa and spa with out considering the interleaving position. This is corrected using the offset. There is no call prm_cxl_dpa_spa(dev0, 0x1fffffffff) that returns 0x1004fffffff, but we want to check the upper boundery of the SPA range. > > Extra data: here are the programmed endpoint decoder values > > [endpoint5/decoder5.0]# cat start size dpa_size interleave_ways interleave_granularity > 0x0 > 0x2000000000 > 0x0000002000000000 > 1 > 256 > > [endpoint6/decoder6.0]# cat start size dpa_size interleave_ways interleave_granularity > 0x0 > 0x2000000000 > 0x0000002000000000 > 1 > 256 This is correct and and must be half the size of the HPA window. Thanks for testing. -Robert > > > Anyway, yeah I'm a bit confused how this is all supposed to actually > work given that both devices translate to the same addresses. > > In theory this *should* work since the root decoder covers the whole > space - as this has been working for me previously with some hacked up > PRMT emulation code. > > [decoder0.0]# cat start size interleave_ways interleave_granularity > 0xc050000000 > 0x4000000000 > 2 > 256 > > [decoder1.0]# cat start size interleave_ways interleave_granularity > 0xc050000000 > 0x4000000000 > 1 > 256 > > [decoder3.0]# cat start size interleave_ways interleave_granularity > 0xc050000000 > 0x4000000000 > 1 > 256 > > [decoder5.0]# cat start size interleave_ways interleave_granularity > 0x0 > 0x2000000000 > 1 > 256 > > [decoder6.0]# cat start size interleave_ways interleave_granularity > 0x0 > 0x2000000000 > 1 > 256 > > ~Gregory
On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote: > > > > Should be dpa as argument? Confusing to convert an hpa to an hpa. > > We need to handle the decoder address ranges, the argument is always > the HPA range the decoder belongs to. I see, and this is where my confusion stems from. Basically these addresses are consider "HPA" because they are programmed to decoders, and decoder addresses are "always HPA". i.e. 2 interleaved devices (endpoint decoders) with normalized addresses: dev0: base(0x0) len(0x200000000) dev1: base(0x0) len(0x200000000) These are HPAs because decoders are programmed with HPAs. It's just that in this (specific) case HPA=DPA, while root decoders and host bridge decoders will always have HPA=SPA. We're just translating up the stack from HPA range to HPA range. I've been dealing with virtualization for a long time and this has been painful for me to follow - but I think I'm getting there. > > DPA(0) > > dev0: base(0xc050000000) spa(0xc050000000) > > dev1: base(0xc050000000) spa(0xc050000000) > > > > DPA(0x1fffffffff) > > dev0: base(0xc050000000) spa(0xe04fffffff) > > dev1: base(0xc050000000) spa(0xe04fffffff) > > > > The bases seems correct, the SPAs looks suspect. > > SPA range length must be 0x4000000000 (2x 128G). That is, upper SPA > must be 0x10050000000 (0xc050000000 + 0x4000000000 - 1). This one is > too short. > > The decoder range lengths below look correct (0x2000000000), the > interleaving configuration should be checked for the decoders. > If i understand correctly, this configuration may be suspect [decoder0.0]# cat start size interleave_ways interleave_granularity 0xc050000000 0x4000000000 2 <----- root decoder reports interleave ways = 2 256 [decoder1.0]# cat start size interleave_ways interleave_granularity 0xc050000000 0x4000000000 1 <----- host bridge decoder reports interleave ways = 1 256 [decoder3.0]# cat start size interleave_ways interleave_granularity 0xc050000000 0x4000000000 1 <----- host bridge decoder reports interleave ways = 1 256 > > I do not understand this chunk here, we seem to just be chopping the HPA > > in half to acquire the DPA. But the value passed in is already a DPA. > > > > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1)) > > = 0xfffffffff > > HPA is: > > HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff > ... snip ... > There is probably a broken interleaving config causing half the size > of total device mem. > In my case, I never see 0x3fffffffff passed in. The value 0x1fffffffff from the endpoint decoders is always passed in. This suggests the host bridge interleave ways should be 2. I can force this and figure out why its reporting 1 and get back to you. > > dev0 (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000 > > dev1 (dpa -> hpa -> spa): 0x0 -> 0x100 -> 0xc050000100 > > dev0 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3ffffffeff -> 0x1004ffffeff > > dev1 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3fffffffff -> 0x1004fffffff > > Yes, would be the result without the offset applied for spa2 above. > The check above calculates the *total* length of hpa and spa with out > considering the interleaving position. This is corrected using the > offset. There is no call prm_cxl_dpa_spa(dev0, 0x1fffffffff) that > returns 0x1004fffffff, but we want to check the upper boundery of the > SPA range. > This makes sense now, there's no dpa->spa direct translation because you may have to go through multiple layers of translation to get there - so the best you can do is calculate the highest possible endpoint and say "Yeah this range is in there somewhere". Thank you for taking the time to walk me through this, I'm sorry I've been confused on DPA/HPA/SPA for so long - it's been a bit of a struggle. ~Gregory
On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote: > On 09.01.25 17:25:13, Gregory Price wrote: > > > + dpa = (hpa & ~(granularity * ways - 1)) / ways > > > + + (hpa & (granularity - 1)); > > > > I do not understand this chunk here, we seem to just be chopping the HPA > > in half to acquire the DPA. But the value passed in is already a DPA. > > > > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1)) > > = 0xfffffffff > > HPA is: > > HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff > > Should calculate for a 2-way config to: > > DPA = 0x1fffffffff. > I'm looking back through all of this again, and I'm not seeing how the current code is ever capable of ending up with hpa=0x3fffffffff. Taking an example endpoint in my setup: [decoder5.0]# cat start size interleave_ways interleave_granularity 0x0 0x2000000000 <- 128GB (half the total 256GB interleaved range) 1 <- this decoder does not apply interleave 256 translating up to a root decoder: [decoder0.0]# cat start size interleave_ways interleave_granularity 0xc050000000 0x4000000000 <- 256GB (total interleaved capacity) 2 <- interleaved 2 ways, this decoder applies interleave 256 Now looking at the code that actually invokes the translation static struct device * cxl_find_auto_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled, struct cxl_region *cxlr) { struct cxl_decoder *cxld = &cxled->cxld; struct range hpa = cxld->hpa_range; ... snip ... if (cxl_port_calc_hpa(parent, cxld, &hpa)) return NULL; } or static int cxl_endpoint_initialize(struct cxl_endpoint_decoder *cxled) { struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); struct cxl_port *parent, *iter = cxled_to_port(cxled); struct range hpa = cxled->cxld.hpa_range; struct cxl_decoder *cxld = &cxled->cxld; while (1) { if (is_cxl_endpoint(iter)) cxld = &cxled->cxld; ... /* Translate HPA to the next upper memory domain. */ if (cxl_port_calc_hpa(parent, cxld, &hpa)) { } ... } .... } Both of these will call cxl_port_calc_hpa with hpa = [0, 0x1fffffffff] Resulting in the following static int cxl_port_calc_hpa(struct cxl_port *port, struct cxl_decoder *cxld, struct range *hpa_range) { ... /* Translate HPA to the next upper domain. */ hpa.start = port->to_hpa(cxld, hpa.start); <---- 0x0 hpa.end = port->to_hpa(cxld, hpa.end); <---- 0x1fffffffff } So we call: to_hpa(decoder5.0, hpa.end) to_hpa(decoder5.0, 0x1fffffffff) ^^^^^^^^^^^^^ --- hpa will never be 0x3fffffffff Should the to_hpa() code be taking an decoder length as an argument? to_hpa(decoder5.0, range_length, addr) ? This would actually let us calculate the end of region with the interleave ways and granularity: upper_hpa_base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC) - DPA_MAGIC; upper_hpa_end = upper_hpa_base + (range_length * ways) - 1 Without this, you don't have enough information to actually calculate the upper_hpa_end as you suggested. The result is the math ends up chopping the endpoint decoder's range (128GB) in half (64GB). Below I walk through the translation code with these inputs step by step. ~Gregory ------------------------------------------------------------------------ Walking through the translation code by hand here: [decoder5.0]# cat start size 0x0 0x2000000000 call: to_hpa(decoder5.0, 0x1fffffffff) --- code #define DPA_MAGIC 0xd20000 base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC); spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K); spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256); len = spa - base; len2 = spa2 - base; /* offset = pos * granularity */ if (len == SZ_16K && len2 == SZ_16K - SZ_256) { ... snip - not taken ... } else { ways = len / SZ_16K; offset = spa & (SZ_16K - 1); granularity = (len - len2 - SZ_256) / (ways - 1); pos = offset / granularity; } --- end code At this point in the code i have the following values: base = 0xc051a40100 spa = 0xc051a48100 spa2 = 0xc051a47f00 len = 0x8000 len2 = 0x7E00 ways = 2 offset = 256 granularity = 256 pos = 1 --- code base = base - DPA_MAGIC * ways - pos * granularity; spa = base + hpa; --- end code base = 0xc051a40100 - 0xd20000 * 2 - 1 * 256 = 0xc050000000 spa = base + hpa = 0xc050000000 + 0x1fffffffff <----- = 0xe04fffffff | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Isn't this just incorrect? Should it be something like: base + ((hpa & ~(granularity - 1)) * pos) + (hpa & (ways * granularity - 1)) --- code dpa = (hpa & ~(granularity * ways - 1)) / ways + (hpa & (granularity - 1)); --- end code dpa = (hpa & ~(granularity * ways - 1)) / ways + (hpa & (granularity - 1)); ^^^ ^^^ 0x1fffffffff / 2 We are dividing the endpoint decoder HPA by interleave ways. This is how we end up with the truncated size. Full math dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1)) = (0x1fffffffff & (0xF...E00) / 2 + (0x1fffffffff & 0xFF) = (0x1ffffffe00) / 2 + (0xFF) = 0xfffffff00 + 0xff = 0xfffffffff --- code offset = hpa & (granularity * ways - 1) & ~(granularity - 1); offset -= pos * granularity; --- offset = (0x1fffffffff & (256 * 2 - 1) & ~(256 - 1)) - (1 * 256) (0x1fffffffff & 0x1ff & 0xffffffffffffff00) - 0x100 (0x1ff & 0xffffffffffffff00) - 0x100 0x100 - 0x100 0x0 Final Result: --- code spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset; --- = prm_cxl_dpa_spa(pci_dev, 0xfffffffff) + 0 = 0xe04fffffff + 0 ^^^ note that in thise case my emulation gives the exact address you seem to suggest that i'll get the closet granularity So if not for my emulation, the offset calculation is wrong? -------------------------------------------------------------------- For the sake of completeness, here is my PRMT emulation code to show you that it is doing the translation as-expected. All this does is just force translation for a particular set of PCI devices based on the known static CFMW regions. Note for onlookers: This patch is extremely dangerous and only applies to my specific system / interleave configuration. diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c index ac74b6f6dad7..8ccf2d5638ed 100644 --- a/drivers/cxl/acpi.c +++ b/drivers/cxl/acpi.c @@ -432,6 +432,31 @@ static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws, return 0; } +#define MAX_CFMWS (32) +static unsigned int num_cfmws; +static unsigned long cfmws_bases[MAX_CFMWS]; +static unsigned long cfmws_sizes[MAX_CFMWS]; +unsigned int cxl_get_num_cfmws(void) +{ + return num_cfmws; +} + +unsigned long cxl_get_cfmws_base(unsigned int idx) +{ + if (idx >= MAX_CFMWS || idx >= num_cfmws) + return ~0; + + return cfmws_bases[idx]; +} + +unsigned long cxl_get_cfmws_size(unsigned int idx) +{ + if (idx >= MAX_CFMWS || idx >= num_cfmws) + return ~0; + + return cfmws_sizes[idx]; +} + static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg, const unsigned long end) { @@ -446,10 +471,16 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg, "Failed to add decode range: [%#llx - %#llx] (%d)\n", cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1, rc); - else + else { dev_dbg(dev, "decode range: node: %d range [%#llx - %#llx]\n", phys_to_target_node(cfmws->base_hpa), cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1); + if (num_cfmws < MAX_CFMWS) { + cfmws_bases[num_cfmws] = cfmws->base_hpa; + cfmws_sizes[num_cfmws] = cfmws->window_size; + num_cfmws++; + } + } /* never fail cxl_acpi load for a single window failure */ return 0; diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c index 553b7d0caefd..08a5bfb9fbd6 100644 --- a/drivers/cxl/core/amd.c +++ b/drivers/cxl/core/amd.c @@ -64,6 +64,10 @@ struct prm_cxl_dpa_spa_data { static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa) { struct prm_cxl_dpa_spa_data data; + unsigned int cfmws_nr; + unsigned int idx; + unsigned long offset, size; + unsigned int dev; u64 spa; int rc; @@ -75,12 +79,35 @@ static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa) .out = &spa, }; + cfmws_nr = cxl_get_num_cfmws(); + if (!cfmws_nr) + goto try_prmt; + + /* HACK: Calculate the interleaved offset and find the matching base */ + if (pci_dev->bus->number != 0xe1 && pci_dev->bus->number != 0xc1) + goto try_prmt; + + dev = pci_dev->bus->number == 0xe1 ? 0 : 1; + offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff); + + for (idx = 0; idx < cfmws_nr; idx++) { + size = cxl_get_cfmws_size(idx); + if (offset < size) { + spa = cxl_get_cfmws_base(idx) + offset; + goto out; + } + offset -= size; + } + /* We failed, fall back to calling the PRMT */ +try_prmt: + rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data); if (rc) { pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc); return ULLONG_MAX; } +out: pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa); return spa; diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index f93eb464fc97..48cbfa68d739 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -919,6 +919,10 @@ bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port); struct cxl_mem_command *cxl_find_feature_command(u16 opcode); +unsigned int cxl_get_num_cfmws(void); +unsigned long cxl_get_cfmws_base(unsigned int idx); +unsigned long cxl_get_cfmws_size(unsigned int idx); + /* * Unit test builds overrides this to __weak, find the 'strong' version * of these symbols in tools/testing/cxl/.
On 14.01.25 11:13:07, Jonathan Cameron wrote: > On Thu, 9 Jan 2025 11:14:46 +0100 > Robert Richter <rrichter@amd.com> wrote: > > > On 08.01.25 10:48:23, Gregory Price wrote: > > > > > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > > > > index 901555bf4b73..c8176265c15c 100644 > > > > --- a/drivers/cxl/core/port.c > > > > +++ b/drivers/cxl/core/port.c > > > > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) > > > > &cxl_einj_inject_fops); > > > > } > > > > > > > > +static void cxl_port_platform_setup(struct cxl_port *port) > > > > +{ > > > > + cxl_port_setup_amd(port); > > > > +} > > > > + > > > > > > Assuming this gets expanded (which it may not), should we expect this > > > function to end up like so? > > > > > > static void cxl_port_platform_setup(struct cxl_port *port) > > > { > > > cxl_port_setup_amd(port); > > > cxl_port_setup_intel(port); > > > cxl_port_setup_arm(port); > > > ... etc ... > > > } > > > > > > I suppose this logic has to exist somewhere in some form, just want to make > > > sure this is what we want. Either way, this is easily modifiable, so > > > not a blocker as I said. > > > > Yes, it is exactly designed like that. I will update the patch > > description. > > If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup() > as never building a kernel that does x86 and arm. Could rely on stubs but that > tends to get ugly as things grow. I could move the function and file to core/x86/amd.c already and add a: void __weak arch_cxl_port_platform_setup(struct cxl_port *port) { } -Robert
On 10.01.25 17:48:35, Gregory Price wrote: > On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote: > > Add AMD platform specific Zen5 support for address translation. > > > > Zen5 systems may be configured to use 'Normalized addresses'. Then, > > CXL endpoints use their own physical address space and Host Physical > > Addresses (HPAs) need address translation from the endpoint to its CXL > > host bridge. The HPA of a CXL host bridge is equivalent to the System > > Physical Address (SPA). > > > > Just adding the note that I've tested this patch set for HPA==SPA and > found it causes no regressions in my setup. Thanks for testing HPA==SPA. -Robert > > Still working on testing the normalized address mode due to a few BIOS > quirks I'm running up against. > > ~Gregory
On Fri, 17 Jan 2025 08:59:00 +0100 Robert Richter <rrichter@amd.com> wrote: > On 14.01.25 11:13:07, Jonathan Cameron wrote: > > On Thu, 9 Jan 2025 11:14:46 +0100 > > Robert Richter <rrichter@amd.com> wrote: > > > > > On 08.01.25 10:48:23, Gregory Price wrote: > > > > > > > > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > > > > > index 901555bf4b73..c8176265c15c 100644 > > > > > --- a/drivers/cxl/core/port.c > > > > > +++ b/drivers/cxl/core/port.c > > > > > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) > > > > > &cxl_einj_inject_fops); > > > > > } > > > > > > > > > > +static void cxl_port_platform_setup(struct cxl_port *port) > > > > > +{ > > > > > + cxl_port_setup_amd(port); > > > > > +} > > > > > + > > > > > > > > Assuming this gets expanded (which it may not), should we expect this > > > > function to end up like so? > > > > > > > > static void cxl_port_platform_setup(struct cxl_port *port) > > > > { > > > > cxl_port_setup_amd(port); > > > > cxl_port_setup_intel(port); > > > > cxl_port_setup_arm(port); > > > > ... etc ... > > > > } > > > > > > > > I suppose this logic has to exist somewhere in some form, just want to make > > > > sure this is what we want. Either way, this is easily modifiable, so > > > > not a blocker as I said. > > > > > > Yes, it is exactly designed like that. I will update the patch > > > description. > > > > If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup() > > as never building a kernel that does x86 and arm. Could rely on stubs but that > > tends to get ugly as things grow. > > I could move the function and file to core/x86/amd.c already and add > a: > > void __weak arch_cxl_port_platform_setup(struct cxl_port *port) { } Something like that probably makes sense. I don't like x86 calls in what I'm building for arm, even if they are stubbed out ;) Jonathan > > -Robert
On 15.01.25 17:24:54, Gregory Price wrote: > On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote: > > On 09.01.25 17:25:13, Gregory Price wrote: > > > > + dpa = (hpa & ~(granularity * ways - 1)) / ways > > > > + + (hpa & (granularity - 1)); > > > > > > I do not understand this chunk here, we seem to just be chopping the HPA > > > in half to acquire the DPA. But the value passed in is already a DPA. > > > > > > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1)) > > > = 0xfffffffff > > > > HPA is: > > > > HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff > > > > Should calculate for a 2-way config to: > > > > DPA = 0x1fffffffff. > > > > I'm looking back through all of this again, and I'm not seeing how the > current code is ever capable of ending up with hpa=0x3fffffffff. > > Taking an example endpoint in my setup: > > [decoder5.0]# cat start size interleave_ways interleave_granularity > 0x0 > 0x2000000000 <- 128GB (half the total 256GB interleaved range) > 1 <- this decoder does not apply interleave > 256 > > translating up to a root decoder: > > [decoder0.0]# cat start size interleave_ways interleave_granularity > 0xc050000000 > 0x4000000000 <- 256GB (total interleaved capacity) > 2 <- interleaved 2 ways, this decoder applies interleave > 256 > > > Now looking at the code that actually invokes the translation > > static struct device * > cxl_find_auto_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled, > struct cxl_region *cxlr) > { > struct cxl_decoder *cxld = &cxled->cxld; > struct range hpa = cxld->hpa_range; > ... snip ... > if (cxl_port_calc_hpa(parent, cxld, &hpa)) > return NULL; > } > > or > > static int cxl_endpoint_initialize(struct cxl_endpoint_decoder *cxled) > { > struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); > struct cxl_port *parent, *iter = cxled_to_port(cxled); > struct range hpa = cxled->cxld.hpa_range; > struct cxl_decoder *cxld = &cxled->cxld; > > while (1) { > if (is_cxl_endpoint(iter)) > cxld = &cxled->cxld; > ... > /* Translate HPA to the next upper memory domain. */ > if (cxl_port_calc_hpa(parent, cxld, &hpa)) { > } > ... > } > .... > } > > Both of these will call cxl_port_calc_hpa with > hpa = [0, 0x1fffffffff] > > Resulting in the following > > static int cxl_port_calc_hpa(struct cxl_port *port, struct cxl_decoder *cxld, > struct range *hpa_range) > { > ... > /* Translate HPA to the next upper domain. */ > hpa.start = port->to_hpa(cxld, hpa.start); <---- 0x0 > hpa.end = port->to_hpa(cxld, hpa.end); <---- 0x1fffffffff > } > > So we call: > to_hpa(decoder5.0, hpa.end) > to_hpa(decoder5.0, 0x1fffffffff) > ^^^^^^^^^^^^^ --- hpa will never be 0x3fffffffff Depending on the endpoint the PRM call returns the following here (2-way interleaving with 256 gran): hpa = 0x1fffffff00 * 2 + pos * 0x100 + 0xff; It will either return 0x3ffffffeff or 0x3fffffffff. But implementation in cxl_zen5_to_hpa() is not correct, see below. > > > Should the to_hpa() code be taking an decoder length as an argument? > > to_hpa(decoder5.0, range_length, addr) ? > > This would actually let us calculate the end of region with the > interleave ways and granularity: > > upper_hpa_base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC) - DPA_MAGIC; > upper_hpa_end = upper_hpa_base + (range_length * ways) - 1 > > Without this, you don't have enough information to actually calculate > the upper_hpa_end as you suggested. The result is the math ends up > chopping the endpoint decoder's range (128GB) in half (64GB). > > Below I walk through the translation code with these inputs step by step. > > ~Gregory > > ------------------------------------------------------------------------ > > Walking through the translation code by hand here: > > [decoder5.0]# cat start size > 0x0 > 0x2000000000 > > call: to_hpa(decoder5.0, 0x1fffffffff) > > --- code > #define DPA_MAGIC 0xd20000 > base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC); > spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K); > spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256); > > len = spa - base; > len2 = spa2 - base; > > /* offset = pos * granularity */ > if (len == SZ_16K && len2 == SZ_16K - SZ_256) { > ... snip - not taken ... > } else { > ways = len / SZ_16K; > offset = spa & (SZ_16K - 1); > granularity = (len - len2 - SZ_256) / (ways - 1); > pos = offset / granularity; > } > --- end code > > At this point in the code i have the following values: > > base = 0xc051a40100 > spa = 0xc051a48100 > spa2 = 0xc051a47f00 > len = 0x8000 > len2 = 0x7E00 > ways = 2 > offset = 256 > granularity = 256 > pos = 1 > > --- code > base = base - DPA_MAGIC * ways - pos * granularity; > spa = base + hpa; > --- end code > > base = 0xc051a40100 - 0xd20000 * 2 - 1 * 256 > = 0xc050000000 > > spa = base + hpa This is the wrong part, the interleaving parameters must be considered for SPA too, like: spa = base + hpa div gran * gran * ways + pos * gran + hpa mod gran This caused a wrong range size and this went unnoticed due to an overlaying issue while testing. The decection of the interleaving config still is correct. > = 0xc050000000 + 0x1fffffffff <----- > = 0xe04fffffff | > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Isn't this just incorrect? Should it be something like: > -------------------------------------------------------------------- > > For the sake of completeness, here is my PRMT emulation code to show > you that it is doing the translation as-expected. > > All this does is just force translation for a particular set of PCI > devices based on the known static CFMW regions. > @@ -75,12 +79,35 @@ static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa) > .out = &spa, > }; > > + cfmws_nr = cxl_get_num_cfmws(); > + if (!cfmws_nr) > + goto try_prmt; > + > + /* HACK: Calculate the interleaved offset and find the matching base */ > + if (pci_dev->bus->number != 0xe1 && pci_dev->bus->number != 0xc1) > + goto try_prmt; > + > + dev = pci_dev->bus->number == 0xe1 ? 0 : 1; > + offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff); That looks correct to me, a test too: >>> dpa = 0x1fffffffff >>> dev = 1 >>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff) >>> "0x%x" % offset '0x3fffffffff' >>> dev = 0 >>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff) >>> "0x%x" % offset '0x3ffffffeff' Thanks for review and testing. Will fix in v2. -Robert > + > + for (idx = 0; idx < cfmws_nr; idx++) { > + size = cxl_get_cfmws_size(idx); > + if (offset < size) { > + spa = cxl_get_cfmws_base(idx) + offset; > + goto out; > + } > + offset -= size; > + } > + /* We failed, fall back to calling the PRMT */ > +try_prmt: > + > rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data); > if (rc) { > pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc); > return ULLONG_MAX; > } > > +out: > pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa); > > return spa;
On 17.01.25 11:46:42, Jonathan Cameron wrote: > On Fri, 17 Jan 2025 08:59:00 +0100 > Robert Richter <rrichter@amd.com> wrote: > > On 14.01.25 11:13:07, Jonathan Cameron wrote: > > > > > static void cxl_port_platform_setup(struct cxl_port *port) > > > > > { > > > > > cxl_port_setup_amd(port); > > > > > cxl_port_setup_intel(port); > > > > > cxl_port_setup_arm(port); > > > > > ... etc ... > > > > > } > > > > > > > > > > I suppose this logic has to exist somewhere in some form, just want to make > > > > > sure this is what we want. Either way, this is easily modifiable, so > > > > > not a blocker as I said. > > > > > > > > Yes, it is exactly designed like that. I will update the patch > > > > description. > > > > > > If we need it on ARM then we might wrap this in an arch_cxl_port_platform_setup() > > > as never building a kernel that does x86 and arm. Could rely on stubs but that > > > tends to get ugly as things grow. > > > > I could move the function and file to core/x86/amd.c already and add > > a: > > > > void __weak arch_cxl_port_platform_setup(struct cxl_port *port) { } > Something like that probably makes sense. I don't like x86 calls in what > I'm building for arm, even if they are stubbed out ;) Sure, will change that. Thanks for review, -Robert
On 1/7/25 8:10 AM, Robert Richter wrote: > Add AMD platform specific Zen5 support for address translation. > > Zen5 systems may be configured to use 'Normalized addresses'. Then, > CXL endpoints use their own physical address space and Host Physical > Addresses (HPAs) need address translation from the endpoint to its CXL > host bridge. The HPA of a CXL host bridge is equivalent to the System > Physical Address (SPA). > > ACPI Platform Runtime Mechanism (PRM) is used to translate the CXL > Device Physical Address (DPA) to its System Physical Address. This is > documented in: > > AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh > ACPI v6.5 Porting Guide, Publication # 58088 > https://www.amd.com/en/search/documentation/hub.html > > Note that DPA and HPA of an endpoint may differ depending on the > interleaving configuration. That is, an additional calculation between > DPA and HPA is needed. > > To implement AMD Zen5 address translation the following steps are > needed: > > Introduce the generic function cxl_port_platform_setup() that allows > to apply platform specific changes to each port where necessary. > > Add a function cxl_port_setup_amd() to implement AMD platform specific > code. Use Kbuild and Kconfig options respectivly to enable the code > depending on architecture and platform options. Create a new file > core/amd.c for this. > > Introduce a function cxl_zen5_init() to handle Zen5 specific > enablement. Zen5 platforms are detected using the PCIe vendor and > device ID of the corresponding CXL root port. > > Apply cxl_zen5_to_hpa() as cxl_port->to_hpa() callback to Zen5 CXL > host bridges to enable platform specific address translation. > > Use ACPI PRM DPA to SPA translation to determine an endpoint's > interleaving configuration and base address during the early > initialization proces. This is used to determine an endpoint's SPA > range. > > Since the PRM translates DPA->SPA, but HPA->SPA is needed, determine > the interleaving config and base address of the endpoint first, then > calculate the SPA based on the given HPA using the address base. > > The config can be determined calling the PRM for specific DPAs > given. Since the interleaving configuration is still unknown, chose > DPAs starting at 0xd20000. This address is factor for all values from > 1 to 8 and thus valid for all possible interleaving configuration. > The resulting SPAs are used to calculate interleaving paramters and > the SPA base address of the endpoint. The maximum granularity (chunk > size) is 16k, minimum is 256. Use the following calculation for a > given DPA: > > ways = hpa_len(SZ_16K) / SZ_16K > gran = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256) > / (ways - 1) > pos = (hpa_len(SZ_16K) - ways * SZ_16K) / gran > > Once the endpoint is attached to a region and its SPA range is know, > calling the PRM is no longer needed, the SPA base can be used. > > Signed-off-by: Robert Richter <rrichter@amd.com> > --- > drivers/cxl/Kconfig | 4 + > drivers/cxl/core/Makefile | 1 + > drivers/cxl/core/amd.c | 227 ++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/core.h | 6 + > drivers/cxl/core/port.c | 7 ++ > 5 files changed, 245 insertions(+) > create mode 100644 drivers/cxl/core/amd.c > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig > index 876469e23f7a..e576028dd983 100644 > --- a/drivers/cxl/Kconfig > +++ b/drivers/cxl/Kconfig > @@ -146,4 +146,8 @@ config CXL_REGION_INVALIDATION_TEST > If unsure, or if this kernel is meant for production environments, > say N. > > +config CXL_AMD > + def_bool y > + depends on AMD_NB > + > endif > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile > index 9259bcc6773c..dc368e61d281 100644 > --- a/drivers/cxl/core/Makefile > +++ b/drivers/cxl/core/Makefile > @@ -16,3 +16,4 @@ cxl_core-y += pmu.o > cxl_core-y += cdat.o > cxl_core-$(CONFIG_TRACING) += trace.o > cxl_core-$(CONFIG_CXL_REGION) += region.o > +cxl_core-$(CONFIG_CXL_AMD) += amd.o > diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c > new file mode 100644 > index 000000000000..553b7d0caefd > --- /dev/null > +++ b/drivers/cxl/core/amd.c > @@ -0,0 +1,227 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (C) 2024 Advanced Micro Devices, Inc. Make sure to update the year. Don't know if it's supposed to be 2024-2025 or just 2025. > + */ > + > +#include <linux/prmt.h> > +#include <linux/pci.h> > + > +#include "cxlmem.h" > +#include "core.h" > + > +#define PCI_DEVICE_ID_AMD_ZEN5_ROOT 0x153e > + > +static const struct pci_device_id zen5_root_port_ids[] = { > + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ZEN5_ROOT) }, > + {}, > +}; > + > +static int is_zen5_root_port(struct device *dev, void *unused) > +{ > + if (!dev_is_pci(dev)) > + return 0; > + > + return !!pci_match_id(zen5_root_port_ids, to_pci_dev(dev)); > +} > + > +static bool is_zen5(struct cxl_port *port) > +{ > + if (!IS_ENABLED(CONFIG_ACPI_PRMT)) > + return false; > + > + /* To get the CXL root port, find the CXL host bridge first. */ > + if (is_cxl_root(port) || > + !port->host_bridge || > + !is_cxl_root(to_cxl_port(port->dev.parent))) > + return false; > + > + return !!device_for_each_child(port->host_bridge, NULL, > + is_zen5_root_port); > +} > + > +/* > + * PRM Address Translation - CXL DPA to System Physical Address > + * > + * Reference: > + * > + * AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh > + * ACPI v6.5 Porting Guide, Publication # 58088 > + */ > + > +static const guid_t prm_cxl_dpa_spa_guid = > + GUID_INIT(0xee41b397, 0x25d4, 0x452c, 0xad, 0x54, 0x48, 0xc6, 0xe3, > + 0x48, 0x0b, 0x94); > + > +struct prm_cxl_dpa_spa_data { > + u64 dpa; > + u8 reserved; > + u8 devfn; > + u8 bus; > + u8 segment; > + void *out; > +} __packed; > + > +static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa) > +{ > + struct prm_cxl_dpa_spa_data data; > + u64 spa; > + int rc; > + > + data = (struct prm_cxl_dpa_spa_data) { > + .dpa = dpa, > + .devfn = pci_dev->devfn, > + .bus = pci_dev->bus->number, > + .segment = pci_domain_nr(pci_dev->bus), > + .out = &spa, > + }; > + > + rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data); > + if (rc) { > + pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc); > + return ULLONG_MAX; > + } > + > + pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa); > + > + return spa; > +} > + > +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa) After reading through your discussion with Gregory I'm not confused about the function name anymore, but I would definitely include a comment specifying the expected usage and return value. Thanks, Ben > +{ > + struct cxl_memdev *cxlmd; > + struct pci_dev *pci_dev; > + struct cxl_port *port; > + u64 dpa, base, spa, spa2, len, len2, offset, granularity; > + int ways, pos; > + > + /* > + * Nothing to do if base is non-zero and Normalized Addressing > + * is disabled. > + */ > + if (cxld->hpa_range.start) > + return hpa; > + > + /* Only translate from endpoint to its parent port. */ > + if (!is_endpoint_decoder(&cxld->dev)) > + return hpa; > + > + if (hpa > cxld->hpa_range.end) { > + dev_dbg(&cxld->dev, "hpa addr %#llx out of range %#llx-%#llx\n", > + hpa, cxld->hpa_range.start, cxld->hpa_range.end); > + return ULLONG_MAX; > + } > + > + /* > + * If the decoder is already attached, the region's base can > + * be used. > + */ > + if (cxld->region) > + return cxld->region->params.res->start + hpa; > + > + port = to_cxl_port(cxld->dev.parent); > + cxlmd = port ? to_cxl_memdev(port->uport_dev) : NULL; > + if (!port || !dev_is_pci(cxlmd->dev.parent)) { > + dev_dbg(&cxld->dev, "No endpoint found: %s, range %#llx-%#llx\n", > + dev_name(cxld->dev.parent), cxld->hpa_range.start, > + cxld->hpa_range.end); > + return ULLONG_MAX; > + } > + pci_dev = to_pci_dev(cxlmd->dev.parent); > + > + /* > + * The PRM translates DPA->SPA, but we need HPA->SPA. > + * Determine the interleaving config first, then calculate the > + * DPA. Maximum granularity (chunk size) is 16k, minimum is > + * 256. Calculated with: > + * > + * ways = hpa_len(SZ_16K) / SZ_16K > + * gran = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256) > + * / (ways - 1) > + * pos = (hpa_len(SZ_16K) - ways * SZ_16K) / gran > + */ > + > + /* > + * DPA magic: > + * > + * Position and granularity are unknown yet, use an always > + * valid DPA: > + * > + * 0xd20000 = 13762560 = 16k * 2 * 3 * 2 * 5 * 7 * 2 > + * > + * It is divisible by all positions 1 to 8. The DPA is valid > + * for all positions and granularities. > + */ > +#define DPA_MAGIC 0xd20000 > + base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC); > + spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K); > + spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256); > + > + /* Includes checks to avoid div by zero */ > + if (!base || base == ULLONG_MAX || spa == ULLONG_MAX || > + spa2 == ULLONG_MAX || spa < base + SZ_16K || spa2 <= base || > + (spa > base + SZ_16K && spa - spa2 < SZ_256 * 2)) { > + dev_dbg(&cxld->dev, "Error translating HPA: base %#llx, spa %#llx, spa2 %#llx\n", > + base, spa, spa2); > + return ULLONG_MAX; > + } > + > + len = spa - base; > + len2 = spa2 - base; > + > + /* offset = pos * granularity */ > + if (len == SZ_16K && len2 == SZ_16K - SZ_256) { > + ways = 1; > + offset = 0; > + granularity = 0; > + pos = 0; > + } else { > + ways = len / SZ_16K; > + offset = spa & (SZ_16K - 1); > + granularity = (len - len2 - SZ_256) / (ways - 1); > + pos = offset / granularity; > + } > + > + base = base - DPA_MAGIC * ways - pos * granularity; > + spa = base + hpa; > + > + /* > + * Check SPA using a PRM call for the closest DPA calculated > + * for the HPA. If the HPA matches a different interleaving > + * position other than the decoder's, determine its offset to > + * adjust the SPA. > + */ > + > + dpa = (hpa & ~(granularity * ways - 1)) / ways > + + (hpa & (granularity - 1)); > + offset = hpa & (granularity * ways - 1) & ~(granularity - 1); > + offset -= pos * granularity; > + spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset; > + > + dev_dbg(&cxld->dev, > + "address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n", > + pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity); > + > + if (spa != spa2) { > + dev_dbg(&cxld->dev, "SPA calculation failed: %#llx:%#llx\n", > + spa, spa2); > + return ULLONG_MAX; > + } > + > + return spa; > +} > + > +static void cxl_zen5_init(struct cxl_port *port) > +{ > + if (!is_zen5(port)) > + return; > + > + port->to_hpa = cxl_zen5_to_hpa; > + > + dev_dbg(port->host_bridge, "PRM address translation enabled for %s.\n", > + dev_name(&port->dev)); > +} > + > +void cxl_port_setup_amd(struct cxl_port *port) > +{ > + cxl_zen5_init(port); > +} > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > index 800466f96a68..efe34ae6943e 100644 > --- a/drivers/cxl/core/core.h > +++ b/drivers/cxl/core/core.h > @@ -115,4 +115,10 @@ bool cxl_need_node_perf_attrs_update(int nid); > int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port, > struct access_coordinate *c); > > +#ifdef CONFIG_CXL_AMD > +void cxl_port_setup_amd(struct cxl_port *port); > +#else > +static inline void cxl_port_setup_amd(struct cxl_port *port) {}; > +#endif > + > #endif /* __CXL_CORE_H__ */ > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > index 901555bf4b73..c8176265c15c 100644 > --- a/drivers/cxl/core/port.c > +++ b/drivers/cxl/core/port.c > @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) > &cxl_einj_inject_fops); > } > > +static void cxl_port_platform_setup(struct cxl_port *port) > +{ > + cxl_port_setup_amd(port); > +} > + > static int cxl_port_add(struct cxl_port *port, > resource_size_t component_reg_phys, > struct cxl_dport *parent_dport) > @@ -868,6 +873,8 @@ static int cxl_port_add(struct cxl_port *port, > return rc; > } > > + cxl_port_platform_setup(port); > + > rc = device_add(dev); > if (rc) > return rc;
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 876469e23f7a..e576028dd983 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -146,4 +146,8 @@ config CXL_REGION_INVALIDATION_TEST If unsure, or if this kernel is meant for production environments, say N. +config CXL_AMD + def_bool y + depends on AMD_NB + endif diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile index 9259bcc6773c..dc368e61d281 100644 --- a/drivers/cxl/core/Makefile +++ b/drivers/cxl/core/Makefile @@ -16,3 +16,4 @@ cxl_core-y += pmu.o cxl_core-y += cdat.o cxl_core-$(CONFIG_TRACING) += trace.o cxl_core-$(CONFIG_CXL_REGION) += region.o +cxl_core-$(CONFIG_CXL_AMD) += amd.o diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c new file mode 100644 index 000000000000..553b7d0caefd --- /dev/null +++ b/drivers/cxl/core/amd.c @@ -0,0 +1,227 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2024 Advanced Micro Devices, Inc. + */ + +#include <linux/prmt.h> +#include <linux/pci.h> + +#include "cxlmem.h" +#include "core.h" + +#define PCI_DEVICE_ID_AMD_ZEN5_ROOT 0x153e + +static const struct pci_device_id zen5_root_port_ids[] = { + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ZEN5_ROOT) }, + {}, +}; + +static int is_zen5_root_port(struct device *dev, void *unused) +{ + if (!dev_is_pci(dev)) + return 0; + + return !!pci_match_id(zen5_root_port_ids, to_pci_dev(dev)); +} + +static bool is_zen5(struct cxl_port *port) +{ + if (!IS_ENABLED(CONFIG_ACPI_PRMT)) + return false; + + /* To get the CXL root port, find the CXL host bridge first. */ + if (is_cxl_root(port) || + !port->host_bridge || + !is_cxl_root(to_cxl_port(port->dev.parent))) + return false; + + return !!device_for_each_child(port->host_bridge, NULL, + is_zen5_root_port); +} + +/* + * PRM Address Translation - CXL DPA to System Physical Address + * + * Reference: + * + * AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh + * ACPI v6.5 Porting Guide, Publication # 58088 + */ + +static const guid_t prm_cxl_dpa_spa_guid = + GUID_INIT(0xee41b397, 0x25d4, 0x452c, 0xad, 0x54, 0x48, 0xc6, 0xe3, + 0x48, 0x0b, 0x94); + +struct prm_cxl_dpa_spa_data { + u64 dpa; + u8 reserved; + u8 devfn; + u8 bus; + u8 segment; + void *out; +} __packed; + +static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa) +{ + struct prm_cxl_dpa_spa_data data; + u64 spa; + int rc; + + data = (struct prm_cxl_dpa_spa_data) { + .dpa = dpa, + .devfn = pci_dev->devfn, + .bus = pci_dev->bus->number, + .segment = pci_domain_nr(pci_dev->bus), + .out = &spa, + }; + + rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data); + if (rc) { + pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc); + return ULLONG_MAX; + } + + pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa); + + return spa; +} + +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa) +{ + struct cxl_memdev *cxlmd; + struct pci_dev *pci_dev; + struct cxl_port *port; + u64 dpa, base, spa, spa2, len, len2, offset, granularity; + int ways, pos; + + /* + * Nothing to do if base is non-zero and Normalized Addressing + * is disabled. + */ + if (cxld->hpa_range.start) + return hpa; + + /* Only translate from endpoint to its parent port. */ + if (!is_endpoint_decoder(&cxld->dev)) + return hpa; + + if (hpa > cxld->hpa_range.end) { + dev_dbg(&cxld->dev, "hpa addr %#llx out of range %#llx-%#llx\n", + hpa, cxld->hpa_range.start, cxld->hpa_range.end); + return ULLONG_MAX; + } + + /* + * If the decoder is already attached, the region's base can + * be used. + */ + if (cxld->region) + return cxld->region->params.res->start + hpa; + + port = to_cxl_port(cxld->dev.parent); + cxlmd = port ? to_cxl_memdev(port->uport_dev) : NULL; + if (!port || !dev_is_pci(cxlmd->dev.parent)) { + dev_dbg(&cxld->dev, "No endpoint found: %s, range %#llx-%#llx\n", + dev_name(cxld->dev.parent), cxld->hpa_range.start, + cxld->hpa_range.end); + return ULLONG_MAX; + } + pci_dev = to_pci_dev(cxlmd->dev.parent); + + /* + * The PRM translates DPA->SPA, but we need HPA->SPA. + * Determine the interleaving config first, then calculate the + * DPA. Maximum granularity (chunk size) is 16k, minimum is + * 256. Calculated with: + * + * ways = hpa_len(SZ_16K) / SZ_16K + * gran = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256) + * / (ways - 1) + * pos = (hpa_len(SZ_16K) - ways * SZ_16K) / gran + */ + + /* + * DPA magic: + * + * Position and granularity are unknown yet, use an always + * valid DPA: + * + * 0xd20000 = 13762560 = 16k * 2 * 3 * 2 * 5 * 7 * 2 + * + * It is divisible by all positions 1 to 8. The DPA is valid + * for all positions and granularities. + */ +#define DPA_MAGIC 0xd20000 + base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC); + spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K); + spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256); + + /* Includes checks to avoid div by zero */ + if (!base || base == ULLONG_MAX || spa == ULLONG_MAX || + spa2 == ULLONG_MAX || spa < base + SZ_16K || spa2 <= base || + (spa > base + SZ_16K && spa - spa2 < SZ_256 * 2)) { + dev_dbg(&cxld->dev, "Error translating HPA: base %#llx, spa %#llx, spa2 %#llx\n", + base, spa, spa2); + return ULLONG_MAX; + } + + len = spa - base; + len2 = spa2 - base; + + /* offset = pos * granularity */ + if (len == SZ_16K && len2 == SZ_16K - SZ_256) { + ways = 1; + offset = 0; + granularity = 0; + pos = 0; + } else { + ways = len / SZ_16K; + offset = spa & (SZ_16K - 1); + granularity = (len - len2 - SZ_256) / (ways - 1); + pos = offset / granularity; + } + + base = base - DPA_MAGIC * ways - pos * granularity; + spa = base + hpa; + + /* + * Check SPA using a PRM call for the closest DPA calculated + * for the HPA. If the HPA matches a different interleaving + * position other than the decoder's, determine its offset to + * adjust the SPA. + */ + + dpa = (hpa & ~(granularity * ways - 1)) / ways + + (hpa & (granularity - 1)); + offset = hpa & (granularity * ways - 1) & ~(granularity - 1); + offset -= pos * granularity; + spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset; + + dev_dbg(&cxld->dev, + "address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n", + pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity); + + if (spa != spa2) { + dev_dbg(&cxld->dev, "SPA calculation failed: %#llx:%#llx\n", + spa, spa2); + return ULLONG_MAX; + } + + return spa; +} + +static void cxl_zen5_init(struct cxl_port *port) +{ + if (!is_zen5(port)) + return; + + port->to_hpa = cxl_zen5_to_hpa; + + dev_dbg(port->host_bridge, "PRM address translation enabled for %s.\n", + dev_name(&port->dev)); +} + +void cxl_port_setup_amd(struct cxl_port *port) +{ + cxl_zen5_init(port); +} diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h index 800466f96a68..efe34ae6943e 100644 --- a/drivers/cxl/core/core.h +++ b/drivers/cxl/core/core.h @@ -115,4 +115,10 @@ bool cxl_need_node_perf_attrs_update(int nid); int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port, struct access_coordinate *c); +#ifdef CONFIG_CXL_AMD +void cxl_port_setup_amd(struct cxl_port *port); +#else +static inline void cxl_port_setup_amd(struct cxl_port *port) {}; +#endif + #endif /* __CXL_CORE_H__ */ diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c index 901555bf4b73..c8176265c15c 100644 --- a/drivers/cxl/core/port.c +++ b/drivers/cxl/core/port.c @@ -831,6 +831,11 @@ static void cxl_debugfs_create_dport_dir(struct cxl_dport *dport) &cxl_einj_inject_fops); } +static void cxl_port_platform_setup(struct cxl_port *port) +{ + cxl_port_setup_amd(port); +} + static int cxl_port_add(struct cxl_port *port, resource_size_t component_reg_phys, struct cxl_dport *parent_dport) @@ -868,6 +873,8 @@ static int cxl_port_add(struct cxl_port *port, return rc; } + cxl_port_platform_setup(port); + rc = device_add(dev); if (rc) return rc;
Add AMD platform specific Zen5 support for address translation. Zen5 systems may be configured to use 'Normalized addresses'. Then, CXL endpoints use their own physical address space and Host Physical Addresses (HPAs) need address translation from the endpoint to its CXL host bridge. The HPA of a CXL host bridge is equivalent to the System Physical Address (SPA). ACPI Platform Runtime Mechanism (PRM) is used to translate the CXL Device Physical Address (DPA) to its System Physical Address. This is documented in: AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh ACPI v6.5 Porting Guide, Publication # 58088 https://www.amd.com/en/search/documentation/hub.html Note that DPA and HPA of an endpoint may differ depending on the interleaving configuration. That is, an additional calculation between DPA and HPA is needed. To implement AMD Zen5 address translation the following steps are needed: Introduce the generic function cxl_port_platform_setup() that allows to apply platform specific changes to each port where necessary. Add a function cxl_port_setup_amd() to implement AMD platform specific code. Use Kbuild and Kconfig options respectivly to enable the code depending on architecture and platform options. Create a new file core/amd.c for this. Introduce a function cxl_zen5_init() to handle Zen5 specific enablement. Zen5 platforms are detected using the PCIe vendor and device ID of the corresponding CXL root port. Apply cxl_zen5_to_hpa() as cxl_port->to_hpa() callback to Zen5 CXL host bridges to enable platform specific address translation. Use ACPI PRM DPA to SPA translation to determine an endpoint's interleaving configuration and base address during the early initialization proces. This is used to determine an endpoint's SPA range. Since the PRM translates DPA->SPA, but HPA->SPA is needed, determine the interleaving config and base address of the endpoint first, then calculate the SPA based on the given HPA using the address base. The config can be determined calling the PRM for specific DPAs given. Since the interleaving configuration is still unknown, chose DPAs starting at 0xd20000. This address is factor for all values from 1 to 8 and thus valid for all possible interleaving configuration. The resulting SPAs are used to calculate interleaving paramters and the SPA base address of the endpoint. The maximum granularity (chunk size) is 16k, minimum is 256. Use the following calculation for a given DPA: ways = hpa_len(SZ_16K) / SZ_16K gran = (hpa_len(SZ_16K) - hpa_len(SZ_16K - SZ_256) - SZ_256) / (ways - 1) pos = (hpa_len(SZ_16K) - ways * SZ_16K) / gran Once the endpoint is attached to a region and its SPA range is know, calling the PRM is no longer needed, the SPA base can be used. Signed-off-by: Robert Richter <rrichter@amd.com> --- drivers/cxl/Kconfig | 4 + drivers/cxl/core/Makefile | 1 + drivers/cxl/core/amd.c | 227 ++++++++++++++++++++++++++++++++++++++ drivers/cxl/core/core.h | 6 + drivers/cxl/core/port.c | 7 ++ 5 files changed, 245 insertions(+) create mode 100644 drivers/cxl/core/amd.c