diff mbox series

[2/4,v2] cxl/core: Add helpers to detect Low memory Holes on x86

Message ID 20250114203432.31861-3-fabio.m.de.francesco@linux.intel.com
State Superseded
Headers show
Series cxl/core: Enable Region creation on x86 with Low Mem Hole | expand

Commit Message

Fabio M. De Francesco Jan. 14, 2025, 8:32 p.m. UTC
In x86 with Low memory Hole, the BIOS may publishes CFMWS that describe
SPA ranges which are subsets of the corresponding CXL Endpoint Decoders
HPA's because the CFMWS never intersects LMH's while EP Decoders HPA's
ranges are always guaranteed to align to the NIW * 256M rule.

In order to construct Regions and attach Decoders, the driver needs to
match Root Decoders and Regions with Endpoint Decoders, but it fails and
the entire process returns errors because it doesn't expect to deal with
SPA range lengths smaller than corresponding HPA's.

Introduce functions that indirectly detect x86 LMH's by comparing SPA's
with corresponding HPA's. They will be used in the process of Regions
creation and Endpoint attachments to prevent driver failures in a few
steps of the above-mentioned process.

The helpers return true when HPA/SPA misalignments are detected under
specific conditions: both the SPA and HPA ranges must start at
LMH_CFMWS_RANGE_START (that in x86 with LMH's is 0x0), SPA range sizes
be less than HPA's, SPA's range's size be less than 4G, HPA's size be
aligned to the NIW * 256M rule.

Also introduce a function to adjust the range end of the Regions to be
created on x86 with LMH's.

Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
---
 drivers/cxl/core/lmh.c | 55 ++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/lmh.h | 28 +++++++++++++++++++++
 2 files changed, 83 insertions(+)
 create mode 100644 drivers/cxl/core/lmh.c
 create mode 100644 drivers/cxl/core/lmh.h

Comments

Gregory Price Jan. 15, 2025, 2:23 a.m. UTC | #1
On Tue, Jan 14, 2025 at 09:32:54PM +0100, Fabio M. De Francesco wrote:
> +/*
> + * Match CXL Root and Endpoint Decoders by comparing SPA and HPA ranges.
> + *
> + * On x86, CFMWS ranges never intersect memory holes while endpoint decoders
> + * HPA range sizes are always guaranteed aligned to NIW * 256MB; therefore,
> + * the given endpoint decoder HPA range size is always expected aligned and
> + * also larger than that of the matching root decoder. If there are LMH's,
> + * the root decoder range end is always less than SZ_4G.
> + */

Is there any reason to limit this memory-hole handling to only low
memory holes? I have observed systems where the following memory
hole situation occurs:

(example, not exact sizes)
CFMW1:   [  0xc0000000 - 0xdfffffff  ] 512MB range
Reserved [  0xe0000000 - 0xffffffff  ] 512MB range
CFMW2:   [ 0x100000000 - 0x15fffffff ] 1.5GB range

2 CXL Memory Devices w/ 1GB capacity each (again, an example).

Note that 1 device has its capacity split across the hole, but
if the devices are interleaved then both devices have their capacity
split across the hole.

It seems with some mild modification, this patch set could be
re-used to handle this memory hole scenario as well
(w/ addr translation - Robert's patch set)

Is there a reason not to handle more than just LMH's in this set?

(I may try to hack this up on my test system and report back.)

~Gregory
Fabio M. De Francesco Jan. 21, 2025, 8:35 p.m. UTC | #2
On Wednesday, January 15, 2025 3:23:36 AM GMT+1 Gregory Price wrote:
> On Tue, Jan 14, 2025 at 09:32:54PM +0100, Fabio M. De Francesco wrote:
> > +/*
> > + * Match CXL Root and Endpoint Decoders by comparing SPA and HPA ranges.
> > + *
> > + * On x86, CFMWS ranges never intersect memory holes while endpoint decoders
> > + * HPA range sizes are always guaranteed aligned to NIW * 256MB; therefore,
> > + * the given endpoint decoder HPA range size is always expected aligned and
> > + * also larger than that of the matching root decoder. If there are LMH's,
> > + * the root decoder range end is always less than SZ_4G.
> > + */
> 
> Is there any reason to limit this memory-hole handling to only low
> memory holes? 
>
No I don't see any special reasons to limit this to only low memory holes.
It's just that I didn't know about others.
>
> I have observed systems where the following memory
> hole situation occurs:
> 
> (example, not exact sizes)
> CFMW1:   [  0xc0000000 - 0xdfffffff  ] 512MB range
> Reserved [  0xe0000000 - 0xffffffff  ] 512MB range
> CFMW2:   [ 0x100000000 - 0x15fffffff ] 1.5GB range
> 
> 2 CXL Memory Devices w/ 1GB capacity each (again, an example).
> 
> Note that 1 device has its capacity split across the hole, but
> if the devices are interleaved then both devices have their capacity
> split across the hole.
> 
> It seems with some mild modification, this patch set could be
> re-used to handle this memory hole scenario as well
> (w/ addr translation - Robert's patch set)
> 
> Is there a reason not to handle more than just LMH's in this set?
> 
> (I may try to hack this up on my test system and report back.)
>
I'd appreciate it if you want to report back.

Thank you,

Fabio
> 
> ~Gregory
>
Dan Williams April 1, 2025, 8:32 p.m. UTC | #3
Gregory Price wrote:
> On Tue, Jan 14, 2025 at 09:32:54PM +0100, Fabio M. De Francesco wrote:
> > +/*
> > + * Match CXL Root and Endpoint Decoders by comparing SPA and HPA ranges.
> > + *
> > + * On x86, CFMWS ranges never intersect memory holes while endpoint decoders
> > + * HPA range sizes are always guaranteed aligned to NIW * 256MB; therefore,
> > + * the given endpoint decoder HPA range size is always expected aligned and
> > + * also larger than that of the matching root decoder. If there are LMH's,
> > + * the root decoder range end is always less than SZ_4G.
> > + */
> 
> Is there any reason to limit this memory-hole handling to only low
> memory holes? I have observed systems where the following memory
> hole situation occurs:
> 
> (example, not exact sizes)
> CFMW1:   [  0xc0000000 - 0xdfffffff  ] 512MB range
> Reserved [  0xe0000000 - 0xffffffff  ] 512MB range
> CFMW2:   [ 0x100000000 - 0x15fffffff ] 1.5GB range
> 
> 2 CXL Memory Devices w/ 1GB capacity each (again, an example).
> 
> Note that 1 device has its capacity split across the hole, but
> if the devices are interleaved then both devices have their capacity
> split across the hole.
> 
> It seems with some mild modification, this patch set could be
> re-used to handle this memory hole scenario as well
> (w/ addr translation - Robert's patch set)
> 
> Is there a reason not to handle more than just LMH's in this set?

This discussion was referenced recently on an IM and I wanted to share
my response to it:

The rules for when to apply this memory hole quirk are explicit and
suitable to add to the CXL specification. I want the same standard for
any other quirk and ideally some proof-of-work to get that quirk
recognized by the specification. Otherwise, I worry that generalizing
for all the possible ways that platform BIOS tries to be clever means we
end up with something that has no rules.

The spec is there to allow software to delineate valid configurations vs
mistakes, and this slow drip of "Linux does not understand this platform
configuration" is a spec gap.
Gregory Price April 1, 2025, 9:40 p.m. UTC | #4
On Tue, Apr 01, 2025 at 01:32:33PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> > Is there a reason not to handle more than just LMH's in this set?
> 
> This discussion was referenced recently on an IM and I wanted to share
> my response to it:
> 
> The rules for when to apply this memory hole quirk are explicit and
> suitable to add to the CXL specification. I want the same standard for
> any other quirk and ideally some proof-of-work to get that quirk
> recognized by the specification. Otherwise, I worry that generalizing
> for all the possible ways that platform BIOS tries to be clever means we
> end up with something that has no rules.
>
> The spec is there to allow software to delineate valid configurations vs
> mistakes, and this slow drip of "Linux does not understand this platform
> configuration" is a spec gap.

Note: I've since come around to understand the whole ecosystem a bit
      better since i wrote this response. I don't know that it's needed.


Some of the explanation of this patch series is a bit confusing. It
justifies itself by saying CFMWS don't intersect memory holes and that
endpoint decoders have to be 256MB aligned.

/*
 * Match CXL Root and Endpoint Decoders by comparing SPA and HPA ranges.
 *
 * On x86, CFMWS ranges never intersect memory holes while endpoint decoders
 * HPA range sizes are always guaranteed aligned to NIW * 256MB; therefore,
 * the given endpoint decoder HPA range size is always expected aligned and
 * also larger than that of the matching root decoder. If there are LMH's,
 * the root decoder range end is always less than SZ_4G.
 */

But per the spec, CFMWS is also aligned to be aligned to 256MB.

Shouldn't the platform work around memory holes to generate multiple
CFMWS for the entire capacity, and then use multiple endpoint decoders
(1 per CFMWS) to map the capacity accordingly?

(Also, I still don't understand the oracle value of <4GB address range.
 It seems like if this is some quirk of SPA vs HPA alignment, then it
 can hold for *all* ocurrances, not just stuff below 4GB)

~Gregory
Dan Williams April 1, 2025, 11:34 p.m. UTC | #5
Gregory Price wrote:
> On Tue, Apr 01, 2025 at 01:32:33PM -0700, Dan Williams wrote:
> > Gregory Price wrote:
> > > Is there a reason not to handle more than just LMH's in this set?
> > 
> > This discussion was referenced recently on an IM and I wanted to share
> > my response to it:
> > 
> > The rules for when to apply this memory hole quirk are explicit and
> > suitable to add to the CXL specification. I want the same standard for
> > any other quirk and ideally some proof-of-work to get that quirk
> > recognized by the specification. Otherwise, I worry that generalizing
> > for all the possible ways that platform BIOS tries to be clever means we
> > end up with something that has no rules.
> >
> > The spec is there to allow software to delineate valid configurations vs
> > mistakes, and this slow drip of "Linux does not understand this platform
> > configuration" is a spec gap.
> 
> Note: I've since come around to understand the whole ecosystem a bit
>       better since i wrote this response.

Yes, I should have acknowledged shifts in understanding since this
thread went quiet. Fabio was about to spin this set again to add more
"generalization" and I wanted to clarify my current thinking that
generalization is the opposite of what should happen here.

> I don't know that it's needed.

Referring to spec changes? I think they are, see below

> Some of the explanation of this patch series is a bit confusing. It
> justifies itself by saying CFMWS don't intersect memory holes and that
> endpoint decoders have to be 256MB aligned.
> 
> /*
>  * Match CXL Root and Endpoint Decoders by comparing SPA and HPA ranges.
>  *
>  * On x86, CFMWS ranges never intersect memory holes while endpoint decoders
>  * HPA range sizes are always guaranteed aligned to NIW * 256MB; therefore,
>  * the given endpoint decoder HPA range size is always expected aligned and
>  * also larger than that of the matching root decoder. If there are LMH's,
>  * the root decoder range end is always less than SZ_4G.
>  */
> 
> But per the spec, CFMWS is also aligned to be aligned to 256MB.

Right, something has to give, i.e. "spec meet reality". Hardware
endpoint decoders must be aligned, that is a shipping expectation, and
endpoints are not in a position to know or care about host platform
constraints. In constrast, the CFMWS definition runs into a practical
problem meeting the same expectation given competing host phyiscal
memory map constraints.

The platforms with this condition want to support CXL mapped starting at
zero *and* the typical/historical PCI MMIO space in low memory (for
those PCI devices that do not support 64-bit addressing). If the CFMWS
blindly followed the 256MB*NIW constraint the CXL window would overlap
the MMIO space. So the choices are:

1/ Give up on mapping CXL starting at zero when 256MB * NIW does not fit
2/ Give up on maintaining historical availabilty of and compatibility
   with 32-bit only PCI devices (PCI configuration regression)
3/ Trim CFMWS to match the reality that the platform will always route
   memory cycles in that PCI MMIO range to PCI MMIO and never to CXL.
4/ Define some new protocol for when CFMWS is explicitly countermanded
   by other platform resource descriptors, and not a BIOS bug.

The platform in question chose option 3.

> Shouldn't the platform work around memory holes to generate multiple
> CFMWS for the entire capacity, and then use multiple endpoint decoders
> (1 per CFMWS) to map the capacity accordingly?

Per above, the maths do not work out to be able to support that relative
to a CXL region with problematic NIW.

> (Also, I still don't understand the oracle value of <4GB address range.
>  It seems like if this is some quirk of SPA vs HPA alignment, then it
>  can hold for *all* ocurrances, not just stuff below 4GB)

The goal is to get platform vendors to define the rules so that an OS
has a reasonable expectation to know what is a valid vs invalid
configuration. A hole above 4GB has no reason to exist, there is no
resource conflict like PCI MMIO that explains why typical spec
expectation can not be met.

So I want the subsystem to have an explicit set of platform quirks
ideally backed up by updated spec language. That allows us to validate
that the Linux implementation is correct by some objective source of
truth, encourage platform vendors to update that source of truth when
they create new corner cases, or even better, be more mindful to not
create new corner cases.
Gregory Price April 2, 2025, 5:05 a.m. UTC | #6
On Tue, Apr 01, 2025 at 04:34:44PM -0700, Dan Williams wrote:
> The platforms with this condition want to support CXL mapped starting at
> zero *and* the typical/historical PCI MMIO space in low memory (for
> those PCI devices that do not support 64-bit addressing). If the CFMWS
> blindly followed the 256MB*NIW constraint the CXL window would overlap
> the MMIO space. So the choices are:
> 

If I'm understanding everything correctly, then I think this is intended
to work only when EFI_MEMORY_SP is *not* set for these particular CXL
devices - so it comes up as memory in early boot and we're just trying
to wire up all the bits to let the driver manage it accordingly?

If that's the case, then ignore me, i'm just bikeshedding at this point.

I can't imagine a system where the memory at 0x0-4GB is intended to come
up as EFI_MEMORY_SP, so I hope no one ever tries this :D

> > (Also, I still don't understand the oracle value of <4GB address range.
> >  It seems like if this is some quirk of SPA vs HPA alignment, then it
> >  can hold for *all* ocurrances, not just stuff below 4GB)
> 
> The goal is to get platform vendors to define the rules so that an OS
> has a reasonable expectation to know what is a valid vs invalid
> configuration. A hole above 4GB has no reason to exist, there is no
> resource conflict like PCI MMIO that explains why typical spec
> expectation can not be met.
> 
> So I want the subsystem to have an explicit set of platform quirks
> ideally backed up by updated spec language. That allows us to validate
> that the Linux implementation is correct by some objective source of
> truth, encourage platform vendors to update that source of truth when
> they create new corner cases, or even better, be more mindful to not
> create new corner cases.

I follow, seems reasonable.  

~Gregory
diff mbox series

Patch

diff --git a/drivers/cxl/core/lmh.c b/drivers/cxl/core/lmh.c
new file mode 100644
index 0000000000000..232ebea0a8364
--- /dev/null
+++ b/drivers/cxl/core/lmh.c
@@ -0,0 +1,55 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/range.h>
+#include "lmh.h"
+
+/* Start of CFMWS range that end before x86 Low Memory Holes */
+#define LMH_CFMWS_RANGE_START 0x0ULL
+
+/*
+ * Match CXL Root and Endpoint Decoders by comparing SPA and HPA ranges.
+ *
+ * On x86, CFMWS ranges never intersect memory holes while endpoint decoders
+ * HPA range sizes are always guaranteed aligned to NIW * 256MB; therefore,
+ * the given endpoint decoder HPA range size is always expected aligned and
+ * also larger than that of the matching root decoder. If there are LMH's,
+ * the root decoder range end is always less than SZ_4G.
+ */
+bool arch_match_spa(struct cxl_root_decoder *cxlrd,
+		    struct cxl_endpoint_decoder *cxled)
+{
+	struct range *r1, *r2;
+	int niw;
+
+	r1 = &cxlrd->cxlsd.cxld.hpa_range;
+	r2 = &cxled->cxld.hpa_range;
+	niw = cxled->cxld.interleave_ways;
+
+	if (r1->start == LMH_CFMWS_RANGE_START && r1->start == r2->start &&
+	    r1->end < (LMH_CFMWS_RANGE_START + SZ_4G) && r1->end < r2->end &&
+	    IS_ALIGNED(range_len(r2), niw * SZ_256M))
+		return true;
+
+	return false;
+}
+
+/* Similar to arch_match_spa(), it matches regions and decoders */
+bool arch_match_region(struct cxl_region_params *p, struct cxl_decoder *cxld)
+{
+	struct range *r = &cxld->hpa_range;
+	struct resource *res = p->res;
+	int niw = cxld->interleave_ways;
+
+	if (res->start == LMH_CFMWS_RANGE_START && res->start == r->start &&
+	    res->end < (LMH_CFMWS_RANGE_START + SZ_4G) && res->end < r->end &&
+	    IS_ALIGNED(range_len(r), niw * SZ_256M))
+		return true;
+
+	return false;
+}
+
+void arch_adjust_region_resource(struct resource *res,
+				 struct cxl_root_decoder *cxlrd)
+{
+	res->end = cxlrd->res->end;
+}
diff --git a/drivers/cxl/core/lmh.h b/drivers/cxl/core/lmh.h
new file mode 100644
index 0000000000000..ec8907145afe8
--- /dev/null
+++ b/drivers/cxl/core/lmh.h
@@ -0,0 +1,28 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include "cxl.h"
+
+#ifdef CONFIG_CXL_ARCH_LOW_MEMORY_HOLE
+bool arch_match_spa(struct cxl_root_decoder *cxlrd,
+		    struct cxl_endpoint_decoder *cxled);
+bool arch_match_region(struct cxl_region_params *p, struct cxl_decoder *cxld);
+void arch_adjust_region_resource(struct resource *res,
+				 struct cxl_root_decoder *cxlrd);
+#else
+static bool arch_match_spa(struct cxl_root_decoder *cxlrd,
+			   struct cxl_endpoint_decoder *cxled)
+{
+	return false;
+}
+
+static bool arch_match_region(struct cxl_region_params *p,
+			      struct cxl_decoder *cxld)
+{
+	return false;
+}
+
+static void arch_adjust_region_resource(struct resource *res,
+					struct cxl_root_decoder *cxlrd)
+{
+}
+#endif /* CXL_ARCH_LOW_MEMORY_HOLE */