diff mbox series

[v2] cxl/region: Allow 6 & 12 way regions on 3-way HB interleaves

Message ID 20250306232239.2609017-1-alison.schofield@intel.com
State New
Headers show
Series [v2] cxl/region: Allow 6 & 12 way regions on 3-way HB interleaves | expand

Commit Message

Alison Schofield March 6, 2025, 11:22 p.m. UTC
From: Alison Schofield <alison.schofield@intel.com>

The CXL driver requires the granularity of a region and its root
decoder to be the same. This is particularly restrictive for 3-way
host bridge interleaves where the only spec defined interleave
configurations for creating 6-way and 12-way regions on a 3-way HB
interleave require mixed granularities.

CXL 3.2 Specification 9.13.1.1:
Legal Interleaving Configurations: 12-way, 6-way, and 3-way

Adding support for these new interleaves touches these areas:

1) User created regions employing "cxl create-region" fail when the
CXL CLI tool gets a mixed granularity request. That is addressed in
a patch to the ndctl tool.

2) User created regions employing sysfs directly fail at the sysfs
store of the non-matching region granularity. That restriction is
lifted here. Note that the driver immediately allows any reasonable
3-way HB granularity, but the region config may still fail later
if the ways/gran between region and root decoder are incompatible.

3) Auto created regions, in which the drivers role is to reflect
the BIOS created region and provide RAS support, basically sneak on
by. The driver ignores the root decoders granularity and assumes it
is the same as the regions. The impact being that endpoint devices
appear out of order and DPA to HPA address translations that depend
on that order are invalid. Here the driver stops making that same
granularity assumption and checks for an allowable granularity.

A new helper, interleave_granularity_allow(), is used in both the
user and auto creation path to allow the newly supported configs.

Another new helper, gran_multiple(), is used in sorting, target
list searching, and distance calculations, supporting the new
root granularity to region granularity multiples.

And, as noted in 2), there's an added check to see if the
selected sysfs ways/gran make sense when considered together.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---

Changes in v2:
- Allow exactly 2*ig or 4*ig in interleave_granularity_allow()(DaveJ)
- Remove needless !root_decoder_gran check in gran_multiple()(Ming)
- Update spec references to 3.2 (DaveJ)
- Commit log grammar cleanup (DaveJ)
- Describe param 'multiple' in cxl_calc_interleave_pos() (lkp)


 drivers/cxl/core/region.c | 104 +++++++++++++++++++++++++++++++-------
 1 file changed, 85 insertions(+), 19 deletions(-)


base-commit: 0a14566be090ca51a32ebdd8a8e21678062dac08

Comments

Dan Williams March 7, 2025, 12:23 a.m. UTC | #1
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> The CXL driver requires the granularity of a region and its root
> decoder to be the same. This is particularly restrictive for 3-way
> host bridge interleaves where the only spec defined interleave
> configurations for creating 6-way and 12-way regions on a 3-way HB
> interleave require mixed granularities.
> 
> CXL 3.2 Specification 9.13.1.1:
> Legal Interleaving Configurations: 12-way, 6-way, and 3-way
> 
> Adding support for these new interleaves touches these areas:
> 
> 1) User created regions employing "cxl create-region" fail when the
> CXL CLI tool gets a mixed granularity request. That is addressed in
> a patch to the ndctl tool.
> 
> 2) User created regions employing sysfs directly fail at the sysfs
> store of the non-matching region granularity. That restriction is
> lifted here. Note that the driver immediately allows any reasonable
> 3-way HB granularity, but the region config may still fail later
> if the ways/gran between region and root decoder are incompatible.
> 
> 3) Auto created regions, in which the drivers role is to reflect
> the BIOS created region and provide RAS support, basically sneak on
> by. The driver ignores the root decoders granularity and assumes it
> is the same as the regions. The impact being that endpoint devices
> appear out of order and DPA to HPA address translations that depend
> on that order are invalid.

I.e. that is a bug. The question is do we backport a "fix" to make that
config start failing per old expectations, or do we backport this patch
to fix those known configs?

I would lean to backporting this patch so this needs a Fixes tag.

> Here the driver stops making that same
> granularity assumption and checks for an allowable granularity.
> 
> A new helper, interleave_granularity_allow(), is used in both the
> user and auto creation path to allow the newly supported configs.

Minor but the function is not performing an "allow" operation it is
evaluating an "allowed" condition, so I would have expected this to be
named:

    is_interleave_granularity_allowed()

> Another new helper, gran_multiple(), is used in sorting, target
> list searching, and distance calculations, supporting the new
> root granularity to region granularity multiples.
> 
> And, as noted in 2), there's an added check to see if the
> selected sysfs ways/gran make sense when considered together.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> ---
> 
> Changes in v2:
> - Allow exactly 2*ig or 4*ig in interleave_granularity_allow()(DaveJ)
> - Remove needless !root_decoder_gran check in gran_multiple()(Ming)
> - Update spec references to 3.2 (DaveJ)
> - Commit log grammar cleanup (DaveJ)
> - Describe param 'multiple' in cxl_calc_interleave_pos() (lkp)
> 
> 
>  drivers/cxl/core/region.c | 104 +++++++++++++++++++++++++++++++-------
>  1 file changed, 85 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 8537b6a9ca18..cc2687407a3e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -532,6 +532,31 @@ static ssize_t interleave_granularity_show(struct device *dev,
>  	return rc;
>  }
>  
> +static bool interleave_granularity_allow(struct cxl_decoder *cxld, u16 ig)

I don't like that this function takes a plain 'struct cxl_decoder *',
this function is only relevant for root decoders so make that clear by
requring that it is passed a 'struct cxl_root_decoder *'. Yes, it means
redoing the "cxld = &cxlrd->cxlsd.cxld" conversion.

> +{
> +	/*
> +	 * When the host-bridge is interleaved, disallow region granularity
> +	 * != root granularity with the exception of 3-way HB interleaves.
> +	 * Allow the CXL Spec defined 3-way HB interleaves that can only be
> +	 * configured when host-bridge interleave is greater that the
> +	 * region interleave. (CXL 3.2 Specification 9.13.1.1)
> +	 * Allow 2+2+2 interleave where HB gran is 2 * region granularity
> +	 *	 4+4+4 interleave where HB gran is 4 * region granularity

Only an Intel person might understand X+X+X notation topology notation.
Just describe these configs in plain english.

    Allow 6 endpoints across 3 host-bridges when the root-decoder
    granularity is 2x region granularity.

    Allow 12 endpoints across 3 host-bridges when the root-decoder 
    granularity is 4x region granularity.

The problem is that this comment assumes direct-attached endpoints. If
you had switches between a host-bridge and endpoints then the region
granularity is not valid to compare to the root decoder granularity.

> +	 *
> +	 * Regions with a granularity greater than the root interleave result
> +	 * in invalid DPA translations (invalid to support).
> +	 */
> +	if (cxld->interleave_ways > 1 && ig != cxld->interleave_granularity) {
> +		if (cxld->interleave_ways != 3)
> +			return false;
> +
> +		if (cxld->interleave_granularity != 2 * ig &&
> +		    cxld->interleave_granularity != 4 * ig)
> +			return false;
> +	}
> +	return true;
> +}
> +
>  static ssize_t interleave_granularity_store(struct device *dev,
>  					    struct device_attribute *attr,
>  					    const char *buf, size_t len)
> @@ -551,15 +576,7 @@ static ssize_t interleave_granularity_store(struct device *dev,
>  	if (rc)
>  		return rc;
>  
> -	/*
> -	 * When the host-bridge is interleaved, disallow region granularity !=
> -	 * root granularity. Regions with a granularity less than the root
> -	 * interleave result in needing multiple endpoints to support a single
> -	 * slot in the interleave (possible to support in the future). Regions
> -	 * with a granularity greater than the root interleave result in invalid
> -	 * DPA translations (invalid to support).
> -	 */
> -	if (cxld->interleave_ways > 1 && val != cxld->interleave_granularity)
> +	if (!interleave_granularity_allow(cxld, val))
>  		return -EINVAL;
>  
>  	rc = down_write_killable(&cxl_region_rwsem);
> @@ -1305,6 +1322,23 @@ static int check_interleave_cap(struct cxl_decoder *cxld, int iw, int ig)
>  	return 0;
>  }
>  
> +static int gran_multiple(struct cxl_region *cxlr)
> +{
> +	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
> +	int root_decoder_gran = cxlrd->cxlsd.cxld.interleave_granularity;
> +	int region_gran = cxlr->params.interleave_granularity;
> +
> +	/*
> +	 * In regions built upon 3-way HB interleaves, the root decoder
> +	 * granularity can be a multiple of the region granularity. The
> +	 * multiple value is used in sorting and distance calculations.
> +	 */
> +	if (cxlrd->cxlsd.cxld.interleave_ways != 3)
> +		return 1;
> +
> +	return root_decoder_gran / region_gran;
> +}
> +
>  static int cxl_port_setup_targets(struct cxl_port *port,
>  				  struct cxl_region *cxlr,
>  				  struct cxl_endpoint_decoder *cxled)
> @@ -1336,6 +1370,7 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  	cxlsd = to_cxl_switch_decoder(&cxld->dev);
>  	if (cxl_rr->nr_targets_set) {
>  		int i, distance = 1;
> +		int multiple = gran_multiple(cxlr);
>  		struct cxl_region_ref *cxl_rr_iter;
>  
>  		/*
> @@ -1347,12 +1382,17 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  		 * always 1 as every index targets a different host-bridge. At
>  		 * each subsequent switch level those ports map every Nth region
>  		 * position where N is the width of the switch == distance.
> +		 *
> +		 * With the introduction of mixed granularities in 3-way HB
> +		 * interleaves, divide N by a multiple that represents the root
> +		 * decoder to region granularity.

I go into more detail later, but region granularity is not relevant when
switches are present.

>  		 */
>  		do {
>  			cxl_rr_iter = cxl_rr_load(iter, cxlr);
>  			distance *= cxl_rr_iter->nr_targets;
>  			iter = to_cxl_port(iter->dev.parent);
>  		} while (!is_cxl_root(iter));
> +		distance /= multiple;
>  		distance *= cxlrd->cxlsd.cxld.interleave_ways;
>  
>  		for (i = 0; i < cxl_rr->nr_targets_set; i++)
> @@ -1369,12 +1409,9 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  	if (is_cxl_root(parent_port)) {
>  		/*
>  		 * Root decoder IG is always set to value in CFMWS which
> -		 * may be different than this region's IG.  We can use the
> -		 * region's IG here since interleave_granularity_store()
> -		 * does not allow interleaved host-bridges with
> -		 * root IG != region IG.
> +		 * may be different than this region's IG.
>  		 */
> -		parent_ig = p->interleave_granularity;
> +		parent_ig = cxlrd->cxlsd.cxld.interleave_granularity;
>  		parent_iw = cxlrd->cxlsd.cxld.interleave_ways;
>  		/*
>  		 * For purposes of address bit routing, use power-of-2 math for
> @@ -1446,7 +1483,7 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>  
>  	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
>  		if (cxld->interleave_ways != iw ||
> -		    cxld->interleave_granularity != ig ||
> +		    !interleave_granularity_allow(cxld, ig) ||
>  		    !region_res_match_cxl_range(p, &cxld->hpa_range) ||
>  		    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
>  			dev_err(&cxlr->dev,
> @@ -1675,11 +1712,12 @@ static int cxl_region_attach_position(struct cxl_region *cxlr,
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>  	struct cxl_switch_decoder *cxlsd = &cxlrd->cxlsd;
>  	struct cxl_decoder *cxld = &cxlsd->cxld;
> +	int multiple = gran_multiple(cxlr);
>  	int iw = cxld->interleave_ways;
>  	struct cxl_port *iter;
>  	int rc;
>  
> -	if (dport != cxlrd->cxlsd.target[pos % iw]) {
> +	if (dport != cxlrd->cxlsd.target[pos / multiple % iw]) {
>  		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
>  			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
>  			dev_name(&cxlrd->cxlsd.cxld.dev));
> @@ -1811,6 +1849,9 @@ static int find_pos_and_ways(struct cxl_port *port, struct range *range,
>  /**
>   * cxl_calc_interleave_pos() - calculate an endpoint position in a region
>   * @cxled: endpoint decoder member of given region
> + * @multiple: the root decoder gran as a multiple of region gran
> + *	      Typically enforced as 1 except in 3-way HB interleaves
> + *	      See gran_multiple()
>   *
>   * The endpoint position is calculated by traversing the topology from
>   * the endpoint to the root decoder and iteratively applying this
> @@ -1823,7 +1864,8 @@ static int find_pos_and_ways(struct cxl_port *port, struct range *range,
>   * Return: position >= 0 on success
>   *	   -ENXIO on failure
>   */
> -static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled)
> +static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled,
> +				   int multiple)
>  {
>  	struct cxl_port *iter, *port = cxled_to_port(cxled);
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> @@ -1869,6 +1911,11 @@ static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled)
>  		if (rc)
>  			return rc;
>  
> +		if (multiple > 1 && is_cxl_root(next_port(iter))) {
> +			pos = pos + multiple * parent_pos;
> +			break;
> +		}
> +
>  		pos = pos * parent_ways + parent_pos;
>  	}
>  
> @@ -1883,12 +1930,13 @@ static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled)
>  static int cxl_region_sort_targets(struct cxl_region *cxlr)
>  {
>  	struct cxl_region_params *p = &cxlr->params;
> +	int multiple = gran_multiple(cxlr);
>  	int i, rc = 0;
>  
>  	for (i = 0; i < p->nr_targets; i++) {
>  		struct cxl_endpoint_decoder *cxled = p->targets[i];
>  
> -		cxled->pos = cxl_calc_interleave_pos(cxled);
> +		cxled->pos = cxl_calc_interleave_pos(cxled, multiple);
>  		/*
>  		 * Record that sorting failed, but still continue to calc
>  		 * cxled->pos so that follow-on code paths can reliably
> @@ -1916,6 +1964,23 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  	struct cxl_dport *dport;
>  	int rc = -ENXIO;
>  
> +	/*
> +	 * Protect against improper gran mixes on 3-way HB interleave
> +	 * Expect decoder gran*ways == region gran*ways
> +	 */
> +	if (cxlrd->cxlsd.cxld.interleave_ways == 3) {
> +		if ((cxlrd->cxlsd.cxld.interleave_granularity * 3) !=
> +		    (p->interleave_ways * p->interleave_granularity)) {

When switches are present p->interleave_ways and
p->interleave_granularity are not relevant for 3-way validation. For
example imagine an x3 HB, with x6 switches, and x12 endpoints. The mixed
granularity is resolved at the switch level, not the region level.
Alison Schofield March 12, 2025, 4:59 p.m. UTC | #2
On Thu, Mar 06, 2025 at 04:23:35PM -0800, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > The CXL driver requires the granularity of a region and its root
> > decoder to be the same. This is particularly restrictive for 3-way
> > host bridge interleaves where the only spec defined interleave
> > configurations for creating 6-way and 12-way regions on a 3-way HB
> > interleave require mixed granularities.
> > 
> > CXL 3.2 Specification 9.13.1.1:
> > Legal Interleaving Configurations: 12-way, 6-way, and 3-way
> > 
> > Adding support for these new interleaves touches these areas:
> > 
> > 1) User created regions employing "cxl create-region" fail when the
> > CXL CLI tool gets a mixed granularity request. That is addressed in
> > a patch to the ndctl tool.
> > 
> > 2) User created regions employing sysfs directly fail at the sysfs
> > store of the non-matching region granularity. That restriction is
> > lifted here. Note that the driver immediately allows any reasonable
> > 3-way HB granularity, but the region config may still fail later
> > if the ways/gran between region and root decoder are incompatible.
> > 
> > 3) Auto created regions, in which the drivers role is to reflect
> > the BIOS created region and provide RAS support, basically sneak on
> > by. The driver ignores the root decoders granularity and assumes it
> > is the same as the regions. The impact being that endpoint devices
> > appear out of order and DPA to HPA address translations that depend
> > on that order are invalid.
> 
> I.e. that is a bug. The question is do we backport a "fix" to make that
> config start failing per old expectations, or do we backport this patch
> to fix those known configs?
> 

For this specific config -> fix it. For all the others quietly
misassembling -> start failing.  More...

> I would lean to backporting this patch so this needs a Fixes tag.

Thanks for the review Dan. Here's a response in general, and afterwards
responses to your embedded code review comments.

I should have said this was a no switches solution. This is a no
switches solution as that meets the use case needing repair. More
on that below.

I see where this can be more restrictive so as not to let any configs
slip through that are not 100% supported by the CXL driver. Let's
discuss which configs the driver will fully support versus explicitly
reject.

Appended is a list of the 3-6-12 way interleaves described in the CXL
Spec 3.2 Table. The complete list is appended because it's important
to see all the flavors. If we only talk switch, no_switch, or mixed,
no_mixed, combinations are omitted and complexity may be underestimated.
The driver supports 3 of the 11 configs correctly: 3-A, 6-A, 12-A.
Those are the configs with cross-host bridge interleave only.

The driver may pretend to support, but fails on all others. I use the
word 'may' because user created regions fail to create, so there is
no pretending there. It is the auto regions, (BIOS configured regions)
that will appear to be recognized by the CXL driver, but in fact the
assembly is misunderstood and so RAS support is broken.

That pretend is not just for MOD3 regions. Barring a safeguard I've
missed, any auto config with mixed granularity is quietly misassembled.

That suggests a lead in patch here that shuts down the quiet misassembly
of any mixed granularity auto regions. Again, no need to shut it down
for user-created as they are already blocked. User-created regions will
require new allowances.

Which mixed granularity regions require support?

I have a user that needs config 6-B fixed. That is the 6 way region
built on a 3-way Host Bridge Interleave, no switches. It is one of
those that are quietly misassembling as an auto region. Two similar
12-ways configs, 12-B&D come along for minimal additional effort yet
I am not aware of a use case for these.

Proposal is that this patch(set) protects against the mis-assembled
configs and adds support 6-B. That would fully meet the use case at hand
and protect users against the mis-assembled regions.

I do wonder if the shutdown of mis-assembled auto regions might bring
some use cases to light 

I expect I'll work a v3 with the updates noted here while listening
for upstream feedback on driver supported configs.


--------
Section 9.13.1.1
Legal Interleaving Configurations: 12-way, 6-way, and 3-way 

Table 9-8: 3-Way Device-level Interleave at 1GB
3-A:	Cross-host Bridge: 3 way at 1GB
	Host Bridge: No interleaving
	Switch: No interleaving or absent

Table 9-7: 6-Way Device-level Interleave at 1GB
6-A:	Cross-host Bridge: 6 way at 1GB
	Host Bridge: No interleaving
	Switch: No interleaving or absent
6-B:	Cross-host Bridge: 3 way at 2 times 1GB
	Host Bridge: 2 way at 1GB
	Switch: No interleaving
6-C:	Cross-host Bridge: 3 way at 2 times 1GB
	Host Bridge: No interleaving
	Switch: 2 way at 1GB

Table 9-6: 12-Way Device-level Interleave at 1GB
12-A:	Cross-host Bridge: 12 way at 1GB
	Host Bridge: No interleaving
	Switch: No interleaving or absent
12-B:	Cross-host Bridge: 6 way at 2 times 1GB
	Host Bridge: 2 way at 1GB
	Switch: No interleaving or absent
12-C:	Cross-host Bridge: 6 way at 2 times 1GB
	Host Bridge: No interleaving
	Switch: 2 way at 1GB
12-D:	Cross-host Bridge: 3 way at 4 times 1GB
	Host Bridge: 4 way at 1GB
	Switch: No interleaving
12-E:	Cross-host Bridge: 3 way at 4 times 1GB
	Host Bridge: No interleaving
	Switch: 4 way at 1GB
12-F:	Cross-host Bridge: 3 way at 4 times 1GB
	Host Bridge: 2 way at 1GB
	Switch: 2 way at 2 times 1GB
12-G:	Cross-host Bridge: 3 way at 4 times 1GB
	Host Bridge: 2 way at 2 times 1GB
	Switch: 2 way at 1GB


> 
> > Here the driver stops making that same
> > granularity assumption and checks for an allowable granularity.
> > 
> > A new helper, interleave_granularity_allow(), is used in both the
> > user and auto creation path to allow the newly supported configs.
> 
> Minor but the function is not performing an "allow" operation it is
> evaluating an "allowed" condition, so I would have expected this to be
> named:
> 
>     is_interleave_granularity_allowed()
> 

got it.


> > Another new helper, gran_multiple(), is used in sorting, target
> > list searching, and distance calculations, supporting the new
> > root granularity to region granularity multiples.
> > 
> > And, as noted in 2), there's an added check to see if the
> > selected sysfs ways/gran make sense when considered together.
> > 
> > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > ---
> > 
> > Changes in v2:
> > - Allow exactly 2*ig or 4*ig in interleave_granularity_allow()(DaveJ)
> > - Remove needless !root_decoder_gran check in gran_multiple()(Ming)
> > - Update spec references to 3.2 (DaveJ)
> > - Commit log grammar cleanup (DaveJ)
> > - Describe param 'multiple' in cxl_calc_interleave_pos() (lkp)
> > 
> > 
> >  drivers/cxl/core/region.c | 104 +++++++++++++++++++++++++++++++-------
> >  1 file changed, 85 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 8537b6a9ca18..cc2687407a3e 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -532,6 +532,31 @@ static ssize_t interleave_granularity_show(struct device *dev,
> >  	return rc;
> >  }
> >  
> > +static bool interleave_granularity_allow(struct cxl_decoder *cxld, u16 ig)
> 
> I don't like that this function takes a plain 'struct cxl_decoder *',
> this function is only relevant for root decoders so make that clear by
> requring that it is passed a 'struct cxl_root_decoder *'. Yes, it means
> redoing the "cxld = &cxlrd->cxlsd.cxld" conversion.
> 

got it: s/cxl_decoder/cxl_root_decoder


> > +{
> > +	/*
> > +	 * When the host-bridge is interleaved, disallow region granularity
> > +	 * != root granularity with the exception of 3-way HB interleaves.
> > +	 * Allow the CXL Spec defined 3-way HB interleaves that can only be
> > +	 * configured when host-bridge interleave is greater that the
> > +	 * region interleave. (CXL 3.2 Specification 9.13.1.1)
> > +	 * Allow 2+2+2 interleave where HB gran is 2 * region granularity
> > +	 *	 4+4+4 interleave where HB gran is 4 * region granularity
> 
> Only an Intel person might understand X+X+X notation topology notation.
> Just describe these configs in plain english.

Can do.

> 
>     Allow 6 endpoints across 3 host-bridges when the root-decoder
>     granularity is 2x region granularity.
> 
>     Allow 12 endpoints across 3 host-bridges when the root-decoder 
>     granularity is 4x region granularity.
> 
> The problem is that this comment assumes direct-attached endpoints. If
> you had switches between a host-bridge and endpoints then the region
> granularity is not valid to compare to the root decoder granularity.
> 

Yes it does.

> > +	 *
> > +	 * Regions with a granularity greater than the root interleave result
> > +	 * in invalid DPA translations (invalid to support).
> > +	 */
> > +	if (cxld->interleave_ways > 1 && ig != cxld->interleave_granularity) {
> > +		if (cxld->interleave_ways != 3)
> > +			return false;
> > +
> > +		if (cxld->interleave_granularity != 2 * ig &&
> > +		    cxld->interleave_granularity != 4 * ig)
> > +			return false;
> > +	}
> > +	return true;
> > +}
> > +

snip

> > @@ -1347,12 +1382,17 @@ static int cxl_port_setup_targets(struct cxl_port *port,
> >  		 * always 1 as every index targets a different host-bridge. At
> >  		 * each subsequent switch level those ports map every Nth region
> >  		 * position where N is the width of the switch == distance.
> > +		 *
> > +		 * With the introduction of mixed granularities in 3-way HB
> > +		 * interleaves, divide N by a multiple that represents the root
> > +		 * decoder to region granularity.
> 
> I go into more detail later, but region granularity is not relevant when
> switches are present.

No switches present. I take another look to see if any of this
doesn't clearly preclude switches.

> 
> > +	/*
> > +	 * Protect against improper gran mixes on 3-way HB interleave
> > +	 * Expect decoder gran*ways == region gran*ways
> > +	 */
> > +	if (cxlrd->cxlsd.cxld.interleave_ways == 3) {
> > +		if ((cxlrd->cxlsd.cxld.interleave_granularity * 3) !=
> > +		    (p->interleave_ways * p->interleave_granularity)) {
> 
> When switches are present p->interleave_ways and
> p->interleave_granularity are not relevant for 3-way validation. For
> example imagine an x3 HB, with x6 switches, and x12 endpoints. The mixed
> granularity is resolved at the switch level, not the region level.

No switches.
Jonathan Cameron March 14, 2025, noon UTC | #3
On Thu,  6 Mar 2025 15:22:37 -0800
alison.schofield@intel.com wrote:

> From: Alison Schofield <alison.schofield@intel.com>
> 
> The CXL driver requires the granularity of a region and its root
> decoder to be the same. This is particularly restrictive for 3-way
> host bridge interleaves where the only spec defined interleave
> configurations for creating 6-way and 12-way regions on a 3-way HB
> interleave require mixed granularities.
> 
> CXL 3.2 Specification 9.13.1.1:
> Legal Interleaving Configurations: 12-way, 6-way, and 3-way

Ah this is finally a valid reason to do coarser interleave first
(going away from host). That was subject of long discussions way back
when original interleaving code was discussed (mostly because
my mental model did it that way around and I couldn't follow what
the kernel code was doing).

We 'could' revisit allowing this more generally - at least for
already configured set ups.  Only real reason for that is
either that someone ships a config that does it for a different
case, or that we want to avoid special casing 3*x cases.

Jonathan
Alison Schofield March 18, 2025, 3:08 a.m. UTC | #4
On Fri, Mar 14, 2025 at 12:00:36PM +0000, Jonathan Cameron wrote:
> On Thu,  6 Mar 2025 15:22:37 -0800
> alison.schofield@intel.com wrote:
> 
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > The CXL driver requires the granularity of a region and its root
> > decoder to be the same. This is particularly restrictive for 3-way
> > host bridge interleaves where the only spec defined interleave
> > configurations for creating 6-way and 12-way regions on a 3-way HB
> > interleave require mixed granularities.
> > 
> > CXL 3.2 Specification 9.13.1.1:
> > Legal Interleaving Configurations: 12-way, 6-way, and 3-way
> 
> Ah this is finally a valid reason to do coarser interleave first
> (going away from host). That was subject of long discussions way back
> when original interleaving code was discussed (mostly because
> my mental model did it that way around and I couldn't follow what
> the kernel code was doing).
> 
> We 'could' revisit allowing this more generally - at least for
> already configured set ups.  Only real reason for that is
> either that someone ships a config that does it for a different
> case, or that we want to avoid special casing 3*x cases.

Hi Jonathan,

I'll bring these configs up as an OPEN at the upcoming community collab,
and hope you can elaborate more there.

--Alison


> 
> Jonathan
Jonathan Cameron March 18, 2025, 2:35 p.m. UTC | #5
On Mon, 17 Mar 2025 20:08:24 -0700
Alison Schofield <alison.schofield@intel.com> wrote:

> On Fri, Mar 14, 2025 at 12:00:36PM +0000, Jonathan Cameron wrote:
> > On Thu,  6 Mar 2025 15:22:37 -0800
> > alison.schofield@intel.com wrote:
> >   
> > > From: Alison Schofield <alison.schofield@intel.com>
> > > 
> > > The CXL driver requires the granularity of a region and its root
> > > decoder to be the same. This is particularly restrictive for 3-way
> > > host bridge interleaves where the only spec defined interleave
> > > configurations for creating 6-way and 12-way regions on a 3-way HB
> > > interleave require mixed granularities.
> > > 
> > > CXL 3.2 Specification 9.13.1.1:
> > > Legal Interleaving Configurations: 12-way, 6-way, and 3-way  
> > 
> > Ah this is finally a valid reason to do coarser interleave first
> > (going away from host). That was subject of long discussions way back
> > when original interleaving code was discussed (mostly because
> > my mental model did it that way around and I couldn't follow what
> > the kernel code was doing).
> > 
> > We 'could' revisit allowing this more generally - at least for
> > already configured set ups.  Only real reason for that is
> > either that someone ships a config that does it for a different
> > case, or that we want to avoid special casing 3*x cases.  
> 
> Hi Jonathan,
> 
> I'll bring these configs up as an OPEN at the upcoming community collab,
> and hope you can elaborate more there.
> 

Sure. I'll try and remember how all this worked!

I'm not really suggesting we do handle those other cases unless we know
they are in the wild or it simplifies the code by removing special cases
(I'm not sure it does!)

Jonathan

> --Alison
> 
> 
> > 
> > Jonathan  
>
diff mbox series

Patch

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 8537b6a9ca18..cc2687407a3e 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -532,6 +532,31 @@  static ssize_t interleave_granularity_show(struct device *dev,
 	return rc;
 }
 
+static bool interleave_granularity_allow(struct cxl_decoder *cxld, u16 ig)
+{
+	/*
+	 * When the host-bridge is interleaved, disallow region granularity
+	 * != root granularity with the exception of 3-way HB interleaves.
+	 * Allow the CXL Spec defined 3-way HB interleaves that can only be
+	 * configured when host-bridge interleave is greater that the
+	 * region interleave. (CXL 3.2 Specification 9.13.1.1)
+	 * Allow 2+2+2 interleave where HB gran is 2 * region granularity
+	 *	 4+4+4 interleave where HB gran is 4 * region granularity
+	 *
+	 * Regions with a granularity greater than the root interleave result
+	 * in invalid DPA translations (invalid to support).
+	 */
+	if (cxld->interleave_ways > 1 && ig != cxld->interleave_granularity) {
+		if (cxld->interleave_ways != 3)
+			return false;
+
+		if (cxld->interleave_granularity != 2 * ig &&
+		    cxld->interleave_granularity != 4 * ig)
+			return false;
+	}
+	return true;
+}
+
 static ssize_t interleave_granularity_store(struct device *dev,
 					    struct device_attribute *attr,
 					    const char *buf, size_t len)
@@ -551,15 +576,7 @@  static ssize_t interleave_granularity_store(struct device *dev,
 	if (rc)
 		return rc;
 
-	/*
-	 * When the host-bridge is interleaved, disallow region granularity !=
-	 * root granularity. Regions with a granularity less than the root
-	 * interleave result in needing multiple endpoints to support a single
-	 * slot in the interleave (possible to support in the future). Regions
-	 * with a granularity greater than the root interleave result in invalid
-	 * DPA translations (invalid to support).
-	 */
-	if (cxld->interleave_ways > 1 && val != cxld->interleave_granularity)
+	if (!interleave_granularity_allow(cxld, val))
 		return -EINVAL;
 
 	rc = down_write_killable(&cxl_region_rwsem);
@@ -1305,6 +1322,23 @@  static int check_interleave_cap(struct cxl_decoder *cxld, int iw, int ig)
 	return 0;
 }
 
+static int gran_multiple(struct cxl_region *cxlr)
+{
+	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
+	int root_decoder_gran = cxlrd->cxlsd.cxld.interleave_granularity;
+	int region_gran = cxlr->params.interleave_granularity;
+
+	/*
+	 * In regions built upon 3-way HB interleaves, the root decoder
+	 * granularity can be a multiple of the region granularity. The
+	 * multiple value is used in sorting and distance calculations.
+	 */
+	if (cxlrd->cxlsd.cxld.interleave_ways != 3)
+		return 1;
+
+	return root_decoder_gran / region_gran;
+}
+
 static int cxl_port_setup_targets(struct cxl_port *port,
 				  struct cxl_region *cxlr,
 				  struct cxl_endpoint_decoder *cxled)
@@ -1336,6 +1370,7 @@  static int cxl_port_setup_targets(struct cxl_port *port,
 	cxlsd = to_cxl_switch_decoder(&cxld->dev);
 	if (cxl_rr->nr_targets_set) {
 		int i, distance = 1;
+		int multiple = gran_multiple(cxlr);
 		struct cxl_region_ref *cxl_rr_iter;
 
 		/*
@@ -1347,12 +1382,17 @@  static int cxl_port_setup_targets(struct cxl_port *port,
 		 * always 1 as every index targets a different host-bridge. At
 		 * each subsequent switch level those ports map every Nth region
 		 * position where N is the width of the switch == distance.
+		 *
+		 * With the introduction of mixed granularities in 3-way HB
+		 * interleaves, divide N by a multiple that represents the root
+		 * decoder to region granularity.
 		 */
 		do {
 			cxl_rr_iter = cxl_rr_load(iter, cxlr);
 			distance *= cxl_rr_iter->nr_targets;
 			iter = to_cxl_port(iter->dev.parent);
 		} while (!is_cxl_root(iter));
+		distance /= multiple;
 		distance *= cxlrd->cxlsd.cxld.interleave_ways;
 
 		for (i = 0; i < cxl_rr->nr_targets_set; i++)
@@ -1369,12 +1409,9 @@  static int cxl_port_setup_targets(struct cxl_port *port,
 	if (is_cxl_root(parent_port)) {
 		/*
 		 * Root decoder IG is always set to value in CFMWS which
-		 * may be different than this region's IG.  We can use the
-		 * region's IG here since interleave_granularity_store()
-		 * does not allow interleaved host-bridges with
-		 * root IG != region IG.
+		 * may be different than this region's IG.
 		 */
-		parent_ig = p->interleave_granularity;
+		parent_ig = cxlrd->cxlsd.cxld.interleave_granularity;
 		parent_iw = cxlrd->cxlsd.cxld.interleave_ways;
 		/*
 		 * For purposes of address bit routing, use power-of-2 math for
@@ -1446,7 +1483,7 @@  static int cxl_port_setup_targets(struct cxl_port *port,
 
 	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
 		if (cxld->interleave_ways != iw ||
-		    cxld->interleave_granularity != ig ||
+		    !interleave_granularity_allow(cxld, ig) ||
 		    !region_res_match_cxl_range(p, &cxld->hpa_range) ||
 		    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
 			dev_err(&cxlr->dev,
@@ -1675,11 +1712,12 @@  static int cxl_region_attach_position(struct cxl_region *cxlr,
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
 	struct cxl_switch_decoder *cxlsd = &cxlrd->cxlsd;
 	struct cxl_decoder *cxld = &cxlsd->cxld;
+	int multiple = gran_multiple(cxlr);
 	int iw = cxld->interleave_ways;
 	struct cxl_port *iter;
 	int rc;
 
-	if (dport != cxlrd->cxlsd.target[pos % iw]) {
+	if (dport != cxlrd->cxlsd.target[pos / multiple % iw]) {
 		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
 			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
 			dev_name(&cxlrd->cxlsd.cxld.dev));
@@ -1811,6 +1849,9 @@  static int find_pos_and_ways(struct cxl_port *port, struct range *range,
 /**
  * cxl_calc_interleave_pos() - calculate an endpoint position in a region
  * @cxled: endpoint decoder member of given region
+ * @multiple: the root decoder gran as a multiple of region gran
+ *	      Typically enforced as 1 except in 3-way HB interleaves
+ *	      See gran_multiple()
  *
  * The endpoint position is calculated by traversing the topology from
  * the endpoint to the root decoder and iteratively applying this
@@ -1823,7 +1864,8 @@  static int find_pos_and_ways(struct cxl_port *port, struct range *range,
  * Return: position >= 0 on success
  *	   -ENXIO on failure
  */
-static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled)
+static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled,
+				   int multiple)
 {
 	struct cxl_port *iter, *port = cxled_to_port(cxled);
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
@@ -1869,6 +1911,11 @@  static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled)
 		if (rc)
 			return rc;
 
+		if (multiple > 1 && is_cxl_root(next_port(iter))) {
+			pos = pos + multiple * parent_pos;
+			break;
+		}
+
 		pos = pos * parent_ways + parent_pos;
 	}
 
@@ -1883,12 +1930,13 @@  static int cxl_calc_interleave_pos(struct cxl_endpoint_decoder *cxled)
 static int cxl_region_sort_targets(struct cxl_region *cxlr)
 {
 	struct cxl_region_params *p = &cxlr->params;
+	int multiple = gran_multiple(cxlr);
 	int i, rc = 0;
 
 	for (i = 0; i < p->nr_targets; i++) {
 		struct cxl_endpoint_decoder *cxled = p->targets[i];
 
-		cxled->pos = cxl_calc_interleave_pos(cxled);
+		cxled->pos = cxl_calc_interleave_pos(cxled, multiple);
 		/*
 		 * Record that sorting failed, but still continue to calc
 		 * cxled->pos so that follow-on code paths can reliably
@@ -1916,6 +1964,23 @@  static int cxl_region_attach(struct cxl_region *cxlr,
 	struct cxl_dport *dport;
 	int rc = -ENXIO;
 
+	/*
+	 * Protect against improper gran mixes on 3-way HB interleave
+	 * Expect decoder gran*ways == region gran*ways
+	 */
+	if (cxlrd->cxlsd.cxld.interleave_ways == 3) {
+		if ((cxlrd->cxlsd.cxld.interleave_granularity * 3) !=
+		    (p->interleave_ways * p->interleave_granularity)) {
+			dev_dbg(&cxlr->dev,
+				"invalid config region:%d/%d decoder %d/%d\n",
+				p->interleave_ways, p->interleave_granularity,
+				cxlrd->cxlsd.cxld.interleave_ways,
+				cxlrd->cxlsd.cxld.interleave_granularity);
+
+			return rc;
+		}
+	}
+
 	rc = check_interleave_cap(&cxled->cxld, p->interleave_ways,
 				  p->interleave_granularity);
 	if (rc) {
@@ -2069,9 +2134,10 @@  static int cxl_region_attach(struct cxl_region *cxlr,
 	 */
 	for (int i = 0; i < p->nr_targets; i++) {
 		struct cxl_endpoint_decoder *cxled = p->targets[i];
+		int multiple = gran_multiple(cxlr);
 		int test_pos;
 
-		test_pos = cxl_calc_interleave_pos(cxled);
+		test_pos = cxl_calc_interleave_pos(cxled, multiple);
 		dev_dbg(&cxled->cxld.dev,
 			"Test cxl_calc_interleave_pos(): %s test_pos:%d cxled->pos:%d\n",
 			(test_pos == cxled->pos) ? "success" : "fail",