diff mbox series

[RFC,3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window

Message ID 79a10b7101a8fab56f9ff3f9a4de73bee3156b40.1683742429.git.alison.schofield@intel.com
State New, archived
Headers show
Series Apply SRAT defined PXM to entire CFMWS | expand

Commit Message

Alison Schofield May 10, 2023, 6:44 p.m. UTC
From: Alison Schofield <alison.schofield@intel.com>

Commit fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each
CFMWS not in SRAT") did not account for the case where an SRAT entry
only partially describes a CFMWS window. It assumed an SRAT entry
covered the entire CFMWS HPA range, start through end.

Broaden the search for an SRAT defined NUMA node, by replacing the
previously used phys_to_targe_node(start) search, with the recently
introduced, numa_find_node(), that can discover a NUMA node anywhere
in the CFMWS HPA range.

If any NUMA node is discovered, proactively cleanup, by removing any
memblks, partial or whole, in the HPA range, and add one memblk that
encompasses the entire range. That has the effect of applying the SRAT
defined proximity domain to the entire range, as well as doing a memblk
cleanup at the point a redundancy is created.

Considered and rejected, letting numa_cleanup_meminfo() try to remove
the redundancy. It doesn't currently address this case, because these
memblks will be moved to numa_reserved_meminfo, before any numa_meminfo
merge is done. Also, the merge logic in numa_cleanup_meminfo() works
on adjacent memblks, so it would need to grow in complexity to search
for these potential cases.

Considered and ready to reconsider,  allow an extra memblk for every
CFMWS HPA range that is also described in the SRAT. Is that a concern?
If not a concern to have the extra memblk, then skip the memblk remove
work entirely.

Fixes: fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 drivers/acpi/numa/srat.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

Comments

Dan Williams May 11, 2023, 11:16 p.m. UTC | #1
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> Commit fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each
> CFMWS not in SRAT") did not account for the case where an SRAT entry
> only partially describes a CFMWS window. It assumed an SRAT entry
> covered the entire CFMWS HPA range, start through end.
> 
> Broaden the search for an SRAT defined NUMA node, by replacing the
> previously used phys_to_targe_node(start) search, with the recently
> introduced, numa_find_node(), that can discover a NUMA node anywhere
> in the CFMWS HPA range.
> 
> If any NUMA node is discovered, proactively cleanup, by removing any
> memblks, partial or whole, in the HPA range, and add one memblk that
> encompasses the entire range. That has the effect of applying the SRAT
> defined proximity domain to the entire range, as well as doing a memblk
> cleanup at the point a redundancy is created.
> 
> Considered and rejected, letting numa_cleanup_meminfo() try to remove
> the redundancy. It doesn't currently address this case, because these
> memblks will be moved to numa_reserved_meminfo, before any numa_meminfo
> merge is done. Also, the merge logic in numa_cleanup_meminfo() works
> on adjacent memblks, so it would need to grow in complexity to search
> for these potential cases.
> 
> Considered and ready to reconsider,  allow an extra memblk for every
> CFMWS HPA range that is also described in the SRAT. Is that a concern?
> If not a concern to have the extra memblk, then skip the memblk remove
> work entirely.
> 
> Fixes: fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT")
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> ---
>  drivers/acpi/numa/srat.c | 32 ++++++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 1f4fc5f8a819..f41b65e9b085 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -301,27 +301,47 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>  static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
>  				   void *arg, const unsigned long table_end)
>  {
> +	int node, found_node, *fake_pxm = arg;
>  	struct acpi_cedt_cfmws *cfmws;
> -	int *fake_pxm = arg;
>  	u64 start, end;
> -	int node;
>  
>  	cfmws = (struct acpi_cedt_cfmws *)header;
>  	start = cfmws->base_hpa;
>  	end = cfmws->base_hpa + cfmws->window_size;
>  
> -	/* Skip if the SRAT already described the NUMA details for this HPA */
> -	node = phys_to_target_node(start);
> -	if (node != NUMA_NO_NODE)
> +	/*
> +	 * The SRAT may have already described the NUMA details for
> +	 * this CFMWS HPA range, yet it may not have created memblks
> +	 * for the entire range. Look for a node with a memblk covering
> +	 * any part of the HPA range. Don't bother figuring out if it
> +	 * is partially or wholly described. Replace any memblks in the
> +	 * range with one single memblk that covers the entire range.
> +	 *
> +	 * This preserves the SRAT defined node and Proximity Domain.
> +	 */
> +
> +	found_node = numa_find_node(start, end);
> +	if (found_node != NUMA_NO_NODE) {
> +		numa_remove_memblks(found_node, start, end);
> +		if (numa_add_memblk(found_node, start, end) < 0) {

I worry that this blows away information in the case where the BIOS for
some reason decides to describe multiple proximity domains per CXL
window.

I think this wants to be an algorithm like

numa_fill_memblks(start, end)

...where that walks the range and for any gaps in that range extend the
last memblk to the end of the gap.

For example if the CFWMS to SRAT mapping is this:

 ┌──────────────────────────────────────────────────┐
 │                  WINDOW0                         │
 ├──────────┬────────────┬────────────┬─────────────┤
 │PXM0      │ GAP        │ PXM1       │ GAP         │
 └──────────┴────────────┴────────────┴─────────────┘


 ┌──────────────────────────────────────────────────┐
 │                  WINDOW0                         │
 ├───────────────────────┬──────────────────────────┤
 │PXM0                   │ PXM1                     │
 └───────────────────────┴──────────────────────────┘

...and if that finds nothing to fill, *then* create the next proximity
domain for the window.


> +			/* CXL driver must handle the NUMA_NO_NODE case */
> +			pr_warn("ACPI NUMA: failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> +				found_node, start, end);
> +		}
>  		return 0;
> +	}
>  
> +	/*
> +	 * SRAT did not describe this window at all.
> +	 * Create a new node with a fake proximity domain. Add a
> +	 * memblk covering the entire HPA range.
> +	 */
>  	node = acpi_map_pxm_to_node(*fake_pxm);
>  
>  	if (node == NUMA_NO_NODE) {
>  		pr_err("ACPI NUMA: Too many proximity domains while processing CFMWS.\n");
>  		return -EINVAL;
>  	}
> -

Unrelated whitespace change.

>  	if (numa_add_memblk(node, start, end) < 0) {
>  		/* CXL driver must handle the NUMA_NO_NODE case */
>  		pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> -- 
> 2.37.3
>
Alison Schofield May 12, 2023, 12:01 a.m. UTC | #2
On Thu, May 11, 2023 at 04:16:53PM -0700, Dan Williams wrote:
> alison.schofield@ wrote:

snip...

> > +	/*
> > +	 * The SRAT may have already described the NUMA details for
> > +	 * this CFMWS HPA range, yet it may not have created memblks
> > +	 * for the entire range. Look for a node with a memblk covering
> > +	 * any part of the HPA range. Don't bother figuring out if it
> > +	 * is partially or wholly described. Replace any memblks in the
> > +	 * range with one single memblk that covers the entire range.
> > +	 *
> > +	 * This preserves the SRAT defined node and Proximity Domain.
> > +	 */
> > +
> > +	found_node = numa_find_node(start, end);
> > +	if (found_node != NUMA_NO_NODE) {
> > +		numa_remove_memblks(found_node, start, end);
> > +		if (numa_add_memblk(found_node, start, end) < 0) {
> 
> I worry that this blows away information in the case where the BIOS for
> some reason decides to describe multiple proximity domains per CXL
> window.
> 
> I think this wants to be an algorithm like
> 
> numa_fill_memblks(start, end)
> 
> ...where that walks the range and for any gaps in that range extend the
> last memblk to the end of the gap.
> 
> For example if the CFWMS to SRAT mapping is this:
> 
>  ┌──────────────────────────────────────────────────┐
>  │                  WINDOW0                         │
>  ├──────────┬────────────┬────────────┬─────────────┤
>  │PXM0      │ GAP        │ PXM1       │ GAP         │
>  └──────────┴────────────┴────────────┴─────────────┘
> 
> 
>  ┌──────────────────────────────────────────────────┐
>  │                  WINDOW0                         │
>  ├───────────────────────┬──────────────────────────┤
>  │PXM0                   │ PXM1                     │
>  └───────────────────────┴──────────────────────────┘
> 
> ...and if that finds nothing to fill, *then* create the next proximity
> domain for the window.

Thanks for the feedback Dan.

Is there any guarantee that the first PXM begins at cfmws.start?
ie...that gaps always 'follow'

This one's easy:	gap-pxm0 	becomes all pxm0
This one's easy too:	gap-pxm0-gap 	becomes all pxm0

This one's not so obvious: gap-pxm0-gap-pxm1 

Imagine those are quarters. Is the end result 50/50 or 75/25?

I'm wondering if there is any policy in existence or this is new,
and we are setting the policy now?

Alison

> 
> 
> > +			/* CXL driver must handle the NUMA_NO_NODE case */
> > +			pr_warn("ACPI NUMA: failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> > +				found_node, start, end);
> > +		}
> >  		return 0;
> > +	}
> >  
> > +	/*
> > +	 * SRAT did not describe this window at all.
> > +	 * Create a new node with a fake proximity domain. Add a
> > +	 * memblk covering the entire HPA range.
> > +	 */
> >  	node = acpi_map_pxm_to_node(*fake_pxm);
> >  
> >  	if (node == NUMA_NO_NODE) {
> >  		pr_err("ACPI NUMA: Too many proximity domains while processing CFMWS.\n");
> >  		return -EINVAL;
> >  	}
> > -
> 
> Unrelated whitespace change.
> 
> >  	if (numa_add_memblk(node, start, end) < 0) {
> >  		/* CXL driver must handle the NUMA_NO_NODE case */
> >  		pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> > -- 
> > 2.37.3
> > 
> 
>
Dan Williams May 12, 2023, 12:18 a.m. UTC | #3
Alison Schofield wrote:
> On Thu, May 11, 2023 at 04:16:53PM -0700, Dan Williams wrote:
> > alison.schofield@ wrote:
> 
> snip...
> 
> > > +	/*
> > > +	 * The SRAT may have already described the NUMA details for
> > > +	 * this CFMWS HPA range, yet it may not have created memblks
> > > +	 * for the entire range. Look for a node with a memblk covering
> > > +	 * any part of the HPA range. Don't bother figuring out if it
> > > +	 * is partially or wholly described. Replace any memblks in the
> > > +	 * range with one single memblk that covers the entire range.
> > > +	 *
> > > +	 * This preserves the SRAT defined node and Proximity Domain.
> > > +	 */
> > > +
> > > +	found_node = numa_find_node(start, end);
> > > +	if (found_node != NUMA_NO_NODE) {
> > > +		numa_remove_memblks(found_node, start, end);
> > > +		if (numa_add_memblk(found_node, start, end) < 0) {
> > 
> > I worry that this blows away information in the case where the BIOS for
> > some reason decides to describe multiple proximity domains per CXL
> > window.
> > 
> > I think this wants to be an algorithm like
> > 
> > numa_fill_memblks(start, end)
> > 
> > ...where that walks the range and for any gaps in that range extend the
> > last memblk to the end of the gap.
> > 
> > For example if the CFWMS to SRAT mapping is this:
> > 
> >  ┌──────────────────────────────────────────────────┐
> >  │                  WINDOW0                         │
> >  ├──────────┬────────────┬────────────┬─────────────┤
> >  │PXM0      │ GAP        │ PXM1       │ GAP         │
> >  └──────────┴────────────┴────────────┴─────────────┘
> > 
> > 
> >  ┌──────────────────────────────────────────────────┐
> >  │                  WINDOW0                         │
> >  ├───────────────────────┬──────────────────────────┤
> >  │PXM0                   │ PXM1                     │
> >  └───────────────────────┴──────────────────────────┘
> > 
> > ...and if that finds nothing to fill, *then* create the next proximity
> > domain for the window.
> 
> Thanks for the feedback Dan.
> 
> Is there any guarantee that the first PXM begins at cfmws.start?
> ie...that gaps always 'follow'
> 
> This one's easy:	gap-pxm0 	becomes all pxm0
> This one's easy too:	gap-pxm0-gap 	becomes all pxm0
> 
> This one's not so obvious: gap-pxm0-gap-pxm1 

Good point.

I would say that pxm0 consuming both gaps is ok. pxm1 grabbing the
second gap is also ok, whatever makes the code simpler.

> Imagine those are quarters. Is the end result 50/50 or 75/25?

The window is all one QoS class anyway as far as CXL is concerned. So
the end distribution is not much of a concern. I.e. I would not worry if
you end up with 75/25 even if 50/50 was possible.

> I'm wondering if there is any policy in existence or this is new,
> and we are setting the policy now?

We are trailblazing on our own. The reason why I think we have the
freedom to make an arbitrary decision here is that it is unlikely that the
BIOS describes multiple proximity domains per window, but on the off
chance they do lets not throw away information.
Jonathan Cameron May 12, 2023, 4:45 p.m. UTC | #4
On Thu, 11 May 2023 17:18:27 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Alison Schofield wrote:
> > On Thu, May 11, 2023 at 04:16:53PM -0700, Dan Williams wrote:  
> > > alison.schofield@ wrote:  
> > 
> > snip...
> >   
> > > > +	/*
> > > > +	 * The SRAT may have already described the NUMA details for
> > > > +	 * this CFMWS HPA range, yet it may not have created memblks
> > > > +	 * for the entire range. Look for a node with a memblk covering
> > > > +	 * any part of the HPA range. Don't bother figuring out if it
> > > > +	 * is partially or wholly described. Replace any memblks in the
> > > > +	 * range with one single memblk that covers the entire range.
> > > > +	 *
> > > > +	 * This preserves the SRAT defined node and Proximity Domain.
> > > > +	 */
> > > > +
> > > > +	found_node = numa_find_node(start, end);
> > > > +	if (found_node != NUMA_NO_NODE) {
> > > > +		numa_remove_memblks(found_node, start, end);
> > > > +		if (numa_add_memblk(found_node, start, end) < 0) {  
> > > 
> > > I worry that this blows away information in the case where the BIOS for
> > > some reason decides to describe multiple proximity domains per CXL
> > > window.
> > > 
> > > I think this wants to be an algorithm like
> > > 
> > > numa_fill_memblks(start, end)
> > > 
> > > ...where that walks the range and for any gaps in that range extend the
> > > last memblk to the end of the gap.
> > > 
> > > For example if the CFWMS to SRAT mapping is this:
> > > 
> > >  ┌──────────────────────────────────────────────────┐
> > >  │                  WINDOW0                         │
> > >  ├──────────┬────────────┬────────────┬─────────────┤
> > >  │PXM0      │ GAP        │ PXM1       │ GAP         │
> > >  └──────────┴────────────┴────────────┴─────────────┘
> > > 
> > > 
> > >  ┌──────────────────────────────────────────────────┐
> > >  │                  WINDOW0                         │
> > >  ├───────────────────────┬──────────────────────────┤
> > >  │PXM0                   │ PXM1                     │
> > >  └───────────────────────┴──────────────────────────┘
> > > 
> > > ...and if that finds nothing to fill, *then* create the next proximity
> > > domain for the window.  
> > 
> > Thanks for the feedback Dan.
> > 
> > Is there any guarantee that the first PXM begins at cfmws.start?
> > ie...that gaps always 'follow'
> > 
> > This one's easy:	gap-pxm0 	becomes all pxm0
> > This one's easy too:	gap-pxm0-gap 	becomes all pxm0
> > 
> > This one's not so obvious: gap-pxm0-gap-pxm1   
> 
> Good point.
> 
> I would say that pxm0 consuming both gaps is ok. pxm1 grabbing the
> second gap is also ok, whatever makes the code simpler.

Agreed - it's best effort on a heuristic.  My guess is we'll end up
revisiting this when hotplug means we see similar patterns to those
that led a bios to do the above, but this is fine for now.

> 
> > Imagine those are quarters. Is the end result 50/50 or 75/25?  
> 
> The window is all one QoS class anyway as far as CXL is concerned. So
> the end distribution is not much of a concern. I.e. I would not worry if
> you end up with 75/25 even if 50/50 was possible.

I don't like tying QoS class to a narrow range of access characteristics
but that's a problem for another day.

> 
> > I'm wondering if there is any policy in existence or this is new,
> > and we are setting the policy now?  
> 
> We are trailblazing on our own. The reason why I think we have the
> freedom to make an arbitrary decision here is that it is unlikely that the
> BIOS describes multiple proximity domains per window, but on the off
> chance they do lets not throw away information.

Agree with not throwing it away, fairly sure you'll see this :)
If nothing else it's handy for a system that doesn't do interleaving
across devices for some reason where the OS is then going to use NUMA Interleave.

So it can be useful to have multiple proximity domains even for identical
memory if the access paths are different.

Ha! I see an excuse to reuse old work:
https://github.com/hisilicon/acpi-numa-whitepaper/releases/tag/v0.93
See 5.1

I should bring that back one day when things are quieter (so likely never).
Was targeted as an ACPI WP and out for review when something happened (check
the date and who I work for if curious ;)

Jonathan
diff mbox series

Patch

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 1f4fc5f8a819..f41b65e9b085 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -301,27 +301,47 @@  acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 				   void *arg, const unsigned long table_end)
 {
+	int node, found_node, *fake_pxm = arg;
 	struct acpi_cedt_cfmws *cfmws;
-	int *fake_pxm = arg;
 	u64 start, end;
-	int node;
 
 	cfmws = (struct acpi_cedt_cfmws *)header;
 	start = cfmws->base_hpa;
 	end = cfmws->base_hpa + cfmws->window_size;
 
-	/* Skip if the SRAT already described the NUMA details for this HPA */
-	node = phys_to_target_node(start);
-	if (node != NUMA_NO_NODE)
+	/*
+	 * The SRAT may have already described the NUMA details for
+	 * this CFMWS HPA range, yet it may not have created memblks
+	 * for the entire range. Look for a node with a memblk covering
+	 * any part of the HPA range. Don't bother figuring out if it
+	 * is partially or wholly described. Replace any memblks in the
+	 * range with one single memblk that covers the entire range.
+	 *
+	 * This preserves the SRAT defined node and Proximity Domain.
+	 */
+
+	found_node = numa_find_node(start, end);
+	if (found_node != NUMA_NO_NODE) {
+		numa_remove_memblks(found_node, start, end);
+		if (numa_add_memblk(found_node, start, end) < 0) {
+			/* CXL driver must handle the NUMA_NO_NODE case */
+			pr_warn("ACPI NUMA: failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
+				found_node, start, end);
+		}
 		return 0;
+	}
 
+	/*
+	 * SRAT did not describe this window at all.
+	 * Create a new node with a fake proximity domain. Add a
+	 * memblk covering the entire HPA range.
+	 */
 	node = acpi_map_pxm_to_node(*fake_pxm);
 
 	if (node == NUMA_NO_NODE) {
 		pr_err("ACPI NUMA: Too many proximity domains while processing CFMWS.\n");
 		return -EINVAL;
 	}
-
 	if (numa_add_memblk(node, start, end) < 0) {
 		/* CXL driver must handle the NUMA_NO_NODE case */
 		pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",