diff mbox series

[3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment

Message ID 20241008044355.4325-4-gourry@gourry.net (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series memory,acpi: resize memory blocks based on CFMW alignment | expand

Commit Message

Gregory Price Oct. 8, 2024, 4:43 a.m. UTC
The CXL Fixed Memory Window allows for memory aligned down to the
size of 256MB.  However, by default on x86, memory blocks increase
in size as total System RAM capacity increases. On x86, this caps
out at 2G when 64GB of System RAM is reached.

When the CFMWS regions are not aligned to memory block size, this
results in lost capacity on either side of the alignment.

Parse all CFMWS to detect the largest common denomenator among all
regions, and reduce the block size accordingly.

This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
enabled, but the surrounding code may not necessarily require these
configs, so build accordingly.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

Comments

Ira Weiny Oct. 8, 2024, 2:58 p.m. UTC | #1
Gregory Price wrote:
> The CXL Fixed Memory Window allows for memory aligned down to the
> size of 256MB.  However, by default on x86, memory blocks increase
> in size as total System RAM capacity increases. On x86, this caps
> out at 2G when 64GB of System RAM is reached.
> 
> When the CFMWS regions are not aligned to memory block size, this
> results in lost capacity on either side of the alignment.
> 
> Parse all CFMWS to detect the largest common denomenator among all
> regions, and reduce the block size accordingly.
> 
> This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> enabled, but the surrounding code may not necessarily require these
> configs, so build accordingly.
> 
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 44f91f2c6c5d..9367d36eba9a 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -14,6 +14,7 @@
>  #include <linux/errno.h>
>  #include <linux/acpi.h>
>  #include <linux/memblock.h>
> +#include <linux/memory.h>
>  #include <linux/numa.h>
>  #include <linux/nodemask.h>
>  #include <linux/topology.h>
> @@ -333,6 +334,37 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
>  	return 0;
>  }
>  
> +#if defined(CONFIG_MEMORY_HOTPLUG)

Generally we avoid config defines in *.c files...  See more below.

> +/*
> + * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
> + * systems default to larger alignments (2GB on x86). Misalignments can
> + * cause some capacity to become unreachable. Calculate the largest supported
> + * alignment for all CFMW to maximize the amount of mappable capacity.
> + */
> +static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
> +				   void *arg, const unsigned long table_end)
> +{
> +	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
> +	u64 start = cfmws->base_hpa;
> +	u64 size = cfmws->window_size;
> +	unsigned long *fin_bz = arg;
> +	unsigned long bz;
> +
> +	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
> +		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
> +			break;
> +	}
> +
> +	/* Only adjust downward, we never want to increase block size */
> +	if (bz < *fin_bz && bz >= SZ_256M)
> +		*fin_bz = bz;
> +	else if (bz < SZ_256M)
> +		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
> +
> +	return 0;
> +}
> +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> +
>  static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
>  				   void *arg, const unsigned long table_end)
>  {
> @@ -501,6 +533,10 @@ acpi_table_parse_srat(enum acpi_srat_type id,
>  int __init acpi_numa_init(void)
>  {
>  	int i, fake_pxm, cnt = 0;
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +	unsigned long block_sz = memory_block_size_bytes();

To help address David's comment as well;

Is there a way to scan all the alignments of the windows and pass the
desired alignment to the arch in a new call and have the arch determine if
changing the order is ok?

Also the call to the arch would be a noop for !CONFIG_MEMORY_HOTPLUG which
cleans up this function WRT CONFIG_MEMORY_HOTPLUG.

Ira

> +	unsigned long cfmw_align = block_sz;
> +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
>  
>  	if (acpi_disabled)
>  		return -EINVAL;
> @@ -552,6 +588,18 @@ int __init acpi_numa_init(void)
>  	}
>  	last_real_pxm = fake_pxm;
>  	fake_pxm++;
> +
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +	/* Calculate and set largest supported memory block size alignment */
> +	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
> +			      &cfmw_align);
> +	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
> +		if (set_memory_block_size_order(ffs(cfmw_align)-1))
> +			pr_warn("CFMWS: Unable to adjust memory block size\n");
> +	}
> +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> +
> +	/* Then parse and fill the numa nodes with the described memory */
>  	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
>  			      &fake_pxm);
>  
> -- 
> 2.43.0
> 
>
Gregory Price Oct. 8, 2024, 3:17 p.m. UTC | #2
On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
> Gregory Price wrote:
> > The CXL Fixed Memory Window allows for memory aligned down to the
> > size of 256MB.  However, by default on x86, memory blocks increase
> > in size as total System RAM capacity increases. On x86, this caps
> > out at 2G when 64GB of System RAM is reached.
> > 
> > When the CFMWS regions are not aligned to memory block size, this
> > results in lost capacity on either side of the alignment.
> > 
> > Parse all CFMWS to detect the largest common denomenator among all
> > regions, and reduce the block size accordingly.
> > 
> > This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> > enabled, but the surrounding code may not necessarily require these
> > configs, so build accordingly.
> > 
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > ---
> >  drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 48 insertions(+)
> > 
> > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > index 44f91f2c6c5d..9367d36eba9a 100644
> > --- a/drivers/acpi/numa/srat.c
> > +++ b/drivers/acpi/numa/srat.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/errno.h>
> >  #include <linux/acpi.h>
> >  #include <linux/memblock.h>
> > +#include <linux/memory.h>
> >  #include <linux/numa.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/topology.h>
> > @@ -333,6 +334,37 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> >  	return 0;
> >  }
> >  
> > +#if defined(CONFIG_MEMORY_HOTPLUG)
> 
> Generally we avoid config defines in *.c files...  See more below.
> 
> > +/*
> > + * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
> > + * systems default to larger alignments (2GB on x86). Misalignments can
> > + * cause some capacity to become unreachable. Calculate the largest supported
> > + * alignment for all CFMW to maximize the amount of mappable capacity.
> > + */
> > +static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
> > +				   void *arg, const unsigned long table_end)
> > +{
> > +	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
> > +	u64 start = cfmws->base_hpa;
> > +	u64 size = cfmws->window_size;
> > +	unsigned long *fin_bz = arg;
> > +	unsigned long bz;
> > +
> > +	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
> > +		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
> > +			break;
> > +	}
> > +
> > +	/* Only adjust downward, we never want to increase block size */
> > +	if (bz < *fin_bz && bz >= SZ_256M)
> > +		*fin_bz = bz;
> > +	else if (bz < SZ_256M)
> > +		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
> > +
> > +	return 0;
> > +}
> > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > +
> >  static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> >  				   void *arg, const unsigned long table_end)
> >  {
> > @@ -501,6 +533,10 @@ acpi_table_parse_srat(enum acpi_srat_type id,
> >  int __init acpi_numa_init(void)
> >  {
> >  	int i, fake_pxm, cnt = 0;
> > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > +	unsigned long block_sz = memory_block_size_bytes();
> 
> To help address David's comment as well;
> 
> Is there a way to scan all the alignments of the windows and pass the
> desired alignment to the arch in a new call and have the arch determine if
> changing the order is ok?
> 

At least on x86, it's only OK during init, so it would probably look like
setting a static bit (like the global value in x86) and just refusing to
update once it is locked.

I could implement that on the x86 side as an example.

FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
```
    I am assuming we would call it here
        drivers/acpi/numa/srat.c::acpi_parse_cfmws()
    which should be before page-allocator init
```

It's only safe before page-allocator init (i.e. once blocks start getting
populated and used), and this area occurs before that.


> Also the call to the arch would be a noop for !CONFIG_MEMORY_HOTPLUG which
> cleans up this function WRT CONFIG_MEMORY_HOTPLUG.
> 
> Ira
>

The ifdefs are a nasty result of the HOTPLUG and SPARSEMEM configs
being, from my perview, horrendously inconsistent throughout the system.

As an example, MIN_MEMORY_BLOCK_SIZE depends on SECTION_SIZE_BITS
which on some architectures is dependent on CONFIG_SPARSEMEM, and
on others is defined unconditionally.  Compound this with memblock
usage appearing to imply CONFIG_MEMORY_HOTPLUG which implies
CONFIG_SPARSEMEM (see drivers/base/memory.c) - but mm/memblock.c
makes no such assumption.

The result of this is that if you extern these functions and build
x86 with each combination of HOTPLUG/SPARSMEM on/off, it builds - but
loongarch (and others) fail to build because SECTION_SIZE_BITS doesn't
get defined in some configurations.

It's not clear if removing those ifdefs from those archs is "correct"
(for some definition of correct) and I didn't want to increase scope.

So it's really not clear how to wire this all up.

I spent the better part of a week trying to detangle this mess just
to get things building successfully in LKP and decided to just add the
ifdefs to get it out and get opinions on the issue :[
 
> > +	unsigned long cfmw_align = block_sz;
> > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> >  
> >  	if (acpi_disabled)
> >  		return -EINVAL;
> > @@ -552,6 +588,18 @@ int __init acpi_numa_init(void)
> >  	}
> >  	last_real_pxm = fake_pxm;
> >  	fake_pxm++;
> > +
> > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > +	/* Calculate and set largest supported memory block size alignment */
> > +	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
> > +			      &cfmw_align);
> > +	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
> > +		if (set_memory_block_size_order(ffs(cfmw_align)-1))
> > +			pr_warn("CFMWS: Unable to adjust memory block size\n");
> > +	}
> > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > +
> > +	/* Then parse and fill the numa nodes with the described memory */
> >  	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> >  			      &fake_pxm);
> >  
> > -- 
> > 2.43.0
> > 
> > 
> 
>
Dan Williams Oct. 8, 2024, 4:46 p.m. UTC | #3
Gregory Price wrote:
> On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
> > Gregory Price wrote:
> > > The CXL Fixed Memory Window allows for memory aligned down to the
> > > size of 256MB.  However, by default on x86, memory blocks increase
> > > in size as total System RAM capacity increases. On x86, this caps
> > > out at 2G when 64GB of System RAM is reached.
> > > 
> > > When the CFMWS regions are not aligned to memory block size, this
> > > results in lost capacity on either side of the alignment.
> > > 
> > > Parse all CFMWS to detect the largest common denomenator among all
> > > regions, and reduce the block size accordingly.
> > > 
> > > This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> > > enabled, but the surrounding code may not necessarily require these
> > > configs, so build accordingly.
> > > 
> > > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > > Signed-off-by: Gregory Price <gourry@gourry.net>
> > > ---
[..]
> > To help address David's comment as well;
> > 
> > Is there a way to scan all the alignments of the windows and pass the
> > desired alignment to the arch in a new call and have the arch determine if
> > changing the order is ok?
> > 
> 
> At least on x86, it's only OK during init, so it would probably look like
> setting a static bit (like the global value in x86) and just refusing to
> update once it is locked.
> 
> I could implement that on the x86 side as an example.
> 
> FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
> ```
>     I am assuming we would call it here
>         drivers/acpi/numa/srat.c::acpi_parse_cfmws()
>     which should be before page-allocator init
> ```
> 
> It's only safe before page-allocator init (i.e. once blocks start getting
> populated and used), and this area occurs before that.

I will note though that drivers/acpi/numa/srat.c is always built-in, so
there is no need for set_memory_block_size_order() to be EXPORT_SYMBOL
for modules to play with, just an extern for NUMA init to access.
Ira Weiny Oct. 8, 2024, 7:02 p.m. UTC | #4
Gregory Price wrote:
> On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
> > Gregory Price wrote:
> > > The CXL Fixed Memory Window allows for memory aligned down to the
> > > size of 256MB.  However, by default on x86, memory blocks increase
> > > in size as total System RAM capacity increases. On x86, this caps
> > > out at 2G when 64GB of System RAM is reached.
> > > 
> > > When the CFMWS regions are not aligned to memory block size, this
> > > results in lost capacity on either side of the alignment.
> > > 
> > > Parse all CFMWS to detect the largest common denomenator among all
> > > regions, and reduce the block size accordingly.
> > > 
> > > This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> > > enabled, but the surrounding code may not necessarily require these
> > > configs, so build accordingly.
> > > 
> > > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > > Signed-off-by: Gregory Price <gourry@gourry.net>
> > > ---
> > >  drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 48 insertions(+)
> > > 
> > > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > > index 44f91f2c6c5d..9367d36eba9a 100644
> > > --- a/drivers/acpi/numa/srat.c
> > > +++ b/drivers/acpi/numa/srat.c
> > > @@ -14,6 +14,7 @@
> > >  #include <linux/errno.h>
> > >  #include <linux/acpi.h>
> > >  #include <linux/memblock.h>
> > > +#include <linux/memory.h>
> > >  #include <linux/numa.h>
> > >  #include <linux/nodemask.h>
> > >  #include <linux/topology.h>
> > > @@ -333,6 +334,37 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> > >  	return 0;
> > >  }
> > >  
> > > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > 
> > Generally we avoid config defines in *.c files...  See more below.
> > 
> > > +/*
> > > + * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
> > > + * systems default to larger alignments (2GB on x86). Misalignments can
> > > + * cause some capacity to become unreachable. Calculate the largest supported
> > > + * alignment for all CFMW to maximize the amount of mappable capacity.
> > > + */
> > > +static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
> > > +				   void *arg, const unsigned long table_end)
> > > +{
> > > +	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
> > > +	u64 start = cfmws->base_hpa;
> > > +	u64 size = cfmws->window_size;
> > > +	unsigned long *fin_bz = arg;
> > > +	unsigned long bz;
> > > +
> > > +	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
> > > +		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
> > > +			break;
> > > +	}
> > > +
> > > +	/* Only adjust downward, we never want to increase block size */
> > > +	if (bz < *fin_bz && bz >= SZ_256M)
> > > +		*fin_bz = bz;
> > > +	else if (bz < SZ_256M)
> > > +		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
> > > +
> > > +	return 0;
> > > +}
> > > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > > +
> > >  static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> > >  				   void *arg, const unsigned long table_end)
> > >  {
> > > @@ -501,6 +533,10 @@ acpi_table_parse_srat(enum acpi_srat_type id,
> > >  int __init acpi_numa_init(void)
> > >  {
> > >  	int i, fake_pxm, cnt = 0;
> > > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > > +	unsigned long block_sz = memory_block_size_bytes();
> > 
> > To help address David's comment as well;
> > 
> > Is there a way to scan all the alignments of the windows and pass the
> > desired alignment to the arch in a new call and have the arch determine if
> > changing the order is ok?
> > 
> 
> At least on x86, it's only OK during init, so it would probably look like
> setting a static bit (like the global value in x86) and just refusing to
> update once it is locked.
> 
> I could implement that on the x86 side as an example.
> 
> FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
> ```
>     I am assuming we would call it here
>         drivers/acpi/numa/srat.c::acpi_parse_cfmws()
>     which should be before page-allocator init
> ```
> 
> It's only safe before page-allocator init (i.e. once blocks start getting
> populated and used), and this area occurs before that.

Right I was expecting the call to the arch to be here.

> 
> 
> > Also the call to the arch would be a noop for !CONFIG_MEMORY_HOTPLUG which
> > cleans up this function WRT CONFIG_MEMORY_HOTPLUG.
> > 
> > Ira
> >
> 
> The ifdefs are a nasty result of the HOTPLUG and SPARSEMEM configs
> being, from my perview, horrendously inconsistent throughout the system.

:-(

> 
> As an example, MIN_MEMORY_BLOCK_SIZE depends on SECTION_SIZE_BITS
> which on some architectures is dependent on CONFIG_SPARSEMEM, and
> on others is defined unconditionally.  Compound this with memblock
> usage appearing to imply CONFIG_MEMORY_HOTPLUG which implies
> CONFIG_SPARSEMEM (see drivers/base/memory.c) - but mm/memblock.c
> makes no such assumption.
> 
> The result of this is that if you extern these functions and build
> x86 with each combination of HOTPLUG/SPARSMEM on/off, it builds - but
> loongarch (and others) fail to build because SECTION_SIZE_BITS doesn't
> get defined in some configurations.
> 
> It's not clear if removing those ifdefs from those archs is "correct"
> (for some definition of correct) and I didn't want to increase scope.
> 
> So it's really not clear how to wire this all up.
> 
> I spent the better part of a week trying to detangle this mess just
> to get things building successfully in LKP and decided to just add the
> ifdefs to get it out and get opinions on the issue :[

Yea sometimes it is best to throw things out there.  An idea I have is to
have something like CONFIG_ARCH_HAS_ADVISE_ALIGNMENT which is only defined
for x86.  Other arch's get a default which is a noop.

So the code would be something like:

	unsigned long long cfmw_align = SZ_64T;

	/* Find largest CXL window alignment */
	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, max_cfwm_align,
			      &cfmw_align);

	/* name/interface TBD */
	if (arch_advise_alignment(cfmw_align))
		pr_warn("...", cfmw_align);

This would set the alignment up early in init if the arch allows for it.
Because it is an 'advise' call (again name TBD) it does not mean anything
is set for sure.

FWIW I am sure that David and Dan have better ideas.  Just trying to help.

Ira

>  
> > > +	unsigned long cfmw_align = block_sz;
> > > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > >  
> > >  	if (acpi_disabled)
> > >  		return -EINVAL;
> > > @@ -552,6 +588,18 @@ int __init acpi_numa_init(void)
> > >  	}
> > >  	last_real_pxm = fake_pxm;
> > >  	fake_pxm++;
> > > +
> > > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > > +	/* Calculate and set largest supported memory block size alignment */
> > > +	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
> > > +			      &cfmw_align);
> > > +	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
> > > +		if (set_memory_block_size_order(ffs(cfmw_align)-1))
> > > +			pr_warn("CFMWS: Unable to adjust memory block size\n");
> > > +	}
> > > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > > +
> > > +	/* Then parse and fill the numa nodes with the described memory */
> > >  	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> > >  			      &fake_pxm);
> > >  
> > > -- 
> > > 2.43.0
> > > 
> > > 
> > 
> > 
>
David Hildenbrand Oct. 14, 2024, 11:50 a.m. UTC | #5
On 08.10.24 18:46, Dan Williams wrote:
> Gregory Price wrote:
>> On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
>>> Gregory Price wrote:
>>>> The CXL Fixed Memory Window allows for memory aligned down to the
>>>> size of 256MB.  However, by default on x86, memory blocks increase
>>>> in size as total System RAM capacity increases. On x86, this caps
>>>> out at 2G when 64GB of System RAM is reached.
>>>>
>>>> When the CFMWS regions are not aligned to memory block size, this
>>>> results in lost capacity on either side of the alignment.
>>>>
>>>> Parse all CFMWS to detect the largest common denomenator among all
>>>> regions, and reduce the block size accordingly.
>>>>
>>>> This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
>>>> enabled, but the surrounding code may not necessarily require these
>>>> configs, so build accordingly.
>>>>
>>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>>> Signed-off-by: Gregory Price <gourry@gourry.net>
>>>> ---
> [..]
>>> To help address David's comment as well;
>>>
>>> Is there a way to scan all the alignments of the windows and pass the
>>> desired alignment to the arch in a new call and have the arch determine if
>>> changing the order is ok?
>>>
>>
>> At least on x86, it's only OK during init, so it would probably look like
>> setting a static bit (like the global value in x86) and just refusing to
>> update once it is locked.
>>
>> I could implement that on the x86 side as an example.
>>
>> FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
>> ```
>>      I am assuming we would call it here
>>          drivers/acpi/numa/srat.c::acpi_parse_cfmws()
>>      which should be before page-allocator init
>> ```
>>
>> It's only safe before page-allocator init (i.e. once blocks start getting
>> populated and used), and this area occurs before that.

Sorry for the late reply. It must also be called before 
memory_dev_init(), which happens after the buddy is up IIRC.

> 
> I will note though that drivers/acpi/numa/srat.c is always built-in, so
> there is no need for set_memory_block_size_order() to be EXPORT_SYMBOL
> for modules to play with, just an extern for NUMA init to access.

That's the magic piece I was missing.

Because it didn't make too much sense for me a call to this function 
would ever makes sense after modules where loaded.

This really only works that early during boot, that modules are not 
loaded yet.

So this patch here should be dropped.
diff mbox series

Patch

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 44f91f2c6c5d..9367d36eba9a 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -14,6 +14,7 @@ 
 #include <linux/errno.h>
 #include <linux/acpi.h>
 #include <linux/memblock.h>
+#include <linux/memory.h>
 #include <linux/numa.h>
 #include <linux/nodemask.h>
 #include <linux/topology.h>
@@ -333,6 +334,37 @@  acpi_parse_memory_affinity(union acpi_subtable_headers *header,
 	return 0;
 }
 
+#if defined(CONFIG_MEMORY_HOTPLUG)
+/*
+ * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
+ * systems default to larger alignments (2GB on x86). Misalignments can
+ * cause some capacity to become unreachable. Calculate the largest supported
+ * alignment for all CFMW to maximize the amount of mappable capacity.
+ */
+static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
+				   void *arg, const unsigned long table_end)
+{
+	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
+	u64 start = cfmws->base_hpa;
+	u64 size = cfmws->window_size;
+	unsigned long *fin_bz = arg;
+	unsigned long bz;
+
+	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
+		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
+			break;
+	}
+
+	/* Only adjust downward, we never want to increase block size */
+	if (bz < *fin_bz && bz >= SZ_256M)
+		*fin_bz = bz;
+	else if (bz < SZ_256M)
+		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
+
+	return 0;
+}
+#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
+
 static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 				   void *arg, const unsigned long table_end)
 {
@@ -501,6 +533,10 @@  acpi_table_parse_srat(enum acpi_srat_type id,
 int __init acpi_numa_init(void)
 {
 	int i, fake_pxm, cnt = 0;
+#if defined(CONFIG_MEMORY_HOTPLUG)
+	unsigned long block_sz = memory_block_size_bytes();
+	unsigned long cfmw_align = block_sz;
+#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
 
 	if (acpi_disabled)
 		return -EINVAL;
@@ -552,6 +588,18 @@  int __init acpi_numa_init(void)
 	}
 	last_real_pxm = fake_pxm;
 	fake_pxm++;
+
+#if defined(CONFIG_MEMORY_HOTPLUG)
+	/* Calculate and set largest supported memory block size alignment */
+	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
+			      &cfmw_align);
+	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
+		if (set_memory_block_size_order(ffs(cfmw_align)-1))
+			pr_warn("CFMWS: Unable to adjust memory block size\n");
+	}
+#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
+
+	/* Then parse and fill the numa nodes with the described memory */
 	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
 			      &fake_pxm);