mbox series

[0/3] memory,acpi: resize memory blocks based on CFMW alignment

Message ID 20241008044355.4325-1-gourry@gourry.net
Headers show
Series memory,acpi: resize memory blocks based on CFMW alignment | expand

Message

Gregory Price Oct. 8, 2024, 4:43 a.m. UTC
When physical address capacity is not aligned to the size of a memory
block managed size, the misaligned portion is not mapped - creating
an effective loss of capacity.

This appears to be a calculated decision based on the fact that most
regions would generally be aligned, and the loss of capacity would be
relatively limited. With CXL devices, this is no longer the case.

CXL exposes its memory for management through the ACPI CEDT (CXL Early
Detection Table) in a field called the CXL Fixed Memory Window.  Per
the CXL specification, this memory must be aligned to at least 256MB.

On X86, memory block capacity increases based on the overall capacity
of the machine - eventually reaching a maximum of 2GB per memory block.
When a CFMW aligns on 256MB, this causes a loss of at least 2GB of
capacity, and in some cases more.

It is also possible for multiple CFMW to be exposed for a single device.
This can happen if a reserved region intersects with the target memory
location of the memory device. This happens on AMD x86 platforms. 

This patch set detects the alignments of all CFMW in the ACPI CEDT,
and changes the memory block size downward to meet the largest common
denomenator of the supported memory regions.

To do this, we needed 3 changes:
    1) extern memory block management functions for the acpi driver
    2) modify x86 to update its cached block size value
    3) add code in acpi/numa/srat.c to do the alignment check

Presently this only affects x86, since this is the only architecture
that implements set_memory_block_size_order.

Presently this appears to only affect x86, and we only mitigated there
since it is the only arch to implement set_memory_block_size_order.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>

Gregory Price (3):
  memory: extern memory_block_size_bytes and set_memory_block_size_order
  x86/mm: if memblock size is adjusted, update the cached value
  acpi,srat: reduce memory block size if CFMWS has a smaller alignment

 arch/x86/mm/init_64.c    | 17 ++++++++++++--
 drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
 drivers/base/memory.c    |  6 +++++
 include/linux/memory.h   |  4 ++--
 4 files changed, 71 insertions(+), 4 deletions(-)

Comments

Ira Weiny Oct. 8, 2024, 2:38 p.m. UTC | #1
Gregory Price wrote:
> When physical address capacity is not aligned to the size of a memory
> block managed size, the misaligned portion is not mapped - creating
> an effective loss of capacity.
> 
> This appears to be a calculated decision based on the fact that most
> regions would generally be aligned, and the loss of capacity would be
> relatively limited. With CXL devices, this is no longer the case.
> 
> CXL exposes its memory for management through the ACPI CEDT (CXL Early
> Detection Table) in a field called the CXL Fixed Memory Window.  Per
> the CXL specification, this memory must be aligned to at least 256MB.
> 
> On X86, memory block capacity increases based on the overall capacity
> of the machine - eventually reaching a maximum of 2GB per memory block.
> When a CFMW aligns on 256MB, this causes a loss of at least 2GB of
> capacity, and in some cases more.
> 
> It is also possible for multiple CFMW to be exposed for a single device.
> This can happen if a reserved region intersects with the target memory
> location of the memory device. This happens on AMD x86 platforms. 

I'm not clear why you mention reserved regions here.  IIUC CFMW's can
overlap to describe different attributes which may be utilized based on
the devices which are mapped within them.  For this reason, all CFMW's
must be scanned to find the lowest common denominator even if the HPA
range has already been evaluated.

Is that what you are trying to say?

> 
> This patch set detects the alignments of all CFMW in the ACPI CEDT,
> and changes the memory block size downward to meet the largest common
> denomenator of the supported memory regions.
> 
> To do this, we needed 3 changes:
>     1) extern memory block management functions for the acpi driver
>     2) modify x86 to update its cached block size value
>     3) add code in acpi/numa/srat.c to do the alignment check
> 
> Presently this only affects x86, since this is the only architecture
> that implements set_memory_block_size_order.
> 
> Presently this appears to only affect x86, and we only mitigated there
> since it is the only arch to implement set_memory_block_size_order.

NIT : duplicate statement

Ira
Gregory Price Oct. 8, 2024, 2:49 p.m. UTC | #2
On Tue, Oct 08, 2024 at 09:38:35AM -0500, Ira Weiny wrote:
> Gregory Price wrote:
> > When physical address capacity is not aligned to the size of a memory
> > block managed size, the misaligned portion is not mapped - creating
> > an effective loss of capacity.
> > 
> > This appears to be a calculated decision based on the fact that most
> > regions would generally be aligned, and the loss of capacity would be
> > relatively limited. With CXL devices, this is no longer the case.
> > 
> > CXL exposes its memory for management through the ACPI CEDT (CXL Early
> > Detection Table) in a field called the CXL Fixed Memory Window.  Per
> > the CXL specification, this memory must be aligned to at least 256MB.
> > 
> > On X86, memory block capacity increases based on the overall capacity
> > of the machine - eventually reaching a maximum of 2GB per memory block.
> > When a CFMW aligns on 256MB, this causes a loss of at least 2GB of
> > capacity, and in some cases more.
> > 
> > It is also possible for multiple CFMW to be exposed for a single device.
> > This can happen if a reserved region intersects with the target memory
> > location of the memory device. This happens on AMD x86 platforms. 
> 
> I'm not clear why you mention reserved regions here.  IIUC CFMW's can
> overlap to describe different attributes which may be utilized based on
> the devices which are mapped within them.  For this reason, all CFMW's
> must be scanned to find the lowest common denominator even if the HPA
> range has already been evaluated.
> 
> Is that what you are trying to say?
>

On AMD systems, depending on the capacity, it is possible for a single
memory expander to be represented by multiple CFMW.

an example: There are two memory expanders w/ 256GB total capacity

[    0.000000] BIOS-e820: [mem 0x000000c050000000-0x000000fcffffffff] soft reserved
[    0.000000] BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000010000000000-0x000001034fffffff] soft reserved

[0A4h 0164   1]                Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0A5h 0165   1]                     Reserved : 00
[0A6h 0166   2]                       Length : 0028
[0A8h 0168   4]                     Reserved : 00000000
[0ACh 0172   8]          Window base address : 000000C050000000
[0B4h 0180   8]                  Window size : 0000002000000000
[0BCh 0188   1]     Interleave Members (2^n) : 00
[0BDh 0189   1]        Interleave Arithmetic : 00
[0BEh 0190   2]                     Reserved : 0000
[0C0h 0192   4]                  Granularity : 00000000
[0C4h 0196   2]                 Restrictions : 0006
[0C6h 0198   2]                        QtgId : 0001
[0C8h 0200   4]                 First Target : 00000007

[0CCh 0204   1]                Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0CDh 0205   1]                     Reserved : 00
[0CEh 0206   2]                       Length : 0028
[0D0h 0208   4]                     Reserved : 00000000
[0D4h 0212   8]          Window base address : 000000E050000000
[0DCh 0220   8]                  Window size : 0000001CB0000000
[0E4h 0228   1]     Interleave Members (2^n) : 00
[0E5h 0229   1]        Interleave Arithmetic : 00
[0E6h 0230   2]                     Reserved : 0000
[0E8h 0232   4]                  Granularity : 00000000
[0ECh 0236   2]                 Restrictions : 0006
[0EEh 0238   2]                        QtgId : 0002
[0F0h 0240   4]                 First Target : 00000006

[0F4h 0244   1]                Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0F5h 0245   1]                     Reserved : 00
[0F6h 0246   2]                       Length : 0028
[0F8h 0248   4]                     Reserved : 00000000
[0FCh 0252   8]          Window base address : 0000010000000000
[104h 0260   8]                  Window size : 0000000350000000
[10Ch 0268   1]     Interleave Members (2^n) : 00
[10Dh 0269   1]        Interleave Arithmetic : 00
[10Eh 0270   2]                     Reserved : 0000
[110h 0272   4]                  Granularity : 00000000
[114h 0276   2]                 Restrictions : 0006
[116h 0278   2]                        QtgId : 0003
[118h 0280   4]                 First Target : 00000006

Note that there are two soft reserved regions, but 3 CFMWS.  This is
because the first device is contained entirely within the first region,
and the second device is split across the first and the second.

The reserved space at the top of the 1TB memory region is reserved by
the system for something else - similar I imagine to the PCI hole at the
top of 4GB.

The e820 entries are not aligned to 2GB - so you will lose capacity
right off the bat. Then on top of this, when you go to map the windows,
you're met with more bases and lengths not aligned to 2GB which results
in even futher loss of usable capacity.

> > 
> > This patch set detects the alignments of all CFMW in the ACPI CEDT,
> > and changes the memory block size downward to meet the largest common
> > denomenator of the supported memory regions.
> > 
> > To do this, we needed 3 changes:
> >     1) extern memory block management functions for the acpi driver
> >     2) modify x86 to update its cached block size value
> >     3) add code in acpi/numa/srat.c to do the alignment check
> > 
> > Presently this only affects x86, since this is the only architecture
> > that implements set_memory_block_size_order.
> > 
> > Presently this appears to only affect x86, and we only mitigated there
> > since it is the only arch to implement set_memory_block_size_order.
> 
> NIT : duplicate statement
> 
> Ira