mbox series

[0/5] Manual definition of Soft Reserved memory devices

Message ID 158318759687.2216124.4684754859068906007.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
Headers show
Series Manual definition of Soft Reserved memory devices | expand

Message

Dan Williams March 2, 2020, 10:19 p.m. UTC
Given the current dearth of systems that supply an ACPI HMAT table, and
the utility of being able to manually define device-dax "hmem" instances
via the efi_fake_mem= option, relax the requirements for creating these
devices. Specifically, add an option (numa=nohmat) to optionally disable
consideration of the HMAT and update efi_fake_mem= to behave like
memmap=nn!ss in terms of delimiting device boundaries.

All review welcome of course, but the E820 changes want an x86
maintainer ack, the efi_fake_mem update needs Ard, and Rafael has
previously shepherded the HMAT changes. For the changes to
kernel/resource.c, where there is no clear maintainer, I just copied the
last few people to make thoughtful changes in that area. I am happy to
take these through the nvdimm tree along with these prerequisites
already in -next:

b2ca916ce392 ACPI: NUMA: Up-level "map to online node" functionality
4fcbe96e4d0b mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
575e23b6e13c powerpc/papr_scm: Switch to numa_map_to_online_node()
1e5d8e1e47af x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
5d30f92e7631 x86/NUMA: Provide a range-to-target_node lookup facility
7b27a8622f80 libnvdimm/e820: Retrieve and populate correct 'target_node' info

Tested with:

        numa=nohmat efi_fake_mem=4G@9G:0x40000,4G@13G:0x40000

...to create to device-dax instances:

	# daxctl list -RDu
	[
	  {
	    "path":"\/platform\/hmem.1",
	    "id":1,
	    "size":"4.00 GiB (4.29 GB)",
	    "align":2097152,
	    "devices":[
	      {
	        "chardev":"dax1.0",
	        "size":"4.00 GiB (4.29 GB)",
	        "target_node":3,
	        "mode":"devdax"
	      }
	    ]
	  },
	  {
	    "path":"\/platform\/hmem.0",
	    "id":0,
	    "size":"4.00 GiB (4.29 GB)",
	    "align":2097152,
	    "devices":[
	      {
	        "chardev":"dax0.0",
	        "size":"4.00 GiB (4.29 GB)",
	        "target_node":2,
	        "mode":"devdax"
	      }
	    ]
	  }
	]

---

Dan Williams (5):
      ACPI: NUMA: Add 'nohmat' option
      efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
      ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
      resource: Report parent to walk_iomem_res_desc() callback
      ACPI: HMAT: Attach a device for each soft-reserved range


 arch/x86/kernel/e820.c              |   16 +++++-
 arch/x86/mm/numa.c                  |    4 +
 drivers/acpi/numa/hmat.c            |   71 +++-----------------------
 drivers/dax/Kconfig                 |    5 ++
 drivers/dax/Makefile                |    3 -
 drivers/dax/hmem/Makefile           |    6 ++
 drivers/dax/hmem/device.c           |   97 +++++++++++++++++++++++++++++++++++
 drivers/dax/hmem/hmem.c             |    2 -
 drivers/firmware/efi/x86_fake_mem.c |   12 +++-
 include/acpi/acpi_numa.h            |    1 
 include/linux/dax.h                 |    8 +++
 kernel/resource.c                   |    1 
 12 files changed, 156 insertions(+), 70 deletions(-)
 create mode 100644 drivers/dax/hmem/Makefile
 create mode 100644 drivers/dax/hmem/device.c
 rename drivers/dax/{hmem.c => hmem/hmem.c} (98%)

base-commit: 7b27a8622f802761d5c6abd6c37b22312a35343c

Comments

Jeff Moyer March 6, 2020, 8:07 p.m. UTC | #1
Dan Williams <dan.j.williams@intel.com> writes:

> Given the current dearth of systems that supply an ACPI HMAT table, and
> the utility of being able to manually define device-dax "hmem" instances
> via the efi_fake_mem= option, relax the requirements for creating these
> devices. Specifically, add an option (numa=nohmat) to optionally disable
> consideration of the HMAT and update efi_fake_mem= to behave like
> memmap=nn!ss in terms of delimiting device boundaries.

So, am I correct in deducing that your primary motivation is testing
without hardware/firmware support?  This looks like a bit of a hack to
me, and I think maybe it would be better to just emulate the HMAT using
qemu.  I don't have a strong objection, though.

-Jeff

>
> All review welcome of course, but the E820 changes want an x86
> maintainer ack, the efi_fake_mem update needs Ard, and Rafael has
> previously shepherded the HMAT changes. For the changes to
> kernel/resource.c, where there is no clear maintainer, I just copied the
> last few people to make thoughtful changes in that area. I am happy to
> take these through the nvdimm tree along with these prerequisites
> already in -next:
>
> b2ca916ce392 ACPI: NUMA: Up-level "map to online node" functionality
> 4fcbe96e4d0b mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
> 575e23b6e13c powerpc/papr_scm: Switch to numa_map_to_online_node()
> 1e5d8e1e47af x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
> 5d30f92e7631 x86/NUMA: Provide a range-to-target_node lookup facility
> 7b27a8622f80 libnvdimm/e820: Retrieve and populate correct 'target_node' info
>
> Tested with:
>
>         numa=nohmat efi_fake_mem=4G@9G:0x40000,4G@13G:0x40000
>
> ...to create to device-dax instances:
>
> 	# daxctl list -RDu
> 	[
> 	  {
> 	    "path":"\/platform\/hmem.1",
> 	    "id":1,
> 	    "size":"4.00 GiB (4.29 GB)",
> 	    "align":2097152,
> 	    "devices":[
> 	      {
> 	        "chardev":"dax1.0",
> 	        "size":"4.00 GiB (4.29 GB)",
> 	        "target_node":3,
> 	        "mode":"devdax"
> 	      }
> 	    ]
> 	  },
> 	  {
> 	    "path":"\/platform\/hmem.0",
> 	    "id":0,
> 	    "size":"4.00 GiB (4.29 GB)",
> 	    "align":2097152,
> 	    "devices":[
> 	      {
> 	        "chardev":"dax0.0",
> 	        "size":"4.00 GiB (4.29 GB)",
> 	        "target_node":2,
> 	        "mode":"devdax"
> 	      }
> 	    ]
> 	  }
> 	]
>
> ---
>
> Dan Williams (5):
>       ACPI: NUMA: Add 'nohmat' option
>       efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
>       ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
>       resource: Report parent to walk_iomem_res_desc() callback
>       ACPI: HMAT: Attach a device for each soft-reserved range
>
>
>  arch/x86/kernel/e820.c              |   16 +++++-
>  arch/x86/mm/numa.c                  |    4 +
>  drivers/acpi/numa/hmat.c            |   71 +++-----------------------
>  drivers/dax/Kconfig                 |    5 ++
>  drivers/dax/Makefile                |    3 -
>  drivers/dax/hmem/Makefile           |    6 ++
>  drivers/dax/hmem/device.c           |   97 +++++++++++++++++++++++++++++++++++
>  drivers/dax/hmem/hmem.c             |    2 -
>  drivers/firmware/efi/x86_fake_mem.c |   12 +++-
>  include/acpi/acpi_numa.h            |    1 
>  include/linux/dax.h                 |    8 +++
>  kernel/resource.c                   |    1 
>  12 files changed, 156 insertions(+), 70 deletions(-)
>  create mode 100644 drivers/dax/hmem/Makefile
>  create mode 100644 drivers/dax/hmem/device.c
>  rename drivers/dax/{hmem.c => hmem/hmem.c} (98%)
>
> base-commit: 7b27a8622f802761d5c6abd6c37b22312a35343c
Dan Williams March 6, 2020, 9:05 p.m. UTC | #2
On Fri, Mar 6, 2020 at 12:07 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > Given the current dearth of systems that supply an ACPI HMAT table, and
> > the utility of being able to manually define device-dax "hmem" instances
> > via the efi_fake_mem= option, relax the requirements for creating these
> > devices. Specifically, add an option (numa=nohmat) to optionally disable
> > consideration of the HMAT and update efi_fake_mem= to behave like
> > memmap=nn!ss in terms of delimiting device boundaries.
>
> So, am I correct in deducing that your primary motivation is testing
> without hardware/firmware support?

My primary motivation is making the dax_kmem facility useful to
shipping platforms that have performance differentiated memory, but
may not have EFI-defined soft-reservations / HMAT (or
non-EFI-ACPI-platform equivalent). I'm anticipating HMAT enabled
platforms where the platform firmware policy for what is
soft-reserved, or not, is not the policy the system owner would pick.
I'd also highlight Joao's work [1] (see the TODO section) as an
indication of the demand for custom carving memory resources and
applying the device-dax memory management interface.

> This looks like a bit of a hack to
> me, and I think maybe it would be better to just emulate the HMAT using
> qemu.  I don't have a strong objection, though.

Yeah, qemu emulation does not help when you, the system owner, have a
different use case than what the bare-metal platform-firmware
envisioned for "specific-purpose memory".

[1]: https://lore.kernel.org/lkml/20200110190313.17144-1-joao.m.martins@oracle.com/