mbox series

[00/16] Memory Hierarchy: Enable target node lookups for reserved memory

Message ID 157309899529.1582359.15358067933360719580.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
Headers show
Series Memory Hierarchy: Enable target node lookups for reserved memory | expand

Message

Dan Williams Nov. 7, 2019, 3:56 a.m. UTC
Yes, this patch series looks like a pile of boring libnvdimm cleanups,
but buried at the end are some small gems that testing with libnvdimm
uncovered. These gems will prove more valuable over time for Memory
Hierarchy management as more platforms, via the ACPI HMAT and EFI
Specific Purpose Memory, publish reserved or "soft-reserved" ranges to
Linux. Linux system administrators will expect to be able to interact
with those ranges with a unique numa node number when/if that memory is
onlined via the dax_kmem driver [1].

One configuration that currently fails to properly convey the target
node for the resulting memory hotplug operation is persistent memory
defined by the memmap=nn!ss parameter. For example, today if node1 is a
memory only node, and all the memory from node1 is specified to
memmap=nn!ss and subsequently onlined, it will end up being onlined as
node0 memory. As it stands, memory_add_physaddr_to_nid() can only
identify online nodes and since node1 in this example has no online cpus
/ memory the target node is initialized node0.

The fix is to preserve rather than discard the numa_meminfo entries that
are relevant for reserved memory ranges, and to uplevel the node
distance helper for determining the "local" (closest) node relative to
an initiator node.

The first 12 patches are cleanups to make sure that all nvdimm devices
and their children properly export a numa_node attribute. The switch to
a device-type is less code and less error prone as a result.

Patch 13 and 14 are the core changes (gems) to allow numa node
information for offline memory to be tracked.

Patches 15 and 16 use this new capability to fix the conveyance of numa
node information for memmap=nn!ss assignments. See patch 16 for more
details.

[1]: https://pmem.io/ndctl/daxctl-reconfigure-device.html

---

Dan Williams (16):
      libnvdimm: Move attribute groups to device type
      libnvdimm: Move region attribute group definition
      libnvdimm: Move nd_device_attribute_group to device_type
      libnvdimm: Move nd_numa_attribute_group to device_type
      libnvdimm: Move nd_region_attribute_group to device_type
      libnvdimm: Move nd_mapping_attribute_group to device_type
      libnvdimm: Move nvdimm_attribute_group to device_type
      libnvdimm: Move nvdimm_bus_attribute_group to device_type
      dax: Create a dax device_type
      dax: Simplify root read-only definition for the 'resource' attribute
      libnvdimm: Simplify root read-only definition for the 'resource' attribute
      dax: Add numa_node to the default device-dax attributes
      acpi/mm: Up-level "map to online node" functionality
      x86/numa: Provide a range-to-target_node lookup facility
      libnvdimm/e820: Drop the wrapper around memory_add_physaddr_to_nid
      libnvdimm/e820: Retrieve and populate correct 'target_node' info


 arch/powerpc/platforms/pseries/papr_scm.c |   25 ---
 arch/x86/mm/numa.c                        |   72 ++++++++-
 drivers/acpi/nfit/core.c                  |    7 -
 drivers/acpi/numa.c                       |   41 -----
 drivers/dax/bus.c                         |   22 ++-
 drivers/nvdimm/btt_devs.c                 |   24 +--
 drivers/nvdimm/bus.c                      |   15 +-
 drivers/nvdimm/core.c                     |    8 +
 drivers/nvdimm/dax_devs.c                 |   27 +--
 drivers/nvdimm/dimm_devs.c                |   30 ++--
 drivers/nvdimm/e820.c                     |   30 ----
 drivers/nvdimm/namespace_devs.c           |   77 +++++-----
 drivers/nvdimm/nd.h                       |    5 -
 drivers/nvdimm/of_pmem.c                  |   13 --
 drivers/nvdimm/pfn_devs.c                 |   38 ++---
 drivers/nvdimm/region_devs.c              |  235 +++++++++++++++--------------
 include/linux/acpi.h                      |   23 +++
 include/linux/libnvdimm.h                 |    7 -
 include/linux/memory_hotplug.h            |    6 +
 include/linux/numa.h                      |    2 
 mm/mempolicy.c                            |   30 ++++
 21 files changed, 382 insertions(+), 355 deletions(-)

Comments

Aneesh Kumar K.V Nov. 12, 2019, 11:42 a.m. UTC | #1
Dan Williams <dan.j.williams@intel.com> writes:

> Yes, this patch series looks like a pile of boring libnvdimm cleanups,
> but buried at the end are some small gems that testing with libnvdimm
> uncovered. These gems will prove more valuable over time for Memory
> Hierarchy management as more platforms, via the ACPI HMAT and EFI
> Specific Purpose Memory, publish reserved or "soft-reserved" ranges to
> Linux. Linux system administrators will expect to be able to interact
> with those ranges with a unique numa node number when/if that memory is
> onlined via the dax_kmem driver [1].
>
> One configuration that currently fails to properly convey the target
> node for the resulting memory hotplug operation is persistent memory
> defined by the memmap=nn!ss parameter. For example, today if node1 is a
> memory only node, and all the memory from node1 is specified to
> memmap=nn!ss and subsequently onlined, it will end up being onlined as
> node0 memory. As it stands, memory_add_physaddr_to_nid() can only
> identify online nodes and since node1 in this example has no online cpus
> / memory the target node is initialized node0.
>
> The fix is to preserve rather than discard the numa_meminfo entries that
> are relevant for reserved memory ranges, and to uplevel the node
> distance helper for determining the "local" (closest) node relative to
> an initiator node.
>
> The first 12 patches are cleanups to make sure that all nvdimm devices
> and their children properly export a numa_node attribute. The switch to
> a device-type is less code and less error prone as a result.


Will this still allow leaf driver to have platform specific attribute
exposed via sysfs? Or do we want to still keep them in nvdimm core and
control the visibility via is_visible() callback?

-aneesh
Dan Williams Nov. 12, 2019, 7:37 p.m. UTC | #2
On Tue, Nov 12, 2019 at 3:43 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > Yes, this patch series looks like a pile of boring libnvdimm cleanups,
> > but buried at the end are some small gems that testing with libnvdimm
> > uncovered. These gems will prove more valuable over time for Memory
> > Hierarchy management as more platforms, via the ACPI HMAT and EFI
> > Specific Purpose Memory, publish reserved or "soft-reserved" ranges to
> > Linux. Linux system administrators will expect to be able to interact
> > with those ranges with a unique numa node number when/if that memory is
> > onlined via the dax_kmem driver [1].
> >
> > One configuration that currently fails to properly convey the target
> > node for the resulting memory hotplug operation is persistent memory
> > defined by the memmap=nn!ss parameter. For example, today if node1 is a
> > memory only node, and all the memory from node1 is specified to
> > memmap=nn!ss and subsequently onlined, it will end up being onlined as
> > node0 memory. As it stands, memory_add_physaddr_to_nid() can only
> > identify online nodes and since node1 in this example has no online cpus
> > / memory the target node is initialized node0.
> >
> > The fix is to preserve rather than discard the numa_meminfo entries that
> > are relevant for reserved memory ranges, and to uplevel the node
> > distance helper for determining the "local" (closest) node relative to
> > an initiator node.
> >
> > The first 12 patches are cleanups to make sure that all nvdimm devices
> > and their children properly export a numa_node attribute. The switch to
> > a device-type is less code and less error prone as a result.
>
>
> Will this still allow leaf driver to have platform specific attribute
> exposed via sysfs? Or do we want to still keep them in nvdimm core and
> control the visibility via is_visible() callback?

The leaf driver can still have platform specific attributes, see:

    acpi_nfit_attribute_groups
    acpi_nfit_dimm_attribute_groups
    acpi_nfit_region_attribute_groups

...that still exist after this conversion. This conversion simply
arranges for those to passed in without making the leaf driver also be
responsible for specifying the core attributes.