[v4,00/23] device-dax: Support sub-dividing soft-reserved ranges

Message ID	159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=BBDP=BN=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 050EB206D7 IronPort-SDR: UI4YNT6AhezihGeHFcfAWbgunUo2iZK/wBfKv0IZtmOeSf144+MmALmszinF7/W2YXsUYU3NlV jWPiRSf004Lw== IronPort-SDR: tm9LkYDyqhV6V4pL1WKMtw3wf8v+Is6yDzCmQMUlL5Qqvl/+x6Q9qb44wsebsqk1ZT8GgKO45l P1v5Vq1qFkWg== Subject: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges From: Dan Williams <dan.j.williams@intel.com> To: akpm@linux-foundation.org Date: Sun, 02 Aug 2020 22:02:23 -0700 Message-ID: <159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 Precedence: list Cc: x86@kernel.org, David Hildenbrand <david@redhat.com>, David Airlie <airlied@linux.ie>, Catalin Marinas <catalin.marinas@arm.com>, Dave Hansen <dave.hansen@linux.intel.com>, dri-devel@lists.freedesktop.org, Paul Mackerras <paulus@ozlabs.org>, linux-mm@kvack.org, Michael Ellerman <mpe@ellerman.id.au>, "H. Peter Anvin" <hpa@zytor.com>, joao.m.martins@oracle.com, Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>, Dave Jiang <dave.jiang@intel.com>, linux-acpi@vger.kernel.org, linux-nvdimm@lists.01.org, vishal.l.verma@intel.com, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>, Mike Rapoport <rppt@linux.ibm.com>, Peter Zijlstra <peterz@infradead.org>, Jeff Moyer <jmoyer@redhat.com>, Jason Gunthorpe <jgg@mellanox.com>, Ben Skeggs <bskeggs@redhat.com>, Tom Lendacky <thomas.lendacky@amd.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Ira Weiny <ira.weiny@intel.com>, Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Jia He <justin.he@arm.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, linux-kernel@vger.kernel.org, Wei Yang <richardw.yang@linux.intel.com>, Brice Goglin <Brice.Goglin@inria.fr>, "Rafael J. Wysocki" <rafael@kernel.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	device-dax: Support sub-dividing soft-reserved ranges \| expand [v4,00/23] device-dax: Support sub-dividing soft-reserved ranges [v4,01/23] x86/numa: Cleanup configuration dependent command-line options [v4,02/23] x86/numa: Add 'nohmat' option [v4,03/23] efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance [v4,04/23] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device [v4,05/23] resource: Report parent to walk_iomem_res_desc() callback [v4,06/23] mm/memory_hotplug: Introduce default phys_to_target_node() implementation [v4,07/23] ACPI: HMAT: Attach a device for each soft-reserved range [v4,08/23] device-dax: Drop the dax_region.pfn_flags attribute [v4,09/23] device-dax: Move instance creation parameters to 'struct dev_dax_data' [v4,10/23] device-dax: Make pgmap optional for instance creation [v4,11/23] device-dax: Kill dax_kmem_res [v4,12/23] device-dax: Add an allocation interface for device-dax instances [v4,13/23] device-dax: Introduce 'seed' devices [v4,14/23] drivers/base: Make device_find_child_by_name() compatible with sysfs inputs [v4,15/23] device-dax: Add resize support [v4,16/23] mm/memremap_pages: Convert to 'struct range' [v4,17/23] mm/memremap_pages: Support multiple ranges per invocation [v4,18/23] device-dax: Add dis-contiguous resource support [v4,19/23] device-dax: Introduce 'mapping' devices [v4,20/23] device-dax: Make align a per-device property [v4,21/23] device-dax: Add an 'align' attribute [v4,22/23] dax/hmem: Introduce dax_hmem.region_idle parameter [v4,23/23] device-dax: Add a range mapping allocation attribute

Dan Williams Aug. 3, 2020, 5:02 a.m. UTC

Changes since v3 [1]:
- Update x86 boot options documentation for 'nohmat' (Randy)

- Fixup a handful of kbuild robot reports, the most significant being
  moving usage of PUD_SIZE and PMD_SIZE under
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE protection.

[1]: http://lore.kernel.org/r/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com

---
Merge notes:

Well, no v5.8-rc8 to line this up for v5.9, so next best is early
integration into -mm before other collisions develop.

Chatted with Justin offline and it currently appears that the missing
numa information is the fault of the platform firmware to populate all
the necessary NUMA data in the NFIT.

---
Cover:

The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.

In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.

The motivations for this facility are:

1/ Allow performance differentiated memory ranges to be split between
   kernel-managed and directly-accessed use cases.

2/ Allow physical memory to be provisioned along performance relevant
   address boundaries. For example, divide a memory-side cache [4] along
   cache-color boundaries.

3/ Parcel out soft-reserved memory to VMs using device-dax as a security
   / permissions boundary [5]. Specifically I have seen people (ab)using
   memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
   device-dax interface on custom address ranges. A follow-on for the VM
   use case is to teach device-dax to dynamically allocate 'struct page' at
   runtime to reduce the duplication of 'struct page' space in both the
   guest and the host kernel for the same physical pages.

[2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com

---

Dan Williams (19):
      x86/numa: Cleanup configuration dependent command-line options
      x86/numa: Add 'nohmat' option
      efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
      ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
      resource: Report parent to walk_iomem_res_desc() callback
      mm/memory_hotplug: Introduce default phys_to_target_node() implementation
      ACPI: HMAT: Attach a device for each soft-reserved range
      device-dax: Drop the dax_region.pfn_flags attribute
      device-dax: Move instance creation parameters to 'struct dev_dax_data'
      device-dax: Make pgmap optional for instance creation
      device-dax: Kill dax_kmem_res
      device-dax: Add an allocation interface for device-dax instances
      device-dax: Introduce 'seed' devices
      drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
      device-dax: Add resize support
      mm/memremap_pages: Convert to 'struct range'
      mm/memremap_pages: Support multiple ranges per invocation
      device-dax: Add dis-contiguous resource support
      device-dax: Introduce 'mapping' devices

Joao Martins (4):
      device-dax: Make align a per-device property
      device-dax: Add an 'align' attribute
      dax/hmem: Introduce dax_hmem.region_idle parameter
      device-dax: Add a range mapping allocation attribute


 Documentation/x86/x86_64/boot-options.rst |    4 
 arch/powerpc/kvm/book3s_hv_uvmem.c        |   14 
 arch/x86/include/asm/numa.h               |    8 
 arch/x86/kernel/e820.c                    |   16 
 arch/x86/mm/numa.c                        |   11 
 arch/x86/mm/numa_emulation.c              |    3 
 arch/x86/xen/enlighten_pv.c               |    2 
 drivers/acpi/numa/hmat.c                  |   76 --
 drivers/acpi/numa/srat.c                  |    9 
 drivers/base/core.c                       |    2 
 drivers/dax/Kconfig                       |    4 
 drivers/dax/Makefile                      |    3 
 drivers/dax/bus.c                         | 1046 +++++++++++++++++++++++++++--
 drivers/dax/bus.h                         |   28 -
 drivers/dax/dax-private.h                 |   60 +-
 drivers/dax/device.c                      |  134 ++--
 drivers/dax/hmem.c                        |   56 --
 drivers/dax/hmem/Makefile                 |    6 
 drivers/dax/hmem/device.c                 |  100 +++
 drivers/dax/hmem/hmem.c                   |   65 ++
 drivers/dax/kmem.c                        |  199 +++---
 drivers/dax/pmem/compat.c                 |    2 
 drivers/dax/pmem/core.c                   |   22 -
 drivers/firmware/efi/x86_fake_mem.c       |   12 
 drivers/gpu/drm/nouveau/nouveau_dmem.c    |   15 
 drivers/nvdimm/badrange.c                 |   26 -
 drivers/nvdimm/claim.c                    |   13 
 drivers/nvdimm/nd.h                       |    3 
 drivers/nvdimm/pfn_devs.c                 |   13 
 drivers/nvdimm/pmem.c                     |   27 -
 drivers/nvdimm/region.c                   |   21 -
 drivers/pci/p2pdma.c                      |   12 
 include/acpi/acpi_numa.h                  |   14 
 include/linux/dax.h                       |    8 
 include/linux/memory_hotplug.h            |    5 
 include/linux/memremap.h                  |   11 
 include/linux/numa.h                      |   11 
 include/linux/range.h                     |    6 
 kernel/resource.c                         |   11 
 lib/test_hmm.c                            |   15 
 mm/memory_hotplug.c                       |   10 
 mm/memremap.c                             |  299 +++++---
 tools/testing/nvdimm/dax-dev.c            |   22 -
 tools/testing/nvdimm/test/iomap.c         |    2 
 44 files changed, 1825 insertions(+), 601 deletions(-)
 delete mode 100644 drivers/dax/hmem.c
 create mode 100644 drivers/dax/hmem/Makefile
 create mode 100644 drivers/dax/hmem/device.c
 create mode 100644 drivers/dax/hmem/hmem.c

base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26

David Hildenbrand Aug. 3, 2020, 7:47 a.m. UTC | #1

[...]

> Well, no v5.8-rc8 to line this up for v5.9, so next best is early
> integration into -mm before other collisions develop.
> 
> Chatted with Justin offline and it currently appears that the missing
> numa information is the fault of the platform firmware to populate all
> the necessary NUMA data in the NFIT.

I'm planning on looking at some bits of this series this week, but some
questions upfront ...

> 
> ---
> Cover:
> 
> The device-dax facility allows an address range to be directly mapped
> through a chardev, or optionally hotplugged to the core kernel page
> allocator as System-RAM. It is the mechanism for converting persistent
> memory (pmem) to be used as another volatile memory pool i.e. the
> current Memory Tiering hot topic on linux-mm.
> 
> In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
> it, but that labeling mechanism is not available / applicable to
> soft-reserved ("EFI specific purpose") memory [3]. This series provides
> a sysfs-mechanism for the daxctl utility to enable provisioning of
> volatile-soft-reserved memory ranges.
> 
> The motivations for this facility are:
> 
> 1/ Allow performance differentiated memory ranges to be split between
>    kernel-managed and directly-accessed use cases.
> 
> 2/ Allow physical memory to be provisioned along performance relevant
>    address boundaries. For example, divide a memory-side cache [4] along
>    cache-color boundaries.
> 
> 3/ Parcel out soft-reserved memory to VMs using device-dax as a security
>    / permissions boundary [5]. Specifically I have seen people (ab)using
>    memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
>    device-dax interface on custom address ranges. A follow-on for the VM
>    use case is to teach device-dax to dynamically allocate 'struct page' at
>    runtime to reduce the duplication of 'struct page' space in both the
>    guest and the host kernel for the same physical pages.


I think I am missing some important pieces. Bear with me.

1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
automatically used in the buddy during boot, but remains untouched
(similar to pmem). But as it involves ACPI as well, it could also be
used on arm64 (-e820), correct?

2. Soft-reserved memory is volatile RAM with differing performance
characteristics ("performance differentiated memory"). What would be
examples of such memory? Like, memory that is faster than RAM (scratch
pad), or slower (pmem)? Or both? :) Is it a valid use case to use pmem
in a hypervisor to back this memory?

3. There seem to be use cases where "soft-reserved" memory is used via
DAX. What is an example use case? I assume it's *not* to treat it like
PMEM but instead e.g., use it as a fast buffer inside applications or
similar.

4. There seem to be use cases where some part of "soft-reserved" memory
is used via DAX, some other is given to the buddy. What is an example
use case? Is this really necessary or only some theoretical use case?

5. The "provisioned along performance relevant address boundaries." part
is unclear to me. Can you give an example of how this would look like
from user space? Like, split that memory in blocks of size X with
alignment Y and give them to separate applications?

6. If you add such memory to the buddy, is there any way the system can
differentiate it from other memory? E.g., via fake/other NUMA nodes?


Also, can you give examples of how kmem-added memory is represented in
/proc/iomem for a) pmem and b) soft-resered memory after this series
(skimming over the patches, I think there is a change for pmem, right?)?

I am really wondering if it's the right approach to squeeze this into
our pmem/nvdimm infrastructure just because it's easy to do. E.g., man
"ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile
Memory)" speaks explicitly about non-volatile memory.


> 
> [2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
> [3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
> [4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
> [5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
> 
> ---
> 
> Dan Williams (19):
>       x86/numa: Cleanup configuration dependent command-line options
>       x86/numa: Add 'nohmat' option
>       efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
>       ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
>       resource: Report parent to walk_iomem_res_desc() callback
>       mm/memory_hotplug: Introduce default phys_to_target_node() implementation
>       ACPI: HMAT: Attach a device for each soft-reserved range
>       device-dax: Drop the dax_region.pfn_flags attribute
>       device-dax: Move instance creation parameters to 'struct dev_dax_data'
>       device-dax: Make pgmap optional for instance creation
>       device-dax: Kill dax_kmem_res
>       device-dax: Add an allocation interface for device-dax instances
>       device-dax: Introduce 'seed' devices
>       drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
>       device-dax: Add resize support
>       mm/memremap_pages: Convert to 'struct range'
>       mm/memremap_pages: Support multiple ranges per invocation
>       device-dax: Add dis-contiguous resource support
>       device-dax: Introduce 'mapping' devices
> 
> Joao Martins (4):
>       device-dax: Make align a per-device property
>       device-dax: Add an 'align' attribute
>       dax/hmem: Introduce dax_hmem.region_idle parameter
>       device-dax: Add a range mapping allocation attribute
> 
> 
>  Documentation/x86/x86_64/boot-options.rst |    4 
>  arch/powerpc/kvm/book3s_hv_uvmem.c        |   14 
>  arch/x86/include/asm/numa.h               |    8 
>  arch/x86/kernel/e820.c                    |   16 
>  arch/x86/mm/numa.c                        |   11 
>  arch/x86/mm/numa_emulation.c              |    3 
>  arch/x86/xen/enlighten_pv.c               |    2 
>  drivers/acpi/numa/hmat.c                  |   76 --
>  drivers/acpi/numa/srat.c                  |    9 
>  drivers/base/core.c                       |    2 
>  drivers/dax/Kconfig                       |    4 
>  drivers/dax/Makefile                      |    3 
>  drivers/dax/bus.c                         | 1046 +++++++++++++++++++++++++++--
>  drivers/dax/bus.h                         |   28 -
>  drivers/dax/dax-private.h                 |   60 +-
>  drivers/dax/device.c                      |  134 ++--
>  drivers/dax/hmem.c                        |   56 --
>  drivers/dax/hmem/Makefile                 |    6 
>  drivers/dax/hmem/device.c                 |  100 +++
>  drivers/dax/hmem/hmem.c                   |   65 ++
>  drivers/dax/kmem.c                        |  199 +++---
>  drivers/dax/pmem/compat.c                 |    2 
>  drivers/dax/pmem/core.c                   |   22 -
>  drivers/firmware/efi/x86_fake_mem.c       |   12 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c    |   15 
>  drivers/nvdimm/badrange.c                 |   26 -
>  drivers/nvdimm/claim.c                    |   13 
>  drivers/nvdimm/nd.h                       |    3 
>  drivers/nvdimm/pfn_devs.c                 |   13 
>  drivers/nvdimm/pmem.c                     |   27 -
>  drivers/nvdimm/region.c                   |   21 -
>  drivers/pci/p2pdma.c                      |   12 
>  include/acpi/acpi_numa.h                  |   14 
>  include/linux/dax.h                       |    8 
>  include/linux/memory_hotplug.h            |    5 
>  include/linux/memremap.h                  |   11 
>  include/linux/numa.h                      |   11 
>  include/linux/range.h                     |    6 
>  kernel/resource.c                         |   11 
>  lib/test_hmm.c                            |   15 
>  mm/memory_hotplug.c                       |   10 
>  mm/memremap.c                             |  299 +++++---
>  tools/testing/nvdimm/dax-dev.c            |   22 -
>  tools/testing/nvdimm/test/iomap.c         |    2 
>  44 files changed, 1825 insertions(+), 601 deletions(-)
>  delete mode 100644 drivers/dax/hmem.c
>  create mode 100644 drivers/dax/hmem/Makefile
>  create mode 100644 drivers/dax/hmem/device.c
>  create mode 100644 drivers/dax/hmem/hmem.c
> 
> base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26
>

Dan Williams Aug. 20, 2020, 1:53 a.m. UTC | #2

On Mon, Aug 3, 2020 at 12:48 AM David Hildenbrand <david@redhat.com> wrote:
>
> [...]
>
> > Well, no v5.8-rc8 to line this up for v5.9, so next best is early
> > integration into -mm before other collisions develop.
> >
> > Chatted with Justin offline and it currently appears that the missing
> > numa information is the fault of the platform firmware to populate all
> > the necessary NUMA data in the NFIT.
>
> I'm planning on looking at some bits of this series this week, but some
> questions upfront ...
>
> >
> > ---
> > Cover:
> >
> > The device-dax facility allows an address range to be directly mapped
> > through a chardev, or optionally hotplugged to the core kernel page
> > allocator as System-RAM. It is the mechanism for converting persistent
> > memory (pmem) to be used as another volatile memory pool i.e. the
> > current Memory Tiering hot topic on linux-mm.
> >
> > In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
> > it, but that labeling mechanism is not available / applicable to
> > soft-reserved ("EFI specific purpose") memory [3]. This series provides
> > a sysfs-mechanism for the daxctl utility to enable provisioning of
> > volatile-soft-reserved memory ranges.
> >
> > The motivations for this facility are:
> >
> > 1/ Allow performance differentiated memory ranges to be split between
> >    kernel-managed and directly-accessed use cases.
> >
> > 2/ Allow physical memory to be provisioned along performance relevant
> >    address boundaries. For example, divide a memory-side cache [4] along
> >    cache-color boundaries.
> >
> > 3/ Parcel out soft-reserved memory to VMs using device-dax as a security
> >    / permissions boundary [5]. Specifically I have seen people (ab)using
> >    memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
> >    device-dax interface on custom address ranges. A follow-on for the VM
> >    use case is to teach device-dax to dynamically allocate 'struct page' at
> >    runtime to reduce the duplication of 'struct page' space in both the
> >    guest and the host kernel for the same physical pages.
>
>
> I think I am missing some important pieces. Bear with me.

No worries, also bear with me, I'm going to be offline intermittently
until at least mid-September. Hopefully Joao and/or Vishal can jump in
on this discussion.

>
> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
> automatically used in the buddy during boot, but remains untouched
> (similar to pmem). But as it involves ACPI as well, it could also be
> used on arm64 (-e820), correct?

Correct, arm64 also gets the EFI support for enumerating memory this
way. However, I would clarify that whether soft-reserved is given to
the buddy allocator by default or not is the kernel's policy choice,
"buddy-by-default" is ok and is what will happen anyways with older
kernels on platforms that enumerate a memory range this way.

> 2. Soft-reserved memory is volatile RAM with differing performance
> characteristics ("performance differentiated memory"). What would be
> examples of such memory?

Likely the most prominent one that drove the creation of the "EFI
Specific Purpose" attribute bit is high-bandwidth memory. One concrete
example of that was a platform called Knights Landing [1] that ended
up shipping firmware that lied to the OS about the latency
characteristics of the memory to try to reverse engineer OS behavior
to not allocate from that memory range by default. With the EFI
attribute firmware performance tables can tell the truth about the
performance characteristics of the memory range *and* indicate that
the OS not use it for general purpose allocations by default.

[1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html

> Like, memory that is faster than RAM (scratch
> pad), or slower (pmem)? Or both? :)

Both, but note that PMEM is already hard-reserved by default.
Soft-reserved is about a memory range that, for example, an
administrator may want to reserve 100% for a weather simulation where
if even a small amount of memory was stolen for the page cache the
application may not meet its performance targets. It could also be a
memory range that is so slow that only applications with higher
latency tolerances would be prepared to consume it.

In other words the soft-reserved memory can be used to indicate memory
that is either too precious, or too slow for general purpose OS
allocations.

> Is it a valid use case to use pmem
> in a hypervisor to back this memory?

Depends on the pmem. That performance capability is indicated by the
ACPI HMAT, not the EFI soft-reserved designation.

> 3. There seem to be use cases where "soft-reserved" memory is used via
> DAX. What is an example use case? I assume it's *not* to treat it like
> PMEM but instead e.g., use it as a fast buffer inside applications or
> similar.

Right, in that weather-simulation example that application could just
mmap /dev/daxX.Y and never worry about contending for the "fast
memory" resource on the platform. Alternatively if that resource needs
to be shared and/or over-commited then kernel memory-management
services are needed and that dax-device can be assigned to kmem.

> 4. There seem to be use cases where some part of "soft-reserved" memory
> is used via DAX, some other is given to the buddy. What is an example
> use case? Is this really necessary or only some theoretical use case?

It's as necessary as pmem namespace partitioning, or the inclusion of
dax-kmem upstream in the first place. In that kmem case the motivation
was that some users want a portion of pmem provisioned for storage and
some for volatile usage. The motivation is similar here, platform
firmware can only identify memory attributes on coarse boundaries,
finer grained provisioning decisions are up to the administrator /
platform-owner and the kernel is a just a facilitator of that policy.

>
> 5. The "provisioned along performance relevant address boundaries." part
> is unclear to me. Can you give an example of how this would look like
> from user space? Like, split that memory in blocks of size X with
> alignment Y and give them to separate applications?

One example of platform address boundaries are the memory address
ranges that alias in a direct-mapped memory-side-cache. In the
direct-map-cache aliasing may repeat every N GBs where N is the ratio
of far-to-near memory. ("Near memory" ==  cache "Far memory" ==
backing memory). Also refer back to the background in the page
allocator shuffling patches [2]. With this partitioning mechanism you
could, for one example use case, assign different VMs to exclusive
colors in the memory side cache.

[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098

> 6. If you add such memory to the buddy, is there any way the system can
> differentiate it from other memory? E.g., via fake/other NUMA nodes?

Numa node numbers / are how performance differentiated memory ranges
are enumerated. The expectation is that all distinct performance
memory targets have unique ACPI proximity domains and Linux numa node
numbers as a result.

> Also, can you give examples of how kmem-added memory is represented in
> /proc/iomem for a) pmem and b) soft-resered memory after this series
> (skimming over the patches, I think there is a change for pmem, right?)?

I don't expect a change. The only difference is the parent resource
will be marked "Soft Reserved" instead of "Persistent Memory".

> I am really wondering if it's the right approach to squeeze this into
> our pmem/nvdimm infrastructure just because it's easy to do. E.g., man
> "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile
> Memory)" speaks explicitly about non-volatile memory.

In fact it's not squeezed into PMEM infrastructure. dax-kmem and
device-dax are independent of PMEM. PMEM is one source of potential
device-dax instances, soft-reserved memory is another orthogonal
source. This is why device-dax needs its own userspace policy directed
partitioning mechanism because there is no PMEM to store the
configuration for partitioned higph-bandwidth memory. The userspace
tooling for this mechanism is targeted for a tool called daxctl that
has no PMEM dependencies. Look to Joao's use case that is using this
infrastructure independent of PMEM with manual soft-reservations
specified on the kernel command-line.

David Hildenbrand Aug. 21, 2020, 10:15 a.m. UTC | #3

>>
>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>> automatically used in the buddy during boot, but remains untouched
>> (similar to pmem). But as it involves ACPI as well, it could also be
>> used on arm64 (-e820), correct?
> 
> Correct, arm64 also gets the EFI support for enumerating memory this
> way. However, I would clarify that whether soft-reserved is given to
> the buddy allocator by default or not is the kernel's policy choice,
> "buddy-by-default" is ok and is what will happen anyways with older
> kernels on platforms that enumerate a memory range this way.

Is "soft-reserved" then the right terminology for that? It sounds very
x86-64/e820 specific. Maybe a compressed for of "performance
differentiated memory" might be a better fit to expose to user space, no?

> 
>> 2. Soft-reserved memory is volatile RAM with differing performance
>> characteristics ("performance differentiated memory"). What would be
>> examples of such memory?
> 
> Likely the most prominent one that drove the creation of the "EFI
> Specific Purpose" attribute bit is high-bandwidth memory. One concrete
> example of that was a platform called Knights Landing [1] that ended
> up shipping firmware that lied to the OS about the latency
> characteristics of the memory to try to reverse engineer OS behavior
> to not allocate from that memory range by default. With the EFI
> attribute firmware performance tables can tell the truth about the
> performance characteristics of the memory range *and* indicate that
> the OS not use it for general purpose allocations by default.

Thanks for clarifying!

> 
> [1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html
> 
>> Like, memory that is faster than RAM (scratch
>> pad), or slower (pmem)? Or both? :)
> 
> Both, but note that PMEM is already hard-reserved by default.
> Soft-reserved is about a memory range that, for example, an
> administrator may want to reserve 100% for a weather simulation where
> if even a small amount of memory was stolen for the page cache the
> application may not meet its performance targets. It could also be a
> memory range that is so slow that only applications with higher
> latency tolerances would be prepared to consume it.
> 
> In other words the soft-reserved memory can be used to indicate memory
> that is either too precious, or too slow for general purpose OS
> allocations.

Right, so actually performance-differentiated in any way :)

> 
>> Is it a valid use case to use pmem
>> in a hypervisor to back this memory?
> 
> Depends on the pmem. That performance capability is indicated by the
> ACPI HMAT, not the EFI soft-reserved designation.
> 
>> 3. There seem to be use cases where "soft-reserved" memory is used via
>> DAX. What is an example use case? I assume it's *not* to treat it like
>> PMEM but instead e.g., use it as a fast buffer inside applications or
>> similar.
> 
> Right, in that weather-simulation example that application could just
> mmap /dev/daxX.Y and never worry about contending for the "fast
> memory" resource on the platform. Alternatively if that resource needs
> to be shared and/or over-commited then kernel memory-management
> services are needed and that dax-device can be assigned to kmem.
> 
>> 4. There seem to be use cases where some part of "soft-reserved" memory
>> is used via DAX, some other is given to the buddy. What is an example
>> use case? Is this really necessary or only some theoretical use case?
> 
> It's as necessary as pmem namespace partitioning, or the inclusion of
> dax-kmem upstream in the first place. In that kmem case the motivation
> was that some users want a portion of pmem provisioned for storage and
> some for volatile usage. The motivation is similar here, platform
> firmware can only identify memory attributes on coarse boundaries,
> finer grained provisioning decisions are up to the administrator /
> platform-owner and the kernel is a just a facilitator of that policy.
> 
>>
>> 5. The "provisioned along performance relevant address boundaries." part
>> is unclear to me. Can you give an example of how this would look like
>> from user space? Like, split that memory in blocks of size X with
>> alignment Y and give them to separate applications?
> 
> One example of platform address boundaries are the memory address
> ranges that alias in a direct-mapped memory-side-cache. In the
> direct-map-cache aliasing may repeat every N GBs where N is the ratio
> of far-to-near memory. ("Near memory" ==  cache "Far memory" ==
> backing memory). Also refer back to the background in the page
> allocator shuffling patches [2]. With this partitioning mechanism you
> could, for one example use case, assign different VMs to exclusive
> colors in the memory side cache.

Interesting, thanks!

> 
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098
> 
>> 6. If you add such memory to the buddy, is there any way the system can
>> differentiate it from other memory? E.g., via fake/other NUMA nodes?
> 
> Numa node numbers / are how performance differentiated memory ranges
> are enumerated. The expectation is that all distinct performance
> memory targets have unique ACPI proximity domains and Linux numa node
> numbers as a result.

Makes sense to me (although it's somehow weird, because memory of the
same socket/node would be represented via different NUMA nodes), thanks!

> 
>> Also, can you give examples of how kmem-added memory is represented in
>> /proc/iomem for a) pmem and b) soft-resered memory after this series
>> (skimming over the patches, I think there is a change for pmem, right?)?
> 
> I don't expect a change. The only difference is the parent resource
> will be marked "Soft Reserved" instead of "Persistent Memory".

Right, I misread patch #11 while skimming - I thought the device
resource would be dropped.

> 
>> I am really wondering if it's the right approach to squeeze this into
>> our pmem/nvdimm infrastructure just because it's easy to do. E.g., man
>> "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile
>> Memory)" speaks explicitly about non-volatile memory.
> 
> In fact it's not squeezed into PMEM infrastructure. dax-kmem and
> device-dax are independent of PMEM. PMEM is one source of potential
> device-dax instances, soft-reserved memory is another orthogonal
> source. This is why device-dax needs its own userspace policy directed
> partitioning mechanism because there is no PMEM to store the
> configuration for partitioned higph-bandwidth memory. The userspace
> tooling for this mechanism is targeted for a tool called daxctl that
> has no PMEM dependencies. Look to Joao's use case that is using this
> infrastructure independent of PMEM with manual soft-reservations
> specified on the kernel command-line.

Thanks for clarifying, I was under the impression we would be reusing
libnvdimm to manage that memory.

Dan Williams Aug. 21, 2020, 6:27 p.m. UTC | #4

On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>
> >> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
> >> automatically used in the buddy during boot, but remains untouched
> >> (similar to pmem). But as it involves ACPI as well, it could also be
> >> used on arm64 (-e820), correct?
> >
> > Correct, arm64 also gets the EFI support for enumerating memory this
> > way. However, I would clarify that whether soft-reserved is given to
> > the buddy allocator by default or not is the kernel's policy choice,
> > "buddy-by-default" is ok and is what will happen anyways with older
> > kernels on platforms that enumerate a memory range this way.
>
> Is "soft-reserved" then the right terminology for that? It sounds very
> x86-64/e820 specific. Maybe a compressed for of "performance
> differentiated memory" might be a better fit to expose to user space, no?

No. The EFI "Specific Purpose" bit is an attribute independent of
e820, it's x86-Linux that entangles those together. There is no
requirement for platform firmware to use that designation even for
drastic performance differentiation between ranges, and conversely
there is no requirement that memory *with* that designation has any
performance difference compared to the default memory pool. So it
really is a reservation policy about a memory range to keep out of the
buddy allocator by default.

[..]
> > Both, but note that PMEM is already hard-reserved by default.
> > Soft-reserved is about a memory range that, for example, an
> > administrator may want to reserve 100% for a weather simulation where
> > if even a small amount of memory was stolen for the page cache the
> > application may not meet its performance targets. It could also be a
> > memory range that is so slow that only applications with higher
> > latency tolerances would be prepared to consume it.
> >
> > In other words the soft-reserved memory can be used to indicate memory
> > that is either too precious, or too slow for general purpose OS
> > allocations.
>
> Right, so actually performance-differentiated in any way :)

... or not differentiated at all which is Joao's use case for example.

[..]
> > Numa node numbers / are how performance differentiated memory ranges
> > are enumerated. The expectation is that all distinct performance
> > memory targets have unique ACPI proximity domains and Linux numa node
> > numbers as a result.
>
> Makes sense to me (although it's somehow weird, because memory of the
> same socket/node would be represented via different NUMA nodes), thanks!

Yes, numa ids as only physical socket identifiers is no longer a
reliable assumption since the introduction of the ACPI HMAT.

David Hildenbrand Aug. 21, 2020, 6:30 p.m. UTC | #5

On 21.08.20 20:27, Dan Williams wrote:
> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>
>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>> automatically used in the buddy during boot, but remains untouched
>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>> used on arm64 (-e820), correct?
>>>
>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>> way. However, I would clarify that whether soft-reserved is given to
>>> the buddy allocator by default or not is the kernel's policy choice,
>>> "buddy-by-default" is ok and is what will happen anyways with older
>>> kernels on platforms that enumerate a memory range this way.
>>
>> Is "soft-reserved" then the right terminology for that? It sounds very
>> x86-64/e820 specific. Maybe a compressed for of "performance
>> differentiated memory" might be a better fit to expose to user space, no?
> 
> No. The EFI "Specific Purpose" bit is an attribute independent of
> e820, it's x86-Linux that entangles those together. There is no
> requirement for platform firmware to use that designation even for
> drastic performance differentiation between ranges, and conversely
> there is no requirement that memory *with* that designation has any
> performance difference compared to the default memory pool. So it
> really is a reservation policy about a memory range to keep out of the
> buddy allocator by default.

Okay, still "soft-reserved" is x86-64 specific, no? (AFAIK,
"soft-reserved" will be visible in /proc/iomem, or am I confusing
stuff?) IOW, it "performance differentiated" is not universally
applicable, maybe  "specific purpose memory" is ?

Dan Williams Aug. 21, 2020, 9:17 p.m. UTC | #6

On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.08.20 20:27, Dan Williams wrote:
> > On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>>
> >>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
> >>>> automatically used in the buddy during boot, but remains untouched
> >>>> (similar to pmem). But as it involves ACPI as well, it could also be
> >>>> used on arm64 (-e820), correct?
> >>>
> >>> Correct, arm64 also gets the EFI support for enumerating memory this
> >>> way. However, I would clarify that whether soft-reserved is given to
> >>> the buddy allocator by default or not is the kernel's policy choice,
> >>> "buddy-by-default" is ok and is what will happen anyways with older
> >>> kernels on platforms that enumerate a memory range this way.
> >>
> >> Is "soft-reserved" then the right terminology for that? It sounds very
> >> x86-64/e820 specific. Maybe a compressed for of "performance
> >> differentiated memory" might be a better fit to expose to user space, no?
> >
> > No. The EFI "Specific Purpose" bit is an attribute independent of
> > e820, it's x86-Linux that entangles those together. There is no
> > requirement for platform firmware to use that designation even for
> > drastic performance differentiation between ranges, and conversely
> > there is no requirement that memory *with* that designation has any
> > performance difference compared to the default memory pool. So it
> > really is a reservation policy about a memory range to keep out of the
> > buddy allocator by default.
>
> Okay, still "soft-reserved" is x86-64 specific, no?

There's nothing preventing other EFI archs, or a similar designation
in another firmware spec, picking up this policy.

>   (AFAIK,
> "soft-reserved" will be visible in /proc/iomem, or am I confusing
> stuff?)

No, you're correct.

> IOW, it "performance differentiated" is not universally
> applicable, maybe  "specific purpose memory" is ?

Those bikeshed colors don't seem an improvement to me.

"Soft-reserved" actually tells you something about the kernel policy
for the memory. The criticism of "specific purpose" that led to
calling it "soft-reserved" in Linux is the fact that "specific" is
undefined as far as the firmware knows, and "specific" may have
different applications based on the platform user. "Soft-reserved"
like "Reserved" tells you that a driver policy might be in play for
that memory.

Also note that the current color of the bikeshed has already shipped since v5.5:

   262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration

David Hildenbrand Aug. 21, 2020, 9:33 p.m. UTC | #7

> Am 21.08.2020 um 23:17 schrieb Dan Williams <dan.j.williams@intel.com>:
> 
> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <david@redhat.com> wrote:
>> 
>>> On 21.08.20 20:27, Dan Williams wrote:
>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>> 
>>>>>> 
>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>> used on arm64 (-e820), correct?
>>>>> 
>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>> kernels on platforms that enumerate a memory range this way.
>>>> 
>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>> differentiated memory" might be a better fit to expose to user space, no?
>>> 
>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>> e820, it's x86-Linux that entangles those together. There is no
>>> requirement for platform firmware to use that designation even for
>>> drastic performance differentiation between ranges, and conversely
>>> there is no requirement that memory *with* that designation has any
>>> performance difference compared to the default memory pool. So it
>>> really is a reservation policy about a memory range to keep out of the
>>> buddy allocator by default.
>> 
>> Okay, still "soft-reserved" is x86-64 specific, no?
> 
> There's nothing preventing other EFI archs, or a similar designation
> in another firmware spec, picking up this policy.
> 
>>  (AFAIK,
>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>> stuff?)
> 
> No, you're correct.
> 
>> IOW, it "performance differentiated" is not universally
>> applicable, maybe  "specific purpose memory" is ?
> 
> Those bikeshed colors don't seem an improvement to me.
> 
> "Soft-reserved" actually tells you something about the kernel policy
> for the memory. The criticism of "specific purpose" that led to
> calling it "soft-reserved" in Linux is the fact that "specific" is
> undefined as far as the firmware knows, and "specific" may have
> different applications based on the platform user. "Soft-reserved"
> like "Reserved" tells you that a driver policy might be in play for
> that memory.
> 
> Also note that the current color of the bikeshed has already shipped since v5.5:
> 
>   262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
> 

I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.

In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

But anyhow, just details, and you‘re telling me that that ship already sailed. So no further comments from my side.

Thanks for all the info!

David Hildenbrand Aug. 21, 2020, 9:42 p.m. UTC | #8

> Am 21.08.2020 um 23:34 schrieb David Hildenbrand <david@redhat.com>:
> 
> 
> 
>>> Am 21.08.2020 um 23:17 schrieb Dan Williams <dan.j.williams@intel.com>:
>>> 
>>> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <david@redhat.com> wrote:
>>> 
>>>> On 21.08.20 20:27, Dan Williams wrote:
>>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>>> 
>>>>>>> 
>>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>>> used on arm64 (-e820), correct?
>>>>>> 
>>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>>> kernels on platforms that enumerate a memory range this way.
>>>>> 
>>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>>> differentiated memory" might be a better fit to expose to user space, no?
>>>> 
>>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>>> e820, it's x86-Linux that entangles those together. There is no
>>>> requirement for platform firmware to use that designation even for
>>>> drastic performance differentiation between ranges, and conversely
>>>> there is no requirement that memory *with* that designation has any
>>>> performance difference compared to the default memory pool. So it
>>>> really is a reservation policy about a memory range to keep out of the
>>>> buddy allocator by default.
>>> 
>>> Okay, still "soft-reserved" is x86-64 specific, no?
>> 
>> There's nothing preventing other EFI archs, or a similar designation
>> in another firmware spec, picking up this policy.
>> 
>>> (AFAIK,
>>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>>> stuff?)
>> 
>> No, you're correct.
>> 
>>> IOW, it "performance differentiated" is not universally
>>> applicable, maybe  "specific purpose memory" is ?
>> 
>> Those bikeshed colors don't seem an improvement to me.
>> 
>> "Soft-reserved" actually tells you something about the kernel policy
>> for the memory. The criticism of "specific purpose" that led to
>> calling it "soft-reserved" in Linux is the fact that "specific" is
>> undefined as far as the firmware knows, and "specific" may have
>> different applications based on the platform user. "Soft-reserved"
>> like "Reserved" tells you that a driver policy might be in play for
>> that memory.
>> 
>> Also note that the current color of the bikeshed has already shipped since v5.5:
>> 
>>  262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
>> 
> 
> I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.
> 
> In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

s/normal/most/ of course :)

David Hildenbrand Aug. 21, 2020, 9:43 p.m. UTC | #9

> Am 21.08.2020 um 23:34 schrieb David Hildenbrand <david@redhat.com>:
> 
> 
> 
>>> Am 21.08.2020 um 23:17 schrieb Dan Williams <dan.j.williams@intel.com>:
>>> 
>>> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <david@redhat.com> wrote:
>>> 
>>>> On 21.08.20 20:27, Dan Williams wrote:
>>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>>> 
>>>>>>> 
>>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>>> used on arm64 (-e820), correct?
>>>>>> 
>>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>>> kernels on platforms that enumerate a memory range this way.
>>>>> 
>>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>>> differentiated memory" might be a better fit to expose to user space, no?
>>>> 
>>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>>> e820, it's x86-Linux that entangles those together. There is no
>>>> requirement for platform firmware to use that designation even for
>>>> drastic performance differentiation between ranges, and conversely
>>>> there is no requirement that memory *with* that designation has any
>>>> performance difference compared to the default memory pool. So it
>>>> really is a reservation policy about a memory range to keep out of the
>>>> buddy allocator by default.
>>> 
>>> Okay, still "soft-reserved" is x86-64 specific, no?
>> 
>> There's nothing preventing other EFI archs, or a similar designation
>> in another firmware spec, picking up this policy.
>> 
>>> (AFAIK,
>>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>>> stuff?)
>> 
>> No, you're correct.
>> 
>>> IOW, it "performance differentiated" is not universally
>>> applicable, maybe  "specific purpose memory" is ?
>> 
>> Those bikeshed colors don't seem an improvement to me.
>> 
>> "Soft-reserved" actually tells you something about the kernel policy
>> for the memory. The criticism of "specific purpose" that led to
>> calling it "soft-reserved" in Linux is the fact that "specific" is
>> undefined as far as the firmware knows, and "specific" may have
>> different applications based on the platform user. "Soft-reserved"
>> like "Reserved" tells you that a driver policy might be in play for
>> that memory.
>> 
>> Also note that the current color of the bikeshed has already shipped since v5.5:
>> 
>>  262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
>> 
> 
> I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.
> 
> In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

Obviously s/normal/most/

Cheers!

David Hildenbrand Aug. 21, 2020, 9:46 p.m. UTC | #10

On 21.08.20 23:33, David Hildenbrand wrote:
> 
> 
>> Am 21.08.2020 um 23:17 schrieb Dan Williams <dan.j.williams@intel.com>:
>>
>> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> On 21.08.20 20:27, Dan Williams wrote:
>>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>>>>
>>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>>> used on arm64 (-e820), correct?
>>>>>>
>>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>>> kernels on platforms that enumerate a memory range this way.
>>>>>
>>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>>> differentiated memory" might be a better fit to expose to user space, no?
>>>>
>>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>>> e820, it's x86-Linux that entangles those together. There is no
>>>> requirement for platform firmware to use that designation even for
>>>> drastic performance differentiation between ranges, and conversely
>>>> there is no requirement that memory *with* that designation has any
>>>> performance difference compared to the default memory pool. So it
>>>> really is a reservation policy about a memory range to keep out of the
>>>> buddy allocator by default.
>>>
>>> Okay, still "soft-reserved" is x86-64 specific, no?
>>
>> There's nothing preventing other EFI archs, or a similar designation
>> in another firmware spec, picking up this policy.
>>
>>>  (AFAIK,
>>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>>> stuff?)
>>
>> No, you're correct.
>>
>>> IOW, it "performance differentiated" is not universally
>>> applicable, maybe  "specific purpose memory" is ?
>>
>> Those bikeshed colors don't seem an improvement to me.
>>
>> "Soft-reserved" actually tells you something about the kernel policy
>> for the memory. The criticism of "specific purpose" that led to
>> calling it "soft-reserved" in Linux is the fact that "specific" is
>> undefined as far as the firmware knows, and "specific" may have
>> different applications based on the platform user. "Soft-reserved"
>> like "Reserved" tells you that a driver policy might be in play for
>> that memory.
>>
>> Also note that the current color of the bikeshed has already shipped since v5.5:
>>
>>   262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
>>
> 
> I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.
> 
> In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

s/normal/most/ , shouldn't be writing emails from my smartphone. Cheers!

Andrew Morton Aug. 21, 2020, 11:21 p.m. UTC | #11

On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> > I think I am missing some important pieces. Bear with me.
> 
> No worries, also bear with me, I'm going to be offline intermittently
> until at least mid-September. Hopefully Joao and/or Vishal can jump in
> on this discussion.

Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
this.

But given that v4 all applies OK and that Dan has pending outages, I'll
scoop up this version, even though at least one change has been suggested.

Also, this series has killed Zhen Lei's little cleanup
(http://lkml.kernel.org/r/20200817065926.2239-1-thunder.leizhen@huawei.com).
I don't think the affected code was moved elsewhere, so I'll drop that
patch.

Zhen Lei Aug. 22, 2020, 2:32 a.m. UTC | #12

On 8/22/2020 7:21 AM, Andrew Morton wrote:
> On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> 
>>> I think I am missing some important pieces. Bear with me.
>>
>> No worries, also bear with me, I'm going to be offline intermittently
>> until at least mid-September. Hopefully Joao and/or Vishal can jump in
>> on this discussion.
> 
> Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
> this.
> 
> But given that v4 all applies OK and that Dan has pending outages, I'll
> scoop up this version, even though at least one change has been suggested.
> 
> Also, this series has killed Zhen Lei's little cleanup
> (http://lkml.kernel.org/r/20200817065926.2239-1-thunder.leizhen@huawei.com).
> I don't think the affected code was moved elsewhere, so I'll drop that
> patch.

OK, this patch is really optional.

> 
> 
> .
>

David Hildenbrand Sept. 8, 2020, 10:45 a.m. UTC | #13

On 22.08.20 01:21, Andrew Morton wrote:
> On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> 
>>> I think I am missing some important pieces. Bear with me.
>>
>> No worries, also bear with me, I'm going to be offline intermittently
>> until at least mid-September. Hopefully Joao and/or Vishal can jump in
>> on this discussion.
> 
> Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
> this.
> 
> But given that v4 all applies OK and that Dan has pending outages, I'll
> scoop up this version, even though at least one change has been suggested.
> 

Should I try to fix patch #11 while Dan is away? Because I think at
least two things in there are wrong (and it would have been better to
split that patch into reviewable pieces).

Dan Williams Sept. 23, 2020, 12:43 a.m. UTC | #14

On Tue, Sep 8, 2020 at 3:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.20 01:21, Andrew Morton wrote:
> > On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >>> I think I am missing some important pieces. Bear with me.
> >>
> >> No worries, also bear with me, I'm going to be offline intermittently
> >> until at least mid-September. Hopefully Joao and/or Vishal can jump in
> >> on this discussion.
> >
> > Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
> > this.
> >
> > But given that v4 all applies OK and that Dan has pending outages, I'll
> > scoop up this version, even though at least one change has been suggested.
> >
>
> Should I try to fix patch #11 while Dan is away? Because I think at
> least two things in there are wrong (and it would have been better to
> split that patch into reviewable pieces).

Hey David,

Back now, I'll take a look. I didn't immediately see a way to
meaningfully break up that patch without some dead-code steps in the
conversion, but I'll take another run at it.

[v4,00/23] device-dax: Support sub-dividing soft-reserved ranges

Message

Comments