mbox series

[v5,00/10] mm: Sub-section memory hotplug support

Message ID 155327387405.225273.9325594075351253804.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
Headers show
Series mm: Sub-section memory hotplug support | expand

Message

Dan Williams March 22, 2019, 4:57 p.m. UTC
Changes since v4 [1]:
- Given v4 was from March of 2017 the bulk of the changes result from
  rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.

- A unit test is added to ndctl to exercise the creation and dax
  mounting of multiple independent namespaces in a single 128M section.

[1]: https://lwn.net/Articles/717383/

---

Quote patch7:

"The libnvdimm sub-system has suffered a series of hacks and broken
 workarounds for the memory-hotplug implementation's awkward
 section-aligned (128MB) granularity. For example the following backtrace
 is emitted when attempting arch_add_memory() with physical address
 ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
 within a given section:
 
  WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
  devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
  [..]
  Call Trace:
    dump_stack+0x86/0xc3
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5f/0x80
    devm_memremap_pages+0x3b5/0x4c0
    __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
    pmem_attach_disk+0x19a/0x440 [nd_pmem]
 
 Recently it was discovered that the problem goes beyond RAM vs PMEM
 collisions as some platform produce PMEM vs PMEM collisions within a
 given section. The libnvdimm workaround for that case revealed that the
 libnvdimm section-alignment-padding implementation has been broken for a
 long while. A fix for that long-standing breakage introduces as many
 problems as it solves as it would require a backward-incompatible change
 to the namespace metadata interpretation. Instead of that dubious route
 [2], address the root problem in the memory-hotplug implementation."

The approach is taken is to observe that each section already maintains
an array of 'unsigned long' values to hold the pageblock_flags. A single
additional 'unsigned long' is added to house a 'sub-section active'
bitmask. Each bit tracks the mapped state of one sub-section's worth of
capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.

The implication of allowing sections to be piecemeal mapped/unmapped is
that the valid_section() helper is no longer authoritative to determine
if a section is fully mapped. Instead pfn_valid() is updated to consult
the section-active bitmask. Given that typical memory hotplug still has
deep "section" dependencies the sub-section capability is limited to
'want_memblock=false' invocations of arch_add_memory(), effectively only
devm_memremap_pages() users for now.

With this in place the hacks in the libnvdimm sub-system can be
dropped, and other devm_memremap_pages() users need no longer be
constrained to 128MB mapping granularity.

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

---

Dan Williams (10):
      mm/sparsemem: Introduce struct mem_section_usage
      mm/sparsemem: Introduce common definitions for the size and mask of a section
      mm/sparsemem: Add helpers track active portions of a section at boot
      mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
      mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
      mm/sparsemem: Prepare for sub-section ranges
      mm/sparsemem: Support sub-section hotplug
      mm/devm_memremap_pages: Enable sub-section remap
      libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
      libnvdimm/pfn: Stop padding pmem namespaces to section alignment


 arch/x86/mm/init_64.c          |   15 +-
 drivers/nvdimm/dax_devs.c      |    2 
 drivers/nvdimm/pfn.h           |   12 -
 drivers/nvdimm/pfn_devs.c      |   93 +++-------
 include/linux/memory_hotplug.h |    7 -
 include/linux/mm.h             |    4 
 include/linux/mmzone.h         |   60 ++++++
 kernel/memremap.c              |   57 ++----
 mm/hmm.c                       |    2 
 mm/memory_hotplug.c            |  119 +++++++-----
 mm/page_alloc.c                |    6 -
 mm/sparse-vmemmap.c            |   21 +-
 mm/sparse.c                    |  382 ++++++++++++++++++++++++++++------------
 13 files changed, 476 insertions(+), 304 deletions(-)

Comments

Michal Hocko March 22, 2019, 6:05 p.m. UTC | #1
On Fri 22-03-19 09:57:54, Dan Williams wrote:
> Changes since v4 [1]:
> - Given v4 was from March of 2017 the bulk of the changes result from
>   rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> 
> - A unit test is added to ndctl to exercise the creation and dax
>   mounting of multiple independent namespaces in a single 128M section.
> 
> [1]: https://lwn.net/Articles/717383/
> 
> ---
> 
> Quote patch7:
> 
> "The libnvdimm sub-system has suffered a series of hacks and broken
>  workarounds for the memory-hotplug implementation's awkward
>  section-aligned (128MB) granularity. For example the following backtrace
>  is emitted when attempting arch_add_memory() with physical address
>  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
>  within a given section:
>  
>   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
>   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
>   [..]
>   Call Trace:
>     dump_stack+0x86/0xc3
>     __warn+0xcb/0xf0
>     warn_slowpath_fmt+0x5f/0x80
>     devm_memremap_pages+0x3b5/0x4c0
>     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
>     pmem_attach_disk+0x19a/0x440 [nd_pmem]
>  
>  Recently it was discovered that the problem goes beyond RAM vs PMEM
>  collisions as some platform produce PMEM vs PMEM collisions within a
>  given section. The libnvdimm workaround for that case revealed that the
>  libnvdimm section-alignment-padding implementation has been broken for a
>  long while. A fix for that long-standing breakage introduces as many
>  problems as it solves as it would require a backward-incompatible change
>  to the namespace metadata interpretation. Instead of that dubious route
>  [2], address the root problem in the memory-hotplug implementation."
> 
> The approach is taken is to observe that each section already maintains
> an array of 'unsigned long' values to hold the pageblock_flags. A single
> additional 'unsigned long' is added to house a 'sub-section active'
> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.

So the hotplugable unit is pageblock now, right? Why is this
sufficient? What prevents new and creative HW to come up with
alignements that do not fit there? Do not get me wrong but the section
as a unit is deeply carved into the memory hotplug and removing all those
assumptions is a major undertaking and I would like to know that you are
not just shifting the problem to a smaller unit and a new/creative HW
will force us to go even more complicated.

What is the fundamental reason that pmem sections cannot be assigned
to a section aligned memory range? The physical address space is
quite large to impose 128MB sections IMHO. I thought this is merely a
configuration issue. How often this really happens and how often it is
unavoidable.

> The implication of allowing sections to be piecemeal mapped/unmapped is
> that the valid_section() helper is no longer authoritative to determine
> if a section is fully mapped. Instead pfn_valid() is updated to consult
> the section-active bitmask. Given that typical memory hotplug still has
> deep "section" dependencies the sub-section capability is limited to
> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> devm_memremap_pages() users for now.

Does this mean that pfn_valid is more expensive now? How much? For any
pfn? Also what about the section life time? Who is removing section now?

I will probably have much more question, but it's friday and I am mostly
offline already. I would just like to hear much more about the new
design and resulting assumptions.
Dan Williams March 22, 2019, 6:32 p.m. UTC | #2
On Fri, Mar 22, 2019 at 11:06 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 22-03-19 09:57:54, Dan Williams wrote:
> > Changes since v4 [1]:
> > - Given v4 was from March of 2017 the bulk of the changes result from
> >   rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> >
> > - A unit test is added to ndctl to exercise the creation and dax
> >   mounting of multiple independent namespaces in a single 128M section.
> >
> > [1]: https://lwn.net/Articles/717383/
> >
> > ---
> >
> > Quote patch7:
> >
> > "The libnvdimm sub-system has suffered a series of hacks and broken
> >  workarounds for the memory-hotplug implementation's awkward
> >  section-aligned (128MB) granularity. For example the following backtrace
> >  is emitted when attempting arch_add_memory() with physical address
> >  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> >  within a given section:
> >
> >   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> >   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> >   [..]
> >   Call Trace:
> >     dump_stack+0x86/0xc3
> >     __warn+0xcb/0xf0
> >     warn_slowpath_fmt+0x5f/0x80
> >     devm_memremap_pages+0x3b5/0x4c0
> >     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >     pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >
> >  Recently it was discovered that the problem goes beyond RAM vs PMEM
> >  collisions as some platform produce PMEM vs PMEM collisions within a
> >  given section. The libnvdimm workaround for that case revealed that the
> >  libnvdimm section-alignment-padding implementation has been broken for a
> >  long while. A fix for that long-standing breakage introduces as many
> >  problems as it solves as it would require a backward-incompatible change
> >  to the namespace metadata interpretation. Instead of that dubious route
> >  [2], address the root problem in the memory-hotplug implementation."
> >
> > The approach is taken is to observe that each section already maintains
> > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > additional 'unsigned long' is added to house a 'sub-section active'
> > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
>
> So the hotplugable unit is pageblock now, right?

No, with this patchset the hotplug unit is 2MB.

> Why is this sufficient?

2MB is sufficient because it allows mapping a namespace at PMD
granularity and there is no practical need to go smaller.

> What prevents new and creative HW to come up with alignements that do not fit there?

There is a resource in hardware memory controllers called
address-decode-registers that control the mapping granularity. The
minimum granularity today is 64MB and the pressure as memory sizes
increases is to make that granularity larger, not smaller. So the
hardware pressure is going in the opposite direction of your concern,
at least for persistent memory.

User-defined memory namespaces have this problem, but 2MB is the
default alignment and is sufficient for most uses.

PCI Address BARs that are also mapped with devm_memremap_pages are
aligned to their size and there is no expectation to support smaller
than 2MB.

All that said, to support a smaller sub-section granularity, just add
more bits to the section-active bitmask.

> Do not get me wrong but the section
> as a unit is deeply carved into the memory hotplug and removing all those
> assumptions is a major undertaking

Right, as stated in the cover letter, this does not remove all those
assumptions, it only removes the ones that impact
devm_memremap_pages(). Specifying that sub-section is only supported
in the 'want_memblock=false' case to arch_add_memory().

> and I would like to know that you are
> not just shifting the problem to a smaller unit and a new/creative HW
> will force us to go even more complicated.

HW will not do this to us. It's software that has the problem.
Namespace creation is unnecessarily constrained to 128MB alignment.

I'm also open to exploring lifting the section alignment constraint
for the 'want_memblock=true', but first things first.

> What is the fundamental reason that pmem sections cannot be assigned
> to a section aligned memory range? The physical address space is
> quite large to impose 128MB sections IMHO. I thought this is merely a
> configuration issue.

1) it's not just hardware that imposes this, software wants to be able
to avoid the constraint

2) the flexibility of the memory controller initialization code is
constrained by address-decode-registers. So while it is simple to say
"just configure it to be aligned" it's not that easy in practice
without throwing away usable memory capacity.

> How often this really happens and how often it is unavoidable.

Again, software can cause this problem at will. Multiple shipping
systems expose this alignment problem in physical address space, for
example: https://github.com/pmem/ndctl/issues/76

> > The implication of allowing sections to be piecemeal mapped/unmapped is
> > that the valid_section() helper is no longer authoritative to determine
> > if a section is fully mapped. Instead pfn_valid() is updated to consult
> > the section-active bitmask. Given that typical memory hotplug still has
> > deep "section" dependencies the sub-section capability is limited to
> > 'want_memblock=false' invocations of arch_add_memory(), effectively only
> > devm_memremap_pages() users for now.
>
> Does this mean that pfn_valid is more expensive now? How much?

Negligible, the information to determine whether the sub-section is
valid for a given pfn is in the same cacheline as the section-valid
flag.

> Also what about the section life time?

Section is live as long as any sub-section is active.

> Who is removing section now?

Last arch_remove_memory() that removes the last sub-section clears out
the remaining sub-section active bits.

> I will probably have much more question, but it's friday and I am mostly
> offline already. I would just like to hear much more about the new
> design and resulting assumptions.

Happy to accommodate this discussion. The section alignment has been
an absolute horror to contend with. So I have years worth of pain to
share for as deep as you want to go on probing why this is needed.
Michal Hocko March 25, 2019, 10:19 a.m. UTC | #3
On Fri 22-03-19 11:32:11, Dan Williams wrote:
> On Fri, Mar 22, 2019 at 11:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 22-03-19 09:57:54, Dan Williams wrote:
> > > Changes since v4 [1]:
> > > - Given v4 was from March of 2017 the bulk of the changes result from
> > >   rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> > >
> > > - A unit test is added to ndctl to exercise the creation and dax
> > >   mounting of multiple independent namespaces in a single 128M section.
> > >
> > > [1]: https://lwn.net/Articles/717383/
> > >
> > > ---
> > >
> > > Quote patch7:
> > >
> > > "The libnvdimm sub-system has suffered a series of hacks and broken
> > >  workarounds for the memory-hotplug implementation's awkward
> > >  section-aligned (128MB) granularity. For example the following backtrace
> > >  is emitted when attempting arch_add_memory() with physical address
> > >  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > >  within a given section:
> > >
> > >   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> > >   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> > >   [..]
> > >   Call Trace:
> > >     dump_stack+0x86/0xc3
> > >     __warn+0xcb/0xf0
> > >     warn_slowpath_fmt+0x5f/0x80
> > >     devm_memremap_pages+0x3b5/0x4c0
> > >     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> > >     pmem_attach_disk+0x19a/0x440 [nd_pmem]
> > >
> > >  Recently it was discovered that the problem goes beyond RAM vs PMEM
> > >  collisions as some platform produce PMEM vs PMEM collisions within a
> > >  given section. The libnvdimm workaround for that case revealed that the
> > >  libnvdimm section-alignment-padding implementation has been broken for a
> > >  long while. A fix for that long-standing breakage introduces as many
> > >  problems as it solves as it would require a backward-incompatible change
> > >  to the namespace metadata interpretation. Instead of that dubious route
> > >  [2], address the root problem in the memory-hotplug implementation."
> > >
> > > The approach is taken is to observe that each section already maintains
> > > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > > additional 'unsigned long' is added to house a 'sub-section active'
> > > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >
> > So the hotplugable unit is pageblock now, right?
> 
> No, with this patchset the hotplug unit is 2MB.

Which is a pageblock unit on x86 with hugetlb enabled. I was just
wondering whether this is really bound to pageblock or the math just
works out to be the same.

> > Why is this sufficient?
> 
> 2MB is sufficient because it allows mapping a namespace at PMD
> granularity and there is no practical need to go smaller.
> 
> > What prevents new and creative HW to come up with alignements that do not fit there?
> 
> There is a resource in hardware memory controllers called
> address-decode-registers that control the mapping granularity. The
> minimum granularity today is 64MB and the pressure as memory sizes
> increases is to make that granularity larger, not smaller. So the
> hardware pressure is going in the opposite direction of your concern,
> at least for persistent memory.

OK, this is good to know and actually against subsection direction.

> User-defined memory namespaces have this problem, but 2MB is the
> default alignment and is sufficient for most uses.

What does prevent users to go and use a larger alignment?

> PCI Address BARs that are also mapped with devm_memremap_pages are
> aligned to their size and there is no expectation to support smaller
> than 2MB.
> 
> All that said, to support a smaller sub-section granularity, just add
> more bits to the section-active bitmask.
> 
> > Do not get me wrong but the section
> > as a unit is deeply carved into the memory hotplug and removing all those
> > assumptions is a major undertaking
> 
> Right, as stated in the cover letter, this does not remove all those
> assumptions, it only removes the ones that impact
> devm_memremap_pages(). Specifying that sub-section is only supported
> in the 'want_memblock=false' case to arch_add_memory().

And this is exactly the problem. Having different assumptions depending
on whether there is a memblock interface or not is utterly wrong and a
maintainability mess.

> > and I would like to know that you are
> > not just shifting the problem to a smaller unit and a new/creative HW
> > will force us to go even more complicated.
> 
> HW will not do this to us. It's software that has the problem.
> Namespace creation is unnecessarily constrained to 128MB alignment.

And why is that a problem? A lack of documentation that this is a
requirement? Something will not work with a larger alignment? Someting
else?

Why do we have to go a mile to tweak the kernel, especially something as
fragile as memory hotplug, just to support sub mem section ranges. This
is somthing that is not clearly explained in the cover letter. Sure you
are talking about hacks at the higher level to deal with this but I do
not see any fundamental reason to actually support that at all.

> I'm also open to exploring lifting the section alignment constraint
> for the 'want_memblock=true', but first things first.

I disagree. If you want to get rid of the the section requirement then
do it first and build on top. This is a normal kernel development
process.

> > What is the fundamental reason that pmem sections cannot be assigned
> > to a section aligned memory range? The physical address space is
> > quite large to impose 128MB sections IMHO. I thought this is merely a
> > configuration issue.
> 
> 1) it's not just hardware that imposes this, software wants to be able
> to avoid the constraint
> 
> 2) the flexibility of the memory controller initialization code is
> constrained by address-decode-registers. So while it is simple to say
> "just configure it to be aligned" it's not that easy in practice
> without throwing away usable memory capacity.

Yes and we are talking about 128MB is sacrifying that unit worth all the
troubles?

[...]

> > I will probably have much more question, but it's friday and I am mostly
> > offline already. I would just like to hear much more about the new
> > design and resulting assumptions.
> 
> Happy to accommodate this discussion. The section alignment has been
> an absolute horror to contend with. So I have years worth of pain to
> share for as deep as you want to go on probing why this is needed.

I can feel your frustration. I am not entirely happy about the section
size limitation myself but you have to realize that this is simplicy vs.
feature set compromise. It works reasonably well for many usecases but
falls flat on some others. But you cannot simply build on top of
existing foundations and tweak some code paths to handle one particular
case. This is exactly how the memory hotplug ended in the unfortunate
state it is now. If you want to make the code more reusable then there
is a _lot_ of ground work first before you can add a shiny new feature.
Jeff Moyer March 25, 2019, 2:28 p.m. UTC | #4
Michal Hocko <mhocko@kernel.org> writes:

>> > and I would like to know that you are
>> > not just shifting the problem to a smaller unit and a new/creative HW
>> > will force us to go even more complicated.
>> 
>> HW will not do this to us. It's software that has the problem.
>> Namespace creation is unnecessarily constrained to 128MB alignment.
>
> And why is that a problem? A lack of documentation that this is a
> requirement? Something will not work with a larger alignment? Someting
> else?

See this email for one user-visible problem:
  https://lore.kernel.org/lkml/x49imxbx22d.fsf@segfault.boston.devel.redhat.com/

Cheers,
Jeff
Michal Hocko March 25, 2019, 2:50 p.m. UTC | #5
On Mon 25-03-19 10:28:00, Jeff Moyer wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> >> > and I would like to know that you are
> >> > not just shifting the problem to a smaller unit and a new/creative HW
> >> > will force us to go even more complicated.
> >> 
> >> HW will not do this to us. It's software that has the problem.
> >> Namespace creation is unnecessarily constrained to 128MB alignment.
> >
> > And why is that a problem? A lack of documentation that this is a
> > requirement? Something will not work with a larger alignment? Someting
> > else?
> 
> See this email for one user-visible problem:
>   https://lore.kernel.org/lkml/x49imxbx22d.fsf@segfault.boston.devel.redhat.com/

: # ndctl create-namespace -m fsdax -s 128m
:   Error: '--size=' must align to interleave-width: 6 and alignment: 2097152
:   did you intend --size=132M?
: 
: failed to create namespace: Invalid argument

So the size is in section size units. So what prevents the userspace to
use a proper alignment? I am sorry if this is a stupid question but I am
not really familiar with ndctl nor the pmem side of it.
Dan Williams March 25, 2019, 8:03 p.m. UTC | #6
On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 22-03-19 11:32:11, Dan Williams wrote:
> > On Fri, Mar 22, 2019 at 11:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Fri 22-03-19 09:57:54, Dan Williams wrote:
> > > > Changes since v4 [1]:
> > > > - Given v4 was from March of 2017 the bulk of the changes result from
> > > >   rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> > > >
> > > > - A unit test is added to ndctl to exercise the creation and dax
> > > >   mounting of multiple independent namespaces in a single 128M section.
> > > >
> > > > [1]: https://lwn.net/Articles/717383/
> > > >
> > > > ---
> > > >
> > > > Quote patch7:
> > > >
> > > > "The libnvdimm sub-system has suffered a series of hacks and broken
> > > >  workarounds for the memory-hotplug implementation's awkward
> > > >  section-aligned (128MB) granularity. For example the following backtrace
> > > >  is emitted when attempting arch_add_memory() with physical address
> > > >  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > > >  within a given section:
> > > >
> > > >   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> > > >   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> > > >   [..]
> > > >   Call Trace:
> > > >     dump_stack+0x86/0xc3
> > > >     __warn+0xcb/0xf0
> > > >     warn_slowpath_fmt+0x5f/0x80
> > > >     devm_memremap_pages+0x3b5/0x4c0
> > > >     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> > > >     pmem_attach_disk+0x19a/0x440 [nd_pmem]
> > > >
> > > >  Recently it was discovered that the problem goes beyond RAM vs PMEM
> > > >  collisions as some platform produce PMEM vs PMEM collisions within a
> > > >  given section. The libnvdimm workaround for that case revealed that the
> > > >  libnvdimm section-alignment-padding implementation has been broken for a
> > > >  long while. A fix for that long-standing breakage introduces as many
> > > >  problems as it solves as it would require a backward-incompatible change
> > > >  to the namespace metadata interpretation. Instead of that dubious route
> > > >  [2], address the root problem in the memory-hotplug implementation."
> > > >
> > > > The approach is taken is to observe that each section already maintains
> > > > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > > > additional 'unsigned long' is added to house a 'sub-section active'
> > > > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > > > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> > >
> > > So the hotplugable unit is pageblock now, right?
> >
> > No, with this patchset the hotplug unit is 2MB.
>
> Which is a pageblock unit on x86 with hugetlb enabled. I was just
> wondering whether this is really bound to pageblock or the math just
> works out to be the same.

Ah, ok just coincidental math.

> > > Why is this sufficient?
> >
> > 2MB is sufficient because it allows mapping a namespace at PMD
> > granularity and there is no practical need to go smaller.
> >
> > > What prevents new and creative HW to come up with alignements that do not fit there?
> >
> > There is a resource in hardware memory controllers called
> > address-decode-registers that control the mapping granularity. The
> > minimum granularity today is 64MB and the pressure as memory sizes
> > increases is to make that granularity larger, not smaller. So the
> > hardware pressure is going in the opposite direction of your concern,
> > at least for persistent memory.
>
> OK, this is good to know and actually against subsection direction.

Seems I forgot to mention timescales. The 64MB granularity is present
on current generation platforms, and I expect multiple platform
generations (potentially years) until it might change in the future.

That does not even take into consideration the configuration
flexibility of PCI BAR ranges and the interaction with the
peer-to-peer DMA facility which maps 'struct page' for physical ranges
that are not memory. There is no pressure for PCI BAR ranges to submit
to a 128MB alignment.

> > User-defined memory namespaces have this problem, but 2MB is the
> > default alignment and is sufficient for most uses.
>
> What does prevent users to go and use a larger alignment?

Given that we are living with 64MB granularity on mainstream platforms
for the foreseeable future, the reason users can't rely on a larger
alignment to address the issue is that the physical alignment may
change from one boot to the next.

No, you can't just wish hardware / platform firmware won't do this,
because there are not enough platform resources to give every hardware
device a guaranteed alignment.

The effect is that even if the driver deploys a software alignment
mitigation when it first sees the persistent memory range, that
alignment can be violated on a subsequent boot leading to data being
unavailable. There is no facility to communicate to the administrator
what went wrong in this scenario as several events can trigger a
physical map layout change. Add / remove of hardware and hardware
failure are the most likely causes.

An additional pain point for users is that EFI pre-boot environment
has little chance to create a namespace that Linux might be able to
use. The section size is an arbitrary Linux constraint and we should
not encode something Linux specific that might change in the future
into OS agnostic software.

> > PCI Address BARs that are also mapped with devm_memremap_pages are
> > aligned to their size and there is no expectation to support smaller
> > than 2MB.
> >
> > All that said, to support a smaller sub-section granularity, just add
> > more bits to the section-active bitmask.
> >
> > > Do not get me wrong but the section
> > > as a unit is deeply carved into the memory hotplug and removing all those
> > > assumptions is a major undertaking
> >
> > Right, as stated in the cover letter, this does not remove all those
> > assumptions, it only removes the ones that impact
> > devm_memremap_pages(). Specifying that sub-section is only supported
> > in the 'want_memblock=false' case to arch_add_memory().
>
> And this is exactly the problem. Having different assumptions depending
> on whether there is a memblock interface or not is utterly wrong and a
> maintainability mess.

In this case I disagree with you. The hotplug code already has the
want_memblock=false semantic in the implementation. The sub-section
hotplug infrastructure is a strict superset of what is there already.
Now, if it created parallel infrastructure that would indeed be a
maintainability burden, but in this case there are no behavior changes
for typical memory hotplug as it just hotplugs full sections at a time
like always. The 'section' concept is not going away.

> > > and I would like to know that you are
> > > not just shifting the problem to a smaller unit and a new/creative HW
> > > will force us to go even more complicated.
> >
> > HW will not do this to us. It's software that has the problem.
> > Namespace creation is unnecessarily constrained to 128MB alignment.
>
> And why is that a problem?

Data loss, inability to cope with some hardware configurations,
difficult to interoperate with non-Linux software.

> A lack of documentation that this is a requirement?

It's not a requirement. It's an arbitrary Linux implementation detail.

> Something will not work with a larger alignment? Someting else?
[..]
> Why do we have to go a mile to tweak the kernel, especially something as
> fragile as memory hotplug, just to support sub mem section ranges. This
> is somthing that is not clearly explained in the cover letter. Sure you
> are talking about hacks at the higher level to deal with this but I do
> not see any fundamental reason to actually support that at all.

Like it or not, 'struct page' mappings for arbitrary hardware-physical
memory ranges is a facility that has grown from the pmem case, to hmm,
and peer-to-peer DMA. Unless you want to do the work to eliminate the
'struct page' requirement across the kernel I think it is unreasonable
to effectively archive the arch_add_memory() implementation and
prevent it from reacting to growing demands.

Note that I did try to eliminate 'struct page' before creating
devm_memremap_pages(), that effort failed because 'struct page' is
just too ingrained into too many kernel code paths.

> > I'm also open to exploring lifting the section alignment constraint
> > for the 'want_memblock=true', but first things first.
>
> I disagree. If you want to get rid of the the section requirement then
> do it first and build on top. This is a normal kernel development
> process.
>
> > > What is the fundamental reason that pmem sections cannot be assigned
> > > to a section aligned memory range? The physical address space is
> > > quite large to impose 128MB sections IMHO. I thought this is merely a
> > > configuration issue.
> >
> > 1) it's not just hardware that imposes this, software wants to be able
> > to avoid the constraint
> >
> > 2) the flexibility of the memory controller initialization code is
> > constrained by address-decode-registers. So while it is simple to say
> > "just configure it to be aligned" it's not that easy in practice
> > without throwing away usable memory capacity.
>
> Yes and we are talking about 128MB is sacrifying that unit worth all the
> troubles?
>
> [...]
>
> > > I will probably have much more question, but it's friday and I am mostly
> > > offline already. I would just like to hear much more about the new
> > > design and resulting assumptions.
> >
> > Happy to accommodate this discussion. The section alignment has been
> > an absolute horror to contend with. So I have years worth of pain to
> > share for as deep as you want to go on probing why this is needed.
>
> I can feel your frustration. I am not entirely happy about the section
> size limitation myself but you have to realize that this is simplicy vs.
> feature set compromise.

You have to realize that arch_add_memory() is no longer just a
front-end for typical memory hotplug. The requirements have changed.
Simplicity should be maintained for as long as it can get the job
done, and the simplicity is currently failing.

> It works reasonably well for many usecases but
> falls flat on some others. But you cannot simply build on top of
> existing foundations and tweak some code paths to handle one particular
> case. This is exactly how the memory hotplug ended in the unfortunate
> state it is now. If you want to make the code more reusable then there
> is a _lot_ of ground work first before you can add a shiny new feature.

This *is* the ground work. It solves the today's hardware page-mapping
pain points with a path to go deeper and enable sub-section hotplug
for typical memory hotplug, but in the meantime the two cases share
common code paths. This is not a tweak on the side, it's a super-set,
and typical memory-hotplug is properly fixed up to use just the subset
that it needs.
Michal Hocko March 26, 2019, 8:04 a.m. UTC | #7
On Mon 25-03-19 13:03:47, Dan Williams wrote:
> On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > User-defined memory namespaces have this problem, but 2MB is the
> > > default alignment and is sufficient for most uses.
> >
> > What does prevent users to go and use a larger alignment?
> 
> Given that we are living with 64MB granularity on mainstream platforms
> for the foreseeable future, the reason users can't rely on a larger
> alignment to address the issue is that the physical alignment may
> change from one boot to the next.

I would love to learn more about this inter boot volatility. Could you
expand on that some more? I though that the HW configuration presented
to the OS would be more or less stable unless the underlying HW changes.

> No, you can't just wish hardware / platform firmware won't do this,
> because there are not enough platform resources to give every hardware
> device a guaranteed alignment.

Guarantee is one part and I can see how nobody wants to give you
something as strong but how often does that happen in the real life?

> The effect is that even if the driver deploys a software alignment
> mitigation when it first sees the persistent memory range, that
> alignment can be violated on a subsequent boot leading to data being
> unavailable. There is no facility to communicate to the administrator
> what went wrong in this scenario as several events can trigger a
> physical map layout change. Add / remove of hardware and hardware
> failure are the most likely causes.

This is indeed bad and unexpected! That is exactly something to have in
the chagelog!

> An additional pain point for users is that EFI pre-boot environment
> has little chance to create a namespace that Linux might be able to
> use. The section size is an arbitrary Linux constraint and we should
> not encode something Linux specific that might change in the future
> into OS agnostic software.

This looks like a fair point but please keep in mind that there hotplug
restrictions are on other platforms as well (4MB on Windows IIRC) so
there will be some knowledge required all the time. Besides that there
are likely to be some restrictions depending on the implementation.

[...]
> > > Right, as stated in the cover letter, this does not remove all those
> > > assumptions, it only removes the ones that impact
> > > devm_memremap_pages(). Specifying that sub-section is only supported
> > > in the 'want_memblock=false' case to arch_add_memory().
> >
> > And this is exactly the problem. Having different assumptions depending
> > on whether there is a memblock interface or not is utterly wrong and a
> > maintainability mess.
> 
> In this case I disagree with you. The hotplug code already has the
> want_memblock=false semantic in the implementation.

want_memblock was a hack to allow memory hotplug to not have user
visible sysfs interface. It was added to reduce the code duplication
IIRC. Besides that this hasn't changed the underlying assumptions about
hotplugable units or other invariants that were in place.

> The sub-section
> hotplug infrastructure is a strict superset of what is there already.
> Now, if it created parallel infrastructure that would indeed be a
> maintainability burden, but in this case there are no behavior changes
> for typical memory hotplug as it just hotplugs full sections at a time
> like always. The 'section' concept is not going away.

You are really neglecting many details here. E.g. memory section can be
shared between two different types of memory. We've had some bugs in the
hotplug code when one section can be shared between two different NUMA
nodes (e.g. 4aa9fc2a435a ("Revert "mm, memory_hotplug: initialize struct
pages for the full memory section""). We do not allow to hotremove such
sections because it would open another can of worms. I am not saying
your implementation is incorrect - still haven't time to look deeply -
but stating that this is a strict superset of want_memblock is simply
wrong.
 
[...]
> > Why do we have to go a mile to tweak the kernel, especially something as
> > fragile as memory hotplug, just to support sub mem section ranges. This
> > is somthing that is not clearly explained in the cover letter. Sure you
> > are talking about hacks at the higher level to deal with this but I do
> > not see any fundamental reason to actually support that at all.
> 
> Like it or not, 'struct page' mappings for arbitrary hardware-physical
> memory ranges is a facility that has grown from the pmem case, to hmm,
> and peer-to-peer DMA. Unless you want to do the work to eliminate the
> 'struct page' requirement across the kernel I think it is unreasonable
> to effectively archive the arch_add_memory() implementation and
> prevent it from reacting to growing demands.

I am definitely not blocking memory hotplug to be reused more! All I am
saying is that there is much more ground work to be done before you can
add features like that. There are some general assumptions in the code,
like it or not, and you should start by removing those to build on top.
Pmem/nvidimm development is full of "we have to do it now and find a way
to graft it into the existing infrastructure" pattern that I really
hate. Clean up will come later, I have heard. Have a look at all
zone_device hacks that remained. Why is this any different?

And just to make myself clear. There are places where section cannot go
away because that is the unit in which the memory model maintains struct
pages. But the hotplug code is fill of construct where we iterate mem
sections as one unit and operate on it as whole. Those have to go away
before you can consider subsection hotadd/remove.

> > I can feel your frustration. I am not entirely happy about the section
> > size limitation myself but you have to realize that this is simplicy vs.
> > feature set compromise.
> 
> You have to realize that arch_add_memory() is no longer just a
> front-end for typical memory hotplug. The requirements have changed.
> Simplicity should be maintained for as long as it can get the job
> done, and the simplicity is currently failing.

I do agree. But you also have to realize that this require a lot of
work. As long as users of the api are not willing to do that work then
I am afraid but the facility will remain dumb. But putting hacks to make
a specific usecase (almost)work is not the right way.
Dan Williams March 27, 2019, 12:20 a.m. UTC | #8
On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 25-03-19 13:03:47, Dan Williams wrote:
> > On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > > User-defined memory namespaces have this problem, but 2MB is the
> > > > default alignment and is sufficient for most uses.
> > >
> > > What does prevent users to go and use a larger alignment?
> >
> > Given that we are living with 64MB granularity on mainstream platforms
> > for the foreseeable future, the reason users can't rely on a larger
> > alignment to address the issue is that the physical alignment may
> > change from one boot to the next.
>
> I would love to learn more about this inter boot volatility. Could you
> expand on that some more? I though that the HW configuration presented
> to the OS would be more or less stable unless the underlying HW changes.

Even if the configuration is static there can be hardware failures
that prevent a DIMM, or a PCI device to be included in the memory map.
When that happens the BIOS needs to re-layout the map and the result
is not guaranteed to maintain the previous alignment.

> > No, you can't just wish hardware / platform firmware won't do this,
> > because there are not enough platform resources to give every hardware
> > device a guaranteed alignment.
>
> Guarantee is one part and I can see how nobody wants to give you
> something as strong but how often does that happen in the real life?

I expect a "rare" event to happen everyday in a data-center fleet.
Failure rates tend towards 100% daily occurrence at scale and in this
case the kernel has everything it needs to mitigate such an event.

Setting aside the success rate of a software-alignment mitigation, the
reason I am charging this hill again after a 2 year hiatus is the
realization that this problem is wider spread than the original
failing scenario. Back in 2017 the problem seemed limited to custom
memmap= configurations, and collisions between PMEM and System RAM.
Now it is clear that the collisions can happen between PMEM regions
and namespaces as well, and the problem spans platforms from multiple
vendors. Here is the most recent collision problem:
https://github.com/pmem/ndctl/issues/76, from a third-party platform.

The fix for that issue uncovered a bug in the padding implementation,
and a fix for that bug would result in even more hacks in the nvdimm
code for what is a core kernel deficiency. Code review of those
changes resulted in changing direction to go after the core
deficiency.

> > The effect is that even if the driver deploys a software alignment
> > mitigation when it first sees the persistent memory range, that
> > alignment can be violated on a subsequent boot leading to data being
> > unavailable. There is no facility to communicate to the administrator
> > what went wrong in this scenario as several events can trigger a
> > physical map layout change. Add / remove of hardware and hardware
> > failure are the most likely causes.
>
> This is indeed bad and unexpected! That is exactly something to have in
> the chagelog!

Apologies that was indeed included in the 2017 changelog (see: "a user
could inadvertently lose access to nvdimm namespaces" note here:
https://lwn.net/Articles/717383/), and I failed to carry it forward.

>
> > An additional pain point for users is that EFI pre-boot environment
> > has little chance to create a namespace that Linux might be able to
> > use. The section size is an arbitrary Linux constraint and we should
> > not encode something Linux specific that might change in the future
> > into OS agnostic software.
>
> This looks like a fair point but please keep in mind that there hotplug
> restrictions are on other platforms as well (4MB on Windows IIRC) so
> there will be some knowledge required all the time. Besides that there
> are likely to be some restrictions depending on the implementation.

Windows does not have an equivalent constraint, so it's only Linux
that imposes an arbitrary alignment restriction on pmem to agents like
EFI.

> [...]
> > > > Right, as stated in the cover letter, this does not remove all those
> > > > assumptions, it only removes the ones that impact
> > > > devm_memremap_pages(). Specifying that sub-section is only supported
> > > > in the 'want_memblock=false' case to arch_add_memory().
> > >
> > > And this is exactly the problem. Having different assumptions depending
> > > on whether there is a memblock interface or not is utterly wrong and a
> > > maintainability mess.
> >
> > In this case I disagree with you. The hotplug code already has the
> > want_memblock=false semantic in the implementation.
>
> want_memblock was a hack to allow memory hotplug to not have user
> visible sysfs interface. It was added to reduce the code duplication
> IIRC. Besides that this hasn't changed the underlying assumptions about
> hotplugable units or other invariants that were in place.

Neither does this patch series for the typical memory hotplug case.
For the device-memory use case I've gone through and fixed up the
underlying assumptions.

> > The sub-section
> > hotplug infrastructure is a strict superset of what is there already.
> > Now, if it created parallel infrastructure that would indeed be a
> > maintainability burden, but in this case there are no behavior changes
> > for typical memory hotplug as it just hotplugs full sections at a time
> > like always. The 'section' concept is not going away.
>
> You are really neglecting many details here. E.g. memory section can be
> shared between two different types of memory. We've had some bugs in the
> hotplug code when one section can be shared between two different NUMA
> nodes (e.g. 4aa9fc2a435a ("Revert "mm, memory_hotplug: initialize struct
> pages for the full memory section""). We do not allow to hotremove such
> sections because it would open another can of worms. I am not saying
> your implementation is incorrect - still haven't time to look deeply -
> but stating that this is a strict superset of want_memblock is simply
> wrong.

Please have a look at the code and the handling of "early" sections.
The assertion that I neglected to consider that detail is not true.

My "superset" contention is from the arch_add_memory() api
perspective. All typical memory hotplug use cases are a sub-case of
the new support.

> [...]
> > > Why do we have to go a mile to tweak the kernel, especially something as
> > > fragile as memory hotplug, just to support sub mem section ranges. This
> > > is somthing that is not clearly explained in the cover letter. Sure you
> > > are talking about hacks at the higher level to deal with this but I do
> > > not see any fundamental reason to actually support that at all.
> >
> > Like it or not, 'struct page' mappings for arbitrary hardware-physical
> > memory ranges is a facility that has grown from the pmem case, to hmm,
> > and peer-to-peer DMA. Unless you want to do the work to eliminate the
> > 'struct page' requirement across the kernel I think it is unreasonable
> > to effectively archive the arch_add_memory() implementation and
> > prevent it from reacting to growing demands.
>
> I am definitely not blocking memory hotplug to be reused more! All I am
> saying is that there is much more ground work to be done before you can
> add features like that. There are some general assumptions in the code,
> like it or not, and you should start by removing those to build on top.

Let's talk about specifics please, because I don't think you've had a
chance to consider the details in the patches. Your "start by removing
those [assumptions] to build on top" request is indeed what the
preparation patches in this series aim to achieve.

The general assumptions of the current (pre-patch-series) implementation are:

- Sections that describe boot memory (early sections) are never
unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check

- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.

- the memblock sysfs interface only comprehends full sections

Those assumptions are removed / handled with the following
implementation details respectively:

- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.

- pfn_valid() goes beyond valid_section() to also check the
active-sub-section mask. As stated before this indication is in the
same cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.

- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that deal
with online memory are not updated. That's a feature not a bug until
we decide that sub-section hotplug makes sense for online / typical
memory as well.

- the existing memblock sysfs user api guarantees / assumptions are
not touched since this capability is limited to !online
!sysfs-accessible sections for now.

> Pmem/nvidimm development is full of "we have to do it now and find a way
> to graft it into the existing infrastructure" pattern that I really
> hate. Clean up will come later, I have heard. Have a look at all
> zone_device hacks that remained. Why is this any different?

This is indeed different because unlike memmap_init_zone_device(),
which is arguably a side-hack to move 'struct page' init outside the
mem_hotplug_lock just for ZONE_DEVICE, this implementation is reused
in the main memory hotplug path. It's not a "temporary implementation
until something better comes along", it moves the implementation
forward not sideways.

> And just to make myself clear. There are places where section cannot go
> away because that is the unit in which the memory model maintains struct
> pages. But the hotplug code is fill of construct where we iterate mem
> sections as one unit and operate on it as whole. Those have to go away
> before you can consider subsection hotadd/remove.
>
> > > I can feel your frustration. I am not entirely happy about the section
> > > size limitation myself but you have to realize that this is simplicy vs.
> > > feature set compromise.
> >
> > You have to realize that arch_add_memory() is no longer just a
> > front-end for typical memory hotplug. The requirements have changed.
> > Simplicity should be maintained for as long as it can get the job
> > done, and the simplicity is currently failing.
>
> I do agree. But you also have to realize that this require a lot of
> work. As long as users of the api are not willing to do that work then
> I am afraid but the facility will remain dumb. But putting hacks to make
> a specific usecase (almost)work is not the right way.

Please look at the patches. This isn't a half-step, it's a solution to
a problem that has haunted the implementation for years. If there are
opportunities for additional cleanups please point them out.
Michal Hocko March 27, 2019, 4:13 p.m. UTC | #9
On Tue 26-03-19 17:20:41, Dan Williams wrote:
> On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 25-03-19 13:03:47, Dan Williams wrote:
> > > On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <mhocko@kernel.org> wrote:
> > [...]
> > > > > User-defined memory namespaces have this problem, but 2MB is the
> > > > > default alignment and is sufficient for most uses.
> > > >
> > > > What does prevent users to go and use a larger alignment?
> > >
> > > Given that we are living with 64MB granularity on mainstream platforms
> > > for the foreseeable future, the reason users can't rely on a larger
> > > alignment to address the issue is that the physical alignment may
> > > change from one boot to the next.
> >
> > I would love to learn more about this inter boot volatility. Could you
> > expand on that some more? I though that the HW configuration presented
> > to the OS would be more or less stable unless the underlying HW changes.
> 
> Even if the configuration is static there can be hardware failures
> that prevent a DIMM, or a PCI device to be included in the memory map.
> When that happens the BIOS needs to re-layout the map and the result
> is not guaranteed to maintain the previous alignment.
> 
> > > No, you can't just wish hardware / platform firmware won't do this,
> > > because there are not enough platform resources to give every hardware
> > > device a guaranteed alignment.
> >
> > Guarantee is one part and I can see how nobody wants to give you
> > something as strong but how often does that happen in the real life?
> 
> I expect a "rare" event to happen everyday in a data-center fleet.
> Failure rates tend towards 100% daily occurrence at scale and in this
> case the kernel has everything it needs to mitigate such an event.
> 
> Setting aside the success rate of a software-alignment mitigation, the
> reason I am charging this hill again after a 2 year hiatus is the
> realization that this problem is wider spread than the original
> failing scenario. Back in 2017 the problem seemed limited to custom
> memmap= configurations, and collisions between PMEM and System RAM.
> Now it is clear that the collisions can happen between PMEM regions
> and namespaces as well, and the problem spans platforms from multiple
> vendors. Here is the most recent collision problem:
> https://github.com/pmem/ndctl/issues/76, from a third-party platform.
> 
> The fix for that issue uncovered a bug in the padding implementation,
> and a fix for that bug would result in even more hacks in the nvdimm
> code for what is a core kernel deficiency. Code review of those
> changes resulted in changing direction to go after the core
> deficiency.

This kind of information along with real world examples is exactly what
you should have added into the cover letter. A previous very vague
claims were not really convincing or something that can be considered a
proper justification. Please do realize that people who are not working
with the affected HW are unlikely to have an idea how serious/relevant
those problems really are.

People are asking for a smaller memory hotplug granularity for other
usecases (e.g. memory ballooning into VMs) which are quite dubious to
be honest and not really worth all the code rework. If we are talking
about something that can be worked around elsewhere then it is preferred
because the code base is not in an excellent shape and putting more on
top is just going to cause more headaches.

I will try to find some time to review this more deeply (no promises
though because time is hectic and this is not a simple feature). For the
future, please try harder to write up a proper justification and a
highlevel design description which tells a bit about all important parts
of the new scheme.
Dan Williams March 27, 2019, 4:17 p.m. UTC | #10
On Wed, Mar 27, 2019 at 9:13 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-03-19 17:20:41, Dan Williams wrote:
> > On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 25-03-19 13:03:47, Dan Williams wrote:
> > > > On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > [...]
> > > > > > User-defined memory namespaces have this problem, but 2MB is the
> > > > > > default alignment and is sufficient for most uses.
> > > > >
> > > > > What does prevent users to go and use a larger alignment?
> > > >
> > > > Given that we are living with 64MB granularity on mainstream platforms
> > > > for the foreseeable future, the reason users can't rely on a larger
> > > > alignment to address the issue is that the physical alignment may
> > > > change from one boot to the next.
> > >
> > > I would love to learn more about this inter boot volatility. Could you
> > > expand on that some more? I though that the HW configuration presented
> > > to the OS would be more or less stable unless the underlying HW changes.
> >
> > Even if the configuration is static there can be hardware failures
> > that prevent a DIMM, or a PCI device to be included in the memory map.
> > When that happens the BIOS needs to re-layout the map and the result
> > is not guaranteed to maintain the previous alignment.
> >
> > > > No, you can't just wish hardware / platform firmware won't do this,
> > > > because there are not enough platform resources to give every hardware
> > > > device a guaranteed alignment.
> > >
> > > Guarantee is one part and I can see how nobody wants to give you
> > > something as strong but how often does that happen in the real life?
> >
> > I expect a "rare" event to happen everyday in a data-center fleet.
> > Failure rates tend towards 100% daily occurrence at scale and in this
> > case the kernel has everything it needs to mitigate such an event.
> >
> > Setting aside the success rate of a software-alignment mitigation, the
> > reason I am charging this hill again after a 2 year hiatus is the
> > realization that this problem is wider spread than the original
> > failing scenario. Back in 2017 the problem seemed limited to custom
> > memmap= configurations, and collisions between PMEM and System RAM.
> > Now it is clear that the collisions can happen between PMEM regions
> > and namespaces as well, and the problem spans platforms from multiple
> > vendors. Here is the most recent collision problem:
> > https://github.com/pmem/ndctl/issues/76, from a third-party platform.
> >
> > The fix for that issue uncovered a bug in the padding implementation,
> > and a fix for that bug would result in even more hacks in the nvdimm
> > code for what is a core kernel deficiency. Code review of those
> > changes resulted in changing direction to go after the core
> > deficiency.
>
> This kind of information along with real world examples is exactly what
> you should have added into the cover letter. A previous very vague
> claims were not really convincing or something that can be considered a
> proper justification. Please do realize that people who are not working
> with the affected HW are unlikely to have an idea how serious/relevant
> those problems really are.
>
> People are asking for a smaller memory hotplug granularity for other
> usecases (e.g. memory ballooning into VMs) which are quite dubious to
> be honest and not really worth all the code rework. If we are talking
> about something that can be worked around elsewhere then it is preferred
> because the code base is not in an excellent shape and putting more on
> top is just going to cause more headaches.
>
> I will try to find some time to review this more deeply (no promises
> though because time is hectic and this is not a simple feature). For the
> future, please try harder to write up a proper justification and a
> highlevel design description which tells a bit about all important parts
> of the new scheme.

Fair enough. I've been steeped in this for too long, and should have
taken a wider view to bring reviewers up to speed.
David Hildenbrand March 28, 2019, 1:38 p.m. UTC | #11
On 27.03.19 17:13, Michal Hocko wrote:
> On Tue 26-03-19 17:20:41, Dan Williams wrote:
>> On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <mhocko@kernel.org> wrote:
>>>
>>> On Mon 25-03-19 13:03:47, Dan Williams wrote:
>>>> On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <mhocko@kernel.org> wrote:
>>> [...]
>>>>>> User-defined memory namespaces have this problem, but 2MB is the
>>>>>> default alignment and is sufficient for most uses.
>>>>>
>>>>> What does prevent users to go and use a larger alignment?
>>>>
>>>> Given that we are living with 64MB granularity on mainstream platforms
>>>> for the foreseeable future, the reason users can't rely on a larger
>>>> alignment to address the issue is that the physical alignment may
>>>> change from one boot to the next.
>>>
>>> I would love to learn more about this inter boot volatility. Could you
>>> expand on that some more? I though that the HW configuration presented
>>> to the OS would be more or less stable unless the underlying HW changes.
>>
>> Even if the configuration is static there can be hardware failures
>> that prevent a DIMM, or a PCI device to be included in the memory map.
>> When that happens the BIOS needs to re-layout the map and the result
>> is not guaranteed to maintain the previous alignment.
>>
>>>> No, you can't just wish hardware / platform firmware won't do this,
>>>> because there are not enough platform resources to give every hardware
>>>> device a guaranteed alignment.
>>>
>>> Guarantee is one part and I can see how nobody wants to give you
>>> something as strong but how often does that happen in the real life?
>>
>> I expect a "rare" event to happen everyday in a data-center fleet.
>> Failure rates tend towards 100% daily occurrence at scale and in this
>> case the kernel has everything it needs to mitigate such an event.
>>
>> Setting aside the success rate of a software-alignment mitigation, the
>> reason I am charging this hill again after a 2 year hiatus is the
>> realization that this problem is wider spread than the original
>> failing scenario. Back in 2017 the problem seemed limited to custom
>> memmap= configurations, and collisions between PMEM and System RAM.
>> Now it is clear that the collisions can happen between PMEM regions
>> and namespaces as well, and the problem spans platforms from multiple
>> vendors. Here is the most recent collision problem:
>> https://github.com/pmem/ndctl/issues/76, from a third-party platform.
>>
>> The fix for that issue uncovered a bug in the padding implementation,
>> and a fix for that bug would result in even more hacks in the nvdimm
>> code for what is a core kernel deficiency. Code review of those
>> changes resulted in changing direction to go after the core
>> deficiency.
> 
> This kind of information along with real world examples is exactly what
> you should have added into the cover letter. A previous very vague
> claims were not really convincing or something that can be considered a
> proper justification. Please do realize that people who are not working
> with the affected HW are unlikely to have an idea how serious/relevant
> those problems really are.
> 
> People are asking for a smaller memory hotplug granularity for other
> usecases (e.g. memory ballooning into VMs) which are quite dubious to
> be honest and not really worth all the code rework. If we are talking
> about something that can be worked around elsewhere then it is preferred
> because the code base is not in an excellent shape and putting more on
> top is just going to cause more headaches.

At least for virtio-mem, it will be handled similar to xen-balloon and
hyper-v balloon, where whole actions are added and some parts are kept
"soft-offline". But there, one device "owns" the complete section, it
does not overlap with other devices. One section only has one owner.

As we discussed a similar approach back then with virtio-mem
(online/offline of smaller blocks), you had a pretty good point that
such complexity is better avoided in core MM. Sections really seem to be
the granularity with which core MM should work. At least speaking about
!pmem memory hotplug.

> 
> I will try to find some time to review this more deeply (no promises
> though because time is hectic and this is not a simple feature). For the
> future, please try harder to write up a proper justification and a
> highlevel design description which tells a bit about all important parts
> of the new scheme.
>
Michal Hocko March 28, 2019, 2:16 p.m. UTC | #12
On Thu 28-03-19 14:38:15, David Hildenbrand wrote:
> On 27.03.19 17:13, Michal Hocko wrote:
[...]
> > People are asking for a smaller memory hotplug granularity for other
> > usecases (e.g. memory ballooning into VMs) which are quite dubious to
> > be honest and not really worth all the code rework. If we are talking
> > about something that can be worked around elsewhere then it is preferred
> > because the code base is not in an excellent shape and putting more on
> > top is just going to cause more headaches.
> 
> At least for virtio-mem, it will be handled similar to xen-balloon and
> hyper-v balloon, where whole actions are added and some parts are kept
> "soft-offline". But there, one device "owns" the complete section, it
> does not overlap with other devices. One section only has one owner.

This is exactly what I meant by handing at a higher level.
David Hildenbrand March 28, 2019, 8:10 p.m. UTC | #13
On 22.03.19 17:57, Dan Williams wrote:
> Changes since v4 [1]:
> - Given v4 was from March of 2017 the bulk of the changes result from
>   rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> 
> - A unit test is added to ndctl to exercise the creation and dax
>   mounting of multiple independent namespaces in a single 128M section.
> 
> [1]: https://lwn.net/Articles/717383/
> 
> ---

I'm gonna have to ask some very basic questions:

You are using the term "Sub-section memory hotplug support", but is it
actually what you mean? To rephrase, aren't we talking here about
"Sub-section device memory hotplug support" or similar?

Reason I am asking is because I wonder how that would interact with the
memory block device infrastructure and hotplugging of system ram -
add_memory()/add_memory_resource(). I *assume* you are not changing the
add_memory() interface, so that one still only works with whole sections
(or well, memory_block_size_bytes()) - check_hotplug_memory_range().

In general, mix and matching system RAM and persistent memory per
section, I am not a friend of that. Especially when it comes to memory
block devices. But I am getting the feeling that we are rather targeting
PMEM vs. PMEM with this patch series.

> 
> Quote patch7:
> 
> "The libnvdimm sub-system has suffered a series of hacks and broken
>  workarounds for the memory-hotplug implementation's awkward
>  section-aligned (128MB) granularity. For example the following backtrace
>  is emitted when attempting arch_add_memory() with physical address
>  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
>  within a given section:
>  
>   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
>   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
>   [..]
>   Call Trace:
>     dump_stack+0x86/0xc3
>     __warn+0xcb/0xf0
>     warn_slowpath_fmt+0x5f/0x80
>     devm_memremap_pages+0x3b5/0x4c0
>     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
>     pmem_attach_disk+0x19a/0x440 [nd_pmem]
>  
>  Recently it was discovered that the problem goes beyond RAM vs PMEM
>  collisions as some platform produce PMEM vs PMEM collisions within a

As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
implemented "on top" of what we have right now. Or is this what we
already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
sorry for the stupid questions)

>  given section. The libnvdimm workaround for that case revealed that the
>  libnvdimm section-alignment-padding implementation has been broken for a
>  long while. A fix for that long-standing breakage introduces as many
>  problems as it solves as it would require a backward-incompatible change
>  to the namespace metadata interpretation. Instead of that dubious route
>  [2], address the root problem in the memory-hotplug implementation."
> 
> The approach is taken is to observe that each section already maintains
> an array of 'unsigned long' values to hold the pageblock_flags. A single
> additional 'unsigned long' is added to house a 'sub-section active'
> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> 
> The implication of allowing sections to be piecemeal mapped/unmapped is
> that the valid_section() helper is no longer authoritative to determine
> if a section is fully mapped. Instead pfn_valid() is updated to consult
> the section-active bitmask. Given that typical memory hotplug still has
> deep "section" dependencies the sub-section capability is limited to
> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> devm_memremap_pages() users for now.

Ah, there it is. And my point would be, please don't ever unlock
something like that for want_memblock=true. Especially not for memory
added after boot via device drivers (add_memory()).

> 
> With this in place the hacks in the libnvdimm sub-system can be
> dropped, and other devm_memremap_pages() users need no longer be
> constrained to 128MB mapping granularity.
Dan Williams March 28, 2019, 8:43 p.m. UTC | #14
On Thu, Mar 28, 2019 at 1:10 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.03.19 17:57, Dan Williams wrote:
> > Changes since v4 [1]:
> > - Given v4 was from March of 2017 the bulk of the changes result from
> >   rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> >
> > - A unit test is added to ndctl to exercise the creation and dax
> >   mounting of multiple independent namespaces in a single 128M section.
> >
> > [1]: https://lwn.net/Articles/717383/
> >
> > ---
>
> I'm gonna have to ask some very basic questions:

No worries.

>
> You are using the term "Sub-section memory hotplug support", but is it
> actually what you mean? To rephrase, aren't we talking here about
> "Sub-section device memory hotplug support" or similar?

Specifically it is support for passing @start and @size arguments to
arch_add_memory() that are not section aligned. It's not limited to
"device memory" which is otherwise not a concept that
arch_add_memory() understands, it just groks spans of pfns.

> Reason I am asking is because I wonder how that would interact with the
> memory block device infrastructure and hotplugging of system ram -
> add_memory()/add_memory_resource(). I *assume* you are not changing the
> add_memory() interface, so that one still only works with whole sections
> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().

Like you found below, the implementation enforces that add_memory_*()
interfaces maintain section alignment for @start and @size.

> In general, mix and matching system RAM and persistent memory per
> section, I am not a friend of that.

You have no choice. The platform may decide to map PMEM and System RAM
in the same section because the Linux section is too large compared to
typical memory controller mapping granularity capability.

> Especially when it comes to memory
> block devices. But I am getting the feeling that we are rather targeting
> PMEM vs. PMEM with this patch series.

The collisions are between System RAM, PMEM regions, and PMEM
namespaces (sub-divisions of regions that each need their own mapping
lifetime).

> > Quote patch7:
> >
> > "The libnvdimm sub-system has suffered a series of hacks and broken
> >  workarounds for the memory-hotplug implementation's awkward
> >  section-aligned (128MB) granularity. For example the following backtrace
> >  is emitted when attempting arch_add_memory() with physical address
> >  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> >  within a given section:
> >
> >   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> >   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> >   [..]
> >   Call Trace:
> >     dump_stack+0x86/0xc3
> >     __warn+0xcb/0xf0
> >     warn_slowpath_fmt+0x5f/0x80
> >     devm_memremap_pages+0x3b5/0x4c0
> >     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >     pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >
> >  Recently it was discovered that the problem goes beyond RAM vs PMEM
> >  collisions as some platform produce PMEM vs PMEM collisions within a
>
> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
> implemented "on top" of what we have right now. Or is this what we
> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
> sorry for the stupid questions)

It doesn't work, because even if the padding was implemented 100%
correct, which thus far has failed to be the case, the platform may
change physical alignments from one boot to the next for a variety of
reasons.

>
> >  given section. The libnvdimm workaround for that case revealed that the
> >  libnvdimm section-alignment-padding implementation has been broken for a
> >  long while. A fix for that long-standing breakage introduces as many
> >  problems as it solves as it would require a backward-incompatible change
> >  to the namespace metadata interpretation. Instead of that dubious route
> >  [2], address the root problem in the memory-hotplug implementation."
> >
> > The approach is taken is to observe that each section already maintains
> > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > additional 'unsigned long' is added to house a 'sub-section active'
> > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >
> > The implication of allowing sections to be piecemeal mapped/unmapped is
> > that the valid_section() helper is no longer authoritative to determine
> > if a section is fully mapped. Instead pfn_valid() is updated to consult
> > the section-active bitmask. Given that typical memory hotplug still has
> > deep "section" dependencies the sub-section capability is limited to
> > 'want_memblock=false' invocations of arch_add_memory(), effectively only
> > devm_memremap_pages() users for now.
>
> Ah, there it is. And my point would be, please don't ever unlock
> something like that for want_memblock=true. Especially not for memory
> added after boot via device drivers (add_memory()).

I don't see a strong reason why not, as long as it does not regress
existing use cases. It might need to be an opt-in for new tooling that
is aware of finer granularity hotplug. That said, I have no pressing
need to go there and just care about the arch_add_memory() capability
for now.
David Hildenbrand March 28, 2019, 9:17 p.m. UTC | #15
>> You are using the term "Sub-section memory hotplug support", but is it
>> actually what you mean? To rephrase, aren't we talking here about
>> "Sub-section device memory hotplug support" or similar?
> 
> Specifically it is support for passing @start and @size arguments to
> arch_add_memory() that are not section aligned. It's not limited to
> "device memory" which is otherwise not a concept that
> arch_add_memory() understands, it just groks spans of pfns.

Okay, so everything that does not have a memory block devices as of now.

> 
>> Reason I am asking is because I wonder how that would interact with the
>> memory block device infrastructure and hotplugging of system ram -
>> add_memory()/add_memory_resource(). I *assume* you are not changing the
>> add_memory() interface, so that one still only works with whole sections
>> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
> 
> Like you found below, the implementation enforces that add_memory_*()
> interfaces maintain section alignment for @start and @size.
> 
>> In general, mix and matching system RAM and persistent memory per
>> section, I am not a friend of that.
> 
> You have no choice. The platform may decide to map PMEM and System RAM
> in the same section because the Linux section is too large compared to
> typical memory controller mapping granularity capability.

I might be very wrong here, but do we actually care about something like
64MB getting lost in the cracks? I mean if it simplifies core MM, let go
of the couple of MB of system ram and handle the PMEM part only. Treat
the system ram parts like memory holes we already have in ordinary
sections (well, there we simply set the relevant struct pages to
PG_reserved). Of course, if we have hundreds of unaligned devices and
stuff will start to add up ... but I assume this is not the case?

> 
>> Especially when it comes to memory
>> block devices. But I am getting the feeling that we are rather targeting
>> PMEM vs. PMEM with this patch series.
> 
> The collisions are between System RAM, PMEM regions, and PMEM
> namespaces (sub-divisions of regions that each need their own mapping
> lifetime).

Understood. I wonder if that PMEM only mapping (including separate
lifetime) could be handled differently. But I am absolutely no expert,
just curious.

> 
>>> Quote patch7:
>>>
>>> "The libnvdimm sub-system has suffered a series of hacks and broken
>>>  workarounds for the memory-hotplug implementation's awkward
>>>  section-aligned (128MB) granularity. For example the following backtrace
>>>  is emitted when attempting arch_add_memory() with physical address
>>>  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
>>>  within a given section:
>>>
>>>   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
>>>   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
>>>   [..]
>>>   Call Trace:
>>>     dump_stack+0x86/0xc3
>>>     __warn+0xcb/0xf0
>>>     warn_slowpath_fmt+0x5f/0x80
>>>     devm_memremap_pages+0x3b5/0x4c0
>>>     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
>>>     pmem_attach_disk+0x19a/0x440 [nd_pmem]
>>>
>>>  Recently it was discovered that the problem goes beyond RAM vs PMEM
>>>  collisions as some platform produce PMEM vs PMEM collisions within a
>>
>> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
>> implemented "on top" of what we have right now. Or is this what we
>> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
>> sorry for the stupid questions)
> 
> It doesn't work, because even if the padding was implemented 100%
> correct, which thus far has failed to be the case, the platform may
> change physical alignments from one boot to the next for a variety of
> reasons.

Would ignoring the System RAM parts (as mentioned above) help or doesn't
it make any difference in terms of complexity?

> 
>>
>>>  given section. The libnvdimm workaround for that case revealed that the
>>>  libnvdimm section-alignment-padding implementation has been broken for a
>>>  long while. A fix for that long-standing breakage introduces as many
>>>  problems as it solves as it would require a backward-incompatible change
>>>  to the namespace metadata interpretation. Instead of that dubious route
>>>  [2], address the root problem in the memory-hotplug implementation."
>>>
>>> The approach is taken is to observe that each section already maintains
>>> an array of 'unsigned long' values to hold the pageblock_flags. A single
>>> additional 'unsigned long' is added to house a 'sub-section active'
>>> bitmask. Each bit tracks the mapped state of one sub-section's worth of
>>> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
>>>
>>> The implication of allowing sections to be piecemeal mapped/unmapped is
>>> that the valid_section() helper is no longer authoritative to determine
>>> if a section is fully mapped. Instead pfn_valid() is updated to consult
>>> the section-active bitmask. Given that typical memory hotplug still has
>>> deep "section" dependencies the sub-section capability is limited to
>>> 'want_memblock=false' invocations of arch_add_memory(), effectively only
>>> devm_memremap_pages() users for now.
>>
>> Ah, there it is. And my point would be, please don't ever unlock
>> something like that for want_memblock=true. Especially not for memory
>> added after boot via device drivers (add_memory()).
> 
> I don't see a strong reason why not, as long as it does not regress
> existing use cases. It might need to be an opt-in for new tooling that
> is aware of finer granularity hotplug. That said, I have no pressing
> need to go there and just care about the arch_add_memory() capability
> for now.

Especially onlining/offlining of memory might end up very ugly. And that
goes hand in hand with memory block devices. They are either online or
offline, not something in between. (I went that path and Michal
correctly told me why it is not a good idea)

I was recently trying to teach memory block devices who their owner is /
of which type they are. Right now I am looking into the option of using
drivers. Memory block devices that could belong to different drivers at
a time are well ... totally broken. I assume it would still be a special
case, though, but conceptually speaking about the interface it would be
allowed.

Memory block devices (and therefore 1..X sections) should have one owner
only. Anything else just does not fit.
Dan Williams March 28, 2019, 9:32 p.m. UTC | #16
On Thu, Mar 28, 2019 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> You are using the term "Sub-section memory hotplug support", but is it
> >> actually what you mean? To rephrase, aren't we talking here about
> >> "Sub-section device memory hotplug support" or similar?
> >
> > Specifically it is support for passing @start and @size arguments to
> > arch_add_memory() that are not section aligned. It's not limited to
> > "device memory" which is otherwise not a concept that
> > arch_add_memory() understands, it just groks spans of pfns.
>
> Okay, so everything that does not have a memory block devices as of now.
>
> >
> >> Reason I am asking is because I wonder how that would interact with the
> >> memory block device infrastructure and hotplugging of system ram -
> >> add_memory()/add_memory_resource(). I *assume* you are not changing the
> >> add_memory() interface, so that one still only works with whole sections
> >> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
> >
> > Like you found below, the implementation enforces that add_memory_*()
> > interfaces maintain section alignment for @start and @size.
> >
> >> In general, mix and matching system RAM and persistent memory per
> >> section, I am not a friend of that.
> >
> > You have no choice. The platform may decide to map PMEM and System RAM
> > in the same section because the Linux section is too large compared to
> > typical memory controller mapping granularity capability.
>
> I might be very wrong here, but do we actually care about something like
> 64MB getting lost in the cracks? I mean if it simplifies core MM, let go
> of the couple of MB of system ram and handle the PMEM part only. Treat
> the system ram parts like memory holes we already have in ordinary
> sections (well, there we simply set the relevant struct pages to
> PG_reserved). Of course, if we have hundreds of unaligned devices and
> stuff will start to add up ... but I assume this is not the case?

That's precisely what we do today and it has become untenable as the
collision scenarios pile up. This thread [1] is worth a read if you
care about  some of the gory details why I'm back to pushing for
sub-section support, but most if it has already been summarized in the
current discussion on this thread.

[1]: https://lore.kernel.org/lkml/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com/

>
> >
> >> Especially when it comes to memory
> >> block devices. But I am getting the feeling that we are rather targeting
> >> PMEM vs. PMEM with this patch series.
> >
> > The collisions are between System RAM, PMEM regions, and PMEM
> > namespaces (sub-divisions of regions that each need their own mapping
> > lifetime).
>
> Understood. I wonder if that PMEM only mapping (including separate
> lifetime) could be handled differently. But I am absolutely no expert,
> just curious.

I refer you to the above thread trying to fix the libnvdimm-local hacks.

>
> >
> >>> Quote patch7:
> >>>
> >>> "The libnvdimm sub-system has suffered a series of hacks and broken
> >>>  workarounds for the memory-hotplug implementation's awkward
> >>>  section-aligned (128MB) granularity. For example the following backtrace
> >>>  is emitted when attempting arch_add_memory() with physical address
> >>>  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> >>>  within a given section:
> >>>
> >>>   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> >>>   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> >>>   [..]
> >>>   Call Trace:
> >>>     dump_stack+0x86/0xc3
> >>>     __warn+0xcb/0xf0
> >>>     warn_slowpath_fmt+0x5f/0x80
> >>>     devm_memremap_pages+0x3b5/0x4c0
> >>>     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >>>     pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >>>
> >>>  Recently it was discovered that the problem goes beyond RAM vs PMEM
> >>>  collisions as some platform produce PMEM vs PMEM collisions within a
> >>
> >> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
> >> implemented "on top" of what we have right now. Or is this what we
> >> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
> >> sorry for the stupid questions)
> >
> > It doesn't work, because even if the padding was implemented 100%
> > correct, which thus far has failed to be the case, the platform may
> > change physical alignments from one boot to the next for a variety of
> > reasons.
>
> Would ignoring the System RAM parts (as mentioned above) help or doesn't
> it make any difference in terms of complexity?

Doesn't help much, that's only one of many collision sources.

> >>>  given section. The libnvdimm workaround for that case revealed that the
> >>>  libnvdimm section-alignment-padding implementation has been broken for a
> >>>  long while. A fix for that long-standing breakage introduces as many
> >>>  problems as it solves as it would require a backward-incompatible change
> >>>  to the namespace metadata interpretation. Instead of that dubious route
> >>>  [2], address the root problem in the memory-hotplug implementation."
> >>>
> >>> The approach is taken is to observe that each section already maintains
> >>> an array of 'unsigned long' values to hold the pageblock_flags. A single
> >>> additional 'unsigned long' is added to house a 'sub-section active'
> >>> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> >>> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >>>
> >>> The implication of allowing sections to be piecemeal mapped/unmapped is
> >>> that the valid_section() helper is no longer authoritative to determine
> >>> if a section is fully mapped. Instead pfn_valid() is updated to consult
> >>> the section-active bitmask. Given that typical memory hotplug still has
> >>> deep "section" dependencies the sub-section capability is limited to
> >>> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> >>> devm_memremap_pages() users for now.
> >>
> >> Ah, there it is. And my point would be, please don't ever unlock
> >> something like that for want_memblock=true. Especially not for memory
> >> added after boot via device drivers (add_memory()).
> >
> > I don't see a strong reason why not, as long as it does not regress
> > existing use cases. It might need to be an opt-in for new tooling that
> > is aware of finer granularity hotplug. That said, I have no pressing
> > need to go there and just care about the arch_add_memory() capability
> > for now.
>
> Especially onlining/offlining of memory might end up very ugly. And that
> goes hand in hand with memory block devices. They are either online or
> offline, not something in between. (I went that path and Michal
> correctly told me why it is not a good idea)

Thread reference?

> I was recently trying to teach memory block devices who their owner is /
> of which type they are. Right now I am looking into the option of using
> drivers. Memory block devices that could belong to different drivers at
> a time are well ... totally broken.

Sub-section support is aimed at a similar case where different
portions of an 128MB span need to handed out to devices / drivers with
independent lifetimes.

> I assume it would still be a special
> case, though, but conceptually speaking about the interface it would be
> allowed.
>
> Memory block devices (and therefore 1..X sections) should have one owner
> only. Anything else just does not fit.

Yes, but I would say the problem there is that the
memory-block-devices interface design is showing its age and is being
pressured with how systems want to deploy and use memory today.
David Hildenbrand March 28, 2019, 9:54 p.m. UTC | #17
>>>> Reason I am asking is because I wonder how that would interact with the
>>>> memory block device infrastructure and hotplugging of system ram -
>>>> add_memory()/add_memory_resource(). I *assume* you are not changing the
>>>> add_memory() interface, so that one still only works with whole sections
>>>> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
>>>
>>> Like you found below, the implementation enforces that add_memory_*()
>>> interfaces maintain section alignment for @start and @size.
>>>
>>>> In general, mix and matching system RAM and persistent memory per
>>>> section, I am not a friend of that.
>>>
>>> You have no choice. The platform may decide to map PMEM and System RAM
>>> in the same section because the Linux section is too large compared to
>>> typical memory controller mapping granularity capability.
>>
>> I might be very wrong here, but do we actually care about something like
>> 64MB getting lost in the cracks? I mean if it simplifies core MM, let go
>> of the couple of MB of system ram and handle the PMEM part only. Treat
>> the system ram parts like memory holes we already have in ordinary
>> sections (well, there we simply set the relevant struct pages to
>> PG_reserved). Of course, if we have hundreds of unaligned devices and
>> stuff will start to add up ... but I assume this is not the case?
> 
> That's precisely what we do today and it has become untenable as the
> collision scenarios pile up. This thread [1] is worth a read if you
> care about  some of the gory details why I'm back to pushing for
> sub-section support, but most if it has already been summarized in the
> current discussion on this thread.

Thanks, exactly what I am interested in, will have a look!

>>>
>>> I don't see a strong reason why not, as long as it does not regress
>>> existing use cases. It might need to be an opt-in for new tooling that
>>> is aware of finer granularity hotplug. That said, I have no pressing
>>> need to go there and just care about the arch_add_memory() capability
>>> for now.
>>
>> Especially onlining/offlining of memory might end up very ugly. And that
>> goes hand in hand with memory block devices. They are either online or
>> offline, not something in between. (I went that path and Michal
>> correctly told me why it is not a good idea)
> 
> Thread reference?

Sure:

https://marc.info/?l=linux-mm&m=152362539714432&w=2

Onlining/offlining subsections was what I tried. (adding/removing whole
sections). But with the memory block device model (online/offline memory
blocks), this really was in some sense dirty, although it worked.

> 
>> I was recently trying to teach memory block devices who their owner is /
>> of which type they are. Right now I am looking into the option of using
>> drivers. Memory block devices that could belong to different drivers at
>> a time are well ... totally broken.
> 
> Sub-section support is aimed at a similar case where different
> portions of an 128MB span need to handed out to devices / drivers with
> independent lifetimes.

Right, but we are stuck here with memory block devices having certain
bigger granularity. We already went from 128MB to 2048MB because "there
were too many". Modeling this on 2MB level (e.g. subsections), no way.
And as I said, multiple users for one memory block device, very ugly.

What would be interesting is having memory block devices of variable
size. (64MB, 1024GB, 6GB ..), maybe even representing the unit in which
e.g. add_memory() was performed. But it would also have downsides when
it comes to changing the zone of memory blocks. Memory would be
onlined/offlined in way bigger chunks.

E.g. one DIMM = one memory block device.

> 
>> I assume it would still be a special
>> case, though, but conceptually speaking about the interface it would be
>> allowed.
>>
>> Memory block devices (and therefore 1..X sections) should have one owner
>> only. Anything else just does not fit.
> 
> Yes, but I would say the problem there is that the
> memory-block-devices interface design is showing its age and is being
> pressured with how systems want to deploy and use memory today.

Maybe, I guess the main "issue" started to pop up when different things
(RAM vs. PMEM) were started to be mapped into memory side by side. But
it is ABI, and basic kdump would completely break if removed. And of
course memory unplug and much more. It is a crucial part of how system
ram is handled today and might not at all be easy to replace.
David Hildenbrand April 1, 2019, 9:18 a.m. UTC | #18
On 27.03.19 01:20, Dan Williams wrote:
> On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <mhocko@kernel.org> wrote:
>>
>> On Mon 25-03-19 13:03:47, Dan Williams wrote:
>>> On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <mhocko@kernel.org> wrote:
>> [...]
>>>>> User-defined memory namespaces have this problem, but 2MB is the
>>>>> default alignment and is sufficient for most uses.
>>>>
>>>> What does prevent users to go and use a larger alignment?
>>>
>>> Given that we are living with 64MB granularity on mainstream platforms
>>> for the foreseeable future, the reason users can't rely on a larger
>>> alignment to address the issue is that the physical alignment may
>>> change from one boot to the next.
>>
>> I would love to learn more about this inter boot volatility. Could you
>> expand on that some more? I though that the HW configuration presented
>> to the OS would be more or less stable unless the underlying HW changes.
> 
> Even if the configuration is static there can be hardware failures
> that prevent a DIMM, or a PCI device to be included in the memory map.
> When that happens the BIOS needs to re-layout the map and the result
> is not guaranteed to maintain the previous alignment.
> 
>>> No, you can't just wish hardware / platform firmware won't do this,
>>> because there are not enough platform resources to give every hardware
>>> device a guaranteed alignment.
>>
>> Guarantee is one part and I can see how nobody wants to give you
>> something as strong but how often does that happen in the real life?
> 
> I expect a "rare" event to happen everyday in a data-center fleet.
> Failure rates tend towards 100% daily occurrence at scale and in this
> case the kernel has everything it needs to mitigate such an event.
> 
> Setting aside the success rate of a software-alignment mitigation, the
> reason I am charging this hill again after a 2 year hiatus is the
> realization that this problem is wider spread than the original
> failing scenario. Back in 2017 the problem seemed limited to custom
> memmap= configurations, and collisions between PMEM and System RAM.
> Now it is clear that the collisions can happen between PMEM regions
> and namespaces as well, and the problem spans platforms from multiple
> vendors. Here is the most recent collision problem:
> https://github.com/pmem/ndctl/issues/76, from a third-party platform.
> 
> The fix for that issue uncovered a bug in the padding implementation,
> and a fix for that bug would result in even more hacks in the nvdimm
> code for what is a core kernel deficiency. Code review of those
> changes resulted in changing direction to go after the core
> deficiency.
> 
>>> The effect is that even if the driver deploys a software alignment
>>> mitigation when it first sees the persistent memory range, that
>>> alignment can be violated on a subsequent boot leading to data being
>>> unavailable. There is no facility to communicate to the administrator
>>> what went wrong in this scenario as several events can trigger a
>>> physical map layout change. Add / remove of hardware and hardware
>>> failure are the most likely causes.
>>
>> This is indeed bad and unexpected! That is exactly something to have in
>> the chagelog!
> 
> Apologies that was indeed included in the 2017 changelog (see: "a user
> could inadvertently lose access to nvdimm namespaces" note here:
> https://lwn.net/Articles/717383/), and I failed to carry it forward.
> 
>>
>>> An additional pain point for users is that EFI pre-boot environment
>>> has little chance to create a namespace that Linux might be able to
>>> use. The section size is an arbitrary Linux constraint and we should
>>> not encode something Linux specific that might change in the future
>>> into OS agnostic software.
>>
>> This looks like a fair point but please keep in mind that there hotplug
>> restrictions are on other platforms as well (4MB on Windows IIRC) so
>> there will be some knowledge required all the time. Besides that there
>> are likely to be some restrictions depending on the implementation.
> 
> Windows does not have an equivalent constraint, so it's only Linux
> that imposes an arbitrary alignment restriction on pmem to agents like
> EFI.
> 
>> [...]
>>>>> Right, as stated in the cover letter, this does not remove all those
>>>>> assumptions, it only removes the ones that impact
>>>>> devm_memremap_pages(). Specifying that sub-section is only supported
>>>>> in the 'want_memblock=false' case to arch_add_memory().
>>>>
>>>> And this is exactly the problem. Having different assumptions depending
>>>> on whether there is a memblock interface or not is utterly wrong and a
>>>> maintainability mess.
>>>
>>> In this case I disagree with you. The hotplug code already has the
>>> want_memblock=false semantic in the implementation.
>>
>> want_memblock was a hack to allow memory hotplug to not have user
>> visible sysfs interface. It was added to reduce the code duplication
>> IIRC. Besides that this hasn't changed the underlying assumptions about
>> hotplugable units or other invariants that were in place.
> 
> Neither does this patch series for the typical memory hotplug case.
> For the device-memory use case I've gone through and fixed up the
> underlying assumptions.
> 
>>> The sub-section
>>> hotplug infrastructure is a strict superset of what is there already.
>>> Now, if it created parallel infrastructure that would indeed be a
>>> maintainability burden, but in this case there are no behavior changes
>>> for typical memory hotplug as it just hotplugs full sections at a time
>>> like always. The 'section' concept is not going away.
>>
>> You are really neglecting many details here. E.g. memory section can be
>> shared between two different types of memory. We've had some bugs in the
>> hotplug code when one section can be shared between two different NUMA
>> nodes (e.g. 4aa9fc2a435a ("Revert "mm, memory_hotplug: initialize struct
>> pages for the full memory section""). We do not allow to hotremove such
>> sections because it would open another can of worms. I am not saying
>> your implementation is incorrect - still haven't time to look deeply -
>> but stating that this is a strict superset of want_memblock is simply
>> wrong.
> 
> Please have a look at the code and the handling of "early" sections.
> The assertion that I neglected to consider that detail is not true.
> 
> My "superset" contention is from the arch_add_memory() api
> perspective. All typical memory hotplug use cases are a sub-case of
> the new support.
> 
>> [...]
>>>> Why do we have to go a mile to tweak the kernel, especially something as
>>>> fragile as memory hotplug, just to support sub mem section ranges. This
>>>> is somthing that is not clearly explained in the cover letter. Sure you
>>>> are talking about hacks at the higher level to deal with this but I do
>>>> not see any fundamental reason to actually support that at all.
>>>
>>> Like it or not, 'struct page' mappings for arbitrary hardware-physical
>>> memory ranges is a facility that has grown from the pmem case, to hmm,
>>> and peer-to-peer DMA. Unless you want to do the work to eliminate the
>>> 'struct page' requirement across the kernel I think it is unreasonable
>>> to effectively archive the arch_add_memory() implementation and
>>> prevent it from reacting to growing demands.
>>
>> I am definitely not blocking memory hotplug to be reused more! All I am
>> saying is that there is much more ground work to be done before you can
>> add features like that. There are some general assumptions in the code,
>> like it or not, and you should start by removing those to build on top.
> 
> Let's talk about specifics please, because I don't think you've had a
> chance to consider the details in the patches. Your "start by removing
> those [assumptions] to build on top" request is indeed what the
> preparation patches in this series aim to achieve.
> 
> The general assumptions of the current (pre-patch-series) implementation are:
> 
> - Sections that describe boot memory (early sections) are never
> unplugged / removed.

I m not sure if this is completely true, and it also recently popped up
while discussing some work Oscar is doing ("[PATCH 0/4]
mm,memory_hotplug: allocate memmap from hotadded memory").

We have powernv (arch/powerpc/platforms/powernv/memtrace.c), that will
offline + remove memory from the system that it didn't originally add.
As far as I understand, this can easily be boot memory. Not sure if
there is anything blocking this code from removing boot memory.

Also, ACPI memory hotplug (drivers/acpi/acpi_memhotplug.c) seems to have
a case where memory provided by a DIMM is already used by the kernel (I
assume this means, it was detected and added during boot). This memory
can theoretically be removed. I am still to figure out how that special
case here fits into the big picture.

> 
> - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
> valid_section() check
> 
> - __add_pages() and helper routines assume all operations occur in
> PAGES_PER_SECTION units.
> 
> - the memblock sysfs interface only comprehends full sections
> 
> Those assumptions are removed / handled with the following
> implementation details respectively:
> 
> - Partially populated early sections can be extended with additional
> sub-sections, and those sub-sections can be removed with
> arch_remove_memory(). With this in place we no longer lose usable
> memory capacity to padding.
> 
> - pfn_valid() goes beyond valid_section() to also check the
> active-sub-section mask. As stated before this indication is in the
> same cacheline as the valid_section() so the performance impact is
> expected to be negligible. So far the lkp robot has not reported any
> regressions.
> 
> - Outside of the core vmemmap population routines which are replaced,
> other helper routines like shrink_{zone,pgdat}_span() are updated to
> handle the smaller granularity. Core memory hotplug routines that deal
> with online memory are not updated. That's a feature not a bug until
> we decide that sub-section hotplug makes sense for online / typical
> memory as well.
> 
> - the existing memblock sysfs user api guarantees / assumptions are
> not touched since this capability is limited to !online
> !sysfs-accessible sections for now.

So to expand on that, the main difference of RAM hotplug to device
memory hotplug is that

- Memory has to be onlined/offlined. Sections are marked as being either
online or offline. Not relevant for device memory. Onlining/offlining
imples working on the buddy / core MM.

- Memory is exposed and managed via memblock sysfs API. memblocks are
multiples of sections. The RAM hotplug granularity really is the size of
memblocks. E.g. kdump uses memblock sysfs events to reaload when new
memory is added/onlined. Onlining controlled by userspace works on
memblocks getting added. Other users heavily use the memblock API.

So I think the hotplug granularity of RAM really is memblocks (actually
sections). Changing that might be very complicated, will break APIs and
has a questionable benefit.

I am starting to wonder if RAM (memdev) really is the special case and
what you are proposing is the right thing to do for everything that
- doesn't use memdev sysfs interface
- doesn't require to online memory (sections)

So, it boils down to memblock=true is the special case. We would have to
make sure that memblock=true cannot be mixed with memblock=false on the
same sections (or even memory blocks)

(not having had a detailed look at the patches yet) Michal, what do you
think?
David Hildenbrand April 10, 2019, 9:51 a.m. UTC | #19
On 22.03.19 17:57, Dan Williams wrote:
> Changes since v4 [1]:
> - Given v4 was from March of 2017 the bulk of the changes result from
>   rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> 
> - A unit test is added to ndctl to exercise the creation and dax
>   mounting of multiple independent namespaces in a single 128M section.
> 
> [1]: https://lwn.net/Articles/717383/
> 
> ---
> 
> Quote patch7:
> 
> "The libnvdimm sub-system has suffered a series of hacks and broken
>  workarounds for the memory-hotplug implementation's awkward
>  section-aligned (128MB) granularity. For example the following backtrace
>  is emitted when attempting arch_add_memory() with physical address
>  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
>  within a given section:
>  
>   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
>   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
>   [..]
>   Call Trace:
>     dump_stack+0x86/0xc3
>     __warn+0xcb/0xf0
>     warn_slowpath_fmt+0x5f/0x80
>     devm_memremap_pages+0x3b5/0x4c0
>     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
>     pmem_attach_disk+0x19a/0x440 [nd_pmem]
>  
>  Recently it was discovered that the problem goes beyond RAM vs PMEM
>  collisions as some platform produce PMEM vs PMEM collisions within a
>  given section. The libnvdimm workaround for that case revealed that the
>  libnvdimm section-alignment-padding implementation has been broken for a
>  long while. A fix for that long-standing breakage introduces as many
>  problems as it solves as it would require a backward-incompatible change
>  to the namespace metadata interpretation. Instead of that dubious route
>  [2], address the root problem in the memory-hotplug implementation."
> 
> The approach is taken is to observe that each section already maintains
> an array of 'unsigned long' values to hold the pageblock_flags. A single
> additional 'unsigned long' is added to house a 'sub-section active'
> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> 
> The implication of allowing sections to be piecemeal mapped/unmapped is
> that the valid_section() helper is no longer authoritative to determine
> if a section is fully mapped. Instead pfn_valid() is updated to consult
> the section-active bitmask. Given that typical memory hotplug still has
> deep "section" dependencies the sub-section capability is limited to
> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> devm_memremap_pages() users for now.
> 
> With this in place the hacks in the libnvdimm sub-system can be
> dropped, and other devm_memremap_pages() users need no longer be
> constrained to 128MB mapping granularity.
> 
> [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
> 

I started to explore the wonderful world of system ram memory hotplug
(memory block devices) and it is full with issues. Too many to name them
all, but two example are memory block devices that span several nodes
(such memory can only be added during boot, mem->nid would be completely
misleading) or that we assume struct pages have been initialized, while
they really haven't when removing memory.

It is already a mess that we have multiple sections per memory block
devices (and it was never properly cleaned up and I think I spotted
several issues). Going into the direction of sub-sections for memory
block devices, I don't like. It is already a big mess.

Memory block devices are an important concept for memory hotplug/unplug.
This is the granularity memory will get onlined/offlined by user space.
I don't see this interface going away. On the other hand, memory block
devices only make sense for memory to be onlined/offlined in such
chunks, system ram. So whatever ZONE_DEVICE memory doesn't run into that
restriction.

I think we should restrict adding/removing system ram via
online_pages()/offline_pages()/add_memory()/remove_memory() to
- memory block device granularity (already mostly checked)
- single zones (already mostly checked)
- single nodes (basically not checked as far as I can see)

Cleaning this mess up might take some time. Essentially, all special
handling related to memory block devices should be factored out from
arch_add_memory()/arch_remove_memory() to add_memory()/remove_memory().
I started looking into that. __add_pages() doesn't properly revert what
it already did when failing.

I don't have a strong opinion against adding sub-section memory hotadd
as long as we don't use it for memory block devices hotplug. Meaning,
use it internally, but don't use it along with memory block device hotplug.

As add_memory() only works on memory block device granularity, memory
block devices for something like that is not an issue. The real issue is
memory added during boot that e.g. has holes or overlaps with pmem and
friends. Trying to offline/online/remove such memory should be
completely blocked.

To easily detect multiple nodes per memory block devices, I am thinking
about making mem->nid indicated that. E.g. nid == -1, uninitialized, nid
== -2, mixed nodes, don't allow to offline/remove such memory. Might
require some refactorings.

The question is how we could easily detect
- memory block devices with some sub-sections missing. Offlining code
forbids this right now as the holes are marked as PG_reserved.
- memory block devices with some sub-sections being ZONE_DEVICE memory
like pmem.

Both cases could only happen when memory was added during boot.
Offlining/removing such memory has to be forbidden.