[RFC,0/6] decrease unnecessary gap due to pmem kmem alignment

Message ID	20200729033424.2629-1-justin.he@arm.com (mailing list archive)
Headers	show Return-Path: <SRS0=g18b=BI=lists.01.org=linux-nvdimm-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3DE892074B Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=217.140.110.172; helo=foss.arm.com; envelope-from=justin.he@arm.com; receiver=<UNKNOWN> From: Jia He <justin.he@arm.com> To: Dan Williams <dan.j.williams@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Mike Rapoport <rppt@linux.ibm.com>, David Hildenbrand <david@redhat.com> Subject: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment Date: Wed, 29 Jul 2020 11:34:18 +0800 Message-Id: <20200729033424.2629-1-justin.he@arm.com> Message-ID-Hash: JFWKKAKU6KNYZS5EZ74NWGMHA3FVWNJ7 CC: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, "Rafael J. Wysocki" <rafael@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Steve Capper <steve.capper@arm.com>, Mark Rutland <mark.rutland@arm.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Hsin-Yi Wang <hsinyi@chromium.org>, Jason Gunthorpe <jgg@ziepe.ca>, Dave Hansen <dave.hansen@linux.intel.com>, Kees Cook <keescook@chromium.org>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org, linux-mm@kvack.org, Pankaj Gupta <pankaj.gupta.linux@gmail.com>, Kaly Xin <Kaly.Xin@arm.com>, Jia He <justin.he@arm.com> Precedence: list Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/JFWKKAKU6KNYZS5EZ74NWGMHA3FVWNJ7/> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	decrease unnecessary gap due to pmem kmem alignment \| expand [RFC,0/6] decrease unnecessary gap due to pmem kmem alignment [RFC,1/6] mm/memory_hotplug: remove redundant memory block size alignment check [RFC,2/6] resource: export find_next_iomem_res() helper [RFC,3/6] mm/memory_hotplug: allow pmem kmem not to align with memory_block_size [RFC,4/6] mm/page_alloc: adjust the start,end in dax pmem kmem case [RFC,5/6] device-dax: relax the memblock size alignment for kmem_start [RFC,6/6] arm64: fall back to vmemmap_populate_basepages if not aligned with PMD_SIZE

Justin He July 29, 2020, 3:34 a.m. UTC

When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e.
1G memblock size. Even Dan Williams' sub-section patch series [1] had been
upstream merged, it was not helpful due to hard limitation of kmem_start:
$ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
$echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
$echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
$cat /proc/iomem
...
23c000000-23fffffff : System RAM
  23dd40000-23fecffff : reserved
  23fed0000-23fffffff : reserved
240000000-33fdfffff : Persistent Memory
  240000000-2403fffff : namespace0.0
  280000000-2bfffffff : dax0.0          <- aligned with 1G boundary
    280000000-2bfffffff : System RAM
Hence there is a big gap between 0x2403fffff and 0x280000000 due to the 1G
alignment.
 
Without this series, if qemu creates a 4G bytes nvdimm device, we can only
use 2G bytes for dax pmem(kmem) in the worst case.
e.g.
240000000-33fdfffff : Persistent Memory 
We can only use the memblock between [240000000, 2ffffffff] due to the hard
limitation. It wastes too much memory space.

Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there
are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
SPARSEMEM_VMEMMAP, page bits in struct page ...

Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment
with memory_block_size_bytes().

Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem
can be used as ram with smaller gap. Also the kmem hotplug add/remove are both
tested on arm64/x86 guest.

This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2].

[1] https://lkml.org/lkml/2019/6/19/67
[2] https://lkml.org/lkml/2020/7/8/1546
Jia He (6):
  mm/memory_hotplug: remove redundant memory block size alignment check
  resource: export find_next_iomem_res() helper
  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
  mm/page_alloc: adjust the start,end in dax pmem kmem case
  device-dax: relax the memblock size alignment for kmem_start
  arm64: fall back to vmemmap_populate_basepages if not aligned  with
    PMD_SIZE

 arch/arm64/mm/mmu.c    |  4 ++++
 drivers/base/memory.c  | 24 ++++++++++++++++--------
 drivers/dax/kmem.c     | 22 +++++++++++++---------
 include/linux/ioport.h |  3 +++
 kernel/resource.c      |  3 ++-
 mm/memory_hotplug.c    | 39 ++++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c        | 14 ++++++++++++++
 7 files changed, 90 insertions(+), 19 deletions(-)

David Hildenbrand July 29, 2020, 6:36 a.m. UTC | #1

> Am 29.07.2020 um 05:35 schrieb Jia He <justin.he@arm.com>:
> 
> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
> addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e.
> 1G memblock size. Even Dan Williams' sub-section patch series [1] had been
> upstream merged, it was not helpful due to hard limitation of kmem_start:
> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> $cat /proc/iomem
> ...
> 23c000000-23fffffff : System RAM
>  23dd40000-23fecffff : reserved
>  23fed0000-23fffffff : reserved
> 240000000-33fdfffff : Persistent Memory
>  240000000-2403fffff : namespace0.0
>  280000000-2bfffffff : dax0.0          <- aligned with 1G boundary
>    280000000-2bfffffff : System RAM
> Hence there is a big gap between 0x2403fffff and 0x280000000 due to the 1G
> alignment.
> 
> Without this series, if qemu creates a 4G bytes nvdimm device, we can only
> use 2G bytes for dax pmem(kmem) in the worst case.
> e.g.
> 240000000-33fdfffff : Persistent Memory 
> We can only use the memblock between [240000000, 2ffffffff] due to the hard
> limitation. It wastes too much memory space.
> 
> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there
> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> SPARSEMEM_VMEMMAP, page bits in struct page ...
> 
> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment
> with memory_block_size_bytes().
> 
> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem
> can be used as ram with smaller gap. Also the kmem hotplug add/remove are both
> tested on arm64/x86 guest.
> 

Hi,

I am not convinced this use case is worth such hacks (that’s what it is) for now. On real machines pmem is big - your example (losing 50% is extreme).

I would much rather want to see the section size on arm64 reduced. I remember there were patches and that at least with a base page size of 4k it can be reduced drastically (64k base pages are more problematic due to the ridiculous THP size of 512M). But could be a section size of 512 is possible on all configs right now.

In the long term we might want to rework the memory block device model (eventually supporting old/new as discussed with Michal some time ago using a kernel parameter), dropping the fixed sizes
- allowing sizes / addresses aligned with subsection size
- drastically reducing the number of devices for boot memory to only a hand full (e.g., one per resource / DIMM we can actually unplug again.

Long story short, I don’t like this hack.


> This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2].
> 
> [1] https://lkml.org/lkml/2019/6/19/67
> [2] https://lkml.org/lkml/2020/7/8/1546
> Jia He (6):
>  mm/memory_hotplug: remove redundant memory block size alignment check
>  resource: export find_next_iomem_res() helper
>  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
>  mm/page_alloc: adjust the start,end in dax pmem kmem case
>  device-dax: relax the memblock size alignment for kmem_start
>  arm64: fall back to vmemmap_populate_basepages if not aligned  with
>    PMD_SIZE
> 
> arch/arm64/mm/mmu.c    |  4 ++++
> drivers/base/memory.c  | 24 ++++++++++++++++--------
> drivers/dax/kmem.c     | 22 +++++++++++++---------
> include/linux/ioport.h |  3 +++
> kernel/resource.c      |  3 ++-
> mm/memory_hotplug.c    | 39 ++++++++++++++++++++++++++++++++++++++-
> mm/page_alloc.c        | 14 ++++++++++++++
> 7 files changed, 90 insertions(+), 19 deletions(-)
> 
> -- 
> 2.17.1
>

Justin He July 29, 2020, 8:27 a.m. UTC | #2

Hi David

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Wednesday, July 29, 2020 2:37 PM
> To: Justin He <Justin.He@arm.com>
> Cc: Dan Williams <dan.j.williams@intel.com>; Vishal Verma
> <vishal.l.verma@intel.com>; Mike Rapoport <rppt@linux.ibm.com>; David
> Hildenbrand <david@redhat.com>; Catalin Marinas <Catalin.Marinas@arm.com>;
> Will Deacon <will@kernel.org>; Greg Kroah-Hartman
> <gregkh@linuxfoundation.org>; Rafael J. Wysocki <rafael@kernel.org>; Dave
> Jiang <dave.jiang@intel.com>; Andrew Morton <akpm@linux-foundation.org>;
> Steve Capper <Steve.Capper@arm.com>; Mark Rutland <Mark.Rutland@arm.com>;
> Logan Gunthorpe <logang@deltatee.com>; Anshuman Khandual
> <Anshuman.Khandual@arm.com>; Hsin-Yi Wang <hsinyi@chromium.org>; Jason
> Gunthorpe <jgg@ziepe.ca>; Dave Hansen <dave.hansen@linux.intel.com>; Kees
> Cook <keescook@chromium.org>; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-nvdimm@lists.01.org; linux-mm@kvack.org; Wei
> Yang <richardw.yang@linux.intel.com>; Pankaj Gupta
> <pankaj.gupta.linux@gmail.com>; Ira Weiny <ira.weiny@intel.com>; Kaly Xin
> <Kaly.Xin@arm.com>
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
> 
> 
> 
> > Am 29.07.2020 um 05:35 schrieb Jia He <justin.he@arm.com>:
> >
> > When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
> > addr in dev_dax_kmem_probe() should be aligned w/
> SECTION_SIZE_BITS(30),i.e.
> > 1G memblock size. Even Dan Williams' sub-section patch series [1] had
> been
> > upstream merged, it was not helpful due to hard limitation of kmem_start:
> > $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
> -a 2M
> > $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > $cat /proc/iomem
> > ...
> > 23c000000-23fffffff : System RAM
> >  23dd40000-23fecffff : reserved
> >  23fed0000-23fffffff : reserved
> > 240000000-33fdfffff : Persistent Memory
> >  240000000-2403fffff : namespace0.0
> >  280000000-2bfffffff : dax0.0          <- aligned with 1G boundary
> >    280000000-2bfffffff : System RAM
> > Hence there is a big gap between 0x2403fffff and 0x280000000 due to the
> 1G
> > alignment.
> >
> > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> only
> > use 2G bytes for dax pmem(kmem) in the worst case.
> > e.g.
> > 240000000-33fdfffff : Persistent Memory
> > We can only use the memblock between [240000000, 2ffffffff] due to the
> hard
> > limitation. It wastes too much memory space.
> >
> > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> there
> > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > SPARSEMEM_VMEMMAP, page bits in struct page ...
> >
> > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> alignment
> > with memory_block_size_bytes().
> >
> > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> pmem
> > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> are both
> > tested on arm64/x86 guest.
> >
> 
> Hi,
> 
> I am not convinced this use case is worth such hacks (that’s what it is)
> for now. On real machines pmem is big - your example (losing 50% is
> extreme).
> 
> I would much rather want to see the section size on arm64 reduced. I
> remember there were patches and that at least with a base page size of 4k
> it can be reduced drastically (64k base pages are more problematic due to
> the ridiculous THP size of 512M). But could be a section size of 512 is
> possible on all configs right now.

Yes, I once investigated how to reduce section size on arm64 thoughtfully:
There are many constraints for reducing SECTION_SIZE_BITS
1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
   much.
2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
   into page->flags.
3. MAX_ORDER depends on SECTION_SIZE_BITS 
 - 3.1 mmzone.h
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
 - 3.2 hugepage_init()
MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);

Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
SECTION_SIZE_BITS can be reduced to 27.
But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
be reduced to 27.

In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
might be very complicated,e.g. we still need to consider the case for
ARM64_16K_PAGES.

> 
> In the long term we might want to rework the memory block device model
> (eventually supporting old/new as discussed with Michal some time ago
> using a kernel parameter), dropping the fixed sizes

Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.


--
Cheers,
Justin (Jia He)



> - allowing sizes / addresses aligned with subsection size
> - drastically reducing the number of devices for boot memory to only a
> hand full (e.g., one per resource / DIMM we can actually unplug again.
> 
> Long story short, I don’t like this hack.
> 
> 
> > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-
> rc5 [2].
> >
> > [1] https://lkml.org/lkml/2019/6/19/67
> > [2] https://lkml.org/lkml/2020/7/8/1546
> > Jia He (6):
> >  mm/memory_hotplug: remove redundant memory block size alignment check
> >  resource: export find_next_iomem_res() helper
> >  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
> >  mm/page_alloc: adjust the start,end in dax pmem kmem case
> >  device-dax: relax the memblock size alignment for kmem_start
> >  arm64: fall back to vmemmap_populate_basepages if not aligned  with
> >    PMD_SIZE
> >
> > arch/arm64/mm/mmu.c    |  4 ++++
> > drivers/base/memory.c  | 24 ++++++++++++++++--------
> > drivers/dax/kmem.c     | 22 +++++++++++++---------
> > include/linux/ioport.h |  3 +++
> > kernel/resource.c      |  3 ++-
> > mm/memory_hotplug.c    | 39 ++++++++++++++++++++++++++++++++++++++-
> > mm/page_alloc.c        | 14 ++++++++++++++
> > 7 files changed, 90 insertions(+), 19 deletions(-)
> >
> > --
> > 2.17.1
> >

David Hildenbrand July 29, 2020, 8:44 a.m. UTC | #3

On 29.07.20 10:27, Justin He wrote:
> Hi David
> 
>> -----Original Message-----
>> From: David Hildenbrand <david@redhat.com>
>> Sent: Wednesday, July 29, 2020 2:37 PM
>> To: Justin He <Justin.He@arm.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>; Vishal Verma
>> <vishal.l.verma@intel.com>; Mike Rapoport <rppt@linux.ibm.com>; David
>> Hildenbrand <david@redhat.com>; Catalin Marinas <Catalin.Marinas@arm.com>;
>> Will Deacon <will@kernel.org>; Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org>; Rafael J. Wysocki <rafael@kernel.org>; Dave
>> Jiang <dave.jiang@intel.com>; Andrew Morton <akpm@linux-foundation.org>;
>> Steve Capper <Steve.Capper@arm.com>; Mark Rutland <Mark.Rutland@arm.com>;
>> Logan Gunthorpe <logang@deltatee.com>; Anshuman Khandual
>> <Anshuman.Khandual@arm.com>; Hsin-Yi Wang <hsinyi@chromium.org>; Jason
>> Gunthorpe <jgg@ziepe.ca>; Dave Hansen <dave.hansen@linux.intel.com>; Kees
>> Cook <keescook@chromium.org>; linux-arm-kernel@lists.infradead.org; linux-
>> kernel@vger.kernel.org; linux-nvdimm@lists.01.org; linux-mm@kvack.org; Wei
>> Yang <richardw.yang@linux.intel.com>; Pankaj Gupta
>> <pankaj.gupta.linux@gmail.com>; Ira Weiny <ira.weiny@intel.com>; Kaly Xin
>> <Kaly.Xin@arm.com>
>> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
>> alignment
>>
>>
>>
>>> Am 29.07.2020 um 05:35 schrieb Jia He <justin.he@arm.com>:
>>>
>>> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
>>> addr in dev_dax_kmem_probe() should be aligned w/
>> SECTION_SIZE_BITS(30),i.e.
>>> 1G memblock size. Even Dan Williams' sub-section patch series [1] had
>> been
>>> upstream merged, it was not helpful due to hard limitation of kmem_start:
>>> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
>> -a 2M
>>> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>>> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>>> $cat /proc/iomem
>>> ...
>>> 23c000000-23fffffff : System RAM
>>>  23dd40000-23fecffff : reserved
>>>  23fed0000-23fffffff : reserved
>>> 240000000-33fdfffff : Persistent Memory
>>>  240000000-2403fffff : namespace0.0
>>>  280000000-2bfffffff : dax0.0          <- aligned with 1G boundary
>>>    280000000-2bfffffff : System RAM
>>> Hence there is a big gap between 0x2403fffff and 0x280000000 due to the
>> 1G
>>> alignment.
>>>
>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>> only
>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>> e.g.
>>> 240000000-33fdfffff : Persistent Memory
>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>> hard
>>> limitation. It wastes too much memory space.
>>>
>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>> there
>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>
>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>> alignment
>>> with memory_block_size_bytes().
>>>
>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>> pmem
>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>> are both
>>> tested on arm64/x86 guest.
>>>
>>
>> Hi,
>>
>> I am not convinced this use case is worth such hacks (that’s what it is)
>> for now. On real machines pmem is big - your example (losing 50% is
>> extreme).
>>
>> I would much rather want to see the section size on arm64 reduced. I
>> remember there were patches and that at least with a base page size of 4k
>> it can be reduced drastically (64k base pages are more problematic due to
>> the ridiculous THP size of 512M). But could be a section size of 512 is
>> possible on all configs right now.
> 
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>    much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>    into page->flags.

Yep.

> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>  - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif

Yep, with 4k base pages it's 4 MB. However, with 64k base pages its
512MB ( :( ).

>  - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> 
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.

I think there were plans to eventually switch to 2MB THP with 64k base
pages as well (which can be emulated using some sort of consecutive PTE
entries under arm64, don't ask me how this feature is called),
theoretically also allowing smaller section sizes (when also reducing
MAX_ORDER properly) I would highly appreciate that switch. Having max
allocation/THP in the size of gigantic pages sounds very weird to me
(and creates issues e.g., to support hot(un)plug of small memory blocks
for virtio-mem). But I guess this is not under our control :)

> 
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.

Haven't looked into 16k base pages yet. But I remember it's in general
more similar to 4k than to 64k (speaking about sane THP sizes and
similar ...).

> 
>>
>> In the long term we might want to rework the memory block device model
>> (eventually supporting old/new as discussed with Michal some time ago
>> using a kernel parameter), dropping the fixed sizes
> 
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.

Yeah, but I might not be able to dig it out anymore ...

Anyhow, the idea would be to have some magic switch that converts
between old and new world, to not break userspace that relies on that.

With old, everything would continue to work as it is. With *new* we
would have the reduced number of memory blocks for boot memory and
decoupled it from a strict, static memory block size.


There would be another option in corner cases right now. If you would
*know* that the metadata memory has no memmap/idendity mapping and have
1G alignment for your pmem device (including the metadata part)

1. add_memory_device_managed() the whole memory, including the metadata part
2. use generic_online_pages() to not expose metadata pages to the buddy
3. Mark metdata pages in a special way, such that you can e.g., allow to
offline memory again, including the metdata pages (e.g., PG_offline +
memory notifier like virtio-mem does)

3. would only be relevant to support offlining of memory again.

If the metadata part is, however, already ZONE_DEVICE with a memmap,
then that's not an option. (I have no idea how that metadata part is
used, sorry)

Mike Rapoport July 29, 2020, 9:31 a.m. UTC | #4

Hi Justin,

On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
> Hi David
> > >
> > > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> > only
> > > use 2G bytes for dax pmem(kmem) in the worst case.
> > > e.g.
> > > 240000000-33fdfffff : Persistent Memory
> > > We can only use the memblock between [240000000, 2ffffffff] due to the
> > hard
> > > limitation. It wastes too much memory space.
> > >
> > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> > there
> > > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > > SPARSEMEM_VMEMMAP, page bits in struct page ...
> > >
> > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> > alignment
> > > with memory_block_size_bytes().
> > >
> > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> > pmem
> > > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> > are both
> > > tested on arm64/x86 guest.
> > >
> > 
> > Hi,
> > 
> > I am not convinced this use case is worth such hacks (that’s what it is)
> > for now. On real machines pmem is big - your example (losing 50% is
> > extreme).
> > 
> > I would much rather want to see the section size on arm64 reduced. I
> > remember there were patches and that at least with a base page size of 4k
> > it can be reduced drastically (64k base pages are more problematic due to
> > the ridiculous THP size of 512M). But could be a section size of 512 is
> > possible on all configs right now.
> 
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>    much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>    into page->flags.
> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>  - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif
>  - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> 
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.
> 
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.

It is not necessary to pollute Kconfig with that.
arch/arm64/include/asm/sparesemem.h can have something like

#ifdef CONFIG_ARM64_64K_PAGES
#define SPARSE_SECTION_SIZE 29
#elif defined(CONFIG_ARM16K_PAGES)
#define SPARSE_SECTION_SIZE 28
#elif defined(CONFIG_ARM4K_PAGES)
#define SPARSE_SECTION_SIZE 27
#else
#error
#endif
 
There is still large gap with ARM64_64K_PAGES, though.

As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?

> > 
> > In the long term we might want to rework the memory block device model
> > (eventually supporting old/new as discussed with Michal some time ago
> > using a kernel parameter), dropping the fixed sizes
> 
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.
> 
> 
> --
> Cheers,
> Justin (Jia He)
> 
> 
> 
> > - allowing sizes / addresses aligned with subsection size
> > - drastically reducing the number of devices for boot memory to only a
> > hand full (e.g., one per resource / DIMM we can actually unplug again.
> > 
> > Long story short, I don’t like this hack.
> > 
> > 
> > > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-
> > rc5 [2].
> > >
> > > [1] https://lkml.org/lkml/2019/6/19/67
> > > [2] https://lkml.org/lkml/2020/7/8/1546
> > > Jia He (6):
> > >  mm/memory_hotplug: remove redundant memory block size alignment check
> > >  resource: export find_next_iomem_res() helper
> > >  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
> > >  mm/page_alloc: adjust the start,end in dax pmem kmem case
> > >  device-dax: relax the memblock size alignment for kmem_start
> > >  arm64: fall back to vmemmap_populate_basepages if not aligned  with
> > >    PMD_SIZE
> > >
> > > arch/arm64/mm/mmu.c    |  4 ++++
> > > drivers/base/memory.c  | 24 ++++++++++++++++--------
> > > drivers/dax/kmem.c     | 22 +++++++++++++---------
> > > include/linux/ioport.h |  3 +++
> > > kernel/resource.c      |  3 ++-
> > > mm/memory_hotplug.c    | 39 ++++++++++++++++++++++++++++++++++++++-
> > > mm/page_alloc.c        | 14 ++++++++++++++
> > > 7 files changed, 90 insertions(+), 19 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >
>

David Hildenbrand July 29, 2020, 9:35 a.m. UTC | #5

On 29.07.20 11:31, Mike Rapoport wrote:
> Hi Justin,
> 
> On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
>> Hi David
>>>>
>>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>>> only
>>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>>> e.g.
>>>> 240000000-33fdfffff : Persistent Memory
>>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>>> hard
>>>> limitation. It wastes too much memory space.
>>>>
>>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>>> there
>>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>>
>>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>>> alignment
>>>> with memory_block_size_bytes().
>>>>
>>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>>> pmem
>>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>>> are both
>>>> tested on arm64/x86 guest.
>>>>
>>>
>>> Hi,
>>>
>>> I am not convinced this use case is worth such hacks (that’s what it is)
>>> for now. On real machines pmem is big - your example (losing 50% is
>>> extreme).
>>>
>>> I would much rather want to see the section size on arm64 reduced. I
>>> remember there were patches and that at least with a base page size of 4k
>>> it can be reduced drastically (64k base pages are more problematic due to
>>> the ridiculous THP size of 512M). But could be a section size of 512 is
>>> possible on all configs right now.
>>
>> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
>> There are many constraints for reducing SECTION_SIZE_BITS
>> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>>    much.
>> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>>    into page->flags.
>> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>>  - 3.1 mmzone.h
>> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
>> #error Allocator MAX_ORDER exceeds SECTION_SIZE
>> #endif
>>  - 3.2 hugepage_init()
>> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>>
>> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
>> SECTION_SIZE_BITS can be reduced to 27.
>> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
>> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
>> be reduced to 27.
>>
>> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
>> might be very complicated,e.g. we still need to consider the case for
>> ARM64_16K_PAGES.
> 
> It is not necessary to pollute Kconfig with that.
> arch/arm64/include/asm/sparesemem.h can have something like
> 
> #ifdef CONFIG_ARM64_64K_PAGES
> #define SPARSE_SECTION_SIZE 29
> #elif defined(CONFIG_ARM16K_PAGES)
> #define SPARSE_SECTION_SIZE 28
> #elif defined(CONFIG_ARM4K_PAGES)
> #define SPARSE_SECTION_SIZE 27
> #else
> #error
> #endif

ack

>  
> There is still large gap with ARM64_64K_PAGES, though.
> 
> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?

I was asking myself the same question a while ago and didn't really find
a compelling one.

I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
would require config tweaks to even disable it.

Mike Rapoport July 29, 2020, 1 p.m. UTC | #6

On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
> On 29.07.20 11:31, Mike Rapoport wrote:
> > Hi Justin,
> > 
> > On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
> >> Hi David
> >>>>
> >>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
> >>> only
> >>>> use 2G bytes for dax pmem(kmem) in the worst case.
> >>>> e.g.
> >>>> 240000000-33fdfffff : Persistent Memory
> >>>> We can only use the memblock between [240000000, 2ffffffff] due to the
> >>> hard
> >>>> limitation. It wastes too much memory space.
> >>>>
> >>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> >>> there
> >>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> >>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
> >>>>
> >>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> >>> alignment
> >>>> with memory_block_size_bytes().
> >>>>
> >>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> >>> pmem
> >>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
> >>> are both
> >>>> tested on arm64/x86 guest.
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I am not convinced this use case is worth such hacks (that’s what it is)
> >>> for now. On real machines pmem is big - your example (losing 50% is
> >>> extreme).
> >>>
> >>> I would much rather want to see the section size on arm64 reduced. I
> >>> remember there were patches and that at least with a base page size of 4k
> >>> it can be reduced drastically (64k base pages are more problematic due to
> >>> the ridiculous THP size of 512M). But could be a section size of 512 is
> >>> possible on all configs right now.
> >>
> >> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> >> There are many constraints for reducing SECTION_SIZE_BITS
> >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
> >>    much.
> >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
> >>    into page->flags.
> >> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
> >>  - 3.1 mmzone.h
> >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> >> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> >> #endif
> >>  - 3.2 hugepage_init()
> >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> >>
> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> >> SECTION_SIZE_BITS can be reduced to 27.
> >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> >> be reduced to 27.
> >>
> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> >> might be very complicated,e.g. we still need to consider the case for
> >> ARM64_16K_PAGES.
> > 
> > It is not necessary to pollute Kconfig with that.
> > arch/arm64/include/asm/sparesemem.h can have something like
> > 
> > #ifdef CONFIG_ARM64_64K_PAGES
> > #define SPARSE_SECTION_SIZE 29
> > #elif defined(CONFIG_ARM16K_PAGES)
> > #define SPARSE_SECTION_SIZE 28
> > #elif defined(CONFIG_ARM4K_PAGES)
> > #define SPARSE_SECTION_SIZE 27
> > #else
> > #error
> > #endif
> 
> ack
> 
> >  
> > There is still large gap with ARM64_64K_PAGES, though.
> > 
> > As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
> 
> I was asking myself the same question a while ago and didn't really find
> a compelling one.

Memory overhead for VMEMMAP is larger, especially for arm64 that knows
how to free empty parts of the memory map with "classic" SPARSEMEM.
 
> I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
> would require config tweaks to even disable it.

Nope, it's right there in menuconfig,

"Memory Management options" -> "Sparse Memory virtual memmap"

> -- 
> Thanks,
> 
> David / dhildenb
>

David Hildenbrand July 29, 2020, 1:03 p.m. UTC | #7

On 29.07.20 15:00, Mike Rapoport wrote:
> On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
>> On 29.07.20 11:31, Mike Rapoport wrote:
>>> Hi Justin,
>>>
>>> On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
>>>> Hi David
>>>>>>
>>>>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>>>>> only
>>>>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>>>>> e.g.
>>>>>> 240000000-33fdfffff : Persistent Memory
>>>>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>>>>> hard
>>>>>> limitation. It wastes too much memory space.
>>>>>>
>>>>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>>>>> there
>>>>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>>>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>>>>
>>>>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>>>>> alignment
>>>>>> with memory_block_size_bytes().
>>>>>>
>>>>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>>>>> pmem
>>>>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>>>>> are both
>>>>>> tested on arm64/x86 guest.
>>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am not convinced this use case is worth such hacks (that’s what it is)
>>>>> for now. On real machines pmem is big - your example (losing 50% is
>>>>> extreme).
>>>>>
>>>>> I would much rather want to see the section size on arm64 reduced. I
>>>>> remember there were patches and that at least with a base page size of 4k
>>>>> it can be reduced drastically (64k base pages are more problematic due to
>>>>> the ridiculous THP size of 512M). But could be a section size of 512 is
>>>>> possible on all configs right now.
>>>>
>>>> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
>>>> There are many constraints for reducing SECTION_SIZE_BITS
>>>> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>>>>    much.
>>>> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>>>>    into page->flags.
>>>> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>>>>  - 3.1 mmzone.h
>>>> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
>>>> #error Allocator MAX_ORDER exceeds SECTION_SIZE
>>>> #endif
>>>>  - 3.2 hugepage_init()
>>>> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>>>>
>>>> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
>>>> SECTION_SIZE_BITS can be reduced to 27.
>>>> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
>>>> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
>>>> be reduced to 27.
>>>>
>>>> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
>>>> might be very complicated,e.g. we still need to consider the case for
>>>> ARM64_16K_PAGES.
>>>
>>> It is not necessary to pollute Kconfig with that.
>>> arch/arm64/include/asm/sparesemem.h can have something like
>>>
>>> #ifdef CONFIG_ARM64_64K_PAGES
>>> #define SPARSE_SECTION_SIZE 29
>>> #elif defined(CONFIG_ARM16K_PAGES)
>>> #define SPARSE_SECTION_SIZE 28
>>> #elif defined(CONFIG_ARM4K_PAGES)
>>> #define SPARSE_SECTION_SIZE 27
>>> #else
>>> #error
>>> #endif
>>
>> ack
>>
>>>  
>>> There is still large gap with ARM64_64K_PAGES, though.
>>>
>>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
>>
>> I was asking myself the same question a while ago and didn't really find
>> a compelling one.
> 
> Memory overhead for VMEMMAP is larger, especially for arm64 that knows
> how to free empty parts of the memory map with "classic" SPARSEMEM.

You mean the hole punching within section memmap? (which is why their
pfn_valid() implementation is special)

(I do wonder why that shouldn't work with VMEMMAP, or is it simply not
implemented?)

>  
>> I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
>> would require config tweaks to even disable it.
> 
> Nope, it's right there in menuconfig,
> 
> "Memory Management options" -> "Sparse Memory virtual memmap"

Ah, good to know.

Mike Rapoport July 29, 2020, 2:12 p.m. UTC | #8

On Wed, Jul 29, 2020 at 03:03:04PM +0200, David Hildenbrand wrote:
> On 29.07.20 15:00, Mike Rapoport wrote:
> > On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
> >>>  
> >>> There is still large gap with ARM64_64K_PAGES, though.
> >>>
> >>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
> >>
> >> I was asking myself the same question a while ago and didn't really find
> >> a compelling one.
> > 
> > Memory overhead for VMEMMAP is larger, especially for arm64 that knows
> > how to free empty parts of the memory map with "classic" SPARSEMEM.
> 
> You mean the hole punching within section memmap? (which is why their
> pfn_valid() implementation is special)

Yes, arm (both 32 and 64) do this. And for smaller systems with a few
memory banks this is very reasonable to trade slight (if any) slowdown
in pfn_valid() for several megs of memory.
 
> (I do wonder why that shouldn't work with VMEMMAP, or is it simply not
> implemented?)
 
It's not implemented. There was a patch [1] recently to implement this. 

[1] https://lore.kernel.org/lkml/20200721073203.107862-1-liwei213@huawei.com/

> -- 
> Thanks,
> 
> David / dhildenb
>

Justin He July 30, 2020, 2:17 a.m. UTC | #9

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Wednesday, July 29, 2020 5:35 PM
> To: Mike Rapoport <rppt@linux.ibm.com>; Justin He <Justin.He@arm.com>
> Cc: Dan Williams <dan.j.williams@intel.com>; Vishal Verma
> <vishal.l.verma@intel.com>; Catalin Marinas <Catalin.Marinas@arm.com>;
> Will Deacon <will@kernel.org>; Greg Kroah-Hartman
> <gregkh@linuxfoundation.org>; Rafael J. Wysocki <rafael@kernel.org>; Dave
> Jiang <dave.jiang@intel.com>; Andrew Morton <akpm@linux-foundation.org>;
> Steve Capper <Steve.Capper@arm.com>; Mark Rutland <Mark.Rutland@arm.com>;
> Logan Gunthorpe <logang@deltatee.com>; Anshuman Khandual
> <Anshuman.Khandual@arm.com>; Hsin-Yi Wang <hsinyi@chromium.org>; Jason
> Gunthorpe <jgg@ziepe.ca>; Dave Hansen <dave.hansen@linux.intel.com>; Kees
> Cook <keescook@chromium.org>; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-nvdimm@lists.01.org; linux-mm@kvack.org; Wei
> Yang <richardw.yang@linux.intel.com>; Pankaj Gupta
> <pankaj.gupta.linux@gmail.com>; Ira Weiny <ira.weiny@intel.com>; Kaly Xin
> <Kaly.Xin@arm.com>
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
> 
> On 29.07.20 11:31, Mike Rapoport wrote:
> > Hi Justin,
> >
> > On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
> >> Hi David
> >>>>
> >>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
> >>> only
> >>>> use 2G bytes for dax pmem(kmem) in the worst case.
> >>>> e.g.
> >>>> 240000000-33fdfffff : Persistent Memory
> >>>> We can only use the memblock between [240000000, 2ffffffff] due to
> the
> >>> hard
> >>>> limitation. It wastes too much memory space.
> >>>>
> >>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative,
> but
> >>> there
> >>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> >>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
> >>>>
> >>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> >>> alignment
> >>>> with memory_block_size_bytes().
> >>>>
> >>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device.
> dax
> >>> pmem
> >>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
> >>> are both
> >>>> tested on arm64/x86 guest.
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I am not convinced this use case is worth such hacks (that’s what it
> is)
> >>> for now. On real machines pmem is big - your example (losing 50% is
> >>> extreme).
> >>>
> >>> I would much rather want to see the section size on arm64 reduced. I
> >>> remember there were patches and that at least with a base page size of
> 4k
> >>> it can be reduced drastically (64k base pages are more problematic due
> to
> >>> the ridiculous THP size of 512M). But could be a section size of 512
> is
> >>> possible on all configs right now.
> >>
> >> Yes, I once investigated how to reduce section size on arm64
> thoughtfully:
> >> There are many constraints for reducing SECTION_SIZE_BITS
> >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be
> reduced too
> >>    much.
> >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be
> counted
> >>    into page->flags.
> >> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> >>  - 3.1 mmzone.h
> >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> >> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> >> #endif
> >>  - 3.2 hugepage_init()
> >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> >>
> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> >> SECTION_SIZE_BITS can be reduced to 27.
> >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS
> can not
> >> be reduced to 27.
> >>
> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the
> Kconfig
> >> might be very complicated,e.g. we still need to consider the case for
> >> ARM64_16K_PAGES.
> >
> > It is not necessary to pollute Kconfig with that.
> > arch/arm64/include/asm/sparesemem.h can have something like
> >
> > #ifdef CONFIG_ARM64_64K_PAGES
> > #define SPARSE_SECTION_SIZE 29
> > #elif defined(CONFIG_ARM16K_PAGES)
> > #define SPARSE_SECTION_SIZE 28
> > #elif defined(CONFIG_ARM4K_PAGES)
> > #define SPARSE_SECTION_SIZE 27
> > #else
> > #error
> > #endif
> 
> ack
Thanks, David and Mike. Will discuss it further more with arm internally about
the thoughtful section_size change

--
Cheers,
Justin (Jia He)

[RFC,0/6] decrease unnecessary gap due to pmem kmem alignment

Message

Comments