mbox series

[v5,00/21] Free some vmemmap pages of hugetlb page

Message ID 20201120064325.34492-1-songmuchun@bytedance.com (mailing list archive)
Headers show
Series Free some vmemmap pages of hugetlb page | expand

Message

Muchun Song Nov. 20, 2020, 6:43 a.m. UTC
Hi all,

This patch series will free some vmemmap pages(struct page structures)
associated with each hugetlbpage when preallocated to save memory.

The struct page structures (page structs) are used to describe a physical
page frame. By default, there is a one-to-one mapping from a page frame to
it's corresponding page struct.

The HugeTLB pages consist of multiple base page size pages and is supported
by many architectures. See hugetlbpage.rst in the Documentation directory
for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB
are currently supported. Since the base page size on x86 is 4KB, a 2MB
HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
4096 base pages. For each base page, there is a corresponding page struct.

Within the HugeTLB subsystem, only the first 4 page structs are used to
contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
provides this upper limit. The only 'useful' information in the remaining
page structs is the compound_head field, and this field is the same for all
tail pages.

By removing redundant page structs for HugeTLB pages, memory can returned to
the buddy allocator for other uses.

When the system boot up, every 2M HugeTLB has 512 struct page structs which
size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).

    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | -------------> |     1     |
 |           |                     +-----------+                +-----------+
 |           |                     |     2     | -------------> |     2     |
 |           |                     +-----------+                +-----------+
 |           |                     |     3     | -------------> |     3     |
 |           |                     +-----------+                +-----------+
 |           |                     |     4     | -------------> |     4     |
 |    2MB    |                     +-----------+                +-----------+
 |           |                     |     5     | -------------> |     5     |
 |           |                     +-----------+                +-----------+
 |           |                     |     6     | -------------> |     6     |
 |           |                     +-----------+                +-----------+
 |           |                     |     7     | -------------> |     7     |
 |           |                     +-----------+                +-----------+
 |           |
 |           |
 |           |
 +-----------+

The value of page->compound_head is the same for all tail pages. The first
page of page structs (page 0) associated with the HugeTLB page contains the 4
page structs necessary to describe the HugeTLB. The only use of the remaining
pages of page structs (page 1 to page 7) is to point to page->compound_head.
Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
will be used for each HugeTLB page. This will allow us to free the remaining
6 pages to the buddy allocator.

Here is how things look after remapping.

    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | -------------> |     1     |
 |           |                     +-----------+                +-----------+
 |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
 |           |                     +-----------+                   | | | | |
 |           |                     |     3     | ------------------+ | | | |
 |           |                     +-----------+                     | | | |
 |           |                     |     4     | --------------------+ | | |
 |    2MB    |                     +-----------+                       | | |
 |           |                     |     5     | ----------------------+ | |
 |           |                     +-----------+                         | |
 |           |                     |     6     | ------------------------+ |
 |           |                     +-----------+                           |
 |           |                     |     7     | --------------------------+
 |           |                     +-----------+
 |           |
 |           |
 |           |
 +-----------+

When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
vmemmap pages and restore the previous mapping relationship.

Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
to the 2MB HugeTLB page. We also can use this approach to free the vmemmap
pages.

In this case, for the 1GB HugeTLB page, we can save 4088 pages(There are
4096 pages for struct page structs, we reserve 2 pages for vmemmap and 8
pages for page tables. So we can save 4088 pages). This is a very substantial
gain. On our server, run some SPDK/QEMU applications which will use 1024GB
hugetlbpage. With this feature enabled, we can save ~16GB(1G hugepage)/~11GB
(2MB hugepage, the worst case is 10GB while the best is 12GB) memory.

Because there are vmemmap page tables reconstruction on the freeing/allocating
path, it increases some overhead. Here are some overhead analysis.

1) Allocating 10240 2MB hugetlb pages.

   a) With this patch series applied:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.166s
   user     0m0.000s
   sys      0m0.166s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [8K, 16K)           8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [16K, 32K)          1868 |@@@@@@@@@@@                                         |
   [32K, 64K)            10 |                                                    |
   [64K, 128K)            2 |                                                    |

   b) Without this patch series:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.066s
   user     0m0.000s
   sys      0m0.066s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)           10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)             62 |                                                    |
   [16K, 32K)             2 |                                                    |

   Summarize: this feature is about ~2x slower than before.

2) Freeing 10240 2MB hugetlb pages.

   a) With this patch series applied:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.004s
   user     0m0.000s
   sys      0m0.002s

   # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [16K, 32K)         10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

   b) Without this patch series:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.077s
   user     0m0.001s
   sys      0m0.075s

   # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)            9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)            287 |@                                                   |
   [16K, 32K)             3 |                                                    |

   Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
              But according to the allocation test above, I think that here is
	      also ~2x slower than before.

              But why the 'real' time of patched is smaller than before? Because
	      In this patch series, the freeing hugetlb is asynchronous(through
	      kwoker).

Although the overhead has increased, the overhead is not significant. Like Mike
said, "However, remember that the majority of use cases create hugetlb pages at
or shortly after boot time and add them to the pool. So, additional overhead is
at pool creation time. There is no change to 'normal run time' operations of
getting a page from or returning a page to the pool (think page fault/unmap)".

Todo:
  1. Free all of the tail vmemmap pages
     Now for the 2MB HugrTLB page, we only free 6 vmemmap pages. we really can
     free 7 vmemmap pages. In this case, we can see 8 of the 512 struct page
     structures has beed set PG_head flag. If we can adjust compound_head()
     slightly and make compound_head() return the real head struct page when
     the parameter is the tail struct page but with PG_head flag set.

     In order to make the code evolution route clearer. This feature can can be
     a separate patch after this patchset is solid.

  Changelog in v5:
  1. Rework somme comments and code in the [PATCH v4 04/21] and [PATCH v4 05/21].
     Thanks to Mike and Oscar's suggestions.

  Changelog in v4:
  1. Move all the vmemmap functions to hugetlb_vmemmap.c.
  2. Make the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP default to y, if we want to
     disable this feature, we should disable it by a boot/kernel command line.
  3. Remove vmemmap_pgtable_{init, deposit, withdraw}() helper functions.
  4. Initialize page table lock for vmemmap through core_initcall mechanism.

  Thanks for Mike and Oscar's suggestions.

  Changelog in v3:
  1. Rename some helps function name. Thanks Mike.
  2. Rework some code. Thanks Mike and Oscar.
  3. Remap the tail vmemmap page with PAGE_KERNEL_RO instead of
     PAGE_KERNEL. Thanks Matthew.
  4. Add some overhead analysis in the cover letter.
  5. Use vmemap pmd table lock instead of a hugetlb specific global lock.

  Changelog in v2:
  1. Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap().
  2. Fix some typo and code style problems.
  3. Remove unused handle_vmemmap_fault().
  4. Merge some commits to one commit suggested by Mike.

Muchun Song (21):
  mm/memory_hotplug: Move bootmem info registration API to
    bootmem_info.c
  mm/memory_hotplug: Move {get,put}_page_bootmem() to bootmem_info.c
  mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
  mm/hugetlb: Introduce pgtable allocation/freeing helpers
  mm/bootmem_info: Introduce {free,prepare}_vmemmap_page()
  mm/bootmem_info: Combine bootmem info and type into page->freelist
  mm/hugetlb: Initialize page table lock for vmemmap
  mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
  mm/hugetlb: Defer freeing of hugetlb pages
  mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb
    page
  mm/hugetlb: Introduce remap_huge_page_pmd_vmemmap helper
  mm/hugetlb: Use PG_slab to indicate split pmd
  mm/hugetlb: Support freeing vmemmap pages of gigantic page
  mm/hugetlb: Set the PageHWPoison to the raw error page
  mm/hugetlb: Flush work when dissolving hugetlb page
  mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
  mm/hugetlb: Merge pte to huge pmd only for gigantic page
  mm/hugetlb: Gather discrete indexes of tail page
  mm/hugetlb: Add BUILD_BUG_ON to catch invalid usage of tail struct
    page
  mm/hugetlb: Disable freeing vmemmap if struct page size is not power
    of two

 Documentation/admin-guide/kernel-parameters.txt |   9 +
 Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +
 arch/x86/include/asm/hugetlb.h                  |  17 +
 arch/x86/include/asm/pgtable_64_types.h         |   8 +
 arch/x86/mm/init_64.c                           |   7 +-
 fs/Kconfig                                      |  14 +
 include/linux/bootmem_info.h                    |  78 +++
 include/linux/hugetlb.h                         |  19 +
 include/linux/hugetlb_cgroup.h                  |  15 +-
 include/linux/memory_hotplug.h                  |  27 -
 mm/Makefile                                     |   2 +
 mm/bootmem_info.c                               | 124 ++++
 mm/hugetlb.c                                    | 163 ++++-
 mm/hugetlb_vmemmap.c                            | 765 ++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h                            | 103 ++++
 mm/memory_hotplug.c                             | 116 ----
 mm/sparse.c                                     |   5 +-
 17 files changed, 1295 insertions(+), 180 deletions(-)
 create mode 100644 include/linux/bootmem_info.h
 create mode 100644 mm/bootmem_info.c
 create mode 100644 mm/hugetlb_vmemmap.c
 create mode 100644 mm/hugetlb_vmemmap.h

Comments

Michal Hocko Nov. 20, 2020, 8:42 a.m. UTC | #1
On Fri 20-11-20 14:43:04, Muchun Song wrote:
[...]

Thanks for improving the cover letter and providing some numbers. I have
only glanced through the patchset because I didn't really have more time
to dive depply into them.

Overall it looks promissing. To summarize. I would prefer to not have
the feature enablement controlled by compile time option and the kernel
command line option should be opt-in. I also do not like that freeing
the pool can trigger the oom killer or even shut the system down if no
oom victim is eligible.

One thing that I didn't really get to think hard about is what is the
effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
invalid when racing with the split. How do we enforce that this won't
blow up?

I have also asked in a previous version whether the vmemmap manipulation
should be really unconditional. E.g. shortlived hugetlb pages allocated
from the buddy allocator directly rather than for a pool. Maybe it
should be restricted for the pool allocation as those are considered
long term and therefore the overhead will be amortized and freeing path
restrictions better understandable.

>  Documentation/admin-guide/kernel-parameters.txt |   9 +
>  Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +
>  arch/x86/include/asm/hugetlb.h                  |  17 +
>  arch/x86/include/asm/pgtable_64_types.h         |   8 +
>  arch/x86/mm/init_64.c                           |   7 +-
>  fs/Kconfig                                      |  14 +
>  include/linux/bootmem_info.h                    |  78 +++
>  include/linux/hugetlb.h                         |  19 +
>  include/linux/hugetlb_cgroup.h                  |  15 +-
>  include/linux/memory_hotplug.h                  |  27 -
>  mm/Makefile                                     |   2 +
>  mm/bootmem_info.c                               | 124 ++++
>  mm/hugetlb.c                                    | 163 ++++-
>  mm/hugetlb_vmemmap.c                            | 765 ++++++++++++++++++++++++
>  mm/hugetlb_vmemmap.h                            | 103 ++++

I will need to look closer but I suspect that a non-trivial part of the
vmemmap manipulation really belongs to mm/sparse-vmemmap.c because the
split and remapping shouldn't really be hugetlb specific. Sure hugetlb
knows how to split but all the splitting should be implemented in
vmemmap proper.

>  mm/memory_hotplug.c                             | 116 ----
>  mm/sparse.c                                     |   5 +-
>  17 files changed, 1295 insertions(+), 180 deletions(-)
>  create mode 100644 include/linux/bootmem_info.h
>  create mode 100644 mm/bootmem_info.c
>  create mode 100644 mm/hugetlb_vmemmap.c
>  create mode 100644 mm/hugetlb_vmemmap.h

Thanks!
David Hildenbrand Nov. 20, 2020, 9:27 a.m. UTC | #2
On 20.11.20 09:42, Michal Hocko wrote:
> On Fri 20-11-20 14:43:04, Muchun Song wrote:
> [...]
> 
> Thanks for improving the cover letter and providing some numbers. I have
> only glanced through the patchset because I didn't really have more time
> to dive depply into them.
> 
> Overall it looks promissing. To summarize. I would prefer to not have
> the feature enablement controlled by compile time option and the kernel
> command line option should be opt-in. I also do not like that freeing
> the pool can trigger the oom killer or even shut the system down if no
> oom victim is eligible.
> 
> One thing that I didn't really get to think hard about is what is the
> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> invalid when racing with the split. How do we enforce that this won't
> blow up?

I have the same concerns - the sections are online the whole time and 
anybody with pfn_to_online_page() can grab them

I think we have similar issues with memory offlining when removing the 
vmemmap, it's just very hard to trigger and we can easily protect by 
grabbing the memhotplug lock. I once discussed with Dan using rcu to 
protect the SECTION_IS_ONLINE bit, to make sure anybody who did a 
pfn_to_online_page() stopped using the page. Of course, such an approach 
is not easy to use in this context where the sections stay online the 
whole time ... we would have to protect vmemmap table entries using rcu 
or similar, which can get quite ugly.

To keep things easy, maybe simply never allow to free these hugetlb 
pages again for now? If they were reserved during boot and the vmemmap 
condensed, then just let them stick around for all eternity.

Once we have a safe approach on how to modify an online vmemmap, we can 
enable this freeing, and eventually also dynamically manage vmemmaps for 
runtime-allocated huge pages.
Michal Hocko Nov. 20, 2020, 9:39 a.m. UTC | #3
On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
> On 20.11.20 09:42, Michal Hocko wrote:
> > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > [...]
> > 
> > Thanks for improving the cover letter and providing some numbers. I have
> > only glanced through the patchset because I didn't really have more time
> > to dive depply into them.
> > 
> > Overall it looks promissing. To summarize. I would prefer to not have
> > the feature enablement controlled by compile time option and the kernel
> > command line option should be opt-in. I also do not like that freeing
> > the pool can trigger the oom killer or even shut the system down if no
> > oom victim is eligible.
> > 
> > One thing that I didn't really get to think hard about is what is the
> > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > invalid when racing with the split. How do we enforce that this won't
> > blow up?
> 
> I have the same concerns - the sections are online the whole time and
> anybody with pfn_to_online_page() can grab them
> 
> I think we have similar issues with memory offlining when removing the
> vmemmap, it's just very hard to trigger and we can easily protect by
> grabbing the memhotplug lock.

I am not sure we can/want to span memory hotplug locking out to all pfn
walkers. But you are right that the underlying problem is similar but
much harder to trigger because vmemmaps are only removed when the
physical memory is hotremoved and that happens very seldom. Maybe it
will happen more with virtualization usecases. But this work makes it
even more tricky. If a pfn walker races with a hotremove then it would
just blow up when accessing the unmapped physical address space. For
this feature a pfn walker would just grab a real struct page re-used for
some unpredictable use under its feet. Any failure would be silent and
hard to debug.

[...]
> To keep things easy, maybe simply never allow to free these hugetlb pages
> again for now? If they were reserved during boot and the vmemmap condensed,
> then just let them stick around for all eternity.

Not sure I understand. Do you propose to only free those vmemmap pages
when the pool is initialized during boot time and never allow to free
them up? That would certainly make it safer and maybe even simpler wrt
implementation.
David Hildenbrand Nov. 20, 2020, 9:43 a.m. UTC | #4
On 20.11.20 10:39, Michal Hocko wrote:
> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
>> On 20.11.20 09:42, Michal Hocko wrote:
>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
>>> [...]
>>>
>>> Thanks for improving the cover letter and providing some numbers. I have
>>> only glanced through the patchset because I didn't really have more time
>>> to dive depply into them.
>>>
>>> Overall it looks promissing. To summarize. I would prefer to not have
>>> the feature enablement controlled by compile time option and the kernel
>>> command line option should be opt-in. I also do not like that freeing
>>> the pool can trigger the oom killer or even shut the system down if no
>>> oom victim is eligible.
>>>
>>> One thing that I didn't really get to think hard about is what is the
>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
>>> invalid when racing with the split. How do we enforce that this won't
>>> blow up?
>>
>> I have the same concerns - the sections are online the whole time and
>> anybody with pfn_to_online_page() can grab them
>>
>> I think we have similar issues with memory offlining when removing the
>> vmemmap, it's just very hard to trigger and we can easily protect by
>> grabbing the memhotplug lock.
> 
> I am not sure we can/want to span memory hotplug locking out to all pfn
> walkers. But you are right that the underlying problem is similar but
> much harder to trigger because vmemmaps are only removed when the
> physical memory is hotremoved and that happens very seldom. Maybe it
> will happen more with virtualization usecases. But this work makes it
> even more tricky. If a pfn walker races with a hotremove then it would
> just blow up when accessing the unmapped physical address space. For
> this feature a pfn walker would just grab a real struct page re-used for
> some unpredictable use under its feet. Any failure would be silent and
> hard to debug.

Right, we don't want the memory hotplug locking, thus discussions 
regarding rcu. Luckily, for now I never saw a BUG report regarding this 
- maybe because the time between memory offlining (offline_pages()) and 
memory/vmemmap getting removed (try_remove_memory()) is just too long. 
Someone would have to sleep after pfn_to_online_page() for quite a while 
to trigger it.

> 
> [...]
>> To keep things easy, maybe simply never allow to free these hugetlb pages
>> again for now? If they were reserved during boot and the vmemmap condensed,
>> then just let them stick around for all eternity.
> 
> Not sure I understand. Do you propose to only free those vmemmap pages
> when the pool is initialized during boot time and never allow to free
> them up? That would certainly make it safer and maybe even simpler wrt
> implementation.

Exactly, let's keep it simple for now. I guess most use cases of this 
(virtualization, databases, ...) will allocate hugepages during boot and 
never free them.
Muchun Song Nov. 20, 2020, 12:40 p.m. UTC | #5
On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:04, Muchun Song wrote:
> [...]
>
> Thanks for improving the cover letter and providing some numbers. I have
> only glanced through the patchset because I didn't really have more time
> to dive depply into them.
>
> Overall it looks promissing. To summarize. I would prefer to not have
> the feature enablement controlled by compile time option and the kernel
> command line option should be opt-in. I also do not like that freeing
> the pool can trigger the oom killer or even shut the system down if no
> oom victim is eligible.

Hi Michal,

I have replied to you about those questions on the other mail thread.

Thanks.

>
> One thing that I didn't really get to think hard about is what is the
> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> invalid when racing with the split. How do we enforce that this won't
> blow up?

This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
in this case, the pfn_to_page can work. The return value of the
pfn_to_page is actually the address of it's struct page struct.
I can not figure out where the problem is. Can you describe the
problem in detail please? Thanks.

>
> I have also asked in a previous version whether the vmemmap manipulation
> should be really unconditional. E.g. shortlived hugetlb pages allocated
> from the buddy allocator directly rather than for a pool. Maybe it
> should be restricted for the pool allocation as those are considered
> long term and therefore the overhead will be amortized and freeing path
> restrictions better understandable.

Yeah, I agree with you. This can be an optimization. And we can
add it to the todo list and implement it in the future. Now the patch
series is already huge.

>
> >  Documentation/admin-guide/kernel-parameters.txt |   9 +
> >  Documentation/admin-guide/mm/hugetlbpage.rst    |   3 +
> >  arch/x86/include/asm/hugetlb.h                  |  17 +
> >  arch/x86/include/asm/pgtable_64_types.h         |   8 +
> >  arch/x86/mm/init_64.c                           |   7 +-
> >  fs/Kconfig                                      |  14 +
> >  include/linux/bootmem_info.h                    |  78 +++
> >  include/linux/hugetlb.h                         |  19 +
> >  include/linux/hugetlb_cgroup.h                  |  15 +-
> >  include/linux/memory_hotplug.h                  |  27 -
> >  mm/Makefile                                     |   2 +
> >  mm/bootmem_info.c                               | 124 ++++
> >  mm/hugetlb.c                                    | 163 ++++-
> >  mm/hugetlb_vmemmap.c                            | 765 ++++++++++++++++++++++++
> >  mm/hugetlb_vmemmap.h                            | 103 ++++
>
> I will need to look closer but I suspect that a non-trivial part of the
> vmemmap manipulation really belongs to mm/sparse-vmemmap.c because the
> split and remapping shouldn't really be hugetlb specific. Sure hugetlb
> knows how to split but all the splitting should be implemented in
> vmemmap proper.
>
> >  mm/memory_hotplug.c                             | 116 ----
> >  mm/sparse.c                                     |   5 +-
> >  17 files changed, 1295 insertions(+), 180 deletions(-)
> >  create mode 100644 include/linux/bootmem_info.h
> >  create mode 100644 mm/bootmem_info.c
> >  create mode 100644 mm/hugetlb_vmemmap.c
> >  create mode 100644 mm/hugetlb_vmemmap.h
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 20, 2020, 1:11 p.m. UTC | #6
On Fri 20-11-20 20:40:46, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > [...]
> >
> > Thanks for improving the cover letter and providing some numbers. I have
> > only glanced through the patchset because I didn't really have more time
> > to dive depply into them.
> >
> > Overall it looks promissing. To summarize. I would prefer to not have
> > the feature enablement controlled by compile time option and the kernel
> > command line option should be opt-in. I also do not like that freeing
> > the pool can trigger the oom killer or even shut the system down if no
> > oom victim is eligible.
> 
> Hi Michal,
> 
> I have replied to you about those questions on the other mail thread.
> 
> Thanks.
> 
> >
> > One thing that I didn't really get to think hard about is what is the
> > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > invalid when racing with the split. How do we enforce that this won't
> > blow up?
> 
> This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> in this case, the pfn_to_page can work. The return value of the
> pfn_to_page is actually the address of it's struct page struct.
> I can not figure out where the problem is. Can you describe the
> problem in detail please? Thanks.

struct page returned by pfn_to_page might get invalid right when it is
returned because vmemmap could get freed up and the respective memory
released to the page allocator and reused for something else. See?

> > I have also asked in a previous version whether the vmemmap manipulation
> > should be really unconditional. E.g. shortlived hugetlb pages allocated
> > from the buddy allocator directly rather than for a pool. Maybe it
> > should be restricted for the pool allocation as those are considered
> > long term and therefore the overhead will be amortized and freeing path
> > restrictions better understandable.
> 
> Yeah, I agree with you. This can be an optimization. And we can
> add it to the todo list and implement it in the future. Now the patch
> series is already huge.

Yes the patchset is large and the primary aim should be reducing
functionality to make it smaller in the first incarnation. Especially
when it is tricky to implement. Releasing vmemmap sparse hugepages is
one of those things. Do you really need it for your usecase?
Muchun Song Nov. 20, 2020, 3:44 p.m. UTC | #7
On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > [...]
> > >
> > > Thanks for improving the cover letter and providing some numbers. I have
> > > only glanced through the patchset because I didn't really have more time
> > > to dive depply into them.
> > >
> > > Overall it looks promissing. To summarize. I would prefer to not have
> > > the feature enablement controlled by compile time option and the kernel
> > > command line option should be opt-in. I also do not like that freeing
> > > the pool can trigger the oom killer or even shut the system down if no
> > > oom victim is eligible.
> >
> > Hi Michal,
> >
> > I have replied to you about those questions on the other mail thread.
> >
> > Thanks.
> >
> > >
> > > One thing that I didn't really get to think hard about is what is the
> > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > invalid when racing with the split. How do we enforce that this won't
> > > blow up?
> >
> > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > in this case, the pfn_to_page can work. The return value of the
> > pfn_to_page is actually the address of it's struct page struct.
> > I can not figure out where the problem is. Can you describe the
> > problem in detail please? Thanks.
>
> struct page returned by pfn_to_page might get invalid right when it is
> returned because vmemmap could get freed up and the respective memory
> released to the page allocator and reused for something else. See?

If the HugeTLB page is already allocated from the buddy allocator,
the struct page of the HugeTLB can be freed? Does this exist?
If yes, how to free the HugeTLB page to the buddy allocator
(cannot access the struct page)?

>
> > > I have also asked in a previous version whether the vmemmap manipulation
> > > should be really unconditional. E.g. shortlived hugetlb pages allocated
> > > from the buddy allocator directly rather than for a pool. Maybe it
> > > should be restricted for the pool allocation as those are considered
> > > long term and therefore the overhead will be amortized and freeing path
> > > restrictions better understandable.
> >
> > Yeah, I agree with you. This can be an optimization. And we can
> > add it to the todo list and implement it in the future. Now the patch
> > series is already huge.
>
> Yes the patchset is large and the primary aim should be reducing
> functionality to make it smaller in the first incarnation. Especially
> when it is tricky to implement. Releasing vmemmap sparse hugepages is
> one of those things. Do you really need it for your usecase?
> --
> Michal Hocko
> SUSE Labs
Mike Kravetz Nov. 20, 2020, 5:45 p.m. UTC | #8
On 11/20/20 1:43 AM, David Hildenbrand wrote:
> On 20.11.20 10:39, Michal Hocko wrote:
>> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
>>> On 20.11.20 09:42, Michal Hocko wrote:
>>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
>>>> [...]
>>>>
>>>> Thanks for improving the cover letter and providing some numbers. I have
>>>> only glanced through the patchset because I didn't really have more time
>>>> to dive depply into them.
>>>>
>>>> Overall it looks promissing. To summarize. I would prefer to not have
>>>> the feature enablement controlled by compile time option and the kernel
>>>> command line option should be opt-in. I also do not like that freeing
>>>> the pool can trigger the oom killer or even shut the system down if no
>>>> oom victim is eligible.
>>>>
>>>> One thing that I didn't really get to think hard about is what is the
>>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
>>>> invalid when racing with the split. How do we enforce that this won't
>>>> blow up?
>>>
>>> I have the same concerns - the sections are online the whole time and
>>> anybody with pfn_to_online_page() can grab them
>>>
>>> I think we have similar issues with memory offlining when removing the
>>> vmemmap, it's just very hard to trigger and we can easily protect by
>>> grabbing the memhotplug lock.
>>
>> I am not sure we can/want to span memory hotplug locking out to all pfn
>> walkers. But you are right that the underlying problem is similar but
>> much harder to trigger because vmemmaps are only removed when the
>> physical memory is hotremoved and that happens very seldom. Maybe it
>> will happen more with virtualization usecases. But this work makes it
>> even more tricky. If a pfn walker races with a hotremove then it would
>> just blow up when accessing the unmapped physical address space. For
>> this feature a pfn walker would just grab a real struct page re-used for
>> some unpredictable use under its feet. Any failure would be silent and
>> hard to debug.
> 
> Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it.
> 
>>
>> [...]
>>> To keep things easy, maybe simply never allow to free these hugetlb pages
>>> again for now? If they were reserved during boot and the vmemmap condensed,
>>> then just let them stick around for all eternity.
>>
>> Not sure I understand. Do you propose to only free those vmemmap pages
>> when the pool is initialized during boot time and never allow to free
>> them up? That would certainly make it safer and maybe even simpler wrt
>> implementation.
> 
> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.

Not sure if I agree with that last statement.  Database and virtualization
use cases from my employer allocate allocate hugetlb pages after boot.  It
is shortly after boot, but still not from boot/kernel command line.

Somewhat related, but not exactly addressing this issue ...

One idea discussed in a previous patch set was to disable PMD/huge page
mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
of the complex code doing page table manipulation.  It does not address
the issue of struct page pages going away which is being discussed here,
but it could be a way to simply the first version of this code.  If this
is going to be an 'opt in' feature as previously suggested, then eliminating
the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
sysadmins would only 'opt in' if they expect most of system memory to be used
by hugetlb pages.  We certainly have database and virtualization use cases
where this is true.
David Hildenbrand Nov. 20, 2020, 6 p.m. UTC | #9
On 20.11.20 18:45, Mike Kravetz wrote:
> On 11/20/20 1:43 AM, David Hildenbrand wrote:
>> On 20.11.20 10:39, Michal Hocko wrote:
>>> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
>>>> On 20.11.20 09:42, Michal Hocko wrote:
>>>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
>>>>> [...]
>>>>>
>>>>> Thanks for improving the cover letter and providing some numbers. I have
>>>>> only glanced through the patchset because I didn't really have more time
>>>>> to dive depply into them.
>>>>>
>>>>> Overall it looks promissing. To summarize. I would prefer to not have
>>>>> the feature enablement controlled by compile time option and the kernel
>>>>> command line option should be opt-in. I also do not like that freeing
>>>>> the pool can trigger the oom killer or even shut the system down if no
>>>>> oom victim is eligible.
>>>>>
>>>>> One thing that I didn't really get to think hard about is what is the
>>>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
>>>>> invalid when racing with the split. How do we enforce that this won't
>>>>> blow up?
>>>>
>>>> I have the same concerns - the sections are online the whole time and
>>>> anybody with pfn_to_online_page() can grab them
>>>>
>>>> I think we have similar issues with memory offlining when removing the
>>>> vmemmap, it's just very hard to trigger and we can easily protect by
>>>> grabbing the memhotplug lock.
>>>
>>> I am not sure we can/want to span memory hotplug locking out to all pfn
>>> walkers. But you are right that the underlying problem is similar but
>>> much harder to trigger because vmemmaps are only removed when the
>>> physical memory is hotremoved and that happens very seldom. Maybe it
>>> will happen more with virtualization usecases. But this work makes it
>>> even more tricky. If a pfn walker races with a hotremove then it would
>>> just blow up when accessing the unmapped physical address space. For
>>> this feature a pfn walker would just grab a real struct page re-used for
>>> some unpredictable use under its feet. Any failure would be silent and
>>> hard to debug.
>>
>> Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it.
>>
>>>
>>> [...]
>>>> To keep things easy, maybe simply never allow to free these hugetlb pages
>>>> again for now? If they were reserved during boot and the vmemmap condensed,
>>>> then just let them stick around for all eternity.
>>>
>>> Not sure I understand. Do you propose to only free those vmemmap pages
>>> when the pool is initialized during boot time and never allow to free
>>> them up? That would certainly make it safer and maybe even simpler wrt
>>> implementation.
>>
>> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
> 
> Not sure if I agree with that last statement.  Database and virtualization
> use cases from my employer allocate allocate hugetlb pages after boot.  It
> is shortly after boot, but still not from boot/kernel command line.

Right, but the ones that care about this optimization for now could be 
converted, I assume? I mean we are talking about "opt-in" from 
sysadmins, so requiring to specify a different cmdline parameter does 
not sound to weird to me. And it should simplify a first version quite a 
lot.

The more I think about this, the more I believe doing these vmemmap 
modifications after boot are very dangerous.

> 
> Somewhat related, but not exactly addressing this issue ...
> 
> One idea discussed in a previous patch set was to disable PMD/huge page
> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
> of the complex code doing page table manipulation.  It does not address
> the issue of struct page pages going away which is being discussed here,
> but it could be a way to simply the first version of this code.  If this
> is going to be an 'opt in' feature as previously suggested, then eliminating
> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
> sysadmins would only 'opt in' if they expect most of system memory to be used
> by hugetlb pages.  We certainly have database and virtualization use cases
> where this is true.

It sounds like a hack to me, which does not fully solve the problem. But 
yeah, it's a simplification.
Muchun Song Nov. 22, 2020, 7:29 a.m. UTC | #10
On Sat, Nov 21, 2020 at 1:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/20/20 1:43 AM, David Hildenbrand wrote:
> > On 20.11.20 10:39, Michal Hocko wrote:
> >> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
> >>> On 20.11.20 09:42, Michal Hocko wrote:
> >>>> On Fri 20-11-20 14:43:04, Muchun Song wrote:
> >>>> [...]
> >>>>
> >>>> Thanks for improving the cover letter and providing some numbers. I have
> >>>> only glanced through the patchset because I didn't really have more time
> >>>> to dive depply into them.
> >>>>
> >>>> Overall it looks promissing. To summarize. I would prefer to not have
> >>>> the feature enablement controlled by compile time option and the kernel
> >>>> command line option should be opt-in. I also do not like that freeing
> >>>> the pool can trigger the oom killer or even shut the system down if no
> >>>> oom victim is eligible.
> >>>>
> >>>> One thing that I didn't really get to think hard about is what is the
> >>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> >>>> invalid when racing with the split. How do we enforce that this won't
> >>>> blow up?
> >>>
> >>> I have the same concerns - the sections are online the whole time and
> >>> anybody with pfn_to_online_page() can grab them
> >>>
> >>> I think we have similar issues with memory offlining when removing the
> >>> vmemmap, it's just very hard to trigger and we can easily protect by
> >>> grabbing the memhotplug lock.
> >>
> >> I am not sure we can/want to span memory hotplug locking out to all pfn
> >> walkers. But you are right that the underlying problem is similar but
> >> much harder to trigger because vmemmaps are only removed when the
> >> physical memory is hotremoved and that happens very seldom. Maybe it
> >> will happen more with virtualization usecases. But this work makes it
> >> even more tricky. If a pfn walker races with a hotremove then it would
> >> just blow up when accessing the unmapped physical address space. For
> >> this feature a pfn walker would just grab a real struct page re-used for
> >> some unpredictable use under its feet. Any failure would be silent and
> >> hard to debug.
> >
> > Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it.
> >
> >>
> >> [...]
> >>> To keep things easy, maybe simply never allow to free these hugetlb pages
> >>> again for now? If they were reserved during boot and the vmemmap condensed,
> >>> then just let them stick around for all eternity.
> >>
> >> Not sure I understand. Do you propose to only free those vmemmap pages
> >> when the pool is initialized during boot time and never allow to free
> >> them up? That would certainly make it safer and maybe even simpler wrt
> >> implementation.
> >
> > Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
>
> Not sure if I agree with that last statement.  Database and virtualization
> use cases from my employer allocate allocate hugetlb pages after boot.  It
> is shortly after boot, but still not from boot/kernel command line.
>
> Somewhat related, but not exactly addressing this issue ...
>
> One idea discussed in a previous patch set was to disable PMD/huge page
> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
> of the complex code doing page table manipulation.  It does not address
> the issue of struct page pages going away which is being discussed here,
> but it could be a way to simply the first version of this code.  If this
> is going to be an 'opt in' feature as previously suggested, then eliminating
> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
> sysadmins would only 'opt in' if they expect most of system memory to be used
> by hugetlb pages.  We certainly have database and virtualization use cases
> where this is true.

Hi Mike,

Yeah, I agree with you that the first version of this feature should be
simply. I can do that (disable PMD/huge page mapping of vmemmap)
in the next version patch. But I have another question: what the
problem is when struct page pages go away? I have not understood
the issues discussed here, hope you can answer for me. Thanks.

> --
> Mike Kravetz
Michal Hocko Nov. 23, 2020, 7:38 a.m. UTC | #11
On Fri 20-11-20 09:45:12, Mike Kravetz wrote:
> On 11/20/20 1:43 AM, David Hildenbrand wrote:
[...]
> >>> To keep things easy, maybe simply never allow to free these hugetlb pages
> >>> again for now? If they were reserved during boot and the vmemmap condensed,
> >>> then just let them stick around for all eternity.
> >>
> >> Not sure I understand. Do you propose to only free those vmemmap pages
> >> when the pool is initialized during boot time and never allow to free
> >> them up? That would certainly make it safer and maybe even simpler wrt
> >> implementation.
> > 
> > Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
> 
> Not sure if I agree with that last statement.  Database and virtualization
> use cases from my employer allocate allocate hugetlb pages after boot.  It
> is shortly after boot, but still not from boot/kernel command line.

Is there any strong reason for that?

> Somewhat related, but not exactly addressing this issue ...
> 
> One idea discussed in a previous patch set was to disable PMD/huge page
> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
> of the complex code doing page table manipulation.  It does not address
> the issue of struct page pages going away which is being discussed here,
> but it could be a way to simply the first version of this code.  If this
> is going to be an 'opt in' feature as previously suggested, then eliminating
> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
> sysadmins would only 'opt in' if they expect most of system memory to be used
> by hugetlb pages.  We certainly have database and virtualization use cases
> where this is true.

Would this simplify the code considerably? I mean, the vmemmap page
tables will need to be updated anyway. So that code has to stay. PMD
entry split shouldn't be the most complex part of that operation.  On
the other hand dropping large pages for all vmemmaps will likely have a
performance.
Michal Hocko Nov. 23, 2020, 7:40 a.m. UTC | #12
On Fri 20-11-20 23:44:26, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > [...]
> > > >
> > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > only glanced through the patchset because I didn't really have more time
> > > > to dive depply into them.
> > > >
> > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > the feature enablement controlled by compile time option and the kernel
> > > > command line option should be opt-in. I also do not like that freeing
> > > > the pool can trigger the oom killer or even shut the system down if no
> > > > oom victim is eligible.
> > >
> > > Hi Michal,
> > >
> > > I have replied to you about those questions on the other mail thread.
> > >
> > > Thanks.
> > >
> > > >
> > > > One thing that I didn't really get to think hard about is what is the
> > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > invalid when racing with the split. How do we enforce that this won't
> > > > blow up?
> > >
> > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > in this case, the pfn_to_page can work. The return value of the
> > > pfn_to_page is actually the address of it's struct page struct.
> > > I can not figure out where the problem is. Can you describe the
> > > problem in detail please? Thanks.
> >
> > struct page returned by pfn_to_page might get invalid right when it is
> > returned because vmemmap could get freed up and the respective memory
> > released to the page allocator and reused for something else. See?
> 
> If the HugeTLB page is already allocated from the buddy allocator,
> the struct page of the HugeTLB can be freed? Does this exist?

Nope, struct pages only ever get deallocated when the respective memory
(they describe) is hotremoved via hotplug.

> If yes, how to free the HugeTLB page to the buddy allocator
> (cannot access the struct page)?

But I do not follow how that relates to my concern above.
Muchun Song Nov. 23, 2020, 8:53 a.m. UTC | #13
On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > [...]
> > > > >
> > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > only glanced through the patchset because I didn't really have more time
> > > > > to dive depply into them.
> > > > >
> > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > the feature enablement controlled by compile time option and the kernel
> > > > > command line option should be opt-in. I also do not like that freeing
> > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > oom victim is eligible.
> > > >
> > > > Hi Michal,
> > > >
> > > > I have replied to you about those questions on the other mail thread.
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > One thing that I didn't really get to think hard about is what is the
> > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > blow up?
> > > >
> > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > in this case, the pfn_to_page can work. The return value of the
> > > > pfn_to_page is actually the address of it's struct page struct.
> > > > I can not figure out where the problem is. Can you describe the
> > > > problem in detail please? Thanks.
> > >
> > > struct page returned by pfn_to_page might get invalid right when it is
> > > returned because vmemmap could get freed up and the respective memory
> > > released to the page allocator and reused for something else. See?
> >
> > If the HugeTLB page is already allocated from the buddy allocator,
> > the struct page of the HugeTLB can be freed? Does this exist?
>
> Nope, struct pages only ever get deallocated when the respective memory
> (they describe) is hotremoved via hotplug.
>
> > If yes, how to free the HugeTLB page to the buddy allocator
> > (cannot access the struct page)?
>
> But I do not follow how that relates to my concern above.

Sorry. I shouldn't understand your concerns.

vmemmap pages                 page frame
+-----------+   mapping to   +-----------+
|           | -------------> |     0     |
+-----------+                +-----------+
|           | -------------> |     1     |
+-----------+                +-----------+
|           | -------------> |     2     |
+-----------+                +-----------+
|           | -------------> |     3     |
+-----------+                +-----------+
|           | -------------> |     4     |
+-----------+                +-----------+
|           | -------------> |     5     |
+-----------+                +-----------+
|           | -------------> |     6     |
+-----------+                +-----------+
|           | -------------> |     7     |
+-----------+                +-----------+

In this patch series, we will free the page frame 2-7 to the
buddy allocator. You mean that pfn_to_page can return invalid
value when the pfn is the page frame 2-7? Thanks.

>
> --
> Michal Hocko
> SUSE Labs



--
Yours,
Muchun
Michal Hocko Nov. 23, 2020, 9:43 a.m. UTC | #14
On Mon 23-11-20 16:53:53, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > [...]
> > > > > >
> > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > to dive depply into them.
> > > > > >
> > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > oom victim is eligible.
> > > > >
> > > > > Hi Michal,
> > > > >
> > > > > I have replied to you about those questions on the other mail thread.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > blow up?
> > > > >
> > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > I can not figure out where the problem is. Can you describe the
> > > > > problem in detail please? Thanks.
> > > >
> > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > returned because vmemmap could get freed up and the respective memory
> > > > released to the page allocator and reused for something else. See?
> > >
> > > If the HugeTLB page is already allocated from the buddy allocator,
> > > the struct page of the HugeTLB can be freed? Does this exist?
> >
> > Nope, struct pages only ever get deallocated when the respective memory
> > (they describe) is hotremoved via hotplug.
> >
> > > If yes, how to free the HugeTLB page to the buddy allocator
> > > (cannot access the struct page)?
> >
> > But I do not follow how that relates to my concern above.
> 
> Sorry. I shouldn't understand your concerns.
> 
> vmemmap pages                 page frame
> +-----------+   mapping to   +-----------+
> |           | -------------> |     0     |
> +-----------+                +-----------+
> |           | -------------> |     1     |
> +-----------+                +-----------+
> |           | -------------> |     2     |
> +-----------+                +-----------+
> |           | -------------> |     3     |
> +-----------+                +-----------+
> |           | -------------> |     4     |
> +-----------+                +-----------+
> |           | -------------> |     5     |
> +-----------+                +-----------+
> |           | -------------> |     6     |
> +-----------+                +-----------+
> |           | -------------> |     7     |
> +-----------+                +-----------+
> 
> In this patch series, we will free the page frame 2-7 to the
> buddy allocator. You mean that pfn_to_page can return invalid
> value when the pfn is the page frame 2-7? Thanks.

No I really mean that pfn_to_page will give you a struct page pointer
from pages which you release from the vmemmap page tables. Those pages
might get reused as soon sa they are freed to the page allocator.
Muchun Song Nov. 23, 2020, 10:36 a.m. UTC | #15
On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > [...]
> > > > > > >
> > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > to dive depply into them.
> > > > > > >
> > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > oom victim is eligible.
> > > > > >
> > > > > > Hi Michal,
> > > > > >
> > > > > > I have replied to you about those questions on the other mail thread.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > >
> > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > blow up?
> > > > > >
> > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > problem in detail please? Thanks.
> > > > >
> > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > returned because vmemmap could get freed up and the respective memory
> > > > > released to the page allocator and reused for something else. See?
> > > >
> > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > the struct page of the HugeTLB can be freed? Does this exist?
> > >
> > > Nope, struct pages only ever get deallocated when the respective memory
> > > (they describe) is hotremoved via hotplug.
> > >
> > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > (cannot access the struct page)?
> > >
> > > But I do not follow how that relates to my concern above.
> >
> > Sorry. I shouldn't understand your concerns.
> >
> > vmemmap pages                 page frame
> > +-----------+   mapping to   +-----------+
> > |           | -------------> |     0     |
> > +-----------+                +-----------+
> > |           | -------------> |     1     |
> > +-----------+                +-----------+
> > |           | -------------> |     2     |
> > +-----------+                +-----------+
> > |           | -------------> |     3     |
> > +-----------+                +-----------+
> > |           | -------------> |     4     |
> > +-----------+                +-----------+
> > |           | -------------> |     5     |
> > +-----------+                +-----------+
> > |           | -------------> |     6     |
> > +-----------+                +-----------+
> > |           | -------------> |     7     |
> > +-----------+                +-----------+
> >
> > In this patch series, we will free the page frame 2-7 to the
> > buddy allocator. You mean that pfn_to_page can return invalid
> > value when the pfn is the page frame 2-7? Thanks.
>
> No I really mean that pfn_to_page will give you a struct page pointer
> from pages which you release from the vmemmap page tables. Those pages
> might get reused as soon sa they are freed to the page allocator.

We will remap vmemmap pages 2-7 (virtual addresses) to page
frame 1. And then we free page frame 2-7 to the buddy allocator.
Then accessing the struct page pointer that returned by pfn_to_page
will be reflected on page frame 1. I think that here is no problem.

Thanks.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 23, 2020, 10:42 a.m. UTC | #16
On Mon 23-11-20 18:36:33, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > to dive depply into them.
> > > > > > > >
> > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > oom victim is eligible.
> > > > > > >
> > > > > > > Hi Michal,
> > > > > > >
> > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > >
> > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > blow up?
> > > > > > >
> > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > problem in detail please? Thanks.
> > > > > >
> > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > released to the page allocator and reused for something else. See?
> > > > >
> > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > >
> > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > (they describe) is hotremoved via hotplug.
> > > >
> > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > (cannot access the struct page)?
> > > >
> > > > But I do not follow how that relates to my concern above.
> > >
> > > Sorry. I shouldn't understand your concerns.
> > >
> > > vmemmap pages                 page frame
> > > +-----------+   mapping to   +-----------+
> > > |           | -------------> |     0     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     1     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     2     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     3     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     4     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     5     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     6     |
> > > +-----------+                +-----------+
> > > |           | -------------> |     7     |
> > > +-----------+                +-----------+
> > >
> > > In this patch series, we will free the page frame 2-7 to the
> > > buddy allocator. You mean that pfn_to_page can return invalid
> > > value when the pfn is the page frame 2-7? Thanks.
> >
> > No I really mean that pfn_to_page will give you a struct page pointer
> > from pages which you release from the vmemmap page tables. Those pages
> > might get reused as soon sa they are freed to the page allocator.
> 
> We will remap vmemmap pages 2-7 (virtual addresses) to page
> frame 1. And then we free page frame 2-7 to the buddy allocator.

And this doesn't really happen in an atomic fashion from the pfn walker
POV, right? So it is very well possible that 

struct page *page = pfn_to_page();
// remapping happens here
// page content is no longer valid because its backing memory can be
// reused for whatever purpose.
Muchun Song Nov. 23, 2020, 11:16 a.m. UTC | #17
On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > > to dive depply into them.
> > > > > > > > >
> > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > > oom victim is eligible.
> > > > > > > >
> > > > > > > > Hi Michal,
> > > > > > > >
> > > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > > blow up?
> > > > > > > >
> > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > > problem in detail please? Thanks.
> > > > > > >
> > > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > > released to the page allocator and reused for something else. See?
> > > > > >
> > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > >
> > > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > > (they describe) is hotremoved via hotplug.
> > > > >
> > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > (cannot access the struct page)?
> > > > >
> > > > > But I do not follow how that relates to my concern above.
> > > >
> > > > Sorry. I shouldn't understand your concerns.
> > > >
> > > > vmemmap pages                 page frame
> > > > +-----------+   mapping to   +-----------+
> > > > |           | -------------> |     0     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     1     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     2     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     3     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     4     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     5     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     6     |
> > > > +-----------+                +-----------+
> > > > |           | -------------> |     7     |
> > > > +-----------+                +-----------+
> > > >
> > > > In this patch series, we will free the page frame 2-7 to the
> > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > value when the pfn is the page frame 2-7? Thanks.
> > >
> > > No I really mean that pfn_to_page will give you a struct page pointer
> > > from pages which you release from the vmemmap page tables. Those pages
> > > might get reused as soon sa they are freed to the page allocator.
> >
> > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > frame 1. And then we free page frame 2-7 to the buddy allocator.
>
> And this doesn't really happen in an atomic fashion from the pfn walker
> POV, right? So it is very well possible that

Yeah, you are right. But it may not be a problem for HugeTLB pages.
Because in most cases, we only read the tail struct page and get the
head struct page through compound_head() when the pfn is within
a HugeTLB range. Right?

>
> struct page *page = pfn_to_page();
> // remapping happens here
> // page content is no longer valid because its backing memory can be

If we only read the page->compound_head. The content is
also valid. Because the value of compound_head is the same
for the tail page struct of HugeTLB page.

> // reused for whatever purpose.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 23, 2020, 11:32 a.m. UTC | #18
On Mon 23-11-20 19:16:18, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > > > to dive depply into them.
> > > > > > > > > >
> > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > > > oom victim is eligible.
> > > > > > > > >
> > > > > > > > > Hi Michal,
> > > > > > > > >
> > > > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > > > blow up?
> > > > > > > > >
> > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > > > problem in detail please? Thanks.
> > > > > > > >
> > > > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > > > released to the page allocator and reused for something else. See?
> > > > > > >
> > > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > > >
> > > > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > > > (they describe) is hotremoved via hotplug.
> > > > > >
> > > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > > (cannot access the struct page)?
> > > > > >
> > > > > > But I do not follow how that relates to my concern above.
> > > > >
> > > > > Sorry. I shouldn't understand your concerns.
> > > > >
> > > > > vmemmap pages                 page frame
> > > > > +-----------+   mapping to   +-----------+
> > > > > |           | -------------> |     0     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     1     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     2     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     3     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     4     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     5     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     6     |
> > > > > +-----------+                +-----------+
> > > > > |           | -------------> |     7     |
> > > > > +-----------+                +-----------+
> > > > >
> > > > > In this patch series, we will free the page frame 2-7 to the
> > > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > > value when the pfn is the page frame 2-7? Thanks.
> > > >
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > >
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> >
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that
> 
> Yeah, you are right. But it may not be a problem for HugeTLB pages.
> Because in most cases, we only read the tail struct page and get the
> head struct page through compound_head() when the pfn is within
> a HugeTLB range. Right?

Many pfn walkers would encounter the head page first and then skip over
the rest. Those should be reasonably safe. But there is no guarantee and
the fact that you need a valid page->compound_head which might get
scribbled over once you have the struct page makes this extremely
subtle.
Muchun Song Nov. 23, 2020, 12:07 p.m. UTC | #19
On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 19:16:18, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > > > [...]
> > > > > > > > > > >
> > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > > > > > > > > only glanced through the patchset because I didn't really have more time
> > > > > > > > > > > to dive depply into them.
> > > > > > > > > > >
> > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > > > > > > > the feature enablement controlled by compile time option and the kernel
> > > > > > > > > > > command line option should be opt-in. I also do not like that freeing
> > > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > > > > > > > oom victim is eligible.
> > > > > > > > > >
> > > > > > > > > > Hi Michal,
> > > > > > > > > >
> > > > > > > > > > I have replied to you about those questions on the other mail thread.
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > One thing that I didn't really get to think hard about is what is the
> > > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > > > > > > > blow up?
> > > > > > > > > >
> > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > > > > problem in detail please? Thanks.
> > > > > > > > >
> > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > > > > > returned because vmemmap could get freed up and the respective memory
> > > > > > > > > released to the page allocator and reused for something else. See?
> > > > > > > >
> > > > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > > > >
> > > > > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > > > > (they describe) is hotremoved via hotplug.
> > > > > > >
> > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > > > (cannot access the struct page)?
> > > > > > >
> > > > > > > But I do not follow how that relates to my concern above.
> > > > > >
> > > > > > Sorry. I shouldn't understand your concerns.
> > > > > >
> > > > > > vmemmap pages                 page frame
> > > > > > +-----------+   mapping to   +-----------+
> > > > > > |           | -------------> |     0     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     1     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     2     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     3     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     4     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     5     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     6     |
> > > > > > +-----------+                +-----------+
> > > > > > |           | -------------> |     7     |
> > > > > > +-----------+                +-----------+
> > > > > >
> > > > > > In this patch series, we will free the page frame 2-7 to the
> > > > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > > > value when the pfn is the page frame 2-7? Thanks.
> > > > >
> > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > might get reused as soon sa they are freed to the page allocator.
> > > >
> > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > >
> > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > POV, right? So it is very well possible that
> >
> > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > Because in most cases, we only read the tail struct page and get the
> > head struct page through compound_head() when the pfn is within
> > a HugeTLB range. Right?
>
> Many pfn walkers would encounter the head page first and then skip over
> the rest. Those should be reasonably safe. But there is no guarantee and
> the fact that you need a valid page->compound_head which might get
> scribbled over once you have the struct page makes this extremely
> subtle.

In this patch series, we can guarantee that the page->compound_head
is always valid. Because we reuse the first tail page. Maybe you need to
look closer at this series. Thanks.


>
> --
>
> SUSE Labs



--
Yours,
Muchun
Michal Hocko Nov. 23, 2020, 12:18 p.m. UTC | #20
On Mon 23-11-20 20:07:23, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > >
> > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > >
> > > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > > POV, right? So it is very well possible that
> > >
> > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > Because in most cases, we only read the tail struct page and get the
> > > head struct page through compound_head() when the pfn is within
> > > a HugeTLB range. Right?
> >
> > Many pfn walkers would encounter the head page first and then skip over
> > the rest. Those should be reasonably safe. But there is no guarantee and
> > the fact that you need a valid page->compound_head which might get
> > scribbled over once you have the struct page makes this extremely
> > subtle.
> 
> In this patch series, we can guarantee that the page->compound_head
> is always valid. Because we reuse the first tail page. Maybe you need to
> look closer at this series. Thanks.

I must be really terrible exaplaining my concern. Let me try one last
time. It is really _irrelevant_ what you do with tail pages. The
underlying problem is that you are changing struct pages under users
without any synchronization. What used to be a valid struct page will
turn into garbage as soon as you remap vmemmap page tables.
Muchun Song Nov. 23, 2020, 12:40 p.m. UTC | #21
On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-11-20 20:07:23, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > > >
> > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > > >
> > > > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > > > POV, right? So it is very well possible that
> > > >
> > > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > > Because in most cases, we only read the tail struct page and get the
> > > > head struct page through compound_head() when the pfn is within
> > > > a HugeTLB range. Right?
> > >
> > > Many pfn walkers would encounter the head page first and then skip over
> > > the rest. Those should be reasonably safe. But there is no guarantee and
> > > the fact that you need a valid page->compound_head which might get
> > > scribbled over once you have the struct page makes this extremely
> > > subtle.
> >
> > In this patch series, we can guarantee that the page->compound_head
> > is always valid. Because we reuse the first tail page. Maybe you need to
> > look closer at this series. Thanks.
>
> I must be really terrible exaplaining my concern. Let me try one last
> time. It is really _irrelevant_ what you do with tail pages. The
> underlying problem is that you are changing struct pages under users
> without any synchronization. What used to be a valid struct page will
> turn into garbage as soon as you remap vmemmap page tables.

Thank you very much for your patient explanation. So if the pfn walkers
always try get the head struct page through compound_head() when it
encounter a tail struct page. There will be no concerns. Do you agree?

> --
> Michal Hocko
> SUSE Labs
Matthew Wilcox Nov. 23, 2020, 12:45 p.m. UTC | #22
On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > No I really mean that pfn_to_page will give you a struct page pointer
> > > from pages which you release from the vmemmap page tables. Those pages
> > > might get reused as soon sa they are freed to the page allocator.
> > 
> > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > frame 1. And then we free page frame 2-7 to the buddy allocator.
> 
> And this doesn't really happen in an atomic fashion from the pfn walker
> POV, right? So it is very well possible that 
> 
> struct page *page = pfn_to_page();
> // remapping happens here
> // page content is no longer valid because its backing memory can be
> // reused for whatever purpose.

pfn_to_page() returns you a virtual address.  That virtual address
remains a valid pointer to exactly the same contents, it's just that
the page tables change to point to a different struct page which has
the same compound_head().
Michal Hocko Nov. 23, 2020, 12:48 p.m. UTC | #23
On Mon 23-11-20 20:40:40, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-11-20 20:07:23, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > > > > > from pages which you release from the vmemmap page tables. Those pages
> > > > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > > > >
> > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > > > >
> > > > > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > > > > POV, right? So it is very well possible that
> > > > >
> > > > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > > > Because in most cases, we only read the tail struct page and get the
> > > > > head struct page through compound_head() when the pfn is within
> > > > > a HugeTLB range. Right?
> > > >
> > > > Many pfn walkers would encounter the head page first and then skip over
> > > > the rest. Those should be reasonably safe. But there is no guarantee and
> > > > the fact that you need a valid page->compound_head which might get
> > > > scribbled over once you have the struct page makes this extremely
> > > > subtle.
> > >
> > > In this patch series, we can guarantee that the page->compound_head
> > > is always valid. Because we reuse the first tail page. Maybe you need to
> > > look closer at this series. Thanks.
> >
> > I must be really terrible exaplaining my concern. Let me try one last
> > time. It is really _irrelevant_ what you do with tail pages. The
> > underlying problem is that you are changing struct pages under users
> > without any synchronization. What used to be a valid struct page will
> > turn into garbage as soon as you remap vmemmap page tables.
> 
> Thank you very much for your patient explanation. So if the pfn walkers
> always try get the head struct page through compound_head() when it
> encounter a tail struct page. There will be no concerns. Do you agree?

No, I do not agree. Please read again. The content of the struct page
might be a complete garbage at any time after pfn_to_page returns a
struct page. So there is no valid compound_head anywamore.
Muchun Song Nov. 23, 2020, 1:05 p.m. UTC | #24
On Mon, Nov 23, 2020 at 8:45 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > >
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> >
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that
> >
> > struct page *page = pfn_to_page();
> > // remapping happens here
> > // page content is no longer valid because its backing memory can be
> > // reused for whatever purpose.
>
> pfn_to_page() returns you a virtual address.  That virtual address
> remains a valid pointer to exactly the same contents, it's just that
> the page tables change to point to a different struct page which has
> the same compound_head().

I agree with you.

Hi Michal,

Maybe you need to look at this.
Michal Hocko Nov. 23, 2020, 1:13 p.m. UTC | #25
On Mon 23-11-20 12:45:13, Matthew Wilcox wrote:
> On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > > 
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > 
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that 
> > 
> > struct page *page = pfn_to_page();
> > // remapping happens here
> > // page content is no longer valid because its backing memory can be
> > // reused for whatever purpose.
> 
> pfn_to_page() returns you a virtual address.  That virtual address
> remains a valid pointer to exactly the same contents, it's just that
> the page tables change to point to a different struct page which has
> the same compound_head().

You are right. I have managed to completely confuse myself. Sorry about
the noise!
Mike Kravetz Nov. 23, 2020, 9:52 p.m. UTC | #26
On 11/22/20 11:38 PM, Michal Hocko wrote:
> On Fri 20-11-20 09:45:12, Mike Kravetz wrote:
>> On 11/20/20 1:43 AM, David Hildenbrand wrote:
> [...]
>>>>> To keep things easy, maybe simply never allow to free these hugetlb pages
>>>>> again for now? If they were reserved during boot and the vmemmap condensed,
>>>>> then just let them stick around for all eternity.
>>>>
>>>> Not sure I understand. Do you propose to only free those vmemmap pages
>>>> when the pool is initialized during boot time and never allow to free
>>>> them up? That would certainly make it safer and maybe even simpler wrt
>>>> implementation.
>>>
>>> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
>>
>> Not sure if I agree with that last statement.  Database and virtualization
>> use cases from my employer allocate allocate hugetlb pages after boot.  It
>> is shortly after boot, but still not from boot/kernel command line.
> 
> Is there any strong reason for that?
> 

The reason I have been given is that it is preferable to have SW compute
the number of needed huge pages after boot based on total memory, rather
than have a sysadmin calculate the number and add a boot parameter.

>> Somewhat related, but not exactly addressing this issue ...
>>
>> One idea discussed in a previous patch set was to disable PMD/huge page
>> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
>> of the complex code doing page table manipulation.  It does not address
>> the issue of struct page pages going away which is being discussed here,
>> but it could be a way to simply the first version of this code.  If this
>> is going to be an 'opt in' feature as previously suggested, then eliminating
>> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
>> sysadmins would only 'opt in' if they expect most of system memory to be used
>> by hugetlb pages.  We certainly have database and virtualization use cases
>> where this is true.
> 
> Would this simplify the code considerably? I mean, the vmemmap page
> tables will need to be updated anyway. So that code has to stay. PMD
> entry split shouldn't be the most complex part of that operation.  On
> the other hand dropping large pages for all vmemmaps will likely have a
> performance.

I agree with your points.  This was just one way in which the patch set
could be simplified.
Matthew Wilcox Nov. 23, 2020, 10:01 p.m. UTC | #27
On Mon, Nov 23, 2020 at 01:52:13PM -0800, Mike Kravetz wrote:
> On 11/22/20 11:38 PM, Michal Hocko wrote:
> > On Fri 20-11-20 09:45:12, Mike Kravetz wrote:
> >> Not sure if I agree with that last statement.  Database and virtualization
> >> use cases from my employer allocate allocate hugetlb pages after boot.  It
> >> is shortly after boot, but still not from boot/kernel command line.
> > 
> > Is there any strong reason for that?
> 
> The reason I have been given is that it is preferable to have SW compute
> the number of needed huge pages after boot based on total memory, rather
> than have a sysadmin calculate the number and add a boot parameter.

Oh, I remember this bug!  I think it was posted publically, even.
If the sysadmin configures, say, 90% of the RAM to be hugepages and
then a DIMM fails and the sysadmin doesn't remember to adjust the boot
parameter, Linux does some pretty horrible things and the symptom is
"Linux doesn't boot".