diff mbox series

[v19,7/8] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap

Message ID 20210315092015.35396-8-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series Free some vmemmap pages of HugeTLB page | expand

Commit Message

Muchun Song March 15, 2021, 9:20 a.m. UTC
Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
freeing unused vmemmap pages associated with each hugetlb page on boot.

We disables PMD mapping of vmemmap pages for x86-64 arch when this
feature is enabled. Because vmemmap_remap_free() depends on vmemmap
being base page mapped.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 17 +++++++++++++++++
 Documentation/admin-guide/mm/hugetlbpage.rst    |  3 +++
 arch/x86/mm/init_64.c                           |  8 ++++++--
 include/linux/hugetlb.h                         | 19 +++++++++++++++++++
 mm/hugetlb_vmemmap.c                            | 24 ++++++++++++++++++++++++
 5 files changed, 69 insertions(+), 2 deletions(-)

Comments

Oscar Salvador March 19, 2021, 8:59 a.m. UTC | #1
On Mon, Mar 15, 2021 at 05:20:14PM +0800, Muchun Song wrote:
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -34,6 +34,7 @@
>  #include <linux/gfp.h>
>  #include <linux/kcore.h>
>  #include <linux/bootmem_info.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/processor.h>
>  #include <asm/bios_ebda.h>
> @@ -1557,7 +1558,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  {
>  	int err;
>  
> -	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> +	if ((is_hugetlb_free_vmemmap_enabled()  && !altmap) ||
> +	    end - start < PAGES_PER_SECTION * sizeof(struct page))
>  		err = vmemmap_populate_basepages(start, end, node, NULL);
>  	else if (boot_cpu_has(X86_FEATURE_PSE))
>  		err = vmemmap_populate_hugepages(start, end, node, altmap);

I've been thinking about this some more.

Assume you opt-in the hugetlb-vmemmap feature, and assume you pass a valid altmap
to vmemmap_populate.
This will lead to use populating the vmemmap array with hugepages.

What if then, a HugeTLB gets allocated and falls within that memory range (backed
by hugetpages)?
AFAIK, this will get us in trouble as currently the code can only operate on memory
backed by PAGE_SIZE pages, right?

I cannot remember, but I do not think nothing prevents that from happening?
Am I missing anything?
Muchun Song March 19, 2021, 12:15 p.m. UTC | #2
On Fri, Mar 19, 2021 at 4:59 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Mon, Mar 15, 2021 at 05:20:14PM +0800, Muchun Song wrote:
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -34,6 +34,7 @@
> >  #include <linux/gfp.h>
> >  #include <linux/kcore.h>
> >  #include <linux/bootmem_info.h>
> > +#include <linux/hugetlb.h>
> >
> >  #include <asm/processor.h>
> >  #include <asm/bios_ebda.h>
> > @@ -1557,7 +1558,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> >  {
> >       int err;
> >
> > -     if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> > +     if ((is_hugetlb_free_vmemmap_enabled()  && !altmap) ||
> > +         end - start < PAGES_PER_SECTION * sizeof(struct page))
> >               err = vmemmap_populate_basepages(start, end, node, NULL);
> >       else if (boot_cpu_has(X86_FEATURE_PSE))
> >               err = vmemmap_populate_hugepages(start, end, node, altmap);
>
> I've been thinking about this some more.
>
> Assume you opt-in the hugetlb-vmemmap feature, and assume you pass a valid altmap
> to vmemmap_populate.
> This will lead to use populating the vmemmap array with hugepages.

Right.

>
> What if then, a HugeTLB gets allocated and falls within that memory range (backed
> by hugetpages)?

I am not sure whether we can allocate the HugeTLB pages from there.
Will only device memory pass a valid altmap parameter to
vmemmap_populate()? If yes, can we allocate HugeTLB pages from
device memory? Sorry, I am not an expert on this.


> AFAIK, this will get us in trouble as currently the code can only operate on memory
> backed by PAGE_SIZE pages, right?
>
> I cannot remember, but I do not think nothing prevents that from happening?
> Am I missing anything?

Maybe David H is more familiar with this.

Hi David,

Do you have some suggestions on this?

Thanks.


>
> --
> Oscar Salvador
> SUSE L3
David Hildenbrand March 19, 2021, 12:36 p.m. UTC | #3
On 19.03.21 13:15, Muchun Song wrote:
> On Fri, Mar 19, 2021 at 4:59 PM Oscar Salvador <osalvador@suse.de> wrote:
>>
>> On Mon, Mar 15, 2021 at 05:20:14PM +0800, Muchun Song wrote:
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -34,6 +34,7 @@
>>>   #include <linux/gfp.h>
>>>   #include <linux/kcore.h>
>>>   #include <linux/bootmem_info.h>
>>> +#include <linux/hugetlb.h>
>>>
>>>   #include <asm/processor.h>
>>>   #include <asm/bios_ebda.h>
>>> @@ -1557,7 +1558,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>   {
>>>        int err;
>>>
>>> -     if (end - start < PAGES_PER_SECTION * sizeof(struct page))
>>> +     if ((is_hugetlb_free_vmemmap_enabled()  && !altmap) ||
>>> +         end - start < PAGES_PER_SECTION * sizeof(struct page))
>>>                err = vmemmap_populate_basepages(start, end, node, NULL);
>>>        else if (boot_cpu_has(X86_FEATURE_PSE))
>>>                err = vmemmap_populate_hugepages(start, end, node, altmap);
>>
>> I've been thinking about this some more.
>>
>> Assume you opt-in the hugetlb-vmemmap feature, and assume you pass a valid altmap
>> to vmemmap_populate.
>> This will lead to use populating the vmemmap array with hugepages.
> 
> Right.
> 
>>
>> What if then, a HugeTLB gets allocated and falls within that memory range (backed
>> by hugetpages)?
> 
> I am not sure whether we can allocate the HugeTLB pages from there.
> Will only device memory pass a valid altmap parameter to
> vmemmap_populate()? If yes, can we allocate HugeTLB pages from
> device memory? Sorry, I am not an expert on this.

I think, right now, yes. System RAM that's applicable for HugePages 
never uses an altmap. But Oscar's patch will change that, maybe before 
your series might get included from what I've been reading. [1]

[1] https://lkml.kernel.org/r/20210319092635.6214-1-osalvador@suse.de

> 
> 
>> AFAIK, this will get us in trouble as currently the code can only operate on memory
>> backed by PAGE_SIZE pages, right?
>>
>> I cannot remember, but I do not think nothing prevents that from happening?
>> Am I missing anything?
> 
> Maybe David H is more familiar with this.
> 
> Hi David,
> 
> Do you have some suggestions on this?

There has to be some way to identify whether we can optimize specific 
vmemmap pages or should just leave them alone. altmap vs. !altmap.

Unfortunately, there is no easy way to detect that - e.g., 
PageReserved() applies also to boot memory.

We could go back to setting a special PageType for these vmemmap pages, 
indicating "this is a page allocated from an altmap, don't touch it".
David Hildenbrand March 19, 2021, 12:42 p.m. UTC | #4
On 19.03.21 13:36, David Hildenbrand wrote:
> On 19.03.21 13:15, Muchun Song wrote:
>> On Fri, Mar 19, 2021 at 4:59 PM Oscar Salvador <osalvador@suse.de> wrote:
>>>
>>> On Mon, Mar 15, 2021 at 05:20:14PM +0800, Muchun Song wrote:
>>>> --- a/arch/x86/mm/init_64.c
>>>> +++ b/arch/x86/mm/init_64.c
>>>> @@ -34,6 +34,7 @@
>>>>    #include <linux/gfp.h>
>>>>    #include <linux/kcore.h>
>>>>    #include <linux/bootmem_info.h>
>>>> +#include <linux/hugetlb.h>
>>>>
>>>>    #include <asm/processor.h>
>>>>    #include <asm/bios_ebda.h>
>>>> @@ -1557,7 +1558,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>    {
>>>>         int err;
>>>>
>>>> -     if (end - start < PAGES_PER_SECTION * sizeof(struct page))
>>>> +     if ((is_hugetlb_free_vmemmap_enabled()  && !altmap) ||
>>>> +         end - start < PAGES_PER_SECTION * sizeof(struct page))
>>>>                 err = vmemmap_populate_basepages(start, end, node, NULL);
>>>>         else if (boot_cpu_has(X86_FEATURE_PSE))
>>>>                 err = vmemmap_populate_hugepages(start, end, node, altmap);
>>>
>>> I've been thinking about this some more.
>>>
>>> Assume you opt-in the hugetlb-vmemmap feature, and assume you pass a valid altmap
>>> to vmemmap_populate.
>>> This will lead to use populating the vmemmap array with hugepages.
>>
>> Right.
>>
>>>
>>> What if then, a HugeTLB gets allocated and falls within that memory range (backed
>>> by hugetpages)?
>>
>> I am not sure whether we can allocate the HugeTLB pages from there.
>> Will only device memory pass a valid altmap parameter to
>> vmemmap_populate()? If yes, can we allocate HugeTLB pages from
>> device memory? Sorry, I am not an expert on this.
> 
> I think, right now, yes. System RAM that's applicable for HugePages
> never uses an altmap. But Oscar's patch will change that, maybe before
> your series might get included from what I've been reading. [1]
> 
> [1] https://lkml.kernel.org/r/20210319092635.6214-1-osalvador@suse.de
> 
>>
>>
>>> AFAIK, this will get us in trouble as currently the code can only operate on memory
>>> backed by PAGE_SIZE pages, right?
>>>
>>> I cannot remember, but I do not think nothing prevents that from happening?
>>> Am I missing anything?
>>
>> Maybe David H is more familiar with this.
>>
>> Hi David,
>>
>> Do you have some suggestions on this?
> 
> There has to be some way to identify whether we can optimize specific
> vmemmap pages or should just leave them alone. altmap vs. !altmap.
> 
> Unfortunately, there is no easy way to detect that - e.g.,
> PageReserved() applies also to boot memory.
> 
> We could go back to setting a special PageType for these vmemmap pages,
> indicating "this is a page allocated from an altmap, don't touch it".
> 

With SPARSEMEM we can use

PageReserved(page) && early_section(): vmemmap from bootmem

PageReserved(page) && !early_section(): vmemmap from altmap

!PageReserved(page): vmemmap from buddy

But it's a bit shaky :)
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..2e6b57207a3d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1557,6 +1557,23 @@ 
 			Documentation/admin-guide/mm/hugetlbpage.rst.
 			Format: size[KMG]
 
+	hugetlb_free_vmemmap=
+			[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+			enabled.
+			Allows heavy hugetlb users to free up some more
+			memory (6 * PAGE_SIZE for each 2MB hugetlb page).
+			This feauture is not free though. Large page
+			tables are not used to back vmemmap pages which
+			can lead to a performance degradation for some
+			workloads. Also there will be memory allocation
+			required when hugetlb pages are freed from the
+			pool which can lead to corner cases under heavy
+			memory pressure.
+			Format: { on | off (default) }
+
+			on:  enable the feature
+			off: disable the feature
+
 	hung_task_panic=
 			[KNL] Should the hung task detector generate panics.
 			Format: 0 | 1
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 6988895d09a8..8abaeb144e44 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -153,6 +153,9 @@  default_hugepagesz
 
 	will all result in 256 2M huge pages being allocated.  Valid default
 	huge page size is architecture dependent.
+hugetlb_free_vmemmap
+	When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
+	unused vmemmap pages associated with each HugeTLB page.
 
 When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
 indicates the current number of pre-allocated huge pages of the default size.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0435bee2e172..39f88c5faadc 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -34,6 +34,7 @@ 
 #include <linux/gfp.h>
 #include <linux/kcore.h>
 #include <linux/bootmem_info.h>
+#include <linux/hugetlb.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1557,7 +1558,8 @@  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 {
 	int err;
 
-	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
+	if ((is_hugetlb_free_vmemmap_enabled()  && !altmap) ||
+	    end - start < PAGES_PER_SECTION * sizeof(struct page))
 		err = vmemmap_populate_basepages(start, end, node, NULL);
 	else if (boot_cpu_has(X86_FEATURE_PSE))
 		err = vmemmap_populate_hugepages(start, end, node, altmap);
@@ -1585,6 +1587,8 @@  void register_page_bootmem_memmap(unsigned long section_nr,
 	pmd_t *pmd;
 	unsigned int nr_pmd_pages;
 	struct page *page;
+	bool base_mapping = !boot_cpu_has(X86_FEATURE_PSE) ||
+			    is_hugetlb_free_vmemmap_enabled();
 
 	for (; addr < end; addr = next) {
 		pte_t *pte = NULL;
@@ -1610,7 +1614,7 @@  void register_page_bootmem_memmap(unsigned long section_nr,
 		}
 		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
 
-		if (!boot_cpu_has(X86_FEATURE_PSE)) {
+		if (base_mapping) {
 			next = (addr + PAGE_SIZE) & PAGE_MASK;
 			pmd = pmd_offset(pud, addr);
 			if (pmd_none(*pmd))
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7f7a0e3405ae..3efc6b9b23f2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -872,6 +872,20 @@  static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 }
 #endif
 
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+extern bool hugetlb_free_vmemmap_enabled;
+
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+	return hugetlb_free_vmemmap_enabled;
+}
+#else
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+	return false;
+}
+#endif
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
@@ -1025,6 +1039,11 @@  static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr
 					pte_t *ptep, pte_t pte, unsigned long sz)
 {
 }
+
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+	return false;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 0e6835264da3..721258beeb94 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -168,6 +168,8 @@ 
  * (last) level. So this type of HugeTLB page can be optimized only when its
  * size of the struct page structs is greater than 2 pages.
  */
+#define pr_fmt(fmt)	"HugeTLB: " fmt
+
 #include "hugetlb_vmemmap.h"
 
 /*
@@ -180,6 +182,28 @@ 
 #define RESERVE_VMEMMAP_NR		2U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 
+bool hugetlb_free_vmemmap_enabled;
+
+static int __init early_hugetlb_free_vmemmap_param(char *buf)
+{
+	/* We cannot optimize if a "struct page" crosses page boundaries. */
+	if ((!is_power_of_2(sizeof(struct page)))) {
+		pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n");
+		return 0;
+	}
+
+	if (!buf)
+		return -EINVAL;
+
+	if (!strcmp(buf, "on"))
+		hugetlb_free_vmemmap_enabled = true;
+	else if (strcmp(buf, "off"))
+		return -EINVAL;
+
+	return 0;
+}
+early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
+
 static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
 {
 	return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;