diff mbox series

[v6,4/4] mm: hugetlb_vmemmap: add hugetlb_free_vmemmap sysctl

Message ID 20220330153745.20465-5-songmuchun@bytedance.com (mailing list archive)
State New
Headers show
Series add hugetlb_free_vmemmap sysctl | expand

Commit Message

Muchun Song March 30, 2022, 3:37 p.m. UTC
We must add "hugetlb_free_vmemmap=on" to boot cmdline and reboot the
server to enable the feature of freeing vmemmap pages of HugeTLB
pages.  Rebooting usually takes a long time.  Add a sysctl to enable
or disable the feature at runtime without rebooting.

Disabling requires there is no any optimized HugeTLB page in the
system.  If you fail to disable it, you can set "nr_hugepages" to 0
and then retry.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  14 +++++
 include/linux/memory_hotplug.h          |   9 +++
 mm/hugetlb_vmemmap.c                    | 101 +++++++++++++++++++++++++-------
 mm/hugetlb_vmemmap.h                    |   4 +-
 mm/memory_hotplug.c                     |   7 +--
 5 files changed, 108 insertions(+), 27 deletions(-)

Comments

Andrew Morton March 31, 2022, 2:36 a.m. UTC | #1
On Wed, 30 Mar 2022 23:37:45 +0800 Muchun Song <songmuchun@bytedance.com> wrote:

> We must add "hugetlb_free_vmemmap=on" to boot cmdline and reboot the
> server to enable the feature of freeing vmemmap pages of HugeTLB
> pages.  Rebooting usually takes a long time.  Add a sysctl to enable
> or disable the feature at runtime without rebooting.

I forget, why did we add the hugetlb_free_vmemmap option in the first
place?  Why not just leave the feature enabled in all cases?

Furthermore, why would anyone want to tweak this at runtime?  What is
the use case?  Where is the end-user value in all of this?

> Disabling requires there is no any optimized HugeTLB page in the
> system.  If you fail to disable it, you can set "nr_hugepages" to 0
> and then retry.
> 
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -561,6 +561,20 @@ Change the minimum size of the hugepage pool.
>  See Documentation/admin-guide/mm/hugetlbpage.rst
>  
>  
> +hugetlb_free_vmemmap
> +====================
> +
> +Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap
> +pages associated with each HugeTLB page.  Once true, the vmemmap pages of
> +subsequent allocation of HugeTLB pages from buddy system will be optimized,
> +whereas already allocated HugeTLB pages will not be optimized.  If you fail
> +to disable this feature, you can set "nr_hugepages" to 0 and then retry
> +since it is only allowed to be disabled after there is no any optimized
> +HugeTLB page in the system.
> +

Pity the poor user who is looking at this and wondering whether it will
improve or worsen things.  If we don't tell them, who will?  Are they
supposed to just experiment?

What can we add here to help them understand whether this might be
beneficial?

> +See Documentation/admin-guide/mm/hugetlbpage.rst
Muchun Song March 31, 2022, 3:45 a.m. UTC | #2
On Thu, Mar 31, 2022 at 10:37 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 30 Mar 2022 23:37:45 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
>
> > We must add "hugetlb_free_vmemmap=on" to boot cmdline and reboot the
> > server to enable the feature of freeing vmemmap pages of HugeTLB
> > pages.  Rebooting usually takes a long time.  Add a sysctl to enable
> > or disable the feature at runtime without rebooting.
>
> I forget, why did we add the hugetlb_free_vmemmap option in the first
> place? Why not just leave the feature enabled in all cases?

The 1st reason is because we disable PMD/huge page mapping
of vmemmap pages (in the original version) which increase page
table pages.  So if a user/sysadmin only  uses a small number of
HugeTLB pages (as a percentage of system memory), they could
end up using more memory with hugetlb_free_vmemmap on as
opposed to off.  Now this tradeoff is gone.

The 2nd reason is this feature adds more overhead in the path of
HugeTLB allocation/freeing from/to the buddy system.  As Mike said
in the link [1].
"
There are still some instances where huge pages
are allocated 'on the fly' instead of being pulled from the pool.  Michal
pointed out the case of page migration.  It is also possible for someone to
use hugetlbfs without pre-allocating huge pages to the pool.  I remember the
use case pointed out in commit 099730d67417.  It says, "I have a hugetlbfs
user which is never explicitly allocating huge pages with 'nr_hugepages'.
They only set 'nr_overcommit_hugepages' and then let the pages be allocated
from the buddy allocator at fault time."  In this case, I suspect they were
using 'page fault' allocation for initialization much like someone using
/proc/sys/vm/nr_hugepages.  So, the overhead may not be as noticeable.
"

For those different workloads, we introduce hugetlb_free_vmemmap and
expect users to make decisions based on their workloads.

[1] https://patchwork.kernel.org/comment/23752641/

>
> Furthermore, why would anyone want to tweak this at runtime?  What is
> the use case?  Where is the end-user value in all of this?

If the workload is changed in the future on a server.  The users need to
adapt this at runtime without rebooting the server.

>
> > Disabling requires there is no any optimized HugeTLB page in the
> > system.  If you fail to disable it, you can set "nr_hugepages" to 0
> > and then retry.
> >
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -561,6 +561,20 @@ Change the minimum size of the hugepage pool.
> >  See Documentation/admin-guide/mm/hugetlbpage.rst
> >
> >
> > +hugetlb_free_vmemmap
> > +====================
> > +
> > +Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap
> > +pages associated with each HugeTLB page.  Once true, the vmemmap pages of
> > +subsequent allocation of HugeTLB pages from buddy system will be optimized,
> > +whereas already allocated HugeTLB pages will not be optimized.  If you fail
> > +to disable this feature, you can set "nr_hugepages" to 0 and then retry
> > +since it is only allowed to be disabled after there is no any optimized
> > +HugeTLB page in the system.
> > +
>
> Pity the poor user who is looking at this and wondering whether it will
> improve or worsen things.  If we don't tell them, who will?  Are they
> supposed to just experiment?
>
> What can we add here to help them understand whether this might be
> beneficial?
>

My bad. I should explain more details to let users make better decisions.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f4804ce37c58..9e0e153ed935 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -561,6 +561,20 @@  Change the minimum size of the hugepage pool.
 See Documentation/admin-guide/mm/hugetlbpage.rst
 
 
+hugetlb_free_vmemmap
+====================
+
+Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap
+pages associated with each HugeTLB page.  Once true, the vmemmap pages of
+subsequent allocation of HugeTLB pages from buddy system will be optimized,
+whereas already allocated HugeTLB pages will not be optimized.  If you fail
+to disable this feature, you can set "nr_hugepages" to 0 and then retry
+since it is only allowed to be disabled after there is no any optimized
+HugeTLB page in the system.
+
+See Documentation/admin-guide/mm/hugetlbpage.rst
+
+
 nr_hugepages_mempolicy
 ======================
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 1ce6f8044f1e..9b015b254e86 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -348,4 +348,13 @@  void arch_remove_linear_mapping(u64 start, u64 size);
 extern bool mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+bool mhp_memmap_on_memory(void);
+#else
+static inline bool mhp_memmap_on_memory(void)
+{
+	return false;
+}
+#endif
+
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 951cf83010c7..5df41b148e79 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -176,6 +176,7 @@ 
  */
 #define pr_fmt(fmt)	"HugeTLB: " fmt
 
+#include <linux/memory_hotplug.h>
 #include "hugetlb_vmemmap.h"
 
 #ifdef STRUCT_PAGE_SIZE_IS_POWER_OF_2
@@ -193,6 +194,10 @@  DEFINE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON,
 			hugetlb_free_vmemmap_enabled_key);
 EXPORT_SYMBOL(hugetlb_free_vmemmap_enabled_key);
 
+/* How many HugeTLB pages with vmemmap pages optimized. */
+static atomic_long_t optimized_pages = ATOMIC_LONG_INIT(0);
+static DECLARE_RWSEM(sysctl_rwsem);
+
 static int __init early_hugetlb_free_vmemmap_param(char *buf)
 {
 	if (!buf)
@@ -209,11 +214,6 @@  static int __init early_hugetlb_free_vmemmap_param(char *buf)
 }
 early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
 
-static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
-{
-	return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
-}
-
 /*
  * Previously discarded vmemmap pages will be allocated and remapping
  * after this function returns zero.
@@ -222,14 +222,18 @@  int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 	int ret;
 	unsigned long vmemmap_addr = (unsigned long)head;
-	unsigned long vmemmap_end, vmemmap_reuse;
+	unsigned long vmemmap_end, vmemmap_reuse, vmemmap_pages;
 
 	if (!HPageVmemmapOptimized(head))
 		return 0;
 
-	vmemmap_addr += RESERVE_VMEMMAP_SIZE;
-	vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
-	vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
+	vmemmap_addr	+= RESERVE_VMEMMAP_SIZE;
+	vmemmap_pages	= free_vmemmap_pages_per_hpage(h);
+	vmemmap_end	= vmemmap_addr + (vmemmap_pages << PAGE_SHIFT);
+	vmemmap_reuse	= vmemmap_addr - PAGE_SIZE;
+
+	VM_BUG_ON_PAGE(!vmemmap_pages, head);
+
 	/*
 	 * The pages which the vmemmap virtual address range [@vmemmap_addr,
 	 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
@@ -239,8 +243,14 @@  int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 	 */
 	ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse,
 				  GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
-	if (!ret)
+	if (!ret) {
 		ClearHPageVmemmapOptimized(head);
+		/*
+		 * Paired with acquire semantic in
+		 * hugetlb_free_vmemmap_handler().
+		 */
+		atomic_long_dec_return_release(&optimized_pages);
+	}
 
 	return ret;
 }
@@ -248,22 +258,28 @@  int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
 void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 	unsigned long vmemmap_addr = (unsigned long)head;
-	unsigned long vmemmap_end, vmemmap_reuse;
+	unsigned long vmemmap_end, vmemmap_reuse, vmemmap_pages;
 
-	if (!free_vmemmap_pages_per_hpage(h))
-		return;
+	down_read(&sysctl_rwsem);
+	vmemmap_pages = free_vmemmap_pages_per_hpage(h);
+	if (!vmemmap_pages)
+		goto out;
 
-	vmemmap_addr += RESERVE_VMEMMAP_SIZE;
-	vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
-	vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
+	vmemmap_addr	+= RESERVE_VMEMMAP_SIZE;
+	vmemmap_end	= vmemmap_addr + (vmemmap_pages << PAGE_SHIFT);
+	vmemmap_reuse	= vmemmap_addr - PAGE_SIZE;
 
 	/*
 	 * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end)
 	 * to the page which @vmemmap_reuse is mapped to, then free the pages
 	 * which the range [@vmemmap_addr, @vmemmap_end] is mapped to.
 	 */
-	if (!vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse))
+	if (!vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse)) {
 		SetHPageVmemmapOptimized(head);
+		atomic_long_inc(&optimized_pages);
+	}
+out:
+	up_read(&sysctl_rwsem);
 }
 
 void __init hugetlb_vmemmap_init(struct hstate *h)
@@ -279,9 +295,6 @@  void __init hugetlb_vmemmap_init(struct hstate *h)
 	BUILD_BUG_ON(__NR_USED_SUBPAGE >=
 		     RESERVE_VMEMMAP_SIZE / sizeof(struct page));
 
-	if (!hugetlb_free_vmemmap_enabled())
-		return;
-
 	vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT;
 	/*
 	 * The head page is not to be freed to buddy allocator, the other tail
@@ -297,4 +310,52 @@  void __init hugetlb_vmemmap_init(struct hstate *h)
 	pr_info("can free %d vmemmap pages for %s\n", h->nr_free_vmemmap_pages,
 		h->name);
 }
+
+static int hugetlb_free_vmemmap_handler(struct ctl_table *table, int write,
+					void *buffer, size_t *length,
+					loff_t *ppos)
+{
+	int ret;
+
+	down_write(&sysctl_rwsem);
+	/*
+	 * Cannot be disabled when there is at lease one optimized
+	 * HugeTLB in the system.
+	 *
+	 * The acquire semantic is paired with release semantic in
+	 * alloc_huge_page_vmemmap(). If we saw the @optimized_pages
+	 * with 0, all the operations of vmemmap pages remapping from
+	 * alloc_huge_page_vmemmap() are visible too so that we can
+	 * safely disable static key.
+	 */
+	table->extra1 = atomic_long_read_acquire(&optimized_pages) ?
+			SYSCTL_ONE : SYSCTL_ZERO;
+	ret = proc_do_static_key(table, write, buffer, length, ppos);
+	up_write(&sysctl_rwsem);
+
+	return ret;
+}
+
+static struct ctl_table hugetlb_vmemmap_sysctls[] = {
+	{
+		.procname	= "hugetlb_free_vmemmap",
+		.data		= &hugetlb_free_vmemmap_enabled_key.key,
+		.mode		= 0644,
+		.proc_handler	= hugetlb_free_vmemmap_handler,
+	},
+	{ }
+};
+
+static __init int hugetlb_vmemmap_sysctls_init(void)
+{
+	/*
+	 * The vmemmap pages cannot be optimized if
+	 * "memory_hotplug.memmap_on_memory" is enabled.
+	 */
+	if (!mhp_memmap_on_memory())
+		register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
+
+	return 0;
+}
+late_initcall(hugetlb_vmemmap_sysctls_init);
 #endif /* STRUCT_PAGE_SIZE_IS_POWER_OF_2 */
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index b137fd8b6ba4..89cd85fb0d26 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -21,7 +21,9 @@  void hugetlb_vmemmap_init(struct hstate *h);
  */
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
-	return h->nr_free_vmemmap_pages;
+	if (hugetlb_free_vmemmap_enabled())
+		return h->nr_free_vmemmap_pages;
+	return 0;
 }
 #else
 static inline int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da594b382829..793c04cfe46f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -63,15 +63,10 @@  static bool memmap_on_memory __ro_after_init;
 module_param_cb(memmap_on_memory, &memmap_on_memory_ops, &memmap_on_memory, 0444);
 MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
 
-static inline bool mhp_memmap_on_memory(void)
+bool mhp_memmap_on_memory(void)
 {
 	return memmap_on_memory;
 }
-#else
-static inline bool mhp_memmap_on_memory(void)
-{
-	return false;
-}
 #endif
 
 enum {