diff mbox series

[v4,1/5] hugetlb: add demote hugetlb page sysfs interfaces

Message ID 20211007181918.136982-2-mike.kravetz@oracle.com (mailing list archive)
State New
Headers show
Series hugetlb: add demote/split page functionality | expand

Commit Message

Mike Kravetz Oct. 7, 2021, 6:19 p.m. UTC
Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:
  demote_size - The size in Kb that pages are demoted to. (read-write)
  demote - The number of huge pages to demote. (write-only)

By default, demote_size is the next smallest huge page size.  Valid huge
page sizes less than huge page size may be written to this file.  When
huge pages are demoted, they are demoted to this size.

Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.

NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size.  For example, on x86 1GB huge
pages will have demote interfaces.  2MB huge pages will not have demote
interfaces.

This patch does not provide full demote functionality.  It only provides
the sysfs interfaces.

It also provides documentation for the new interfaces.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 Documentation/admin-guide/mm/hugetlbpage.rst |  30 +++-
 include/linux/hugetlb.h                      |   1 +
 mm/hugetlb.c                                 | 155 ++++++++++++++++++-
 3 files changed, 183 insertions(+), 3 deletions(-)

Comments

Oscar Salvador Oct. 8, 2021, 7:51 a.m. UTC | #1
On Thu, Oct 07, 2021 at 11:19:14AM -0700, Mike Kravetz wrote:
> +static ssize_t demote_store(struct kobject *kobj,
> +	       struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	unsigned long nr_demote;
> +	unsigned long nr_available;
> +	nodemask_t nodes_allowed, *n_mask;
> +	struct hstate *h;
> +	int err = 0;
> +	int nid;
> +
> +	err = kstrtoul(buf, 10, &nr_demote);
> +	if (err)
> +		return err;
> +	h = kobj_to_hstate(kobj, &nid);
> +
> +	/* Synchronize with other sysfs operations modifying huge pages */
> +	mutex_lock(&h->resize_lock);
> +
> +	if (nid != NUMA_NO_NODE) {
> +		init_nodemask_of_node(&nodes_allowed, nid);
> +		n_mask = &nodes_allowed;
> +	} else {
> +		n_mask = &node_states[N_MEMORY];
> +	}

Why this needs to be protected by the resize_lock? I do not understand
what are we really protecting here and from what.

Besides that, I did not spot anything wrong.
Mike Kravetz Oct. 8, 2021, 8:24 p.m. UTC | #2
On 10/8/21 12:51 AM, Oscar Salvador wrote:
> On Thu, Oct 07, 2021 at 11:19:14AM -0700, Mike Kravetz wrote:
>> +static ssize_t demote_store(struct kobject *kobj,
>> +	       struct kobj_attribute *attr, const char *buf, size_t len)
>> +{
>> +	unsigned long nr_demote;
>> +	unsigned long nr_available;
>> +	nodemask_t nodes_allowed, *n_mask;
>> +	struct hstate *h;
>> +	int err = 0;
>> +	int nid;
>> +
>> +	err = kstrtoul(buf, 10, &nr_demote);
>> +	if (err)
>> +		return err;
>> +	h = kobj_to_hstate(kobj, &nid);
>> +
>> +	/* Synchronize with other sysfs operations modifying huge pages */
>> +	mutex_lock(&h->resize_lock);
>> +
>> +	if (nid != NUMA_NO_NODE) {
>> +		init_nodemask_of_node(&nodes_allowed, nid);
>> +		n_mask = &nodes_allowed;
>> +	} else {
>> +		n_mask = &node_states[N_MEMORY];
>> +	}
> 
> Why this needs to be protected by the resize_lock? I do not understand
> what are we really protecting here and from what.

In general, the resize_lock prevents unexpected consequences when
multiple users are modifying the number of pages in a pool concurrently
from the proc/sysfs interfaces.  The mutex is acquired here because we
are modifying (decreasing) the pool size.

When the mutex was added, Michal asked about the need.  Theoretically,
all code making pool adjustment should be safe because at a low level
the hugetlb_lock is taken when the hstate is modified.  However, I did
point out that there is a hstate variable 'next_nid_to_alloc' which is
used outside the lock which could result in pages being allocated from
the wrong node.  One could argue that if two (root) sysadmin users are
modifying the pool concurrently, they should not be surprised by such
consequences.  The mutex seemed like a small price to avoid any such
potential issues.  It is taken here to be consistent with this model.

Coincidentally, I was running some stress testing with this series last
night and noticed some unexpected behavior.   As a result, we will also
need to take the resize_mutex of the 'target_hstate'.  This is all in
patch 5 of the series.  I will add details there.
Oscar Salvador Oct. 18, 2021, 7:35 a.m. UTC | #3
On Fri, Oct 08, 2021 at 01:24:28PM -0700, Mike Kravetz wrote:
> In general, the resize_lock prevents unexpected consequences when
> multiple users are modifying the number of pages in a pool concurrently
> from the proc/sysfs interfaces.  The mutex is acquired here because we
> are modifying (decreasing) the pool size.

Yes, I got that. My question was wrt. n_mask initialization:

 +	if (nid != NUMA_NO_NODE) {
 +		init_nodemask_of_node(&nodes_allowed, nid);
 +		n_mask = &nodes_allowed;
 +	} else {
 +		n_mask = &node_states[N_MEMORY];
 +	}

AFAICS, this does not need to be protected.

with that addressed:

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Mike Kravetz Oct. 22, 2021, 6:58 p.m. UTC | #4
On 10/18/21 12:35 AM, Oscar Salvador wrote:
> On Fri, Oct 08, 2021 at 01:24:28PM -0700, Mike Kravetz wrote:
>> In general, the resize_lock prevents unexpected consequences when
>> multiple users are modifying the number of pages in a pool concurrently
>> from the proc/sysfs interfaces.  The mutex is acquired here because we
>> are modifying (decreasing) the pool size.
> 
> Yes, I got that. My question was wrt. n_mask initialization:
> 
>  +	if (nid != NUMA_NO_NODE) {
>  +		init_nodemask_of_node(&nodes_allowed, nid);
>  +		n_mask = &nodes_allowed;
>  +	} else {
>  +		n_mask = &node_states[N_MEMORY];
>  +	}
> 
> AFAICS, this does not need to be protected.
> 

Sorry that I misunderstood your quesion!

You are correct, the n_mask initialization does not need to be protected
by the mutex.  Thanks for pointing that out.

The updated patch below simply moves taking the mutex after this
initialization code.

Andrew, please let me know if you want something else to make this
update simpler for you.

From f9c401323fee234667787a118c74d93aa185fcf6 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Fri, 22 Oct 2021 11:40:57 -0700
Subject: [PATCH v4 1/5] hugetlb: add demote hugetlb page sysfs interfaces

Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:
  demote_size - The size in Kb that pages are demoted to. (read-write)
  demote - The number of huge pages to demote. (write-only)

By default, demote_size is the next smallest huge page size.  Valid huge
page sizes less than huge page size may be written to this file.  When
huge pages are demoted, they are demoted to this size.

Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.

NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size.  For example, on x86 1GB huge
pages will have demote interfaces.  2MB huge pages will not have demote
interfaces.

This patch does not provide full demote functionality.  It only provides
the sysfs interfaces.

It also provides documentation for the new interfaces.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 Documentation/admin-guide/mm/hugetlbpage.rst |  30 +++-
 include/linux/hugetlb.h                      |   1 +
 mm/hugetlb.c                                 | 155 ++++++++++++++++++-
 3 files changed, 183 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144e44..bb90de3885d1 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -234,8 +234,12 @@ will exist, of the form::
 
 	hugepages-${size}kB
 
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist.  In addition, two additional interfaces for demoting huge
+pages may exist::
 
+        demote
+        demote_size
 	nr_hugepages
 	nr_hugepages_mempolicy
 	nr_overcommit_hugepages
@@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
 	resv_hugepages
 	surplus_hugepages
 
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages.  For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
+2MB huge pages.  Demote interfaces are not available for the smallest
+huge page size.  The demote interfaces are:
+
+demote_size
+        is the size of demoted pages.  When a page is demoted a corresponding
+        number of huge pages of demote_size will be created.  By default,
+        demote_size is set to the next smaller huge page size.  If there are
+        multiple smaller huge page sizes, demote_size can be set to any of
+        these smaller sizes.  Only huge page sizes less than the current huge
+        pages size are allowed.
+
+demote
+        is used to demote a number of huge pages.  A user with root privileges
+        can write to this file.  It may not be possible to demote the
+        requested number of huge pages.  To determine how many pages were
+        actually demoted, compare the value of nr_hugepages before and after
+        writing to the demote interface.  demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
 
 .. _mem_policy_and_hp_alloc:
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..f2c3979efd69 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -596,6 +596,7 @@ struct hstate {
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
+	unsigned int demote_order;
 	unsigned long mask;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95dc7b83381f..d2262ad4b3ed 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 static void __init hugetlb_init_hstates(void)
 {
-	struct hstate *h;
+	struct hstate *h, *h2;
 
 	for_each_hstate(h) {
 		if (minimum_order > huge_page_order(h))
@@ -2995,6 +2995,22 @@ static void __init hugetlb_init_hstates(void)
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
+
+		/*
+		 * Set demote order for each hstate.  Note that
+		 * h->demote_order is initially 0.
+		 * - We can not demote gigantic pages if runtime freeing
+		 *   is not supported, so skip this.
+		 */
+		if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+			continue;
+		for_each_hstate(h2) {
+			if (h2 == h)
+				continue;
+			if (h2->order < h->order &&
+			    h2->order > h->demote_order)
+				h->demote_order = h2->order;
+		}
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
 }
@@ -3235,9 +3251,31 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	return 0;
 }
 
+static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+	__must_hold(&hugetlb_lock)
+{
+	int rc = 0;
+
+	lockdep_assert_held(&hugetlb_lock);
+
+	/* We should never get here if no demote order */
+	if (!h->demote_order) {
+		pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
+		return -EINVAL;		/* internal error */
+	}
+
+	/*
+	 * TODO - demote fucntionality will be added in subsequent patch
+	 */
+	return rc;
+}
+
 #define HSTATE_ATTR_RO(_name) \
 	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 
+#define HSTATE_ATTR_WO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
+
 #define HSTATE_ATTR(_name) \
 	static struct kobj_attribute _name##_attr = \
 		__ATTR(_name, 0644, _name##_show, _name##_store)
@@ -3433,6 +3471,105 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
+static ssize_t demote_store(struct kobject *kobj,
+	       struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long nr_demote;
+	unsigned long nr_available;
+	nodemask_t nodes_allowed, *n_mask;
+	struct hstate *h;
+	int err = 0;
+	int nid;
+
+	err = kstrtoul(buf, 10, &nr_demote);
+	if (err)
+		return err;
+	h = kobj_to_hstate(kobj, &nid);
+
+	if (nid != NUMA_NO_NODE) {
+		init_nodemask_of_node(&nodes_allowed, nid);
+		n_mask = &nodes_allowed;
+	} else {
+		n_mask = &node_states[N_MEMORY];
+	}
+
+	/* Synchronize with other sysfs operations modifying huge pages */
+	mutex_lock(&h->resize_lock);
+	spin_lock_irq(&hugetlb_lock);
+
+	while (nr_demote) {
+		/*
+		 * Check for available pages to demote each time thorough the
+		 * loop as demote_pool_huge_page will drop hugetlb_lock.
+		 *
+		 * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
+		 * but will when full demote functionality is added in a later
+		 * patch.
+		 */
+		if (nid != NUMA_NO_NODE)
+			nr_available = h->free_huge_pages_node[nid];
+		else
+			nr_available = h->free_huge_pages;
+		nr_available -= h->resv_huge_pages;
+		if (!nr_available)
+			break;
+
+		err = demote_pool_huge_page(h, n_mask);
+		if (err)
+			break;
+
+		nr_demote--;
+	}
+
+	spin_unlock_irq(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
+
+	if (err)
+		return err;
+	return len;
+}
+HSTATE_ATTR_WO(demote);
+
+static ssize_t demote_size_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	int nid;
+	struct hstate *h = kobj_to_hstate(kobj, &nid);
+	unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
+
+	return sysfs_emit(buf, "%lukB\n", demote_size);
+}
+
+static ssize_t demote_size_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct hstate *h, *demote_hstate;
+	unsigned long demote_size;
+	unsigned int demote_order;
+	int nid;
+
+	demote_size = (unsigned long)memparse(buf, NULL);
+
+	demote_hstate = size_to_hstate(demote_size);
+	if (!demote_hstate)
+		return -EINVAL;
+	demote_order = demote_hstate->order;
+
+	/* demote order must be smaller than hstate order */
+	h = kobj_to_hstate(kobj, &nid);
+	if (demote_order >= h->order)
+		return -EINVAL;
+
+	/* resize_lock synchronizes access to demote size and writes */
+	mutex_lock(&h->resize_lock);
+	h->demote_order = demote_order;
+	mutex_unlock(&h->resize_lock);
+
+	return count;
+}
+HSTATE_ATTR(demote_size);
+
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
@@ -3449,6 +3586,16 @@ static const struct attribute_group hstate_attr_group = {
 	.attrs = hstate_attrs,
 };
 
+static struct attribute *hstate_demote_attrs[] = {
+	&demote_size_attr.attr,
+	&demote_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group hstate_demote_attr_group = {
+	.attrs = hstate_demote_attrs,
+};
+
 static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 				    struct kobject **hstate_kobjs,
 				    const struct attribute_group *hstate_attr_group)
@@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 		hstate_kobjs[hi] = NULL;
 	}
 
+	if (h->demote_order) {
+		if (sysfs_create_group(hstate_kobjs[hi],
+					&hstate_demote_attr_group))
+			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
+	}
+
 	return retval;
 }
Oscar Salvador Oct. 25, 2021, 7:24 a.m. UTC | #5
On Fri, Oct 22, 2021 at 11:58:42AM -0700, Mike Kravetz wrote:
> From f9c401323fee234667787a118c74d93aa185fcf6 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 22 Oct 2021 11:40:57 -0700
> Subject: [PATCH v4 1/5] hugetlb: add demote hugetlb page sysfs interfaces
> 
> Two new sysfs files are added to demote hugtlb pages.  These files are
> both per-hugetlb page size and per node.  Files are:
>   demote_size - The size in Kb that pages are demoted to. (read-write)
>   demote - The number of huge pages to demote. (write-only)
> 
> By default, demote_size is the next smallest huge page size.  Valid huge
> page sizes less than huge page size may be written to this file.  When
> huge pages are demoted, they are demoted to this size.
> 
> Writing a value to demote will result in an attempt to demote that
> number of hugetlb pages to an appropriate number of demote_size pages.
> 
> NOTE: Demote interfaces are only provided for huge page sizes if there
> is a smaller target demote huge page size.  For example, on x86 1GB huge
> pages will have demote interfaces.  2MB huge pages will not have demote
> interfaces.
> 
> This patch does not provide full demote functionality.  It only provides
> the sysfs interfaces.
> 
> It also provides documentation for the new interfaces.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>
 
> ---
>  Documentation/admin-guide/mm/hugetlbpage.rst |  30 +++-
>  include/linux/hugetlb.h                      |   1 +
>  mm/hugetlb.c                                 | 155 ++++++++++++++++++-
>  3 files changed, 183 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index 8abaeb144e44..bb90de3885d1 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -234,8 +234,12 @@ will exist, of the form::
>  
>  	hugepages-${size}kB
>  
> -Inside each of these directories, the same set of files will exist::
> +Inside each of these directories, the set of files contained in ``/proc``
> +will exist.  In addition, two additional interfaces for demoting huge
> +pages may exist::
>  
> +        demote
> +        demote_size
>  	nr_hugepages
>  	nr_hugepages_mempolicy
>  	nr_overcommit_hugepages
> @@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
>  	resv_hugepages
>  	surplus_hugepages
>  
> -which function as described above for the default huge page-sized case.
> +The demote interfaces provide the ability to split a huge page into
> +smaller huge pages.  For example, the x86 architecture supports both
> +1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
> +2MB huge pages.  Demote interfaces are not available for the smallest
> +huge page size.  The demote interfaces are:
> +
> +demote_size
> +        is the size of demoted pages.  When a page is demoted a corresponding
> +        number of huge pages of demote_size will be created.  By default,
> +        demote_size is set to the next smaller huge page size.  If there are
> +        multiple smaller huge page sizes, demote_size can be set to any of
> +        these smaller sizes.  Only huge page sizes less than the current huge
> +        pages size are allowed.
> +
> +demote
> +        is used to demote a number of huge pages.  A user with root privileges
> +        can write to this file.  It may not be possible to demote the
> +        requested number of huge pages.  To determine how many pages were
> +        actually demoted, compare the value of nr_hugepages before and after
> +        writing to the demote interface.  demote is a write only interface.
> +
> +The interfaces which are the same as in ``/proc`` (all except demote and
> +demote_size) function as described above for the default huge page-sized case.
>  
>  .. _mem_policy_and_hp_alloc:
>  
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1faebe1cd0ed..f2c3979efd69 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -596,6 +596,7 @@ struct hstate {
>  	int next_nid_to_alloc;
>  	int next_nid_to_free;
>  	unsigned int order;
> +	unsigned int demote_order;
>  	unsigned long mask;
>  	unsigned long max_huge_pages;
>  	unsigned long nr_huge_pages;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 95dc7b83381f..d2262ad4b3ed 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  
>  static void __init hugetlb_init_hstates(void)
>  {
> -	struct hstate *h;
> +	struct hstate *h, *h2;
>  
>  	for_each_hstate(h) {
>  		if (minimum_order > huge_page_order(h))
> @@ -2995,6 +2995,22 @@ static void __init hugetlb_init_hstates(void)
>  		/* oversize hugepages were init'ed in early boot */
>  		if (!hstate_is_gigantic(h))
>  			hugetlb_hstate_alloc_pages(h);
> +
> +		/*
> +		 * Set demote order for each hstate.  Note that
> +		 * h->demote_order is initially 0.
> +		 * - We can not demote gigantic pages if runtime freeing
> +		 *   is not supported, so skip this.
> +		 */
> +		if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> +			continue;
> +		for_each_hstate(h2) {
> +			if (h2 == h)
> +				continue;
> +			if (h2->order < h->order &&
> +			    h2->order > h->demote_order)
> +				h->demote_order = h2->order;
> +		}
>  	}
>  	VM_BUG_ON(minimum_order == UINT_MAX);
>  }
> @@ -3235,9 +3251,31 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	return 0;
>  }
>  
> +static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> +	__must_hold(&hugetlb_lock)
> +{
> +	int rc = 0;
> +
> +	lockdep_assert_held(&hugetlb_lock);
> +
> +	/* We should never get here if no demote order */
> +	if (!h->demote_order) {
> +		pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
> +		return -EINVAL;		/* internal error */
> +	}
> +
> +	/*
> +	 * TODO - demote fucntionality will be added in subsequent patch
> +	 */
> +	return rc;
> +}
> +
>  #define HSTATE_ATTR_RO(_name) \
>  	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>  
> +#define HSTATE_ATTR_WO(_name) \
> +	static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
> +
>  #define HSTATE_ATTR(_name) \
>  	static struct kobj_attribute _name##_attr = \
>  		__ATTR(_name, 0644, _name##_show, _name##_store)
> @@ -3433,6 +3471,105 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> +static ssize_t demote_store(struct kobject *kobj,
> +	       struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	unsigned long nr_demote;
> +	unsigned long nr_available;
> +	nodemask_t nodes_allowed, *n_mask;
> +	struct hstate *h;
> +	int err = 0;
> +	int nid;
> +
> +	err = kstrtoul(buf, 10, &nr_demote);
> +	if (err)
> +		return err;
> +	h = kobj_to_hstate(kobj, &nid);
> +
> +	if (nid != NUMA_NO_NODE) {
> +		init_nodemask_of_node(&nodes_allowed, nid);
> +		n_mask = &nodes_allowed;
> +	} else {
> +		n_mask = &node_states[N_MEMORY];
> +	}
> +
> +	/* Synchronize with other sysfs operations modifying huge pages */
> +	mutex_lock(&h->resize_lock);
> +	spin_lock_irq(&hugetlb_lock);
> +
> +	while (nr_demote) {
> +		/*
> +		 * Check for available pages to demote each time thorough the
> +		 * loop as demote_pool_huge_page will drop hugetlb_lock.
> +		 *
> +		 * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
> +		 * but will when full demote functionality is added in a later
> +		 * patch.
> +		 */
> +		if (nid != NUMA_NO_NODE)
> +			nr_available = h->free_huge_pages_node[nid];
> +		else
> +			nr_available = h->free_huge_pages;
> +		nr_available -= h->resv_huge_pages;
> +		if (!nr_available)
> +			break;
> +
> +		err = demote_pool_huge_page(h, n_mask);
> +		if (err)
> +			break;
> +
> +		nr_demote--;
> +	}
> +
> +	spin_unlock_irq(&hugetlb_lock);
> +	mutex_unlock(&h->resize_lock);
> +
> +	if (err)
> +		return err;
> +	return len;
> +}
> +HSTATE_ATTR_WO(demote);
> +
> +static ssize_t demote_size_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	int nid;
> +	struct hstate *h = kobj_to_hstate(kobj, &nid);
> +	unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
> +
> +	return sysfs_emit(buf, "%lukB\n", demote_size);
> +}
> +
> +static ssize_t demote_size_store(struct kobject *kobj,
> +					struct kobj_attribute *attr,
> +					const char *buf, size_t count)
> +{
> +	struct hstate *h, *demote_hstate;
> +	unsigned long demote_size;
> +	unsigned int demote_order;
> +	int nid;
> +
> +	demote_size = (unsigned long)memparse(buf, NULL);
> +
> +	demote_hstate = size_to_hstate(demote_size);
> +	if (!demote_hstate)
> +		return -EINVAL;
> +	demote_order = demote_hstate->order;
> +
> +	/* demote order must be smaller than hstate order */
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (demote_order >= h->order)
> +		return -EINVAL;
> +
> +	/* resize_lock synchronizes access to demote size and writes */
> +	mutex_lock(&h->resize_lock);
> +	h->demote_order = demote_order;
> +	mutex_unlock(&h->resize_lock);
> +
> +	return count;
> +}
> +HSTATE_ATTR(demote_size);
> +
>  static struct attribute *hstate_attrs[] = {
>  	&nr_hugepages_attr.attr,
>  	&nr_overcommit_hugepages_attr.attr,
> @@ -3449,6 +3586,16 @@ static const struct attribute_group hstate_attr_group = {
>  	.attrs = hstate_attrs,
>  };
>  
> +static struct attribute *hstate_demote_attrs[] = {
> +	&demote_size_attr.attr,
> +	&demote_attr.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group hstate_demote_attr_group = {
> +	.attrs = hstate_demote_attrs,
> +};
> +
>  static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
>  				    struct kobject **hstate_kobjs,
>  				    const struct attribute_group *hstate_attr_group)
> @@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
>  		hstate_kobjs[hi] = NULL;
>  	}
>  
> +	if (h->demote_order) {
> +		if (sysfs_create_group(hstate_kobjs[hi],
> +					&hstate_demote_attr_group))
> +			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
> +	}
> +
>  	return retval;
>  }
>  
> -- 
> 2.31.1
>
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144e44..bb90de3885d1 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -234,8 +234,12 @@  will exist, of the form::
 
 	hugepages-${size}kB
 
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist.  In addition, two additional interfaces for demoting huge
+pages may exist::
 
+        demote
+        demote_size
 	nr_hugepages
 	nr_hugepages_mempolicy
 	nr_overcommit_hugepages
@@ -243,7 +247,29 @@  Inside each of these directories, the same set of files will exist::
 	resv_hugepages
 	surplus_hugepages
 
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages.  For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
+2MB huge pages.  Demote interfaces are not available for the smallest
+huge page size.  The demote interfaces are:
+
+demote_size
+        is the size of demoted pages.  When a page is demoted a corresponding
+        number of huge pages of demote_size will be created.  By default,
+        demote_size is set to the next smaller huge page size.  If there are
+        multiple smaller huge page sizes, demote_size can be set to any of
+        these smaller sizes.  Only huge page sizes less than the current huge
+        pages size are allowed.
+
+demote
+        is used to demote a number of huge pages.  A user with root privileges
+        can write to this file.  It may not be possible to demote the
+        requested number of huge pages.  To determine how many pages were
+        actually demoted, compare the value of nr_hugepages before and after
+        writing to the demote interface.  demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
 
 .. _mem_policy_and_hp_alloc:
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..f2c3979efd69 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -596,6 +596,7 @@  struct hstate {
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
+	unsigned int demote_order;
 	unsigned long mask;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95dc7b83381f..44d3c9477635 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2986,7 +2986,7 @@  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 static void __init hugetlb_init_hstates(void)
 {
-	struct hstate *h;
+	struct hstate *h, *h2;
 
 	for_each_hstate(h) {
 		if (minimum_order > huge_page_order(h))
@@ -2995,6 +2995,22 @@  static void __init hugetlb_init_hstates(void)
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
+
+		/*
+		 * Set demote order for each hstate.  Note that
+		 * h->demote_order is initially 0.
+		 * - We can not demote gigantic pages if runtime freeing
+		 *   is not supported, so skip this.
+		 */
+		if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+			continue;
+		for_each_hstate(h2) {
+			if (h2 == h)
+				continue;
+			if (h2->order < h->order &&
+			    h2->order > h->demote_order)
+				h->demote_order = h2->order;
+		}
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
 }
@@ -3235,9 +3251,31 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	return 0;
 }
 
+static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+	__must_hold(&hugetlb_lock)
+{
+	int rc = 0;
+
+	lockdep_assert_held(&hugetlb_lock);
+
+	/* We should never get here if no demote order */
+	if (!h->demote_order) {
+		pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
+		return -EINVAL;		/* internal error */
+	}
+
+	/*
+	 * TODO - demote fucntionality will be added in subsequent patch
+	 */
+	return rc;
+}
+
 #define HSTATE_ATTR_RO(_name) \
 	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 
+#define HSTATE_ATTR_WO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
+
 #define HSTATE_ATTR(_name) \
 	static struct kobj_attribute _name##_attr = \
 		__ATTR(_name, 0644, _name##_show, _name##_store)
@@ -3433,6 +3471,105 @@  static ssize_t surplus_hugepages_show(struct kobject *kobj,
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
+static ssize_t demote_store(struct kobject *kobj,
+	       struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long nr_demote;
+	unsigned long nr_available;
+	nodemask_t nodes_allowed, *n_mask;
+	struct hstate *h;
+	int err = 0;
+	int nid;
+
+	err = kstrtoul(buf, 10, &nr_demote);
+	if (err)
+		return err;
+	h = kobj_to_hstate(kobj, &nid);
+
+	/* Synchronize with other sysfs operations modifying huge pages */
+	mutex_lock(&h->resize_lock);
+
+	if (nid != NUMA_NO_NODE) {
+		init_nodemask_of_node(&nodes_allowed, nid);
+		n_mask = &nodes_allowed;
+	} else {
+		n_mask = &node_states[N_MEMORY];
+	}
+
+	spin_lock_irq(&hugetlb_lock);
+	while (nr_demote) {
+		/*
+		 * Check for available pages to demote each time thorough the
+		 * loop as demote_pool_huge_page will drop hugetlb_lock.
+		 *
+		 * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
+		 * but will when full demote functionality is added in a later
+		 * patch.
+		 */
+		if (nid != NUMA_NO_NODE)
+			nr_available = h->free_huge_pages_node[nid];
+		else
+			nr_available = h->free_huge_pages;
+		nr_available -= h->resv_huge_pages;
+		if (!nr_available)
+			break;
+
+		err = demote_pool_huge_page(h, n_mask);
+		if (err)
+			break;
+
+		nr_demote--;
+	}
+
+	spin_unlock_irq(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
+
+	if (err)
+		return err;
+	return len;
+}
+HSTATE_ATTR_WO(demote);
+
+static ssize_t demote_size_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	int nid;
+	struct hstate *h = kobj_to_hstate(kobj, &nid);
+	unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
+
+	return sysfs_emit(buf, "%lukB\n", demote_size);
+}
+
+static ssize_t demote_size_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct hstate *h, *demote_hstate;
+	unsigned long demote_size;
+	unsigned int demote_order;
+	int nid;
+
+	demote_size = (unsigned long)memparse(buf, NULL);
+
+	demote_hstate = size_to_hstate(demote_size);
+	if (!demote_hstate)
+		return -EINVAL;
+	demote_order = demote_hstate->order;
+
+	/* demote order must be smaller than hstate order */
+	h = kobj_to_hstate(kobj, &nid);
+	if (demote_order >= h->order)
+		return -EINVAL;
+
+	/* resize_lock synchronizes access to demote size and writes */
+	mutex_lock(&h->resize_lock);
+	h->demote_order = demote_order;
+	mutex_unlock(&h->resize_lock);
+
+	return count;
+}
+HSTATE_ATTR(demote_size);
+
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
@@ -3449,6 +3586,16 @@  static const struct attribute_group hstate_attr_group = {
 	.attrs = hstate_attrs,
 };
 
+static struct attribute *hstate_demote_attrs[] = {
+	&demote_size_attr.attr,
+	&demote_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group hstate_demote_attr_group = {
+	.attrs = hstate_demote_attrs,
+};
+
 static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 				    struct kobject **hstate_kobjs,
 				    const struct attribute_group *hstate_attr_group)
@@ -3466,6 +3613,12 @@  static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 		hstate_kobjs[hi] = NULL;
 	}
 
+	if (h->demote_order) {
+		if (sysfs_create_group(hstate_kobjs[hi],
+					&hstate_demote_attr_group))
+			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
+	}
+
 	return retval;
 }