mbox series

[v4,0/5] hugetlb: add demote/split page functionality

Message ID 20211007181918.136982-1-mike.kravetz@oracle.com (mailing list archive)
Headers show
Series hugetlb: add demote/split page functionality | expand

Message

Mike Kravetz Oct. 7, 2021, 6:19 p.m. UTC
The concurrent use of multiple hugetlb page sizes on a single system
is becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.

When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools.  This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup.  In addition, preallocating
huge pages minimizes the issue of memory fragmentation that increases
the longer the system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become
depleted while larger hugetlb pages still remain.  In such situations,
it is desirable to convert larger hugetlb pages to smaller hugetlb pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  For example, to convert 50 GB pages on x86:
gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages

On an idle system this operation is fairly reliable and results are as
expected.  The number of 2MB pages is increased as expected and the time
of the operation is a second or two.

However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed.  This is
   because the area freed by the larger page could quickly be
   fragmented.
In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:

Unexpected number of 2MB pages allocated: Expected 25600, have 19944
real    0m42.092s
user    0m0.008s
sys     0m41.467s

To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size.  This avoids freeing pages to buddy and then
trying to allocate from buddy.

Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
- demote_size   Target page size for demotion, a smaller huge page size.
		File can be written to chose a smaller huge page size if
		multiple are available.
- demote        Writable number of hugetlb pages to be demoted

To demote 50 GB huge pages, one would:
cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
echo 50 > .../hugepages-1048576kB/demote

Only hugetlb pages which are free at the time of the request can be demoted.
Demotion does not add to the complexity of surplus pages and honors reserved
huge pages.  Therefore, when a value is written to the sysfs demote file,
that value is only the maximum number of pages which will be demoted.  It
is possible fewer will actually be demoted.  The recently introduced
per-hstate mutex is used to synchronize demote operations with other
operations that modify hugetlb pools.

Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages are
used to back VMs on x86.  Both issues of long allocation times and not
necessarily getting the expected number of smaller huge pages after a free
and allocate cycle have been experienced.  The occurrence of these issues
is dependent on other activity within the host and can not be predicted.

v3 -> v4
  - Fix dead store in demote_size_show and rewrite to look better
  - Fix documentation typo
  - Make code setting up per-hstate demote orders more clear
  - Add warning if demote_pool_huge_page passed null demote order
  - Acquired hugetlb_lock lock later in demote_store
  - Restored cma_release debug message
  - Updated commit message for the need of using HUGETLB_PAGE_ORDER
  - Renamed function to destroy hugetlb pages for demote
  - Made sure error codes are passed all the way back to user

v2 -> v3
  - Require gigantic_page_runtime_supported for demote
  - Simplify code in demote_store and update comment
  - Remove hugetlb specific cma flag, add cma_pages_valid interface
  - Retain error return code in demote_free_huge_page

RESEND -> v2
  - Removed optimizations for vmemmap optimized pages
  - Make demote_size writable
  - Removed demote interfaces for smallest huge page size
  - Updated documentation and commit messages
  - Fixed build break for !CONFIG_ARCH_HAS_GIGANTIC_PAGE

v1 -> RESEND
  - Rebase on next-20210816
  - Fix a few typos in commit messages
RFC -> v1
  - Provides basic support for vmemmap optimized pages
  - Takes speculative page references into account
  - Updated Documentation file
  - Added optimizations for vmemmap optimized pages

Mike Kravetz (5):
  hugetlb: add demote hugetlb page sysfs interfaces
  mm/cma: add cma_pages_valid to determine if pages are in CMA
  hugetlb: be sure to free demoted CMA pages to CMA
  hugetlb: add demote bool to gigantic page routines
  hugetlb: add hugetlb demote page support

 Documentation/admin-guide/mm/hugetlbpage.rst |  30 +-
 include/linux/cma.h                          |   1 +
 include/linux/hugetlb.h                      |   1 +
 mm/cma.c                                     |  24 +-
 mm/hugetlb.c                                 | 318 ++++++++++++++++++-
 5 files changed, 353 insertions(+), 21 deletions(-)