[132/262] hugetlb: add hugetlb demote page support

Message ID	20211105204133.Or52A3rtJ%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=bSwl=PY=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A6C506126A Date: Fri, 05 Nov 2021 13:41:33 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, david@redhat.com, linux-mm@kvack.org, mhocko@suse.com, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, naoya.horiguchi@linux.dev, nghialm78@gmail.com, osalvador@suse.de, rientjes@google.com, songmuchun@bytedance.com, torvalds@linux-foundation.org, ziy@nvidia.com Subject: [patch 132/262] hugetlb: add hugetlb demote page support Message-ID: <20211105204133.Or52A3rtJ%akpm@linux-foundation.org> In-Reply-To: <20211105133408.cccbb98b71a77d5e8430aba1@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/262] scripts/spelling.txt: add more spellings to spelling.txt \| expand [001/262] scripts/spelling.txt: add more spellings to spelling.txt [002/262] scripts/spelling.txt: fix "mistake" version of "synchronization" [003/262] scripts/decodecode: fix faulting instruction no print when opps.file is DOS format [004/262] ocfs2: fix handle refcount leak in two exception handling paths [005/262] ocfs2: cleanup journal init and shutdown [006/262] ocfs2/dlm: remove redundant assignment of variable ret [007/262] ocfs2: fix data corruption on truncate [008/262] ocfs2: do not zero pages beyond i_size [009/262] fs/posix_acl.c: avoid -Wempty-body warning [010/262] d_path: fix Kernel doc validator complaining [011/262] mm: move kvmalloc-related functions to slab.h [012/262] mm/slab.c: remove useless lines in enable_cpucache() [013/262] slub: add back check for free nonslab objects [014/262] mm, slub: change percpu partial accounting from objects to pages [015/262] mm/slub: increase default cpu partial list sizes [016/262] mm, slub: use prefetchw instead of prefetch [017/262] mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT [018/262] mm: don't include <linux/dax.h> in <linux/mempolicy.h> [019/262] lib/stackdepot: include gfp.h [020/262] lib/stackdepot: remove unused function argument [021/262] lib/stackdepot: introduce __stack_depot_save() [022/262] kasan: common: provide can_alloc in kasan_save_stack() [023/262] kasan: generic: introduce kasan_record_aux_stack_noalloc() [024/262] workqueue, kasan: avoid alloc_pages() when recording stack [025/262] kasan: fix tag for large allocations when using CONFIG_SLAB [026/262] kasan: test: add memcpy test that avoids out-of-bounds write [027/262] mm/smaps: fix shmem pte hole swap calculation [028/262] mm/smaps: use vma->vm_pgoff directly when counting partial swap [029/262] mm/smaps: simplify shmem handling of pte holes [030/262] mm: debug_vm_pgtable: don't use __P000 directly [031/262] kasan: test: bypass __alloc_size checks [032/262] rapidio: avoid bogus __alloc_size warning [033/262] Compiler Attributes: add __alloc_size() for better bounds checking [034/262] slab: clean up function prototypes [035/262] slab: add __alloc_size attributes for better bounds checking [036/262] mm/kvmalloc: add __alloc_size attributes for better bounds checking [037/262] mm/vmalloc: add __alloc_size attributes for better bounds checking [038/262] mm/page_alloc: add __alloc_size attributes for better bounds checking [039/262] percpu: add __alloc_size attributes for better bounds checking [040/262] mm/page_ext.c: fix a comment [041/262] mm: stop filemap_read() from grabbing a superfluous page [042/262] mm: export bdi_unregister [043/262] mtd: call bdi_unregister explicitly [044/262] fs: explicitly unregister per-superblock BDIs [045/262] mm: don't automatically unregister bdis [046/262] mm: simplify bdi refcounting [047/262] mm: don't read i_size of inode unless we need it [048/262] mm/filemap.c: remove bogus VM_BUG_ON [049/262] mm: move more expensive part of XA setup out of mapping check [050/262] mm/gup: further simplify __gup_device_huge() [051/262] mm/swapfile: remove needless request_queue NULL pointer check [052/262] mm/swapfile: fix an integer overflow in swap_show() [053/262] mm: optimise put_pages_list() [054/262] mm/memcg: drop swp_entry_t* in mc_handle_file_pte() [055/262] memcg: flush stats only if updated [056/262] memcg: unify memcg stat flushing [057/262] mm/memcg: remove obsolete memcg_free_kmem() [058/262] mm/list_lru.c: prefer struct_size over open coded arithmetic [059/262] memcg, kmem: further deprecate kmem.limit_in_bytes [060/262] mm: list_lru: remove holding lru lock [061/262] mm: list_lru: fix the return value of list_lru_count_one() [062/262] mm: memcontrol: remove kmemcg_id reparenting [063/262] mm: memcontrol: remove the kmem states [064/262] mm: list_lru: only add memcg-aware lrus to the global lru list [065/262] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks [066/262] mm, oom: do not trigger out_of_memory from the #PF [067/262] memcg: prohibit unconditional exceeding the limit of dying tasks [068/262] mm/mmap.c: fix a data race of mm->total_vm [069/262] mm: use __pfn_to_section() instead of open coding it [070/262] mm/memory.c: avoid unnecessary kernel/user pointer conversion [071/262] mm/memory.c: use correct VMA flags when freeing page-tables [072/262] mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte [073/262] mm: clear vmf->pte after pte_unmap_same() returns [074/262] mm: drop first_index/last_index in zap_details [075/262] mm: add zap_skip_check_mapping() helper [076/262] mm: introduce pmd_install() helper [077/262] mm: remove redundant smp_wmb() [078/262] Documentation: update pagemap with shmem exceptions [079/262] lazy tlb: introduce lazy mm refcount helper functions [080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable [081/262] lazy tlb: shoot lazies, a non-refcounting lazy tlb option [082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN [083/262] memory: remove unused CONFIG_MEM_BLOCK_SIZE [084/262] mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() [085/262] mm/mremap: don't account pages in vma_to_resize() [086/262] include/linux/io-mapping.h: remove fallback for writecombine [087/262] mm: mmap_lock: remove redundant newline in TP_printk [088/262] mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN [089/262] mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() [090/262] mm/vmalloc: don't allow VM_NO_GUARD on vmap() [091/262] mm/vmalloc: make show_numa_info() aware of hugepage mappings [092/262] mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo [093/262] mm/vmalloc: do not adjust the search size for alignment overhead [094/262] mm/vmalloc: check various alignments when debugging [095/262] vmalloc: back off when the current task is OOM-killed [096/262] vmalloc: choose a better start address in vm_area_register_early() [097/262] arm64: support page mapping percpu first chunk allocator [098/262] kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC [099/262] mm/vmalloc: be more explicit about supported gfp flags [100/262] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation [101/262] lib/test_vmalloc.c: use swap() to make code cleaner [102/262] mm/large system hash: avoid possible NULL deref in alloc_large_system_hash [103/262] mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() [104/262] mm/page_alloc.c: simplify the code by using macro K() [105/262] mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() [106/262] mm/page_alloc.c: use helper function zone_spans_pfn() [107/262] mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] [108/262] mm/page_alloc: print node fallback order [109/262] mm/page_alloc: use accumulated load when building node fallback list [110/262] mm: move node_reclaim_distance to fix NUMA without SMP [111/262] mm: move fold_vm_numa_events() to fix NUMA without SMP [112/262] mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page() [113/262] mm/page_alloc: detect allocation forbidden by cpuset and bail out early [114/262] mm/page_alloc.c: show watermark_boost of zone in zoneinfo [115/262] mm: create a new system state and fix core_kernel_text() [116/262] mm: make generic arch_is_kernel_initmem_freed() do what it says [117/262] powerpc: use generic version of arch_is_kernel_initmem_freed() [118/262] s390: use generic version of arch_is_kernel_initmem_freed() [119/262] mm: page_alloc: use migrate_disable() in drain_local_pages_wq() [120/262] mm/page_alloc: use clamp() to simplify code [121/262] mm: fix data race in PagePoisoned() [122/262] mm/memory_failure: constify static mm_walk_ops [123/262] mm: filemap: coding style cleanup for filemap_map_pmd() [124/262] mm: hwpoison: refactor refcount check handling [125/262] mm: shmem: don't truncate page if memory failure happens [126/262] mm: hwpoison: handle non-anonymous THP correctly [127/262] mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h [128/262] hugetlb: add demote hugetlb page sysfs interfaces [129/262] mm/cma: add cma_pages_valid to determine if pages are in CMA [130/262] hugetlb: be sure to free demoted CMA pages to CMA [131/262] hugetlb: add demote bool to gigantic page routines [132/262] hugetlb: add hugetlb demote page support [133/262] mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged [134/262] mm, hugepages: add mremap() support for hugepage backed vma [135/262] mm, hugepages: add hugetlb vma mremap() test [136/262] hugetlb: support node specified when using cma for gigantic hugepages [137/262] mm: remove duplicate include in hugepage-mremap.c [138/262] hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro [139/262] hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments [140/262] hugetlb: remove redundant validation in has_same_uncharge_info() [141/262] hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() [142/262] hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page [143/262] userfaultfd/selftests: don't rely on GNU extensions for random numbers [144/262] userfaultfd/selftests: fix feature support detection [145/262] userfaultfd/selftests: fix calculation of expected ioctls [146/262] mm/page_isolation: fix potential missing call to unset_migratetype_isolate() [147/262] mm/page_isolation: guard against possible putback unisolated page [148/262] mm/vmscan.c: fix -Wunused-but-set-variable warning [149/262] mm/vmscan: throttle reclaim until some writeback completes if congested [150/262] mm/vmscan: throttle reclaim and compaction when too may pages are isolated [151/262] mm/vmscan: throttle reclaim when no progress is being made [152/262] mm/writeback: throttle based on page writeback instead of congestion [153/262] mm/page_alloc: remove the throttling logic from the page allocator [154/262] mm/vmscan: centralise timeout values for reclaim_throttle [155/262] mm/vmscan: increase the timeout if page reclaim is not making progress [156/262] mm/vmscan: delay waking of tasks throttled on NOPROGRESS [157/262] mm/vmpressure: fix data-race with memcg->socket_pressure [158/262] tools/vm/page_owner_sort.c: count and sort by mem [159/262] tools/vm/page-types.c: make walk_file() aware of address range option [160/262] tools/vm/page-types.c: move show_file() to summary output [161/262] tools/vm/page-types.c: print file offset in hexadecimal [162/262] arch_numa: simplify numa_distance allocation [163/262] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer [164/262] memblock: drop memblock_free_early_nid() and memblock_free_early() [165/262] memblock: stop aliasing __memblock_free_late with memblock_free_late [166/262] memblock: rename memblock_free to memblock_phys_free [167/262] memblock: use memblock_free for freeing virtual pointers [168/262] mm: mark the OOM reaper thread as freezable [169/262] hugetlbfs: extend the definition of hugepages parameter to support node allocation [170/262] mm/migrate: de-duplicate migrate_reason strings [171/262] mm: migrate: make demotion knob depend on migration [172/262] selftests/vm/transhuge-stress: fix ram size thinko [173/262] mm, thp: lock filemap when truncating page cache [174/262] mm, thp: fix incorrect unmap behavior for private pages [175/262] mm/readahead.c: fix incorrect comments for get_init_ra_size [176/262] mm: nommu: kill arch_get_unmapped_area() [177/262] selftest/vm: fix ksm selftest to run with different NUMA topologies [178/262] selftests: vm: add KSM huge pages merging time test [179/262] mm/vmstat: annotate data race for zone->free_area[order].nr_free [180/262] mm: vmstat.c: make extfrag_index show more pretty [181/262] selftests/vm: make MADV_POPULATE_(READ\|WRITE) use in-tree headers [182/262] mm/memory_hotplug: add static qualifier for online_policy_to_str() [183/262] memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node" [184/262] memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path [185/262] memory-hotplug.rst: document the "auto-movable" online policy [186/262] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG [187/262] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE [188/262] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit [189/262] mm/memory_hotplug: remove HIGHMEM leftovers [190/262] mm/memory_hotplug: remove stale function declarations [191/262] x86: remove memory hotplug support on X86_32 [192/262] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() [193/262] memblock: improve MEMBLOCK_HOTPLUG documentation [194/262] memblock: allow to specify flags with memblock_add_node() [195/262] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED [196/262] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED [197/262] mm/rmap.c: avoid double faults migrating device private pages [198/262] mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migrati… [199/262] mm/highmem: remove deprecated kmap_atomic [200/262] zram_drv: allow reclaim on bio_alloc [201/262] zram: off by one in read_block_state() [202/262] zram: introduce an aged idle interface [203/262] mm: remove HARDENED_USERCOPY_FALLBACK [204/262] include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h [205/262] stacktrace: move filter_irq_stacks() to kernel/stacktrace.c [206/262] kfence: count unexpectedly skipped allocations [207/262] kfence: move saving stack trace of allocations into __kfence_alloc() [208/262] kfence: limit currently covered allocations when pool nearly full [209/262] kfence: add note to documentation about skipping covered allocations [210/262] kfence: test: use kunit_skip() to skip tests [211/262] kfence: shorten critical sections of alloc/free [212/262] kfence: always use static branches to guard kfence_alloc() [213/262] kfence: default to dynamic branch instead of static keys mode [214/262] mm/damon: grammar s/works/work/ [215/262] Documentation/vm: move user guides to admin-guide/mm/ [216/262] MAINTAINERS: update SeongJae's email address [217/262] docs/vm/damon: remove broken reference [218/262] include/linux/damon.h: fix kernel-doc comments for 'damon_callback' [219/262] mm/damon/core: print kdamond start log in debug mode only [220/262] mm/damon: remove unnecessary do_exit() from kdamond [221/262] mm/damon: needn't hold kdamond_lock to print pid of kdamond [222/262] mm/damon/core: nullify pointer ctx->kdamond with a NULL [223/262] mm/damon/core: account age of target regions [224/262] mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) [225/262] mm/damon/vaddr: support DAMON-based Operation Schemes [226/262] mm/damon/dbgfs: support DAMON-based Operation Schemes [227/262] mm/damon/schemes: implement statistics feature [228/262] selftests/damon: add 'schemes' debugfs tests [229/262] Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes [230/262] mm/damon/dbgfs: allow users to set initial monitoring target regions [231/262] mm/damon/dbgfs-test: add a unit test case for 'init_regions' [232/262] Docs/admin-guide/mm/damon: document 'init_regions' feature [233/262] mm/damon/vaddr: separate commonly usable functions [234/262] mm/damon: implement primitives for physical address space monitoring [235/262] mm/damon/dbgfs: support physical memory monitoring [236/262] Docs/DAMON: document physical memory monitoring support [237/262] mm/damon/vaddr: constify static mm_walk_ops [238/262] mm/damon/dbgfs: remove unnecessary variables [239/262] mm/damon/paddr: support the pageout scheme [240/262] mm/damon/schemes: implement size quota for schemes application speed control [241/262] mm/damon/schemes: skip already charged targets and regions [242/262] mm/damon/schemes: implement time quota [243/262] mm/damon/dbgfs: support quotas of schemes [244/262] mm/damon/selftests: support schemes quotas [245/262] mm/damon/schemes: prioritize regions within the quotas [246/262] mm/damon/vaddr,paddr: support pageout prioritization [247/262] mm/damon/dbgfs: support prioritization weights [248/262] tools/selftests/damon: update for regions prioritization of schemes [249/262] mm/damon/schemes: activate schemes based on a watermarks mechanism [250/262] mm/damon/dbgfs: support watermarks [251/262] selftests/damon: support watermarks [252/262] mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) [253/262] Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM [254/262] mm/damon: remove unnecessary variable initialization [255/262] mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on [256/262] Docs/admin-guide/mm/damon/start: fix wrong example commands [257/262] Docs/admin-guide/mm/damon/start: fix a wrong link [258/262] Docs/admin-guide/mm/damon/start: simplify the content [259/262] Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions [260/262] mm/damon: simplify stop mechanism [261/262] mm/damon: fix a few spelling mistakes in comments and a pr_debug message [262/262] mm/damon: remove return value from before_terminate callback

Message ID

20211105204133.Or52A3rtJ%akpm@linux-foundation.org (mailing list archive)

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A6C506126A
Date: Fri, 05 Nov 2021 13:41:33 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com,
 david@redhat.com, linux-mm@kvack.org, mhocko@suse.com,
 mike.kravetz@oracle.com, mm-commits@vger.kernel.org,
 naoya.horiguchi@linux.dev, nghialm78@gmail.com, osalvador@suse.de,
 rientjes@google.com, songmuchun@bytedance.com,
 torvalds@linux-foundation.org, ziy@nvidia.com
Subject: [patch 132/262] hugetlb: add hugetlb demote page support
Message-ID: <20211105204133.Or52A3rtJ%akpm@linux-foundation.org>
In-Reply-To: <20211105133408.cccbb98b71a77d5e8430aba1@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/262] scripts/spelling.txt: add more spellings to spelling.txt | expand

Commit Message

Andrew Morton Nov. 5, 2021, 8:41 p.m. UTC

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: add hugetlb demote page support

Demote page functionality will split a huge page into a number of huge
pages of a smaller size.  For example, on x86 a 1GB huge page can be
demoted into 512 2M huge pages.  Demotion is done 'in place' by simply
splitting the huge page.

Added '*_for_demote' wrappers for remove_hugetlb_page,
destroy_compound_hugetlb_page and prep_compound_gigantic_page for use by
demote code.

[mike.kravetz@oracle.com: v4]
  Link: https://lkml.kernel.org/r/6ca29b8e-527c-d6ec-900e-e6a43e4f8b73@oracle.com
Link: https://lkml.kernel.org/r/20211007181918.136982-6-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |  100 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 92 insertions(+), 8 deletions(-)

--- a/mm/hugetlb.c~hugetlb-add-hugetlb-demote-page-support
+++ a/mm/hugetlb.c
@@ -1270,7 +1270,7 @@  static int hstate_next_node_to_free(stru
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+/* used to demote non-gigantic_huge pages as well */
 static void __destroy_compound_gigantic_page(struct page *page,
 					unsigned int order, bool demote)
 {
@@ -1293,6 +1293,13 @@  static void __destroy_compound_gigantic_
 	__ClearPageHead(page);
 }
 
+static void destroy_compound_hugetlb_page_for_demote(struct page *page,
+					unsigned int order)
+{
+	__destroy_compound_gigantic_page(page, order, true);
+}
+
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static void destroy_compound_gigantic_page(struct page *page,
 					unsigned int order)
 {
@@ -1438,6 +1445,12 @@  static void remove_hugetlb_page(struct h
 	__remove_hugetlb_page(h, page, adjust_surplus, false);
 }
 
+static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	__remove_hugetlb_page(h, page, adjust_surplus, true);
+}
+
 static void add_hugetlb_page(struct hstate *h, struct page *page,
 			     bool adjust_surplus)
 {
@@ -1779,6 +1792,12 @@  static bool prep_compound_gigantic_page(
 	return __prep_compound_gigantic_page(page, order, false);
 }
 
+static bool prep_compound_gigantic_page_for_demote(struct page *page,
+							unsigned int order)
+{
+	return __prep_compound_gigantic_page(page, order, true);
+}
+
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for normal or
  * transparent huge pages.  See the PageTransHuge() documentation for more
@@ -3304,9 +3323,72 @@  out:
 	return 0;
 }
 
+static int demote_free_huge_page(struct hstate *h, struct page *page)
+{
+	int i, nid = page_to_nid(page);
+	struct hstate *target_hstate;
+	int rc = 0;
+
+	target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
+
+	remove_hugetlb_page_for_demote(h, page, false);
+	spin_unlock_irq(&hugetlb_lock);
+
+	rc = alloc_huge_page_vmemmap(h, page);
+	if (rc) {
+		/* Allocation of vmemmmap failed, we can not demote page */
+		spin_lock_irq(&hugetlb_lock);
+		set_page_refcounted(page);
+		add_hugetlb_page(h, page, false);
+		return rc;
+	}
+
+	/*
+	 * Use destroy_compound_hugetlb_page_for_demote for all huge page
+	 * sizes as it will not ref count pages.
+	 */
+	destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h));
+
+	/*
+	 * Taking target hstate mutex synchronizes with set_max_huge_pages.
+	 * Without the mutex, pages added to target hstate could be marked
+	 * as surplus.
+	 *
+	 * Note that we already hold h->resize_lock.  To prevent deadlock,
+	 * use the convention of always taking larger size hstate mutex first.
+	 */
+	mutex_lock(&target_hstate->resize_lock);
+	for (i = 0; i < pages_per_huge_page(h);
+				i += pages_per_huge_page(target_hstate)) {
+		if (hstate_is_gigantic(target_hstate))
+			prep_compound_gigantic_page_for_demote(page + i,
+							target_hstate->order);
+		else
+			prep_compound_page(page + i, target_hstate->order);
+		set_page_private(page + i, 0);
+		set_page_refcounted(page + i);
+		prep_new_huge_page(target_hstate, page + i, nid);
+		put_page(page + i);
+	}
+	mutex_unlock(&target_hstate->resize_lock);
+
+	spin_lock_irq(&hugetlb_lock);
+
+	/*
+	 * Not absolutely necessary, but for consistency update max_huge_pages
+	 * based on pool changes for the demoted page.
+	 */
+	h->max_huge_pages--;
+	target_hstate->max_huge_pages += pages_per_huge_page(h);
+
+	return rc;
+}
+
 static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 	__must_hold(&hugetlb_lock)
 {
+	int nr_nodes, node;
+	struct page *page;
 	int rc = 0;
 
 	lockdep_assert_held(&hugetlb_lock);
@@ -3317,9 +3399,15 @@  static int demote_pool_huge_page(struct
 		return -EINVAL;		/* internal error */
 	}
 
-	/*
-	 * TODO - demote fucntionality will be added in subsequent patch
-	 */
+	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+		if (!list_empty(&h->hugepage_freelists[node])) {
+			page = list_entry(h->hugepage_freelists[node].next,
+					struct page, lru);
+			rc = demote_free_huge_page(h, page);
+			break;
+		}
+	}
+
 	return rc;
 }
 
@@ -3554,10 +3642,6 @@  static ssize_t demote_store(struct kobje
 		/*
 		 * Check for available pages to demote each time thorough the
 		 * loop as demote_pool_huge_page will drop hugetlb_lock.
-		 *
-		 * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
-		 * but will when full demote functionality is added in a later
-		 * patch.
 		 */
 		if (nid != NUMA_NO_NODE)
 			nr_available = h->free_huge_pages_node[nid];

[132/262] hugetlb: add hugetlb demote page support

Commit Message

Patch