[130/262] hugetlb: be sure to free demoted CMA pages to CMA

Message ID	20211105204127.2cYhr-b8M%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=bSwl=PY=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EC2AF61279 Date: Fri, 05 Nov 2021 13:41:27 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, david@redhat.com, linux-mm@kvack.org, mhocko@suse.com, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, naoya.horiguchi@linux.dev, nghialm78@gmail.com, osalvador@suse.de, rientjes@google.com, songmuchun@bytedance.com, torvalds@linux-foundation.org, ziy@nvidia.com Subject: [patch 130/262] hugetlb: be sure to free demoted CMA pages to CMA Message-ID: <20211105204127.2cYhr-b8M%akpm@linux-foundation.org> In-Reply-To: <20211105133408.cccbb98b71a77d5e8430aba1@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/262] scripts/spelling.txt: add more spellings to spelling.txt \| expand [001/262] scripts/spelling.txt: add more spellings to spelling.txt [002/262] scripts/spelling.txt: fix "mistake" version of "synchronization" [003/262] scripts/decodecode: fix faulting instruction no print when opps.file is DOS format [004/262] ocfs2: fix handle refcount leak in two exception handling paths [005/262] ocfs2: cleanup journal init and shutdown [006/262] ocfs2/dlm: remove redundant assignment of variable ret [007/262] ocfs2: fix data corruption on truncate [008/262] ocfs2: do not zero pages beyond i_size [009/262] fs/posix_acl.c: avoid -Wempty-body warning [010/262] d_path: fix Kernel doc validator complaining [011/262] mm: move kvmalloc-related functions to slab.h [012/262] mm/slab.c: remove useless lines in enable_cpucache() [013/262] slub: add back check for free nonslab objects [014/262] mm, slub: change percpu partial accounting from objects to pages [015/262] mm/slub: increase default cpu partial list sizes [016/262] mm, slub: use prefetchw instead of prefetch [017/262] mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT [018/262] mm: don't include <linux/dax.h> in <linux/mempolicy.h> [019/262] lib/stackdepot: include gfp.h [020/262] lib/stackdepot: remove unused function argument [021/262] lib/stackdepot: introduce __stack_depot_save() [022/262] kasan: common: provide can_alloc in kasan_save_stack() [023/262] kasan: generic: introduce kasan_record_aux_stack_noalloc() [024/262] workqueue, kasan: avoid alloc_pages() when recording stack [025/262] kasan: fix tag for large allocations when using CONFIG_SLAB [026/262] kasan: test: add memcpy test that avoids out-of-bounds write [027/262] mm/smaps: fix shmem pte hole swap calculation [028/262] mm/smaps: use vma->vm_pgoff directly when counting partial swap [029/262] mm/smaps: simplify shmem handling of pte holes [030/262] mm: debug_vm_pgtable: don't use __P000 directly [031/262] kasan: test: bypass __alloc_size checks [032/262] rapidio: avoid bogus __alloc_size warning [033/262] Compiler Attributes: add __alloc_size() for better bounds checking [034/262] slab: clean up function prototypes [035/262] slab: add __alloc_size attributes for better bounds checking [036/262] mm/kvmalloc: add __alloc_size attributes for better bounds checking [037/262] mm/vmalloc: add __alloc_size attributes for better bounds checking [038/262] mm/page_alloc: add __alloc_size attributes for better bounds checking [039/262] percpu: add __alloc_size attributes for better bounds checking [040/262] mm/page_ext.c: fix a comment [041/262] mm: stop filemap_read() from grabbing a superfluous page [042/262] mm: export bdi_unregister [043/262] mtd: call bdi_unregister explicitly [044/262] fs: explicitly unregister per-superblock BDIs [045/262] mm: don't automatically unregister bdis [046/262] mm: simplify bdi refcounting [047/262] mm: don't read i_size of inode unless we need it [048/262] mm/filemap.c: remove bogus VM_BUG_ON [049/262] mm: move more expensive part of XA setup out of mapping check [050/262] mm/gup: further simplify __gup_device_huge() [051/262] mm/swapfile: remove needless request_queue NULL pointer check [052/262] mm/swapfile: fix an integer overflow in swap_show() [053/262] mm: optimise put_pages_list() [054/262] mm/memcg: drop swp_entry_t* in mc_handle_file_pte() [055/262] memcg: flush stats only if updated [056/262] memcg: unify memcg stat flushing [057/262] mm/memcg: remove obsolete memcg_free_kmem() [058/262] mm/list_lru.c: prefer struct_size over open coded arithmetic [059/262] memcg, kmem: further deprecate kmem.limit_in_bytes [060/262] mm: list_lru: remove holding lru lock [061/262] mm: list_lru: fix the return value of list_lru_count_one() [062/262] mm: memcontrol: remove kmemcg_id reparenting [063/262] mm: memcontrol: remove the kmem states [064/262] mm: list_lru: only add memcg-aware lrus to the global lru list [065/262] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks [066/262] mm, oom: do not trigger out_of_memory from the #PF [067/262] memcg: prohibit unconditional exceeding the limit of dying tasks [068/262] mm/mmap.c: fix a data race of mm->total_vm [069/262] mm: use __pfn_to_section() instead of open coding it [070/262] mm/memory.c: avoid unnecessary kernel/user pointer conversion [071/262] mm/memory.c: use correct VMA flags when freeing page-tables [072/262] mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte [073/262] mm: clear vmf->pte after pte_unmap_same() returns [074/262] mm: drop first_index/last_index in zap_details [075/262] mm: add zap_skip_check_mapping() helper [076/262] mm: introduce pmd_install() helper [077/262] mm: remove redundant smp_wmb() [078/262] Documentation: update pagemap with shmem exceptions [079/262] lazy tlb: introduce lazy mm refcount helper functions [080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable [081/262] lazy tlb: shoot lazies, a non-refcounting lazy tlb option [082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN [083/262] memory: remove unused CONFIG_MEM_BLOCK_SIZE [084/262] mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() [085/262] mm/mremap: don't account pages in vma_to_resize() [086/262] include/linux/io-mapping.h: remove fallback for writecombine [087/262] mm: mmap_lock: remove redundant newline in TP_printk [088/262] mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN [089/262] mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() [090/262] mm/vmalloc: don't allow VM_NO_GUARD on vmap() [091/262] mm/vmalloc: make show_numa_info() aware of hugepage mappings [092/262] mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo [093/262] mm/vmalloc: do not adjust the search size for alignment overhead [094/262] mm/vmalloc: check various alignments when debugging [095/262] vmalloc: back off when the current task is OOM-killed [096/262] vmalloc: choose a better start address in vm_area_register_early() [097/262] arm64: support page mapping percpu first chunk allocator [098/262] kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC [099/262] mm/vmalloc: be more explicit about supported gfp flags [100/262] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation [101/262] lib/test_vmalloc.c: use swap() to make code cleaner [102/262] mm/large system hash: avoid possible NULL deref in alloc_large_system_hash [103/262] mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() [104/262] mm/page_alloc.c: simplify the code by using macro K() [105/262] mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() [106/262] mm/page_alloc.c: use helper function zone_spans_pfn() [107/262] mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] [108/262] mm/page_alloc: print node fallback order [109/262] mm/page_alloc: use accumulated load when building node fallback list [110/262] mm: move node_reclaim_distance to fix NUMA without SMP [111/262] mm: move fold_vm_numa_events() to fix NUMA without SMP [112/262] mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page() [113/262] mm/page_alloc: detect allocation forbidden by cpuset and bail out early [114/262] mm/page_alloc.c: show watermark_boost of zone in zoneinfo [115/262] mm: create a new system state and fix core_kernel_text() [116/262] mm: make generic arch_is_kernel_initmem_freed() do what it says [117/262] powerpc: use generic version of arch_is_kernel_initmem_freed() [118/262] s390: use generic version of arch_is_kernel_initmem_freed() [119/262] mm: page_alloc: use migrate_disable() in drain_local_pages_wq() [120/262] mm/page_alloc: use clamp() to simplify code [121/262] mm: fix data race in PagePoisoned() [122/262] mm/memory_failure: constify static mm_walk_ops [123/262] mm: filemap: coding style cleanup for filemap_map_pmd() [124/262] mm: hwpoison: refactor refcount check handling [125/262] mm: shmem: don't truncate page if memory failure happens [126/262] mm: hwpoison: handle non-anonymous THP correctly [127/262] mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h [128/262] hugetlb: add demote hugetlb page sysfs interfaces [129/262] mm/cma: add cma_pages_valid to determine if pages are in CMA [130/262] hugetlb: be sure to free demoted CMA pages to CMA [131/262] hugetlb: add demote bool to gigantic page routines [132/262] hugetlb: add hugetlb demote page support [133/262] mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged [134/262] mm, hugepages: add mremap() support for hugepage backed vma [135/262] mm, hugepages: add hugetlb vma mremap() test [136/262] hugetlb: support node specified when using cma for gigantic hugepages [137/262] mm: remove duplicate include in hugepage-mremap.c [138/262] hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro [139/262] hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments [140/262] hugetlb: remove redundant validation in has_same_uncharge_info() [141/262] hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() [142/262] hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page [143/262] userfaultfd/selftests: don't rely on GNU extensions for random numbers [144/262] userfaultfd/selftests: fix feature support detection [145/262] userfaultfd/selftests: fix calculation of expected ioctls [146/262] mm/page_isolation: fix potential missing call to unset_migratetype_isolate() [147/262] mm/page_isolation: guard against possible putback unisolated page [148/262] mm/vmscan.c: fix -Wunused-but-set-variable warning [149/262] mm/vmscan: throttle reclaim until some writeback completes if congested [150/262] mm/vmscan: throttle reclaim and compaction when too may pages are isolated [151/262] mm/vmscan: throttle reclaim when no progress is being made [152/262] mm/writeback: throttle based on page writeback instead of congestion [153/262] mm/page_alloc: remove the throttling logic from the page allocator [154/262] mm/vmscan: centralise timeout values for reclaim_throttle [155/262] mm/vmscan: increase the timeout if page reclaim is not making progress [156/262] mm/vmscan: delay waking of tasks throttled on NOPROGRESS [157/262] mm/vmpressure: fix data-race with memcg->socket_pressure [158/262] tools/vm/page_owner_sort.c: count and sort by mem [159/262] tools/vm/page-types.c: make walk_file() aware of address range option [160/262] tools/vm/page-types.c: move show_file() to summary output [161/262] tools/vm/page-types.c: print file offset in hexadecimal [162/262] arch_numa: simplify numa_distance allocation [163/262] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer [164/262] memblock: drop memblock_free_early_nid() and memblock_free_early() [165/262] memblock: stop aliasing __memblock_free_late with memblock_free_late [166/262] memblock: rename memblock_free to memblock_phys_free [167/262] memblock: use memblock_free for freeing virtual pointers [168/262] mm: mark the OOM reaper thread as freezable [169/262] hugetlbfs: extend the definition of hugepages parameter to support node allocation [170/262] mm/migrate: de-duplicate migrate_reason strings [171/262] mm: migrate: make demotion knob depend on migration [172/262] selftests/vm/transhuge-stress: fix ram size thinko [173/262] mm, thp: lock filemap when truncating page cache [174/262] mm, thp: fix incorrect unmap behavior for private pages [175/262] mm/readahead.c: fix incorrect comments for get_init_ra_size [176/262] mm: nommu: kill arch_get_unmapped_area() [177/262] selftest/vm: fix ksm selftest to run with different NUMA topologies [178/262] selftests: vm: add KSM huge pages merging time test [179/262] mm/vmstat: annotate data race for zone->free_area[order].nr_free [180/262] mm: vmstat.c: make extfrag_index show more pretty [181/262] selftests/vm: make MADV_POPULATE_(READ\|WRITE) use in-tree headers [182/262] mm/memory_hotplug: add static qualifier for online_policy_to_str() [183/262] memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node" [184/262] memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path [185/262] memory-hotplug.rst: document the "auto-movable" online policy [186/262] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG [187/262] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE [188/262] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit [189/262] mm/memory_hotplug: remove HIGHMEM leftovers [190/262] mm/memory_hotplug: remove stale function declarations [191/262] x86: remove memory hotplug support on X86_32 [192/262] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() [193/262] memblock: improve MEMBLOCK_HOTPLUG documentation [194/262] memblock: allow to specify flags with memblock_add_node() [195/262] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED [196/262] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED [197/262] mm/rmap.c: avoid double faults migrating device private pages [198/262] mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migrati… [199/262] mm/highmem: remove deprecated kmap_atomic [200/262] zram_drv: allow reclaim on bio_alloc [201/262] zram: off by one in read_block_state() [202/262] zram: introduce an aged idle interface [203/262] mm: remove HARDENED_USERCOPY_FALLBACK [204/262] include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h [205/262] stacktrace: move filter_irq_stacks() to kernel/stacktrace.c [206/262] kfence: count unexpectedly skipped allocations [207/262] kfence: move saving stack trace of allocations into __kfence_alloc() [208/262] kfence: limit currently covered allocations when pool nearly full [209/262] kfence: add note to documentation about skipping covered allocations [210/262] kfence: test: use kunit_skip() to skip tests [211/262] kfence: shorten critical sections of alloc/free [212/262] kfence: always use static branches to guard kfence_alloc() [213/262] kfence: default to dynamic branch instead of static keys mode [214/262] mm/damon: grammar s/works/work/ [215/262] Documentation/vm: move user guides to admin-guide/mm/ [216/262] MAINTAINERS: update SeongJae's email address [217/262] docs/vm/damon: remove broken reference [218/262] include/linux/damon.h: fix kernel-doc comments for 'damon_callback' [219/262] mm/damon/core: print kdamond start log in debug mode only [220/262] mm/damon: remove unnecessary do_exit() from kdamond [221/262] mm/damon: needn't hold kdamond_lock to print pid of kdamond [222/262] mm/damon/core: nullify pointer ctx->kdamond with a NULL [223/262] mm/damon/core: account age of target regions [224/262] mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) [225/262] mm/damon/vaddr: support DAMON-based Operation Schemes [226/262] mm/damon/dbgfs: support DAMON-based Operation Schemes [227/262] mm/damon/schemes: implement statistics feature [228/262] selftests/damon: add 'schemes' debugfs tests [229/262] Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes [230/262] mm/damon/dbgfs: allow users to set initial monitoring target regions [231/262] mm/damon/dbgfs-test: add a unit test case for 'init_regions' [232/262] Docs/admin-guide/mm/damon: document 'init_regions' feature [233/262] mm/damon/vaddr: separate commonly usable functions [234/262] mm/damon: implement primitives for physical address space monitoring [235/262] mm/damon/dbgfs: support physical memory monitoring [236/262] Docs/DAMON: document physical memory monitoring support [237/262] mm/damon/vaddr: constify static mm_walk_ops [238/262] mm/damon/dbgfs: remove unnecessary variables [239/262] mm/damon/paddr: support the pageout scheme [240/262] mm/damon/schemes: implement size quota for schemes application speed control [241/262] mm/damon/schemes: skip already charged targets and regions [242/262] mm/damon/schemes: implement time quota [243/262] mm/damon/dbgfs: support quotas of schemes [244/262] mm/damon/selftests: support schemes quotas [245/262] mm/damon/schemes: prioritize regions within the quotas [246/262] mm/damon/vaddr,paddr: support pageout prioritization [247/262] mm/damon/dbgfs: support prioritization weights [248/262] tools/selftests/damon: update for regions prioritization of schemes [249/262] mm/damon/schemes: activate schemes based on a watermarks mechanism [250/262] mm/damon/dbgfs: support watermarks [251/262] selftests/damon: support watermarks [252/262] mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) [253/262] Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM [254/262] mm/damon: remove unnecessary variable initialization [255/262] mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on [256/262] Docs/admin-guide/mm/damon/start: fix wrong example commands [257/262] Docs/admin-guide/mm/damon/start: fix a wrong link [258/262] Docs/admin-guide/mm/damon/start: simplify the content [259/262] Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions [260/262] mm/damon: simplify stop mechanism [261/262] mm/damon: fix a few spelling mistakes in comments and a pr_debug message [262/262] mm/damon: remove return value from before_terminate callback

Message ID

20211105204127.2cYhr-b8M%akpm@linux-foundation.org (mailing list archive)

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EC2AF61279
Date: Fri, 05 Nov 2021 13:41:27 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com,
 david@redhat.com, linux-mm@kvack.org, mhocko@suse.com,
 mike.kravetz@oracle.com, mm-commits@vger.kernel.org,
 naoya.horiguchi@linux.dev, nghialm78@gmail.com, osalvador@suse.de,
 rientjes@google.com, songmuchun@bytedance.com,
 torvalds@linux-foundation.org, ziy@nvidia.com
Subject: [patch 130/262] hugetlb: be sure to free demoted CMA
 pages to CMA
Message-ID: <20211105204127.2cYhr-b8M%akpm@linux-foundation.org>
In-Reply-To: <20211105133408.cccbb98b71a77d5e8430aba1@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/262] scripts/spelling.txt: add more spellings to spelling.txt | expand

Commit Message

Andrew Morton Nov. 5, 2021, 8:41 p.m. UTC

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: be sure to free demoted CMA pages to CMA

When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size.  For example, on x86 a 1G page can be
demoted to 512 2M pages.  However, gigantic pages can potentially be
allocated from CMA.  If a gigantic page which was allocated from CMA is
demoted, the corresponding demoted pages needs to be returned to CMA.

Use the new interface cma_pages_valid() to determine if a non-gigantic
hugetlb page should be freed to CMA.  Also, clear mapping field of these
pages as expected by cma_release.

This also requires a change to CMA region creation for gigantic pages. 
CMA uses a per-region bit map to track allocations.  When setting up the
region, you specify how many pages each bit represents.  Currently, only
gigantic pages are allocated/freed from CMA so the region is set up such
that one bit represents a gigantic page size allocation.

With demote, a gigantic page (allocation) could be split into smaller size
pages.  And, these smaller size pages will be freed to CMA.  So, since the
per-region bit map needs to be set up to represent the smallest
allocation/free size, it now needs to be set to the smallest huge page
size which can be freed to CMA.

Unfortunately, we set up the CMA region for huge pages before we set up
huge pages sizes (hstates).  So, technically we do not know the smallest
huge page size as this can change via command line options and
architecture specific code.  Therefore, at region setup time we use
HUGETLB_PAGE_ORDER as the smallest possible huge page size that can be
given back to CMA.  It is possible that this value is sub-optimal for some
architectures/config options.  If needed, this can be addressed in follow
on work.

Link: https://lkml.kernel.org/r/20211007181918.136982-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

--- a/mm/hugetlb.c~hugetlb-be-sure-to-free-demoted-cma-pages-to-cma
+++ a/mm/hugetlb.c
@@ -50,6 +50,16 @@  struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
 static struct cma *hugetlb_cma[MAX_NUMNODES];
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+	return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
+				1 << order);
+}
+#else
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+	return false;
+}
 #endif
 static unsigned long hugetlb_cma_size __initdata;
 
@@ -1272,6 +1282,7 @@  static void destroy_compound_gigantic_pa
 	atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
@@ -1476,7 +1487,13 @@  static void __update_and_free_page(struc
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
-	if (hstate_is_gigantic(h)) {
+
+	/*
+	 * Non-gigantic pages demoted from CMA allocated gigantic pages
+	 * need to be given back to CMA in free_gigantic_page.
+	 */
+	if (hstate_is_gigantic(h) ||
+	    hugetlb_cma_page(page, huge_page_order(h))) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
 	} else {
@@ -3001,9 +3018,13 @@  static void __init hugetlb_init_hstates(
 		 * h->demote_order is initially 0.
 		 * - We can not demote gigantic pages if runtime freeing
 		 *   is not supported, so skip this.
+		 * - If CMA allocation is possible, we can not demote
+		 *   HUGETLB_PAGE_ORDER or smaller size pages.
 		 */
 		if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 			continue;
+		if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
+			continue;
 		for_each_hstate(h2) {
 			if (h2 == h)
 				continue;
@@ -3555,6 +3576,8 @@  static ssize_t demote_size_store(struct
 	if (!demote_hstate)
 		return -EINVAL;
 	demote_order = demote_hstate->order;
+	if (demote_order < HUGETLB_PAGE_ORDER)
+		return -EINVAL;
 
 	/* demote order must be smaller than hstate order */
 	h = kobj_to_hstate(kobj, &nid);
@@ -6543,6 +6566,7 @@  void __init hugetlb_cma_reserve(int orde
 	if (hugetlb_cma_size < (PAGE_SIZE << order)) {
 		pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
 			(PAGE_SIZE << order) / SZ_1M);
+		hugetlb_cma_size = 0;
 		return;
 	}
 
@@ -6563,7 +6587,13 @@  void __init hugetlb_cma_reserve(int orde
 		size = round_up(size, PAGE_SIZE << order);
 
 		snprintf(name, sizeof(name), "hugetlb%d", nid);
-		res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
+		/*
+		 * Note that 'order per bit' is based on smallest size that
+		 * may be returned to CMA allocator in the case of
+		 * huge page demotion.
+		 */
+		res = cma_declare_contiguous_nid(0, size, 0,
+						PAGE_SIZE << HUGETLB_PAGE_ORDER,
 						 0, false, name,
 						 &hugetlb_cma[nid], nid);
 		if (res) {
@@ -6579,6 +6609,13 @@  void __init hugetlb_cma_reserve(int orde
 		if (reserved >= hugetlb_cma_size)
 			break;
 	}
+
+	if (!reserved)
+		/*
+		 * hugetlb_cma_size is used to determine if allocations from
+		 * cma are possible.  Set to zero if no cma regions are set up.
+		 */
+		hugetlb_cma_size = 0;
 }
 
 void __init hugetlb_cma_check(void)

[130/262] hugetlb: be sure to free demoted CMA pages to CMA

Commit Message

Patch