[147/155] hugetlb_cgroup: support noreserve mappings

Message ID	20200402041131.AsrG1azep%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=V2YN=5S=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 35A33206E9 Date: Wed, 01 Apr 2020 21:11:31 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, almasrymina@google.com, gthelen@google.com, linux-mm@kvack.org, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, rientjes@google.com, sandipan@linux.ibm.com, shakeelb@google.com, shuah@kernel.org, torvalds@linux-foundation.org Subject: [patch 147/155] hugetlb_cgroup: support noreserve mappings Message-ID: <20200402041131.AsrG1azep%akpm@linux-foundation.org> In-Reply-To: <20200401210155.09e3b9742e1c6e732f5a7250@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/155] tools/accounting/getdelays.c: fix netlink attribute length \| expand [001/155] tools/accounting/getdelays.c: fix netlink attribute length [002/155] kthread: mark timer used by delayed kthread works as IRQ safe [003/155] asm-generic: make more kernel-space headers mandatory [004/155] scripts/spelling.txt: add syfs/sysfs pattern [005/155] scripts/spelling.txt: add more spellings to spelling.txt [006/155] ocfs2: remove FS_OCFS2_NM [007/155] ocfs2: remove unused macros [008/155] ocfs2: use OCFS2_SEC_BITS in macro [009/155] ocfs2: remove dlm_lock_is_remote [010/155] ocfs2: there is no need to log twice in several functions [011/155] ocfs2: correct annotation from "l_next_rec" to "l_next_free_rec" [012/155] ocfs2: remove useless err [013/155] ocfs2: add missing annotations for ocfs2_refcount_cache_lock() and ocfs2_refcount_cache_u… [014/155] ocfs2: replace zero-length array with flexible-array member [015/155] ocfs2: cluster: replace zero-length array with flexible-array member [016/155] ocfs2: dlm: replace zero-length array with flexible-array member [017/155] ocfs2: ocfs2_fs.h: replace zero-length array with flexible-array member [018/155] ocfs2: roll back the reference count modification of the parent directory if an error occ… [019/155] ocfs2: use scnprintf() for avoiding potential buffer overflow [020/155] ocfs2: use memalloc_nofs_save instead of memalloc_noio_save [021/155] fs_parse: remove pr_notice() about each validation [022/155] mm/slub.c: replace cpu_slab->partial with wrapped APIs [023/155] mm/slub.c: replace kmem_cache->cpu_partial with wrapped APIs [024/155] slub: improve bit diffusion for freelist ptr obfuscation [025/155] slub: relocate freelist pointer to middle of object [026/155] revert "topology: add support for node_to_mem_node() to determine the fallback node" [027/155] mm/kmemleak.c: use address-of operator on section symbols [028/155] mm/Makefile: disable KCSAN for kmemleak [029/155] mm/filemap.c: don't bother dropping mmap_sem for zero size readahead [030/155] mm/page-writeback.c: write_cache_pages(): deduplicate identical checks [031/155] mm/filemap.c: clear page error before actual read [032/155] mm/filemap.c: remove unused argument from shrink_readahead_size_eio() [033/155] mm/filemap.c: use vm_fault error code directly [034/155] include/linux/pagemap.h: rename arguments to find_subpage [035/155] mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_io [036/155] mm/filemap.c: unexport find_get_entry [037/155] mm/filemap.c: rewrite pagecache_get_page documentation [038/155] mm/gup: split get_user_pages_remote() into two routines [039/155] mm/gup: pass a flags arg to __gup_device_* functions [040/155] mm: introduce page_ref_sub_return() [041/155] mm/gup: pass gup flags to two more routines [042/155] mm/gup: require FOLL_GET for get_user_pages_fast() [043/155] mm/gup: track FOLL_PIN pages [044/155] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages [045/155] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting [046/155] mm/gup_benchmark: support pin_user_pages() and related calls [047/155] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage [048/155] mm: improve dump_page() for compound pages [049/155] mm: dump_page(): additional diagnostics for huge pinned pages [050/155] mm/gup/writeback: add callbacks for inaccessible pages [051/155] mm/gup: rename nr as nr_pinned in get_user_pages_fast() [052/155] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path [053/155] mm/swapfile.c: fix comments for swapcache_prepare [054/155] mm/swap.c: not necessary to export __pagevec_lru_add() [055/155] mm/swapfile: fix data races in try_to_unuse() [056/155] mm/swap_slots.c: assign\|reset cache slot by value directly [057/155] mm: swap: make page_evictable() inline [058/155] mm: swap: use smp_mb__after_atomic() to order LRU bit set [059/155] mm/swap_state.c: use the same way to count page in [add_to\|delete_from]_swap_cache [060/155] mm, memcg: fix build error around the usage of kmem_caches [061/155] mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node [062/155] mm: memcg/slab: use mem_cgroup_from_obj() [063/155] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments [064/155] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments [065/155] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() [066/155] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() [067/155] mm: memcg/slab: cache page number in memcg_(un)charge_slab() [068/155] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() [069/155] mm: memcontrol: fix memory.low proportional distribution [070/155] mm: memcontrol: clean up and document effective low/min calculations [071/155] mm: memcontrol: recursive memory.low protection [072/155] memcg: css_tryget_online cleanups [073/155] mm/memcontrol.c: make mem_cgroup_id_get_many() __maybe_unused [074/155] mm, memcg: prevent memory.high load/store tearing [075/155] mm, memcg: prevent memory.max load tearing [076/155] mm, memcg: prevent memory.low load/store tearing [077/155] mm, memcg: prevent memory.min load/store tearing [078/155] mm, memcg: prevent memory.swap.max load tearing [079/155] mm, memcg: prevent mem_cgroup_protected store tearing [080/155] mm: memcg: make memory.oom.group tolerable to task migration [081/155] mm/mapping_dirty_helpers: update huge page-table entry callbacks [082/155] mm/vma: move VM_NO_KHUGEPAGED into generic header [083/155] mm/vma: make vma_is_foreign() available for general use [084/155] mm/vma: make is_vma_temporary_stack() available for general use [085/155] mm: add pagemap.h to the fine documentation [086/155] mm/gup: rename "nonblocking" to "locked" where proper [087/155] mm/gup: fix __get_user_pages() on fault retry of hugetlb [088/155] mm: introduce fault_signal_pending() [089/155] x86/mm: use helper fault_signal_pending() [090/155] arc/mm: use helper fault_signal_pending() [091/155] arm64/mm: use helper fault_signal_pending() [092/155] powerpc/mm: use helper fault_signal_pending() [093/155] sh/mm: use helper fault_signal_pending() [094/155] mm: return faster for non-fatal signals in user mode faults [095/155] userfaultfd: don't retake mmap_sem to emulate NOPAGE [096/155] mm: introduce FAULT_FLAG_DEFAULT [097/155] mm: introduce FAULT_FLAG_INTERRUPTIBLE [098/155] mm: allow VM_FAULT_RETRY for multiple times [099/155] mm/gup: allow VM_FAULT_RETRY for multiple times [100/155] mm/gup: allow to react to fatal signals [101/155] mm/userfaultfd: honor FAULT_FLAG_KILLABLE in fault path [102/155] mm: clarify a confusing comment for remap_pfn_range() [103/155] mm/memory.c: clarify a confusing comment for vm_iomap_memory [104/155] mmap: remove inline of vm_unmapped_area [105/155] mm: mmap: add trace point of vm_unmapped_area [106/155] mm/mremap: add MREMAP_DONTUNMAP to mremap() [107/155] selftests: add MREMAP_DONTUNMAP selftest [108/155] mm/sparsemem: get address to page struct instead of address to pfn [109/155] mm/sparse: rename pfn_present() to pfn_in_present_section() [110/155] mm/sparse.c: use kvmalloc/kvfree to alloc/free memmap for the classic sparse [111/155] mm/sparse.c: allocate memmap preferring the given node [112/155] kasan: detect negative size in memory operation function [113/155] kasan: add test for invalid size in memmove [114/155] mm/page_alloc: increase default min_free_kbytes bound [115/155] mm, pagealloc: micro-optimisation: save two branches on hot page allocation path [116/155] mm/page_alloc.c: use free_area_empty() instead of open-coding [117/155] mm/page_alloc.c: micro-optimisation Remove unnecessary branch [118/155] mm/page_alloc: simplify page_is_buddy() for better code readability [119/155] mm: vmpressure: don't need call kfree if kstrndup fails [120/155] mm: vmpressure: use mem_cgroup_is_root API [121/155] mm: vmscan: replace open codings to NUMA_NO_NODE [122/155] mm/vmscan.c: remove cpu online notification for now [123/155] mm/vmscan.c: fix data races using kswapd_classzone_idx [124/155] mm/vmscan.c: clean code by removing unnecessary assignment [125/155] mm/vmscan.c: make may_enter_fs bool in shrink_page_list() [126/155] mm/vmscan.c: do_try_to_free_pages(): clean code by removing unnecessary assignment [127/155] selftests: vm: drop dependencies on page flags from mlock2 tests [128/155] mm,compaction,cma: add alloc_contig flag to compact_control [129/155] mm,thp,compaction,cma: allow THP migration for CMA allocations [130/155] mm, compaction: fully assume capture is not NULL in compact_zone_order() [131/155] mm/compaction: really limit compact_unevictable_allowed to 0 and 1 [132/155] mm/compaction: Disable compact_unevictable_allowed on RT [133/155] mm/compaction.c: clean code by removing unnecessary assignment [134/155] mm/mempolicy: support MPOL_MF_STRICT for huge page mapping [135/155] mm/mempolicy: check hugepage migration is supported by arch in vma_migratable() [136/155] mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk() [137/155] mm: mempolicy: require at least one nodeid for MPOL_PREFERRED [138/155] mm/memblock.c: remove redundant assignment to variable max_addr [139/155] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization [140/155] hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race [141/155] hugetlb_cgroup: add hugetlb_cgroup reservation counter [142/155] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations [143/155] mm/hugetlb_cgroup: fix hugetlb_cgroup migration [144/155] hugetlb_cgroup: add reservation accounting for private mappings [145/155] hugetlb: disable region_add file_region coalescing [146/155] hugetlb_cgroup: add accounting for shared mappings [147/155] hugetlb_cgroup: support noreserve mappings [148/155] hugetlb: support file_region coalescing again [149/155] hugetlb_cgroup: add hugetlb_cgroup reservation tests [150/155] hugetlb_cgroup: add hugetlb_cgroup reservation docs [151/155] mm/hugetlb.c: clean code by removing unnecessary initialization [152/155] mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() [153/155] selftests/vm: fix map_hugetlb length used for testing read and write [154/155] mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS [155/155] include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP

Message ID

20200402041131.AsrG1azep%akpm@linux-foundation.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 35A33206E9
Date: Wed, 01 Apr 2020 21:11:31 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, almasrymina@google.com,
 gthelen@google.com, linux-mm@kvack.org, mike.kravetz@oracle.com,
 mm-commits@vger.kernel.org, rientjes@google.com, sandipan@linux.ibm.com,
 shakeelb@google.com, shuah@kernel.org, torvalds@linux-foundation.org
Subject: [patch 147/155] hugetlb_cgroup: support noreserve
 mappings
Message-ID: <20200402041131.AsrG1azep%akpm@linux-foundation.org>
In-Reply-To: <20200401210155.09e3b9742e1c6e732f5a7250@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/155] tools/accounting/getdelays.c: fix netlink attribute length | expand

Commit Message

Andrew Morton April 2, 2020, 4:11 a.m. UTC

From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: support noreserve mappings

Support MAP_NORESERVE accounting as part of the new counter.

For each hugepage allocation, at allocation time we check if there is a
reservation for this allocation or not.  If there is a reservation for
this allocation, then this allocation was charged at reservation time, and
we don't re-account it.  If there is no reserevation for this allocation,
we charge the appropriate hugetlb_cgroup.

The hugetlb_cgroup to uncharge for this allocation is stored in
page[3].private.  We use new APIs added in an earlier patch to set this
pointer.

Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

--- a/mm/hugetlb.c~hugetlb_cgroup-support-noreserve-mappings
+++ a/mm/hugetlb.c
@@ -1345,6 +1345,8 @@  static void __free_huge_page(struct page
 	clear_page_huge_active(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
+	hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
+					  pages_per_huge_page(h), page);
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
@@ -2281,6 +2283,7 @@  struct page *alloc_huge_page(struct vm_a
 	long gbl_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
+	bool deferred_reserve;
 
 	idx = hstate_index(h);
 	/*
@@ -2318,9 +2321,19 @@  struct page *alloc_huge_page(struct vm_a
 			gbl_chg = 1;
 	}
 
+	/* If this allocation is not consuming a reservation, charge it now.
+	 */
+	deferred_reserve = map_chg || avoid_reserve || !vma_resv_map(vma);
+	if (deferred_reserve) {
+		ret = hugetlb_cgroup_charge_cgroup_rsvd(
+			idx, pages_per_huge_page(h), &h_cg);
+		if (ret)
+			goto out_subpool_put;
+	}
+
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret)
-		goto out_subpool_put;
+		goto out_uncharge_cgroup_reservation;
 
 	spin_lock(&hugetlb_lock);
 	/*
@@ -2343,6 +2356,14 @@  struct page *alloc_huge_page(struct vm_a
 		/* Fall through */
 	}
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
+	/* If allocation is not consuming a reservation, also store the
+	 * hugetlb_cgroup pointer on the page.
+	 */
+	if (deferred_reserve) {
+		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
+						  h_cg, page);
+	}
+
 	spin_unlock(&hugetlb_lock);
 
 	set_page_private(page, (unsigned long)spool);
@@ -2367,6 +2388,10 @@  struct page *alloc_huge_page(struct vm_a
 
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
+out_uncharge_cgroup_reservation:
+	if (deferred_reserve)
+		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
+						    h_cg);
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);

[147/155] hugetlb_cgroup: support noreserve mappings

Commit Message

Patch