From patchwork Thu Oct 28 11:56:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ning Zhang X-Patchwork-Id: 12589935 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69F96C433EF for ; Thu, 28 Oct 2021 11:57:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 035B560524 for ; Thu, 28 Oct 2021 11:57:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 035B560524 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 5E0A494000B; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5909194000A; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E63F94000B; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33]) by kanga.kvack.org (Postfix) with ESMTP id 0D51D940008 for ; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 927C71809B39D for ; Thu, 28 Oct 2021 11:57:03 +0000 (UTC) X-FDA: 78745695126.16.681EEEE Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by imf27.hostedemail.com (Postfix) with ESMTP id CB9CD700024B for ; Thu, 28 Oct 2021 11:57:01 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Uu.SoiE_1635422215; Received: from localhost(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0Uu.SoiE_1635422215) by smtp.aliyun-inc.com(127.0.0.1); Thu, 28 Oct 2021 19:56:56 +0800 From: Ning Zhang To: linux-mm@kvack.org Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao Subject: [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Date: Thu, 28 Oct 2021 19:56:50 +0800 Message-Id: <1635422215-99394-2-git-send-email-ningzhang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> X-Stat-Signature: 8xgqc3y1598f6je88z8thh9jn1rwfcw4 Authentication-Results: imf27.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf27.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: CB9CD700024B X-HE-Tag: 1635422221-96476 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Transparent huge pages could reduce the number of tlb misses, which could improve performance for the applications. But one concern is that thp may lead to memory bloat which may cause OOM. The reason is a huge page may contain some zero subpages which user didn't really access them. This patch introduces a mechanism to reclaim these zero subpages, it works when memory pressure is high. We'll estimate whether a huge page contains enough zero subpages at first, then try split it and reclaim the zero subpages. Through testing with some apps, we found that the zero subpages are tended to be concentrated into a few huge pages. Following is a text_classification_rnn case for tensorflow: zero_subpages huge_pages waste [ 0, 1) 186 0.00% [ 1, 2) 23 0.01% [ 2, 4) 36 0.02% [ 4, 8) 67 0.08% [ 8, 16) 80 0.23% [ 16, 32) 109 0.61% [ 32, 64) 44 0.49% [ 64, 128) 12 0.30% [ 128, 256) 28 1.54% [ 256, 513) 159 18.03% In the case, a lot of zero subpages are concentrated into 187(28+159) huge pages, which lead to 19.57% waste of the total rss. It means we can reclaim 19.57% memory by splitting the 187 huge pages and reclaiming the zero subpages. We store the huge pages to a new list in order to find them quickly. And add a interface 'thp_reclaim' to control on or off in memory cgroup: echo 1 > memory.thp_reclaim to enable. echo 0 > memory.thp_reclaim to disable. Signed-off-by: Ning Zhang Signed-off-by: Gang Deng --- include/linux/huge_mm.h | 9 ++ include/linux/memcontrol.h | 15 +++ include/linux/mm.h | 1 + include/linux/mm_types.h | 6 + include/linux/mmzone.h | 6 + mm/huge_memory.c | 296 ++++++++++++++++++++++++++++++++++++++++++++- mm/memcontrol.c | 107 ++++++++++++++++ mm/vmscan.c | 59 ++++++++- 8 files changed, 496 insertions(+), 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f123e15..e1b3bf9 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -185,6 +185,15 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, void free_transhuge_page(struct page *page); bool is_transparent_hugepage(struct page *page); +#ifdef CONFIG_MEMCG +int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page); +unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page); +static inline struct list_head *hpage_reclaim_list(struct page *page) +{ + return &page[3].hpage_reclaim_list; +} +#endif + bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3096c9a..502a6ab 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -150,6 +150,9 @@ struct mem_cgroup_per_node { unsigned long usage_in_excess;/* Set to the value by which */ /* the soft limit is exceeded*/ bool on_tree; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct hpage_reclaim hpage_reclaim_queue; +#endif struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ }; @@ -228,6 +231,13 @@ struct obj_cgroup { }; }; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +enum thp_reclaim_state { + THP_RECLAIM_DISABLE, + THP_RECLAIM_ENABLE, + THP_RECLAIM_MEMCG, /* For global configure*/ +}; +#endif /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -345,6 +355,7 @@ struct mem_cgroup { #ifdef CONFIG_TRANSPARENT_HUGEPAGE struct deferred_split deferred_split_queue; + int thp_reclaim; #endif struct mem_cgroup_per_node *nodeinfo[]; @@ -1110,6 +1121,10 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void del_hpage_from_queue(struct page *page); +#endif + #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 diff --git a/include/linux/mm.h b/include/linux/mm.h index 73a52ab..39676f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3061,6 +3061,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void *, size_t *, void drop_slab(void); void drop_slab_node(int nid); +unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7f8ee09..9433987 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -159,6 +159,12 @@ struct page { /* For both global and memcg */ struct list_head deferred_list; }; + struct { /* Third tail page of compound page */ + unsigned long _compound_pad_2; + unsigned long _compound_pad_3; + /* For zero subpages reclaim */ + struct list_head hpage_reclaim_list; + }; struct { /* Page table pages */ unsigned long _pt_pad_1; /* compound_head */ pgtable_t pmd_huge_pte; /* protected by page->ptl */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d..222cd4f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -787,6 +787,12 @@ struct deferred_split { struct list_head split_queue; unsigned long split_queue_len; }; + +struct hpage_reclaim { + spinlock_t reclaim_queue_lock; + struct list_head reclaim_queue; + unsigned long reclaim_queue_len; +}; #endif /* diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5e9ef0f..21e3c01 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -526,6 +526,9 @@ void prep_transhuge_page(struct page *page) */ INIT_LIST_HEAD(page_deferred_list(page)); +#ifdef CONFIG_MEMCG + INIT_LIST_HEAD(hpage_reclaim_list(page)); +#endif set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); } @@ -2367,7 +2370,7 @@ static void __split_huge_page_tail(struct page *head, int tail, (1L << PG_dirty))); /* ->mapping in first tail page is compound_mapcount */ - VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, + VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail); page_tail->mapping = head->mapping; page_tail->index = head->index + tail; @@ -2620,6 +2623,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head); + del_hpage_from_queue(page); if (PageWriteback(head)) return -EBUSY; @@ -2779,6 +2783,7 @@ void deferred_split_huge_page(struct page *page) set_shrinker_bit(memcg, page_to_nid(page), deferred_split_shrinker.id); #endif + del_hpage_from_queue(page); } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } @@ -3203,3 +3208,292 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) update_mmu_cache_pmd(vma, address, pvmw->pmd); } #endif + +#ifdef CONFIG_MEMCG +static inline bool is_zero_page(struct page *page) +{ + void *addr = kmap(page); + bool ret = true; + + if (memchr_inv(addr, 0, PAGE_SIZE)) + ret = false; + kunmap(page); + + return ret; +} + +/* + * We'll split the huge page iff it contains at least 1/32 zeros, + * estimate it by checking some discrete unsigned long values. + */ +static bool hpage_estimate_zero(struct page *page) +{ + unsigned int i, maybe_zero_pages = 0, offset = 0; + void *addr; + +#define BYTES_PER_LONG (BITS_PER_LONG / BITS_PER_BYTE) + for (i = 0; i < HPAGE_PMD_NR; i++, page++, offset++) { + addr = kmap(page); + if (unlikely((offset + 1) * BYTES_PER_LONG > PAGE_SIZE)) + offset = 0; + if (*(const unsigned long *)(addr + offset) == 0UL) { + if (++maybe_zero_pages == HPAGE_PMD_NR >> 5) { + kunmap(page); + return true; + } + } + kunmap(page); + } + + return false; +} + +static bool replace_zero_pte(struct page *page, struct vm_area_struct *vma, + unsigned long addr, void *zero_page) +{ + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = addr, + .flags = PVMW_SYNC | PVMW_MIGRATION, + }; + pte_t pte; + + VM_BUG_ON_PAGE(PageTail(page), page); + + while (page_vma_mapped_walk(&pvmw)) { + pte = pte_mkspecial( + pfn_pte(page_to_pfn((struct page *)zero_page), + vma->vm_page_prot)); + dec_mm_counter(vma->vm_mm, MM_ANONPAGES); + set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); + update_mmu_cache(vma, pvmw.address, pvmw.pte); + } + + return true; +} + +static void replace_zero_ptes_locked(struct page *page) +{ + struct page *zero_page = ZERO_PAGE(0); + struct rmap_walk_control rwc = { + .rmap_one = replace_zero_pte, + .arg = zero_page, + }; + + rmap_walk_locked(page, &rwc); +} + +static bool replace_zero_page(struct page *page) +{ + struct anon_vma *anon_vma = NULL; + bool unmap_success; + bool ret = true; + + anon_vma = page_get_anon_vma(page); + if (!anon_vma) + return false; + + anon_vma_lock_write(anon_vma); + try_to_migrate(page, TTU_RMAP_LOCKED); + unmap_success = !page_mapped(page); + + if (!unmap_success || !is_zero_page(page)) { + /* remap the page */ + remove_migration_ptes(page, page, true); + ret = false; + } else + replace_zero_ptes_locked(page); + + anon_vma_unlock_write(anon_vma); + put_anon_vma(anon_vma); + + return ret; +} + +/* + * reclaim_zero_subpages - reclaim the zero subpages and putback the non-zero + * subpages. + * + * The non-zero subpages are putback to the keep_list, and will be putback to + * the lru list. + * + * Return the number of reclaimed zero subpages. + */ +static unsigned long reclaim_zero_subpages(struct list_head *list, + struct list_head *keep_list) +{ + LIST_HEAD(zero_list); + struct page *page; + unsigned long reclaimed = 0; + + while (!list_empty(list)) { + page = lru_to_page(list); + list_del_init(&page->lru); + if (is_zero_page(page)) { + if (!trylock_page(page)) + goto keep; + + if (!replace_zero_page(page)) { + unlock_page(page); + goto keep; + } + + __ClearPageActive(page); + unlock_page(page); + if (put_page_testzero(page)) { + list_add(&page->lru, &zero_list); + reclaimed++; + } + + /* someone may hold the zero page, we just skip it. */ + + continue; + } +keep: + list_add(&page->lru, keep_list); + } + + mem_cgroup_uncharge_list(&zero_list); + free_unref_page_list(&zero_list); + + return reclaimed; + +} + +#ifdef CONFIG_MMU +#define ZSR_PG_MLOCK(flag) (1UL << flag) +#else +#define ZSR_PG_MLOCK(flag) 0 +#endif + +#ifdef CONFIG_ARCH_USES_PG_UNCACHED +#define ZSR_PG_UNCACHED(flag) (1UL << flag) +#else +#define ZSR_PG_UNCACHED(flag) 0 +#endif + +#ifdef CONFIG_MEMORY_FAILURE +#define ZSR_PG_HWPOISON(flag) (1UL << flag) +#else +#define ZSR_PG_HWPOISON(flag) 0 +#endif + +/* Filter unsupported page flags. */ +#define ZSR_FLAG_CHECK \ + ((1UL << PG_error) | \ + (1UL << PG_owner_priv_1) | \ + (1UL << PG_arch_1) | \ + (1UL << PG_reserved) | \ + (1UL << PG_private) | \ + (1UL << PG_private_2) | \ + (1UL << PG_writeback) | \ + (1UL << PG_swapcache) | \ + (1UL << PG_mappedtodisk) | \ + (1UL << PG_reclaim) | \ + (1UL << PG_unevictable) | \ + ZSR_PG_MLOCK(PG_mlocked) | \ + ZSR_PG_UNCACHED(PG_uncached) | \ + ZSR_PG_HWPOISON(PG_hwpoison)) + +#define hpage_can_reclaim(page) \ + (PageAnon(page) && !PageKsm(page) && !(page->flags & ZSR_FLAG_CHECK)) + +#define hr_queue_list_to_page(head) \ + compound_head(list_entry((head)->prev, struct page,\ + hpage_reclaim_list)) + +/* + * zsr_get_hpage - get one huge page from huge page reclaim queue + * + * Return -EINVAL if the queue is empty; otherwise, return 0. + * If the queue is not empty, it will check whether the tail page of the + * queue can be reclaimed or not. If the page can be reclaimed, it will + * be stored in reclaim_page; otherwise, just delete the page from the + * queue. + */ +int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page) +{ + struct page *page = NULL; + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags); + if (list_empty(&hr_queue->reclaim_queue)) { + ret = -EINVAL; + goto unlock; + } + + page = hr_queue_list_to_page(&hr_queue->reclaim_queue); + list_del_init(hpage_reclaim_list(page)); + hr_queue->reclaim_queue_len--; + + if (!hpage_can_reclaim(page) || !get_page_unless_zero(page)) + goto unlock; + + if (!trylock_page(page)) { + put_page(page); + goto unlock; + } + + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); + + if (hpage_can_reclaim(page) && hpage_estimate_zero(page) && + !isolate_lru_page(page)) { + __mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON, + HPAGE_PMD_NR); + /* + * dec the reference added in + * isolate_lru_page + */ + page_ref_dec(page); + *reclaim_page = page; + } else { + unlock_page(page); + put_page(page); + } + + return ret; + +unlock: + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); + return ret; + +} + +unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + unsigned long reclaimed; + unsigned long flags; + LIST_HEAD(split_list); + LIST_HEAD(keep_list); + + /* + * Split the huge page and reclaim the zero subpages. + * And putback the non-zero subpages to the lru list. + */ + if (split_huge_page_to_list(page, &split_list)) { + unlock_page(page); + putback_lru_page(page); + mod_node_page_state(pgdat, NR_ISOLATED_ANON, + -HPAGE_PMD_NR); + return 0; + } + + unlock_page(page); + list_add_tail(&page->lru, &split_list); + reclaimed = reclaim_zero_subpages(&split_list, &keep_list); + + spin_lock_irqsave(&lruvec->lru_lock, flags); + move_pages_to_lru(lruvec, &keep_list); + spin_unlock_irqrestore(&lruvec->lru_lock, flags); + mod_node_page_state(pgdat, NR_ISOLATED_ANON, + -HPAGE_PMD_NR); + + mem_cgroup_uncharge_list(&keep_list); + free_unref_page_list(&keep_list); + + return reclaimed; +} +#endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b762215..5df1cdd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2739,6 +2739,56 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) } #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/* Need the page lock if the page is not a newly allocated page. */ +static void add_hpage_to_queue(struct page *page, struct mem_cgroup *memcg) +{ + struct hpage_reclaim *hr_queue; + unsigned long flags; + + if (READ_ONCE(memcg->thp_reclaim) == THP_RECLAIM_DISABLE) + return; + + page = compound_head(page); + /* + * we just want to add the anon page to the queue, but it is not sure + * the page is anon or not when charging to memcg. + * page_mapping return NULL if the page is a anon page or the mapping + * is not yet set. + */ + if (!is_transparent_hugepage(page) || page_mapping(page)) + return; + + hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hpage_reclaim_queue; + spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags); + if (list_empty(hpage_reclaim_list(page))) { + list_add(hpage_reclaim_list(page), &hr_queue->reclaim_queue); + hr_queue->reclaim_queue_len++; + } + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); +} + +void del_hpage_from_queue(struct page *page) +{ + struct mem_cgroup *memcg; + struct hpage_reclaim *hr_queue; + unsigned long flags; + + page = compound_head(page); + memcg = page_memcg(page); + if (!memcg || !is_transparent_hugepage(page)) + return; + + hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hpage_reclaim_queue; + spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags); + if (!list_empty(hpage_reclaim_list(page))) { + list_del_init(hpage_reclaim_list(page)); + hr_queue->reclaim_queue_len--; + } + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); +} +#endif + static void commit_charge(struct page *page, struct mem_cgroup *memcg) { VM_BUG_ON_PAGE(page_memcg(page), page); @@ -2751,6 +2801,10 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg) * - exclusive reference */ page->memcg_data = (unsigned long)memcg; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + add_hpage_to_queue(page, memcg); +#endif } static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) @@ -4425,6 +4479,26 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, return 0; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static u64 mem_cgroup_thp_reclaim_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return READ_ONCE(mem_cgroup_from_css(css)->thp_reclaim); +} + +static int mem_cgroup_thp_reclaim_write(struct cgroup_subsys_state *css, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + if (val != THP_RECLAIM_DISABLE && val != THP_RECLAIM_ENABLE) + return -EINVAL; + + WRITE_ONCE(memcg->thp_reclaim, val); + + return 0; +} +#endif #ifdef CONFIG_CGROUP_WRITEBACK @@ -4988,6 +5062,13 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of, .write = mem_cgroup_reset, .read_u64 = mem_cgroup_read_u64, }, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + { + .name = "thp_reclaim", + .read_u64 = mem_cgroup_thp_reclaim_read, + .write_u64 = mem_cgroup_thp_reclaim_write, + }, +#endif { }, /* terminate */ }; @@ -5088,6 +5169,12 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) pn->on_tree = false; pn->memcg = memcg; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + spin_lock_init(&pn->hpage_reclaim_queue.reclaim_queue_lock); + INIT_LIST_HEAD(&pn->hpage_reclaim_queue.reclaim_queue); + pn->hpage_reclaim_queue.reclaim_queue_len = 0; +#endif + memcg->nodeinfo[node] = pn; return 0; } @@ -5176,6 +5263,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void) spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); memcg->deferred_split_queue.split_queue_len = 0; + + memcg->thp_reclaim = THP_RECLAIM_DISABLE; #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; @@ -5209,6 +5298,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) page_counter_init(&memcg->swap, &parent->swap); page_counter_init(&memcg->kmem, &parent->kmem); page_counter_init(&memcg->tcpmem, &parent->tcpmem); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + memcg->thp_reclaim = parent->thp_reclaim; +#endif } else { page_counter_init(&memcg->memory, NULL); page_counter_init(&memcg->swap, NULL); @@ -5654,6 +5746,10 @@ static int mem_cgroup_move_account(struct page *page, __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + del_hpage_from_queue(page); +#endif + /* * All state has been migrated, let's switch to the new memcg. * @@ -5674,6 +5770,10 @@ static int mem_cgroup_move_account(struct page *page, page->memcg_data = (unsigned long)to; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + add_hpage_to_queue(page, to); +#endif + __unlock_page_memcg(from); ret = 0; @@ -6850,6 +6950,9 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug) VM_BUG_ON_PAGE(PageLRU(page), page); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + del_hpage_from_queue(page); +#endif /* * Nobody should be changing or seriously looking at * page memcg or objcg at this point, we have fully @@ -7196,6 +7299,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(oldid, page); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + del_hpage_from_queue(page); +#endif + page->memcg_data = 0; if (!mem_cgroup_is_root(memcg)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 74296c2..9be136f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2151,8 +2151,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, * * Returns the number of pages moved to the given lruvec. */ -static unsigned int move_pages_to_lru(struct lruvec *lruvec, - struct list_head *list) +unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list) { int nr_pages, nr_moved = 0; LIST_HEAD(pages_to_free); @@ -2783,6 +2782,57 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return can_demote(pgdat->node_id, sc); } +#ifdef CONFIG_MEMCG +#define MAX_SCAN_HPAGE 32UL +/* + * Try to reclaim the zero subpages for the transparent huge page. + */ +static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec, + int priority, + unsigned long nr_to_reclaim) +{ + struct mem_cgroup *memcg; + struct hpage_reclaim *hr_queue; + int nid = lruvec->pgdat->node_id; + unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan; + + memcg = lruvec_memcg(lruvec); + if (!memcg) + goto out; + + hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue; + if (!READ_ONCE(memcg->thp_reclaim)) + goto out; + + /* The last scan loop will scan all the huge pages.*/ + nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE; + + do { + struct page *page = NULL; + + if (zsr_get_hpage(hr_queue, &page)) + break; + + if (!page) + continue; + + nr_reclaimed += zsr_reclaim_hpage(lruvec, page); + + cond_resched(); + + } while ((nr_reclaimed < nr_to_reclaim) && (++nr_scanned != nr_to_scan)); +out: + return nr_reclaimed; +} +#else +static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec, + int priority, + unsigned long nr_to_reclaim) +{ + return 0; +} +#endif + static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; @@ -2886,6 +2936,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) scan_adjusted = true; } blk_finish_plug(&plug); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (nr_reclaimed < nr_to_reclaim) + nr_reclaimed += reclaim_hpage_zero_subpages(lruvec, + sc->priority, nr_to_reclaim - nr_reclaimed); +#endif sc->nr_reclaimed += nr_reclaimed; /* From patchwork Thu Oct 28 11:56:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ning Zhang X-Patchwork-Id: 12589929 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D76C1C433EF for ; Thu, 28 Oct 2021 11:57:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 529F4610FC for ; Thu, 28 Oct 2021 11:57:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 529F4610FC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 851F7940007; Thu, 28 Oct 2021 07:57:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 801326B0072; Thu, 28 Oct 2021 07:57:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C99E940007; Thu, 28 Oct 2021 07:57:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 44FE06B0071 for ; Thu, 28 Oct 2021 07:57:01 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id CE14218464D1D for ; Thu, 28 Oct 2021 11:57:00 +0000 (UTC) X-FDA: 78745695000.06.CDAE93A Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by imf12.hostedemail.com (Postfix) with ESMTP id B76AA10003E7 for ; Thu, 28 Oct 2021 11:56:59 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Uu.SoiJ_1635422216; Received: from localhost(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0Uu.SoiJ_1635422216) by smtp.aliyun-inc.com(127.0.0.1); Thu, 28 Oct 2021 19:56:56 +0800 From: Ning Zhang To: linux-mm@kvack.org Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao Subject: [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Date: Thu, 28 Oct 2021 19:56:51 +0800 Message-Id: <1635422215-99394-3-git-send-email-ningzhang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: B76AA10003E7 X-Stat-Signature: grf4eeygog5cpueuxd3chu33cai79ojz Authentication-Results: imf12.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf12.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com X-HE-Tag: 1635422219-431856 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add a global interface to configure zero subpages reclaim global: /sys/kernel/mm/transparent_hugepage/reclaim It has three modes: memcg, means every memory cgroup will use their own configure. enable, means every mem cgroup will enable reclaim. disable, means every mem cgroup will disable reclaim. The default mode is memcg. Signed-off-by: Ning Zhang --- include/linux/huge_mm.h | 1 + include/linux/memcontrol.h | 8 ++++++++ mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ mm/memcontrol.c | 2 +- mm/vmscan.c | 2 +- 5 files changed, 55 insertions(+), 2 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e1b3bf9..04607b1 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -186,6 +186,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, bool is_transparent_hugepage(struct page *page); #ifdef CONFIG_MEMCG +extern int global_thp_reclaim; int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page); unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page); static inline struct list_head *hpage_reclaim_list(struct page *page) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 502a6ab..f99f13f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1123,6 +1123,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, #ifdef CONFIG_TRANSPARENT_HUGEPAGE void del_hpage_from_queue(struct page *page); + +static inline int get_thp_reclaim_mode(struct mem_cgroup *memcg) +{ + int reclaim = READ_ONCE(global_thp_reclaim); + + return (reclaim != THP_RECLAIM_MEMCG) ? reclaim : + READ_ONCE(memcg->thp_reclaim); +} #endif #else /* CONFIG_MEMCG */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 21e3c01..84fd738 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -60,6 +60,10 @@ static struct shrinker deferred_split_shrinker; +#ifdef CONFIG_MEMCG +int global_thp_reclaim = THP_RECLAIM_MEMCG; +#endif + static atomic_t huge_zero_refcount; struct page *huge_zero_page __read_mostly; unsigned long huge_zero_pfn __read_mostly = ~0UL; @@ -330,6 +334,43 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size); +#ifdef CONFIG_MEMCG +static ssize_t reclaim_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int thp_reclaim = READ_ONCE(global_thp_reclaim); + + if (thp_reclaim == THP_RECLAIM_MEMCG) + return sprintf(buf, "[memcg] enable disable\n"); + else if (thp_reclaim == THP_RECLAIM_ENABLE) + return sprintf(buf, "memcg [enable] disable\n"); + else + return sprintf(buf, "memcg enable [disable]\n"); +} + +static ssize_t reclaim_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (!memcmp("memcg", buf, + min(sizeof("memcg")-1, count))) + WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_MEMCG); + else if (!memcmp("enable", buf, + min(sizeof("enable")-1, count))) + WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_ENABLE); + else if (!memcmp("disable", buf, + min(sizeof("disable")-1, count))) + WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_DISABLE); + else + return -EINVAL; + + return count; +} + +static struct kobj_attribute reclaim_attr = + __ATTR(reclaim, 0644, reclaim_show, reclaim_store); +#endif + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -338,6 +379,9 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, #endif +#ifdef CONFIG_MEMCG + &reclaim_attr.attr, +#endif NULL, }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5df1cdd..ae96781 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2746,7 +2746,7 @@ static void add_hpage_to_queue(struct page *page, struct mem_cgroup *memcg) struct hpage_reclaim *hr_queue; unsigned long flags; - if (READ_ONCE(memcg->thp_reclaim) == THP_RECLAIM_DISABLE) + if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE) return; page = compound_head(page); diff --git a/mm/vmscan.c b/mm/vmscan.c index 9be136f..f4ff14d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2801,7 +2801,7 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec, goto out; hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue; - if (!READ_ONCE(memcg->thp_reclaim)) + if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE) goto out; /* The last scan loop will scan all the huge pages.*/ From patchwork Thu Oct 28 11:56:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ning Zhang X-Patchwork-Id: 12589931 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2360C433F5 for ; Thu, 28 Oct 2021 11:57:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4AA4B60F21 for ; Thu, 28 Oct 2021 11:57:03 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4AA4B60F21 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 212F16B0071; Thu, 28 Oct 2021 07:57:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C085940008; Thu, 28 Oct 2021 07:57:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 061FE6B0073; Thu, 28 Oct 2021 07:57:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com [216.40.44.134]) by kanga.kvack.org (Postfix) with ESMTP id D5F8F6B0071 for ; Thu, 28 Oct 2021 07:57:01 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 7936A31E5E for ; Thu, 28 Oct 2021 11:57:01 +0000 (UTC) X-FDA: 78745695042.09.F944EFD Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42]) by imf13.hostedemail.com (Postfix) with ESMTP id 53A511049B56 for ; Thu, 28 Oct 2021 11:56:54 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R951e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Uu.chCS_1635422216; Received: from localhost(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0Uu.chCS_1635422216) by smtp.aliyun-inc.com(127.0.0.1); Thu, 28 Oct 2021 19:56:57 +0800 From: Ning Zhang To: linux-mm@kvack.org Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao Subject: [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold Date: Thu, 28 Oct 2021 19:56:52 +0800 Message-Id: <1635422215-99394-4-git-send-email-ningzhang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 53A511049B56 X-Stat-Signature: 5p4urcje94k39eidwngwa88mx8uuzrok Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf13.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.42 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com X-HE-Tag: 1635422214-879102 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In this patch, we add memory.thp_reclaim_ctrl for each memory cgroup to control thp reclaim. The first controller "threshold" is to set the reclaim threshold. The default value is 16, which means if a huge page contains over 16 zero subpages (estimated), the huge page can be split and the zero subpages can be reclaimed when the zero subpages reclaim is enable. You can change this value by: echo "threshold $v" > /sys/fs/cgroup/memory/{memcg}/thp_reclaim_ctrl Signed-off-by: Ning Zhang --- include/linux/huge_mm.h | 3 ++- include/linux/memcontrol.h | 3 +++ mm/huge_memory.c | 9 ++++--- mm/memcontrol.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 4 ++- 5 files changed, 75 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 04607b1..304e3df 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -187,7 +187,8 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, #ifdef CONFIG_MEMCG extern int global_thp_reclaim; -int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page); +int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page, + int threshold); unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page); static inline struct list_head *hpage_reclaim_list(struct page *page) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f99f13f..4815c56 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -237,6 +237,8 @@ enum thp_reclaim_state { THP_RECLAIM_ENABLE, THP_RECLAIM_MEMCG, /* For global configure*/ }; + +#define THP_RECLAIM_THRESHOLD_DEFAULT 16 #endif /* * The memory controller data structure. The memory controller controls both @@ -356,6 +358,7 @@ struct mem_cgroup { #ifdef CONFIG_TRANSPARENT_HUGEPAGE struct deferred_split deferred_split_queue; int thp_reclaim; + int thp_reclaim_threshold; #endif struct mem_cgroup_per_node *nodeinfo[]; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 84fd738..40a9879 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3270,7 +3270,7 @@ static inline bool is_zero_page(struct page *page) * We'll split the huge page iff it contains at least 1/32 zeros, * estimate it by checking some discrete unsigned long values. */ -static bool hpage_estimate_zero(struct page *page) +static bool hpage_estimate_zero(struct page *page, int threshold) { unsigned int i, maybe_zero_pages = 0, offset = 0; void *addr; @@ -3281,7 +3281,7 @@ static bool hpage_estimate_zero(struct page *page) if (unlikely((offset + 1) * BYTES_PER_LONG > PAGE_SIZE)) offset = 0; if (*(const unsigned long *)(addr + offset) == 0UL) { - if (++maybe_zero_pages == HPAGE_PMD_NR >> 5) { + if (++maybe_zero_pages == threshold) { kunmap(page); return true; } @@ -3456,7 +3456,8 @@ static unsigned long reclaim_zero_subpages(struct list_head *list, * be stored in reclaim_page; otherwise, just delete the page from the * queue. */ -int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page) +int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page, + int threshold) { struct page *page = NULL; unsigned long flags; @@ -3482,7 +3483,7 @@ int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page) spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); - if (hpage_can_reclaim(page) && hpage_estimate_zero(page) && + if (hpage_can_reclaim(page) && hpage_estimate_zero(page, threshold) && !isolate_lru_page(page)) { __mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON, HPAGE_PMD_NR); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ae96781..7ba3c69 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4498,6 +4498,61 @@ static int mem_cgroup_thp_reclaim_write(struct cgroup_subsys_state *css, return 0; } + +static inline char *strsep_s(char **s, const char *ct) +{ + char *p; + + while ((p = strsep(s, ct))) { + if (*p) + return p; + } + + return NULL; +} + +static int memcg_thp_reclaim_ctrl_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + int thp_reclaim_threshold = READ_ONCE(memcg->thp_reclaim_threshold); + + seq_printf(m, "threshold\t%d\n", thp_reclaim_threshold); + + return 0; +} + +static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + char *key, *value; + int ret; + + key = strsep_s(&buf, " \t\n"); + if (!key) + return -EINVAL; + + if (!strcmp(key, "threshold")) { + int threshold; + + value = strsep_s(&buf, " \t\n"); + if (!value) + return -EINVAL; + + ret = kstrtouint(value, 0, &threshold); + if (ret) + return ret; + + if (threshold > HPAGE_PMD_NR || threshold < 1) + return -EINVAL; + + xchg(&memcg->thp_reclaim_threshold, threshold); + } else + return -EINVAL; + + return nbytes; +} #endif #ifdef CONFIG_CGROUP_WRITEBACK @@ -5068,6 +5123,11 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of, .read_u64 = mem_cgroup_thp_reclaim_read, .write_u64 = mem_cgroup_thp_reclaim_write, }, + { + .name = "thp_reclaim_ctrl", + .seq_show = memcg_thp_reclaim_ctrl_show, + .write = memcg_thp_reclaim_ctrl_write, + }, #endif { }, /* terminate */ }; @@ -5265,6 +5325,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) memcg->deferred_split_queue.split_queue_len = 0; memcg->thp_reclaim = THP_RECLAIM_DISABLE; + memcg->thp_reclaim_threshold = THP_RECLAIM_THRESHOLD_DEFAULT; #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; @@ -5300,6 +5361,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) page_counter_init(&memcg->tcpmem, &parent->tcpmem); #ifdef CONFIG_TRANSPARENT_HUGEPAGE memcg->thp_reclaim = parent->thp_reclaim; + memcg->thp_reclaim_threshold = parent->thp_reclaim_threshold; #endif } else { page_counter_init(&memcg->memory, NULL); diff --git a/mm/vmscan.c b/mm/vmscan.c index f4ff14d..fcc80a6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2794,6 +2794,7 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec, struct mem_cgroup *memcg; struct hpage_reclaim *hr_queue; int nid = lruvec->pgdat->node_id; + int threshold; unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan; memcg = lruvec_memcg(lruvec); @@ -2806,11 +2807,12 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec, /* The last scan loop will scan all the huge pages.*/ nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE; + threshold = READ_ONCE(memcg->thp_reclaim_threshold); do { struct page *page = NULL; - if (zsr_get_hpage(hr_queue, &page)) + if (zsr_get_hpage(hr_queue, &page, threshold)) break; if (!page) From patchwork Thu Oct 28 11:56:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ning Zhang X-Patchwork-Id: 12589933 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66FC2C433FE for ; Thu, 28 Oct 2021 11:57:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 112A260E8C for ; Thu, 28 Oct 2021 11:57:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 112A260E8C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 76837940009; Thu, 28 Oct 2021 07:57:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6C9DB940008; Thu, 28 Oct 2021 07:57:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 51C87940009; Thu, 28 Oct 2021 07:57:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200]) by kanga.kvack.org (Postfix) with ESMTP id 2BFF6940008 for ; Thu, 28 Oct 2021 07:57:03 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id A63F62FE09 for ; Thu, 28 Oct 2021 11:57:02 +0000 (UTC) X-FDA: 78745695084.25.680E9EE Received: from out30-56.freemail.mail.aliyun.com (out30-56.freemail.mail.aliyun.com [115.124.30.56]) by imf03.hostedemail.com (Postfix) with ESMTP id 2C42330039A7 for ; Thu, 28 Oct 2021 11:56:56 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Uu0GVSZ_1635422217; Received: from localhost(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0Uu0GVSZ_1635422217) by smtp.aliyun-inc.com(127.0.0.1); Thu, 28 Oct 2021 19:56:57 +0800 From: Ning Zhang To: linux-mm@kvack.org Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao Subject: [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim Date: Thu, 28 Oct 2021 19:56:53 +0800 Message-Id: <1635422215-99394-5-git-send-email-ningzhang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 2C42330039A7 X-Stat-Signature: cdde144bd3qmeudyc617tariojp3ppo4 Authentication-Results: imf03.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf03.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.56 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com X-HE-Tag: 1635422216-912258 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add a new controller named "reclaim" for memory.thp_reclaim_ctrl to trigger thp reclaim immediately: echo "reclaim 1" > memory.thp_reclaim_ctrl echo "reclaim 2" > memory.thp_reclaim_ctrl "reclaim 1" means triggering reclaim only for current memcg. "reclaim 2" means triggering reclaim for current memcg and it's children memcgs. Signed-off-by: Ning Zhang --- include/linux/huge_mm.h | 1 + mm/huge_memory.c | 29 +++++++++++++++++++++++++++++ mm/memcontrol.c | 27 +++++++++++++++++++++++++++ 3 files changed, 57 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 304e3df..f792433 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -190,6 +190,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page, int threshold); unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page); +void zsr_reclaim_memcg(struct mem_cgroup *memcg); static inline struct list_head *hpage_reclaim_list(struct page *page) { return &page[3].hpage_reclaim_list; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 40a9879..633fd0f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3541,4 +3541,33 @@ unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page) return reclaimed; } + +void zsr_reclaim_memcg(struct mem_cgroup *memcg) +{ + struct lruvec *lruvec; + struct hpage_reclaim *hr_queue; + int threshold, nid; + + if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE) + return; + + threshold = READ_ONCE(memcg->thp_reclaim_threshold); + for_each_online_node(nid) { + lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue; + for ( ; ; ) { + struct page *page = NULL; + + if (zsr_get_hpage(hr_queue, &page, threshold)) + break; + + if (!page) + continue; + + zsr_reclaim_hpage(lruvec, page); + + cond_resched(); + } + } +} #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7ba3c69..a8e3ca1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4521,6 +4521,8 @@ static int memcg_thp_reclaim_ctrl_show(struct seq_file *m, void *v) return 0; } +#define CTRL_RECLAIM_MEMCG 1 /* only relciam current memcg */ +#define CTRL_RECLAIM_ALL 2 /* reclaim current memcg and all the children memcgs */ static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) @@ -4548,6 +4550,31 @@ static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of, return -EINVAL; xchg(&memcg->thp_reclaim_threshold, threshold); + } else if (!strcmp(key, "reclaim")) { + struct mem_cgroup *iter; + int mode; + + value = strsep_s(&buf, " \t\n"); + if (!value) + return -EINVAL; + + ret = kstrtouint(value, 0, &mode); + if (ret) + return ret; + + switch (mode) { + case CTRL_RECLAIM_MEMCG: + zsr_reclaim_memcg(memcg); + break; + case CTRL_RECLAIM_ALL: + iter = mem_cgroup_iter(memcg, NULL, NULL); + do { + zsr_reclaim_memcg(iter); + } while ((iter = mem_cgroup_iter(memcg, iter, NULL))); + break; + default: + return -EINVAL; + } } else return -EINVAL; From patchwork Thu Oct 28 11:56:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ning Zhang X-Patchwork-Id: 12589939 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BA12C433EF for ; Thu, 28 Oct 2021 11:57:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 30CA7604DA for ; Thu, 28 Oct 2021 11:57:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 30CA7604DA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id F0CA894000C; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DCE1F94000A; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6F1294000C; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0179.hostedemail.com [216.40.44.179]) by kanga.kvack.org (Postfix) with ESMTP id A117394000A for ; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 1B3C22D38D for ; Thu, 28 Oct 2021 11:57:04 +0000 (UTC) X-FDA: 78745695168.03.0146F1F Received: from out30-45.freemail.mail.aliyun.com (out30-45.freemail.mail.aliyun.com [115.124.30.45]) by imf23.hostedemail.com (Postfix) with ESMTP id 98A56900038F for ; Thu, 28 Oct 2021 11:56:54 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Uu0GVSh_1635422217; Received: from localhost(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0Uu0GVSh_1635422217) by smtp.aliyun-inc.com(127.0.0.1); Thu, 28 Oct 2021 19:56:58 +0800 From: Ning Zhang To: linux-mm@kvack.org Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao Subject: [RFC 5/6] mm, thp: add some statistics for zero subpages reclaim Date: Thu, 28 Oct 2021 19:56:54 +0800 Message-Id: <1635422215-99394-6-git-send-email-ningzhang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> X-Stat-Signature: 6me8go4tujkkti1iumdu17zktjm46k3x X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 98A56900038F Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf23.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.45 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com X-HE-Tag: 1635422214-250793 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: queue_length show the numbers of huge pages in the queue. split_hpage shows the numbers of huge pages split by thp reclaim. split_failed shows the numbers of huge pages split failed reclaim_subpage shows the numbers of zero subpages reclaimed by thp reclaim. Signed-off-by: Ning Zhang --- include/linux/huge_mm.h | 3 ++- include/linux/mmzone.h | 3 +++ mm/huge_memory.c | 8 ++++++-- mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 2 +- 5 files changed, 59 insertions(+), 4 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f792433..5d4a038 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -189,7 +189,8 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, extern int global_thp_reclaim; int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page, int threshold); -unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page); +unsigned long zsr_reclaim_hpage(struct hpage_reclaim *hr_queue, + struct lruvec *lruvec, struct page *page); void zsr_reclaim_memcg(struct mem_cgroup *memcg); static inline struct list_head *hpage_reclaim_list(struct page *page) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 222cd4f..6ce6890 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -792,6 +792,9 @@ struct hpage_reclaim { spinlock_t reclaim_queue_lock; struct list_head reclaim_queue; unsigned long reclaim_queue_len; + atomic_long_t split_hpage; + atomic_long_t split_failed; + atomic_long_t reclaim_subpage; }; #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 633fd0f..5e737d0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3506,7 +3506,8 @@ int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page, } -unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page) +unsigned long zsr_reclaim_hpage(struct hpage_reclaim *hr_queue, + struct lruvec *lruvec, struct page *page) { struct pglist_data *pgdat = page_pgdat(page); unsigned long reclaimed; @@ -3523,12 +3524,15 @@ unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page) putback_lru_page(page); mod_node_page_state(pgdat, NR_ISOLATED_ANON, -HPAGE_PMD_NR); + atomic_long_inc(&hr_queue->split_failed); return 0; } unlock_page(page); list_add_tail(&page->lru, &split_list); reclaimed = reclaim_zero_subpages(&split_list, &keep_list); + atomic_long_inc(&hr_queue->split_hpage); + atomic_long_add(reclaimed, &hr_queue->reclaim_subpage); spin_lock_irqsave(&lruvec->lru_lock, flags); move_pages_to_lru(lruvec, &keep_list); @@ -3564,7 +3568,7 @@ void zsr_reclaim_memcg(struct mem_cgroup *memcg) if (!page) continue; - zsr_reclaim_hpage(lruvec, page); + zsr_reclaim_hpage(hr_queue, lruvec, page); cond_resched(); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a8e3ca1..f8016ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4580,6 +4580,49 @@ static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of, return nbytes; } + +static int memcg_thp_reclaim_stat_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup_per_node *mz; + int nid; + unsigned long len; + + seq_puts(m, "queue_length\t"); + for_each_node(nid) { + mz = memcg->nodeinfo[nid]; + len = READ_ONCE(mz->hpage_reclaim_queue.reclaim_queue_len); + seq_printf(m, "%-24lu", len); + } + + seq_puts(m, "\n"); + seq_puts(m, "split_hpage\t"); + for_each_node(nid) { + mz = memcg->nodeinfo[nid]; + len = atomic_long_read(&mz->hpage_reclaim_queue.split_hpage); + seq_printf(m, "%-24lu", len); + } + + seq_puts(m, "\n"); + seq_puts(m, "split_failed\t"); + for_each_node(nid) { + mz = memcg->nodeinfo[nid]; + len = atomic_long_read(&mz->hpage_reclaim_queue.split_failed); + seq_printf(m, "%-24lu", len); + } + + seq_puts(m, "\n"); + seq_puts(m, "reclaim_subpage\t"); + for_each_node(nid) { + mz = memcg->nodeinfo[nid]; + len = atomic_long_read(&mz->hpage_reclaim_queue.reclaim_subpage); + seq_printf(m, "%-24lu", len); + } + + seq_puts(m, "\n"); + + return 0; +} #endif #ifdef CONFIG_CGROUP_WRITEBACK @@ -5155,6 +5198,10 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of, .seq_show = memcg_thp_reclaim_ctrl_show, .write = memcg_thp_reclaim_ctrl_write, }, + { + .name = "thp_reclaim_stat", + .seq_show = memcg_thp_reclaim_stat_show, + }, #endif { }, /* terminate */ }; diff --git a/mm/vmscan.c b/mm/vmscan.c index fcc80a6..cb5f53d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2818,7 +2818,7 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec, if (!page) continue; - nr_reclaimed += zsr_reclaim_hpage(lruvec, page); + nr_reclaimed += zsr_reclaim_hpage(hr_queue, lruvec, page); cond_resched(); From patchwork Thu Oct 28 11:56:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ning Zhang X-Patchwork-Id: 12589937 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21CB0C433F5 for ; Thu, 28 Oct 2021 11:57:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C33B660E8C for ; Thu, 28 Oct 2021 11:57:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C33B660E8C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 8CD31940008; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 745CA94000C; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 48132940008; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33]) by kanga.kvack.org (Postfix) with ESMTP id 196BF94000A for ; Thu, 28 Oct 2021 07:57:04 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A3C9582499A8 for ; Thu, 28 Oct 2021 11:57:03 +0000 (UTC) X-FDA: 78745695126.25.80EC15F Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf25.hostedemail.com (Postfix) with ESMTP id F34B7B0001A3 for ; Thu, 28 Oct 2021 11:56:55 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Uu.Soim_1635422218; Received: from localhost(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0Uu.Soim_1635422218) by smtp.aliyun-inc.com(127.0.0.1); Thu, 28 Oct 2021 19:56:58 +0800 From: Ning Zhang To: linux-mm@kvack.org Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao Subject: [RFC 6/6] mm, thp: add document for zero subpages reclaim Date: Thu, 28 Oct 2021 19:56:55 +0800 Message-Id: <1635422215-99394-7-git-send-email-ningzhang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> X-Rspamd-Queue-Id: F34B7B0001A3 Authentication-Results: imf25.hostedemail.com; dkim=none; spf=pass (imf25.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Stat-Signature: 7zxo9x5iz7jkd7nxq7kkj3pf14jfpgee X-Rspamd-Server: rspam06 X-HE-Tag: 1635422215-435215 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add user guide for thp zero subpages reclaim. Signed-off-by: Ning Zhang --- Documentation/admin-guide/mm/transhuge.rst | 75 ++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index c9c37f1..85cd3b7 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -421,3 +421,78 @@ support enabled just fine as always. No difference can be noted in hugetlbfs other than there will be less overall fragmentation. All usual features belonging to hugetlbfs are preserved and unaffected. libhugetlbfs will also work fine as usual. + +THP zero subpages reclaim +========================= +THP may lead to memory bloat which may cause OOM. The reason is a huge +page may contain some zero subpages which users didn't really access them. +To avoid this, a mechanism to reclaim these zero subpages is introduced:: + + echo 1 > /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim + echo 0 > /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim + +Echo 1 to enable and echo 0 to disable. +The default value is inherited from its parent. The default mode of root +memcg is disable. + +We also add a global interface, if you don't want to configure it by +configuring every memory cgroup, you can use this one:: + + /sys/kernel/mm/transparent_hugepage/reclaim + +memcg + The default mode. It means every mem cgroup will use their own + configure. + +enable + means every mem cgroup will enable reclaim. + +disable + means every mem cgroup will disable reclaim. + +If zero subpages reclaim is enabled, the new huge page will be add to a +reclaim queue in mem_cgroup, and the queue would be scanned when memory +reclaiming. The queue stat can be checked like this:: + + cat /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_stat + +queue_length + means the queue length of each node. + +split_hpage + means the numbers of huge pages split by thp reclaim of each node. + +split_failed + means the numbers of huge pages split failed by thp reclaim of + each node. + +reclaim_subpage + means the numbers of zero subpages reclaimed by thp reclaim of + each node. + +We also add a controller interface to set configs for thp reclaim:: + + /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_ctrl + +threshold + means the huge page which contains at least threshold zero pages would + be split (estimate it by checking some discrete unsigned long values). + The default value of threshold is 16, and will inherit from it's parent. + The range of this value is (0, HPAGE_PMD_NR], which means the value must + be less than or equal to HPAGE_PMD_NR (512 in x86), and be greater than 0. + We can set reclaim threshold to be 8 by this:: + + echo "threshold 8" > memory.thp_reclaim_ctrl + +reclaim + triggers action immediately for the huge pages in the reclaim queue. + The action deponds on the thp reclaim config (reclaim, swap or disable, + disable means just remove the huge page from the queue). + This contronller has two value, 1 and 2. 1 means just reclaim the current + memcg, and 2 means reclaim the current memcg and all the children memcgs. + Like this:: + + echo "reclaim 1" > memory.thp_reclaim_ctrl + echo "reclaim 2" > memory.thp_reclaim_ctrl + +Only one of the configs mentioned above can be set at a time.