From patchwork Mon May 20 03:52:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10949851 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 26A7F14C0 for ; Mon, 20 May 2019 03:53:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 13CD0285B5 for ; Mon, 20 May 2019 03:53:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 066A0285C7; Mon, 20 May 2019 03:53:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D2DB6285B5 for ; Mon, 20 May 2019 03:53:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E1696B0006; Sun, 19 May 2019 23:53:11 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 892C26B0007; Sun, 19 May 2019 23:53:11 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 75AAA6B0008; Sun, 19 May 2019 23:53:11 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) by kanga.kvack.org (Postfix) with ESMTP id 38D156B0006 for ; Sun, 19 May 2019 23:53:11 -0400 (EDT) Received: by mail-pf1-f199.google.com with SMTP id u7so9013551pfh.17 for ; Sun, 19 May 2019 20:53:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=ZGdumR7MSzBXsKYy6968BKcjiPMsnk7KDc5fqeBNwkM=; b=aV0dDdH/7ixP2clnVXk+TmlrA6EhIVLsLVQAl2z4ybyWBof582ZW/e9wtu+Q0J4z0b EiGzvTfzOp99xA1Pri8XFtQBDBF4JIkUFrTjGP/lnomYlmcGbMGx077nbKvCTzIbF/VX OH7RYbFDyDJAY5TXfRz3+9WeWmbKEO5InZEdkf7v1Eqi3MOyMIBjkhS0rlxWS5GYnat3 6I3kKfYf1GF4QXyIq38g2iftJ1cz8C6+xdX11dGvADmQV5eFUDAcGxpMWt5paEeqCGOP g9bx3b7LXX/JKh4y3ZuY8p8cVNo4CoZ6oz10y7V0E0WnCeVf2csGcgB303IqUlUd0QA8 xFfA== X-Gm-Message-State: APjAAAXDl/f42p02YlG0pTLPpFlNU1ATp9Z6cmrhp1wiy+ZnK4tp6L7/ QYA04o3FTLPS2ElPfcqX9PJa95r2iCuIkdiLSTovTJ4S8IK5Kxtftc7ds239qynZC4N6z3gykL3 sb+tTipTLAHik+NN3nekooYhc1+ksVgfsW5MadmTZoUJjGiC1LYxEeavhlmNfYoY= X-Received: by 2002:a62:ea04:: with SMTP id t4mr76318622pfh.47.1558324390773; Sun, 19 May 2019 20:53:10 -0700 (PDT) X-Received: by 2002:a62:ea04:: with SMTP id t4mr76318516pfh.47.1558324389206; Sun, 19 May 2019 20:53:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558324389; cv=none; d=google.com; s=arc-20160816; b=UdpklDB5VTM1Vc6grDHXK9Lwz/O6X1xY/7/g6hqKg+2VdEfYgPC3HsTeZLAXCx1kZn QIz2xBVWaXW+CreXQL42+pdDGfWE51MZUSNa9p6Ge9u+66nGi6j7fsBUIlARH+FMThk/ 3hw6lql1NSyTFJKOHs7X+DWTyheCRl1D7AlrXFzzTkMGKog8AkAgBLRD6uNb9Q5Kl1OQ CCkppmedpEhGE3ix9c/AIJ94hNvRcDDeGy1kuIwhvnCDVxA7zzno8NZAQcoEzewnxSqF c+xNhiZIEltqcjv+Z1foPFsGY5DsGXf/h3DJBLjuDaR4/4T6bDnRzNqMebvLpj8YDAmv A3Uw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=ZGdumR7MSzBXsKYy6968BKcjiPMsnk7KDc5fqeBNwkM=; b=CBcmCFO25BQZi/Sv0T18IHTZFitYQSjBOZSbRLdPmWcO/DLSAnZWEFK69qNoZ/5+gC Z3OxvT1GIxZ3JV2U4mZPlpqMRJk5B2o2JJDlFkJzSMrK5+l40kuk4cdEXT35iLrrxwWE W3IkXwKbM7CChw0/HSGEvut/nEd3hf8n7Y6ZqahjXNi09tgBg00J398/zcJ5eBREqouE QT6txHWKomO2x8EVZum0ImVzxJSocB/98+2OSBpBZhvStiLk9nt3ryksVxgzVATR8x5l 2ye87WexPuuigDRUp3TaSzmXEiWWw5axK/80OXcQA2VIFKfTdkW5A6glFwrPwou77wDW dUfg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FjGICkkN; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id x11sor16313660pgx.46.2019.05.19.20.53.09 for (Google Transport Security); Sun, 19 May 2019 20:53:09 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FjGICkkN; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZGdumR7MSzBXsKYy6968BKcjiPMsnk7KDc5fqeBNwkM=; b=FjGICkkNAHPAt7KjG1vLE/V+sSXuxy882WftAOALleqtB/dYOU3CMCjIDgDwZYKgFU 3sy/1I4Tp75glkroR9BwdxtajZxchrAJmNiQDzCIfBRONzD4Sk5A69HqaGIWkwYLTmDp 4oNo3B6H9DMHdm5kk9sBlnxkIVCEkg41Xd95eUqKZB3s+zeI5iNSKMsjDZdzNO0/StUR r7zwG5UvA/9HfwRp7liBHENYgIFBJYPNdfGsk11RsencgCo9ZNFguk4rW67gKnMyjna/ MPZwC1Mz5npPbiBs95cBHlk9Em9tqej29GDMZwX5u7S31Lee7ALKwxWK3cY3Tg5rlSzh V1AQ== X-Google-Smtp-Source: APXvYqyqDckbd7dbntrF36htEawLZuZ75EEeBqGqTy7menXc3X+zgsp9IMF5ITbm+AhBQk09A6XOsA== X-Received: by 2002:a63:7b1e:: with SMTP id w30mr67834911pgc.406.1558324388822; Sun, 19 May 2019 20:53:08 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id x66sm3312779pfx.139.2019.05.19.20.53.04 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 May 2019 20:53:07 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , linux-mm , Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , Minchan Kim Subject: [RFC 1/7] mm: introduce MADV_COOL Date: Mon, 20 May 2019 12:52:48 +0900 Message-Id: <20190520035254.57579-2-minchan@kernel.org> X-Mailer: git-send-email 2.21.0.1020.gf2820cf01a-goog In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP When a process expects no accesses to a certain memory range it could hint kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COOL hint to madvise(2) syscall. MADV_COOL can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. Internally, it works via deactivating memory from active list to inactive's head so when the memory pressure happens, they will be reclaimed earlier than other active pages unless there is no access until the time. * v1r2 * use clear_page_young in deactivate_page - joelaf * v1r1 * Revise the description - surenb * Renaming from MADV_WARM to MADV_COOL - surenb Signed-off-by: Minchan Kim --- include/linux/page-flags.h | 1 + include/linux/page_idle.h | 15 ++++ include/linux/swap.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/madvise.c | 112 +++++++++++++++++++++++++ mm/swap.c | 43 ++++++++++ 6 files changed, 173 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 9f8712a4b1a5..58b06654c8dd 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page) TESTPAGEFLAG(Young, young, PF_ANY) SETPAGEFLAG(Young, young, PF_ANY) TESTCLEARFLAG(Young, young, PF_ANY) +CLEARPAGEFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdce..f3f43b317150 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page) SetPageYoung(page); } +static inline void clear_page_young(struct page *page) +{ + ClearPageYoung(page); +} + static inline bool test_and_clear_page_young(struct page *page) { return TestClearPageYoung(page); @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page) set_bit(PAGE_EXT_YOUNG, &page_ext->flags); } +static void clear_page_young(struct page *page) +{ + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return; + + clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); +} + static inline bool test_and_clear_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); diff --git a/include/linux/swap.h b/include/linux/swap.h index 4bfb5c4ac108..64795abea003 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); +extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index abd238d0f7a4..f7a4a5d4b642 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -42,6 +42,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_COOL 5 /* deactivatie these pages */ /* common parameters: try to keep these consistent across architectures */ #define MADV_FREE 8 /* free pages only if memory pressure */ diff --git a/mm/madvise.c b/mm/madvise.c index 628022e674a7..c05817fb570d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -8,6 +8,7 @@ #include #include +#include #include #include #include @@ -40,6 +41,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_REMOVE: case MADV_WILLNEED: case MADV_DONTNEED: + case MADV_COOL: case MADV_FREE: return 0; default: @@ -307,6 +309,113 @@ static long madvise_willneed(struct vm_area_struct *vma, return 0; } +static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + pte_t *orig_pte, *pte, ptent; + spinlock_t *ptl; + struct page *page; + struct vm_area_struct *vma = walk->vma; + unsigned long next; + + next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + spinlock_t *ptl; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (!ptl) + return 0; + + if (is_huge_zero_pmd(*pmd)) + goto huge_unlock; + + page = pmd_page(*pmd); + if (page_mapcount(page) > 1) + goto huge_unlock; + + if (next - addr != HPAGE_PMD_SIZE) { + int err; + + get_page(page); + spin_unlock(ptl); + lock_page(page); + err = split_huge_page(page); + unlock_page(page); + put_page(page); + if (!err) + goto regular_page; + return 0; + } + + pmdp_test_and_clear_young(vma, addr, pmd); + deactivate_page(page); +huge_unlock: + spin_unlock(ptl); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; + +regular_page: + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { + ptent = *pte; + + if (pte_none(ptent)) + continue; + + if (!pte_present(ptent)) + continue; + + page = vm_normal_page(vma, addr, ptent); + if (!page) + continue; + + if (page_mapcount(page) > 1) + continue; + + ptep_test_and_clear_young(vma, addr, pte); + deactivate_page(page); + } + + pte_unmap_unlock(orig_pte, ptl); + cond_resched(); + + return 0; +} + +static void madvise_cool_page_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + struct mm_walk cool_walk = { + .pmd_entry = madvise_cool_pte_range, + .mm = vma->vm_mm, + }; + + tlb_start_vma(tlb, vma); + walk_page_range(addr, end, &cool_walk); + tlb_end_vma(tlb, vma); +} + +static long madvise_cool(struct vm_area_struct *vma, + unsigned long start_addr, unsigned long end_addr) +{ + struct mm_struct *mm = vma->vm_mm; + struct mmu_gather tlb; + + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) + return -EINVAL; + + lru_add_drain(); + tlb_gather_mmu(&tlb, mm, start_addr, end_addr); + madvise_cool_page_range(&tlb, vma, start_addr, end_addr); + tlb_finish_mmu(&tlb, start_addr, end_addr); + + return 0; +} + static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -695,6 +804,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_remove(vma, prev, start, end); case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); + case MADV_COOL: + return madvise_cool(vma, start, end); case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); @@ -716,6 +827,7 @@ madvise_behavior_valid(int behavior) case MADV_WILLNEED: case MADV_DONTNEED: case MADV_FREE: + case MADV_COOL: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: diff --git a/mm/swap.c b/mm/swap.c index 3a75722e68a9..0f94c3b5397d 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -46,6 +46,7 @@ int page_cluster; static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs); +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs); #ifdef CONFIG_SMP static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); @@ -537,6 +538,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, update_page_reclaim_stat(lruvec, file, 0); } +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, + void *arg) +{ + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + int file = page_is_file_cache(page); + int lru = page_lru_base_type(page); + + del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE); + ClearPageActive(page); + ClearPageReferenced(page); + clear_page_young(page); + add_page_to_lru_list(page, lruvec, lru); + + __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page)); + update_page_reclaim_stat(lruvec, file, 0); + } +} static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, void *arg) @@ -589,6 +607,10 @@ void lru_add_drain_cpu(int cpu) if (pagevec_count(pvec)) pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); + pvec = &per_cpu(lru_deactivate_pvecs, cpu); + if (pagevec_count(pvec)) + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + pvec = &per_cpu(lru_lazyfree_pvecs, cpu); if (pagevec_count(pvec)) pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL); @@ -622,6 +644,26 @@ void deactivate_file_page(struct page *page) } } +/* + * deactivate_page - deactivate a page + * @page: page to deactivate + * + * deactivate_page() moves @page to the inactive list if @page was on the active + * list and was not an unevictable page. This is done to accelerate the reclaim + * of @page. + */ +void deactivate_page(struct page *page) +{ + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); + + get_page(page); + if (!pagevec_add(pvec, page) || PageCompound(page)) + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + put_cpu_var(lru_deactivate_pvecs); + } +} + /** * mark_page_lazyfree - make an anon page lazyfree * @page: page to deactivate @@ -686,6 +728,7 @@ void lru_add_drain_all(void) if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) || need_activate_page_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); From patchwork Mon May 20 03:52:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10949853 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C1C89912 for ; Mon, 20 May 2019 03:53:17 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B1845285B7 for ; Mon, 20 May 2019 03:53:17 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A2FDB285B5; Mon, 20 May 2019 03:53:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 62C06285B5 for ; Mon, 20 May 2019 03:53:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C6CE6B0007; Sun, 19 May 2019 23:53:15 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 278A06B0008; Sun, 19 May 2019 23:53:15 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 118DA6B000A; Sun, 19 May 2019 23:53:15 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id CFF616B0007 for ; Sun, 19 May 2019 23:53:14 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id u7so9013608pfh.17 for ; Sun, 19 May 2019 20:53:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=toPMA/JKL4+JCLITlgbKXa8Kcq8nIXcYZEy8G8dwb38=; b=DUU3BXKKLBLL86ZMZW8VvugCx+r+PNB9IXbNMB+T8d9PBkMcBuYltakZnGRqr8TnIz CHr05fEPRqcR9S+ziwOe+fTJjg+Iuru/DikSCNjglICqQAsilWnEAxnPfP4PG+4qNecW qHdlfYrkWGIraXFyY5+JsEdH5yE9/OHz117xksNMV+dfoNxYKgGgCvScul62ynWei1xQ RxYxxOGwFUHL219auLv7JkawtEfj8WJjBqC5vd6jg4RiKrzdvPfTgc+3vU0knaOFESwp 5h7a02xEKyo33d/75qH88seORA0FpBzPbsTDTZRNs+ccD4swrFvxWQXovV+QOVl95L6P U7LQ== X-Gm-Message-State: APjAAAWKSHigJrYSjrQtl2YJ6V9Kgz44SZHBdrXlFWsPoI24SsLn08q4 WPd+5afhVWJ8zdaFybF7g24rKhF4FMEyZht/2rjl3vimIr4+L+AZXbRSxz1D8z5CF1917eDiOxz QgS/vHMuZZM1+tSVnPC8WjpHkIb9ZylLpY9A+dkkzSrGqLB6xOwVOCsBeO3uJcz4= X-Received: by 2002:a63:1866:: with SMTP id 38mr73187376pgy.123.1558324394512; Sun, 19 May 2019 20:53:14 -0700 (PDT) X-Received: by 2002:a63:1866:: with SMTP id 38mr73187302pgy.123.1558324393371; Sun, 19 May 2019 20:53:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558324393; cv=none; d=google.com; s=arc-20160816; b=jq/Aieg56v1vzKlYok9KILz5eHPe2jIGh7kf6pdampCap49sxTjhNdJKuYsVvksVzR Q8tkNHEte7U5xUe4OSW92/tK5b6tT7AI3Q4IJ/utQJPDM7itfaLkO4y/68Us6tSrC1Tz RuV/P/6IuoymBDmZsYvgoyNPvuvkVU+hjN77E7ffAOZxE+YiIgmkWVzJdy/UW6onWXnt E8jtwmup2ByRjjIB9ReBDdZXovGyhfS87k+WxEh1lQVI2vJhIS+z5XYT9D7p0wTyArXg AuZ7Bp5RfwBZpqI/6KN+zDcPQQzIOlFWyfM2bgDQ86qGQqCj/vQ/fKTgOGzjNUwmWl3f gymQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=toPMA/JKL4+JCLITlgbKXa8Kcq8nIXcYZEy8G8dwb38=; b=M8vhVK4IDfzVcifm76C0wCPHvdPUSq4HdE9pv4Y+7wsXbfSW+vQ72OEID7KUm7L0n5 bPJtUqqorgO610w9uun+L4AP6sdG34ylkqKXO7PcKwTRLvbwsIW+u7WDj1EJL7CvO8RX dcMq751n0lXHpr1LSNwirmUGABVdxqmNem3iImmDJFl+VuvOWiKBbw1p9oT9Nh3ja9Qd XUAZKnmgmjxWg8cEJQj7+xM/X7bpHlyzHbV4++crgWKSywv0jMsBYGmyzNXXJsU0RkUg twycW5LjSu4Yb87I/45SQkfc69lYq32lWsTCBDpIzDvRVedKQ7kRwl2gsUFCcF1WKL5n RpMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=m97MBGV8; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id b11sor18141783plz.51.2019.05.19.20.53.13 for (Google Transport Security); Sun, 19 May 2019 20:53:13 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=m97MBGV8; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=toPMA/JKL4+JCLITlgbKXa8Kcq8nIXcYZEy8G8dwb38=; b=m97MBGV8NBC+cNZm03BDXSfBTqmp4sA5sC300y9OchQG14lFyRfXZJxdfo2GW5E1Bb c3a2eND70z1kSueovPmNTmqKCpC99rpeTQndsDVUf33inyAsKgA6/RsH6E/gncjywccA vuM7cpJGZEfgjvvX222Uu1QRUDmeJ+PYc75lPHqnU+P/W987PxW3xyUbJChZSE6eMGyx cyN5a8kSJdLmiQhXGuthrx0P9H5qOS/gEWmOhYkfBZffzX36ft3VXIMnJACry2QLWwRS DR3UDJCoEm2n1/IbKW2+5zR3mM4UNJnM5ITqNNd3i5AbZ4bw07uLzmsXqbjkURtuljj3 ug5w== X-Google-Smtp-Source: APXvYqw1v+/XqFbH4qE/picLm8gJqscGg2vCb7a2xfoSbRTwDaEwwBRsuHR2FujTc9592q32a1eZZg== X-Received: by 2002:a17:902:d892:: with SMTP id b18mr29342232plz.216.1558324393052; Sun, 19 May 2019 20:53:13 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id x66sm3312779pfx.139.2019.05.19.20.53.09 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 May 2019 20:53:12 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , linux-mm , Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , Minchan Kim Subject: [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Date: Mon, 20 May 2019 12:52:49 +0900 Message-Id: <20190520035254.57579-3-minchan@kernel.org> X-Mailer: git-send-email 2.21.0.1020.gf2820cf01a-goog In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN as default. It is for preventing to reclaim dirty pages when CMA try to migrate pages. Strictly speaking, we don't need it because CMA didn't allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list. Moreover, it has a problem to prevent anonymous pages's swap out even though force_reclaim = true in shrink_page_list on upcoming patch. So this patch makes references's default value to PAGEREF_RECLAIM and rename force_reclaim with skip_reference_check to make it more clear. This is a preparatory work for next patch. Signed-off-by: Minchan Kim Acked-by: Johannes Weiner --- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index d9c3e873eca6..a28e5d17b495 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, enum ttu_flags ttu_flags, struct reclaim_stat *stat, - bool force_reclaim) + bool skip_reference_check) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); @@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct address_space *mapping; struct page *page; int may_enter_fs; - enum page_references references = PAGEREF_RECLAIM_CLEAN; + enum page_references references = PAGEREF_RECLAIM; bool dirty, writeback; cond_resched(); @@ -1248,7 +1248,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } - if (!force_reclaim) + if (!skip_reference_check) references = page_check_references(page, sc); switch (references) { From patchwork Mon May 20 03:52:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10949855 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0CD49912 for ; Mon, 20 May 2019 03:53:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EF511285B5 for ; Mon, 20 May 2019 03:53:21 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E34CA285C7; Mon, 20 May 2019 03:53:21 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1C15D285B5 for ; Mon, 20 May 2019 03:53:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F236C6B0008; Sun, 19 May 2019 23:53:19 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id ED3AB6B000A; Sun, 19 May 2019 23:53:19 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D9D286B000C; Sun, 19 May 2019 23:53:19 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id A0BDD6B0008 for ; Sun, 19 May 2019 23:53:19 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id f9so4181624pfn.6 for ; Sun, 19 May 2019 20:53:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=+ky+Sp7MhEB6EVoatOc2laxLncqWcKPthCrhwamx588=; b=kjRTKKvNTPHCZ5rMF0nWxvMYWHosgTKE2DGkwbo/m/hcKf/ZkV/7i19Y/wvhjOZuyT 9FIZw/Asc3b0G1yAMZNyhRvk6pjZUdWsFupCzHCQ2OewbSA4LPlu+ClMk7f0eScrCTE4 jCOn87HLg/HsMtrr2FdMxdgiRcX9gOx0LCdqEfGjxzQvgsFUmH4pN78L9fLzsJYcqG7V +BWQ2mRNGa++sKDV5itVSGAYsKfMZdMImIb+YyO6gDFaAwCfs1AAPkGcQMvj7OLVcP9/ RPtWppjVPVYvOE1gKQ+IksYPp73aA9fLcZEglAi0FjNLd17u1MY/7bnWiOSVxu2U9xkf qktA== X-Gm-Message-State: APjAAAUhKwXajTqrtNxP+7B/AZetbynj6K1rWknkssYhB5SKLjQB6k/I xChzdL28Ipa5vOwQgGYquI1cNeqhtddC/tBePb1h1dzO73NDRlKNGWe3ZSdBqABouWPeJrLardd R0O7V2z6RJwb31xX8jVKwzRHjiA6Hk+ECmpF2Nh2I0lcmzXbV0UuzSlR1JsKC4fc= X-Received: by 2002:a63:c14:: with SMTP id b20mr72496670pgl.163.1558324399290; Sun, 19 May 2019 20:53:19 -0700 (PDT) X-Received: by 2002:a63:c14:: with SMTP id b20mr72496588pgl.163.1558324398150; Sun, 19 May 2019 20:53:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558324398; cv=none; d=google.com; s=arc-20160816; b=tvVvegr/4TnkCDFz3K/y5Phlwu0vAbxfABoML71E2HhOHA0rcBeei/8cIH6st9k6Jq gVruvy8EOtJFo/tuOMZvF4gmjcHFEpM/Ji0H0Q9K3UriMHdpwRfoMXs3yzbMNTitB3ZH NMfUlfEaRa7vv267/ivADgk2T4YLYeh48+EWz6i27Iw/p4+j5iu2lIqb7f1az2aqTP45 jndDHcr+nCwzDFqdjwi+enQwOX4XaxTNkzVQhPywBagP9aga18MIrPUO10pwCLJdJ5Me mrRB38ppv7cAYqLZ2Mjdh5H3IKut0ynmALg65GX/6yYrl/vihThdDq/DR1LkJhtGhEMi aKLg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=+ky+Sp7MhEB6EVoatOc2laxLncqWcKPthCrhwamx588=; b=XUSg6nq7c4yjJAjuBp7q+ZVPZiIAndjnjCcf24KkWsIuuxnRKaPbfPsa1Ag8veB4Fh x6/y+r6Z3V9sJe+iZ0WnKRgHd+vCdS4+U+RefRXJX+9p+0yEXOnTGk1nETXM1F1U7R/k 4w0fxmRERkXRfglb6plgvMGLIykqEHEeDos8xRIyPybta0M6kjUyekpswlSKDO7AZSXJ OxVZU4HGYifGqTJsS23UWI2dL8KZpgKdrYddT5d9wiXcZeXktcya+KwBLw9bM3T3UT1F NKJyX3oy3nVeWxZdI8q3IFeNsXLgjqOXAn3JE+7borDVqcn8Jx0bu/MXXWanInZ14VJG VIhQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=t2f8Wzza; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 78sor16509067pgb.30.2019.05.19.20.53.18 for (Google Transport Security); Sun, 19 May 2019 20:53:18 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=t2f8Wzza; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=+ky+Sp7MhEB6EVoatOc2laxLncqWcKPthCrhwamx588=; b=t2f8WzzaADAETNhuYxhAeKtM/AHj0gzF4ATQn35I6pm9JPzSen+jglx4e4pZ5fhTwU yfl9in1HjDniapwXdxLyTS0ZiCttr/UomYlRYsPCC8JOpISY3vsZhbsyFNnG12Pup8mF 1EN8H/bhHAbAcKQnL9RTvd7gosK4ZvyIA0nLBhxvSY5SbH+CBcHzMo5VFmQ1LkcUT9S5 L7AbHVKJvY9bgH+SYi+OK1krwIAN0PjhZDZw5tHxl7mxx9GOZPdWIsczCdPmrDdaQsPe aYDc3I3sTfuGtZ+4lUU/zpLoiEIBNE6G5dH/2HVLc4cFyx0rjfs7E5tvAU0USZmN4cRX fOag== X-Google-Smtp-Source: APXvYqzoHwcC2b/fGoVNKZk6bvd4p0rDW0BF5P0YqY4DqaB1uH9QaANxyS77XhPl27ae5RPUynAF4A== X-Received: by 2002:a65:6648:: with SMTP id z8mr23825282pgv.303.1558324397745; Sun, 19 May 2019 20:53:17 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id x66sm3312779pfx.139.2019.05.19.20.53.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 May 2019 20:53:16 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , linux-mm , Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , Minchan Kim Subject: [RFC 3/7] mm: introduce MADV_COLD Date: Mon, 20 May 2019 12:52:50 +0900 Message-Id: <20190520035254.57579-4-minchan@kernel.org> X-Mailer: git-send-email 2.21.0.1020.gf2820cf01a-goog In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used for a long time. The hint can help kernel in deciding which pages to evict proactively. Internally, it works via reclaiming memory in process context the syscall is called. If the page is dirty but backing storage is not synchronous device, the written page will be rotate back into LRU's tail once the write is done so they will reclaim easily when memory pressure happens. If backing storage is synchrnous device(e.g., zram), hte page will be reclaimed instantly. Signed-off-by: Minchan Kim --- include/linux/swap.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/madvise.c | 123 +++++++++++++++++++++++++ mm/vmscan.c | 74 +++++++++++++++ 4 files changed, 199 insertions(+) diff --git a/include/linux/swap.h b/include/linux/swap.h index 64795abea003..7f32a948fc6a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -365,6 +365,7 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; +extern unsigned long reclaim_pages(struct list_head *page_list); #ifdef CONFIG_NUMA extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f7a4a5d4b642..b9b51eeb8e1a 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -43,6 +43,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ #define MADV_COOL 5 /* deactivatie these pages */ +#define MADV_COLD 6 /* reclaim these pages */ /* common parameters: try to keep these consistent across architectures */ #define MADV_FREE 8 /* free pages only if memory pressure */ diff --git a/mm/madvise.c b/mm/madvise.c index c05817fb570d..9a6698b56845 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_WILLNEED: case MADV_DONTNEED: case MADV_COOL: + case MADV_COLD: case MADV_FREE: return 0; default: @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma, return 0; } +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + pte_t *orig_pte, *pte, ptent; + spinlock_t *ptl; + LIST_HEAD(page_list); + struct page *page; + int isolated = 0; + struct vm_area_struct *vma = walk->vma; + unsigned long next; + + next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + spinlock_t *ptl; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (!ptl) + return 0; + + if (is_huge_zero_pmd(*pmd)) + goto huge_unlock; + + page = pmd_page(*pmd); + if (page_mapcount(page) > 1) + goto huge_unlock; + + if (next - addr != HPAGE_PMD_SIZE) { + int err; + + get_page(page); + spin_unlock(ptl); + lock_page(page); + err = split_huge_page(page); + unlock_page(page); + put_page(page); + if (!err) + goto regular_page; + return 0; + } + + if (isolate_lru_page(page)) + goto huge_unlock; + + list_add(&page->lru, &page_list); +huge_unlock: + spin_unlock(ptl); + reclaim_pages(&page_list); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; +regular_page: + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { + ptent = *pte; + if (!pte_present(ptent)) + continue; + + page = vm_normal_page(vma, addr, ptent); + if (!page) + continue; + + if (page_mapcount(page) > 1) + continue; + + if (isolate_lru_page(page)) + continue; + + isolated++; + list_add(&page->lru, &page_list); + if (isolated >= SWAP_CLUSTER_MAX) { + pte_unmap_unlock(orig_pte, ptl); + reclaim_pages(&page_list); + isolated = 0; + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + orig_pte = pte; + } + } + + pte_unmap_unlock(orig_pte, ptl); + reclaim_pages(&page_list); + cond_resched(); + + return 0; +} + +static void madvise_cold_page_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + struct mm_walk warm_walk = { + .pmd_entry = madvise_cold_pte_range, + .mm = vma->vm_mm, + }; + + tlb_start_vma(tlb, vma); + walk_page_range(addr, end, &warm_walk); + tlb_end_vma(tlb, vma); +} + + +static long madvise_cold(struct vm_area_struct *vma, + unsigned long start_addr, unsigned long end_addr) +{ + struct mm_struct *mm = vma->vm_mm; + struct mmu_gather tlb; + + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) + return -EINVAL; + + lru_add_drain(); + tlb_gather_mmu(&tlb, mm, start_addr, end_addr); + madvise_cold_page_range(&tlb, vma, start_addr, end_addr); + tlb_finish_mmu(&tlb, start_addr, end_addr); + + return 0; +} + static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_willneed(vma, prev, start, end); case MADV_COOL: return madvise_cool(vma, start, end); + case MADV_COLD: + return madvise_cold(vma, start, end); case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); @@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior) case MADV_DONTNEED: case MADV_FREE: case MADV_COOL: + case MADV_COLD: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: diff --git a/mm/vmscan.c b/mm/vmscan.c index a28e5d17b495..1701b31f70a8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_deactivate, nr_rotated, sc->priority, file); } +unsigned long reclaim_pages(struct list_head *page_list) +{ + int nid = -1; + unsigned long nr_isolated[2] = {0, }; + unsigned long nr_reclaimed = 0; + LIST_HEAD(node_page_list); + struct reclaim_stat dummy_stat; + struct scan_control sc = { + .gfp_mask = GFP_KERNEL, + .priority = DEF_PRIORITY, + .may_writepage = 1, + .may_unmap = 1, + .may_swap = 1, + }; + + while (!list_empty(page_list)) { + struct page *page; + + page = lru_to_page(page_list); + list_del(&page->lru); + + if (nid == -1) { + nid = page_to_nid(page); + INIT_LIST_HEAD(&node_page_list); + nr_isolated[0] = nr_isolated[1] = 0; + } + + if (nid == page_to_nid(page)) { + list_add(&page->lru, &node_page_list); + nr_isolated[!!page_is_file_cache(page)] += + hpage_nr_pages(page); + continue; + } + + nid = page_to_nid(page); + + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + nr_isolated[1]); + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, + &dummy_stat, true); + while (!list_empty(&node_page_list)) { + struct page *page = lru_to_page(page_list); + + list_del(&page->lru); + putback_lru_page(page); + } + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + -nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + -nr_isolated[1]); + nr_isolated[0] = nr_isolated[1] = 0; + INIT_LIST_HEAD(&node_page_list); + } + + if (!list_empty(&node_page_list)) { + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + nr_isolated[1]); + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, + &dummy_stat, true); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + -nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + -nr_isolated[1]); + } + + return nr_reclaimed; +} + /* * The inactive anon list should be small enough that the VM never has * to do too much work. From patchwork Mon May 20 03:52:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10949857 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1914E912 for ; Mon, 20 May 2019 03:53:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 06666285B5 for ; Mon, 20 May 2019 03:53:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EEA66285C7; Mon, 20 May 2019 03:53:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F15C1285B5 for ; Mon, 20 May 2019 03:53:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D9C946B000E; Sun, 19 May 2019 23:53:24 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id D4C2C6B0266; Sun, 19 May 2019 23:53:24 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF8506B000E; Sun, 19 May 2019 23:53:24 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id 83D846B000E for ; Sun, 19 May 2019 23:53:24 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id x5so9045708pfi.5 for ; Sun, 19 May 2019 20:53:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=89p7CoU4S9UFAIEQ4qQeLvyukcRmWZty6wgd/XamBhs=; b=ukkgyM5mrzM7J/TuQNw8moQsPQW5c8FaGZ5RRBlBbvSjhsEtAwhzoxpqC0uRNeeqgy 8rlrHYGynlFV/4PJHgmuS/zYkTjTNqxatj4ro9Ag6UyFyRSC9ZIgpcmIyoxdu+tcIwSH EynOCH6uqfrhlfz+98wvDQx2wtLtoeQekIZYKljPw34lsArmFw1j0mTXpL8ZLLyQpAaW yEugCW9b/hJOnI2fadITQvk+7oTwgahKpjjzbjZgeZ5guIYvii4X0v5ifsz/FZJAu4Rr IHo+52e+R6zojW6ruQeFBbsdERz2WRxQzM+3+qPX+BB//WKDbh/BMndkfiToTJVm8gKr UBdw== X-Gm-Message-State: APjAAAWwKZlO3gMSe2jz4WNH2ZVsK7xkjItOMukruzJxd1QDg6TZkkoC TgUVP5AZwTze2958Wh5XKILt6od0vMisH7tFajyOcyI2VHP4GmJ1Ft2QQqmua28lm1tZbCS2b7+ PG9RRmtL8yG+UMn0veEIcOdqr5n8bycrcrjqxe7lFPb3dwt4cwCDtaXL04VVa3z0= X-Received: by 2002:a17:902:aa95:: with SMTP id d21mr18674100plr.32.1558324404168; Sun, 19 May 2019 20:53:24 -0700 (PDT) X-Received: by 2002:a17:902:aa95:: with SMTP id d21mr18674006plr.32.1558324402680; Sun, 19 May 2019 20:53:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558324402; cv=none; d=google.com; s=arc-20160816; b=Id7yOzTu61GYbMEbfKoO/Za/V293xlngE4E4jWlsx0z4qdKb1Tn1rXpBzn3eP37k0/ VfhWqCuTwofGTIpvERKcnYLPXPqEzDivUAmZ3bIjcWAt3bnI9wJUyH4BKB5Nd0tcsWZS 9hGnSAfMzuz8tD+OnZjNG/C0Fxv7d/cpvzRuludPeVt/Fig5s1g77bUerd/dNQ0KmFGd SbNJHkssc3i36Ux2y+Bnyu3ivRyrQ28hlfMdKp4tDmj1CFGJrMvLrbA0rR3Q5nij7S7V QPbZU+b+mMHEUMBQJr8JgtT6/8lu/Co1vDXEHT1iGlW5kHkutbUx5oBo1VP4wPXhFvbs pysA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=89p7CoU4S9UFAIEQ4qQeLvyukcRmWZty6wgd/XamBhs=; b=esdREco1CzRWydRTLTGDOBzUoRAulMwltq2DCLODLFIF2KwjEUJGbcDJOCmGCGNOXs OKvAvWD5R2aoNUYieENm0YTf8yu+Hi3z+e0wGYNeqEKOwW/QfIHJkVjLOwTXQxJ/2xDG E1zzeZejS6145woTgcF+3sU0fRjy7/ke9/ZomonutfgboBGXhYwdHfHoIFRlNF2FxXZU kubmADt8gzIWzfjoWNB0TvXfMz4ywPUxk3pdtrb2qqtwz0jbxljcdGDgFpKnIGh7vMtj JvYMgPW3SFnmhXcGXKff48/6PuIck3ItkcpQ0GRGF8pXSKNyR+upXqUG1njfLNBmeNhj C8tQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Caux5HDn; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id p65sor17823582pfg.64.2019.05.19.20.53.22 for (Google Transport Security); Sun, 19 May 2019 20:53:22 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Caux5HDn; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=89p7CoU4S9UFAIEQ4qQeLvyukcRmWZty6wgd/XamBhs=; b=Caux5HDngteXa5fZ7XLnDhpQOpZUHg9z6OGEBD9kplxhqTCkOkn0nPKyDUnZvR7BEZ hmroxWZ/B5nvgDPgtZNqJsSKb9b48wpHzo1a47mUlARGD2Jq4O84JWPUNl0x8PEJR+XV 8BtzmWqjjyfomQd+a360QgYlvk6RLWze4t749nG67Rk4F65CS66g/47rULD44Uf+xKIl yoFtR9BmF8S2IJxiXuzRUxDzPObzJ+fjAJZSKvoLXNHnVgNVbLwHjr9jOWUkcA/fm6GD zdoHvjTRxs+QnEL/Ly5U+ix/mbq4UGyaHh0a9JDcAaojR+ydrDC6a853uXcm9G0k69il R1WQ== X-Google-Smtp-Source: APXvYqxd8/hhxsSgSvvNWTVzT+1zlZvV711Jl/ozMiRJDFxhSFYMXe5yIGw6TiA9hGerZo479aN1pg== X-Received: by 2002:a62:82c1:: with SMTP id w184mr47287418pfd.171.1558324402213; Sun, 19 May 2019 20:53:22 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id x66sm3312779pfx.139.2019.05.19.20.53.17 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 May 2019 20:53:21 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , linux-mm , Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , Minchan Kim Subject: [RFC 4/7] mm: factor out madvise's core functionality Date: Mon, 20 May 2019 12:52:51 +0900 Message-Id: <20190520035254.57579-5-minchan@kernel.org> X-Mailer: git-send-email 2.21.0.1020.gf2820cf01a-goog In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This patch factor out madvise's core functionality so that upcoming patch can reuse it without duplication. It shouldn't change any behavior. Signed-off-by: Minchan Kim --- mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------ 1 file changed, 89 insertions(+), 79 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 9a6698b56845..119e82e1f065 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, return 0; } -static long madvise_dontneed_free(struct vm_area_struct *vma, +static long madvise_dontneed_free(struct task_struct *tsk, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, if (!userfaultfd_remove(vma, start, end)) { *prev = NULL; /* mmap_sem has been dropped, prev is stale */ - down_read(¤t->mm->mmap_sem); - vma = find_vma(current->mm, start); + down_read(&tsk->mm->mmap_sem); + vma = find_vma(tsk->mm, start); if (!vma) return -ENOMEM; if (start < vma->vm_start) { @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. */ -static long madvise_remove(struct vm_area_struct *vma, +static long madvise_remove(struct task_struct *tsk, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end) { @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, get_file(f); if (userfaultfd_remove(vma, start, end)) { /* mmap_sem was not released by userfaultfd_remove() */ - up_read(¤t->mm->mmap_sem); + up_read(&tsk->mm->mmap_sem); } error = vfs_fallocate(f, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, end - start); fput(f); - down_read(¤t->mm->mmap_sem); + down_read(&tsk->mm->mmap_sem); return error; } @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, #endif static long -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end, int behavior) +madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, int behavior) { switch (behavior) { case MADV_REMOVE: - return madvise_remove(vma, prev, start, end); + return madvise_remove(tsk, vma, prev, start, end); case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); case MADV_COOL: @@ -930,7 +933,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_cold(vma, start, end); case MADV_FREE: case MADV_DONTNEED: - return madvise_dontneed_free(vma, prev, start, end, behavior); + return madvise_dontneed_free(tsk, vma, prev, start, + end, behavior); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -974,68 +978,8 @@ madvise_behavior_valid(int behavior) } } -/* - * The madvise(2) system call. - * - * Applications can use madvise() to advise the kernel how it should - * handle paging I/O in this VM area. The idea is to help the kernel - * use appropriate read-ahead and caching techniques. The information - * provided is advisory only, and can be safely disregarded by the - * kernel without affecting the correct operation of the application. - * - * behavior values: - * MADV_NORMAL - the default behavior is to read clusters. This - * results in some read-ahead and read-behind. - * MADV_RANDOM - the system should read the minimum amount of data - * on any access, since it is unlikely that the appli- - * cation will need more than what it asks for. - * MADV_SEQUENTIAL - pages in the given range will probably be accessed - * once, so they can be aggressively read ahead, and - * can be freed soon after they are accessed. - * MADV_WILLNEED - the application is notifying the system to read - * some pages ahead. - * MADV_DONTNEED - the application is finished with the given range, - * so the kernel can free resources associated with it. - * MADV_FREE - the application marks pages in the given range as lazy free, - * where actual purges are postponed until memory pressure happens. - * MADV_REMOVE - the application wants to free up the given range of - * pages and associated backing store. - * MADV_DONTFORK - omit this area from child's address space when forking: - * typically, to avoid COWing pages pinned by get_user_pages(). - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. - * MADV_WIPEONFORK - present the child process with zero-filled memory in this - * range after a fork. - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK - * MADV_HWPOISON - trigger memory error handler as if the given memory range - * were corrupted by unrecoverable hardware memory failure. - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in - * this area with pages of identical content from other such areas. - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. - * MADV_HUGEPAGE - the application wants to back the given range by transparent - * huge pages in the future. Existing pages might be coalesced and - * new pages might be allocated as THP. - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by - * transparent huge pages so the existing pages will not be - * coalesced into THP and new pages will not be allocated as THP. - * MADV_DONTDUMP - the application wants to prevent pages in the given range - * from being included in its core dump. - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. - * - * return values: - * zero - success - * -EINVAL - start + len < 0, start is not page-aligned, - * "behavior" is not a valid value, or application - * is attempting to release locked or shared pages, - * or the specified address range includes file, Huge TLB, - * MAP_SHARED or VMPFNMAP range. - * -ENOMEM - addresses in the specified range are not currently - * mapped, or are outside the AS of the process. - * -EIO - an I/O error occurred while paging in data. - * -EBADF - map exists, but area maps something that isn't a file. - * -EAGAIN - a kernel resource was temporarily unavailable. - */ -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +static int madvise_core(struct task_struct *tsk, unsigned long start, + size_t len_in, int behavior) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; @@ -1071,10 +1015,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) write = madvise_need_mmap_write(behavior); if (write) { - if (down_write_killable(¤t->mm->mmap_sem)) + if (down_write_killable(&tsk->mm->mmap_sem)) return -EINTR; } else { - down_read(¤t->mm->mmap_sem); + down_read(&tsk->mm->mmap_sem); } /* @@ -1082,7 +1026,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) * ranges, just ignore them, but return -ENOMEM at the end. * - different from the way of handling in mlock etc. */ - vma = find_vma_prev(current->mm, start, &prev); + vma = find_vma_prev(tsk->mm, start, &prev); if (vma && start > vma->vm_start) prev = vma; @@ -1107,7 +1051,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) tmp = end; /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ - error = madvise_vma(vma, &prev, start, tmp, behavior); + error = madvise_vma(tsk, vma, &prev, start, tmp, behavior); if (error) goto out; start = tmp; @@ -1119,14 +1063,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) if (prev) vma = prev->vm_next; else /* madvise_remove dropped mmap_sem */ - vma = find_vma(current->mm, start); + vma = find_vma(tsk->mm, start); } out: blk_finish_plug(&plug); if (write) - up_write(¤t->mm->mmap_sem); + up_write(&tsk->mm->mmap_sem); else - up_read(¤t->mm->mmap_sem); + up_read(&tsk->mm->mmap_sem); return error; } + +/* + * The madvise(2) system call. + * + * Applications can use madvise() to advise the kernel how it should + * handle paging I/O in this VM area. The idea is to help the kernel + * use appropriate read-ahead and caching techniques. The information + * provided is advisory only, and can be safely disregarded by the + * kernel without affecting the correct operation of the application. + * + * behavior values: + * MADV_NORMAL - the default behavior is to read clusters. This + * results in some read-ahead and read-behind. + * MADV_RANDOM - the system should read the minimum amount of data + * on any access, since it is unlikely that the appli- + * cation will need more than what it asks for. + * MADV_SEQUENTIAL - pages in the given range will probably be accessed + * once, so they can be aggressively read ahead, and + * can be freed soon after they are accessed. + * MADV_WILLNEED - the application is notifying the system to read + * some pages ahead. + * MADV_DONTNEED - the application is finished with the given range, + * so the kernel can free resources associated with it. + * MADV_FREE - the application marks pages in the given range as lazy free, + * where actual purges are postponed until memory pressure happens. + * MADV_REMOVE - the application wants to free up the given range of + * pages and associated backing store. + * MADV_DONTFORK - omit this area from child's address space when forking: + * typically, to avoid COWing pages pinned by get_user_pages(). + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. + * MADV_WIPEONFORK - present the child process with zero-filled memory in this + * range after a fork. + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK + * MADV_HWPOISON - trigger memory error handler as if the given memory range + * were corrupted by unrecoverable hardware memory failure. + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in + * this area with pages of identical content from other such areas. + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. + * MADV_HUGEPAGE - the application wants to back the given range by transparent + * huge pages in the future. Existing pages might be coalesced and + * new pages might be allocated as THP. + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by + * transparent huge pages so the existing pages will not be + * coalesced into THP and new pages will not be allocated as THP. + * MADV_DONTDUMP - the application wants to prevent pages in the given range + * from being included in its core dump. + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. + * + * return values: + * zero - success + * -EINVAL - start + len < 0, start is not page-aligned, + * "behavior" is not a valid value, or application + * is attempting to release locked or shared pages, + * or the specified address range includes file, Huge TLB, + * MAP_SHARED or VMPFNMAP range. + * -ENOMEM - addresses in the specified range are not currently + * mapped, or are outside the AS of the process. + * -EIO - an I/O error occurred while paging in data. + * -EBADF - map exists, but area maps something that isn't a file. + * -EAGAIN - a kernel resource was temporarily unavailable. + */ +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +{ + return madvise_core(current, start, len_in, behavior); +} From patchwork Mon May 20 03:52:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10949859 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3EAAC14C0 for ; Mon, 20 May 2019 03:53:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2C1F2285B5 for ; Mon, 20 May 2019 03:53:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1ECF8285C7; Mon, 20 May 2019 03:53:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5D22D285B5 for ; Mon, 20 May 2019 03:53:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 253F06B0269; Sun, 19 May 2019 23:53:29 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1DBA76B026A; Sun, 19 May 2019 23:53:29 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07D0F6B026B; Sun, 19 May 2019 23:53:29 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) by kanga.kvack.org (Postfix) with ESMTP id BEE9C6B0269 for ; Sun, 19 May 2019 23:53:28 -0400 (EDT) Received: by mail-pf1-f199.google.com with SMTP id d12so9057150pfn.9 for ; Sun, 19 May 2019 20:53:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=tBNcU7dtCYU00dN9icFSaU4lVFSngUnu9jAK5Jk7lkE=; b=mx9FYVkLnGuvqeNtVcNiguNTB84p7riKFg+ZHXehdFQKEdgpfXoK6ZOgcIo1kTcrxq dNkdE7d+ideKF4dYn1hHusAACQ9sbBgHUt2dKRyuRtpPw9w36yAjnOFLD9IlGIcINakB 7rgWC/WaZOdZkdgKPKRxe8ObGbIy/2dpoByGJ6DwRlOMLsHSLypbGfiJIOQMFCa/tVtB pLuoL65z+8/Prc5SPGmVfH9V5Ic1YlNoAn8nEOjkDTMQr3MeMRUI/24u1pGVFQm4u0ap THuzZesgz6NRa06Hz4ccMyeYYT3/o0jamMZIhcQg5Qu7GUDZfFZkQp8sSZhwSK/pYbh6 04TA== X-Gm-Message-State: APjAAAWVaucu14103Lt4WD/p2pFrmxznVbktwzjedHaQGomLwrj3Rinc yBXc8CIXq8aS7f/6tWaEqe/H4KgcSnkNMd/NtPiDdDw228zJtAjNPcGAK65Q/keDqa9wgGverRG BB0yVOR2abRPVOMhMzhaG35Nq5Ickgjo4VaHes5b3B027HrZBUCwUS+lW/36SpL0= X-Received: by 2002:a17:902:aa95:: with SMTP id d21mr18674364plr.32.1558324408413; Sun, 19 May 2019 20:53:28 -0700 (PDT) X-Received: by 2002:a17:902:aa95:: with SMTP id d21mr18674291plr.32.1558324407133; Sun, 19 May 2019 20:53:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558324407; cv=none; d=google.com; s=arc-20160816; b=YeG8D4TkeBkiELMQhrZPVAOR8ycwXr1Tr/1Uk0C8D7ILJOWLgL3Y2P7HSc4bX4M9Wu afCt6QOlx9AVvG6jW14kLceaLCQyDgiydfnEJvlnidbI2wPWfAWJlPXGzuc8kL5qTDl3 wK3rjTiX1hB+CwfNVBMdEQR4eh35pSl2Fe9L79iPdrPBQ8yL1EuiUCvIGsaOMVei7Ffe 9h+EdeaFYXdd/ivNLvWb9y/hsFUJZbifeWqetInxEou/W9ssIq/CzCJ+s6ztKIE3i39W 1Z7ZQ1lbtniGyeRKIhHc+iECJFn5bDkc8z3XZwsMM/imoJDPd2qRCIdeIVcSuUNewYil OERg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=tBNcU7dtCYU00dN9icFSaU4lVFSngUnu9jAK5Jk7lkE=; b=L8fuCmOYLJ4n67JKi8axX6MCNcGe3D/Q0IeUaFUTfYe1MqVU91dSM19MCKxSzj57c4 bH9eDZKfxpMCnsxJNWMAz0IJO3GHYAjopGIAxNE/NrySb8+2O1bXt+EoaGcK6VVMFxRN khHxj3c+msPKX9Yxb/ke7s/J4ajVbpSmk7/owhCt5RX8Gx3qhlqLuZwa0xTzhh3zA8fY OrFFlO+G/IMONQezh91ji38vdf+qC0abxlehce23kAUoRWx3li5vadB3MFR+HXmMN4+m IVeYisYUNvl7vBs17noH4IR5bTB22a8JZVakASGxjx5pnVB7dzj2oW0eaPKXVb0cB28U HSeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YQSNJCcK; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id l96sor17798691plb.68.2019.05.19.20.53.27 for (Google Transport Security); Sun, 19 May 2019 20:53:27 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YQSNJCcK; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=tBNcU7dtCYU00dN9icFSaU4lVFSngUnu9jAK5Jk7lkE=; b=YQSNJCcKM7WviKLmz7cKDDQ551lYaqh3ka6GsuFVnMdhbAJcv5YeKKfABVLz6AGuTW lFwJDX545/UN4fmoPbQeKDAZHe7/rEVNOLkNgOXnk53Csp4rl3Oqoo0kakKoxaazkCvK VE3q0cT/ObPibPKiSEqZT2szzLJ0aonUfjZNoXz0b8B6u5vrQ9WKCNKjliUntgNQCH1m Zo3b+x+OCfxBYocbwx2ZWCMM9CR6/OqkPVpXyc4meEXyXm8MM3GplEZqHq/yOb81G/8y LBK0MmgHyiXKvDiNOpYqZgcuTxYGuGePMZ2bth3KXF8BznuDHuYyyQT7YwG2mvugxbqV P8gg== X-Google-Smtp-Source: APXvYqz0+fnAnbc+xIx6Hyu1xzjAmscoSixfuxLRBCh1WPSoZBvSNF0PxhRQ0cIpCPfHKg1QdL/lIA== X-Received: by 2002:a17:902:b490:: with SMTP id y16mr44075401plr.161.1558324406751; Sun, 19 May 2019 20:53:26 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id x66sm3312779pfx.139.2019.05.19.20.53.22 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 May 2019 20:53:25 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , linux-mm , Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , Minchan Kim Subject: [RFC 5/7] mm: introduce external memory hinting API Date: Mon, 20 May 2019 12:52:52 +0900 Message-Id: <20190520035254.57579-6-minchan@kernel.org> X-Mailer: git-send-email 2.21.0.1020.gf2820cf01a-goog In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP There is some usecase that centralized userspace daemon want to give a memory hint like MADV_[COOL|COLD] to other process. Android's ActivityManagerService is one of them. It's similar in spirit to madvise(MADV_WONTNEED), but the information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces new syscall process_madvise(2) which works based on pidfd so it could give a hint to the exeternal process. int process_madvise(int pidfd, void *addr, size_t length, int advise); All advises madvise provides can be supported in process_madvise, too. Since it could affect other process's address range, only privileged process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) gives it the right to ptrrace the process could use it successfully. Please suggest better idea if you have other idea about the permission. * from v1r1 * use ptrace capability - surenb, dancol Signed-off-by: Minchan Kim --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/proc_fs.h | 1 + include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 2 ++ kernel/signal.c | 2 +- kernel/sys_ni.c | 1 + mm/madvise.c | 45 ++++++++++++++++++++++++++ 8 files changed, 54 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 4cd5f982b1e5..5b9dd55d6b57 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -438,3 +438,4 @@ 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register +428 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 64ca0d06259a..0e5ee78161c9 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -355,6 +355,7 @@ 425 common io_uring_setup __x64_sys_io_uring_setup 426 common io_uring_enter __x64_sys_io_uring_enter 427 common io_uring_register __x64_sys_io_uring_register +428 common process_madvise __x64_sys_process_madvise # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 52a283ba0465..f8545d7c5218 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file) #endif /* CONFIG_PROC_FS */ +extern struct pid *pidfd_to_pid(const struct file *file); struct net; static inline struct proc_dir_entry *proc_net_mkdir( diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..21c6c9a62006 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start, + size_t len, int behavior); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index dee7292e1df6..7ee82ce04620 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) #define __NR_io_uring_register 427 __SYSCALL(__NR_io_uring_register, sys_io_uring_register) +#define __NR_process_madvise 428 +__SYSCALL(__NR_process_madvise, sys_process_madvise) #undef __NR_syscalls #define __NR_syscalls 428 diff --git a/kernel/signal.c b/kernel/signal.c index 1c86b78a7597..04e75daab1f8 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) return copy_siginfo_from_user(kinfo, info); } -static struct pid *pidfd_to_pid(const struct file *file) +struct pid *pidfd_to_pid(const struct file *file) { if (file->f_op == &pidfd_fops) return file->private_data; diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 4d9ae5ea6caf..5277421795ab 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall); COND_SYSCALL(munlockall); COND_SYSCALL(mincore); COND_SYSCALL(madvise); +COND_SYSCALL(process_madvise); COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); COND_SYSCALL_COMPAT(mbind); diff --git a/mm/madvise.c b/mm/madvise.c index 119e82e1f065..af02aa17e5c1 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -16,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { return madvise_core(current, start, len_in, behavior); } + +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, + size_t, len_in, int, behavior) +{ + int ret; + struct fd f; + struct pid *pid; + struct task_struct *tsk; + struct mm_struct *mm; + + f = fdget(pidfd); + if (!f.file) + return -EBADF; + + pid = pidfd_to_pid(f.file); + if (IS_ERR(pid)) { + ret = PTR_ERR(pid); + goto err; + } + + ret = -EINVAL; + rcu_read_lock(); + tsk = pid_task(pid, PIDTYPE_PID); + if (!tsk) { + rcu_read_unlock(); + goto err; + } + get_task_struct(tsk); + rcu_read_unlock(); + mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS); + if (!mm || IS_ERR(mm)) { + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; + if (ret == -EACCES) + ret = -EPERM; + goto err; + } + ret = madvise_core(tsk, start, len_in, behavior); + mmput(mm); + put_task_struct(tsk); +err: + fdput(f); + return ret; +} From patchwork Mon May 20 03:52:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10949861 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7347414C0 for ; Mon, 20 May 2019 03:53:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 61AAE285B5 for ; Mon, 20 May 2019 03:53:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5578A285C7; Mon, 20 May 2019 03:53:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4970C285B5 for ; Mon, 20 May 2019 03:53:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 030316B026A; Sun, 19 May 2019 23:53:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id F23046B026C; Sun, 19 May 2019 23:53:33 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE9C56B026D; Sun, 19 May 2019 23:53:33 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f199.google.com (mail-pl1-f199.google.com [209.85.214.199]) by kanga.kvack.org (Postfix) with ESMTP id 9D37D6B026A for ; Sun, 19 May 2019 23:53:33 -0400 (EDT) Received: by mail-pl1-f199.google.com with SMTP id s19so8313697plp.6 for ; Sun, 19 May 2019 20:53:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=8fBJ35agorKCunTXs811LaBEjUlFt/o9xwPOQhaevY0=; b=B2Lv8/A4W7XQ7G1gdoqTh4va6u5Xo2CPnqH5qbYQGy7gMEN8NFy1S9INEc+Zaanptm HxQEeXHR7D3TsZM2lra51QoJroPO2qgQHl6lswZAlgu+DFdB7cRj6hlmTMCX/8zzkyiP qLINxHf5zD54R3ULLrzuteuwQAYU4ZG9RKjXjeMYW2CdnQDAQNOFvg1OZ0Bd1Rty4uaq CV+yGwri7OpmS99iXZqecllo03othyGQRArSzCGWAfOOKbpXn2V7ZjIQN8stss0Eupna WeeAK6AJbACody23ltp1x1pmRsjE0KvneGL3bv7C2hbEmOw/gK93zauJdfw4d7YhCDdo mz+A== X-Gm-Message-State: APjAAAVAntLJWC4YtSTzgx3898SxIdhnKcRBhmblP/mbifbm9bm1ZeB8 UvU3yqVFmWy4Jpjfi7agJPLaieopikbtLI+orb8tOyTMMYGYp3H+m1LBwhn3LVIvf9BMN4UvvHh x9ZC+PBkROW27G9IU5LyTmcUfNu1Mf9y1x10bdRrm3D5HkzCcNl6uPYJCbJYO+i8= X-Received: by 2002:a17:902:b489:: with SMTP id y9mr71910565plr.70.1558324413237; Sun, 19 May 2019 20:53:33 -0700 (PDT) X-Received: by 2002:a17:902:b489:: with SMTP id y9mr71910460plr.70.1558324411631; Sun, 19 May 2019 20:53:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558324411; cv=none; d=google.com; s=arc-20160816; b=TMs9ZBFbpRbC6EtEGum7PcnnQjahPAmgkBWUOfRNwNKyJ8FkVnnzKkeenfmmAuWWwy VxWc9+fs/T4ckiEmWwdwEDh+uCCPgu6ZcaHCCr4kmvqVkq2VQAeC5sCWAoCgRSgDkR6r 7NGUgOAJO85igLCPLJLaTKxVagREviBVVPuPJV/UVF/hCkoZhE3CGsr7essYyfDf+RbX UkK/57VUNffmCrr7zzVKZDzWq94bGWbR53yRmpxZVmXS9jaE5auLBtKe/DHPwOtE26x9 kGNN/HXcDIXCT8x6ybdWy4dp9eOpzT9PVe46Br8/ZFtTJF7LhRS5hZVtFeLxmmMX2idb us7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=8fBJ35agorKCunTXs811LaBEjUlFt/o9xwPOQhaevY0=; b=EOINE7w01q3uTS4GVVxZgpuKQO6Gtu9XTKUAhRlBQ5XrGVY9sme++ZeL2CDbLY54aJ 3Oc6bV5XzUolwBGo6CQ2XiiXld3Jx9Q9lTkj82b/+uUmvAEWiR57OuOroopVDaXDjx9w nlPBXS+zKHSWn9SOK3KR8xtViHZG4tfIrYAG7HgLnBRIvajDNF1d1NTKoIhR5nbUAPNn vwBAQZHoSZipNjXKbRykXMq6F0s6Y7DtD0BsP0YjbwKFM8OFTtGY4Yg67GOtNVS2dp40 UTdwdnGNcOpH8VPNMke4zF7Oij0syAPDluaJdqMYzTVWGeMtNTKRx6Zwik6/iYKRiTvy PO9g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=rCgV6grF; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id r144sor8495461pgr.57.2019.05.19.20.53.31 for (Google Transport Security); Sun, 19 May 2019 20:53:31 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=rCgV6grF; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8fBJ35agorKCunTXs811LaBEjUlFt/o9xwPOQhaevY0=; b=rCgV6grFd1VFVtus7aOkg2Nhi0kCxqIOMHIRqgpyweiFaeLMopGpQwJXCZSKLwsibC QgYt+TXrIKw2sHv+WiiCOfsbLvzeoIvzscG2JiDBvxnBWFBqW/JRMfK5IiyIMerLxj0v oKNyq7x5uN0yJc6DDoT8UNzSaWOedLJWVbhMqTVfpBBauFJz6d1py2+RYNYursmwonRB 9Ia1hQqeDHdslM+lJrbutkWYR5BmmcZdz7nZslS804Sd9V71hpb0Q89SJWPb0d/+qXQZ 8Iy+Iee1ApriyJSBoamNpeHPHSspH0yPc0PAoQ/N9QX4UN5TXeHvXsjvosacyIOHPyW0 URWA== X-Google-Smtp-Source: APXvYqyIBN0roIeUPC55CImKizlJCrfGac023V5GGgyvYa8R+Rw3Iny1vz3vJBv5Ytczunh6wQZSpg== X-Received: by 2002:a63:191b:: with SMTP id z27mr73201987pgl.327.1558324411188; Sun, 19 May 2019 20:53:31 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id x66sm3312779pfx.139.2019.05.19.20.53.26 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 May 2019 20:53:30 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , linux-mm , Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , Minchan Kim Subject: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Date: Mon, 20 May 2019 12:52:53 +0900 Message-Id: <20190520035254.57579-7-minchan@kernel.org> X-Mailer: git-send-email 2.21.0.1020.gf2820cf01a-goog In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Currently, process_madvise syscall works for only one address range so user should call the syscall several times to give hints to multiple address range. This patch extends process_madvise syscall to support multiple hints, address ranges and return vaules so user could give hints all at once. struct pr_madvise_param { int size; /* the size of this structure */ const struct iovec __user *vec; /* address range array */ } int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, struct pr_madvise_param *results, struct pr_madvise_param *ranges, unsigned long flags); - pidfd target process fd - nr_elem the number of elemenent of array behavior, results, ranges - behavior hints for each address range in remote process so that user could give different hints for each range. - results array of buffers to get results for associated remote address range action. - ranges array to buffers to have remote process's address ranges to be processed - flags extra argument for the future. It should be zero this moment. Example) struct pr_madvise_param { int size; const struct iovec *vec; }; int main(int argc, char *argv[]) { struct pr_madvise_param retp, rangep; struct iovec result_vec[2], range_vec[2]; int hints[2]; long ret[2]; void *addr[2]; pid_t pid; char cmd[64] = {0,}; addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); if (MAP_FAILED == addr[0]) return 1; addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); if (MAP_FAILED == addr[1]) return 1; hints[0] = MADV_COLD; range_vec[0].iov_base = addr[0]; range_vec[0].iov_len = ALLOC_SIZE; result_vec[0].iov_base = &ret[0]; result_vec[0].iov_len = sizeof(long); retp.vec = result_vec; retp.size = sizeof(struct pr_madvise_param); hints[1] = MADV_COOL; range_vec[1].iov_base = addr[1]; range_vec[1].iov_len = ALLOC_SIZE; result_vec[1].iov_base = &ret[1]; result_vec[1].iov_len = sizeof(long); rangep.vec = range_vec; rangep.size = sizeof(struct pr_madvise_param); pid = fork(); if (!pid) { sleep(10); } else { int pidfd = open(cmd, O_DIRECTORY | O_CLOEXEC); if (pidfd < 0) return 1; /* munmap to make pages private for the child */ munmap(addr[0], ALLOC_SIZE); munmap(addr[1], ALLOC_SIZE); system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); if (syscall(__NR_process_madvise, pidfd, 2, behaviors, &retp, &rangep, 0)) perror("process_madvise fail\n"); system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); } return 0; } Signed-off-by: Minchan Kim --- include/uapi/asm-generic/mman-common.h | 5 + mm/madvise.c | 184 +++++++++++++++++++++---- 2 files changed, 166 insertions(+), 23 deletions(-) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index b9b51eeb8e1a..b8e230de84a6 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -74,4 +74,9 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +struct pr_madvise_param { + int size; /* the size of this structure */ + const struct iovec __user *vec; /* address range array */ +}; + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/mm/madvise.c b/mm/madvise.c index af02aa17e5c1..f4f569dac2bd 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, struct page *page; struct vm_area_struct *vma = walk->vma; unsigned long next; + long nr_pages = 0; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) { @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, ptep_test_and_clear_young(vma, addr, pte); deactivate_page(page); + nr_pages++; + } pte_unmap_unlock(orig_pte, ptl); + *(long *)walk->private += nr_pages; cond_resched(); return 0; @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, static void madvise_cool_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long addr, unsigned long end) + unsigned long addr, unsigned long end, + long *nr_pages) { struct mm_walk cool_walk = { .pmd_entry = madvise_cool_pte_range, .mm = vma->vm_mm, + .private = nr_pages }; tlb_start_vma(tlb, vma); @@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb, } static long madvise_cool(struct vm_area_struct *vma, - unsigned long start_addr, unsigned long end_addr) + unsigned long start_addr, unsigned long end_addr, + long *nr_pages) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather tlb; @@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma, lru_add_drain(); tlb_gather_mmu(&tlb, mm, start_addr, end_addr); - madvise_cool_page_range(&tlb, vma, start_addr, end_addr); + madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages); tlb_finish_mmu(&tlb, start_addr, end_addr); return 0; @@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, int isolated = 0; struct vm_area_struct *vma = walk->vma; unsigned long next; + long nr_pages = 0; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) { @@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, list_add(&page->lru, &page_list); if (isolated >= SWAP_CLUSTER_MAX) { pte_unmap_unlock(orig_pte, ptl); - reclaim_pages(&page_list); + nr_pages += reclaim_pages(&page_list); isolated = 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); orig_pte = pte; @@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, } pte_unmap_unlock(orig_pte, ptl); - reclaim_pages(&page_list); + nr_pages += reclaim_pages(&page_list); cond_resched(); + *(long *)walk->private += nr_pages; return 0; } static void madvise_cold_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, - unsigned long addr, unsigned long end) + unsigned long addr, unsigned long end, + long *nr_pages) { struct mm_walk warm_walk = { .pmd_entry = madvise_cold_pte_range, .mm = vma->vm_mm, + .private = nr_pages, }; tlb_start_vma(tlb, vma); @@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb, static long madvise_cold(struct vm_area_struct *vma, - unsigned long start_addr, unsigned long end_addr) + unsigned long start_addr, unsigned long end_addr, + long *nr_pages) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather tlb; @@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma, lru_add_drain(); tlb_gather_mmu(&tlb, mm, start_addr, end_addr); - madvise_cold_page_range(&tlb, vma, start_addr, end_addr); + madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages); tlb_finish_mmu(&tlb, start_addr, end_addr); return 0; @@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior, static long madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, - unsigned long end, int behavior) + unsigned long end, int behavior, long *nr_pages) { switch (behavior) { case MADV_REMOVE: @@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); case MADV_COOL: - return madvise_cool(vma, start, end); + return madvise_cool(vma, start, end, nr_pages); case MADV_COLD: - return madvise_cold(vma, start, end); + return madvise_cold(vma, start, end, nr_pages); case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(tsk, vma, prev, start, @@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior) } static int madvise_core(struct task_struct *tsk, unsigned long start, - size_t len_in, int behavior) + size_t len_in, int behavior, long *nr_pages) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; @@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, if (start & ~PAGE_MASK) return error; + len = (len_in + ~PAGE_MASK) & PAGE_MASK; /* Check to see whether len was rounded up from small -ve to zero */ @@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, blk_start_plug(&plug); for (;;) { /* Still start < end. */ + long pages = 0; + error = -ENOMEM; if (!vma) goto out; @@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, tmp = end; /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ - error = madvise_vma(tsk, vma, &prev, start, tmp, behavior); + error = madvise_vma(tsk, vma, &prev, start, tmp, + behavior, &pages); if (error) goto out; + *nr_pages += pages; start = tmp; if (prev && start < prev->vm_end) start = prev->vm_end; @@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, */ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { - return madvise_core(current, start, len_in, behavior); + unsigned long dummy; + + return madvise_core(current, start, len_in, behavior, &dummy); } -SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, - size_t, len_in, int, behavior) +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param, + struct pr_madvise_param *param) +{ + u32 size; + int ret; + + memset(param, 0, sizeof(*param)); + + ret = get_user(size, &u_param->size); + if (ret) + return ret; + + if (size > PAGE_SIZE) + return -E2BIG; + + if (!size || size > sizeof(struct pr_madvise_param)) + return -EINVAL; + + ret = copy_from_user(param, u_param, size); + if (ret) + return -EFAULT; + + return ret; +} + +static int process_madvise_core(struct task_struct *tsk, int *behaviors, + struct iov_iter *iter, + const struct iovec *range_vec, + unsigned long riovcnt, + unsigned long flags) +{ + int i; + long err; + + for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) { + long ret = 0; + + err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base, + range_vec[i].iov_len, behaviors[i], + &ret); + if (err) + ret = err; + + if (copy_to_iter(&ret, sizeof(long), iter) != + sizeof(long)) { + err = -EFAULT; + break; + } + + err = 0; + } + + return err; +} + +SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem, + const int __user *, hints, + struct pr_madvise_param __user *, results, + struct pr_madvise_param __user *, ranges, + unsigned long, flags) { int ret; struct fd f; struct pid *pid; struct task_struct *tsk; struct mm_struct *mm; + struct pr_madvise_param result_p, range_p; + const struct iovec __user *result_vec, __user *range_vec; + int *behaviors; + struct iovec iovstack_result[UIO_FASTIOV]; + struct iovec iovstack_r[UIO_FASTIOV]; + struct iovec *iov_l = iovstack_result; + struct iovec *iov_r = iovstack_r; + struct iov_iter iter; + + if (flags != 0) + return -EINVAL; + + ret = pr_madvise_copy_param(results, &result_p); + if (ret) + return ret; + + ret = pr_madvise_copy_param(ranges, &range_p); + if (ret) + return ret; + + result_vec = result_p.vec; + range_vec = range_p.vec; + + if (result_p.size != sizeof(struct pr_madvise_param) || + range_p.size != sizeof(struct pr_madvise_param)) + return -EINVAL; + + behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL); + if (!behaviors) + return -ENOMEM; + + ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem); + if (ret < 0) + goto free_behavior_vec; + + ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV, + &iov_l, &iter); + if (ret < 0) + goto free_behavior_vec; + + if (!iov_iter_count(&iter)) { + ret = -EINVAL; + goto free_iovecs; + } + + ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem, + UIO_FASTIOV, iovstack_r, &iov_r); + if (ret <= 0) + goto free_iovecs; f = fdget(pidfd); - if (!f.file) - return -EBADF; + if (!f.file) { + ret = -EBADF; + goto free_iovecs; + } pid = pidfd_to_pid(f.file); if (IS_ERR(pid)) { ret = PTR_ERR(pid); - goto err; + goto put_fd; } ret = -EINVAL; @@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, tsk = pid_task(pid, PIDTYPE_PID); if (!tsk) { rcu_read_unlock(); - goto err; + goto put_fd; } get_task_struct(tsk); rcu_read_unlock(); @@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; if (ret == -EACCES) ret = -EPERM; - goto err; + goto put_task; } - ret = madvise_core(tsk, start, len_in, behavior); + + ret = process_madvise_core(tsk, behaviors, &iter, iov_r, + nr_elem, flags); mmput(mm); +put_task: put_task_struct(tsk); -err: +put_fd: fdput(f); +free_iovecs: + if (iov_r != iovstack_r) + kfree(iov_r); + kfree(iov_l); +free_behavior_vec: + kfree(behaviors); + return ret; } From patchwork Mon May 20 03:52:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10949863 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5208F912 for ; Mon, 20 May 2019 03:53:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4079D285B5 for ; Mon, 20 May 2019 03:53:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3487A285C7; Mon, 20 May 2019 03:53:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A804B285B5 for ; Mon, 20 May 2019 03:53:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 782506B026C; Sun, 19 May 2019 23:53:37 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7358C6B026E; Sun, 19 May 2019 23:53:37 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5AF1D6B026F; Sun, 19 May 2019 23:53:37 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id 1E7906B026C for ; Sun, 19 May 2019 23:53:37 -0400 (EDT) Received: by mail-pf1-f197.google.com with SMTP id 5so9024891pff.11 for ; Sun, 19 May 2019 20:53:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=6eom6C+0aXRglfnsINAswRs1OeG+39R7hMy87CNGrDQ=; b=stDfthoz24x7Y2CNv7XaVcyIVutDRMcgTOMdU8ehTW8si4nuphNc5NuIegUepoa50R qOiI9EpfkN7su81DANMX4WW53qmsVPFjsMANoVOeUQ6ACERu88N9bCZULd7P1ug+1G+F IiWYDh0cejkDEcEmH/5zb+peCysjgKCfr0/SFZMBm4nQWsM/lUb1Gijv8XugntuG5XLE jBmEdb/umb0QmNVRe/TIGb4JsBU0HpxldYHOd7m1JpqP42+aBhp8Pvp3Sj+ZTLpwEN74 DdK3sYoko/+aUzfLYy3DdiND0Sm/rJ0ll0gskTkL45NL9PinaeN5h53qbHbAWHwrJYjj +OXQ== X-Gm-Message-State: APjAAAWHvU+HJFLjLusvvRyyLbP+n+1FKgYDuXgujrapNSboLtCeQ0Gi xom2wlEkJf1tIz8sRAJU5/ZzH+HHlpuQNaJ4YFnar5X1NNKQr0Zc3ih/eiu7OpXloQoe7dhg0vT az/3xiXAlS1s3Pj8fl44uXMlhZIgHyRCURqJcrzK33e4k9M9mqLYC+fl50bgQVcg= X-Received: by 2002:a17:902:322:: with SMTP id 31mr60577216pld.204.1558324416781; Sun, 19 May 2019 20:53:36 -0700 (PDT) X-Received: by 2002:a17:902:322:: with SMTP id 31mr60577172pld.204.1558324416005; Sun, 19 May 2019 20:53:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558324416; cv=none; d=google.com; s=arc-20160816; b=cvTR1NN7lr6I9GC6+koE5HYcwF3DZLuD+Fxs2BWajJtphPyEV4LNYrRSG8pLmVP4L/ jQQEC5k+GSeFpHC7Vz2RaWMN8qfXeCaVOokMDRJXRUnaG5ccp/5Qc5Q9SE29G5TA4JBd MA72vlc8cGH5VgecoKOCTOj0rYa8O7fUDdTzYiA0odyiB6TSXTbxDCEWvD+BWZe/a9f0 h4sEKZkoDYojmBL2N8TZGqC61bFUSrfQqgHyut9FqVfcy2ZDCp7HnyXSE9hWRenfyAi8 PwO4in7ZjKiJoEg85L/YsRD1TxGstJmtqWuBQ/526PtLxrVtLNH7+hmrRxqCETCaH8Ka 62UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=6eom6C+0aXRglfnsINAswRs1OeG+39R7hMy87CNGrDQ=; b=LGwSNBY6mRsJs0eUHQ7QvJi6EaZYTWQy9s5UdgCpvUNlZrqFV262ZiebrL8Wi+ts5t IobqfxFtDKsOtHOzsfsx0AOEoFF3Q+VgJWxyK8wrslBEnWALqDJN/idOBHoz5x/MbgaM 6HxWvlge/BNiBuuPsSUDdeYu693ZZ7rshvZufi9II8tbtdZoxXunSRLo5pDZhBp8xxfk hR+PmwtegBoC9GKDR7kibVwhbtGYLZsszBnfJSSl2tzY/iKF6frSFQlfy6kJlGTWirJL /NxqhNmMHcJoF8J0eRxif7Tf0eiRIBNu3aLfOcwFd5ec1RdnkETevm5Ts7WyGku/LXuj 6QZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=XG5I+EPz; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id o10sor17778945pll.35.2019.05.19.20.53.35 for (Google Transport Security); Sun, 19 May 2019 20:53:35 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=XG5I+EPz; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6eom6C+0aXRglfnsINAswRs1OeG+39R7hMy87CNGrDQ=; b=XG5I+EPz86QA3OhsQLafnFpAan9eaChflXFWy43qaxMLemrkVaXoKjqy7PxZqkUxkP O0Pi6kyy6arJRnGcyGSp5wdYWbuXmnvmMU9TLW9lQqedL13pwu5ot2YxbHHnH5JDECz2 5s7ZSy68dQ3tlt3cl5NMx33/pwti9RfeGoJWKiUEMCQZqpnPep50Djr8xxL6STY36Ujw beVq6BNhkOiPzbrCyDh5fI53flkYcWi5NjAVWwKl/1j+lPu5r1E1UxZcQv/nYdEAVufF I5AElncQaFNYUw4EOAGcoZbjZ6MvPKbdF3qRi76J+cqbdAOTOvWbZZqXTrDpaKJYoBkJ Nunw== X-Google-Smtp-Source: APXvYqzawjhmQskMTjHPygEFAgYiL4Q/o65j0xjkMqI2OU3MhBsEo2mDN7heff7G2XFlEjOFBY9bKg== X-Received: by 2002:a17:902:bc42:: with SMTP id t2mr21860026plz.55.1558324415659; Sun, 19 May 2019 20:53:35 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id x66sm3312779pfx.139.2019.05.19.20.53.31 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 May 2019 20:53:34 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , linux-mm , Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , Minchan Kim Subject: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Date: Mon, 20 May 2019 12:52:54 +0900 Message-Id: <20190520035254.57579-8-minchan@kernel.org> X-Mailer: git-send-email 2.21.0.1020.gf2820cf01a-goog In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP System could have much faster swap device like zRAM. In that case, swapping is extremely cheaper than file-IO on the low-end storage. In this configuration, userspace could handle different strategy for each kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD while it keeps file-backed pages in inactive LRU by MADV_COOL because file IO is more expensive in this case so want to keep them in memory until memory pressure happens. To support such strategy easier, this patch introduces MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like that /proc//clear_refs already has supported same filters. They are filters could be Ored with other existing hints using top two bits of (int behavior). Once either of them is set, the hint could affect only the interested vma either anonymous or file-backed. With that, user could call a process_madvise syscall simply with a entire range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER so there is no need to call the syscall range by range. * from v1r2 * use consistent check with clear_refs to identify anon/file vma - surenb * from v1r1 * use naming "filter" for new madvise option - dancol Signed-off-by: Minchan Kim --- include/uapi/asm-generic/mman-common.h | 5 +++++ mm/madvise.c | 14 ++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index b8e230de84a6..be59a1b90284 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -66,6 +66,11 @@ #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER)) + +#define MADV_ANONYMOUS_FILTER (1<<31) /* works for only anonymous vma */ +#define MADV_FILE_FILTER (1<<30) /* works for only file-backed vma */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index f4f569dac2bd..116131243540 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, int write; size_t len; struct blk_plug plug; + bool anon_only, file_only; + anon_only = behavior & MADV_ANONYMOUS_FILTER; + file_only = behavior & MADV_FILE_FILTER; + + if (anon_only && file_only) + return error; + + behavior = behavior & MADV_BEHAVIOR_MASK; if (!madvise_behavior_valid(behavior)) return error; @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, if (end < tmp) tmp = end; + if (anon_only && vma->vm_file) + goto next; + if (file_only && !vma->vm_file) + goto next; + /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = madvise_vma(tsk, vma, &prev, start, tmp, behavior, &pages); if (error) goto out; *nr_pages += pages; +next: start = tmp; if (prev && start < prev->vm_end) start = prev->vm_end;