From patchwork Fri May 31 06:43:08 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10969659 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BF28C15E6 for ; Fri, 31 May 2019 06:43:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AD19928C76 for ; Fri, 31 May 2019 06:43:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A064B28C78; Fri, 31 May 2019 06:43:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B416828C76 for ; Fri, 31 May 2019 06:43:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 884216B026F; Fri, 31 May 2019 02:43:32 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 80DFF6B0278; Fri, 31 May 2019 02:43:32 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 687856B027A; Fri, 31 May 2019 02:43:32 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f199.google.com (mail-pl1-f199.google.com [209.85.214.199]) by kanga.kvack.org (Postfix) with ESMTP id 2B9F56B026F for ; Fri, 31 May 2019 02:43:32 -0400 (EDT) Received: by mail-pl1-f199.google.com with SMTP id x14so5640688pln.6 for ; Thu, 30 May 2019 23:43:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=Q+WrS2WTmZNwvzbHeu1bEhllXnLrAtu1HXBhZpezELo=; b=Hd2hQb53a+t88cN8lHaQGdbn0fkuyaAi6ah0oFvHuv3ZA0i19wMq/IZtuSbxgk3Bzr rZLmHHsZDthZ1VeBq9Stsz4jsfPfoFCGCvTKATAmCk7qP8VDdEAIH36uetZFvATCcTPC 2vIpVe6ig16KUCmgHIGYlW2N/1zK9Q8h5We8mVXx9Tn9hRzvmSx8xuAry5/r+d9p5REA Cq2MhNtDAQIrK5fYVDPrciN400L7wnaY6Ie1CvmoHqPhp6GQlxDUhb48CJDvRIxoefAC xKBBq8HA6ndDSnbCAlvPKp9z7aLCr5MEEzOoXHzv4L751q4jBT8ocBsmYigHr13ATzJS Sy3A== X-Gm-Message-State: APjAAAVysiB4XNQF5X3vBl3ZTN0Da1F8gkU6APrdJi1KnTiEjm1nxc6K yhJK8Uw6yieYElnMk6qcihvu+vEP05PnypGOFXz1d7MvnW+BCGjqmpbHUiZ10V/PHujv2gO1b04 Yf1h/uGNcVj0axARGQzI0xmLWIAgH/e39edi7ViAr5VnGWsKQacZwKG4ZVkDFUxo= X-Received: by 2002:a63:484f:: with SMTP id x15mr4442998pgk.162.1559285011754; Thu, 30 May 2019 23:43:31 -0700 (PDT) X-Received: by 2002:a63:484f:: with SMTP id x15mr4442934pgk.162.1559285010305; Thu, 30 May 2019 23:43:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559285010; cv=none; d=google.com; s=arc-20160816; b=l7H9+nI8wF0nHv3SaHxX4xY/KIkA7yWXb5/YErA81qANUWxtsRXb8Cy0b4fdn/p1u6 VP5V2O1K+Y41Z828G3BDBiI0QZLSeE/p3i2BRujvm7qLiGEEBXUqXc9dvBsi1PN/N6Wb PArppoU/PHhVZzXtOwU3jg7Sb87w0Z/3oJpOlyZ2wj27JPVKHPpIOB/3JSsb2kOIoIb8 Wc3bgebjRTKDS1+Ud6yLz2VG2aOTORtJuu6cVNYQiWyK9OQODPbhcoQ9islBfUEiCrX0 Y0eErfUXNNZSpKT8gHsj5zW+IbBvySsePgWNk932S3nxSErc2wiu2+ewID0W2HwXA7OW 3AnA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=Q+WrS2WTmZNwvzbHeu1bEhllXnLrAtu1HXBhZpezELo=; b=oRM7iUE/4qUpl2xxfRCN54AmvrhZfKUvpzX/CZVEc3XsoZ0FrAf90yk2y6H3HY1k93 NdoirrcWZUsia1cZL+AeHfD6NkpQpT6g5Q/QeDAMKlCZxV22MOKNDcmAM3sOPRH4sWI+ 6624U7ErCbrW9VJx336UEDNkRb4v/FJlG/6ZTlK1raPkFAoDE6InjfmLUlpapC8jpWM8 Kcft4uh0rSACCZs8WckoLA1YL0r+CHJ5Uw5urtUV5GvozDVhTawm0P1VP5P5j4dM+2cU Wa2dW4Teq+jJfI0ERPQkD/Wo6BzuOnZ1hMcmed6emwXuzTpa+wanFgMcw92/d4dgEMWA gpJg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FEsa8HYI; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id j38sor5644596plb.12.2019.05.30.23.43.30 for (Google Transport Security); Thu, 30 May 2019 23:43:30 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=FEsa8HYI; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Q+WrS2WTmZNwvzbHeu1bEhllXnLrAtu1HXBhZpezELo=; b=FEsa8HYIFGZ9g1ecQ/j5yvOs3scpXhDyCkELSjwYNDZXDmKEgv7UIMfZDSc0fKEcN+ ZlpEel5M+I/H3F87UXers650mpPjGhCJOexbGza5EwiVIp98yG+mhXkAcvE+jpPt7w84 C1/Hm477PSNaOEnaJdz5miiM2KPG/b13w8QOqwJ2fqldjIOnY/OLI6MzQM28wA3RC9aA wSuH1Z/1boh1gySuNd5IsggB+NnflfJBBC4aPMs+1BVz2ohjNPyQCv2ExAu9GXHBYmDn gjMOTXLRcN1LrF57LkJTPBbngj4cAzwWcNtGee+H3rK+MMR6lJilkuIjjidYzrdqS6Ge BePg== X-Google-Smtp-Source: APXvYqwfomDSlTZeLCsTrU9+iE7yhhnwl2chqbA3Xd8kL+uKc3jFjp2g2vRAJC3oLMB5Zza3YbqBkA== X-Received: by 2002:a17:902:7897:: with SMTP id q23mr7376157pll.21.1559285009871; Thu, 30 May 2019 23:43:29 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id f30sm4243340pjg.13.2019.05.30.23.43.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 May 2019 23:43:28 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: linux-mm , LKML , linux-api@vger.kernel.org, Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , jannh@google.com, oleg@redhat.com, christian@brauner.io, oleksandr@redhat.com, hdanton@sina.com, Minchan Kim Subject: [RFCv2 1/6] mm: introduce MADV_COLD Date: Fri, 31 May 2019 15:43:08 +0900 Message-Id: <20190531064313.193437-2-minchan@kernel.org> X-Mailer: git-send-email 2.22.0.rc1.257.g3120a18244-goog In-Reply-To: <20190531064313.193437-1-minchan@kernel.org> References: <20190531064313.193437-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. Internally, it works via deactivating pages from active list to inactive's head if the page is private because inactive list could be full of used-once pages which are first candidate for the reclaiming and that's a reason why MADV_FREE move pages to head of inactive LRU list. Therefore, if the memory pressure happens, they will be reclaimed earlier than other active pages unless there is no access until the time. * RFCv1 * renaming from MADV_COOL to MADV_COLD - hannes * internal review * use clear_page_youn in deactivate_page - joelaf * Revise the description - surenb * Renaming from MADV_WARM to MADV_COOL - surenb Signed-off-by: Minchan Kim --- include/linux/page-flags.h | 1 + include/linux/page_idle.h | 15 ++++ include/linux/swap.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/madvise.c | 111 +++++++++++++++++++++++++ mm/swap.c | 43 ++++++++++ 6 files changed, 172 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 9f8712a4b1a5..58b06654c8dd 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page) TESTPAGEFLAG(Young, young, PF_ANY) SETPAGEFLAG(Young, young, PF_ANY) TESTCLEARFLAG(Young, young, PF_ANY) +CLEARPAGEFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdce..f3f43b317150 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page) SetPageYoung(page); } +static inline void clear_page_young(struct page *page) +{ + ClearPageYoung(page); +} + static inline bool test_and_clear_page_young(struct page *page) { return TestClearPageYoung(page); @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page) set_bit(PAGE_EXT_YOUNG, &page_ext->flags); } +static void clear_page_young(struct page *page) +{ + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return; + + clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); +} + static inline bool test_and_clear_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); diff --git a/include/linux/swap.h b/include/linux/swap.h index de2c67a33b7e..0ce997edb8bb 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); +extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index bea0278f65ab..1190f4e7f7b9 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -43,6 +43,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_COLD 5 /* deactivatie these pages */ /* common parameters: try to keep these consistent across architectures */ #define MADV_FREE 8 /* free pages only if memory pressure */ diff --git a/mm/madvise.c b/mm/madvise.c index 628022e674a7..bff150eab6da 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_REMOVE: case MADV_WILLNEED: case MADV_DONTNEED: + case MADV_COLD: case MADV_FREE: return 0; default: @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma, return 0; } +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + pte_t *orig_pte, *pte, ptent; + spinlock_t *ptl; + struct page *page; + struct vm_area_struct *vma = walk->vma; + unsigned long next; + + next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + ptl = pmd_trans_huge_lock(pmd, vma); + if (!ptl) + return 0; + + if (is_huge_zero_pmd(*pmd)) + goto huge_unlock; + + page = pmd_page(*pmd); + if (page_mapcount(page) > 1) + goto huge_unlock; + + if (next - addr != HPAGE_PMD_SIZE) { + int err; + + get_page(page); + spin_unlock(ptl); + lock_page(page); + err = split_huge_page(page); + unlock_page(page); + put_page(page); + if (!err) + goto regular_page; + return 0; + } + + pmdp_test_and_clear_young(vma, addr, pmd); + deactivate_page(page); +huge_unlock: + spin_unlock(ptl); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; + +regular_page: + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { + ptent = *pte; + + if (pte_none(ptent)) + continue; + + if (!pte_present(ptent)) + continue; + + page = vm_normal_page(vma, addr, ptent); + if (!page) + continue; + + if (page_mapcount(page) > 1) + continue; + + ptep_test_and_clear_young(vma, addr, pte); + deactivate_page(page); + } + + pte_unmap_unlock(orig_pte, ptl); + cond_resched(); + + return 0; +} + +static void madvise_cold_page_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + struct mm_walk cool_walk = { + .pmd_entry = madvise_cold_pte_range, + .mm = vma->vm_mm, + }; + + tlb_start_vma(tlb, vma); + walk_page_range(addr, end, &cool_walk); + tlb_end_vma(tlb, vma); +} + +static long madvise_cold(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start_addr, unsigned long end_addr) +{ + struct mm_struct *mm = vma->vm_mm; + struct mmu_gather tlb; + + *prev = vma; + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) + return -EINVAL; + + lru_add_drain(); + tlb_gather_mmu(&tlb, mm, start_addr, end_addr); + madvise_cold_page_range(&tlb, vma, start_addr, end_addr); + tlb_finish_mmu(&tlb, start_addr, end_addr); + + return 0; +} + static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_remove(vma, prev, start, end); case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); + case MADV_COLD: + return madvise_cold(vma, prev, start, end); case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior) case MADV_WILLNEED: case MADV_DONTNEED: case MADV_FREE: + case MADV_COLD: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: diff --git a/mm/swap.c b/mm/swap.c index 7b079976cbec..cebedab15aa2 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -47,6 +47,7 @@ int page_cluster; static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs); +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs); #ifdef CONFIG_SMP static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, update_page_reclaim_stat(lruvec, file, 0); } +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, + void *arg) +{ + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + int file = page_is_file_cache(page); + int lru = page_lru_base_type(page); + + del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE); + ClearPageActive(page); + ClearPageReferenced(page); + clear_page_young(page); + add_page_to_lru_list(page, lruvec, lru); + + __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page)); + update_page_reclaim_stat(lruvec, file, 0); + } +} static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, void *arg) @@ -590,6 +608,10 @@ void lru_add_drain_cpu(int cpu) if (pagevec_count(pvec)) pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); + pvec = &per_cpu(lru_deactivate_pvecs, cpu); + if (pagevec_count(pvec)) + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + pvec = &per_cpu(lru_lazyfree_pvecs, cpu); if (pagevec_count(pvec)) pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL); @@ -623,6 +645,26 @@ void deactivate_file_page(struct page *page) } } +/* + * deactivate_page - deactivate a page + * @page: page to deactivate + * + * deactivate_page() moves @page to the inactive list if @page was on the active + * list and was not an unevictable page. This is done to accelerate the reclaim + * of @page. + */ +void deactivate_page(struct page *page) +{ + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); + + get_page(page); + if (!pagevec_add(pvec, page) || PageCompound(page)) + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + put_cpu_var(lru_deactivate_pvecs); + } +} + /** * mark_page_lazyfree - make an anon page lazyfree * @page: page to deactivate @@ -687,6 +729,7 @@ void lru_add_drain_all(void) if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) || need_activate_page_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); From patchwork Fri May 31 06:43:09 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10969661 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3CDFE933 for ; Fri, 31 May 2019 06:43:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2CF4B28C76 for ; Fri, 31 May 2019 06:43:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 20DCE28C78; Fri, 31 May 2019 06:43:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1819628C76 for ; Fri, 31 May 2019 06:43:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 21C566B0278; Fri, 31 May 2019 02:43:37 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1A6066B027A; Fri, 31 May 2019 02:43:37 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 01FF56B027C; Fri, 31 May 2019 02:43:36 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com [209.85.215.200]) by kanga.kvack.org (Postfix) with ESMTP id BF1EA6B0278 for ; Fri, 31 May 2019 02:43:36 -0400 (EDT) Received: by mail-pg1-f200.google.com with SMTP id f8so4057366pgp.9 for ; Thu, 30 May 2019 23:43:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=7TXnJy3Wfu+ssJbkFIL06uViagM5VHDrErGSsKQYVSM=; b=SIfeE3U/HsCKJ/9niiTHuHsoXgHqMMQYJercqAsPcUZXPg56DpQ8sowN95fWuv9gyb xGU3NMeHy6TOG3NKutPXk1D2LlIc7cJHI8VM31ymHppUuL/tdf80O9oHegV2JJAeqyXF +fNq7bhD8jTgB0dQfIOY+80nN5d2Xa3utv/wIGJtrmxsojYrWiz9QHLgl7E3ffZSHndS CWZX7svvajnmzuQWLy1ZuAKMlcr38OPmEUEUTOn7o5UF39g83xUT4dCPY/PSPO2c9r+J LKIxQOBV//C4Za4Y/EFj9dR5/5ICoxr6GtUE/BVvhH5S5jBithKLgoICMzI/EIs+Cc9W k1HA== X-Gm-Message-State: APjAAAXGwrHse55FsPMWDhIXU4sweJJmCyzWvF9HY7kKTcM/ZIN21cBt qhS0zufUCoWaT2ip6O4Sy9jM/FEhJ1Md+fYa67a7KwuPGPWuKQ2ZbMfs9ZbIbugffPp6NE+XPO9 BsdUvjAhoh0GT4VXN9dxpx0d6TKGcMSqAgyOe/r+O0UIDxqq2bQvanI2cRZupkow= X-Received: by 2002:a17:90a:db0a:: with SMTP id g10mr7238688pjv.43.1559285016371; Thu, 30 May 2019 23:43:36 -0700 (PDT) X-Received: by 2002:a17:90a:db0a:: with SMTP id g10mr7238641pjv.43.1559285015670; Thu, 30 May 2019 23:43:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559285015; cv=none; d=google.com; s=arc-20160816; b=JmrwcUXz8L5biD7mqprDvxmTc1Jh/cLja0IoCcLeen8XPU3s6+naZpO4yRdzx5gs9t BAfaP5CFlIlzXqiSHWyaZheo+mZxNkp6uA3MJlBiG+cHVnqjaRPY9SM9kJIgxJJhkxuJ Z6uOIJ1G+IJJ+OFrrCrsnquyy+/9NStLbRUtICr9POmVoDNhDN0MhibwaAgzgxvKKH+6 aIZYCGCeaFMoFdfpJIRizJ0l2Cz7wgjjFe8YM5s4EqQhohpaBQC5rdN2j9JBqj3caZec hcudTaWG1Nf8bi2amKwMP7a4iCS8PjHvC7VvzF+y9WNQoVte9HrkCWY9RWpKvjS+H/wf XU4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=7TXnJy3Wfu+ssJbkFIL06uViagM5VHDrErGSsKQYVSM=; b=tfqTPcELD7XCHJiF1cgroAHc4jqoBUNwnxKY0dX9EYIB19mvsbXJwuP0b+pitCysWC m+KMmyHic+aE/VoqOS1+9xZKVtHf3iD4be2AvlFozpg+JDrknYaU6VD/56tdFdYD+rvR sIRwQIKSCnG2CWYZnXViyWnqBlzwEevryTawHc2B2qh9SCPmDbb4QT9v5KsKkdxIT7PQ PRc9/B1BIOt/z2YzLFeCvKFKJZXPb2Cq4CrZlmtk7XVsaqU8rFzEo1lXcblB9h+9JNH5 RElNVFMHMO4+rDYjEHAaaN7YeEd9olybN7ZctEnJR2LOWi0cpckYZnMA685NGvFT9em6 saNg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YldrKVod; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id q9sor5567571pjv.16.2019.05.30.23.43.35 for (Google Transport Security); Thu, 30 May 2019 23:43:35 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YldrKVod; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=7TXnJy3Wfu+ssJbkFIL06uViagM5VHDrErGSsKQYVSM=; b=YldrKVod3fVxNE60xQH2FBlSIRGcgnrEfJznpoRGFPlLTUui7kco8qaCVZtwA3bhxO hYV6qOtRKs9DEjdJuliqY1VcGmVNtDxRdK3B4xfQGI0oJHOdJIJHCnJRJKytoVYAyyqI pHPY1gaRCvxI7NnRrA1M3nQrDQqq/tdtwfsHpAg13+jhdHUbav7ntbJqZmk0XdHVZ657 IAZ///cK1ZIILA1FYIIgdbVxUjBMd2zdujzy670NcJaHVlV2wlDGSYqZobqs/HEZZQTP iTASK90nsU/RVTkuQLt60xo5Xe29JCIA4qY6Oh/8/+p8Sh/PeL/kKnmcVQk25QOJ7eY8 D7nQ== X-Google-Smtp-Source: APXvYqzM1347jezqrPNnjPMerTz/Z/t9a00AGumOLAAOkc1KTRjEfpypi+voz0AAAFuUr2Q0lFTnCA== X-Received: by 2002:a17:90a:af8e:: with SMTP id w14mr7384065pjq.89.1559285015283; Thu, 30 May 2019 23:43:35 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id f30sm4243340pjg.13.2019.05.30.23.43.30 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 May 2019 23:43:34 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: linux-mm , LKML , linux-api@vger.kernel.org, Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , jannh@google.com, oleg@redhat.com, christian@brauner.io, oleksandr@redhat.com, hdanton@sina.com, Minchan Kim Subject: [RFCv2 2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Date: Fri, 31 May 2019 15:43:09 +0900 Message-Id: <20190531064313.193437-3-minchan@kernel.org> X-Mailer: git-send-email 2.22.0.rc1.257.g3120a18244-goog In-Reply-To: <20190531064313.193437-1-minchan@kernel.org> References: <20190531064313.193437-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN as default. It is for preventing to reclaim dirty pages when CMA try to migrate pages. Strictly speaking, we don't need it because CMA didn't allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list. Moreover, it has a problem to prevent anonymous pages's swap out even though force_reclaim = true in shrink_page_list on upcoming patch. So this patch makes references's default value to PAGEREF_RECLAIM and rename force_reclaim with ignore_references to make it more clear. This is a preparatory work for next patch. * RFCv1 * use ignore_referecnes as parameter name - hannes Acked-by: Johannes Weiner Signed-off-by: Minchan Kim --- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 84dcb651d05c..0973a46a0472 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, enum ttu_flags ttu_flags, struct reclaim_stat *stat, - bool force_reclaim) + bool ignore_references) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); @@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct address_space *mapping; struct page *page; int may_enter_fs; - enum page_references references = PAGEREF_RECLAIM_CLEAN; + enum page_references references = PAGEREF_RECLAIM; bool dirty, writeback; unsigned int nr_pages; @@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } - if (!force_reclaim) + if (!ignore_references) references = page_check_references(page, sc); switch (references) { From patchwork Fri May 31 06:43:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10969663 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C86E915E6 for ; Fri, 31 May 2019 06:43:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B7DEE28C76 for ; Fri, 31 May 2019 06:43:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id ABF3F28C78; Fri, 31 May 2019 06:43:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D8AF828C76 for ; Fri, 31 May 2019 06:43:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D3E4E6B027A; Fri, 31 May 2019 02:43:42 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id CC86A6B027C; Fri, 31 May 2019 02:43:42 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B698C6B027E; Fri, 31 May 2019 02:43:42 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com [209.85.215.197]) by kanga.kvack.org (Postfix) with ESMTP id 7A6046B027A for ; Fri, 31 May 2019 02:43:42 -0400 (EDT) Received: by mail-pg1-f197.google.com with SMTP id z10so4045279pgf.15 for ; Thu, 30 May 2019 23:43:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=G+TrQWebb1OQqIMsOPBwXOFrxSHqKpw5UQ00aCYRV6U=; b=dX10TIFBzFiOL0yxDdc0929tUiKXiza1mO44G/N0beC2adiJA3TePRUEvQn/9zEh9w VnngCJyLyNB3dIPddo7ZbQu1kfaG1Xfut4L1d3j2g7GjKFUJyh2PNzdbby2QZ93KPI/c uItCfx+4CBU1oKopUBq31GbajBq5SzhQBTgAu+oeB5J89WC7du+czcqTHRnRt723tTG7 jVlRHlmX0SNmlGf0Mvtl8J1LOcpnhPdUWA4V5mk/5O9LQpRCkg/RJ+iaI5EvU7XjPqq2 cYsZGAuUn+pVHI8ggT1ApDo0VmfxA2QAQBsDrMu9U+rks7SNYPg1Yt4lSwc3Si4IB2SV fcSA== X-Gm-Message-State: APjAAAV8NobSEwIC4ImNsp9yo/1lATp2MThP2eSOM3mnbLvGpbXHvtHu vE5ui05ky7uwHpo/c4cgrP3WBP0hLPnvCPI7dkM5cszbnfQTJSwb4f+4WKmJhuhyhzdONa7uD34 BXCJ3gEU6NvyPbC+NEkDKv0TA+GdZ2weGTO6LLwTalOnhePrJBR1JPkL0JQpMaFc= X-Received: by 2002:a62:7e8d:: with SMTP id z135mr7669950pfc.260.1559285022067; Thu, 30 May 2019 23:43:42 -0700 (PDT) X-Received: by 2002:a62:7e8d:: with SMTP id z135mr7669917pfc.260.1559285021166; Thu, 30 May 2019 23:43:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559285021; cv=none; d=google.com; s=arc-20160816; b=MJBe4SfGz/zuwFmsi/rePreLcKcCs4maklqP7YPJonY4FmBSq4WIAWRpY4r53nYszv BlBgNqLd0EiFianFZ/h65jSDgDo3GjU48ub51yI+oMBOF0FjAmZVcrVa1j91VJSAjWVf kYEGB1hWSMnc6gd9kW3uDdE4CGsQ/D0YNh3Eot167N6HKYMoxM9bLx+mXuPaB+dtSDv7 9TZPk6W3T/zD/dFVhdNWyUizMQ31tmwUAkoleVpc3iOUDuWrx1iIIjxc6S+Dwwum2Aqv 6Pxy+wMiVEZu7dGaD9k2etiZelim9CZFpaCbtLJa+15x2xSIhE2aV8JHUXifSgq4xTry lvsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=G+TrQWebb1OQqIMsOPBwXOFrxSHqKpw5UQ00aCYRV6U=; b=SLkG7OdnVD3+gWxat9dwuJXSz6dRWOA7d58W0NjJWIxSsshNG5OCH9kqSXPDVhxIYX HrrDVgITZrJITd3nhKobz04hkN3q6N15YawcQpBdyPIOQs3vUClsb0m4JgqAzb3TF0qk y8FDd+BZnSpZQ36XKFI5QNrIoyu8a0vBSBemO/W8MfxcsKg0cuwYrYUhdQnMq2ukE4qn rTyI+iQYk/Q8XKl898VGdJBUe6N5RZpHABo7Fei3uoh4781iHvgnLyVpGurf/eXelG4Q R/6ysXn8G9U0jhqObsLEe0xoniG/zoNlIea1DYLl7uoA3Y7Hw+SbMToct70/MgRx5bf0 IYNw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=kuACxbXH; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id q2sor5553044plh.19.2019.05.30.23.43.41 for (Google Transport Security); Thu, 30 May 2019 23:43:41 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=kuACxbXH; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=G+TrQWebb1OQqIMsOPBwXOFrxSHqKpw5UQ00aCYRV6U=; b=kuACxbXH+9FoOl8y32mUGUSFvG3BNUwt7h4d8u+U1DMvt0f0dWSJxfj0SIY9/t3bGB J6Xc51DsKnBPFaE4WbaMPk1mAgGh7Uoe9JiYM5kmLCvZOUS/NwOpzEGLN1TVUAciqAZY j/MFHhAydSn2PLOj6PmQ59xb5YmVCJTGYCNMeiLopGArg/Hc3cTHxmLOuBW3cOKVHjYN Gudc1YJevZoVh/8NVigzrpCmlDHGNfTao83N+IApxFs1dKuDmEDuGtiQ1ARmSLXeRDrW Wxg+6NIRSZuT02jPArGbPPUbPMYXC7btU9n1KVReG/mzKU5L1n9Tq0Uc4NCQvXJGkqRQ 8SRg== X-Google-Smtp-Source: APXvYqwAhOWjoSgAiirZU4CO+Y4FLGljQXQatDXZpfewPZcwELP4PX2gZp7hdg8bFpts+exduLmpIQ== X-Received: by 2002:a17:902:2ba7:: with SMTP id l36mr7376613plb.334.1559285020728; Thu, 30 May 2019 23:43:40 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id f30sm4243340pjg.13.2019.05.30.23.43.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 May 2019 23:43:39 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: linux-mm , LKML , linux-api@vger.kernel.org, Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , jannh@google.com, oleg@redhat.com, christian@brauner.io, oleksandr@redhat.com, hdanton@sina.com, Minchan Kim Subject: [RFCv2 3/6] mm: introduce MADV_PAGEOUT Date: Fri, 31 May 2019 15:43:10 +0900 Message-Id: <20190531064313.193437-4-minchan@kernel.org> X-Mailer: git-send-email 2.22.0.rc1.257.g3120a18244-goog In-Reply-To: <20190531064313.193437-1-minchan@kernel.org> References: <20190531064313.193437-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall. MADV_PAGEOUT can be used by a process to mark a memory range as not expected to be used for a long time so that kernel reclaims the memory instantly. The hint can help kernel in deciding which pages to evict proactively. * RFCv1 * rename from MADV_COLD to MADV_PAGEOUT - hannes * bail out if process is being killed - Hillf * fix reclaim_pages bugs - Hillf Signed-off-by: Minchan Kim --- include/linux/swap.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/madvise.c | 126 +++++++++++++++++++++++++ mm/vmscan.c | 77 +++++++++++++++ 4 files changed, 205 insertions(+) diff --git a/include/linux/swap.h b/include/linux/swap.h index 0ce997edb8bb..063c0c1e112b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -365,6 +365,7 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; +extern unsigned long reclaim_pages(struct list_head *page_list); #ifdef CONFIG_NUMA extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 1190f4e7f7b9..92e347a89ddc 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -44,6 +44,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ #define MADV_COLD 5 /* deactivatie these pages */ +#define MADV_PAGEOUT 6 /* reclaim these pages */ /* common parameters: try to keep these consistent across architectures */ #define MADV_FREE 8 /* free pages only if memory pressure */ diff --git a/mm/madvise.c b/mm/madvise.c index bff150eab6da..9d749a1420b4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_WILLNEED: case MADV_DONTNEED: case MADV_COLD: + case MADV_PAGEOUT: case MADV_FREE: return 0; default: @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma, return 0; } +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + pte_t *orig_pte, *pte, ptent; + spinlock_t *ptl; + LIST_HEAD(page_list); + struct page *page; + int isolated = 0; + struct vm_area_struct *vma = walk->vma; + unsigned long next; + + if (fatal_signal_pending(current)) + return -EINTR; + + next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + ptl = pmd_trans_huge_lock(pmd, vma); + if (!ptl) + return 0; + + if (is_huge_zero_pmd(*pmd)) + goto huge_unlock; + + page = pmd_page(*pmd); + if (page_mapcount(page) > 1) + goto huge_unlock; + + if (next - addr != HPAGE_PMD_SIZE) { + int err; + + get_page(page); + spin_unlock(ptl); + lock_page(page); + err = split_huge_page(page); + unlock_page(page); + put_page(page); + if (!err) + goto regular_page; + return 0; + } + + if (isolate_lru_page(page)) + goto huge_unlock; + + list_add(&page->lru, &page_list); +huge_unlock: + spin_unlock(ptl); + reclaim_pages(&page_list); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; +regular_page: + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { + ptent = *pte; + if (!pte_present(ptent)) + continue; + + page = vm_normal_page(vma, addr, ptent); + if (!page) + continue; + + if (page_mapcount(page) > 1) + continue; + + if (isolate_lru_page(page)) + continue; + + isolated++; + list_add(&page->lru, &page_list); + if (isolated >= SWAP_CLUSTER_MAX) { + pte_unmap_unlock(orig_pte, ptl); + reclaim_pages(&page_list); + isolated = 0; + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + orig_pte = pte; + } + } + + pte_unmap_unlock(orig_pte, ptl); + reclaim_pages(&page_list); + cond_resched(); + + return 0; +} + +static void madvise_pageout_page_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + struct mm_walk warm_walk = { + .pmd_entry = madvise_pageout_pte_range, + .mm = vma->vm_mm, + }; + + tlb_start_vma(tlb, vma); + walk_page_range(addr, end, &warm_walk); + tlb_end_vma(tlb, vma); +} + + +static long madvise_pageout(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start_addr, unsigned long end_addr) +{ + struct mm_struct *mm = vma->vm_mm; + struct mmu_gather tlb; + + *prev = vma; + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) + return -EINVAL; + + lru_add_drain(); + tlb_gather_mmu(&tlb, mm, start_addr, end_addr); + madvise_pageout_page_range(&tlb, vma, start_addr, end_addr); + tlb_finish_mmu(&tlb, start_addr, end_addr); + + return 0; +} + static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_willneed(vma, prev, start, end); case MADV_COLD: return madvise_cold(vma, prev, start, end); + case MADV_PAGEOUT: + return madvise_pageout(vma, prev, start, end); case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); @@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior) case MADV_DONTNEED: case MADV_FREE: case MADV_COLD: + case MADV_PAGEOUT: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: diff --git a/mm/vmscan.c b/mm/vmscan.c index 0973a46a0472..280dd808fb91 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_deactivate, nr_rotated, sc->priority, file); } +unsigned long reclaim_pages(struct list_head *page_list) +{ + int nid = -1; + unsigned long nr_isolated[2] = {0, }; + unsigned long nr_reclaimed = 0; + LIST_HEAD(node_page_list); + struct reclaim_stat dummy_stat; + struct scan_control sc = { + .gfp_mask = GFP_KERNEL, + .priority = DEF_PRIORITY, + .may_writepage = 1, + .may_unmap = 1, + .may_swap = 1, + }; + + while (!list_empty(page_list)) { + struct page *page; + + page = lru_to_page(page_list); + if (nid == -1) { + nid = page_to_nid(page); + INIT_LIST_HEAD(&node_page_list); + nr_isolated[0] = nr_isolated[1] = 0; + } + + if (nid == page_to_nid(page)) { + list_move(&page->lru, &node_page_list); + nr_isolated[!!page_is_file_cache(page)] += + hpage_nr_pages(page); + continue; + } + + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + nr_isolated[1]); + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, + &dummy_stat, true); + while (!list_empty(&node_page_list)) { + struct page *page = lru_to_page(&node_page_list); + + list_del(&page->lru); + putback_lru_page(page); + } + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + -nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + -nr_isolated[1]); + nid = -1; + } + + if (!list_empty(&node_page_list)) { + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + nr_isolated[1]); + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, + &dummy_stat, true); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, + -nr_isolated[0]); + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, + -nr_isolated[1]); + + while (!list_empty(&node_page_list)) { + struct page *page = lru_to_page(&node_page_list); + + list_del(&page->lru); + putback_lru_page(page); + } + + } + + return nr_reclaimed; +} + /* * The inactive anon list should be small enough that the VM never has * to do too much work. From patchwork Fri May 31 06:43:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10969665 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 69CA815E6 for ; Fri, 31 May 2019 06:43:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5982328C76 for ; Fri, 31 May 2019 06:43:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4D38428C78; Fri, 31 May 2019 06:43:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 28F6128C76 for ; Fri, 31 May 2019 06:43:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1593A6B027C; Fri, 31 May 2019 02:43:49 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 109726B027E; Fri, 31 May 2019 02:43:49 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EC5596B0280; Fri, 31 May 2019 02:43:48 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f198.google.com (mail-pg1-f198.google.com [209.85.215.198]) by kanga.kvack.org (Postfix) with ESMTP id AE7A76B027C for ; Fri, 31 May 2019 02:43:48 -0400 (EDT) Received: by mail-pg1-f198.google.com with SMTP id j26so4062721pgj.6 for ; Thu, 30 May 2019 23:43:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=i0n4fcezG7qmXXiR6N7lNveZXLIkFhoMMErG1GaUxuc=; b=j/4uSLE3HcoVL+z0NJZTZvG9OGe1Ua5J0aQVCq9bfS/youKoIU92u8WMNZLfgXuDkD iX0eK1udPqTgdO6C6+cFdJdiTelQKJa2cG4V1I501Fi/hHQ6XmLsmUsJbiGEzJucO76e K/ggRJd3/Bdf9I46uDeHZddaPttiwdAuBjO8kBgUvqodM0wN4bdDOskj3A4FRVBrTsGI fZtWivbtZvx4dqRw3uE+dAap22Ho9RtEn8P1e44g2vTENea6BbT4Zrvs3Vpp4QerHHhl NxM4TAGEEauyj6jzYUa5712YMQg0zc32wz6wyFdiSJpBiLM4176zk18bIAK2s98zmt64 cxEQ== X-Gm-Message-State: APjAAAVRZGot4+57zhSylf8Zg1wuhhFP/JlgDFVI2Q2s5y1A2OXf6KRi y8maZW065a2wNuwn/+/NzOcfFc5VN0iI0oIcCQvMjmPmDOEGOXrUO07XvCgvBamPWObra5MrqhP YJwwDMuF1D+0RKTBfkEag7rOvJoH5G6g3++TKkwfdRb4I6LbcLNlY+5tGWMKCQZg= X-Received: by 2002:a62:e90b:: with SMTP id j11mr7699738pfh.88.1559285028314; Thu, 30 May 2019 23:43:48 -0700 (PDT) X-Received: by 2002:a62:e90b:: with SMTP id j11mr7699651pfh.88.1559285026522; Thu, 30 May 2019 23:43:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559285026; cv=none; d=google.com; s=arc-20160816; b=zdS4PSTlTF+H8jrRrKw+KifwkoWu+wCTAowq7ace1VbSlR1I0EiAoeqqjkPBvSwf81 ijzWCIDYb7/D4ecOEXKGzfGH5Skv8X805GIN5/s4kTv7WKtwNrx0IlBsqKVQOffnhyyx JVmefzah0tBUonLX6DM7VbSW987Ucrz6h2eC+vaoYS6fxFabZCPE0VZDtKj0BhK0naDk 6eM8m5+1verv1hpgwv7nIkv4ZWrGEdYgTSSRzqLAJ+6LRwXEExDaP39YEk0479I9a8Cv 6EOuXQCp1VlvjNiPD1KyLmok9ESBcJacymdOF3gAEl/fSQaw2L7C30SVkDsybMWxAR0/ wyHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=i0n4fcezG7qmXXiR6N7lNveZXLIkFhoMMErG1GaUxuc=; b=tjMP08xWTeP9CWYaYLkh3E7xKygF9dHaadEZXqfDiwWnK0jfD+8cyDAZcule8KvSTb 51cg+g2JNvd77xviVGRa0gDJG0JbLX6S8/enUN9k3td/9BCvBdOcKNSLytwoXvrncYut SKF0rASnZ5dipt3/LP+4Yo3sef6eJyWQ6tZvFPU0zXpHwiWJ9y1PS6FZgb+5Q3E9G8SV 5fmM+Aywi7+zVOb2F1tQRVUGu4btOPmVnv7XJP5i6ZTQkMYIm7h+7XOvQlQHvnEGgnS9 /x75lwTiORJRAfF/pRVs7Qs62Bo6fAsK8UznzWm8EhdXjE0C1KL4Wps3KUA6FqnLAUTm ITqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=qXemTkT8; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 31sor5540218plc.62.2019.05.30.23.43.46 for (Google Transport Security); Thu, 30 May 2019 23:43:46 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=qXemTkT8; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=i0n4fcezG7qmXXiR6N7lNveZXLIkFhoMMErG1GaUxuc=; b=qXemTkT8r4Wi0uTe+NvJ9cwRtlyNU9BH3JRXJ31swm6H3VPAiX93vJgXxtq9BmxMht mmjxQUjVC1YEFizMCUm2/LgOar80xohZx5K+3+NszkywMrXyujoAvQB6cX2QlgGU/zdk Db71ubYU1Nj+kD9gQw2mD8AImg/4UYOz4+VTpazzlJu8Czm+g07Cs/KFnnNdwY3lbjkw uult2wzDiWTQx2tmR2oxqzR2xYGg1h8c6DE/iMexVI4w8PJbyRsEE+YFRrw0wpp3bYp+ I4Xm+gDvUYXDWx5+221F8d0DCuncf2NB1uoIP0X+F6tK1cUG91wKxh7dikN21cwhiBxI vxzQ== X-Google-Smtp-Source: APXvYqzhZsFy8hJnTqv+6LQ309ELVxI8pTbZ01FO7waDc9Ax2BZwvcM2+JNFU0bbPjjdmfJSO1re7Q== X-Received: by 2002:a17:902:8210:: with SMTP id x16mr7361005pln.306.1559285026107; Thu, 30 May 2019 23:43:46 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id f30sm4243340pjg.13.2019.05.30.23.43.40 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 May 2019 23:43:45 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: linux-mm , LKML , linux-api@vger.kernel.org, Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , jannh@google.com, oleg@redhat.com, christian@brauner.io, oleksandr@redhat.com, hdanton@sina.com, Minchan Kim Subject: [RFCv2 4/6] mm: factor out madvise's core functionality Date: Fri, 31 May 2019 15:43:11 +0900 Message-Id: <20190531064313.193437-5-minchan@kernel.org> X-Mailer: git-send-email 2.22.0.rc1.257.g3120a18244-goog In-Reply-To: <20190531064313.193437-1-minchan@kernel.org> References: <20190531064313.193437-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This patch factor out madvise's core functionality so that upcoming patch can reuse it without duplication. It shouldn't change any behavior. Signed-off-by: Minchan Kim --- mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------ 1 file changed, 101 insertions(+), 87 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 9d749a1420b4..466623ea8c36 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr, struct page *page; int isolated = 0; struct vm_area_struct *vma = walk->vma; + struct task_struct *task = walk->private; unsigned long next; - if (fatal_signal_pending(current)) + if (fatal_signal_pending(task)) return -EINTR; next = pmd_addr_end(addr, end); @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr, } static void madvise_pageout_page_range(struct mmu_gather *tlb, - struct vm_area_struct *vma, - unsigned long addr, unsigned long end) + struct task_struct *task, + struct vm_area_struct *vma, + unsigned long addr, unsigned long end) { struct mm_walk warm_walk = { .pmd_entry = madvise_pageout_pte_range, .mm = vma->vm_mm, + .private = task, }; tlb_start_vma(tlb, vma); @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb, } -static long madvise_pageout(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start_addr, unsigned long end_addr) +static long madvise_pageout(struct task_struct *task, + struct vm_area_struct *vma, struct vm_area_struct **prev, + unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather tlb; @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma, lru_add_drain(); tlb_gather_mmu(&tlb, mm, start_addr, end_addr); - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr); + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr); tlb_finish_mmu(&tlb, start_addr, end_addr); return 0; @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, return 0; } -static long madvise_dontneed_free(struct vm_area_struct *vma, +static long madvise_dontneed_free(struct mm_struct *mm, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, if (!userfaultfd_remove(vma, start, end)) { *prev = NULL; /* mmap_sem has been dropped, prev is stale */ - down_read(¤t->mm->mmap_sem); - vma = find_vma(current->mm, start); + down_read(&mm->mmap_sem); + vma = find_vma(mm, start); if (!vma) return -ENOMEM; if (start < vma->vm_start) { @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. */ -static long madvise_remove(struct vm_area_struct *vma, +static long madvise_remove(struct mm_struct *mm, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end) { @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma, get_file(f); if (userfaultfd_remove(vma, start, end)) { /* mmap_sem was not released by userfaultfd_remove() */ - up_read(¤t->mm->mmap_sem); + up_read(&mm->mmap_sem); } error = vfs_fallocate(f, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, end - start); fput(f); - down_read(¤t->mm->mmap_sem); + down_read(&mm->mmap_sem); return error; } @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior, #endif static long -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, +madvise_vma(struct task_struct *task, struct mm_struct *mm, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) { switch (behavior) { case MADV_REMOVE: - return madvise_remove(vma, prev, start, end); + return madvise_remove(mm, vma, prev, start, end); case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); case MADV_COLD: return madvise_cold(vma, prev, start, end); case MADV_PAGEOUT: - return madvise_pageout(vma, prev, start, end); + return madvise_pageout(task, vma, prev, start, end); case MADV_FREE: case MADV_DONTNEED: - return madvise_dontneed_free(vma, prev, start, end, behavior); + return madvise_dontneed_free(mm, vma, prev, start, + end, behavior); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior) } } -/* - * The madvise(2) system call. - * - * Applications can use madvise() to advise the kernel how it should - * handle paging I/O in this VM area. The idea is to help the kernel - * use appropriate read-ahead and caching techniques. The information - * provided is advisory only, and can be safely disregarded by the - * kernel without affecting the correct operation of the application. - * - * behavior values: - * MADV_NORMAL - the default behavior is to read clusters. This - * results in some read-ahead and read-behind. - * MADV_RANDOM - the system should read the minimum amount of data - * on any access, since it is unlikely that the appli- - * cation will need more than what it asks for. - * MADV_SEQUENTIAL - pages in the given range will probably be accessed - * once, so they can be aggressively read ahead, and - * can be freed soon after they are accessed. - * MADV_WILLNEED - the application is notifying the system to read - * some pages ahead. - * MADV_DONTNEED - the application is finished with the given range, - * so the kernel can free resources associated with it. - * MADV_FREE - the application marks pages in the given range as lazy free, - * where actual purges are postponed until memory pressure happens. - * MADV_REMOVE - the application wants to free up the given range of - * pages and associated backing store. - * MADV_DONTFORK - omit this area from child's address space when forking: - * typically, to avoid COWing pages pinned by get_user_pages(). - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. - * MADV_WIPEONFORK - present the child process with zero-filled memory in this - * range after a fork. - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK - * MADV_HWPOISON - trigger memory error handler as if the given memory range - * were corrupted by unrecoverable hardware memory failure. - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in - * this area with pages of identical content from other such areas. - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. - * MADV_HUGEPAGE - the application wants to back the given range by transparent - * huge pages in the future. Existing pages might be coalesced and - * new pages might be allocated as THP. - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by - * transparent huge pages so the existing pages will not be - * coalesced into THP and new pages will not be allocated as THP. - * MADV_DONTDUMP - the application wants to prevent pages in the given range - * from being included in its core dump. - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. - * - * return values: - * zero - success - * -EINVAL - start + len < 0, start is not page-aligned, - * "behavior" is not a valid value, or application - * is attempting to release locked or shared pages, - * or the specified address range includes file, Huge TLB, - * MAP_SHARED or VMPFNMAP range. - * -ENOMEM - addresses in the specified range are not currently - * mapped, or are outside the AS of the process. - * -EIO - an I/O error occurred while paging in data. - * -EBADF - map exists, but area maps something that isn't a file. - * -EAGAIN - a kernel resource was temporarily unavailable. - */ -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +static int madvise_core(struct task_struct *task, struct mm_struct *mm, + unsigned long start, size_t len_in, int behavior) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) #ifdef CONFIG_MEMORY_FAILURE if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE) - return madvise_inject_error(behavior, start, start + len_in); + return madvise_inject_error(behavior, + start, start + len_in); #endif write = madvise_need_mmap_write(behavior); if (write) { - if (down_write_killable(¤t->mm->mmap_sem)) + if (down_write_killable(&mm->mmap_sem)) return -EINTR; } else { - down_read(¤t->mm->mmap_sem); + down_read(&mm->mmap_sem); } /* @@ -1084,7 +1032,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) * ranges, just ignore them, but return -ENOMEM at the end. * - different from the way of handling in mlock etc. */ - vma = find_vma_prev(current->mm, start, &prev); + vma = find_vma_prev(mm, start, &prev); if (vma && start > vma->vm_start) prev = vma; @@ -1109,7 +1057,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) tmp = end; /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ - error = madvise_vma(vma, &prev, start, tmp, behavior); + error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior); if (error) goto out; start = tmp; @@ -1121,14 +1069,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) if (prev) vma = prev->vm_next; else /* madvise_remove dropped mmap_sem */ - vma = find_vma(current->mm, start); + vma = find_vma(mm, start); } out: blk_finish_plug(&plug); if (write) - up_write(¤t->mm->mmap_sem); + up_write(&mm->mmap_sem); else - up_read(¤t->mm->mmap_sem); + up_read(&mm->mmap_sem); return error; } + +/* + * The madvise(2) system call. + * + * Applications can use madvise() to advise the kernel how it should + * handle paging I/O in this VM area. The idea is to help the kernel + * use appropriate read-ahead and caching techniques. The information + * provided is advisory only, and can be safely disregarded by the + * kernel without affecting the correct operation of the application. + * + * behavior values: + * MADV_NORMAL - the default behavior is to read clusters. This + * results in some read-ahead and read-behind. + * MADV_RANDOM - the system should read the minimum amount of data + * on any access, since it is unlikely that the appli- + * cation will need more than what it asks for. + * MADV_SEQUENTIAL - pages in the given range will probably be accessed + * once, so they can be aggressively read ahead, and + * can be freed soon after they are accessed. + * MADV_WILLNEED - the application is notifying the system to read + * some pages ahead. + * MADV_DONTNEED - the application is finished with the given range, + * so the kernel can free resources associated with it. + * MADV_FREE - the application marks pages in the given range as lazy free, + * where actual purges are postponed until memory pressure happens. + * MADV_REMOVE - the application wants to free up the given range of + * pages and associated backing store. + * MADV_DONTFORK - omit this area from child's address space when forking: + * typically, to avoid COWing pages pinned by get_user_pages(). + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. + * MADV_WIPEONFORK - present the child process with zero-filled memory in this + * range after a fork. + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK + * MADV_HWPOISON - trigger memory error handler as if the given memory range + * were corrupted by unrecoverable hardware memory failure. + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in + * this area with pages of identical content from other such areas. + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. + * MADV_HUGEPAGE - the application wants to back the given range by transparent + * huge pages in the future. Existing pages might be coalesced and + * new pages might be allocated as THP. + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by + * transparent huge pages so the existing pages will not be + * coalesced into THP and new pages will not be allocated as THP. + * MADV_DONTDUMP - the application wants to prevent pages in the given range + * from being included in its core dump. + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. + * + * return values: + * zero - success + * -EINVAL - start + len < 0, start is not page-aligned, + * "behavior" is not a valid value, or application + * is attempting to release locked or shared pages, + * or the specified address range includes file, Huge TLB, + * MAP_SHARED or VMPFNMAP range. + * -ENOMEM - addresses in the specified range are not currently + * mapped, or are outside the AS of the process. + * -EIO - an I/O error occurred while paging in data. + * -EBADF - map exists, but area maps something that isn't a file. + * -EAGAIN - a kernel resource was temporarily unavailable. + */ +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +{ + return madvise_core(current, current->mm, start, len_in, behavior); +} From patchwork Fri May 31 06:43:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10969667 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 05633933 for ; Fri, 31 May 2019 06:43:56 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E96B628C76 for ; Fri, 31 May 2019 06:43:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DD79F28C78; Fri, 31 May 2019 06:43:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EB52C28C76 for ; Fri, 31 May 2019 06:43:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CA3A06B027E; Fri, 31 May 2019 02:43:53 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id C2DE76B0280; Fri, 31 May 2019 02:43:53 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ACC626B0281; Fri, 31 May 2019 02:43:53 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com [209.85.214.198]) by kanga.kvack.org (Postfix) with ESMTP id 714D26B027E for ; Fri, 31 May 2019 02:43:53 -0400 (EDT) Received: by mail-pl1-f198.google.com with SMTP id d22so5649636plr.0 for ; Thu, 30 May 2019 23:43:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=/1KtkTjpM2F7VqpMpplgHbrJXqxrx2Yg1QX8m2jGqnM=; b=Yo82ZCtKK7LXRGy5dY07oo1toaVVXPUA4qmNPsfKhyHlE70Dpz0SyA6NzaJzoP8kQO yQk16A9HrsZRrXfVk3YYNDrxuezjgT8ocJnR/V2aSTku/90KIo7vDFnof2khIbVKv2m+ zARlnJjZZPXDyOpk+/M+mBcOiAOpRgbbRK9Fy6SeHFrOkYeRP2mp8qYPYkXzY2DRymbk SJSs144fgANm7KvnEWo8Kwk4r6jBmfXBOMXdOqYk4+gdq8vWENrhl0SbqhQGJUjtZRLq rB8iXvZVH9fTNQUd/P/nI78iTGrZqVKDoz2l1Kh0kVcg6cjKXco0Yz4NKZRv5SaoKPLh +BmQ== X-Gm-Message-State: APjAAAU52yiL/RFlkpyS54wsO8HNeseToje1m6LyIEdiidPYWhn27DHO vWljkW1+zRt5eSEsgxKP6X2Xh8msrHqCFiY4NqBkmcbdERCYhpd13zTb5QiCICeEpY8braXrnhv kSEMA+N2A0Y1Zh2hgxXa3E2YMZA7UTgWvpNH9I60/q8Yvg9G6OpDpUtXvhq5gewQ= X-Received: by 2002:aa7:87ca:: with SMTP id i10mr7752201pfo.157.1559285033083; Thu, 30 May 2019 23:43:53 -0700 (PDT) X-Received: by 2002:aa7:87ca:: with SMTP id i10mr7752133pfo.157.1559285031943; Thu, 30 May 2019 23:43:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559285031; cv=none; d=google.com; s=arc-20160816; b=WAdEt2ikaVsVw6LAqqlunrKWow2/c59td6K1ScRcwTWScwaAf/oy7/GuifRNjaxY24 X6CU2nsLDU0l0Fj4cvaGHNn//jzPmS6qreE0f/ST8I4WkoSQDMZlMsiU3cSjoUz5RrwK 2YJ72W2C/mBXTjr6ZLb/fVCeuXUTtwWcuvtG75YdcXm43LZFtFEEquWdBML425W8e4mN ht4FeruyacdGl54wtvGr1r1oI/O7lOoTKbRvH/L+lO/HOYdgYcueHDak4IBn6NOB/tTW 6FHAu6fxVKeNeltrJmYNeG4FvpKXNI1oKFVxSd9m0z9C1X7sOz0pjhBbg9chPKWeiOi9 5y2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=/1KtkTjpM2F7VqpMpplgHbrJXqxrx2Yg1QX8m2jGqnM=; b=bQcej59PB6hxur1i2XDkHzNB2e52SxH9arKaYbvzM9IFGnJfyhRnHqJb9PjCzXFuDC ua4qyBfvR1t0FGk8yuC9PUoYGZHOK9XHrfhLh+u5QJNV/xSlNJ1SBxbKN7erKp7BdJcR 1HUh1r/ChmEDmvyxklW61aCMZeRcaMbBjR7ZIpKcUj53NJFdYIXDZdyiH3OgoanfJzLF dmrPnjst6kK5lpOYQray9OLPHKqbZqnJzAnvt/AipZcqwM182PNfXTQoERGgh5aNjEx4 oQ17KVcLJTvvRtyZuxeDfoDcPZYnuC5k9qr5VwRfPBEtPO8i/wKb40d6LmdR5KkFRasX H66g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YEsBwFGp; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id az4sor917497plb.53.2019.05.30.23.43.51 for (Google Transport Security); Thu, 30 May 2019 23:43:51 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=YEsBwFGp; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/1KtkTjpM2F7VqpMpplgHbrJXqxrx2Yg1QX8m2jGqnM=; b=YEsBwFGprnt0P096FjbN8VGCSTOtpp5VtwAcs4d0O5djn46yEI1ThviZsV8gHM5SRs g77Xcn9BAXeaersL08jx4tGOFlBndGVF3aB2ATh4nM1P4LJI1SkjYqCY/OWZpFXOqo3/ 3PY4n0DyQwTLX+F7AinFsc3xzg1ujHCWjHQXhN+QaCOqTnPVgaNKT+vYay+a7cA10FBA qi8DrCEZteJ//y5V8pQBaMOzNdsL44ov3oG/4pegLyMytXtm6h7vW/si2O7/9A5xaYaa +eLP9pIywLISqXcfXrpD4+3La62D4pROiX8uwt2OMAQIq/7H7raT+Dtr76c8gdFJM1r6 h9qQ== X-Google-Smtp-Source: APXvYqzoefQeJ1C97Vsx+NAezzE29KGmSgaLlD+4lJUr/OpwWASbCKw2YAxZCKcGf1+iV7yxUZ9u/w== X-Received: by 2002:a17:902:868b:: with SMTP id g11mr7213277plo.183.1559285031493; Thu, 30 May 2019 23:43:51 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id f30sm4243340pjg.13.2019.05.30.23.43.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 May 2019 23:43:50 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: linux-mm , LKML , linux-api@vger.kernel.org, Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , jannh@google.com, oleg@redhat.com, christian@brauner.io, oleksandr@redhat.com, hdanton@sina.com, Minchan Kim Subject: [RFCv2 5/6] mm: introduce external memory hinting API Date: Fri, 31 May 2019 15:43:12 +0900 Message-Id: <20190531064313.193437-6-minchan@kernel.org> X-Mailer: git-send-email 2.22.0.rc1.257.g3120a18244-goog In-Reply-To: <20190531064313.193437-1-minchan@kernel.org> References: <20190531064313.193437-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP There is some usecase that centralized userspace daemon want to give a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's ActivityManagerService is one of them. It's similar in spirit to madvise(MADV_WONTNEED), but the information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces new syscall process_madvise(2). It could give a hint to the exeternal process of pidfd. int process_madvise(int pidfd, void *addr, size_t length, int advise, unsigned long cookie, unsigned long flag); Since it could affect other process's address range, only privileged process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The syscall has a cookie argument to privode atomicity(i.e., detect target process's address space change since monitor process has parsed the address range of target process so the operaion could fail in case of happening race). Although there is no interface to get a cookie at this moment, it could be useful to consider it as argument to avoid introducing another new syscall in future. It could support *atomicity* for disruptive hint(e.g., MADV_DONTNEED|FREE). flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise are rather risky. Because I'm not sure all hints makes sense from external process and implementation for the hint may rely on caller is current context like MADV_INJECT_ERROR so it could be error-prone if the caller is external process. Other example is userfaultfd because userfaultfd_remove need to release mmap_sem during the operation, which wouuld be a obstacle to implement atomicity later. So, I just limited hints as MADV_[COLD|PAGEOUT] at this moment. If someone want to expose other hint we need to hear their workload/scenario and could review carefully by design/implmentation of each hint. It's more safe for maintainace - Once we open a buggy syscall but found hard to fix it later, we never roll back. TODO: once we agree the direction, I need to define the syscall for every architecture. * RFCv1 * not export pidfd_to_pid. Use pidfd_pid - Christian * use mm struct instead of task_struct for madvise_core - Oleg * add cookie variable as syscall argument to guarantee atomicity - dancol * internal review * use ptrace capability - surenb, dancol I didn't solve the issue Olge pointed out(task we got from pid_task could be a zombie leader so that syscall will fai by mm_access even though process is alive with other threads) because it's not a only problem of this new syscall but it's general problem for other MM syscalls like process_vm_readv, move_pages so need more discussion. Cc: Oleg Nesterov Signed-off-by: Minchan Kim --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/pid.h | 4 ++ include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- kernel/fork.c | 8 +++ kernel/signal.c | 7 ++- kernel/sys_ni.c | 1 + mm/madvise.c | 87 ++++++++++++++++++++++++++ 9 files changed, 113 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 43e4429a5272..5f44a29b7882 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -439,3 +439,4 @@ 432 i386 fsmount sys_fsmount __ia32_sys_fsmount 433 i386 fspick sys_fspick __ia32_sys_fspick 434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open +435 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 1bee0a77fdd3..35e91f3e9646 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -356,6 +356,7 @@ 432 common fsmount __x64_sys_fsmount 433 common fspick __x64_sys_fspick 434 common pidfd_open __x64_sys_pidfd_open +435 common process_madvise __x64_sys_process_madvise # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/pid.h b/include/linux/pid.h index a261712ac3fe..a49ef789c034 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -73,6 +73,10 @@ extern struct pid init_struct_pid; extern const struct file_operations pidfd_fops; extern int pidfd_create(struct pid *pid); +struct file; + +extern struct pid *pidfd_pid(const struct file *file); + static inline struct pid *get_pid(struct pid *pid) { if (pid) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 1a4dc53f40d9..6ba081c955f6 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -872,6 +872,9 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); +asmlinkage long sys_process_madvise(int pidfd, unsigned long start, + size_t len, int behavior, + unsigned long cookie, unsigned long flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index e5684a4512c0..082d1f3fe3a2 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -846,9 +846,11 @@ __SYSCALL(__NR_fsmount, sys_fsmount) __SYSCALL(__NR_fspick, sys_fspick) #define __NR_pidfd_open 434 __SYSCALL(__NR_pidfd_open, sys_pidfd_open) +#define __NR_process_madvise 435 +__SYSCALL(__NR_process_madvise, sys_process_madvise) #undef __NR_syscalls -#define __NR_syscalls 435 +#define __NR_syscalls 436 /* * 32 bit systems traditionally used different diff --git a/kernel/fork.c b/kernel/fork.c index 9f238cdd886e..b76aade51631 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1738,6 +1738,14 @@ const struct file_operations pidfd_fops = { #endif }; +struct pid *pidfd_pid(const struct file *file) +{ + if (file->f_op == &pidfd_fops) + return file->private_data; + + return ERR_PTR(-EBADF); +} + /** * pidfd_create() - Create a new pid file descriptor. * diff --git a/kernel/signal.c b/kernel/signal.c index b477e21ecafc..b376870d7565 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3702,8 +3702,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) static struct pid *pidfd_to_pid(const struct file *file) { - if (file->f_op == &pidfd_fops) - return file->private_data; + struct pid *pid; + + pid = pidfd_pid(file); + if (pid) + return pid; return tgid_pidfd_to_pid(file); } diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 4d9ae5ea6caf..5277421795ab 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall); COND_SYSCALL(munlockall); COND_SYSCALL(mincore); COND_SYSCALL(madvise); +COND_SYSCALL(process_madvise); COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); COND_SYSCALL_COMPAT(mbind); diff --git a/mm/madvise.c b/mm/madvise.c index 466623ea8c36..fd205e928a1b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -983,6 +984,31 @@ madvise_behavior_valid(int behavior) } } +static bool +process_madvise_behavior_valid(int behavior) +{ + switch (behavior) { + case MADV_COLD: + case MADV_PAGEOUT: + return true; + + default: + return false; + } +} + +/* + * madvise_core - request behavior hint to address range of the target process + * + * @task: task_struct got behavior hint, not giving the hint + * @mm: mm_struct got behavior hint, not giving the hint + * @start: base address of the hinted range + * @len_in: length of the hinted range + * @behavior: requested hint + * + * @task could be a zombie leader if it calls sys_exit so accessing mm_struct + * via task->mm is prohibited. Please use @mm insetad of task->mm. + */ static int madvise_core(struct task_struct *task, struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { @@ -1146,3 +1172,64 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { return madvise_core(current, current->mm, start, len_in, behavior); } + +SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start, + size_t, len_in, int, behavior, unsigned long, cookie, + unsigned long, flags) +{ + int ret; + struct fd f; + struct pid *pid; + struct task_struct *task; + struct mm_struct *mm; + + if (flags != 0) + return -EINVAL; + + /* + * We don't support cookie to gaurantee address space change + * atomicity yet. + */ + if (cookie != 0) + return -EINVAL; + + if (!process_madvise_behavior_valid(behavior)) + return return -EINVAL; + + f = fdget(pidfd); + if (!f.file) + return -EBADF; + + pid = pidfd_pid(f.file); + if (IS_ERR(pid)) { + ret = PTR_ERR(pid); + goto err; + } + + rcu_read_lock(); + task = pid_task(pid, PIDTYPE_PID); + if (!task) { + rcu_read_unlock(); + ret = -ESRCH; + goto err; + } + + get_task_struct(task); + rcu_read_unlock(); + + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); + if (!mm || IS_ERR(mm)) { + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; + if (ret == -EACCES) + ret = -EPERM; + goto release_task; + } + + ret = madvise_core(task, mm, start, len_in, behavior); + mmput(mm); +release_task: + put_task_struct(task); +err: + fdput(f); + return ret; +} From patchwork Fri May 31 06:43:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 10969669 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9C10115E6 for ; Fri, 31 May 2019 06:44:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8AEAA28C76 for ; Fri, 31 May 2019 06:44:01 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7F0BF28C78; Fri, 31 May 2019 06:44:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9808B28C76 for ; Fri, 31 May 2019 06:44:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7C0E36B0280; Fri, 31 May 2019 02:43:59 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 74B3C6B0281; Fri, 31 May 2019 02:43:59 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EB546B0282; Fri, 31 May 2019 02:43:59 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f200.google.com (mail-pl1-f200.google.com [209.85.214.200]) by kanga.kvack.org (Postfix) with ESMTP id 23B4B6B0280 for ; Fri, 31 May 2019 02:43:59 -0400 (EDT) Received: by mail-pl1-f200.google.com with SMTP id 91so5635609pla.7 for ; Thu, 30 May 2019 23:43:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:sender:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=/qxKh5Sd2cIvdmEp7qaPCDBJLZKehHIzssQM7btjaog=; b=G+TlR7gIMYdz3YClQV2+zxjLB4JkKpoYZDUZqOhhWL5ni1+8TG2Ae7z4dcwTM8XQiG 6XayZ9isbD+4HaQABXG9ek8IUUzTwilB/1FwR+9n+zLh89fmkSupk3LS5Hte2rB9fQPJ FxPgAZCFTrZl7GWaFYPym2xxIBmLLbZmFbGJw8s60o6RgJbEHRRveM9uAuNi3dQb+SKw VLOwQ8knVHVAoReP6hccBULQdiZBEDsNB7KiwnYxPw7mO30pOpQN/K2bwkuJytnsxMPA GljxCNlZTpbmiVkgYDNDWwkz0T94/2pX+/uZV2fp8D6xvKA552CMAV4LoeoEMlKtHgHw W8lg== X-Gm-Message-State: APjAAAW5KriXb0ExtAeF31tEMSC28OTnOAHaklLKyUuad+Sa3rhJpFla M7R4lpNCq+sjLkqrjbGN6N6lKuUO4YbDqtbKrFYhewozGr1t1+FwsWyiwsxgtRCFU4n7HS/htUc 3bkNWN9GVQx0JVYDGEHsvCU+AHYdWVxyjufUAkOLTRjoYMly2jlIJdulBGR80Yfg= X-Received: by 2002:a17:902:8d92:: with SMTP id v18mr7289149plo.225.1559285038697; Thu, 30 May 2019 23:43:58 -0700 (PDT) X-Received: by 2002:a17:902:8d92:: with SMTP id v18mr7289095plo.225.1559285037506; Thu, 30 May 2019 23:43:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559285037; cv=none; d=google.com; s=arc-20160816; b=XxNBlu4xEQRZt/g5QavIrtJOBp2JJnbD+Fiszd+H49LwnUpxqp7lXrb86Ta5oY6hfw nC8OrVI+opd4jmQxwwt9g7A1hC469IHVWjFjUivnoO5oygoAr26iTmqE9l+tq/7mdckO liKkoJsCuhFarIS1wIMZKM56uQMXws56+ndCOPD829IuBFz7lsxHbhqz3K6M/cs3HT1n yS2WCIwNCObZl7Si6dsbtepiEV5/b4eRlwuWq8CLYmJwpVFgFWruFkcgsjBjvQfo1PA8 NW2H9JT3EfX2SUj9+oTC9bdDIiLmp2WG119aSjIVQGPWD80KVELgXTpuprJTJfqt60YD FH9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:sender:dkim-signature; bh=/qxKh5Sd2cIvdmEp7qaPCDBJLZKehHIzssQM7btjaog=; b=GfYsDoIItB2PhuDyZmFwCmPjwIAxnJpfeufmPrVDpE1Xjmf2Qml3UK9mHmQ6K3Eqrq gfqAoCWOU3d6zmJjJ5rGMhkWvvOU2QRclw8aJmWkL4wrHVt8hZghNDEvHV0VV3eb2YR9 0tRaw6R8qZlESdPko+qjRshiGRbhDa3eJ8fo+DpP/pYstA4AyrNWxt/iJ+295daKtNoU Zh1nGoQoKN8xv8iZ34DQt7YYMsRRTje5LsifHAmlfQRhEgOKlns5672o1KlByGqFyXYF 5Zyd0TQPXe8XZzdXDfsiIgBYJMxaMI4/BwVlwc+DU9XddsAG2wcPFzlJHZGLBx8m0lcr j9iQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=XKSLVIvf; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id c34sor5310485pgb.8.2019.05.30.23.43.57 for (Google Transport Security); Thu, 30 May 2019 23:43:57 -0700 (PDT) Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=XKSLVIvf; spf=pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/qxKh5Sd2cIvdmEp7qaPCDBJLZKehHIzssQM7btjaog=; b=XKSLVIvfbXFduytLOVRRP+7q+9eS6PVclnbre5yWnQ2ulCLxJC+U5oTgFrMnEQeNAf eObYAkBLjnRSzHgTxAR+qO9wGvvMLR3l+58XoqDjZMnGJfBLDCohcRa+VL/pMmst2VY4 4NkoKyOpgPaQD2AXu7CZdNs05heoLQwWvAkOuals+tANzP9I5itzGEWwaCCbkenvjrvJ zOGFz0MIxeC+wowv/sL3U66m0bBSBW18XeLnISIVwaWJ+wBJrsGidFeXsm7kIBiocmUs KUvffuUdYxn8vMi33LZVCZZLRY7+2tTmRHigoExmUp4G51N8eW2O35cDsDY4ftA9G0P8 Iu8w== X-Google-Smtp-Source: APXvYqxG5/PbTB6PPKuzAXJIcjNhgNf6ZhMN+PpfuPfgq3fEGC9dwq5i1uswo13q01oJzXIAx3/6Uw== X-Received: by 2002:a63:2ad2:: with SMTP id q201mr7003061pgq.94.1559285037009; Thu, 30 May 2019 23:43:57 -0700 (PDT) Received: from bbox-2.seo.corp.google.com ([2401:fa00:d:0:98f1:8b3d:1f37:3e8]) by smtp.gmail.com with ESMTPSA id f30sm4243340pjg.13.2019.05.30.23.43.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 May 2019 23:43:55 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: linux-mm , LKML , linux-api@vger.kernel.org, Michal Hocko , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Daniel Colascione , Shakeel Butt , Sonny Rao , Brian Geffon , jannh@google.com, oleg@redhat.com, christian@brauner.io, oleksandr@redhat.com, hdanton@sina.com, Minchan Kim Subject: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary Date: Fri, 31 May 2019 15:43:13 +0900 Message-Id: <20190531064313.193437-7-minchan@kernel.org> X-Mailer: git-send-email 2.22.0.rc1.257.g3120a18244-goog In-Reply-To: <20190531064313.193437-1-minchan@kernel.org> References: <20190531064313.193437-1-minchan@kernel.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Currently, process_madvise syscall works for only one address range so user should call the syscall several times to give hints to multiple address ranges. However, it's not efficient to support atomicity of address range opreations as well as performance perspective. This patch extends process_madvise syscall to support multiple hints, address ranges and return vaules so user could give hints all at once. struct pr_madvise_param { int size; /* the size of this structure */ int cookie; /* reserved to support atomicity */ int nr_elem; /* count of below arrary fields */ int __user *hints; /* hints for each range */ /* to store result of each operation */ const struct iovec __user *results; /* input address ranges */ const struct iovec __user *ranges; }; int process_madvise(int pidfd, struct pr_madvise_param *u_param, unsigned long flags); About cookie, Daniel Colascione suggested a idea[1] to support atomicity as well as improving parsing speed of address ranges of the target process. The process_getinfo(2) syscall could create vma configuration sequence number and returns(e.g., the seq number will be increased when target process holds mmap_sem exclusive lock) the number with address ranges as binary form. With calling the this vector syscall with the sequence number and address ranges we got from process_getinfo, we could detect there was race of the target process address space layout and makes the fail of the syscall if user want to have atomicity. It also speed up the address range parsing because we don't need to parse human-friend strings from /proc fs. [1] https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 struct pr_madvise_param { int size; /* the size of this structure */ int cookie; /* reserved to support atomicity */ int nr_elem; /* count of below arrary fields */ int *hints; /* hints for each range */ /* to store result of each operation */ const struct iovec *results; /* input address ranges */ const struct iovec *ranges; }; int main(int argc, char *argv[]) { struct pr_madvise_param param; int hints[NR_ADDR_RANGE]; int ret[NR_ADDR_RANGE]; struct iovec ret_vec[NR_ADDR_RANGE]; struct iovec range_vec[NR_ADDR_RANGE]; void *addr[NR_ADDR_RANGE]; pid_t pid; addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); if (MAP_FAILED == addr[0]) { printf("Fail to alloc\n"); return 1; } addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); if (MAP_FAILED == addr[1]) { printf("Fail to alloc\n"); return 1; } ret_vec[0].iov_base = &ret[0]; ret_vec[0].iov_len = sizeof(long); ret_vec[1].iov_base = &ret[1]; ret_vec[1].iov_len = sizeof(long); range_vec[0].iov_base = addr[0]; range_vec[0].iov_len = ALLOC_SIZE; range_vec[1].iov_base = addr[1]; range_vec[1].iov_len = ALLOC_SIZE; hints[0] = MADV_COLD; hints[1] = MADV_PAGEOUT; param.size = sizeof(struct pr_madvise_param); param.cookie = 0; param.nr_elem = NR_ADDR_RANGE; param.hints = hints; param.results = ret_vec; param.ranges = range_vec; pid = fork(); if (!pid) { sleep(10); } else { int pidfd = syscall(__NR_pidfd_open, pid, 0); if (pidfd < 0) { printf("Fail to open process file descriptor\n"); return 1; } munmap(addr[0], ALLOC_SIZE); munmap(addr[1], ALLOC_SIZE); system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); if (syscall(__NR_process_madvise, pidfd, ¶m, 0)) perror("process_madvise fail\n"); system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); } return 0; } Signed-off-by: Minchan Kim --- include/linux/syscalls.h | 6 +- include/uapi/asm-generic/mman-common.h | 11 +++ mm/madvise.c | 126 ++++++++++++++++++++++--- 3 files changed, 126 insertions(+), 17 deletions(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 6ba081c955f6..05627718a547 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -872,9 +872,9 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); -asmlinkage long sys_process_madvise(int pidfd, unsigned long start, - size_t len, int behavior, - unsigned long cookie, unsigned long flags); +asmlinkage long sys_process_madvise(int pidfd, + struct pr_madvise_param __user *u_params, + unsigned long flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 92e347a89ddc..220c2b5eb961 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -75,4 +75,15 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +struct pr_madvise_param { + int size; /* the size of this structure */ + int cookie; /* reserved to support atomicity */ + int nr_elem; /* count of below arrary fields */ + int __user *hints; /* hints for each range */ + /* to store result of each operation */ + const struct iovec __user *results; + /* input address ranges */ + const struct iovec __user *ranges; +}; + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/mm/madvise.c b/mm/madvise.c index fd205e928a1b..94d782097afd 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1107,6 +1107,56 @@ static int madvise_core(struct task_struct *task, struct mm_struct *mm, return error; } +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param, + struct pr_madvise_param *param) +{ + u32 size; + int ret; + + memset(param, 0, sizeof(*param)); + + ret = get_user(size, &u_param->size); + if (ret) + return ret; + + if (size > PAGE_SIZE) + return -E2BIG; + + if (!size || size > sizeof(struct pr_madvise_param)) + return -EINVAL; + + ret = copy_from_user(param, u_param, size); + if (ret) + return -EFAULT; + + return ret; +} + +static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm, + int *behaviors, + struct iov_iter *iter, + const struct iovec *range_vec, + unsigned long riovcnt) +{ + int i; + long err; + + for (i = 0; i < riovcnt && iov_iter_count(iter); i++) { + err = -EINVAL; + if (process_madvise_behavior_valid(behaviors[i])) + err = madvise_core(tsk, mm, + (unsigned long)range_vec[i].iov_base, + range_vec[i].iov_len, behaviors[i]); + + if (copy_to_iter(&err, sizeof(long), iter) != + sizeof(long)) { + return -EFAULT; + } + } + + return 0; +} + /* * The madvise(2) system call. * @@ -1173,37 +1223,78 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) return madvise_core(current, current->mm, start, len_in, behavior); } -SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start, - size_t, len_in, int, behavior, unsigned long, cookie, - unsigned long, flags) + +SYSCALL_DEFINE3(process_madvise, int, pidfd, + struct pr_madvise_param __user *, u_params, + unsigned long, flags) { int ret; struct fd f; struct pid *pid; struct task_struct *task; struct mm_struct *mm; + struct pr_madvise_param params; + const struct iovec __user *result_vec, __user *range_vec; + int *behaviors; + struct iovec iovstack_result[UIO_FASTIOV]; + struct iovec iovstack_r[UIO_FASTIOV]; + struct iovec *iov_l = iovstack_result; + struct iovec *iov_r = iovstack_r; + struct iov_iter iter; + int nr_elem; if (flags != 0) return -EINVAL; + ret = pr_madvise_copy_param(u_params, ¶ms); + if (ret) + return ret; + /* - * We don't support cookie to gaurantee address space change - * atomicity yet. + * We don't support cookie to gaurantee address space atomicity yet. + * Once we implment cookie, process_madvise_core need to hold mmap_sme + * during entire operation to guarantee atomicity. */ - if (cookie != 0) + if (params.cookie != 0) return -EINVAL; - if (!process_madvise_behavior_valid(behavior)) - return return -EINVAL; + range_vec = params.ranges; + result_vec = params.results; + nr_elem = params.nr_elem; + + behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL); + if (!behaviors) + return -ENOMEM; + + ret = copy_from_user(behaviors, params.hints, sizeof(int) * nr_elem); + if (ret < 0) + goto free_behavior_vec; + + ret = import_iovec(READ, result_vec, params. nr_elem, UIO_FASTIOV, + &iov_l, &iter); + if (ret < 0) + goto free_behavior_vec; + + if (!iov_iter_count(&iter)) { + ret = -EINVAL; + goto free_iovecs; + } + + ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem, + UIO_FASTIOV, iovstack_r, &iov_r); + if (ret <= 0) + goto free_iovecs; f = fdget(pidfd); - if (!f.file) - return -EBADF; + if (!f.file) { + ret = -EBADF; + goto free_iovecs; + } pid = pidfd_pid(f.file); if (IS_ERR(pid)) { ret = PTR_ERR(pid); - goto err; + goto put_fd; } rcu_read_lock(); @@ -1211,7 +1302,7 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start, if (!task) { rcu_read_unlock(); ret = -ESRCH; - goto err; + goto put_fd; } get_task_struct(task); @@ -1225,11 +1316,18 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start, goto release_task; } - ret = madvise_core(task, mm, start, len_in, behavior); + ret = process_madvise_core(task, mm, behaviors, &iter, iov_r, nr_elem); mmput(mm); release_task: put_task_struct(task); -err: +put_fd: fdput(f); +free_iovecs: + if (iov_r != iovstack_r) + kfree(iov_r); + kfree(iov_l); +free_behavior_vec: + kfree(behaviors); + return ret; }