From patchwork Fri Feb 15 22:03:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zi Yan X-Patchwork-Id: 10815935 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C077C13B5 for ; Fri, 15 Feb 2019 22:04:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AE4FA2FF77 for ; Fri, 15 Feb 2019 22:04:01 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A27723041C; Fri, 15 Feb 2019 22:04:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CA23F2FF77 for ; Fri, 15 Feb 2019 22:03:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0DED38E0006; Fri, 15 Feb 2019 17:03:53 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id F00828E0001; Fri, 15 Feb 2019 17:03:52 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D2DCE8E0006; Fri, 15 Feb 2019 17:03:52 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f72.google.com (mail-yw1-f72.google.com [209.85.161.72]) by kanga.kvack.org (Postfix) with ESMTP id 8DC238E0001 for ; Fri, 15 Feb 2019 17:03:52 -0500 (EST) Received: by mail-yw1-f72.google.com with SMTP id r8so6911925ywh.10 for ; Fri, 15 Feb 2019 14:03:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding:dkim-signature; bh=yulhGIpWhQu+UjEN/kQgEyDI0008fvOEzH9bvf/BwbU=; b=mkXvNBhdvBzMKiK7aVPFCAhYIEQOo/r2c54HrKu1cTBs72vvBvV62MfljjoCVOZe0L 1uXrB8Df4ciiO32uAgk1s9RvbUI1RIJqy77oun/RwHkvUqsFPvcnJqi+4vSdPvcSRezS uPCoFN33n4pz+0P6XE01sbBTtROqkars1In7yNONSeIGIgsXvGV4uDrj6NjBh6a94Cwz RCUvdp2pepMgxlPpHUO29AOzRx8tPHpjm5wtonlUjhizyiONH3EAsbq5SutOl2Jd6HK2 r6aY22i1J7C3JOQi7hMz+CFio+5vFPArVHbHa1JGXrP0RtWUFO3LR7O1tBqoHd55Qp7X ImSQ== X-Gm-Message-State: AHQUAuakrf6VVpc+AvSSTNJRmIdqLHtYDJ7C3gdKoa0iZrkDqOIkZ09p 7SRhUgB+3NG1htW5zmoQYViZKpyVnm/ZGiQrV+XOUWj8g14L6IRZpwYIbdFNdd0srf6y1UmCb5x FjZIvWVhXGLyoV1VNcnURazqtf8V0ija9xYAgv5sy/Xxa8W0xmaC69658An2zjmC29Q== X-Received: by 2002:a5b:4b:: with SMTP id e11mr9846403ybp.5.1550268232214; Fri, 15 Feb 2019 14:03:52 -0800 (PST) X-Google-Smtp-Source: AHgI3IZP26JW4RGk9235qfYKLcp3TRxqwxrXElpFhSslvhGA6YvSMACyjjcrmZYf+4S1ReneNnqQ X-Received: by 2002:a5b:4b:: with SMTP id e11mr9846228ybp.5.1550268229766; Fri, 15 Feb 2019 14:03:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550268229; cv=none; d=google.com; s=arc-20160816; b=EHJJGbxrEATKu6saDXA9NkdhGQAZsRphq1HfZU2wnGvZFsmINtS0DaF04k9ArhA0vl xN2g9Mc6tv8S1WacBnDh/t5OYgOqeLQEy41eH2DiTfje0XOCqDMF5oCkDK0Ido3ufGSw 8vSInV+moFMMLQuoKilgjXubv4exfnOdrE3Mp9ipyFWxClSYlJoaC7MORw+IM0aCmDDp jiLWQHRGp8OyvHmtcKmAetmh8/K8njnwYXPpvDMhg5B6EJ2zqiBLh1l0fIHQv14/pkJc ex+Twv3KHoH2/61bBjl9Sdl+RMSRfOhsKKU8SRWBzkxEqUP/rJp83hNJ09SiAnVjHP/u cQ0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=dkim-signature:content-transfer-encoding:mime-version:references :in-reply-to:message-id:date:subject:cc:to:from; bh=yulhGIpWhQu+UjEN/kQgEyDI0008fvOEzH9bvf/BwbU=; b=yxZ/lwYAebY4MA7vgllOWevSq3/wdOTBIuMG3gph8dV3YjuXcv1bgapUR3v6Dyj2NH tASgmVuqcWNAvdfM56M2xh4ONLV7MaREAjqvfssieNsEx8nKCLvtgPeeZX8SRx9lRWny eW1sknbXLZPYQkGu6+ylihuFjQBqjK5fBQjMtx8xiuzuWxv3Stfll2+wz+PFTjTI6Y5j rwaRLmiaXdMBqWHkcqvGnkyMwjtS7hKWpD7QeteT7cLEROsJrZ71qWGJjWQMZoh8/U2N tNIxAFHIJrwJ/NPCfSttuTxbN+LE5/9qln4mwDQ4TMDBTb3+h3bDGk3SIPnqX2EKALwU KVlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b="FMP/aTB/"; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqemgate15.nvidia.com (hqemgate15.nvidia.com. [216.228.121.64]) by mx.google.com with ESMTPS id z3si4064440ybh.391.2019.02.15.14.03.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Feb 2019 14:03:49 -0800 (PST) Received-SPF: pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) client-ip=216.228.121.64; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b="FMP/aTB/"; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate15.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Fri, 15 Feb 2019 14:03:12 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Fri, 15 Feb 2019 14:03:48 -0800 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Fri, 15 Feb 2019 14:03:48 -0800 Received: from nvrsysarch5.nvidia.com (172.20.13.39) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Fri, 15 Feb 2019 22:03:48 +0000 From: Zi Yan To: , CC: Dave Hansen , Michal Hocko , "Kirill A . Shutemov" , Andrew Morton , Vlastimil Babka , Mel Gorman , John Hubbard , Mark Hairgrove , Nitin Gupta , David Nellans , Zi Yan Subject: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages. Date: Fri, 15 Feb 2019 14:03:04 -0800 Message-ID: <20190215220334.29298-2-ziy@nvidia.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190215220334.29298-1-ziy@nvidia.com> References: <20190215220334.29298-1-ziy@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL103.nvidia.com (172.20.187.11) To HQMAIL101.nvidia.com (172.20.187.10) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1550268192; bh=yulhGIpWhQu+UjEN/kQgEyDI0008fvOEzH9bvf/BwbU=; h=X-PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer: In-Reply-To:References:MIME-Version:X-Originating-IP: X-ClientProxiedBy:Content-Transfer-Encoding:Content-Type; b=FMP/aTB/6mLMQsbJpZs8dD1k6EWPJaV8/TWLaKipisw7+8IkiGlpYgrdL5Lyz7Tnj PQUmwkYryIu9SY076OhXQt53cwYO84vt0NFWyPxA53DDofCIb4HvVa0+udsniEwla0 b7isGOSBlhLYMe3jwLec0lw53S41IVICyRpWtvUWOREJGvGNCkobOWDDn9ymvpHrA8 V3EfkaqraE5dNrfokgI32TXohkCl0kiTh+k1KtR1XiZisFCD9T4jgpdGDvpD2HIVpp fjXS7chFb4gL49sz4DthC2ma2M5q33NA4SJnRjixZPji/WL3NRv/tj711Mmfu4+j8c PKnzMaaMYZhbw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP In stead of using two migrate_pages(), a single exchange_pages() would be sufficient and without allocating new pages. Signed-off-by: Zi Yan --- include/linux/ksm.h | 5 + mm/Makefile | 1 + mm/exchange.c | 846 ++++++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 6 + mm/ksm.c | 35 ++ mm/migrate.c | 4 +- 6 files changed, 895 insertions(+), 2 deletions(-) create mode 100644 mm/exchange.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 161e8164abcf..87c5b943a73c 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -53,6 +53,7 @@ struct page *ksm_might_need_to_copy(struct page *page, void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); void ksm_migrate_page(struct page *newpage, struct page *oldpage); +void ksm_exchange_page(struct page *to_page, struct page *from_page); #else /* !CONFIG_KSM */ @@ -86,6 +87,10 @@ static inline void rmap_walk_ksm(struct page *page, static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage) { } +static inline void ksm_exchange_page(struct page *to_page, + struct page *from_page) +{ +} #endif /* CONFIG_MMU */ #endif /* !CONFIG_KSM */ diff --git a/mm/Makefile b/mm/Makefile index d210cc9d6f80..1574ea5743e4 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -43,6 +43,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ obj-y += init-mm.o obj-y += memblock.o +obj-y += exchange.o ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o diff --git a/mm/exchange.c b/mm/exchange.c new file mode 100644 index 000000000000..a607348cc6f4 --- /dev/null +++ b/mm/exchange.c @@ -0,0 +1,846 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2016 NVIDIA, Zi Yan + * + * Exchange two in-use pages. Page flags and page->mapping are exchanged + * as well. Only anonymous pages are supported. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* buffer_migrate_page */ +#include + + +#include "internal.h" + +struct exchange_page_info { + struct page *from_page; + struct page *to_page; + + struct anon_vma *from_anon_vma; + struct anon_vma *to_anon_vma; + + struct list_head list; +}; + +struct page_flags { + unsigned int page_error :1; + unsigned int page_referenced:1; + unsigned int page_uptodate:1; + unsigned int page_active:1; + unsigned int page_unevictable:1; + unsigned int page_checked:1; + unsigned int page_mappedtodisk:1; + unsigned int page_dirty:1; + unsigned int page_is_young:1; + unsigned int page_is_idle:1; + unsigned int page_swapcache:1; + unsigned int page_writeback:1; + unsigned int page_private:1; + unsigned int __pad:3; +}; + + +static void exchange_page(char *to, char *from) +{ + u64 tmp; + int i; + + for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { + tmp = *((u64 *)(from + i)); + *((u64 *)(from + i)) = *((u64 *)(to + i)); + *((u64 *)(to + i)) = tmp; + } +} + +static inline void exchange_highpage(struct page *to, struct page *from) +{ + char *vfrom, *vto; + + vfrom = kmap_atomic(from); + vto = kmap_atomic(to); + exchange_page(vto, vfrom); + kunmap_atomic(vto); + kunmap_atomic(vfrom); +} + +static void __exchange_gigantic_page(struct page *dst, struct page *src, + int nr_pages) +{ + int i; + struct page *dst_base = dst; + struct page *src_base = src; + + for (i = 0; i < nr_pages; ) { + cond_resched(); + exchange_highpage(dst, src); + + i++; + dst = mem_map_next(dst, dst_base, i); + src = mem_map_next(src, src_base, i); + } +} + +static void exchange_huge_page(struct page *dst, struct page *src) +{ + int i; + int nr_pages; + + if (PageHuge(src)) { + /* hugetlbfs page */ + struct hstate *h = page_hstate(src); + + nr_pages = pages_per_huge_page(h); + + if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) { + __exchange_gigantic_page(dst, src, nr_pages); + return; + } + } else { + /* thp page */ + VM_BUG_ON(!PageTransHuge(src)); + nr_pages = hpage_nr_pages(src); + } + + for (i = 0; i < nr_pages; i++) { + cond_resched(); + exchange_highpage(dst + i, src + i); + } +} + +/* + * Copy the page to its new location without polluting cache + */ +static void exchange_page_flags(struct page *to_page, struct page *from_page) +{ + int from_cpupid, to_cpupid; + struct page_flags from_page_flags, to_page_flags; + struct mem_cgroup *to_memcg = page_memcg(to_page), + *from_memcg = page_memcg(from_page); + + from_cpupid = page_cpupid_xchg_last(from_page, -1); + + from_page_flags.page_error = TestClearPageError(from_page); + from_page_flags.page_referenced = TestClearPageReferenced(from_page); + from_page_flags.page_uptodate = PageUptodate(from_page); + ClearPageUptodate(from_page); + from_page_flags.page_active = TestClearPageActive(from_page); + from_page_flags.page_unevictable = TestClearPageUnevictable(from_page); + from_page_flags.page_checked = PageChecked(from_page); + ClearPageChecked(from_page); + from_page_flags.page_mappedtodisk = PageMappedToDisk(from_page); + ClearPageMappedToDisk(from_page); + from_page_flags.page_dirty = PageDirty(from_page); + ClearPageDirty(from_page); + from_page_flags.page_is_young = test_and_clear_page_young(from_page); + from_page_flags.page_is_idle = page_is_idle(from_page); + clear_page_idle(from_page); + from_page_flags.page_swapcache = PageSwapCache(from_page); + from_page_flags.page_writeback = test_clear_page_writeback(from_page); + + + to_cpupid = page_cpupid_xchg_last(to_page, -1); + + to_page_flags.page_error = TestClearPageError(to_page); + to_page_flags.page_referenced = TestClearPageReferenced(to_page); + to_page_flags.page_uptodate = PageUptodate(to_page); + ClearPageUptodate(to_page); + to_page_flags.page_active = TestClearPageActive(to_page); + to_page_flags.page_unevictable = TestClearPageUnevictable(to_page); + to_page_flags.page_checked = PageChecked(to_page); + ClearPageChecked(to_page); + to_page_flags.page_mappedtodisk = PageMappedToDisk(to_page); + ClearPageMappedToDisk(to_page); + to_page_flags.page_dirty = PageDirty(to_page); + ClearPageDirty(to_page); + to_page_flags.page_is_young = test_and_clear_page_young(to_page); + to_page_flags.page_is_idle = page_is_idle(to_page); + clear_page_idle(to_page); + to_page_flags.page_swapcache = PageSwapCache(to_page); + to_page_flags.page_writeback = test_clear_page_writeback(to_page); + + /* set to_page */ + if (from_page_flags.page_error) + SetPageError(to_page); + if (from_page_flags.page_referenced) + SetPageReferenced(to_page); + if (from_page_flags.page_uptodate) + SetPageUptodate(to_page); + if (from_page_flags.page_active) { + VM_BUG_ON_PAGE(from_page_flags.page_unevictable, from_page); + SetPageActive(to_page); + } else if (from_page_flags.page_unevictable) + SetPageUnevictable(to_page); + if (from_page_flags.page_checked) + SetPageChecked(to_page); + if (from_page_flags.page_mappedtodisk) + SetPageMappedToDisk(to_page); + + /* Move dirty on pages not done by migrate_page_move_mapping() */ + if (from_page_flags.page_dirty) + SetPageDirty(to_page); + + if (from_page_flags.page_is_young) + set_page_young(to_page); + if (from_page_flags.page_is_idle) + set_page_idle(to_page); + + /* set from_page */ + if (to_page_flags.page_error) + SetPageError(from_page); + if (to_page_flags.page_referenced) + SetPageReferenced(from_page); + if (to_page_flags.page_uptodate) + SetPageUptodate(from_page); + if (to_page_flags.page_active) { + VM_BUG_ON_PAGE(to_page_flags.page_unevictable, from_page); + SetPageActive(from_page); + } else if (to_page_flags.page_unevictable) + SetPageUnevictable(from_page); + if (to_page_flags.page_checked) + SetPageChecked(from_page); + if (to_page_flags.page_mappedtodisk) + SetPageMappedToDisk(from_page); + + /* Move dirty on pages not done by migrate_page_move_mapping() */ + if (to_page_flags.page_dirty) + SetPageDirty(from_page); + + if (to_page_flags.page_is_young) + set_page_young(from_page); + if (to_page_flags.page_is_idle) + set_page_idle(from_page); + + /* + * Copy NUMA information to the new page, to prevent over-eager + * future migrations of this same page. + */ + page_cpupid_xchg_last(to_page, from_cpupid); + page_cpupid_xchg_last(from_page, to_cpupid); + + ksm_exchange_page(to_page, from_page); + /* + * Please do not reorder this without considering how mm/ksm.c's + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). + */ + ClearPageSwapCache(to_page); + ClearPageSwapCache(from_page); + if (from_page_flags.page_swapcache) + SetPageSwapCache(to_page); + if (to_page_flags.page_swapcache) + SetPageSwapCache(from_page); + + +#ifdef CONFIG_PAGE_OWNER + /* exchange page owner */ + BUILD_BUG(); +#endif + /* exchange mem cgroup */ + to_page->mem_cgroup = from_memcg; + from_page->mem_cgroup = to_memcg; + +} + +/* + * Replace the page in the mapping. + * + * The number of remaining references must be: + * 1 for anonymous pages without a mapping + * 2 for pages with a mapping + * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. + */ + +static int exchange_page_move_mapping(struct address_space *to_mapping, + struct address_space *from_mapping, + struct page *to_page, struct page *from_page, + struct buffer_head *to_head, + struct buffer_head *from_head, + enum migrate_mode mode, + int to_extra_count, int from_extra_count) +{ + int to_expected_count = 1 + to_extra_count, + from_expected_count = 1 + from_extra_count; + unsigned long from_page_index = from_page->index; + unsigned long to_page_index = to_page->index; + int to_swapbacked = PageSwapBacked(to_page), + from_swapbacked = PageSwapBacked(from_page); + struct address_space *to_mapping_value = to_page->mapping; + struct address_space *from_mapping_value = from_page->mapping; + + VM_BUG_ON_PAGE(to_mapping != page_mapping(to_page), to_page); + VM_BUG_ON_PAGE(from_mapping != page_mapping(from_page), from_page); + + if (!to_mapping) { + /* Anonymous page without mapping */ + if (page_count(to_page) != to_expected_count) + return -EAGAIN; + } + + if (!from_mapping) { + /* Anonymous page without mapping */ + if (page_count(from_page) != from_expected_count) + return -EAGAIN; + } + + /* both are anonymous pages */ + if (!from_mapping && !to_mapping) { + /* from_page */ + from_page->index = to_page_index; + from_page->mapping = to_mapping_value; + + ClearPageSwapBacked(from_page); + if (to_swapbacked) + SetPageSwapBacked(from_page); + + + /* to_page */ + to_page->index = from_page_index; + to_page->mapping = from_mapping_value; + + ClearPageSwapBacked(to_page); + if (from_swapbacked) + SetPageSwapBacked(to_page); + } else if (!from_mapping && to_mapping) { + /* from is anonymous, to is file-backed */ + struct zone *from_zone, *to_zone; + void **to_pslot; + int dirty; + + from_zone = page_zone(from_page); + to_zone = page_zone(to_page); + + xa_lock_irq(&to_mapping->i_pages); + + to_pslot = radix_tree_lookup_slot(&to_mapping->i_pages, + page_index(to_page)); + + to_expected_count += 1 + page_has_private(to_page); + if (page_count(to_page) != to_expected_count || + radix_tree_deref_slot_protected(to_pslot, + &to_mapping->i_pages.xa_lock) != to_page) { + xa_unlock_irq(&to_mapping->i_pages); + return -EAGAIN; + } + + if (!page_ref_freeze(to_page, to_expected_count)) { + xa_unlock_irq(&to_mapping->i_pages); + pr_debug("cannot freeze page count\n"); + return -EAGAIN; + } + + if (mode == MIGRATE_ASYNC && to_head && + !buffer_migrate_lock_buffers(to_head, mode)) { + page_ref_unfreeze(to_page, to_expected_count); + xa_unlock_irq(&to_mapping->i_pages); + + pr_debug("cannot lock buffer head\n"); + return -EAGAIN; + } + + if (!page_ref_freeze(from_page, from_expected_count)) { + page_ref_unfreeze(to_page, to_expected_count); + xa_unlock_irq(&to_mapping->i_pages); + + return -EAGAIN; + } + /* + * Now we know that no one else is looking at the page: + * no turning back from here. + */ + ClearPageSwapBacked(from_page); + ClearPageSwapBacked(to_page); + + /* from_page */ + from_page->index = to_page_index; + from_page->mapping = to_mapping_value; + /* to_page */ + to_page->index = from_page_index; + to_page->mapping = from_mapping_value; + + if (to_swapbacked) + __SetPageSwapBacked(from_page); + else + VM_BUG_ON_PAGE(PageSwapCache(to_page), to_page); + + if (from_swapbacked) + __SetPageSwapBacked(to_page); + else + VM_BUG_ON_PAGE(PageSwapCache(from_page), from_page); + + dirty = PageDirty(to_page); + + radix_tree_replace_slot(&to_mapping->i_pages, + to_pslot, from_page); + + /* move cache reference */ + page_ref_unfreeze(to_page, to_expected_count - 1); + page_ref_unfreeze(from_page, from_expected_count + 1); + + xa_unlock(&to_mapping->i_pages); + + /* + * If moved to a different zone then also account + * the page for that zone. Other VM counters will be + * taken care of when we establish references to the + * new page and drop references to the old page. + * + * Note that anonymous pages are accounted for + * via NR_FILE_PAGES and NR_ANON_MAPPED if they + * are mapped to swap space. + */ + if (to_zone != from_zone) { + __dec_node_state(to_zone->zone_pgdat, NR_FILE_PAGES); + __inc_node_state(from_zone->zone_pgdat, NR_FILE_PAGES); + if (PageSwapBacked(to_page) && !PageSwapCache(to_page)) { + __dec_node_state(to_zone->zone_pgdat, NR_SHMEM); + __inc_node_state(from_zone->zone_pgdat, NR_SHMEM); + } + if (dirty && mapping_cap_account_dirty(to_mapping)) { + __dec_node_state(to_zone->zone_pgdat, NR_FILE_DIRTY); + __dec_zone_state(to_zone, NR_ZONE_WRITE_PENDING); + __inc_node_state(from_zone->zone_pgdat, NR_FILE_DIRTY); + __inc_zone_state(from_zone, NR_ZONE_WRITE_PENDING); + } + } + local_irq_enable(); + + } else { + /* from is file-backed to is anonymous: fold this to the case above */ + /* both are file-backed */ + VM_BUG_ON(1); + } + + return MIGRATEPAGE_SUCCESS; +} + +static int exchange_from_to_pages(struct page *to_page, struct page *from_page, + enum migrate_mode mode) +{ + int rc = -EBUSY; + struct address_space *to_page_mapping, *from_page_mapping; + struct buffer_head *to_head = NULL, *to_bh = NULL; + + VM_BUG_ON_PAGE(!PageLocked(from_page), from_page); + VM_BUG_ON_PAGE(!PageLocked(to_page), to_page); + + /* copy page->mapping not use page_mapping() */ + to_page_mapping = page_mapping(to_page); + from_page_mapping = page_mapping(from_page); + + /* from_page has to be anonymous page */ + VM_BUG_ON(from_page_mapping); + VM_BUG_ON(PageWriteback(from_page)); + /* writeback has to finish */ + BUG_ON(PageWriteback(to_page)); + + + /* to_page is anonymous */ + if (!to_page_mapping) { +exchange_mappings: + /* actual page mapping exchange */ + rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping, + to_page, from_page, NULL, NULL, mode, 0, 0); + } else { + if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) { + + if (!page_has_buffers(to_page)) + goto exchange_mappings; + + to_head = page_buffers(to_page); + + rc = exchange_page_move_mapping(to_page_mapping, + from_page_mapping, to_page, from_page, + to_head, NULL, mode, 0, 0); + + if (rc != MIGRATEPAGE_SUCCESS) + return rc; + + /* + * In the async case, migrate_page_move_mapping locked the buffers + * with an IRQ-safe spinlock held. In the sync case, the buffers + * need to be locked now + */ + if (mode != MIGRATE_ASYNC) + VM_BUG_ON(!buffer_migrate_lock_buffers(to_head, mode)); + + ClearPagePrivate(to_page); + set_page_private(from_page, page_private(to_page)); + set_page_private(to_page, 0); + /* transfer private page count */ + put_page(to_page); + get_page(from_page); + + to_bh = to_head; + do { + set_bh_page(to_bh, from_page, bh_offset(to_bh)); + to_bh = to_bh->b_this_page; + + } while (to_bh != to_head); + + SetPagePrivate(from_page); + + to_bh = to_head; + } else if (!to_page_mapping->a_ops->migratepage) { + /* fallback_migrate_page */ + if (PageDirty(to_page)) { + if (mode != MIGRATE_SYNC) + return -EBUSY; + return writeout(to_page_mapping, to_page); + } + if (page_has_private(to_page) && + !try_to_release_page(to_page, GFP_KERNEL)) + return -EAGAIN; + + goto exchange_mappings; + } + } + /* actual page data exchange */ + if (rc != MIGRATEPAGE_SUCCESS) + return rc; + + + if (PageHuge(from_page) || PageTransHuge(from_page)) + exchange_huge_page(to_page, from_page); + else + exchange_highpage(to_page, from_page); + rc = 0; + + /* + * 1. buffer_migrate_page: + * private flag should be transferred from to_page to from_page + * + * 2. anon<->anon, fallback_migrate_page: + * both have none private flags or to_page's is cleared. + */ + VM_BUG_ON(!((page_has_private(from_page) && !page_has_private(to_page)) || + (!page_has_private(from_page) && !page_has_private(to_page)))); + + exchange_page_flags(to_page, from_page); + + if (to_bh) { + VM_BUG_ON(to_bh != to_head); + do { + unlock_buffer(to_bh); + put_bh(to_bh); + to_bh = to_bh->b_this_page; + + } while (to_bh != to_head); + } + + return rc; +} + +static int unmap_and_exchange(struct page *from_page, + struct page *to_page, enum migrate_mode mode) +{ + int rc = -EAGAIN; + struct anon_vma *from_anon_vma = NULL; + struct anon_vma *to_anon_vma = NULL; + int from_page_was_mapped = 0; + int to_page_was_mapped = 0; + int from_page_count = 0, to_page_count = 0; + int from_map_count = 0, to_map_count = 0; + unsigned long from_flags, to_flags; + pgoff_t from_index, to_index; + struct address_space *from_mapping, *to_mapping; + + if (!trylock_page(from_page)) { + if (mode == MIGRATE_ASYNC) + goto out; + lock_page(from_page); + } + + if (!trylock_page(to_page)) { + if (mode == MIGRATE_ASYNC) + goto out_unlock; + lock_page(to_page); + } + + /* from_page is supposed to be an anonymous page */ + VM_BUG_ON_PAGE(PageWriteback(from_page), from_page); + + if (PageWriteback(to_page)) { + /* + * Only in the case of a full synchronous migration is it + * necessary to wait for PageWriteback. In the async case, + * the retry loop is too short and in the sync-light case, + * the overhead of stalling is too much + */ + if (mode != MIGRATE_SYNC) { + rc = -EBUSY; + goto out_unlock_both; + } + wait_on_page_writeback(to_page); + } + + if (PageAnon(from_page) && !PageKsm(from_page)) + from_anon_vma = page_get_anon_vma(from_page); + + if (PageAnon(to_page) && !PageKsm(to_page)) + to_anon_vma = page_get_anon_vma(to_page); + + from_page_count = page_count(from_page); + from_map_count = page_mapcount(from_page); + to_page_count = page_count(to_page); + to_map_count = page_mapcount(to_page); + from_flags = from_page->flags; + to_flags = to_page->flags; + from_mapping = from_page->mapping; + to_mapping = to_page->mapping; + from_index = from_page->index; + to_index = to_page->index; + + /* + * Corner case handling: + * 1. When a new swap-cache page is read into, it is added to the LRU + * and treated as swapcache but it has no rmap yet. + * Calling try_to_unmap() against a page->mapping==NULL page will + * trigger a BUG. So handle it here. + * 2. An orphaned page (see truncate_complete_page) might have + * fs-private metadata. The page can be picked up due to memory + * offlining. Everywhere else except page reclaim, the page is + * invisible to the vm, so the page can not be migrated. So try to + * free the metadata, so the page can be freed. + */ + if (!from_page->mapping) { + VM_BUG_ON_PAGE(PageAnon(from_page), from_page); + if (page_has_private(from_page)) { + try_to_free_buffers(from_page); + goto out_unlock_both; + } + } else if (page_mapped(from_page)) { + /* Establish migration ptes */ + VM_BUG_ON_PAGE(PageAnon(from_page) && !PageKsm(from_page) && + !from_anon_vma, from_page); + try_to_unmap(from_page, + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + from_page_was_mapped = 1; + } + + if (!to_page->mapping) { + VM_BUG_ON_PAGE(PageAnon(to_page), to_page); + if (page_has_private(to_page)) { + try_to_free_buffers(to_page); + goto out_unlock_both_remove_from_migration_pte; + } + } else if (page_mapped(to_page)) { + /* Establish migration ptes */ + VM_BUG_ON_PAGE(PageAnon(to_page) && !PageKsm(to_page) && + !to_anon_vma, to_page); + try_to_unmap(to_page, + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + to_page_was_mapped = 1; + } + + if (!page_mapped(from_page) && !page_mapped(to_page)) + rc = exchange_from_to_pages(to_page, from_page, mode); + + + if (to_page_was_mapped) { + /* swap back to_page->index to be compatible with + * remove_migration_ptes(), which assumes both from_page and to_page + * below have the same index. + */ + if (rc == MIGRATEPAGE_SUCCESS) + swap(to_page->index, to_index); + + remove_migration_ptes(to_page, + rc == MIGRATEPAGE_SUCCESS ? from_page : to_page, false); + + if (rc == MIGRATEPAGE_SUCCESS) + swap(to_page->index, to_index); + } + +out_unlock_both_remove_from_migration_pte: + if (from_page_was_mapped) { + /* swap back from_page->index to be compatible with + * remove_migration_ptes(), which assumes both from_page and to_page + * below have the same index. + */ + if (rc == MIGRATEPAGE_SUCCESS) + swap(from_page->index, from_index); + + remove_migration_ptes(from_page, + rc == MIGRATEPAGE_SUCCESS ? to_page : from_page, false); + + if (rc == MIGRATEPAGE_SUCCESS) + swap(from_page->index, from_index); + } + +out_unlock_both: + if (to_anon_vma) + put_anon_vma(to_anon_vma); + unlock_page(to_page); +out_unlock: + /* Drop an anon_vma reference if we took one */ + if (from_anon_vma) + put_anon_vma(from_anon_vma); + unlock_page(from_page); +out: + return rc; +} + +/* + * Exchange pages in the exchange_list + * + * Caller should release the exchange_list resource. + * + */ +static int exchange_pages(struct list_head *exchange_list, + enum migrate_mode mode, + int reason) +{ + struct exchange_page_info *one_pair, *one_pair2; + int failed = 0; + + list_for_each_entry_safe(one_pair, one_pair2, exchange_list, list) { + struct page *from_page = one_pair->from_page; + struct page *to_page = one_pair->to_page; + int rc; + int retry = 0; + +again: + if (page_count(from_page) == 1) { + /* page was freed from under us. So we are done */ + ClearPageActive(from_page); + ClearPageUnevictable(from_page); + + mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON + + page_is_file_cache(from_page), + -hpage_nr_pages(from_page)); + put_page(from_page); + + if (page_count(to_page) == 1) { + ClearPageActive(to_page); + ClearPageUnevictable(to_page); + put_page(to_page); + mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON + + page_is_file_cache(to_page), + -hpage_nr_pages(to_page)); + } else + goto putback_to_page; + + continue; + } + + if (page_count(to_page) == 1) { + /* page was freed from under us. So we are done */ + ClearPageActive(to_page); + ClearPageUnevictable(to_page); + + mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON + + page_is_file_cache(to_page), + -hpage_nr_pages(to_page)); + put_page(to_page); + + mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON + + page_is_file_cache(from_page), + -hpage_nr_pages(from_page)); + putback_lru_page(from_page); + continue; + } + + /* TODO: compound page not supported */ + /* to_page can be file-backed page */ + if (PageCompound(from_page) || + page_mapping(from_page) + ) { + ++failed; + goto putback; + } + + rc = unmap_and_exchange(from_page, to_page, mode); + + if (rc == -EAGAIN && retry < 3) { + ++retry; + goto again; + } + + if (rc != MIGRATEPAGE_SUCCESS) + ++failed; + +putback: + mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON + + page_is_file_cache(from_page), + -hpage_nr_pages(from_page)); + + putback_lru_page(from_page); +putback_to_page: + mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON + + page_is_file_cache(to_page), + -hpage_nr_pages(to_page)); + + putback_lru_page(to_page); + } + return failed; +} + +int exchange_two_pages(struct page *page1, struct page *page2) +{ + struct exchange_page_info page_info; + LIST_HEAD(exchange_list); + int err = -EFAULT; + int pagevec_flushed = 0; + + VM_BUG_ON_PAGE(PageTail(page1), page1); + VM_BUG_ON_PAGE(PageTail(page2), page2); + + if (!(PageLRU(page1) && PageLRU(page2))) + return -EBUSY; + +retry_isolate1: + if (!get_page_unless_zero(page1)) + return -EBUSY; + err = isolate_lru_page(page1); + put_page(page1); + if (err) { + if (!pagevec_flushed) { + migrate_prep(); + pagevec_flushed = 1; + goto retry_isolate1; + } + return err; + } + mod_node_page_state(page_pgdat(page1), + NR_ISOLATED_ANON + page_is_file_cache(page1), + hpage_nr_pages(page1)); + +retry_isolate2: + if (!get_page_unless_zero(page2)) { + putback_lru_page(page1); + return -EBUSY; + } + err = isolate_lru_page(page2); + put_page(page2); + if (err) { + if (!pagevec_flushed) { + migrate_prep(); + pagevec_flushed = 1; + goto retry_isolate2; + } + return err; + } + mod_node_page_state(page_pgdat(page2), + NR_ISOLATED_ANON + page_is_file_cache(page2), + hpage_nr_pages(page2)); + + page_info.from_page = page1; + page_info.to_page = page2; + INIT_LIST_HEAD(&page_info.list); + list_add(&page_info.list, &exchange_list); + + + return exchange_pages(&exchange_list, MIGRATE_SYNC, 0); + +} diff --git a/mm/internal.h b/mm/internal.h index f4a7bb02decf..77e205c423ce 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -543,4 +543,10 @@ static inline bool is_migrate_highatomic_page(struct page *page) void setup_zone_pageset(struct zone *zone); extern struct page *alloc_new_node_page(struct page *page, unsigned long node); + +bool buffer_migrate_lock_buffers(struct buffer_head *head, + enum migrate_mode mode); +int writeout(struct address_space *mapping, struct page *page); +extern int exchange_two_pages(struct page *page1, struct page *page2); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/ksm.c b/mm/ksm.c index 6c48ad13b4c9..dc1ec06b71a0 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2665,6 +2665,41 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage) set_page_stable_node(oldpage, NULL); } } + +void ksm_exchange_page(struct page *to_page, struct page *from_page) +{ + struct stable_node *to_stable_node, *from_stable_node; + + VM_BUG_ON_PAGE(!PageLocked(to_page), to_page); + VM_BUG_ON_PAGE(!PageLocked(from_page), from_page); + + to_stable_node = page_stable_node(to_page); + from_stable_node = page_stable_node(from_page); + if (to_stable_node) { + VM_BUG_ON_PAGE(to_stable_node->kpfn != page_to_pfn(from_page), + from_page); + to_stable_node->kpfn = page_to_pfn(to_page); + /* + * newpage->mapping was set in advance; now we need smp_wmb() + * to make sure that the new stable_node->kpfn is visible + * to get_ksm_page() before it can see that oldpage->mapping + * has gone stale (or that PageSwapCache has been cleared). + */ + smp_wmb(); + } + if (from_stable_node) { + VM_BUG_ON_PAGE(from_stable_node->kpfn != page_to_pfn(to_page), + to_page); + from_stable_node->kpfn = page_to_pfn(from_page); + /* + * newpage->mapping was set in advance; now we need smp_wmb() + * to make sure that the new stable_node->kpfn is visible + * to get_ksm_page() before it can see that oldpage->mapping + * has gone stale (or that PageSwapCache has been cleared). + */ + smp_wmb(); + } +} #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_MEMORY_HOTREMOVE diff --git a/mm/migrate.c b/mm/migrate.c index d4fd680be3b0..b8c79aa62134 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -701,7 +701,7 @@ EXPORT_SYMBOL(migrate_page); #ifdef CONFIG_BLOCK /* Returns true if all buffers are successfully locked */ -static bool buffer_migrate_lock_buffers(struct buffer_head *head, +bool buffer_migrate_lock_buffers(struct buffer_head *head, enum migrate_mode mode) { struct buffer_head *bh = head; @@ -849,7 +849,7 @@ int buffer_migrate_page_norefs(struct address_space *mapping, /* * Writeback a page to clean the dirty state */ -static int writeout(struct address_space *mapping, struct page *page) +int writeout(struct address_space *mapping, struct page *page) { struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, From patchwork Fri Feb 15 22:03:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zi Yan X-Patchwork-Id: 10815933 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A2D113B5 for ; Fri, 15 Feb 2019 22:03:58 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3A2242FF77 for ; Fri, 15 Feb 2019 22:03:58 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2DD043041C; Fri, 15 Feb 2019 22:03:58 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0335A2FF77 for ; Fri, 15 Feb 2019 22:03:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 849E68E0003; Fri, 15 Feb 2019 17:03:51 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7F9828E0001; Fri, 15 Feb 2019 17:03:51 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 672488E0006; Fri, 15 Feb 2019 17:03:51 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f72.google.com (mail-yw1-f72.google.com [209.85.161.72]) by kanga.kvack.org (Postfix) with ESMTP id 1CB4B8E0003 for ; Fri, 15 Feb 2019 17:03:51 -0500 (EST) Received: by mail-yw1-f72.google.com with SMTP id x64so6906223ywc.6 for ; Fri, 15 Feb 2019 14:03:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding:dkim-signature; bh=AK8H5tpJ34j47t7xdPUWcNrKavgktr5lUYER4OwE6o4=; b=T4R1e3luEeHPfNcf0KsRLAK/LPrdI8nVGZL72sV79d99T3W0N1ftDhO91PPXqz1cY0 h/gTL21Z+cmvlfNbzfoDOkp01w5lspBCiM33s8/7C1W5f3QsNI2RAC7lYPmpTgMSd6OB 1+uZxFqesN4PRIL787czcMq916wXfYUnsfHcmVStOPtP4xUeFonK7nBfoKoxWAnYlbXN oI+8LmJCMR/4iT6IAuSDXVAUSIBSXKExEtRjCoDi+J5w5LipbSjWUAUqfUCG9JqrfzUi e44DHcehpzIZ36fVi4YaLA2V6oa6yPEcq7wxk0/iVP9lMHqNziresD6UcITGrgZ64s9i 2B1g== X-Gm-Message-State: AHQUAuYDnNpjfNRZPT8ZmKuV6lPc5mXYNP68bwthWPAexbzfstuPNCgj sd7EoO4ONJQtY0Tek8cAaELU5bTKdyUQiYP6jC3/K5SQGZr+Qhr6QY7eZ6xAysSEkFZcsitWehE u/IgsfRMV2tDrvKETf9HJVdJe2smytH8Nrt6H8F91iBvKHCls3MJeZvIRlUZf3XWrtQ== X-Received: by 2002:a81:e0d:: with SMTP id 13mr9862544ywo.61.1550268230803; Fri, 15 Feb 2019 14:03:50 -0800 (PST) X-Google-Smtp-Source: AHgI3IZB6kd9HlNDbKx0V74xfCA9MSogTWvWZJME10yRjIv2bN2oZSYnGRDv0iOKqDiiKqCc/2z/ X-Received: by 2002:a81:e0d:: with SMTP id 13mr9862458ywo.61.1550268229652; Fri, 15 Feb 2019 14:03:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550268229; cv=none; d=google.com; s=arc-20160816; b=o9xzDf4N0+iULgZLb7faUO5YE/2ZKl6xHqLp0Kru8Q/LvxE6FpNPZRYSGo9insCIsC FJOdCP0RGurTsHyU+SJbliYK62tjCWzl2HrLgnJQA3hFkT/sA7qtVO56uOzQOsZ3brju bAKeaOkYG3oR2WNgKffx5Zqjs7cF63iO3ACR5kKLANjXdDRzLgfhn+pBZ3LNtdAmSye6 eBxdwXGk8K+Tk3RAlAf6UPCM9T6c2pm6cbdXLdK34w49Kf8v006I/p58TjgvlMyPeLbH BFp78u9efLlFyadGj3klML8wAFyOmD9qAbdWggeOEYSldE5+WAZOKhkleTWdJRz2rHJX tpmw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=dkim-signature:content-transfer-encoding:mime-version:references :in-reply-to:message-id:date:subject:cc:to:from; bh=AK8H5tpJ34j47t7xdPUWcNrKavgktr5lUYER4OwE6o4=; b=PrjsdfzQhiB7dZslpgE7amVSN5qrMGAWUuDAMBAOKUWgqweJtMnV+Dmw6bxP867q9C SA+PrWtfSPbwP3tbcd9IsgnkzPOLcbMM5p1jAXJOgZRlUQSThyxCwHpVVODb6AmvqSax /snreYcSuoEPpIDLRtbH/DDe/7g631++CZIhGwo7OJXYl3xuMoPjamkY5RIfkd6O3fxA oiGO5/ppS+lbw1YYrSrNcr00i2YgytyEq6vArBof9+YeVpYlBXuLd9VrKADoyJzRDJKv xprGRydpovJJulD8aAZaqNG6MXk4K8+T2GuiRV+lWCxEP6EcWmR1mRs1iBYt9VWhT6QO FuZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=llx8nrNY; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.65 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqemgate16.nvidia.com (hqemgate16.nvidia.com. [216.228.121.65]) by mx.google.com with ESMTPS id b5si3852075ybo.313.2019.02.15.14.03.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Feb 2019 14:03:49 -0800 (PST) Received-SPF: pass (google.com: domain of ziy@nvidia.com designates 216.228.121.65 as permitted sender) client-ip=216.228.121.65; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=llx8nrNY; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.65 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqemgate16.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Fri, 15 Feb 2019 14:03:51 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Fri, 15 Feb 2019 14:03:48 -0800 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Fri, 15 Feb 2019 14:03:48 -0800 Received: from nvrsysarch5.nvidia.com (172.20.13.39) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Fri, 15 Feb 2019 22:03:48 +0000 From: Zi Yan To: , CC: Dave Hansen , Michal Hocko , "Kirill A . Shutemov" , Andrew Morton , Vlastimil Babka , Mel Gorman , John Hubbard , Mark Hairgrove , Nitin Gupta , David Nellans , Zi Yan Subject: [RFC PATCH 02/31] mm: migrate: Add THP exchange support. Date: Fri, 15 Feb 2019 14:03:05 -0800 Message-ID: <20190215220334.29298-3-ziy@nvidia.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190215220334.29298-1-ziy@nvidia.com> References: <20190215220334.29298-1-ziy@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL103.nvidia.com (172.20.187.11) To HQMAIL101.nvidia.com (172.20.187.10) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1550268231; bh=AK8H5tpJ34j47t7xdPUWcNrKavgktr5lUYER4OwE6o4=; h=X-PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer: In-Reply-To:References:MIME-Version:X-Originating-IP: X-ClientProxiedBy:Content-Transfer-Encoding:Content-Type; b=llx8nrNYNyWAE9nMxnvaGVyiU6nNWXM9c63gW31afUQku2afncl5z2iFqFdrFKAdW MBkHBNmhaxoQw2LCootR2cHJ0kqgsQO0OjfABw7A9iuSucSndYl0ZcSfKg56AVzJSv NKy87FTDJ2owxBMBT14yW9N0n3n2ZYfK+1oBB5Huf9R6HCm4ey7MxK5XKvIiE0BQzh gJW0b/b+Jx2Xoen9iWY0845hNMm84tuHAgqHha9mh4qdQO3lawBmjTTo30uAygaKu8 iMAtqmtpUcv784RnR4Nk5i/nC0PqIMWgPxucrbFc4rd7/KTQk3FlOPvCopGP1yFkwX VnNhO1hCqWkTw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Add support for exchange between two THPs. Signed-off-by: Zi Yan --- mm/exchange.c | 47 ++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 38 insertions(+), 9 deletions(-) diff --git a/mm/exchange.c b/mm/exchange.c index a607348cc6f4..8cf286fc0f10 100644 --- a/mm/exchange.c +++ b/mm/exchange.c @@ -48,7 +48,8 @@ struct page_flags { unsigned int page_swapcache:1; unsigned int page_writeback:1; unsigned int page_private:1; - unsigned int __pad:3; + unsigned int page_doublemap:1; + unsigned int __pad:2; }; @@ -125,20 +126,23 @@ static void exchange_huge_page(struct page *dst, struct page *src) static void exchange_page_flags(struct page *to_page, struct page *from_page) { int from_cpupid, to_cpupid; - struct page_flags from_page_flags, to_page_flags; + struct page_flags from_page_flags = {0}, to_page_flags = {0}; struct mem_cgroup *to_memcg = page_memcg(to_page), *from_memcg = page_memcg(from_page); from_cpupid = page_cpupid_xchg_last(from_page, -1); - from_page_flags.page_error = TestClearPageError(from_page); + from_page_flags.page_error = PageError(from_page); + if (from_page_flags.page_error) + ClearPageError(from_page); from_page_flags.page_referenced = TestClearPageReferenced(from_page); from_page_flags.page_uptodate = PageUptodate(from_page); ClearPageUptodate(from_page); from_page_flags.page_active = TestClearPageActive(from_page); from_page_flags.page_unevictable = TestClearPageUnevictable(from_page); from_page_flags.page_checked = PageChecked(from_page); - ClearPageChecked(from_page); + if (from_page_flags.page_checked) + ClearPageChecked(from_page); from_page_flags.page_mappedtodisk = PageMappedToDisk(from_page); ClearPageMappedToDisk(from_page); from_page_flags.page_dirty = PageDirty(from_page); @@ -148,18 +152,22 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page) clear_page_idle(from_page); from_page_flags.page_swapcache = PageSwapCache(from_page); from_page_flags.page_writeback = test_clear_page_writeback(from_page); + from_page_flags.page_doublemap = PageDoubleMap(from_page); to_cpupid = page_cpupid_xchg_last(to_page, -1); - to_page_flags.page_error = TestClearPageError(to_page); + to_page_flags.page_error = PageError(to_page); + if (to_page_flags.page_error) + ClearPageError(to_page); to_page_flags.page_referenced = TestClearPageReferenced(to_page); to_page_flags.page_uptodate = PageUptodate(to_page); ClearPageUptodate(to_page); to_page_flags.page_active = TestClearPageActive(to_page); to_page_flags.page_unevictable = TestClearPageUnevictable(to_page); to_page_flags.page_checked = PageChecked(to_page); - ClearPageChecked(to_page); + if (to_page_flags.page_checked) + ClearPageChecked(to_page); to_page_flags.page_mappedtodisk = PageMappedToDisk(to_page); ClearPageMappedToDisk(to_page); to_page_flags.page_dirty = PageDirty(to_page); @@ -169,6 +177,7 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page) clear_page_idle(to_page); to_page_flags.page_swapcache = PageSwapCache(to_page); to_page_flags.page_writeback = test_clear_page_writeback(to_page); + to_page_flags.page_doublemap = PageDoubleMap(to_page); /* set to_page */ if (from_page_flags.page_error) @@ -195,6 +204,8 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page) set_page_young(to_page); if (from_page_flags.page_is_idle) set_page_idle(to_page); + if (from_page_flags.page_doublemap) + SetPageDoubleMap(to_page); /* set from_page */ if (to_page_flags.page_error) @@ -221,6 +232,8 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page) set_page_young(from_page); if (to_page_flags.page_is_idle) set_page_idle(from_page); + if (to_page_flags.page_doublemap) + SetPageDoubleMap(from_page); /* * Copy NUMA information to the new page, to prevent over-eager @@ -280,6 +293,7 @@ static int exchange_page_move_mapping(struct address_space *to_mapping, VM_BUG_ON_PAGE(to_mapping != page_mapping(to_page), to_page); VM_BUG_ON_PAGE(from_mapping != page_mapping(from_page), from_page); + VM_BUG_ON(PageCompound(from_page) != PageCompound(to_page)); if (!to_mapping) { /* Anonymous page without mapping */ @@ -600,7 +614,6 @@ static int unmap_and_exchange(struct page *from_page, to_mapping = to_page->mapping; from_index = from_page->index; to_index = to_page->index; - /* * Corner case handling: * 1. When a new swap-cache page is read into, it is added to the LRU @@ -691,6 +704,23 @@ static int unmap_and_exchange(struct page *from_page, return rc; } +static bool can_be_exchanged(struct page *from, struct page *to) +{ + if (PageCompound(from) != PageCompound(to)) + return false; + + if (PageHuge(from) != PageHuge(to)) + return false; + + if (PageHuge(from) || PageHuge(to)) + return false; + + if (compound_order(from) != compound_order(to)) + return false; + + return true; +} + /* * Exchange pages in the exchange_list * @@ -751,9 +781,8 @@ static int exchange_pages(struct list_head *exchange_list, continue; } - /* TODO: compound page not supported */ /* to_page can be file-backed page */ - if (PageCompound(from_page) || + if (!can_be_exchanged(from_page, to_page) || page_mapping(from_page) ) { ++failed; From patchwork Fri Feb 15 22:03:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zi Yan X-Patchwork-Id: 10815931 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 138261390 for ; Fri, 15 Feb 2019 22:03:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0301A2FF77 for ; Fri, 15 Feb 2019 22:03:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EA63E3041C; Fri, 15 Feb 2019 22:03:54 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 905A22FF77 for ; Fri, 15 Feb 2019 22:03:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5143C8E0004; Fri, 15 Feb 2019 17:03:51 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4AA498E0001; Fri, 15 Feb 2019 17:03:51 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 191548E0004; Fri, 15 Feb 2019 17:03:51 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f199.google.com (mail-yb1-f199.google.com [209.85.219.199]) by kanga.kvack.org (Postfix) with ESMTP id DAB8A8E0003 for ; Fri, 15 Feb 2019 17:03:50 -0500 (EST) Received: by mail-yb1-f199.google.com with SMTP id u66so6715729ybb.15 for ; Fri, 15 Feb 2019 14:03:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding:dkim-signature; bh=diqLFW2QNnU4w8AwbrLLHI+jIAOfRTJB7J6DX23hrZI=; b=YZUw6vitG7wr6vuHhz7LdgGiTCzGq9IOaCxXlxS8C3IkpLomlilyaL0fJOLVVofPH4 eapVgRH04P/DMjMEo1NH/PtE9/hbexfDnCB9anmd9vCeuqXz4WX2TizSDfea+sXp1/ym m+4hpena/o2BGt0QF7OaE0qydWJMJzN8gOrqKmo97sor1sosFiY6MFMrkeD+HkgU2Qb7 +LMBg+VP1bnqUVwfDXGiU3uORnXGMB53ldRgmD3DzmkngF+FPrGkvsc6rL+L/5musSeo C6P/Ha91JjgqVP0jio6xUTeQkO2jDl84alZjb+VnFPSYdqY93mRjsCcwXe+kLA6pmBtm Gbeg== X-Gm-Message-State: AHQUAuaHDwd/Zh0gAmwcIBpB8zaVhKMxWIlrzF55OLNLCqOwhrNbt4I5 uh/p1B8KyDw1974cnAGWdlH346jI74My5/JVDi0zerbTGuP8DZgVEYrlT/pT+5pS/nw2gyfy1az 55S29y/dep21+T1KSyRNNxbpt4OrHCJZ2v52QLWikAahW4EQZmijn4OdidAf4eX+rjA== X-Received: by 2002:a81:9345:: with SMTP id k66mr10061635ywg.509.1550268230623; Fri, 15 Feb 2019 14:03:50 -0800 (PST) X-Google-Smtp-Source: AHgI3IaxEwjj8szBRl0zt0Fnlxa0LHEALHHKv+MvC+qp5u6zUJEDWSPO8gOmxTgegrN/LEl/5lUi X-Received: by 2002:a81:9345:: with SMTP id k66mr10061581ywg.509.1550268229903; Fri, 15 Feb 2019 14:03:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550268229; cv=none; d=google.com; s=arc-20160816; b=koU+2mTvpBTrhLUzmxDPR51paUHcPPHNXV6jLAsyFPBd6GzFATKF6scsHualXeHu+K rW5+Gn+zUBUXp+ENhgvVOqBe+01sNiEc+EGMlZBzzHXODsOrF54qOLPTYKISmnGZjoIb fssnhWZsD3LcrwWPE4C00kp7tI6I/WKsXpkLMrDDfFDr3X82TZeogaZJdSIl6C7+tTO8 mYs+IiuanSxndSjZfeJlYajBMEwgAP2i8fjoU9RSxm1cYFN2qCo8b0ye1EDSZBdUJjE1 t4lNxmeGXbpty5rHfXbdkhgMs3Zn5QT3GYpvaEeQGOeZGmVZv3E6OKxWLWBi7OGOSq+O ZVWg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=dkim-signature:content-transfer-encoding:mime-version:references :in-reply-to:message-id:date:subject:cc:to:from; bh=diqLFW2QNnU4w8AwbrLLHI+jIAOfRTJB7J6DX23hrZI=; b=jVkPdJcLjnLBC9WdHhJ6ReIWrXQLxYu7yJbYQj6OT+4f5ra/SI7JiTlMnqeD/yQJPB 0WMV7k4Keo/b27Yq6JxQolCkfKKVQuv45L51vSXVvBdX2a4GJHRrMUgoPtiKkAQqhqXX 5PoYqoMStY7hDaUOdndd3pOq+Uy6Mps/Cns4T1BCiR+oKsZw/47aZlhrSUghxAKjCpFc 0U7zj2jqM1GJ08+nG+t1MsIyTXyoaViuo5ck59mHJp5iGxmokN0odZDhmX2KqvPpZE1L CUx2b9mHkBxXuLdShBoguE3ukuPXk7UrM3p0FG69Ov92zFZCTvyD+3UKDHmztEAwOwmB Bvqg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=TNPGouO5; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqemgate15.nvidia.com (hqemgate15.nvidia.com. [216.228.121.64]) by mx.google.com with ESMTPS id i6si3997576ybl.194.2019.02.15.14.03.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Feb 2019 14:03:49 -0800 (PST) Received-SPF: pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) client-ip=216.228.121.64; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=TNPGouO5; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate15.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Fri, 15 Feb 2019 14:03:12 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Fri, 15 Feb 2019 14:03:49 -0800 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Fri, 15 Feb 2019 14:03:49 -0800 Received: from nvrsysarch5.nvidia.com (172.20.13.39) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Fri, 15 Feb 2019 22:03:48 +0000 From: Zi Yan To: , CC: Dave Hansen , Michal Hocko , "Kirill A . Shutemov" , Andrew Morton , Vlastimil Babka , Mel Gorman , John Hubbard , Mark Hairgrove , Nitin Gupta , David Nellans , Zi Yan Subject: [RFC PATCH 03/31] mm: migrate: Add tmpfs exchange support. Date: Fri, 15 Feb 2019 14:03:06 -0800 Message-ID: <20190215220334.29298-4-ziy@nvidia.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190215220334.29298-1-ziy@nvidia.com> References: <20190215220334.29298-1-ziy@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL103.nvidia.com (172.20.187.11) To HQMAIL101.nvidia.com (172.20.187.10) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1550268192; bh=diqLFW2QNnU4w8AwbrLLHI+jIAOfRTJB7J6DX23hrZI=; h=X-PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer: In-Reply-To:References:MIME-Version:X-Originating-IP: X-ClientProxiedBy:Content-Transfer-Encoding:Content-Type; b=TNPGouO5nlWg/9pe/V0ZKIAiRe2RPkwjLNfb4tKTKlSwmLSAI6tkwxTsSHbrgK3em 9lIgIxPe3xW52UImP1RcQXpij3t1nTvgfLzIb2VcLgTNyr7PgA6Jv19FJ8o7o5wm3m TEuF06h4KG4BglTfKWy3Rgwhtax/XYYN+hxVonRdE7YehaicjOeZ+Q4jLKApnJAzkE VJNWWqi68PckR0+9TCndFpK/oBN8OUX+/26Vr7GScR5K9ZjIMIDzt34IFDH4ZvzNOj XQXRlgiZc/afkTojNTryknrFCPq+zNenWZ2PHI0VydDOxlDj/6YlI2PN+lw0corpOu id6BMexbxH+Dg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP tmpfs uses the same migrate routine as anonymous pages, enabling exchange pages for it is easy. Signed-off-by: Zi Yan --- mm/exchange.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/exchange.c b/mm/exchange.c index 8cf286fc0f10..851f1a99b48b 100644 --- a/mm/exchange.c +++ b/mm/exchange.c @@ -466,7 +466,10 @@ static int exchange_from_to_pages(struct page *to_page, struct page *from_page, rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping, to_page, from_page, NULL, NULL, mode, 0, 0); } else { - if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) { + /* shmem */ + if (to_page_mapping->a_ops->migratepage == migrate_page) + goto exchange_mappings; + else if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) { if (!page_has_buffers(to_page)) goto exchange_mappings; From patchwork Fri Feb 15 22:03:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zi Yan X-Patchwork-Id: 10815937 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D026B1390 for ; Fri, 15 Feb 2019 22:04:07 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B1F012FF77 for ; Fri, 15 Feb 2019 22:04:07 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A15483041C; Fri, 15 Feb 2019 22:04:07 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A83292FF77 for ; Fri, 15 Feb 2019 22:04:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B5F2B8E0007; Fri, 15 Feb 2019 17:03:55 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 9FA408E0001; Fri, 15 Feb 2019 17:03:55 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B0278E0007; Fri, 15 Feb 2019 17:03:55 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f198.google.com (mail-yb1-f198.google.com [209.85.219.198]) by kanga.kvack.org (Postfix) with ESMTP id 39FA68E0001 for ; Fri, 15 Feb 2019 17:03:55 -0500 (EST) Received: by mail-yb1-f198.google.com with SMTP id m2so6707987ybp.20 for ; Fri, 15 Feb 2019 14:03:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding:dkim-signature; bh=JHbAHsw3fP7W/+/16hxE96Ofr24vKouSy6Uk8xAj36k=; b=DNoyaFV+clLyiAVRVl5Xxu/03pcc+Q5yr+xcPSBvt/qxzh7nMztUhH39oYK1V1zvFI 9vqEi84Bpkh/LCgdrPEdXX4Hrnh6pFzfa6b/hrLd+joSy/WbERMOsFoLnaTsnu/y9rqy lGj4xL9itqyXxlvHm5R2TegbJvPaYx9osdJ8xxMbnZXZ8sgS6jy8pK6QS5vXn92VvNlB 5xvPeWhLlqaIrkRcGLMxnLGCPCv6g/C+qDkZcoNqiZG9zgN/9Li5Q1z9Dwle936zuIjZ bJT5Fzn/UiQuVyKd/krBp7VgahQloPdmsmgUgCxiNsIEj9w36IO1cuavrrYe64rIvWH6 GpVQ== X-Gm-Message-State: AHQUAuatnHus6sdjNzbjFGr9+BKqqeguWgYenumd/hjxMAmdL2Whr69v VE22YPo1vA2BYzCIOyPhH71MLmpGZw+wR/4dEzScvZiyk9ceRp4Dv+Cgu+AcvSHtpfM8ITSFv8G 6rKIlQVI7cbEJzp3tSE7FQpNjYGs5HWM6U2U0JanAEpF59kgM0BMu7o8Kc+xjPUB8Pg== X-Received: by 2002:a25:6650:: with SMTP id z16mr10485155ybm.394.1550268234742; Fri, 15 Feb 2019 14:03:54 -0800 (PST) X-Google-Smtp-Source: AHgI3IZdcPqAg4Cexpw+EnwtiXhhMmoTeVT0gCSRzqc4Z5lICcIylpjCXv6ypzCgdk/RGncUuiJP X-Received: by 2002:a25:6650:: with SMTP id z16mr10484854ybm.394.1550268230658; Fri, 15 Feb 2019 14:03:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550268230; cv=none; d=google.com; s=arc-20160816; b=duKw/jfJmWRs5Z2x2O7YaD1qvp7nlPZJ1nSNsbgodu1zyf+UY0ylKd4RT1PQRr01fT Wno4qB9KdWrhwkBu+xmwdkIkmDyyfXZqRWRaoCc+TcIeSMG8rB5NCot2JbRRUrDjLd9w zcyPaSs2RA5f+TxPGU/2ga3xIob8SCnllBHm0EteRu+SjKcpKHiDaZyxTEttc0zpenZc BwTTqy4552Pn2NI1OvA71Ews7tItY16LWmNvFJ3kjzwrU3sfYomnZmYKGq4LVgbtp+QW ib8uwExQGO2Zu5+bbgD0ZJP5WvQH4YuyDZkTayhv66FBnfJfaBiV+Z1rIRWRhRRxaF6u fyzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=dkim-signature:content-transfer-encoding:mime-version:references :in-reply-to:message-id:date:subject:cc:to:from; bh=JHbAHsw3fP7W/+/16hxE96Ofr24vKouSy6Uk8xAj36k=; b=R2JofmtZWVQvbyyNefQxfOP6xMOD3jw00Vk+vrZ+t1cpZKWV6dLwBWdt4eWV4od6S0 nx55BiRzhb4d8XYAsIs9xd13TLTzTntNAdsASY7K/wau/Z0oTZX549xjboOfoXe3JhXx x6HNI+8bIpLWYIILMnc08Cut3WHqh0XRFO7xxNzonu/pm4diAK6sRqNNj5vcWngxCYoM BQJRY5dqWPjz+dL/8VJ5e18DEbwURoS0knJ/Sb4GUnDVZmilUseQsJPLczHn5JHpWFZA HU2I5COx0cWjFFN5+ICvXDtX21aB/qjs2mjyMbSelOQdnwOQgSxtZ0J3c3A0YZrrq+5+ PagQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=Im06plRY; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqemgate15.nvidia.com (hqemgate15.nvidia.com. [216.228.121.64]) by mx.google.com with ESMTPS id g12si4125142ybk.465.2019.02.15.14.03.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Feb 2019 14:03:50 -0800 (PST) Received-SPF: pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) client-ip=216.228.121.64; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=Im06plRY; spf=pass (google.com: domain of ziy@nvidia.com designates 216.228.121.64 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate15.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Fri, 15 Feb 2019 14:03:13 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Fri, 15 Feb 2019 14:03:49 -0800 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Fri, 15 Feb 2019 14:03:49 -0800 Received: from nvrsysarch5.nvidia.com (172.20.13.39) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Fri, 15 Feb 2019 22:03:48 +0000 From: Zi Yan To: , CC: Dave Hansen , Michal Hocko , "Kirill A . Shutemov" , Andrew Morton , Vlastimil Babka , Mel Gorman , John Hubbard , Mark Hairgrove , Nitin Gupta , David Nellans , Zi Yan Subject: [RFC PATCH 04/31] mm: add mem_defrag functionality. Date: Fri, 15 Feb 2019 14:03:07 -0800 Message-ID: <20190215220334.29298-5-ziy@nvidia.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190215220334.29298-1-ziy@nvidia.com> References: <20190215220334.29298-1-ziy@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL103.nvidia.com (172.20.187.11) To HQMAIL101.nvidia.com (172.20.187.10) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1550268193; bh=JHbAHsw3fP7W/+/16hxE96Ofr24vKouSy6Uk8xAj36k=; h=X-PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer: In-Reply-To:References:MIME-Version:X-Originating-IP: X-ClientProxiedBy:Content-Transfer-Encoding:Content-Type; b=Im06plRYbCrzGzStogBUI+vA45mc6CSX8iksUrHkZnElFDFUPVYrHDMEQM0/GbWrg 0RCuz00NvXybuZ7UVi2Pjswgj/fcVbne+++csWYLlX4SpMEjJPX4c1W9fl/l4K93FC MxA8sW9KbxIwMG0VX8RVEClAkhmKY12HZFMXrdD+GYAC0clf9XMYyu5yUcQoY+Bsw/ N7CmOXsU7QECeXXGq1HJfjdPlKo5tfQSn1Ek8MDmu8Z+ovwAX3xP6Yyh7AiAfE8MjZ J6HAZCx+srrCPWOf9fnJuRnCoVWc1HNapG0U8y6VUyQkeH48pty8Bv1WBCMqVknaJE 1zjU4+tKlbjsw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Create contiguous physical memory regions by migrating/exchanging pages. 1. It scans all VMAs in one process and determines an anchor pair (VPN, PFN) for each VMA. 2. Then, it migrates/exchanges pages in each VMA to make them aligned with the anchor pair. At the end of day, a physically contiguous region should be created for each VMA in the physical address range [PFN, PFN + VMA size). Signed-off-by: Zi Yan --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/exec.c | 4 + include/linux/mem_defrag.h | 60 + include/linux/mm.h | 12 + include/linux/mm_types.h | 4 + include/linux/sched/coredump.h | 3 + include/linux/syscalls.h | 3 + include/linux/vm_event_item.h | 23 + include/uapi/asm-generic/mman-common.h | 3 + kernel/fork.c | 9 + kernel/sysctl.c | 79 +- mm/Makefile | 1 + mm/compaction.c | 17 +- mm/huge_memory.c | 4 + mm/internal.h | 28 + mm/khugepaged.c | 1 + mm/madvise.c | 15 + mm/mem_defrag.c | 1782 ++++++++++++++++++++++++ mm/memory.c | 7 + mm/mmap.c | 29 + mm/page_alloc.c | 4 +- mm/vmstat.c | 21 + 22 files changed, 2096 insertions(+), 14 deletions(-) create mode 100644 include/linux/mem_defrag.h create mode 100644 mm/mem_defrag.c diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..374c11e3cf80 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,7 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common scan_process_memory __x64_sys_scan_process_memory # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/exec.c b/fs/exec.c index fb72d36f7823..b71b9d305d7d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1010,7 +1010,11 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm; + int move_mem_defrag = current->mm ? + test_bit(MMF_VM_MEM_DEFRAG_ALL, ¤t->mm->flags):0; + if (move_mem_defrag && mm) + set_bit(MMF_VM_MEM_DEFRAG_ALL, &mm->flags); /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; diff --git a/include/linux/mem_defrag.h b/include/linux/mem_defrag.h new file mode 100644 index 000000000000..43954a316752 --- /dev/null +++ b/include/linux/mem_defrag.h @@ -0,0 +1,60 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * mem_defrag.h Memory defragmentation function. + * + * Copyright (C) 2019 Zi Yan + * + * + */ +#ifndef _LINUX_KMEM_DEFRAGD_H +#define _LINUX_KMEM_DEFRAGD_H + +#include /* MMF_VM_MEM_DEFRAG */ + +#define MEM_DEFRAG_SCAN 0 +#define MEM_DEFRAG_MARK_SCAN_ALL 1 +#define MEM_DEFRAG_CLEAR_SCAN_ALL 2 +#define MEM_DEFRAG_DEFRAG 3 +#define MEM_DEFRAG_CONTIG_SCAN 5 + +enum mem_defrag_action { + MEM_DEFRAG_FULL_STATS = 0, + MEM_DEFRAG_DO_DEFRAG, + MEM_DEFRAG_CONTIG_STATS, +}; + +extern int kmem_defragd_always; + +extern int __kmem_defragd_enter(struct mm_struct *mm); +extern void __kmem_defragd_exit(struct mm_struct *mm); +extern int memdefrag_madvise(struct vm_area_struct *vma, + unsigned long *vm_flags, int advice); + +static inline int kmem_defragd_fork(struct mm_struct *mm, + struct mm_struct *oldmm) +{ + if (test_bit(MMF_VM_MEM_DEFRAG, &oldmm->flags)) + return __kmem_defragd_enter(mm); + return 0; +} + +static inline void kmem_defragd_exit(struct mm_struct *mm) +{ + if (test_bit(MMF_VM_MEM_DEFRAG, &mm->flags)) + __kmem_defragd_exit(mm); +} + +static inline int kmem_defragd_enter(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + if (!test_bit(MMF_VM_MEM_DEFRAG, &vma->vm_mm->flags)) + if (((kmem_defragd_always || + ((vm_flags & VM_MEMDEFRAG))) && + !(vm_flags & VM_NOMEMDEFRAG)) || + test_bit(MMF_VM_MEM_DEFRAG_ALL, &vma->vm_mm->flags)) + if (__kmem_defragd_enter(vma->vm_mm)) + return -ENOMEM; + return 0; +} + +#endif /* _LINUX_KMEM_DEFRAGD_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 80bb6408fe73..5bcc1b03372a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,13 +251,20 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_6 38 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) +#define VM_HIGH_ARCH_6 BIT(VM_HIGH_ARCH_BIT_6) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ +#define VM_MEMDEFRAG VM_HIGH_ARCH_5 /* memory defrag */ +#define VM_NOMEMDEFRAG VM_HIGH_ARCH_6 /* no memory defrag */ + #ifdef CONFIG_ARCH_HAS_PKEYS # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ @@ -487,6 +494,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) vma->vm_mm = mm; vma->vm_ops = &dummy_vm_ops; INIT_LIST_HEAD(&vma->anon_vma_chain); + vma->anchor_page_rb = RB_ROOT_CACHED; + vma->vma_create_jiffies = jiffies; + vma->vma_defrag_jiffies = 0; } static inline void vma_set_anonymous(struct vm_area_struct *vma) @@ -2837,6 +2847,8 @@ static inline bool debug_guardpage_enabled(void) { return false; } static inline bool page_is_guard(struct page *page) { return false; } #endif /* CONFIG_DEBUG_PAGEALLOC */ +void free_anchor_pages(struct vm_area_struct *vma); + #if MAX_NUMNODES > 1 void __init setup_nr_node_ids(void); #else diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 2c471a2c43fa..32549b255d25 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -328,6 +328,10 @@ struct vm_area_struct { struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; + struct rb_root_cached anchor_page_rb; + unsigned long vma_create_jiffies; /* life time of the vma */ + unsigned long vma_modify_jiffies; /* being modified time of the vma */ + unsigned long vma_defrag_jiffies; /* being defragged time of the vma */ } __randomize_layout; struct core_thread { diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index ecdc6542070f..52ad71db6687 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -76,5 +76,8 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ MMF_DISABLE_THP_MASK) +#define MMF_VM_MEM_DEFRAG 26 /* set mm is added to do mem defrag */ +#define MMF_VM_MEM_DEFRAG_ALL 27 /* all vmas in the mm will be mem defrag */ + #endif /* _LINUX_SCHED_COREDUMP_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..caa6f043b29a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -926,6 +926,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags, unsigned mask, struct statx __user *buffer); asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, int flags, uint32_t sig); +asmlinkage long sys_scan_process_memory(pid_t pid, char __user *out_buf, + int buf_len, int action); /* * Architecture-specific system calls @@ -1315,4 +1317,5 @@ static inline unsigned int ksys_personality(unsigned int personality) return old; } + #endif diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 47a3441cf4c4..6b32c8243616 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -109,6 +109,29 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_SWAP SWAP_RA, SWAP_RA_HIT, +#endif + /* MEM_DEFRAG STATS */ + MEM_DEFRAG_DEFRAG_NUM, + MEM_DEFRAG_SCAN_NUM, + MEM_DEFRAG_DST_FREE_PAGES, + MEM_DEFRAG_DST_ANON_PAGES, + MEM_DEFRAG_DST_FILE_PAGES, + MEM_DEFRAG_DST_NONLRU_PAGES, + MEM_DEFRAG_DST_FREE_PAGES_FAILED, + MEM_DEFRAG_DST_FREE_PAGES_OVERFLOW_FAILED, + MEM_DEFRAG_DST_ANON_PAGES_FAILED, + MEM_DEFRAG_DST_FILE_PAGES_FAILED, + MEM_DEFRAG_DST_NONLRU_PAGES_FAILED, + MEM_DEFRAG_SRC_ANON_PAGES_FAILED, + MEM_DEFRAG_SRC_COMP_PAGES_FAILED, + MEM_DEFRAG_DST_SPLIT_HUGEPAGES, +#ifdef CONFIG_COMPACTION + /* memory compaction */ + COMPACT_MIGRATE_PAGES, +#endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* thp collapse */ + THP_COLLAPSE_MIGRATE_PAGES, #endif NR_VM_EVENT_ITEMS }; diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index e7ee32861d51..d1ec94a1970d 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -66,6 +66,9 @@ #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ +#define MADV_MEMDEFRAG 20 /* Worth backing with hugepages */ +#define MADV_NOMEMDEFRAG 21 /* Not worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/kernel/fork.c b/kernel/fork.c index b69248e6f0e0..dcefa978c232 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -92,6 +92,7 @@ #include #include #include +#include #include #include @@ -343,12 +344,16 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) if (new) { *new = *orig; INIT_LIST_HEAD(&new->anon_vma_chain); + new->anchor_page_rb = RB_ROOT_CACHED; + new->vma_create_jiffies = jiffies; + new->vma_defrag_jiffies = 0; } return new; } void vm_area_free(struct vm_area_struct *vma) { + free_anchor_pages(vma); kmem_cache_free(vm_area_cachep, vma); } @@ -496,6 +501,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, if (retval) goto out; retval = khugepaged_fork(mm, oldmm); + if (retval) + goto out; + retval = kmem_defragd_fork(mm, oldmm); if (retval) goto out; @@ -1044,6 +1052,7 @@ static inline void __mmput(struct mm_struct *mm) exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ + kmem_defragd_exit(mm); exit_mmap(mm); mm_put_huge_zero_page(mm); set_mm_exe_file(mm, NULL); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..6bf0be1af7e0 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -115,6 +115,13 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; extern int sysctl_nr_trim_pages; #endif +extern int kmem_defragd_always; +extern int vma_scan_percentile; +extern int vma_scan_threshold_type; +extern int vma_no_repeat_defrag; +extern int num_breakout_chunks; +extern int defrag_size_threshold; + /* Constants used for minimum and maximum */ #ifdef CONFIG_LOCKUP_DETECTOR static int sixty = 60; @@ -1303,7 +1310,7 @@ static struct ctl_table vm_table[] = { .proc_handler = overcommit_kbytes_handler, }, { - .procname = "page-cluster", + .procname = "page-cluster", .data = &page_cluster, .maxlen = sizeof(int), .mode = 0644, @@ -1691,6 +1698,58 @@ static struct ctl_table vm_table[] = { .extra2 = (void *)&mmap_rnd_compat_bits_max, }, #endif + { + .procname = "kmem_defragd_always", + .data = &kmem_defragd_always, + .maxlen = sizeof(kmem_defragd_always), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, + { + .procname = "vma_scan_percentile", + .data = &vma_scan_percentile, + .maxlen = sizeof(vma_scan_percentile), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one_hundred, + }, + { + .procname = "vma_scan_threshold_type", + .data = &vma_scan_threshold_type, + .maxlen = sizeof(vma_scan_threshold_type), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, + { + .procname = "vma_no_repeat_defrag", + .data = &vma_no_repeat_defrag, + .maxlen = sizeof(vma_no_repeat_defrag), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, + { + .procname = "num_breakout_chunks", + .data = &num_breakout_chunks, + .maxlen = sizeof(num_breakout_chunks), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + }, + { + .procname = "defrag_size_threshold", + .data = &defrag_size_threshold, + .maxlen = sizeof(defrag_size_threshold), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + }, { } }; @@ -1807,7 +1866,7 @@ static struct ctl_table fs_table[] = { .mode = 0555, .child = inotify_table, }, -#endif +#endif #ifdef CONFIG_EPOLL { .procname = "epoll", @@ -2252,12 +2311,12 @@ static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table, int *i, vleft, first = 1, err = 0; size_t left; char *kbuf = NULL, *p; - + if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) { *lenp = 0; return 0; } - + i = (int *) tbl_data; vleft = table->maxlen / sizeof(*i); left = *lenp; @@ -2483,7 +2542,7 @@ static int do_proc_douintvec(struct ctl_table *table, int write, * @ppos: file position * * Reads/writes up to table->maxlen/sizeof(unsigned int) integer - * values from/to the user buffer, treated as an ASCII string. + * values from/to the user buffer, treated as an ASCII string. * * Returns 0 on success. */ @@ -2974,7 +3033,7 @@ static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp, * @ppos: file position * * Reads/writes up to table->maxlen/sizeof(unsigned int) integer - * values from/to the user buffer, treated as an ASCII string. + * values from/to the user buffer, treated as an ASCII string. * The values read are assumed to be in seconds, and are converted into * jiffies. * @@ -2996,8 +3055,8 @@ int proc_dointvec_jiffies(struct ctl_table *table, int write, * @ppos: pointer to the file position * * Reads/writes up to table->maxlen/sizeof(unsigned int) integer - * values from/to the user buffer, treated as an ASCII string. - * The values read are assumed to be in 1/USER_HZ seconds, and + * values from/to the user buffer, treated as an ASCII string. + * The values read are assumed to be in 1/USER_HZ seconds, and * are converted into jiffies. * * Returns 0 on success. @@ -3019,8 +3078,8 @@ int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write, * @ppos: the current position in the file * * Reads/writes up to table->maxlen/sizeof(unsigned int) integer - * values from/to the user buffer, treated as an ASCII string. - * The values read are assumed to be in 1/1000 seconds, and + * values from/to the user buffer, treated as an ASCII string. + * The values read are assumed to be in 1/1000 seconds, and * are converted into jiffies. * * Returns 0 on success. diff --git a/mm/Makefile b/mm/Makefile index 1574ea5743e4..925f21c717db 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -44,6 +44,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ obj-y += init-mm.o obj-y += memblock.o obj-y += exchange.o +obj-y += mem_defrag.o ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o diff --git a/mm/compaction.c b/mm/compaction.c index ef29490b0f46..54c4bfdbdbc3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -50,7 +50,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta) #define pageblock_start_pfn(pfn) block_start_pfn(pfn, pageblock_order) #define pageblock_end_pfn(pfn) block_end_pfn(pfn, pageblock_order) -static unsigned long release_freepages(struct list_head *freelist) +unsigned long release_freepages(struct list_head *freelist) { struct page *page, *next; unsigned long high_pfn = 0; @@ -58,7 +58,10 @@ static unsigned long release_freepages(struct list_head *freelist) list_for_each_entry_safe(page, next, freelist, lru) { unsigned long pfn = page_to_pfn(page); list_del(&page->lru); - __free_page(page); + if (PageCompound(page)) + __free_pages(page, compound_order(page)); + else + __free_page(page); if (pfn > high_pfn) high_pfn = pfn; } @@ -1593,6 +1596,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) { int err; + int num_migrated_pages = 0; + struct page *iter; switch (isolate_migratepages(zone, cc)) { case ISOLATE_ABORT: @@ -1611,6 +1616,9 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro ; } + list_for_each_entry(iter, &cc->migratepages, lru) + num_migrated_pages++; + err = migrate_pages(&cc->migratepages, compaction_alloc, compaction_free, (unsigned long)cc, cc->mode, MR_COMPACTION); @@ -1618,6 +1626,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro trace_mm_compaction_migratepages(cc->nr_migratepages, err, &cc->migratepages); + list_for_each_entry(iter, &cc->migratepages, lru) + num_migrated_pages--; + + count_vm_events(COMPACT_MIGRATE_PAGES, num_migrated_pages); + /* All pages were either migrated or will be released */ cc->nr_migratepages = 0; if (err) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index faf357eaf0ce..ffcae07a87d3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include @@ -695,6 +696,9 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) return VM_FAULT_OOM; if (unlikely(khugepaged_enter(vma, vma->vm_flags))) return VM_FAULT_OOM; + /* Make it defrag */ + if (unlikely(kmem_defragd_enter(vma, vma->vm_flags))) + return VM_FAULT_OOM; if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm) && transparent_hugepage_use_zero_page()) { diff --git a/mm/internal.h b/mm/internal.h index 77e205c423ce..4fe8d1a4d7bb 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -15,6 +15,7 @@ #include #include #include +#include /* * The set of flags that only affect watermark checking and reclaim @@ -549,4 +550,31 @@ bool buffer_migrate_lock_buffers(struct buffer_head *head, int writeout(struct address_space *mapping, struct page *page); extern int exchange_two_pages(struct page *page1, struct page *page2); +struct anchor_page_info { + struct list_head list; + struct page *anchor_page; + unsigned long vaddr; + unsigned long start; + unsigned long end; +}; + +struct anchor_page_node { + struct interval_tree_node node; + unsigned long anchor_pfn; /* struct page can be calculated from pfn_to_page() */ + unsigned long anchor_vpn; +}; + +unsigned long release_freepages(struct list_head *freelist); + +void free_anchor_pages(struct vm_area_struct *vma); + +extern int exchange_two_pages(struct page *page1, struct page *page2); + +void expand(struct zone *zone, struct page *page, + int low, int high, struct free_area *area, + int migratetype); + +void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, + unsigned int alloc_flags); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4f017339ddb2..aedaa9f75806 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -660,6 +660,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, } else { src_page = pte_page(pteval); copy_user_highpage(page, src_page, address, vma); + count_vm_event(THP_COLLAPSE_MIGRATE_PAGES); VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page); release_pte_page(src_page); /* diff --git a/mm/madvise.c b/mm/madvise.c index 21a7881a2db4..9cef96d633e8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -24,6 +24,7 @@ #include #include #include +#include #include @@ -616,6 +617,13 @@ static long madvise_remove(struct vm_area_struct *vma, return error; } +static long madvise_memdefrag(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end, int behavior) +{ + *prev = vma; + return memdefrag_madvise(vma, &vma->vm_flags, behavior); +} #ifdef CONFIG_MEMORY_FAILURE /* * Error injection support for memory error handling. @@ -697,6 +705,9 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); + case MADV_MEMDEFRAG: + case MADV_NOMEMDEFRAG: + return madvise_memdefrag(vma, prev, start, end, behavior); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -731,6 +742,8 @@ madvise_behavior_valid(int behavior) case MADV_SOFT_OFFLINE: case MADV_HWPOISON: #endif + case MADV_MEMDEFRAG: + case MADV_NOMEMDEFRAG: return true; default: @@ -785,6 +798,8 @@ madvise_behavior_valid(int behavior) * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. + * MADV_MEMDEFRAG - allow mem defrag running on this region. + * MADV_NOMEMDEFRAG - no mem defrag here. * * return values: * zero - success diff --git a/mm/mem_defrag.c b/mm/mem_defrag.c new file mode 100644 index 000000000000..414909e1c19c --- /dev/null +++ b/mm/mem_defrag.c @@ -0,0 +1,1782 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Memory defragmentation. + * + * Copyright (C) 2019 Zi Yan + * + * Two lists: + * 1) a mm list, representing virtual address spaces + * 2) a anon_vma list, representing the physical address space. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include "internal.h" + + +struct contig_stats { + int err; + unsigned long contig_pages; + unsigned long first_vaddr_in_chunk; + unsigned long first_paddr_in_chunk; +}; + +struct defrag_result_stats { + unsigned long aligned; + unsigned long migrated; + unsigned long src_pte_thp_failed; + unsigned long src_thp_dst_not_failed; + unsigned long src_not_present; + unsigned long dst_out_of_bound_failed; + unsigned long dst_pte_thp_failed; + unsigned long dst_thp_src_not_failed; + unsigned long dst_isolate_free_failed; + unsigned long dst_migrate_free_failed; + unsigned long dst_anon_failed; + unsigned long dst_file_failed; + unsigned long dst_non_lru_failed; + unsigned long dst_non_moveable_failed; + unsigned long not_defrag_vpn; +}; + +enum { + VMA_THRESHOLD_TYPE_TIME = 0, + VMA_THRESHOLD_TYPE_SIZE, +}; + +int num_breakout_chunks; +int vma_scan_percentile = 100; +int vma_scan_threshold_type = VMA_THRESHOLD_TYPE_TIME; +int vma_no_repeat_defrag; +int kmem_defragd_always; +int defrag_size_threshold = 5; +static DEFINE_SPINLOCK(kmem_defragd_mm_lock); + +#define MM_SLOTS_HASH_BITS 10 +static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); + +static struct kmem_cache *mm_slot_cache __read_mostly; + +struct defrag_scan_control { + struct mm_struct *mm; + unsigned long scan_address; + char __user *out_buf; + int buf_len; + int used_len; + enum mem_defrag_action action; + bool scan_in_vma; + unsigned long vma_scan_threshold; +}; + +/** + * struct mm_slot - hash lookup from mm to mm_slot + * @hash: hash collision list + * @mm_node: kmem_defragd scan list headed in kmem_defragd_scan.mm_head + * @mm: the mm that this information is valid for + */ +struct mm_slot { + struct hlist_node hash; + struct list_head mm_node; + struct mm_struct *mm; +}; + +/** + * struct kmem_defragd_scan - cursor for scanning + * @mm_head: the head of the mm list to scan + * @mm_slot: the current mm_slot we are scanning + * @address: the next address inside that to be scanned + * + * There is only the one kmem_defragd_scan instance of this cursor structure. + */ +struct kmem_defragd_scan { + struct list_head mm_head; + struct mm_slot *mm_slot; + unsigned long address; +}; + +static struct kmem_defragd_scan kmem_defragd_scan = { + .mm_head = LIST_HEAD_INIT(kmem_defragd_scan.mm_head), +}; + + +static inline struct mm_slot *alloc_mm_slot(void) +{ + if (!mm_slot_cache) /* initialization failed */ + return NULL; + return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL); +} + +static inline void free_mm_slot(struct mm_slot *mm_slot) +{ + kmem_cache_free(mm_slot_cache, mm_slot); +} + +static struct mm_slot *get_mm_slot(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + + hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm) + if (mm == mm_slot->mm) + return mm_slot; + + return NULL; +} + +static void insert_to_mm_slots_hash(struct mm_struct *mm, + struct mm_slot *mm_slot) +{ + mm_slot->mm = mm; + hash_add(mm_slots_hash, &mm_slot->hash, (long)mm); +} + +static inline int kmem_defragd_test_exit(struct mm_struct *mm) +{ + return atomic_read(&mm->mm_users) == 0; +} + +int __kmem_defragd_enter(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + + mm_slot = alloc_mm_slot(); + if (!mm_slot) + return -ENOMEM; + + /* __kmem_defragd_exit() must not run from under us */ + VM_BUG_ON_MM(kmem_defragd_test_exit(mm), mm); + if (unlikely(test_and_set_bit(MMF_VM_MEM_DEFRAG, &mm->flags))) { + free_mm_slot(mm_slot); + return 0; + } + + spin_lock(&kmem_defragd_mm_lock); + insert_to_mm_slots_hash(mm, mm_slot); + /* + * Insert just behind the scanning cursor, to let the area settle + * down a little. + */ + list_add_tail(&mm_slot->mm_node, &kmem_defragd_scan.mm_head); + spin_unlock(&kmem_defragd_mm_lock); + + atomic_inc(&mm->mm_count); + + return 0; +} + +void __kmem_defragd_exit(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + int free = 0; + + spin_lock(&kmem_defragd_mm_lock); + mm_slot = get_mm_slot(mm); + if (mm_slot && kmem_defragd_scan.mm_slot != mm_slot) { + hash_del(&mm_slot->hash); + list_del(&mm_slot->mm_node); + free = 1; + } + spin_unlock(&kmem_defragd_mm_lock); + + if (free) { + clear_bit(MMF_VM_MEM_DEFRAG, &mm->flags); + free_mm_slot(mm_slot); + mmdrop(mm); + } else if (mm_slot) { + /* + * This is required to serialize against + * kmem_defragd_test_exit() (which is guaranteed to run + * under mmap sem read mode). Stop here (after we + * return all pagetables will be destroyed) until + * kmem_defragd has finished working on the pagetables + * under the mmap_sem. + */ + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } +} + +static void collect_mm_slot(struct mm_slot *mm_slot) +{ + struct mm_struct *mm = mm_slot->mm; + + VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&kmem_defragd_mm_lock)); + + if (kmem_defragd_test_exit(mm)) { + /* free mm_slot */ + hash_del(&mm_slot->hash); + list_del(&mm_slot->mm_node); + + /* + * Not strictly needed because the mm exited already. + * + * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); + */ + + /* kmem_defragd_mm_lock actually not necessary for the below */ + free_mm_slot(mm_slot); + mmdrop(mm); + } +} + +static bool mem_defrag_vma_check(struct vm_area_struct *vma) +{ + if ((!test_bit(MMF_VM_MEM_DEFRAG_ALL, &vma->vm_mm->flags) && + !(vma->vm_flags & VM_MEMDEFRAG) && !kmem_defragd_always) || + (vma->vm_flags & VM_NOMEMDEFRAG)) + return false; + if (shmem_file(vma->vm_file)) { + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) + return false; + return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, + HPAGE_PMD_NR); + } + if (is_vm_hugetlb_page(vma)) + return true; + if (!vma->anon_vma || vma->vm_ops) + return false; + if (is_vma_temporary_stack(vma)) + return false; + return true; +} + +static int do_vma_stat(struct mm_struct *mm, struct vm_area_struct *vma, + char *kernel_buf, int pos, int *remain_buf_len) +{ + int used_len; + int init_remain_len = *remain_buf_len; + + if (!*remain_buf_len || !kernel_buf) + return -1; + + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, "%p, 0x%lx, 0x%lx, " + "0x%lx, -1\n", + mm, (unsigned long)vma+vma->vma_create_jiffies, + vma->vm_start, vma->vm_end); + + *remain_buf_len -= used_len; + + if (*remain_buf_len == 1) { + *remain_buf_len = init_remain_len; + kernel_buf[pos] = '\0'; + return -1; + } + + return 0; +} + +static inline int get_contig_page_size(struct page *page) +{ + int page_size = PAGE_SIZE; + + if (PageCompound(page)) { + struct page *head_page = compound_head(page); + int compound_size = PAGE_SIZE<first_vaddr_in_chunk) { + contig_stats->first_vaddr_in_chunk = vaddr; + contig_stats->first_paddr_in_chunk = paddr; + contig_stats->contig_pages = 0; + } + + /* scan_in_vma is set to true if buffer runs out while scanning a + * vma. A corner case happens, when buffer runs out, then vma changes, + * scan_address is reset to vm_start. Then, vma info is printed twice. + */ + if (vaddr == vma->vm_start && !scan_in_vma) { + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, + "%p, 0x%lx, 0x%lx, 0x%lx, ", + mm, (unsigned long)vma+vma->vma_create_jiffies, + vma->vm_start, vma->vm_end); + + *remain_buf_len -= used_len; + + if (*remain_buf_len == 1) { + contig_stats->err = 1; + goto out_of_buf; + } + pos += used_len; + } + + if (page) { + if (contig_stats->first_paddr_in_chunk) { + if (((long long)vaddr - contig_stats->first_vaddr_in_chunk) == + ((long long)paddr - contig_stats->first_paddr_in_chunk)) + contig_stats->contig_pages += num_pages; + else { + /* output present contig chunk */ + contig_pages = contig_stats->contig_pages; + goto output_contig_info; + } + } else { /* the previous chunk is not present pages */ + /* output non-present contig chunk */ + contig_pages = -(long long)contig_stats->contig_pages; + goto output_contig_info; + } + } else { + /* the previous chunk is not present pages */ + if (!contig_stats->first_paddr_in_chunk) { + VM_BUG_ON(contig_stats->first_vaddr_in_chunk + + contig_stats->contig_pages * PAGE_SIZE != + vaddr); + ++contig_stats->contig_pages; + } else { + /* output present contig chunk */ + contig_pages = contig_stats->contig_pages; + + goto output_contig_info; + } + } + +check_last_entry: + /* if vaddr is the last page, we need to dump stats as well */ + if ((vaddr + num_pages * PAGE_SIZE) >= vma->vm_end) { + if (contig_stats->first_paddr_in_chunk) + contig_pages = contig_stats->contig_pages; + else + contig_pages = -(long long)contig_stats->contig_pages; + last_entry = true; + } else + return 0; +output_contig_info: + if (last_entry) + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, + "%lld, -1\n", contig_pages); + else + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, + "%lld, ", contig_pages); + + *remain_buf_len -= used_len; + if (*remain_buf_len == 1) { + contig_stats->err = 1; + goto out_of_buf; + } else { + pos += used_len; + if (last_entry) { + /* clear contig_stats */ + contig_stats->first_vaddr_in_chunk = 0; + contig_stats->first_paddr_in_chunk = 0; + contig_stats->contig_pages = 0; + return 0; + } else { + /* set new contig_stats */ + contig_stats->first_vaddr_in_chunk = vaddr; + contig_stats->first_paddr_in_chunk = paddr; + contig_stats->contig_pages = num_pages; + goto check_last_entry; + } + } + return 0; + } + + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, + "%p, %p, 0x%lx, 0x%lx, 0x%lx, 0x%llx", + mm, vma, vma->vm_start, vma->vm_end, + vaddr, page ? PFN_PHYS(page_to_pfn(page)) : 0); + + *remain_buf_len -= used_len; + if (*remain_buf_len == 1) + goto out_of_buf; + pos += used_len; + + if (page && PageAnon(page)) { + /* check page order */ + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, ", %d", + compound_order(page)); + *remain_buf_len -= used_len; + if (*remain_buf_len == 1) + goto out_of_buf; + pos += used_len; + + anon_vma = page_anon_vma(page); + if (!anon_vma) + goto end_of_stat; + anon_vma_lock_read(anon_vma); + + do { + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, ", %p", + anon_vma); + *remain_buf_len -= used_len; + if (*remain_buf_len == 1) { + anon_vma_unlock_read(anon_vma); + goto out_of_buf; + } + pos += used_len; + + anon_vma = anon_vma->parent; + } while (anon_vma != anon_vma->parent); + + anon_vma_unlock_read(anon_vma); + } +end_of_stat: + /* end of one page stat */ + used_len = scnprintf(kernel_buf + pos, *remain_buf_len, ", %d\n", end_note); + *remain_buf_len -= used_len; + if (*remain_buf_len == 1) + goto out_of_buf; + + return 0; +out_of_buf: /* revert incomplete data */ + *remain_buf_len = init_remain_len; + kernel_buf[pos] = '\0'; + return -1; + +} + +static int isolate_free_page_no_wmark(struct page *page, unsigned int order) +{ + struct zone *zone; + int mt; + + VM_BUG_ON(!PageBuddy(page)); + + zone = page_zone(page); + mt = get_pageblock_migratetype(page); + + + /* Remove page from free list */ + list_del(&page->lru); + zone->free_area[order].nr_free--; + __ClearPageBuddy(page); + set_page_private(page, 0); + + /* + * Set the pageblock if the isolated page is at least half of a + * pageblock + */ + if (order >= pageblock_order - 1) { + struct page *endpage = page + (1 << order) - 1; + + for (; page < endpage; page += pageblock_nr_pages) { + int mt = get_pageblock_migratetype(page); + + if (!is_migrate_isolate(mt) && !is_migrate_cma(mt) + && mt != MIGRATE_HIGHATOMIC) + set_pageblock_migratetype(page, + MIGRATE_MOVABLE); + } + } + + return 1UL << order; +} + +struct exchange_alloc_info { + struct list_head list; + struct page *src_page; + struct page *dst_page; +}; + +struct exchange_alloc_head { + struct list_head exchange_list; + struct list_head freelist; + struct list_head migratepage_list; + unsigned long num_freepages; +}; + +static int create_exchange_alloc_info(struct vm_area_struct *vma, + unsigned long scan_address, struct page *first_in_use_page, + int total_free_pages, + struct list_head *freelist, + struct list_head *exchange_list, + struct list_head *migratepage_list) +{ + struct page *in_use_page; + struct page *freepage; + struct exchange_alloc_info *one_pair; + int err; + int pagevec_flushed = 0; + + down_read(&vma->vm_mm->mmap_sem); + in_use_page = follow_page(vma, scan_address, + FOLL_GET|FOLL_MIGRATION | FOLL_REMOTE); + up_read(&vma->vm_mm->mmap_sem); + + freepage = list_first_entry_or_null(freelist, struct page, lru); + + if (first_in_use_page != in_use_page || + !freepage || + PageCompound(in_use_page) != PageCompound(freepage) || + compound_order(in_use_page) != compound_order(freepage)) { + if (in_use_page) + put_page(in_use_page); + return -EBUSY; + } + one_pair = kmalloc(sizeof(struct exchange_alloc_info), + GFP_KERNEL | __GFP_ZERO); + + if (!one_pair) { + put_page(in_use_page); + return -ENOMEM; + } + +retry_isolate: + /* isolate in_use_page */ + err = isolate_lru_page(in_use_page); + if (err) { + if (!pagevec_flushed) { + migrate_prep(); + pagevec_flushed = 1; + goto retry_isolate; + } + put_page(in_use_page); + in_use_page = NULL; + } + + if (in_use_page) { + put_page(in_use_page); + mod_node_page_state(page_pgdat(in_use_page), + NR_ISOLATED_ANON + page_is_file_cache(in_use_page), + hpage_nr_pages(in_use_page)); + list_add_tail(&in_use_page->lru, migratepage_list); + } + /* fill info */ + one_pair->src_page = in_use_page; + one_pair->dst_page = freepage; + INIT_LIST_HEAD(&one_pair->list); + + list_add_tail(&one_pair->list, exchange_list); + + return 0; +} + +static void free_alloc_info(struct list_head *alloc_info_list) +{ + struct exchange_alloc_info *item, *item2; + + list_for_each_entry_safe(item, item2, alloc_info_list, list) { + list_del(&item->list); + kfree(item); + } +} + +/* + * migrate callback: give a specific free page when it is called to guarantee + * contiguity after migration. + */ +static struct page *exchange_alloc(struct page *migratepage, + unsigned long data) +{ + struct exchange_alloc_head *head = (struct exchange_alloc_head *)data; + struct page *freepage = NULL; + struct exchange_alloc_info *info; + + list_for_each_entry(info, &head->exchange_list, list) { + if (migratepage == info->src_page) { + freepage = info->dst_page; + /* remove it from freelist */ + list_del(&freepage->lru); + if (PageTransHuge(freepage)) + head->num_freepages -= HPAGE_PMD_NR; + else + head->num_freepages--; + break; + } + } + + return freepage; +} + +static void exchange_free(struct page *freepage, unsigned long data) +{ + struct exchange_alloc_head *head = (struct exchange_alloc_head *)data; + + list_add_tail(&freepage->lru, &head->freelist); + if (PageTransHuge(freepage)) + head->num_freepages += HPAGE_PMD_NR; + else + head->num_freepages++; +} + +int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long start_addr, unsigned long end_addr, + struct page *anchor_page, unsigned long page_vaddr, + struct defrag_result_stats *defrag_stats) +{ + /*unsigned long va_pa_page_offset = (unsigned long)-1;*/ + unsigned long scan_address; + unsigned long page_size = PAGE_SIZE; + int failed = 0; + int not_present = 0; + bool src_thp = false; + + for (scan_address = start_addr; scan_address < end_addr; + scan_address += page_size) { + struct page *scan_page; + unsigned long scan_phys_addr; + long long page_dist; + + cond_resched(); + + down_read(&vma->vm_mm->mmap_sem); + scan_page = follow_page(vma, scan_address, FOLL_MIGRATION | FOLL_REMOTE); + up_read(&vma->vm_mm->mmap_sem); + scan_phys_addr = PFN_PHYS(scan_page ? page_to_pfn(scan_page) : 0); + + page_size = PAGE_SIZE; + + if (!scan_phys_addr) { + not_present++; + failed += 1; + defrag_stats->src_not_present += 1; + continue; + } + + page_size = get_contig_page_size(scan_page); + + /* PTE-mapped THP not allowed */ + if ((scan_page == compound_head(scan_page)) && + PageTransHuge(scan_page) && !PageHuge(scan_page)) + src_thp = true; + + /* Allow THPs */ + if (PageCompound(scan_page) && !src_thp) { + count_vm_events(MEM_DEFRAG_SRC_COMP_PAGES_FAILED, page_size/PAGE_SIZE); + failed += (page_size/PAGE_SIZE); + defrag_stats->src_pte_thp_failed += (page_size/PAGE_SIZE); + + defrag_stats->not_defrag_vpn = scan_address + page_size; + goto quit_defrag; + continue; + } + + VM_BUG_ON(!anchor_page); + + page_dist = ((long long)scan_address - page_vaddr) / PAGE_SIZE; + + /* already in the contiguous pos */ + if (page_dist == (long long)(scan_page - anchor_page)) { + defrag_stats->aligned += (page_size/PAGE_SIZE); + continue; + } else { /* migrate pages according to the anchor pages in the vma. */ + struct page *dest_page = anchor_page + page_dist; + int page_drained = 0; + bool dst_thp = false; + int scan_page_order = src_thp?compound_order(scan_page):0; + + if (!zone_spans_pfn(page_zone(anchor_page), + (page_to_pfn(anchor_page) + page_dist))) { + failed += 1; + defrag_stats->dst_out_of_bound_failed += 1; + + defrag_stats->not_defrag_vpn = scan_address + page_size; + goto quit_defrag; + continue; + } + +retry_defrag: + /* migrate */ + if (PageBuddy(dest_page)) { + struct zone *zone = page_zone(dest_page); + spinlock_t *zone_lock = &zone->lock; + unsigned long zone_lock_flags; + unsigned long free_page_order = 0; + int err = 0; + struct exchange_alloc_head exchange_alloc_head = {0}; + int migratetype = get_pageblock_migratetype(dest_page); + + INIT_LIST_HEAD(&exchange_alloc_head.exchange_list); + INIT_LIST_HEAD(&exchange_alloc_head.freelist); + INIT_LIST_HEAD(&exchange_alloc_head.migratepage_list); + + count_vm_events(MEM_DEFRAG_DST_FREE_PAGES, 1<lock */ + spin_lock_irqsave(zone_lock, zone_lock_flags); + + if (!PageBuddy(dest_page)) { + err = -EINVAL; + goto freepage_isolate_fail; + } + + free_page_order = page_order(dest_page); + + /* fail early if not enough free pages */ + if (free_page_order < scan_page_order) { + err = -ENOMEM; + goto freepage_isolate_fail; + } + + /* __isolate_free_page() */ + err = isolate_free_page_no_wmark(dest_page, free_page_order); + if (!err) + goto freepage_isolate_fail; + + expand(zone, dest_page, scan_page_order, free_page_order, + &(zone->free_area[free_page_order]), + migratetype); + + if (!is_migrate_isolate(migratetype)) + __mod_zone_freepage_state(zone, -(1UL << scan_page_order), + migratetype); + + prep_new_page(dest_page, scan_page_order, + __GFP_MOVABLE|(scan_page_order?__GFP_COMP:0), 0); + + if (scan_page_order) { + VM_BUG_ON(!src_thp); + VM_BUG_ON(scan_page_order != HPAGE_PMD_ORDER); + prep_transhuge_page(dest_page); + } + + list_add(&dest_page->lru, &exchange_alloc_head.freelist); + +freepage_isolate_fail: + spin_unlock_irqrestore(zone_lock, zone_lock_flags); + + if (err < 0) { + failed += (page_size/PAGE_SIZE); + defrag_stats->dst_isolate_free_failed += (page_size/PAGE_SIZE); + + defrag_stats->not_defrag_vpn = scan_address + page_size; + goto quit_defrag; + continue; + } + + /* gather in-use pages + * create a exchange_alloc_info structure, a list of + * tuples, each like: + * (in_use_page, free_page) + * + * so in exchange_alloc, the code needs to traverse the list + * and find the tuple from in_use_page. Then return the + * corresponding free page. + * + * This can guarantee the contiguity after migration. + */ + + err = create_exchange_alloc_info(vma, scan_address, scan_page, + 1<dst_migrate_free_failed += + exchange_alloc_head.num_freepages; + } + defrag_stats->migrated += ((1UL<dst_pte_thp_failed += page_size/PAGE_SIZE; + + defrag_stats->not_defrag_vpn = scan_address + page_size; + goto quit_defrag; + } + + if (src_thp != dst_thp) { + failed += get_contig_page_size(scan_page); + if (src_thp && !dst_thp) + defrag_stats->src_thp_dst_not_failed += + page_size/PAGE_SIZE; + else /* !src_thp && dst_thp */ + defrag_stats->dst_thp_src_not_failed += + page_size/PAGE_SIZE; + + defrag_stats->not_defrag_vpn = scan_address + page_size; + goto quit_defrag; + /*continue;*/ + } + + /* free page on pcplist */ + if (page_count(dest_page) == 0) { + /* not managed pages */ + if (!dest_page->flags) { + failed += 1; + defrag_stats->dst_out_of_bound_failed += 1; + + defrag_stats->not_defrag_vpn = scan_address + page_size; + goto quit_defrag; + } + /* spill order-0 pages to buddy allocator from pcplist */ + if (!page_drained) { + drain_all_pages(NULL); + page_drained = 1; + goto retry_defrag; + } + } + + if (PageAnon(dest_page)) { + count_vm_events(MEM_DEFRAG_DST_ANON_PAGES, + 1<dst_anon_failed += 1<dst_file_failed += 1<dst_non_lru_failed += 1<dst_non_moveable_failed += 1<migrated += 1<not_defrag_vpn = scan_address + page_size; + goto quit_defrag; + } + + } + } + } +quit_defrag: + return failed; +} + +struct anchor_page_node *get_anchor_page_node_from_vma(struct vm_area_struct *vma, + unsigned long address) +{ + struct interval_tree_node *prev_vma_node; + + prev_vma_node = interval_tree_iter_first(&vma->anchor_page_rb, + address, address); + + if (!prev_vma_node) + return NULL; + + return container_of(prev_vma_node, struct anchor_page_node, node); +} + +unsigned long get_undefragged_area(struct zone *zone, struct vm_area_struct *vma, + unsigned long start_addr, unsigned long end_addr) +{ + struct mm_struct *mm = vma->vm_mm; + struct vm_area_struct *scan_vma = NULL; + unsigned long vma_size = end_addr - start_addr; + bool first_vma = true; + + for (scan_vma = mm->mmap; scan_vma; scan_vma = scan_vma->vm_next) + if (!RB_EMPTY_ROOT(&scan_vma->anchor_page_rb.rb_root)) + break; + /* no defragged area */ + if (!scan_vma) + return zone->zone_start_pfn; + + scan_vma = mm->mmap; + while (scan_vma) { + if (!RB_EMPTY_ROOT(&scan_vma->anchor_page_rb.rb_root)) { + struct interval_tree_node *node = NULL, *next_node = NULL; + struct anchor_page_node *anchor_node, *next_anchor_node = NULL; + struct vm_area_struct *next_vma = scan_vma->vm_next; + unsigned long end_pfn; + /* each vma get one anchor range */ + node = interval_tree_iter_first(&scan_vma->anchor_page_rb, + scan_vma->vm_start, scan_vma->vm_end - 1); + if (!node) { + scan_vma = scan_vma->vm_next; + continue; + } + + anchor_node = container_of(node, struct anchor_page_node, node); + end_pfn = (anchor_node->anchor_pfn + + ((scan_vma->vm_end - scan_vma->vm_start)>>PAGE_SHIFT)); + + /* check space before first vma */ + if (first_vma) { + first_vma = false; + if (zone->zone_start_pfn + vma_size < anchor_node->anchor_pfn) + return zone->zone_start_pfn; + /* remove existing anchor if new vma is much larger */ + if (vma_size > (scan_vma->vm_end - scan_vma->vm_start)*2) { + first_vma = true; + interval_tree_remove(node, &scan_vma->anchor_page_rb); + kfree(anchor_node); + scan_vma = scan_vma->vm_next; + continue; + } + } + + /* find next vma with anchor range */ + for (next_vma = scan_vma->vm_next; + next_vma && RB_EMPTY_ROOT(&next_vma->anchor_page_rb.rb_root); + next_vma = next_vma->vm_next); + + if (!next_vma) + return end_pfn; + else { + next_node = interval_tree_iter_first(&next_vma->anchor_page_rb, + next_vma->vm_start, next_vma->vm_end - 1); + VM_BUG_ON(!next_node); + next_anchor_node = container_of(next_node, + struct anchor_page_node, node); + if (end_pfn + vma_size < next_anchor_node->anchor_pfn) + return end_pfn; + } + scan_vma = next_vma; + } else + scan_vma = scan_vma->vm_next; + } + + return zone->zone_start_pfn; +} + +/* + * anchor pages decide the va pa offset in each vma + * + */ +static int find_anchor_pages_in_vma(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long start_addr) +{ + struct anchor_page_node *anchor_node; + unsigned long end_addr = vma->vm_end - PAGE_SIZE; + struct interval_tree_node *existing_anchor = NULL; + unsigned long scan_address = start_addr; + struct page *present_page = NULL; + struct zone *present_zone = NULL; + unsigned long alignment_size = PAGE_SIZE; + + /* Out of range query */ + if (start_addr >= vma->vm_end || start_addr < vma->vm_start) + return -1; + + /* + * Clean up unrelated anchor infor + * + * VMA range can change and leave some anchor info out of range, + * so clean it here. + * It should be cleaned when vma is changed, but the code there + * is too complicated. + */ + if (!RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root) && + !interval_tree_iter_first(&vma->anchor_page_rb, + vma->vm_start, vma->vm_end - 1)) { + struct interval_tree_node *node = NULL; + + for (node = interval_tree_iter_first(&vma->anchor_page_rb, + 0, (unsigned long)-1); + node;) { + struct anchor_page_node *anchor_node = container_of(node, + struct anchor_page_node, node); + interval_tree_remove(node, &vma->anchor_page_rb); + node = interval_tree_iter_next(node, 0, (unsigned long)-1); + kfree(anchor_node); + } + } + + /* no range at all */ + if (RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root)) + goto insert_new_range; + + /* look for first range has start_addr or after it */ + existing_anchor = interval_tree_iter_first(&vma->anchor_page_rb, + start_addr, end_addr); + + /* first range has start_addr or after it */ + if (existing_anchor) { + /* redundant range, do nothing */ + if (existing_anchor->start == start_addr) + return 0; + else if (existing_anchor->start < start_addr && + existing_anchor->last >= start_addr){ + return 0; + } else { /* a range after start_addr */ + struct anchor_page_node *existing_node = container_of(existing_anchor, + struct anchor_page_node, node); + VM_BUG_ON(!(existing_anchor->start > start_addr)); + /* remove the existing range and insert a new one, since expanding + * forward can make the range go beyond the zone limit + */ + interval_tree_remove(existing_anchor, &vma->anchor_page_rb); + kfree(existing_node); + VM_BUG_ON(!RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root)); + goto insert_new_range; + } + } else { + struct interval_tree_node *prev_anchor = NULL, *cur_anchor; + /* there is a range before start_addr */ + + /* find the range just before start_addr */ + for (cur_anchor = interval_tree_iter_first(&vma->anchor_page_rb, + vma->vm_start, start_addr - PAGE_SIZE); + cur_anchor; + prev_anchor = cur_anchor, + cur_anchor = interval_tree_iter_next(cur_anchor, + vma->vm_start, start_addr - PAGE_SIZE)); + + interval_tree_remove(prev_anchor, &vma->anchor_page_rb); + prev_anchor->last = vma->vm_end; + interval_tree_insert(prev_anchor, &vma->anchor_page_rb); + + goto out; + } + +insert_new_range: /* start_addr to end_addr */ + down_read(&vma->vm_mm->mmap_sem); + /* find the first present page and use it as the anchor page */ + while (!present_page && scan_address < end_addr) { + present_page = follow_page(vma, scan_address, + FOLL_MIGRATION | FOLL_REMOTE); + scan_address += present_page?get_contig_page_size(present_page):PAGE_SIZE; + } + up_read(&vma->vm_mm->mmap_sem); + + if (!present_page) + goto out; + + anchor_node = + kmalloc(sizeof(struct anchor_page_node), GFP_KERNEL | __GFP_ZERO); + if (!anchor_node) + return -ENOMEM; + + present_zone = page_zone(present_page); + + anchor_node->node.start = start_addr; + anchor_node->node.last = end_addr; + + anchor_node->anchor_vpn = start_addr>>PAGE_SHIFT; + anchor_node->anchor_pfn = get_undefragged_area(present_zone, + vma, start_addr, end_addr); + + /* adjust VPN and PFN alignment according to VMA size */ + if (vma->vm_end - vma->vm_start >= HPAGE_PUD_SIZE) { + if ((anchor_node->anchor_vpn & ((HPAGE_PUD_SIZE>>PAGE_SHIFT) - 1)) < + (anchor_node->anchor_pfn & ((HPAGE_PUD_SIZE>>PAGE_SHIFT) - 1))) + anchor_node->anchor_pfn += (HPAGE_PUD_SIZE>>PAGE_SHIFT); + + anchor_node->anchor_pfn = (anchor_node->anchor_pfn & (PUD_MASK>>PAGE_SHIFT)) | + (anchor_node->anchor_vpn & ((HPAGE_PUD_SIZE>>PAGE_SHIFT) - 1)); + + alignment_size = HPAGE_PUD_SIZE; + } else if (vma->vm_end - vma->vm_start >= HPAGE_PMD_SIZE) { + if ((anchor_node->anchor_vpn & ((HPAGE_PMD_SIZE>>PAGE_SHIFT) - 1)) < + (anchor_node->anchor_pfn & ((HPAGE_PMD_SIZE>>PAGE_SHIFT) - 1))) + anchor_node->anchor_pfn += (HPAGE_PMD_SIZE>>PAGE_SHIFT); + + anchor_node->anchor_pfn = (anchor_node->anchor_pfn & (PMD_MASK>>PAGE_SHIFT)) | + (anchor_node->anchor_vpn & ((HPAGE_PMD_SIZE>>PAGE_SHIFT) - 1)); + + alignment_size = HPAGE_PMD_SIZE; + } else + alignment_size = PAGE_SIZE; + + /* move the range into the zone limit */ + if (!(zone_spans_pfn(present_zone, anchor_node->anchor_pfn))) { + while (anchor_node->anchor_pfn >= zone_end_pfn(present_zone)) + anchor_node->anchor_pfn -= alignment_size / PAGE_SHIFT; + while (anchor_node->anchor_pfn < present_zone->zone_start_pfn) + anchor_node->anchor_pfn += alignment_size / PAGE_SHIFT; + } + + interval_tree_insert(&anchor_node->node, &vma->anchor_page_rb); + +out: + return 0; +} + +static inline bool is_stats_collection(enum mem_defrag_action action) +{ + switch (action) { + case MEM_DEFRAG_FULL_STATS: + case MEM_DEFRAG_CONTIG_STATS: + return true; + default: + return false; + } + return false; +} + +/* comparator for sorting vma's lifetime */ +static int unsigned_long_cmp(const void *a, const void *b) +{ + const unsigned long *l = a, *r = b; + + if (*l < *r) + return -1; + if (*l > *r) + return 1; + return 0; +} + +/* + * scan all to-be-defragged VMA lifetime and calculate VMA defragmentation + * threshold. + */ +static void scan_all_vma_lifetime(struct defrag_scan_control *sc) +{ + struct mm_struct *mm = sc->mm; + struct vm_area_struct *vma = NULL; + unsigned long current_jiffies = jiffies; /* fix one jiffies */ + unsigned int num_vma = 0, index = 0; + unsigned long *vma_scan_list = NULL; + + down_read(&mm->mmap_sem); + for (vma = find_vma(mm, 0); vma; vma = vma->vm_next) + /* only care about to-be-defragged vmas */ + if (mem_defrag_vma_check(vma)) + ++num_vma; + + vma_scan_list = kmalloc(num_vma*sizeof(unsigned long), + GFP_KERNEL | __GFP_ZERO); + + if (ZERO_OR_NULL_PTR(vma_scan_list)) { + sc->vma_scan_threshold = 1; + goto out; + } + + for (vma = find_vma(mm, 0); vma; vma = vma->vm_next) + /* only care about to-be-defragged vmas */ + if (mem_defrag_vma_check(vma)) { + if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_TIME) + vma_scan_list[index] = current_jiffies - vma->vma_create_jiffies; + else if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_SIZE) + vma_scan_list[index] = vma->vm_end - vma->vm_start; + ++index; + if (index >= num_vma) + break; + } + + sort(vma_scan_list, num_vma, sizeof(unsigned long), + unsigned_long_cmp, NULL); + + index = (100 - vma_scan_percentile) * num_vma / 100; + + sc->vma_scan_threshold = vma_scan_list[index]; + + kfree(vma_scan_list); +out: + up_read(&mm->mmap_sem); +} + +/* + * Scan single mm_struct. + * The function will down_read mmap_sem. + * + */ +static int kmem_defragd_scan_mm(struct defrag_scan_control *sc) +{ + struct mm_struct *mm = sc->mm; + struct vm_area_struct *vma = NULL; + unsigned long *scan_address = &sc->scan_address; + char *stats_buf = NULL; + int remain_buf_len = sc->buf_len; + int err = 0; + struct contig_stats contig_stats; + + + if (sc->out_buf && + sc->buf_len) { + stats_buf = vzalloc(sc->buf_len); + if (!stats_buf) + goto breakouterloop; + } + + /*down_read(&mm->mmap_sem);*/ + if (unlikely(kmem_defragd_test_exit(mm))) + vma = NULL; + else { + /* get vma_scan_threshold */ + if (!sc->vma_scan_threshold) + scan_all_vma_lifetime(sc); + + vma = find_vma(mm, *scan_address); + } + + for (; vma; vma = vma->vm_next) { + unsigned long vstart, vend; + struct anchor_page_node *anchor_node = NULL; + int scanned_chunks = 0; + + + if (unlikely(kmem_defragd_test_exit(mm))) + break; + if (!mem_defrag_vma_check(vma)) { + /* collect contiguity stats for this VMA */ + if (is_stats_collection(sc->action)) + if (do_vma_stat(mm, vma, stats_buf, sc->buf_len - remain_buf_len, + &remain_buf_len)) + goto breakouterloop; + *scan_address = vma->vm_end; + goto done_one_vma; + } + + + vstart = vma->vm_start; + vend = vma->vm_end; + if (vstart >= vend) + goto done_one_vma; + if (*scan_address > vend) + goto done_one_vma; + if (*scan_address < vstart) + *scan_address = vstart; + + if (sc->action == MEM_DEFRAG_DO_DEFRAG) { + /* Check VMA size, skip if below the size threshold */ + if (vma->vm_end - vma->vm_start < + defrag_size_threshold * HPAGE_PMD_SIZE) + goto done_one_vma; + + /* + * Check VMA lifetime or size, skip if below the lifetime/size + * threshold derived from a percentile + */ + if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_TIME) { + if ((jiffies - vma->vma_create_jiffies) < sc->vma_scan_threshold) + goto done_one_vma; + } else if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_SIZE) { + if ((vma->vm_end - vma->vm_start) < sc->vma_scan_threshold) + goto done_one_vma; + } + + /* Avoid repeated defrag if the vma has not been changed */ + if (vma_no_repeat_defrag && + vma->vma_defrag_jiffies > vma->vma_modify_jiffies) + goto done_one_vma; + + /* vma contiguity stats collection */ + if (remain_buf_len && stats_buf) { + int used_len; + int pos = sc->buf_len - remain_buf_len; + + used_len = scnprintf(stats_buf + pos, remain_buf_len, + "vma: 0x%lx, 0x%lx, 0x%lx, -1\n", + (unsigned long)vma+vma->vma_create_jiffies, + vma->vm_start, vma->vm_end); + + remain_buf_len -= used_len; + + if (remain_buf_len == 1) { + stats_buf[pos] = '\0'; + remain_buf_len = 0; + } + } + + anchor_node = get_anchor_page_node_from_vma(vma, vma->vm_start); + + if (!anchor_node) { + find_anchor_pages_in_vma(mm, vma, vma->vm_start); + anchor_node = get_anchor_page_node_from_vma(vma, vma->vm_start); + + if (!anchor_node) + goto done_one_vma; + } + } + + contig_stats = (struct contig_stats) {0}; + + while (*scan_address < vend) { + struct page *page; + + cond_resched(); + + if (unlikely(kmem_defragd_test_exit(mm))) + goto breakouterloop; + + if (is_stats_collection(sc->action)) { + down_read(&vma->vm_mm->mmap_sem); + page = follow_page(vma, *scan_address, + FOLL_MIGRATION | FOLL_REMOTE); + + if (do_page_stat(mm, vma, page, *scan_address, + stats_buf, sc->buf_len - remain_buf_len, + &remain_buf_len, sc->action, &contig_stats, + sc->scan_in_vma)) { + /* reset scan_address to the beginning of the contig. + * So next scan will get the whole contig. + */ + if (contig_stats.err) { + *scan_address = contig_stats.first_vaddr_in_chunk; + sc->scan_in_vma = true; + } + goto breakouterloop; + } + /* move to next address */ + if (page) + *scan_address += get_contig_page_size(page); + else + *scan_address += PAGE_SIZE; + up_read(&vma->vm_mm->mmap_sem); + } else if (sc->action == MEM_DEFRAG_DO_DEFRAG) { + /* go to nearest 1GB aligned address */ + unsigned long defrag_end = min_t(unsigned long, + (*scan_address + HPAGE_PUD_SIZE) & HPAGE_PUD_MASK, + vend); + int defrag_result; + + anchor_node = get_anchor_page_node_from_vma(vma, *scan_address); + + /* in case VMA size changes */ + if (!anchor_node) { + find_anchor_pages_in_vma(mm, vma, *scan_address); + anchor_node = get_anchor_page_node_from_vma(vma, *scan_address); + } + + if (!anchor_node) + goto done_one_vma; + + /* + * looping through the 1GB region and defrag 2MB range in each + * iteration. + */ + while (*scan_address < defrag_end) { + unsigned long defrag_sub_chunk_end = min_t(unsigned long, + (*scan_address + HPAGE_PMD_SIZE) & HPAGE_PMD_MASK, + defrag_end); + struct defrag_result_stats defrag_stats = {0}; +continue_defrag: + if (!anchor_node) { + anchor_node = get_anchor_page_node_from_vma(vma, + *scan_address); + if (!anchor_node) { + find_anchor_pages_in_vma(mm, vma, *scan_address); + anchor_node = get_anchor_page_node_from_vma(vma, + *scan_address); + } + if (!anchor_node) + goto done_one_vma; + } + + defrag_result = defrag_address_range(mm, vma, *scan_address, + defrag_sub_chunk_end, + pfn_to_page(anchor_node->anchor_pfn), + anchor_node->anchor_vpn<buf_len - remain_buf_len; + + used_len = scnprintf(stats_buf + pos, remain_buf_len, + "[0x%lx, 0x%lx):%lu [alig:%lu, migrated:%lu, " + "src: not:%lu, src_thp_dst_not:%lu, src_pte_thp:%lu " + "dst: out_bound:%lu, dst_thp_src_not:%lu, " + "dst_pte_thp:%lu, isolate_free:%lu, " + "migrate_free:%lu, anon:%lu, file:%lu, " + "non-lru:%lu, non-moveable:%lu], " + "anchor: (%lx, %lx), range: [%lx, %lx], " + "vma: 0x%lx, not_defrag_vpn: %lx\n", + *scan_address, defrag_sub_chunk_end, + (defrag_sub_chunk_end - *scan_address)/PAGE_SIZE, + defrag_stats.aligned, + defrag_stats.migrated, + defrag_stats.src_not_present, + defrag_stats.src_thp_dst_not_failed, + defrag_stats.src_pte_thp_failed, + defrag_stats.dst_out_of_bound_failed, + defrag_stats.dst_thp_src_not_failed, + defrag_stats.dst_pte_thp_failed, + defrag_stats.dst_isolate_free_failed, + defrag_stats.dst_migrate_free_failed, + defrag_stats.dst_anon_failed, + defrag_stats.dst_file_failed, + defrag_stats.dst_non_lru_failed, + defrag_stats.dst_non_moveable_failed, + anchor_node->anchor_vpn, + anchor_node->anchor_pfn, + anchor_node->node.start, + anchor_node->node.last, + (unsigned long)vma+vma->vma_create_jiffies, + defrag_stats.not_defrag_vpn + ); + + remain_buf_len -= used_len; + + if (remain_buf_len == 1) { + stats_buf[pos] = '\0'; + remain_buf_len = 0; + } + } + + /* + * skip the page which cannot be defragged and restart + * from the next page + */ + if (defrag_stats.not_defrag_vpn && + defrag_stats.not_defrag_vpn < defrag_sub_chunk_end) { + VM_BUG_ON(defrag_sub_chunk_end != defrag_end && + defrag_stats.not_defrag_vpn > defrag_sub_chunk_end); + + *scan_address = defrag_stats.not_defrag_vpn; + defrag_stats.not_defrag_vpn = 0; + goto continue_defrag; + } + + /* Done with current 2MB chunk */ + *scan_address = defrag_sub_chunk_end; + scanned_chunks++; + /* + * if the knob is set, break out of the defrag loop after + * a preset number of 2MB chunks are defragged + */ + if (num_breakout_chunks && scanned_chunks >= num_breakout_chunks) { + scanned_chunks = 0; + goto breakouterloop; + } + } + + } + } +done_one_vma: + sc->scan_in_vma = false; + if (sc->action == MEM_DEFRAG_DO_DEFRAG) + vma->vma_defrag_jiffies = jiffies; + } +breakouterloop: + + /* copy stats to user space */ + if (sc->out_buf && + sc->buf_len) { + err = copy_to_user(sc->out_buf, stats_buf, + sc->buf_len - remain_buf_len); + sc->used_len = sc->buf_len - remain_buf_len; + } + + if (stats_buf) + vfree(stats_buf); + + /* 0: scan a vma complete, 1: scan a vma incomplete */ + return vma == NULL ? 0 : 1; +} + +SYSCALL_DEFINE4(scan_process_memory, pid_t, pid, char __user *, out_buf, + int, buf_len, int, action) +{ + const struct cred *cred = current_cred(), *tcred; + struct task_struct *task; + struct mm_struct *mm; + int err = 0; + static struct defrag_scan_control defrag_scan_control = {0}; + + /* Find the mm_struct */ + rcu_read_lock(); + task = pid ? find_task_by_vpid(pid) : current; + if (!task) { + rcu_read_unlock(); + return -ESRCH; + } + get_task_struct(task); + + /* + * Check if this process has the right to modify the specified + * process. The right exists if the process has administrative + * capabilities, superuser privileges or the same + * userid as the target process. + */ + tcred = __task_cred(task); + if (!uid_eq(cred->euid, tcred->suid) && !uid_eq(cred->euid, tcred->uid) && + !uid_eq(cred->uid, tcred->suid) && !uid_eq(cred->uid, tcred->uid) && + !capable(CAP_SYS_NICE)) { + rcu_read_unlock(); + err = -EPERM; + goto out; + } + rcu_read_unlock(); + + err = security_task_movememory(task); + if (err) + goto out; + + mm = get_task_mm(task); + put_task_struct(task); + + if (!mm) + return -EINVAL; + + switch (action) { + case MEM_DEFRAG_SCAN: + case MEM_DEFRAG_CONTIG_SCAN: + count_vm_event(MEM_DEFRAG_SCAN_NUM); + /* + * We allow scanning one process's address space for multiple + * iterations. When we change the scanned process, reset + * defrag_scan_control's mm_struct + */ + if (!defrag_scan_control.mm || + defrag_scan_control.mm != mm) { + defrag_scan_control = (struct defrag_scan_control){0}; + defrag_scan_control.mm = mm; + } + defrag_scan_control.out_buf = out_buf; + defrag_scan_control.buf_len = buf_len; + if (action == MEM_DEFRAG_SCAN) + defrag_scan_control.action = MEM_DEFRAG_FULL_STATS; + else if (action == MEM_DEFRAG_CONTIG_SCAN) + defrag_scan_control.action = MEM_DEFRAG_CONTIG_STATS; + else { + err = -EINVAL; + break; + } + + defrag_scan_control.used_len = 0; + + if (unlikely(!access_ok(out_buf, buf_len))) { + err = -EFAULT; + break; + } + + /* clear mm once it is fully scanned */ + if (!kmem_defragd_scan_mm(&defrag_scan_control) && + !defrag_scan_control.used_len) + defrag_scan_control.mm = NULL; + + err = defrag_scan_control.used_len; + break; + case MEM_DEFRAG_MARK_SCAN_ALL: + set_bit(MMF_VM_MEM_DEFRAG_ALL, &mm->flags); + __kmem_defragd_enter(mm); + break; + case MEM_DEFRAG_CLEAR_SCAN_ALL: + clear_bit(MMF_VM_MEM_DEFRAG_ALL, &mm->flags); + break; + case MEM_DEFRAG_DEFRAG: + count_vm_event(MEM_DEFRAG_DEFRAG_NUM); + + if (!defrag_scan_control.mm || + defrag_scan_control.mm != mm) { + defrag_scan_control = (struct defrag_scan_control){0}; + defrag_scan_control.mm = mm; + } + defrag_scan_control.action = MEM_DEFRAG_DO_DEFRAG; + + defrag_scan_control.out_buf = out_buf; + defrag_scan_control.buf_len = buf_len; + + /* clear mm once it is fully defragged */ + if (buf_len) { + if (!kmem_defragd_scan_mm(&defrag_scan_control) && + !defrag_scan_control.used_len) { + defrag_scan_control.mm = NULL; + } + err = defrag_scan_control.used_len; + } else { + err = kmem_defragd_scan_mm(&defrag_scan_control); + if (err == 0) + defrag_scan_control.mm = NULL; + } + break; + default: + err = -EINVAL; + break; + } + + mmput(mm); + return err; + +out: + put_task_struct(task); + return err; +} + +static unsigned int kmem_defragd_scan_mm_slot(void) +{ + struct mm_slot *mm_slot; + int scan_status = 0; + struct defrag_scan_control defrag_scan_control = {0}; + + spin_lock(&kmem_defragd_mm_lock); + if (kmem_defragd_scan.mm_slot) + mm_slot = kmem_defragd_scan.mm_slot; + else { + mm_slot = list_entry(kmem_defragd_scan.mm_head.next, + struct mm_slot, mm_node); + kmem_defragd_scan.address = 0; + kmem_defragd_scan.mm_slot = mm_slot; + } + spin_unlock(&kmem_defragd_mm_lock); + + defrag_scan_control.mm = mm_slot->mm; + defrag_scan_control.scan_address = kmem_defragd_scan.address; + defrag_scan_control.action = MEM_DEFRAG_DO_DEFRAG; + + scan_status = kmem_defragd_scan_mm(&defrag_scan_control); + + kmem_defragd_scan.address = defrag_scan_control.scan_address; + + spin_lock(&kmem_defragd_mm_lock); + VM_BUG_ON(kmem_defragd_scan.mm_slot != mm_slot); + /* + * Release the current mm_slot if this mm is about to die, or + * if we scanned all vmas of this mm. + */ + if (kmem_defragd_test_exit(mm_slot->mm) || !scan_status) { + /* + * Make sure that if mm_users is reaching zero while + * kmem_defragd runs here, kmem_defragd_exit will find + * mm_slot not pointing to the exiting mm. + */ + if (mm_slot->mm_node.next != &kmem_defragd_scan.mm_head) { + kmem_defragd_scan.mm_slot = list_first_entry( + &mm_slot->mm_node, + struct mm_slot, mm_node); + kmem_defragd_scan.address = 0; + } else + kmem_defragd_scan.mm_slot = NULL; + + if (kmem_defragd_test_exit(mm_slot->mm)) + collect_mm_slot(mm_slot); + else if (!scan_status) { + list_del(&mm_slot->mm_node); + list_add_tail(&mm_slot->mm_node, &kmem_defragd_scan.mm_head); + } + } + spin_unlock(&kmem_defragd_mm_lock); + + return 0; +} + +int memdefrag_madvise(struct vm_area_struct *vma, + unsigned long *vm_flags, int advice) +{ + switch (advice) { + case MADV_MEMDEFRAG: + *vm_flags &= ~VM_NOMEMDEFRAG; + *vm_flags |= VM_MEMDEFRAG; + /* + * If the vma become good for kmem_defragd to scan, + * register it here without waiting a page fault that + * may not happen any time soon. + */ + if (kmem_defragd_enter(vma, *vm_flags)) + return -ENOMEM; + break; + case MADV_NOMEMDEFRAG: + *vm_flags &= ~VM_MEMDEFRAG; + *vm_flags |= VM_NOMEMDEFRAG; + /* + * Setting VM_NOMEMDEFRAG will prevent kmem_defragd from scanning + * this vma even if we leave the mm registered in kmem_defragd if + * it got registered before VM_NOMEMDEFRAG was set. + */ + break; + } + + return 0; +} + + +void __init kmem_defragd_destroy(void) +{ + kmem_cache_destroy(mm_slot_cache); +} + +int __init kmem_defragd_init(void) +{ + mm_slot_cache = kmem_cache_create("kmem_defragd_mm_slot", + sizeof(struct mm_slot), + __alignof__(struct mm_slot), 0, NULL); + if (!mm_slot_cache) + return -ENOMEM; + + return 0; +} + +subsys_initcall(kmem_defragd_init); \ No newline at end of file diff --git a/mm/memory.c b/mm/memory.c index e11ca9dd823f..019036e87088 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -69,6 +69,7 @@ #include #include #include +#include #include #include @@ -2926,6 +2927,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; + /* Make it defrag */ + if (unlikely(kmem_defragd_enter(vma, vma->vm_flags))) + goto oom; page = alloc_zeroed_user_highpage_movable(vma, vmf->address); if (!page) goto oom; @@ -3844,6 +3848,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, p4d_t *p4d; vm_fault_t ret; + /* Zi: page faults modify vma */ + vma->vma_modify_jiffies = jiffies; + pgd = pgd_offset(mm, address); p4d = p4d_alloc(mm, pgd, address); if (!p4d) diff --git a/mm/mmap.c b/mm/mmap.c index f901065c4c64..653dd99d5145 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -169,6 +169,28 @@ void unlink_file_vma(struct vm_area_struct *vma) } } +void free_anchor_pages(struct vm_area_struct *vma) +{ + struct interval_tree_node *node; + + if (!vma) + return; + + if (RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root)) + return; + + for (node = interval_tree_iter_first(&vma->anchor_page_rb, + 0, (unsigned long)-1); + node;) { + struct anchor_page_node *anchor_node = container_of(node, + struct anchor_page_node, node); + interval_tree_remove(node, &vma->anchor_page_rb); + node = interval_tree_iter_next(node, 0, (unsigned long)-1); + kfree(anchor_node); + } + +} + /* * Close a vm structure and free it, returning the next. */ @@ -181,6 +203,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) vma->vm_ops->close(vma); if (vma->vm_file) fput(vma->vm_file); + free_anchor_pages(vma); mpol_put(vma_policy(vma)); vm_area_free(vma); return next; @@ -725,10 +748,15 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, long adjust_next = 0; int remove_next = 0; + /*free_anchor_pages(vma);*/ + + vma->vma_modify_jiffies = jiffies; + if (next && !insert) { struct vm_area_struct *exporter = NULL, *importer = NULL; if (end >= next->vm_end) { + /*free_anchor_pages(next);*/ /* * vma expands, overlapping all the next, and * perhaps the one after too (mprotect case 6). @@ -775,6 +803,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, exporter = next->vm_next; } else if (end > next->vm_start) { + /*free_anchor_pages(next);*/ /* * vma expands, overlapping part of the next: * mprotect case 5 shifting the boundary up. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 35fdde041f5c..a35605e0924a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1828,7 +1828,7 @@ void __init init_cma_reserved_pageblock(struct page *page) * * -- nyc */ -static inline void expand(struct zone *zone, struct page *page, +inline void expand(struct zone *zone, struct page *page, int low, int high, struct free_area *area, int migratetype) { @@ -1950,7 +1950,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_owner(page, order, gfp_flags); } -static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, +void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, unsigned int alloc_flags) { int i; diff --git a/mm/vmstat.c b/mm/vmstat.c index 83b30edc2f7f..c18a42250a5c 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1293,6 +1293,27 @@ const char * const vmstat_text[] = { "swap_ra", "swap_ra_hit", #endif + "memdefrag_defrag", + "memdefrag_scan", + "memdefrag_dest_free_pages", + "memdefrag_dest_anon_pages", + "memdefrag_dest_file_pages", + "memdefrag_dest_non_lru_pages", + "memdefrag_dest_free_pages_failed", + "memdefrag_dest_free_pages_overflow_failed", + "memdefrag_dest_anon_pages_failed", + "memdefrag_dest_file_pages_failed", + "memdefrag_dest_nonlru_pages_failed", + "memdefrag_src_anon_pages_failed", + "memdefrag_src_compound_pages_failed", + "memdefrag_dst_split_hugepage", +#ifdef CONFIG_COMPACTION + "compact_migrate_pages", +#endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + "thp_collapse_migrate_pages" +#endif + #endif /* CONFIG_VM_EVENTS_COUNTERS */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */