From patchwork Wed May 23 08:26:12 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 10420593 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id DB9C960327 for ; Wed, 23 May 2018 08:27:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CC98C28E3F for ; Wed, 23 May 2018 08:27:01 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C189428E5F; Wed, 23 May 2018 08:27:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5781328E3F for ; Wed, 23 May 2018 08:27:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 411896B0266; Wed, 23 May 2018 04:26:55 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 39A416B0269; Wed, 23 May 2018 04:26:55 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 23DFF6B026A; Wed, 23 May 2018 04:26:55 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl0-f69.google.com (mail-pl0-f69.google.com [209.85.160.69]) by kanga.kvack.org (Postfix) with ESMTP id C8BE66B0266 for ; Wed, 23 May 2018 04:26:54 -0400 (EDT) Received: by mail-pl0-f69.google.com with SMTP id d4-v6so13710597plr.17 for ; Wed, 23 May 2018 01:26:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=8sIbLyAlv8RJ3NbseZqYLOEYJTrhCSp01bcN64ERR2g=; b=TdC317YtgHcqjV3RbmCcVz6hUw3bV3vx7erQc1mQ8/Cp4wGKwfBZUBJIfKoFcLoKWQ zXwK8nhvpJxMTD4R4XzsaxsuVC2TqeDog9S1u/QaEwZtKMxi6/r4W7EYdaEwCPM1C13/ sQKJKy6HFg841Fcy9yAnEWucQP3Oj49NJIRaJbnYgTGhVpgaNUA1fCW642E7ALj4Y8SB u6h+4F6OuuVCMYVdCzY57rLgDZv2ZaUgNcxtUChTcDMUkd7Jm0Fb1ncKYhZ4To2oLCwd 5z/OAuLAJyfbFRmdZAn4SyKB7v0YI0I6jU34NMSzVc1Mar1fDy+Ag52h2DCUSDuQcSAB /OQw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: ALKqPweRnqbDG5y4t3HRB4qmmZOYlUNKd1uweM4Hz70E40n/CH3aOWq3 KNApupHUlX8lMVVVNKKdl6c6PPxnoZDuthX4OeHYev2vY7SCrpcgBCv/bjS5z9sH6I/qO/eBMGf bdrso+y1ebpkN0qsjXC2LWlpsgfM9p8l9WAj2+5bxcfvcc5cCJom7VdDjVn04d9SjJQ== X-Received: by 2002:a62:e107:: with SMTP id q7-v6mr1924737pfh.226.1527064014464; Wed, 23 May 2018 01:26:54 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpHiKGdXJuLCLb0SpKoMvqAuAjLohrmjUnFrgKc9DXJqv6KRjxSfCF2H+Xo+322g/87wNxG X-Received: by 2002:a62:e107:: with SMTP id q7-v6mr1924694pfh.226.1527064013611; Wed, 23 May 2018 01:26:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527064013; cv=none; d=google.com; s=arc-20160816; b=fvL6UENE+fMYmKBudtPA09m9dLPh1Db89h3adySi8qaXXFJPViflb5pCXlpPIalY4y 8s6p/PeGFo26pAobMsTqJ1ddE2Igju+z+EcoZKrF/+CspgvfJ0Yf4DWBsliZwPr5RCxe 8WqgJt0WaiD3+156ObBjbpq8r6y2ZPdLe3YMxd2AXmhv2bdupuj+Nz0PYt3h2GBhIfOK N05i+z46HoQnSxGvYNkJaKNKLEWqEt1qSyzmpYQ+HzQQR1dce/Tu0tqyaALY0p6w9Zkr 85Kc4bP7zRyoniBTett0mlfDWEnjxwU15n81Kgs3GRO4py22YWwLzJe2BLRSgBouq8wx IcWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :arc-authentication-results; bh=8sIbLyAlv8RJ3NbseZqYLOEYJTrhCSp01bcN64ERR2g=; b=DpuvfNE1npTn2mC5ESVvu+7Zio7Kk/pM8j+VdcbYyZr8+r7Hp5TaSQVmOW4wBLpIZW QMgmTF2blv/jBXMcV//samDASSJcA4zkalLbiD4FaDqPhAHiDoroiguvuXyhO/PwABvJ qNyy0w05+Ig/JOAac0gb7K9UgD6LN0rvfDd+20CBSktxe+GIedfffKnUHosSl7QWrvnl HOfuPH8TwjaIRB3gdGrtjqiyTk6+E5CQcxPR58yFhO5AEuhfkqyrAUCm98tZM+2wUbtw f78JRDUhvP1WJB5WUteb2yqvIpQOYtePJOX26OPbYhOg0r9SDK353glBTWftmlOPozGH PabQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga17.intel.com (mga17.intel.com. [192.55.52.151]) by mx.google.com with ESMTPS id y16-v6si17687140pfm.140.2018.05.23.01.26.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 23 May 2018 01:26:53 -0700 (PDT) Received-SPF: pass (google.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) client-ip=192.55.52.151; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 23 May 2018 01:26:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,432,1520924400"; d="scan'208";a="57726063" Received: from yhuang6-ux31a.sh.intel.com ([10.239.197.97]) by fmsmga001.fm.intel.com with ESMTP; 23 May 2018 01:26:50 -0700 From: "Huang, Ying" To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Dave Hansen , Naoya Horiguchi , Zi Yan Subject: [PATCH -mm -V3 08/21] mm, THP, swap: Support to read a huge swap cluster for swapin a THP Date: Wed, 23 May 2018 16:26:12 +0800 Message-Id: <20180523082625.6897-9-ying.huang@intel.com> X-Mailer: git-send-email 2.16.1 In-Reply-To: <20180523082625.6897-1-ying.huang@intel.com> References: <20180523082625.6897-1-ying.huang@intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Huang Ying To swapin a THP as a whole, we need to read a huge swap cluster from the swap device. This patch revised the __read_swap_cache_async() and its callers and callees to support this. If __read_swap_cache_async() find the swap cluster of the specified swap entry is huge, it will try to allocate a THP, add it into the swap cache. So later the contents of the huge swap cluster can be read into the THP. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan --- include/linux/huge_mm.h | 38 ++++++++++++++++++++++++ include/linux/swap.h | 4 +-- mm/huge_memory.c | 26 ----------------- mm/swap_state.c | 77 ++++++++++++++++++++++++++++++++++--------------- mm/swapfile.c | 11 ++++--- 5 files changed, 100 insertions(+), 56 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0d0cfddbf4b7..0dbfbe34b01a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -250,6 +250,39 @@ static inline bool thp_migration_supported(void) return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION); } +/* + * always: directly stall for all thp allocations + * defer: wake kswapd and fail if not immediately available + * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise + * fail if not immediately available + * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately + * available + * never: never stall for any thp allocation + */ +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) +{ + bool vma_madvised; + + if (!vma) + return GFP_TRANSHUGE_LIGHT; + vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, + &transparent_hugepage_flags)) + return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, + &transparent_hugepage_flags)) + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, + &transparent_hugepage_flags)) + return GFP_TRANSHUGE_LIGHT | + (vma_madvised ? __GFP_DIRECT_RECLAIM : + __GFP_KSWAPD_RECLAIM); + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, + &transparent_hugepage_flags)) + return GFP_TRANSHUGE_LIGHT | + (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); + return GFP_TRANSHUGE_LIGHT; +} #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) @@ -362,6 +395,11 @@ static inline bool thp_migration_supported(void) { return false; } + +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 878f132dabc0..d2e017dd7bbd 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -462,7 +462,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); extern int page_swapcount(struct page *); extern int __swap_count(swp_entry_t entry); -extern int __swp_swapcount(swp_entry_t entry); +extern int __swp_swapcount(swp_entry_t entry, bool *huge_cluster); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); @@ -589,7 +589,7 @@ static inline int __swap_count(swp_entry_t entry) return 0; } -static inline int __swp_swapcount(swp_entry_t entry) +static inline int __swp_swapcount(swp_entry_t entry, bool *huge_cluster) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e363e13f6751..3975d824b4ed 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -620,32 +620,6 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, } -/* - * always: directly stall for all thp allocations - * defer: wake kswapd and fail if not immediately available - * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise - * fail if not immediately available - * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately - * available - * never: never stall for any thp allocation - */ -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) -{ - const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); - - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : - __GFP_KSWAPD_RECLAIM); - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : - 0); - return GFP_TRANSHUGE_LIGHT; -} - /* Caller must hold page table lock. */ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd, diff --git a/mm/swap_state.c b/mm/swap_state.c index c6b3eab73fde..59b37a84fbd7 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -386,6 +386,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct page *found_page = NULL, *new_page = NULL; struct swap_info_struct *si; int err; + bool huge_cluster = false; + swp_entry_t hentry; + *new_page_allocated = false; do { @@ -411,14 +414,32 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * as SWAP_HAS_CACHE. That's done in later part of code or * else swap_off will be aborted if we return NULL. */ - if (!__swp_swapcount(entry) && swap_slot_cache_enabled) + if (!__swp_swapcount(entry, &huge_cluster) && + swap_slot_cache_enabled) break; /* * Get a new page to read into from swap. */ - if (!new_page) { - new_page = alloc_page_vma(gfp_mask, vma, addr); + if (!new_page || + (thp_swap_supported() && + !!PageTransCompound(new_page) != huge_cluster)) { + if (new_page) + put_page(new_page); + if (thp_swap_supported() && huge_cluster) { + gfp_t gfp = alloc_hugepage_direct_gfpmask(vma); + + new_page = alloc_hugepage_vma(gfp, vma, + addr, HPAGE_PMD_ORDER); + if (new_page) + prep_transhuge_page(new_page); + hentry = swp_entry(swp_type(entry), + round_down(swp_offset(entry), + HPAGE_PMD_NR)); + } else { + new_page = alloc_page_vma(gfp_mask, vma, addr); + hentry = entry; + } if (!new_page) break; /* Out of memory */ } @@ -426,33 +447,37 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * call radix_tree_preload() while we can wait. */ - err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL); + err = radix_tree_maybe_preload_order(gfp_mask & GFP_KERNEL, + compound_order(new_page)); if (err) break; /* * Swap entry may have been freed since our caller observed it. */ - err = swapcache_prepare(entry); - if (err == -EEXIST) { - radix_tree_preload_end(); - /* - * We might race against get_swap_page() and stumble - * across a SWAP_HAS_CACHE swap_map entry whose page - * has not been brought into the swapcache yet. - */ - cond_resched(); - continue; - } - if (err) { /* swp entry is obsolete ? */ + err = swapcache_prepare(hentry, huge_cluster); + if (err) { radix_tree_preload_end(); - break; + if (err == -EEXIST) { + /* + * We might race against get_swap_page() and + * stumble across a SWAP_HAS_CACHE swap_map + * entry whose page has not been brought into + * the swapcache yet. + */ + cond_resched(); + continue; + } else if (err == -ENOTDIR) { + /* huge swap cluster is split under us */ + continue; + } else /* swp entry is obsolete ? */ + break; } /* May fail (-ENOMEM) if radix-tree node allocation failed. */ __SetPageLocked(new_page); __SetPageSwapBacked(new_page); - err = __add_to_swap_cache(new_page, entry); + err = __add_to_swap_cache(new_page, hentry); if (likely(!err)) { radix_tree_preload_end(); /* @@ -460,6 +485,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, */ lru_cache_add_anon(new_page); *new_page_allocated = true; + if (thp_swap_supported() && huge_cluster) + new_page += swp_offset(entry) & + (HPAGE_PMD_NR - 1); return new_page; } radix_tree_preload_end(); @@ -468,7 +496,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * add_to_swap_cache() doesn't return -EEXIST, so we can safely * clear SWAP_HAS_CACHE flag. */ - put_swap_page(new_page, entry); + put_swap_page(new_page, hentry); } while (err != -ENOMEM); if (new_page) @@ -490,7 +518,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, vma, addr, &page_was_allocated); if (page_was_allocated) - swap_readpage(retpage, do_poll); + swap_readpage(compound_head(retpage), do_poll); return retpage; } @@ -609,8 +637,9 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!page) continue; if (page_allocated) { - swap_readpage(page, false); - if (offset != entry_offset) { + swap_readpage(compound_head(page), false); + if (offset != entry_offset && + !PageTransCompound(page)) { SetPageReadahead(page); count_vm_event(SWAP_RA); } @@ -771,8 +800,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask, if (!page) continue; if (page_allocated) { - swap_readpage(page, false); - if (i != ra_info.offset) { + swap_readpage(compound_head(page), false); + if (i != ra_info.offset && !PageTransCompound(page)) { SetPageReadahead(page); count_vm_event(SWAP_RA); } diff --git a/mm/swapfile.c b/mm/swapfile.c index 1e723d3a9a6f..1a62fbc13381 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1501,7 +1501,8 @@ int __swap_count(swp_entry_t entry) return count; } -static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) +static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry, + bool *huge_cluster) { int count = 0; pgoff_t offset = swp_offset(entry); @@ -1509,6 +1510,8 @@ static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) ci = lock_cluster_or_swap_info(si, offset); count = swap_count(si->swap_map[offset]); + if (huge_cluster && ci) + *huge_cluster = cluster_is_huge(ci); unlock_cluster_or_swap_info(si, ci); return count; } @@ -1518,14 +1521,14 @@ static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) * This does not give an exact answer when swap count is continued, * but does include the high COUNT_CONTINUED flag to allow for that. */ -int __swp_swapcount(swp_entry_t entry) +int __swp_swapcount(swp_entry_t entry, bool *huge_cluster) { int count = 0; struct swap_info_struct *si; si = get_swap_device(entry); if (si) { - count = swap_swapcount(si, entry); + count = swap_swapcount(si, entry, huge_cluster); put_swap_device(si); } return count; @@ -1685,7 +1688,7 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, return map_swapcount; } #else -#define swap_page_trans_huge_swapped(si, entry) swap_swapcount(si, entry) +#define swap_page_trans_huge_swapped(si, entry) swap_swapcount(si, entry, NULL) #define page_swapped(page) (page_swapcount(page) != 0) static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,