From patchwork Fri Dec 7 05:41:06 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 10717451 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 324B2109C for ; Fri, 7 Dec 2018 05:41:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1FFB82EC6E for ; Fri, 7 Dec 2018 05:41:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 13AFF2EC7A; Fri, 7 Dec 2018 05:41:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 54C312EC6E for ; Fri, 7 Dec 2018 05:41:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 80EE56B7E9A; Fri, 7 Dec 2018 00:41:47 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7417F6B7E9C; Fri, 7 Dec 2018 00:41:47 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60BE36B7E9D; Fri, 7 Dec 2018 00:41:47 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f198.google.com (mail-pf1-f198.google.com [209.85.210.198]) by kanga.kvack.org (Postfix) with ESMTP id 0D00A6B7E9A for ; Fri, 7 Dec 2018 00:41:47 -0500 (EST) Received: by mail-pf1-f198.google.com with SMTP id t72so2373916pfi.21 for ; Thu, 06 Dec 2018 21:41:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=A8SPG2AWA0GCKgY7w38h46NrJ2ecbrE6BWHT6tbx6cY=; b=ItJaZesd52T3JLSndxYeCJVy3JDZtUDfE9huI7ed5JmJDC0kEG17C9VvaJaCAoUbzK T7i3IZjsSIM1Pp8/C4nJoxQ3LBTWJq925DM0/si+ZpUrlGUZLVTi0d1+N9LhJWhH9CD9 oPcSw6pJymjeUsBUoh8fbZj2LkoyP9Rve7sx1Zu3U/7FP286Ins/+M3Qjg7hvzOMoJ0q wVd2VnSjLdS4ISDjK9DmLva8ij5ixYwLF32jhzez9T1+kf4EmsC7rR3DZiYjIKiTE2sI nDXjcNYOslFXfZumEbLgxBPqqF5ZA9JxyQnI9RHT6AykaudbS9Iva+4aP7v+o+hyN5By ae/A== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AA+aEWYMs5TtZyYit1PynsLri+QdsRpdNolJH/xz9rA87RXkK8A35l3/ 1BZ1AaSUoVNFwdf4F85a8sI41HZgS5yEEutHOq+NJK7MgcuhUETYJHRz4xAq91gG9HdNiVlvW3+ MokC9j3PF5AnWY4MdqGs/IBf7Hsf3ilDp5oDVis8zqsi8Xk1+wx/4dZvKVkL8JG82SA== X-Received: by 2002:a63:f201:: with SMTP id v1mr765235pgh.232.1544161306707; Thu, 06 Dec 2018 21:41:46 -0800 (PST) X-Google-Smtp-Source: AFSGD/XnIJSDqGD5knPQ4gCBHh3ra8MA3W6TFzF1k9mj+E85C95wGQ4aVRGJ8PiWYSyq517TESoG X-Received: by 2002:a63:f201:: with SMTP id v1mr765208pgh.232.1544161305381; Thu, 06 Dec 2018 21:41:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544161305; cv=none; d=google.com; s=arc-20160816; b=Ox2bu/gp2+C9uiV4hP5mJCLoMmwexBIzfkknT7QGeMzBRAolpjMuKTxHh6TAflCjD3 rtgNGdOo+FIsFdCZheVx1Bl4+HvSVXFS0nqaW/esC2arsIzCHG+acAoZCatIO6Q3OaSm 2VfZgeMhn3pxVlIXO0BWAD4jAftQ5MTCSqx3VwFqFztI8pu7LdW74LKKcy5iYluMJZby MlxpX+e02QvN1tsM8LM1Kcpr0a6VmHqahSrBj7bcZl4f+ByE2yoaksVDqGIqYDmEh4lL Ya2TEFchjdzHT+l3ejVUCGBNHTsgAsCY7X209fM3P9e3kAXBgXgtgTnW7xtgm409o3IR LPbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=A8SPG2AWA0GCKgY7w38h46NrJ2ecbrE6BWHT6tbx6cY=; b=ds/KSX+Qa0VmAcxrebKL8ZZm+o1vbGzqIK8bnX38+wglHrFmthwEYxbPYdScmjwsYB N/2pAZBRNShWv9p3D6yQK9Mztr7Loy+nsAcnbzUv8Q2N0Zs9EwVDhFmaLIA11gv5vV7H bMbhGREuDzdQVmPOvJirDaDnO8wHv3q7jAVeZXENLmkALRdhovMZmj20CbWTLORvNBd/ rORI01rnHpDUXV/rboSdKESsuVvz1fUZBYYnirOCPXKCn7+yPMzrFw1lV3Od7hgtjYLW G1BWZKoV6EwMOCX5C/3vm3GHjYO6aS72nitV0NyWFtITXtyKKCoacoVLm7lLSdIpHWZB G/JA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga05.intel.com (mga05.intel.com. [192.55.52.43]) by mx.google.com with ESMTPS id cf16si2126256plb.227.2018.12.06.21.41.45 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 06 Dec 2018 21:41:45 -0800 (PST) Received-SPF: pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) client-ip=192.55.52.43; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Dec 2018 21:41:44 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,324,1539673200"; d="scan'208";a="105567234" Received: from yhuang-mobile.sh.intel.com ([10.239.196.133]) by fmsmga007.fm.intel.com with ESMTP; 06 Dec 2018 21:41:42 -0800 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Dave Hansen , Naoya Horiguchi , Zi Yan , Daniel Jordan Subject: [PATCH -V8 06/21] swap: Support PMD swap mapping when splitting huge PMD Date: Fri, 7 Dec 2018 13:41:06 +0800 Message-Id: <20181207054122.27822-7-ying.huang@intel.com> X-Mailer: git-send-email 2.18.1 In-Reply-To: <20181207054122.27822-1-ying.huang@intel.com> References: <20181207054122.27822-1-ying.huang@intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP A huge PMD need to be split when zap a part of the PMD mapping etc. If the PMD mapping is a swap mapping, we need to split it too. This patch implemented the support for this. This is similar as splitting the PMD page mapping, except we need to decrease the PMD swap mapping count for the huge swap cluster too. If the PMD swap mapping count becomes 0, the huge swap cluster will be split. Notice: is_huge_zero_pmd() and pmd_page() doesn't work well with swap PMD, so pmd_present() check is called before them. Thanks Daniel Jordan for testing and reporting a data corruption bug caused by misaligned address processing issue in __split_huge_swap_pmd(). Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 4 ++++ include/linux/swap.h | 6 +++++ mm/huge_memory.c | 49 ++++++++++++++++++++++++++++++++++++----- mm/swapfile.c | 32 +++++++++++++++++++++++++++ 4 files changed, 86 insertions(+), 5 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4663ee96cf59..1c0fda003d6a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -226,6 +226,10 @@ static inline bool is_huge_zero_page(struct page *page) return READ_ONCE(huge_zero_page) == page; } +/* + * is_huge_zero_pmd() must be called after checking pmd_present(), + * otherwise, it may report false positive for PMD swap entry. + */ static inline bool is_huge_zero_pmd(pmd_t pmd) { return is_huge_zero_page(pmd_page(pmd)); diff --git a/include/linux/swap.h b/include/linux/swap.h index 24c3014894dd..a24d101b131d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -619,11 +619,17 @@ static inline swp_entry_t get_swap_page(struct page *page) #ifdef CONFIG_THP_SWAP extern int split_swap_cluster(swp_entry_t entry); +extern int split_swap_cluster_map(swp_entry_t entry); #else static inline int split_swap_cluster(swp_entry_t entry) { return 0; } + +static inline int split_swap_cluster_map(swp_entry_t entry) +{ + return 0; +} #endif #ifdef CONFIG_MEMCG diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2dba2c1c299a..9ec87c2ed1e8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1631,6 +1631,41 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) return 0; } +/* Convert a PMD swap mapping to a set of PTE swap mappings */ +static void __split_huge_swap_pmd(struct vm_area_struct *vma, + unsigned long addr, + pmd_t *pmd) +{ + struct mm_struct *mm = vma->vm_mm; + pgtable_t pgtable; + pmd_t _pmd; + swp_entry_t entry; + int i, soft_dirty; + + addr &= HPAGE_PMD_MASK; + entry = pmd_to_swp_entry(*pmd); + soft_dirty = pmd_soft_dirty(*pmd); + + split_swap_cluster_map(entry); + + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + pmd_populate(mm, &_pmd, pgtable); + + for (i = 0; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, entry.val++) { + pte_t *pte, ptent; + + pte = pte_offset_map(&_pmd, addr); + VM_BUG_ON(!pte_none(*pte)); + ptent = swp_entry_to_pte(entry); + if (soft_dirty) + ptent = pte_swp_mksoft_dirty(ptent); + set_pte_at(mm, addr, pte, ptent); + pte_unmap(pte); + } + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(mm, pmd, pgtable); +} + /* * Return true if we do MADV_FREE successfully on entire pmd page. * Otherwise, return false. @@ -2095,7 +2130,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, VM_BUG_ON(haddr & ~HPAGE_PMD_MASK); VM_BUG_ON_VMA(vma->vm_start > haddr, vma); VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma); - VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd) + VM_BUG_ON(!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)); count_vm_event(THP_SPLIT_PMD); @@ -2119,7 +2154,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, put_page(page); add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR); return; - } else if (is_huge_zero_pmd(*pmd)) { + } else if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) { /* * FIXME: Do we want to invalidate secondary mmu by calling * mmu_notifier_invalidate_range() see comments below inside @@ -2163,6 +2198,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, page = pfn_to_page(swp_offset(entry)); } else #endif + if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(old_pmd)) + return __split_huge_swap_pmd(vma, haddr, pmd); + else page = pmd_page(old_pmd); VM_BUG_ON_PAGE(!page_count(page), page); page_ref_add(page, HPAGE_PMD_NR - 1); @@ -2254,14 +2292,15 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, * pmd against. Otherwise we can end up replacing wrong page. */ VM_BUG_ON(freeze && !page); - if (page && page != pmd_page(*pmd)) - goto out; + /* pmd_page() should be called only if pmd_present() */ + if (page && (!pmd_present(*pmd) || page != pmd_page(*pmd))) + goto out; if (pmd_trans_huge(*pmd)) { page = pmd_page(*pmd); if (PageMlocked(page)) clear_page_mlock(page); - } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) + } else if (!(pmd_devmap(*pmd) || is_swap_pmd(*pmd))) goto out; __split_huge_pmd_locked(vma, pmd, haddr, freeze); out: diff --git a/mm/swapfile.c b/mm/swapfile.c index 3eda4cbd279c..e83e3c93f3b3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -4041,6 +4041,38 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node, } #endif +#ifdef CONFIG_THP_SWAP +/* + * The corresponding page table shouldn't be changed under us, that + * is, the page table lock should be held. + */ +int split_swap_cluster_map(swp_entry_t entry) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset = swp_offset(entry); + + VM_BUG_ON(!IS_ALIGNED(offset, SWAPFILE_CLUSTER)); + si = _swap_info_get(entry); + if (!si) + return -EBUSY; + ci = lock_cluster(si, offset); + /* The swap cluster has been split by someone else, we are done */ + if (!cluster_is_huge(ci)) + goto out; + cluster_add_swapcount(ci, -1); + /* + * If the last PMD swap mapping has gone and the THP isn't in + * swap cache, the huge swap cluster will be split. + */ + if (!cluster_swapcount(ci) && !(si->swap_map[offset] & SWAP_HAS_CACHE)) + cluster_clear_huge(ci); +out: + unlock_cluster(ci); + return 0; +} +#endif + static int __init swapfile_init(void) { int nid;