From patchwork Fri Jun 22 03:51:33 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 10481127 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id DCC0060380 for ; Fri, 22 Jun 2018 03:55:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CD30128F91 for ; Fri, 22 Jun 2018 03:55:22 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C1CDD28F98; Fri, 22 Jun 2018 03:55:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6BD5F28F91 for ; Fri, 22 Jun 2018 03:55:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8F0A56B0008; Thu, 21 Jun 2018 23:55:17 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7D6D46B000A; Thu, 21 Jun 2018 23:55:17 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 629F36B000D; Thu, 21 Jun 2018 23:55:17 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 10B356B0008 for ; Thu, 21 Jun 2018 23:55:17 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id y26-v6so2558350pfn.14 for ; Thu, 21 Jun 2018 20:55:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=5twvFuP9uUeQk85WehhAj/95Fb0Gv2A6V0lWsAa8pSQ=; b=drJ+s1RpX85vseYCeIThIoKbQAaT1a7GE0Ub7Y25+elkU/RgaKgJ+BLTSaRJbfhLiu C6hyulArceObrAUhv9oG9ZglZqbVer+IUv/4AY33BdJ+oQRi48agqM2iLxs0B3oxDykU wrzRIO7mH3nDsaXx7xMLz/RslLdzIE8Pf/AlwrrFYJdtoma9v8NUHhADmaWcVWEtlUJG trPZa54Unmied/KaEL0QVlv4tzHbDDq0jbQ9OiPj72AgWmr71MTHc248C6MRMj8w88wS MzLIVvoyD0xRcJvffRgYZXVhFw8ISul+LO+Qvv8EjxrKEGpVhUXP2JAzsk7kmMgMVptb A+Dg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: APt69E346Iyq7BpmSM2k3h5LzJKXpLTGuO98vIlkLsA8gvfq8rL9k+Gv NDYh5z9hNSgBFvY3JBr0eykTShg+kOKQAtvFIuTtcTpt2AmrjHJ3Uj8wegzk3b79FT1QeSpL8tB 9vSQco10wE0uPHEFEU7j44nMP3ZqJFcxm++W8pUEXKYV+hmjKmd+uADo+Z9OBYLoz2w== X-Received: by 2002:a63:a347:: with SMTP id v7-v6mr5006377pgn.182.1529639716688; Thu, 21 Jun 2018 20:55:16 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKfpKLqyWT4edx/ES4lEPaic/gzXKrO/HMg+qXG48qPq6BfG0QSRsiKar/g2GDXf0d1jS6f X-Received: by 2002:a63:a347:: with SMTP id v7-v6mr5006328pgn.182.1529639715448; Thu, 21 Jun 2018 20:55:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529639715; cv=none; d=google.com; s=arc-20160816; b=yoxCQcpaPt2vl9PnbgDQsajSWptItZKI2O3UFpJA193GrCyx2vQtYmylWES6l63hMt GR0SXfgMmeBRRmH6oalz72hwovp0INYXRIr/oR6hUeqULlKFinksznHC+BxAQ7V5xwyO SBHk9I0pgTr0Aw3cYg4v4Yg0+fx6JfLEe+Kn3kRmwkGgLS1IMbCog063houtF2STeL+w HS7Lki57MQUgd2vwSPCV6qI+QNVhgBgByNpyrQr8zZPMbvCmREXtdaNVgPDBxhaQW3Gm tl7H8+IjYfV29vK0aZnGQX3kGytDMNGnXdbyFQCZPtXP/vM0Z23IsqzF54OjPhDYlpCX 7OrA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :arc-authentication-results; bh=5twvFuP9uUeQk85WehhAj/95Fb0Gv2A6V0lWsAa8pSQ=; b=XK1pyn24nf+/+0dII13Bh2/ILIUPBQIKkQ+bcF1mmrwHKSs3kUotducUGuYp0HARPX R6Z5WzMUGZsTMz4QbSe9JYRgix/Yv+rxNSW2x1sjWUxaSnxJcd9o8VQqFxk6++rS2r2m cSBZvyxNBJi0bAPqFaOgetoFSdBRz8ELOYUAHuRSqyV1dO1cPURld0e6hxShpLueU383 wH4Y4R7W24UcT4pdIRPL+qubD04fpaXHjfgAGi3cOu8FUZqEK6Lcw3PFA0v58WOQ8IC6 1+uvigN6y29h9KWEfHNROvp8sDJgIIK7AnaZ8l6EQlt0GW9I7YN9C7YLMAsBo27cF+1k 58OQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga05.intel.com (mga05.intel.com. [192.55.52.43]) by mx.google.com with ESMTPS id q13-v6si6671342plr.220.2018.06.21.20.55.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 Jun 2018 20:55:15 -0700 (PDT) Received-SPF: pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) client-ip=192.55.52.43; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Jun 2018 20:55:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,255,1526367600"; d="scan'208";a="65335074" Received: from wanpingl-mobl.ccr.corp.intel.com (HELO yhuang6-ux31a.ccr.corp.intel.com) ([10.254.212.200]) by fmsmga004.fm.intel.com with ESMTP; 21 Jun 2018 20:55:12 -0700 From: "Huang, Ying" To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Dave Hansen , Naoya Horiguchi , Zi Yan , Daniel Jordan Subject: [PATCH -mm -v4 03/21] mm, THP, swap: Support PMD swap mapping in swap_duplicate() Date: Fri, 22 Jun 2018 11:51:33 +0800 Message-Id: <20180622035151.6676-4-ying.huang@intel.com> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20180622035151.6676-1-ying.huang@intel.com> References: <20180622035151.6676-1-ying.huang@intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Huang Ying To support to swapin the THP as a whole, we need to create PMD swap mapping during swapout, and maintain PMD swap mapping count. This patch implements the support to increase the PMD swap mapping count (for swapout, fork, etc.) and set SWAP_HAS_CACHE flag (for swapin, etc.) for a huge swap cluster in swap_duplicate() function family. Although it only implements a part of the design of the swap reference count with PMD swap mapping, the whole design is described as follow to make it easy to understand the patch and the whole picture. A huge swap cluster is used to hold the contents of a swapouted THP. After swapout, a PMD page mapping to the THP will become a PMD swap mapping to the huge swap cluster via a swap entry in PMD. While a PTE page mapping to a subpage of the THP will become the PTE swap mapping to a swap slot in the huge swap cluster via a swap entry in PTE. If there is no PMD swap mapping and the corresponding THP is removed from the page cache (reclaimed), the huge swap cluster will be split and become a normal swap cluster. The count (cluster_count()) of the huge swap cluster is SWAPFILE_CLUSTER (= HPAGE_PMD_NR) + PMD swap mapping count. Because all swap slots in the huge swap cluster are mapped by PTE or PMD, or has SWAP_HAS_CACHE bit set, the usage count of the swap cluster is HPAGE_PMD_NR. And the PMD swap mapping count is recorded too to make it easy to determine whether there are remaining PMD swap mappings. The count in swap_map[offset] is the sum of PTE and PMD swap mapping count. This means when we increase the PMD swap mapping count, we need to increase swap_map[offset] for all swap slots inside the swap cluster. An alternative choice is to make swap_map[offset] to record PTE swap map count only, given we have recorded PMD swap mapping count in the count of the huge swap cluster. But this need to increase swap_map[offset] when splitting the PMD swap mapping, that may fail because of memory allocation for swap count continuation. That is hard to dealt with. So we choose current solution. The PMD swap mapping to a huge swap cluster may be split when unmap a part of PMD mapping etc. That is easy because only the count of the huge swap cluster need to be changed. When the last PMD swap mapping is gone and SWAP_HAS_CACHE is unset, we will split the huge swap cluster (clear the huge flag). This makes it easy to reason the cluster state. A huge swap cluster will be split when splitting the THP in swap cache, or failing to allocate THP during swapin, etc. But when splitting the huge swap cluster, we will not try to split all PMD swap mappings, because we haven't enough information available for that sometimes. Later, when the PMD swap mapping is duplicated or swapin, etc, the PMD swap mapping will be split and fallback to the PTE operation. When a THP is added into swap cache, the SWAP_HAS_CACHE flag will be set in the swap_map[offset] of all swap slots inside the huge swap cluster backing the THP. This huge swap cluster will not be split unless the THP is split even if its PMD swap mapping count dropped to 0. Later, when the THP is removed from swap cache, the SWAP_HAS_CACHE flag will be cleared in the swap_map[offset] of all swap slots inside the huge swap cluster. And this huge swap cluster will be split if its PMD swap mapping count is 0. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 5 + include/linux/swap.h | 9 +- mm/memory.c | 2 +- mm/rmap.c | 2 +- mm/swap_state.c | 2 +- mm/swapfile.c | 287 +++++++++++++++++++++++++++++++++--------------- 6 files changed, 214 insertions(+), 93 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index d3bbf6bea9e9..213d32e57c39 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -80,6 +80,11 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) #define HPAGE_PMD_NR (1<lock); } +static inline bool is_cluster_offset(unsigned long offset) +{ + return !(offset % SWAPFILE_CLUSTER); +} + static inline bool cluster_list_empty(struct swap_cluster_list *list) { return cluster_is_null(&list->head); @@ -1166,16 +1174,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) return NULL; } -static unsigned char __swap_entry_free(struct swap_info_struct *p, - swp_entry_t entry, unsigned char usage) +static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage) { - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); unsigned char count; unsigned char has_cache; - ci = lock_cluster_or_swap_info(p, offset); - count = p->swap_map[offset]; has_cache = count & SWAP_HAS_CACHE; @@ -1203,6 +1209,17 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, usage = count | has_cache; p->swap_map[offset] = usage ? : SWAP_HAS_CACHE; + return usage; +} + +static unsigned char __swap_entry_free(struct swap_info_struct *p, + swp_entry_t entry, unsigned char usage) +{ + struct swap_cluster_info *ci; + unsigned long offset = swp_offset(entry); + + ci = lock_cluster_or_swap_info(p, offset); + usage = __swap_entry_free_locked(p, ci, offset, usage); unlock_cluster_or_swap_info(p, ci); return usage; @@ -3450,32 +3467,12 @@ void si_swapinfo(struct sysinfo *val) spin_unlock(&swap_lock); } -/* - * Verify that a swap entry is valid and increment its swap map count. - * - * Returns error code in following case. - * - success -> 0 - * - swp_entry is invalid -> EINVAL - * - swp_entry is migration entry -> EINVAL - * - swap-cache reference is requested but there is already one. -> EEXIST - * - swap-cache reference is requested but the entry is not used. -> ENOENT - * - swap-mapped reference requested but needs continued swap count. -> ENOMEM - */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +static int __swap_duplicate_locked(struct swap_info_struct *p, + unsigned long offset, unsigned char usage) { - struct swap_info_struct *p; - struct swap_cluster_info *ci; - unsigned long offset; unsigned char count; unsigned char has_cache; - int err = -EINVAL; - - p = get_swap_device(entry); - if (!p) - goto out; - - offset = swp_offset(entry); - ci = lock_cluster_or_swap_info(p, offset); + int err = 0; count = p->swap_map[offset]; @@ -3485,12 +3482,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) */ if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { err = -ENOENT; - goto unlock_out; + goto out; } has_cache = count & SWAP_HAS_CACHE; count &= ~SWAP_HAS_CACHE; - err = 0; if (usage == SWAP_HAS_CACHE) { @@ -3517,11 +3513,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) p->swap_map[offset] = count | has_cache; -unlock_out: +out: + return err; +} + +/* + * Verify that a swap entry is valid and increment its swap map count. + * + * Returns error code in following case. + * - success -> 0 + * - swp_entry is invalid -> EINVAL + * - swp_entry is migration entry -> EINVAL + * - swap-cache reference is requested but there is already one. -> EEXIST + * - swap-cache reference is requested but the entry is not used. -> ENOENT + * - swap-mapped reference requested but needs continued swap count. -> ENOMEM + */ +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +{ + struct swap_info_struct *p; + struct swap_cluster_info *ci; + unsigned long offset; + int err = -EINVAL; + + p = get_swap_device(entry); + if (!p) + goto out; + + offset = swp_offset(entry); + ci = lock_cluster_or_swap_info(p, offset); + err = __swap_duplicate_locked(p, offset, usage); unlock_cluster_or_swap_info(p, ci); + + put_swap_device(p); out: - if (p) - put_swap_device(p); return err; } @@ -3534,6 +3558,81 @@ void swap_shmem_alloc(swp_entry_t entry) __swap_duplicate(entry, SWAP_MAP_SHMEM); } +#ifdef CONFIG_THP_SWAP +static int __swap_duplicate_cluster(swp_entry_t *entry, unsigned char usage) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset; + unsigned char *map; + int i, err = 0; + + si = get_swap_device(*entry); + if (!si) { + err = -EINVAL; + goto out; + } + offset = swp_offset(*entry); + ci = lock_cluster(si, offset); + if (cluster_is_free(ci)) { + err = -ENOENT; + goto unlock; + } + if (!cluster_is_huge(ci)) { + err = -ENOTDIR; + goto unlock; + } + VM_BUG_ON(!is_cluster_offset(offset)); + VM_BUG_ON(cluster_count(ci) < SWAPFILE_CLUSTER); + map = si->swap_map + offset; + if (usage == SWAP_HAS_CACHE) { + if (map[0] & SWAP_HAS_CACHE) { + err = -EEXIST; + goto unlock; + } + for (i = 0; i < SWAPFILE_CLUSTER; i++) { + VM_BUG_ON(map[i] & SWAP_HAS_CACHE); + map[i] |= SWAP_HAS_CACHE; + } + } else { + for (i = 0; i < SWAPFILE_CLUSTER; i++) { +retry: + err = __swap_duplicate_locked(si, offset + i, usage); + if (err == -ENOMEM) { + struct page *page; + + page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); + err = add_swap_count_continuation_locked( + si, offset + i, page); + if (err) { + *entry = swp_entry(si->type, offset+i); + goto undup; + } + goto retry; + } else if (err) + goto undup; + } + cluster_set_count(ci, cluster_count(ci) + usage); + } +unlock: + unlock_cluster(ci); + put_swap_device(si); +out: + return err; +undup: + for (i--; i >= 0; i--) + __swap_entry_free_locked( + si, ci, offset + i, usage); + goto unlock; +} +#else +static inline int __swap_duplicate_cluster(swp_entry_t *entry, + unsigned char usage) +{ + return 0; +} +#endif + /* * Increase reference count of swap entry by 1. * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required @@ -3541,12 +3640,15 @@ void swap_shmem_alloc(swp_entry_t entry) * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which * might occur if a page table entry has got corrupted. */ -int swap_duplicate(swp_entry_t entry) +int swap_duplicate(swp_entry_t *entry, bool cluster) { int err = 0; - while (!err && __swap_duplicate(entry, 1) == -ENOMEM) - err = add_swap_count_continuation(entry, GFP_ATOMIC); + if (thp_swap_supported() && cluster) + return __swap_duplicate_cluster(entry, 1); + + while (!err && __swap_duplicate(*entry, 1) == -ENOMEM) + err = add_swap_count_continuation(*entry, GFP_ATOMIC); return err; } @@ -3558,9 +3660,12 @@ int swap_duplicate(swp_entry_t entry) * -EBUSY means there is a swap cache. * Note: return code is different from swap_duplicate(). */ -int swapcache_prepare(swp_entry_t entry) +int swapcache_prepare(swp_entry_t entry, bool cluster) { - return __swap_duplicate(entry, SWAP_HAS_CACHE); + if (thp_swap_supported() && cluster) + return __swap_duplicate_cluster(&entry, SWAP_HAS_CACHE); + else + return __swap_duplicate(entry, SWAP_HAS_CACHE); } struct swap_info_struct *swp_swap_info(swp_entry_t entry) @@ -3590,51 +3695,13 @@ pgoff_t __page_file_index(struct page *page) } EXPORT_SYMBOL_GPL(__page_file_index); -/* - * add_swap_count_continuation - called when a swap count is duplicated - * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's - * page of the original vmalloc'ed swap_map, to hold the continuation count - * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called - * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. - * - * These continuation pages are seldom referenced: the common paths all work - * on the original swap_map, only referring to a continuation page when the - * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX. - * - * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding - * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL) - * can be called after dropping locks. - */ -int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) +static int add_swap_count_continuation_locked(struct swap_info_struct *si, + unsigned long offset, + struct page *page) { - struct swap_info_struct *si; - struct swap_cluster_info *ci; struct page *head; - struct page *page; struct page *list_page; - pgoff_t offset; unsigned char count; - int ret = 0; - - /* - * When debugging, it's easier to use __GFP_ZERO here; but it's better - * for latency not to zero a page while GFP_ATOMIC and holding locks. - */ - page = alloc_page(gfp_mask | __GFP_HIGHMEM); - - si = get_swap_device(entry); - if (!si) { - /* - * An acceptable race has occurred since the failing - * __swap_duplicate(): the swap device may be swapoff - */ - goto outer; - } - spin_lock(&si->lock); - - offset = swp_offset(entry); - - ci = lock_cluster(si, offset); count = si->swap_map[offset] & ~SWAP_HAS_CACHE; @@ -3644,13 +3711,11 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) * will race to add swap count continuation: we need to avoid * over-provisioning. */ - goto out; + return 0; } - if (!page) { - ret = -ENOMEM; - goto out; - } + if (!page) + return -ENOMEM; /* * We are fortunate that although vmalloc_to_page uses pte_offset_map, @@ -3698,7 +3763,57 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) page = NULL; /* now it's attached, don't free it */ out_unlock_cont: spin_unlock(&si->cont_lock); -out: + if (page) + __free_page(page); + return 0; +} + +/* + * add_swap_count_continuation - called when a swap count is duplicated + * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's + * page of the original vmalloc'ed swap_map, to hold the continuation count + * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called + * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. + * + * These continuation pages are seldom referenced: the common paths all work + * on the original swap_map, only referring to a continuation page when the + * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX. + * + * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding + * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL) + * can be called after dropping locks. + */ +int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + struct page *page; + unsigned long offset; + int ret = 0; + + /* + * When debugging, it's easier to use __GFP_ZERO here; but it's better + * for latency not to zero a page while GFP_ATOMIC and holding locks. + */ + page = alloc_page(gfp_mask | __GFP_HIGHMEM); + + si = get_swap_device(entry); + if (!si) { + /* + * An acceptable race has occurred since the failing + * __swap_duplicate(): the swap device may be swapoff + */ + goto outer; + } + spin_lock(&si->lock); + + offset = swp_offset(entry); + + ci = lock_cluster(si, offset); + + ret = add_swap_count_continuation_locked(si, offset, page); + page = NULL; + unlock_cluster(ci); spin_unlock(&si->lock); put_swap_device(si);