From patchwork Mon Aug 5 23:22:42 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nhat Pham X-Patchwork-Id: 13754221 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 497F9C3DA7F for ; Mon, 5 Aug 2024 23:22:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A93F96B0085; Mon, 5 Aug 2024 19:22:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A1CFF6B0088; Mon, 5 Aug 2024 19:22:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 871F96B0089; Mon, 5 Aug 2024 19:22:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5EC686B0085 for ; Mon, 5 Aug 2024 19:22:48 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 090D341E8A for ; Mon, 5 Aug 2024 23:22:48 +0000 (UTC) X-FDA: 82419768816.17.AB1F328 Received: from mail-yw1-f171.google.com (mail-yw1-f171.google.com [209.85.128.171]) by imf05.hostedemail.com (Postfix) with ESMTP id 330E2100009 for ; Mon, 5 Aug 2024 23:22:45 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="U+Y4Qcz/"; spf=pass (imf05.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.171 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722900104; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=B9jBszB5LLWkkfrks405QdswXL8fPSHh3jU0pMbZKL8=; b=1h9CqRc/teEBXHpmIdUaf7t66rStPnSc/w6my5PLGXn51dWk8zsvA9X2MDuP8Kra5uwXKO d75K/b0tjyC3syYvleLI410n77fys8aeEGgVOR6LkYn/AlCgx4l8bEqncW/DPKSjYFBAjL jeVoo8tSwV5xk9/KmWqpbOO5yESd0xc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722900104; a=rsa-sha256; cv=none; b=JCZz258XDARBtI14mr/bEo3CTtJ+1RUiQHzEi4KsMlwvgUFwSP2Qraud0e2X5K29V48OkV Lgk3dx+bnzD7ogH2bXEVJa5yU3w8X8Qy56DCZNRh1zuGyYfU20WEJM1tR+BDFONoujx+Os E3QCmHWx60X1ZuWRE2gFXHmDUplWCto= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="U+Y4Qcz/"; spf=pass (imf05.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.171 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yw1-f171.google.com with SMTP id 00721157ae682-690e9001e01so167317b3.3 for ; Mon, 05 Aug 2024 16:22:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722900165; x=1723504965; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=B9jBszB5LLWkkfrks405QdswXL8fPSHh3jU0pMbZKL8=; b=U+Y4Qcz/1OYjDXj2XwE2wGuRf2fXTjNrJWx2ZlKLVz1IeV0btX6tpr8nCbKryx+8jE Es4QCpycb4U7BoNL7y2IDaiEZ7yJnf6E66btuELfIAsx+U1QxTZXv1JouCfbuL0nVYAz GOPhDjFPGAZzs+F3RdEqW0+14b674l4KGzw7Sv3vW51kWgWzG9Aq64sJkFlykeHCHFbz 5LisBGAyHOhC8osz6BMmvYAHyMRXaqNFH3VxX4DkYWV4dFM+t0u0NeylvAD7iaHAGmz3 k25BsX6u1WxldWecARGU/vkYDlJkHcs22ShenPbmWMs0nKsJmCmzzVZ/T0O3ztbvQ/1E GmeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722900165; x=1723504965; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=B9jBszB5LLWkkfrks405QdswXL8fPSHh3jU0pMbZKL8=; b=DRhDqEKBZ9xhyMFNa8yyA5pFMeqbIRPqm0YViQ/AVtZulYl1gDgmCuaOVvwXYy6CAG 3PRBT5Y9eig1YwY3NaACtUMtBOSKRet6iepxX4coS0TzHEWn7gzzL6cJpqhAit1kFq70 I4RzoG/I0vhwLJiHvlQx7VZ6SN24Xzx/2iKy7e9HjGhy0/hFBwN+jknF4VaxmGMMkcg9 9N4tZyj8AuPmFjBXhj6uFDmbMl05l+A/Bi2ftvyw74Hk5bUpG6Z+e5wSi8S7cZBGa8/M gEgjbTMnbBW/ubg96Pn9ILUHbCr26FvLGMgcyIOc4+SGkksdfC6g2bqOi4Kbs++m5t1a yhbA== X-Forwarded-Encrypted: i=1; AJvYcCVNzWLHhjZ0ux+eBedVK1f8S6+VBFhKVrzlavXS5lH8LqN/E4tiw5svC/xW+DEuiYcsFPqnyqkkQmezx18Rcic/Jgw= X-Gm-Message-State: AOJu0YzNScVqcy24/e/ApZDeVjrngQ20VFHr/jrq5+qm+c5oTjpVeC6S OEsnOiunnos74wXhqsSTOzOz4tVkSzv/aTLBoBf5dVzNGuMZfxSm X-Google-Smtp-Source: AGHT+IEET/qcu+3/GcjWYDk0MU/fhGCcMkhZvJ8FSa3Q+ZhrpzZxf542Yj8C0aMe8IhCvVZbrmDb+g== X-Received: by 2002:a0d:dc07:0:b0:65f:8f77:720f with SMTP id 00721157ae682-6895f60d50emr140723007b3.5.1722900165075; Mon, 05 Aug 2024 16:22:45 -0700 (PDT) Received: from localhost (fwdproxy-nha-001.fbsv.net. [2a03:2880:25ff:1::face:b00c]) by smtp.gmail.com with ESMTPSA id 00721157ae682-68a140613bcsm13712107b3.135.2024.08.05.16.22.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Aug 2024 16:22:44 -0700 (PDT) From: Nhat Pham To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, yosryahmed@google.com, shakeel.butt@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, flintglass@gmail.com, chengming.zhou@linux.dev Subject: [PATCH v3 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker Date: Mon, 5 Aug 2024 16:22:42 -0700 Message-ID: <20240805232243.2896283-2-nphamcs@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240805232243.2896283-1-nphamcs@gmail.com> References: <20240805232243.2896283-1-nphamcs@gmail.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 330E2100009 X-Stat-Signature: epn4p88n6ko8z1s3c1qn5q1dgon85q5c X-HE-Tag: 1722900165-669213 X-HE-Meta: U2FsdGVkX18ke29K5bmZpG803gV5ofzjdfb3ovKKUARyQtDqmYA+eaHbd9G0EulkLdDV2iXc7kDQ1JoEqIf3wxJR4bJwY7e3gAtC0KKgadJ6dcv/iVefedMsixPqD+W13V9cRyey8znuPJAP9gc3q98KDD/rHcFdwegR5fb9m9rbXMfBebBYMKgaIcEnQdW/Q8dt3pAuYsgVcF3ZlZgZIQysmJ3LixuBRmRf7KOxFDjiYrqDBYdoa8M3DPa4P6jAVOa5vbNSNuDspPxHrLNjQXU7Td76qEQu0ZDYNGMM1GDflCgKuOAo9uIYPpKcaCpBGQvfnh/FIpV/Mq8K375pDnUpdqufKGYAT0dRx9Xgwf/lHyZz0tUk+KuG8VhjKb5gw3McYaoMyPbLXo1hEJrQ+HuZn6wYAlsGp0QA5Jxnypx6pIcd22cMrNBjDUx46fo8cqDs5rSu4sx14WR50wFg2E2BuG6A0c2m3hmDpP8PC+T+adua/T8cBrKVm3vMwJfGIsZZNSgJlMKUmAn1HCmn4iD94zpF0bW1KsZNhlaS3vOeUJRNsFkujPmaRvv/Zx5oXovLax8rdsEmOhlN2vsP+tklLqXsR/HucnFKia8k1gBqL/Ch75yx68EYJ4mvCSOejUQPiZJ3I6Ilds5v5CDodYEsDKVljaWA86LtKBK8sH94oZw+BADrCYQHIc04yi/ZLZ6fekpPhpQMTwMEqYVAPfNf7AYlhZPqcJpwUdQWWkXC2rqr5dW2zt7h2Zjdszya6fGjDZQgfwHH6cMd7LbYgNtIByopn6cs9xdKFQnJe6u9+U12ZB/n1DEFwlFumptkouv5ThhJeunDSFANFVvtKPqA0T/I3ab3hh01oluqkR+84fXl9BMnZdq060l3eFk0AjTO9jQRLIay9fiRgfxV7rX0MLx+EdbRN7yqh+iNeNMbeRyVZUUZvxiWmZibgFfTWzSC5a8MeV9fT6BPZEU 7yjuyR5f At4HQRp9woqk24YXk4lOdFqZSvCg2XTq7HcKxsLcChNNqYIxjiZ+m6A3nWkjpkCQfeQeUJTvjOYQRf1USgNjg2IcCj+Gs8+0TalSes1qghxvQLTZyGN+Lzko5vpCfjk8JmmpNLwRd4qkIXnLbtDo/i4Os3mn6B6gJkE1bXX0SkwFAeCp/7SsrOmFxu/RCw2gxWaIgYkLPUR59GG2ClYaWL5jEuSdFCobRtMjk5+Et7uKv27cl0czIOf/U7BXTyWMJ2jtBfYrc0iEpcTJT9KkQPXxACj3SpIyDptI/8V9OCg1+Kbmwc354Er56WVFCFMvn53rzmZjhyEGPEzmv9IXV0tZlqnYTiXkk/fxbo+XqTmzwxJGa3Sd+b8XU67n5dIdTKgSTu98pSD8XsdY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.009448, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Current zswap shrinker's heuristics to prevent overshrinking is brittle and inaccurate, specifically in the way we decay the protection size (i.e making pages in the zswap LRU eligible for reclaim). We currently decay protection aggressively in zswap_lru_add() calls. This leads to the following unfortunate effect: when a new batch of pages enter zswap, the protection size rapidly decays to below 25% of the zswap LRU size, which is way too low. We have observed this effect in production, when experimenting with the zswap shrinker: the rate of shrinking shoots up massively right after a new batch of zswap stores. This is somewhat the opposite of what we want originally - when new pages enter zswap, we want to protect both these new pages AND the pages that are already protected in the zswap LRU. Replace existing heuristics with a second chance algorithm 1. When a new zswap entry is stored in the zswap pool, its referenced bit is set. 2. When the zswap shrinker encounters a zswap entry with the referenced bit set, give it a second chance - only flips the referenced bit and rotate it in the LRU. 3. If the shrinker encounters the entry again, this time with its referenced bit unset, then it can reclaim the entry. In this manner, the aging of the pages in the zswap LRUs are decoupled from zswap stores, and picks up the pace with increasing memory pressure (which is what we want). The second chance scheme allows us to modulate the writeback rate based on recent pool activities. Entries that recently entered the pool will be protected, so if the pool is dominated by such entries the writeback rate will reduce proportionally, protecting the workload's workingset.On the other hand, stale entries will be written back quickly, which increases the effective writeback rate. The referenced bit is added at the hole after the `length` field of struct zswap_entry, so there is no extra space overhead for this algorithm. We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in. To test this new heuristics, I built the kernel under a cgroup with memory.max set to 2G, on a host with 36 cores: With the old shrinker: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 With the second chance algorithm: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 (average over 5 runs) We observe an 1.3% reduction in kernel CPU usage, and around 7.2% reduction in real time. Note that the number of swapped in pages dropped by 58%. Suggested-by: Johannes Weiner Signed-off-by: Nhat Pham Acked-by: Yosry Ahmed --- include/linux/zswap.h | 16 +++---- mm/zswap.c | 108 ++++++++++++++++++++++++------------------ 2 files changed, 70 insertions(+), 54 deletions(-) diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 6cecb4a4f68b..9cd1beef0654 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages; struct zswap_lruvec_state { /* - * Number of pages in zswap that should be protected from the shrinker. - * This number is an estimate of the following counts: + * Number of swapped in pages from disk, i.e not found in the zswap pool. * - * a) Recent page faults. - * b) Recent insertion to the zswap LRU. This includes new zswap stores, - * as well as recent zswap LRU rotations. - * - * These pages are likely to be warm, and might incur IO if the are written - * to swap. + * This is consumed and subtracted from the lru size in + * zswap_shrinker_count() to penalize past overshrinking that led to disk + * swapins. The idea is that had we considered this many more pages in the + * LRU active/protected and not written them back, we would not have had to + * swapped them in. */ - atomic_long_t nr_zswap_protected; + atomic_long_t nr_disk_swapins; }; unsigned long zswap_total_pages(void); diff --git a/mm/zswap.c b/mm/zswap.c index adeaf9c97fde..fb3d9cb88785 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -187,6 +187,10 @@ static struct shrinker *zswap_shrinker; * length - the length in bytes of the compressed page data. Needed during * decompression. For a same value filled page length is 0, and both * pool and lru are invalid and must be ignored. + * referenced - true if the entry recently entered the zswap pool. Unset by the + * dynamic shrinker. The entry is only reclaimed by the dynamic + * shrinker if referenced is unset. See comments in the shrinker + * section for context. * pool - the zswap_pool the entry's data is in * handle - zpool allocation handle that stores the compressed page data * value - value of the same-value filled pages which have same content @@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker; struct zswap_entry { swp_entry_t swpentry; unsigned int length; + bool referenced; struct zswap_pool *pool; union { unsigned long handle; @@ -700,11 +705,8 @@ static inline int entry_to_nid(struct zswap_entry *entry) static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) { - atomic_long_t *nr_zswap_protected; - unsigned long lru_size, old, new; int nid = entry_to_nid(entry); struct mem_cgroup *memcg; - struct lruvec *lruvec; /* * Note that it is safe to use rcu_read_lock() here, even in the face of @@ -722,19 +724,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) memcg = mem_cgroup_from_entry(entry); /* will always succeed */ list_lru_add(list_lru, &entry->lru, nid, memcg); - - /* Update the protection area */ - lru_size = list_lru_count_one(list_lru, nid, memcg); - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); - nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected; - old = atomic_long_inc_return(nr_zswap_protected); - /* - * Decay to avoid overflow and adapt to changing workloads. - * This is based on LRU reclaim cost decaying heuristics. - */ - do { - new = old > lru_size / 4 ? old / 2 : old; - } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new)); rcu_read_unlock(); } @@ -752,7 +741,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) void zswap_lruvec_state_init(struct lruvec *lruvec) { - atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0); + atomic_long_set(&lruvec->zswap_lruvec_state.nr_disk_swapins, 0); } void zswap_folio_swapin(struct folio *folio) @@ -761,7 +750,7 @@ void zswap_folio_swapin(struct folio *folio) if (folio) { lruvec = folio_lruvec(folio); - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); + atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins); } } @@ -1082,6 +1071,28 @@ static int zswap_writeback_entry(struct zswap_entry *entry, /********************************* * shrinker functions **********************************/ +/* + * The dynamic shrinker is modulated by the following factors: + * + * 1. Each zswap entry has a referenced bit, which the shrinker unsets (giving + * the entry a second chance) before rotating it in the LRU list. If the + * entry is considered again by the shrinker, with its referenced bit unset, + * it is written back. The writeback rate as a result is dynamically + * adjusted by the pool activities - if the pool is dominated by new entries + * (i.e lots of recent zswapouts), these entries will be protected and + * the writeback rate will slow down. On the other hand, if the pool has a + * lot of stagnant entries, these entries will be reclaimed immediately, + * effectively increasing the writeback rate. + * + * 2. Swapins counter: If we observe swapins, it is a sign that we are + * overshrinking and should slow down. We maintain a swapins counter, which + * is consumed and subtract from the number of eligible objects on the LRU + * in zswap_shrinker_count(). + * + * 3. Compression ratio. The better the workload compresses, the less gains we + * can expect from writeback. We scale down the number of objects available + * for reclaim by this ratio. + */ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l, spinlock_t *lock, void *arg) { @@ -1091,6 +1102,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o enum lru_status ret = LRU_REMOVED_RETRY; int writeback_result; + /* + * Second chance algorithm: if the entry has its referenced bit set, give it + * a second chance. Only clear the referenced bit and rotate it in the + * zswap's LRU list. + */ + if (entry->referenced) { + entry->referenced = false; + return LRU_ROTATE; + } + /* * As soon as we drop the LRU lock, the entry can be freed by * a concurrent invalidation. This means the following: @@ -1157,8 +1178,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) { - struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); - unsigned long shrink_ret, nr_protected, lru_size; + unsigned long shrink_ret; bool encountered_page_in_swapcache = false; if (!zswap_shrinker_enabled || @@ -1167,25 +1187,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, return SHRINK_STOP; } - nr_protected = - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); - lru_size = list_lru_shrink_count(&zswap_list_lru, sc); - - /* - * Abort if we are shrinking into the protected region. - * - * This short-circuiting is necessary because if we have too many multiple - * concurrent reclaimers getting the freeable zswap object counts at the - * same time (before any of them made reasonable progress), the total - * number of reclaimed objects might be more than the number of unprotected - * objects (i.e the reclaimers will reclaim into the protected area of the - * zswap LRU). - */ - if (nr_protected >= lru_size - sc->nr_to_scan) { - sc->nr_scanned = 0; - return SHRINK_STOP; - } - shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb, &encountered_page_in_swapcache); @@ -1200,7 +1201,10 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, { struct mem_cgroup *memcg = sc->memcg; struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); - unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; + atomic_long_t *nr_disk_swapins = + &lruvec->zswap_lruvec_state.nr_disk_swapins; + unsigned long nr_backing, nr_stored, nr_freeable, nr_disk_swapins_cur, + nr_remain; if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) return 0; @@ -1233,14 +1237,27 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, if (!nr_stored) return 0; - nr_protected = - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc); + if (!nr_freeable) + return 0; + /* - * Subtract the lru size by an estimate of the number of pages - * that should be protected. + * Subtract from the lru size the number of pages that are recently swapped + * in from disk. The idea is that had we protect the zswap's LRU by this + * amount of pages, these disk swapins would not have happened. */ - nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0; + nr_disk_swapins_cur = atomic_long_read(nr_disk_swapins); + do { + if (nr_freeable >= nr_disk_swapins_cur) + nr_remain = 0; + else + nr_remain = nr_disk_swapins_cur - nr_freeable; + } while (!atomic_long_try_cmpxchg( + nr_disk_swapins, &nr_disk_swapins_cur, nr_remain)); + + nr_freeable -= nr_disk_swapins_cur - nr_remain; + if (!nr_freeable) + return 0; /* * Scale the number of freeable pages by the memory saving factor. @@ -1462,6 +1479,7 @@ bool zswap_store(struct folio *folio) store_entry: entry->swpentry = swp; entry->objcg = objcg; + entry->referenced = true; old = xa_store(tree, offset, entry, GFP_KERNEL); if (xa_is_err(old)) { From patchwork Mon Aug 5 23:22:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nhat Pham X-Patchwork-Id: 13754222 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74A44C3DA4A for ; Mon, 5 Aug 2024 23:22:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 564B36B0088; Mon, 5 Aug 2024 19:22:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 477446B0089; Mon, 5 Aug 2024 19:22:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27BF86B008C; Mon, 5 Aug 2024 19:22:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0D4C46B0088 for ; Mon, 5 Aug 2024 19:22:49 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id B05B5A06DC for ; Mon, 5 Aug 2024 23:22:48 +0000 (UTC) X-FDA: 82419768816.26.959EC63 Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180]) by imf03.hostedemail.com (Postfix) with ESMTP id CBDDB20017 for ; Mon, 5 Aug 2024 23:22:46 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="G41U/7Ye"; spf=pass (imf03.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722900105; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hkaEIx/k4xuzqcdyf8vlncyqK6jYETnvINbJiqVkRs4=; b=Aemzb57NIuKUCzI0hV/72jN6RepmGYfzkRu7ZfNj+CHFr/MEBIAm6k+PyRwr2MLFcmwMxa DZ7aX7QTOy+btna7W5WZY18fD+r7faRD8nuNaSK5YXVb4rOXi4C3uDGc7hF1O3P+Knj5dg 4j+RT8yf10SnQTqOqYt2SmNlicVGEps= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722900105; a=rsa-sha256; cv=none; b=HADm7IqJRNMOgD+QSAGDvsDciH5AzcALFeY5gnq8/ykh6mvPBUJxgsjUJfmbtL4HSD9xOW WoQOafw4RAWxDwVpZMNAp8zwO5mchjA6OY8ooZ+9REzNtrI0Nq9Z4tu8/PktQgR+CaLvCm bHqym1k89TQBk35hxXi2TrMTB1Rsyo0= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="G41U/7Ye"; spf=pass (imf03.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-66599ca3470so151367b3.2 for ; Mon, 05 Aug 2024 16:22:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722900166; x=1723504966; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=hkaEIx/k4xuzqcdyf8vlncyqK6jYETnvINbJiqVkRs4=; b=G41U/7YeHXYub3ORmCx7McqgYv32uV7fy+aT6HpvnDzQYa+ll/qFmouzon0XML6/LT ii/e6UrIRybCgkqBaZfy1hXKVOv5lXpE2+8BTjJrcIVbv74iWv9V49RCKgYOw1Ov7l0Q IShzCF9mNXN/oE9tjiqapg0we39igQxK7E76Mk+fWauosYeZSX5gkzQSR93q+ndmrR81 QvqeF3zUWz12AiW2xv4KYjVn29JK8LH+XohFKHgDDCESXtfO+5+E746bp0MVetb7jKdN C+a+4bZM630e3RmC0PfhDF5+lrG9nmQqxyvjV4WBWcVJXYo4VsmQdzaZnIcpHGZlq0hp ghxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722900166; x=1723504966; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hkaEIx/k4xuzqcdyf8vlncyqK6jYETnvINbJiqVkRs4=; b=K5BkP0C/rgXbTg6qvr3WKFB1l9dNvnjUozY6b7at/qevqp3/4uuPNufbD7tV84TT6L Y9bgspP8VH9T84aTSceHcBUpSNngtMWp9AWIbe2MI3splcWKoTZBN7Bg4Icke+p5+54m o8Nt/yrsY2JpNQT3kF3hnbo7VUBrru222ZRrea2GrLpdkeKc4M54a0l0vGYDY4cJbrJE abe4jERzRo8Wo7VhzIsI2Tnjxa0wHq8kF/JPq3Zon5F2jIkoPR+GU0kB2AiKTr0Sjt1A MKNui5RRpMLmAT3kJW26GH87s7mK/0BpIMwhGswujI5ekk8LBgdrRtyQSheKLhReplj1 aQeQ== X-Forwarded-Encrypted: i=1; AJvYcCUI0A8kawp1a0tqqj3TGNqtv44g/geK0Di/sT6uipmOIyYE6GyucD4C8VTCUyF04c2Wlh4MfS1fUA==@kvack.org X-Gm-Message-State: AOJu0YyDvY72WseELzuibZzgUHrMlwCsyu3epaZmnyD2qczokX22F8oZ BGzCAEOtWXLKYtGDTi3DTRuI3FwISOmpFOVT+7VAf1IS7fiOOGCX X-Google-Smtp-Source: AGHT+IE4Tl8ohAx6IIeKCZxM6CjsiKWomCPMWkpF+m0TYpAKSVDKJNP3Zx4RTKZcwi7+KTZ1YLQcpQ== X-Received: by 2002:a0d:f7c2:0:b0:63b:c16e:a457 with SMTP id 00721157ae682-689601aa7b8mr136816817b3.13.1722900165711; Mon, 05 Aug 2024 16:22:45 -0700 (PDT) Received: from localhost (fwdproxy-nha-113.fbsv.net. [2a03:2880:25ff:71::face:b00c]) by smtp.gmail.com with ESMTPSA id 00721157ae682-68a0f4193desm13652527b3.6.2024.08.05.16.22.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Aug 2024 16:22:45 -0700 (PDT) From: Nhat Pham To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, yosryahmed@google.com, shakeel.butt@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, flintglass@gmail.com, chengming.zhou@linux.dev Subject: [PATCH v3 2/2] zswap: track swapins from disk more accurately Date: Mon, 5 Aug 2024 16:22:43 -0700 Message-ID: <20240805232243.2896283-3-nphamcs@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240805232243.2896283-1-nphamcs@gmail.com> References: <20240805232243.2896283-1-nphamcs@gmail.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: CBDDB20017 X-Stat-Signature: o1paq1t3afx856xt7u43yxup1qiu8a4q X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1722900166-368558 X-HE-Meta: U2FsdGVkX18AUdQZk9ukddq0j8+Oh/sVsa06K51bLlBbbuniZuKMaTVpyX1+vnzthEp3JNP7HXLGhELdcUPKs2GOHJ9KrfANyXTLeRFhEIe5GDK0Oa5EDqEu6Tlzak4lGCKTUkgbjSrnKnddKs7PkECndcRulnQ2dzJvzB2UAzrn+0nZB5RRbCKlpg16dviSeZ0Mdp07tMQRo6IQl4roryx1Bt2VSxd9DFX6q7R8+hqKCewbiOfBkOhpsnY3aXQO9eRA0NRT8O6/7TQbcKjyjIBpvKHJIi2ifrH+GO/8v9/z6WNfcpseu9UN5aBthDcq62IJhAil17UZVTKHleTW4Ra8JndWg+dIffFUX8Mu7hSCHILjEcBB6VMgJM+PZHY+RYz5RLLVw+1F5DNGEAY1PYq65L0cZZ3AWXVeA5lK3e2biV0Lcwsco9X8tE7qtpSmxFuNeA3yG5imG0Amt5HpbI524KmOSrZtR+rfLqSbVsxHzfIAZ6fEwW2sq3y3yAFdzpFyMqGta/iOpS8iZiNgcKCiVl2QsoXbaaoIDvrEMjXva16P5jwXG4Ow96oHbFx6UFok0BDu9t6b02iUIPwFpw54f36fUlfDTQKeT4y6md4EJEBLIEGgffn0bgBYxkki/PEumXWK6oMnU+lmeDRHNEJPMXyLm+qRxr9TQxwiEqgN1E7WPCW6uxuqdygyCsBp4ywHxejfYh+EJckmwXG7s7Kmn9fLYllc4RZGq+ZT2YOdIlSobncu2x7WTmTG4Nhm8vHSJ04SUhOXlEzwtRvUHLXmcHHL/Il7BfyDwCiuHbkkjJ/UhatZtkqhmDcviiF1QAztRoJlyf4YinHhd6No9Kw9y2pG7Ebx0Tb0ASPDyArlDYjqedni7iGJ4xTw/sGHmWhxBU/lq2UraKlxAUUrEt1QWLUlscm9vEm5r7p8gOBAUuWK/Ozs1GTfwmdELsSBhm3N5jWdcXSCHtk6UEe Nhn00jLM pxIpXOLIbQFb8MZF+pFr0jZeJ0rFl84pO+iBH345WspsEarX+N0ulCKTZS3Y7EI9X1qJvfXRZK47KYN49cwLHyTGanhbNtl8bDjyFr90PLS9AQcyx7+lbrH74L27YT0ZZIMHRbhHOpATknxavxsTvjo5xYMIUvod0ZgL6s453icJXbHoKuGzvBV0j/TUk+rlH5SmXPLDaZCGJrH8ZnbHfZT+LFCm+KiUvrIFs1ogebiYvQRlIX19srTbpOoM2GQ/AJVZ6O1vXcXcipnuRBlpPXpvBfTS3FvzE6txUxoeyKGHa2Jd7Ia/Oo5atV8j4SLzJLgoiSwMsrgNxa1WftRkfUyZ/87GGwmf42LGTdpzuADWvSbRWsZI8JOU/6VKvZ2HF2Ejp8HTBVWAF7z0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, there are a couple of issues with our disk swapin tracking for dynamic zswap shrinker heuristics: 1. We only increment the swapin counter on pivot pages. This means we are not taking into account pages that also need to be swapped in, but are already taken care of as part of the readahead window. 2. We are also incrementing when the pages are read from the zswap pool, which is inaccurate. This patch rectifies these issues by incrementing the counter whenever we need to perform a non-zswap read. Note that we are slightly overcounting, as a page might be read into memory by the readahead algorithm even though it will not be neeeded by users - however, this is an acceptable inaccuracy, as the readahead logic itself will adapt to these kind of scenarios. To test this change, I built the kernel under a cgroup with its memory.max set to 2 GB: real: 236.66s user: 4286.06s sys: 652.86s swapins: 81552 For comparison, with just the new second chance algorithm, the build time is as follows: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 Without neither: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 (average over 5 runs) With this change, the kernel CPU time reduces by a further 1.7%, and the real time is reduced by another 3.3%, compared to just the second chance algorithm by itself. The swapins count also reduces by another 13.85%. Combinng the two changes, we reduce the real time by 10.32%, kernel CPU time by 3%, and number of swapins by 64.12%. To gauge the new scheme's ability to offload cold data, I ran another benchmark, in which the kernel was built under a cgroup with memory.max set to 3 GB, but with 0.5 GB worth of cold data allocated before each build (in a shmem file). Under the old scheme: real: 197.18s user: 4365.08s sys: 289.02s zswpwb: 72115.2 Under the new scheme: real: 195.8s user: 4362.25s sys: 290.14s zswpwb: 87277.8 (average over 5 runs) Notice that we actually observe a 21% increase in the number of written back pages - so the new scheme is just as good, if not better at offloading pages from the zswap pool when they are cold. Build time reduces by around 0.7% as a result. Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure") Suggested-by: Johannes Weiner Signed-off-by: Nhat Pham Acked-by: Yosry Ahmed --- mm/page_io.c | 11 ++++++++++- mm/swap_state.c | 8 ++------ 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index ff8c99ee3af7..0004c9fbf7e8 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -521,7 +521,15 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) if (zswap_load(folio)) { folio_unlock(folio); - } else if (data_race(sis->flags & SWP_FS_OPS)) { + goto finish; + } + + /* + * We have to read the page from slower devices. Increase zswap protection. + */ + zswap_folio_swapin(folio); + + if (data_race(sis->flags & SWP_FS_OPS)) { swap_read_folio_fs(folio, plug); } else if (synchronous) { swap_read_folio_bdev_sync(folio, sis); @@ -529,6 +537,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) swap_read_folio_bdev_async(folio, sis); } +finish: if (workingset) { delayacct_thrashing_end(&in_thrashing); psi_memstall_leave(&pflags); diff --git a/mm/swap_state.c b/mm/swap_state.c index a1726e49a5eb..3a0cf965f32b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -698,10 +698,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, /* The page was likely read above, so no need for plugging here */ folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx, &page_allocated, false); - if (unlikely(page_allocated)) { - zswap_folio_swapin(folio); + if (unlikely(page_allocated)) swap_read_folio(folio, NULL); - } return folio; } @@ -850,10 +848,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, /* The folio was likely read above, so no need for plugging here */ folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx, &page_allocated, false); - if (unlikely(page_allocated)) { - zswap_folio_swapin(folio); + if (unlikely(page_allocated)) swap_read_folio(folio, NULL); - } return folio; }