From patchwork Mon Sep 30 22:12:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kanchana P Sridhar X-Patchwork-Id: 13817148 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A55CCEB2C2 for ; Mon, 30 Sep 2024 22:12:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5D89F280037; Mon, 30 Sep 2024 18:12:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 58A46280036; Mon, 30 Sep 2024 18:12:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2F230280037; Mon, 30 Sep 2024 18:12:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0FCA728002D for ; Mon, 30 Sep 2024 18:12:32 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B5374120BA2 for ; Mon, 30 Sep 2024 22:12:31 +0000 (UTC) X-FDA: 82622804502.22.88C824F Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.9]) by imf16.hostedemail.com (Postfix) with ESMTP id 94501180007 for ; Mon, 30 Sep 2024 22:12:29 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Z+BvAHDl; spf=pass (imf16.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 198.175.65.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727734182; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YsKRmQoYoQzefhoLcurdjhvgw705R9YFJDO0bonL/ps=; b=IQT8aK8Z2KTLf9iSgLeeunrNbGgg+4kg74qGRie4nG4J7iPwz0T2vk22oYu5AiipfimWoc QTqG+rimpmdMF9BLG2ASJiN1HZr8TOd/WklloaCdCf80NwlQf7JcK/EDtuiXxZ95GFbCOX MVCoFroRGuoRqTbiPKpNY/2MlDdYjps= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Z+BvAHDl; spf=pass (imf16.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 198.175.65.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727734182; a=rsa-sha256; cv=none; b=ZSfi5mYHehkGfWTGhNFG48sz8heRdC5OjQy31TgdVaCNAiAP3aHL9qWZA4uiwJIXP5kFhw QgaKcm6GRgPr8EP0UAieCv6ylmbFYF3wzz9X8g8KNYCa/698TER73WRiLJf3TjtXMGPN19 WWnXWXedtLitRTP4KPmq3YT4UzdcIN8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1727734350; x=1759270350; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rGYXZbQGx8nuXhyND0p/8h/6lHtYJuiMzkYh1FrNrkU=; b=Z+BvAHDl21z94tvXiJ4anmrAUrdUHGMU8OXCsbe5EZ/xNbZxj9AnNOjA brIl8pk6DddOIX3fNZ2wc1j1gawM4fvhB8xeAmJ0hiYuwrAhP8GdkX3a2 EL9N1PhnN5I+2BlAxa+ecQb52Uz6ukXCSISGY3nDuKx4tB4mu0KuGflPc ctJO0vLw+tuYGvECymkRzUuDDbePTFx2ZIyhDtmLnqR1Z7P1KU9yZBHMu 7eiZy7eT8PE5dvkZG0ZjzFnceT8m48ZgC0/vgYv+8SMfaghkFON8ZDqvv yntOUrnL3i/nGeNb0KKT82T8ILfTS02+lnRVV1iRWOqHcrtY1ncK8rBO9 g==; X-CSE-ConnectionGUID: rYB8Fw55QCW4g+2XFVY31w== X-CSE-MsgGUID: nMWoJfxSRxKKghp5Fd8ddw== X-IronPort-AV: E=McAfee;i="6700,10204,11211"; a="49368446" X-IronPort-AV: E=Sophos;i="6.11,166,1725346800"; d="scan'208";a="49368446" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Sep 2024 15:12:25 -0700 X-CSE-ConnectionGUID: 6OCfoR3wTpiVDGXWLJhJ7w== X-CSE-MsgGUID: QqMGlGaTSySq4bW+2bm9jg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,166,1725346800"; d="scan'208";a="77985592" Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6]) by fmviesa004.fm.intel.com with ESMTP; 30 Sep 2024 15:12:24 -0700 From: Kanchana P Sridhar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, usamaarif642@gmail.com, shakeel.butt@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org, willy@infradead.org Cc: nanhai.zou@intel.com, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com, kanchana.p.sridhar@intel.com Subject: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store(). Date: Mon, 30 Sep 2024 15:12:20 -0700 Message-Id: <20240930221221.6981-7-kanchana.p.sridhar@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20240930221221.6981-1-kanchana.p.sridhar@intel.com> References: <20240930221221.6981-1-kanchana.p.sridhar@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 94501180007 X-Stat-Signature: ttcts4oztccedxb1gysech9kgmh4tizo X-Rspam-User: X-HE-Tag: 1727734349-591339 X-HE-Meta: U2FsdGVkX1/pmzn7n2xH49dD/RBi13y+qe7OdwmwrDGpZVJVJsjni9vGNeW+kG6HardKHvsL6D5HslVcsQS62hUVz4qYFcDUX85E3A2n288u92uBiUK8VUMaS4EU61rl+REqJq5ulMFW6nDN/k4njhd7R2q2aykkT6eEAhVs3m0E8c+3WdnaDiLMn9dUeJTtM4F5ZaOLXwMhmOzmUUTJKBFyP/Zjsuw+X3XQR+f5AhEYhndxAbjvUusYGoL5TIjmBflB+UMAC/MsLzi17MPaxT0dLDQHvTdiKAiSwaM03/ViC67QyeGKktZwF8/x0+6Ikh3dJAf4TcGnH8PmvV4Hpfn4YQjQpAsj3gJkJXHj6E5IwR5AOtRe3seTLM7isCrBl7AhWqU8D7dt497HGJeC9vXTrP1QJUW0LUOqPgp7chK7R1L/Cg1loYUIlJx3DNwcxkVipbbDti3TwZhFb4bv+F4QIwQUOu+9HDaQtvzreSRi0UVkjiwCXjx8HgYwvvWg4rWnuLmGNoO3Nip/tM0VMDajDl8RC6LbthaXcJigxDBVrPYvNw6DbVJuwChvezQ8pfjgoJCgEWYtSBoc2IDf9ZU5fsI7iiOzK7gzqU4tHneNDHE7d3Nz06GL6mfCpjizJHwCZihuqXTh+JExLIJUslxwLRD9xz9p4HdlWdbONANZfqK8eDZPqJHdFXt9nqh3rlPtz6RoLrFe5DHYpJzulLlKjQpOOgIjVBJ+NfCQLsNzw/EBA0iRvDA9ArT4zHEe/3ReZTgGFcu+lysepZqJqON/E5mB/1uIibSapNZQZiG43UvjZwLJ9RyhWspeeH+Lp7R7fMj8SqNT0ngD2Bdd/zlt2zmADgSv/EohsjHiYPXHC5XDXwLIsaRFq38i6Sq1s8iwA8xU+KK/U78RznEK5Ep/2OvMD+kytIPXuG7sxXAhaOWQq4hRR2WIpkaVigWpw12UcmGO4Nr2QL01BGc h8uQRxTD PLXmcOrDRsFBFVZ+hftfjr1TAx6+Th4Xpp8j0w4mCWwQQk/sHFDQKi4KA1g/nVOPJS6QKBKTI5YJUfrFpc7oCqIszGo0Lr1WHPQOXtmS6P90l86gGY7qzvS0U8UvfYELhs683Huw/pAsDS2/W8xIc7aPyJcQRDjJ8UvRWSzBMVX3hLuKA6QeWyUSWUTl5FDHksJAHSKZmhr7KLpjIpl7aYls/O5ACzX4JOvIaV9piA9qC08Aue1m22oP+pWhXSkPegcvyZl4wSJC2Se+YLJAHzntAcrsagsSe+4zLzKjhw/fteHBhWEZHvaS0YA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: zswap_store() will store large folios by compressing them page by page. This patch provides a sequential implementation of storing a large folio in zswap_store() by iterating through each page in the folio to compress and store it in the zswap zpool. zswap_store() calls the newly added zswap_store_page() function for each page in the folio. zswap_store_page() handles compressing and storing each page. We check the global and per-cgroup limits once at the beginning of zswap_store(), and only check that the limit is not reached yet. This is racy and inaccurate, but it should be sufficient for now. We also obtain initial references to the relevant objcg and pool to guarantee that subsequent references can be acquired by zswap_store_page(). A new function zswap_pool_get() is added to facilitate this. If these one-time checks pass, we compress the pages of the folio, while maintaining a running count of compressed bytes for all the folio's pages. If all pages are successfully compressed and stored, we do the cgroup zswap charging with the total compressed bytes, and batch update the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once, before returning from zswap_store(). If an error is encountered during the store of any page in the folio, all pages in that folio currently stored in zswap will be invalidated. Thus, a folio is either entirely stored in zswap, or entirely not stored in zswap. The most important value provided by this patch is it enables swapping out large folios to zswap without splitting them. Furthermore, it batches some operations while doing so (cgroup charging, stats updates). This patch also forms the basis for building compress batching of pages in a large folio in zswap_store() by compressing up to say, 8 pages of the folio in parallel in hardware using the Intel In-Memory Analytics Accelerator (Intel IAA). This change reuses and adapts the functionality in Ryan Roberts' RFC patch [1]: "[RFC,v1] mm: zswap: Store large folios without splitting" [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u Also, addressed some of the RFC comments from the discussion in [1]. Co-developed-by: Ryan Roberts Signed-off-by: Signed-off-by: Kanchana P Sridhar Reviewed-by: Nhat Pham --- mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 153 insertions(+), 67 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 2b8da50f6322..b74c8de99646 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool) return percpu_ref_tryget(&pool->ref); } +/* The caller must already have a reference. */ +static void zswap_pool_get(struct zswap_pool *pool) +{ + percpu_ref_get(&pool->ref); +} + static void zswap_pool_put(struct zswap_pool *pool) { percpu_ref_put(&pool->ref); @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w) /********************************* * main API **********************************/ -bool zswap_store(struct folio *folio) + +/* + * Stores the page at specified "index" in a folio. + * + * @page: The page to store in zswap. + * @objcg: The folio's objcg. Caller has a reference. + * @pool: The zswap_pool to store the compressed data for the page. + * The caller should have obtained a reference to a valid + * zswap_pool by calling zswap_pool_tryget(), to pass as this + * argument. + * @tree: The xarray for the @page's folio's swap. + * @compressed_bytes: The compressed entry->length value is added + * to this, so that the caller can get the total + * compressed lengths of all sub-pages in a folio. + */ +static bool zswap_store_page(struct page *page, + struct obj_cgroup *objcg, + struct zswap_pool *pool, + struct xarray *tree, + size_t *compressed_bytes) { - swp_entry_t swp = folio->swap; - pgoff_t offset = swp_offset(swp); - struct xarray *tree = swap_zswap_tree(swp); struct zswap_entry *entry, *old; - struct obj_cgroup *objcg = NULL; - struct mem_cgroup *memcg = NULL; - - VM_WARN_ON_ONCE(!folio_test_locked(folio)); - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); - - /* Large folios aren't supported */ - if (folio_test_large(folio)) - return false; - - if (!zswap_enabled) - goto check_old; - - /* Check cgroup limits */ - objcg = get_obj_cgroup_from_folio(folio); - if (objcg && !obj_cgroup_may_zswap(objcg)) { - memcg = get_mem_cgroup_from_objcg(objcg); - if (shrink_memcg(memcg)) { - mem_cgroup_put(memcg); - goto reject; - } - mem_cgroup_put(memcg); - } - - if (zswap_check_limits()) - goto reject; /* allocate entry */ - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page))); if (!entry) { zswap_reject_kmemcache_fail++; goto reject; } - /* if entry is successfully added, it keeps the reference */ - entry->pool = zswap_pool_current_get(); - if (!entry->pool) - goto freepage; + /* zswap_store() already holds a ref on 'objcg' and 'pool' */ + if (objcg) + obj_cgroup_get(objcg); + zswap_pool_get(pool); - if (objcg) { - memcg = get_mem_cgroup_from_objcg(objcg); - if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) { - mem_cgroup_put(memcg); - goto put_pool; - } - mem_cgroup_put(memcg); - } + /* if entry is successfully added, it keeps the reference */ + entry->pool = pool; - if (!zswap_compress(&folio->page, entry)) - goto put_pool; + if (!zswap_compress(page, entry)) + goto put_pool_objcg; - entry->swpentry = swp; + entry->swpentry = page_swap_entry(page); entry->objcg = objcg; entry->referenced = true; - old = xa_store(tree, offset, entry, GFP_KERNEL); + old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL); if (xa_is_err(old)) { int err = xa_err(old); @@ -1480,11 +1470,6 @@ bool zswap_store(struct folio *folio) if (old) zswap_entry_free(old); - if (objcg) { - obj_cgroup_charge_zswap(objcg, entry->length); - count_objcg_events(objcg, ZSWPOUT, 1); - } - /* * We finish initializing the entry while it's already in xarray. * This is safe because: @@ -1496,36 +1481,137 @@ bool zswap_store(struct folio *folio) * an incoherent entry. */ if (entry->length) { + *compressed_bytes += entry->length; INIT_LIST_HEAD(&entry->lru); zswap_lru_add(&zswap_list_lru, entry); } - /* update stats */ - atomic_long_inc(&zswap_stored_pages); - count_vm_event(ZSWPOUT); - + /* + * We shouldn't have any possibility of failure after the entry is + * added in the xarray. The pool/objcg refs obtained here will only + * be dropped if/when zswap_entry_free() gets called. + */ return true; store_failed: zpool_free(entry->pool->zpool, entry->handle); -put_pool: - zswap_pool_put(entry->pool); -freepage: +put_pool_objcg: + zswap_pool_put(pool); + obj_cgroup_put(objcg); zswap_entry_cache_free(entry); reject: + return false; +} + +bool zswap_store(struct folio *folio) +{ + long nr_pages = folio_nr_pages(folio); + swp_entry_t swp = folio->swap; + struct xarray *tree = swap_zswap_tree(swp); + struct obj_cgroup *objcg = NULL; + struct mem_cgroup *memcg = NULL; + struct zswap_pool *pool; + size_t compressed_bytes = 0; + bool ret = false; + long index; + + VM_WARN_ON_ONCE(!folio_test_locked(folio)); + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); + + if (!zswap_enabled) + goto check_old; + + /* + * Check cgroup zswap limits: + * + * The cgroup zswap limit check is done once at the beginning of + * zswap_store(). The cgroup charging is done once, at the end + * of a successful folio store. What this means is, if the cgroup + * was within the zswap_max limit at the beginning of a large folio + * store, it could go over the limit by at most (HPAGE_PMD_NR - 1) + * pages due to this store. + */ + objcg = get_obj_cgroup_from_folio(folio); + if (objcg && !obj_cgroup_may_zswap(objcg)) { + memcg = get_mem_cgroup_from_objcg(objcg); + if (shrink_memcg(memcg)) { + mem_cgroup_put(memcg); + goto put_objcg; + } + mem_cgroup_put(memcg); + } + + /* + * Check zpool utilization against zswap limits: + * + * The zswap zpool utilization is also checked against the limits + * just once, at the start of zswap_store(). If the check passes, + * any breaches of the limits set by zswap_max_pages() or + * zswap_accept_thr_pages() that may happen while storing this + * folio, will only be detected during the next call to + * zswap_store() by any process. + */ + if (zswap_check_limits()) + goto put_objcg; + + pool = zswap_pool_current_get(); + if (!pool) + goto put_objcg; + + if (objcg) { + memcg = get_mem_cgroup_from_objcg(objcg); + if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) { + mem_cgroup_put(memcg); + goto put_pool; + } + mem_cgroup_put(memcg); + } + + /* + * Store each page of the folio as a separate entry. If we fail to + * store a page, unwind by deleting all the pages for this folio + * currently in zswap. + */ + for (index = 0; index < nr_pages; ++index) { + if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree, &compressed_bytes)) + goto put_pool; + } + + if (objcg) { + obj_cgroup_charge_zswap(objcg, compressed_bytes); + count_objcg_events(objcg, ZSWPOUT, nr_pages); + } + + atomic_long_add(nr_pages, &zswap_stored_pages); + count_vm_events(ZSWPOUT, nr_pages); + + ret = true; + +put_pool: + zswap_pool_put(pool); +put_objcg: obj_cgroup_put(objcg); - if (zswap_pool_reached_full) + if (!ret && zswap_pool_reached_full) queue_work(shrink_wq, &zswap_shrink_work); check_old: /* - * If the zswap store fails or zswap is disabled, we must invalidate the - * possibly stale entry which was previously stored at this offset. - * Otherwise, writeback could overwrite the new data in the swapfile. + * If the zswap store fails or zswap is disabled, we must invalidate + * the possibly stale entries which were previously stored at the + * offsets corresponding to each page of the folio. Otherwise, + * writeback could overwrite the new data in the swapfile. */ - entry = xa_erase(tree, offset); - if (entry) - zswap_entry_free(entry); - return false; + if (!ret) { + pgoff_t offset = swp_offset(swp); + struct zswap_entry *entry; + + for (index = 0; index < nr_pages; ++index) { + entry = xa_erase(tree, offset + index); + if (entry) + zswap_entry_free(entry); + } + } + + return ret; } bool zswap_load(struct folio *folio)