diff mbox series

[v9,6/7] mm: zswap: Support large folios in zswap_store().

Message ID 20240930221221.6981-7-kanchana.p.sridhar@intel.com (mailing list archive)
State New
Headers show
Series mm: zswap swap-out of large folios | expand

Commit Message

Kanchana P Sridhar Sept. 30, 2024, 10:12 p.m. UTC
zswap_store() will store large folios by compressing them page by page.

This patch provides a sequential implementation of storing a large folio
in zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

zswap_store() calls the newly added zswap_store_page() function for each
page in the folio. zswap_store_page() handles compressing and storing each
page.

We check the global and per-cgroup limits once at the beginning of
zswap_store(), and only check that the limit is not reached yet. This is
racy and inaccurate, but it should be sufficient for now. We also obtain
initial references to the relevant objcg and pool to guarantee that
subsequent references can be acquired by zswap_store_page(). A new function
zswap_pool_get() is added to facilitate this.

If these one-time checks pass, we compress the pages of the folio, while
maintaining a running count of compressed bytes for all the folio's pages.
If all pages are successfully compressed and stored, we do the cgroup
zswap charging with the total compressed bytes, and batch update the
zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
before returning from zswap_store().

If an error is encountered during the store of any page in the folio,
all pages in that folio currently stored in zswap will be invalidated.
Thus, a folio is either entirely stored in zswap, or entirely not stored
in zswap.

The most important value provided by this patch is it enables swapping out
large folios to zswap without splitting them. Furthermore, it batches some
operations while doing so (cgroup charging, stats updates).

This patch also forms the basis for building compress batching of pages in
a large folio in zswap_store() by compressing up to say, 8 pages of the
folio in parallel in hardware using the Intel In-Memory Analytics
Accelerator (Intel IAA).

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Also, addressed some of the RFC comments from the discussion in [1].

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 153 insertions(+), 67 deletions(-)

Comments

Yosry Ahmed Sept. 30, 2024, 11:09 p.m. UTC | #1
On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will store large folios by compressing them page by page.
>
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> zswap_store() calls the newly added zswap_store_page() function for each
> page in the folio. zswap_store_page() handles compressing and storing each
> page.
>
> We check the global and per-cgroup limits once at the beginning of
> zswap_store(), and only check that the limit is not reached yet. This is
> racy and inaccurate, but it should be sufficient for now. We also obtain
> initial references to the relevant objcg and pool to guarantee that
> subsequent references can be acquired by zswap_store_page(). A new function
> zswap_pool_get() is added to facilitate this.
>
> If these one-time checks pass, we compress the pages of the folio, while
> maintaining a running count of compressed bytes for all the folio's pages.
> If all pages are successfully compressed and stored, we do the cgroup
> zswap charging with the total compressed bytes, and batch update the
> zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
> before returning from zswap_store().
>
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
>
> The most important value provided by this patch is it enables swapping out
> large folios to zswap without splitting them. Furthermore, it batches some
> operations while doing so (cgroup charging, stats updates).
>
> This patch also forms the basis for building compress batching of pages in
> a large folio in zswap_store() by compressing up to say, 8 pages of the
> folio in parallel in hardware using the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:

We haven't been able to get a signoff from Ryan. Andrew, what's the policy here?

> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 153 insertions(+), 67 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2b8da50f6322..b74c8de99646 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>         return percpu_ref_tryget(&pool->ref);
>  }
>
> +/* The caller must already have a reference. */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> +       percpu_ref_get(&pool->ref);
> +}
> +
>  static void zswap_pool_put(struct zswap_pool *pool)
>  {
>         percpu_ref_put(&pool->ref);
> @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
>  /*********************************
>  * main API
>  **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @page:  The page to store in zswap.
> + * @objcg: The folio's objcg. Caller has a reference.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + *         The caller should have obtained a reference to a valid
> + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> + *         argument.
> + * @tree:  The xarray for the @page's folio's swap.
> + * @compressed_bytes: The compressed entry->length value is added
> + *                    to this, so that the caller can get the total
> + *                    compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct page *page,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool,
> +                            struct xarray *tree,
> +                            size_t *compressed_bytes)
>  {
> -       swp_entry_t swp = folio->swap;
> -       pgoff_t offset = swp_offset(swp);
> -       struct xarray *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry, *old;
> -       struct obj_cgroup *objcg = NULL;
> -       struct mem_cgroup *memcg = NULL;
> -
> -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> -
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> -               return false;
> -
> -       if (!zswap_enabled)
> -               goto check_old;
> -
> -       /* Check cgroup limits */
> -       objcg = get_obj_cgroup_from_folio(folio);
> -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (shrink_memcg(memcg)) {
> -                       mem_cgroup_put(memcg);
> -                       goto reject;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> -
> -       if (zswap_check_limits())
> -               goto reject;
>
>         /* allocate entry */
> -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));

Can we just use page_to_nid() here? I think the node info exists even
for tail pages, right?

>         if (!entry) {
>                 zswap_reject_kmemcache_fail++;
>                 goto reject;
>         }
>
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> -               goto freepage;
> +       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> +       if (objcg)
> +               obj_cgroup_get(objcg);
> +       zswap_pool_get(pool);
>
> -       if (objcg) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> -                       mem_cgroup_put(memcg);
> -                       goto put_pool;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> +       /* if entry is successfully added, it keeps the reference */
> +       entry->pool = pool;
>
> -       if (!zswap_compress(&folio->page, entry))
> -               goto put_pool;
> +       if (!zswap_compress(page, entry))
> +               goto put_pool_objcg;
>
> -       entry->swpentry = swp;
> +       entry->swpentry = page_swap_entry(page);
>         entry->objcg = objcg;
>         entry->referenced = true;
>
> -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> +       old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
>         if (xa_is_err(old)) {
>                 int err = xa_err(old);
>
> @@ -1480,11 +1470,6 @@ bool zswap_store(struct folio *folio)
>         if (old)
>                 zswap_entry_free(old);
>
> -       if (objcg) {
> -               obj_cgroup_charge_zswap(objcg, entry->length);
> -               count_objcg_events(objcg, ZSWPOUT, 1);
> -       }
> -
>         /*
>          * We finish initializing the entry while it's already in xarray.
>          * This is safe because:
> @@ -1496,36 +1481,137 @@ bool zswap_store(struct folio *folio)
>          *    an incoherent entry.
>          */
>         if (entry->length) {
> +               *compressed_bytes += entry->length;
>                 INIT_LIST_HEAD(&entry->lru);
>                 zswap_lru_add(&zswap_list_lru, entry);
>         }
>
> -       /* update stats */
> -       atomic_long_inc(&zswap_stored_pages);
> -       count_vm_event(ZSWPOUT);
> -
> +       /*
> +        * We shouldn't have any possibility of failure after the entry is
> +        * added in the xarray. The pool/objcg refs obtained here will only
> +        * be dropped if/when zswap_entry_free() gets called.
> +        */
>         return true;
>
>  store_failed:
>         zpool_free(entry->pool->zpool, entry->handle);
> -put_pool:
> -       zswap_pool_put(entry->pool);
> -freepage:
> +put_pool_objcg:
> +       zswap_pool_put(pool);
> +       obj_cgroup_put(objcg);

I think if we reorder the function we can drop these calls, make the
comments positioned a bit better, and centralize the entry
initializations. I am also not a fan of passing a semi-initialized
entry to zswap_compress() to get the pool pointer.

Does the following diff improve things or did I miss something?

diff --git a/mm/zswap.c b/mm/zswap.c
index b74c8de996468..eac1f846886a6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -881,7 +881,8 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
struct hlist_node *node)
        return 0;
 }

-static bool zswap_compress(struct page *page, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry,
+                          struct zswap_pool *pool)
 {
        struct crypto_acomp_ctx *acomp_ctx;
        struct scatterlist input, output;
@@ -893,7 +894,7 @@ static bool zswap_compress(struct page *page,
struct zswap_entry *entry)
        gfp_t gfp;
        u8 *dst;

-       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+       acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);

        mutex_lock(&acomp_ctx->mutex);

@@ -926,7 +927,7 @@ static bool zswap_compress(struct page *page,
struct zswap_entry *entry)
        if (comp_ret)
                goto unlock;

-       zpool = entry->pool->zpool;
+       zpool = pool->zpool;
        gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
        if (zpool_malloc_support_movable(zpool))
                gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
@@ -1435,23 +1436,11 @@ static bool zswap_store_page(struct page *page,
        entry = zswap_entry_cache_alloc(GFP_KERNEL,
folio_nid(page_folio(page)));
        if (!entry) {
                zswap_reject_kmemcache_fail++;
-               goto reject;
+               return false;
        }

-       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
-       if (objcg)
-               obj_cgroup_get(objcg);
-       zswap_pool_get(pool);
-
-       /* if entry is successfully added, it keeps the reference */
-       entry->pool = pool;
-
-       if (!zswap_compress(page, entry))
-               goto put_pool_objcg;
-
-       entry->swpentry = page_swap_entry(page);
-       entry->objcg = objcg;
-       entry->referenced = true;
+       if (!zswap_compress(page, entry, pool))
+               goto compress_failed;

        old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
        if (xa_is_err(old)) {
@@ -1470,6 +1459,16 @@ static bool zswap_store_page(struct page *page,
        if (old)
                zswap_entry_free(old);

+       /*
+        * The entry is successfully compressed and stored in the tree, there is
+        * no further possibility of failure. Grab refs to the pool and objcg.
+        * These refs will be dropped by zswap_entry_free() when the entry is
+        * removed from the tree.
+        */
+       zswap_pool_get(pool);
+       if (objcg)
+               obj_cgroup_get(objcg);
+
        /*
         * We finish initializing the entry while it's already in xarray.
         * This is safe because:
@@ -1480,26 +1479,22 @@ static bool zswap_store_page(struct page *page,
         *    The publishing order matters to prevent writeback from seeing
         *    an incoherent entry.
         */
+       entry->pool = pool;
+       entry->swpentry = page_swap_entry(page);
+       entry->objcg = objcg;
+       entry->referenced = true;
        if (entry->length) {
                *compressed_bytes += entry->length;
                INIT_LIST_HEAD(&entry->lru);
                zswap_lru_add(&zswap_list_lru, entry);
        }

-       /*
-        * We shouldn't have any possibility of failure after the entry is
-        * added in the xarray. The pool/objcg refs obtained here will only
-        * be dropped if/when zswap_entry_free() gets called.
-        */
        return true;

 store_failed:
-       zpool_free(entry->pool->zpool, entry->handle);
-put_pool_objcg:
-       zswap_pool_put(pool);
-       obj_cgroup_put(objcg);
+       zpool_free(pool->zpool, entry->handle);
+compress_failed:
        zswap_entry_cache_free(entry);
-reject:
        return false;
 }


>         zswap_entry_cache_free(entry);
>  reject:
> +       return false;
> +}
> +
> +bool zswap_store(struct folio *folio)
> +{
> +       long nr_pages = folio_nr_pages(folio);
> +       swp_entry_t swp = folio->swap;
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       struct obj_cgroup *objcg = NULL;
> +       struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       size_t compressed_bytes = 0;
> +       bool ret = false;
> +       long index;
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +       if (!zswap_enabled)
> +               goto check_old;
> +
> +       /*
> +        * Check cgroup zswap limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of
> +        * zswap_store(). The cgroup charging is done once, at the end
> +        * of a successful folio store. What this means is, if the cgroup
> +        * was within the zswap_max limit at the beginning of a large folio
> +        * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> +        * pages due to this store.
> +        */
> +       objcg = get_obj_cgroup_from_folio(folio);
> +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (shrink_memcg(memcg)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_objcg;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Check zpool utilization against zswap limits:
> +        *
> +        * The zswap zpool utilization is also checked against the limits
> +        * just once, at the start of zswap_store(). If the check passes,
> +        * any breaches of the limits set by zswap_max_pages() or
> +        * zswap_accept_thr_pages() that may happen while storing this
> +        * folio, will only be detected during the next call to
> +        * zswap_store() by any process.
> +        */

This is essentially a rephrased repetition of the last comment, just
refer to it instead. Something like:

         /*
           * Check zpool utilization against zswap limits. The possibility of
           * going overlimit is the same as the cgroup limit check.
           */

> +       if (zswap_check_limits())
> +               goto put_objcg;
> +
> +       pool = zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
> +
> +       if (objcg) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_pool;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Store each page of the folio as a separate entry. If we fail to
> +        * store a page, unwind by deleting all the pages for this folio
> +        * currently in zswap.
> +        */
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree, &compressed_bytes))
> +                       goto put_pool;
> +       }
> +
> +       if (objcg) {
> +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +               count_objcg_events(objcg, ZSWPOUT, nr_pages);
> +       }
> +
> +       atomic_long_add(nr_pages, &zswap_stored_pages);
> +       count_vm_events(ZSWPOUT, nr_pages);
> +
> +       ret = true;
> +
> +put_pool:
> +       zswap_pool_put(pool);
> +put_objcg:
>         obj_cgroup_put(objcg);
> -       if (zswap_pool_reached_full)
> +       if (!ret && zswap_pool_reached_full)
>                 queue_work(shrink_wq, &zswap_shrink_work);
>  check_old:
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalidate the
> -        * possibly stale entry which was previously stored at this offset.
> -        * Otherwise, writeback could overwrite the new data in the swapfile.
> +        * If the zswap store fails or zswap is disabled, we must invalidate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
>          */
> -       entry = xa_erase(tree, offset);
> -       if (entry)
> -               zswap_entry_free(entry);
> -       return false;
> +       if (!ret) {
> +               pgoff_t offset = swp_offset(swp);
> +               struct zswap_entry *entry;
> +
> +               for (index = 0; index < nr_pages; ++index) {
> +                       entry = xa_erase(tree, offset + index);
> +                       if (entry)
> +                               zswap_entry_free(entry);
> +               }
> +       }
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>
Nhat Pham Sept. 30, 2024, 11:11 p.m. UTC | #2
On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will store large folios by compressing them page by page.
>
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> zswap_store() calls the newly added zswap_store_page() function for each
> page in the folio. zswap_store_page() handles compressing and storing each
> page.
>
> We check the global and per-cgroup limits once at the beginning of
> zswap_store(), and only check that the limit is not reached yet. This is
> racy and inaccurate, but it should be sufficient for now. We also obtain
> initial references to the relevant objcg and pool to guarantee that
> subsequent references can be acquired by zswap_store_page(). A new function
> zswap_pool_get() is added to facilitate this.
>
> If these one-time checks pass, we compress the pages of the folio, while
> maintaining a running count of compressed bytes for all the folio's pages.
> If all pages are successfully compressed and stored, we do the cgroup
> zswap charging with the total compressed bytes, and batch update the
> zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
> before returning from zswap_store().
>
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
>
> The most important value provided by this patch is it enables swapping out
> large folios to zswap without splitting them. Furthermore, it batches some
> operations while doing so (cgroup charging, stats updates).
>
> This patch also forms the basis for building compress batching of pages in
> a large folio in zswap_store() by compressing up to say, 8 pages of the
> folio in parallel in hardware using the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 153 insertions(+), 67 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2b8da50f6322..b74c8de99646 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>         return percpu_ref_tryget(&pool->ref);
>  }
>
> +/* The caller must already have a reference. */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> +       percpu_ref_get(&pool->ref);
> +}
> +
>  static void zswap_pool_put(struct zswap_pool *pool)
>  {
>         percpu_ref_put(&pool->ref);
> @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
>  /*********************************
>  * main API
>  **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @page:  The page to store in zswap.
> + * @objcg: The folio's objcg. Caller has a reference.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + *         The caller should have obtained a reference to a valid
> + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> + *         argument.
> + * @tree:  The xarray for the @page's folio's swap.
> + * @compressed_bytes: The compressed entry->length value is added
> + *                    to this, so that the caller can get the total
> + *                    compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct page *page,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool,
> +                            struct xarray *tree,
> +                            size_t *compressed_bytes)
>  {
> -       swp_entry_t swp = folio->swap;
> -       pgoff_t offset = swp_offset(swp);
> -       struct xarray *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry, *old;
> -       struct obj_cgroup *objcg = NULL;
> -       struct mem_cgroup *memcg = NULL;
> -
> -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> -
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> -               return false;
> -
> -       if (!zswap_enabled)
> -               goto check_old;
> -
> -       /* Check cgroup limits */
> -       objcg = get_obj_cgroup_from_folio(folio);
> -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (shrink_memcg(memcg)) {
> -                       mem_cgroup_put(memcg);
> -                       goto reject;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> -
> -       if (zswap_check_limits())
> -               goto reject;
>
>         /* allocate entry */
> -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
>         if (!entry) {
>                 zswap_reject_kmemcache_fail++;
>                 goto reject;
>         }
>
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> -               goto freepage;
> +       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> +       if (objcg)
> +               obj_cgroup_get(objcg);
> +       zswap_pool_get(pool);

Should we also batch-get references to the pool as well? i.e add a
helper function:

/* The caller must already have a reference. */
static void zswap_pool_get_many(struct zswap_pool *pool, unsigned long nr)
{
       percpu_ref_get_many(&pool->ref, nr);
}

then do it in a fell swoop after you're done storing all individual subpages
(near atomic_long_add(nr_pages, &zswap_stored_pages)).

Do double check that it is safe - I think it should be, since we have
the folio locked in swapcache, so there should not be any shenanigans
(for e.g no race with concurrent free or writeback).

Perhaps a fixlet suffices?
Yosry Ahmed Sept. 30, 2024, 11:19 p.m. UTC | #3
On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large folio
> > in zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > zswap_store() calls the newly added zswap_store_page() function for each
> > page in the folio. zswap_store_page() handles compressing and storing each
> > page.
> >
> > We check the global and per-cgroup limits once at the beginning of
> > zswap_store(), and only check that the limit is not reached yet. This is
> > racy and inaccurate, but it should be sufficient for now. We also obtain
> > initial references to the relevant objcg and pool to guarantee that
> > subsequent references can be acquired by zswap_store_page(). A new function
> > zswap_pool_get() is added to facilitate this.
> >
> > If these one-time checks pass, we compress the pages of the folio, while
> > maintaining a running count of compressed bytes for all the folio's pages.
> > If all pages are successfully compressed and stored, we do the cgroup
> > zswap charging with the total compressed bytes, and batch update the
> > zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
> > before returning from zswap_store().
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not stored
> > in zswap.
> >
> > The most important value provided by this patch is it enables swapping out
> > large folios to zswap without splitting them. Furthermore, it batches some
> > operations while doing so (cgroup charging, stats updates).
> >
> > This patch also forms the basis for building compress batching of pages in
> > a large folio in zswap_store() by compressing up to say, 8 pages of the
> > folio in parallel in hardware using the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 153 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2b8da50f6322..b74c8de99646 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
> >         return percpu_ref_tryget(&pool->ref);
> >  }
> >
> > +/* The caller must already have a reference. */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > +       percpu_ref_get(&pool->ref);
> > +}
> > +
> >  static void zswap_pool_put(struct zswap_pool *pool)
> >  {
> >         percpu_ref_put(&pool->ref);
> > @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @page:  The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + *         The caller should have obtained a reference to a valid
> > + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + *         argument.
> > + * @tree:  The xarray for the @page's folio's swap.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + *                    to this, so that the caller can get the total
> > + *                    compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct page *page,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool,
> > +                            struct xarray *tree,
> > +                            size_t *compressed_bytes)
> >  {
> > -       swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> > -       struct xarray *tree = swap_zswap_tree(swp);
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > -
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > -               return false;
> > -
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > -
> > -       /* Check cgroup limits */
> > -       objcg = get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > -
> > -       if (zswap_check_limits())
> > -               goto reject;
> >
> >         /* allocate entry */
> > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
> >         if (!entry) {
> >                 zswap_reject_kmemcache_fail++;
> >                 goto reject;
> >         }
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> > +       zswap_pool_get(pool);
>
> Should we also batch-get references to the pool as well? i.e add a
> helper function:
>
> /* The caller must already have a reference. */
> static void zswap_pool_get_many(struct zswap_pool *pool, unsigned long nr)
> {
>        percpu_ref_get_many(&pool->ref, nr);
> }
>
> then do it in a fell swoop after you're done storing all individual subpages
> (near atomic_long_add(nr_pages, &zswap_stored_pages)).
>
> Do double check that it is safe - I think it should be, since we have
> the folio locked in swapcache, so there should not be any shenanigans
> (for e.g no race with concurrent free or writeback).
>
> Perhaps a fixlet suffices?

I suggested this in a previous version, and Kanchana faced some
complexities implementing it:
https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/

Basically, if we batch get the refs after the store I think it's not
safe, because once an entry is published to writeback it can be
written back and freed, and a ref that we never acquired would be
dropped.

Getting refs before the store would work, but then if the store fails
at an arbitrary page, we need to only drop refs on the pool for pages
that were not added to the tree, as the cleanup loop with
zswap_entry_free() at the end of zswap_store() will drop the ref for
those that were added to the tree.

We agreed to (potentially) do the batching for refcounts as a followup.
Nhat Pham Sept. 30, 2024, 11:29 p.m. UTC | #4
On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> I suggested this in a previous version, and Kanchana faced some
> complexities implementing it:
> https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/

Sorry, I missed that conversation.

>
> Basically, if we batch get the refs after the store I think it's not
> safe, because once an entry is published to writeback it can be
> written back and freed, and a ref that we never acquired would be
> dropped.

Hmmm. I don't think writeback could touch any individual subpage just yet, no?

Before doing any work, zswap writeback would attempt to add the
subpage to the swap cache (via __read_swap_cache_async()). However,
all subpage will have already been added to swap cache, and point to
the (large) folio. So zswap_writeback_entry() should short circuit
here (the if (!page_allocated) case).

>
> Getting refs before the store would work, but then if the store fails
> at an arbitrary page, we need to only drop refs on the pool for pages
> that were not added to the tree, as the cleanup loop with
> zswap_entry_free() at the end of zswap_store() will drop the ref for
> those that were added to the tree.
>
> We agreed to (potentially) do the batching for refcounts as a followup.

But yeah no biggie. Not a dealbreaker for me tbh. I thought it was a
quick change (hence the fixlet suggestion), but if not then let's do
it as a follow-up.
Yosry Ahmed Sept. 30, 2024, 11:33 p.m. UTC | #5
On Mon, Sep 30, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > I suggested this in a previous version, and Kanchana faced some
> > complexities implementing it:
> > https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/
>
> Sorry, I missed that conversation.
>
> >
> > Basically, if we batch get the refs after the store I think it's not
> > safe, because once an entry is published to writeback it can be
> > written back and freed, and a ref that we never acquired would be
> > dropped.
>
> Hmmm. I don't think writeback could touch any individual subpage just yet, no?
>
> Before doing any work, zswap writeback would attempt to add the
> subpage to the swap cache (via __read_swap_cache_async()). However,
> all subpage will have already been added to swap cache, and point to
> the (large) folio. So zswap_writeback_entry() should short circuit
> here (the if (!page_allocated) case).

If it's safe to take the refs after all calls to zswap_store_page()
are successful, then yeah that should be possible, for both the pool
and objcg. I didn't look closely though.

Just to clarify, you mean grab one ref first, then do the
compressions, then grab the remaining refs, right?

The other thing I would be worried about is code clarity, not sure if
it would be confusing to initialize the entries in zswap_store_page(),
then grab the refs later in zswap_store().

>
> >
> > Getting refs before the store would work, but then if the store fails
> > at an arbitrary page, we need to only drop refs on the pool for pages
> > that were not added to the tree, as the cleanup loop with
> > zswap_entry_free() at the end of zswap_store() will drop the ref for
> > those that were added to the tree.
> >
> > We agreed to (potentially) do the batching for refcounts as a followup.
>
> But yeah no biggie. Not a dealbreaker for me tbh. I thought it was a
> quick change (hence the fixlet suggestion), but if not then let's do
> it as a follow-up.

I don't feel strongly either way.
Nhat Pham Sept. 30, 2024, 11:43 p.m. UTC | #6
On Mon, Sep 30, 2024 at 4:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Sep 30, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > I suggested this in a previous version, and Kanchana faced some
> > > complexities implementing it:
> > > https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/
> >
> > Sorry, I missed that conversation.
> >
> > >
> > > Basically, if we batch get the refs after the store I think it's not
> > > safe, because once an entry is published to writeback it can be
> > > written back and freed, and a ref that we never acquired would be
> > > dropped.
> >
> > Hmmm. I don't think writeback could touch any individual subpage just yet, no?
> >
> > Before doing any work, zswap writeback would attempt to add the
> > subpage to the swap cache (via __read_swap_cache_async()). However,
> > all subpage will have already been added to swap cache, and point to
> > the (large) folio. So zswap_writeback_entry() should short circuit
> > here (the if (!page_allocated) case).
>
> If it's safe to take the refs after all calls to zswap_store_page()
> are successful, then yeah that should be possible, for both the pool
> and objcg. I didn't look closely though.
>
> Just to clarify, you mean grab one ref first, then do the
> compressions, then grab the remaining refs, right?

Ah yeah, that's what I meant. We can either perform one of the
following sequences: grab 1 -> grab nr -> drop 1, or grab 1 -> grab nr
- 1 if successful, drop 1 if failed.

Seems straightforward to me, but yeah it seems a bit hair-splitting of
me to die on this hill :) Just thought it was weird seeing the other
parts batchified, and one part wasn't.

The rest LGTM - I'll defer to you and Johannes for further review.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Kanchana P Sridhar Oct. 1, 2024, 12:35 a.m. UTC | #7
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Monday, September 30, 2024 4:09 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
> 
> On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large folio
> > in zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > zswap_store() calls the newly added zswap_store_page() function for each
> > page in the folio. zswap_store_page() handles compressing and storing each
> > page.
> >
> > We check the global and per-cgroup limits once at the beginning of
> > zswap_store(), and only check that the limit is not reached yet. This is
> > racy and inaccurate, but it should be sufficient for now. We also obtain
> > initial references to the relevant objcg and pool to guarantee that
> > subsequent references can be acquired by zswap_store_page(). A new
> function
> > zswap_pool_get() is added to facilitate this.
> >
> > If these one-time checks pass, we compress the pages of the folio, while
> > maintaining a running count of compressed bytes for all the folio's pages.
> > If all pages are successfully compressed and stored, we do the cgroup
> > zswap charging with the total compressed bytes, and batch update the
> > zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> once,
> > before returning from zswap_store().
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not stored
> > in zswap.
> >
> > The most important value provided by this patch is it enables swapping out
> > large folios to zswap without splitting them. Furthermore, it batches some
> > operations while doing so (cgroup charging, stats updates).
> >
> > This patch also forms the basis for building compress batching of pages in
> > a large folio in zswap_store() by compressing up to say, 8 pages of the
> > folio in parallel in hardware using the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> 
> We haven't been able to get a signoff from Ryan. Andrew, what's the policy
> here?
> 
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++-----------
> -----
> >  1 file changed, 153 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2b8da50f6322..b74c8de99646 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct
> zswap_pool *pool)
> >         return percpu_ref_tryget(&pool->ref);
> >  }
> >
> > +/* The caller must already have a reference. */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > +       percpu_ref_get(&pool->ref);
> > +}
> > +
> >  static void zswap_pool_put(struct zswap_pool *pool)
> >  {
> >         percpu_ref_put(&pool->ref);
> > @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct
> *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @page:  The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + *         The caller should have obtained a reference to a valid
> > + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + *         argument.
> > + * @tree:  The xarray for the @page's folio's swap.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + *                    to this, so that the caller can get the total
> > + *                    compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct page *page,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool,
> > +                            struct xarray *tree,
> > +                            size_t *compressed_bytes)
> >  {
> > -       swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> > -       struct xarray *tree = swap_zswap_tree(swp);
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > -
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > -               return false;
> > -
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > -
> > -       /* Check cgroup limits */
> > -       objcg = get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > -
> > -       if (zswap_check_limits())
> > -               goto reject;
> >
> >         /* allocate entry */
> > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > +       entry = zswap_entry_cache_alloc(GFP_KERNEL,
> folio_nid(page_folio(page)));
> 
> Can we just use page_to_nid() here? I think the node info exists even
> for tail pages, right?

I wasn't sure about this, and figured it would be safer to use folio_nid()
which always get the node id from the head page. We would need
Johannes/Matthew to give their recommendations.

> 
> >         if (!entry) {
> >                 zswap_reject_kmemcache_fail++;
> >                 goto reject;
> >         }
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> > +       zswap_pool_get(pool);
> >
> > -       if (objcg) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto put_pool;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > +       /* if entry is successfully added, it keeps the reference */
> > +       entry->pool = pool;
> >
> > -       if (!zswap_compress(&folio->page, entry))
> > -               goto put_pool;
> > +       if (!zswap_compress(page, entry))
> > +               goto put_pool_objcg;
> >
> > -       entry->swpentry = swp;
> > +       entry->swpentry = page_swap_entry(page);
> >         entry->objcg = objcg;
> >         entry->referenced = true;
> >
> > -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> > +       old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
> >         if (xa_is_err(old)) {
> >                 int err = xa_err(old);
> >
> > @@ -1480,11 +1470,6 @@ bool zswap_store(struct folio *folio)
> >         if (old)
> >                 zswap_entry_free(old);
> >
> > -       if (objcg) {
> > -               obj_cgroup_charge_zswap(objcg, entry->length);
> > -               count_objcg_events(objcg, ZSWPOUT, 1);
> > -       }
> > -
> >         /*
> >          * We finish initializing the entry while it's already in xarray.
> >          * This is safe because:
> > @@ -1496,36 +1481,137 @@ bool zswap_store(struct folio *folio)
> >          *    an incoherent entry.
> >          */
> >         if (entry->length) {
> > +               *compressed_bytes += entry->length;
> >                 INIT_LIST_HEAD(&entry->lru);
> >                 zswap_lru_add(&zswap_list_lru, entry);
> >         }
> >
> > -       /* update stats */
> > -       atomic_long_inc(&zswap_stored_pages);
> > -       count_vm_event(ZSWPOUT);
> > -
> > +       /*
> > +        * We shouldn't have any possibility of failure after the entry is
> > +        * added in the xarray. The pool/objcg refs obtained here will only
> > +        * be dropped if/when zswap_entry_free() gets called.
> > +        */
> >         return true;
> >
> >  store_failed:
> >         zpool_free(entry->pool->zpool, entry->handle);
> > -put_pool:
> > -       zswap_pool_put(entry->pool);
> > -freepage:
> > +put_pool_objcg:
> > +       zswap_pool_put(pool);
> > +       obj_cgroup_put(objcg);
> 
> I think if we reorder the function we can drop these calls, make the
> comments positioned a bit better, and centralize the entry
> initializations. I am also not a fan of passing a semi-initialized
> entry to zswap_compress() to get the pool pointer.
> 
> Does the following diff improve things or did I miss something?

We shouldn’t be adding the entry to the xarray before initializing its pool
and objcg, right? Please let me know if I am misunderstanding what you're
proposing in the diff.

> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index b74c8de996468..eac1f846886a6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -881,7 +881,8 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> struct hlist_node *node)
>         return 0;
>  }
> 
> -static bool zswap_compress(struct page *page, struct zswap_entry *entry)
> +static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> +                          struct zswap_pool *pool)
>  {
>         struct crypto_acomp_ctx *acomp_ctx;
>         struct scatterlist input, output;
> @@ -893,7 +894,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry)
>         gfp_t gfp;
>         u8 *dst;
> 
> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +       acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> 
>         mutex_lock(&acomp_ctx->mutex);
> 
> @@ -926,7 +927,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry)
>         if (comp_ret)
>                 goto unlock;
> 
> -       zpool = entry->pool->zpool;
> +       zpool = pool->zpool;
>         gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
>         if (zpool_malloc_support_movable(zpool))
>                 gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> @@ -1435,23 +1436,11 @@ static bool zswap_store_page(struct page
> *page,
>         entry = zswap_entry_cache_alloc(GFP_KERNEL,
> folio_nid(page_folio(page)));
>         if (!entry) {
>                 zswap_reject_kmemcache_fail++;
> -               goto reject;
> +               return false;
>         }
> 
> -       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> -       if (objcg)
> -               obj_cgroup_get(objcg);
> -       zswap_pool_get(pool);
> -
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool = pool;
> -
> -       if (!zswap_compress(page, entry))
> -               goto put_pool_objcg;
> -
> -       entry->swpentry = page_swap_entry(page);
> -       entry->objcg = objcg;
> -       entry->referenced = true;
> +       if (!zswap_compress(page, entry, pool))
> +               goto compress_failed;
> 
>         old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
>         if (xa_is_err(old)) {
> @@ -1470,6 +1459,16 @@ static bool zswap_store_page(struct page *page,
>         if (old)
>                 zswap_entry_free(old);
> 
> +       /*
> +        * The entry is successfully compressed and stored in the tree, there is
> +        * no further possibility of failure. Grab refs to the pool and objcg.
> +        * These refs will be dropped by zswap_entry_free() when the entry is
> +        * removed from the tree.
> +        */
> +       zswap_pool_get(pool);
> +       if (objcg)
> +               obj_cgroup_get(objcg);
> +
>         /*
>          * We finish initializing the entry while it's already in xarray.
>          * This is safe because:
> @@ -1480,26 +1479,22 @@ static bool zswap_store_page(struct page
> *page,
>          *    The publishing order matters to prevent writeback from seeing
>          *    an incoherent entry.
>          */
> +       entry->pool = pool;
> +       entry->swpentry = page_swap_entry(page);
> +       entry->objcg = objcg;
> +       entry->referenced = true;
>         if (entry->length) {
>                 *compressed_bytes += entry->length;
>                 INIT_LIST_HEAD(&entry->lru);
>                 zswap_lru_add(&zswap_list_lru, entry);
>         }
> 
> -       /*
> -        * We shouldn't have any possibility of failure after the entry is
> -        * added in the xarray. The pool/objcg refs obtained here will only
> -        * be dropped if/when zswap_entry_free() gets called.
> -        */
>         return true;
> 
>  store_failed:
> -       zpool_free(entry->pool->zpool, entry->handle);
> -put_pool_objcg:
> -       zswap_pool_put(pool);
> -       obj_cgroup_put(objcg);
> +       zpool_free(pool->zpool, entry->handle);
> +compress_failed:
>         zswap_entry_cache_free(entry);
> -reject:
>         return false;
>  }
> 
> 
> >         zswap_entry_cache_free(entry);
> >  reject:
> > +       return false;
> > +}
> > +
> > +bool zswap_store(struct folio *folio)
> > +{
> > +       long nr_pages = folio_nr_pages(folio);
> > +       swp_entry_t swp = folio->swap;
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       size_t compressed_bytes = 0;
> > +       bool ret = false;
> > +       long index;
> > +
> > +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +       if (!zswap_enabled)
> > +               goto check_old;
> > +
> > +       /*
> > +        * Check cgroup zswap limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of
> > +        * zswap_store(). The cgroup charging is done once, at the end
> > +        * of a successful folio store. What this means is, if the cgroup
> > +        * was within the zswap_max limit at the beginning of a large folio
> > +        * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> > +        * pages due to this store.
> > +        */
> > +       objcg = get_obj_cgroup_from_folio(folio);
> > +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (shrink_memcg(memcg)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_objcg;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Check zpool utilization against zswap limits:
> > +        *
> > +        * The zswap zpool utilization is also checked against the limits
> > +        * just once, at the start of zswap_store(). If the check passes,
> > +        * any breaches of the limits set by zswap_max_pages() or
> > +        * zswap_accept_thr_pages() that may happen while storing this
> > +        * folio, will only be detected during the next call to
> > +        * zswap_store() by any process.
> > +        */
> 
> This is essentially a rephrased repetition of the last comment, just
> refer to it instead. Something like:
> 
>          /*
>            * Check zpool utilization against zswap limits. The possibility of
>            * going overlimit is the same as the cgroup limit check.
>            */

Sure, I will make this change.

Thanks,
Kanchana

> 
> > +       if (zswap_check_limits())
> > +               goto put_objcg;
> > +
> > +       pool = zswap_pool_current_get();
> > +       if (!pool)
> > +               goto put_objcg;
> > +
> > +       if (objcg) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_pool;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Store each page of the folio as a separate entry. If we fail to
> > +        * store a page, unwind by deleting all the pages for this folio
> > +        * currently in zswap.
> > +        */
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree,
> &compressed_bytes))
> > +                       goto put_pool;
> > +       }
> > +
> > +       if (objcg) {
> > +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +               count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > +       }
> > +
> > +       atomic_long_add(nr_pages, &zswap_stored_pages);
> > +       count_vm_events(ZSWPOUT, nr_pages);
> > +
> > +       ret = true;
> > +
> > +put_pool:
> > +       zswap_pool_put(pool);
> > +put_objcg:
> >         obj_cgroup_put(objcg);
> > -       if (zswap_pool_reached_full)
> > +       if (!ret && zswap_pool_reached_full)
> >                 queue_work(shrink_wq, &zswap_shrink_work);
> >  check_old:
> >         /*
> > -        * If the zswap store fails or zswap is disabled, we must invalidate the
> > -        * possibly stale entry which was previously stored at this offset.
> > -        * Otherwise, writeback could overwrite the new data in the swapfile.
> > +        * If the zswap store fails or zswap is disabled, we must invalidate
> > +        * the possibly stale entries which were previously stored at the
> > +        * offsets corresponding to each page of the folio. Otherwise,
> > +        * writeback could overwrite the new data in the swapfile.
> >          */
> > -       entry = xa_erase(tree, offset);
> > -       if (entry)
> > -               zswap_entry_free(entry);
> > -       return false;
> > +       if (!ret) {
> > +               pgoff_t offset = swp_offset(swp);
> > +               struct zswap_entry *entry;
> > +
> > +               for (index = 0; index < nr_pages; ++index) {
> > +                       entry = xa_erase(tree, offset + index);
> > +                       if (entry)
> > +                               zswap_entry_free(entry);
> > +               }
> > +       }
> > +
> > +       return ret;
> >  }
> >
> >  bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >
Kanchana P Sridhar Oct. 1, 2024, 12:47 a.m. UTC | #8
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, September 30, 2024 4:43 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
> 
> On Mon, Sep 30, 2024 at 4:34 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> > On Mon, Sep 30, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > >
> > > On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed
> <yosryahmed@google.com> wrote:
> > > >
> > > > On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > > >
> > > > I suggested this in a previous version, and Kanchana faced some
> > > > complexities implementing it:
> > > >
> https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2
> @SJ0PR11MB5678.namprd11.prod.outlook.com/
> > >
> > > Sorry, I missed that conversation.
> > >
> > > >
> > > > Basically, if we batch get the refs after the store I think it's not
> > > > safe, because once an entry is published to writeback it can be
> > > > written back and freed, and a ref that we never acquired would be
> > > > dropped.
> > >
> > > Hmmm. I don't think writeback could touch any individual subpage just
> yet, no?
> > >
> > > Before doing any work, zswap writeback would attempt to add the
> > > subpage to the swap cache (via __read_swap_cache_async()). However,
> > > all subpage will have already been added to swap cache, and point to
> > > the (large) folio. So zswap_writeback_entry() should short circuit
> > > here (the if (!page_allocated) case).
> >
> > If it's safe to take the refs after all calls to zswap_store_page()
> > are successful, then yeah that should be possible, for both the pool
> > and objcg. I didn't look closely though.
> >
> > Just to clarify, you mean grab one ref first, then do the
> > compressions, then grab the remaining refs, right?
> 
> Ah yeah, that's what I meant. We can either perform one of the
> following sequences: grab 1 -> grab nr -> drop 1, or grab 1 -> grab nr
> - 1 if successful, drop 1 if failed.
> 
> Seems straightforward to me, but yeah it seems a bit hair-splitting of
> me to die on this hill :) Just thought it was weird seeing the other
> parts batchified, and one part wasn't.
> 
> The rest LGTM - I'll defer to you and Johannes for further review.
> 
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Thanks Nhat! Sure, this sounds good. I thought I will clarify my current
thinking on this. As Yosry was also mentioning, batching of the pool refs
requires some more thought. This is tied in to the design approach of
obtaining the objcg/pool refs per page before adding them to the entry,
which in turn needs to happen before adding the entry to the xarray.
I think there is some value in doing this incrementally because at any
point, if storing the page fails:

1) All existing folio entries in the xarray will have valid pool/objcg that
    can get refs dropped when we unwind state.
2) There are no excess refs that were potentially obtained by batching
    that need to be dropped (this might make the code a little bit messy,
    imho).

Thanks,
Kanchana
Johannes Weiner Oct. 1, 2024, 1:14 a.m. UTC | #9
On Mon, Sep 30, 2024 at 03:12:20PM -0700, Kanchana P Sridhar wrote:
>  /*********************************
>  * main API
>  **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.

There is no more index and no folio in this function.

> + *
> + * @page:  The page to store in zswap.
> + * @objcg: The folio's objcg. Caller has a reference.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + *         The caller should have obtained a reference to a valid
> + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> + *         argument.
> + * @tree:  The xarray for the @page's folio's swap.

This doesn't look safe.

If the entries were to span a SWAP_ADDRESS_SPACE_SHIFT boundary, the
subpage entries would need to be spread out to different trees also.
Otherwise, it would break loading and writeback down the line.

I *think* it works right now, but at best it's not very future proof.
Please look up the tree inside the function for the specific
swp_entry_t that is being stored.

Same for the unwind/check_old: section.

> + * @compressed_bytes: The compressed entry->length value is added
> + *                    to this, so that the caller can get the total
> + *                    compressed lengths of all sub-pages in a folio.
> + */

With just one caller, IMO the function comment can be dropped...

>  	/* allocate entry */
> -	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> +	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));

page_to_nid() is safe to use here.

> +bool zswap_store(struct folio *folio)
> +{
> +	long nr_pages = folio_nr_pages(folio);
> +	swp_entry_t swp = folio->swap;
> +	struct xarray *tree = swap_zswap_tree(swp);
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg = NULL;
> +	struct zswap_pool *pool;
> +	size_t compressed_bytes = 0;
> +	bool ret = false;
> +	long index;
> +
> +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +	if (!zswap_enabled)
> +		goto check_old;
> +
> +	/*
> +	 * Check cgroup zswap limits:
> +	 *
> +	 * The cgroup zswap limit check is done once at the beginning of
> +	 * zswap_store(). The cgroup charging is done once, at the end
> +	 * of a successful folio store. What this means is, if the cgroup
> +	 * was within the zswap_max limit at the beginning of a large folio
> +	 * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> +	 * pages due to this store.
> +	 */
> +	objcg = get_obj_cgroup_from_folio(folio);
> +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (shrink_memcg(memcg)) {
> +			mem_cgroup_put(memcg);
> +			goto put_objcg;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Check zpool utilization against zswap limits:
> +	 *
> +	 * The zswap zpool utilization is also checked against the limits
> +	 * just once, at the start of zswap_store(). If the check passes,
> +	 * any breaches of the limits set by zswap_max_pages() or
> +	 * zswap_accept_thr_pages() that may happen while storing this
> +	 * folio, will only be detected during the next call to
> +	 * zswap_store() by any process.
> +	 */
> +	if (zswap_check_limits())
> +		goto put_objcg;

There has been some back and forth on those comments. Both checks are
non-atomic and subject to races, so mentioning the HPAGE_PMD_NR - 1
overrun is somewhat misleading - it's much higher in the worst case.

Honestly, I would just get rid of the comments. You're not changing
anything fundamental in this regard, so I don't think there is a
burden to add new comments either.

> +
> +	pool = zswap_pool_current_get();
> +	if (!pool)
> +		goto put_objcg;
> +
> +	if (objcg) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +			mem_cgroup_put(memcg);
> +			goto put_pool;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Store each page of the folio as a separate entry. If we fail to
> +	 * store a page, unwind by deleting all the pages for this folio
> +	 * currently in zswap.
> +	 */

The first sentence explains something that is internal to
zswap_store_page(). The second sentence IMO is obvious from the code
itself. I think you can delete this comment.

> +	for (index = 0; index < nr_pages; ++index) {
> +		if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree, &compressed_bytes))
> +			goto put_pool;

Hah, I'm not a stickler for the 80 column line limit, but this is
pushing it ;)

Please grab the page up front.

Yosry had also suggested replacing the compressed_bytes return
parameter with an actual return value. Basically, return compressed
bytes on success, -errno on error. I think this comment was missed
among the page_swap_entry() discussion.

	for (index = 0; index < nr_pages; index++) {
		struct page *page = folio_page(folio, index);
		int bytes;

		bytes = zswap_store_page(page, object, pool, tree);
		if (bytes < 0)
			goto put_pool;
		total_bytes += bytes;
	}
Kanchana P Sridhar Oct. 1, 2024, 1:53 a.m. UTC | #10
> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Monday, September 30, 2024 6:14 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
> 
> On Mon, Sep 30, 2024 at 03:12:20PM -0700, Kanchana P Sridhar wrote:
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> 
> There is no more index and no folio in this function.
> 
> > + *
> > + * @page:  The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + *         The caller should have obtained a reference to a valid
> > + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + *         argument.
> > + * @tree:  The xarray for the @page's folio's swap.
> 
> This doesn't look safe.
> 
> If the entries were to span a SWAP_ADDRESS_SPACE_SHIFT boundary, the
> subpage entries would need to be spread out to different trees also.
> Otherwise, it would break loading and writeback down the line.
> 
> I *think* it works right now, but at best it's not very future proof.
> Please look up the tree inside the function for the specific
> swp_entry_t that is being stored.
> 
> Same for the unwind/check_old: section.
> 
> > + * @compressed_bytes: The compressed entry->length value is added
> > + *                    to this, so that the caller can get the total
> > + *                    compressed lengths of all sub-pages in a folio.
> > + */
> 
> With just one caller, IMO the function comment can be dropped...
> 
> >  	/* allocate entry */
> > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > +	entry = zswap_entry_cache_alloc(GFP_KERNEL,
> folio_nid(page_folio(page)));
> 
> page_to_nid() is safe to use here.
> 
> > +bool zswap_store(struct folio *folio)
> > +{
> > +	long nr_pages = folio_nr_pages(folio);
> > +	swp_entry_t swp = folio->swap;
> > +	struct xarray *tree = swap_zswap_tree(swp);
> > +	struct obj_cgroup *objcg = NULL;
> > +	struct mem_cgroup *memcg = NULL;
> > +	struct zswap_pool *pool;
> > +	size_t compressed_bytes = 0;
> > +	bool ret = false;
> > +	long index;
> > +
> > +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +	if (!zswap_enabled)
> > +		goto check_old;
> > +
> > +	/*
> > +	 * Check cgroup zswap limits:
> > +	 *
> > +	 * The cgroup zswap limit check is done once at the beginning of
> > +	 * zswap_store(). The cgroup charging is done once, at the end
> > +	 * of a successful folio store. What this means is, if the cgroup
> > +	 * was within the zswap_max limit at the beginning of a large folio
> > +	 * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> > +	 * pages due to this store.
> > +	 */
> > +	objcg = get_obj_cgroup_from_folio(folio);
> > +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +		memcg = get_mem_cgroup_from_objcg(objcg);
> > +		if (shrink_memcg(memcg)) {
> > +			mem_cgroup_put(memcg);
> > +			goto put_objcg;
> > +		}
> > +		mem_cgroup_put(memcg);
> > +	}
> > +
> > +	/*
> > +	 * Check zpool utilization against zswap limits:
> > +	 *
> > +	 * The zswap zpool utilization is also checked against the limits
> > +	 * just once, at the start of zswap_store(). If the check passes,
> > +	 * any breaches of the limits set by zswap_max_pages() or
> > +	 * zswap_accept_thr_pages() that may happen while storing this
> > +	 * folio, will only be detected during the next call to
> > +	 * zswap_store() by any process.
> > +	 */
> > +	if (zswap_check_limits())
> > +		goto put_objcg;
> 
> There has been some back and forth on those comments. Both checks are
> non-atomic and subject to races, so mentioning the HPAGE_PMD_NR - 1
> overrun is somewhat misleading - it's much higher in the worst case.
> 
> Honestly, I would just get rid of the comments. You're not changing
> anything fundamental in this regard, so I don't think there is a
> burden to add new comments either.
> 
> > +
> > +	pool = zswap_pool_current_get();
> > +	if (!pool)
> > +		goto put_objcg;
> > +
> > +	if (objcg) {
> > +		memcg = get_mem_cgroup_from_objcg(objcg);
> > +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru,
> GFP_KERNEL)) {
> > +			mem_cgroup_put(memcg);
> > +			goto put_pool;
> > +		}
> > +		mem_cgroup_put(memcg);
> > +	}
> > +
> > +	/*
> > +	 * Store each page of the folio as a separate entry. If we fail to
> > +	 * store a page, unwind by deleting all the pages for this folio
> > +	 * currently in zswap.
> > +	 */
> 
> The first sentence explains something that is internal to
> zswap_store_page(). The second sentence IMO is obvious from the code
> itself. I think you can delete this comment.
> 
> > +	for (index = 0; index < nr_pages; ++index) {
> > +		if (!zswap_store_page(folio_page(folio, index), objcg, pool,
> tree, &compressed_bytes))
> > +			goto put_pool;
> 
> Hah, I'm not a stickler for the 80 column line limit, but this is
> pushing it ;)
> 
> Please grab the page up front.
> 
> Yosry had also suggested replacing the compressed_bytes return
> parameter with an actual return value. Basically, return compressed
> bytes on success, -errno on error. I think this comment was missed
> among the page_swap_entry() discussion.
> 
> 	for (index = 0; index < nr_pages; index++) {
> 		struct page *page = folio_page(folio, index);
> 		int bytes;
> 
> 		bytes = zswap_store_page(page, object, pool, tree);
> 		if (bytes < 0)
> 			goto put_pool;
> 		total_bytes += bytes;
> 	}

Thanks Johannes! Appreciate your detailed code review comments.
I will incorporate all the comments and submit v10.

Thanks,
Kanchana
Yosry Ahmed Oct. 1, 2024, 6 a.m. UTC | #11
[..]
> > >  store_failed:
> > >         zpool_free(entry->pool->zpool, entry->handle);
> > > -put_pool:
> > > -       zswap_pool_put(entry->pool);
> > > -freepage:
> > > +put_pool_objcg:
> > > +       zswap_pool_put(pool);
> > > +       obj_cgroup_put(objcg);
> >
> > I think if we reorder the function we can drop these calls, make the
> > comments positioned a bit better, and centralize the entry
> > initializations. I am also not a fan of passing a semi-initialized
> > entry to zswap_compress() to get the pool pointer.
> >
> > Does the following diff improve things or did I miss something?
>
> We shouldn’t be adding the entry to the xarray before initializing its pool
> and objcg, right? Please let me know if I am misunderstanding what you're
> proposing in the diff.

It should be safe. We already initialize entry->lru after we insert
the entry in the tree. See the comment above the call to
zswap_lru_add(). Basically we are protected against concurrent
stores/loads through the folio lock, and are protected against
writeback because the entry is not on the LRU yet.

>
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index b74c8de996468..eac1f846886a6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -881,7 +881,8 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> > struct hlist_node *node)
> >         return 0;
> >  }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry)
> > +static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > +                          struct zswap_pool *pool)
> >  {
> >         struct crypto_acomp_ctx *acomp_ctx;
> >         struct scatterlist input, output;
> > @@ -893,7 +894,7 @@ static bool zswap_compress(struct page *page,
> > struct zswap_entry *entry)
> >         gfp_t gfp;
> >         u8 *dst;
> >
> > -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > +       acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > @@ -926,7 +927,7 @@ static bool zswap_compress(struct page *page,
> > struct zswap_entry *entry)
> >         if (comp_ret)
> >                 goto unlock;
> >
> > -       zpool = entry->pool->zpool;
> > +       zpool = pool->zpool;
> >         gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> >         if (zpool_malloc_support_movable(zpool))
> >                 gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1435,23 +1436,11 @@ static bool zswap_store_page(struct page
> > *page,
> >         entry = zswap_entry_cache_alloc(GFP_KERNEL,
> > folio_nid(page_folio(page)));
> >         if (!entry) {
> >                 zswap_reject_kmemcache_fail++;
> > -               goto reject;
> > +               return false;
> >         }
> >
> > -       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > -       if (objcg)
> > -               obj_cgroup_get(objcg);
> > -       zswap_pool_get(pool);
> > -
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = pool;
> > -
> > -       if (!zswap_compress(page, entry))
> > -               goto put_pool_objcg;
> > -
> > -       entry->swpentry = page_swap_entry(page);
> > -       entry->objcg = objcg;
> > -       entry->referenced = true;
> > +       if (!zswap_compress(page, entry, pool))
> > +               goto compress_failed;
> >
> >         old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
> >         if (xa_is_err(old)) {
> > @@ -1470,6 +1459,16 @@ static bool zswap_store_page(struct page *page,
> >         if (old)
> >                 zswap_entry_free(old);
> >
> > +       /*
> > +        * The entry is successfully compressed and stored in the tree, there is
> > +        * no further possibility of failure. Grab refs to the pool and objcg.
> > +        * These refs will be dropped by zswap_entry_free() when the entry is
> > +        * removed from the tree.
> > +        */
> > +       zswap_pool_get(pool);
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> > +
> >         /*
> >          * We finish initializing the entry while it's already in xarray.
> >          * This is safe because:
> > @@ -1480,26 +1479,22 @@ static bool zswap_store_page(struct page
> > *page,
> >          *    The publishing order matters to prevent writeback from seeing
> >          *    an incoherent entry.
> >          */

I am referring to the comment here ^

> > +       entry->pool = pool;
> > +       entry->swpentry = page_swap_entry(page);
> > +       entry->objcg = objcg;
> > +       entry->referenced = true;
> >         if (entry->length) {
> >                 *compressed_bytes += entry->length;
> >                 INIT_LIST_HEAD(&entry->lru);
> >                 zswap_lru_add(&zswap_list_lru, entry);
> >         }
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 2b8da50f6322..b74c8de99646 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -411,6 +411,12 @@  static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
 	return percpu_ref_tryget(&pool->ref);
 }
 
+/* The caller must already have a reference. */
+static void zswap_pool_get(struct zswap_pool *pool)
+{
+	percpu_ref_get(&pool->ref);
+}
+
 static void zswap_pool_put(struct zswap_pool *pool)
 {
 	percpu_ref_put(&pool->ref);
@@ -1402,68 +1408,52 @@  static void shrink_worker(struct work_struct *w)
 /*********************************
 * main API
 **********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Stores the page at specified "index" in a folio.
+ *
+ * @page:  The page to store in zswap.
+ * @objcg: The folio's objcg. Caller has a reference.
+ * @pool:  The zswap_pool to store the compressed data for the page.
+ *         The caller should have obtained a reference to a valid
+ *         zswap_pool by calling zswap_pool_tryget(), to pass as this
+ *         argument.
+ * @tree:  The xarray for the @page's folio's swap.
+ * @compressed_bytes: The compressed entry->length value is added
+ *                    to this, so that the caller can get the total
+ *                    compressed lengths of all sub-pages in a folio.
+ */
+static bool zswap_store_page(struct page *page,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool,
+			     struct xarray *tree,
+			     size_t *compressed_bytes)
 {
-	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
-	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry, *old;
-	struct obj_cgroup *objcg = NULL;
-	struct mem_cgroup *memcg = NULL;
-
-	VM_WARN_ON_ONCE(!folio_test_locked(folio));
-	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
-
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
-		return false;
-
-	if (!zswap_enabled)
-		goto check_old;
-
-	/* Check cgroup limits */
-	objcg = get_obj_cgroup_from_folio(folio);
-	if (objcg && !obj_cgroup_may_zswap(objcg)) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (shrink_memcg(memcg)) {
-			mem_cgroup_put(memcg);
-			goto reject;
-		}
-		mem_cgroup_put(memcg);
-	}
-
-	if (zswap_check_limits())
-		goto reject;
 
 	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
+	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
 	if (!entry) {
 		zswap_reject_kmemcache_fail++;
 		goto reject;
 	}
 
-	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
-		goto freepage;
+	/* zswap_store() already holds a ref on 'objcg' and 'pool' */
+	if (objcg)
+		obj_cgroup_get(objcg);
+	zswap_pool_get(pool);
 
-	if (objcg) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
-			mem_cgroup_put(memcg);
-			goto put_pool;
-		}
-		mem_cgroup_put(memcg);
-	}
+	/* if entry is successfully added, it keeps the reference */
+	entry->pool = pool;
 
-	if (!zswap_compress(&folio->page, entry))
-		goto put_pool;
+	if (!zswap_compress(page, entry))
+		goto put_pool_objcg;
 
-	entry->swpentry = swp;
+	entry->swpentry = page_swap_entry(page);
 	entry->objcg = objcg;
 	entry->referenced = true;
 
-	old = xa_store(tree, offset, entry, GFP_KERNEL);
+	old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
 	if (xa_is_err(old)) {
 		int err = xa_err(old);
 
@@ -1480,11 +1470,6 @@  bool zswap_store(struct folio *folio)
 	if (old)
 		zswap_entry_free(old);
 
-	if (objcg) {
-		obj_cgroup_charge_zswap(objcg, entry->length);
-		count_objcg_events(objcg, ZSWPOUT, 1);
-	}
-
 	/*
 	 * We finish initializing the entry while it's already in xarray.
 	 * This is safe because:
@@ -1496,36 +1481,137 @@  bool zswap_store(struct folio *folio)
 	 *    an incoherent entry.
 	 */
 	if (entry->length) {
+		*compressed_bytes += entry->length;
 		INIT_LIST_HEAD(&entry->lru);
 		zswap_lru_add(&zswap_list_lru, entry);
 	}
 
-	/* update stats */
-	atomic_long_inc(&zswap_stored_pages);
-	count_vm_event(ZSWPOUT);
-
+	/*
+	 * We shouldn't have any possibility of failure after the entry is
+	 * added in the xarray. The pool/objcg refs obtained here will only
+	 * be dropped if/when zswap_entry_free() gets called.
+	 */
 	return true;
 
 store_failed:
 	zpool_free(entry->pool->zpool, entry->handle);
-put_pool:
-	zswap_pool_put(entry->pool);
-freepage:
+put_pool_objcg:
+	zswap_pool_put(pool);
+	obj_cgroup_put(objcg);
 	zswap_entry_cache_free(entry);
 reject:
+	return false;
+}
+
+bool zswap_store(struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	swp_entry_t swp = folio->swap;
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	size_t compressed_bytes = 0;
+	bool ret = false;
+	long index;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+	if (!zswap_enabled)
+		goto check_old;
+
+	/*
+	 * Check cgroup zswap limits:
+	 *
+	 * The cgroup zswap limit check is done once at the beginning of
+	 * zswap_store(). The cgroup charging is done once, at the end
+	 * of a successful folio store. What this means is, if the cgroup
+	 * was within the zswap_max limit at the beginning of a large folio
+	 * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
+	 * pages due to this store.
+	 */
+	objcg = get_obj_cgroup_from_folio(folio);
+	if (objcg && !obj_cgroup_may_zswap(objcg)) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (shrink_memcg(memcg)) {
+			mem_cgroup_put(memcg);
+			goto put_objcg;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Check zpool utilization against zswap limits:
+	 *
+	 * The zswap zpool utilization is also checked against the limits
+	 * just once, at the start of zswap_store(). If the check passes,
+	 * any breaches of the limits set by zswap_max_pages() or
+	 * zswap_accept_thr_pages() that may happen while storing this
+	 * folio, will only be detected during the next call to
+	 * zswap_store() by any process.
+	 */
+	if (zswap_check_limits())
+		goto put_objcg;
+
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			goto put_pool;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Store each page of the folio as a separate entry. If we fail to
+	 * store a page, unwind by deleting all the pages for this folio
+	 * currently in zswap.
+	 */
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree, &compressed_bytes))
+			goto put_pool;
+	}
+
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, compressed_bytes);
+		count_objcg_events(objcg, ZSWPOUT, nr_pages);
+	}
+
+	atomic_long_add(nr_pages, &zswap_stored_pages);
+	count_vm_events(ZSWPOUT, nr_pages);
+
+	ret = true;
+
+put_pool:
+	zswap_pool_put(pool);
+put_objcg:
 	obj_cgroup_put(objcg);
-	if (zswap_pool_reached_full)
+	if (!ret && zswap_pool_reached_full)
 		queue_work(shrink_wq, &zswap_shrink_work);
 check_old:
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * If the zswap store fails or zswap is disabled, we must invalidate
+	 * the possibly stale entries which were previously stored at the
+	 * offsets corresponding to each page of the folio. Otherwise,
+	 * writeback could overwrite the new data in the swapfile.
 	 */
-	entry = xa_erase(tree, offset);
-	if (entry)
-		zswap_entry_free(entry);
-	return false;
+	if (!ret) {
+		pgoff_t offset = swp_offset(swp);
+		struct zswap_entry *entry;
+
+		for (index = 0; index < nr_pages; ++index) {
+			entry = xa_erase(tree, offset + index);
+			if (entry)
+				zswap_entry_free(entry);
+		}
+	}
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)