[v8,6/8] mm: zswap: Support large folios in zswap_store().

Message ID	20240928021620.8369-7-kanchana.p.sridhar@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, usamaarif642@gmail.com, shakeel.butt@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org Cc: nanhai.zou@intel.com, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com, kanchana.p.sridhar@intel.com Subject: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store(). Date: Fri, 27 Sep 2024 19:16:18 -0700 Message-Id: <20240928021620.8369-7-kanchana.p.sridhar@intel.com> In-Reply-To: <20240928021620.8369-1-kanchana.p.sridhar@intel.com> References: <20240928021620.8369-1-kanchana.p.sridhar@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: zswap swap-out of large folios \| expand [v8,0/8] mm: zswap swap-out of large folios [v8,1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. [v8,2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio. [v8,3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget(). [v8,4/8] mm: Provide a new count_objcg_events() API for batch event updates. [v8,5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t. [v8,6/8] mm: zswap: Support large folios in zswap_store(). [v8,7/8] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats. [v8,8/8] mm: Document the newly added sysfs large folios zswpout stats.

Sridhar, Kanchana P Sept. 28, 2024, 2:16 a.m. UTC

zswap_store() will store large folios by compressing them page by page.

This patch provides a sequential implementation of storing a large folio
in zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

We check the cgroup zswap limit and the zpool utilization against
the zswap max/accept_threshold limits once, at the beginning of
zswap_store. We also obtain a percpu_ref_tryget() reference to the
current zswap_pool at the start of zswap_store to prevent it from
being deleted while the folio is being stored.

If these one-time checks pass, we compress the sub-pages of the folio,
while maintaining a running count of compressed bytes for all the folio's
sub-pages. If all pages are successfully compressed and stored, we do the
cgroup zswap charging with the total compressed bytes, and batch update
the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
once, before returning from zswap_store.

The patch adds a new zswap_pool_get() function that is called in the
sub-page level "zswap_store_page()" function.

If an error is encountered during the store of any page in the folio,
all pages in that folio currently stored in zswap will be invalidated.
Thus, a folio is either entirely stored in zswap, or entirely not stored
in zswap.

This patch forms the basis for building compress batching of pages in a
large folio in zswap_store by compressing up to say, 8 pages of the folio
in parallel in hardware using the Intel In-Memory Analytics Accelerator
(Intel IAA).

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Also, addressed some of the RFC comments from the discussion in [1].

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 165 insertions(+), 62 deletions(-)

Yosry Ahmed Sept. 28, 2024, 3:42 a.m. UTC | #1

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will store large folios by compressing them page by page.
>
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.

It is already modified at this point, it's not part of this patch.

>
> Each page's swap offset is stored as a separate zswap entry.

I think "Each page is compressed individually and stored as a separate
zswap entry" is clearer. We do not store the offsets.

>
> We check the cgroup zswap limit and the zpool utilization against
> the zswap max/accept_threshold limits once, at the beginning of
> zswap_store. We also obtain a percpu_ref_tryget() reference to the
> current zswap_pool at the start of zswap_store to prevent it from
> being deleted while the folio is being stored.

This can be clarified further:

We check the global and per-cgroup limits once at the beginning, and
only check that the limit is not reached yet. This is racy and
inaccurate, but it should be sufficient for now. We also obtain
initial references to the relevant objcg and pool to guarantee that
subsequent references can be acquired by zswap_store_page().

>
> If these one-time checks pass, we compress the sub-pages of the folio,
> while maintaining a running count of compressed bytes for all the folio's
> sub-pages. If all pages are successfully compressed and stored, we do the
> cgroup zswap charging with the total compressed bytes, and batch update
> the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> once, before returning from zswap_store.

Please consistently use "page" instead of "sub-page", and add "()" to
the end of function names.

>
> The patch adds a new zswap_pool_get() function that is called in the
> sub-page level "zswap_store_page()" function.

"that is called by zswap_store_page(), which handles compressing and
storing each page in the folio."

>
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
>
> This patch forms the basis for building compress batching of pages in a
> large folio in zswap_store by compressing up to say, 8 pages of the folio
> in parallel in hardware using the Intel In-Memory Analytics Accelerator
> (Intel IAA).

This patch also has its own merits, it batches some operations, and
more importantly enables swapping out large folios to zswap without
splitting them.

>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 165 insertions(+), 62 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 43e4e216db41..b8395ddf7b7c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,16 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>         return percpu_ref_tryget(&pool->ref);
>  }
>
> +/*
> + * Note: zswap_pool_get() should only be called after zswap_pool_tryget()
> + * returns success. zswap_pool_tryget() returns success only if the "pool" is
> + * non-NULL and the "&pool->ref" is non-0.

Just use the same statement used by css_get():

"The caller must already have a reference."

> + */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> +       percpu_ref_get(&pool->ref);
> +}
> +
>  static void zswap_pool_put(struct zswap_pool *pool)
>  {
>         percpu_ref_put(&pool->ref);
> @@ -1402,38 +1412,35 @@ static void shrink_worker(struct work_struct *w)
>  /*********************************
>  * main API
>  **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @folio: The folio to store in zswap.
> + * @index: Index into the page in the folio that this function will store.
> + * @objcg: The folio's objcg.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + *         The caller should have obtained a reference to a valid
> + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> + *         argument.
> + * @compressed_bytes: The compressed entry->length value is added
> + *                    to this, so that the caller can get the total
> + *                    compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool,
> +                            size_t *compressed_bytes)

Would it be clearer to just return the compressed size, and return -1
upon failure?

>  {
> +       struct page *page = folio_page(folio, index);
>         swp_entry_t swp = folio->swap;
> -       pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> +       pgoff_t offset = swp_offset(swp) + index;
>         struct zswap_entry *entry, *old;
> -       struct obj_cgroup *objcg = NULL;
> -       struct mem_cgroup *memcg = NULL;
> -
> -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +       int type = swp_type(swp);

Why do we need type? We use it when initializing entry->swpentry to
reconstruct the swp_entry_t we already have.

>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> -               return false;
> -
> -       if (!zswap_enabled)
> -               goto check_old;
> -
> -       /* Check cgroup limits */
> -       objcg = get_obj_cgroup_from_folio(folio);
> -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (shrink_memcg(memcg)) {
> -                       mem_cgroup_put(memcg);
> -                       goto reject;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> -
> -       if (zswap_check_limits())
> -               goto reject;
> +       if (objcg)
> +               obj_cgroup_get(objcg);
>
>         /* allocate entry */
>         entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> @@ -1442,24 +1449,21 @@ bool zswap_store(struct folio *folio)
>                 goto reject;
>         }
>
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> -               goto freepage;
> +       /*
> +        * We get here only after the call to zswap_pool_tryget() in the
> +        * caller, zswap_store(), has returned success. Hence it is safe
> +        * to call zswap_pool_get().
> +        *
> +        * if entry is successfully added, it keeps the reference
> +        */
> +       zswap_pool_get(pool);

I would move this right above obj_cgroup_get() and have a single
concise comment for both, e.g.:

/* zswap_store() already holds a ref on 'pool' and 'objcg' */

>
> -       if (objcg) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> -                       mem_cgroup_put(memcg);
> -                       goto put_pool;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> +       entry->pool = pool;
>
> -       if (!zswap_compress(&folio->page, entry))
> +       if (!zswap_compress(page, entry))
>                 goto put_pool;
>
> -       entry->swpentry = swp;
> +       entry->swpentry = swp_entry(type, offset);
>         entry->objcg = objcg;
>         entry->referenced = true;
>
> @@ -1480,11 +1484,6 @@ bool zswap_store(struct folio *folio)
>         if (old)
>                 zswap_entry_free(old);
>
> -       if (objcg) {
> -               obj_cgroup_charge_zswap(objcg, entry->length);
> -               count_objcg_event(objcg, ZSWPOUT);
> -       }
> -
>         /*
>          * We finish initializing the entry while it's already in xarray.
>          * This is safe because:
> @@ -1496,36 +1495,140 @@ bool zswap_store(struct folio *folio)
>          *    an incoherent entry.
>          */
>         if (entry->length) {
> +               *compressed_bytes += entry->length;
>                 INIT_LIST_HEAD(&entry->lru);
>                 zswap_lru_add(&zswap_list_lru, entry);
>         }
>
> -       /* update stats */
> -       atomic_long_inc(&zswap_stored_pages);
> -       count_vm_event(ZSWPOUT);
> -
>         return true;
>
>  store_failed:
>         zpool_free(entry->pool->zpool, entry->handle);
>  put_pool:
> -       zswap_pool_put(entry->pool);
> -freepage:
> +       zswap_pool_put(pool);
>         zswap_entry_cache_free(entry);
>  reject:

Please rename this to put_objcg, we are no longer "rejecting" here.

>         obj_cgroup_put(objcg);
> -       if (zswap_pool_reached_full)
> -               queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +       return false;
> +}
> +
> +bool zswap_store(struct folio *folio)
> +{
> +       long nr_pages = folio_nr_pages(folio);
> +       swp_entry_t swp = folio->swap;
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       pgoff_t offset = swp_offset(swp);
> +       struct obj_cgroup *objcg = NULL;
> +       struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       size_t compressed_bytes = 0;

Why size_t? entry->length is int.

> +       bool ret = false;
> +       long index;
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +       if (!zswap_enabled)
> +               goto reject;
> +
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalidate the
> -        * possibly stale entry which was previously stored at this offset.
> -        * Otherwise, writeback could overwrite the new data in the swapfile.
> +        * Check cgroup zswap limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of
> +        * zswap_store(). The cgroup charging is done once, at the end
> +        * of a successful folio store. What this means is, if the cgroup
> +        * was within the zswap_max limit at the beginning of a large folio
> +        * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> +        * pages.

Add ".. due to this store", as it can be much more over the limit when
stores are racing. Also, this comment can be slightly generalized and
the comment above zswap_check_limits() omitted.

>          */
> -       entry = xa_erase(tree, offset);
> -       if (entry)
> -               zswap_entry_free(entry);
> -       return false;
> +       objcg = get_obj_cgroup_from_folio(folio);
> +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (shrink_memcg(memcg)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_objcg;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Check zpool utilization against zswap limits:
> +        *
> +        * The zswap zpool utilization is also checked against the limits
> +        * just once, at the start of zswap_store(). If the check passes,
> +        * any breaches of the limits set by zswap_max_pages() or
> +        * zswap_accept_thr_pages() that may happen while storing this
> +        * folio, will only be detected during the next call to
> +        * zswap_store() by any process.
> +        */
> +       if (zswap_check_limits())
> +               goto put_objcg;
> +
> +       pool = zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
> +
> +       if (objcg) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_pool;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Store each page of the folio as a separate entry. If we fail to
> +        * store a page, unwind by deleting all the pages for this folio
> +        * currently in zswap.
> +        */
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio, index, objcg, pool, &compressed_bytes))
> +                       goto put_pool;
> +       }
> +
> +       /*
> +        * All pages in the folio have been successfully stored.
> +        * Batch update the cgroup zswap charging, zswap_stored_page atomic,
> +        * and ZSWPOUT event stats.
> +        */

This comment seems unnecessary.

> +       if (objcg) {
> +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +               count_objcg_events(objcg, ZSWPOUT, nr_pages);
> +       }
> +
> +       /* update stats */

Please delete this comment too.

> +       atomic_long_add(nr_pages, &zswap_stored_pages);
> +       count_vm_events(ZSWPOUT, nr_pages);
> +
> +       ret = true;
> +
> +put_pool:
> +       zswap_pool_put(pool);
> +put_objcg:
> +       obj_cgroup_put(objcg);
> +reject:
> +       /*
> +        * If the zswap store fails or zswap is disabled, we must invalidate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
> +        */
> +       if (!ret) {
> +               struct zswap_entry *entry;
> +               long i;
> +
> +               for (i = 0; i < nr_pages; ++i) {

Just reuse 'index' here.

> +                       entry = xa_erase(tree, offset + i);
> +                       if (entry)
> +                               zswap_entry_free(entry);

I think we need a comment in zswap_store_page() that we shouldn't have
any possibility of failure after the entry is added in the tree.
Otherwise we may duplicate the cleanup work here (e.g. putting refs).
This is already the case in zswap_store(), but it's now more subtle
and easier to break when the code is split and we have two cleanup
paths.

> +               }
> +
> +               if (zswap_pool_reached_full)
> +                       queue_work(shrink_wq, &zswap_shrink_work);

We are now doing this check even when zswap is disabled. I think this
is a problem.

If zswap gets disabled while 'zswap_pool_reached_full' is true, we
will queue 'zswap_shrink_work' every time zswap_store() is called in
the swapout path, but 'zswap_pool_reached_full' will never become
false again become zswap_check_limits() will never be called.

I think we need to maintain two separate "reject" and "check_old"
labels, and avoid this check when zswap is disabled.

> +       }
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>

Chengming Zhou Sept. 28, 2024, 6:05 a.m. UTC | #2

On 2024/9/28 10:16, Kanchana P Sridhar wrote:
> zswap_store() will store large folios by compressing them page by page.
> 
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
> 
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
> 
> Each page's swap offset is stored as a separate zswap entry.
> 
> We check the cgroup zswap limit and the zpool utilization against
> the zswap max/accept_threshold limits once, at the beginning of
> zswap_store. We also obtain a percpu_ref_tryget() reference to the
> current zswap_pool at the start of zswap_store to prevent it from
> being deleted while the folio is being stored.
> 
> If these one-time checks pass, we compress the sub-pages of the folio,
> while maintaining a running count of compressed bytes for all the folio's
> sub-pages. If all pages are successfully compressed and stored, we do the
> cgroup zswap charging with the total compressed bytes, and batch update
> the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> once, before returning from zswap_store.
> 
> The patch adds a new zswap_pool_get() function that is called in the
> sub-page level "zswap_store_page()" function.
> 
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
> 
> This patch forms the basis for building compress batching of pages in a
> large folio in zswap_store by compressing up to say, 8 pages of the folio
> in parallel in hardware using the Intel In-Memory Analytics Accelerator
> (Intel IAA).
> 
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
> 
>    "[RFC,v1] mm: zswap: Store large folios without splitting"
> 
>    [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> 
> Also, addressed some of the RFC comments from the discussion in [1].
> 
> Co-developed-by: Ryan Roberts
> Signed-off-by:

This seems not right.

> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>   mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 165 insertions(+), 62 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 43e4e216db41..b8395ddf7b7c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,16 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>   	return percpu_ref_tryget(&pool->ref);
>   }
>   
> +/*
> + * Note: zswap_pool_get() should only be called after zswap_pool_tryget()
> + * returns success. zswap_pool_tryget() returns success only if the "pool" is
> + * non-NULL and the "&pool->ref" is non-0.
> + */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> +	percpu_ref_get(&pool->ref);
> +}

percpu_ref_tryget_many() could be considered? But this looks ok too.

Thanks.

> +
>   static void zswap_pool_put(struct zswap_pool *pool)
>   {
>   	percpu_ref_put(&pool->ref);
> @@ -1402,38 +1412,35 @@ static void shrink_worker(struct work_struct *w)
>   /*********************************
>   * main API
>   **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @folio: The folio to store in zswap.
> + * @index: Index into the page in the folio that this function will store.
> + * @objcg: The folio's objcg.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + *         The caller should have obtained a reference to a valid
> + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> + *         argument.
> + * @compressed_bytes: The compressed entry->length value is added
> + *                    to this, so that the caller can get the total
> + *                    compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct folio *folio, long index,
> +			     struct obj_cgroup *objcg,
> +			     struct zswap_pool *pool,
> +			     size_t *compressed_bytes)
>   {
> +	struct page *page = folio_page(folio, index);
>   	swp_entry_t swp = folio->swap;
> -	pgoff_t offset = swp_offset(swp);
>   	struct xarray *tree = swap_zswap_tree(swp);
> +	pgoff_t offset = swp_offset(swp) + index;
>   	struct zswap_entry *entry, *old;
> -	struct obj_cgroup *objcg = NULL;
> -	struct mem_cgroup *memcg = NULL;
> -
> -	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +	int type = swp_type(swp);
>   
> -	/* Large folios aren't supported */
> -	if (folio_test_large(folio))
> -		return false;
> -
> -	if (!zswap_enabled)
> -		goto check_old;
> -
> -	/* Check cgroup limits */
> -	objcg = get_obj_cgroup_from_folio(folio);
> -	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -		memcg = get_mem_cgroup_from_objcg(objcg);
> -		if (shrink_memcg(memcg)) {
> -			mem_cgroup_put(memcg);
> -			goto reject;
> -		}
> -		mem_cgroup_put(memcg);
> -	}
> -
> -	if (zswap_check_limits())
> -		goto reject;
> +	if (objcg)
> +		obj_cgroup_get(objcg);
>   
>   	/* allocate entry */
>   	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> @@ -1442,24 +1449,21 @@ bool zswap_store(struct folio *folio)
>   		goto reject;
>   	}
>   
> -	/* if entry is successfully added, it keeps the reference */
> -	entry->pool = zswap_pool_current_get();
> -	if (!entry->pool)
> -		goto freepage;
> +	/*
> +	 * We get here only after the call to zswap_pool_tryget() in the
> +	 * caller, zswap_store(), has returned success. Hence it is safe
> +	 * to call zswap_pool_get().
> +	 *
> +	 * if entry is successfully added, it keeps the reference
> +	 */
> +	zswap_pool_get(pool);
>   
> -	if (objcg) {
> -		memcg = get_mem_cgroup_from_objcg(objcg);
> -		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> -			mem_cgroup_put(memcg);
> -			goto put_pool;
> -		}
> -		mem_cgroup_put(memcg);
> -	}
> +	entry->pool = pool;
>   
> -	if (!zswap_compress(&folio->page, entry))
> +	if (!zswap_compress(page, entry))
>   		goto put_pool;
>   
> -	entry->swpentry = swp;
> +	entry->swpentry = swp_entry(type, offset);
>   	entry->objcg = objcg;
>   	entry->referenced = true;
>   
> @@ -1480,11 +1484,6 @@ bool zswap_store(struct folio *folio)
>   	if (old)
>   		zswap_entry_free(old);
>   
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, entry->length);
> -		count_objcg_event(objcg, ZSWPOUT);
> -	}
> -
>   	/*
>   	 * We finish initializing the entry while it's already in xarray.
>   	 * This is safe because:
> @@ -1496,36 +1495,140 @@ bool zswap_store(struct folio *folio)
>   	 *    an incoherent entry.
>   	 */
>   	if (entry->length) {
> +		*compressed_bytes += entry->length;
>   		INIT_LIST_HEAD(&entry->lru);
>   		zswap_lru_add(&zswap_list_lru, entry);
>   	}
>   
> -	/* update stats */
> -	atomic_long_inc(&zswap_stored_pages);
> -	count_vm_event(ZSWPOUT);
> -
>   	return true;
>   
>   store_failed:
>   	zpool_free(entry->pool->zpool, entry->handle);
>   put_pool:
> -	zswap_pool_put(entry->pool);
> -freepage:
> +	zswap_pool_put(pool);
>   	zswap_entry_cache_free(entry);
>   reject:
>   	obj_cgroup_put(objcg);
> -	if (zswap_pool_reached_full)
> -		queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +	return false;
> +}
> +
> +bool zswap_store(struct folio *folio)
> +{
> +	long nr_pages = folio_nr_pages(folio);
> +	swp_entry_t swp = folio->swap;
> +	struct xarray *tree = swap_zswap_tree(swp);
> +	pgoff_t offset = swp_offset(swp);
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg = NULL;
> +	struct zswap_pool *pool;
> +	size_t compressed_bytes = 0;
> +	bool ret = false;
> +	long index;
> +
> +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +	if (!zswap_enabled)
> +		goto reject;
> +
>   	/*
> -	 * If the zswap store fails or zswap is disabled, we must invalidate the
> -	 * possibly stale entry which was previously stored at this offset.
> -	 * Otherwise, writeback could overwrite the new data in the swapfile.
> +	 * Check cgroup zswap limits:
> +	 *
> +	 * The cgroup zswap limit check is done once at the beginning of
> +	 * zswap_store(). The cgroup charging is done once, at the end
> +	 * of a successful folio store. What this means is, if the cgroup
> +	 * was within the zswap_max limit at the beginning of a large folio
> +	 * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> +	 * pages.
>   	 */
> -	entry = xa_erase(tree, offset);
> -	if (entry)
> -		zswap_entry_free(entry);
> -	return false;
> +	objcg = get_obj_cgroup_from_folio(folio);
> +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (shrink_memcg(memcg)) {
> +			mem_cgroup_put(memcg);
> +			goto put_objcg;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Check zpool utilization against zswap limits:
> +	 *
> +	 * The zswap zpool utilization is also checked against the limits
> +	 * just once, at the start of zswap_store(). If the check passes,
> +	 * any breaches of the limits set by zswap_max_pages() or
> +	 * zswap_accept_thr_pages() that may happen while storing this
> +	 * folio, will only be detected during the next call to
> +	 * zswap_store() by any process.
> +	 */
> +	if (zswap_check_limits())
> +		goto put_objcg;
> +
> +	pool = zswap_pool_current_get();
> +	if (!pool)
> +		goto put_objcg;
> +
> +	if (objcg) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +			mem_cgroup_put(memcg);
> +			goto put_pool;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Store each page of the folio as a separate entry. If we fail to
> +	 * store a page, unwind by deleting all the pages for this folio
> +	 * currently in zswap.
> +	 */
> +	for (index = 0; index < nr_pages; ++index) {
> +		if (!zswap_store_page(folio, index, objcg, pool, &compressed_bytes))
> +			goto put_pool;
> +	}
> +
> +	/*
> +	 * All pages in the folio have been successfully stored.
> +	 * Batch update the cgroup zswap charging, zswap_stored_page atomic,
> +	 * and ZSWPOUT event stats.
> +	 */
> +	if (objcg) {
> +		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> +	}
> +
> +	/* update stats */
> +	atomic_long_add(nr_pages, &zswap_stored_pages);
> +	count_vm_events(ZSWPOUT, nr_pages);
> +
> +	ret = true;
> +
> +put_pool:
> +	zswap_pool_put(pool);
> +put_objcg:
> +	obj_cgroup_put(objcg);
> +reject:
> +	/*
> +	 * If the zswap store fails or zswap is disabled, we must invalidate
> +	 * the possibly stale entries which were previously stored at the
> +	 * offsets corresponding to each page of the folio. Otherwise,
> +	 * writeback could overwrite the new data in the swapfile.
> +	 */
> +	if (!ret) {
> +		struct zswap_entry *entry;
> +		long i;
> +
> +		for (i = 0; i < nr_pages; ++i) {
> +			entry = xa_erase(tree, offset + i);
> +			if (entry)
> +				zswap_entry_free(entry);
> +		}
> +
> +		if (zswap_pool_reached_full)
> +			queue_work(shrink_wq, &zswap_shrink_work);
> +	}
> +
> +	return ret;
>   }
>   
>   bool zswap_load(struct folio *folio)

Johannes Weiner Sept. 28, 2024, 2:15 p.m. UTC | #3

On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> >  {
> > +       struct page *page = folio_page(folio, index);
> >         swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +       int type = swp_type(swp);
> 
> Why do we need type? We use it when initializing entry->swpentry to
> reconstruct the swp_entry_t we already have.

It's not the same entry. folio->swap points to the head entry, this
function has to store swap entries with the offsets of each subpage.

Given the name of this function, it might be better to actually pass a
page pointer to it; do the folio_page() inside zswap_store().

Then do

		entry->swpentry = page_swap_entry(page);

below.

> >         obj_cgroup_put(objcg);
> > -       if (zswap_pool_reached_full)
> > -               queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +       return false;
> > +}
> > +
> > +bool zswap_store(struct folio *folio)
> > +{
> > +       long nr_pages = folio_nr_pages(folio);
> > +       swp_entry_t swp = folio->swap;
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp);
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       size_t compressed_bytes = 0;
> 
> Why size_t? entry->length is int.

In light of Willy's comment, I think size_t is a good idea.

Yosry Ahmed Sept. 28, 2024, 6:11 p.m. UTC | #4

On Sat, Sep 28, 2024 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> > >  {
> > > +       struct page *page = folio_page(folio, index);
> > >         swp_entry_t swp = folio->swap;
> > > -       pgoff_t offset = swp_offset(swp);
> > >         struct xarray *tree = swap_zswap_tree(swp);
> > > +       pgoff_t offset = swp_offset(swp) + index;
> > >         struct zswap_entry *entry, *old;
> > > -       struct obj_cgroup *objcg = NULL;
> > > -       struct mem_cgroup *memcg = NULL;
> > > -
> > > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > +       int type = swp_type(swp);
> >
> > Why do we need type? We use it when initializing entry->swpentry to
> > reconstruct the swp_entry_t we already have.
>
> It's not the same entry. folio->swap points to the head entry, this
> function has to store swap entries with the offsets of each subpage.

Duh, yeah, thanks.

>
> Given the name of this function, it might be better to actually pass a
> page pointer to it; do the folio_page() inside zswap_store().
>
> Then do
>
>                 entry->swpentry = page_swap_entry(page);
>
> below.

That is indeed clearer.

Although this will be adding yet another caller of page_swap_entry()
that already has the folio, yet it calls page_swap_entry() for each
page in the folio, which calls page_folio() inside.

I wonder if we should add (or replace page_swap_entry()) with a
folio_swap_entry(folio, index) helper. This can also be done as a
follow up.

>
> > >         obj_cgroup_put(objcg);
> > > -       if (zswap_pool_reached_full)
> > > -               queue_work(shrink_wq, &zswap_shrink_work);
> > > -check_old:
> > > +       return false;
> > > +}
> > > +
> > > +bool zswap_store(struct folio *folio)
> > > +{
> > > +       long nr_pages = folio_nr_pages(folio);
> > > +       swp_entry_t swp = folio->swap;
> > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > +       pgoff_t offset = swp_offset(swp);
> > > +       struct obj_cgroup *objcg = NULL;
> > > +       struct mem_cgroup *memcg = NULL;
> > > +       struct zswap_pool *pool;
> > > +       size_t compressed_bytes = 0;
> >
> > Why size_t? entry->length is int.
>
> In light of Willy's comment, I think size_t is a good idea.

Agreed.

Sridhar, Kanchana P Sept. 29, 2024, 9:15 p.m. UTC | #5

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Saturday, September 28, 2024 11:11 AM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
> 
> On Sat, Sep 28, 2024 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> > > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> > > >  {
> > > > +       struct page *page = folio_page(folio, index);
> > > >         swp_entry_t swp = folio->swap;
> > > > -       pgoff_t offset = swp_offset(swp);
> > > >         struct xarray *tree = swap_zswap_tree(swp);
> > > > +       pgoff_t offset = swp_offset(swp) + index;
> > > >         struct zswap_entry *entry, *old;
> > > > -       struct obj_cgroup *objcg = NULL;
> > > > -       struct mem_cgroup *memcg = NULL;
> > > > -
> > > > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > > +       int type = swp_type(swp);
> > >
> > > Why do we need type? We use it when initializing entry->swpentry to
> > > reconstruct the swp_entry_t we already have.
> >
> > It's not the same entry. folio->swap points to the head entry, this
> > function has to store swap entries with the offsets of each subpage.
> 
> Duh, yeah, thanks.
> 
> >
> > Given the name of this function, it might be better to actually pass a
> > page pointer to it; do the folio_page() inside zswap_store().
> >
> > Then do
> >
> >                 entry->swpentry = page_swap_entry(page);
> >
> > below.
> 
> That is indeed clearer.
> 
> Although this will be adding yet another caller of page_swap_entry()
> that already has the folio, yet it calls page_swap_entry() for each
> page in the folio, which calls page_folio() inside.
> 
> I wonder if we should add (or replace page_swap_entry()) with a
> folio_swap_entry(folio, index) helper. This can also be done as a
> follow up.

Thanks Johannes and Yosry for these comments. I was thinking about
this some more. In its current form, zswap_store_page() is called in the context
of the folio by passing in a [folio, index]. This implies a key assumption about
the existing zswap_store() large folios functionality, i.e., we do the per-page
store for the page at a "index * PAGE_SIZE" within the folio, and not for any
arbitrary page. Further, we need the folio for folio_nid(); but this can also be
computed from the page. Another reason why I thought the existing signature
might be preferable is because it seems like it enables getting the entry's
swp_entry_t with fewer computes. Could calling page_swap_entry() add
more computes; which if it is the case, could potentially add up (say 512 times)

I would appreciate your thoughts on whether these are valid considerations,
and can proceed accordingly.

> 
> >
> > > >         obj_cgroup_put(objcg);
> > > > -       if (zswap_pool_reached_full)
> > > > -               queue_work(shrink_wq, &zswap_shrink_work);
> > > > -check_old:
> > > > +       return false;
> > > > +}
> > > > +
> > > > +bool zswap_store(struct folio *folio)
> > > > +{
> > > > +       long nr_pages = folio_nr_pages(folio);
> > > > +       swp_entry_t swp = folio->swap;
> > > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > > +       pgoff_t offset = swp_offset(swp);
> > > > +       struct obj_cgroup *objcg = NULL;
> > > > +       struct mem_cgroup *memcg = NULL;
> > > > +       struct zswap_pool *pool;
> > > > +       size_t compressed_bytes = 0;
> > >
> > > Why size_t? entry->length is int.
> >
> > In light of Willy's comment, I think size_t is a good idea.
> 
> Agreed.

Thanks Yosry, Matthew and Johannes for the resolution on this!

Thanks,
Kanchana

Sridhar, Kanchana P Sept. 29, 2024, 9:24 p.m. UTC | #6

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, September 27, 2024 8:42 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large folio
> > in zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> 
> It is already modified at this point, it's not part of this patch.
> 
> >
> > Each page's swap offset is stored as a separate zswap entry.
> 
> I think "Each page is compressed individually and stored as a separate
> zswap entry" is clearer. We do not store the offsets.
> 
> >
> > We check the cgroup zswap limit and the zpool utilization against
> > the zswap max/accept_threshold limits once, at the beginning of
> > zswap_store. We also obtain a percpu_ref_tryget() reference to the
> > current zswap_pool at the start of zswap_store to prevent it from
> > being deleted while the folio is being stored.
> 
> This can be clarified further:
> 
> We check the global and per-cgroup limits once at the beginning, and
> only check that the limit is not reached yet. This is racy and
> inaccurate, but it should be sufficient for now. We also obtain
> initial references to the relevant objcg and pool to guarantee that
> subsequent references can be acquired by zswap_store_page().
> 
> >
> > If these one-time checks pass, we compress the sub-pages of the folio,
> > while maintaining a running count of compressed bytes for all the folio's
> > sub-pages. If all pages are successfully compressed and stored, we do the
> > cgroup zswap charging with the total compressed bytes, and batch update
> > the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> > once, before returning from zswap_store.
> 
> Please consistently use "page" instead of "sub-page", and add "()" to
> the end of function names.
> 
> >
> > The patch adds a new zswap_pool_get() function that is called in the
> > sub-page level "zswap_store_page()" function.
> 
> "that is called by zswap_store_page(), which handles compressing and
> storing each page in the folio."
> 
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not stored
> > in zswap.
> >
> > This patch forms the basis for building compress batching of pages in a
> > large folio in zswap_store by compressing up to say, 8 pages of the folio
> > in parallel in hardware using the Intel In-Memory Analytics Accelerator
> > (Intel IAA).
> 
> This patch also has its own merits, it batches some operations, and
> more importantly enables swapping out large folios to zswap without
> splitting them.

Thanks for the detailed commit log review comments! Sure, I will incorporate
all of these in v9. I will also modify the latency improvement calculation convention
as you've suggested in another comment on the cover letter. Will re-gather all
the "after" data once v9 is ready.

> 
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++----------
> -----
> >  1 file changed, 165 insertions(+), 62 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 43e4e216db41..b8395ddf7b7c 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,16 @@ static int __must_check zswap_pool_tryget(struct
> zswap_pool *pool)
> >         return percpu_ref_tryget(&pool->ref);
> >  }
> >
> > +/*
> > + * Note: zswap_pool_get() should only be called after zswap_pool_tryget()
> > + * returns success. zswap_pool_tryget() returns success only if the "pool" is
> > + * non-NULL and the "&pool->ref" is non-0.
> 
> Just use the same statement used by css_get():
> 
> "The caller must already have a reference."
> 
> > + */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > +       percpu_ref_get(&pool->ref);
> > +}
> > +
> >  static void zswap_pool_put(struct zswap_pool *pool)
> >  {
> >         percpu_ref_put(&pool->ref);
> > @@ -1402,38 +1412,35 @@ static void shrink_worker(struct work_struct
> *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @folio: The folio to store in zswap.
> > + * @index: Index into the page in the folio that this function will store.
> > + * @objcg: The folio's objcg.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + *         The caller should have obtained a reference to a valid
> > + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + *         argument.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + *                    to this, so that the caller can get the total
> > + *                    compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool,
> > +                            size_t *compressed_bytes)
> 
> Would it be clearer to just return the compressed size, and return -1
> upon failure?
> 
> >  {
> > +       struct page *page = folio_page(folio, index);
> >         swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +       int type = swp_type(swp);
> 
> Why do we need type? We use it when initializing entry->swpentry to
> reconstruct the swp_entry_t we already have.
> 
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > -               return false;
> > -
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > -
> > -       /* Check cgroup limits */
> > -       objcg = get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > -
> > -       if (zswap_check_limits())
> > -               goto reject;
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> >
> >         /* allocate entry */
> >         entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > @@ -1442,24 +1449,21 @@ bool zswap_store(struct folio *folio)
> >                 goto reject;
> >         }
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       /*
> > +        * We get here only after the call to zswap_pool_tryget() in the
> > +        * caller, zswap_store(), has returned success. Hence it is safe
> > +        * to call zswap_pool_get().
> > +        *
> > +        * if entry is successfully added, it keeps the reference
> > +        */
> > +       zswap_pool_get(pool);
> 
> I would move this right above obj_cgroup_get() and have a single
> concise comment for both, e.g.:
> 
> /* zswap_store() already holds a ref on 'pool' and 'objcg' */
> 
> >
> > -       if (objcg) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto put_pool;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > +       entry->pool = pool;
> >
> > -       if (!zswap_compress(&folio->page, entry))
> > +       if (!zswap_compress(page, entry))
> >                 goto put_pool;
> >
> > -       entry->swpentry = swp;
> > +       entry->swpentry = swp_entry(type, offset);
> >         entry->objcg = objcg;
> >         entry->referenced = true;
> >
> > @@ -1480,11 +1484,6 @@ bool zswap_store(struct folio *folio)
> >         if (old)
> >                 zswap_entry_free(old);
> >
> > -       if (objcg) {
> > -               obj_cgroup_charge_zswap(objcg, entry->length);
> > -               count_objcg_event(objcg, ZSWPOUT);
> > -       }
> > -
> >         /*
> >          * We finish initializing the entry while it's already in xarray.
> >          * This is safe because:
> > @@ -1496,36 +1495,140 @@ bool zswap_store(struct folio *folio)
> >          *    an incoherent entry.
> >          */
> >         if (entry->length) {
> > +               *compressed_bytes += entry->length;
> >                 INIT_LIST_HEAD(&entry->lru);
> >                 zswap_lru_add(&zswap_list_lru, entry);
> >         }
> >
> > -       /* update stats */
> > -       atomic_long_inc(&zswap_stored_pages);
> > -       count_vm_event(ZSWPOUT);
> > -
> >         return true;
> >
> >  store_failed:
> >         zpool_free(entry->pool->zpool, entry->handle);
> >  put_pool:
> > -       zswap_pool_put(entry->pool);
> > -freepage:
> > +       zswap_pool_put(pool);
> >         zswap_entry_cache_free(entry);
> >  reject:
> 
> Please rename this to put_objcg, we are no longer "rejecting" here.

I have reworked getting the refs on pool and objcg within zswap_store_page(),
as per your suggestion. With these modifications, the "reject" label is only for
the case where we return "false" from zswap_store_page(), so it seems more
appropriate? Anyway, let me know once v9 is posted, and I can make any
necessary changes.

> 
> >         obj_cgroup_put(objcg);
> > -       if (zswap_pool_reached_full)
> > -               queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +       return false;
> > +}
> > +
> > +bool zswap_store(struct folio *folio)
> > +{
> > +       long nr_pages = folio_nr_pages(folio);
> > +       swp_entry_t swp = folio->swap;
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp);
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       size_t compressed_bytes = 0;
> 
> Why size_t? entry->length is int.
> 
> > +       bool ret = false;
> > +       long index;
> > +
> > +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +       if (!zswap_enabled)
> > +               goto reject;
> > +
> >         /*
> > -        * If the zswap store fails or zswap is disabled, we must invalidate the
> > -        * possibly stale entry which was previously stored at this offset.
> > -        * Otherwise, writeback could overwrite the new data in the swapfile.
> > +        * Check cgroup zswap limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of
> > +        * zswap_store(). The cgroup charging is done once, at the end
> > +        * of a successful folio store. What this means is, if the cgroup
> > +        * was within the zswap_max limit at the beginning of a large folio
> > +        * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> > +        * pages.
> 
> Add ".. due to this store", as it can be much more over the limit when
> stores are racing. Also, this comment can be slightly generalized and
> the comment above zswap_check_limits() omitted.
> 
> >          */
> > -       entry = xa_erase(tree, offset);
> > -       if (entry)
> > -               zswap_entry_free(entry);
> > -       return false;
> > +       objcg = get_obj_cgroup_from_folio(folio);
> > +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (shrink_memcg(memcg)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_objcg;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Check zpool utilization against zswap limits:
> > +        *
> > +        * The zswap zpool utilization is also checked against the limits
> > +        * just once, at the start of zswap_store(). If the check passes,
> > +        * any breaches of the limits set by zswap_max_pages() or
> > +        * zswap_accept_thr_pages() that may happen while storing this
> > +        * folio, will only be detected during the next call to
> > +        * zswap_store() by any process.
> > +        */
> > +       if (zswap_check_limits())
> > +               goto put_objcg;
> > +
> > +       pool = zswap_pool_current_get();
> > +       if (!pool)
> > +               goto put_objcg;
> > +
> > +       if (objcg) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_pool;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Store each page of the folio as a separate entry. If we fail to
> > +        * store a page, unwind by deleting all the pages for this folio
> > +        * currently in zswap.
> > +        */
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!zswap_store_page(folio, index, objcg, pool,
> &compressed_bytes))
> > +                       goto put_pool;
> > +       }
> > +
> > +       /*
> > +        * All pages in the folio have been successfully stored.
> > +        * Batch update the cgroup zswap charging, zswap_stored_page
> atomic,
> > +        * and ZSWPOUT event stats.
> > +        */
> 
> This comment seems unnecessary.
> 
> > +       if (objcg) {
> > +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +               count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > +       }
> > +
> > +       /* update stats */
> 
> Please delete this comment too.
> 
> > +       atomic_long_add(nr_pages, &zswap_stored_pages);
> > +       count_vm_events(ZSWPOUT, nr_pages);
> > +
> > +       ret = true;
> > +
> > +put_pool:
> > +       zswap_pool_put(pool);
> > +put_objcg:
> > +       obj_cgroup_put(objcg);
> > +reject:
> > +       /*
> > +        * If the zswap store fails or zswap is disabled, we must invalidate
> > +        * the possibly stale entries which were previously stored at the
> > +        * offsets corresponding to each page of the folio. Otherwise,
> > +        * writeback could overwrite the new data in the swapfile.
> > +        */
> > +       if (!ret) {
> > +               struct zswap_entry *entry;
> > +               long i;
> > +
> > +               for (i = 0; i < nr_pages; ++i) {
> 
> Just reuse 'index' here.
> 
> > +                       entry = xa_erase(tree, offset + i);
> > +                       if (entry)
> > +                               zswap_entry_free(entry);
> 
> I think we need a comment in zswap_store_page() that we shouldn't have
> any possibility of failure after the entry is added in the tree.
> Otherwise we may duplicate the cleanup work here (e.g. putting refs).
> This is already the case in zswap_store(), but it's now more subtle
> and easier to break when the code is split and we have two cleanup
> paths.

I have incorporated all these comments, thanks!

> 
> > +               }
> > +
> > +               if (zswap_pool_reached_full)
> > +                       queue_work(shrink_wq, &zswap_shrink_work);
> 
> We are now doing this check even when zswap is disabled. I think this
> is a problem.
> 
> If zswap gets disabled while 'zswap_pool_reached_full' is true, we
> will queue 'zswap_shrink_work' every time zswap_store() is called in
> the swapout path, but 'zswap_pool_reached_full' will never become
> false again become zswap_check_limits() will never be called.
> 
> I think we need to maintain two separate "reject" and "check_old"
> labels, and avoid this check when zswap is disabled.

Thanks Yosry, great catch! I have fixed this in the v9 that I am working on.

Thanks again,
Kanchana

> 
> > +       }
> > +
> > +       return ret;
> >  }
> >
> >  bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >

Sridhar, Kanchana P Sept. 30, 2024, 5:55 p.m. UTC | #7

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Sunday, September 29, 2024 2:15 PM
> To: Yosry Ahmed <yosryahmed@google.com>; Johannes Weiner
> <hannes@cmpxchg.org>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Saturday, September 28, 2024 11:11 AM
> > To: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v8 6/8] mm: zswap: Support large folios in
> zswap_store().
> >
> > On Sat, Sep 28, 2024 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org>
> > wrote:
> > >
> > > On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> > > > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> > > > >  {
> > > > > +       struct page *page = folio_page(folio, index);
> > > > >         swp_entry_t swp = folio->swap;
> > > > > -       pgoff_t offset = swp_offset(swp);
> > > > >         struct xarray *tree = swap_zswap_tree(swp);
> > > > > +       pgoff_t offset = swp_offset(swp) + index;
> > > > >         struct zswap_entry *entry, *old;
> > > > > -       struct obj_cgroup *objcg = NULL;
> > > > > -       struct mem_cgroup *memcg = NULL;
> > > > > -
> > > > > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > > > +       int type = swp_type(swp);
> > > >
> > > > Why do we need type? We use it when initializing entry->swpentry to
> > > > reconstruct the swp_entry_t we already have.
> > >
> > > It's not the same entry. folio->swap points to the head entry, this
> > > function has to store swap entries with the offsets of each subpage.
> >
> > Duh, yeah, thanks.
> >
> > >
> > > Given the name of this function, it might be better to actually pass a
> > > page pointer to it; do the folio_page() inside zswap_store().
> > >
> > > Then do
> > >
> > >                 entry->swpentry = page_swap_entry(page);
> > >
> > > below.
> >
> > That is indeed clearer.
> >
> > Although this will be adding yet another caller of page_swap_entry()
> > that already has the folio, yet it calls page_swap_entry() for each
> > page in the folio, which calls page_folio() inside.
> >
> > I wonder if we should add (or replace page_swap_entry()) with a
> > folio_swap_entry(folio, index) helper. This can also be done as a
> > follow up.
> 
> Thanks Johannes and Yosry for these comments. I was thinking about
> this some more. In its current form, zswap_store_page() is called in the
> context
> of the folio by passing in a [folio, index]. This implies a key assumption about
> the existing zswap_store() large folios functionality, i.e., we do the per-page
> store for the page at a "index * PAGE_SIZE" within the folio, and not for any
> arbitrary page. Further, we need the folio for folio_nid(); but this can also be
> computed from the page. Another reason why I thought the existing signature
> might be preferable is because it seems like it enables getting the entry's
> swp_entry_t with fewer computes. Could calling page_swap_entry() add
> more computes; which if it is the case, could potentially add up (say 512
> times)

I went ahead and quantified this with the v8 signature of zswap_store_page()
and the suggested changes for this function to take a page and use
page_swap_entry(). I ran usemem with 2M pmd-mappable folios enabled.
The results indicate that the page_swap_entry() implementation is slightly
better in throughput and latency:

v8:                             run1       run2       run3    average
---------------------------------------------------------------------
Total throughput (KB/s):   6,483,835  6,396,760  6,349,532  6,410,042
Average throughput (KB/s):   216,127    213,225	   211,651    213,889
elapsed time (sec):           107.75     107.06	    109.99     108.87
sys time (sec):             2,476.43   2,453.99	  2,551.52   2,513.98
---------------------------------------------------------------------


page_swap_entry():              run1       run2       run3    average
---------------------------------------------------------------------
Total throughput (KB/s):   6,462,954  6,396,134  6,418,076  6,425,721
Average throughput (KB/s):   215,431    213,204	   213,935    214,683
elapsed time (sec):           108.67     109.46	    107.91     108.29
sys time (sec):             2,473.65   2,493.33	  2,507.82   2,490.74
---------------------------------------------------------------------------

Based on this, I will go ahead and implement the change suggested
by Johannes and submit a v9.

Thanks,
Kanchana

> 
> I would appreciate your thoughts on whether these are valid considerations,
> and can proceed accordingly.
> 
> >
> > >
> > > > >         obj_cgroup_put(objcg);
> > > > > -       if (zswap_pool_reached_full)
> > > > > -               queue_work(shrink_wq, &zswap_shrink_work);
> > > > > -check_old:
> > > > > +       return false;
> > > > > +}
> > > > > +
> > > > > +bool zswap_store(struct folio *folio)
> > > > > +{
> > > > > +       long nr_pages = folio_nr_pages(folio);
> > > > > +       swp_entry_t swp = folio->swap;
> > > > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > > > +       pgoff_t offset = swp_offset(swp);
> > > > > +       struct obj_cgroup *objcg = NULL;
> > > > > +       struct mem_cgroup *memcg = NULL;
> > > > > +       struct zswap_pool *pool;
> > > > > +       size_t compressed_bytes = 0;
> > > >
> > > > Why size_t? entry->length is int.
> > >
> > > In light of Willy's comment, I think size_t is a good idea.
> >
> > Agreed.
> 
> Thanks Yosry, Matthew and Johannes for the resolution on this!
> 
> Thanks,
> Kanchana

[v8,6/8] mm: zswap: Support large folios in zswap_store().

Commit Message

Comments

Patch