diff mbox series

[v7,6/8] mm: zswap: Support mTHP swapout in zswap_store().

Message ID 20240924011709.7037-7-kanchana.p.sridhar@intel.com (mailing list archive)
State New
Headers show
Series mm: ZSWAP swap-out of mTHP folios | expand

Commit Message

Sridhar, Kanchana P Sept. 24, 2024, 1:17 a.m. UTC
zswap_store() will now store mTHP and PMD-size THP folios by compressing
them page by page.

This patch provides a sequential implementation of storing an mTHP in
zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

If an error is encountered during the store of any page in the mTHP,
all previous pages/entries stored will be invalidated. Thus, an mTHP
is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.

This forms the basis for building batching of pages during zswap store
of large folios by compressing batches of up to say, 8 pages in an
mTHP in parallel in hardware, with the Intel In-Memory Analytics
Accelerator (Intel IAA).

A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP. The corresponding tunable
zswap module parameter is "mthp_enabled".

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Also, addressed some of the RFC comments from the discussion in [1].

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/Kconfig |   8 ++++
 mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
 2 files changed, 66 insertions(+), 64 deletions(-)

Comments

Nhat Pham Sept. 24, 2024, 5:33 p.m. UTC | #1
On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will now store mTHP and PMD-size THP folios by compressing
> them page by page.
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. The corresponding tunable
> zswap module parameter is "mthp_enabled".
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/Kconfig |   8 ++++
>  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
>  2 files changed, 66 insertions(+), 64 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09aebca1cae3..c659fb732ec4 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
>           reducing the chance that cold pages will reside in the zswap pool
>           and consume memory indefinitely.
>
> +config ZSWAP_STORE_THP_DEFAULT_ON
> +       bool "Store mTHP and THP folios in zswap"
> +       depends on ZSWAP
> +       default n
> +       help
> +         If selected, zswap will process mTHP and THP folios by
> +         compressing and storing each 4K page in the large folio.
> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 8f2e0ab34c84..16ab770546d6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/*
> + * Enable/disable zswap processing of mTHP folios.
> + * For now, only zswap_store will process mTHP folios.
> + */
> +static bool zswap_mthp_enabled = IS_ENABLED(
> +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> +

Hmm, so this is a runtime knob. Also, should this be zswap_thp_enabled? :)

>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
>   * @objcg: The folio's objcg.
>   * @pool:  The zswap_pool to store the compressed data for the page.
>   */
> -static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> -                                           struct obj_cgroup *objcg,
> -                                           struct zswap_pool *pool)
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool)
>  {
>         swp_entry_t swp = folio->swap;
>         int type = swp_type(swp);
> @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
>         return false;
>  }
>
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> -       struct zswap_entry *entry;
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       bool ret = false;
> +       long index;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       /* Storing large folios isn't enabled */
> +       if (!zswap_mthp_enabled && folio_test_large(folio))
>                 return false;

Hmm can this go wrong somehow? Can we have a case where we enable
zswap_mthp_enabled, have a large folio written to zswap, disable
zswap_mthp_enabled, and attempt to store that folio to zswap again.

Now, we have a stale copy in zswap that is not invalidated...?

Or am I missing something here :)
Yosry Ahmed Sept. 24, 2024, 7:38 p.m. UTC | #2
On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will now store mTHP and PMD-size THP folios by compressing
> them page by page.
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. The corresponding tunable
> zswap module parameter is "mthp_enabled".
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/Kconfig |   8 ++++
>  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
>  2 files changed, 66 insertions(+), 64 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09aebca1cae3..c659fb732ec4 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
>           reducing the chance that cold pages will reside in the zswap pool
>           and consume memory indefinitely.
>
> +config ZSWAP_STORE_THP_DEFAULT_ON
> +       bool "Store mTHP and THP folios in zswap"
> +       depends on ZSWAP
> +       default n
> +       help
> +         If selected, zswap will process mTHP and THP folios by
> +         compressing and storing each 4K page in the large folio.
> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 8f2e0ab34c84..16ab770546d6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/*
> + * Enable/disable zswap processing of mTHP folios.
> + * For now, only zswap_store will process mTHP folios.
> + */
> +static bool zswap_mthp_enabled = IS_ENABLED(
> +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> +
>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
>   * @objcg: The folio's objcg.
>   * @pool:  The zswap_pool to store the compressed data for the page.
>   */
> -static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> -                                           struct obj_cgroup *objcg,
> -                                           struct zswap_pool *pool)
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool)

As I mentioned earlier, the patch that introduced zswap_store_page()
should have directly used it in zswap_store(). This would make this
patch much clearer.

>  {
>         swp_entry_t swp = folio->swap;
>         int type = swp_type(swp);
> @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
>         return false;
>  }
>
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> -       struct zswap_entry *entry;
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       bool ret = false;
> +       long index;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       /* Storing large folios isn't enabled */

The comment is now stating the obvious, please remove it.

> +       if (!zswap_mthp_enabled && folio_test_large(folio))
>                 return false;
>
>         if (!zswap_enabled)
> -               goto check_old;
> +               goto reject;
>
> -       /* Check cgroup limits */
> +       /*
> +        * Check cgroup limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of an
> +        * mTHP store, and not within zswap_store_page() for each page
> +        * in the mTHP. We do however check the zswap pool limits at the
> +        * start of zswap_store_page(). What this means is, the cgroup
> +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> +        * However, the per-store-page zswap pool limits check should
> +        * hopefully trigger the cgroup aware and zswap LRU aware global
> +        * reclaim implemented in the shrinker. If this assumption holds,
> +        * the cgroup exceeding the zswap limits could potentially be
> +        * resolved before the next zswap_store, and if it is not, the next
> +        * zswap_store would fail the cgroup zswap limit check at the start.
> +        */

I do not really like this. Allowing going one page above the limit is
one thing, but one THP above the limit seems too much. I also don't
like relying on the repeated limit checking in zswap_store_page(), if
anything I think that should be batched too.

Is it too unreasonable to maintain the average compression ratio and
use that to estimate limit checking for both memcg and global limits?
Johannes, Nhat, any thoughts on this?

>         objcg = get_obj_cgroup_from_folio(folio);
>         if (objcg && !obj_cgroup_may_zswap(objcg)) {
>                 memcg = get_mem_cgroup_from_objcg(objcg);
>                 if (shrink_memcg(memcg)) {
>                         mem_cgroup_put(memcg);
> -                       goto reject;
> +                       goto put_objcg;
>                 }
>                 mem_cgroup_put(memcg);
>         }
>
>         if (zswap_check_limits())
> -               goto reject;
> -
> -       /* allocate entry */
> -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> -       if (!entry) {
> -               zswap_reject_kmemcache_fail++;
> -               goto reject;
> -       }
> +               goto put_objcg;
>
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> -               goto freepage;
> +       pool = zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
>
>         if (objcg) {
>                 memcg = get_mem_cgroup_from_objcg(objcg);
> @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio)
>                 mem_cgroup_put(memcg);
>         }
>
> -       if (!zswap_compress(&folio->page, entry))
> -               goto put_pool;
> -
> -       entry->swpentry = swp;
> -       entry->objcg = objcg;
> -       entry->referenced = true;
> -
> -       if (!zswap_store_entry(tree, entry))
> -               goto store_failed;
> -
> -       if (objcg) {
> -               obj_cgroup_charge_zswap(objcg, entry->length);
> -               count_objcg_event(objcg, ZSWPOUT);
> -       }
> -
>         /*
> -        * We finish initializing the entry while it's already in xarray.
> -        * This is safe because:
> -        *
> -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> -        *
> -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> -        *    The publishing order matters to prevent writeback from seeing
> -        *    an incoherent entry.
> +        * Store each page of the folio as a separate entry. If we fail to store
> +        * a page, unwind by removing all the previous pages we stored.
>          */
> -       if (entry->length) {
> -               INIT_LIST_HEAD(&entry->lru);
> -               zswap_lru_add(&zswap_list_lru, entry);
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio, index, objcg, pool))
> +                       goto put_pool;
>         }
>
> -       /* update stats */
> -       atomic_inc(&zswap_stored_pages);
> -       count_vm_event(ZSWPOUT);
> -
> -       return true;
> +       ret = true;
>
> -store_failed:
> -       zpool_free(entry->pool->zpool, entry->handle);
>  put_pool:
> -       zswap_pool_put(entry->pool);
> -freepage:
> -       zswap_entry_cache_free(entry);
> -reject:
> +       zswap_pool_put(pool);
> +put_objcg:
>         obj_cgroup_put(objcg);
>         if (zswap_pool_reached_full)
>                 queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +reject:
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalidate the
> -        * possibly stale entry which was previously stored at this offset.
> -        * Otherwise, writeback could overwrite the new data in the swapfile.
> +        * If the zswap store fails or zswap is disabled, we must invalidate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
>          */
> -       zswap_delete_stored_offsets(tree, offset, nr_pages);
> -       return false;
> +       if (!ret)
> +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>
Nhat Pham Sept. 24, 2024, 8:51 p.m. UTC | #3
On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > +        */
>
> I do not really like this. Allowing going one page above the limit is
> one thing, but one THP above the limit seems too much. I also don't

Hmm what if you have multiple concurrent zswap stores, from different
tasks but the same cgroup? If none of them has charged, they would all
get greenlit, and charge towards the cgroup...

So technically the zswap limit checking is already best-effort only.
But now, instead of one page per violation, it's 512 pages per
violation :)

Yeah this can be bad. I think this is only safe if you only use
zswap.max as a binary knob (0 or max)...

> like relying on the repeated limit checking in zswap_store_page(), if
> anything I think that should be batched too.
>
> Is it too unreasonable to maintain the average compression ratio and
> use that to estimate limit checking for both memcg and global limits?
> Johannes, Nhat, any thoughts on this?

I remember asking about this, but past Nhat might have relented :)

https://lore.kernel.org/linux-mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail.gmail.com/

We can do limit checking and charging after compression is done, but
that's a lot of code change (might not even be possible)... It will,
however, allow us to do charging + checking in one go (rather than
doing it 8, 16, or 512 times)

Another thing we can do is to register a zswap writeback after the
zswap store attempts to clean up excess capacity. Not sure what will
happen if zswap writeback is disabled for the cgroup though :)

If it's too hard, the average estimate could be a decent compromise,
until we figure something smarter.
Sridhar, Kanchana P Sept. 24, 2024, 8:51 p.m. UTC | #4
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 10:34 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> > them page by page.
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. The corresponding tunable
> > zswap module parameter is "mthp_enabled".
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++++
> >  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
> >  2 files changed, 66 insertions(+), 64 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09aebca1cae3..c659fb732ec4 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8f2e0ab34c84..16ab770546d6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled =
> IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool,
> 0644);
> > +
> 
> Hmm, so this is a runtime knob. Also, should this be zswap_thp_enabled? :)

Agreed, zswap_thp_enabled is a better name. I will make the change in v8.
More comments below as to the runtime knob.

> 
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct
> xarray *tree,
> >   * @objcg: The folio's objcg.
> >   * @pool:  The zswap_pool to store the compressed data for the page.
> >   */
> > -static bool __maybe_unused zswap_store_page(struct folio *folio, long
> index,
> > -                                           struct obj_cgroup *objcg,
> > -                                           struct zswap_pool *pool)
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> >  {
> >         swp_entry_t swp = folio->swap;
> >         int type = swp_type(swp);
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused
> zswap_store_page(struct folio *folio, long index,
> >         return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       /* Storing large folios isn't enabled */
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> >                 return false;
> 
> Hmm can this go wrong somehow? Can we have a case where we enable
> zswap_mthp_enabled, have a large folio written to zswap, disable
> zswap_mthp_enabled, and attempt to store that folio to zswap again.
> 
> Now, we have a stale copy in zswap that is not invalidated...?
> 
> Or am I missing something here :)

This is an excellent point. Thanks Nhat for catching this! I can see two
options to solving this:

Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
for the mTHP in zswap before exiting. This could race with writeback
(either one or more subpages could be written back before zswap_store
acquires the tree lock), however, I don't think it will cause data inconsistencies.
Any offsets for subpages not written back will be deleted from zswap,
zswap_store() will return false, and the backing swap device's subsequent
swapout will over-write the zswap write-back data. Could anything go wrong
with this?

Option 2: Only provide a build config option,
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed.

Would appreciate suggestions on these, and other potential solutions.

Thanks,
Kanchana
Nhat Pham Sept. 24, 2024, 9:08 p.m. UTC | #5
On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> This is an excellent point. Thanks Nhat for catching this! I can see two
> options to solving this:
>
> Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
> for the mTHP in zswap before exiting. This could race with writeback
> (either one or more subpages could be written back before zswap_store
> acquires the tree lock), however, I don't think it will cause data inconsistencies.
> Any offsets for subpages not written back will be deleted from zswap,
> zswap_store() will return false, and the backing swap device's subsequent
> swapout will over-write the zswap write-back data. Could anything go wrong
> with this?

I think this should be safe, albeit a bit awkward.

At this point (zswap_store()), we should have the folio added to to
swap cache, and locked. All the associated swap entries will point to
this same (large) folio.

Any concurrent zswap writeback attempt, even on a tail page, should
get that folio when it calls __read_swap_cache_async(), and with
page_allocated == false, and should short circuit.

So I don't think we will race with zswap_writeback().

Yosry, Chengming, Johannes, any thoughts?

>
> Option 2: Only provide a build config option,
> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed.

This can be a last resort thing, if the above doesn't work. Not the
end of the world, but not ideal :)

>
> Would appreciate suggestions on these, and other potential solutions.
>
> Thanks,
> Kanchana
Yosry Ahmed Sept. 24, 2024, 9:34 p.m. UTC | #6
On Tue, Sep 24, 2024 at 2:08 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > This is an excellent point. Thanks Nhat for catching this! I can see two
> > options to solving this:
> >
> > Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
> > for the mTHP in zswap before exiting. This could race with writeback
> > (either one or more subpages could be written back before zswap_store
> > acquires the tree lock), however, I don't think it will cause data inconsistencies.
> > Any offsets for subpages not written back will be deleted from zswap,
> > zswap_store() will return false, and the backing swap device's subsequent
> > swapout will over-write the zswap write-back data. Could anything go wrong
> > with this?
>
> I think this should be safe, albeit a bit awkward.
>
> At this point (zswap_store()), we should have the folio added to to
> swap cache, and locked. All the associated swap entries will point to
> this same (large) folio.
>
> Any concurrent zswap writeback attempt, even on a tail page, should
> get that folio when it calls __read_swap_cache_async(), and with
> page_allocated == false, and should short circuit.
>
> So I don't think we will race with zswap_writeback().
>
> Yosry, Chengming, Johannes, any thoughts?

Why can't we just handle it the same way as we handle zswap
disablement? If it is disabled, we invalidate any old entries for the
offsets and return false to swapout to disk.

Taking a step back, why do we need the runtime knob and config option?
Are there cases where we think zswapout of mTHPs will perform badly,
or is it just due to lack of confidence in the feature?

>
> >
> > Option 2: Only provide a build config option,
> > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed.
>
> This can be a last resort thing, if the above doesn't work. Not the
> end of the world, but not ideal :)
>
> >
> > Would appreciate suggestions on these, and other potential solutions.
> >
> > Thanks,
> > Kanchana
Yosry Ahmed Sept. 24, 2024, 9:38 p.m. UTC | #7
On Tue, Sep 24, 2024 at 1:51 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > +        * mTHP store, and not within zswap_store_page() for each page
> > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > +        * However, the per-store-page zswap pool limits check should
> > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > +        * the cgroup exceeding the zswap limits could potentially be
> > > +        * resolved before the next zswap_store, and if it is not, the next
> > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > +        */
> >
> > I do not really like this. Allowing going one page above the limit is
> > one thing, but one THP above the limit seems too much. I also don't
>
> Hmm what if you have multiple concurrent zswap stores, from different
> tasks but the same cgroup? If none of them has charged, they would all
> get greenlit, and charge towards the cgroup...
>
> So technically the zswap limit checking is already best-effort only.
> But now, instead of one page per violation, it's 512 pages per
> violation :)

Yeah good point about concurrent operations, we can go 512 pages above
limit * number of concurrent swapouts. That can be a lot of memory.

>
> Yeah this can be bad. I think this is only safe if you only use
> zswap.max as a binary knob (0 or max)...
>
> > like relying on the repeated limit checking in zswap_store_page(), if
> > anything I think that should be batched too.
> >
> > Is it too unreasonable to maintain the average compression ratio and
> > use that to estimate limit checking for both memcg and global limits?
> > Johannes, Nhat, any thoughts on this?
>
> I remember asking about this, but past Nhat might have relented :)
>
> https://lore.kernel.org/linux-mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail.gmail.com/
>
> We can do limit checking and charging after compression is done, but
> that's a lot of code change (might not even be possible)... It will,
> however, allow us to do charging + checking in one go (rather than
> doing it 8, 16, or 512 times)
>
> Another thing we can do is to register a zswap writeback after the
> zswap store attempts to clean up excess capacity. Not sure what will
> happen if zswap writeback is disabled for the cgroup though :)
>
> If it's too hard, the average estimate could be a decent compromise,
> until we figure something smarter.

We can also do what we discussed before about double charging. The
pages that are being reclaimed are already charged, so technically we
don't need to charge them again. We can uncharge the difference
between compressed and uncompressed sizes after compression and call
it a day. This fixes the limit checking and the double charging in one
go.

I am a little bit nervous though about zswap uncharing the pages from
under reclaim, there are likely further accesses of the page memcg
after zswap. Maybe we can plumb the info back to reclaim or set a flag
on the page to avoid uncharging it when it's freed.
Nhat Pham Sept. 24, 2024, 10:16 p.m. UTC | #8
On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
>
> Why can't we just handle it the same way as we handle zswap
> disablement? If it is disabled, we invalidate any old entries for the
> offsets and return false to swapout to disk.

I think that was the suggestion.

>
> Taking a step back, why do we need the runtime knob and config option?
> Are there cases where we think zswapout of mTHPs will perform badly,
> or is it just due to lack of confidence in the feature?

Fair point. I think the reason why I suggested this knob was because
we observe so much regressions in earlier benchmarks, and especially
on the software compressor column.

But now that we've reworked the benchmark + use zstd for software
compressor, I think we can get rid of this knob/config option, and
simplify things.
Sridhar, Kanchana P Sept. 24, 2024, 10:17 p.m. UTC | #9
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 2:34 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 2:08 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > >
> > > This is an excellent point. Thanks Nhat for catching this! I can see two
> > > options to solving this:
> > >
> > > Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
> > > for the mTHP in zswap before exiting. This could race with writeback
> > > (either one or more subpages could be written back before zswap_store
> > > acquires the tree lock), however, I don't think it will cause data
> inconsistencies.
> > > Any offsets for subpages not written back will be deleted from zswap,
> > > zswap_store() will return false, and the backing swap device's subsequent
> > > swapout will over-write the zswap write-back data. Could anything go
> wrong
> > > with this?
> >
> > I think this should be safe, albeit a bit awkward.
> >
> > At this point (zswap_store()), we should have the folio added to to
> > swap cache, and locked. All the associated swap entries will point to
> > this same (large) folio.
> >
> > Any concurrent zswap writeback attempt, even on a tail page, should
> > get that folio when it calls __read_swap_cache_async(), and with
> > page_allocated == false, and should short circuit.
> >
> > So I don't think we will race with zswap_writeback().
> >
> > Yosry, Chengming, Johannes, any thoughts?
> 
> Why can't we just handle it the same way as we handle zswap
> disablement? If it is disabled, we invalidate any old entries for the
> offsets and return false to swapout to disk.
> 
> Taking a step back, why do we need the runtime knob and config option?
> Are there cases where we think zswapout of mTHPs will perform badly,
> or is it just due to lack of confidence in the feature?

Thanks Nhat and Yosry for the suggestions/comments.

If I recall correctly, the topic of adding a config option/knob came up
based on earlier data I had collected with a zram backing device setup,
which showed a performance degradation with zstd, but not with deflate-iaa.

Since the v7 data collected with an 823G SSD swap disk partition indicates
that we get good throughput and latency improvements with zswap-mTHP
with zstd and deflate-iaa, I am not sure if the knob is still required (if this
is representative of most of the setups that use mTHP).

I am confident about the zswap-mTHP feature itself, and don’t think the
knob is needed from that perspective. I think the question is really about
having the ability to disable zswap-mTHP in some existing setup where
having mTHP enabled performs worse with this patchset than without.

I am Ok with having the knob and handling it using Option 1, or, not
having a knob.

Thanks,
Kanchana 

> 
> >
> > >
> > > Option 2: Only provide a build config option,
> > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically
> changed.
> >
> > This can be a last resort thing, if the above doesn't work. Not the
> > end of the world, but not ideal :)
> >
> > >
> > > Would appreciate suggestions on these, and other potential solutions.
> > >
> > > Thanks,
> > > Kanchana
Sridhar, Kanchana P Sept. 24, 2024, 10:18 p.m. UTC | #10
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 3:16 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> >
> > Why can't we just handle it the same way as we handle zswap
> > disablement? If it is disabled, we invalidate any old entries for the
> > offsets and return false to swapout to disk.
> 
> I think that was the suggestion.
> 
> >
> > Taking a step back, why do we need the runtime knob and config option?
> > Are there cases where we think zswapout of mTHPs will perform badly,
> > or is it just due to lack of confidence in the feature?
> 
> Fair point. I think the reason why I suggested this knob was because
> we observe so much regressions in earlier benchmarks, and especially
> on the software compressor column.
> 
> But now that we've reworked the benchmark + use zstd for software
> compressor, I think we can get rid of this knob/config option, and
> simplify things.

I agree, thanks Nhat! Will fix this in v8.

Thanks,
Kanchana
Yosry Ahmed Sept. 24, 2024, 10:28 p.m. UTC | #11
On Tue, Sep 24, 2024 at 3:16 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> >
> > Why can't we just handle it the same way as we handle zswap
> > disablement? If it is disabled, we invalidate any old entries for the
> > offsets and return false to swapout to disk.
>
> I think that was the suggestion.


Hmm I may be reading this wrong, but my understanding was that the
suggestion is to synchronously remove all entries of large folios from
zswap when zswap_mthp is disabled. What I am suggesting is to do the
same thing we do in zswap_store() when zswap is disabled.

Anyway, if we are removing the knob this is not relevant anymore.
Sridhar, Kanchana P Sept. 24, 2024, 11:02 p.m. UTC | #12
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:39 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> > them page by page.
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. The corresponding tunable
> > zswap module parameter is "mthp_enabled".
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++++
> >  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
> >  2 files changed, 66 insertions(+), 64 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09aebca1cae3..c659fb732ec4 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8f2e0ab34c84..16ab770546d6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled =
> IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool,
> 0644);
> > +
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct
> xarray *tree,
> >   * @objcg: The folio's objcg.
> >   * @pool:  The zswap_pool to store the compressed data for the page.
> >   */
> > -static bool __maybe_unused zswap_store_page(struct folio *folio, long
> index,
> > -                                           struct obj_cgroup *objcg,
> > -                                           struct zswap_pool *pool)
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> 
> As I mentioned earlier, the patch that introduced zswap_store_page()
> should have directly used it in zswap_store(). This would make this
> patch much clearer.

Sure. I will fix this in v8.

> 
> >  {
> >         swp_entry_t swp = folio->swap;
> >         int type = swp_type(swp);
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused
> zswap_store_page(struct folio *folio, long index,
> >         return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       /* Storing large folios isn't enabled */
> 
> The comment is now stating the obvious, please remove it.

Ok. I suppose this check will also no longer be needed based on the
config knob not being needed.

> 
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> >                 return false;
> >
> >         if (!zswap_enabled)
> > -               goto check_old;
> > +               goto reject;
> >
> > -       /* Check cgroup limits */
> > +       /*
> > +        * Check cgroup limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > +        */
> 
> I do not really like this. Allowing going one page above the limit is
> one thing, but one THP above the limit seems too much. I also don't
> like relying on the repeated limit checking in zswap_store_page(), if
> anything I think that should be batched too.
> 
> Is it too unreasonable to maintain the average compression ratio and
> use that to estimate limit checking for both memcg and global limits?
> Johannes, Nhat, any thoughts on this?

I see that Nhat has responded. Hopefully we can discuss this
in the follow-up to Nhat's comments.

Thanks,
Kanchana

> 
> >         objcg = get_obj_cgroup_from_folio(folio);
> >         if (objcg && !obj_cgroup_may_zswap(objcg)) {
> >                 memcg = get_mem_cgroup_from_objcg(objcg);
> >                 if (shrink_memcg(memcg)) {
> >                         mem_cgroup_put(memcg);
> > -                       goto reject;
> > +                       goto put_objcg;
> >                 }
> >                 mem_cgroup_put(memcg);
> >         }
> >
> >         if (zswap_check_limits())
> > -               goto reject;
> > -
> > -       /* allocate entry */
> > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > -       if (!entry) {
> > -               zswap_reject_kmemcache_fail++;
> > -               goto reject;
> > -       }
> > +               goto put_objcg;
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       pool = zswap_pool_current_get();
> > +       if (!pool)
> > +               goto put_objcg;
> >
> >         if (objcg) {
> >                 memcg = get_mem_cgroup_from_objcg(objcg);
> > @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio)
> >                 mem_cgroup_put(memcg);
> >         }
> >
> > -       if (!zswap_compress(&folio->page, entry))
> > -               goto put_pool;
> > -
> > -       entry->swpentry = swp;
> > -       entry->objcg = objcg;
> > -       entry->referenced = true;
> > -
> > -       if (!zswap_store_entry(tree, entry))
> > -               goto store_failed;
> > -
> > -       if (objcg) {
> > -               obj_cgroup_charge_zswap(objcg, entry->length);
> > -               count_objcg_event(objcg, ZSWPOUT);
> > -       }
> > -
> >         /*
> > -        * We finish initializing the entry while it's already in xarray.
> > -        * This is safe because:
> > -        *
> > -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > -        *
> > -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -        *    The publishing order matters to prevent writeback from seeing
> > -        *    an incoherent entry.
> > +        * Store each page of the folio as a separate entry. If we fail to store
> > +        * a page, unwind by removing all the previous pages we stored.
> >          */
> > -       if (entry->length) {
> > -               INIT_LIST_HEAD(&entry->lru);
> > -               zswap_lru_add(&zswap_list_lru, entry);
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!zswap_store_page(folio, index, objcg, pool))
> > +                       goto put_pool;
> >         }
> >
> > -       /* update stats */
> > -       atomic_inc(&zswap_stored_pages);
> > -       count_vm_event(ZSWPOUT);
> > -
> > -       return true;
> > +       ret = true;
> >
> > -store_failed:
> > -       zpool_free(entry->pool->zpool, entry->handle);
> >  put_pool:
> > -       zswap_pool_put(entry->pool);
> > -freepage:
> > -       zswap_entry_cache_free(entry);
> > -reject:
> > +       zswap_pool_put(pool);
> > +put_objcg:
> >         obj_cgroup_put(objcg);
> >         if (zswap_pool_reached_full)
> >                 queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +reject:
> >         /*
> > -        * If the zswap store fails or zswap is disabled, we must invalidate the
> > -        * possibly stale entry which was previously stored at this offset.
> > -        * Otherwise, writeback could overwrite the new data in the swapfile.
> > +        * If the zswap store fails or zswap is disabled, we must invalidate
> > +        * the possibly stale entries which were previously stored at the
> > +        * offsets corresponding to each page of the folio. Otherwise,
> > +        * writeback could overwrite the new data in the swapfile.
> >          */
> > -       zswap_delete_stored_offsets(tree, offset, nr_pages);
> > -       return false;
> > +       if (!ret)
> > +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> > +
> > +       return ret;
> >  }
> >
> >  bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >
Nhat Pham Sept. 24, 2024, 11:11 p.m. UTC | #13
On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
>
> We can also do what we discussed before about double charging. The
> pages that are being reclaimed are already charged, so technically we
> don't need to charge them again. We can uncharge the difference
> between compressed and uncompressed sizes after compression and call
> it a day. This fixes the limit checking and the double charging in one
> go.
> I am a little bit nervous though about zswap uncharing the pages from
> under reclaim, there are likely further accesses of the page memcg
> after zswap. Maybe we can plumb the info back to reclaim or set a flag
> on the page to avoid uncharging it when it's freed.

Hmm this is just for memory usage charging, no? The problem here is
the zswap usage (zswap.current), and its relation to the limit.

One thing we can do is check the zswap usage against the limit for
every subpage, but that's likely expensive...?

With the new atomic counters Joshua is working on, we can
check-and-charge at the same time, after we have compressed the whole
large folio, like this:

for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
     memcg = parent_mem_cgroup(memcg));
     old_usage = atomic_read(&memcg->zswap);

     do {
        new_usage = old_usage + size;
        if (new_usage > limit) {
           /* undo charging of descendants, then return false */
        }
      } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage))
}

But I don't know what we can do in the current design. I gave it some
more thought, and even if we only check after we know the size, we can
still potentially overshoot the limit :(
Sridhar, Kanchana P Sept. 24, 2024, 11:21 p.m. UTC | #14
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 1:51 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > +        * mTHP store, and not within zswap_store_page() for each page
> > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > +        * However, the per-store-page zswap pool limits check should
> > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > +        * the cgroup exceeding the zswap limits could potentially be
> > > +        * resolved before the next zswap_store, and if it is not, the next
> > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > +        */
> >
> > I do not really like this. Allowing going one page above the limit is
> > one thing, but one THP above the limit seems too much. I also don't
> 
> Hmm what if you have multiple concurrent zswap stores, from different
> tasks but the same cgroup? If none of them has charged, they would all
> get greenlit, and charge towards the cgroup...
> 
> So technically the zswap limit checking is already best-effort only.
> But now, instead of one page per violation, it's 512 pages per
> violation :)
> 
> Yeah this can be bad. I think this is only safe if you only use
> zswap.max as a binary knob (0 or max)...
> 
> > like relying on the repeated limit checking in zswap_store_page(), if
> > anything I think that should be batched too.
> >
> > Is it too unreasonable to maintain the average compression ratio and
> > use that to estimate limit checking for both memcg and global limits?
> > Johannes, Nhat, any thoughts on this?
> 
> I remember asking about this, but past Nhat might have relented :)
> 
> https://lore.kernel.org/linux-
> mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail
> .gmail.com/
> 
> We can do limit checking and charging after compression is done, but
> that's a lot of code change (might not even be possible)... It will,
> however, allow us to do charging + checking in one go (rather than
> doing it 8, 16, or 512 times)
> 
> Another thing we can do is to register a zswap writeback after the
> zswap store attempts to clean up excess capacity. Not sure what will
> happen if zswap writeback is disabled for the cgroup though :)
> 
> If it's too hard, the average estimate could be a decent compromise,
> until we figure something smarter.

Thanks Yosry and Nhat for these insights. This is how I was viewing
this scenario: I thought of incrementally (per subpage store) calling
zswap_pool_get() and limit checks followed by shrinker invocations
in case of error conditions to allow different concurrent stores to make
progress, without favoring only one process's mTHP store based on
there being enough zpool space available (for e.g. based on compression
ratio estimate).

Besides simplicity and no added overhead in the regular cases, I was
thinking this approach would have minimal impact on the process(es)
that see the zswap limit being exceeded, and that this would be better
than preemptively checking for the entire mTHP and failing (this could
also complicate things where no one makes progress because multiple
processes run the batch checks and fail, when realistically one/many
could have triggered the shrinker before erroring out, and at least
one/few could have made progress).

Another potential solution for this could be based on experimental
data for a given setup, on mTHP swapout failures and say, reducing
the zswap zpool max-limit and/or acceptance threshold perhaps?

Would appreciate your suggestions on how to proceed as far as
the limit checks.

Thanks,
Kanchana
Sridhar, Kanchana P Sept. 25, 2024, 12:05 a.m. UTC | #15
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 4:11 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> joshua.hahnjy@gmail.com
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> >
> > We can also do what we discussed before about double charging. The
> > pages that are being reclaimed are already charged, so technically we
> > don't need to charge them again. We can uncharge the difference
> > between compressed and uncompressed sizes after compression and call
> > it a day. This fixes the limit checking and the double charging in one
> > go.
> > I am a little bit nervous though about zswap uncharing the pages from
> > under reclaim, there are likely further accesses of the page memcg
> > after zswap. Maybe we can plumb the info back to reclaim or set a flag
> > on the page to avoid uncharging it when it's freed.
> 
> Hmm this is just for memory usage charging, no? The problem here is
> the zswap usage (zswap.current), and its relation to the limit.
> 
> One thing we can do is check the zswap usage against the limit for
> every subpage, but that's likely expensive...?

This is the approach currently implemented in v7.
Data gathered doesn’t indicate a performance issue with this
specific workload in the two scenarios validated, namely,
zswap-4K vs. zswap-mTHP and SSD-mTHP vs. zswap-mTHP (we only
see performance gains with explainable sys time increase).

Of course, the existing implementation could be a baseline for
validating performance of other approaches, e.g., checking zswap usage
per mTHP. However, these other approaches would also need to be
evaluated for more global multi-instance implications as far as all
processes being able to make progress. 

> 
> With the new atomic counters Joshua is working on, we can
> check-and-charge at the same time, after we have compressed the whole
> large folio, like this:
> 
> for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
>      memcg = parent_mem_cgroup(memcg));
>      old_usage = atomic_read(&memcg->zswap);
> 
>      do {
>         new_usage = old_usage + size;
>         if (new_usage > limit) {
>            /* undo charging of descendants, then return false */
>         }
>       } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage))
> }
> 
> But I don't know what we can do in the current design. I gave it some
> more thought, and even if we only check after we know the size, we can
> still potentially overshoot the limit :(

I agree. Moreover, these checks based on estimated ratio or compressed size
could also add overhead in the normal case where we are not near the usage
limits.
Yosry Ahmed Sept. 25, 2024, 12:52 a.m. UTC | #16
On Tue, Sep 24, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> >
> > We can also do what we discussed before about double charging. The
> > pages that are being reclaimed are already charged, so technically we
> > don't need to charge them again. We can uncharge the difference
> > between compressed and uncompressed sizes after compression and call
> > it a day. This fixes the limit checking and the double charging in one
> > go.
> > I am a little bit nervous though about zswap uncharing the pages from
> > under reclaim, there are likely further accesses of the page memcg
> > after zswap. Maybe we can plumb the info back to reclaim or set a flag
> > on the page to avoid uncharging it when it's freed.
>
> Hmm this is just for memory usage charging, no? The problem here is
> the zswap usage (zswap.current), and its relation to the limit.
>
> One thing we can do is check the zswap usage against the limit for
> every subpage, but that's likely expensive...?

Ah yes, I totally missed this.

>
> With the new atomic counters Joshua is working on, we can
> check-and-charge at the same time, after we have compressed the whole
> large folio, like this:
>
> for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
>      memcg = parent_mem_cgroup(memcg));
>      old_usage = atomic_read(&memcg->zswap);
>
>      do {
>         new_usage = old_usage + size;
>         if (new_usage > limit) {
>            /* undo charging of descendants, then return false */
>         }
>       } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage))
> }
>
> But I don't know what we can do in the current design. I gave it some
> more thought, and even if we only check after we know the size, we can
> still potentially overshoot the limit :(

Yeah it's difficult because if we check the limit before compressing,
we have to estimate the compressed size or check using the
uncompressed size. If we wait until after compression we will either
overshoot the limit or free the compressed page and fallback to swap.

Maybe a good compromise is to do the check before compression with an
estimate based on historical compression ratio, and then do the actual
charging after the compression and allow overshooting, hopefully it's
not too much if our estimate is good. We can also improve this later
by adding a backoff mechanism where we make more conservative
estimates the more we overshoot the limit.
Johannes Weiner Sept. 25, 2024, 1:40 p.m. UTC | #17
On Tue, Sep 24, 2024 at 12:38:32PM -0700, Yosry Ahmed wrote:
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> > them page by page.
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> > will enable/disable zswap storing of (m)THP. The corresponding tunable
> > zswap module parameter is "mthp_enabled".
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++++
> >  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
> >  2 files changed, 66 insertions(+), 64 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09aebca1cae3..c659fb732ec4 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8f2e0ab34c84..16ab770546d6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> > +
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
> >   * @objcg: The folio's objcg.
> >   * @pool:  The zswap_pool to store the compressed data for the page.
> >   */
> > -static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> > -                                           struct obj_cgroup *objcg,
> > -                                           struct zswap_pool *pool)
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> 
> As I mentioned earlier, the patch that introduced zswap_store_page()
> should have directly used it in zswap_store(). This would make this
> patch much clearer.
> 
> >  {
> >         swp_entry_t swp = folio->swap;
> >         int type = swp_type(swp);
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> >         return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> > + * and stored sequentially.
> > + */
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       /* Storing large folios isn't enabled */
> 
> The comment is now stating the obvious, please remove it.
> 
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> >                 return false;
> >
> >         if (!zswap_enabled)
> > -               goto check_old;
> > +               goto reject;
> >
> > -       /* Check cgroup limits */
> > +       /*
> > +        * Check cgroup limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > +        */
> 
> I do not really like this. Allowing going one page above the limit is
> one thing, but one THP above the limit seems too much. I also don't
> like relying on the repeated limit checking in zswap_store_page(), if
> anything I think that should be batched too.
> 
> Is it too unreasonable to maintain the average compression ratio and
> use that to estimate limit checking for both memcg and global limits?
> Johannes, Nhat, any thoughts on this?

I honestly don't think it's much of an issue. The global limit is
huge, and the cgroup limit is to the best of my knowledge only used as
a binary switch. Setting a non-binary limit - global or cgroup - seems
like a bit of an obscure usecase to me, because in the vast majority
of cases it's preferable to keep compresing over declaring OOM.

And even if you do have some granular limit, the workload size scales
with it. It's not like you have a thousand THPs in a 10M cgroup.

If this ever becomes an issue, we can handle it in a fastpath-slowpath
scheme: check the limit up front for fast-path failure if we're
already maxed out, just like now; then make obj_cgroup_charge_zswap()
atomically charge against zswap.max and unwind the store if we raced.

For now, I would just keep the simple version we currently have: check
once in zswap_store() and then just go ahead for the whole folio.
Johannes Weiner Sept. 25, 2024, 2:27 p.m. UTC | #18
On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote:
> zswap_store() will now store mTHP and PMD-size THP folios by compressing

The hugepage terminology throughout the patches is a bit convoluted.

There is no real distinction in this code between PMD-size THPs and
sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when
they were added, to distinguish them from conventional THPs. But using
this term going forward just causes confusion, IMO.

We're going through a big effort in the codebase to call all of these
things simply "folios" - which stands for "one or more pages". If you
want to emphasize the "more than one page", the convention is to call
it a "large folio". (If you need to emphasize that it's PMD size -
which doesn't apply to these patches, but just for the record - the
convention is "pmd-mappable folio".)

So what this patch set does is "support large folios in zswap".

> @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
>  	return false;
>  }
>  
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */

This is a changelog, not a code comment ;) Please delete it.

>  bool zswap_store(struct folio *folio)
>  {
>  	long nr_pages = folio_nr_pages(folio);
>  	swp_entry_t swp = folio->swap;
>  	pgoff_t offset = swp_offset(swp);
>  	struct xarray *tree = swap_zswap_tree(swp);
> -	struct zswap_entry *entry;
>  	struct obj_cgroup *objcg = NULL;
>  	struct mem_cgroup *memcg = NULL;
> +	struct zswap_pool *pool;
> +	bool ret = false;
> +	long index;
>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>  
> -	/* Large folios aren't supported */
> -	if (folio_test_large(folio))
> +	/* Storing large folios isn't enabled */
> +	if (!zswap_mthp_enabled && folio_test_large(folio))
>  		return false;
>  
>  	if (!zswap_enabled)
> -		goto check_old;
> +		goto reject;
>  
> -	/* Check cgroup limits */
> +	/*
> +	 * Check cgroup limits:
> +	 *
> +	 * The cgroup zswap limit check is done once at the beginning of an
> +	 * mTHP store, and not within zswap_store_page() for each page
> +	 * in the mTHP. We do however check the zswap pool limits at the

Use "folio" and "large folio" as appropriate here and throughout.
Yosry Ahmed Sept. 25, 2024, 6:17 p.m. UTC | #19
On Wed, Sep 25, 2024 at 7:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote:
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
>
> The hugepage terminology throughout the patches is a bit convoluted.
>
> There is no real distinction in this code between PMD-size THPs and
> sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when
> they were added, to distinguish them from conventional THPs. But using
> this term going forward just causes confusion, IMO.
>
> We're going through a big effort in the codebase to call all of these
> things simply "folios" - which stands for "one or more pages". If you
> want to emphasize the "more than one page", the convention is to call
> it a "large folio". (If you need to emphasize that it's PMD size -
> which doesn't apply to these patches, but just for the record - the
> convention is "pmd-mappable folio".)
>
> So what this patch set does is "support large folios in zswap".

Agreed on all of this, except it should be "support large folios in
zswap _stores". We don't really support loading large folios.
Yosry Ahmed Sept. 25, 2024, 6:30 p.m. UTC | #20
[..]
> > > +       /*
> > > +        * Check cgroup limits:
> > > +        *
> > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > +        * mTHP store, and not within zswap_store_page() for each page
> > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > +        * However, the per-store-page zswap pool limits check should
> > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > +        * the cgroup exceeding the zswap limits could potentially be
> > > +        * resolved before the next zswap_store, and if it is not, the next
> > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > +        */
> >
> > I do not really like this. Allowing going one page above the limit is
> > one thing, but one THP above the limit seems too much. I also don't
> > like relying on the repeated limit checking in zswap_store_page(), if
> > anything I think that should be batched too.
> >
> > Is it too unreasonable to maintain the average compression ratio and
> > use that to estimate limit checking for both memcg and global limits?
> > Johannes, Nhat, any thoughts on this?
>
> I honestly don't think it's much of an issue. The global limit is
> huge, and the cgroup limit is to the best of my knowledge only used as
> a binary switch. Setting a non-binary limit - global or cgroup - seems
> like a bit of an obscure usecase to me, because in the vast majority
> of cases it's preferable to keep compresing over declaring OOM.
>
> And even if you do have some granular limit, the workload size scales
> with it. It's not like you have a thousand THPs in a 10M cgroup.

The memcg limit and zswap limit can be disproportionate, although that
shouldn't be common.

>
> If this ever becomes an issue, we can handle it in a fastpath-slowpath
> scheme: check the limit up front for fast-path failure if we're
> already maxed out, just like now; then make obj_cgroup_charge_zswap()
> atomically charge against zswap.max and unwind the store if we raced.
>
> For now, I would just keep the simple version we currently have: check
> once in zswap_store() and then just go ahead for the whole folio.

I am not totally against this but I feel like this is too optimistic.
I think we can keep it simple-ish by maintaining an ewma for the
compression ratio, we already have primitives for this (see
DECLARE_EWMA).

Then in zswap_store(), we can use the ewma to estimate the compressed
size and use it to do the memcg and global limit checks once, like we
do today. Instead of just checking if we are below the limits, we
check if we have enough headroom for the estimated compressed size.
Then we call zswap_store_page() to do the per-page stuff, then do
batched charging and stats updates.

If you think that's an overkill we can keep doing the limit checks as
we do today,
but I would still like to see batching of all the limit checks,
charging, and stats updates. It makes little sense otherwise.
Sridhar, Kanchana P Sept. 25, 2024, 6:48 p.m. UTC | #21
> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Wednesday, September 25, 2024 7:28 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote:
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> 
> The hugepage terminology throughout the patches is a bit convoluted.
> 
> There is no real distinction in this code between PMD-size THPs and
> sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when
> they were added, to distinguish them from conventional THPs. But using
> this term going forward just causes confusion, IMO.
> 
> We're going through a big effort in the codebase to call all of these
> things simply "folios" - which stands for "one or more pages". If you
> want to emphasize the "more than one page", the convention is to call
> it a "large folio". (If you need to emphasize that it's PMD size -
> which doesn't apply to these patches, but just for the record - the
> convention is "pmd-mappable folio".)
> 
> So what this patch set does is "support large folios in zswap".

Sure. Will modify this to be "support large folios in zswap _stores"
as per Yosry's follow-up clarification.

> 
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused
> zswap_store_page(struct folio *folio, long index,
> >  	return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> 
> This is a changelog, not a code comment ;) Please delete it.

Ok, sure.

> 
> >  bool zswap_store(struct folio *folio)
> >  {
> >  	long nr_pages = folio_nr_pages(folio);
> >  	swp_entry_t swp = folio->swap;
> >  	pgoff_t offset = swp_offset(swp);
> >  	struct xarray *tree = swap_zswap_tree(swp);
> > -	struct zswap_entry *entry;
> >  	struct obj_cgroup *objcg = NULL;
> >  	struct mem_cgroup *memcg = NULL;
> > +	struct zswap_pool *pool;
> > +	bool ret = false;
> > +	long index;
> >
> >  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -	/* Large folios aren't supported */
> > -	if (folio_test_large(folio))
> > +	/* Storing large folios isn't enabled */
> > +	if (!zswap_mthp_enabled && folio_test_large(folio))
> >  		return false;
> >
> >  	if (!zswap_enabled)
> > -		goto check_old;
> > +		goto reject;
> >
> > -	/* Check cgroup limits */
> > +	/*
> > +	 * Check cgroup limits:
> > +	 *
> > +	 * The cgroup zswap limit check is done once at the beginning of an
> > +	 * mTHP store, and not within zswap_store_page() for each page
> > +	 * in the mTHP. We do however check the zswap pool limits at the
> 
> Use "folio" and "large folio" as appropriate here and throughout.

Sounds good.

Thanks,
Kanchana
Sridhar, Kanchana P Sept. 25, 2024, 7:10 p.m. UTC | #22
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, September 25, 2024 11:31 AM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> [..]
> > > > +       /*
> > > > +        * Check cgroup limits:
> > > > +        *
> > > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > > +        * mTHP store, and not within zswap_store_page() for each page
> > > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > > +        * However, the per-store-page zswap pool limits check should
> > > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > > +        * the cgroup exceeding the zswap limits could potentially be
> > > > +        * resolved before the next zswap_store, and if it is not, the next
> > > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > > +        */
> > >
> > > I do not really like this. Allowing going one page above the limit is
> > > one thing, but one THP above the limit seems too much. I also don't
> > > like relying on the repeated limit checking in zswap_store_page(), if
> > > anything I think that should be batched too.
> > >
> > > Is it too unreasonable to maintain the average compression ratio and
> > > use that to estimate limit checking for both memcg and global limits?
> > > Johannes, Nhat, any thoughts on this?
> >
> > I honestly don't think it's much of an issue. The global limit is
> > huge, and the cgroup limit is to the best of my knowledge only used as
> > a binary switch. Setting a non-binary limit - global or cgroup - seems
> > like a bit of an obscure usecase to me, because in the vast majority
> > of cases it's preferable to keep compresing over declaring OOM.
> >
> > And even if you do have some granular limit, the workload size scales
> > with it. It's not like you have a thousand THPs in a 10M cgroup.
> 
> The memcg limit and zswap limit can be disproportionate, although that
> shouldn't be common.
> 
> >
> > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > scheme: check the limit up front for fast-path failure if we're
> > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > atomically charge against zswap.max and unwind the store if we raced.
> >
> > For now, I would just keep the simple version we currently have: check
> > once in zswap_store() and then just go ahead for the whole folio.
> 
> I am not totally against this but I feel like this is too optimistic.
> I think we can keep it simple-ish by maintaining an ewma for the
> compression ratio, we already have primitives for this (see
> DECLARE_EWMA).
> 
> Then in zswap_store(), we can use the ewma to estimate the compressed
> size and use it to do the memcg and global limit checks once, like we
> do today. Instead of just checking if we are below the limits, we
> check if we have enough headroom for the estimated compressed size.
> Then we call zswap_store_page() to do the per-page stuff, then do
> batched charging and stats updates.
> 
> If you think that's an overkill we can keep doing the limit checks as
> we do today,
> but I would still like to see batching of all the limit checks,
> charging, and stats updates. It makes little sense otherwise.

Thanks Johannes and Yosry for these suggestions and pointers.
I think there is general agreement about the batch charging and
zswap_stored_pages/stats updates. Yosry,  does "batching of limit
checks" imply the same as a simple check for being over the cgroup
limit at the start of zswap_store and not doing this check in
zswap_store_page? Does this also imply a zswap_pool_get_many()?
Would appreciate it if you can help clarify.

The main question in my mind about using the EWMA checks is,
will it add overhead to the normal zswap reclaim path; and if so,
would a simple limit check at the start of zswap_store as suggested
by Johannes suffice. I can run a few experiments to quantify this
overhead, and maybe we can revisit this?

Thanks,
Kanchana
Johannes Weiner Sept. 25, 2024, 7:20 p.m. UTC | #23
On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> Johannes wrote:
> > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > scheme: check the limit up front for fast-path failure if we're
> > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > atomically charge against zswap.max and unwind the store if we raced.
> >
> > For now, I would just keep the simple version we currently have: check
> > once in zswap_store() and then just go ahead for the whole folio.
> 
> I am not totally against this but I feel like this is too optimistic.
> I think we can keep it simple-ish by maintaining an ewma for the
> compression ratio, we already have primitives for this (see
> DECLARE_EWMA).
> 
> Then in zswap_store(), we can use the ewma to estimate the compressed
> size and use it to do the memcg and global limit checks once, like we
> do today. Instead of just checking if we are below the limits, we
> check if we have enough headroom for the estimated compressed size.
> Then we call zswap_store_page() to do the per-page stuff, then do
> batched charging and stats updates.

I'm not sure what you gain from making a non-atomic check precise. You
can get a hundred threads determining down precisely that *their*
store will fit exactly into the last 800kB before the limit.

> If you think that's an overkill we can keep doing the limit checks as
> we do today,

I just don't see how it would make a practical difference.

What would make a difference is atomic transactional charging of the
compressed size, and unwinding on failure - with the upfront check to
avoid pointlessly compressing (outside of race conditions).

And I'm not against doing that in general, I am just against doing it
per default.

It's a lot of complexity, and like I said, the practical usecase for
limiting zswap memory to begin with is quite unclear to me. Zswap is
not a limited resource. It's just memory. And you already had the
memory for the uncompressed copy. So it's a bit strange to me to say
"you have compressed your memory enough, so now you get sent to disk
(or we declare OOM)". What would be a reason to limit it?

It sort of makes sense as a binary switch, but I don't get the usecase
for a granular limit. (And I blame my own cowardice for making the
cgroup knob a limit, to keep options open, instead of a switch.)

All that to say, this would be better in a follow-up patch. We allow
overshooting now, it's not clear how overshooting by a larger amount
makes a categorical difference.

> but I would still like to see batching of all the limit checks,
> charging, and stats updates. It makes little sense otherwise.

Definitely. One check, one charge, one stat update per folio.
Yosry Ahmed Sept. 25, 2024, 7:39 p.m. UTC | #24
On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > Johannes wrote:
> > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > scheme: check the limit up front for fast-path failure if we're
> > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > atomically charge against zswap.max and unwind the store if we raced.
> > >
> > > For now, I would just keep the simple version we currently have: check
> > > once in zswap_store() and then just go ahead for the whole folio.
> >
> > I am not totally against this but I feel like this is too optimistic.
> > I think we can keep it simple-ish by maintaining an ewma for the
> > compression ratio, we already have primitives for this (see
> > DECLARE_EWMA).
> >
> > Then in zswap_store(), we can use the ewma to estimate the compressed
> > size and use it to do the memcg and global limit checks once, like we
> > do today. Instead of just checking if we are below the limits, we
> > check if we have enough headroom for the estimated compressed size.
> > Then we call zswap_store_page() to do the per-page stuff, then do
> > batched charging and stats updates.
>
> I'm not sure what you gain from making a non-atomic check precise. You
> can get a hundred threads determining down precisely that *their*
> store will fit exactly into the last 800kB before the limit.

We just get to avoid overshooting in cases where we know we probably
can't fit it anyway. If we have 4KB left and we are trying to compress
a 2MB THP, for example. It just makes the upfront check to avoid
pointless compression a little bit more meaningful.

>
> > If you think that's an overkill we can keep doing the limit checks as
> > we do today,
>
> I just don't see how it would make a practical difference.
>
> What would make a difference is atomic transactional charging of the
> compressed size, and unwinding on failure - with the upfront check to
> avoid pointlessly compressing (outside of race conditions).
>
> And I'm not against doing that in general, I am just against doing it
> per default.
>
> It's a lot of complexity, and like I said, the practical usecase for
> limiting zswap memory to begin with is quite unclear to me. Zswap is
> not a limited resource. It's just memory. And you already had the
> memory for the uncompressed copy. So it's a bit strange to me to say
> "you have compressed your memory enough, so now you get sent to disk
> (or we declare OOM)". What would be a reason to limit it?

Technically speaking if we have a global zswap limit, it becomes a
limited resource and distributing it across workloads can make sense.
That being said, I am not aware of any existing use cases for that.

The other use case is controlling when writeback kicks in for
different workloads. It may not make sense for limit-based reclaim,
because as you mentioned the memory is limited anyway and workloads
should be free to compress their own memory within their limit as they
please. But it may make sense for proactive reclaim, controlling how
much memory we compress vs how much memory we completely evict to
disk.

Again, not aware of any existing use cases for this as well.

>
> It sort of makes sense as a binary switch, but I don't get the usecase
> for a granular limit. (And I blame my own cowardice for making the
> cgroup knob a limit, to keep options open, instead of a switch.)
>
> All that to say, this would be better in a follow-up patch. We allow
> overshooting now, it's not clear how overshooting by a larger amount
> makes a categorical difference.

I am not against making this a follow-up, if/when the need arises. My
whole point was that using EWMA (or similar) we can make the upfront
check a little bit more meaningful than "We have 1 byte of headroom,
let's go compress a 2MB THP!". I think it's not a lot of complexity to
check for headroom based on an estimated compression size, but I
didn't try to code it, so maybe I am wrong :)

>
> > but I would still like to see batching of all the limit checks,
> > charging, and stats updates. It makes little sense otherwise.
>
> Definitely. One check, one charge, one stat update per folio.
Yosry Ahmed Sept. 25, 2024, 7:49 p.m. UTC | #25
[..]
> > > > > +       /*
> > > > > +        * Check cgroup limits:
> > > > > +        *
> > > > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > > > +        * mTHP store, and not within zswap_store_page() for each page
> > > > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > > > +        * However, the per-store-page zswap pool limits check should
> > > > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > > > +        * the cgroup exceeding the zswap limits could potentially be
> > > > > +        * resolved before the next zswap_store, and if it is not, the next
> > > > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > > > +        */
> > > >
> > > > I do not really like this. Allowing going one page above the limit is
> > > > one thing, but one THP above the limit seems too much. I also don't
> > > > like relying on the repeated limit checking in zswap_store_page(), if
> > > > anything I think that should be batched too.
> > > >
> > > > Is it too unreasonable to maintain the average compression ratio and
> > > > use that to estimate limit checking for both memcg and global limits?
> > > > Johannes, Nhat, any thoughts on this?
> > >
> > > I honestly don't think it's much of an issue. The global limit is
> > > huge, and the cgroup limit is to the best of my knowledge only used as
> > > a binary switch. Setting a non-binary limit - global or cgroup - seems
> > > like a bit of an obscure usecase to me, because in the vast majority
> > > of cases it's preferable to keep compresing over declaring OOM.
> > >
> > > And even if you do have some granular limit, the workload size scales
> > > with it. It's not like you have a thousand THPs in a 10M cgroup.
> >
> > The memcg limit and zswap limit can be disproportionate, although that
> > shouldn't be common.
> >
> > >
> > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > scheme: check the limit up front for fast-path failure if we're
> > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > atomically charge against zswap.max and unwind the store if we raced.
> > >
> > > For now, I would just keep the simple version we currently have: check
> > > once in zswap_store() and then just go ahead for the whole folio.
> >
> > I am not totally against this but I feel like this is too optimistic.
> > I think we can keep it simple-ish by maintaining an ewma for the
> > compression ratio, we already have primitives for this (see
> > DECLARE_EWMA).
> >
> > Then in zswap_store(), we can use the ewma to estimate the compressed
> > size and use it to do the memcg and global limit checks once, like we
> > do today. Instead of just checking if we are below the limits, we
> > check if we have enough headroom for the estimated compressed size.
> > Then we call zswap_store_page() to do the per-page stuff, then do
> > batched charging and stats updates.
> >
> > If you think that's an overkill we can keep doing the limit checks as
> > we do today,
> > but I would still like to see batching of all the limit checks,
> > charging, and stats updates. It makes little sense otherwise.
>
> Thanks Johannes and Yosry for these suggestions and pointers.
> I think there is general agreement about the batch charging and
> zswap_stored_pages/stats updates. Yosry,  does "batching of limit
> checks" imply the same as a simple check for being over the cgroup
> limit at the start of zswap_store and not doing this check in
> zswap_store_page? Does this also imply a zswap_pool_get_many()?
> Would appreciate it if you can help clarify.

Yes I think we should batch as much as possible in zswap_store(), and
only do the things are truly per-page in zswap_store_page(). The limit
checks, stats updates, zswap_pool refs, charging etc. Batching all of
these things should be clear wins.

>
> The main question in my mind about using the EWMA checks is,
> will it add overhead to the normal zswap reclaim path; and if so,
> would a simple limit check at the start of zswap_store as suggested
> by Johannes suffice. I can run a few experiments to quantify this
> overhead, and maybe we can revisit this?

If you look at ewma_##name##_add() in include/linux/average.h, it's
really just a bunch of bit shifts, so I am not concerned about runtime
overhead. My discussion with Johannes is more about if the complexity
is justified, I'd wait for that discussion to settle.

Either way, we should check the limits once in zswap_store().
Johannes Weiner Sept. 25, 2024, 8:13 p.m. UTC | #26
On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > Johannes wrote:
> > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > > scheme: check the limit up front for fast-path failure if we're
> > > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > > atomically charge against zswap.max and unwind the store if we raced.
> > > >
> > > > For now, I would just keep the simple version we currently have: check
> > > > once in zswap_store() and then just go ahead for the whole folio.
> > >
> > > I am not totally against this but I feel like this is too optimistic.
> > > I think we can keep it simple-ish by maintaining an ewma for the
> > > compression ratio, we already have primitives for this (see
> > > DECLARE_EWMA).
> > >
> > > Then in zswap_store(), we can use the ewma to estimate the compressed
> > > size and use it to do the memcg and global limit checks once, like we
> > > do today. Instead of just checking if we are below the limits, we
> > > check if we have enough headroom for the estimated compressed size.
> > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > batched charging and stats updates.
> >
> > I'm not sure what you gain from making a non-atomic check precise. You
> > can get a hundred threads determining down precisely that *their*
> > store will fit exactly into the last 800kB before the limit.
> 
> We just get to avoid overshooting in cases where we know we probably
> can't fit it anyway. If we have 4KB left and we are trying to compress
> a 2MB THP, for example. It just makes the upfront check to avoid
> pointless compression a little bit more meaningful.

I think I'm missing something. It's not just an upfront check, it's
the only check. The charge down the line doesn't limit anything, it
just counts. So if this check passes, we WILL store the folio. There
is no pointless compression.

We might overshoot the limit by about one folio in a single-threaded
scenario. But that is negligible in comparison to the overshoot we can
get due to race conditions.

Again, I see no no practical, meaningful difference in outcome by
making that limit check any more precise. Just keep it as-is.
Johannes Weiner Sept. 25, 2024, 8:49 p.m. UTC | #27
On Wed, Sep 25, 2024 at 12:49:13PM -0700, Yosry Ahmed wrote:
> Kanchana wrote:
> > The main question in my mind about using the EWMA checks is,
> > will it add overhead to the normal zswap reclaim path; and if so,
> > would a simple limit check at the start of zswap_store as suggested
> > by Johannes suffice. I can run a few experiments to quantify this
> > overhead, and maybe we can revisit this?
> 
> If you look at ewma_##name##_add() in include/linux/average.h, it's
> really just a bunch of bit shifts, so I am not concerned about runtime
> overhead. My discussion with Johannes is more about if the complexity
> is justified, I'd wait for that discussion to settle.

Sorry to be blunt, but "precision" in a non-atomic check like this
makes no sense. The fact that it's not too expensive is irrelevant.
This discussion around this honestly has gone off the rails.

Just leave the limit checks exactly as they are. Check limits and
cgroup_may_zswap() once up front. Compress the subpages. Acquire
references and bump all stats in batches of folio_nr_pages(). You can
add up the subpage compressed bytes in the for-loop and do the
obj_cgroup_charge_zswap() in a single call at the end as well.

That's my suggestion. If that's no good, please ELI5.
Yosry Ahmed Sept. 25, 2024, 9:06 p.m. UTC | #28
On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > > Johannes wrote:
> > > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > > > scheme: check the limit up front for fast-path failure if we're
> > > > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > > > atomically charge against zswap.max and unwind the store if we raced.
> > > > >
> > > > > For now, I would just keep the simple version we currently have: check
> > > > > once in zswap_store() and then just go ahead for the whole folio.
> > > >
> > > > I am not totally against this but I feel like this is too optimistic.
> > > > I think we can keep it simple-ish by maintaining an ewma for the
> > > > compression ratio, we already have primitives for this (see
> > > > DECLARE_EWMA).
> > > >
> > > > Then in zswap_store(), we can use the ewma to estimate the compressed
> > > > size and use it to do the memcg and global limit checks once, like we
> > > > do today. Instead of just checking if we are below the limits, we
> > > > check if we have enough headroom for the estimated compressed size.
> > > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > > batched charging and stats updates.
> > >
> > > I'm not sure what you gain from making a non-atomic check precise. You
> > > can get a hundred threads determining down precisely that *their*
> > > store will fit exactly into the last 800kB before the limit.
> >
> > We just get to avoid overshooting in cases where we know we probably
> > can't fit it anyway. If we have 4KB left and we are trying to compress
> > a 2MB THP, for example. It just makes the upfront check to avoid
> > pointless compression a little bit more meaningful.
>
> I think I'm missing something. It's not just an upfront check, it's
> the only check. The charge down the line doesn't limit anything, it
> just counts. So if this check passes, we WILL store the folio. There
> is no pointless compression.

I got confused by what you said about the fast-slow path, I thought
you were suggesting we do this now, so I was saying it's better to use
an estimate of the compressed size in the fast path to avoid pointless
compression.

I missed the second paragraph.

>
> We might overshoot the limit by about one folio in a single-threaded
> scenario. But that is negligible in comparison to the overshoot we can
> get due to race conditions.
>
> Again, I see no no practical, meaningful difference in outcome by
> making that limit check any more precise. Just keep it as-is.

> Sorry to be blunt, but "precision" in a non-atomic check like this?
> makes no sense. The fact that it's not too expensive is irrelevant.
> This discussion around this honestly has gone off the rails.

Yeah I thought we were talking about the version where we rollback
compressions if we overshoot, my bad. We discussed quite a few things
and I managed to confuse myself.

> Just leave the limit checks exactly as they are. Check limits and
> cgroup_may_zswap() once up front. Compress the subpages. Acquire
> references and bump all stats in batches of folio_nr_pages(). You can
> add up the subpage compressed bytes in the for-loop and do the
> obj_cgroup_charge_zswap() in a single call at the end as well.

We can keep the limit checks as they are for now, and revisit as needed.
Sridhar, Kanchana P Sept. 25, 2024, 10:29 p.m. UTC | #29
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, September 25, 2024 2:06 PM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner
> <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > > > Johannes wrote:
> > > > > > If this ever becomes an issue, we can handle it in a fastpath-
> slowpath
> > > > > > scheme: check the limit up front for fast-path failure if we're
> > > > > > already maxed out, just like now; then make
> obj_cgroup_charge_zswap()
> > > > > > atomically charge against zswap.max and unwind the store if we
> raced.
> > > > > >
> > > > > > For now, I would just keep the simple version we currently have:
> check
> > > > > > once in zswap_store() and then just go ahead for the whole folio.
> > > > >
> > > > > I am not totally against this but I feel like this is too optimistic.
> > > > > I think we can keep it simple-ish by maintaining an ewma for the
> > > > > compression ratio, we already have primitives for this (see
> > > > > DECLARE_EWMA).
> > > > >
> > > > > Then in zswap_store(), we can use the ewma to estimate the
> compressed
> > > > > size and use it to do the memcg and global limit checks once, like we
> > > > > do today. Instead of just checking if we are below the limits, we
> > > > > check if we have enough headroom for the estimated compressed size.
> > > > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > > > batched charging and stats updates.
> > > >
> > > > I'm not sure what you gain from making a non-atomic check precise. You
> > > > can get a hundred threads determining down precisely that *their*
> > > > store will fit exactly into the last 800kB before the limit.
> > >
> > > We just get to avoid overshooting in cases where we know we probably
> > > can't fit it anyway. If we have 4KB left and we are trying to compress
> > > a 2MB THP, for example. It just makes the upfront check to avoid
> > > pointless compression a little bit more meaningful.
> >
> > I think I'm missing something. It's not just an upfront check, it's
> > the only check. The charge down the line doesn't limit anything, it
> > just counts. So if this check passes, we WILL store the folio. There
> > is no pointless compression.
> 
> I got confused by what you said about the fast-slow path, I thought
> you were suggesting we do this now, so I was saying it's better to use
> an estimate of the compressed size in the fast path to avoid pointless
> compression.
> 
> I missed the second paragraph.
> 
> >
> > We might overshoot the limit by about one folio in a single-threaded
> > scenario. But that is negligible in comparison to the overshoot we can
> > get due to race conditions.
> >
> > Again, I see no no practical, meaningful difference in outcome by
> > making that limit check any more precise. Just keep it as-is.
> 
> > Sorry to be blunt, but "precision" in a non-atomic check like this?
> > makes no sense. The fact that it's not too expensive is irrelevant.
> > This discussion around this honestly has gone off the rails.
> 
> Yeah I thought we were talking about the version where we rollback
> compressions if we overshoot, my bad. We discussed quite a few things
> and I managed to confuse myself.
> 
> > Just leave the limit checks exactly as they are. Check limits and
> > cgroup_may_zswap() once up front. Compress the subpages. Acquire
> > references and bump all stats in batches of folio_nr_pages(). You can
> > add up the subpage compressed bytes in the for-loop and do the
> > obj_cgroup_charge_zswap() in a single call at the end as well.
> 
> We can keep the limit checks as they are for now, and revisit as needed.

Thanks Johannes and Yosry for the discussion! I will proceed as suggested.

Thanks,
Kanchana
Sridhar, Kanchana P Sept. 26, 2024, 3:58 a.m. UTC | #30
> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Wednesday, September 25, 2024 3:29 PM
> To: Yosry Ahmed <yosryahmed@google.com>; Johannes Weiner
> <hannes@cmpxchg.org>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, September 25, 2024 2:06 PM
> > To: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > zswap_store().
> >
> > On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner
> <hannes@cmpxchg.org>
> > wrote:
> > >
> > > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> > > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner
> > <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > > > > Johannes wrote:
> > > > > > > If this ever becomes an issue, we can handle it in a fastpath-
> > slowpath
> > > > > > > scheme: check the limit up front for fast-path failure if we're
> > > > > > > already maxed out, just like now; then make
> > obj_cgroup_charge_zswap()
> > > > > > > atomically charge against zswap.max and unwind the store if we
> > raced.
> > > > > > >
> > > > > > > For now, I would just keep the simple version we currently have:
> > check
> > > > > > > once in zswap_store() and then just go ahead for the whole folio.
> > > > > >
> > > > > > I am not totally against this but I feel like this is too optimistic.
> > > > > > I think we can keep it simple-ish by maintaining an ewma for the
> > > > > > compression ratio, we already have primitives for this (see
> > > > > > DECLARE_EWMA).
> > > > > >
> > > > > > Then in zswap_store(), we can use the ewma to estimate the
> > compressed
> > > > > > size and use it to do the memcg and global limit checks once, like we
> > > > > > do today. Instead of just checking if we are below the limits, we
> > > > > > check if we have enough headroom for the estimated compressed
> size.
> > > > > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > > > > batched charging and stats updates.
> > > > >
> > > > > I'm not sure what you gain from making a non-atomic check precise.
> You
> > > > > can get a hundred threads determining down precisely that *their*
> > > > > store will fit exactly into the last 800kB before the limit.
> > > >
> > > > We just get to avoid overshooting in cases where we know we probably
> > > > can't fit it anyway. If we have 4KB left and we are trying to compress
> > > > a 2MB THP, for example. It just makes the upfront check to avoid
> > > > pointless compression a little bit more meaningful.
> > >
> > > I think I'm missing something. It's not just an upfront check, it's
> > > the only check. The charge down the line doesn't limit anything, it
> > > just counts. So if this check passes, we WILL store the folio. There
> > > is no pointless compression.
> >
> > I got confused by what you said about the fast-slow path, I thought
> > you were suggesting we do this now, so I was saying it's better to use
> > an estimate of the compressed size in the fast path to avoid pointless
> > compression.
> >
> > I missed the second paragraph.
> >
> > >
> > > We might overshoot the limit by about one folio in a single-threaded
> > > scenario. But that is negligible in comparison to the overshoot we can
> > > get due to race conditions.
> > >
> > > Again, I see no no practical, meaningful difference in outcome by
> > > making that limit check any more precise. Just keep it as-is.
> >
> > > Sorry to be blunt, but "precision" in a non-atomic check like this?
> > > makes no sense. The fact that it's not too expensive is irrelevant.
> > > This discussion around this honestly has gone off the rails.
> >
> > Yeah I thought we were talking about the version where we rollback
> > compressions if we overshoot, my bad. We discussed quite a few things
> > and I managed to confuse myself.
> >
> > > Just leave the limit checks exactly as they are. Check limits and
> > > cgroup_may_zswap() once up front. Compress the subpages. Acquire
> > > references and bump all stats in batches of folio_nr_pages(). You can
> > > add up the subpage compressed bytes in the for-loop and do the
> > > obj_cgroup_charge_zswap() in a single call at the end as well.
> >
> > We can keep the limit checks as they are for now, and revisit as needed.
> 
> Thanks Johannes and Yosry for the discussion! I will proceed as suggested.

One thing I realized while reworking the patches for the batched checks is:
within zswap_store_page(), we set the entry->objcg and entry->pool before
adding it to the xarray. Given this, wouldn't it be safer to get the objcg
and pool reference per sub-page, locally in zswap_store_page(), rather than
obtaining batched references at the end if the store is successful? If we want
zswap_store_page() to be self-contained and correct as far as the entry
being created and added to the xarray, it seems like the right thing to do?
I am a bit apprehensive about the entry being added to the xarray without
a reference obtained on the objcg and pool, because any page-faults/writeback
that occur on sub-pages added to the xarray before the entire folio has been
stored, would run into issues.

Just wanted to run this by you. The rest of the batched charging, atomic
and stat updates should be Ok.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana
Yosry Ahmed Sept. 26, 2024, 4:52 a.m. UTC | #31
[..]
>
> One thing I realized while reworking the patches for the batched checks is:
> within zswap_store_page(), we set the entry->objcg and entry->pool before
> adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> and pool reference per sub-page, locally in zswap_store_page(), rather than
> obtaining batched references at the end if the store is successful? If we want
> zswap_store_page() to be self-contained and correct as far as the entry
> being created and added to the xarray, it seems like the right thing to do?
> I am a bit apprehensive about the entry being added to the xarray without
> a reference obtained on the objcg and pool, because any page-faults/writeback
> that occur on sub-pages added to the xarray before the entire folio has been
> stored, would run into issues.

We definitely should not obtain references to the pool and objcg after
initializing the entries with them. We can obtain all references in
zswap_store() before zswap_store_page(). IOW, the batching in this
case should be done before the per-page operations, not after.

>
> Just wanted to run this by you. The rest of the batched charging, atomic
> and stat updates should be Ok.
>
> Thanks,
> Kanchana
>
> >
> > Thanks,
> > Kanchana
Sridhar, Kanchana P Sept. 26, 2024, 4:40 p.m. UTC | #32
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, September 25, 2024 9:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> [..]
> >
> > One thing I realized while reworking the patches for the batched checks is:
> > within zswap_store_page(), we set the entry->objcg and entry->pool before
> > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > and pool reference per sub-page, locally in zswap_store_page(), rather than
> > obtaining batched references at the end if the store is successful? If we
> want
> > zswap_store_page() to be self-contained and correct as far as the entry
> > being created and added to the xarray, it seems like the right thing to do?
> > I am a bit apprehensive about the entry being added to the xarray without
> > a reference obtained on the objcg and pool, because any page-
> faults/writeback
> > that occur on sub-pages added to the xarray before the entire folio has been
> > stored, would run into issues.
> 
> We definitely should not obtain references to the pool and objcg after
> initializing the entries with them. We can obtain all references in
> zswap_store() before zswap_store_page(). IOW, the batching in this
> case should be done before the per-page operations, not after.

Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
zswap_pool at the start of zswap_store.

In the case of error on any sub-page, we will unwind state for potentially
only the stored pages or the entire folio if it happened to already be in zswap
and is being re-written. We might need some additional book-keeping to
keep track of which sub-pages were found in the xarray and zswap_entry_free()
got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I would need
to call this with (folio_nr_pages() - nr_sb).

As far as zswap_pool_get(), there is some added complexity if we want to
keep the existing implementation that calls "percpu_ref_tryget()", and assuming
this is extended to provide a new "zswap_pool_get_many()" that calls
"percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() instead
of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the pool->ref
is 0, no further increments will be made. If so, upon unwinding state in
zswap_store(), I would need to special-case to catch this before calling a new
"zswap_pool_put_many()".

Things could be a little simpler if zswap_pool_get() can use "percpu_ref_get()"
which will always increment the refcount. Since the zswap pool->ref is initialized
to "1", this seems Ok, but I don't know if there will be unintended consequences.

Can you please advise on what is the simplest/cleanest approach:

1) Proceed with the above changes without changing percpu_ref_tryget in
     zswap_pool_get. Needs special-casing in zswap_store to detect pool->ref
    being "0" before calling zswap_pool_put[_many].
2) Modify zswap_pool_get/zswap_pool_get_many to use percpu_ref_get_many
    and avoid special-casing to detect pool->ref being "0" before calling
    zswap_pool_put[_many].
3) Keep the approach in v7 where obj_cgroup_get/put is localized to
    zswap_store_page for both success and error conditions, and any unwinding
    state in zswap_store will take care of dropping references obtained from
    prior successful writes (from this or prior invocations of zswap_store).

Thanks,
Kanchana

> 
> >
> > Just wanted to run this by you. The rest of the batched charging, atomic
> > and stat updates should be Ok.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > Thanks,
> > > Kanchana
Yosry Ahmed Sept. 26, 2024, 5:19 p.m. UTC | #33
On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, September 25, 2024 9:52 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > zswap_store().
> >
> > [..]
> > >
> > > One thing I realized while reworking the patches for the batched checks is:
> > > within zswap_store_page(), we set the entry->objcg and entry->pool before
> > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > > and pool reference per sub-page, locally in zswap_store_page(), rather than
> > > obtaining batched references at the end if the store is successful? If we
> > want
> > > zswap_store_page() to be self-contained and correct as far as the entry
> > > being created and added to the xarray, it seems like the right thing to do?
> > > I am a bit apprehensive about the entry being added to the xarray without
> > > a reference obtained on the objcg and pool, because any page-
> > faults/writeback
> > > that occur on sub-pages added to the xarray before the entire folio has been
> > > stored, would run into issues.
> >
> > We definitely should not obtain references to the pool and objcg after
> > initializing the entries with them. We can obtain all references in
> > zswap_store() before zswap_store_page(). IOW, the batching in this
> > case should be done before the per-page operations, not after.
>
> Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
> zswap_pool at the start of zswap_store.
>
> In the case of error on any sub-page, we will unwind state for potentially
> only the stored pages or the entire folio if it happened to already be in zswap
> and is being re-written. We might need some additional book-keeping to
> keep track of which sub-pages were found in the xarray and zswap_entry_free()
> got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I would need
> to call this with (folio_nr_pages() - nr_sb).
>
> As far as zswap_pool_get(), there is some added complexity if we want to
> keep the existing implementation that calls "percpu_ref_tryget()", and assuming
> this is extended to provide a new "zswap_pool_get_many()" that calls
> "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() instead
> of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the pool->ref
> is 0, no further increments will be made. If so, upon unwinding state in
> zswap_store(), I would need to special-case to catch this before calling a new
> "zswap_pool_put_many()".
>
> Things could be a little simpler if zswap_pool_get() can use "percpu_ref_get()"
> which will always increment the refcount. Since the zswap pool->ref is initialized
> to "1", this seems Ok, but I don't know if there will be unintended consequences.
>
> Can you please advise on what is the simplest/cleanest approach:
>
> 1) Proceed with the above changes without changing percpu_ref_tryget in
>      zswap_pool_get. Needs special-casing in zswap_store to detect pool->ref
>     being "0" before calling zswap_pool_put[_many].

My assumption is that we can reorder the code such that if
zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
begin with (e.g. jump to a label after zswap_pool_put_many()).

> 2) Modify zswap_pool_get/zswap_pool_get_many to use percpu_ref_get_many
>     and avoid special-casing to detect pool->ref being "0" before calling
>     zswap_pool_put[_many].

I don't think we can simply switch the tryget to a get, as I believe
we can race with the pool being destroyed.

> 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
>     zswap_store_page for both success and error conditions, and any unwinding
>     state in zswap_store will take care of dropping references obtained from
>     prior successful writes (from this or prior invocations of zswap_store).

I am also fine with doing that and doing the reference batching as a follow up.


>
> Thanks,
> Kanchana
>
> >
> > >
> > > Just wanted to run this by you. The rest of the batched charging, atomic
> > > and stat updates should be Ok.
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > Thanks,
> > > > Kanchana
Sridhar, Kanchana P Sept. 26, 2024, 5:29 p.m. UTC | #34
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, September 26, 2024 10:20 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Wednesday, September 25, 2024 9:52 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> kernel@vger.kernel.org;
> > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> ryan.roberts@arm.com;
> > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > zswap_store().
> > >
> > > [..]
> > > >
> > > > One thing I realized while reworking the patches for the batched checks
> is:
> > > > within zswap_store_page(), we set the entry->objcg and entry->pool
> before
> > > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > > > and pool reference per sub-page, locally in zswap_store_page(), rather
> than
> > > > obtaining batched references at the end if the store is successful? If we
> > > want
> > > > zswap_store_page() to be self-contained and correct as far as the entry
> > > > being created and added to the xarray, it seems like the right thing to
> do?
> > > > I am a bit apprehensive about the entry being added to the xarray
> without
> > > > a reference obtained on the objcg and pool, because any page-
> > > faults/writeback
> > > > that occur on sub-pages added to the xarray before the entire folio has
> been
> > > > stored, would run into issues.
> > >
> > > We definitely should not obtain references to the pool and objcg after
> > > initializing the entries with them. We can obtain all references in
> > > zswap_store() before zswap_store_page(). IOW, the batching in this
> > > case should be done before the per-page operations, not after.
> >
> > Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
> > zswap_pool at the start of zswap_store.
> >
> > In the case of error on any sub-page, we will unwind state for potentially
> > only the stored pages or the entire folio if it happened to already be in
> zswap
> > and is being re-written. We might need some additional book-keeping to
> > keep track of which sub-pages were found in the xarray and
> zswap_entry_free()
> > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I
> would need
> > to call this with (folio_nr_pages() - nr_sb).
> >
> > As far as zswap_pool_get(), there is some added complexity if we want to
> > keep the existing implementation that calls "percpu_ref_tryget()", and
> assuming
> > this is extended to provide a new "zswap_pool_get_many()" that calls
> > "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget()
> instead
> > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the
> pool->ref
> > is 0, no further increments will be made. If so, upon unwinding state in
> > zswap_store(), I would need to special-case to catch this before calling a
> new
> > "zswap_pool_put_many()".
> >
> > Things could be a little simpler if zswap_pool_get() can use
> "percpu_ref_get()"
> > which will always increment the refcount. Since the zswap pool->ref is
> initialized
> > to "1", this seems Ok, but I don't know if there will be unintended
> consequences.
> >
> > Can you please advise on what is the simplest/cleanest approach:
> >
> > 1) Proceed with the above changes without changing percpu_ref_tryget in
> >      zswap_pool_get. Needs special-casing in zswap_store to detect pool-
> >ref
> >     being "0" before calling zswap_pool_put[_many].
> 
> My assumption is that we can reorder the code such that if
> zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
> begin with (e.g. jump to a label after zswap_pool_put_many()).

However, the pool refcount could change between the start and end of
zswap_store.

> 
> > 2) Modify zswap_pool_get/zswap_pool_get_many to use
> percpu_ref_get_many
> >     and avoid special-casing to detect pool->ref being "0" before calling
> >     zswap_pool_put[_many].
> 
> I don't think we can simply switch the tryget to a get, as I believe
> we can race with the pool being destroyed.

That was my initial thought as well, but I figured this couldn't happen
since the pool->ref is initialized to "1", and based on the existing
implementation. In any case, I can understand the intent of the use
of "tryget"; it is just that it adds to the considerations for reference
batching.

> 
> > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> >     zswap_store_page for both success and error conditions, and any
> unwinding
> >     state in zswap_store will take care of dropping references obtained from
> >     prior successful writes (from this or prior invocations of zswap_store).
> 
> I am also fine with doing that and doing the reference batching as a follow up.

I think so too! We could try and improve upon (3) with reference batching
in a follow-up patch.

Thanks,
Kanchana

> 
> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > >
> > > > Just wanted to run this by you. The rest of the batched charging, atomic
> > > > and stat updates should be Ok.
> > > >
> > > > Thanks,
> > > > Kanchana
> > > >
> > > > >
> > > > > Thanks,
> > > > > Kanchana
Yosry Ahmed Sept. 26, 2024, 5:34 p.m. UTC | #35
On Thu, Sep 26, 2024 at 10:29 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Thursday, September 26, 2024 10:20 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > zswap_store().
> >
> > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Yosry Ahmed <yosryahmed@google.com>
> > > > Sent: Wednesday, September 25, 2024 9:52 PM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> > kernel@vger.kernel.org;
> > > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> > ryan.roberts@arm.com;
> > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > > zswap_store().
> > > >
> > > > [..]
> > > > >
> > > > > One thing I realized while reworking the patches for the batched checks
> > is:
> > > > > within zswap_store_page(), we set the entry->objcg and entry->pool
> > before
> > > > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > > > > and pool reference per sub-page, locally in zswap_store_page(), rather
> > than
> > > > > obtaining batched references at the end if the store is successful? If we
> > > > want
> > > > > zswap_store_page() to be self-contained and correct as far as the entry
> > > > > being created and added to the xarray, it seems like the right thing to
> > do?
> > > > > I am a bit apprehensive about the entry being added to the xarray
> > without
> > > > > a reference obtained on the objcg and pool, because any page-
> > > > faults/writeback
> > > > > that occur on sub-pages added to the xarray before the entire folio has
> > been
> > > > > stored, would run into issues.
> > > >
> > > > We definitely should not obtain references to the pool and objcg after
> > > > initializing the entries with them. We can obtain all references in
> > > > zswap_store() before zswap_store_page(). IOW, the batching in this
> > > > case should be done before the per-page operations, not after.
> > >
> > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
> > > zswap_pool at the start of zswap_store.
> > >
> > > In the case of error on any sub-page, we will unwind state for potentially
> > > only the stored pages or the entire folio if it happened to already be in
> > zswap
> > > and is being re-written. We might need some additional book-keeping to
> > > keep track of which sub-pages were found in the xarray and
> > zswap_entry_free()
> > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I
> > would need
> > > to call this with (folio_nr_pages() - nr_sb).
> > >
> > > As far as zswap_pool_get(), there is some added complexity if we want to
> > > keep the existing implementation that calls "percpu_ref_tryget()", and
> > assuming
> > > this is extended to provide a new "zswap_pool_get_many()" that calls
> > > "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget()
> > instead
> > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the
> > pool->ref
> > > is 0, no further increments will be made. If so, upon unwinding state in
> > > zswap_store(), I would need to special-case to catch this before calling a
> > new
> > > "zswap_pool_put_many()".
> > >
> > > Things could be a little simpler if zswap_pool_get() can use
> > "percpu_ref_get()"
> > > which will always increment the refcount. Since the zswap pool->ref is
> > initialized
> > > to "1", this seems Ok, but I don't know if there will be unintended
> > consequences.
> > >
> > > Can you please advise on what is the simplest/cleanest approach:
> > >
> > > 1) Proceed with the above changes without changing percpu_ref_tryget in
> > >      zswap_pool_get. Needs special-casing in zswap_store to detect pool-
> > >ref
> > >     being "0" before calling zswap_pool_put[_many].
> >
> > My assumption is that we can reorder the code such that if
> > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
> > begin with (e.g. jump to a label after zswap_pool_put_many()).
>
> However, the pool refcount could change between the start and end of
> zswap_store.

I am not sure what you mean. If zswap_pool_get_many() fails then we
just do not call zswap_pool_put_many() at all and abort.

>
> >
> > > 2) Modify zswap_pool_get/zswap_pool_get_many to use
> > percpu_ref_get_many
> > >     and avoid special-casing to detect pool->ref being "0" before calling
> > >     zswap_pool_put[_many].
> >
> > I don't think we can simply switch the tryget to a get, as I believe
> > we can race with the pool being destroyed.
>
> That was my initial thought as well, but I figured this couldn't happen
> since the pool->ref is initialized to "1", and based on the existing
> implementation. In any case, I can understand the intent of the use
> of "tryget"; it is just that it adds to the considerations for reference
> batching.

The initial ref can be dropped in __zswap_param_set() if a new pool is
created (see the call to ercpu_ref_kill(()).

>
> >
> > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > >     zswap_store_page for both success and error conditions, and any
> > unwinding
> > >     state in zswap_store will take care of dropping references obtained from
> > >     prior successful writes (from this or prior invocations of zswap_store).
> >
> > I am also fine with doing that and doing the reference batching as a follow up.
>
> I think so too! We could try and improve upon (3) with reference batching
> in a follow-up patch.

SGTM.
Johannes Weiner Sept. 26, 2024, 6:43 p.m. UTC | #36
On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > >     zswap_store_page for both success and error conditions, and any
> > unwinding
> > >     state in zswap_store will take care of dropping references obtained from
> > >     prior successful writes (from this or prior invocations of zswap_store).
> > 
> > I am also fine with doing that and doing the reference batching as a follow up.
> 
> I think so too! We could try and improve upon (3) with reference batching
> in a follow-up patch.

Yeah, I agree. The percpu-refcounts are not that expensive, we should
be able to live with per-page ops for now.

One thing you *can* do from the start is tryget a pool reference in
zswap_store(), to prevent the pools untimely demise while you work on
it, and then in zswap_store_page() you can do gets instead of trygets.

You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
probably for the best) and implement the trivial new zswap_pool_get().
Yosry Ahmed Sept. 26, 2024, 6:45 p.m. UTC | #37
On Thu, Sep 26, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > >     zswap_store_page for both success and error conditions, and any
> > > unwinding
> > > >     state in zswap_store will take care of dropping references obtained from
> > > >     prior successful writes (from this or prior invocations of zswap_store).
> > >
> > > I am also fine with doing that and doing the reference batching as a follow up.
> >
> > I think so too! We could try and improve upon (3) with reference batching
> > in a follow-up patch.
>
> Yeah, I agree. The percpu-refcounts are not that expensive, we should
> be able to live with per-page ops for now.
>
> One thing you *can* do from the start is tryget a pool reference in
> zswap_store(), to prevent the pools untimely demise while you work on
> it, and then in zswap_store_page() you can do gets instead of trygets.
>
> You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
> probably for the best) and implement the trivial new zswap_pool_get().

Yeah I was actually planning to send a follow-up patch to do exactly
that until we figure out proper patching for the refcounts. Even
better if Kanchana incorporates it in the next version :)
Sridhar, Kanchana P Sept. 26, 2024, 7:36 p.m. UTC | #38
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, September 26, 2024 10:35 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 10:29 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Thursday, September 26, 2024 10:20 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> kernel@vger.kernel.org;
> > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> ryan.roberts@arm.com;
> > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > zswap_store().
> > >
> > > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Yosry Ahmed <yosryahmed@google.com>
> > > > > Sent: Wednesday, September 25, 2024 9:52 PM
> > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> > > kernel@vger.kernel.org;
> > > > > linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> > > ryan.roberts@arm.com;
> > > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com;
> akpm@linux-
> > > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> > > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > > > zswap_store().
> > > > >
> > > > > [..]
> > > > > >
> > > > > > One thing I realized while reworking the patches for the batched
> checks
> > > is:
> > > > > > within zswap_store_page(), we set the entry->objcg and entry->pool
> > > before
> > > > > > adding it to the xarray. Given this, wouldn't it be safer to get the
> objcg
> > > > > > and pool reference per sub-page, locally in zswap_store_page(),
> rather
> > > than
> > > > > > obtaining batched references at the end if the store is successful? If
> we
> > > > > want
> > > > > > zswap_store_page() to be self-contained and correct as far as the
> entry
> > > > > > being created and added to the xarray, it seems like the right thing to
> > > do?
> > > > > > I am a bit apprehensive about the entry being added to the xarray
> > > without
> > > > > > a reference obtained on the objcg and pool, because any page-
> > > > > faults/writeback
> > > > > > that occur on sub-pages added to the xarray before the entire folio
> has
> > > been
> > > > > > stored, would run into issues.
> > > > >
> > > > > We definitely should not obtain references to the pool and objcg after
> > > > > initializing the entries with them. We can obtain all references in
> > > > > zswap_store() before zswap_store_page(). IOW, the batching in this
> > > > > case should be done before the per-page operations, not after.
> > > >
> > > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to
> the
> > > > zswap_pool at the start of zswap_store.
> > > >
> > > > In the case of error on any sub-page, we will unwind state for potentially
> > > > only the stored pages or the entire folio if it happened to already be in
> > > zswap
> > > > and is being re-written. We might need some additional book-keeping to
> > > > keep track of which sub-pages were found in the xarray and
> > > zswap_entry_free()
> > > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I
> > > would need
> > > > to call this with (folio_nr_pages() - nr_sb).
> > > >
> > > > As far as zswap_pool_get(), there is some added complexity if we want
> to
> > > > keep the existing implementation that calls "percpu_ref_tryget()", and
> > > assuming
> > > > this is extended to provide a new "zswap_pool_get_many()" that calls
> > > > "percpu_ref_tryget_many()". Is there a reason we use
> percpu_ref_tryget()
> > > instead
> > > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the
> > > pool->ref
> > > > is 0, no further increments will be made. If so, upon unwinding state in
> > > > zswap_store(), I would need to special-case to catch this before calling a
> > > new
> > > > "zswap_pool_put_many()".
> > > >
> > > > Things could be a little simpler if zswap_pool_get() can use
> > > "percpu_ref_get()"
> > > > which will always increment the refcount. Since the zswap pool->ref is
> > > initialized
> > > > to "1", this seems Ok, but I don't know if there will be unintended
> > > consequences.
> > > >
> > > > Can you please advise on what is the simplest/cleanest approach:
> > > >
> > > > 1) Proceed with the above changes without changing percpu_ref_tryget
> in
> > > >      zswap_pool_get. Needs special-casing in zswap_store to detect pool-
> > > >ref
> > > >     being "0" before calling zswap_pool_put[_many].
> > >
> > > My assumption is that we can reorder the code such that if
> > > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
> > > begin with (e.g. jump to a label after zswap_pool_put_many()).
> >
> > However, the pool refcount could change between the start and end of
> > zswap_store.
> 
> I am not sure what you mean. If zswap_pool_get_many() fails then we
> just do not call zswap_pool_put_many() at all and abort.

I guess I was thinking of a scenario where zswap_pool_get_many() returns
true; subsequently, the pool refcount reaches 0 before the zswap_pool_put_many().
I just realized this shouldn’t happen, so I think we are Ok. Will think about this
some more while creating the follow-up patch.

> 
> >
> > >
> > > > 2) Modify zswap_pool_get/zswap_pool_get_many to use
> > > percpu_ref_get_many
> > > >     and avoid special-casing to detect pool->ref being "0" before calling
> > > >     zswap_pool_put[_many].
> > >
> > > I don't think we can simply switch the tryget to a get, as I believe
> > > we can race with the pool being destroyed.
> >
> > That was my initial thought as well, but I figured this couldn't happen
> > since the pool->ref is initialized to "1", and based on the existing
> > implementation. In any case, I can understand the intent of the use
> > of "tryget"; it is just that it adds to the considerations for reference
> > batching.
> 
> The initial ref can be dropped in __zswap_param_set() if a new pool is
> created (see the call to ercpu_ref_kill(()).

I see.. this makes sense, thanks Yosry!

> 
> >
> > >
> > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > >     zswap_store_page for both success and error conditions, and any
> > > unwinding
> > > >     state in zswap_store will take care of dropping references obtained
> from
> > > >     prior successful writes (from this or prior invocations of zswap_store).
> > >
> > > I am also fine with doing that and doing the reference batching as a follow
> up.
> >
> > I think so too! We could try and improve upon (3) with reference batching
> > in a follow-up patch.
> 
> SGTM.

Thanks, will proceed!
Sridhar, Kanchana P Sept. 26, 2024, 7:39 p.m. UTC | #39
> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Thursday, September 26, 2024 11:43 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > >     zswap_store_page for both success and error conditions, and any
> > > unwinding
> > > >     state in zswap_store will take care of dropping references obtained
> from
> > > >     prior successful writes (from this or prior invocations of zswap_store).
> > >
> > > I am also fine with doing that and doing the reference batching as a follow
> up.
> >
> > I think so too! We could try and improve upon (3) with reference batching
> > in a follow-up patch.
> 
> Yeah, I agree. The percpu-refcounts are not that expensive, we should
> be able to live with per-page ops for now.
> 
> One thing you *can* do from the start is tryget a pool reference in
> zswap_store(), to prevent the pools untimely demise while you work on
> it, and then in zswap_store_page() you can do gets instead of trygets.

Sure, this sounds good Johannes, thanks for the suggestion! I already
do a zswap_pool_current_get() at the beginning of zswap_store in the
v7 code, for this purpose.

> 
> You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
> probably for the best) and implement the trivial new zswap_pool_get().

Ok, will do so.

Thanks,
Kanchana
Sridhar, Kanchana P Sept. 26, 2024, 7:40 p.m. UTC | #40
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, September 26, 2024 11:46 AM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > > >     zswap_store_page for both success and error conditions, and any
> > > > unwinding
> > > > >     state in zswap_store will take care of dropping references obtained
> from
> > > > >     prior successful writes (from this or prior invocations of
> zswap_store).
> > > >
> > > > I am also fine with doing that and doing the reference batching as a
> follow up.
> > >
> > > I think so too! We could try and improve upon (3) with reference batching
> > > in a follow-up patch.
> >
> > Yeah, I agree. The percpu-refcounts are not that expensive, we should
> > be able to live with per-page ops for now.
> >
> > One thing you *can* do from the start is tryget a pool reference in
> > zswap_store(), to prevent the pools untimely demise while you work on
> > it, and then in zswap_store_page() you can do gets instead of trygets.
> >
> > You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
> > probably for the best) and implement the trivial new zswap_pool_get().
> 
> Yeah I was actually planning to send a follow-up patch to do exactly
> that until we figure out proper patching for the refcounts. Even
> better if Kanchana incorporates it in the next version :)

Sure, Yosry, I will incorporate it in the next version!

Thanks again,
Kanchana
diff mbox series

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index 09aebca1cae3..c659fb732ec4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -59,6 +59,14 @@  config ZSWAP_SHRINKER_DEFAULT_ON
 	  reducing the chance that cold pages will reside in the zswap pool
 	  and consume memory indefinitely.
 
+config ZSWAP_STORE_THP_DEFAULT_ON
+	bool "Store mTHP and THP folios in zswap"
+	depends on ZSWAP
+	default n
+	help
+	  If selected, zswap will process mTHP and THP folios by
+	  compressing and storing each 4K page in the large folio.
+
 choice
 	prompt "Default compressor"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 8f2e0ab34c84..16ab770546d6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -127,6 +127,14 @@  static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+/*
+ * Enable/disable zswap processing of mTHP folios.
+ * For now, only zswap_store will process mTHP folios.
+ */
+static bool zswap_mthp_enabled = IS_ENABLED(
+		CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
+module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -1471,9 +1479,9 @@  static void zswap_delete_stored_offsets(struct xarray *tree,
  * @objcg: The folio's objcg.
  * @pool:  The zswap_pool to store the compressed data for the page.
  */
-static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
-					    struct obj_cgroup *objcg,
-					    struct zswap_pool *pool)
+static bool zswap_store_page(struct folio *folio, long index,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool)
 {
 	swp_entry_t swp = folio->swap;
 	int type = swp_type(swp);
@@ -1551,51 +1559,63 @@  static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
 	return false;
 }
 
+/*
+ * Modified to store mTHP folios. Each page in the mTHP will be compressed
+ * and stored sequentially.
+ */
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	bool ret = false;
+	long index;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
 
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
+	/* Storing large folios isn't enabled */
+	if (!zswap_mthp_enabled && folio_test_large(folio))
 		return false;
 
 	if (!zswap_enabled)
-		goto check_old;
+		goto reject;
 
-	/* Check cgroup limits */
+	/*
+	 * Check cgroup limits:
+	 *
+	 * The cgroup zswap limit check is done once at the beginning of an
+	 * mTHP store, and not within zswap_store_page() for each page
+	 * in the mTHP. We do however check the zswap pool limits at the
+	 * start of zswap_store_page(). What this means is, the cgroup
+	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
+	 * However, the per-store-page zswap pool limits check should
+	 * hopefully trigger the cgroup aware and zswap LRU aware global
+	 * reclaim implemented in the shrinker. If this assumption holds,
+	 * the cgroup exceeding the zswap limits could potentially be
+	 * resolved before the next zswap_store, and if it is not, the next
+	 * zswap_store would fail the cgroup zswap limit check at the start.
+	 */
 	objcg = get_obj_cgroup_from_folio(folio);
 	if (objcg && !obj_cgroup_may_zswap(objcg)) {
 		memcg = get_mem_cgroup_from_objcg(objcg);
 		if (shrink_memcg(memcg)) {
 			mem_cgroup_put(memcg);
-			goto reject;
+			goto put_objcg;
 		}
 		mem_cgroup_put(memcg);
 	}
 
 	if (zswap_check_limits())
-		goto reject;
-
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
-		goto reject;
-	}
+		goto put_objcg;
 
-	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
-		goto freepage;
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
 
 	if (objcg) {
 		memcg = get_mem_cgroup_from_objcg(objcg);
@@ -1606,60 +1626,34 @@  bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	if (!zswap_compress(&folio->page, entry))
-		goto put_pool;
-
-	entry->swpentry = swp;
-	entry->objcg = objcg;
-	entry->referenced = true;
-
-	if (!zswap_store_entry(tree, entry))
-		goto store_failed;
-
-	if (objcg) {
-		obj_cgroup_charge_zswap(objcg, entry->length);
-		count_objcg_event(objcg, ZSWPOUT);
-	}
-
 	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
+	 * Store each page of the folio as a separate entry. If we fail to store
+	 * a page, unwind by removing all the previous pages we stored.
 	 */
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio, index, objcg, pool))
+			goto put_pool;
 	}
 
-	/* update stats */
-	atomic_inc(&zswap_stored_pages);
-	count_vm_event(ZSWPOUT);
-
-	return true;
+	ret = true;
 
-store_failed:
-	zpool_free(entry->pool->zpool, entry->handle);
 put_pool:
-	zswap_pool_put(entry->pool);
-freepage:
-	zswap_entry_cache_free(entry);
-reject:
+	zswap_pool_put(pool);
+put_objcg:
 	obj_cgroup_put(objcg);
 	if (zswap_pool_reached_full)
 		queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+reject:
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * If the zswap store fails or zswap is disabled, we must invalidate
+	 * the possibly stale entries which were previously stored at the
+	 * offsets corresponding to each page of the folio. Otherwise,
+	 * writeback could overwrite the new data in the swapfile.
 	 */
-	zswap_delete_stored_offsets(tree, offset, nr_pages);
-	return false;
+	if (!ret)
+		zswap_delete_stored_offsets(tree, offset, nr_pages);
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)