diff mbox series

[v2] zswap: add writeback_time_threshold interface to shrink zswap pool

Message ID 20231025095248.458789-1-hezhongkun.hzk@bytedance.com (mailing list archive)
State New
Headers show
Series [v2] zswap: add writeback_time_threshold interface to shrink zswap pool | expand

Commit Message

Zhongkun He Oct. 25, 2023, 9:52 a.m. UTC
zswap does not have a suitable method to select objects that have not
been accessed for a long time, and just shrink the pool when the limit
is hit. There is a high probability of wasting memory in zswap if the
limit is too high.

This patch add a new interface writeback_time_threshold to shrink zswap
pool proactively based on the time threshold in second, e.g.::

echo 600 > /sys/module/zswap/parameters/writeback_time_threshold

If zswap_entrys have not been accessed for more than 600 seconds, they
will be swapout to swap. if set to 0, all of them will be swapout.

This patch provides more control by specifying the time at which to
start writing pages out.

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
v2:
   - rename sto_time to last_ac_time (suggested by Nhat Pham)
   - update the access time when a page is read (reported by
     Yosry Ahmed and Nhat Pham)
   - add config option (suggested by Yosry Ahmed)
---
 Documentation/admin-guide/mm/zswap.rst |   9 +++
 mm/Kconfig                             |  11 +++
 mm/zswap.c                             | 104 +++++++++++++++++++++++++
 3 files changed, 124 insertions(+)

Comments

Nhat Pham Oct. 28, 2023, 12:04 a.m. UTC | #1
On Wed, Oct 25, 2023 at 2:52 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> zswap does not have a suitable method to select objects that have not
> been accessed for a long time, and just shrink the pool when the limit
> is hit. There is a high probability of wasting memory in zswap if the
> limit is too high.
>
> This patch add a new interface writeback_time_threshold to shrink zswap
> pool proactively based on the time threshold in second, e.g.::
>
> echo 600 > /sys/module/zswap/parameters/writeback_time_threshold
>
> If zswap_entrys have not been accessed for more than 600 seconds, they
> will be swapout to swap. if set to 0, all of them will be swapout.
>
> This patch provides more control by specifying the time at which to
> start writing pages out.
>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
> v2:
>    - rename sto_time to last_ac_time (suggested by Nhat Pham)
>    - update the access time when a page is read (reported by
>      Yosry Ahmed and Nhat Pham)
>    - add config option (suggested by Yosry Ahmed)
> ---
>  Documentation/admin-guide/mm/zswap.rst |   9 +++
>  mm/Kconfig                             |  11 +++
>  mm/zswap.c                             | 104 +++++++++++++++++++++++++
>  3 files changed, 124 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> index 45b98390e938..7aec245f89b4 100644
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -153,6 +153,15 @@ attribute, e. g.::
>
>  Setting this parameter to 100 will disable the hysteresis.
>
> +When there is a lot of cold memory according to the last accessed time in the
> +zswap, it can be swapout and save memory in userspace proactively. User can
> +write writeback time threshold in second to enable it, e.g.::
> +
> +  echo 600 > /sys/module/zswap/parameters/writeback_time_threshold
> +
> +If zswap_entrys have not been accessed for more than 600 seconds, they will be
> +swapout. if set to 0, all of them will be swapout.

My original concern with this approach (i.e regarding what value should be
used, and how frequent should userspace trigger this time-based writeback
mechanism) still stands.

If I'm a user of this feature, how would I figure out how long should an object
lie dormant in the zswap pool before it is highly likely to be a cold object?
Users have no clue what the access time stats look like, what is its
distribution,
etc., and will have to somehow guesstimate this based purely on their knowledge
of the program's memory access patterns (which, in many cases, are intentionally
abstracted away).

It's rather hard for users to know what value of cutoff makes sense, without
extensive experiments on a realistic workload.

If I may ask, how do you use this feature internally? You don't have to
reveal any NDA-breaking details of course, but just a rough idea of
the kind of procedure to determine sensible threshold values will
help your case and the future user of this feature a lot.

> +
>  A debugfs interface is provided for various statistic about pool size, number
>  of pages stored, same-value filled pages and various counters for the reasons
>  pages are rejected.
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 89971a894b60..426358d2050b 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -61,6 +61,17 @@ config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
>           The cost is that if the page was never dirtied and needs to be
>           swapped out again, it will be re-compressed.
>
> +config ZSWAP_WRITEBACK_TIME_ON
> +        bool "writeback zswap based on the last accessed time"
> +        depends on ZSWAP
> +        default n
> +        help
> +          If selected, the feature for tracking last accessed time  will be
> +          enabled at boot, otherwise it will be disabled.

nit: looks there's a double space between time and will?

> +
> +          The zswap can be swapout and save memory in userspace proactively
> +          by writing writeback_time_threshold in second.

I think we should include a bit more details in this config option description.
Feel free to just recycle details from the commit log of course, but at least
there should be something along the line of:

When this is selected, users can proactively trigger writebacks by writing a
value to the writeback_time_threshold file. The pages whose last access time
is older than this value will be written back.

(please beautify that paragraph if you use it).

> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0c5ca896edf2..331ee276afbd 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -144,6 +144,19 @@ static bool zswap_exclusive_loads_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
>  module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
>
> +
> +#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
> +/* zswap writeback time threshold in second */
> +static unsigned int  zswap_writeback_time_thr;
> +static int zswap_writeback_time_thr_param_set(const char *, const struct kernel_param *);
> +static const struct kernel_param_ops zswap_writeback_param_ops = {
> +       .set =          zswap_writeback_time_thr_param_set,
> +       .get =          param_get_uint,
> +};
> +module_param_cb(writeback_time_threshold, &zswap_writeback_param_ops,
> +                       &zswap_writeback_time_thr, 0644);
> +#endif
> +
>  /* Number of zpools in zswap_pool (empirically determined for scalability) */
>  #define ZSWAP_NR_ZPOOLS 32
>
> @@ -200,6 +213,7 @@ struct zswap_pool {
>   * value - value of the same-value filled pages which have same content
>   * objcg - the obj_cgroup that the compressed memory is charged to
>   * lru - handle to the pool's lru used to evict pages.
> + * last_ac_time - the last accessed time of zswap_entry.
>   */
>  struct zswap_entry {
>         struct rb_node rbnode;
> @@ -213,6 +227,9 @@ struct zswap_entry {
>         };
>         struct obj_cgroup *objcg;
>         struct list_head lru;
> +#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
> +       ktime_t last_ac_time;
> +#endif
>  };
>
>  /*
> @@ -291,6 +308,27 @@ static void zswap_update_total_size(void)
>         zswap_pool_total_size = total;
>  }
>
> +#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
> +static void zswap_set_access_time(struct zswap_entry *entry)
> +{
> +       entry->last_ac_time = ktime_get_boottime();
> +}
> +
> +static void zswap_clear_access_time(struct zswap_entry *entry)
> +{
> +       entry->last_ac_time = 0;
> +}
> +#else
> +static void zswap_set_access_time(struct zswap_entry *entry)
> +{
> +}
> +
> +static void zswap_clear_access_time(struct zswap_entry *entry)
> +{
> +}
> +#endif
> +
> +
>  /*********************************
>  * zswap entry functions
>  **********************************/
> @@ -398,6 +436,7 @@ static void zswap_free_entry(struct zswap_entry *entry)
>         else {
>                 spin_lock(&entry->pool->lru_lock);
>                 list_del(&entry->lru);
> +               zswap_clear_access_time(entry);
>                 spin_unlock(&entry->pool->lru_lock);
>                 zpool_free(zswap_find_zpool(entry), entry->handle);
>                 zswap_pool_put(entry->pool);
> @@ -712,6 +751,52 @@ static void shrink_worker(struct work_struct *w)
>         zswap_pool_put(pool);
>  }
>
> +#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
> +static bool zswap_reach_timethr(struct zswap_pool *pool)
> +{
> +       struct zswap_entry *entry;
> +       ktime_t expire_time = 0;
> +       bool ret = false;
> +
> +       spin_lock(&pool->lru_lock);
> +
> +       if (list_empty(&pool->lru))
> +               goto out;
> +
> +       entry = list_last_entry(&pool->lru, struct zswap_entry, lru);
> +       expire_time = ktime_add(entry->last_ac_time,
> +                       ns_to_ktime(zswap_writeback_time_thr * NSEC_PER_SEC));
> +
> +       if (ktime_after(ktime_get_boottime(), expire_time))
> +               ret = true;
> +out:
> +       spin_unlock(&pool->lru_lock);
> +       return ret;
> +}
> +
> +static void zswap_reclaim_entry_by_timethr(void)
> +{
> +       struct zswap_pool *pool = zswap_pool_current_get();
> +       int ret, failures = 0;
> +
> +       if (!pool)
> +               return;
> +
> +       while (zswap_reach_timethr(pool)) {
> +               ret = zswap_reclaim_entry(pool);
> +               if (ret) {
> +                       zswap_reject_reclaim_fail++;
> +                       if (ret != -EAGAIN)
> +                               break;
> +                       if (++failures == MAX_RECLAIM_RETRIES)
> +                               break;
> +               }
> +               cond_resched();
> +       }
> +       zswap_pool_put(pool);
> +}
> +#endif
> +
>  static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  {
>         int i;
> @@ -1040,6 +1125,23 @@ static int zswap_enabled_param_set(const char *val,
>         return ret;
>  }
>
> +#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
> +static int zswap_writeback_time_thr_param_set(const char *val,
> +                               const struct kernel_param *kp)
> +{
> +       int ret = -ENODEV;
> +
> +       /* if this is load-time (pre-init) param setting, just return. */
> +       if (system_state != SYSTEM_RUNNING)
> +               return ret;
> +
> +       ret = param_set_uint(val, kp);
> +       if (!ret)
> +               zswap_reclaim_entry_by_timethr();
> +       return ret;
> +}
> +#endif
> +
>  /*********************************
>  * writeback code
>  **********************************/
> @@ -1372,6 +1474,7 @@ bool zswap_store(struct folio *folio)
>         if (entry->length) {
>                 spin_lock(&entry->pool->lru_lock);
>                 list_add(&entry->lru, &entry->pool->lru);
> +               zswap_set_access_time(entry);
>                 spin_unlock(&entry->pool->lru_lock);
>         }
>         spin_unlock(&tree->lock);
> @@ -1484,6 +1587,7 @@ bool zswap_load(struct folio *folio)
>                 folio_mark_dirty(folio);
>         } else if (entry->length) {
>                 spin_lock(&entry->pool->lru_lock);
> +               zswap_set_access_time(entry);
>                 list_move(&entry->lru, &entry->pool->lru);
>                 spin_unlock(&entry->pool->lru_lock);
>         }
> --
> 2.25.1
>

Do you have any experimental results/benchmarks that show
the wins from this approach?

Writing back cold pages from zswap is a good idea from a
theoretical and philosophical POV, but all sort of things could go
wrong, especially if we write pages that turn out to be needed
later on. Some experimental results would be nice.
Zhongkun He Oct. 30, 2023, 3:34 a.m. UTC | #2
Hi  Nhat, thanks for your time.

> My original concern with this approach (i.e regarding what value should be
> used, and how frequent should userspace trigger this time-based writeback
> mechanism) still stands.
>
> If I'm a user of this feature, how would I figure out how long should an object
> lie dormant in the zswap pool before it is highly likely to be a cold object?
> Users have no clue what the access time stats look like, what is its
> distribution,
> etc., and will have to somehow guesstimate this based purely on their knowledge
> of the program's memory access patterns (which, in many cases, are intentionally
> abstracted away).
>
> It's rather hard for users to know what value of cutoff makes sense, without
> extensive experiments on a realistic workload.

I understand your concern, and it is indeed a problem. There are
currently too few
contexts in which this feature is used, and determining the threshold
is very difficult.
So I have a new idea to show the lie dormant time distribution through
a new patch,
as you mentioned above.Based on this time distribution, users can make better
choices.

>
> If I may ask, how do you use this feature internally? You don't have to
> reveal any NDA-breaking details of course, but just a rough idea of
> the kind of procedure to determine sensible threshold values will
> help your case and the future user of this feature a lot.
>

As you mentioned, we decide this threshold based on the knowledge of the
program's memory access patterns and the multiple test results for different
business models,just an experience value.
So  the patch to show the lie dormant time distribution should be valuable.

>
> nit: looks there's a double space between time and will?
>

Oh, got it, thanks.

> > +
> > +          The zswap can be swapout and save memory in userspace proactively
> > +          by writing writeback_time_threshold in second.
>
> I think we should include a bit more details in this config option description.
> Feel free to just recycle details from the commit log of course, but at least
> there should be something along the line of:
>
> When this is selected, users can proactively trigger writebacks by writing a
> value to the writeback_time_threshold file. The pages whose last access time
> is older than this value will be written back.
>
> (please beautify that paragraph if you use it).
>

Thanks  a lot for your description. I will add it.

>
> Do you have any experimental results/benchmarks that show
> the wins from this approach?

OK,  Add it for next time.
>
> Writing back cold pages from zswap is a good idea from a
> theoretical and philosophical POV, but all sort of things could go
> wrong, especially if we write pages that turn out to be needed
> later on. Some experimental results would be nice.

Thanks, i will add a new patch to show the last acccesed time distribution
and add experimental results.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 45b98390e938..7aec245f89b4 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -153,6 +153,15 @@  attribute, e. g.::
 
 Setting this parameter to 100 will disable the hysteresis.
 
+When there is a lot of cold memory according to the last accessed time in the
+zswap, it can be swapout and save memory in userspace proactively. User can
+write writeback time threshold in second to enable it, e.g.::
+
+  echo 600 > /sys/module/zswap/parameters/writeback_time_threshold
+
+If zswap_entrys have not been accessed for more than 600 seconds, they will be
+swapout. if set to 0, all of them will be swapout.
+
 A debugfs interface is provided for various statistic about pool size, number
 of pages stored, same-value filled pages and various counters for the reasons
 pages are rejected.
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..426358d2050b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -61,6 +61,17 @@  config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
 	  The cost is that if the page was never dirtied and needs to be
 	  swapped out again, it will be re-compressed.
 
+config ZSWAP_WRITEBACK_TIME_ON
+        bool "writeback zswap based on the last accessed time"
+        depends on ZSWAP
+        default n
+        help
+          If selected, the feature for tracking last accessed time  will be
+          enabled at boot, otherwise it will be disabled.
+
+          The zswap can be swapout and save memory in userspace proactively
+          by writing writeback_time_threshold in second.
+
 choice
 	prompt "Default compressor"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 0c5ca896edf2..331ee276afbd 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -144,6 +144,19 @@  static bool zswap_exclusive_loads_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
 module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
 
+
+#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
+/* zswap writeback time threshold in second */
+static unsigned int  zswap_writeback_time_thr;
+static int zswap_writeback_time_thr_param_set(const char *, const struct kernel_param *);
+static const struct kernel_param_ops zswap_writeback_param_ops = {
+	.set =		zswap_writeback_time_thr_param_set,
+	.get =          param_get_uint,
+};
+module_param_cb(writeback_time_threshold, &zswap_writeback_param_ops,
+			&zswap_writeback_time_thr, 0644);
+#endif
+
 /* Number of zpools in zswap_pool (empirically determined for scalability) */
 #define ZSWAP_NR_ZPOOLS 32
 
@@ -200,6 +213,7 @@  struct zswap_pool {
  * value - value of the same-value filled pages which have same content
  * objcg - the obj_cgroup that the compressed memory is charged to
  * lru - handle to the pool's lru used to evict pages.
+ * last_ac_time - the last accessed time of zswap_entry.
  */
 struct zswap_entry {
 	struct rb_node rbnode;
@@ -213,6 +227,9 @@  struct zswap_entry {
 	};
 	struct obj_cgroup *objcg;
 	struct list_head lru;
+#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
+	ktime_t last_ac_time;
+#endif
 };
 
 /*
@@ -291,6 +308,27 @@  static void zswap_update_total_size(void)
 	zswap_pool_total_size = total;
 }
 
+#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
+static void zswap_set_access_time(struct zswap_entry *entry)
+{
+	entry->last_ac_time = ktime_get_boottime();
+}
+
+static void zswap_clear_access_time(struct zswap_entry *entry)
+{
+	entry->last_ac_time = 0;
+}
+#else
+static void zswap_set_access_time(struct zswap_entry *entry)
+{
+}
+
+static void zswap_clear_access_time(struct zswap_entry *entry)
+{
+}
+#endif
+
+
 /*********************************
 * zswap entry functions
 **********************************/
@@ -398,6 +436,7 @@  static void zswap_free_entry(struct zswap_entry *entry)
 	else {
 		spin_lock(&entry->pool->lru_lock);
 		list_del(&entry->lru);
+		zswap_clear_access_time(entry);
 		spin_unlock(&entry->pool->lru_lock);
 		zpool_free(zswap_find_zpool(entry), entry->handle);
 		zswap_pool_put(entry->pool);
@@ -712,6 +751,52 @@  static void shrink_worker(struct work_struct *w)
 	zswap_pool_put(pool);
 }
 
+#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
+static bool zswap_reach_timethr(struct zswap_pool *pool)
+{
+	struct zswap_entry *entry;
+	ktime_t expire_time = 0;
+	bool ret = false;
+
+	spin_lock(&pool->lru_lock);
+
+	if (list_empty(&pool->lru))
+		goto out;
+
+	entry = list_last_entry(&pool->lru, struct zswap_entry, lru);
+	expire_time = ktime_add(entry->last_ac_time,
+			ns_to_ktime(zswap_writeback_time_thr * NSEC_PER_SEC));
+
+	if (ktime_after(ktime_get_boottime(), expire_time))
+		ret = true;
+out:
+	spin_unlock(&pool->lru_lock);
+	return ret;
+}
+
+static void zswap_reclaim_entry_by_timethr(void)
+{
+	struct zswap_pool *pool = zswap_pool_current_get();
+	int ret, failures = 0;
+
+	if (!pool)
+		return;
+
+	while (zswap_reach_timethr(pool)) {
+		ret = zswap_reclaim_entry(pool);
+		if (ret) {
+			zswap_reject_reclaim_fail++;
+			if (ret != -EAGAIN)
+				break;
+			if (++failures == MAX_RECLAIM_RETRIES)
+				break;
+		}
+		cond_resched();
+	}
+	zswap_pool_put(pool);
+}
+#endif
+
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	int i;
@@ -1040,6 +1125,23 @@  static int zswap_enabled_param_set(const char *val,
 	return ret;
 }
 
+#ifdef CONFIG_ZSWAP_WRITEBACK_TIME_ON
+static int zswap_writeback_time_thr_param_set(const char *val,
+				const struct kernel_param *kp)
+{
+	int ret = -ENODEV;
+
+	/* if this is load-time (pre-init) param setting, just return. */
+	if (system_state != SYSTEM_RUNNING)
+		return ret;
+
+	ret = param_set_uint(val, kp);
+	if (!ret)
+		zswap_reclaim_entry_by_timethr();
+	return ret;
+}
+#endif
+
 /*********************************
 * writeback code
 **********************************/
@@ -1372,6 +1474,7 @@  bool zswap_store(struct folio *folio)
 	if (entry->length) {
 		spin_lock(&entry->pool->lru_lock);
 		list_add(&entry->lru, &entry->pool->lru);
+		zswap_set_access_time(entry);
 		spin_unlock(&entry->pool->lru_lock);
 	}
 	spin_unlock(&tree->lock);
@@ -1484,6 +1587,7 @@  bool zswap_load(struct folio *folio)
 		folio_mark_dirty(folio);
 	} else if (entry->length) {
 		spin_lock(&entry->pool->lru_lock);
+		zswap_set_access_time(entry);
 		list_move(&entry->lru, &entry->pool->lru);
 		spin_unlock(&entry->pool->lru_lock);
 	}