diff mbox series

[v2,2/2] zswap: increment swapin count for non-pivot swapped in pages

Message ID 20240730222707.2324536-3-nphamcs@gmail.com (mailing list archive)
State New
Headers show
Series improving dynamic zswap shrinker protection scheme | expand

Commit Message

Nhat Pham July 30, 2024, 10:27 p.m. UTC
Currently, we only increment the swapin counter on pivot pages. This
means we are not taking into account pages that also need to be swapped
in, but are already taken care of as part of the readahead window. We
are also incrementing when the pages are read from the zswap pool, which
is inaccurate.

This patch rectifies this issue by incrementing whenever we need to
perform a non-zswap read.

To test this change, I built the kernel under a cgroup with its
memory.max set to 2 GB:

real: 236.66s
user: 4286.06s
sys: 652.86s
swapins: 81552

For comparison, with just the new second chance algorithm, the build
time is as follows:

real: 244.85s
user: 4327.22s
sys: 664.39s
swapins: 94663

Without neither:

real: 263.89s
user: 4318.11s
sys: 673.29s
swapins: 227300.5

(average over 5 runs)

With this change, the kernel CPU time reduces by a further 1.7%, and
the real time is reduced by another 3.3%, compared to just the second
chance algorithm by itself. The swapins count also reduces by another
13.85%.

Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
time by 3%, and number of swapins by 64.12%.

To gauge the new scheme's ability to offload cold data, I ran another
benchmark, in which the kernel was built under a cgroup with memory.max
set to 3 GB, but with 0.5 GB worth of cold data allocated before each
build (in a shmem file).

Under the old scheme:

real: 197.18s
user: 4365.08s
sys: 289.02s
zswpwb: 72115.2

Under the new scheme:

real: 195.8s
user: 4362.25s
sys: 290.14s
zswpwb: 87277.8

(average over 5 runs)

Notice that we actually observe a 21% increase in the number of written
back pages - so the new scheme is just as good, if not better at
offloading pages from the zswap pool when they are cold. Build time
reduces by around 0.7% as a result.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/page_io.c    | 11 ++++++++++-
 mm/swap_state.c |  8 ++------
 2 files changed, 12 insertions(+), 7 deletions(-)

Comments

Yosry Ahmed Aug. 1, 2024, 8:02 p.m. UTC | #1
On Tue, Jul 30, 2024 at 3:27 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, we only increment the swapin counter on pivot pages. This
> means we are not taking into account pages that also need to be swapped
> in, but are already taken care of as part of the readahead window. We

Hmm, but there is a chance that these pages are not actually needed,
in which case we will unnecessarily increase the zswap protection.
Does the readahead window self-correct if the pages were not used?

> are also incrementing when the pages are read from the zswap pool, which
> is inaccurate.

I feel like this is the more important part. It should be the focus of
the commit log with more details (i.e. why is it wrong to increment
the zswap protection in this case).

Do we need a Fixes and cc:stable for this one? Maybe it can be moved
first to make backports easy.

>
> This patch rectifies this issue by incrementing whenever we need to
> perform a non-zswap read.
>
> To test this change, I built the kernel under a cgroup with its
> memory.max set to 2 GB:
>
> real: 236.66s
> user: 4286.06s
> sys: 652.86s
> swapins: 81552
>
> For comparison, with just the new second chance algorithm, the build
> time is as follows:
>
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
>
> Without neither:
>
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
>
> (average over 5 runs)
>
> With this change, the kernel CPU time reduces by a further 1.7%, and
> the real time is reduced by another 3.3%, compared to just the second
> chance algorithm by itself. The swapins count also reduces by another
> 13.85%.
>
> Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
> time by 3%, and number of swapins by 64.12%.
>
> To gauge the new scheme's ability to offload cold data, I ran another
> benchmark, in which the kernel was built under a cgroup with memory.max
> set to 3 GB, but with 0.5 GB worth of cold data allocated before each
> build (in a shmem file).
>
> Under the old scheme:
>
> real: 197.18s
> user: 4365.08s
> sys: 289.02s
> zswpwb: 72115.2
>
> Under the new scheme:
>
> real: 195.8s
> user: 4362.25s
> sys: 290.14s
> zswpwb: 87277.8
>
> (average over 5 runs)
>
> Notice that we actually observe a 21% increase in the number of written
> back pages - so the new scheme is just as good, if not better at
> offloading pages from the zswap pool when they are cold. Build time
> reduces by around 0.7% as a result.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/page_io.c    | 11 ++++++++++-
>  mm/swap_state.c |  8 ++------
>  2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ff8c99ee3af7..0004c9fbf7e8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -521,7 +521,15 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>
>         if (zswap_load(folio)) {
>                 folio_unlock(folio);
> -       } else if (data_race(sis->flags & SWP_FS_OPS)) {
> +               goto finish;
> +       }
> +
> +       /*
> +        * We have to read the page from slower devices. Increase zswap protection.
> +        */
> +       zswap_folio_swapin(folio);
> +
> +       if (data_race(sis->flags & SWP_FS_OPS)) {
>                 swap_read_folio_fs(folio, plug);
>         } else if (synchronous) {
>                 swap_read_folio_bdev_sync(folio, sis);
> @@ -529,6 +537,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>                 swap_read_folio_bdev_async(folio, sis);
>         }
>
> +finish:
>         if (workingset) {
>                 delayacct_thrashing_end(&in_thrashing);
>                 psi_memstall_leave(&pflags);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index a1726e49a5eb..3a0cf965f32b 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -698,10 +698,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         /* The page was likely read above, so no need for plugging here */
>         folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>                                         &page_allocated, false);
> -       if (unlikely(page_allocated)) {
> -               zswap_folio_swapin(folio);
> +       if (unlikely(page_allocated))
>                 swap_read_folio(folio, NULL);
> -       }
>         return folio;
>  }
>
> @@ -850,10 +848,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>         /* The folio was likely read above, so no need for plugging here */
>         folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
>                                         &page_allocated, false);
> -       if (unlikely(page_allocated)) {
> -               zswap_folio_swapin(folio);
> +       if (unlikely(page_allocated))
>                 swap_read_folio(folio, NULL);
> -       }
>         return folio;
>  }
>
> --
> 2.43.0
Nhat Pham Aug. 2, 2024, 11:46 p.m. UTC | #2
On Thu, Aug 1, 2024 at 1:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
>
> Hmm, but there is a chance that these pages are not actually needed,
> in which case we will unnecessarily increase the zswap protection.
> Does the readahead window self-correct if the pages were not used?

Hmm yeah it's kinda hard to predict if a swapped in page is strictly
necessary in this context. We don't have this information at the time
of the read.

That said, I think erring on the side of safety is OK here - my
understanding that readahead, while predictive in nature, only gets
progressively more aggressive if we get signals that it's helpful (i.e
the memory access patterns display sequential behavior).

I think we also accept this slight inaccuracy (i.e for pages in the
readahead window that might not necessarily be needed) the in
workingset refault handling behavior. Could you fact check me,
Johannes?


>
> > are also incrementing when the pages are read from the zswap pool, which
> > is inaccurate.
>
> I feel like this is the more important part. It should be the focus of
> the commit log with more details (i.e. why is it wrong to increment
> the zswap protection in this case).

Yeah this is pretty important too :) Maybe I should make it clearer in
the patch commit.

>
> Do we need a Fixes and cc:stable for this one? Maybe it can be moved
> first to make backports easy.

Hmm.

*Technically*, this is broken in older versions of the shrinker as
well, but it's more of an optimization than a bug that can crash the
kernel, so I don't know if it qualifies for a Fixes tag?

Another factor is, under the old scheme, this does not move the needle
much - at least in my benchmarks. This is because the decaying
behavior is so aggressive that incrementing the counter in a couple
places does not matter, when it will be rapidly divided by half later.
This fix only shows clear improvements when applied on top of the new
second chance scheme.

I don't have a strong opinion here, but it doesn't seem worth it to
backport IMHO :)
Yosry Ahmed Aug. 3, 2024, 3:22 a.m. UTC | #3
On Fri, Aug 2, 2024 at 4:46 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 1:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> >
> > Hmm, but there is a chance that these pages are not actually needed,
> > in which case we will unnecessarily increase the zswap protection.
> > Does the readahead window self-correct if the pages were not used?
>
> Hmm yeah it's kinda hard to predict if a swapped in page is strictly
> necessary in this context. We don't have this information at the time
> of the read.
>
> That said, I think erring on the side of safety is OK here - my
> understanding that readahead, while predictive in nature, only gets
> progressively more aggressive if we get signals that it's helpful (i.e
> the memory access patterns display sequential behavior).

If the readahead logic is expected to adapt in these situations (and I
think it is), then I think we are fine. Perhaps we should just leave a
comment that we may increase the protection more than we should for
those readahead cases.

>
> I think we also accept this slight inaccuracy (i.e for pages in the
> readahead window that might not necessarily be needed) the in
> workingset refault handling behavior. Could you fact check me,
> Johannes?
>
>
> >
> > > are also incrementing when the pages are read from the zswap pool, which
> > > is inaccurate.
> >
> > I feel like this is the more important part. It should be the focus of
> > the commit log with more details (i.e. why is it wrong to increment
> > the zswap protection in this case).
>
> Yeah this is pretty important too :) Maybe I should make it clearer in
> the patch commit.
>
> >
> > Do we need a Fixes and cc:stable for this one? Maybe it can be moved
> > first to make backports easy.
>
> Hmm.
>
> *Technically*, this is broken in older versions of the shrinker as
> well, but it's more of an optimization than a bug that can crash the
> kernel, so I don't know if it qualifies for a Fixes tag?
>
> Another factor is, under the old scheme, this does not move the needle
> much - at least in my benchmarks. This is because the decaying
> behavior is so aggressive that incrementing the counter in a couple
> places does not matter, when it will be rapidly divided by half later.
> This fix only shows clear improvements when applied on top of the new
> second chance scheme.
>
> I don't have a strong opinion here, but it doesn't seem worth it to
> backport IMHO :)

I thought it's a simple change worth backporting, but if it doesn't
move the needle without the second chance algorithm then it's probably
not worth it.

I would still add the "Fixes" tag because technically the logic is
wrong without this patch, it increases the zswap protection when there
swapins from zswap which doesn't make much sense.
diff mbox series

Patch

diff --git a/mm/page_io.c b/mm/page_io.c
index ff8c99ee3af7..0004c9fbf7e8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -521,7 +521,15 @@  void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 
 	if (zswap_load(folio)) {
 		folio_unlock(folio);
-	} else if (data_race(sis->flags & SWP_FS_OPS)) {
+		goto finish;
+	}
+
+	/*
+	 * We have to read the page from slower devices. Increase zswap protection.
+	 */
+	zswap_folio_swapin(folio);
+
+	if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_read_folio_fs(folio, plug);
 	} else if (synchronous) {
 		swap_read_folio_bdev_sync(folio, sis);
@@ -529,6 +537,7 @@  void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 		swap_read_folio_bdev_async(folio, sis);
 	}
 
+finish:
 	if (workingset) {
 		delayacct_thrashing_end(&in_thrashing);
 		psi_memstall_leave(&pflags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index a1726e49a5eb..3a0cf965f32b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -698,10 +698,8 @@  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	/* The page was likely read above, so no need for plugging here */
 	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
-	if (unlikely(page_allocated)) {
-		zswap_folio_swapin(folio);
+	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
-	}
 	return folio;
 }
 
@@ -850,10 +848,8 @@  static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	/* The folio was likely read above, so no need for plugging here */
 	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
-	if (unlikely(page_allocated)) {
-		zswap_folio_swapin(folio);
+	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
-	}
 	return folio;
 }