diff mbox series

[5/6] mm/zswap: only support zswap_exclusive_loads_enabled

Message ID 20240201-b4-zswap-invalidate-entry-v1-5-56ed496b6e55@bytedance.com (mailing list archive)
State New
Headers show
Series mm/zswap: optimize zswap lru list | expand

Commit Message

Chengming Zhou Feb. 1, 2024, 3:49 p.m. UTC
The !zswap_exclusive_loads_enabled mode will leave compressed copy in
the zswap tree and lru list after the folio swapin.

There are some disadvantages in this mode:
1. It's a waste of memory since there are two copies of data, one is
   folio, the other one is compressed data in zswap. And it's unlikely
   the compressed data is useful in the near future.

2. If that folio is dirtied, the compressed data must be not useful,
   but we don't know and don't invalidate the trashy memory in zswap.

3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
   will always return -EEXIST and terminate the shrinking process.

On the other hand, the only downside of zswap_exclusive_loads_enabled
is a little more cpu usage/latency when compression, and the same if
the folio is removed from swapcache or dirtied.

Not sure if we should accept the above disadvantages in the case of
!zswap_exclusive_loads_enabled, so send this out for disscusion.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/Kconfig | 16 ----------------
 mm/zswap.c | 14 +++-----------
 2 files changed, 3 insertions(+), 27 deletions(-)

Comments

Johannes Weiner Feb. 1, 2024, 6:12 p.m. UTC | #1
On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
> The !zswap_exclusive_loads_enabled mode will leave compressed copy in
> the zswap tree and lru list after the folio swapin.
> 
> There are some disadvantages in this mode:
> 1. It's a waste of memory since there are two copies of data, one is
>    folio, the other one is compressed data in zswap. And it's unlikely
>    the compressed data is useful in the near future.
> 
> 2. If that folio is dirtied, the compressed data must be not useful,
>    but we don't know and don't invalidate the trashy memory in zswap.
> 
> 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
>    will always return -EEXIST and terminate the shrinking process.
> 
> On the other hand, the only downside of zswap_exclusive_loads_enabled
> is a little more cpu usage/latency when compression, and the same if
> the folio is removed from swapcache or dirtied.
> 
> Not sure if we should accept the above disadvantages in the case of
> !zswap_exclusive_loads_enabled, so send this out for disscusion.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

This is interesting.

First, I will say that I never liked this config option, because it's
nearly impossible for a user to answer this question. Much better to
just pick a reasonable default.

What should the default be?

Caching "swapout work" is helpful when the system is thrashing. Then
recently swapped in pages might get swapped out again very soon. It
certainly makes sense with conventional swap, because keeping a clean
copy on the disk saves IO work and doesn't cost any additional memory.

But with zswap, it's different. It saves some compression work on a
thrashing page. But the act of keeping compressed memory contributes
to a higher rate of thrashing. And that can cause IO in other places
like zswap writeback and file memory.

It would be useful to have an A/B test to confirm that not caching is
better. Can you run your test with and without keeping the cache, and
in addition to the timings also compare the deltas for pgscan_anon,
pgscan_file, workingset_refault_anon, workingset_refault_file?
Yosry Ahmed Feb. 2, 2024, 1:04 a.m. UTC | #2
On Thu, Feb 01, 2024 at 01:12:40PM -0500, Johannes Weiner wrote:
> On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
> > The !zswap_exclusive_loads_enabled mode will leave compressed copy in
> > the zswap tree and lru list after the folio swapin.
> > 
> > There are some disadvantages in this mode:
> > 1. It's a waste of memory since there are two copies of data, one is
> >    folio, the other one is compressed data in zswap. And it's unlikely
> >    the compressed data is useful in the near future.
> > 
> > 2. If that folio is dirtied, the compressed data must be not useful,
> >    but we don't know and don't invalidate the trashy memory in zswap.
> > 
> > 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
> >    will always return -EEXIST and terminate the shrinking process.
> > 
> > On the other hand, the only downside of zswap_exclusive_loads_enabled
> > is a little more cpu usage/latency when compression, and the same if
> > the folio is removed from swapcache or dirtied.
> > 
> > Not sure if we should accept the above disadvantages in the case of
> > !zswap_exclusive_loads_enabled, so send this out for disscusion.
> > 
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> This is interesting.
> 
> First, I will say that I never liked this config option, because it's
> nearly impossible for a user to answer this question. Much better to
> just pick a reasonable default.
> 
> What should the default be?
> 
> Caching "swapout work" is helpful when the system is thrashing. Then
> recently swapped in pages might get swapped out again very soon. It
> certainly makes sense with conventional swap, because keeping a clean
> copy on the disk saves IO work and doesn't cost any additional memory.
> 
> But with zswap, it's different. It saves some compression work on a
> thrashing page. But the act of keeping compressed memory contributes
> to a higher rate of thrashing. And that can cause IO in other places
> like zswap writeback and file memory.

Agreed.

At Google, we have been using exclusive loads for a very long time in
production, so I have no objections to this. The user interface is also
relatively new, so I don't think it will have accumulated users.

> 
> It would be useful to have an A/B test to confirm that not caching is
> better. Can you run your test with and without keeping the cache, and
> in addition to the timings also compare the deltas for pgscan_anon,
> pgscan_file, workingset_refault_anon, workingset_refault_file?

That would be interesting :)
Chengming Zhou Feb. 2, 2024, 12:57 p.m. UTC | #3
On 2024/2/2 02:12, Johannes Weiner wrote:
> On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
>> The !zswap_exclusive_loads_enabled mode will leave compressed copy in
>> the zswap tree and lru list after the folio swapin.
>>
>> There are some disadvantages in this mode:
>> 1. It's a waste of memory since there are two copies of data, one is
>>    folio, the other one is compressed data in zswap. And it's unlikely
>>    the compressed data is useful in the near future.
>>
>> 2. If that folio is dirtied, the compressed data must be not useful,
>>    but we don't know and don't invalidate the trashy memory in zswap.
>>
>> 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
>>    will always return -EEXIST and terminate the shrinking process.
>>
>> On the other hand, the only downside of zswap_exclusive_loads_enabled
>> is a little more cpu usage/latency when compression, and the same if
>> the folio is removed from swapcache or dirtied.
>>
>> Not sure if we should accept the above disadvantages in the case of
>> !zswap_exclusive_loads_enabled, so send this out for disscusion.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> This is interesting.
> 
> First, I will say that I never liked this config option, because it's
> nearly impossible for a user to answer this question. Much better to
> just pick a reasonable default.

Agree.

> 
> What should the default be?
> 
> Caching "swapout work" is helpful when the system is thrashing. Then
> recently swapped in pages might get swapped out again very soon. It
> certainly makes sense with conventional swap, because keeping a clean
> copy on the disk saves IO work and doesn't cost any additional memory.
> 
> But with zswap, it's different. It saves some compression work on a
> thrashing page. But the act of keeping compressed memory contributes
> to a higher rate of thrashing. And that can cause IO in other places
> like zswap writeback and file memory.
> 
> It would be useful to have an A/B test to confirm that not caching is
> better. Can you run your test with and without keeping the cache, and
> in addition to the timings also compare the deltas for pgscan_anon,
> pgscan_file, workingset_refault_anon, workingset_refault_file?

I just A/B test kernel building in tmpfs directory, memory.max=2GB.
(zswap writeback enabled and shrinker_enabled, one 50GB swapfile)

From the below results, exclusive mode has fewer scan and refault.

                              zswap-invalidate-entry        zswap-invalidate-entry-exclusive
real                          63.80                         63.01                         
user                          1063.83                       1061.32                       
sys                           290.31                        266.15                        
                              zswap-invalidate-entry        zswap-invalidate-entry-exclusive
workingset_refault_anon       2383084.40                    1976397.40                    
workingset_refault_file       44134.00                      45689.40                      
workingset_activate_anon      837878.00                     728441.20                     
workingset_activate_file      4710.00                       4085.20                       
workingset_restore_anon       732622.60                     639428.40                     
workingset_restore_file       1007.00                       926.80                        
workingset_nodereclaim        0.00                          0.00                          
pgscan                        14343003.40                   12409570.20                   
pgscan_kswapd                 0.00                          0.00                          
pgscan_direct                 14343003.40                   12409570.20                   
pgscan_khugepaged             0.00                          0.00
Johannes Weiner Feb. 2, 2024, 4:26 p.m. UTC | #4
On Fri, Feb 02, 2024 at 08:57:38PM +0800, Chengming Zhou wrote:
> On 2024/2/2 02:12, Johannes Weiner wrote:
> > Caching "swapout work" is helpful when the system is thrashing. Then
> > recently swapped in pages might get swapped out again very soon. It
> > certainly makes sense with conventional swap, because keeping a clean
> > copy on the disk saves IO work and doesn't cost any additional memory.
> > 
> > But with zswap, it's different. It saves some compression work on a
> > thrashing page. But the act of keeping compressed memory contributes
> > to a higher rate of thrashing. And that can cause IO in other places
> > like zswap writeback and file memory.
> 
> [...] A/B test kernel building in tmpfs directory, memory.max=2GB.
> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
> 
> From the below results, exclusive mode has fewer scan and refault.
> 
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> real                          63.80                         63.01                         
> user                          1063.83                       1061.32                       
> sys                           290.31                        266.15                        
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> workingset_refault_anon       2383084.40                    1976397.40                    
> workingset_refault_file       44134.00                      45689.40                      
> workingset_activate_anon      837878.00                     728441.20                     
> workingset_activate_file      4710.00                       4085.20                       
> workingset_restore_anon       732622.60                     639428.40                     
> workingset_restore_file       1007.00                       926.80                        
> workingset_nodereclaim        0.00                          0.00                          
> pgscan                        14343003.40                   12409570.20                   
> pgscan_kswapd                 0.00                          0.00                          
> pgscan_direct                 14343003.40                   12409570.20                   
> pgscan_khugepaged             0.00                          0.00                         

That's perfect. Thanks!

Would you mind adding all of the above into the changelog?

With that,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Yosry Ahmed Feb. 2, 2024, 10:15 p.m. UTC | #5
> I just A/B test kernel building in tmpfs directory, memory.max=2GB.
> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
>
> From the below results, exclusive mode has fewer scan and refault.
>
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> real                          63.80                         63.01
> user                          1063.83                       1061.32
> sys                           290.31                        266.15
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> workingset_refault_anon       2383084.40                    1976397.40
> workingset_refault_file       44134.00                      45689.40
> workingset_activate_anon      837878.00                     728441.20
> workingset_activate_file      4710.00                       4085.20
> workingset_restore_anon       732622.60                     639428.40
> workingset_restore_file       1007.00                       926.80
> workingset_nodereclaim        0.00                          0.00
> pgscan                        14343003.40                   12409570.20
> pgscan_kswapd                 0.00                          0.00
> pgscan_direct                 14343003.40                   12409570.20
> pgscan_khugepaged             0.00                          0.00

I think the numbers look really good, and as I mentioned, we have been
doing this in production for many years now, so:

Acked-by: Yosry Ahmed <yosryahmed@google.com>
Nhat Pham Feb. 2, 2024, 10:31 p.m. UTC | #6
On Fri, Feb 2, 2024 at 4:57 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2024/2/2 02:12, Johannes Weiner wrote:
> > On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
> >> The !zswap_exclusive_loads_enabled mode will leave compressed copy in
> >> the zswap tree and lru list after the folio swapin.
> >>
> >> There are some disadvantages in this mode:
> >> 1. It's a waste of memory since there are two copies of data, one is
> >>    folio, the other one is compressed data in zswap. And it's unlikely
> >>    the compressed data is useful in the near future.
> >>
> >> 2. If that folio is dirtied, the compressed data must be not useful,
> >>    but we don't know and don't invalidate the trashy memory in zswap.
> >>
> >> 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
> >>    will always return -EEXIST and terminate the shrinking process.
> >>
> >> On the other hand, the only downside of zswap_exclusive_loads_enabled
> >> is a little more cpu usage/latency when compression, and the same if
> >> the folio is removed from swapcache or dirtied.
> >>
> >> Not sure if we should accept the above disadvantages in the case of
> >> !zswap_exclusive_loads_enabled, so send this out for disscusion.
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >
> > This is interesting.
> >
> > First, I will say that I never liked this config option, because it's
> > nearly impossible for a user to answer this question. Much better to
> > just pick a reasonable default.
>
> Agree.
>
> >
> > What should the default be?
> >
> > Caching "swapout work" is helpful when the system is thrashing. Then
> > recently swapped in pages might get swapped out again very soon. It
> > certainly makes sense with conventional swap, because keeping a clean
> > copy on the disk saves IO work and doesn't cost any additional memory.
> >
> > But with zswap, it's different. It saves some compression work on a
> > thrashing page. But the act of keeping compressed memory contributes
> > to a higher rate of thrashing. And that can cause IO in other places
> > like zswap writeback and file memory.
> >
> > It would be useful to have an A/B test to confirm that not caching is
> > better. Can you run your test with and without keeping the cache, and
> > in addition to the timings also compare the deltas for pgscan_anon,
> > pgscan_file, workingset_refault_anon, workingset_refault_file?
>
> I just A/B test kernel building in tmpfs directory, memory.max=2GB.
> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
>
> From the below results, exclusive mode has fewer scan and refault.
>
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> real                          63.80                         63.01
> user                          1063.83                       1061.32
> sys                           290.31                        266.15
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive

This is one of those cases where something might make sense
conceptually, but does not pan out in practice. Removing
non-invalidate seems to simplify the code a bit, and that's one less
thing to worry about for users, so I like this :)

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

> workingset_refault_anon       2383084.40                    1976397.40
> workingset_refault_file       44134.00                      45689.40
> workingset_activate_anon      837878.00                     728441.20
> workingset_activate_file      4710.00                       4085.20
> workingset_restore_anon       732622.60                     639428.40
> workingset_restore_file       1007.00                       926.80
> workingset_nodereclaim        0.00                          0.00
> pgscan                        14343003.40                   12409570.20
> pgscan_kswapd                 0.00                          0.00
> pgscan_direct                 14343003.40                   12409570.20
> pgscan_khugepaged             0.00                          0.00
Chengming Zhou Feb. 3, 2024, 4:33 a.m. UTC | #7
On 2024/2/3 00:26, Johannes Weiner wrote:
> On Fri, Feb 02, 2024 at 08:57:38PM +0800, Chengming Zhou wrote:
>> On 2024/2/2 02:12, Johannes Weiner wrote:
>>> Caching "swapout work" is helpful when the system is thrashing. Then
>>> recently swapped in pages might get swapped out again very soon. It
>>> certainly makes sense with conventional swap, because keeping a clean
>>> copy on the disk saves IO work and doesn't cost any additional memory.
>>>
>>> But with zswap, it's different. It saves some compression work on a
>>> thrashing page. But the act of keeping compressed memory contributes
>>> to a higher rate of thrashing. And that can cause IO in other places
>>> like zswap writeback and file memory.
>>
>> [...] A/B test kernel building in tmpfs directory, memory.max=2GB.
>> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
>>
>> From the below results, exclusive mode has fewer scan and refault.
>>
>>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
>> real                          63.80                         63.01                         
>> user                          1063.83                       1061.32                       
>> sys                           290.31                        266.15                        
>>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
>> workingset_refault_anon       2383084.40                    1976397.40                    
>> workingset_refault_file       44134.00                      45689.40                      
>> workingset_activate_anon      837878.00                     728441.20                     
>> workingset_activate_file      4710.00                       4085.20                       
>> workingset_restore_anon       732622.60                     639428.40                     
>> workingset_restore_file       1007.00                       926.80                        
>> workingset_nodereclaim        0.00                          0.00                          
>> pgscan                        14343003.40                   12409570.20                   
>> pgscan_kswapd                 0.00                          0.00                          
>> pgscan_direct                 14343003.40                   12409570.20                   
>> pgscan_khugepaged             0.00                          0.00                         
> 
> That's perfect. Thanks!
> 
> Would you mind adding all of the above into the changelog?
Yeah, will do. Thanks!

> 
> With that,
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
diff mbox series

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index ffc3a2ba3a8c..673b35629074 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -45,22 +45,6 @@  config ZSWAP_DEFAULT_ON
 	  The selection made here can be overridden by using the kernel
 	  command line 'zswap.enabled=' option.
 
-config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
-	bool "Invalidate zswap entries when pages are loaded"
-	depends on ZSWAP
-	help
-	  If selected, exclusive loads for zswap will be enabled at boot,
-	  otherwise it will be disabled.
-
-	  If exclusive loads are enabled, when a page is loaded from zswap,
-	  the zswap entry is invalidated at once, as opposed to leaving it
-	  in zswap until the swap entry is freed.
-
-	  This avoids having two copies of the same page in memory
-	  (compressed and uncompressed) after faulting in a page from zswap.
-	  The cost is that if the page was never dirtied and needs to be
-	  swapped out again, it will be re-compressed.
-
 config ZSWAP_SHRINKER_DEFAULT_ON
 	bool "Shrink the zswap pool on memory pressure"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 3fbb7e2c8b8d..cbf379abb6c7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -139,10 +139,6 @@  static bool zswap_non_same_filled_pages_enabled = true;
 module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
 		   bool, 0644);
 
-static bool zswap_exclusive_loads_enabled = IS_ENABLED(
-		CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
-module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
-
 /* Number of zpools in zswap_pool (empirically determined for scalability) */
 #define ZSWAP_NR_ZPOOLS 32
 
@@ -1722,16 +1718,12 @@  bool zswap_load(struct folio *folio)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
 	spin_lock(&tree->lock);
-	if (zswap_exclusive_loads_enabled) {
-		zswap_invalidate_entry(tree, entry);
-		folio_mark_dirty(folio);
-	} else if (entry->length) {
-		zswap_lru_del(&entry->pool->list_lru, entry);
-		zswap_lru_add(&entry->pool->list_lru, entry);
-	}
+	zswap_invalidate_entry(tree, entry);
 	zswap_entry_put(entry);
 	spin_unlock(&tree->lock);
 
+	folio_mark_dirty(folio);
+
 	return true;
 }