diff mbox series

[2/6] mm/zswap: invalidate zswap entry when swap entry free

Message ID 20240201-b4-zswap-invalidate-entry-v1-2-56ed496b6e55@bytedance.com (mailing list archive)
State New
Headers show
Series mm/zswap: optimize zswap lru list | expand

Commit Message

Chengming Zhou Feb. 1, 2024, 3:49 p.m. UTC
During testing I found there are some times the zswap_writeback_entry()
return -ENOMEM, which is not we expected:

bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
@[-12]: 1563
@[0]: 277221

The reason is that __read_swap_cache_async() return NULL because
swapcache_prepare() failed. The reason is that we won't invalidate
zswap entry when swap entry freed to the per-cpu pool, these zswap
entries are still on the zswap tree and lru list.

This patch moves the invalidation ahead to when swap entry freed
to the per-cpu pool, since there is no any benefit to leave trashy
zswap entry on the tree and lru list.

With this patch:
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
@[0]: 259744

Note: large folio can't have zswap entry for now, so don't bother
to add zswap entry invalidation in the large folio swap free path.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/zswap.h | 4 ++--
 mm/swap_slots.c       | 2 ++
 mm/swapfile.c         | 1 -
 mm/zswap.c            | 5 +++--
 4 files changed, 7 insertions(+), 5 deletions(-)

Comments

Johannes Weiner Feb. 1, 2024, 5:49 p.m. UTC | #1
On Thu, Feb 01, 2024 at 03:49:02PM +0000, Chengming Zhou wrote:
> During testing I found there are some times the zswap_writeback_entry()
> return -ENOMEM, which is not we expected:
> 
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[-12]: 1563
> @[0]: 277221
> 
> The reason is that __read_swap_cache_async() return NULL because
> swapcache_prepare() failed. The reason is that we won't invalidate
> zswap entry when swap entry freed to the per-cpu pool, these zswap
> entries are still on the zswap tree and lru list.
> 
> This patch moves the invalidation ahead to when swap entry freed
> to the per-cpu pool, since there is no any benefit to leave trashy
> zswap entry on the tree and lru list.
> 
> With this patch:
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[0]: 259744
> 
> Note: large folio can't have zswap entry for now, so don't bother
> to add zswap entry invalidation in the large folio swap free path.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Great catch.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Nhat Pham Feb. 1, 2024, 8:56 p.m. UTC | #2
On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> During testing I found there are some times the zswap_writeback_entry()
> return -ENOMEM, which is not we expected:
>
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[-12]: 1563
> @[0]: 277221
>
> The reason is that __read_swap_cache_async() return NULL because
> swapcache_prepare() failed. The reason is that we won't invalidate
> zswap entry when swap entry freed to the per-cpu pool, these zswap
> entries are still on the zswap tree and lru list.
>
> This patch moves the invalidation ahead to when swap entry freed
> to the per-cpu pool, since there is no any benefit to leave trashy
> zswap entry on the tree and lru list.
>
> With this patch:
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[0]: 259744

Nice!
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Yosry Ahmed Feb. 2, 2024, 12:11 a.m. UTC | #3
On Thu, Feb 01, 2024 at 03:49:02PM +0000, Chengming Zhou wrote:
> During testing I found there are some times the zswap_writeback_entry()
> return -ENOMEM, which is not we expected:
> 
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[-12]: 1563
> @[0]: 277221
> 
> The reason is that __read_swap_cache_async() return NULL because
> swapcache_prepare() failed. The reason is that we won't invalidate
> zswap entry when swap entry freed to the per-cpu pool, these zswap
> entries are still on the zswap tree and lru list.
> 
> This patch moves the invalidation ahead to when swap entry freed
> to the per-cpu pool, since there is no any benefit to leave trashy
> zswap entry on the tree and lru list.
> 
> With this patch:
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[0]: 259744
> 
> Note: large folio can't have zswap entry for now, so don't bother
> to add zswap entry invalidation in the large folio swap free path.

This makes me slightly nervous. Should we add a comment somewhere just
in case this is missed if someone adds large folio support?

Otherwise the patch itself LGTM.
Chengming Zhou Feb. 2, 2024, 8:10 a.m. UTC | #4
On 2024/2/2 08:11, Yosry Ahmed wrote:
> On Thu, Feb 01, 2024 at 03:49:02PM +0000, Chengming Zhou wrote:
>> During testing I found there are some times the zswap_writeback_entry()
>> return -ENOMEM, which is not we expected:
>>
>> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
>> @[-12]: 1563
>> @[0]: 277221
>>
>> The reason is that __read_swap_cache_async() return NULL because
>> swapcache_prepare() failed. The reason is that we won't invalidate
>> zswap entry when swap entry freed to the per-cpu pool, these zswap
>> entries are still on the zswap tree and lru list.
>>
>> This patch moves the invalidation ahead to when swap entry freed
>> to the per-cpu pool, since there is no any benefit to leave trashy
>> zswap entry on the tree and lru list.
>>
>> With this patch:
>> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
>> @[0]: 259744
>>
>> Note: large folio can't have zswap entry for now, so don't bother
>> to add zswap entry invalidation in the large folio swap free path.
> 
> This makes me slightly nervous. Should we add a comment somewhere just
> in case this is missed if someone adds large folio support?

Ok, will add this comment:

+       /* Large folio swap slot is not covered. */
	zswap_invalidate(entry);

> 
> Otherwise the patch itself LGTM.

Thanks!
diff mbox series

Patch

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 91895ce1fdbc..341aea490070 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -29,7 +29,7 @@  struct zswap_lruvec_state {
 
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
-void zswap_invalidate(int type, pgoff_t offset);
+void zswap_invalidate(swp_entry_t swp);
 int zswap_swapon(int type, unsigned long nr_pages);
 void zswap_swapoff(int type);
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
@@ -50,7 +50,7 @@  static inline bool zswap_load(struct folio *folio)
 	return false;
 }
 
-static inline void zswap_invalidate(int type, pgoff_t offset) {}
+static inline void zswap_invalidate(swp_entry_t swp) {}
 static inline int zswap_swapon(int type, unsigned long nr_pages)
 {
 	return 0;
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..d24cdea26daa 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -273,6 +273,8 @@  void free_swap_slot(swp_entry_t entry)
 {
 	struct swap_slots_cache *cache;
 
+	zswap_invalidate(entry);
+
 	cache = raw_cpu_ptr(&swp_slots);
 	if (likely(use_swap_slot_cache && cache->slots_ret)) {
 		spin_lock_irq(&cache->free_lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0580bb3e34d7..65b49db89b36 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -744,7 +744,6 @@  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 		swap_slot_free_notify = NULL;
 	while (offset <= end) {
 		arch_swap_invalidate_page(si->type, offset);
-		zswap_invalidate(si->type, offset);
 		if (swap_slot_free_notify)
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
diff --git a/mm/zswap.c b/mm/zswap.c
index 735f1a6ef336..d8bb0e06e2b0 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1738,9 +1738,10 @@  bool zswap_load(struct folio *folio)
 	return true;
 }
 
-void zswap_invalidate(int type, pgoff_t offset)
+void zswap_invalidate(swp_entry_t swp)
 {
-	struct zswap_tree *tree = swap_zswap_tree(swp_entry(type, offset));
+	pgoff_t offset = swp_offset(swp);
+	struct zswap_tree *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
 	spin_lock(&tree->lock);