diff mbox series

[v2,2/3] mm/zswap: fix race between lru writeback and swapoff

Message ID 20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com (mailing list archive)
State New
Headers show
Series mm/zswap: fix race between lru writeback and swapoff | expand

Commit Message

Chengming Zhou Jan. 28, 2024, 1:28 p.m. UTC
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:

CPU1			CPU2
shrink_memcg_cb		swap_off
  list_lru_isolate	  zswap_invalidate
			  zswap_swapoff
			    kfree(tree)
  // UAF
  spin_lock(&tree->lock)

The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.

We can fix it by moving the swap cache allocation ahead before
referencing the tree, then check invalidate race with tree lock,
only after that we can safely deref the entry. Note we couldn't
deref entry or tree anymore after we unlock the folio, since we
depend on this to hold on swapoff.

So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache
and if returned with folio locked we can reference the tree safely.
Then we can check invalidate race with tree lock, the following things
is much the same like zswap_load().

Since we can't deref the entry after zswap_writeback_entry(), we
can't use zswap_lru_putback() anymore, instead we rotate the entry
in the beginning. And it will be unlinked and freed when invalidated
if writeback success.

Another change is we don't update the memcg nr_zswap_protected in
the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
with swapin or concurrent shrinker action, since swapin already
have memcg nr_zswap_protected updated, don't need double counts here.
For concurrent shrinker, the folio will be writeback and freed anyway.
-ENOMEM case is extremely rare and doesn't happen spuriously either,
so don't bother distinguishing this case.

[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
 1 file changed, 49 insertions(+), 65 deletions(-)

Comments

Yosry Ahmed Jan. 30, 2024, 12:22 a.m. UTC | #1
On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
> 
> CPU1			CPU2
> shrink_memcg_cb		swap_off
>   list_lru_isolate	  zswap_invalidate
> 			  zswap_swapoff
> 			    kfree(tree)
>   // UAF
>   spin_lock(&tree->lock)
> 
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
> 
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
> 
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and if returned with folio locked we can reference the tree safely.
> Then we can check invalidate race with tree lock, the following things
> is much the same like zswap_load().
> 
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry
> in the beginning. And it will be unlinked and freed when invalidated
> if writeback success.

You are also removing one redundant lookup from the zswap writeback path
to check for the invalidation race, and simplifying the code. Give
yourself full credit :)

> 
> Another change is we don't update the memcg nr_zswap_protected in
> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
> with swapin or concurrent shrinker action, since swapin already
> have memcg nr_zswap_protected updated, don't need double counts here.
> For concurrent shrinker, the folio will be writeback and freed anyway.
> -ENOMEM case is extremely rare and doesn't happen spuriously either,
> so don't bother distinguishing this case.
> 
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
>  1 file changed, 49 insertions(+), 65 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..f5cb5a46e4d7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>  		 zpool_get_type((p)->zpools[0]))
>  
>  static int zswap_writeback_entry(struct zswap_entry *entry,
> -				 struct zswap_tree *tree);
> +				 swp_entry_t swpentry);
>  static int zswap_pool_get(struct zswap_pool *pool);
>  static void zswap_pool_put(struct zswap_pool *pool);
>  
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>  	rcu_read_unlock();
>  }
>  
> -static void zswap_lru_putback(struct list_lru *list_lru,
> -		struct zswap_entry *entry)
> -{
> -	int nid = entry_to_nid(entry);
> -	spinlock_t *lock = &list_lru->node[nid].lock;
> -	struct mem_cgroup *memcg;
> -	struct lruvec *lruvec;
> -
> -	rcu_read_lock();
> -	memcg = mem_cgroup_from_entry(entry);
> -	spin_lock(lock);
> -	/* we cannot use list_lru_add here, because it increments node's lru count */
> -	list_lru_putback(list_lru, &entry->lru, nid, memcg);
> -	spin_unlock(lock);
> -
> -	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> -	/* increment the protection area to account for the LRU rotation. */
> -	atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> -	rcu_read_unlock();
> -}
> -
>  /*********************************
>  * rbtree functions
>  **********************************/
> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>  {
>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>  	bool *encountered_page_in_swapcache = (bool *)arg;
> -	struct zswap_tree *tree;
> -	pgoff_t swpoffset;
> +	swp_entry_t swpentry;
>  	enum lru_status ret = LRU_REMOVED_RETRY;
>  	int writeback_result;
>  
> +	/*
> +	 * Rotate the entry to the tail before unlocking the LRU,
> +	 * so that in case of an invalidation race concurrent
> +	 * reclaimers don't waste their time on it.
> +	 *
> +	 * If writeback succeeds, or failure is due to the entry
> +	 * being invalidated by the swap subsystem, the invalidation
> +	 * will unlink and free it.
> +	 *
> +	 * Temporary failures, where the same entry should be tried
> +	 * again immediately, almost never happen for this shrinker.
> +	 * We don't do any trylocking; -ENOMEM comes closest,
> +	 * but that's extremely rare and doesn't happen spuriously
> +	 * either. Don't bother distinguishing this case.
> +	 *
> +	 * But since they do exist in theory, the entry cannot just
> +	 * be unlinked, or we could leak it. Hence, rotate.

The entry cannot be unlinked because we cannot get a ref on it without
holding the tree lock, and we cannot deref the tree before we acquire a
swap cache ref in zswap_writeback_entry() -- or if
zswap_writeback_entry() fails. This should be called out explicitly
somewhere. Perhaps we can describe this whole deref dance with added
docs to shrink_memcg_cb().

We could also use a comment in zswap_writeback_entry() (or above it) to
state that we only deref the tree *after* we get the swapcache ref.
Chengming Zhou Jan. 30, 2024, 2:30 a.m. UTC | #2
On 2024/1/30 08:22, Yosry Ahmed wrote:
> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
>>
>> CPU1			CPU2
>> shrink_memcg_cb		swap_off
>>   list_lru_isolate	  zswap_invalidate
>> 			  zswap_swapoff
>> 			    kfree(tree)
>>   // UAF
>>   spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and if returned with folio locked we can reference the tree safely.
>> Then we can check invalidate race with tree lock, the following things
>> is much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
>> in the beginning. And it will be unlinked and freed when invalidated
>> if writeback success.
> 
> You are also removing one redundant lookup from the zswap writeback path
> to check for the invalidation race, and simplifying the code. Give
> yourself full credit :)

Ah right, forgot to mention it, I will add this part in the commit message.
Thanks for your reminder!

> 
>>
>> Another change is we don't update the memcg nr_zswap_protected in
>> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
>> with swapin or concurrent shrinker action, since swapin already
>> have memcg nr_zswap_protected updated, don't need double counts here.
>> For concurrent shrinker, the folio will be writeback and freed anyway.
>> -ENOMEM case is extremely rare and doesn't happen spuriously either,
>> so don't bother distinguishing this case.
>>
>> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Acked-by: Nhat Pham <nphamcs@gmail.com>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
>>  1 file changed, 49 insertions(+), 65 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 81cb3790e0dd..f5cb5a46e4d7 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>>  		 zpool_get_type((p)->zpools[0]))
>>  
>>  static int zswap_writeback_entry(struct zswap_entry *entry,
>> -				 struct zswap_tree *tree);
>> +				 swp_entry_t swpentry);
>>  static int zswap_pool_get(struct zswap_pool *pool);
>>  static void zswap_pool_put(struct zswap_pool *pool);
>>  
>> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>>  	rcu_read_unlock();
>>  }
>>  
>> -static void zswap_lru_putback(struct list_lru *list_lru,
>> -		struct zswap_entry *entry)
>> -{
>> -	int nid = entry_to_nid(entry);
>> -	spinlock_t *lock = &list_lru->node[nid].lock;
>> -	struct mem_cgroup *memcg;
>> -	struct lruvec *lruvec;
>> -
>> -	rcu_read_lock();
>> -	memcg = mem_cgroup_from_entry(entry);
>> -	spin_lock(lock);
>> -	/* we cannot use list_lru_add here, because it increments node's lru count */
>> -	list_lru_putback(list_lru, &entry->lru, nid, memcg);
>> -	spin_unlock(lock);
>> -
>> -	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
>> -	/* increment the protection area to account for the LRU rotation. */
>> -	atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>> -	rcu_read_unlock();
>> -}
>> -
>>  /*********************************
>>  * rbtree functions
>>  **********************************/
>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>>  {
>>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>>  	bool *encountered_page_in_swapcache = (bool *)arg;
>> -	struct zswap_tree *tree;
>> -	pgoff_t swpoffset;
>> +	swp_entry_t swpentry;
>>  	enum lru_status ret = LRU_REMOVED_RETRY;
>>  	int writeback_result;
>>  
>> +	/*
>> +	 * Rotate the entry to the tail before unlocking the LRU,
>> +	 * so that in case of an invalidation race concurrent
>> +	 * reclaimers don't waste their time on it.
>> +	 *
>> +	 * If writeback succeeds, or failure is due to the entry
>> +	 * being invalidated by the swap subsystem, the invalidation
>> +	 * will unlink and free it.
>> +	 *
>> +	 * Temporary failures, where the same entry should be tried
>> +	 * again immediately, almost never happen for this shrinker.
>> +	 * We don't do any trylocking; -ENOMEM comes closest,
>> +	 * but that's extremely rare and doesn't happen spuriously
>> +	 * either. Don't bother distinguishing this case.
>> +	 *
>> +	 * But since they do exist in theory, the entry cannot just
>> +	 * be unlinked, or we could leak it. Hence, rotate.
> 
> The entry cannot be unlinked because we cannot get a ref on it without
> holding the tree lock, and we cannot deref the tree before we acquire a
> swap cache ref in zswap_writeback_entry() -- or if
> zswap_writeback_entry() fails. This should be called out explicitly
> somewhere. Perhaps we can describe this whole deref dance with added
> docs to shrink_memcg_cb().

Maybe we should add some comments before or after zswap_writeback_entry()?
Or do you have some suggestions? I'm not good at this. :)

> 
> We could also use a comment in zswap_writeback_entry() (or above it) to
> state that we only deref the tree *after* we get the swapcache ref.

I just notice there are some comments in zswap_writeback_entry(), should
we add more comments here?

	/*
	 * folio is locked, and the swapcache is now secured against
	 * concurrent swapping to and from the slot. Verify that the
	 * swap entry hasn't been invalidated and recycled behind our
	 * backs (our zswap_entry reference doesn't prevent that), to
	 * avoid overwriting a new swap folio with old compressed data.
	 */
Johannes Weiner Jan. 30, 2024, 3:17 a.m. UTC | #3
On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
> On 2024/1/30 08:22, Yosry Ahmed wrote:
> > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >>  {
> >>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> >>  	bool *encountered_page_in_swapcache = (bool *)arg;
> >> -	struct zswap_tree *tree;
> >> -	pgoff_t swpoffset;
> >> +	swp_entry_t swpentry;
> >>  	enum lru_status ret = LRU_REMOVED_RETRY;
> >>  	int writeback_result;
> >>  
> >> +	/*
> >> +	 * Rotate the entry to the tail before unlocking the LRU,
> >> +	 * so that in case of an invalidation race concurrent
> >> +	 * reclaimers don't waste their time on it.
> >> +	 *
> >> +	 * If writeback succeeds, or failure is due to the entry
> >> +	 * being invalidated by the swap subsystem, the invalidation
> >> +	 * will unlink and free it.
> >> +	 *
> >> +	 * Temporary failures, where the same entry should be tried
> >> +	 * again immediately, almost never happen for this shrinker.
> >> +	 * We don't do any trylocking; -ENOMEM comes closest,
> >> +	 * but that's extremely rare and doesn't happen spuriously
> >> +	 * either. Don't bother distinguishing this case.
> >> +	 *
> >> +	 * But since they do exist in theory, the entry cannot just
> >> +	 * be unlinked, or we could leak it. Hence, rotate.
> > 
> > The entry cannot be unlinked because we cannot get a ref on it without
> > holding the tree lock, and we cannot deref the tree before we acquire a
> > swap cache ref in zswap_writeback_entry() -- or if
> > zswap_writeback_entry() fails. This should be called out explicitly
> > somewhere. Perhaps we can describe this whole deref dance with added
> > docs to shrink_memcg_cb().
> 
> Maybe we should add some comments before or after zswap_writeback_entry()?
> Or do you have some suggestions? I'm not good at this. :)

I agree with the suggestion of a central point to document this.

How about something like this:

/*
 * As soon as we drop the LRU lock, the entry can be freed by
 * a concurrent invalidation. This means the following:
 *
 * 1. We extract the swp_entry_t to the stack, allowing
 *    zswap_writeback_entry() to pin the swap entry and
 *    then validate the zwap entry against that swap entry's
 *    tree using pointer value comparison. Only when that
 *    is successful can the entry be dereferenced.
 *
 * 2. Usually, objects are taken off the LRU for reclaim. In
 *    this case this isn't possible, because if reclaim fails
 *    for whatever reason, we have no means of knowing if the
 *    entry is alive to put it back on the LRU.
 *
 *    So rotate it before dropping the lock. If the entry is
 *    written back or invalidated, the free path will unlink
 *    it. For failures, rotation is the right thing as well.
 *
 *    Temporary failures, where the same entry should be tried
 *    again immediately, almost never happen for this shrinker.
 *    We don't do any trylocking; -ENOMEM comes closest,
 *    but that's extremely rare and doesn't happen spuriously
 *    either. Don't bother distinguishing this case.
 */

> > We could also use a comment in zswap_writeback_entry() (or above it) to
> > state that we only deref the tree *after* we get the swapcache ref.
> 
> I just notice there are some comments in zswap_writeback_entry(), should
> we add more comments here?
> 
> 	/*
> 	 * folio is locked, and the swapcache is now secured against
> 	 * concurrent swapping to and from the slot. Verify that the
> 	 * swap entry hasn't been invalidated and recycled behind our
> 	 * backs (our zswap_entry reference doesn't prevent that), to
> 	 * avoid overwriting a new swap folio with old compressed data.
> 	 */

The bit in () is now stale, since we're not even holding a ref ;)

Otherwise, a brief note that entry can not be dereferenced until
validation would be helpful in zswap_writeback_entry(). The core of
the scheme I'd probably describe in shrink_memcg_cb(), though.

Can I ask a favor, though?

For non-critical updates to this patch, can you please make them
follow-up changes? I just sent out 20 cleanup patches on top of this
patch which would be super painful and error prone to rebase. I'd like
to avoid that if at all possible.
Chengming Zhou Jan. 30, 2024, 3:31 a.m. UTC | #4
On 2024/1/30 11:17, Johannes Weiner wrote:
> On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
>> On 2024/1/30 08:22, Yosry Ahmed wrote:
>>> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>>>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>>>>  {
>>>>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>>>>  	bool *encountered_page_in_swapcache = (bool *)arg;
>>>> -	struct zswap_tree *tree;
>>>> -	pgoff_t swpoffset;
>>>> +	swp_entry_t swpentry;
>>>>  	enum lru_status ret = LRU_REMOVED_RETRY;
>>>>  	int writeback_result;
>>>>  
>>>> +	/*
>>>> +	 * Rotate the entry to the tail before unlocking the LRU,
>>>> +	 * so that in case of an invalidation race concurrent
>>>> +	 * reclaimers don't waste their time on it.
>>>> +	 *
>>>> +	 * If writeback succeeds, or failure is due to the entry
>>>> +	 * being invalidated by the swap subsystem, the invalidation
>>>> +	 * will unlink and free it.
>>>> +	 *
>>>> +	 * Temporary failures, where the same entry should be tried
>>>> +	 * again immediately, almost never happen for this shrinker.
>>>> +	 * We don't do any trylocking; -ENOMEM comes closest,
>>>> +	 * but that's extremely rare and doesn't happen spuriously
>>>> +	 * either. Don't bother distinguishing this case.
>>>> +	 *
>>>> +	 * But since they do exist in theory, the entry cannot just
>>>> +	 * be unlinked, or we could leak it. Hence, rotate.
>>>
>>> The entry cannot be unlinked because we cannot get a ref on it without
>>> holding the tree lock, and we cannot deref the tree before we acquire a
>>> swap cache ref in zswap_writeback_entry() -- or if
>>> zswap_writeback_entry() fails. This should be called out explicitly
>>> somewhere. Perhaps we can describe this whole deref dance with added
>>> docs to shrink_memcg_cb().
>>
>> Maybe we should add some comments before or after zswap_writeback_entry()?
>> Or do you have some suggestions? I'm not good at this. :)
> 
> I agree with the suggestion of a central point to document this.
> 
> How about something like this:
> 
> /*
>  * As soon as we drop the LRU lock, the entry can be freed by
>  * a concurrent invalidation. This means the following:
>  *
>  * 1. We extract the swp_entry_t to the stack, allowing
>  *    zswap_writeback_entry() to pin the swap entry and
>  *    then validate the zwap entry against that swap entry's
>  *    tree using pointer value comparison. Only when that
>  *    is successful can the entry be dereferenced.
>  *
>  * 2. Usually, objects are taken off the LRU for reclaim. In
>  *    this case this isn't possible, because if reclaim fails
>  *    for whatever reason, we have no means of knowing if the
>  *    entry is alive to put it back on the LRU.
>  *
>  *    So rotate it before dropping the lock. If the entry is
>  *    written back or invalidated, the free path will unlink
>  *    it. For failures, rotation is the right thing as well.
>  *
>  *    Temporary failures, where the same entry should be tried
>  *    again immediately, almost never happen for this shrinker.
>  *    We don't do any trylocking; -ENOMEM comes closest,
>  *    but that's extremely rare and doesn't happen spuriously
>  *    either. Don't bother distinguishing this case.
>  */
> 

Thanks! I think this document is great enough.

>>> We could also use a comment in zswap_writeback_entry() (or above it) to
>>> state that we only deref the tree *after* we get the swapcache ref.
>>
>> I just notice there are some comments in zswap_writeback_entry(), should
>> we add more comments here?
>>
>> 	/*
>> 	 * folio is locked, and the swapcache is now secured against
>> 	 * concurrent swapping to and from the slot. Verify that the
>> 	 * swap entry hasn't been invalidated and recycled behind our
>> 	 * backs (our zswap_entry reference doesn't prevent that), to
>> 	 * avoid overwriting a new swap folio with old compressed data.
>> 	 */
> 
> The bit in () is now stale, since we're not even holding a ref ;)

Right.

> 
> Otherwise, a brief note that entry can not be dereferenced until
> validation would be helpful in zswap_writeback_entry(). The core of

Ok.

> the scheme I'd probably describe in shrink_memcg_cb(), though.
> 
> Can I ask a favor, though?
> 
> For non-critical updates to this patch, can you please make them
> follow-up changes? I just sent out 20 cleanup patches on top of this
> patch which would be super painful and error prone to rebase. I'd like
> to avoid that if at all possible.

Ok, so these comments changes should be changed on top of your cleanup series
and sent as a follow-up patch.

Thanks.
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 81cb3790e0dd..f5cb5a46e4d7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -277,7 +277,7 @@  static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
 		 zpool_get_type((p)->zpools[0]))
 
 static int zswap_writeback_entry(struct zswap_entry *entry,
-				 struct zswap_tree *tree);
+				 swp_entry_t swpentry);
 static int zswap_pool_get(struct zswap_pool *pool);
 static void zswap_pool_put(struct zswap_pool *pool);
 
@@ -445,27 +445,6 @@  static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
 	rcu_read_unlock();
 }
 
-static void zswap_lru_putback(struct list_lru *list_lru,
-		struct zswap_entry *entry)
-{
-	int nid = entry_to_nid(entry);
-	spinlock_t *lock = &list_lru->node[nid].lock;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = mem_cgroup_from_entry(entry);
-	spin_lock(lock);
-	/* we cannot use list_lru_add here, because it increments node's lru count */
-	list_lru_putback(list_lru, &entry->lru, nid, memcg);
-	spin_unlock(lock);
-
-	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
-	/* increment the protection area to account for the LRU rotation. */
-	atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	rcu_read_unlock();
-}
-
 /*********************************
 * rbtree functions
 **********************************/
@@ -860,40 +839,47 @@  static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 {
 	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
 	bool *encountered_page_in_swapcache = (bool *)arg;
-	struct zswap_tree *tree;
-	pgoff_t swpoffset;
+	swp_entry_t swpentry;
 	enum lru_status ret = LRU_REMOVED_RETRY;
 	int writeback_result;
 
+	/*
+	 * Rotate the entry to the tail before unlocking the LRU,
+	 * so that in case of an invalidation race concurrent
+	 * reclaimers don't waste their time on it.
+	 *
+	 * If writeback succeeds, or failure is due to the entry
+	 * being invalidated by the swap subsystem, the invalidation
+	 * will unlink and free it.
+	 *
+	 * Temporary failures, where the same entry should be tried
+	 * again immediately, almost never happen for this shrinker.
+	 * We don't do any trylocking; -ENOMEM comes closest,
+	 * but that's extremely rare and doesn't happen spuriously
+	 * either. Don't bother distinguishing this case.
+	 *
+	 * But since they do exist in theory, the entry cannot just
+	 * be unlinked, or we could leak it. Hence, rotate.
+	 */
+	list_move_tail(item, &l->list);
+
 	/*
 	 * Once the lru lock is dropped, the entry might get freed. The
-	 * swpoffset is copied to the stack, and entry isn't deref'd again
+	 * swpentry is copied to the stack, and entry isn't deref'd again
 	 * until the entry is verified to still be alive in the tree.
 	 */
-	swpoffset = swp_offset(entry->swpentry);
-	tree = swap_zswap_tree(entry->swpentry);
-	list_lru_isolate(l, item);
+	swpentry = entry->swpentry;
+
 	/*
 	 * It's safe to drop the lock here because we return either
 	 * LRU_REMOVED_RETRY or LRU_RETRY.
 	 */
 	spin_unlock(lock);
 
-	/* Check for invalidate() race */
-	spin_lock(&tree->lock);
-	if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
-		goto unlock;
-
-	/* Hold a reference to prevent a free during writeback */
-	zswap_entry_get(entry);
-	spin_unlock(&tree->lock);
-
-	writeback_result = zswap_writeback_entry(entry, tree);
+	writeback_result = zswap_writeback_entry(entry, swpentry);
 
-	spin_lock(&tree->lock);
 	if (writeback_result) {
 		zswap_reject_reclaim_fail++;
-		zswap_lru_putback(&entry->pool->list_lru, entry);
 		ret = LRU_RETRY;
 
 		/*
@@ -903,27 +889,10 @@  static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 		 */
 		if (writeback_result == -EEXIST && encountered_page_in_swapcache)
 			*encountered_page_in_swapcache = true;
-
-		goto put_unlock;
+	} else {
+		zswap_written_back_pages++;
 	}
-	zswap_written_back_pages++;
-
-	if (entry->objcg)
-		count_objcg_event(entry->objcg, ZSWPWB);
 
-	count_vm_event(ZSWPWB);
-	/*
-	 * Writeback started successfully, the page now belongs to the
-	 * swapcache. Drop the entry from zswap - unless invalidate already
-	 * took it out while we had the tree->lock released for IO.
-	 */
-	zswap_invalidate_entry(tree, entry);
-
-put_unlock:
-	/* Drop local reference */
-	zswap_entry_put(entry);
-unlock:
-	spin_unlock(&tree->lock);
 	spin_lock(lock);
 	return ret;
 }
@@ -1408,9 +1377,9 @@  static void __zswap_load(struct zswap_entry *entry, struct page *page)
  * freed.
  */
 static int zswap_writeback_entry(struct zswap_entry *entry,
-				 struct zswap_tree *tree)
+				 swp_entry_t swpentry)
 {
-	swp_entry_t swpentry = entry->swpentry;
+	struct zswap_tree *tree;
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
@@ -1426,9 +1395,11 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 		return -ENOMEM;
 
 	/*
-	 * Found an existing folio, we raced with load/swapin. We generally
-	 * writeback cold folios from zswap, and swapin means the folio just
-	 * became hot. Skip this folio and let the caller find another one.
+	 * Found an existing folio, we raced with swapin or concurrent
+	 * shrinker. We generally writeback cold folios from zswap, and
+	 * swapin means the folio just became hot, so skip this folio.
+	 * For unlikely concurrent shrinker case, it will be unlinked
+	 * and freed when invalidated by the concurrent shrinker anyway.
 	 */
 	if (!folio_was_allocated) {
 		folio_put(folio);
@@ -1442,18 +1413,31 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 	 * backs (our zswap_entry reference doesn't prevent that), to
 	 * avoid overwriting a new swap folio with old compressed data.
 	 */
+	tree = swap_zswap_tree(swpentry);
 	spin_lock(&tree->lock);
-	if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
+	if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
 		spin_unlock(&tree->lock);
 		delete_from_swap_cache(folio);
 		folio_unlock(folio);
 		folio_put(folio);
 		return -ENOMEM;
 	}
+
+	/* Safe to deref entry after the entry is verified above. */
+	zswap_entry_get(entry);
 	spin_unlock(&tree->lock);
 
 	__zswap_load(entry, &folio->page);
 
+	count_vm_event(ZSWPWB);
+	if (entry->objcg)
+		count_objcg_event(entry->objcg, ZSWPWB);
+
+	spin_lock(&tree->lock);
+	zswap_invalidate_entry(tree, entry);
+	zswap_entry_put(entry);
+	spin_unlock(&tree->lock);
+
 	/* folio is up to date */
 	folio_mark_uptodate(folio);