diff mbox series

[v1,1/3] mm: zswap: fix global shrinker memcg iteration

Message ID 20240608155316.451600-2-flintglass@gmail.com (mailing list archive)
State New
Headers show
Series mm: zswap: global shrinker fix and proactive shrink | expand

Commit Message

Takero Funaki June 8, 2024, 3:53 p.m. UTC
This patch fixes an issue where the zswap global shrinker stopped
iterating through the memcg tree.

The problem was that `shrink_worker()` would stop iterating when a memcg
was being offlined and restart from the tree root.  Now, it properly
handles the offlining memcg and continues shrinking with the next memcg.

This patch also modified handing of the lock for offlined memcg cleaner
to adapt the change in the iteration, and avoid negligibly rare skipping
of a memcg from shrink iteration.

Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 19 deletions(-)

Comments

Yosry Ahmed June 10, 2024, 7:16 p.m. UTC | #1
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that `shrink_worker()` would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offlining memcg and continues shrinking with the next memcg.
>
> This patch also modified handing of the lock for offlined memcg cleaner
> to adapt the change in the iteration, and avoid negligibly rare skipping
> of a memcg from shrink iteration.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 68 insertions(+), 19 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 80c634acb8d5..d720a42069b6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio)
>         }
>  }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>  {
>         /* lock out zswap shrinker walking memcg tree */
>         spin_lock(&zswap_shrink_lock);
> -       if (zswap_next_shrink == memcg)
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +
> +       if (READ_ONCE(zswap_next_shrink) == memcg) {
> +               /* put back reference and advance the cursor */
> +               memcg = mem_cgroup_iter(NULL, memcg, NULL);
> +               WRITE_ONCE(zswap_next_shrink, memcg);
> +       }
> +

I am really finding it difficult to understand what the diff is trying
to do. We are holding a lock that protects zswap_next_shrink. We
always access it with the lock held. Why do we need all of this?

Adding READ_ONCE() and WRITE_ONCE() where they are not needed is just
confusing imo.

>         spin_unlock(&zswap_shrink_lock);
>  }
>
> @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>
>  static void shrink_worker(struct work_struct *w)
>  {
> -       struct mem_cgroup *memcg;
> +       struct mem_cgroup *memcg = NULL;
> +       struct mem_cgroup *next_memcg;
>         int ret, failures = 0;
>         unsigned long thr;
>
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
>
> -       /* global reclaim will select cgroup in a round-robin fashion. */
> +       /* global reclaim will select cgroup in a round-robin fashion.
> +        *
> +        * We save iteration cursor memcg into zswap_next_shrink,
> +        * which can be modified by the offline memcg cleaner
> +        * zswap_memcg_offline_cleanup().
> +        *
> +        * Since the offline cleaner is called only once, we cannot abandone
> +        * offline memcg reference in zswap_next_shrink.
> +        * We can rely on the cleaner only if we get online memcg under lock.
> +        * If we get offline memcg, we cannot determine the cleaner will be
> +        * called later. We must put it before returning from this function.
> +        */
>         do {
> +iternext:
>                 spin_lock(&zswap_shrink_lock);
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> -               memcg = zswap_next_shrink;
> +               next_memcg = READ_ONCE(zswap_next_shrink);
> +
> +               if (memcg != next_memcg) {
> +                       /*
> +                        * Ours was released by offlining.
> +                        * Use the saved memcg reference.
> +                        */
> +                       memcg = next_memcg;

'memcg' will always be NULL on the first iteration, so we will always
start by shrinking 'zswap_next_shrink' for a second time before moving
the iterator.

> +               } else {
> +                       /* advance cursor */
> +                       memcg = mem_cgroup_iter(NULL, memcg, NULL);
> +                       WRITE_ONCE(zswap_next_shrink, memcg);

Again, I don't see what this is achieving. The first iteration will
always set 'memcg' to 'zswap_next_shrink', and then we will always
move the iterator forward. The only difference I see is that we shrink
'zswap_next_shrink' twice in a row now (last 'memcg' in prev call, and
first 'memcg' in this call).

> +               }
>
>                 /*
> -                * We need to retry if we have gone through a full round trip, or if we
> -                * got an offline memcg (or else we risk undoing the effect of the
> -                * zswap memcg offlining cleanup callback). This is not catastrophic
> -                * per se, but it will keep the now offlined memcg hostage for a while.
> -                *
>                  * Note that if we got an online memcg, we will keep the extra
>                  * reference in case the original reference obtained by mem_cgroup_iter
>                  * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
>                 }
>
>                 if (!mem_cgroup_tryget_online(memcg)) {
> -                       /* drop the reference from mem_cgroup_iter() */
> -                       mem_cgroup_iter_break(NULL, memcg);
> -                       zswap_next_shrink = NULL;
> +                       /*
> +                        * It is an offline memcg which we cannot shrink
> +                        * until its pages are reparented.
> +                        *
> +                        * Since we cannot determine if the offline cleaner has
> +                        * been already called or not, the offline memcg must be
> +                        * put back unconditonally. We cannot abort the loop while
> +                        * zswap_next_shrink has a reference of this offline memcg.
> +                        */

You actually deleted the code that actually puts the ref to the
offline memcg above.

Why don't you just replace mem_cgroup_iter_break(NULL, memcg) with
mem_cgroup_iter(NULL, memcg, NULL) here? I don't understand what the
patch is trying to do to be honest. This patch is a lot more confusing
than it should be.

Also, I would like Nhat to weigh in here. Perhaps the decision to
reset the iterator instead of advancing it in this case was made for a
reason that we should honor. Maybe cgroups are usually offlined
together so we will keep running into offline cgroups here if we
continue? I am not sure.

>                         spin_unlock(&zswap_shrink_lock);
> -
> -                       if (++failures == MAX_RECLAIM_RETRIES)
> -                               break;
> -
> -                       goto resched;
> +                       goto iternext;
>                 }
> +               /*
> +                * We got an extra memcg reference before unlocking.
> +                * The cleaner cannot free it using zswap_next_shrink.
> +                *
> +                * Our memcg can be offlined after we get online memcg here.
> +                * In this case, the cleaner is waiting the lock just behind us.
> +                */
>                 spin_unlock(&zswap_shrink_lock);
>
>                 ret = shrink_memcg(memcg);
> @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w)
>  resched:
>                 cond_resched();
>         } while (zswap_total_pages() > thr);
> +
> +       /*
> +        * We can still hold the original memcg reference.
> +        * The reference is stored in zswap_next_shrink, and then reused
> +        * by the next shrink_worker().
> +        */
>  }
>
>  /*********************************
> --
> 2.43.0
>
Takero Funaki June 11, 2024, 2:50 p.m. UTC | #2
On 2024/06/11 4:16, Yosry Ahmed wrote:
>
> I am really finding it difficult to understand what the diff is trying
> to do. We are holding a lock that protects zswap_next_shrink. We
> always access it with the lock held. Why do we need all of this?
>
> Adding READ_ONCE() and WRITE_ONCE() where they are not needed is just
> confusing imo.

I initially thought that reading new values from external variables
inside a loop required protection from compiler optimization. I will
remove the access macros in v2.

> 'memcg' will always be NULL on the first iteration, so we will always
> start by shrinking 'zswap_next_shrink' for a second time before moving
> the iterator.
>
>> +               } else {
>> +                       /* advance cursor */
>> +                       memcg = mem_cgroup_iter(NULL, memcg, NULL);
>> +                       WRITE_ONCE(zswap_next_shrink, memcg);
> Again, I don't see what this is achieving. The first iteration will
> always set 'memcg' to 'zswap_next_shrink', and then we will always
> move the iterator forward. The only difference I see is that we shrink
> 'zswap_next_shrink' twice in a row now (last 'memcg' in prev call, and
> first 'memcg' in this call).

The reason for checking if `memcg != next_memcg` was to ensure that we
do not skip memcg that might be modified by the cleaner.  For example,
say we get memcg A and save it. When the cleaner advances the cursor
from A to B, we then advance from B to C, shrink C. We have to check
that A in the zswap_next_shrink is untouched before advancing the
cursor.

If this approach is overly complicated and ignoring B is acceptable,
the beginning of the loop can be simplified to:

        do {
+iternext:
                spin_lock(&zswap_shrink_lock);

                zswap_next_shrink = mem_cgroup_iter(NULL,
zswap_next_shrink, NULL);
                memcg = zswap_next_shrink;



>> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
>>                 }
>>
>>                 if (!mem_cgroup_tryget_online(memcg)) {
>> -                       /* drop the reference from mem_cgroup_iter() */
>> -                       mem_cgroup_iter_break(NULL, memcg);
>> -                       zswap_next_shrink = NULL;
>> +                       /*
>> +                        * It is an offline memcg which we cannot shrink
>> +                        * until its pages are reparented.
>> +                        *
>> +                        * Since we cannot determine if the offline cleaner has
>> +                        * been already called or not, the offline memcg must be
>> +                        * put back unconditonally. We cannot abort the loop while
>> +                        * zswap_next_shrink has a reference of this offline memcg.
>> +                        */
> You actually deleted the code that actually puts the ref to the
> offline memcg above.
>
> Why don't you just replace mem_cgroup_iter_break(NULL, memcg) with
> mem_cgroup_iter(NULL, memcg, NULL) here? I don't understand what the
> patch is trying to do to be honest. This patch is a lot more confusing
> than it should be.


>>                         spin_unlock(&zswap_shrink_lock);
>> -
>> -                       if (++failures == MAX_RECLAIM_RETRIES)
>> -                               break;
>> -
>> -                       goto resched;
>> +                       goto iternext;
>>                 }

Removing the `break` on max failures from the if-offline branch is
required to not leave the reference of the next memcg.

If we just replace the mem_cgroup_iter_break with `memcg =
zswap_next_shrink = mem_cgroup_iter(NULL, memcg, NULL);` and break the
loop on failure, the next memcg will be left in zswap_next_shrink. If
zswap_next_shrink is also offline, the reference will be held
indefinitely.

When we get offline memcg, we cannot determine if the cleaner has
already been called or will be called later. We have to put back the
offline memcg reference before returning from the worker function.
This potential memcg leak is the reason why I think we cannot break
the loop here.
In this patch, the `goto iternext` ensures the offline memcg is
released in the next iteration (or by cleaner waiting for our unlock).

>
> Also, I would like Nhat to weigh in here. Perhaps the decision to
> reset the iterator instead of advancing it in this case was made for a
> reason that we should honor. Maybe cgroups are usually offlined
> together so we will keep running into offline cgroups here if we
> continue? I am not sure.

From comment I removed,

>> -                * We need to retry if we have gone through a full round trip, or if we
>> -                * got an offline memcg (or else we risk undoing the effect of the
>> -                * zswap memcg offlining cleanup callback). This is not catastrophic
>> -                * per se, but it will keep the now offlined memcg hostage for a while.

I think this mentioned the potential memcg leak, which is now resolved
by this patch modifying the offline memcg case.
Nhat Pham June 11, 2024, 6:26 p.m. UTC | #3
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that `shrink_worker()` would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offlining memcg and continues shrinking with the next memcg.
>
> This patch also modified handing of the lock for offlined memcg cleaner
> to adapt the change in the iteration, and avoid negligibly rare skipping
> of a memcg from shrink iteration.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 68 insertions(+), 19 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 80c634acb8d5..d720a42069b6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio)
>         }
>  }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>  {
>         /* lock out zswap shrinker walking memcg tree */
>         spin_lock(&zswap_shrink_lock);
> -       if (zswap_next_shrink == memcg)
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +
> +       if (READ_ONCE(zswap_next_shrink) == memcg) {
> +               /* put back reference and advance the cursor */
> +               memcg = mem_cgroup_iter(NULL, memcg, NULL);
> +               WRITE_ONCE(zswap_next_shrink, memcg);
> +       }
> +
>         spin_unlock(&zswap_shrink_lock);
>  }

As I have noted in v0, I think this is unnecessary and makes it more confusing.

>
> @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>
>  static void shrink_worker(struct work_struct *w)
>  {
> -       struct mem_cgroup *memcg;
> +       struct mem_cgroup *memcg = NULL;
> +       struct mem_cgroup *next_memcg;
>         int ret, failures = 0;
>         unsigned long thr;
>
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
>
> -       /* global reclaim will select cgroup in a round-robin fashion. */
> +       /* global reclaim will select cgroup in a round-robin fashion.
> +        *
> +        * We save iteration cursor memcg into zswap_next_shrink,
> +        * which can be modified by the offline memcg cleaner
> +        * zswap_memcg_offline_cleanup().
> +        *
> +        * Since the offline cleaner is called only once, we cannot abandone
> +        * offline memcg reference in zswap_next_shrink.
> +        * We can rely on the cleaner only if we get online memcg under lock.
> +        * If we get offline memcg, we cannot determine the cleaner will be
> +        * called later. We must put it before returning from this function.
> +        */
>         do {
> +iternext:
>                 spin_lock(&zswap_shrink_lock);
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> -               memcg = zswap_next_shrink;
> +               next_memcg = READ_ONCE(zswap_next_shrink);
> +
> +               if (memcg != next_memcg) {
> +                       /*
> +                        * Ours was released by offlining.
> +                        * Use the saved memcg reference.
> +                        */
> +                       memcg = next_memcg;
> +               } else {
> +                       /* advance cursor */
> +                       memcg = mem_cgroup_iter(NULL, memcg, NULL);
> +                       WRITE_ONCE(zswap_next_shrink, memcg);
> +               }

I suppose I'm fine with not advancing the memcg when it is already
advanced by the memcg offlining callback.

>
>                 /*
> -                * We need to retry if we have gone through a full round trip, or if we
> -                * got an offline memcg (or else we risk undoing the effect of the
> -                * zswap memcg offlining cleanup callback). This is not catastrophic
> -                * per se, but it will keep the now offlined memcg hostage for a while.
> -                *
>                  * Note that if we got an online memcg, we will keep the extra
>                  * reference in case the original reference obtained by mem_cgroup_iter
>                  * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
>                 }
>
>                 if (!mem_cgroup_tryget_online(memcg)) {
> -                       /* drop the reference from mem_cgroup_iter() */
> -                       mem_cgroup_iter_break(NULL, memcg);
> -                       zswap_next_shrink = NULL;
> +                       /*
> +                        * It is an offline memcg which we cannot shrink
> +                        * until its pages are reparented.
> +                        *
> +                        * Since we cannot determine if the offline cleaner has
> +                        * been already called or not, the offline memcg must be
> +                        * put back unconditonally. We cannot abort the loop while
> +                        * zswap_next_shrink has a reference of this offline memcg.
> +                        */
>                         spin_unlock(&zswap_shrink_lock);
> -
> -                       if (++failures == MAX_RECLAIM_RETRIES)
> -                               break;
> -
> -                       goto resched;
> +                       goto iternext;

Hmmm yeah in the past, I set it to NULL to make sure we're not
replacing zswap_next_shrink with an offlined memcg, after that zswap
offlining callback for that memcg has been completed..

I suppose we can just call mem_cgroup_iter(...) on that offlined
cgroup, but I'm not 100% sure what happens when we call this function
on a cgroup that is currently being offlined, and has gone past the
zswap offline callback stage. So I was just playing it safe and
restart from the top of the tree :)

I think this implementation has that behavior right? We see that the
memcg is offlined, so we drop the lock and go to the beginning of the
loop. We reacquire the lock, and might see that zswap_next_shrink ==
memcg, so we call mem_cgroup_iter(...) on it. Is this safe?

Note that zswap_shrink_lock only orders serializes this memcg
selection loop with memcg offlining after it - there's no guarantee
what's the behavior is for memcg offlining before it (well other than
one reference that we manage to acquire thanks to
mem_cgroup_iter(...), so that memcg has not been freed, but not sure
what we can guarantee regarding its place in the memcg hierarchy
tree?).

Johannes, do you have any idea what we can expect here? Let me also cc Shakeel.





>                 }
> +               /*
> +                * We got an extra memcg reference before unlocking.
> +                * The cleaner cannot free it using zswap_next_shrink.
> +                *
> +                * Our memcg can be offlined after we get online memcg here.
> +                * In this case, the cleaner is waiting the lock just behind us.
> +                */
>                 spin_unlock(&zswap_shrink_lock);
>
>                 ret = shrink_memcg(memcg);
> @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w)
>  resched:
>                 cond_resched();
>         } while (zswap_total_pages() > thr);
> +
> +       /*
> +        * We can still hold the original memcg reference.
> +        * The reference is stored in zswap_next_shrink, and then reused
> +        * by the next shrink_worker().
> +        */
>  }
>
>  /*********************************
> --
> 2.43.0
>
Shakeel Butt June 11, 2024, 11:03 p.m. UTC | #4
On Tue, Jun 11, 2024 at 11:26:12AM GMT, Nhat Pham wrote:
[...]
> 
> Hmmm yeah in the past, I set it to NULL to make sure we're not
> replacing zswap_next_shrink with an offlined memcg, after that zswap
> offlining callback for that memcg has been completed..
> 
> I suppose we can just call mem_cgroup_iter(...) on that offlined
> cgroup, but I'm not 100% sure what happens when we call this function
> on a cgroup that is currently being offlined, and has gone past the
> zswap offline callback stage. So I was just playing it safe and
> restart from the top of the tree :)
> 
> I think this implementation has that behavior right? We see that the
> memcg is offlined, so we drop the lock and go to the beginning of the
> loop. We reacquire the lock, and might see that zswap_next_shrink ==
> memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
> 
> Note that zswap_shrink_lock only orders serializes this memcg
> selection loop with memcg offlining after it - there's no guarantee
> what's the behavior is for memcg offlining before it (well other than
> one reference that we manage to acquire thanks to
> mem_cgroup_iter(...), so that memcg has not been freed, but not sure
> what we can guarantee regarding its place in the memcg hierarchy
> tree?).
> 
> Johannes, do you have any idea what we can expect here? Let me also cc Shakeel.
> 
> 
> 

mem_cgroup_iter() does a pre-order traversal, so you can see mixture of
online and offlined memcgs during a traversal.
Takero Funaki June 12, 2024, 6:16 p.m. UTC | #5
2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>:

>
> As I have noted in v0, I think this is unnecessary and makes it more confusing.
>

Does spin_lock() ensure that compiler optimizations do not remove
memory access to an external variable? I think we need to use
READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
For example,
https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234

isn't this a common use case of READ_ONCE?
```c
bool shared_flag = false;
spinlock_t flag_lock;

void somefunc(void) {
    for (;;) {
        spin_lock(&flag_lock);
        /* check external updates */
        if (READ_ONCE(shared_flag))
            break;
        /* do something */
        spin_unlock(&flag_lock);
    }
    spin_unlock(&flag_lock);
}
```
Without READ_ONCE, the check can be extracted from the loop by optimization.

In shrink_worker, zswap_next_shrink is the shared_flag , which can be
updated by concurrent cleaner threads, so it must be re-read every
time we reacquire the lock. Am I badly misunderstanding something?

> >         do {
> > +iternext:
> >                 spin_lock(&zswap_shrink_lock);
> > -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> > -               memcg = zswap_next_shrink;
> > +               next_memcg = READ_ONCE(zswap_next_shrink);
> > +
> > +               if (memcg != next_memcg) {
> > +                       /*
> > +                        * Ours was released by offlining.
> > +                        * Use the saved memcg reference.
> > +                        */
> > +                       memcg = next_memcg;
> > +               } else {
> > +                       /* advance cursor */
> > +                       memcg = mem_cgroup_iter(NULL, memcg, NULL);
> > +                       WRITE_ONCE(zswap_next_shrink, memcg);
> > +               }
>
> I suppose I'm fine with not advancing the memcg when it is already
> advanced by the memcg offlining callback.
>

For where to restart the shrinking, as Yosry pointed, my version
starts from the last memcg (=retrying failed memcg or evicting once
more)
I now realize that skipping the next memcg of offlined memcg is less
likely to happen. I am reverting it to restart from the next memcg of
zswap_next_shrink.
Which one could be better?

> >
> >                 /*
> > -                * We need to retry if we have gone through a full round trip, or if we
> > -                * got an offline memcg (or else we risk undoing the effect of the
> > -                * zswap memcg offlining cleanup callback). This is not catastrophic
> > -                * per se, but it will keep the now offlined memcg hostage for a while.
> > -                *
> >                  * Note that if we got an online memcg, we will keep the extra
> >                  * reference in case the original reference obtained by mem_cgroup_iter
> >                  * is dropped by the zswap memcg offlining callback, ensuring that the
> > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
> >                 }
> >
> >                 if (!mem_cgroup_tryget_online(memcg)) {
> > -                       /* drop the reference from mem_cgroup_iter() */
> > -                       mem_cgroup_iter_break(NULL, memcg);
> > -                       zswap_next_shrink = NULL;
> > +                       /*
> > +                        * It is an offline memcg which we cannot shrink
> > +                        * until its pages are reparented.
> > +                        *
> > +                        * Since we cannot determine if the offline cleaner has
> > +                        * been already called or not, the offline memcg must be
> > +                        * put back unconditonally. We cannot abort the loop while
> > +                        * zswap_next_shrink has a reference of this offline memcg.
> > +                        */
> >                         spin_unlock(&zswap_shrink_lock);
> > -
> > -                       if (++failures == MAX_RECLAIM_RETRIES)
> > -                               break;
> > -
> > -                       goto resched;
> > +                       goto iternext;
>
> Hmmm yeah in the past, I set it to NULL to make sure we're not
> replacing zswap_next_shrink with an offlined memcg, after that zswap
> offlining callback for that memcg has been completed..
>
> I suppose we can just call mem_cgroup_iter(...) on that offlined
> cgroup, but I'm not 100% sure what happens when we call this function
> on a cgroup that is currently being offlined, and has gone past the
> zswap offline callback stage. So I was just playing it safe and
> restart from the top of the tree :)
>
> I think this implementation has that behavior right? We see that the
> memcg is offlined, so we drop the lock and go to the beginning of the
> loop. We reacquire the lock, and might see that zswap_next_shrink ==
> memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
>
> Note that zswap_shrink_lock only orders serializes this memcg
> selection loop with memcg offlining after it - there's no guarantee
> what's the behavior is for memcg offlining before it (well other than
> one reference that we manage to acquire thanks to
> mem_cgroup_iter(...), so that memcg has not been freed, but not sure
> what we can guarantee regarding its place in the memcg hierarchy
> tree?).

The locking mechanism in shrink_worker does not rely on what the next
memcg is.sorting stability of mem_cgroup_iter does not matter
here.
The expectation for the iterator is that it will walk through all live
memcgs. I believe mem_cgroup_iter uses parent-to-leaf ordering of
cgroup and it ensures all live cgroups are walked at least once,
regardless of its onlineness.
https://elixir.bootlin.com/linux/v6.10-rc2/source/mm/memcontrol.c#L1368

Regarding reference leak, I overlooked a scenario where a leak might
occur in the existing cleaner. although it should be rare.

When the cleaner is called on a memcg in zswap_next_shrink, the next
memcg from mem_cgroup_iter() can be an offline already-cleaned memcg,
resulting in a reference leak of the next memcg from the cleaner. We
should implement the same online check in the cleaner, like this:


```c
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
        struct mem_cgroup *next;

        /* lock out zswap shrinker walking memcg tree */
        spin_lock(&zswap_shrink_lock);
        if (zswap_next_shrink == memcg) {
                next = zswap_next_shrink;
                do {
                        next = mem_cgroup_iter(NULL, next, NULL);
                        WRITE_ONCE(zswap_next_shrink, next);

                        spin_unlock(&zswap_shrink_lock);
                        /* zswap_next_shrink might be updated here */
                        spin_lock(&zswap_shrink_lock);

                        next = READ_ONCE(zswap_next_shrink);
                        if (!next)
                                break;
                } while (!mem_cgroup_online(next));
                /*
                 * We verified the next memcg is online under lock.
                 * Even if the next memcg is being offlined here, another
                 * cleaner for the next memcg is waiting for our unlock just
                 * behind us.  We can leave the next memcg reference.
                 */
        }
        spin_unlock(&zswap_shrink_lock);
}
```

As same as in shrink_worker, we must check if the next memcg is online
under the lock before leaving the ref in zswap_next_shrink.
Otherwise, zswap_next_shrink might hold the ref of offlined and cleaned memcg.

Or if you are concerning about temporary storing unchecked or offlined
memcg in zswap_next_shrink, it is safe because:

1. If there is no other cleaner running for zswap_next_shrink, the ref
saved in zswap_next_shrink ensures liveness of the memcg when
reacquired.
2. Another cleaner thread may put back and replace zswap_next_shrink
with its next. We will check onlineness of the new zswap_next_shrink
under reacquired lock.
3. Even if the verified-online memcg is being offlined concurrently,
another cleaner thread must wait for our unlock. We can leave the
online memcg and rely on its respective cleaner.
Yosry Ahmed June 12, 2024, 6:28 p.m. UTC | #6
On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>:
>
> >
> > As I have noted in v0, I think this is unnecessary and makes it more confusing.
> >
>
> Does spin_lock() ensure that compiler optimizations do not remove
> memory access to an external variable? I think we need to use
> READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
> For example,
> https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234

In this example, it seems like mmu_interval_set_seq() updates
interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think
this is why READ_ONCE() is required in that particular case.

>
> isn't this a common use case of READ_ONCE?
> ```c
> bool shared_flag = false;
> spinlock_t flag_lock;
>
> void somefunc(void) {
>     for (;;) {
>         spin_lock(&flag_lock);
>         /* check external updates */
>         if (READ_ONCE(shared_flag))
>             break;
>         /* do something */
>         spin_unlock(&flag_lock);
>     }
>     spin_unlock(&flag_lock);
> }
> ```
> Without READ_ONCE, the check can be extracted from the loop by optimization.

According to Documentation/memory-barriers.txt, lock acquiring
functions are implicit memory barriers. Otherwise, the compiler would
be able to pull any memory access outside of the lock critical section
and locking wouldn't be reliable.
Takero Funaki June 13, 2024, 2:13 a.m. UTC | #7
2024年6月13日(木) 3:28 Yosry Ahmed <yosryahmed@google.com>:
>
> On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <flintglass@gmail.com> wrote:
> >
> > 2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>:
> >
> > >
> > > As I have noted in v0, I think this is unnecessary and makes it more confusing.
> > >
> >
> > Does spin_lock() ensure that compiler optimizations do not remove
> > memory access to an external variable? I think we need to use
> > READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
> > For example,
> > https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234
>
> In this example, it seems like mmu_interval_set_seq() updates
> interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think
> this is why READ_ONCE() is required in that particular case.
>
> >
> > isn't this a common use case of READ_ONCE?
> > ```c
> > bool shared_flag = false;
> > spinlock_t flag_lock;
> >
> > void somefunc(void) {
> >     for (;;) {
> >         spin_lock(&flag_lock);
> >         /* check external updates */
> >         if (READ_ONCE(shared_flag))
> >             break;
> >         /* do something */
> >         spin_unlock(&flag_lock);
> >     }
> >     spin_unlock(&flag_lock);
> > }
> > ```
> > Without READ_ONCE, the check can be extracted from the loop by optimization.
>
> According to Documentation/memory-barriers.txt, lock acquiring
> functions are implicit memory barriers. Otherwise, the compiler would
> be able to pull any memory access outside of the lock critical section
> and locking wouldn't be reliable.

Ah, I understand now. The implicit barrier is sufficient as long as
all memory access occurs within the lock. It’s a fundamental rule of
locking—facepalm.

I misread a module code, like in the link, where a lockless write
could invade a critical section. My example was in the opposite
direction, just wrong. Thank you so much for clarifying my
misunderstanding.

For now checking the patch, I suppose the locking mechanism itself is
not affected by my misunderstanding of READ_ONCE.

The corrected version of the cleaner should be:
```c
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
        /* lock out zswap shrinker walking memcg tree */
        spin_lock(&zswap_shrink_lock);
        if (zswap_next_shrink == memcg) {
                do {
                        zswap_next_shrink = mem_cgroup_iter(NULL,
                                        zswap_next_shrink, NULL);
                        spin_unlock(&zswap_shrink_lock);
                        spin_lock(&zswap_shrink_lock);
                        if (!zswap_next_shrink)
                                break;
                } while (!mem_cgroup_online(zswap_next_shrink));
        }
        spin_unlock(&zswap_shrink_lock);
}
```

Should we have a separate patch to fix the leak scenario?
Yosry Ahmed June 13, 2024, 2:18 a.m. UTC | #8
On Wed, Jun 12, 2024 at 7:13 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年6月13日(木) 3:28 Yosry Ahmed <yosryahmed@google.com>:
> >
> > On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <flintglass@gmail.com> wrote:
> > >
> > > 2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>:
> > >
> > > >
> > > > As I have noted in v0, I think this is unnecessary and makes it more confusing.
> > > >
> > >
> > > Does spin_lock() ensure that compiler optimizations do not remove
> > > memory access to an external variable? I think we need to use
> > > READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
> > > For example,
> > > https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234
> >
> > In this example, it seems like mmu_interval_set_seq() updates
> > interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think
> > this is why READ_ONCE() is required in that particular case.
> >
> > >
> > > isn't this a common use case of READ_ONCE?
> > > ```c
> > > bool shared_flag = false;
> > > spinlock_t flag_lock;
> > >
> > > void somefunc(void) {
> > >     for (;;) {
> > >         spin_lock(&flag_lock);
> > >         /* check external updates */
> > >         if (READ_ONCE(shared_flag))
> > >             break;
> > >         /* do something */
> > >         spin_unlock(&flag_lock);
> > >     }
> > >     spin_unlock(&flag_lock);
> > > }
> > > ```
> > > Without READ_ONCE, the check can be extracted from the loop by optimization.
> >
> > According to Documentation/memory-barriers.txt, lock acquiring
> > functions are implicit memory barriers. Otherwise, the compiler would
> > be able to pull any memory access outside of the lock critical section
> > and locking wouldn't be reliable.
>
> Ah, I understand now. The implicit barrier is sufficient as long as
> all memory access occurs within the lock. It’s a fundamental rule of
> locking—facepalm.
>
> I misread a module code, like in the link, where a lockless write
> could invade a critical section. My example was in the opposite
> direction, just wrong. Thank you so much for clarifying my
> misunderstanding.
>
> For now checking the patch, I suppose the locking mechanism itself is
> not affected by my misunderstanding of READ_ONCE.
>
> The corrected version of the cleaner should be:
> ```c
> void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> {
>         /* lock out zswap shrinker walking memcg tree */
>         spin_lock(&zswap_shrink_lock);
>         if (zswap_next_shrink == memcg) {
>                 do {
>                         zswap_next_shrink = mem_cgroup_iter(NULL,
>                                         zswap_next_shrink, NULL);
>                         spin_unlock(&zswap_shrink_lock);
>                         spin_lock(&zswap_shrink_lock);
>                         if (!zswap_next_shrink)
>                                 break;
>                 } while (!mem_cgroup_online(zswap_next_shrink));
>         }
>         spin_unlock(&zswap_shrink_lock);
> }
> ```

Is the idea here to avoid moving the iterator to another offline memcg
that zswap_memcg_offline_cleanup() was already called for, to avoid
holding a ref on that memcg until the next run of zswap shrinking?

If yes, I think it's probably worth doing. But why do we need to
release and reacquire the lock in the loop above?
Takero Funaki June 13, 2024, 2:35 a.m. UTC | #9
2024年6月13日(木) 11:18 Yosry Ahmed <yosryahmed@google.com>:

> > The corrected version of the cleaner should be:
> > ```c
> > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > {
> >         /* lock out zswap shrinker walking memcg tree */
> >         spin_lock(&zswap_shrink_lock);
> >         if (zswap_next_shrink == memcg) {
> >                 do {
> >                         zswap_next_shrink = mem_cgroup_iter(NULL,
> >                                         zswap_next_shrink, NULL);
> >                         spin_unlock(&zswap_shrink_lock);
> >                         spin_lock(&zswap_shrink_lock);
> >                         if (!zswap_next_shrink)
> >                                 break;
> >                 } while (!mem_cgroup_online(zswap_next_shrink));
> >         }
> >         spin_unlock(&zswap_shrink_lock);
> > }
> > ```
>
> Is the idea here to avoid moving the iterator to another offline memcg
> that zswap_memcg_offline_cleanup() was already called for, to avoid
> holding a ref on that memcg until the next run of zswap shrinking?
>
> If yes, I think it's probably worth doing. But why do we need to
> release and reacquire the lock in the loop above?

Yes, the existing cleaner might leave the offline, already-cleaned memcg.

The reacquiring lock is to not loop inside the critical section.
In shrink_worker of v0 patch, the loop was restarted on offline memcg
without releasing the lock. Nhat pointed out that we should drop the
lock after every mem_cgroup_iter() call. v1 was changed to reacquire
once per iteration like the cleaner code above.
Yosry Ahmed June 13, 2024, 2:57 a.m. UTC | #10
On Wed, Jun 12, 2024 at 7:36 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年6月13日(木) 11:18 Yosry Ahmed <yosryahmed@google.com>:
>
> > > The corrected version of the cleaner should be:
> > > ```c
> > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > > {
> > >         /* lock out zswap shrinker walking memcg tree */
> > >         spin_lock(&zswap_shrink_lock);
> > >         if (zswap_next_shrink == memcg) {
> > >                 do {
> > >                         zswap_next_shrink = mem_cgroup_iter(NULL,
> > >                                         zswap_next_shrink, NULL);
> > >                         spin_unlock(&zswap_shrink_lock);
> > >                         spin_lock(&zswap_shrink_lock);
> > >                         if (!zswap_next_shrink)
> > >                                 break;
> > >                 } while (!mem_cgroup_online(zswap_next_shrink));
> > >         }
> > >         spin_unlock(&zswap_shrink_lock);
> > > }
> > > ```
> >
> > Is the idea here to avoid moving the iterator to another offline memcg
> > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > holding a ref on that memcg until the next run of zswap shrinking?
> >
> > If yes, I think it's probably worth doing. But why do we need to
> > release and reacquire the lock in the loop above?
>
> Yes, the existing cleaner might leave the offline, already-cleaned memcg.
>
> The reacquiring lock is to not loop inside the critical section.
> In shrink_worker of v0 patch, the loop was restarted on offline memcg
> without releasing the lock. Nhat pointed out that we should drop the
> lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> once per iteration like the cleaner code above.

I am not sure how often we'll run into a situation where we'll be
holding the lock for too long tbh. It should be unlikely to keep
encountering offline memcgs for a long time.

Nhat, do you think this could cause a problem in practice?
Nhat Pham June 13, 2024, 3:04 p.m. UTC | #11
On Wed, Jun 12, 2024 at 7:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Jun 12, 2024 at 7:36 PM Takero Funaki <flintglass@gmail.com> wrote:
> >
> > 2024年6月13日(木) 11:18 Yosry Ahmed <yosryahmed@google.com>:
> >
> > > > The corrected version of the cleaner should be:
> > > > ```c
> > > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > > > {
> > > >         /* lock out zswap shrinker walking memcg tree */
> > > >         spin_lock(&zswap_shrink_lock);
> > > >         if (zswap_next_shrink == memcg) {
> > > >                 do {
> > > >                         zswap_next_shrink = mem_cgroup_iter(NULL,
> > > >                                         zswap_next_shrink, NULL);
> > > >                         spin_unlock(&zswap_shrink_lock);
> > > >                         spin_lock(&zswap_shrink_lock);
> > > >                         if (!zswap_next_shrink)
> > > >                                 break;
> > > >                 } while (!mem_cgroup_online(zswap_next_shrink));
> > > >         }
> > > >         spin_unlock(&zswap_shrink_lock);
> > > > }
> > > > ```
> > >
> > > Is the idea here to avoid moving the iterator to another offline memcg
> > > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > > holding a ref on that memcg until the next run of zswap shrinking?
> > >
> > > If yes, I think it's probably worth doing. But why do we need to
> > > release and reacquire the lock in the loop above?
> >
> > Yes, the existing cleaner might leave the offline, already-cleaned memcg.
> >
> > The reacquiring lock is to not loop inside the critical section.
> > In shrink_worker of v0 patch, the loop was restarted on offline memcg
> > without releasing the lock. Nhat pointed out that we should drop the
> > lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> > once per iteration like the cleaner code above.
>
> I am not sure how often we'll run into a situation where we'll be
> holding the lock for too long tbh. It should be unlikely to keep
> encountering offline memcgs for a long time.
>
> Nhat, do you think this could cause a problem in practice?

I don't remember prescribing anything to be honest :) I think I was
just asking why can't we just drop the lock, then "continue;". This is
mostly for simplicity's sake.

https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/

But I think as Takero pointed out, it would still skip over the memcg
that was (concurrently) updated to zswap_next_shrink by the memcg
offline callback.
Nhat Pham June 13, 2024, 4:08 p.m. UTC | #12
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that `shrink_worker()` would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offlining memcg and continues shrinking with the next memcg.
>
> This patch also modified handing of the lock for offlined memcg cleaner
> to adapt the change in the iteration, and avoid negligibly rare skipping
> of a memcg from shrink iteration.
>

Honestly, I think this version is even more complicated than v0 :)

Hmmm how about this:

do {
    spin_lock(&zswap_pools_lock);
    do {
        /* no offline caller has been called to memcg yet */
        if (memcg == zswap_next_shrink) {
            zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);

        memcg = zswap_next_shrink;
        if (!memcg && ++failure > MAX_FAILURE) {
            spin_unlock(&zswap_pools_lock);
            return;
        }

    } while (!zswap_next_shrink || !mem_cgroup_tryget_online(zswap_next_shrink))
    spin_unlock(&zswap_pools_lock);

    /* proceed with reclaim */
} while (...)

This should achieve all the goals right?

1. No restarting from the top, unless we have traversed the entire hierarchy.

2. No skipping over zswap_next_shrink updated by the memcg offline cleaner.

3. No leaving offlined zswap_next_shrink hanging (and blocking memcg
killing indefinitely).

4. No long running loop unnecessarily. If you want to be extra safe,
we can put a max number of retries on the offline memcg case too (and
restart from the top).

5. At the end of the inner loop, you are guaranteed to hold at least
one reference to memcg (and perhaps 2, if zswap_next_shrink is not
updated by the cleaner).

5. At the end of the inner loop, the selected memcg is known to not
have its cleaner started yet (since it is online with zswap_pools_lock
held). Our selection will undo the cleaner and hold the memcg hostage
forever.

Is there anything that I'm missing? We are not dropping the lock for
the entirety of the inner loop, but honestly I'm fine with this trade
off, for the sake of simplicity :)

If you want it to be even more readable, feel free to refactor the
inner loop into a helper function (but it's probably redundant since
it's only used once).
Nhat Pham June 13, 2024, 4:09 p.m. UTC | #13
On Thu, Jun 13, 2024 at 9:08 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> 5. At the end of the inner loop, the selected memcg is known to not
> have its cleaner started yet (since it is online with zswap_pools_lock
> held). Our selection will undo the cleaner and hold the memcg hostage
> forever.

/s/will undo/will not undo

My apologies - English is hard :)
Shakeel Butt June 13, 2024, 4:49 p.m. UTC | #14
On Thu, Jun 13, 2024 at 08:04:39AM GMT, Nhat Pham wrote:
[...]
> > > >
> > > > Is the idea here to avoid moving the iterator to another offline memcg
> > > > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > > > holding a ref on that memcg until the next run of zswap shrinking?
> > > >
> > > > If yes, I think it's probably worth doing. But why do we need to
> > > > release and reacquire the lock in the loop above?
> > >
> > > Yes, the existing cleaner might leave the offline, already-cleaned memcg.
> > >
> > > The reacquiring lock is to not loop inside the critical section.
> > > In shrink_worker of v0 patch, the loop was restarted on offline memcg
> > > without releasing the lock. Nhat pointed out that we should drop the
> > > lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> > > once per iteration like the cleaner code above.
> >
> > I am not sure how often we'll run into a situation where we'll be
> > holding the lock for too long tbh. It should be unlikely to keep
> > encountering offline memcgs for a long time.
> >
> > Nhat, do you think this could cause a problem in practice?
> 
> I don't remember prescribing anything to be honest :) I think I was
> just asking why can't we just drop the lock, then "continue;". This is
> mostly for simplicity's sake.
> 
> https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/
> 
> But I think as Takero pointed out, it would still skip over the memcg
> that was (concurrently) updated to zswap_next_shrink by the memcg
> offline callback.

What's the issue with keep traversing until an online memcg is found?
Something like the following:


	spin_lock(&zswap_shrink_lock);
	do {
		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
	} while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink));

	if (!zswap_next_shrink)
		zswap_next_shrink = mem_cgroup_iter(NULL, NULL, NULL);
	....

Is the concern that there can a lot of offlined memcgs which may cause
need resched warnings?
Takero Funaki June 14, 2024, 4:39 a.m. UTC | #15
2024年6月14日(金) 1:49 Shakeel Butt <shakeel.butt@linux.dev>:
>
> On Thu, Jun 13, 2024 at 08:04:39AM GMT, Nhat Pham wrote:
> [...]
> > > > >
> > > > > Is the idea here to avoid moving the iterator to another offline memcg
> > > > > that zswap_memcg_offline_cleanup() was already called for, to avoid
> > > > > holding a ref on that memcg until the next run of zswap shrinking?
> > > > >
> > > > > If yes, I think it's probably worth doing. But why do we need to
> > > > > release and reacquire the lock in the loop above?
> > > >
> > > > Yes, the existing cleaner might leave the offline, already-cleaned memcg.
> > > >
> > > > The reacquiring lock is to not loop inside the critical section.
> > > > In shrink_worker of v0 patch, the loop was restarted on offline memcg
> > > > without releasing the lock. Nhat pointed out that we should drop the
> > > > lock after every mem_cgroup_iter() call. v1 was changed to reacquire
> > > > once per iteration like the cleaner code above.
> > >
> > > I am not sure how often we'll run into a situation where we'll be
> > > holding the lock for too long tbh. It should be unlikely to keep
> > > encountering offline memcgs for a long time.
> > >
> > > Nhat, do you think this could cause a problem in practice?
> >
> > I don't remember prescribing anything to be honest :) I think I was
> > just asking why can't we just drop the lock, then "continue;". This is
> > mostly for simplicity's sake.
> >
> > https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/

I apologize for misinterpreting your comments. Removing release/reacquire.

> >
> > But I think as Takero pointed out, it would still skip over the memcg
> > that was (concurrently) updated to zswap_next_shrink by the memcg
> > offline callback.
>
> What's the issue with keep traversing until an online memcg is found?
> Something like the following:
>
>
>         spin_lock(&zswap_shrink_lock);
>         do {
>                 zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
>         } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink));
>
>         if (!zswap_next_shrink)
>                 zswap_next_shrink = mem_cgroup_iter(NULL, NULL, NULL);
>         ....
>
> Is the concern that there can a lot of offlined memcgs which may cause
> need resched warnings?

To avoid using the goto-based loop, here's the new version, including
Shakeel's suggestion:
```c
    do {
        spin_lock(&zswap_shrink_lock);
        /*
         * Advance the cursor to start shrinking from the next memcg
         * after zswap_next_shrink.  One memcg might be skipped from
         * shrinking if the cleaner also advanced the cursor, but it
         * will happen at most once per offlining memcg.
         */
        do {
            zswap_next_shrink = mem_cgroup_iter(NULL,
                    zswap_next_shrink, NULL);
            memcg = zswap_next_shrink;
        } while (memcg && !mem_cgroup_tryget_online(memcg));

        if (!memcg) {
            spin_unlock(&zswap_shrink_lock);
```
We can add or remove  `spin_unlock();spin_lock();` just after
mem_cgroup_iter(), if needed.
I believe the behavior is identical to v1 except for the starting
point of iteration.

For Naht's comment, 2. No skipping over zswap_next_shrink updated by
the memcg offline cleaner.
While this was true for v1, I'm moved to accept this skipping as it's
negligibly rare. As Yorsy commented, v1 retried the last memcg from
the last shrink_worker() run.

There are several options for shrink_worker where to start with:
1. Starting from the next memcg after zswap_next_shrink: It might skip
one memcg, but this is quite rare. It is the current behavior before
patch.

2. Starting from zswap_next_shrink: It might shrink one more page from
the memcg in addition to the one by the last shrink_worker() run. This
should also be rare, but probably more frequent than option 1. This is
the v0 patch behavior.

3. Addressing both: Save the last memcg as well. The worker checks if
it has been modified by the cleaner and advances only if it hasn't.
Like this:
```c
        do {
            if (zswap_last_shrink == zswap_next_shrink) {
                zswap_next_shrink = mem_cgroup_iter(NULL,
                        zswap_next_shrink, NULL);
            }
            memcg = zswap_next_shrink;
        } while (memcg && !mem_cgroup_tryget_online(memcg));
        zswap_last_shrink = memcg;
```

Which one would be better? or any other idea?
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 80c634acb8d5..d720a42069b6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -827,12 +827,27 @@  void zswap_folio_swapin(struct folio *folio)
 	}
 }
 
+/*
+ * This function should be called when a memcg is being offlined.
+ *
+ * Since the global shrinker shrink_worker() may hold a reference
+ * of the memcg, we must check and release the reference in
+ * zswap_next_shrink.
+ *
+ * shrink_worker() must handle the case where this function releases
+ * the reference of memcg being shrunk.
+ */
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
 {
 	/* lock out zswap shrinker walking memcg tree */
 	spin_lock(&zswap_shrink_lock);
-	if (zswap_next_shrink == memcg)
-		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+
+	if (READ_ONCE(zswap_next_shrink) == memcg) {
+		/* put back reference and advance the cursor */
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+		WRITE_ONCE(zswap_next_shrink, memcg);
+	}
+
 	spin_unlock(&zswap_shrink_lock);
 }
 
@@ -1401,25 +1416,44 @@  static int shrink_memcg(struct mem_cgroup *memcg)
 
 static void shrink_worker(struct work_struct *w)
 {
-	struct mem_cgroup *memcg;
+	struct mem_cgroup *memcg = NULL;
+	struct mem_cgroup *next_memcg;
 	int ret, failures = 0;
 	unsigned long thr;
 
 	/* Reclaim down to the accept threshold */
 	thr = zswap_accept_thr_pages();
 
-	/* global reclaim will select cgroup in a round-robin fashion. */
+	/* global reclaim will select cgroup in a round-robin fashion.
+	 *
+	 * We save iteration cursor memcg into zswap_next_shrink,
+	 * which can be modified by the offline memcg cleaner
+	 * zswap_memcg_offline_cleanup().
+	 *
+	 * Since the offline cleaner is called only once, we cannot abandone
+	 * offline memcg reference in zswap_next_shrink.
+	 * We can rely on the cleaner only if we get online memcg under lock.
+	 * If we get offline memcg, we cannot determine the cleaner will be
+	 * called later. We must put it before returning from this function.
+	 */
 	do {
+iternext:
 		spin_lock(&zswap_shrink_lock);
-		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
-		memcg = zswap_next_shrink;
+		next_memcg = READ_ONCE(zswap_next_shrink);
+
+		if (memcg != next_memcg) {
+			/*
+			 * Ours was released by offlining.
+			 * Use the saved memcg reference.
+			 */
+			memcg = next_memcg;
+		} else {
+			/* advance cursor */
+			memcg = mem_cgroup_iter(NULL, memcg, NULL);
+			WRITE_ONCE(zswap_next_shrink, memcg);
+		}
 
 		/*
-		 * We need to retry if we have gone through a full round trip, or if we
-		 * got an offline memcg (or else we risk undoing the effect of the
-		 * zswap memcg offlining cleanup callback). This is not catastrophic
-		 * per se, but it will keep the now offlined memcg hostage for a while.
-		 *
 		 * Note that if we got an online memcg, we will keep the extra
 		 * reference in case the original reference obtained by mem_cgroup_iter
 		 * is dropped by the zswap memcg offlining callback, ensuring that the
@@ -1434,16 +1468,25 @@  static void shrink_worker(struct work_struct *w)
 		}
 
 		if (!mem_cgroup_tryget_online(memcg)) {
-			/* drop the reference from mem_cgroup_iter() */
-			mem_cgroup_iter_break(NULL, memcg);
-			zswap_next_shrink = NULL;
+			/*
+			 * It is an offline memcg which we cannot shrink
+			 * until its pages are reparented.
+			 *
+			 * Since we cannot determine if the offline cleaner has
+			 * been already called or not, the offline memcg must be
+			 * put back unconditonally. We cannot abort the loop while
+			 * zswap_next_shrink has a reference of this offline memcg.
+			 */
 			spin_unlock(&zswap_shrink_lock);
-
-			if (++failures == MAX_RECLAIM_RETRIES)
-				break;
-
-			goto resched;
+			goto iternext;
 		}
+		/*
+		 * We got an extra memcg reference before unlocking.
+		 * The cleaner cannot free it using zswap_next_shrink.
+		 *
+		 * Our memcg can be offlined after we get online memcg here.
+		 * In this case, the cleaner is waiting the lock just behind us.
+		 */
 		spin_unlock(&zswap_shrink_lock);
 
 		ret = shrink_memcg(memcg);
@@ -1457,6 +1500,12 @@  static void shrink_worker(struct work_struct *w)
 resched:
 		cond_resched();
 	} while (zswap_total_pages() > thr);
+
+	/*
+	 * We can still hold the original memcg reference.
+	 * The reference is stored in zswap_next_shrink, and then reused
+	 * by the next shrink_worker().
+	 */
 }
 
 /*********************************