Message ID | 20240608155316.451600-2-flintglass@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: zswap: global shrinker fix and proactive shrink | expand |
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote: > > This patch fixes an issue where the zswap global shrinker stopped > iterating through the memcg tree. > > The problem was that `shrink_worker()` would stop iterating when a memcg > was being offlined and restart from the tree root. Now, it properly > handles the offlining memcg and continues shrinking with the next memcg. > > This patch also modified handing of the lock for offlined memcg cleaner > to adapt the change in the iteration, and avoid negligibly rare skipping > of a memcg from shrink iteration. > > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") > Signed-off-by: Takero Funaki <flintglass@gmail.com> > --- > mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 68 insertions(+), 19 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 80c634acb8d5..d720a42069b6 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio) > } > } > > +/* > + * This function should be called when a memcg is being offlined. > + * > + * Since the global shrinker shrink_worker() may hold a reference > + * of the memcg, we must check and release the reference in > + * zswap_next_shrink. > + * > + * shrink_worker() must handle the case where this function releases > + * the reference of memcg being shrunk. > + */ > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > { > /* lock out zswap shrinker walking memcg tree */ > spin_lock(&zswap_shrink_lock); > - if (zswap_next_shrink == memcg) > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); > + > + if (READ_ONCE(zswap_next_shrink) == memcg) { > + /* put back reference and advance the cursor */ > + memcg = mem_cgroup_iter(NULL, memcg, NULL); > + WRITE_ONCE(zswap_next_shrink, memcg); > + } > + I am really finding it difficult to understand what the diff is trying to do. We are holding a lock that protects zswap_next_shrink. We always access it with the lock held. Why do we need all of this? Adding READ_ONCE() and WRITE_ONCE() where they are not needed is just confusing imo. > spin_unlock(&zswap_shrink_lock); > } > > @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg) > > static void shrink_worker(struct work_struct *w) > { > - struct mem_cgroup *memcg; > + struct mem_cgroup *memcg = NULL; > + struct mem_cgroup *next_memcg; > int ret, failures = 0; > unsigned long thr; > > /* Reclaim down to the accept threshold */ > thr = zswap_accept_thr_pages(); > > - /* global reclaim will select cgroup in a round-robin fashion. */ > + /* global reclaim will select cgroup in a round-robin fashion. > + * > + * We save iteration cursor memcg into zswap_next_shrink, > + * which can be modified by the offline memcg cleaner > + * zswap_memcg_offline_cleanup(). > + * > + * Since the offline cleaner is called only once, we cannot abandone > + * offline memcg reference in zswap_next_shrink. > + * We can rely on the cleaner only if we get online memcg under lock. > + * If we get offline memcg, we cannot determine the cleaner will be > + * called later. We must put it before returning from this function. > + */ > do { > +iternext: > spin_lock(&zswap_shrink_lock); > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); > - memcg = zswap_next_shrink; > + next_memcg = READ_ONCE(zswap_next_shrink); > + > + if (memcg != next_memcg) { > + /* > + * Ours was released by offlining. > + * Use the saved memcg reference. > + */ > + memcg = next_memcg; 'memcg' will always be NULL on the first iteration, so we will always start by shrinking 'zswap_next_shrink' for a second time before moving the iterator. > + } else { > + /* advance cursor */ > + memcg = mem_cgroup_iter(NULL, memcg, NULL); > + WRITE_ONCE(zswap_next_shrink, memcg); Again, I don't see what this is achieving. The first iteration will always set 'memcg' to 'zswap_next_shrink', and then we will always move the iterator forward. The only difference I see is that we shrink 'zswap_next_shrink' twice in a row now (last 'memcg' in prev call, and first 'memcg' in this call). > + } > > /* > - * We need to retry if we have gone through a full round trip, or if we > - * got an offline memcg (or else we risk undoing the effect of the > - * zswap memcg offlining cleanup callback). This is not catastrophic > - * per se, but it will keep the now offlined memcg hostage for a while. > - * > * Note that if we got an online memcg, we will keep the extra > * reference in case the original reference obtained by mem_cgroup_iter > * is dropped by the zswap memcg offlining callback, ensuring that the > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w) > } > > if (!mem_cgroup_tryget_online(memcg)) { > - /* drop the reference from mem_cgroup_iter() */ > - mem_cgroup_iter_break(NULL, memcg); > - zswap_next_shrink = NULL; > + /* > + * It is an offline memcg which we cannot shrink > + * until its pages are reparented. > + * > + * Since we cannot determine if the offline cleaner has > + * been already called or not, the offline memcg must be > + * put back unconditonally. We cannot abort the loop while > + * zswap_next_shrink has a reference of this offline memcg. > + */ You actually deleted the code that actually puts the ref to the offline memcg above. Why don't you just replace mem_cgroup_iter_break(NULL, memcg) with mem_cgroup_iter(NULL, memcg, NULL) here? I don't understand what the patch is trying to do to be honest. This patch is a lot more confusing than it should be. Also, I would like Nhat to weigh in here. Perhaps the decision to reset the iterator instead of advancing it in this case was made for a reason that we should honor. Maybe cgroups are usually offlined together so we will keep running into offline cgroups here if we continue? I am not sure. > spin_unlock(&zswap_shrink_lock); > - > - if (++failures == MAX_RECLAIM_RETRIES) > - break; > - > - goto resched; > + goto iternext; > } > + /* > + * We got an extra memcg reference before unlocking. > + * The cleaner cannot free it using zswap_next_shrink. > + * > + * Our memcg can be offlined after we get online memcg here. > + * In this case, the cleaner is waiting the lock just behind us. > + */ > spin_unlock(&zswap_shrink_lock); > > ret = shrink_memcg(memcg); > @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w) > resched: > cond_resched(); > } while (zswap_total_pages() > thr); > + > + /* > + * We can still hold the original memcg reference. > + * The reference is stored in zswap_next_shrink, and then reused > + * by the next shrink_worker(). > + */ > } > > /********************************* > -- > 2.43.0 >
On 2024/06/11 4:16, Yosry Ahmed wrote: > > I am really finding it difficult to understand what the diff is trying > to do. We are holding a lock that protects zswap_next_shrink. We > always access it with the lock held. Why do we need all of this? > > Adding READ_ONCE() and WRITE_ONCE() where they are not needed is just > confusing imo. I initially thought that reading new values from external variables inside a loop required protection from compiler optimization. I will remove the access macros in v2. > 'memcg' will always be NULL on the first iteration, so we will always > start by shrinking 'zswap_next_shrink' for a second time before moving > the iterator. > >> + } else { >> + /* advance cursor */ >> + memcg = mem_cgroup_iter(NULL, memcg, NULL); >> + WRITE_ONCE(zswap_next_shrink, memcg); > Again, I don't see what this is achieving. The first iteration will > always set 'memcg' to 'zswap_next_shrink', and then we will always > move the iterator forward. The only difference I see is that we shrink > 'zswap_next_shrink' twice in a row now (last 'memcg' in prev call, and > first 'memcg' in this call). The reason for checking if `memcg != next_memcg` was to ensure that we do not skip memcg that might be modified by the cleaner. For example, say we get memcg A and save it. When the cleaner advances the cursor from A to B, we then advance from B to C, shrink C. We have to check that A in the zswap_next_shrink is untouched before advancing the cursor. If this approach is overly complicated and ignoring B is acceptable, the beginning of the loop can be simplified to: do { +iternext: spin_lock(&zswap_shrink_lock); zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); memcg = zswap_next_shrink; >> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w) >> } >> >> if (!mem_cgroup_tryget_online(memcg)) { >> - /* drop the reference from mem_cgroup_iter() */ >> - mem_cgroup_iter_break(NULL, memcg); >> - zswap_next_shrink = NULL; >> + /* >> + * It is an offline memcg which we cannot shrink >> + * until its pages are reparented. >> + * >> + * Since we cannot determine if the offline cleaner has >> + * been already called or not, the offline memcg must be >> + * put back unconditonally. We cannot abort the loop while >> + * zswap_next_shrink has a reference of this offline memcg. >> + */ > You actually deleted the code that actually puts the ref to the > offline memcg above. > > Why don't you just replace mem_cgroup_iter_break(NULL, memcg) with > mem_cgroup_iter(NULL, memcg, NULL) here? I don't understand what the > patch is trying to do to be honest. This patch is a lot more confusing > than it should be. >> spin_unlock(&zswap_shrink_lock); >> - >> - if (++failures == MAX_RECLAIM_RETRIES) >> - break; >> - >> - goto resched; >> + goto iternext; >> } Removing the `break` on max failures from the if-offline branch is required to not leave the reference of the next memcg. If we just replace the mem_cgroup_iter_break with `memcg = zswap_next_shrink = mem_cgroup_iter(NULL, memcg, NULL);` and break the loop on failure, the next memcg will be left in zswap_next_shrink. If zswap_next_shrink is also offline, the reference will be held indefinitely. When we get offline memcg, we cannot determine if the cleaner has already been called or will be called later. We have to put back the offline memcg reference before returning from the worker function. This potential memcg leak is the reason why I think we cannot break the loop here. In this patch, the `goto iternext` ensures the offline memcg is released in the next iteration (or by cleaner waiting for our unlock). > > Also, I would like Nhat to weigh in here. Perhaps the decision to > reset the iterator instead of advancing it in this case was made for a > reason that we should honor. Maybe cgroups are usually offlined > together so we will keep running into offline cgroups here if we > continue? I am not sure. From comment I removed, >> - * We need to retry if we have gone through a full round trip, or if we >> - * got an offline memcg (or else we risk undoing the effect of the >> - * zswap memcg offlining cleanup callback). This is not catastrophic >> - * per se, but it will keep the now offlined memcg hostage for a while. I think this mentioned the potential memcg leak, which is now resolved by this patch modifying the offline memcg case.
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote: > > This patch fixes an issue where the zswap global shrinker stopped > iterating through the memcg tree. > > The problem was that `shrink_worker()` would stop iterating when a memcg > was being offlined and restart from the tree root. Now, it properly > handles the offlining memcg and continues shrinking with the next memcg. > > This patch also modified handing of the lock for offlined memcg cleaner > to adapt the change in the iteration, and avoid negligibly rare skipping > of a memcg from shrink iteration. > > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") > Signed-off-by: Takero Funaki <flintglass@gmail.com> > --- > mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 68 insertions(+), 19 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 80c634acb8d5..d720a42069b6 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio) > } > } > > +/* > + * This function should be called when a memcg is being offlined. > + * > + * Since the global shrinker shrink_worker() may hold a reference > + * of the memcg, we must check and release the reference in > + * zswap_next_shrink. > + * > + * shrink_worker() must handle the case where this function releases > + * the reference of memcg being shrunk. > + */ > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > { > /* lock out zswap shrinker walking memcg tree */ > spin_lock(&zswap_shrink_lock); > - if (zswap_next_shrink == memcg) > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); > + > + if (READ_ONCE(zswap_next_shrink) == memcg) { > + /* put back reference and advance the cursor */ > + memcg = mem_cgroup_iter(NULL, memcg, NULL); > + WRITE_ONCE(zswap_next_shrink, memcg); > + } > + > spin_unlock(&zswap_shrink_lock); > } As I have noted in v0, I think this is unnecessary and makes it more confusing. > > @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg) > > static void shrink_worker(struct work_struct *w) > { > - struct mem_cgroup *memcg; > + struct mem_cgroup *memcg = NULL; > + struct mem_cgroup *next_memcg; > int ret, failures = 0; > unsigned long thr; > > /* Reclaim down to the accept threshold */ > thr = zswap_accept_thr_pages(); > > - /* global reclaim will select cgroup in a round-robin fashion. */ > + /* global reclaim will select cgroup in a round-robin fashion. > + * > + * We save iteration cursor memcg into zswap_next_shrink, > + * which can be modified by the offline memcg cleaner > + * zswap_memcg_offline_cleanup(). > + * > + * Since the offline cleaner is called only once, we cannot abandone > + * offline memcg reference in zswap_next_shrink. > + * We can rely on the cleaner only if we get online memcg under lock. > + * If we get offline memcg, we cannot determine the cleaner will be > + * called later. We must put it before returning from this function. > + */ > do { > +iternext: > spin_lock(&zswap_shrink_lock); > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); > - memcg = zswap_next_shrink; > + next_memcg = READ_ONCE(zswap_next_shrink); > + > + if (memcg != next_memcg) { > + /* > + * Ours was released by offlining. > + * Use the saved memcg reference. > + */ > + memcg = next_memcg; > + } else { > + /* advance cursor */ > + memcg = mem_cgroup_iter(NULL, memcg, NULL); > + WRITE_ONCE(zswap_next_shrink, memcg); > + } I suppose I'm fine with not advancing the memcg when it is already advanced by the memcg offlining callback. > > /* > - * We need to retry if we have gone through a full round trip, or if we > - * got an offline memcg (or else we risk undoing the effect of the > - * zswap memcg offlining cleanup callback). This is not catastrophic > - * per se, but it will keep the now offlined memcg hostage for a while. > - * > * Note that if we got an online memcg, we will keep the extra > * reference in case the original reference obtained by mem_cgroup_iter > * is dropped by the zswap memcg offlining callback, ensuring that the > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w) > } > > if (!mem_cgroup_tryget_online(memcg)) { > - /* drop the reference from mem_cgroup_iter() */ > - mem_cgroup_iter_break(NULL, memcg); > - zswap_next_shrink = NULL; > + /* > + * It is an offline memcg which we cannot shrink > + * until its pages are reparented. > + * > + * Since we cannot determine if the offline cleaner has > + * been already called or not, the offline memcg must be > + * put back unconditonally. We cannot abort the loop while > + * zswap_next_shrink has a reference of this offline memcg. > + */ > spin_unlock(&zswap_shrink_lock); > - > - if (++failures == MAX_RECLAIM_RETRIES) > - break; > - > - goto resched; > + goto iternext; Hmmm yeah in the past, I set it to NULL to make sure we're not replacing zswap_next_shrink with an offlined memcg, after that zswap offlining callback for that memcg has been completed.. I suppose we can just call mem_cgroup_iter(...) on that offlined cgroup, but I'm not 100% sure what happens when we call this function on a cgroup that is currently being offlined, and has gone past the zswap offline callback stage. So I was just playing it safe and restart from the top of the tree :) I think this implementation has that behavior right? We see that the memcg is offlined, so we drop the lock and go to the beginning of the loop. We reacquire the lock, and might see that zswap_next_shrink == memcg, so we call mem_cgroup_iter(...) on it. Is this safe? Note that zswap_shrink_lock only orders serializes this memcg selection loop with memcg offlining after it - there's no guarantee what's the behavior is for memcg offlining before it (well other than one reference that we manage to acquire thanks to mem_cgroup_iter(...), so that memcg has not been freed, but not sure what we can guarantee regarding its place in the memcg hierarchy tree?). Johannes, do you have any idea what we can expect here? Let me also cc Shakeel. > } > + /* > + * We got an extra memcg reference before unlocking. > + * The cleaner cannot free it using zswap_next_shrink. > + * > + * Our memcg can be offlined after we get online memcg here. > + * In this case, the cleaner is waiting the lock just behind us. > + */ > spin_unlock(&zswap_shrink_lock); > > ret = shrink_memcg(memcg); > @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w) > resched: > cond_resched(); > } while (zswap_total_pages() > thr); > + > + /* > + * We can still hold the original memcg reference. > + * The reference is stored in zswap_next_shrink, and then reused > + * by the next shrink_worker(). > + */ > } > > /********************************* > -- > 2.43.0 >
On Tue, Jun 11, 2024 at 11:26:12AM GMT, Nhat Pham wrote: [...] > > Hmmm yeah in the past, I set it to NULL to make sure we're not > replacing zswap_next_shrink with an offlined memcg, after that zswap > offlining callback for that memcg has been completed.. > > I suppose we can just call mem_cgroup_iter(...) on that offlined > cgroup, but I'm not 100% sure what happens when we call this function > on a cgroup that is currently being offlined, and has gone past the > zswap offline callback stage. So I was just playing it safe and > restart from the top of the tree :) > > I think this implementation has that behavior right? We see that the > memcg is offlined, so we drop the lock and go to the beginning of the > loop. We reacquire the lock, and might see that zswap_next_shrink == > memcg, so we call mem_cgroup_iter(...) on it. Is this safe? > > Note that zswap_shrink_lock only orders serializes this memcg > selection loop with memcg offlining after it - there's no guarantee > what's the behavior is for memcg offlining before it (well other than > one reference that we manage to acquire thanks to > mem_cgroup_iter(...), so that memcg has not been freed, but not sure > what we can guarantee regarding its place in the memcg hierarchy > tree?). > > Johannes, do you have any idea what we can expect here? Let me also cc Shakeel. > > > mem_cgroup_iter() does a pre-order traversal, so you can see mixture of online and offlined memcgs during a traversal.
2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>: > > As I have noted in v0, I think this is unnecessary and makes it more confusing. > Does spin_lock() ensure that compiler optimizations do not remove memory access to an external variable? I think we need to use READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock. For example, https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234 isn't this a common use case of READ_ONCE? ```c bool shared_flag = false; spinlock_t flag_lock; void somefunc(void) { for (;;) { spin_lock(&flag_lock); /* check external updates */ if (READ_ONCE(shared_flag)) break; /* do something */ spin_unlock(&flag_lock); } spin_unlock(&flag_lock); } ``` Without READ_ONCE, the check can be extracted from the loop by optimization. In shrink_worker, zswap_next_shrink is the shared_flag , which can be updated by concurrent cleaner threads, so it must be re-read every time we reacquire the lock. Am I badly misunderstanding something? > > do { > > +iternext: > > spin_lock(&zswap_shrink_lock); > > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); > > - memcg = zswap_next_shrink; > > + next_memcg = READ_ONCE(zswap_next_shrink); > > + > > + if (memcg != next_memcg) { > > + /* > > + * Ours was released by offlining. > > + * Use the saved memcg reference. > > + */ > > + memcg = next_memcg; > > + } else { > > + /* advance cursor */ > > + memcg = mem_cgroup_iter(NULL, memcg, NULL); > > + WRITE_ONCE(zswap_next_shrink, memcg); > > + } > > I suppose I'm fine with not advancing the memcg when it is already > advanced by the memcg offlining callback. > For where to restart the shrinking, as Yosry pointed, my version starts from the last memcg (=retrying failed memcg or evicting once more) I now realize that skipping the next memcg of offlined memcg is less likely to happen. I am reverting it to restart from the next memcg of zswap_next_shrink. Which one could be better? > > > > /* > > - * We need to retry if we have gone through a full round trip, or if we > > - * got an offline memcg (or else we risk undoing the effect of the > > - * zswap memcg offlining cleanup callback). This is not catastrophic > > - * per se, but it will keep the now offlined memcg hostage for a while. > > - * > > * Note that if we got an online memcg, we will keep the extra > > * reference in case the original reference obtained by mem_cgroup_iter > > * is dropped by the zswap memcg offlining callback, ensuring that the > > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w) > > } > > > > if (!mem_cgroup_tryget_online(memcg)) { > > - /* drop the reference from mem_cgroup_iter() */ > > - mem_cgroup_iter_break(NULL, memcg); > > - zswap_next_shrink = NULL; > > + /* > > + * It is an offline memcg which we cannot shrink > > + * until its pages are reparented. > > + * > > + * Since we cannot determine if the offline cleaner has > > + * been already called or not, the offline memcg must be > > + * put back unconditonally. We cannot abort the loop while > > + * zswap_next_shrink has a reference of this offline memcg. > > + */ > > spin_unlock(&zswap_shrink_lock); > > - > > - if (++failures == MAX_RECLAIM_RETRIES) > > - break; > > - > > - goto resched; > > + goto iternext; > > Hmmm yeah in the past, I set it to NULL to make sure we're not > replacing zswap_next_shrink with an offlined memcg, after that zswap > offlining callback for that memcg has been completed.. > > I suppose we can just call mem_cgroup_iter(...) on that offlined > cgroup, but I'm not 100% sure what happens when we call this function > on a cgroup that is currently being offlined, and has gone past the > zswap offline callback stage. So I was just playing it safe and > restart from the top of the tree :) > > I think this implementation has that behavior right? We see that the > memcg is offlined, so we drop the lock and go to the beginning of the > loop. We reacquire the lock, and might see that zswap_next_shrink == > memcg, so we call mem_cgroup_iter(...) on it. Is this safe? > > Note that zswap_shrink_lock only orders serializes this memcg > selection loop with memcg offlining after it - there's no guarantee > what's the behavior is for memcg offlining before it (well other than > one reference that we manage to acquire thanks to > mem_cgroup_iter(...), so that memcg has not been freed, but not sure > what we can guarantee regarding its place in the memcg hierarchy > tree?). The locking mechanism in shrink_worker does not rely on what the next memcg is.sorting stability of mem_cgroup_iter does not matter here. The expectation for the iterator is that it will walk through all live memcgs. I believe mem_cgroup_iter uses parent-to-leaf ordering of cgroup and it ensures all live cgroups are walked at least once, regardless of its onlineness. https://elixir.bootlin.com/linux/v6.10-rc2/source/mm/memcontrol.c#L1368 Regarding reference leak, I overlooked a scenario where a leak might occur in the existing cleaner. although it should be rare. When the cleaner is called on a memcg in zswap_next_shrink, the next memcg from mem_cgroup_iter() can be an offline already-cleaned memcg, resulting in a reference leak of the next memcg from the cleaner. We should implement the same online check in the cleaner, like this: ```c void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { struct mem_cgroup *next; /* lock out zswap shrinker walking memcg tree */ spin_lock(&zswap_shrink_lock); if (zswap_next_shrink == memcg) { next = zswap_next_shrink; do { next = mem_cgroup_iter(NULL, next, NULL); WRITE_ONCE(zswap_next_shrink, next); spin_unlock(&zswap_shrink_lock); /* zswap_next_shrink might be updated here */ spin_lock(&zswap_shrink_lock); next = READ_ONCE(zswap_next_shrink); if (!next) break; } while (!mem_cgroup_online(next)); /* * We verified the next memcg is online under lock. * Even if the next memcg is being offlined here, another * cleaner for the next memcg is waiting for our unlock just * behind us. We can leave the next memcg reference. */ } spin_unlock(&zswap_shrink_lock); } ``` As same as in shrink_worker, we must check if the next memcg is online under the lock before leaving the ref in zswap_next_shrink. Otherwise, zswap_next_shrink might hold the ref of offlined and cleaned memcg. Or if you are concerning about temporary storing unchecked or offlined memcg in zswap_next_shrink, it is safe because: 1. If there is no other cleaner running for zswap_next_shrink, the ref saved in zswap_next_shrink ensures liveness of the memcg when reacquired. 2. Another cleaner thread may put back and replace zswap_next_shrink with its next. We will check onlineness of the new zswap_next_shrink under reacquired lock. 3. Even if the verified-online memcg is being offlined concurrently, another cleaner thread must wait for our unlock. We can leave the online memcg and rely on its respective cleaner.
On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <flintglass@gmail.com> wrote: > > 2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>: > > > > > As I have noted in v0, I think this is unnecessary and makes it more confusing. > > > > Does spin_lock() ensure that compiler optimizations do not remove > memory access to an external variable? I think we need to use > READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock. > For example, > https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234 In this example, it seems like mmu_interval_set_seq() updates interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think this is why READ_ONCE() is required in that particular case. > > isn't this a common use case of READ_ONCE? > ```c > bool shared_flag = false; > spinlock_t flag_lock; > > void somefunc(void) { > for (;;) { > spin_lock(&flag_lock); > /* check external updates */ > if (READ_ONCE(shared_flag)) > break; > /* do something */ > spin_unlock(&flag_lock); > } > spin_unlock(&flag_lock); > } > ``` > Without READ_ONCE, the check can be extracted from the loop by optimization. According to Documentation/memory-barriers.txt, lock acquiring functions are implicit memory barriers. Otherwise, the compiler would be able to pull any memory access outside of the lock critical section and locking wouldn't be reliable.
2024年6月13日(木) 3:28 Yosry Ahmed <yosryahmed@google.com>: > > On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <flintglass@gmail.com> wrote: > > > > 2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>: > > > > > > > > As I have noted in v0, I think this is unnecessary and makes it more confusing. > > > > > > > Does spin_lock() ensure that compiler optimizations do not remove > > memory access to an external variable? I think we need to use > > READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock. > > For example, > > https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234 > > In this example, it seems like mmu_interval_set_seq() updates > interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think > this is why READ_ONCE() is required in that particular case. > > > > > isn't this a common use case of READ_ONCE? > > ```c > > bool shared_flag = false; > > spinlock_t flag_lock; > > > > void somefunc(void) { > > for (;;) { > > spin_lock(&flag_lock); > > /* check external updates */ > > if (READ_ONCE(shared_flag)) > > break; > > /* do something */ > > spin_unlock(&flag_lock); > > } > > spin_unlock(&flag_lock); > > } > > ``` > > Without READ_ONCE, the check can be extracted from the loop by optimization. > > According to Documentation/memory-barriers.txt, lock acquiring > functions are implicit memory barriers. Otherwise, the compiler would > be able to pull any memory access outside of the lock critical section > and locking wouldn't be reliable. Ah, I understand now. The implicit barrier is sufficient as long as all memory access occurs within the lock. It’s a fundamental rule of locking—facepalm. I misread a module code, like in the link, where a lockless write could invade a critical section. My example was in the opposite direction, just wrong. Thank you so much for clarifying my misunderstanding. For now checking the patch, I suppose the locking mechanism itself is not affected by my misunderstanding of READ_ONCE. The corrected version of the cleaner should be: ```c void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { /* lock out zswap shrinker walking memcg tree */ spin_lock(&zswap_shrink_lock); if (zswap_next_shrink == memcg) { do { zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); spin_unlock(&zswap_shrink_lock); spin_lock(&zswap_shrink_lock); if (!zswap_next_shrink) break; } while (!mem_cgroup_online(zswap_next_shrink)); } spin_unlock(&zswap_shrink_lock); } ``` Should we have a separate patch to fix the leak scenario?
On Wed, Jun 12, 2024 at 7:13 PM Takero Funaki <flintglass@gmail.com> wrote: > > 2024年6月13日(木) 3:28 Yosry Ahmed <yosryahmed@google.com>: > > > > On Wed, Jun 12, 2024 at 11:16 AM Takero Funaki <flintglass@gmail.com> wrote: > > > > > > 2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>: > > > > > > > > > > > As I have noted in v0, I think this is unnecessary and makes it more confusing. > > > > > > > > > > Does spin_lock() ensure that compiler optimizations do not remove > > > memory access to an external variable? I think we need to use > > > READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock. > > > For example, > > > https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234 > > > > In this example, it seems like mmu_interval_set_seq() updates > > interval_sub->invalidate_seq locklessly using WRITE_ONCE(). I think > > this is why READ_ONCE() is required in that particular case. > > > > > > > > isn't this a common use case of READ_ONCE? > > > ```c > > > bool shared_flag = false; > > > spinlock_t flag_lock; > > > > > > void somefunc(void) { > > > for (;;) { > > > spin_lock(&flag_lock); > > > /* check external updates */ > > > if (READ_ONCE(shared_flag)) > > > break; > > > /* do something */ > > > spin_unlock(&flag_lock); > > > } > > > spin_unlock(&flag_lock); > > > } > > > ``` > > > Without READ_ONCE, the check can be extracted from the loop by optimization. > > > > According to Documentation/memory-barriers.txt, lock acquiring > > functions are implicit memory barriers. Otherwise, the compiler would > > be able to pull any memory access outside of the lock critical section > > and locking wouldn't be reliable. > > Ah, I understand now. The implicit barrier is sufficient as long as > all memory access occurs within the lock. It’s a fundamental rule of > locking—facepalm. > > I misread a module code, like in the link, where a lockless write > could invade a critical section. My example was in the opposite > direction, just wrong. Thank you so much for clarifying my > misunderstanding. > > For now checking the patch, I suppose the locking mechanism itself is > not affected by my misunderstanding of READ_ONCE. > > The corrected version of the cleaner should be: > ```c > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > { > /* lock out zswap shrinker walking memcg tree */ > spin_lock(&zswap_shrink_lock); > if (zswap_next_shrink == memcg) { > do { > zswap_next_shrink = mem_cgroup_iter(NULL, > zswap_next_shrink, NULL); > spin_unlock(&zswap_shrink_lock); > spin_lock(&zswap_shrink_lock); > if (!zswap_next_shrink) > break; > } while (!mem_cgroup_online(zswap_next_shrink)); > } > spin_unlock(&zswap_shrink_lock); > } > ``` Is the idea here to avoid moving the iterator to another offline memcg that zswap_memcg_offline_cleanup() was already called for, to avoid holding a ref on that memcg until the next run of zswap shrinking? If yes, I think it's probably worth doing. But why do we need to release and reacquire the lock in the loop above?
2024年6月13日(木) 11:18 Yosry Ahmed <yosryahmed@google.com>: > > The corrected version of the cleaner should be: > > ```c > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > > { > > /* lock out zswap shrinker walking memcg tree */ > > spin_lock(&zswap_shrink_lock); > > if (zswap_next_shrink == memcg) { > > do { > > zswap_next_shrink = mem_cgroup_iter(NULL, > > zswap_next_shrink, NULL); > > spin_unlock(&zswap_shrink_lock); > > spin_lock(&zswap_shrink_lock); > > if (!zswap_next_shrink) > > break; > > } while (!mem_cgroup_online(zswap_next_shrink)); > > } > > spin_unlock(&zswap_shrink_lock); > > } > > ``` > > Is the idea here to avoid moving the iterator to another offline memcg > that zswap_memcg_offline_cleanup() was already called for, to avoid > holding a ref on that memcg until the next run of zswap shrinking? > > If yes, I think it's probably worth doing. But why do we need to > release and reacquire the lock in the loop above? Yes, the existing cleaner might leave the offline, already-cleaned memcg. The reacquiring lock is to not loop inside the critical section. In shrink_worker of v0 patch, the loop was restarted on offline memcg without releasing the lock. Nhat pointed out that we should drop the lock after every mem_cgroup_iter() call. v1 was changed to reacquire once per iteration like the cleaner code above.
On Wed, Jun 12, 2024 at 7:36 PM Takero Funaki <flintglass@gmail.com> wrote: > > 2024年6月13日(木) 11:18 Yosry Ahmed <yosryahmed@google.com>: > > > > The corrected version of the cleaner should be: > > > ```c > > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > > > { > > > /* lock out zswap shrinker walking memcg tree */ > > > spin_lock(&zswap_shrink_lock); > > > if (zswap_next_shrink == memcg) { > > > do { > > > zswap_next_shrink = mem_cgroup_iter(NULL, > > > zswap_next_shrink, NULL); > > > spin_unlock(&zswap_shrink_lock); > > > spin_lock(&zswap_shrink_lock); > > > if (!zswap_next_shrink) > > > break; > > > } while (!mem_cgroup_online(zswap_next_shrink)); > > > } > > > spin_unlock(&zswap_shrink_lock); > > > } > > > ``` > > > > Is the idea here to avoid moving the iterator to another offline memcg > > that zswap_memcg_offline_cleanup() was already called for, to avoid > > holding a ref on that memcg until the next run of zswap shrinking? > > > > If yes, I think it's probably worth doing. But why do we need to > > release and reacquire the lock in the loop above? > > Yes, the existing cleaner might leave the offline, already-cleaned memcg. > > The reacquiring lock is to not loop inside the critical section. > In shrink_worker of v0 patch, the loop was restarted on offline memcg > without releasing the lock. Nhat pointed out that we should drop the > lock after every mem_cgroup_iter() call. v1 was changed to reacquire > once per iteration like the cleaner code above. I am not sure how often we'll run into a situation where we'll be holding the lock for too long tbh. It should be unlikely to keep encountering offline memcgs for a long time. Nhat, do you think this could cause a problem in practice?
On Wed, Jun 12, 2024 at 7:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jun 12, 2024 at 7:36 PM Takero Funaki <flintglass@gmail.com> wrote: > > > > 2024年6月13日(木) 11:18 Yosry Ahmed <yosryahmed@google.com>: > > > > > > The corrected version of the cleaner should be: > > > > ```c > > > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > > > > { > > > > /* lock out zswap shrinker walking memcg tree */ > > > > spin_lock(&zswap_shrink_lock); > > > > if (zswap_next_shrink == memcg) { > > > > do { > > > > zswap_next_shrink = mem_cgroup_iter(NULL, > > > > zswap_next_shrink, NULL); > > > > spin_unlock(&zswap_shrink_lock); > > > > spin_lock(&zswap_shrink_lock); > > > > if (!zswap_next_shrink) > > > > break; > > > > } while (!mem_cgroup_online(zswap_next_shrink)); > > > > } > > > > spin_unlock(&zswap_shrink_lock); > > > > } > > > > ``` > > > > > > Is the idea here to avoid moving the iterator to another offline memcg > > > that zswap_memcg_offline_cleanup() was already called for, to avoid > > > holding a ref on that memcg until the next run of zswap shrinking? > > > > > > If yes, I think it's probably worth doing. But why do we need to > > > release and reacquire the lock in the loop above? > > > > Yes, the existing cleaner might leave the offline, already-cleaned memcg. > > > > The reacquiring lock is to not loop inside the critical section. > > In shrink_worker of v0 patch, the loop was restarted on offline memcg > > without releasing the lock. Nhat pointed out that we should drop the > > lock after every mem_cgroup_iter() call. v1 was changed to reacquire > > once per iteration like the cleaner code above. > > I am not sure how often we'll run into a situation where we'll be > holding the lock for too long tbh. It should be unlikely to keep > encountering offline memcgs for a long time. > > Nhat, do you think this could cause a problem in practice? I don't remember prescribing anything to be honest :) I think I was just asking why can't we just drop the lock, then "continue;". This is mostly for simplicity's sake. https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/ But I think as Takero pointed out, it would still skip over the memcg that was (concurrently) updated to zswap_next_shrink by the memcg offline callback.
On Sat, Jun 8, 2024 at 8:53 AM Takero Funaki <flintglass@gmail.com> wrote: > > This patch fixes an issue where the zswap global shrinker stopped > iterating through the memcg tree. > > The problem was that `shrink_worker()` would stop iterating when a memcg > was being offlined and restart from the tree root. Now, it properly > handles the offlining memcg and continues shrinking with the next memcg. > > This patch also modified handing of the lock for offlined memcg cleaner > to adapt the change in the iteration, and avoid negligibly rare skipping > of a memcg from shrink iteration. > Honestly, I think this version is even more complicated than v0 :) Hmmm how about this: do { spin_lock(&zswap_pools_lock); do { /* no offline caller has been called to memcg yet */ if (memcg == zswap_next_shrink) { zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); memcg = zswap_next_shrink; if (!memcg && ++failure > MAX_FAILURE) { spin_unlock(&zswap_pools_lock); return; } } while (!zswap_next_shrink || !mem_cgroup_tryget_online(zswap_next_shrink)) spin_unlock(&zswap_pools_lock); /* proceed with reclaim */ } while (...) This should achieve all the goals right? 1. No restarting from the top, unless we have traversed the entire hierarchy. 2. No skipping over zswap_next_shrink updated by the memcg offline cleaner. 3. No leaving offlined zswap_next_shrink hanging (and blocking memcg killing indefinitely). 4. No long running loop unnecessarily. If you want to be extra safe, we can put a max number of retries on the offline memcg case too (and restart from the top). 5. At the end of the inner loop, you are guaranteed to hold at least one reference to memcg (and perhaps 2, if zswap_next_shrink is not updated by the cleaner). 5. At the end of the inner loop, the selected memcg is known to not have its cleaner started yet (since it is online with zswap_pools_lock held). Our selection will undo the cleaner and hold the memcg hostage forever. Is there anything that I'm missing? We are not dropping the lock for the entirety of the inner loop, but honestly I'm fine with this trade off, for the sake of simplicity :) If you want it to be even more readable, feel free to refactor the inner loop into a helper function (but it's probably redundant since it's only used once).
On Thu, Jun 13, 2024 at 9:08 AM Nhat Pham <nphamcs@gmail.com> wrote: > > 5. At the end of the inner loop, the selected memcg is known to not > have its cleaner started yet (since it is online with zswap_pools_lock > held). Our selection will undo the cleaner and hold the memcg hostage > forever. /s/will undo/will not undo My apologies - English is hard :)
On Thu, Jun 13, 2024 at 08:04:39AM GMT, Nhat Pham wrote: [...] > > > > > > > > Is the idea here to avoid moving the iterator to another offline memcg > > > > that zswap_memcg_offline_cleanup() was already called for, to avoid > > > > holding a ref on that memcg until the next run of zswap shrinking? > > > > > > > > If yes, I think it's probably worth doing. But why do we need to > > > > release and reacquire the lock in the loop above? > > > > > > Yes, the existing cleaner might leave the offline, already-cleaned memcg. > > > > > > The reacquiring lock is to not loop inside the critical section. > > > In shrink_worker of v0 patch, the loop was restarted on offline memcg > > > without releasing the lock. Nhat pointed out that we should drop the > > > lock after every mem_cgroup_iter() call. v1 was changed to reacquire > > > once per iteration like the cleaner code above. > > > > I am not sure how often we'll run into a situation where we'll be > > holding the lock for too long tbh. It should be unlikely to keep > > encountering offline memcgs for a long time. > > > > Nhat, do you think this could cause a problem in practice? > > I don't remember prescribing anything to be honest :) I think I was > just asking why can't we just drop the lock, then "continue;". This is > mostly for simplicity's sake. > > https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/ > > But I think as Takero pointed out, it would still skip over the memcg > that was (concurrently) updated to zswap_next_shrink by the memcg > offline callback. What's the issue with keep traversing until an online memcg is found? Something like the following: spin_lock(&zswap_shrink_lock); do { zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink)); if (!zswap_next_shrink) zswap_next_shrink = mem_cgroup_iter(NULL, NULL, NULL); .... Is the concern that there can a lot of offlined memcgs which may cause need resched warnings?
2024年6月14日(金) 1:49 Shakeel Butt <shakeel.butt@linux.dev>: > > On Thu, Jun 13, 2024 at 08:04:39AM GMT, Nhat Pham wrote: > [...] > > > > > > > > > > Is the idea here to avoid moving the iterator to another offline memcg > > > > > that zswap_memcg_offline_cleanup() was already called for, to avoid > > > > > holding a ref on that memcg until the next run of zswap shrinking? > > > > > > > > > > If yes, I think it's probably worth doing. But why do we need to > > > > > release and reacquire the lock in the loop above? > > > > > > > > Yes, the existing cleaner might leave the offline, already-cleaned memcg. > > > > > > > > The reacquiring lock is to not loop inside the critical section. > > > > In shrink_worker of v0 patch, the loop was restarted on offline memcg > > > > without releasing the lock. Nhat pointed out that we should drop the > > > > lock after every mem_cgroup_iter() call. v1 was changed to reacquire > > > > once per iteration like the cleaner code above. > > > > > > I am not sure how often we'll run into a situation where we'll be > > > holding the lock for too long tbh. It should be unlikely to keep > > > encountering offline memcgs for a long time. > > > > > > Nhat, do you think this could cause a problem in practice? > > > > I don't remember prescribing anything to be honest :) I think I was > > just asking why can't we just drop the lock, then "continue;". This is > > mostly for simplicity's sake. > > > > https://lore.kernel.org/linux-mm/CAKEwX=MwrRc43iM2050v5u-TPUK4Yn+a4G7+h6ieKhpQ7WtQ=A@mail.gmail.com/ I apologize for misinterpreting your comments. Removing release/reacquire. > > > > But I think as Takero pointed out, it would still skip over the memcg > > that was (concurrently) updated to zswap_next_shrink by the memcg > > offline callback. > > What's the issue with keep traversing until an online memcg is found? > Something like the following: > > > spin_lock(&zswap_shrink_lock); > do { > zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); > } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink)); > > if (!zswap_next_shrink) > zswap_next_shrink = mem_cgroup_iter(NULL, NULL, NULL); > .... > > Is the concern that there can a lot of offlined memcgs which may cause > need resched warnings? To avoid using the goto-based loop, here's the new version, including Shakeel's suggestion: ```c do { spin_lock(&zswap_shrink_lock); /* * Advance the cursor to start shrinking from the next memcg * after zswap_next_shrink. One memcg might be skipped from * shrinking if the cleaner also advanced the cursor, but it * will happen at most once per offlining memcg. */ do { zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); memcg = zswap_next_shrink; } while (memcg && !mem_cgroup_tryget_online(memcg)); if (!memcg) { spin_unlock(&zswap_shrink_lock); ``` We can add or remove `spin_unlock();spin_lock();` just after mem_cgroup_iter(), if needed. I believe the behavior is identical to v1 except for the starting point of iteration. For Naht's comment, 2. No skipping over zswap_next_shrink updated by the memcg offline cleaner. While this was true for v1, I'm moved to accept this skipping as it's negligibly rare. As Yorsy commented, v1 retried the last memcg from the last shrink_worker() run. There are several options for shrink_worker where to start with: 1. Starting from the next memcg after zswap_next_shrink: It might skip one memcg, but this is quite rare. It is the current behavior before patch. 2. Starting from zswap_next_shrink: It might shrink one more page from the memcg in addition to the one by the last shrink_worker() run. This should also be rare, but probably more frequent than option 1. This is the v0 patch behavior. 3. Addressing both: Save the last memcg as well. The worker checks if it has been modified by the cleaner and advances only if it hasn't. Like this: ```c do { if (zswap_last_shrink == zswap_next_shrink) { zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); } memcg = zswap_next_shrink; } while (memcg && !mem_cgroup_tryget_online(memcg)); zswap_last_shrink = memcg; ``` Which one would be better? or any other idea?
diff --git a/mm/zswap.c b/mm/zswap.c index 80c634acb8d5..d720a42069b6 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio) } } +/* + * This function should be called when a memcg is being offlined. + * + * Since the global shrinker shrink_worker() may hold a reference + * of the memcg, we must check and release the reference in + * zswap_next_shrink. + * + * shrink_worker() must handle the case where this function releases + * the reference of memcg being shrunk. + */ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { /* lock out zswap shrinker walking memcg tree */ spin_lock(&zswap_shrink_lock); - if (zswap_next_shrink == memcg) - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); + + if (READ_ONCE(zswap_next_shrink) == memcg) { + /* put back reference and advance the cursor */ + memcg = mem_cgroup_iter(NULL, memcg, NULL); + WRITE_ONCE(zswap_next_shrink, memcg); + } + spin_unlock(&zswap_shrink_lock); } @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg) static void shrink_worker(struct work_struct *w) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; + struct mem_cgroup *next_memcg; int ret, failures = 0; unsigned long thr; /* Reclaim down to the accept threshold */ thr = zswap_accept_thr_pages(); - /* global reclaim will select cgroup in a round-robin fashion. */ + /* global reclaim will select cgroup in a round-robin fashion. + * + * We save iteration cursor memcg into zswap_next_shrink, + * which can be modified by the offline memcg cleaner + * zswap_memcg_offline_cleanup(). + * + * Since the offline cleaner is called only once, we cannot abandone + * offline memcg reference in zswap_next_shrink. + * We can rely on the cleaner only if we get online memcg under lock. + * If we get offline memcg, we cannot determine the cleaner will be + * called later. We must put it before returning from this function. + */ do { +iternext: spin_lock(&zswap_shrink_lock); - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); - memcg = zswap_next_shrink; + next_memcg = READ_ONCE(zswap_next_shrink); + + if (memcg != next_memcg) { + /* + * Ours was released by offlining. + * Use the saved memcg reference. + */ + memcg = next_memcg; + } else { + /* advance cursor */ + memcg = mem_cgroup_iter(NULL, memcg, NULL); + WRITE_ONCE(zswap_next_shrink, memcg); + } /* - * We need to retry if we have gone through a full round trip, or if we - * got an offline memcg (or else we risk undoing the effect of the - * zswap memcg offlining cleanup callback). This is not catastrophic - * per se, but it will keep the now offlined memcg hostage for a while. - * * Note that if we got an online memcg, we will keep the extra * reference in case the original reference obtained by mem_cgroup_iter * is dropped by the zswap memcg offlining callback, ensuring that the @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w) } if (!mem_cgroup_tryget_online(memcg)) { - /* drop the reference from mem_cgroup_iter() */ - mem_cgroup_iter_break(NULL, memcg); - zswap_next_shrink = NULL; + /* + * It is an offline memcg which we cannot shrink + * until its pages are reparented. + * + * Since we cannot determine if the offline cleaner has + * been already called or not, the offline memcg must be + * put back unconditonally. We cannot abort the loop while + * zswap_next_shrink has a reference of this offline memcg. + */ spin_unlock(&zswap_shrink_lock); - - if (++failures == MAX_RECLAIM_RETRIES) - break; - - goto resched; + goto iternext; } + /* + * We got an extra memcg reference before unlocking. + * The cleaner cannot free it using zswap_next_shrink. + * + * Our memcg can be offlined after we get online memcg here. + * In this case, the cleaner is waiting the lock just behind us. + */ spin_unlock(&zswap_shrink_lock); ret = shrink_memcg(memcg); @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w) resched: cond_resched(); } while (zswap_total_pages() > thr); + + /* + * We can still hold the original memcg reference. + * The reference is stored in zswap_next_shrink, and then reused + * by the next shrink_worker(). + */ } /*********************************
This patch fixes an issue where the zswap global shrinker stopped iterating through the memcg tree. The problem was that `shrink_worker()` would stop iterating when a memcg was being offlined and restart from the tree root. Now, it properly handles the offlining memcg and continues shrinking with the next memcg. This patch also modified handing of the lock for offlined memcg cleaner to adapt the change in the iteration, and avoid negligibly rare skipping of a memcg from shrink iteration. Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") Signed-off-by: Takero Funaki <flintglass@gmail.com> --- mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 68 insertions(+), 19 deletions(-)