diff mbox series

memcg: allow exiting tasks to write back data to swap

Message ID 20241211105336.380cb545@fangorn (mailing list archive)
State New
Headers show
Series memcg: allow exiting tasks to write back data to swap | expand

Commit Message

Rik van Riel Dec. 11, 2024, 3:53 p.m. UTC
A task already in exit can get stuck trying to allocate pages, if its
cgroup is at the memory.max limit, the cgroup is using zswap, but
zswap writeback is enabled, and the remaining memory in the cgroup is
not compressible.

This seems like an unlikely confluence of events, but it can happen
quite easily if a cgroup is OOM killed due to exceeding its memory.max
limit, and all the tasks in the cgroup are trying to exit simultaneously.

When this happens, it can sometimes take hours for tasks to exit,
as they are all trying to squeeze things into zswap to bring the group's
memory consumption below memory.max.

Allowing these exiting programs to push some memory from their own
cgroup into swap allows them to quickly bring the cgroup's memory
consumption below memory.max, and exit in seconds rather than hours.

Loading this fix as a live patch on a system where a workload got stuck
exiting allowed the workload to exit within a fraction of a second.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 mm/memcontrol.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Yosry Ahmed Dec. 11, 2024, 4:26 p.m. UTC | #1
On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com> wrote:
>
> A task already in exit can get stuck trying to allocate pages, if its
> cgroup is at the memory.max limit, the cgroup is using zswap, but
> zswap writeback is enabled, and the remaining memory in the cgroup is
> not compressible.
>
> This seems like an unlikely confluence of events, but it can happen
> quite easily if a cgroup is OOM killed due to exceeding its memory.max
> limit, and all the tasks in the cgroup are trying to exit simultaneously.
>
> When this happens, it can sometimes take hours for tasks to exit,
> as they are all trying to squeeze things into zswap to bring the group's
> memory consumption below memory.max.
>
> Allowing these exiting programs to push some memory from their own
> cgroup into swap allows them to quickly bring the cgroup's memory
> consumption below memory.max, and exit in seconds rather than hours.
>
> Loading this fix as a live patch on a system where a workload got stuck
> exiting allowed the workload to exit within a fraction of a second.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  mm/memcontrol.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..03d77e93087e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5371,6 +5371,15 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
>         if (!zswap_is_enabled())
>                 return true;
>
> +       /*
> +        * Always allow exiting tasks to push data to swap. A process in
> +        * the middle of exit cannot get OOM killed, but may need to push
> +        * uncompressible data to swap in order to get the cgroup memory
> +        * use below the limit, and make progress with the exit.
> +        */
> +       if ((current->flags & PF_EXITING) && memcg == mem_cgroup_from_task(current))
> +               return true;
> +

I have a few questions:
(a) If the task is being OOM killed it should be able to charge memory
beyond memory.max, so why do we need to get the usage down below the
limit?

Looking at the other thread with Michal, it looks like it's because we
have to go into reclaim first before we get to the point of force
charging for dying tasks, and we spend too much time in reclaim. Is
that correct?

If that's the case, I am wondering if the real problem is that we
check  mem_cgroup_zswap_writeback_enabled() too late in the process.
Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap
entries, only to realize it cannot swap in swap_writepage().

Should we check for this in can_reclaim_anon_pages()? If zswap
writeback is disabled and we are already at the memcg limit (or zswap
limit for that matter), we should avoid scanning anon memory to begin
with. The problem is that if we race with memory being freed we may
have some extra OOM kills, but I am not sure how common this case
would be.

(b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in
case we are reclaiming from an ancestor and we hit the limit of that
ancestor?

(c) mem_cgroup_from_task() should be called in an RCU read section (or
we need something like rcu_access_point() if we are not dereferencing
the pointer).


>         for (; memcg; memcg = parent_mem_cgroup(memcg))
>                 if (!READ_ONCE(memcg->zswap_writeback))
>                         return false;
> --
> 2.47.0
>
>
>
Rik van Riel Dec. 11, 2024, 4:34 p.m. UTC | #2
On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote:
> On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com>
> wrote:
> > 
> > +++ b/mm/memcontrol.c
> > @@ -5371,6 +5371,15 @@ bool
> > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> >         if (!zswap_is_enabled())
> >                 return true;
> > 
> > +       /*
> > +        * Always allow exiting tasks to push data to swap. A
> > process in
> > +        * the middle of exit cannot get OOM killed, but may need
> > to push
> > +        * uncompressible data to swap in order to get the cgroup
> > memory
> > +        * use below the limit, and make progress with the exit.
> > +        */
> > +       if ((current->flags & PF_EXITING) && memcg ==
> > mem_cgroup_from_task(current))
> > +               return true;
> > +
> 
> I have a few questions:
> (a) If the task is being OOM killed it should be able to charge
> memory
> beyond memory.max, so why do we need to get the usage down below the
> limit?
> 
If it is a kernel directed memcg OOM kill, that is
true.

However, if the exit comes from somewhere else,
like a userspace oomd kill, we might not hit that
code path.

> Looking at the other thread with Michal, it looks like it's because
> we
> have to go into reclaim first before we get to the point of force
> charging for dying tasks, and we spend too much time in reclaim. Is
> that correct?
> 
> If that's the case, I am wondering if the real problem is that we
> check  mem_cgroup_zswap_writeback_enabled() too late in the process.
> Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap
> entries, only to realize it cannot swap in swap_writepage().
> 
> Should we check for this in can_reclaim_anon_pages()? If zswap
> writeback is disabled and we are already at the memcg limit (or zswap
> limit for that matter), we should avoid scanning anon memory to begin
> with. The problem is that if we race with memory being freed we may
> have some extra OOM kills, but I am not sure how common this case
> would be.

However, we don't know until the attempted zswap write
whether the memory is compressible, and whether doing
a bunch of zswap writes will help us bring our memcg
down below its memory.max limit.

> 
> (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in
> case we are reclaiming from an ancestor and we hit the limit of that
> ancestor?
> 
I don't know if we need or want to reclaim from any
other memcgs than those of the exiting process itself.

A small blast radius seems like it could be desirable,
but I'm open to other ideas :)

> (c) mem_cgroup_from_task() should be called in an RCU read section
> (or
> we need something like rcu_access_point() if we are not dereferencing
> the pointer).
> 
I'll add this in v2.
Yosry Ahmed Dec. 11, 2024, 5 p.m. UTC | #3
On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote:
> > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com>
> > wrote:
> > >
> > > +++ b/mm/memcontrol.c
> > > @@ -5371,6 +5371,15 @@ bool
> > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > >         if (!zswap_is_enabled())
> > >                 return true;
> > >
> > > +       /*
> > > +        * Always allow exiting tasks to push data to swap. A
> > > process in
> > > +        * the middle of exit cannot get OOM killed, but may need
> > > to push
> > > +        * uncompressible data to swap in order to get the cgroup
> > > memory
> > > +        * use below the limit, and make progress with the exit.
> > > +        */
> > > +       if ((current->flags & PF_EXITING) && memcg ==
> > > mem_cgroup_from_task(current))
> > > +               return true;
> > > +
> >
> > I have a few questions:
> > (a) If the task is being OOM killed it should be able to charge
> > memory
> > beyond memory.max, so why do we need to get the usage down below the
> > limit?
> >
> If it is a kernel directed memcg OOM kill, that is
> true.
>
> However, if the exit comes from somewhere else,
> like a userspace oomd kill, we might not hit that
> code path.

Why do we treat dying tasks differently based on the source of the kill?

>
> > Looking at the other thread with Michal, it looks like it's because
> > we
> > have to go into reclaim first before we get to the point of force
> > charging for dying tasks, and we spend too much time in reclaim. Is
> > that correct?
> >
> > If that's the case, I am wondering if the real problem is that we
> > check  mem_cgroup_zswap_writeback_enabled() too late in the process.
> > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap
> > entries, only to realize it cannot swap in swap_writepage().
> >
> > Should we check for this in can_reclaim_anon_pages()? If zswap
> > writeback is disabled and we are already at the memcg limit (or zswap
> > limit for that matter), we should avoid scanning anon memory to begin
> > with. The problem is that if we race with memory being freed we may
> > have some extra OOM kills, but I am not sure how common this case
> > would be.
>
> However, we don't know until the attempted zswap write
> whether the memory is compressible, and whether doing
> a bunch of zswap writes will help us bring our memcg
> down below its memory.max limit.

If we are at memory.max (or memory.zswap.max), we can't compress pages
into zswap anyway, regardless of their compressibility.

So what I am saying is, if we are already at the limit (pages cannot
go into zswap), and writeback is disabled (pages cannot go into
swapfiles), then we should probably avoid scanning the anon LRUs and
spending all those wasted cycles trying to isolate, unmap, and reclaim
them only to fail at the last step.

>
> >
> > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in
> > case we are reclaiming from an ancestor and we hit the limit of that
> > ancestor?
> >
> I don't know if we need or want to reclaim from any
> other memcgs than those of the exiting process itself.
>
> A small blast radius seems like it could be desirable,
> but I'm open to other ideas :)

The exiting process is part of all the ancestor cgroups by the hierarchy.

If we have the following hierarchy:
root
   |
  A
   |
  B

Then a process in cgroup B could be getting OOM killed due to hitting
the limit of A, not B. In which case, reclaiming from A helps us get
below the limit. We can check if the cgroup is an ancestor and it hit
its limit, but maybe that's an overkill.

>
> > (c) mem_cgroup_from_task() should be called in an RCU read section
> > (or
> > we need something like rcu_access_point() if we are not dereferencing
> > the pointer).
> >
> I'll add this in v2.
>
> --
> All Rights Reversed.
Rik van Riel Dec. 11, 2024, 5:19 p.m. UTC | #4
On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote:
> On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com>
> wrote:
> > 
> > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote:
> > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com>
> > > wrote:
> > > > 
> > > > +++ b/mm/memcontrol.c
> > > > @@ -5371,6 +5371,15 @@ bool
> > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > > >         if (!zswap_is_enabled())
> > > >                 return true;
> > > > 
> > > > +       /*
> > > > +        * Always allow exiting tasks to push data to swap. A
> > > > process in
> > > > +        * the middle of exit cannot get OOM killed, but may
> > > > need
> > > > to push
> > > > +        * uncompressible data to swap in order to get the
> > > > cgroup
> > > > memory
> > > > +        * use below the limit, and make progress with the
> > > > exit.
> > > > +        */
> > > > +       if ((current->flags & PF_EXITING) && memcg ==
> > > > mem_cgroup_from_task(current))
> > > > +               return true;
> > > > +
> > > 
> > > I have a few questions:
> > > (a) If the task is being OOM killed it should be able to charge
> > > memory
> > > beyond memory.max, so why do we need to get the usage down below
> > > the
> > > limit?
> > > 
> > If it is a kernel directed memcg OOM kill, that is
> > true.
> > 
> > However, if the exit comes from somewhere else,
> > like a userspace oomd kill, we might not hit that
> > code path.
> 
> Why do we treat dying tasks differently based on the source of the
> kill?
> 
Are you saying we should fail allocations for
every dying task, and add a check for PF_EXITING
in here?


        if (unlikely(task_in_memcg_oom(current)))
                goto nomem;


> > However, we don't know until the attempted zswap write
> > whether the memory is compressible, and whether doing
> > a bunch of zswap writes will help us bring our memcg
> > down below its memory.max limit.
> 
> If we are at memory.max (or memory.zswap.max), we can't compress
> pages
> into zswap anyway, regardless of their compressibility.
> 
Wait, this is news to me.

This seems like something we should fix, rather
than live with, since compressing the data to
a smaller size could bring us below memory.max.

Is this "cannot compress when at memory.max"
behavior intentional, or just a side effect of
how things happen to be?

Won't the allocations made from zswap_store
ignore the memory.max limit because PF_MEMALLOC
is set?

> > > 
> > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg()
> > > in
> > > case we are reclaiming from an ancestor and we hit the limit of
> > > that
> > > ancestor?
> > > 
> > I don't know if we need or want to reclaim from any
> > other memcgs than those of the exiting process itself.
> > 
> > A small blast radius seems like it could be desirable,
> > but I'm open to other ideas :)
> 
> The exiting process is part of all the ancestor cgroups by the
> hierarchy.
> 
> If we have the following hierarchy:
> root
>    |
>   A
>    |
>   B
> 
> Then a process in cgroup B could be getting OOM killed due to hitting
> the limit of A, not B. In which case, reclaiming from A helps us get
> below the limit. We can check if the cgroup is an ancestor and it hit
> its limit, but maybe that's an overkill.

Since we're dealing with a corner case anyway, I
suppose there's no harm using mm_match_cgroup,
which also happens to be cleaner than the code
I have right now.
Yosry Ahmed Dec. 11, 2024, 5:30 p.m. UTC | #5
On Wed, Dec 11, 2024 at 9:20 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote:
> > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com>
> > wrote:
> > >
> > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote:
> > > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com>
> > > > wrote:
> > > > >
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -5371,6 +5371,15 @@ bool
> > > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > > > >         if (!zswap_is_enabled())
> > > > >                 return true;
> > > > >
> > > > > +       /*
> > > > > +        * Always allow exiting tasks to push data to swap. A
> > > > > process in
> > > > > +        * the middle of exit cannot get OOM killed, but may
> > > > > need
> > > > > to push
> > > > > +        * uncompressible data to swap in order to get the
> > > > > cgroup
> > > > > memory
> > > > > +        * use below the limit, and make progress with the
> > > > > exit.
> > > > > +        */
> > > > > +       if ((current->flags & PF_EXITING) && memcg ==
> > > > > mem_cgroup_from_task(current))
> > > > > +               return true;
> > > > > +
> > > >
> > > > I have a few questions:
> > > > (a) If the task is being OOM killed it should be able to charge
> > > > memory
> > > > beyond memory.max, so why do we need to get the usage down below
> > > > the
> > > > limit?
> > > >
> > > If it is a kernel directed memcg OOM kill, that is
> > > true.
> > >
> > > However, if the exit comes from somewhere else,
> > > like a userspace oomd kill, we might not hit that
> > > code path.
> >
> > Why do we treat dying tasks differently based on the source of the
> > kill?
> >
> Are you saying we should fail allocations for
> every dying task, and add a check for PF_EXITING
> in here?

I am asking, not really suggesting anything :)

Does it matter from the kernel perspective if the task is dying due to
a kernel OOM kill or a userspace SIGKILL?

>
>
>         if (unlikely(task_in_memcg_oom(current)))
>                 goto nomem;
>
>
> > > However, we don't know until the attempted zswap write
> > > whether the memory is compressible, and whether doing
> > > a bunch of zswap writes will help us bring our memcg
> > > down below its memory.max limit.
> >
> > If we are at memory.max (or memory.zswap.max), we can't compress
> > pages
> > into zswap anyway, regardless of their compressibility.
> >
> Wait, this is news to me.
>
> This seems like something we should fix, rather
> than live with, since compressing the data to
> a smaller size could bring us below memory.max.
>
> Is this "cannot compress when at memory.max"
> behavior intentional, or just a side effect of
> how things happen to be?
>
> Won't the allocations made from zswap_store
> ignore the memory.max limit because PF_MEMALLOC
> is set?

My bad, obj_cgroup_may_zswap() only checks the zswap limit, not
memory.max. Please ignore this.

The scenario I described where we scan the LRUs needlessly is if the
*zswap limit* is hit, and writeback is disabled. I am guessing this is
not the case you're running into.

So yeah my only outstanding question is the one above about handling
userspace OOM kills differently.

Thanks for bearing with me.
Rik van Riel Dec. 11, 2024, 5:49 p.m. UTC | #6
On Wed, 2024-12-11 at 09:30 -0800, Yosry Ahmed wrote:
> On Wed, Dec 11, 2024 at 9:20 AM Rik van Riel <riel@surriel.com>
> wrote:
> > 
> > On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote:
> > > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com>
> > > wrote:
> > > > > 
> > > > If it is a kernel directed memcg OOM kill, that is
> > > > true.
> > > > 
> > > > However, if the exit comes from somewhere else,
> > > > like a userspace oomd kill, we might not hit that
> > > > code path.
> > > 
> > > Why do we treat dying tasks differently based on the source of
> > > the
> > > kill?
> > > 
> > Are you saying we should fail allocations for
> > every dying task, and add a check for PF_EXITING
> > in here?
> 
> I am asking, not really suggesting anything :)
> 
> Does it matter from the kernel perspective if the task is dying due
> to
> a kernel OOM kill or a userspace SIGKILL?
> 
Currently, it does. I'm not sure it should, but
currently it does :/

We are dealing with two conflicting demands here.

On the one hand, we want the exit code to be able
to access things like futex memory, so it can
properly clean up everything the program left behind.

On the other hand, we don't want the exiting
program to drive up cgroup memory use, especially
not with memory that won't be reclaimed by the
exit.

My patch is an attempt to satisfy both of these
demands, in situations where we currently exhibit
a rather pathological behavior (glacially slow
exit).
Balbir Singh Dec. 11, 2024, 11:15 p.m. UTC | #7
On 12/12/24 02:53, Rik van Riel wrote:
> A task already in exit can get stuck trying to allocate pages, if its
> cgroup is at the memory.max limit, the cgroup is using zswap, but
> zswap writeback is enabled, and the remaining memory in the cgroup is
> not compressible.
> 
> This seems like an unlikely confluence of events, but it can happen
> quite easily if a cgroup is OOM killed due to exceeding its memory.max
> limit, and all the tasks in the cgroup are trying to exit simultaneously.
> 
> When this happens, it can sometimes take hours for tasks to exit,
> as they are all trying to squeeze things into zswap to bring the group's
> memory consumption below memory.max.
> 
> Allowing these exiting programs to push some memory from their own
> cgroup into swap allows them to quickly bring the cgroup's memory
> consumption below memory.max, and exit in seconds rather than hours.
> 
> Loading this fix as a live patch on a system where a workload got stuck
> exiting allowed the workload to exit within a fraction of a second.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  mm/memcontrol.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..03d77e93087e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5371,6 +5371,15 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
>  	if (!zswap_is_enabled())
>  		return true;
>  
> +	/*
> +	 * Always allow exiting tasks to push data to swap. A process in
> +	 * the middle of exit cannot get OOM killed, but may need to push
> +	 * uncompressible data to swap in order to get the cgroup memory
> +	 * use below the limit, and make progress with the exit.
> +	 */
> +	if ((current->flags & PF_EXITING) && memcg == mem_cgroup_from_task(current))
> +		return true;
> +
>  	for (; memcg; memcg = parent_mem_cgroup(memcg))
>  		if (!READ_ONCE(memcg->zswap_writeback))
>  			return false;

Rik,

I am unable to understand the motivation here, so we want 
mem_cgroup_zswap_writeback_enabled() to return true, it only
returns false if a memcg in the hierarchy has zswap_writeback
set to 0 (false). In my git-grep I can't seem to find how/why
that may be the case. I can see memcg starts of with the value
set to true, if CONFIG_ZSWAP is enabled.

Your changelog above makes sense, but I am unable to map it to
the code changes.

Balbir
Rik van Riel Dec. 12, 2024, 1:21 a.m. UTC | #8
On Thu, 2024-12-12 at 10:15 +1100, Balbir Singh wrote:
> On 12/12/24 02:53, Rik van Riel wrote:
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7b3503d12aaf..03d77e93087e 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5371,6 +5371,15 @@ bool
> > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> >  	if (!zswap_is_enabled())
> >  		return true;
> >  
> > +	/*
> > +	 * Always allow exiting tasks to push data to swap. A
> > process in
> > +	 * the middle of exit cannot get OOM killed, but may need
> > to push
> > +	 * uncompressible data to swap in order to get the cgroup
> > memory
> > +	 * use below the limit, and make progress with the exit.
> > +	 */
> > +	if ((current->flags & PF_EXITING) && memcg ==
> > mem_cgroup_from_task(current))
> > +		return true;
> > +
> >  	for (; memcg; memcg = parent_mem_cgroup(memcg))
> >  		if (!READ_ONCE(memcg->zswap_writeback))
> >  			return false;
> 
> Rik,
> 
> I am unable to understand the motivation here, so we want 
> mem_cgroup_zswap_writeback_enabled() to return true, it only
> returns false if a memcg in the hierarchy has zswap_writeback
> set to 0 (false). In my git-grep I can't seem to find how/why
> that may be the case. I can see memcg starts of with the value
> set to true, if CONFIG_ZSWAP is enabled.
> 
> Your changelog above makes sense, but I am unable to map it to
> the code changes.
> 

Wait, are you asking about the code that I'm
adding, or about the code that was already
there?

I want to add the code that allows zswap
writeback if the reclaiming task is exiting,
and in the same cgroup as the to be written
back memory.
Balbir Singh Dec. 12, 2024, 3:25 a.m. UTC | #9
On 12/12/24 12:21, Rik van Riel wrote:
> On Thu, 2024-12-12 at 10:15 +1100, Balbir Singh wrote:
>> On 12/12/24 02:53, Rik van Riel wrote:
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 7b3503d12aaf..03d77e93087e 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -5371,6 +5371,15 @@ bool
>>> mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
>>>  	if (!zswap_is_enabled())
>>>  		return true;
>>>  
>>> +	/*
>>> +	 * Always allow exiting tasks to push data to swap. A
>>> process in
>>> +	 * the middle of exit cannot get OOM killed, but may need
>>> to push
>>> +	 * uncompressible data to swap in order to get the cgroup
>>> memory
>>> +	 * use below the limit, and make progress with the exit.
>>> +	 */
>>> +	if ((current->flags & PF_EXITING) && memcg ==
>>> mem_cgroup_from_task(current))
>>> +		return true;
>>> +
>>>  	for (; memcg; memcg = parent_mem_cgroup(memcg))
>>>  		if (!READ_ONCE(memcg->zswap_writeback))
>>>  			return false;
>>
>> Rik,
>>
>> I am unable to understand the motivation here, so we want 
>> mem_cgroup_zswap_writeback_enabled() to return true, it only
>> returns false if a memcg in the hierarchy has zswap_writeback
>> set to 0 (false). In my git-grep I can't seem to find how/why
>> that may be the case. I can see memcg starts of with the value
>> set to true, if CONFIG_ZSWAP is enabled.
>>
>> Your changelog above makes sense, but I am unable to map it to
>> the code changes.
>>
> 
> Wait, are you asking about the code that I'm
> adding, or about the code that was already
> there?
> 
> I want to add the code that allows zswap
> writeback if the reclaiming task is exiting,
> and in the same cgroup as the to be written
> back memory.
>

I was asking about this change (this patch), I know that the return
true will help avoid the PAGE_ACTIVATE path, but I am not sure why
this function will return false if CONFIG_ZSWAP is enabled (unless
zswap_writeback is turned off in one of the groups)

Balbir
Rik van Riel Dec. 12, 2024, 2:03 p.m. UTC | #10
On Thu, 2024-12-12 at 14:25 +1100, Balbir Singh wrote:
> 
> I was asking about this change (this patch), I know that the return
> true will help avoid the PAGE_ACTIVATE path, but I am not sure why
> this function will return false if CONFIG_ZSWAP is enabled (unless
> zswap_writeback is turned off in one of the groups)
> 
Some workloads are fine with incurring the latency
from zswap, but do not want the latency of actual
block device backed swap.

Having zswap enabled, with writeback disabled, has
been beneficial to a number of workloads.
Balbir Singh Dec. 12, 2024, 11:39 p.m. UTC | #11
On 12/13/24 01:03, Rik van Riel wrote:
> On Thu, 2024-12-12 at 14:25 +1100, Balbir Singh wrote:
>>
>> I was asking about this change (this patch), I know that the return
>> true will help avoid the PAGE_ACTIVATE path, but I am not sure why
>> this function will return false if CONFIG_ZSWAP is enabled (unless
>> zswap_writeback is turned off in one of the groups)
>>
> Some workloads are fine with incurring the latency
> from zswap, but do not want the latency of actual
> block device backed swap.
> 
> Having zswap enabled, with writeback disabled, has
> been beneficial to a number of workloads.
> 

Thanks, that was the missing bit

Balbir
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..03d77e93087e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5371,6 +5371,15 @@  bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
 	if (!zswap_is_enabled())
 		return true;
 
+	/*
+	 * Always allow exiting tasks to push data to swap. A process in
+	 * the middle of exit cannot get OOM killed, but may need to push
+	 * uncompressible data to swap in order to get the cgroup memory
+	 * use below the limit, and make progress with the exit.
+	 */
+	if ((current->flags & PF_EXITING) && memcg == mem_cgroup_from_task(current))
+		return true;
+
 	for (; memcg; memcg = parent_mem_cgroup(memcg))
 		if (!READ_ONCE(memcg->zswap_writeback))
 			return false;