mbox series

[RFC,0/2] opportunistic memory reclaim of a killed process

Message ID 20190411014353.113252-1-surenb@google.com (mailing list archive)
Headers show
Series opportunistic memory reclaim of a killed process | expand

Message

Suren Baghdasaryan April 11, 2019, 1:43 a.m. UTC
The time to kill a process and free its memory can be critical when the
killing was done to prevent memory shortages affecting system
responsiveness.

In the case of Android, where processes can be restarted easily, killing a
less important background process is preferred to delaying or throttling
an interactive foreground process. At the same time unnecessary kills
should be avoided as they cause delays when the killed process is needed
again. This requires a balanced decision from the system software about
how long a kill can be postponed in the hope that memory usage will
decrease without such drastic measures.

As killing a process and reclaiming its memory is not an instant operation,
a margin of free memory has to be maintained to prevent system performance
deterioration while memory of the killed process is being reclaimed. The
size of this margin depends on the minimum reclaim rate to cover the
worst-case scenario and this minimum rate should be deterministic.

Note that on asymmetric architectures like ARM big.LITTLE the reclaim rate
can vary dramatically depending on which core it’s performed at (see test
results). It’s a usual scenario that a non-essential victim process is
being restricted to a less performant or throttled CPU for power saving
purposes. This makes the worst-case reclaim rate scenario very probable.

The cases when victim’s memory reclaim can be delayed further due to
process being blocked in an uninterruptible sleep or when it performs a
time-consuming operation makes the reclaim time even more unpredictable.

Increasing memory reclaim rate and making it more deterministic would
allow for a smaller free memory margin and would lead to more opportunities
to avoid killing a process.

Note that while other strategies like throttling memory allocations are
viable and can be employed for other non-essential processes they would
affect user experience if applied towards an interactive process.

Proposed solution uses existing oom-reaper thread to increase memory
reclaim rate of a killed process and to make this rate more deterministic.
By no means the proposed solution is considered the best and was chosen
because it was simple to implement and allowed for test data collection.
The downside of this solution is that it requires additional “expedite”
hint for something which has to be fast in all cases. Would be great to
find a way that does not require additional hints.

Other possible approaches include:
- Implementing a dedicated syscall to perform opportunistic reclaim in the
context of the process waiting for the victim’s death. A natural boost
bonus occurs if the waiting process has high or RT priority and is not
limited by cpuset cgroup in its CPU choices.
- Implement a mechanism that would perform opportunistic reclaim if it’s
possible unconditionally (similar to checks in task_will_free_mem()).
- Implement opportunistic reclaim that uses shrinker interface, PSI or
other memory pressure indications as a hint to engage.

Test details:
Tests are performed on a Qualcomm® Snapdragon™ 845 8-core ARM big.LITTLE
system with 4 little cores (0.3-1.6GHz) and 4 big cores (0.8-2.5GHz)
running Android.
Memory reclaim speed was measured using signal/signal_generate,
kmem/rss_stat and sched/sched_process_exit traces.

Test results:
powersave governor, min freq
                        normal kills      expedited kills
        little          856 MB/sec        3236 MB/sec
        big             5084 MB/sec       6144 MB/sec

performance governor, max freq
                        normal kills      expedited kills
        little          5602 MB/sec       8144 MB/sec
        big             14656 MB/sec      12398 MB/sec

schedutil governor (default)
                        normal kills      expedited kills
        little          2386 MB/sec       3908 MB/sec
        big             7282 MB/sec       6820-16386 MB/sec
=================================================================
min reclaim speed:      856 MB/sec        3236 MB/sec

The patches are based on 5.1-rc1

Suren Baghdasaryan (2):
  mm: oom: expose expedite_reclaim to use oom_reaper outside of
    oom_kill.c
  signal: extend pidfd_send_signal() to allow expedited process killing

 include/linux/oom.h          |  1 +
 include/linux/sched/signal.h |  3 ++-
 include/linux/signal.h       | 11 ++++++++++-
 ipc/mqueue.c                 |  2 +-
 kernel/signal.c              | 37 ++++++++++++++++++++++++++++--------
 kernel/time/itimer.c         |  2 +-
 mm/oom_kill.c                | 15 +++++++++++++++
 7 files changed, 59 insertions(+), 12 deletions(-)

Comments

Michal Hocko April 11, 2019, 10:51 a.m. UTC | #1
On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
[...]
> Proposed solution uses existing oom-reaper thread to increase memory
> reclaim rate of a killed process and to make this rate more deterministic.
> By no means the proposed solution is considered the best and was chosen
> because it was simple to implement and allowed for test data collection.
> The downside of this solution is that it requires additional “expedite”
> hint for something which has to be fast in all cases. Would be great to
> find a way that does not require additional hints.

I have to say I do not like this much. It is abusing an implementation
detail of the OOM implementation and makes it an official API. Also
there are some non trivial assumptions to be fullfilled to use the
current oom_reaper. First of all all the process groups that share the
address space have to be killed. How do you want to guarantee/implement
that with a simply kill to a thread/process group?

> Other possible approaches include:
> - Implementing a dedicated syscall to perform opportunistic reclaim in the
> context of the process waiting for the victim’s death. A natural boost
> bonus occurs if the waiting process has high or RT priority and is not
> limited by cpuset cgroup in its CPU choices.
> - Implement a mechanism that would perform opportunistic reclaim if it’s
> possible unconditionally (similar to checks in task_will_free_mem()).
> - Implement opportunistic reclaim that uses shrinker interface, PSI or
> other memory pressure indications as a hint to engage.

I would question whether we really need this at all? Relying on the exit
speed sounds like a fundamental design problem of anything that relies
on it. Sure task exit might be slow, but async mm tear down is just a
mere optimization this is not guaranteed to really help in speading
things up. OOM killer uses it as a guarantee for a forward progress in a
finite time rather than as soon as possible.
Rik van Riel April 11, 2019, 11:51 a.m. UTC | #2
On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote:
> The time to kill a process and free its memory can be critical when
> the
> killing was done to prevent memory shortages affecting system
> responsiveness.

The OOM killer is fickle, and often takes a fairly
long time to trigger. Speeding up what happens after
that seems like the wrong thing to optimize.

Have you considered using something like oomd to
proactively kill tasks when memory gets low, so
you do not have to wait for an OOM kill?
Michal Hocko April 11, 2019, 12:16 p.m. UTC | #3
On Thu 11-04-19 07:51:21, Rik van Riel wrote:
> On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote:
> > The time to kill a process and free its memory can be critical when
> > the
> > killing was done to prevent memory shortages affecting system
> > responsiveness.
> 
> The OOM killer is fickle, and often takes a fairly
> long time to trigger. Speeding up what happens after
> that seems like the wrong thing to optimize.
> 
> Have you considered using something like oomd to
> proactively kill tasks when memory gets low, so
> you do not have to wait for an OOM kill?

AFAIU, this is the point here. They probably have a user space OOM
killer implementation and want to achieve killing to be as swift as
possible.
Joel Fernandes April 11, 2019, 4:18 p.m. UTC | #4
On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
>
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API. Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?

Will task_will_free_mem() not bail out in such cases because of
process_shares_mm() returning true? AFAIU, Suren's patch calls that.
Also, if I understand correctly, this patch is opportunistic and knows
what it may not be possible to reap in advance this way in all cases.
        /*
         * Make sure that all tasks which share the mm with the given tasks
         * are dying as well to make sure that a) nobody pins its mm and
         * b) the task is also reapable by the oom reaper.
         */
        rcu_read_lock();
        for_each_process(p) {
                if (!process_shares_mm(p, mm))

> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
>
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it. Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

Per the data collected by Suren, it does speed things up. It would be
nice if we can reuse this mechanism, or come up with a similar
mechanism.

thanks,

 - Joel
Sandeep Patil April 11, 2019, 4:20 p.m. UTC | #5
On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote:
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
> 
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API. Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?
> 
> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
> 
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it.

OTOH, we want to keep as many processes around as possible for recency. In which
case, the exit path (particularly the memory reclaim) becomes critical to
maintain interactivity for phones.

Android keeps processes around because cold starting applications is much
slower than simply bringing them up from background. This obviously presents
the problem when a background application _is_ killed, it is almost always to
address sudden spike in memory needs by something else much more important
and user visible. e.g. a foreground application or critical system process.

> Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

With OOM killer, things are already really bad. When lmkd[1] kills processes,
it is doing so to serve the immediate needs of the system while trying to
avoid the OOM killer.


- ssp

1] https://android.googlesource.com/platform/system/core/+/refs/heads/master/lmkd/
Suren Baghdasaryan April 11, 2019, 4:47 p.m. UTC | #6
Thanks for the feedback!

On Thu, Apr 11, 2019 at 3:51 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
>
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API.

I agree with you that this particular implementation is abusing oom
internal machinery and I don't think it is acceptable as is (hence
this is sent as an RFC). I would like to discuss the viability of the
idea of reaping kill victim's mm asynchronously. If we agree that this
is worth our time, only then I would love to get into more details on
how to implement it. The implementation in this RFC is a convenient
way to illustrate the idea and to collect test data.

> Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?
>

I'm not sure I understood this correctly but if you are asking how do
we know that the mm we are reaping is not shared with processes that
are not being killed then I think your task_will_free_mem() checks for
that. Or have I misunderstood your question?

> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
>
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it.

Relying on it is wrong, I agree. There are protections like allocation
throttling that we can fall back to stop memory depletion. However
having a way to free up resources that are not needed by a dying
process quickly would help to avoid throttling which hurts user
experience.
I agree that this is an optimization which is beneficial in a specific
case - when we kill to free up resources. However this is an important
optimization for systems with low memory resources like embedded
systems, phones, etc. The only way to prevent being cornered into
throttling is to increase the free memory margin that system needs to
maintain (I describe this in my cover letter). And with limited
overall memory resources memory space is at a premium, so we try to
decrease that margin.
I think the other and arguably even more important issue than the
speed of memory reclaim is that this speed depends on what the victim
is doing at the time of a kill. This introduces non-determinism in how
fast we can free up resource and at this point we don't even know how
much safety margin we need.

> Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan April 11, 2019, 4:54 p.m. UTC | #7
On Thu, Apr 11, 2019 at 5:16 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 07:51:21, Rik van Riel wrote:
> > On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote:
> > > The time to kill a process and free its memory can be critical when
> > > the
> > > killing was done to prevent memory shortages affecting system
> > > responsiveness.
> >
> > The OOM killer is fickle, and often takes a fairly
> > long time to trigger. Speeding up what happens after
> > that seems like the wrong thing to optimize.
> >
> > Have you considered using something like oomd to
> > proactively kill tasks when memory gets low, so
> > you do not have to wait for an OOM kill?
>
> AFAIU, this is the point here. They probably have a user space OOM
> killer implementation and want to achieve killing to be as swift as
> possible.

That is correct. Android has a userspace daemon called lmkd (low
memory killer daemon) to respond to memory pressure before things get
bad enough for kernel oom-killer to get involved. So this asynchronous
reclaim optimization would allow lmkd do its job more efficiently.

> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Johannes Weiner April 11, 2019, 5:19 p.m. UTC | #8
On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote:
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it. Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

I don't think it's flawed, it's just optimizing the user experience as
best as it can. You don't want to kill things prematurely, but once
there is pressure you want to rectify it quickly. That's valid.

We have a tool that does this, side effect or not, so I think it's
fair to try to make use of it when oom killing from userspace (which
we explictily support with oom_control in cgroup1 and memory.high in
cgroup2, and it's not just an Android thing).

The question is how explicit a contract we want to make with
userspace, and I would much prefer to not overpromise on a best-effort
thing like this, or even making the oom reaper ABI.

If unconditionally reaping killed tasks is too expensive, I'd much
prefer a simple kill hint over an explicit task reclaim interface.
Michal Hocko April 11, 2019, 6:12 p.m. UTC | #9
On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > [...]
> > > Proposed solution uses existing oom-reaper thread to increase memory
> > > reclaim rate of a killed process and to make this rate more deterministic.
> > > By no means the proposed solution is considered the best and was chosen
> > > because it was simple to implement and allowed for test data collection.
> > > The downside of this solution is that it requires additional “expedite”
> > > hint for something which has to be fast in all cases. Would be great to
> > > find a way that does not require additional hints.
> >
> > I have to say I do not like this much. It is abusing an implementation
> > detail of the OOM implementation and makes it an official API. Also
> > there are some non trivial assumptions to be fullfilled to use the
> > current oom_reaper. First of all all the process groups that share the
> > address space have to be killed. How do you want to guarantee/implement
> > that with a simply kill to a thread/process group?
> 
> Will task_will_free_mem() not bail out in such cases because of
> process_shares_mm() returning true?

I am not really sure I understand your question. task_will_free_mem is
just a shortcut to not kill anything if the current process or a victim
is already dying and likely to free memory without killing or spamming
the log. My concern is that this patch allows to invoke the reaper
without guaranteeing the same. So it can only be an optimistic attempt
and then I am wondering how reasonable of an interface this really is.
Userspace send the signal and has no way to find out whether the async
reaping has been scheduled or not.
Michal Hocko April 11, 2019, 6:19 p.m. UTC | #10
On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
[...]
> > I would question whether we really need this at all? Relying on the exit
> > speed sounds like a fundamental design problem of anything that relies
> > on it.
> 
> Relying on it is wrong, I agree. There are protections like allocation
> throttling that we can fall back to stop memory depletion. However
> having a way to free up resources that are not needed by a dying
> process quickly would help to avoid throttling which hurts user
> experience.

I am not opposing speeding up the exit time in general. That is a good
thing. Especially for a very large processes (e.g. a DB). But I do not
really think we want to expose an API to control this specific aspect.
Joel Fernandes April 11, 2019, 7:14 p.m. UTC | #11
On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > [...]
> > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > reclaim rate of a killed process and to make this rate more deterministic.
> > > > By no means the proposed solution is considered the best and was chosen
> > > > because it was simple to implement and allowed for test data collection.
> > > > The downside of this solution is that it requires additional “expedite”
> > > > hint for something which has to be fast in all cases. Would be great to
> > > > find a way that does not require additional hints.
> > >
> > > I have to say I do not like this much. It is abusing an implementation
> > > detail of the OOM implementation and makes it an official API. Also
> > > there are some non trivial assumptions to be fullfilled to use the
> > > current oom_reaper. First of all all the process groups that share the
> > > address space have to be killed. How do you want to guarantee/implement
> > > that with a simply kill to a thread/process group?
> > 
> > Will task_will_free_mem() not bail out in such cases because of
> > process_shares_mm() returning true?
> 
> I am not really sure I understand your question. task_will_free_mem is
> just a shortcut to not kill anything if the current process or a victim
> is already dying and likely to free memory without killing or spamming
> the log. My concern is that this patch allows to invoke the reaper

Got it.

> without guaranteeing the same. So it can only be an optimistic attempt
> and then I am wondering how reasonable of an interface this really is.
> Userspace send the signal and has no way to find out whether the async
> reaping has been scheduled or not.

Could you clarify more what you're asking to guarantee? I cannot picture it.
If you mean guaranteeing that "a task is dying anyway and will free its
memory on its own", we are calling task_will_free_mem() to check that before
invoking the oom reaper.

Could you clarify what is the draback if OOM reaper is invoked in parallel to
an exiting task which will free its memory soon? It looks like the OOM reaper
is taking all the locks necessary (mmap_sem) in particular and is unmapping
pages. It seemed to me to be safe, but I am missing what are the main draw
backs of this - other than the intereference with core dump. One could be
presumably scalability since the since OOM reaper could be bottlenecked by
freeing memory on behalf of potentially several dying tasks.

IIRC this patch is just Ok with being opportunistic and it need not be hidden
behind an API necessarily or need any guarantees. It is just providing a hint
that the OOM reaper could be woken up to expedite things. If a task is going
to be taking a long time to be scheduled and free its memory, the oom reaper
gives a headstart.  Many of the times, background tasks can be killed but
they may not have necessarily sufficient scheduler priority / cpuset (being
in the background) and may be holding onto a lot of memory that needs to be
reclaimed.

I am not saying this the right way to do it, but I also wanted us to
understand the drawbacks so that we can go back to the drawing board and come
up with something better.

Thanks!

 - Joel
Suren Baghdasaryan April 11, 2019, 7:56 p.m. UTC | #12
On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
> [...]
> > > I would question whether we really need this at all? Relying on the exit
> > > speed sounds like a fundamental design problem of anything that relies
> > > on it.
> >
> > Relying on it is wrong, I agree. There are protections like allocation
> > throttling that we can fall back to stop memory depletion. However
> > having a way to free up resources that are not needed by a dying
> > process quickly would help to avoid throttling which hurts user
> > experience.
>
> I am not opposing speeding up the exit time in general. That is a good
> thing. Especially for a very large processes (e.g. a DB). But I do not
> really think we want to expose an API to control this specific aspect.

Great! Thanks for confirming that the intent is not worthless.
There were a number of ideas floating both internally and in the 2/2
of this patchset. I would like to get some input on which
implementation would be preferable. From your answer sounds like you
think it should be a generic feature, should not require any new APIs
or hints from the userspace and should be conducted for all kills
unconditionally (irrespective of memory pressure, who is waiting for
victim's death, etc.). Do I understand correctly that this would be
the preferred solution?

> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Michal Hocko April 11, 2019, 8:11 p.m. UTC | #13
On Thu 11-04-19 15:14:30, Joel Fernandes wrote:
> On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> > On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > > reclaim rate of a killed process and to make this rate more deterministic.
> > > > > By no means the proposed solution is considered the best and was chosen
> > > > > because it was simple to implement and allowed for test data collection.
> > > > > The downside of this solution is that it requires additional “expedite”
> > > > > hint for something which has to be fast in all cases. Would be great to
> > > > > find a way that does not require additional hints.
> > > >
> > > > I have to say I do not like this much. It is abusing an implementation
> > > > detail of the OOM implementation and makes it an official API. Also
> > > > there are some non trivial assumptions to be fullfilled to use the
> > > > current oom_reaper. First of all all the process groups that share the
> > > > address space have to be killed. How do you want to guarantee/implement
> > > > that with a simply kill to a thread/process group?
> > > 
> > > Will task_will_free_mem() not bail out in such cases because of
> > > process_shares_mm() returning true?
> > 
> > I am not really sure I understand your question. task_will_free_mem is
> > just a shortcut to not kill anything if the current process or a victim
> > is already dying and likely to free memory without killing or spamming
> > the log. My concern is that this patch allows to invoke the reaper
> 
> Got it.
> 
> > without guaranteeing the same. So it can only be an optimistic attempt
> > and then I am wondering how reasonable of an interface this really is.
> > Userspace send the signal and has no way to find out whether the async
> > reaping has been scheduled or not.
> 
> Could you clarify more what you're asking to guarantee? I cannot picture it.
> If you mean guaranteeing that "a task is dying anyway and will free its
> memory on its own", we are calling task_will_free_mem() to check that before
> invoking the oom reaper.

No, I am talking about the API aspect. Say you kall kill with the flag
to make the async address space tear down. Now you cannot really
guarantee that this is safe to do because the target task might
clone(CLONE_VM) at any time. So this will be known only once the signal
is sent, but the calling process has no way to find out. So the caller
has no way to know what is the actual result of the requested operation.
That is a poor API in my book.

> Could you clarify what is the draback if OOM reaper is invoked in parallel to
> an exiting task which will free its memory soon? It looks like the OOM reaper
> is taking all the locks necessary (mmap_sem) in particular and is unmapping
> pages. It seemed to me to be safe, but I am missing what are the main draw
> backs of this - other than the intereference with core dump. One could be
> presumably scalability since the since OOM reaper could be bottlenecked by
> freeing memory on behalf of potentially several dying tasks.

oom_reaper or any other kernel thread doing the same is a mere
implementation detail I think. The oom killer doesn't really need the
oom_reaper to act swiftly because it is there to act as a last resort if
the oom victim cannot terminate on its own. If you want to offer an
user space API then you can assume users will like to use it and expect
a certain behavior but what that is? E.g. what if there are thousands of
tasks killed this way? Do we care that some of them will not get the
async treatment? If yes why do we need an API to control that at all?

Am I more clear now?
Michal Hocko April 11, 2019, 8:17 p.m. UTC | #14
On Thu 11-04-19 12:56:32, Suren Baghdasaryan wrote:
> On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
> > [...]
> > > > I would question whether we really need this at all? Relying on the exit
> > > > speed sounds like a fundamental design problem of anything that relies
> > > > on it.
> > >
> > > Relying on it is wrong, I agree. There are protections like allocation
> > > throttling that we can fall back to stop memory depletion. However
> > > having a way to free up resources that are not needed by a dying
> > > process quickly would help to avoid throttling which hurts user
> > > experience.
> >
> > I am not opposing speeding up the exit time in general. That is a good
> > thing. Especially for a very large processes (e.g. a DB). But I do not
> > really think we want to expose an API to control this specific aspect.
> 
> Great! Thanks for confirming that the intent is not worthless.
> There were a number of ideas floating both internally and in the 2/2
> of this patchset. I would like to get some input on which
> implementation would be preferable. From your answer sounds like you
> think it should be a generic feature, should not require any new APIs
> or hints from the userspace and should be conducted for all kills
> unconditionally (irrespective of memory pressure, who is waiting for
> victim's death, etc.). Do I understand correctly that this would be
> the preferred solution?

Yes, I think the general tear down solution is much more preferable than
a questionable API. How that solution should look like is an open
question. I am not sure myself to be honest.
Joel Fernandes April 11, 2019, 9:11 p.m. UTC | #15
On Thu, Apr 11, 2019 at 10:11:51PM +0200, Michal Hocko wrote:
> On Thu 11-04-19 15:14:30, Joel Fernandes wrote:
> > On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> > > On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > > > reclaim rate of a killed process and to make this rate more deterministic.
> > > > > > By no means the proposed solution is considered the best and was chosen
> > > > > > because it was simple to implement and allowed for test data collection.
> > > > > > The downside of this solution is that it requires additional “expedite”
> > > > > > hint for something which has to be fast in all cases. Would be great to
> > > > > > find a way that does not require additional hints.
> > > > >
> > > > > I have to say I do not like this much. It is abusing an implementation
> > > > > detail of the OOM implementation and makes it an official API. Also
> > > > > there are some non trivial assumptions to be fullfilled to use the
> > > > > current oom_reaper. First of all all the process groups that share the
> > > > > address space have to be killed. How do you want to guarantee/implement
> > > > > that with a simply kill to a thread/process group?
> > > > 
> > > > Will task_will_free_mem() not bail out in such cases because of
> > > > process_shares_mm() returning true?
> > > 
> > > I am not really sure I understand your question. task_will_free_mem is
> > > just a shortcut to not kill anything if the current process or a victim
> > > is already dying and likely to free memory without killing or spamming
> > > the log. My concern is that this patch allows to invoke the reaper
> > 
> > Got it.
> > 
> > > without guaranteeing the same. So it can only be an optimistic attempt
> > > and then I am wondering how reasonable of an interface this really is.
> > > Userspace send the signal and has no way to find out whether the async
> > > reaping has been scheduled or not.
> > 
> > Could you clarify more what you're asking to guarantee? I cannot picture it.
> > If you mean guaranteeing that "a task is dying anyway and will free its
> > memory on its own", we are calling task_will_free_mem() to check that before
> > invoking the oom reaper.
> 
> No, I am talking about the API aspect. Say you kall kill with the flag
> to make the async address space tear down. Now you cannot really
> guarantee that this is safe to do because the target task might
> clone(CLONE_VM) at any time. So this will be known only once the signal
> is sent, but the calling process has no way to find out. So the caller
> has no way to know what is the actual result of the requested operation.
> That is a poor API in my book.
> 
> > Could you clarify what is the draback if OOM reaper is invoked in parallel to
> > an exiting task which will free its memory soon? It looks like the OOM reaper
> > is taking all the locks necessary (mmap_sem) in particular and is unmapping
> > pages. It seemed to me to be safe, but I am missing what are the main draw
> > backs of this - other than the intereference with core dump. One could be
> > presumably scalability since the since OOM reaper could be bottlenecked by
> > freeing memory on behalf of potentially several dying tasks.
> 
> oom_reaper or any other kernel thread doing the same is a mere
> implementation detail I think. The oom killer doesn't really need the
> oom_reaper to act swiftly because it is there to act as a last resort if
> the oom victim cannot terminate on its own. If you want to offer an
> user space API then you can assume users will like to use it and expect
> a certain behavior but what that is? E.g. what if there are thousands of
> tasks killed this way? Do we care that some of them will not get the
> async treatment? If yes why do we need an API to control that at all?
> 
> Am I more clear now?

Yes, your concerns are more clear now. We will think more about this and your
other responses, thanks a lot.

 - Joel