Message ID | 20190411014353.113252-1-surenb@google.com (mailing list archive) |
---|---|
Headers | show |
Series | opportunistic memory reclaim of a killed process | expand |
On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: [...] > Proposed solution uses existing oom-reaper thread to increase memory > reclaim rate of a killed process and to make this rate more deterministic. > By no means the proposed solution is considered the best and was chosen > because it was simple to implement and allowed for test data collection. > The downside of this solution is that it requires additional “expedite” > hint for something which has to be fast in all cases. Would be great to > find a way that does not require additional hints. I have to say I do not like this much. It is abusing an implementation detail of the OOM implementation and makes it an official API. Also there are some non trivial assumptions to be fullfilled to use the current oom_reaper. First of all all the process groups that share the address space have to be killed. How do you want to guarantee/implement that with a simply kill to a thread/process group? > Other possible approaches include: > - Implementing a dedicated syscall to perform opportunistic reclaim in the > context of the process waiting for the victim’s death. A natural boost > bonus occurs if the waiting process has high or RT priority and is not > limited by cpuset cgroup in its CPU choices. > - Implement a mechanism that would perform opportunistic reclaim if it’s > possible unconditionally (similar to checks in task_will_free_mem()). > - Implement opportunistic reclaim that uses shrinker interface, PSI or > other memory pressure indications as a hint to engage. I would question whether we really need this at all? Relying on the exit speed sounds like a fundamental design problem of anything that relies on it. Sure task exit might be slow, but async mm tear down is just a mere optimization this is not guaranteed to really help in speading things up. OOM killer uses it as a guarantee for a forward progress in a finite time rather than as soon as possible.
On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote: > The time to kill a process and free its memory can be critical when > the > killing was done to prevent memory shortages affecting system > responsiveness. The OOM killer is fickle, and often takes a fairly long time to trigger. Speeding up what happens after that seems like the wrong thing to optimize. Have you considered using something like oomd to proactively kill tasks when memory gets low, so you do not have to wait for an OOM kill?
On Thu 11-04-19 07:51:21, Rik van Riel wrote: > On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote: > > The time to kill a process and free its memory can be critical when > > the > > killing was done to prevent memory shortages affecting system > > responsiveness. > > The OOM killer is fickle, and often takes a fairly > long time to trigger. Speeding up what happens after > that seems like the wrong thing to optimize. > > Have you considered using something like oomd to > proactively kill tasks when memory gets low, so > you do not have to wait for an OOM kill? AFAIU, this is the point here. They probably have a user space OOM killer implementation and want to achieve killing to be as swift as possible.
On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > [...] > > Proposed solution uses existing oom-reaper thread to increase memory > > reclaim rate of a killed process and to make this rate more deterministic. > > By no means the proposed solution is considered the best and was chosen > > because it was simple to implement and allowed for test data collection. > > The downside of this solution is that it requires additional “expedite” > > hint for something which has to be fast in all cases. Would be great to > > find a way that does not require additional hints. > > I have to say I do not like this much. It is abusing an implementation > detail of the OOM implementation and makes it an official API. Also > there are some non trivial assumptions to be fullfilled to use the > current oom_reaper. First of all all the process groups that share the > address space have to be killed. How do you want to guarantee/implement > that with a simply kill to a thread/process group? Will task_will_free_mem() not bail out in such cases because of process_shares_mm() returning true? AFAIU, Suren's patch calls that. Also, if I understand correctly, this patch is opportunistic and knows what it may not be possible to reap in advance this way in all cases. /* * Make sure that all tasks which share the mm with the given tasks * are dying as well to make sure that a) nobody pins its mm and * b) the task is also reapable by the oom reaper. */ rcu_read_lock(); for_each_process(p) { if (!process_shares_mm(p, mm)) > > Other possible approaches include: > > - Implementing a dedicated syscall to perform opportunistic reclaim in the > > context of the process waiting for the victim’s death. A natural boost > > bonus occurs if the waiting process has high or RT priority and is not > > limited by cpuset cgroup in its CPU choices. > > - Implement a mechanism that would perform opportunistic reclaim if it’s > > possible unconditionally (similar to checks in task_will_free_mem()). > > - Implement opportunistic reclaim that uses shrinker interface, PSI or > > other memory pressure indications as a hint to engage. > > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. Per the data collected by Suren, it does speed things up. It would be nice if we can reuse this mechanism, or come up with a similar mechanism. thanks, - Joel
On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote: > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > [...] > > Proposed solution uses existing oom-reaper thread to increase memory > > reclaim rate of a killed process and to make this rate more deterministic. > > By no means the proposed solution is considered the best and was chosen > > because it was simple to implement and allowed for test data collection. > > The downside of this solution is that it requires additional “expedite” > > hint for something which has to be fast in all cases. Would be great to > > find a way that does not require additional hints. > > I have to say I do not like this much. It is abusing an implementation > detail of the OOM implementation and makes it an official API. Also > there are some non trivial assumptions to be fullfilled to use the > current oom_reaper. First of all all the process groups that share the > address space have to be killed. How do you want to guarantee/implement > that with a simply kill to a thread/process group? > > > Other possible approaches include: > > - Implementing a dedicated syscall to perform opportunistic reclaim in the > > context of the process waiting for the victim’s death. A natural boost > > bonus occurs if the waiting process has high or RT priority and is not > > limited by cpuset cgroup in its CPU choices. > > - Implement a mechanism that would perform opportunistic reclaim if it’s > > possible unconditionally (similar to checks in task_will_free_mem()). > > - Implement opportunistic reclaim that uses shrinker interface, PSI or > > other memory pressure indications as a hint to engage. > > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. OTOH, we want to keep as many processes around as possible for recency. In which case, the exit path (particularly the memory reclaim) becomes critical to maintain interactivity for phones. Android keeps processes around because cold starting applications is much slower than simply bringing them up from background. This obviously presents the problem when a background application _is_ killed, it is almost always to address sudden spike in memory needs by something else much more important and user visible. e.g. a foreground application or critical system process. > Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. With OOM killer, things are already really bad. When lmkd[1] kills processes, it is doing so to serve the immediate needs of the system while trying to avoid the OOM killer. - ssp 1] https://android.googlesource.com/platform/system/core/+/refs/heads/master/lmkd/
Thanks for the feedback! On Thu, Apr 11, 2019 at 3:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > [...] > > Proposed solution uses existing oom-reaper thread to increase memory > > reclaim rate of a killed process and to make this rate more deterministic. > > By no means the proposed solution is considered the best and was chosen > > because it was simple to implement and allowed for test data collection. > > The downside of this solution is that it requires additional “expedite” > > hint for something which has to be fast in all cases. Would be great to > > find a way that does not require additional hints. > > I have to say I do not like this much. It is abusing an implementation > detail of the OOM implementation and makes it an official API. I agree with you that this particular implementation is abusing oom internal machinery and I don't think it is acceptable as is (hence this is sent as an RFC). I would like to discuss the viability of the idea of reaping kill victim's mm asynchronously. If we agree that this is worth our time, only then I would love to get into more details on how to implement it. The implementation in this RFC is a convenient way to illustrate the idea and to collect test data. > Also > there are some non trivial assumptions to be fullfilled to use the > current oom_reaper. First of all all the process groups that share the > address space have to be killed. How do you want to guarantee/implement > that with a simply kill to a thread/process group? > I'm not sure I understood this correctly but if you are asking how do we know that the mm we are reaping is not shared with processes that are not being killed then I think your task_will_free_mem() checks for that. Or have I misunderstood your question? > > Other possible approaches include: > > - Implementing a dedicated syscall to perform opportunistic reclaim in the > > context of the process waiting for the victim’s death. A natural boost > > bonus occurs if the waiting process has high or RT priority and is not > > limited by cpuset cgroup in its CPU choices. > > - Implement a mechanism that would perform opportunistic reclaim if it’s > > possible unconditionally (similar to checks in task_will_free_mem()). > > - Implement opportunistic reclaim that uses shrinker interface, PSI or > > other memory pressure indications as a hint to engage. > > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. Relying on it is wrong, I agree. There are protections like allocation throttling that we can fall back to stop memory depletion. However having a way to free up resources that are not needed by a dying process quickly would help to avoid throttling which hurts user experience. I agree that this is an optimization which is beneficial in a specific case - when we kill to free up resources. However this is an important optimization for systems with low memory resources like embedded systems, phones, etc. The only way to prevent being cornered into throttling is to increase the free memory margin that system needs to maintain (I describe this in my cover letter). And with limited overall memory resources memory space is at a premium, so we try to decrease that margin. I think the other and arguably even more important issue than the speed of memory reclaim is that this speed depends on what the victim is doing at the time of a kill. This introduces non-determinism in how fast we can free up resource and at this point we don't even know how much safety margin we need. > Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. > > -- > Michal Hocko > SUSE Labs > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Thu, Apr 11, 2019 at 5:16 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 11-04-19 07:51:21, Rik van Riel wrote: > > On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote: > > > The time to kill a process and free its memory can be critical when > > > the > > > killing was done to prevent memory shortages affecting system > > > responsiveness. > > > > The OOM killer is fickle, and often takes a fairly > > long time to trigger. Speeding up what happens after > > that seems like the wrong thing to optimize. > > > > Have you considered using something like oomd to > > proactively kill tasks when memory gets low, so > > you do not have to wait for an OOM kill? > > AFAIU, this is the point here. They probably have a user space OOM > killer implementation and want to achieve killing to be as swift as > possible. That is correct. Android has a userspace daemon called lmkd (low memory killer daemon) to respond to memory pressure before things get bad enough for kernel oom-killer to get involved. So this asynchronous reclaim optimization would allow lmkd do its job more efficiently. > -- > Michal Hocko > SUSE Labs > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote: > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. I don't think it's flawed, it's just optimizing the user experience as best as it can. You don't want to kill things prematurely, but once there is pressure you want to rectify it quickly. That's valid. We have a tool that does this, side effect or not, so I think it's fair to try to make use of it when oom killing from userspace (which we explictily support with oom_control in cgroup1 and memory.high in cgroup2, and it's not just an Android thing). The question is how explicit a contract we want to make with userspace, and I would much prefer to not overpromise on a best-effort thing like this, or even making the oom reaper ABI. If unconditionally reaping killed tasks is too expensive, I'd much prefer a simple kill hint over an explicit task reclaim interface.
On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > [...] > > > Proposed solution uses existing oom-reaper thread to increase memory > > > reclaim rate of a killed process and to make this rate more deterministic. > > > By no means the proposed solution is considered the best and was chosen > > > because it was simple to implement and allowed for test data collection. > > > The downside of this solution is that it requires additional “expedite” > > > hint for something which has to be fast in all cases. Would be great to > > > find a way that does not require additional hints. > > > > I have to say I do not like this much. It is abusing an implementation > > detail of the OOM implementation and makes it an official API. Also > > there are some non trivial assumptions to be fullfilled to use the > > current oom_reaper. First of all all the process groups that share the > > address space have to be killed. How do you want to guarantee/implement > > that with a simply kill to a thread/process group? > > Will task_will_free_mem() not bail out in such cases because of > process_shares_mm() returning true? I am not really sure I understand your question. task_will_free_mem is just a shortcut to not kill anything if the current process or a victim is already dying and likely to free memory without killing or spamming the log. My concern is that this patch allows to invoke the reaper without guaranteeing the same. So it can only be an optimistic attempt and then I am wondering how reasonable of an interface this really is. Userspace send the signal and has no way to find out whether the async reaping has been scheduled or not.
On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote: [...] > > I would question whether we really need this at all? Relying on the exit > > speed sounds like a fundamental design problem of anything that relies > > on it. > > Relying on it is wrong, I agree. There are protections like allocation > throttling that we can fall back to stop memory depletion. However > having a way to free up resources that are not needed by a dying > process quickly would help to avoid throttling which hurts user > experience. I am not opposing speeding up the exit time in general. That is a good thing. Especially for a very large processes (e.g. a DB). But I do not really think we want to expose an API to control this specific aspect.
On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote: > On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > > [...] > > > > Proposed solution uses existing oom-reaper thread to increase memory > > > > reclaim rate of a killed process and to make this rate more deterministic. > > > > By no means the proposed solution is considered the best and was chosen > > > > because it was simple to implement and allowed for test data collection. > > > > The downside of this solution is that it requires additional “expedite” > > > > hint for something which has to be fast in all cases. Would be great to > > > > find a way that does not require additional hints. > > > > > > I have to say I do not like this much. It is abusing an implementation > > > detail of the OOM implementation and makes it an official API. Also > > > there are some non trivial assumptions to be fullfilled to use the > > > current oom_reaper. First of all all the process groups that share the > > > address space have to be killed. How do you want to guarantee/implement > > > that with a simply kill to a thread/process group? > > > > Will task_will_free_mem() not bail out in such cases because of > > process_shares_mm() returning true? > > I am not really sure I understand your question. task_will_free_mem is > just a shortcut to not kill anything if the current process or a victim > is already dying and likely to free memory without killing or spamming > the log. My concern is that this patch allows to invoke the reaper Got it. > without guaranteeing the same. So it can only be an optimistic attempt > and then I am wondering how reasonable of an interface this really is. > Userspace send the signal and has no way to find out whether the async > reaping has been scheduled or not. Could you clarify more what you're asking to guarantee? I cannot picture it. If you mean guaranteeing that "a task is dying anyway and will free its memory on its own", we are calling task_will_free_mem() to check that before invoking the oom reaper. Could you clarify what is the draback if OOM reaper is invoked in parallel to an exiting task which will free its memory soon? It looks like the OOM reaper is taking all the locks necessary (mmap_sem) in particular and is unmapping pages. It seemed to me to be safe, but I am missing what are the main draw backs of this - other than the intereference with core dump. One could be presumably scalability since the since OOM reaper could be bottlenecked by freeing memory on behalf of potentially several dying tasks. IIRC this patch is just Ok with being opportunistic and it need not be hidden behind an API necessarily or need any guarantees. It is just providing a hint that the OOM reaper could be woken up to expedite things. If a task is going to be taking a long time to be scheduled and free its memory, the oom reaper gives a headstart. Many of the times, background tasks can be killed but they may not have necessarily sufficient scheduler priority / cpuset (being in the background) and may be holding onto a lot of memory that needs to be reclaimed. I am not saying this the right way to do it, but I also wanted us to understand the drawbacks so that we can go back to the drawing board and come up with something better. Thanks! - Joel
On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote: > [...] > > > I would question whether we really need this at all? Relying on the exit > > > speed sounds like a fundamental design problem of anything that relies > > > on it. > > > > Relying on it is wrong, I agree. There are protections like allocation > > throttling that we can fall back to stop memory depletion. However > > having a way to free up resources that are not needed by a dying > > process quickly would help to avoid throttling which hurts user > > experience. > > I am not opposing speeding up the exit time in general. That is a good > thing. Especially for a very large processes (e.g. a DB). But I do not > really think we want to expose an API to control this specific aspect. Great! Thanks for confirming that the intent is not worthless. There were a number of ideas floating both internally and in the 2/2 of this patchset. I would like to get some input on which implementation would be preferable. From your answer sounds like you think it should be a generic feature, should not require any new APIs or hints from the userspace and should be conducted for all kills unconditionally (irrespective of memory pressure, who is waiting for victim's death, etc.). Do I understand correctly that this would be the preferred solution? > -- > Michal Hocko > SUSE Labs > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Thu 11-04-19 15:14:30, Joel Fernandes wrote: > On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote: > > On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > > > [...] > > > > > Proposed solution uses existing oom-reaper thread to increase memory > > > > > reclaim rate of a killed process and to make this rate more deterministic. > > > > > By no means the proposed solution is considered the best and was chosen > > > > > because it was simple to implement and allowed for test data collection. > > > > > The downside of this solution is that it requires additional “expedite” > > > > > hint for something which has to be fast in all cases. Would be great to > > > > > find a way that does not require additional hints. > > > > > > > > I have to say I do not like this much. It is abusing an implementation > > > > detail of the OOM implementation and makes it an official API. Also > > > > there are some non trivial assumptions to be fullfilled to use the > > > > current oom_reaper. First of all all the process groups that share the > > > > address space have to be killed. How do you want to guarantee/implement > > > > that with a simply kill to a thread/process group? > > > > > > Will task_will_free_mem() not bail out in such cases because of > > > process_shares_mm() returning true? > > > > I am not really sure I understand your question. task_will_free_mem is > > just a shortcut to not kill anything if the current process or a victim > > is already dying and likely to free memory without killing or spamming > > the log. My concern is that this patch allows to invoke the reaper > > Got it. > > > without guaranteeing the same. So it can only be an optimistic attempt > > and then I am wondering how reasonable of an interface this really is. > > Userspace send the signal and has no way to find out whether the async > > reaping has been scheduled or not. > > Could you clarify more what you're asking to guarantee? I cannot picture it. > If you mean guaranteeing that "a task is dying anyway and will free its > memory on its own", we are calling task_will_free_mem() to check that before > invoking the oom reaper. No, I am talking about the API aspect. Say you kall kill with the flag to make the async address space tear down. Now you cannot really guarantee that this is safe to do because the target task might clone(CLONE_VM) at any time. So this will be known only once the signal is sent, but the calling process has no way to find out. So the caller has no way to know what is the actual result of the requested operation. That is a poor API in my book. > Could you clarify what is the draback if OOM reaper is invoked in parallel to > an exiting task which will free its memory soon? It looks like the OOM reaper > is taking all the locks necessary (mmap_sem) in particular and is unmapping > pages. It seemed to me to be safe, but I am missing what are the main draw > backs of this - other than the intereference with core dump. One could be > presumably scalability since the since OOM reaper could be bottlenecked by > freeing memory on behalf of potentially several dying tasks. oom_reaper or any other kernel thread doing the same is a mere implementation detail I think. The oom killer doesn't really need the oom_reaper to act swiftly because it is there to act as a last resort if the oom victim cannot terminate on its own. If you want to offer an user space API then you can assume users will like to use it and expect a certain behavior but what that is? E.g. what if there are thousands of tasks killed this way? Do we care that some of them will not get the async treatment? If yes why do we need an API to control that at all? Am I more clear now?
On Thu 11-04-19 12:56:32, Suren Baghdasaryan wrote: > On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote: > > [...] > > > > I would question whether we really need this at all? Relying on the exit > > > > speed sounds like a fundamental design problem of anything that relies > > > > on it. > > > > > > Relying on it is wrong, I agree. There are protections like allocation > > > throttling that we can fall back to stop memory depletion. However > > > having a way to free up resources that are not needed by a dying > > > process quickly would help to avoid throttling which hurts user > > > experience. > > > > I am not opposing speeding up the exit time in general. That is a good > > thing. Especially for a very large processes (e.g. a DB). But I do not > > really think we want to expose an API to control this specific aspect. > > Great! Thanks for confirming that the intent is not worthless. > There were a number of ideas floating both internally and in the 2/2 > of this patchset. I would like to get some input on which > implementation would be preferable. From your answer sounds like you > think it should be a generic feature, should not require any new APIs > or hints from the userspace and should be conducted for all kills > unconditionally (irrespective of memory pressure, who is waiting for > victim's death, etc.). Do I understand correctly that this would be > the preferred solution? Yes, I think the general tear down solution is much more preferable than a questionable API. How that solution should look like is an open question. I am not sure myself to be honest.
On Thu, Apr 11, 2019 at 10:11:51PM +0200, Michal Hocko wrote: > On Thu 11-04-19 15:14:30, Joel Fernandes wrote: > > On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote: > > > On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > > > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > > > > [...] > > > > > > Proposed solution uses existing oom-reaper thread to increase memory > > > > > > reclaim rate of a killed process and to make this rate more deterministic. > > > > > > By no means the proposed solution is considered the best and was chosen > > > > > > because it was simple to implement and allowed for test data collection. > > > > > > The downside of this solution is that it requires additional “expedite” > > > > > > hint for something which has to be fast in all cases. Would be great to > > > > > > find a way that does not require additional hints. > > > > > > > > > > I have to say I do not like this much. It is abusing an implementation > > > > > detail of the OOM implementation and makes it an official API. Also > > > > > there are some non trivial assumptions to be fullfilled to use the > > > > > current oom_reaper. First of all all the process groups that share the > > > > > address space have to be killed. How do you want to guarantee/implement > > > > > that with a simply kill to a thread/process group? > > > > > > > > Will task_will_free_mem() not bail out in such cases because of > > > > process_shares_mm() returning true? > > > > > > I am not really sure I understand your question. task_will_free_mem is > > > just a shortcut to not kill anything if the current process or a victim > > > is already dying and likely to free memory without killing or spamming > > > the log. My concern is that this patch allows to invoke the reaper > > > > Got it. > > > > > without guaranteeing the same. So it can only be an optimistic attempt > > > and then I am wondering how reasonable of an interface this really is. > > > Userspace send the signal and has no way to find out whether the async > > > reaping has been scheduled or not. > > > > Could you clarify more what you're asking to guarantee? I cannot picture it. > > If you mean guaranteeing that "a task is dying anyway and will free its > > memory on its own", we are calling task_will_free_mem() to check that before > > invoking the oom reaper. > > No, I am talking about the API aspect. Say you kall kill with the flag > to make the async address space tear down. Now you cannot really > guarantee that this is safe to do because the target task might > clone(CLONE_VM) at any time. So this will be known only once the signal > is sent, but the calling process has no way to find out. So the caller > has no way to know what is the actual result of the requested operation. > That is a poor API in my book. > > > Could you clarify what is the draback if OOM reaper is invoked in parallel to > > an exiting task which will free its memory soon? It looks like the OOM reaper > > is taking all the locks necessary (mmap_sem) in particular and is unmapping > > pages. It seemed to me to be safe, but I am missing what are the main draw > > backs of this - other than the intereference with core dump. One could be > > presumably scalability since the since OOM reaper could be bottlenecked by > > freeing memory on behalf of potentially several dying tasks. > > oom_reaper or any other kernel thread doing the same is a mere > implementation detail I think. The oom killer doesn't really need the > oom_reaper to act swiftly because it is there to act as a last resort if > the oom victim cannot terminate on its own. If you want to offer an > user space API then you can assume users will like to use it and expect > a certain behavior but what that is? E.g. what if there are thousands of > tasks killed this way? Do we care that some of them will not get the > async treatment? If yes why do we need an API to control that at all? > > Am I more clear now? Yes, your concerns are more clear now. We will think more about this and your other responses, thanks a lot. - Joel