Message ID | 20201113173448.1863419-1-surenb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process | expand |
On Fri, 13 Nov 2020 09:34:48 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > When a process is being killed it might be in an uninterruptible sleep > which leads to an unpredictable delay in its memory reclaim. In low memory > situations, when it's important to free up memory quickly, such delay is > problematic. Kernel solves this problem with oom-reaper thread which > performs memory reclaim even when the victim process is not runnable. > Userspace currently lacks such mechanisms and the need and potential > solutions were discussed before (see links below). > This patch provides a mechanism to perform memory reclaim in the context > of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for > pidfd_send_signal syscall can be used only when sending SIGKILL signal > and will lead to the caller synchronously reclaiming the memory that > belongs to the victim and can be easily reclaimed. hm. Seems to me that the ability to reap another process's memory is a generally useful one, and that it should not be tied to delivering a signal in this fashion. And we do have the new process_madvise(MADV_PAGEOUT). It may need a few changes and tweaks, but can't that be used to solve this problem?
On Fri, Nov 13, 2020 at 3:55 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 13 Nov 2020 09:34:48 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > When a process is being killed it might be in an uninterruptible sleep > > which leads to an unpredictable delay in its memory reclaim. In low memory > > situations, when it's important to free up memory quickly, such delay is > > problematic. Kernel solves this problem with oom-reaper thread which > > performs memory reclaim even when the victim process is not runnable. > > Userspace currently lacks such mechanisms and the need and potential > > solutions were discussed before (see links below). > > This patch provides a mechanism to perform memory reclaim in the context > > of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for > > pidfd_send_signal syscall can be used only when sending SIGKILL signal > > and will lead to the caller synchronously reclaiming the memory that > > belongs to the victim and can be easily reclaimed. > > hm. > > Seems to me that the ability to reap another process's memory is a > generally useful one, and that it should not be tied to delivering a > signal in this fashion. > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > few changes and tweaks, but can't that be used to solve this problem? Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was one of the options recently discussed in https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com . The thread describes some of the issues with that approach but if we limit it to processes with pending SIGKILL only then I think that would be doable.
On Fri, 13 Nov 2020 16:06:25 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > On Fri, Nov 13, 2020 at 3:55 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Fri, 13 Nov 2020 09:34:48 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > When a process is being killed it might be in an uninterruptible sleep > > > which leads to an unpredictable delay in its memory reclaim. In low memory > > > situations, when it's important to free up memory quickly, such delay is > > > problematic. Kernel solves this problem with oom-reaper thread which > > > performs memory reclaim even when the victim process is not runnable. > > > Userspace currently lacks such mechanisms and the need and potential > > > solutions were discussed before (see links below). > > > This patch provides a mechanism to perform memory reclaim in the context > > > of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for > > > pidfd_send_signal syscall can be used only when sending SIGKILL signal > > > and will lead to the caller synchronously reclaiming the memory that > > > belongs to the victim and can be easily reclaimed. > > > > hm. > > > > Seems to me that the ability to reap another process's memory is a > > generally useful one, and that it should not be tied to delivering a > > signal in this fashion. > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > few changes and tweaks, but can't that be used to solve this problem? > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > one of the options recently discussed in > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > . The thread describes some of the issues with that approach but if we > limit it to processes with pending SIGKILL only then I think that > would be doable. Why would it be necessary to read /proc/pid/maps? I'd have thought that a starting effort would be madvise((void *)0, (void *)-1, MADV_PAGEOUT) (after translation into process_madvise() speak). Which is equivalent to the proposed process_madvise(MADV_DONTNEED_MM)? There may be things which trip this up, such as mlocked regions or whatever, but we could add another madvise `advice' mode to handle this?
On Fri, Nov 13, 2020 at 5:00 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 13 Nov 2020 16:06:25 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > On Fri, Nov 13, 2020 at 3:55 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > On Fri, 13 Nov 2020 09:34:48 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > When a process is being killed it might be in an uninterruptible sleep > > > > which leads to an unpredictable delay in its memory reclaim. In low memory > > > > situations, when it's important to free up memory quickly, such delay is > > > > problematic. Kernel solves this problem with oom-reaper thread which > > > > performs memory reclaim even when the victim process is not runnable. > > > > Userspace currently lacks such mechanisms and the need and potential > > > > solutions were discussed before (see links below). > > > > This patch provides a mechanism to perform memory reclaim in the context > > > > of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for > > > > pidfd_send_signal syscall can be used only when sending SIGKILL signal > > > > and will lead to the caller synchronously reclaiming the memory that > > > > belongs to the victim and can be easily reclaimed. > > > > > > hm. > > > > > > Seems to me that the ability to reap another process's memory is a > > > generally useful one, and that it should not be tied to delivering a > > > signal in this fashion. > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > few changes and tweaks, but can't that be used to solve this problem? > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > one of the options recently discussed in > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > . The thread describes some of the issues with that approach but if we > > limit it to processes with pending SIGKILL only then I think that > > would be doable. > > Why would it be necessary to read /proc/pid/maps? I'd have thought > that a starting effort would be > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > (after translation into process_madvise() speak). Which is equivalent > to the proposed process_madvise(MADV_DONTNEED_MM)? Yep, this is very similar to option #3 in https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com and I actually have a tested prototype for that. If that's the preferred method then I can post it quite quickly. > > There may be things which trip this up, such as mlocked regions or > whatever, but we could add another madvise `advice' mode to handle > this?
On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > Seems to me that the ability to reap another process's memory is a > > > > generally useful one, and that it should not be tied to delivering a > > > > signal in this fashion. > > > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > > few changes and tweaks, but can't that be used to solve this problem? > > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > > one of the options recently discussed in > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > . The thread describes some of the issues with that approach but if we > > > limit it to processes with pending SIGKILL only then I think that > > > would be doable. > > > > Why would it be necessary to read /proc/pid/maps? I'd have thought > > that a starting effort would be > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > > > (after translation into process_madvise() speak). Which is equivalent > > to the proposed process_madvise(MADV_DONTNEED_MM)? > > Yep, this is very similar to option #3 in > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > and I actually have a tested prototype for that. Why is the `vector=NULL' needed? Can't `vector' point at a single iovec which spans the whole address range? > If that's the > preferred method then I can post it quite quickly. I assume you've tested that prototype. How did its usefulness compare with this SIGKILL-based approach?
On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > Seems to me that the ability to reap another process's memory is a > > > > > generally useful one, and that it should not be tied to delivering a > > > > > signal in this fashion. > > > > > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > > > few changes and tweaks, but can't that be used to solve this problem? > > > > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > > > one of the options recently discussed in > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > . The thread describes some of the issues with that approach but if we > > > > limit it to processes with pending SIGKILL only then I think that > > > > would be doable. > > > > > > Why would it be necessary to read /proc/pid/maps? I'd have thought > > > that a starting effort would be > > > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > > > > > (after translation into process_madvise() speak). Which is equivalent > > > to the proposed process_madvise(MADV_DONTNEED_MM)? > > > > Yep, this is very similar to option #3 in > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > and I actually have a tested prototype for that. > > Why is the `vector=NULL' needed? Can't `vector' point at a single iovec > which spans the whole address range? That would be the option #4 from the same discussion and the issues noted there are "process_madvise return value can't handle such a large number of bytes and there is MAX_RW_COUNT limit on max number of bytes one process_madvise call can handle". In my prototype I have a special handling for such "bulk operation" to work around the MAX_RW_COUNT limitation. > > > If that's the > > preferred method then I can post it quite quickly. > > I assume you've tested that prototype. How did its usefulness compare > with this SIGKILL-based approach? Just to make sure I understand correctly your question, you are asking about performance comparison of: // approach in this RFC pidfd_send_signal(SIGKILL, SYNC_REAP_MM) vs // option #4 in the previous RFC kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED); If so, I have results for the current RFC approach but the previous approach was testing on an older device, so don't have apples-to-apples comparison results at the moment. I can collect the data for fair comparison if desired, however I don't expect a noticeable performance difference since they both do pretty much the same thing (even on different devices my results are quite close). I think it's more a question of which API would be more appropriate. >
On Fri, 13 Nov 2020 17:57:02 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > Seems to me that the ability to reap another process's memory is a > > > > > > generally useful one, and that it should not be tied to delivering a > > > > > > signal in this fashion. > > > > > > > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > > > > few changes and tweaks, but can't that be used to solve this problem? > > > > > > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > > > > one of the options recently discussed in > > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > > . The thread describes some of the issues with that approach but if we > > > > > limit it to processes with pending SIGKILL only then I think that > > > > > would be doable. > > > > > > > > Why would it be necessary to read /proc/pid/maps? I'd have thought > > > > that a starting effort would be > > > > > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > > > > > > > (after translation into process_madvise() speak). Which is equivalent > > > > to the proposed process_madvise(MADV_DONTNEED_MM)? > > > > > > Yep, this is very similar to option #3 in > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > and I actually have a tested prototype for that. > > > > Why is the `vector=NULL' needed? Can't `vector' point at a single iovec > > which spans the whole address range? > > That would be the option #4 from the same discussion and the issues > noted there are "process_madvise return value can't handle such a > large number of bytes and there is MAX_RW_COUNT limit on max number of > bytes one process_madvise call can handle". In my prototype I have a > special handling for such "bulk operation" to work around the > MAX_RW_COUNT limitation. Ah, OK, return value. Maybe process_madvise() shouldn't have done that and should have simply returned 0 on success, like madvise(). I guess a special "nuke whole address space" command is OK. But, again in the search for generality, the ability to nuke very large amounts of address space (but not the entire address space) would be better. The process_madvise() return value issue could be addressed by adding a process_madvise() mode which return 0 on success. And I guess the MAX_RW_COUNT issue is solvable by adding an import_iovec() arg to say "don't check that". Along those lines. It's all sounding a bit painful (but not *too* painful). But to reiterate, I do think that adding the ability for a process to shoot down a large amount of another process's memory is a lot more generally useful than tying it to SIGKILL, agree? > > > > > If that's the > > > preferred method then I can post it quite quickly. > > > > I assume you've tested that prototype. How did its usefulness compare > > with this SIGKILL-based approach? > > Just to make sure I understand correctly your question, you are asking > about performance comparison of: > > // approach in this RFC > pidfd_send_signal(SIGKILL, SYNC_REAP_MM) > > vs > > // option #4 in the previous RFC > kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED); > > If so, I have results for the current RFC approach but the previous > approach was testing on an older device, so don't have > apples-to-apples comparison results at the moment. I can collect the > data for fair comparison if desired, however I don't expect a > noticeable performance difference since they both do pretty much the > same thing (even on different devices my results are quite close). I > think it's more a question of which API would be more appropriate. OK. I wouldn't expect performance to be very different (and things can be sped up if so), but the API usefulness might be an issue. Using process_madvise() (or similar) makes it a two-step operation, whereas tying it to SIGKILL&&TASK_UNINTERRUPTIBLE provides a more precise tool. Any thoughts on this?
On Fri, Nov 13, 2020 at 6:16 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 13 Nov 2020 17:57:02 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > Seems to me that the ability to reap another process's memory is a > > > > > > > generally useful one, and that it should not be tied to delivering a > > > > > > > signal in this fashion. > > > > > > > > > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > > > > > few changes and tweaks, but can't that be used to solve this problem? > > > > > > > > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > > > > > one of the options recently discussed in > > > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > > > . The thread describes some of the issues with that approach but if we > > > > > > limit it to processes with pending SIGKILL only then I think that > > > > > > would be doable. > > > > > > > > > > Why would it be necessary to read /proc/pid/maps? I'd have thought > > > > > that a starting effort would be > > > > > > > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > > > > > > > > > (after translation into process_madvise() speak). Which is equivalent > > > > > to the proposed process_madvise(MADV_DONTNEED_MM)? > > > > > > > > Yep, this is very similar to option #3 in > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > and I actually have a tested prototype for that. > > > > > > Why is the `vector=NULL' needed? Can't `vector' point at a single iovec > > > which spans the whole address range? > > > > That would be the option #4 from the same discussion and the issues > > noted there are "process_madvise return value can't handle such a > > large number of bytes and there is MAX_RW_COUNT limit on max number of > > bytes one process_madvise call can handle". In my prototype I have a > > special handling for such "bulk operation" to work around the > > MAX_RW_COUNT limitation. > > Ah, OK, return value. Maybe process_madvise() shouldn't have done that > and should have simply returned 0 on success, like madvise(). > > I guess a special "nuke whole address space" command is OK. But, again > in the search for generality, the ability to nuke very large amounts of > address space (but not the entire address space) would be better. > > The process_madvise() return value issue could be addressed by adding a > process_madvise() mode which return 0 on success. > > And I guess the MAX_RW_COUNT issue is solvable by adding an > import_iovec() arg to say "don't check that". Along those lines. > > It's all sounding a bit painful (but not *too* painful). But to > reiterate, I do think that adding the ability for a process to shoot > down a large amount of another process's memory is a lot more generally > useful than tying it to SIGKILL, agree? I see. So you are suggesting a mode where process_madvise() can operate on large areas spanning multiple VMAs. This slightly differs from option 4 in the previous RFC which suggested a special mode that operates on the *entire* mm of the process. I agree, your suggestion is more generic. > > > > > > > > If that's the > > > > preferred method then I can post it quite quickly. > > > > > > I assume you've tested that prototype. How did its usefulness compare > > > with this SIGKILL-based approach? > > > > Just to make sure I understand correctly your question, you are asking > > about performance comparison of: > > > > // approach in this RFC > > pidfd_send_signal(SIGKILL, SYNC_REAP_MM) > > > > vs > > > > // option #4 in the previous RFC > > kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED); > > > > If so, I have results for the current RFC approach but the previous > > approach was testing on an older device, so don't have > > apples-to-apples comparison results at the moment. I can collect the > > data for fair comparison if desired, however I don't expect a > > noticeable performance difference since they both do pretty much the > > same thing (even on different devices my results are quite close). I > > think it's more a question of which API would be more appropriate. > > OK. I wouldn't expect performance to be very different (and things can > be sped up if so), but the API usefulness might be an issue. Using > process_madvise() (or similar) makes it a two-step operation, whereas > tying it to SIGKILL&&TASK_UNINTERRUPTIBLE provides a more precise tool. > Any thoughts on this? >
On Fri, Nov 13, 2020 at 06:16:32PM -0800, Andrew Morton wrote: > On Fri, 13 Nov 2020 17:57:02 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > Seems to me that the ability to reap another process's memory is a > > > > > > > generally useful one, and that it should not be tied to delivering a > > > > > > > signal in this fashion. > > > > > > > > > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > > > > > few changes and tweaks, but can't that be used to solve this problem? > > > > > > > > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > > > > > one of the options recently discussed in > > > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > > > . The thread describes some of the issues with that approach but if we > > > > > > limit it to processes with pending SIGKILL only then I think that > > > > > > would be doable. > > > > > > > > > > Why would it be necessary to read /proc/pid/maps? I'd have thought > > > > > that a starting effort would be > > > > > > > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > > > > > > > > > (after translation into process_madvise() speak). Which is equivalent > > > > > to the proposed process_madvise(MADV_DONTNEED_MM)? > > > > > > > > Yep, this is very similar to option #3 in > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > and I actually have a tested prototype for that. > > > > > > Why is the `vector=NULL' needed? Can't `vector' point at a single iovec > > > which spans the whole address range? > > > > That would be the option #4 from the same discussion and the issues > > noted there are "process_madvise return value can't handle such a > > large number of bytes and there is MAX_RW_COUNT limit on max number of > > bytes one process_madvise call can handle". In my prototype I have a > > special handling for such "bulk operation" to work around the > > MAX_RW_COUNT limitation. > > Ah, OK, return value. Maybe process_madvise() shouldn't have done that > and should have simply returned 0 on success, like madvise(). > > I guess a special "nuke whole address space" command is OK. But, again > in the search for generality, the ability to nuke very large amounts of > address space (but not the entire address space) would be better. > > The process_madvise() return value issue could be addressed by adding a > process_madvise() mode which return 0 on success. > > And I guess the MAX_RW_COUNT issue is solvable by adding an > import_iovec() arg to say "don't check that". Along those lines. > > It's all sounding a bit painful (but not *too* painful). But to > reiterate, I do think that adding the ability for a process to shoot > down a large amount of another process's memory is a lot more generally > useful than tying it to SIGKILL, agree? I agree the direction but I think it's the general problem for every APIs have supported iovec and not sure process_madvise is special to chage it. IOW, it wouldn't be a problem to support *entire address space* special mode but not sure to support *large amount of address space* at the cost of breaking existing iovec scheme.
On Fri, Nov 13, 2020 at 03:55:39PM -0800, Andrew Morton wrote: > On Fri, 13 Nov 2020 09:34:48 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > When a process is being killed it might be in an uninterruptible sleep > > which leads to an unpredictable delay in its memory reclaim. In low memory > > situations, when it's important to free up memory quickly, such delay is > > problematic. Kernel solves this problem with oom-reaper thread which > > performs memory reclaim even when the victim process is not runnable. > > Userspace currently lacks such mechanisms and the need and potential > > solutions were discussed before (see links below). > > This patch provides a mechanism to perform memory reclaim in the context > > of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for > > pidfd_send_signal syscall can be used only when sending SIGKILL signal > > and will lead to the caller synchronously reclaiming the memory that > > belongs to the victim and can be easily reclaimed. > > hm. > > Seems to me that the ability to reap another process's memory is a > generally useful one, and that it should not be tied to delivering a > signal in this fashion. I agree and I see you've already had some good ideas how to tie this to process_madvise(). If that's workable for your use-case then I'd prefer that approach. Signals are almost always not a great choice. Christian
On Fri 13-11-20 18:16:32, Andrew Morton wrote: [...] > It's all sounding a bit painful (but not *too* painful). But to > reiterate, I do think that adding the ability for a process to shoot > down a large amount of another process's memory is a lot more generally > useful than tying it to SIGKILL, agree? I am not sure TBH. Is there any reasonable usecase where uncoordinated memory tear down is OK and a target process which is able to see the unmapped memory?
On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 13-11-20 18:16:32, Andrew Morton wrote: > [...] > > It's all sounding a bit painful (but not *too* painful). But to > > reiterate, I do think that adding the ability for a process to shoot > > down a large amount of another process's memory is a lot more generally > > useful than tying it to SIGKILL, agree? > > I am not sure TBH. Is there any reasonable usecase where uncoordinated > memory tear down is OK and a target process which is able to see the > unmapped memory? I think uncoordinated memory tear down is a special case which makes sense only when the target process is being killed (and we can enforce that by allowing MADV_DONTNEED to be used only if the target process has pending SIGKILL). However, the ability to apply other flavors of process_madvise() to large memory areas spanning multiple VMAs can be useful in more cases. For example in Android we will use process_madvise(MADV_PAGEOUT) to "shrink" an inactive background process. Today we have to read /proc/maps and construct the vector of VMAs even when applying this advice to the entire process. With such a special mode we could achieve this more efficiently and with less hussle. > -- > Michal Hocko > SUSE Labs
On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote: > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote: > > [...] > > > It's all sounding a bit painful (but not *too* painful). But to > > > reiterate, I do think that adding the ability for a process to shoot > > > down a large amount of another process's memory is a lot more generally > > > useful than tying it to SIGKILL, agree? > > > > I am not sure TBH. Is there any reasonable usecase where uncoordinated > > memory tear down is OK and a target process which is able to see the > > unmapped memory? > > I think uncoordinated memory tear down is a special case which makes > sense only when the target process is being killed (and we can enforce > that by allowing MADV_DONTNEED to be used only if the target process > has pending SIGKILL). That would be safe but then I am wondering whether it makes sense to implement as a madvise call. It is quite strange to expect somebody call a syscall on a killed process. But this is more a detail. I am not a great fan of a more generic MADV_DONTNEED on a remote process. This is just too dangerous IMHO. > However, the ability to apply other flavors of > process_madvise() to large memory areas spanning multiple VMAs can be > useful in more cases. Yes I do agree with that. The error reporting would be more tricky but I am not really sure that the exact reporting is really necessary for advice like interface. > For example in Android we will use > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background > process. That makes sense to me.
On Wed, Nov 18, 2020 at 11:32 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote: > > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote: > > > [...] > > > > It's all sounding a bit painful (but not *too* painful). But to > > > > reiterate, I do think that adding the ability for a process to shoot > > > > down a large amount of another process's memory is a lot more generally > > > > useful than tying it to SIGKILL, agree? > > > > > > I am not sure TBH. Is there any reasonable usecase where uncoordinated > > > memory tear down is OK and a target process which is able to see the > > > unmapped memory? > > > > I think uncoordinated memory tear down is a special case which makes > > sense only when the target process is being killed (and we can enforce > > that by allowing MADV_DONTNEED to be used only if the target process > > has pending SIGKILL). > > That would be safe but then I am wondering whether it makes sense to > implement as a madvise call. It is quite strange to expect somebody call > a syscall on a killed process. But this is more a detail. I am not a > great fan of a more generic MADV_DONTNEED on a remote process. This is > just too dangerous IMHO. Agree 100% > > > However, the ability to apply other flavors of > > process_madvise() to large memory areas spanning multiple VMAs can be > > useful in more cases. > > Yes I do agree with that. The error reporting would be more tricky but > I am not really sure that the exact reporting is really necessary for > advice like interface. Andrew's suggestion for this special mode to change return semantics to the usual "0 or error code" seems to me like the most reasonable way to deal with the return value limitation. > > > For example in Android we will use > > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background > > process. > > That makes sense to me. > -- > Michal Hocko > SUSE Labs
On Wed, Nov 18, 2020 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Nov 18, 2020 at 11:32 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote: > > > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote: > > > > [...] > > > > > It's all sounding a bit painful (but not *too* painful). But to > > > > > reiterate, I do think that adding the ability for a process to shoot > > > > > down a large amount of another process's memory is a lot more generally > > > > > useful than tying it to SIGKILL, agree? > > > > > > > > I am not sure TBH. Is there any reasonable usecase where uncoordinated > > > > memory tear down is OK and a target process which is able to see the > > > > unmapped memory? > > > > > > I think uncoordinated memory tear down is a special case which makes > > > sense only when the target process is being killed (and we can enforce > > > that by allowing MADV_DONTNEED to be used only if the target process > > > has pending SIGKILL). > > > > That would be safe but then I am wondering whether it makes sense to > > implement as a madvise call. It is quite strange to expect somebody call > > a syscall on a killed process. But this is more a detail. I am not a > > great fan of a more generic MADV_DONTNEED on a remote process. This is > > just too dangerous IMHO. > > Agree 100% I assumed here that by "a more generic MADV_DONTNEED on a remote process" you meant "process_madvise(MADV_DONTNEED) applied to a process that is not being killed". Re-reading your comment I realized that you might have meant "process_madvice() with generic support to large memory areas". I hope I understood you correctly. > > > > > > However, the ability to apply other flavors of > > > process_madvise() to large memory areas spanning multiple VMAs can be > > > useful in more cases. > > > > Yes I do agree with that. The error reporting would be more tricky but > > I am not really sure that the exact reporting is really necessary for > > advice like interface. > > Andrew's suggestion for this special mode to change return semantics > to the usual "0 or error code" seems to me like the most reasonable > way to deal with the return value limitation. > > > > > > For example in Android we will use > > > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background > > > process. > > > > That makes sense to me. > > -- > > Michal Hocko > > SUSE Labs
On Wed, Nov 18, 2020 at 11:55 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Nov 18, 2020 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Wed, Nov 18, 2020 at 11:32 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote: > > > > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote: > > > > > [...] > > > > > > It's all sounding a bit painful (but not *too* painful). But to > > > > > > reiterate, I do think that adding the ability for a process to shoot > > > > > > down a large amount of another process's memory is a lot more generally > > > > > > useful than tying it to SIGKILL, agree? I was looking into how to work around the limitation of MAX_RW_COUNT and the conceptual issue there is the "struct iovec" which has its iov_len as size_t that lacks capacity for expressing ranges like "entire process memory". I would like to check your reaction to the following idea which can be implemented without painful surgeries to the import_iovec and its friends. process_madvise(pidfd, iovec = [ { range_start_addr, 0 }, { range_end_addr, 0 } ], vlen = 2, behavior=MADV_xxx, flags = PMADV_FLAG_RANGE) So, to represent a range we pass a new PMADV_FLAG_RANGE flag and construct a 2-element vector to express range start and range end using iovec.iov_base members. iov_len member of the iovec elements is ignored in this mode. I know it sounds hacky but I think it's the simplest way if we want the ability to express an arbitrarily large range. Another option is to do what Andrew described as "madvise((void *)0, (void *)-1, MADV_PAGEOUT)" which means this mode works only with the entire mm of the process. WDYT? > > > > > > > > > > I am not sure TBH. Is there any reasonable usecase where uncoordinated > > > > > memory tear down is OK and a target process which is able to see the > > > > > unmapped memory? > > > > > > > > I think uncoordinated memory tear down is a special case which makes > > > > sense only when the target process is being killed (and we can enforce > > > > that by allowing MADV_DONTNEED to be used only if the target process > > > > has pending SIGKILL). > > > > > > That would be safe but then I am wondering whether it makes sense to > > > implement as a madvise call. It is quite strange to expect somebody call > > > a syscall on a killed process. But this is more a detail. I am not a > > > great fan of a more generic MADV_DONTNEED on a remote process. This is > > > just too dangerous IMHO. > > > > Agree 100% > > I assumed here that by "a more generic MADV_DONTNEED on a remote > process" you meant "process_madvise(MADV_DONTNEED) applied to a > process that is not being killed". Re-reading your comment I realized > that you might have meant "process_madvice() with generic support to > large memory areas". I hope I understood you correctly. > > > > > > > > > > However, the ability to apply other flavors of > > > > process_madvise() to large memory areas spanning multiple VMAs can be > > > > useful in more cases. > > > > > > Yes I do agree with that. The error reporting would be more tricky but > > > I am not really sure that the exact reporting is really necessary for > > > advice like interface. > > > > Andrew's suggestion for this special mode to change return semantics > > to the usual "0 or error code" seems to me like the most reasonable > > way to deal with the return value limitation. > > > > > > > > > For example in Android we will use > > > > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background > > > > process. > > > > > > That makes sense to me. > > > -- > > > Michal Hocko > > > SUSE Labs
On Wed, Nov 18, 2020 at 4:13 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Nov 18, 2020 at 11:55 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Wed, Nov 18, 2020 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Wed, Nov 18, 2020 at 11:32 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote: > > > > > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote: > > > > > > [...] > > > > > > > It's all sounding a bit painful (but not *too* painful). But to > > > > > > > reiterate, I do think that adding the ability for a process to shoot > > > > > > > down a large amount of another process's memory is a lot more generally > > > > > > > useful than tying it to SIGKILL, agree? > > I was looking into how to work around the limitation of MAX_RW_COUNT > and the conceptual issue there is the "struct iovec" which has its > iov_len as size_t that lacks capacity for expressing ranges like > "entire process memory". I would like to check your reaction to the > following idea which can be implemented without painful surgeries to > the import_iovec and its friends. > > process_madvise(pidfd, iovec = [ { range_start_addr, 0 }, { > range_end_addr, 0 } ], vlen = 2, behavior=MADV_xxx, flags = > PMADV_FLAG_RANGE) > > So, to represent a range we pass a new PMADV_FLAG_RANGE flag and > construct a 2-element vector to express range start and range end > using iovec.iov_base members. iov_len member of the iovec elements is > ignored in this mode. I know it sounds hacky but I think it's the > simplest way if we want the ability to express an arbitrarily large > range. > Another option is to do what Andrew described as "madvise((void *)0, > (void *)-1, MADV_PAGEOUT)" which means this mode works only with the > entire mm of the process. > WDYT? > To follow up on this discussion, I posted a patchset to implement process_madvise(MADV_DONTNEED) supporting the entire mm range at https://lkml.org/lkml/2020/11/24/21. > > > > > > > > > > > > I am not sure TBH. Is there any reasonable usecase where uncoordinated > > > > > > memory tear down is OK and a target process which is able to see the > > > > > > unmapped memory? > > > > > > > > > > I think uncoordinated memory tear down is a special case which makes > > > > > sense only when the target process is being killed (and we can enforce > > > > > that by allowing MADV_DONTNEED to be used only if the target process > > > > > has pending SIGKILL). > > > > > > > > That would be safe but then I am wondering whether it makes sense to > > > > implement as a madvise call. It is quite strange to expect somebody call > > > > a syscall on a killed process. But this is more a detail. I am not a > > > > great fan of a more generic MADV_DONTNEED on a remote process. This is > > > > just too dangerous IMHO. > > > > > > Agree 100% > > > > I assumed here that by "a more generic MADV_DONTNEED on a remote > > process" you meant "process_madvise(MADV_DONTNEED) applied to a > > process that is not being killed". Re-reading your comment I realized > > that you might have meant "process_madvice() with generic support to > > large memory areas". I hope I understood you correctly. > > > > > > > > > > > > > > However, the ability to apply other flavors of > > > > > process_madvise() to large memory areas spanning multiple VMAs can be > > > > > useful in more cases. > > > > > > > > Yes I do agree with that. The error reporting would be more tricky but > > > > I am not really sure that the exact reporting is really necessary for > > > > advice like interface. > > > > > > Andrew's suggestion for this special mode to change return semantics > > > to the usual "0 or error code" seems to me like the most reasonable > > > way to deal with the return value limitation. > > > > > > > > > > > > For example in Android we will use > > > > > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background > > > > > process. > > > > > > > > That makes sense to me. > > > > -- > > > > Michal Hocko > > > > SUSE Labs
diff --git a/include/linux/oom.h b/include/linux/oom.h index 2db9a1432511..9a8dcabdfdf1 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -111,6 +111,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm); long oom_badness(struct task_struct *p, unsigned long totalpages); +extern bool task_will_free_mem(struct task_struct *task); + extern bool out_of_memory(struct oom_control *oc); extern void exit_oom_victim(void); diff --git a/include/linux/signal.h b/include/linux/signal.h index b256f9c65661..5deafc99035d 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -449,6 +449,13 @@ extern bool unhandled_signal(struct task_struct *tsk, int sig); (!siginmask(signr, SIG_KERNEL_IGNORE_MASK|SIG_KERNEL_STOP_MASK) && \ (t)->sighand->action[(signr)-1].sa.sa_handler == SIG_DFL) +/* + * Flag values used in pidfd_send_signal: + * + * SYNC_REAP_MM indicates request to reclaim mm after SIGKILL. + */ +#define SYNC_REAP_MM 0x1 + void signals_init(void); int restore_altstack(const stack_t __user *); diff --git a/kernel/signal.c b/kernel/signal.c index ef8f2a28d37c..15d4be5600a3 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -46,6 +46,7 @@ #include <linux/livepatch.h> #include <linux/cgroup.h> #include <linux/audit.h> +#include <linux/oom.h> #define CREATE_TRACE_POINTS #include <trace/events/signal.h> @@ -3711,6 +3712,63 @@ static struct pid *pidfd_to_pid(const struct file *file) return tgid_pidfd_to_pid(file); } +static int reap_mm(struct pid *pid) +{ + struct task_struct *task; + struct mm_struct *mm; + int ret = 0; + + /* Get the task_struct */ + task = get_pid_task(pid, PIDTYPE_PID); + if (!task) { + ret = -ESRCH; + goto out; + } + + task_lock(task); + + /* Check if memory can be easily reclaimed */ + if (!task_will_free_mem(task)) { + task_unlock(task); + ret = -EBUSY; + goto release_task; + } + + /* Get mm to prevent exit_mmap */ + mm = task->mm; + mmget(mm); + + /* Ensure no competition with OOM-killer to prevent contention */ + if (unlikely(mm_is_oom_victim(mm)) || + unlikely(test_bit(MMF_OOM_SKIP, &mm->flags))) { + /* Already being reclaimed */ + task_unlock(task); + goto drop_mm; + } + /* + * Prevent OOM-killer or other pidfd_send_signal from considering + * this task + */ + set_bit(MMF_OOM_SKIP, &mm->flags); + + task_unlock(task); + + mmap_read_lock(mm); + if (!__oom_reap_task_mm(mm)) { + /* Failed to reap part of the address space. User can retry */ + ret = -EAGAIN; + clear_bit(MMF_OOM_SKIP, &mm->flags); + } + mmap_read_unlock(mm); + +drop_mm: + mmput(mm); +release_task: + put_task_struct(task); +out: + return ret; +} + /** * sys_pidfd_send_signal - Signal a process through a pidfd * @pidfd: file descriptor of the process @@ -3737,10 +3795,16 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, struct pid *pid; kernel_siginfo_t kinfo; - /* Enforce flags be set to 0 until we add an extension. */ - if (flags) + /* Enforce only valid flags. */ + if (flags) { + /* Allow SYNC_REAP_MM only with SIGKILL. */ + if (flags == SYNC_REAP_MM && sig == SIGKILL) + goto valid; + return -EINVAL; + } +valid: f = fdget(pidfd); if (!f.file) return -EBADF; @@ -3775,6 +3839,11 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, } ret = kill_pid_info(sig, &kinfo, pid); + if (unlikely(ret)) + goto err; + + if (flags & SYNC_REAP_MM) + ret = reap_mm(pid); err: fdput(f); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8b84661a6410..66c90bca25bc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -808,7 +808,7 @@ static inline bool __task_will_free_mem(struct task_struct *task) * Caller has to make sure that task->mm is stable (hold task_lock or * it operates on the current). */ -static bool task_will_free_mem(struct task_struct *task) +bool task_will_free_mem(struct task_struct *task) { struct mm_struct *mm = task->mm; struct task_struct *p;
When a process is being killed it might be in an uninterruptible sleep which leads to an unpredictable delay in its memory reclaim. In low memory situations, when it's important to free up memory quickly, such delay is problematic. Kernel solves this problem with oom-reaper thread which performs memory reclaim even when the victim process is not runnable. Userspace currently lacks such mechanisms and the need and potential solutions were discussed before (see links below). This patch provides a mechanism to perform memory reclaim in the context of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for pidfd_send_signal syscall can be used only when sending SIGKILL signal and will lead to the caller synchronously reclaiming the memory that belongs to the victim and can be easily reclaimed. 1. https://patchwork.kernel.org/cover/10894999 2. https://lwn.net/Articles/787217 3. https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- include/linux/oom.h | 2 ++ include/linux/signal.h | 7 ++++ kernel/signal.c | 73 ++++++++++++++++++++++++++++++++++++++++-- mm/oom_kill.c | 2 +- 4 files changed, 81 insertions(+), 3 deletions(-)