Message ID | 20180716115058.5559-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote: > From: Michal Hocko <mhocko@suse.com> > > There are several blockable mmu notifiers which might sleep in > mmu_notifier_invalidate_range_start and that is a problem for the > oom_reaper because it needs to guarantee a forward progress so it cannot > depend on any sleepable locks. > > Currently we simply back off and mark an oom victim with blockable mmu > notifiers as done after a short sleep. That can result in selecting a > new oom victim prematurely because the previous one still hasn't torn > its memory down yet. > > We can do much better though. Even if mmu notifiers use sleepable locks > there is no reason to automatically assume those locks are held. > Moreover majority of notifiers only care about a portion of the address > space and there is absolutely zero reason to fail when we are unmapping an > unrelated range. Many notifiers do really block and wait for HW which is > harder to handle and we have to bail out though. > > This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start > gets a blockable flag and callbacks are not allowed to sleep if the > flag is set to false. This is achieved by using trylock instead of the > sleepable lock for most callbacks and continue as long as we do not > block down the call chain. I assume device driver developers are wondering "what does this mean for me". As I understand it, the only time they will see blockable==false is when their driver is being called in response to an out-of-memory condition, yes? So it is a very rare thing. Any suggestions regarding how the driver developers can test this code path? I don't think we presently have a way to fake an oom-killing event? Perhaps we should add such a thing, given the problems we're having with that feature. > I think we can improve that even further because there is a common > pattern to do a range lookup first and then do something about that. > The first part can be done without a sleeping lock in most cases AFAICS. > > The oom_reaper end then simply retries if there is at least one notifier > which couldn't make any progress in !blockable mode. A retry loop is > already implemented to wait for the mmap_sem and this is basically the > same thing. > > ... > > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + int ret = 0; > + if (mm_has_notifiers(mm)) > + ret = __mmu_notifier_invalidate_range_start(mm, start, end, false); > + > + return ret; > } nit, { if (mm_has_notifiers(mm)) return __mmu_notifier_invalidate_range_start(mm, start, end, false); return 0; } would suffice. > > ... > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm) > * reliably test it. > */ > mutex_lock(&oom_lock); > - __oom_reap_task_mm(mm); > + (void)__oom_reap_task_mm(mm); > mutex_unlock(&oom_lock); What does this do? > set_bit(MMF_OOM_SKIP, &mm->flags); > > ... > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 16, 2018 at 04:12:49PM -0700, Andrew Morton wrote: > On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > There are several blockable mmu notifiers which might sleep in > > mmu_notifier_invalidate_range_start and that is a problem for the > > oom_reaper because it needs to guarantee a forward progress so it cannot > > depend on any sleepable locks. > > > > Currently we simply back off and mark an oom victim with blockable mmu > > notifiers as done after a short sleep. That can result in selecting a > > new oom victim prematurely because the previous one still hasn't torn > > its memory down yet. > > > > We can do much better though. Even if mmu notifiers use sleepable locks > > there is no reason to automatically assume those locks are held. > > Moreover majority of notifiers only care about a portion of the address > > space and there is absolutely zero reason to fail when we are unmapping an > > unrelated range. Many notifiers do really block and wait for HW which is > > harder to handle and we have to bail out though. > > > > This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start > > gets a blockable flag and callbacks are not allowed to sleep if the > > flag is set to false. This is achieved by using trylock instead of the > > sleepable lock for most callbacks and continue as long as we do not > > block down the call chain. > > I assume device driver developers are wondering "what does this mean > for me". As I understand it, the only time they will see > blockable==false is when their driver is being called in response to an > out-of-memory condition, yes? So it is a very rare thing. I can't say for everyone, but at least for me (mlx5), it is not rare event. I'm seeing OOM very often while I'm running my tests in low memory VMs. Thanks > > Any suggestions regarding how the driver developers can test this code > path? I don't think we presently have a way to fake an oom-killing > event? Perhaps we should add such a thing, given the problems we're > having with that feature.
On Mon 16-07-18 16:12:49, Andrew Morton wrote: > On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > There are several blockable mmu notifiers which might sleep in > > mmu_notifier_invalidate_range_start and that is a problem for the > > oom_reaper because it needs to guarantee a forward progress so it cannot > > depend on any sleepable locks. > > > > Currently we simply back off and mark an oom victim with blockable mmu > > notifiers as done after a short sleep. That can result in selecting a > > new oom victim prematurely because the previous one still hasn't torn > > its memory down yet. > > > > We can do much better though. Even if mmu notifiers use sleepable locks > > there is no reason to automatically assume those locks are held. > > Moreover majority of notifiers only care about a portion of the address > > space and there is absolutely zero reason to fail when we are unmapping an > > unrelated range. Many notifiers do really block and wait for HW which is > > harder to handle and we have to bail out though. > > > > This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start > > gets a blockable flag and callbacks are not allowed to sleep if the > > flag is set to false. This is achieved by using trylock instead of the > > sleepable lock for most callbacks and continue as long as we do not > > block down the call chain. > > I assume device driver developers are wondering "what does this mean > for me". As I understand it, the only time they will see > blockable==false is when their driver is being called in response to an > out-of-memory condition, yes? So it is a very rare thing. Yes, this is the case right now. Maybe we will grow other users in future. Those other potential users is the reason why I used blockable rather than oom parameter name. > Any suggestions regarding how the driver developers can test this code > path? I don't think we presently have a way to fake an oom-killing > event? Perhaps we should add such a thing, given the problems we're > having with that feature. The simplest way is to wrap an userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. > > I think we can improve that even further because there is a common > > pattern to do a range lookup first and then do something about that. > > The first part can be done without a sleeping lock in most cases AFAICS. > > > > The oom_reaper end then simply retries if there is at least one notifier > > which couldn't make any progress in !blockable mode. A retry loop is > > already implemented to wait for the mmap_sem and this is basically the > > same thing. > > > > ... > > > > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, > > + unsigned long start, unsigned long end) > > +{ > > + int ret = 0; > > + if (mm_has_notifiers(mm)) > > + ret = __mmu_notifier_invalidate_range_start(mm, start, end, false); > > + > > + return ret; > > } > > nit, > > { > if (mm_has_notifiers(mm)) > return __mmu_notifier_invalidate_range_start(mm, start, end, false); > return 0; > } > > would suffice. Sure. Fixed > > > > ... > > > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm) > > * reliably test it. > > */ > > mutex_lock(&oom_lock); > > - __oom_reap_task_mm(mm); > > + (void)__oom_reap_task_mm(mm); > > mutex_unlock(&oom_lock); > > What does this do? There is no error to be returned here as the comment above explains * Nothing can be holding mm->mmap_sem here and the above call * to mmu_notifier_release(mm) ensures mmu notifier callbacks in * __oom_reap_task_mm() will not block.
Does anybody see any reasons why this should get into mmotm tree? I do not want to rush this in but if general feeling is to push it for the upcoming merge window then I will not object.
On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > Any suggestions regarding how the driver developers can test this code > > path? I don't think we presently have a way to fake an oom-killing > > event? Perhaps we should add such a thing, given the problems we're > > having with that feature. > > The simplest way is to wrap an userspace code which uses these notifiers > into a memcg and set the hard limit to hit the oom. This can be done > e.g. after the test faults in all the mmu notifier managed memory and > set the hard limit to something really small. Then we are looking for a > proper process tear down. Chances are, some of the intended audience don't know how to do this and will either have to hunt down a lot of documentation or will just not test it. But we want them to test it, so a little worked step-by-step example would help things along please. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote: > From: Michal Hocko <mhocko@suse.com> > > There are several blockable mmu notifiers which might sleep in > mmu_notifier_invalidate_range_start and that is a problem for the > oom_reaper because it needs to guarantee a forward progress so it cannot > depend on any sleepable locks. > > ... > > @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > > trace_start_task_reaping(tsk->pid); > > - __oom_reap_task_mm(mm); > + /* failed to reap part of the address space. Try again later */ > + if (!__oom_reap_task_mm(mm)) { > + up_read(&mm->mmap_sem); > + ret = false; > + goto unlock_oom; > + } This function is starting to look a bit screwy. : static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) : { : if (!down_read_trylock(&mm->mmap_sem)) { : trace_skip_task_reaping(tsk->pid); : return false; : } : : /* : * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't : * work on the mm anymore. The check for MMF_OOM_SKIP must run : * under mmap_sem for reading because it serializes against the : * down_write();up_write() cycle in exit_mmap(). : */ : if (test_bit(MMF_OOM_SKIP, &mm->flags)) { : up_read(&mm->mmap_sem); : trace_skip_task_reaping(tsk->pid); : return true; : } : : trace_start_task_reaping(tsk->pid); : : /* failed to reap part of the address space. Try again later */ : if (!__oom_reap_task_mm(mm)) { : up_read(&mm->mmap_sem); : return true; : } : : pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", : task_pid_nr(tsk), tsk->comm, : K(get_mm_counter(mm, MM_ANONPAGES)), : K(get_mm_counter(mm, MM_FILEPAGES)), : K(get_mm_counter(mm, MM_SHMEMPAGES))); : up_read(&mm->mmap_sem); : : trace_finish_task_reaping(tsk->pid); : return true; : } - Undocumented return value. - comment "failed to reap part..." is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. I'm thinking it wants to be something like this? : /* : * Return true if we successfully acquired (then released) mmap_sem : */ : static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) : { : if (!down_read_trylock(&mm->mmap_sem)) { : trace_skip_task_reaping(tsk->pid); : return false; : } : : /* : * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't : * work on the mm anymore. The check for MMF_OOM_SKIP must run : * under mmap_sem for reading because it serializes against the : * down_write();up_write() cycle in exit_mmap(). : */ : if (test_bit(MMF_OOM_SKIP, &mm->flags)) { : trace_skip_task_reaping(tsk->pid); : goto out; : } : : trace_start_task_reaping(tsk->pid); : : if (!__oom_reap_task_mm(mm)) { : /* Failed to reap part of the address space. Try again later */ : goto finish; : } : : pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", : task_pid_nr(tsk), tsk->comm, : K(get_mm_counter(mm, MM_ANONPAGES)), : K(get_mm_counter(mm, MM_FILEPAGES)), : K(get_mm_counter(mm, MM_SHMEMPAGES))); : finish: : trace_finish_task_reaping(tsk->pid); : out: : up_read(&mm->mmap_sem); : return true; : } - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Please take a look? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> Please take a look?
Are you OK to have these in a separate patch?
On Mon 23-07-18 09:03:06, Michal Hocko wrote: > On Fri 20-07-18 17:09:02, Andrew Morton wrote: > [...] > > Please take a look? > > Are you OK to have these in a separate patch? Btw. I will rebase this patch once oom stuff in linux-next settles. At least oom_lock removal from oom_reaper will conflict.
On Mon 23-07-18 09:11:54, Michal Hocko wrote: > On Mon 23-07-18 09:03:06, Michal Hocko wrote: > > On Fri 20-07-18 17:09:02, Andrew Morton wrote: > > [...] > > > Please take a look? > > > > Are you OK to have these in a separate patch? > > Btw. I will rebase this patch once oom stuff in linux-next settles. At > least oom_lock removal from oom_reaper will conflict. Hmm, I have just checked Andrew's akpm and the patch is already in and Andrew has resolved the conflict with the oom_lock patch. It just seems that linux-next (next-20180720) doesn't have the newest mmotm tree. Anyway, I will go with the incremental cleanup patch per Andrew's comments as soon as linux-next catches up.
On Fri 20-07-18 16:01:25, Andrew Morton wrote: > On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > > Any suggestions regarding how the driver developers can test this code > > > path? I don't think we presently have a way to fake an oom-killing > > > event? Perhaps we should add such a thing, given the problems we're > > > having with that feature. > > > > The simplest way is to wrap an userspace code which uses these notifiers > > into a memcg and set the hard limit to hit the oom. This can be done > > e.g. after the test faults in all the mmu notifier managed memory and > > set the hard limit to something really small. Then we are looking for a > > proper process tear down. > > Chances are, some of the intended audience don't know how to do this > and will either have to hunt down a lot of documentation or will just > not test it. But we want them to test it, so a little worked step-by-step > example would help things along please. I am willing to give more specific steps. Is anybody interested? From my experience so far this is not something drivers developers using mmu notifiers would be unfamiliar with.
On Fri 20-07-18 17:09:02, Andrew Morton wrote: [...] > - Undocumented return value. > > - comment "failed to reap part..." is misleading - sounds like it's > referring to something which happened in the past, is in fact > referring to something which might happen in the future. > > - fails to call trace_finish_task_reaping() in one case > > - code duplication. > > - Increases mmap_sem hold time a little by moving > trace_finish_task_reaping() inside the locked region. So sue me ;) > > - Sharing the finish: path means that the trace event won't > distinguish between the two sources of finishing. > > Please take a look? oom_reap_task_mm should return false when __oom_reap_task_mm return false. This is what my patch did but it seems this changed by http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch so that one should be fixed. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 104ef4a01a55..88657e018714 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) /* failed to reap part of the address space. Try again later */ if (!__oom_reap_task_mm(mm)) { up_read(&mm->mmap_sem); - return true; + return false; } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", On top of that the proposed cleanup looks as follows: diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 88657e018714..4e185a282b3d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm) return ret; } +/* + * Reaps the address space of the give task. + * + * Returns true on success and false if none or part of the address space + * has been reclaimed and the caller should retry later. + */ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { + bool ret = true; + if (!down_read_trylock(&mm->mmap_sem)) { trace_skip_task_reaping(tsk->pid); return false; @@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * down_write();up_write() cycle in exit_mmap(). */ if (test_bit(MMF_OOM_SKIP, &mm->flags)) { - up_read(&mm->mmap_sem); trace_skip_task_reaping(tsk->pid); - return true; + goto out_unlock; } trace_start_task_reaping(tsk->pid); /* failed to reap part of the address space. Try again later */ - if (!__oom_reap_task_mm(mm)) { - up_read(&mm->mmap_sem); - return false; - } + ret = __oom_reap_task_mm(mm); + if (!ret) + goto out_finish; pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); +out_finish: + trace_finish_task_reaping(tsk->pid); +out_unlock: up_read(&mm->mmap_sem); - trace_finish_task_reaping(tsk->pid); - return true; + return ret; } #define MAX_OOM_REAP_RETRIES 10
On Tue, 24 Jul 2018 16:17:47 +0200 Michal Hocko <mhocko@kernel.org> wrote: > On Fri 20-07-18 17:09:02, Andrew Morton wrote: > [...] > > - Undocumented return value. > > > > - comment "failed to reap part..." is misleading - sounds like it's > > referring to something which happened in the past, is in fact > > referring to something which might happen in the future. > > > > - fails to call trace_finish_task_reaping() in one case > > > > - code duplication. > > > > - Increases mmap_sem hold time a little by moving > > trace_finish_task_reaping() inside the locked region. So sue me ;) > > > > - Sharing the finish: path means that the trace event won't > > distinguish between the two sources of finishing. > > > > Please take a look? > > oom_reap_task_mm should return false when __oom_reap_task_mm return > false. This is what my patch did but it seems this changed by > http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch > so that one should be fixed. > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 104ef4a01a55..88657e018714 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > /* failed to reap part of the address space. Try again later */ > if (!__oom_reap_task_mm(mm)) { > up_read(&mm->mmap_sem); > - return true; > + return false; > } > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", OK, thanks, I added that. > > On top of that the proposed cleanup looks as follows: > Looks good to me. Seems a bit strange that we omit the pr_info() output if the mm was partially reaped - people would still want to know this? Not very important though. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 24 Jul 2018, Michal Hocko wrote: > oom_reap_task_mm should return false when __oom_reap_task_mm return > false. This is what my patch did but it seems this changed by > http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch > so that one should be fixed. > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 104ef4a01a55..88657e018714 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > /* failed to reap part of the address space. Try again later */ > if (!__oom_reap_task_mm(mm)) { > up_read(&mm->mmap_sem); > - return true; > + return false; > } > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > > > On top of that the proposed cleanup looks as follows: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 88657e018714..4e185a282b3d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm) > return ret; > } > > +/* > + * Reaps the address space of the give task. > + * > + * Returns true on success and false if none or part of the address space > + * has been reclaimed and the caller should retry later. > + */ > static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > { > + bool ret = true; > + > if (!down_read_trylock(&mm->mmap_sem)) { > trace_skip_task_reaping(tsk->pid); > return false; > @@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > * down_write();up_write() cycle in exit_mmap(). > */ > if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > - up_read(&mm->mmap_sem); > trace_skip_task_reaping(tsk->pid); > - return true; > + goto out_unlock; > } > > trace_start_task_reaping(tsk->pid); > > /* failed to reap part of the address space. Try again later */ > - if (!__oom_reap_task_mm(mm)) { > - up_read(&mm->mmap_sem); > - return false; > - } > + ret = __oom_reap_task_mm(mm); > + if (!ret) > + goto out_finish; > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, > K(get_mm_counter(mm, MM_ANONPAGES)), > K(get_mm_counter(mm, MM_FILEPAGES)), > K(get_mm_counter(mm, MM_SHMEMPAGES))); > +out_finish: > + trace_finish_task_reaping(tsk->pid); > +out_unlock: > up_read(&mm->mmap_sem); > > - trace_finish_task_reaping(tsk->pid); > - return true; > + return ret; > } > > #define MAX_OOM_REAP_RETRIES 10 I think we still want to trace when reaping was skipped to know that the oom reaper will retry again later. mm/oom_kill.c: clean up oom_reap_task_mm() fix indicate reaping has been partially skipped so we can expect future skips or another start before finish. Signed-off-by: David Rientjes <rientjes@google.com> --- mm/oom_kill.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* failed to reap part of the address space. Try again later */ ret = __oom_reap_task_mm(mm); - if (!ret) + if (!ret) { + /* Failed to reap part of the address space. Try again later */ + trace_skip_task_reaping(tsk->pid); goto out_finish; + } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 24-07-18 14:07:49, David Rientjes wrote: [...] > mm/oom_kill.c: clean up oom_reap_task_mm() fix > > indicate reaping has been partially skipped so we can expect future skips > or another start before finish. But we are not skipping. This is essentially the same case as mmap_sem trylock fail. Maybe we can add a bool parameter to trace_finish_task_reaping to denote partial success? > Signed-off-by: David Rientjes <rientjes@google.com> > --- > mm/oom_kill.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > > trace_start_task_reaping(tsk->pid); > > - /* failed to reap part of the address space. Try again later */ > ret = __oom_reap_task_mm(mm); > - if (!ret) > + if (!ret) { > + /* Failed to reap part of the address space. Try again later */ > + trace_skip_task_reaping(tsk->pid); > goto out_finish; > + } > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm,
On Tue 24-07-18 12:53:07, Andrew Morton wrote: [...] > > On top of that the proposed cleanup looks as follows: > > > > Looks good to me. Seems a bit strange that we omit the pr_info() > output if the mm was partially reaped - people would still want to know > this? Not very important though. I think that having a single output once we are done is better but I do not have a strong opinion on this. Btw. here is the changelog for the cleanup. " Andrew has noticed someinconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment "failed to reap part..." is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> "
Two more worries for this patch. > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > * > * @amn: our notifier > */ > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > { > - mutex_lock(&amn->read_lock); > + if (blockable) > + mutex_lock(&amn->read_lock); > + else if (!mutex_trylock(&amn->read_lock)) > + return -EAGAIN; > + > if (atomic_inc_return(&amn->recursion) == 1) > down_read_non_owner(&amn->lock); Why don't we need to use trylock here if blockable == false ? Want comment why it is safe to use blocking lock here. > mutex_unlock(&amn->read_lock); > + > + return 0; > } > > /** > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) > up_write(&hmm->mirrors_sem); > } > > -static void hmm_invalidate_range_start(struct mmu_notifier *mn, > +static int hmm_invalidate_range_start(struct mmu_notifier *mn, > struct mm_struct *mm, > unsigned long start, > - unsigned long end) > + unsigned long end, > + bool blockable) > { > struct hmm *hmm = mm->hmm; > > VM_BUG_ON(!hmm); > > atomic_inc(&hmm->sequence); > + > + return 0; > } > > static void hmm_invalidate_range_end(struct mmu_notifier *mn, This assumes that hmm_invalidate_range_end() does not have memory allocation dependency. But hmm_invalidate_range() from hmm_invalidate_range_end() involves down_read(&hmm->mirrors_sem); list_for_each_entry(mirror, &hmm->mirrors, list) mirror->ops->sync_cpu_device_pagetables(mirror, action, start, end); up_read(&hmm->mirrors_sem); sequence. What is surprising is that there is no in-tree user who assigns sync_cpu_device_pagetables field. $ grep -Fr sync_cpu_device_pagetables * Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize page tables include/linux/hmm.h: * will get callbacks through sync_cpu_device_pagetables() operation (see include/linux/hmm.h: /* sync_cpu_device_pagetables() - synchronize page tables include/linux/hmm.h: void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror, include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page mm/hmm.c: mirror->ops->sync_cpu_device_pagetables(mirror, action, That is, this API seems to be currently used by only out-of-tree users. Since we can't check that nobody has memory allocation dependency, I think that hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now.
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: > Two more worries for this patch. > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > > * > > * @amn: our notifier > > */ > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > > { > > - mutex_lock(&amn->read_lock); > > + if (blockable) > > + mutex_lock(&amn->read_lock); > > + else if (!mutex_trylock(&amn->read_lock)) > > + return -EAGAIN; > > + > > if (atomic_inc_return(&amn->recursion) == 1) > > down_read_non_owner(&amn->lock); > > Why don't we need to use trylock here if blockable == false ? > Want comment why it is safe to use blocking lock here. Hmm, I am pretty sure I have checked the code but it was quite confusing so I might have missed something. Double checking now, it seems that this read_lock is not used anywhere else and it is not _the_ lock we are interested about. It is the amn->lock (amdgpu_mn_lock) which matters as it is taken in exclusive mode for expensive operations. Is that correct Christian? If this is correct then we need to update the locking here. I am struggling to grasp the ref counting part. Why cannot all readers simply take the lock rather than rely on somebody else to take it? 1ed3d2567c800 didn't really help me to understand the locking scheme here so any help would be appreciated. I am wondering why we cannot do diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..93034178673d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(&amn->read_lock); - else if (!mutex_trylock(&amn->read_lock)) - return -EAGAIN; - - if (atomic_inc_return(&amn->recursion) == 1) - down_read_non_owner(&amn->lock); - mutex_unlock(&amn->read_lock); + if (!down_read_trylock(&amn->lock)) { + if (!blockable) + return -EAGAIN; + down_read(amn->lock); + } return 0; } @@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) */ static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) { - if (atomic_dec_return(&amn->recursion) == 0) - up_read_non_owner(&amn->lock); + up_read(&amn->lock); } /**
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: [...] > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) > > up_write(&hmm->mirrors_sem); > > } > > > > -static void hmm_invalidate_range_start(struct mmu_notifier *mn, > > +static int hmm_invalidate_range_start(struct mmu_notifier *mn, > > struct mm_struct *mm, > > unsigned long start, > > - unsigned long end) > > + unsigned long end, > > + bool blockable) > > { > > struct hmm *hmm = mm->hmm; > > > > VM_BUG_ON(!hmm); > > > > atomic_inc(&hmm->sequence); > > + > > + return 0; > > } > > > > static void hmm_invalidate_range_end(struct mmu_notifier *mn, > > This assumes that hmm_invalidate_range_end() does not have memory > allocation dependency. But hmm_invalidate_range() from > hmm_invalidate_range_end() involves > > down_read(&hmm->mirrors_sem); > list_for_each_entry(mirror, &hmm->mirrors, list) > mirror->ops->sync_cpu_device_pagetables(mirror, action, > start, end); > up_read(&hmm->mirrors_sem); > > sequence. What is surprising is that there is no in-tree user who assigns > sync_cpu_device_pagetables field. Yes HMM doesn't have any in tree user yet. > $ grep -Fr sync_cpu_device_pagetables * > Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize page tables > include/linux/hmm.h: * will get callbacks through sync_cpu_device_pagetables() operation (see > include/linux/hmm.h: /* sync_cpu_device_pagetables() - synchronize page tables > include/linux/hmm.h: void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror, > include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page > mm/hmm.c: mirror->ops->sync_cpu_device_pagetables(mirror, action, > > That is, this API seems to be currently used by only out-of-tree users. Since > we can't check that nobody has memory allocation dependency, I think that > hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now. The code expects that the invalidate_range_end doesn't block if invalidate_range_start hasn't blocked. That is the reason why the end callback doesn't have blockable parameter. If this doesn't hold then the whole scheme is just fragile because those two calls should pair.
Am 24.08.2018 um 13:32 schrieb Michal Hocko: > On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: >> Two more worries for this patch. >> >> >> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>> @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) >>> * >>> * @amn: our notifier >>> */ >>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) >>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) >>> { >>> - mutex_lock(&amn->read_lock); >>> + if (blockable) >>> + mutex_lock(&amn->read_lock); >>> + else if (!mutex_trylock(&amn->read_lock)) >>> + return -EAGAIN; >>> + >>> if (atomic_inc_return(&amn->recursion) == 1) >>> down_read_non_owner(&amn->lock); >> Why don't we need to use trylock here if blockable == false ? >> Want comment why it is safe to use blocking lock here. > Hmm, I am pretty sure I have checked the code but it was quite confusing > so I might have missed something. Double checking now, it seems that > this read_lock is not used anywhere else and it is not _the_ lock we are > interested about. It is the amn->lock (amdgpu_mn_lock) which matters as > it is taken in exclusive mode for expensive operations. The write side of the lock is only taken in the command submission IOCTL. So you actually don't need to change anything here (even the proposed changes are overkill) since we can't tear down the struct_mm while an IOCTL is still using. > Is that correct Christian? If this is correct then we need to update the > locking here. I am struggling to grasp the ref counting part. Why cannot > all readers simply take the lock rather than rely on somebody else to > take it? 1ed3d2567c800 didn't really help me to understand the locking > scheme here so any help would be appreciated. That won't work like this there might be multiple invalidate_range_start()/invalidate_range_end() pairs open at the same time. E.g. the lock might be taken recursively and that is illegal for a rw_semaphore. Regards, Christian. > > I am wondering why we cannot do > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > index e55508b39496..93034178673d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > @@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > */ > static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > { > - if (blockable) > - mutex_lock(&amn->read_lock); > - else if (!mutex_trylock(&amn->read_lock)) > - return -EAGAIN; > - > - if (atomic_inc_return(&amn->recursion) == 1) > - down_read_non_owner(&amn->lock); > - mutex_unlock(&amn->read_lock); > + if (!down_read_trylock(&amn->lock)) { > + if (!blockable) > + return -EAGAIN; > + down_read(amn->lock); > + } > > return 0; > } > @@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > */ > static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) > { > - if (atomic_dec_return(&amn->recursion) == 0) > - up_read_non_owner(&amn->lock); > + up_read(&amn->lock); > } > > /** >
On Fri 24-08-18 13:43:16, Christian König wrote: > Am 24.08.2018 um 13:32 schrieb Michal Hocko: > > On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: > > > Two more worries for this patch. > > > > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > > > > * > > > > * @amn: our notifier > > > > */ > > > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) > > > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > > > > { > > > > - mutex_lock(&amn->read_lock); > > > > + if (blockable) > > > > + mutex_lock(&amn->read_lock); > > > > + else if (!mutex_trylock(&amn->read_lock)) > > > > + return -EAGAIN; > > > > + > > > > if (atomic_inc_return(&amn->recursion) == 1) > > > > down_read_non_owner(&amn->lock); > > > Why don't we need to use trylock here if blockable == false ? > > > Want comment why it is safe to use blocking lock here. > > Hmm, I am pretty sure I have checked the code but it was quite confusing > > so I might have missed something. Double checking now, it seems that > > this read_lock is not used anywhere else and it is not _the_ lock we are > > interested about. It is the amn->lock (amdgpu_mn_lock) which matters as > > it is taken in exclusive mode for expensive operations. > > The write side of the lock is only taken in the command submission IOCTL. > > So you actually don't need to change anything here (even the proposed > changes are overkill) since we can't tear down the struct_mm while an IOCTL > is still using. I am not so sure. We are not in the mm destruction phase yet. This is mostly about the oom context which might fire right during the IOCTL. If any of the path which is holding the write lock blocks for unbound amount of time or even worse allocates a memory then we are screwed. So we need to back of when blockable = false. > > Is that correct Christian? If this is correct then we need to update the > > locking here. I am struggling to grasp the ref counting part. Why cannot > > all readers simply take the lock rather than rely on somebody else to > > take it? 1ed3d2567c800 didn't really help me to understand the locking > > scheme here so any help would be appreciated. > > That won't work like this there might be multiple > invalidate_range_start()/invalidate_range_end() pairs open at the same time. > E.g. the lock might be taken recursively and that is illegal for a > rw_semaphore. I am not sure I follow. Are you saying that one invalidate_range might trigger another one from the same path?
Am 24.08.2018 um 13:52 schrieb Michal Hocko: > On Fri 24-08-18 13:43:16, Christian König wrote: >> Am 24.08.2018 um 13:32 schrieb Michal Hocko: >>> On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: >>>> Two more worries for this patch. >>>> >>>> >>>> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>>>> @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) >>>>> * >>>>> * @amn: our notifier >>>>> */ >>>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) >>>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) >>>>> { >>>>> - mutex_lock(&amn->read_lock); >>>>> + if (blockable) >>>>> + mutex_lock(&amn->read_lock); >>>>> + else if (!mutex_trylock(&amn->read_lock)) >>>>> + return -EAGAIN; >>>>> + >>>>> if (atomic_inc_return(&amn->recursion) == 1) >>>>> down_read_non_owner(&amn->lock); >>>> Why don't we need to use trylock here if blockable == false ? >>>> Want comment why it is safe to use blocking lock here. >>> Hmm, I am pretty sure I have checked the code but it was quite confusing >>> so I might have missed something. Double checking now, it seems that >>> this read_lock is not used anywhere else and it is not _the_ lock we are >>> interested about. It is the amn->lock (amdgpu_mn_lock) which matters as >>> it is taken in exclusive mode for expensive operations. >> The write side of the lock is only taken in the command submission IOCTL. >> >> So you actually don't need to change anything here (even the proposed >> changes are overkill) since we can't tear down the struct_mm while an IOCTL >> is still using. > I am not so sure. We are not in the mm destruction phase yet. This is > mostly about the oom context which might fire right during the IOCTL. If > any of the path which is holding the write lock blocks for unbound > amount of time or even worse allocates a memory then we are screwed. So > we need to back of when blockable = false. Oh, yeah good point. Haven't thought about that possibility. > >>> Is that correct Christian? If this is correct then we need to update the >>> locking here. I am struggling to grasp the ref counting part. Why cannot >>> all readers simply take the lock rather than rely on somebody else to >>> take it? 1ed3d2567c800 didn't really help me to understand the locking >>> scheme here so any help would be appreciated. >> That won't work like this there might be multiple >> invalidate_range_start()/invalidate_range_end() pairs open at the same time. >> E.g. the lock might be taken recursively and that is illegal for a >> rw_semaphore. > I am not sure I follow. Are you saying that one invalidate_range might > trigger another one from the same path? No, but what can happen is: invalidate_range_start(A,B); invalidate_range_start(C,D); ... invalidate_range_end(C,D); invalidate_range_end(A,B); Grabbing the read lock twice would be illegal in this case. Regards, Christian.
On Fri 24-08-18 13:57:52, Christian König wrote: > Am 24.08.2018 um 13:52 schrieb Michal Hocko: > > On Fri 24-08-18 13:43:16, Christian König wrote: [...] > > > That won't work like this there might be multiple > > > invalidate_range_start()/invalidate_range_end() pairs open at the same time. > > > E.g. the lock might be taken recursively and that is illegal for a > > > rw_semaphore. > > I am not sure I follow. Are you saying that one invalidate_range might > > trigger another one from the same path? > > No, but what can happen is: > > invalidate_range_start(A,B); > invalidate_range_start(C,D); > ... > invalidate_range_end(C,D); > invalidate_range_end(A,B); > > Grabbing the read lock twice would be illegal in this case. I am sorry but I still do not follow. What is the context the two are called from? Can you give me an example. I simply do not see it in the code, mostly because I am not familiar with it.
Am 24.08.2018 um 14:03 schrieb Michal Hocko: > On Fri 24-08-18 13:57:52, Christian König wrote: >> Am 24.08.2018 um 13:52 schrieb Michal Hocko: >>> On Fri 24-08-18 13:43:16, Christian König wrote: > [...] >>>> That won't work like this there might be multiple >>>> invalidate_range_start()/invalidate_range_end() pairs open at the same time. >>>> E.g. the lock might be taken recursively and that is illegal for a >>>> rw_semaphore. >>> I am not sure I follow. Are you saying that one invalidate_range might >>> trigger another one from the same path? >> No, but what can happen is: >> >> invalidate_range_start(A,B); >> invalidate_range_start(C,D); >> ... >> invalidate_range_end(C,D); >> invalidate_range_end(A,B); >> >> Grabbing the read lock twice would be illegal in this case. > I am sorry but I still do not follow. What is the context the two are > called from? I don't have the slightest idea. > Can you give me an example. I simply do not see it in the > code, mostly because I am not familiar with it. I'm neither. We stumbled over that by pure observation and after discussing the problem with Jerome came up with this solution. No idea where exactly that case comes from, but I can confirm that it indeed happens. Regards, Christian.
On Fri 24-08-18 14:18:44, Christian König wrote: > Am 24.08.2018 um 14:03 schrieb Michal Hocko: > > On Fri 24-08-18 13:57:52, Christian König wrote: > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko: > > > > On Fri 24-08-18 13:43:16, Christian König wrote: > > [...] > > > > > That won't work like this there might be multiple > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the same time. > > > > > E.g. the lock might be taken recursively and that is illegal for a > > > > > rw_semaphore. > > > > I am not sure I follow. Are you saying that one invalidate_range might > > > > trigger another one from the same path? > > > No, but what can happen is: > > > > > > invalidate_range_start(A,B); > > > invalidate_range_start(C,D); > > > ... > > > invalidate_range_end(C,D); > > > invalidate_range_end(A,B); > > > > > > Grabbing the read lock twice would be illegal in this case. > > I am sorry but I still do not follow. What is the context the two are > > called from? > > I don't have the slightest idea. > > > Can you give me an example. I simply do not see it in the > > code, mostly because I am not familiar with it. > > I'm neither. > > We stumbled over that by pure observation and after discussing the problem > with Jerome came up with this solution. > > No idea where exactly that case comes from, but I can confirm that it indeed > happens. Thiking about it some more, I can imagine that a notifier callback which performs an allocation might trigger a memory reclaim and that in turn might trigger a notifier to be invoked and recurse. But notifier shouldn't really allocate memory. They are called from deep MM code paths and this would be extremely deadlock prone. Maybe Jerome can come up some more realistic scenario. If not then I would propose to simplify the locking here. We have lockdep to catch self deadlocks and it is always better to handle a specific issue rather than having a code without a clear indication how it can recurse.
Am 24.08.2018 um 14:33 schrieb Michal Hocko: > On Fri 24-08-18 14:18:44, Christian König wrote: >> Am 24.08.2018 um 14:03 schrieb Michal Hocko: >>> On Fri 24-08-18 13:57:52, Christian König wrote: >>>> Am 24.08.2018 um 13:52 schrieb Michal Hocko: >>>>> On Fri 24-08-18 13:43:16, Christian König wrote: >>> [...] >>>>>> That won't work like this there might be multiple >>>>>> invalidate_range_start()/invalidate_range_end() pairs open at the same time. >>>>>> E.g. the lock might be taken recursively and that is illegal for a >>>>>> rw_semaphore. >>>>> I am not sure I follow. Are you saying that one invalidate_range might >>>>> trigger another one from the same path? >>>> No, but what can happen is: >>>> >>>> invalidate_range_start(A,B); >>>> invalidate_range_start(C,D); >>>> ... >>>> invalidate_range_end(C,D); >>>> invalidate_range_end(A,B); >>>> >>>> Grabbing the read lock twice would be illegal in this case. >>> I am sorry but I still do not follow. What is the context the two are >>> called from? >> I don't have the slightest idea. >> >>> Can you give me an example. I simply do not see it in the >>> code, mostly because I am not familiar with it. >> I'm neither. >> >> We stumbled over that by pure observation and after discussing the problem >> with Jerome came up with this solution. >> >> No idea where exactly that case comes from, but I can confirm that it indeed >> happens. > Thiking about it some more, I can imagine that a notifier callback which > performs an allocation might trigger a memory reclaim and that in turn > might trigger a notifier to be invoked and recurse. But notifier > shouldn't really allocate memory. They are called from deep MM code > paths and this would be extremely deadlock prone. Maybe Jerome can come > up some more realistic scenario. If not then I would propose to simplify > the locking here. We have lockdep to catch self deadlocks and it is > always better to handle a specific issue rather than having a code > without a clear indication how it can recurse. Well I agree that we should probably fix that, but I have some concerns to remove the existing workaround. See we added that to get rid of a real problem in a customer environment and I don't want to that to show up again. In the meantime I've send out a fix to avoid allocating memory while holding the mn_lock. Thanks for pointing that out, Christian.
On Fri 24-08-18 14:52:26, Christian König wrote: > Am 24.08.2018 um 14:33 schrieb Michal Hocko: [...] > > Thiking about it some more, I can imagine that a notifier callback which > > performs an allocation might trigger a memory reclaim and that in turn > > might trigger a notifier to be invoked and recurse. But notifier > > shouldn't really allocate memory. They are called from deep MM code > > paths and this would be extremely deadlock prone. Maybe Jerome can come > > up some more realistic scenario. If not then I would propose to simplify > > the locking here. We have lockdep to catch self deadlocks and it is > > always better to handle a specific issue rather than having a code > > without a clear indication how it can recurse. > > Well I agree that we should probably fix that, but I have some concerns to > remove the existing workaround. > > See we added that to get rid of a real problem in a customer environment and > I don't want to that to show up again. It would really help to know more about that case and fix it properly rather than workaround it like this. Anyway, let me think how to handle the non-blocking notifier invocation then. I was not able to come up with anything remotely sane yet.
On 2018/08/24 20:36, Michal Hocko wrote: >> That is, this API seems to be currently used by only out-of-tree users. Since >> we can't check that nobody has memory allocation dependency, I think that >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now. > > The code expects that the invalidate_range_end doesn't block if > invalidate_range_start hasn't blocked. That is the reason why the end > callback doesn't have blockable parameter. If this doesn't hold then the > whole scheme is just fragile because those two calls should pair. > That is More worrisome part in that patch is that I don't know whether using trylock if blockable == false at entry is really sufficient. . Since those two calls should pair, I think that we need to determine whether we need to return -EAGAIN at start call by evaluating both calls. Like mn_invl_range_start() involves schedule_delayed_work() which could be blocked on memory allocation under OOM situation, I worry that (currently out-of-tree) users of this API are involving work / recursion. And hmm_release() says that /* * Drop mirrors_sem so callback can wait on any pending * work that might itself trigger mmu_notifier callback * and thus would deadlock with us. */ and keeps "all operations protected by hmm->mirrors_sem held for write are atomic". This suggests that "some operations protected by hmm->mirrors_sem held for read will sleep (and in the worst case involves memory allocation dependency)".
Am 24.08.2018 um 15:01 schrieb Michal Hocko: > On Fri 24-08-18 14:52:26, Christian König wrote: >> Am 24.08.2018 um 14:33 schrieb Michal Hocko: > [...] >>> Thiking about it some more, I can imagine that a notifier callback which >>> performs an allocation might trigger a memory reclaim and that in turn >>> might trigger a notifier to be invoked and recurse. But notifier >>> shouldn't really allocate memory. They are called from deep MM code >>> paths and this would be extremely deadlock prone. Maybe Jerome can come >>> up some more realistic scenario. If not then I would propose to simplify >>> the locking here. We have lockdep to catch self deadlocks and it is >>> always better to handle a specific issue rather than having a code >>> without a clear indication how it can recurse. >> Well I agree that we should probably fix that, but I have some concerns to >> remove the existing workaround. >> >> See we added that to get rid of a real problem in a customer environment and >> I don't want to that to show up again. > It would really help to know more about that case and fix it properly > rather than workaround it like this. Anyway, let me think how to handle > the non-blocking notifier invocation then. I was not able to come up > with anything remotely sane yet. With avoiding allocating memory in the write lock path I don't see an issue any more with that. All what the write lock path does now is adding items to a linked lists, arrays etc.... So there is no more blocking involved here and the read lock side should be able to grab the lock immediately. Christian.
On Fri 24-08-18 15:10:08, Christian König wrote: > Am 24.08.2018 um 15:01 schrieb Michal Hocko: > > On Fri 24-08-18 14:52:26, Christian König wrote: > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko: > > [...] > > > > Thiking about it some more, I can imagine that a notifier callback which > > > > performs an allocation might trigger a memory reclaim and that in turn > > > > might trigger a notifier to be invoked and recurse. But notifier > > > > shouldn't really allocate memory. They are called from deep MM code > > > > paths and this would be extremely deadlock prone. Maybe Jerome can come > > > > up some more realistic scenario. If not then I would propose to simplify > > > > the locking here. We have lockdep to catch self deadlocks and it is > > > > always better to handle a specific issue rather than having a code > > > > without a clear indication how it can recurse. > > > Well I agree that we should probably fix that, but I have some concerns to > > > remove the existing workaround. > > > > > > See we added that to get rid of a real problem in a customer environment and > > > I don't want to that to show up again. > > It would really help to know more about that case and fix it properly > > rather than workaround it like this. Anyway, let me think how to handle > > the non-blocking notifier invocation then. I was not able to come up > > with anything remotely sane yet. > > With avoiding allocating memory in the write lock path I don't see an issue > any more with that. > > All what the write lock path does now is adding items to a linked lists, > arrays etc.... Can we change it to non-sleepable lock then?
Am 24.08.2018 um 15:24 schrieb Michal Hocko: > On Fri 24-08-18 15:10:08, Christian König wrote: >> Am 24.08.2018 um 15:01 schrieb Michal Hocko: >>> On Fri 24-08-18 14:52:26, Christian König wrote: >>>> Am 24.08.2018 um 14:33 schrieb Michal Hocko: >>> [...] >>>>> Thiking about it some more, I can imagine that a notifier callback which >>>>> performs an allocation might trigger a memory reclaim and that in turn >>>>> might trigger a notifier to be invoked and recurse. But notifier >>>>> shouldn't really allocate memory. They are called from deep MM code >>>>> paths and this would be extremely deadlock prone. Maybe Jerome can come >>>>> up some more realistic scenario. If not then I would propose to simplify >>>>> the locking here. We have lockdep to catch self deadlocks and it is >>>>> always better to handle a specific issue rather than having a code >>>>> without a clear indication how it can recurse. >>>> Well I agree that we should probably fix that, but I have some concerns to >>>> remove the existing workaround. >>>> >>>> See we added that to get rid of a real problem in a customer environment and >>>> I don't want to that to show up again. >>> It would really help to know more about that case and fix it properly >>> rather than workaround it like this. Anyway, let me think how to handle >>> the non-blocking notifier invocation then. I was not able to come up >>> with anything remotely sane yet. >> With avoiding allocating memory in the write lock path I don't see an issue >> any more with that. >> >> All what the write lock path does now is adding items to a linked lists, >> arrays etc.... > Can we change it to non-sleepable lock then? No, the write side doesn't sleep any more, but the read side does. See amdgpu_mn_invalidate_node() and that is where you actually need to handle the non-blocking flag correctly. Christian.
On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: > On 2018/08/24 20:36, Michal Hocko wrote: > >> That is, this API seems to be currently used by only out-of-tree users. Since > >> we can't check that nobody has memory allocation dependency, I think that > >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now. > > > > The code expects that the invalidate_range_end doesn't block if > > invalidate_range_start hasn't blocked. That is the reason why the end > > callback doesn't have blockable parameter. If this doesn't hold then the > > whole scheme is just fragile because those two calls should pair. > > > That is > > More worrisome part in that patch is that I don't know whether using > trylock if blockable == false at entry is really sufficient. > > . Since those two calls should pair, I think that we need to determine whether > we need to return -EAGAIN at start call by evaluating both calls. Yes, and I believe I have done that audit. Module my misunderstanding of the code. > Like mn_invl_range_start() involves schedule_delayed_work() which could be > blocked on memory allocation under OOM situation, It doesn't because that code path is not invoked for the !blockable case. > I worry that (currently > out-of-tree) users of this API are involving work / recursion. I do not give a slightest about out-of-tree modules. They will have to accomodate to the new API. I have no problems to extend the documentation and be explicit about this expectation. diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 133ba78820ee..698e371aafe3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -153,7 +153,9 @@ struct mmu_notifier_ops { * * If blockable argument is set to false then the callback cannot * sleep and has to return with -EAGAIN. 0 should be returned - * otherwise. + * otherwise. Please note that if invalidate_range_start approves + * a non-blocking behavior then the same applies to + * invalidate_range_end. * */ int (*invalidate_range_start)(struct mmu_notifier *mn, > And hmm_release() says that > > /* > * Drop mirrors_sem so callback can wait on any pending > * work that might itself trigger mmu_notifier callback > * and thus would deadlock with us. > */ > > and keeps "all operations protected by hmm->mirrors_sem held for write are > atomic". This suggests that "some operations protected by hmm->mirrors_sem held > for read will sleep (and in the worst case involves memory allocation > dependency)". Yes and so what? The clear expectation is that neither of the range notifiers do not sleep in !blocking mode. I really fail to see what you are trying to say.
On Fri 24-08-18 15:28:33, Christian König wrote: > Am 24.08.2018 um 15:24 schrieb Michal Hocko: > > On Fri 24-08-18 15:10:08, Christian König wrote: > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko: > > > > On Fri 24-08-18 14:52:26, Christian König wrote: > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko: > > > > [...] > > > > > > Thiking about it some more, I can imagine that a notifier callback which > > > > > > performs an allocation might trigger a memory reclaim and that in turn > > > > > > might trigger a notifier to be invoked and recurse. But notifier > > > > > > shouldn't really allocate memory. They are called from deep MM code > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can come > > > > > > up some more realistic scenario. If not then I would propose to simplify > > > > > > the locking here. We have lockdep to catch self deadlocks and it is > > > > > > always better to handle a specific issue rather than having a code > > > > > > without a clear indication how it can recurse. > > > > > Well I agree that we should probably fix that, but I have some concerns to > > > > > remove the existing workaround. > > > > > > > > > > See we added that to get rid of a real problem in a customer environment and > > > > > I don't want to that to show up again. > > > > It would really help to know more about that case and fix it properly > > > > rather than workaround it like this. Anyway, let me think how to handle > > > > the non-blocking notifier invocation then. I was not able to come up > > > > with anything remotely sane yet. > > > With avoiding allocating memory in the write lock path I don't see an issue > > > any more with that. > > > > > > All what the write lock path does now is adding items to a linked lists, > > > arrays etc.... > > Can we change it to non-sleepable lock then? > > No, the write side doesn't sleep any more, but the read side does. > > See amdgpu_mn_invalidate_node() and that is where you actually need to > handle the non-blocking flag correctly. Ohh, right you are. We already handle that by bailing out before calling amdgpu_mn_invalidate_node in !blockable mode. So does this looks good to you? diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..48fa152231be 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(&amn->read_lock); - else if (!mutex_trylock(&amn->read_lock)) - return -EAGAIN; - + /* + * We can take sleepable lock even on !blockable mode because + * read_lock is only ever take from this path and the notifier + * lock never really sleeps. In fact the only reason why the + * later is sleepable is because the notifier itself might sleep + * in amdgpu_mn_invalidate_node but blockable mode is handled + * before calling into that path. + */ + mutex_lock(&amn->read_lock); if (atomic_inc_return(&amn->recursion) == 1) down_read_non_owner(&amn->lock); mutex_unlock(&amn->read_lock);
Am 24.08.2018 um 15:40 schrieb Michal Hocko: > On Fri 24-08-18 15:28:33, Christian König wrote: >> Am 24.08.2018 um 15:24 schrieb Michal Hocko: >>> On Fri 24-08-18 15:10:08, Christian König wrote: >>>> Am 24.08.2018 um 15:01 schrieb Michal Hocko: >>>>> On Fri 24-08-18 14:52:26, Christian König wrote: >>>>>> Am 24.08.2018 um 14:33 schrieb Michal Hocko: >>>>> [...] >>>>>>> Thiking about it some more, I can imagine that a notifier callback which >>>>>>> performs an allocation might trigger a memory reclaim and that in turn >>>>>>> might trigger a notifier to be invoked and recurse. But notifier >>>>>>> shouldn't really allocate memory. They are called from deep MM code >>>>>>> paths and this would be extremely deadlock prone. Maybe Jerome can come >>>>>>> up some more realistic scenario. If not then I would propose to simplify >>>>>>> the locking here. We have lockdep to catch self deadlocks and it is >>>>>>> always better to handle a specific issue rather than having a code >>>>>>> without a clear indication how it can recurse. >>>>>> Well I agree that we should probably fix that, but I have some concerns to >>>>>> remove the existing workaround. >>>>>> >>>>>> See we added that to get rid of a real problem in a customer environment and >>>>>> I don't want to that to show up again. >>>>> It would really help to know more about that case and fix it properly >>>>> rather than workaround it like this. Anyway, let me think how to handle >>>>> the non-blocking notifier invocation then. I was not able to come up >>>>> with anything remotely sane yet. >>>> With avoiding allocating memory in the write lock path I don't see an issue >>>> any more with that. >>>> >>>> All what the write lock path does now is adding items to a linked lists, >>>> arrays etc.... >>> Can we change it to non-sleepable lock then? >> No, the write side doesn't sleep any more, but the read side does. >> >> See amdgpu_mn_invalidate_node() and that is where you actually need to >> handle the non-blocking flag correctly. > Ohh, right you are. We already handle that by bailing out before calling > amdgpu_mn_invalidate_node in !blockable mode. Yeah, that is sufficient. It could be improved because we have something like 90% chance that amdgpu_mn_invalidate_node() actually doesn't need to do anything. But I can take care of that when the patch set has landed. > So does this looks good to > you? Yeah, that looks perfect to me. Reviewed-by: Christian König <christian.koenig@amd.com> Thanks, Christian. > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > index e55508b39496..48fa152231be 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > */ > static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > { > - if (blockable) > - mutex_lock(&amn->read_lock); > - else if (!mutex_trylock(&amn->read_lock)) > - return -EAGAIN; > - > + /* > + * We can take sleepable lock even on !blockable mode because > + * read_lock is only ever take from this path and the notifier > + * lock never really sleeps. In fact the only reason why the > + * later is sleepable is because the notifier itself might sleep > + * in amdgpu_mn_invalidate_node but blockable mode is handled > + * before calling into that path. > + */ > + mutex_lock(&amn->read_lock); > if (atomic_inc_return(&amn->recursion) == 1) > down_read_non_owner(&amn->lock); > mutex_unlock(&amn->read_lock);
On Fri 24-08-18 15:44:03, Christian König wrote: > Am 24.08.2018 um 15:40 schrieb Michal Hocko: > > On Fri 24-08-18 15:28:33, Christian König wrote: > > > Am 24.08.2018 um 15:24 schrieb Michal Hocko: > > > > On Fri 24-08-18 15:10:08, Christian König wrote: > > > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko: > > > > > > On Fri 24-08-18 14:52:26, Christian König wrote: > > > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko: > > > > > > [...] > > > > > > > > Thiking about it some more, I can imagine that a notifier callback which > > > > > > > > performs an allocation might trigger a memory reclaim and that in turn > > > > > > > > might trigger a notifier to be invoked and recurse. But notifier > > > > > > > > shouldn't really allocate memory. They are called from deep MM code > > > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can come > > > > > > > > up some more realistic scenario. If not then I would propose to simplify > > > > > > > > the locking here. We have lockdep to catch self deadlocks and it is > > > > > > > > always better to handle a specific issue rather than having a code > > > > > > > > without a clear indication how it can recurse. > > > > > > > Well I agree that we should probably fix that, but I have some concerns to > > > > > > > remove the existing workaround. > > > > > > > > > > > > > > See we added that to get rid of a real problem in a customer environment and > > > > > > > I don't want to that to show up again. > > > > > > It would really help to know more about that case and fix it properly > > > > > > rather than workaround it like this. Anyway, let me think how to handle > > > > > > the non-blocking notifier invocation then. I was not able to come up > > > > > > with anything remotely sane yet. > > > > > With avoiding allocating memory in the write lock path I don't see an issue > > > > > any more with that. > > > > > > > > > > All what the write lock path does now is adding items to a linked lists, > > > > > arrays etc.... > > > > Can we change it to non-sleepable lock then? > > > No, the write side doesn't sleep any more, but the read side does. > > > > > > See amdgpu_mn_invalidate_node() and that is where you actually need to > > > handle the non-blocking flag correctly. > > Ohh, right you are. We already handle that by bailing out before calling > > amdgpu_mn_invalidate_node in !blockable mode. > > Yeah, that is sufficient. > > It could be improved because we have something like 90% chance that > amdgpu_mn_invalidate_node() actually doesn't need to do anything. > > But I can take care of that when the patch set has landed. > > > So does this looks good to you? > > Yeah, that looks perfect to me. Reviewed-by: Christian König > <christian.koenig@amd.com> Cool! Thanks for your guidance and patience with me. Here is the full patch. Feel free to take it and route per your preference. From 4e297bf5a55906ee369d70bee9f7beeb3cba74bb Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 24 Aug 2018 15:45:52 +0200 Subject: [PATCH] drm/amd: clarify amdgpu_mn_read_lock !blocking mode MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tetsuo has noticed that 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") !blocking case for amdgpu_mn_read_lock is incomplete because it might sleep on the notifier lock. This is true but as it turned out from the discussion with Christian this doesn't really matter. The amd notifier lock doesn't block in the exclusive mode. It only ever sleeps with the read lock inside amdgpu_mn_invalidate_node. That one is not called in !blockable state so while we might sleep on notifier read_lock this will only be for a short while. The same applies on the notifier lock. Therefore remove blockable handling from amdgpu_mn_read_lock and document it properly. Noticed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..48fa152231be 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(&amn->read_lock); - else if (!mutex_trylock(&amn->read_lock)) - return -EAGAIN; - + /* + * We can take sleepable lock even on !blockable mode because + * read_lock is only ever take from this path and the notifier + * lock never really sleeps. In fact the only reason why the + * later is sleepable is because the notifier itself might sleep + * in amdgpu_mn_invalidate_node but blockable mode is handled + * before calling into that path. + */ + mutex_lock(&amn->read_lock); if (atomic_inc_return(&amn->recursion) == 1) down_read_non_owner(&amn->lock); mutex_unlock(&amn->read_lock);
On Fri, Aug 24, 2018 at 07:54:19PM +0900, Tetsuo Handa wrote: > Two more worries for this patch. [...] > > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) > > up_write(&hmm->mirrors_sem); > > } > > > > -static void hmm_invalidate_range_start(struct mmu_notifier *mn, > > +static int hmm_invalidate_range_start(struct mmu_notifier *mn, > > struct mm_struct *mm, > > unsigned long start, > > - unsigned long end) > > + unsigned long end, > > + bool blockable) > > { > > struct hmm *hmm = mm->hmm; > > > > VM_BUG_ON(!hmm); > > > > atomic_inc(&hmm->sequence); > > + > > + return 0; > > } > > > > static void hmm_invalidate_range_end(struct mmu_notifier *mn, > > This assumes that hmm_invalidate_range_end() does not have memory > allocation dependency. But hmm_invalidate_range() from > hmm_invalidate_range_end() involves > > down_read(&hmm->mirrors_sem); > list_for_each_entry(mirror, &hmm->mirrors, list) > mirror->ops->sync_cpu_device_pagetables(mirror, action, > start, end); > up_read(&hmm->mirrors_sem); > > sequence. What is surprising is that there is no in-tree user who assigns > sync_cpu_device_pagetables field. > > $ grep -Fr sync_cpu_device_pagetables * > Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize page tables > include/linux/hmm.h: * will get callbacks through sync_cpu_device_pagetables() operation (see > include/linux/hmm.h: /* sync_cpu_device_pagetables() - synchronize page tables > include/linux/hmm.h: void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror, > include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page > mm/hmm.c: mirror->ops->sync_cpu_device_pagetables(mirror, action, > > That is, this API seems to be currently used by only out-of-tree users. Since > we can't check that nobody has memory allocation dependency, I think that > hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now. So you can see update and user of this there: https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00 I am still working on Mellanox and AMD GPU patchset. I will post the HMM changes that adapt to Michal shortly as anyway thus have been sufficiently tested by now. https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20&id=78785dcb5ba0924c2c5e7be027793f99ebbc39f3 https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20&id=4fc25571dc893f2b278e90cda9e71e139e01de70 Cheers, Jérôme
On 2018/08/24 22:32, Michal Hocko wrote: > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: >> I worry that (currently >> out-of-tree) users of this API are involving work / recursion. > > I do not give a slightest about out-of-tree modules. They will have to > accomodate to the new API. I have no problems to extend the > documentation and be explicit about this expectation. You don't need to care about out-of-tree modules. But you need to hear from mm/hmm.c authors/maintainers when making changes for mmu-notifiers. > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 133ba78820ee..698e371aafe3 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { > * > * If blockable argument is set to false then the callback cannot > * sleep and has to return with -EAGAIN. 0 should be returned > - * otherwise. > + * otherwise. Please note that if invalidate_range_start approves > + * a non-blocking behavior then the same applies to > + * invalidate_range_end. Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to mmu-notifiers users. - * If both of these callbacks cannot block, and invalidate_range - * cannot block, mmu_notifier_ops.flags should have - * MMU_INVALIDATE_DOES_NOT_BLOCK set. + * If blockable argument is set to false then the callback cannot + * sleep and has to return with -EAGAIN. 0 should be returned + * otherwise. Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e. make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK. Now we are in a merge window. And we noticed a possibility that out-of-tree mmu-notifiers users might have trouble with making changes immediately in order to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately. And you are trying to ignore such possibility by just updating expected behavior description instead of giving out-of-tree users a grace period to check and update their code. >> and keeps "all operations protected by hmm->mirrors_sem held for write are >> atomic". This suggests that "some operations protected by hmm->mirrors_sem held >> for read will sleep (and in the worst case involves memory allocation >> dependency)". > > Yes and so what? The clear expectation is that neither of the range > notifiers do not sleep in !blocking mode. I really fail to see what you > are trying to say. I'm saying "Get ACK from Jérôme about mm/hmm.c changes".
On Fri, Aug 24, 2018 at 02:33:41PM +0200, Michal Hocko wrote: > On Fri 24-08-18 14:18:44, Christian König wrote: > > Am 24.08.2018 um 14:03 schrieb Michal Hocko: > > > On Fri 24-08-18 13:57:52, Christian König wrote: > > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko: > > > > > On Fri 24-08-18 13:43:16, Christian König wrote: > > > [...] > > > > > > That won't work like this there might be multiple > > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the same time. > > > > > > E.g. the lock might be taken recursively and that is illegal for a > > > > > > rw_semaphore. > > > > > I am not sure I follow. Are you saying that one invalidate_range might > > > > > trigger another one from the same path? > > > > No, but what can happen is: > > > > > > > > invalidate_range_start(A,B); > > > > invalidate_range_start(C,D); > > > > ... > > > > invalidate_range_end(C,D); > > > > invalidate_range_end(A,B); > > > > > > > > Grabbing the read lock twice would be illegal in this case. > > > I am sorry but I still do not follow. What is the context the two are > > > called from? > > > > I don't have the slightest idea. > > > > > Can you give me an example. I simply do not see it in the > > > code, mostly because I am not familiar with it. > > > > I'm neither. > > > > We stumbled over that by pure observation and after discussing the problem > > with Jerome came up with this solution. > > > > No idea where exactly that case comes from, but I can confirm that it indeed > > happens. > > Thiking about it some more, I can imagine that a notifier callback which > performs an allocation might trigger a memory reclaim and that in turn > might trigger a notifier to be invoked and recurse. But notifier > shouldn't really allocate memory. They are called from deep MM code > paths and this would be extremely deadlock prone. Maybe Jerome can come > up some more realistic scenario. If not then I would propose to simplify > the locking here. We have lockdep to catch self deadlocks and it is > always better to handle a specific issue rather than having a code > without a clear indication how it can recurse. Multiple concurrent mmu notifier, for overlapping range or not, is common (each concurrent threads can trigger some). So you might have multiple invalidate_range_start() in flight for same mm and thus might complete in different order (invalidate_range_end()). IIRC this is what this lock was trying to protect against. I can't think of a reason for recursive mmu notifier call right now. I will ponder see if i remember something about it. Cheers, Jérôme
On Fri, Aug 24, 2018 at 11:52:25PM +0900, Tetsuo Handa wrote: > On 2018/08/24 22:32, Michal Hocko wrote: > > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: > >> I worry that (currently > >> out-of-tree) users of this API are involving work / recursion. > > > > I do not give a slightest about out-of-tree modules. They will have to > > accomodate to the new API. I have no problems to extend the > > documentation and be explicit about this expectation. > > You don't need to care about out-of-tree modules. But you need to hear from > mm/hmm.c authors/maintainers when making changes for mmu-notifiers. > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 133ba78820ee..698e371aafe3 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { > > * > > * If blockable argument is set to false then the callback cannot > > * sleep and has to return with -EAGAIN. 0 should be returned > > - * otherwise. > > + * otherwise. Please note that if invalidate_range_start approves > > + * a non-blocking behavior then the same applies to > > + * invalidate_range_end. > > Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu > notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to > mmu-notifiers users. > > - * If both of these callbacks cannot block, and invalidate_range > - * cannot block, mmu_notifier_ops.flags should have > - * MMU_INVALIDATE_DOES_NOT_BLOCK set. > + * If blockable argument is set to false then the callback cannot > + * sleep and has to return with -EAGAIN. 0 should be returned > + * otherwise. > > Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e. > make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK. > > Now we are in a merge window. And we noticed a possibility that out-of-tree > mmu-notifiers users might have trouble with making changes immediately in order > to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately. > And you are trying to ignore such possibility by just updating expected behavior > description instead of giving out-of-tree users a grace period to check and update > their code. Intention is that 99% of HMM users will be upstream as long as they are not people shouldn't worry. We have been working on nouveau to use it for the last year or so. Many bits were added in 4.16, 4.17, 4.18 and i hope it will all be there in 4.20/4.21 timeframe. See my other mail for list of other users. > > >> and keeps "all operations protected by hmm->mirrors_sem held for write are > >> atomic". This suggests that "some operations protected by hmm->mirrors_sem held > >> for read will sleep (and in the worst case involves memory allocation > >> dependency)". > > > > Yes and so what? The clear expectation is that neither of the range > > notifiers do not sleep in !blocking mode. I really fail to see what you > > are trying to say. > > I'm saying "Get ACK from Jérôme about mm/hmm.c changes". I am fine with Michal patch, i already said so couple month ago first time this discussion did pop up, Michal you can add: Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
On Fri 24-08-18 23:52:25, Tetsuo Handa wrote: > On 2018/08/24 22:32, Michal Hocko wrote: > > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: > >> I worry that (currently > >> out-of-tree) users of this API are involving work / recursion. > > > > I do not give a slightest about out-of-tree modules. They will have to > > accomodate to the new API. I have no problems to extend the > > documentation and be explicit about this expectation. > > You don't need to care about out-of-tree modules. But you need to hear from > mm/hmm.c authors/maintainers when making changes for mmu-notifiers. > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 133ba78820ee..698e371aafe3 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { > > * > > * If blockable argument is set to false then the callback cannot > > * sleep and has to return with -EAGAIN. 0 should be returned > > - * otherwise. > > + * otherwise. Please note that if invalidate_range_start approves > > + * a non-blocking behavior then the same applies to > > + * invalidate_range_end. > > Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu > notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to > mmu-notifiers users. > > - * If both of these callbacks cannot block, and invalidate_range > - * cannot block, mmu_notifier_ops.flags should have > - * MMU_INVALIDATE_DOES_NOT_BLOCK set. > + * If blockable argument is set to false then the callback cannot > + * sleep and has to return with -EAGAIN. 0 should be returned > + * otherwise. > > Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e. > make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK. > > Now we are in a merge window. And we noticed a possibility that out-of-tree > mmu-notifiers users might have trouble with making changes immediately in order > to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately. > And you are trying to ignore such possibility by just updating expected behavior > description instead of giving out-of-tree users a grace period to check and update > their code. This is just ridiculous. I have no idea what you are trying to achieve here but please read through Documentation/process/stable-api-nonsense.rst before you try to make strong statements again. I have changed an in-kernel interface. I have gone through all users and fixed them up. It is really appreciated to double check after me and I am willing to fix up any fallouts. But that is just about it. I do not get a whit about _any_ out of tree drivers when changing the interface. I am willing to answer any questions regarding this change so developers of those drivers know how to do their change properly but doing so is completely their business. > >> and keeps "all operations protected by hmm->mirrors_sem held for write are > >> atomic". This suggests that "some operations protected by hmm->mirrors_sem held > >> for read will sleep (and in the worst case involves memory allocation > >> dependency)". > > > > Yes and so what? The clear expectation is that neither of the range > > notifiers do not sleep in !blocking mode. I really fail to see what you > > are trying to say. > > I'm saying "Get ACK from Jérôme about mm/hmm.c changes". HMM is a library layer for other driver, until those get merged the same applies for them as well.
On Fri 24-08-18 11:12:40, Jerome Glisse wrote: [...] > I am fine with Michal patch, i already said so couple month ago first time > this discussion did pop up, Michal you can add: > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> So I guess the below is the patch you were talking about? From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 24 Aug 2018 15:32:24 +0200 Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition non-blocking mode If invalidate_range_start is called for !blocking mode then all callbacks have to guarantee they will no block/sleep. The same obviously applies to invalidate_range_end because this operation pairs with the former and they are called from the same context. Make sure this is appropriately documented. Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/mmu_notifier.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 133ba78820ee..698e371aafe3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -153,7 +153,9 @@ struct mmu_notifier_ops { * * If blockable argument is set to false then the callback cannot * sleep and has to return with -EAGAIN. 0 should be returned - * otherwise. + * otherwise. Please note that if invalidate_range_start approves + * a non-blocking behavior then the same applies to + * invalidate_range_end. * */ int (*invalidate_range_start)(struct mmu_notifier *mn,
On Fri, Aug 24, 2018 at 06:40:03PM +0200, Michal Hocko wrote: > On Fri 24-08-18 11:12:40, Jerome Glisse wrote: > [...] > > I am fine with Michal patch, i already said so couple month ago first time > > this discussion did pop up, Michal you can add: > > > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> > > So I guess the below is the patch you were talking about? > > From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Fri, 24 Aug 2018 15:32:24 +0200 > Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition > non-blocking mode > > If invalidate_range_start is called for !blocking mode then all > callbacks have to guarantee they will no block/sleep. The same obviously > applies to invalidate_range_end because this operation pairs with the > former and they are called from the same context. Make sure this is > appropriately documented. In my branch i already updated HMM to be like other existing user ie all blocking operation in the start callback. But yes it would be wise to added such comments. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/mmu_notifier.h | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 133ba78820ee..698e371aafe3 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { > * > * If blockable argument is set to false then the callback cannot > * sleep and has to return with -EAGAIN. 0 should be returned > - * otherwise. > + * otherwise. Please note that if invalidate_range_start approves > + * a non-blocking behavior then the same applies to > + * invalidate_range_end. > * > */ > int (*invalidate_range_start)(struct mmu_notifier *mn, > -- > 2.18.0 > > -- > Michal Hocko > SUSE Labs
On 2018/08/24 22:52, Michal Hocko wrote: > @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > */ > static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > { > - if (blockable) > - mutex_lock(&amn->read_lock); > - else if (!mutex_trylock(&amn->read_lock)) > - return -EAGAIN; > - > + /* > + * We can take sleepable lock even on !blockable mode because > + * read_lock is only ever take from this path and the notifier > + * lock never really sleeps. In fact the only reason why the > + * later is sleepable is because the notifier itself might sleep > + * in amdgpu_mn_invalidate_node but blockable mode is handled > + * before calling into that path. > + */ > + mutex_lock(&amn->read_lock); > if (atomic_inc_return(&amn->recursion) == 1) > down_read_non_owner(&amn->lock); > mutex_unlock(&amn->read_lock); > I'm not following. Why don't we need to do like below (given that nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g. drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit() is doing GFP_KERNEL memory allocation with ->lock held for write? diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b..e1cb344 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -64,8 +64,6 @@ * @node: hash table node to find structure by adev and mn * @lock: rw semaphore protecting the notifier nodes * @objects: interval tree containing amdgpu_mn_nodes - * @read_lock: mutex for recursive locking of @lock - * @recursion: depth of recursion * * Data for each amdgpu device and process address space. */ @@ -85,8 +83,6 @@ struct amdgpu_mn { /* objects protected by lock */ struct rw_semaphore lock; struct rb_root_cached objects; - struct mutex read_lock; - atomic_t recursion; }; /** @@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { if (blockable) - mutex_lock(&amn->read_lock); - else if (!mutex_trylock(&amn->read_lock)) + down_read(&amn->lock); + else if (!down_read_trylock(&amn->lock)) return -EAGAIN; - - if (atomic_inc_return(&amn->recursion) == 1) - down_read_non_owner(&amn->lock); - mutex_unlock(&amn->read_lock); - return 0; } @@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) */ static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) { - if (atomic_dec_return(&amn->recursion) == 0) - up_read_non_owner(&amn->lock); + up_read(&amn->lock); } /** @@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev, amn->type = type; amn->mn.ops = &amdgpu_mn_ops[type]; amn->objects = RB_ROOT_CACHED; - mutex_init(&amn->read_lock); - atomic_set(&amn->recursion, 0); r = __mmu_notifier_register(&amn->mn, mm); if (r)
Am 26.08.2018 um 10:40 schrieb Tetsuo Handa: > On 2018/08/24 22:52, Michal Hocko wrote: >> @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) >> */ >> static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) >> { >> - if (blockable) >> - mutex_lock(&amn->read_lock); >> - else if (!mutex_trylock(&amn->read_lock)) >> - return -EAGAIN; >> - >> + /* >> + * We can take sleepable lock even on !blockable mode because >> + * read_lock is only ever take from this path and the notifier >> + * lock never really sleeps. In fact the only reason why the >> + * later is sleepable is because the notifier itself might sleep >> + * in amdgpu_mn_invalidate_node but blockable mode is handled >> + * before calling into that path. >> + */ >> + mutex_lock(&amn->read_lock); >> if (atomic_inc_return(&amn->recursion) == 1) >> down_read_non_owner(&amn->lock); >> mutex_unlock(&amn->read_lock); >> > I'm not following. Why don't we need to do like below (given that > nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g. > drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit() > is doing GFP_KERNEL memory allocation with ->lock held for write? That's a bug which needs to be fixed separately. Allocating memory with GFP_KERNEL while holding a lock which is also taken in the reclaim code path is illegal not matter what you do. Patches to fix this are already on the appropriate mailing list and will be pushed upstream today. Regards, Christian. > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > index e55508b..e1cb344 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > @@ -64,8 +64,6 @@ > * @node: hash table node to find structure by adev and mn > * @lock: rw semaphore protecting the notifier nodes > * @objects: interval tree containing amdgpu_mn_nodes > - * @read_lock: mutex for recursive locking of @lock > - * @recursion: depth of recursion > * > * Data for each amdgpu device and process address space. > */ > @@ -85,8 +83,6 @@ struct amdgpu_mn { > /* objects protected by lock */ > struct rw_semaphore lock; > struct rb_root_cached objects; > - struct mutex read_lock; > - atomic_t recursion; > }; > > /** > @@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > { > if (blockable) > - mutex_lock(&amn->read_lock); > - else if (!mutex_trylock(&amn->read_lock)) > + down_read(&amn->lock); > + else if (!down_read_trylock(&amn->lock)) > return -EAGAIN; > - > - if (atomic_inc_return(&amn->recursion) == 1) > - down_read_non_owner(&amn->lock); > - mutex_unlock(&amn->read_lock); > - > return 0; > } > > @@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > */ > static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) > { > - if (atomic_dec_return(&amn->recursion) == 0) > - up_read_non_owner(&amn->lock); > + up_read(&amn->lock); > } > > /** > @@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev, > amn->type = type; > amn->mn.ops = &amdgpu_mn_ops[type]; > amn->objects = RB_ROOT_CACHED; > - mutex_init(&amn->read_lock); > - atomic_set(&amn->recursion, 0); > > r = __mmu_notifier_register(&amn->mn, mm); > if (r)
On 2018/08/27 16:41, Christian König wrote: > Am 26.08.2018 um 10:40 schrieb Tetsuo Handa: >> I'm not following. Why don't we need to do like below (given that >> nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g. >> drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit() >> is doing GFP_KERNEL memory allocation with ->lock held for write? > > That's a bug which needs to be fixed separately. > > Allocating memory with GFP_KERNEL while holding a lock which is also taken in the reclaim code path is illegal not matter what you do. > > Patches to fix this are already on the appropriate mailing list and will be pushed upstream today. > > Regards, > Christian. Commit 4a2de54dc1d7668f ("drm/amdgpu: fix holding mn_lock while allocating memory") seems to be calling amdgpu_mn_unlock() without amdgpu_mn_lock() when drm_sched_job_init() failed... Michal, you are asking me to fix all bugs (including out of tree code) and prevent future bugs just because you want to avoid using timeout in order to avoid OOM lockup ( https://marc.info/?i=55a3fb37-3246-73d7-0f45-5835a3f4831c@i-love.sakura.ne.jp ). That is a too much request which is impossible for even you. More you count on the OOM reaper, we exponentially complicates dependency and more likely to stumble over unreviewed/untested code...
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6bcecc325e7e..ac08f5d711be 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu) kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap); } -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end) +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, + unsigned long start, unsigned long end, + bool blockable) { unsigned long apic_address; @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); if (start <= apic_address && apic_address < end) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD); + + return 0; } void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index 83e344fbb50a..3399a4a927fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) * * Take the rmn read side lock. */ -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn) +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable) { - mutex_lock(&rmn->read_lock); + if (blockable) + mutex_lock(&rmn->read_lock); + else if (!mutex_trylock(&rmn->read_lock)) + return -EAGAIN; + if (atomic_inc_return(&rmn->recursion) == 1) down_read_non_owner(&rmn->lock); mutex_unlock(&rmn->read_lock); + + return 0; } /** @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node, * We block for all BOs between start and end to be idle and * unmap them by move them into system domain again. */ -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn); struct interval_tree_node *it; @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, /* notification is exclusive, but interval is inclusive */ end -= 1; - amdgpu_mn_read_lock(rmn); + /* TODO we should be able to split locking for interval tree and + * amdgpu_mn_invalidate_node + */ + if (amdgpu_mn_read_lock(rmn, blockable)) + return -EAGAIN; it = interval_tree_iter_first(&rmn->objects, start, end); while (it) { struct amdgpu_mn_node *node; + if (!blockable) { + amdgpu_mn_read_unlock(rmn); + return -EAGAIN; + } + node = container_of(it, struct amdgpu_mn_node, it); it = interval_tree_iter_next(it, start, end); amdgpu_mn_invalidate_node(node, start, end); } + + return 0; } /** @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, * necessitates evicting all user-mode queues of the process. The BOs * are restorted in amdgpu_mn_invalidate_range_end_hsa. */ -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn); struct interval_tree_node *it; @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, /* notification is exclusive, but interval is inclusive */ end -= 1; - amdgpu_mn_read_lock(rmn); + if (amdgpu_mn_read_lock(rmn, blockable)) + return -EAGAIN; it = interval_tree_iter_first(&rmn->objects, start, end); while (it) { struct amdgpu_mn_node *node; struct amdgpu_bo *bo; + if (!blockable) { + amdgpu_mn_read_unlock(rmn); + return -EAGAIN; + } + node = container_of(it, struct amdgpu_mn_node, it); it = interval_tree_iter_next(it, start, end); @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, amdgpu_amdkfd_evict_userptr(mem, mm); } } + + return 0; } /** diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 854bd51b9478..9cbff68f6b41 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) mo->attached = false; } -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn); @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, LIST_HEAD(cancelled); if (RB_EMPTY_ROOT(&mn->objects.rb_root)) - return; + return 0; /* interval ranges are inclusive, but invalidate range is exclusive */ end--; @@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, spin_lock(&mn->lock); it = interval_tree_iter_first(&mn->objects, start, end); while (it) { + if (!blockable) { + spin_unlock(&mn->lock); + return -EAGAIN; + } /* The mmu_object is released late when destroying the * GEM object so it is entirely possible to gain a * reference on an object in the process of being freed @@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, if (!list_empty(&cancelled)) flush_workqueue(mn->wq); + + return 0; } static const struct mmu_notifier_ops i915_gem_userptr_notifier = { diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c index abd24975c9b1..f8b35df44c60 100644 --- a/drivers/gpu/drm/radeon/radeon_mn.c +++ b/drivers/gpu/drm/radeon/radeon_mn.c @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn, * We block for all BOs between start and end to be idle and * unmap them by move them into system domain again. */ -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn, +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn); struct ttm_operation_ctx ctx = { false, false }; struct interval_tree_node *it; + int ret = 0; /* notification is exclusive, but interval is inclusive */ end -= 1; - mutex_lock(&rmn->lock); + /* TODO we should be able to split locking for interval tree and + * the tear down. + */ + if (blockable) + mutex_lock(&rmn->lock); + else if (!mutex_trylock(&rmn->lock)) + return -EAGAIN; it = interval_tree_iter_first(&rmn->objects, start, end); while (it) { @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn, struct radeon_bo *bo; long r; + if (!blockable) { + ret = -EAGAIN; + goto out_unlock; + } + node = container_of(it, struct radeon_mn_node, it); it = interval_tree_iter_next(it, start, end); @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn, } } +out_unlock: mutex_unlock(&rmn->lock); + + return ret; } static const struct mmu_notifier_ops radeon_mn_ops = { diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 182436b92ba9..6ec748eccff7 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn, rbt_ib_umem_for_each_in_range(&context->umem_tree, 0, ULLONG_MAX, ib_umem_notifier_release_trampoline, + true, NULL); up_read(&context->umem_rwsem); } @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start, return 0; } -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn, +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn); + int ret; if (!context->invalidate_range) - return; + return 0; + + if (blockable) + down_read(&context->umem_rwsem); + else if (!down_read_trylock(&context->umem_rwsem)) + return -EAGAIN; ib_ucontext_notifier_start_account(context); - down_read(&context->umem_rwsem); - rbt_ib_umem_for_each_in_range(&context->umem_tree, start, + ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start, end, - invalidate_range_start_trampoline, NULL); + invalidate_range_start_trampoline, + blockable, NULL); up_read(&context->umem_rwsem); + + return ret; } static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start, @@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn, if (!context->invalidate_range) return; + /* + * TODO: we currently bail out if there is any sleepable work to be done + * in ib_umem_notifier_invalidate_range_start so we shouldn't really block + * here. But this is ugly and fragile. + */ down_read(&context->umem_rwsem); rbt_ib_umem_for_each_in_range(&context->umem_tree, start, end, - invalidate_range_end_trampoline, NULL); + invalidate_range_end_trampoline, true, NULL); up_read(&context->umem_rwsem); ib_ucontext_notifier_end_account(context); } @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages); int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root, u64 start, u64 last, umem_call_back cb, + bool blockable, void *cookie) { int ret_val = 0; @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root, for (node = rbt_ib_umem_iter_first(root, start, last - 1); node; node = next) { + /* TODO move the blockable decision up to the callback */ + if (!blockable) + return -EAGAIN; next = rbt_ib_umem_iter_next(node, start, last - 1); umem = container_of(node, struct ib_umem_odp, interval_tree); ret_val = cb(umem->umem, start, last, cookie) || ret_val; diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c index 70aceefe14d5..e1c7996c018e 100644 --- a/drivers/infiniband/hw/hfi1/mmu_rb.c +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c @@ -67,9 +67,9 @@ struct mmu_rb_handler { static unsigned long mmu_node_start(struct mmu_rb_node *); static unsigned long mmu_node_last(struct mmu_rb_node *); -static void mmu_notifier_range_start(struct mmu_notifier *, +static int mmu_notifier_range_start(struct mmu_notifier *, struct mm_struct *, - unsigned long, unsigned long); + unsigned long, unsigned long, bool); static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *, unsigned long, unsigned long); static void do_remove(struct mmu_rb_handler *handler, @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler, handler->ops->remove(handler->ops_arg, node); } -static void mmu_notifier_range_start(struct mmu_notifier *mn, +static int mmu_notifier_range_start(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct mmu_rb_handler *handler = container_of(mn, struct mmu_rb_handler, mn); @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn, if (added) queue_work(handler->wq, &handler->del_work); + + return 0; } /* diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index f1a87a690a4c..d216e0d2921d 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr) down_read(&ctx->umem_rwsem); rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX, - mr_leaf_free, imr); + mr_leaf_free, true, imr); up_read(&ctx->umem_rwsem); wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free)); diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c index 63d6246d6dff..6369aeaa7056 100644 --- a/drivers/misc/mic/scif/scif_dma.c +++ b/drivers/misc/mic/scif/scif_dma.c @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn, schedule_work(&scif_info.misc_work); } -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct scif_mmu_notif *mmn; mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier); scif_rma_destroy_tcw(mmn, start, end - start); + + return 0; } static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c index a3454eb56fbf..be28f05bfafa 100644 --- a/drivers/misc/sgi-gru/grutlbpurge.c +++ b/drivers/misc/sgi-gru/grutlbpurge.c @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru) /* * MMUOPS notifier callout functions */ -static void gru_invalidate_range_start(struct mmu_notifier *mn, +static int gru_invalidate_range_start(struct mmu_notifier *mn, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + bool blockable) { struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct, ms_notifier); @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn, gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms, start, end, atomic_read(&gms->ms_range_active)); gru_flush_tlb_range(gms, start, end - start); + + return 0; } static void gru_invalidate_range_end(struct mmu_notifier *mn, diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c index bd56653b9bbc..55b4f0e3f4d6 100644 --- a/drivers/xen/gntdev.c +++ b/drivers/xen/gntdev.c @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = { /* ------------------------------------------------------------------ */ +static bool in_range(struct grant_map *map, + unsigned long start, unsigned long end) +{ + if (!map->vma) + return false; + if (map->vma->vm_start >= end) + return false; + if (map->vma->vm_end <= start) + return false; + + return true; +} + static void unmap_if_in_range(struct grant_map *map, unsigned long start, unsigned long end) { unsigned long mstart, mend; int err; - if (!map->vma) - return; - if (map->vma->vm_start >= end) - return; - if (map->vma->vm_end <= start) - return; mstart = max(start, map->vma->vm_start); mend = min(end, map->vma->vm_end); pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n", @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map, WARN_ON(err); } -static void mn_invl_range_start(struct mmu_notifier *mn, +static int mn_invl_range_start(struct mmu_notifier *mn, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + bool blockable) { struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn); struct grant_map *map; + int ret = 0; + + /* TODO do we really need a mutex here? */ + if (blockable) + mutex_lock(&priv->lock); + else if (!mutex_trylock(&priv->lock)) + return -EAGAIN; - mutex_lock(&priv->lock); list_for_each_entry(map, &priv->maps, next) { + if (in_range(map, start, end)) { + ret = -EAGAIN; + goto out_unlock; + } unmap_if_in_range(map, start, end); } list_for_each_entry(map, &priv->freeable_maps, next) { + if (in_range(map, start, end)) { + ret = -EAGAIN; + goto out_unlock; + } unmap_if_in_range(map, start, end); } + +out_unlock: mutex_unlock(&priv->lock); + + return ret; } static void mn_release(struct mmu_notifier *mn, diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4ee7bc548a83..148935085194 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp, } #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */ -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end); +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, + unsigned long start, unsigned long end, bool blockable); #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu); diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 392e6af82701..2eb1a2d01759 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -151,13 +151,15 @@ struct mmu_notifier_ops { * address space but may still be referenced by sptes until * the last refcount is dropped. * - * If both of these callbacks cannot block, and invalidate_range - * cannot block, mmu_notifier_ops.flags should have - * MMU_INVALIDATE_DOES_NOT_BLOCK set. + * If blockable argument is set to false then the callback cannot + * sleep and has to return with -EAGAIN. 0 should be returned + * otherwise. + * */ - void (*invalidate_range_start)(struct mmu_notifier *mn, + int (*invalidate_range_start)(struct mmu_notifier *mn, struct mm_struct *mm, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, + bool blockable); void (*invalidate_range_end)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end); @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end); +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end, + bool blockable); extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, unsigned long start, unsigned long end, bool only_end); @@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end) { if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_start(mm, start, end); + __mmu_notifier_invalidate_range_start(mm, start, end, true); +} + +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + int ret = 0; + if (mm_has_notifiers(mm)) + ret = __mmu_notifier_invalidate_range_start(mm, start, end, false); + + return ret; } static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, @@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, { } +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + return 0; +} + static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, unsigned long start, unsigned long end) { diff --git a/include/linux/oom.h b/include/linux/oom.h index 6adac113e96d..92f70e4c6252 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm) return 0; } -void __oom_reap_task_mm(struct mm_struct *mm); +bool __oom_reap_task_mm(struct mm_struct *mm); extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h index 6a17f856f841..381cdf5a9bd1 100644 --- a/include/rdma/ib_umem_odp.h +++ b/include/rdma/ib_umem_odp.h @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end, */ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root, u64 start, u64 end, - umem_call_back cb, void *cookie); + umem_call_back cb, + bool blockable, void *cookie); /* * Find first region intersecting with address range. diff --git a/mm/hmm.c b/mm/hmm.c index de7b6bf77201..81fd57bd2634 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) up_write(&hmm->mirrors_sem); } -static void hmm_invalidate_range_start(struct mmu_notifier *mn, +static int hmm_invalidate_range_start(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct hmm *hmm = mm->hmm; VM_BUG_ON(!hmm); atomic_inc(&hmm->sequence); + + return 0; } static void hmm_invalidate_range_end(struct mmu_notifier *mn, diff --git a/mm/mmap.c b/mm/mmap.c index d1eb87ef4b1a..336bee8c4e25 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm) * reliably test it. */ mutex_lock(&oom_lock); - __oom_reap_task_mm(mm); + (void)__oom_reap_task_mm(mm); mutex_unlock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index eff6b88a993f..103b2b450043 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, srcu_read_unlock(&srcu, id); } -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end, + bool blockable) { struct mmu_notifier *mn; + int ret = 0; int id; id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { - if (mn->ops->invalidate_range_start) - mn->ops->invalidate_range_start(mn, mm, start, end); + if (mn->ops->invalidate_range_start) { + int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable); + if (_ret) { + pr_info("%pS callback failed with %d in %sblockable context.\n", + mn->ops->invalidate_range_start, _ret, + !blockable ? "non-": ""); + ret = _ret; + } + } } srcu_read_unlock(&srcu, id); + + return ret; } EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 84081e77bc51..5a936cf24d79 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -void __oom_reap_task_mm(struct mm_struct *mm) +bool __oom_reap_task_mm(struct mm_struct *mm) { struct vm_area_struct *vma; + bool ret = true; /* * Tell all users of get_user/copy_from_user etc... that the content @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm) struct mmu_gather tlb; tlb_gather_mmu(&tlb, mm, start, end); - mmu_notifier_invalidate_range_start(mm, start, end); + if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) { + ret = false; + continue; + } unmap_page_range(&tlb, vma, start, end, NULL); mmu_notifier_invalidate_range_end(mm, start, end); tlb_finish_mmu(&tlb, start, end); } } + + return ret; } static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) goto unlock_oom; } - /* - * If the mm has invalidate_{start,end}() notifiers that could block, - * sleep to give the oom victim some more time. - * TODO: we really want to get rid of this ugly hack and make sure that - * notifiers cannot block for unbounded amount of time - */ - if (mm_has_blockable_invalidate_notifiers(mm)) { - up_read(&mm->mmap_sem); - schedule_timeout_idle(HZ); - goto unlock_oom; - } - /* * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't * work on the mm anymore. The check for MMF_OOM_SKIP must run @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - __oom_reap_task_mm(mm); + /* failed to reap part of the address space. Try again later */ + if (!__oom_reap_task_mm(mm)) { + up_read(&mm->mmap_sem); + ret = false; + goto unlock_oom; + } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ada21f47f22b..16ce38f178d1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm); static unsigned long long kvm_createvm_count; static unsigned long long kvm_active_vms; -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end) +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, + unsigned long start, unsigned long end, bool blockable) { + return 0; } bool kvm_is_reserved_pfn(kvm_pfn_t pfn) @@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, srcu_read_unlock(&kvm->srcu, idx); } -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, - unsigned long end) + unsigned long end, + bool blockable) { struct kvm *kvm = mmu_notifier_to_kvm(mn); int need_tlb_flush = 0, idx; + int ret; idx = srcu_read_lock(&kvm->srcu); spin_lock(&kvm->mmu_lock); @@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, spin_unlock(&kvm->mmu_lock); - kvm_arch_mmu_notifier_invalidate_range(kvm, start, end); + ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable); srcu_read_unlock(&kvm->srcu, idx); + + return ret; } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,