diff mbox

mm, oom: distinguish blockable mode for mmu notifiers

Message ID 20180716115058.5559-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Hocko July 16, 2018, 11:50 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-rdma@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-mm@kvack.org
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
there were no major objections when I sent this as an RFC the last time
[1]. I was hoping for more feedback in the drivers land because I am
touching the code I have no way to test. On the other hand the pattern
is quite simple and consistent over all users so there shouldn't be
any large surprises hopefully.

Any further review would be highly appreciate of course. But is this
something to put into the mm tree now?

[1] http://lkml.kernel.org/r/20180627074421.GF32348@dhcp22.suse.cz


 arch/x86/kvm/x86.c                      |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
 drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
 drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
 drivers/infiniband/hw/mlx5/odp.c        |  2 +-
 drivers/misc/mic/scif/scif_dma.c        |  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
 drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
 include/linux/kvm_host.h                |  4 +--
 include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
 include/linux/oom.h                     |  2 +-
 include/rdma/ib_umem_odp.h              |  3 +-
 mm/hmm.c                                |  7 ++--
 mm/mmap.c                               |  2 +-
 mm/mmu_notifier.c                       | 19 ++++++++---
 mm/oom_kill.c                           | 29 ++++++++--------
 virt/kvm/kvm_main.c                     | 15 ++++++---
 19 files changed, 225 insertions(+), 80 deletions(-)

Comments

Andrew Morton July 16, 2018, 11:12 p.m. UTC | #1
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.

I assume device driver developers are wondering "what does this mean
for me".  As I understand it, the only time they will see
blockable==false is when their driver is being called in response to an
out-of-memory condition, yes?  So it is a very rare thing.

Any suggestions regarding how the driver developers can test this code
path?  I don't think we presently have a way to fake an oom-killing
event?  Perhaps we should add such a thing, given the problems we're
having with that feature.

> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> ...
>
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>  }

nit,

{
	if (mm_has_notifiers(mm))
		return __mmu_notifier_invalidate_range_start(mm, start, end, false);
	return 0;
}

would suffice.


> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>  		 * reliably test it.
>  		 */
>  		mutex_lock(&oom_lock);
> -		__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm);
>  		mutex_unlock(&oom_lock);

What does this do?

>  		set_bit(MMF_OOM_SKIP, &mm->flags);
> 
> ...
>
Leon Romanovsky July 17, 2018, 4:03 a.m. UTC | #2
On Mon, Jul 16, 2018 at 04:12:49PM -0700, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>
> > From: Michal Hocko <mhocko@suse.com>
> >
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> >
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> >
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> >
> > This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
>
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

I can't say for everyone, but at least for me (mlx5), it is not rare event.
I'm seeing OOM very often while I'm running my tests in low memory VMs.

Thanks

>
> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.
Michal Hocko July 17, 2018, 8:12 a.m. UTC | #3
On Mon 16-07-18 16:12:49, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> > 
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> > 
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> > 
> > This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
> 
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

Yes, this is the case right now. Maybe we will grow other users in
future. Those other potential users is the reason why I used blockable
rather than oom parameter name.

> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.

The simplest way is to wrap an userspace code which uses these notifiers
into a memcg and set the hard limit to hit the oom. This can be done
e.g. after the test faults in all the mmu notifier managed memory and
set the hard limit to something really small. Then we are looking for a
proper process tear down.

> > I think we can improve that even further because there is a common
> > pattern to do a range lookup first and then do something about that.
> > The first part can be done without a sleeping lock in most cases AFAICS.
> > 
> > The oom_reaper end then simply retries if there is at least one notifier
> > which couldn't make any progress in !blockable mode. A retry loop is
> > already implemented to wait for the mmap_sem and this is basically the
> > same thing.
> > 
> > ...
> >
> > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> > +				  unsigned long start, unsigned long end)
> > +{
> > +	int ret = 0;
> > +	if (mm_has_notifiers(mm))
> > +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> > +
> > +	return ret;
> >  }
> 
> nit,
> 
> {
> 	if (mm_has_notifiers(mm))
> 		return __mmu_notifier_invalidate_range_start(mm, start, end, false);
> 	return 0;
> }
> 
> would suffice.

Sure. Fixed
 
> > 
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
> >  		 * reliably test it.
> >  		 */
> >  		mutex_lock(&oom_lock);
> > -		__oom_reap_task_mm(mm);
> > +		(void)__oom_reap_task_mm(mm);
> >  		mutex_unlock(&oom_lock);
> 
> What does this do?

There is no error to be returned here as the comment above explains
		 * Nothing can be holding mm->mmap_sem here and the above call
		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
		 * __oom_reap_task_mm() will not block.
Michal Hocko July 19, 2018, 9:12 a.m. UTC | #4
Does anybody see any reasons why this should get into mmotm tree?
I do not want to rush this in but if general feeling is to push it for
the upcoming merge window then I will not object.
Andrew Morton July 20, 2018, 11:01 p.m. UTC | #5
On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > Any suggestions regarding how the driver developers can test this code
> > path?  I don't think we presently have a way to fake an oom-killing
> > event?  Perhaps we should add such a thing, given the problems we're
> > having with that feature.
> 
> The simplest way is to wrap an userspace code which uses these notifiers
> into a memcg and set the hard limit to hit the oom. This can be done
> e.g. after the test faults in all the mmu notifier managed memory and
> set the hard limit to something really small. Then we are looking for a
> proper process tear down.

Chances are, some of the intended audience don't know how to do this
and will either have to hunt down a lot of documentation or will just
not test it.  But we want them to test it, so a little worked step-by-step
example would help things along please.
Andrew Morton July 21, 2018, 12:09 a.m. UTC | #6
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> ...
>
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  
>  	trace_start_task_reaping(tsk->pid);
>  
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto unlock_oom;
> +	}

This function is starting to look a bit screwy.

: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
: 	if (!down_read_trylock(&mm->mmap_sem)) {
: 		trace_skip_task_reaping(tsk->pid);
: 		return false;
: 	}
: 
: 	/*
: 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
: 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
: 	 * under mmap_sem for reading because it serializes against the
: 	 * down_write();up_write() cycle in exit_mmap().
: 	 */
: 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
: 		up_read(&mm->mmap_sem);
: 		trace_skip_task_reaping(tsk->pid);
: 		return true;
: 	}
: 
: 	trace_start_task_reaping(tsk->pid);
: 
: 	/* failed to reap part of the address space. Try again later */
: 	if (!__oom_reap_task_mm(mm)) {
: 		up_read(&mm->mmap_sem);
: 		return true;
: 	}
: 
: 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
: 			task_pid_nr(tsk), tsk->comm,
: 			K(get_mm_counter(mm, MM_ANONPAGES)),
: 			K(get_mm_counter(mm, MM_FILEPAGES)),
: 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
: 	up_read(&mm->mmap_sem);
: 
: 	trace_finish_task_reaping(tsk->pid);
: 	return true;
: }

- Undocumented return value.

- comment "failed to reap part..." is misleading - sounds like it's
  referring to something which happened in the past, is in fact
  referring to something which might happen in the future.

- fails to call trace_finish_task_reaping() in one case

- code duplication.


I'm thinking it wants to be something like this?

: /*
:  * Return true if we successfully acquired (then released) mmap_sem
:  */
: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
: 	if (!down_read_trylock(&mm->mmap_sem)) {
: 		trace_skip_task_reaping(tsk->pid);
: 		return false;
: 	}
: 
: 	/*
: 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
: 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
: 	 * under mmap_sem for reading because it serializes against the
: 	 * down_write();up_write() cycle in exit_mmap().
: 	 */
: 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
: 		trace_skip_task_reaping(tsk->pid);
: 		goto out;
: 	}
: 
: 	trace_start_task_reaping(tsk->pid);
: 
: 	if (!__oom_reap_task_mm(mm)) {
: 		/* Failed to reap part of the address space. Try again later */
: 		goto finish;
: 	}
: 
: 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
: 			task_pid_nr(tsk), tsk->comm,
: 			K(get_mm_counter(mm, MM_ANONPAGES)),
: 			K(get_mm_counter(mm, MM_FILEPAGES)),
: 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
: finish:
: 	trace_finish_task_reaping(tsk->pid);
: out:
: 	up_read(&mm->mmap_sem);
: 	return true;
: }

- Increases mmap_sem hold time a little by moving
  trace_finish_task_reaping() inside the locked region.  So sue me ;)

- Sharing the finish: path means that the trace event won't
  distinguish between the two sources of finishing.

Please take a look?
Michal Hocko July 23, 2018, 7:03 a.m. UTC | #7
On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> Please take a look?

Are you OK to have these in a separate patch?
Michal Hocko July 23, 2018, 7:11 a.m. UTC | #8
On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > Please take a look?
> 
> Are you OK to have these in a separate patch?

Btw. I will rebase this patch once oom stuff in linux-next settles. At
least oom_lock removal from oom_reaper will conflict.
Michal Hocko July 23, 2018, 8:11 a.m. UTC | #9
On Mon 23-07-18 09:11:54, Michal Hocko wrote:
> On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> > On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> > [...]
> > > Please take a look?
> > 
> > Are you OK to have these in a separate patch?
> 
> Btw. I will rebase this patch once oom stuff in linux-next settles. At
> least oom_lock removal from oom_reaper will conflict.

Hmm, I have just checked Andrew's akpm and the patch is already in and
Andrew has resolved the conflict with the oom_lock patch. It just seems
that linux-next (next-20180720) doesn't have the newest mmotm tree.

Anyway, I will go with the incremental cleanup patch per Andrew's
comments as soon as linux-next catches up.
Michal Hocko July 23, 2018, 8:43 a.m. UTC | #10
On Fri 20-07-18 16:01:25, Andrew Morton wrote:
> On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > Any suggestions regarding how the driver developers can test this code
> > > path?  I don't think we presently have a way to fake an oom-killing
> > > event?  Perhaps we should add such a thing, given the problems we're
> > > having with that feature.
> > 
> > The simplest way is to wrap an userspace code which uses these notifiers
> > into a memcg and set the hard limit to hit the oom. This can be done
> > e.g. after the test faults in all the mmu notifier managed memory and
> > set the hard limit to something really small. Then we are looking for a
> > proper process tear down.
> 
> Chances are, some of the intended audience don't know how to do this
> and will either have to hunt down a lot of documentation or will just
> not test it.  But we want them to test it, so a little worked step-by-step
> example would help things along please.

I am willing to give more specific steps. Is anybody interested? From my
experience so far this is not something drivers developers using mmu
notifiers would be unfamiliar with.
Michal Hocko July 24, 2018, 2:17 p.m. UTC | #11
On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> - Undocumented return value.
> 
> - comment "failed to reap part..." is misleading - sounds like it's
>   referring to something which happened in the past, is in fact
>   referring to something which might happen in the future.
> 
> - fails to call trace_finish_task_reaping() in one case
> 
> - code duplication.
> 
> - Increases mmap_sem hold time a little by moving
>   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> 
> - Sharing the finish: path means that the trace event won't
>   distinguish between the two sources of finishing.
> 
> Please take a look?

oom_reap_task_mm should return false when __oom_reap_task_mm return
false. This is what my patch did but it seems this changed by
http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
so that one should be fixed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 104ef4a01a55..88657e018714 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	/* failed to reap part of the address space. Try again later */
 	if (!__oom_reap_task_mm(mm)) {
 		up_read(&mm->mmap_sem);
-		return true;
+		return false;
 	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",


On top of that the proposed cleanup looks as follows:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 88657e018714..4e185a282b3d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	return ret;
 }
 
+/*
+ * Reaps the address space of the give task.
+ *
+ * Returns true on success and false if none or part of the address space
+ * has been reclaimed and the caller should retry later.
+ */
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
+	bool ret = true;
+
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		trace_skip_task_reaping(tsk->pid);
 		return false;
@@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * down_write();up_write() cycle in exit_mmap().
 	 */
 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		up_read(&mm->mmap_sem);
 		trace_skip_task_reaping(tsk->pid);
-		return true;
+		goto out_unlock;
 	}
 
 	trace_start_task_reaping(tsk->pid);
 
 	/* failed to reap part of the address space. Try again later */
-	if (!__oom_reap_task_mm(mm)) {
-		up_read(&mm->mmap_sem);
-		return false;
-	}
+	ret = __oom_reap_task_mm(mm);
+	if (!ret)
+		goto out_finish;
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_finish:
+	trace_finish_task_reaping(tsk->pid);
+out_unlock:
 	up_read(&mm->mmap_sem);
 
-	trace_finish_task_reaping(tsk->pid);
-	return true;
+	return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
Andrew Morton July 24, 2018, 7:53 p.m. UTC | #12
On Tue, 24 Jul 2018 16:17:47 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > - Undocumented return value.
> > 
> > - comment "failed to reap part..." is misleading - sounds like it's
> >   referring to something which happened in the past, is in fact
> >   referring to something which might happen in the future.
> > 
> > - fails to call trace_finish_task_reaping() in one case
> > 
> > - code duplication.
> > 
> > - Increases mmap_sem hold time a little by moving
> >   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> > 
> > - Sharing the finish: path means that the trace event won't
> >   distinguish between the two sources of finishing.
> > 
> > Please take a look?
> 
> oom_reap_task_mm should return false when __oom_reap_task_mm return
> false. This is what my patch did but it seems this changed by
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
> so that one should be fixed.
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..88657e018714 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	/* failed to reap part of the address space. Try again later */
>  	if (!__oom_reap_task_mm(mm)) {
>  		up_read(&mm->mmap_sem);
> -		return true;
> +		return false;
>  	}
>  
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",

OK, thanks, I added that.

> 
> On top of that the proposed cleanup looks as follows:
> 

Looks good to me.  Seems a bit strange that we omit the pr_info()
output if the mm was partially reaped - people would still want to know
this?   Not very important though.
David Rientjes July 24, 2018, 9:07 p.m. UTC | #13
On Tue, 24 Jul 2018, Michal Hocko wrote:

> oom_reap_task_mm should return false when __oom_reap_task_mm return
> false. This is what my patch did but it seems this changed by
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
> so that one should be fixed.
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..88657e018714 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	/* failed to reap part of the address space. Try again later */
>  	if (!__oom_reap_task_mm(mm)) {
>  		up_read(&mm->mmap_sem);
> -		return true;
> +		return false;
>  	}
>  
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> 
> 
> On top of that the proposed cleanup looks as follows:
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 88657e018714..4e185a282b3d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  	return ret;
>  }
>  
> +/*
> + * Reaps the address space of the give task.
> + *
> + * Returns true on success and false if none or part of the address space
> + * has been reclaimed and the caller should retry later.
> + */
>  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
> +	bool ret = true;
> +
>  	if (!down_read_trylock(&mm->mmap_sem)) {
>  		trace_skip_task_reaping(tsk->pid);
>  		return false;
> @@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 * down_write();up_write() cycle in exit_mmap().
>  	 */
>  	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
> -		up_read(&mm->mmap_sem);
>  		trace_skip_task_reaping(tsk->pid);
> -		return true;
> +		goto out_unlock;
>  	}
>  
>  	trace_start_task_reaping(tsk->pid);
>  
>  	/* failed to reap part of the address space. Try again later */
> -	if (!__oom_reap_task_mm(mm)) {
> -		up_read(&mm->mmap_sem);
> -		return false;
> -	}
> +	ret = __oom_reap_task_mm(mm);
> +	if (!ret)
> +		goto out_finish;
>  
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
>  			K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +out_finish:
> +	trace_finish_task_reaping(tsk->pid);
> +out_unlock:
>  	up_read(&mm->mmap_sem);
>  
> -	trace_finish_task_reaping(tsk->pid);
> -	return true;
> +	return ret;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10

I think we still want to trace when reaping was skipped to know that the 
oom reaper will retry again later.



mm/oom_kill.c: clean up oom_reap_task_mm() fix

indicate reaping has been partially skipped so we can expect future skips 
or another start before finish.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	/* failed to reap part of the address space. Try again later */
 	ret = __oom_reap_task_mm(mm);
-	if (!ret)
+	if (!ret) {
+		/* Failed to reap part of the address space. Try again later */
+		trace_skip_task_reaping(tsk->pid);
 		goto out_finish;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
Michal Hocko July 25, 2018, 6:13 a.m. UTC | #14
On Tue 24-07-18 14:07:49, David Rientjes wrote:
[...]
> mm/oom_kill.c: clean up oom_reap_task_mm() fix
> 
> indicate reaping has been partially skipped so we can expect future skips 
> or another start before finish.

But we are not skipping. This is essentially the same case as mmap_sem
trylock fail. Maybe we can add a bool parameter to trace_finish_task_reaping
to denote partial success?

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  
>  	trace_start_task_reaping(tsk->pid);
>  
> -	/* failed to reap part of the address space. Try again later */
>  	ret = __oom_reap_task_mm(mm);
> -	if (!ret)
> +	if (!ret) {
> +		/* Failed to reap part of the address space. Try again later */
> +		trace_skip_task_reaping(tsk->pid);
>  		goto out_finish;
> +	}
>  
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
Michal Hocko July 25, 2018, 6:17 a.m. UTC | #15
On Tue 24-07-18 12:53:07, Andrew Morton wrote:
[...]
> > On top of that the proposed cleanup looks as follows:
> > 
> 
> Looks good to me.  Seems a bit strange that we omit the pr_info()
> output if the mm was partially reaped - people would still want to know
> this?   Not very important though.

I think that having a single output once we are done is better but I do
not have a strong opinion on this.

Btw. here is the changelog for the cleanup.

"
Andrew has noticed someinconsistencies in oom_reap_task_mm. Notably
 - Undocumented return value.

 - comment "failed to reap part..." is misleading - sounds like it's
   referring to something which happened in the past, is in fact
   referring to something which might happen in the future.

 - fails to call trace_finish_task_reaping() in one case

 - code duplication.

 - Increases mmap_sem hold time a little by moving
   trace_finish_task_reaping() inside the locked region.  So sue me ;)

 - Sharing the finish: path means that the trace event won't
   distinguish between the two sources of finishing.

Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
"
Tetsuo Handa Aug. 24, 2018, 10:54 a.m. UTC | #16
Two more worries for this patch.



> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   *
>   * @amn: our notifier
>   */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>  {
> -       mutex_lock(&amn->read_lock);
> +       if (blockable)
> +               mutex_lock(&amn->read_lock);
> +       else if (!mutex_trylock(&amn->read_lock))
> +               return -EAGAIN;
> +
>         if (atomic_inc_return(&amn->recursion) == 1)
>                 down_read_non_owner(&amn->lock);

Why don't we need to use trylock here if blockable == false ?
Want comment why it is safe to use blocking lock here.

>         mutex_unlock(&amn->read_lock);
> +
> +       return 0;
>  }
> 
>  /**



> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>         up_write(&hmm->mirrors_sem);
>  }
> 
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>                                        struct mm_struct *mm,
>                                        unsigned long start,
> -                                      unsigned long end)
> +                                      unsigned long end,
> +                                      bool blockable)
>  {
>         struct hmm *hmm = mm->hmm;
> 
>         VM_BUG_ON(!hmm);
> 
>         atomic_inc(&hmm->sequence);
> +
> +       return 0;
>  }
> 
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,

This assumes that hmm_invalidate_range_end() does not have memory
allocation dependency. But hmm_invalidate_range() from
hmm_invalidate_range_end() involves

        down_read(&hmm->mirrors_sem);
        list_for_each_entry(mirror, &hmm->mirrors, list)
                mirror->ops->sync_cpu_device_pagetables(mirror, action,
                                                        start, end);
        up_read(&hmm->mirrors_sem);

sequence. What is surprising is that there is no in-tree user who assigns
sync_cpu_device_pagetables field.

  $ grep -Fr sync_cpu_device_pagetables *
  Documentation/vm/hmm.rst:     /* sync_cpu_device_pagetables() - synchronize page tables
  include/linux/hmm.h: * will get callbacks through sync_cpu_device_pagetables() operation (see
  include/linux/hmm.h:    /* sync_cpu_device_pagetables() - synchronize page tables
  include/linux/hmm.h:    void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
  include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
  mm/hmm.c:               mirror->ops->sync_cpu_device_pagetables(mirror, action,

That is, this API seems to be currently used by only out-of-tree users. Since
we can't check that nobody has memory allocation dependency, I think that
hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now.
Michal Hocko Aug. 24, 2018, 11:32 a.m. UTC | #17
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> Two more worries for this patch.
> 
> 
> 
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >   *
> >   * @amn: our notifier
> >   */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> >  {
> > -       mutex_lock(&amn->read_lock);
> > +       if (blockable)
> > +               mutex_lock(&amn->read_lock);
> > +       else if (!mutex_trylock(&amn->read_lock))
> > +               return -EAGAIN;
> > +
> >         if (atomic_inc_return(&amn->recursion) == 1)
> >                 down_read_non_owner(&amn->lock);
> 
> Why don't we need to use trylock here if blockable == false ?
> Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.

Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.

I am wondering why we cannot do
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..93034178673d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-	if (blockable)
-		mutex_lock(&amn->read_lock);
-	else if (!mutex_trylock(&amn->read_lock))
-		return -EAGAIN;
-
-	if (atomic_inc_return(&amn->recursion) == 1)
-		down_read_non_owner(&amn->lock);
-	mutex_unlock(&amn->read_lock);
+	if (!down_read_trylock(&amn->lock)) {
+		if (!blockable)
+			return -EAGAIN;
+		down_read(amn->lock);
+	}
 
 	return 0;
 }
@@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  */
 static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
 {
-	if (atomic_dec_return(&amn->recursion) == 0)
-		up_read_non_owner(&amn->lock);
+	up_read(&amn->lock);
 }
 
 /**
Michal Hocko Aug. 24, 2018, 11:36 a.m. UTC | #18
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
[...]
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> >         up_write(&hmm->mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >                                        struct mm_struct *mm,
> >                                        unsigned long start,
> > -                                      unsigned long end)
> > +                                      unsigned long end,
> > +                                      bool blockable)
> >  {
> >         struct hmm *hmm = mm->hmm;
> > 
> >         VM_BUG_ON(!hmm);
> > 
> >         atomic_inc(&hmm->sequence);
> > +
> > +       return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
>         down_read(&hmm->mirrors_sem);
>         list_for_each_entry(mirror, &hmm->mirrors, list)
>                 mirror->ops->sync_cpu_device_pagetables(mirror, action,
>                                                         start, end);
>         up_read(&hmm->mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.

Yes HMM doesn't have any in tree user yet.

>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst:     /* sync_cpu_device_pagetables() - synchronize page tables
>   include/linux/hmm.h: * will get callbacks through sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:    /* sync_cpu_device_pagetables() - synchronize page tables
>   include/linux/hmm.h:    void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
>   mm/hmm.c:               mirror->ops->sync_cpu_device_pagetables(mirror, action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now.

The code expects that the invalidate_range_end doesn't block if
invalidate_range_start hasn't blocked. That is the reason why the end
callback doesn't have blockable parameter. If this doesn't hold then the
whole scheme is just fragile because those two calls should pair.
Christian König Aug. 24, 2018, 11:43 a.m. UTC | #19
Am 24.08.2018 um 13:32 schrieb Michal Hocko:
> On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
>> Two more worries for this patch.
>>
>>
>>
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>>    *
>>>    * @amn: our notifier
>>>    */
>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>>>   {
>>> -       mutex_lock(&amn->read_lock);
>>> +       if (blockable)
>>> +               mutex_lock(&amn->read_lock);
>>> +       else if (!mutex_trylock(&amn->read_lock))
>>> +               return -EAGAIN;
>>> +
>>>          if (atomic_inc_return(&amn->recursion) == 1)
>>>                  down_read_non_owner(&amn->lock);
>> Why don't we need to use trylock here if blockable == false ?
>> Want comment why it is safe to use blocking lock here.
> Hmm, I am pretty sure I have checked the code but it was quite confusing
> so I might have missed something. Double checking now, it seems that
> this read_lock is not used anywhere else and it is not _the_ lock we are
> interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
> it is taken in exclusive mode for expensive operations.

The write side of the lock is only taken in the command submission IOCTL.

So you actually don't need to change anything here (even the proposed 
changes are overkill) since we can't tear down the struct_mm while an 
IOCTL is still using.

> Is that correct Christian? If this is correct then we need to update the
> locking here. I am struggling to grasp the ref counting part. Why cannot
> all readers simply take the lock rather than rely on somebody else to
> take it? 1ed3d2567c800 didn't really help me to understand the locking
> scheme here so any help would be appreciated.

That won't work like this there might be multiple 
invalidate_range_start()/invalidate_range_end() pairs open at the same 
time. E.g. the lock might be taken recursively and that is illegal for a 
rw_semaphore.

Regards,
Christian.

>
> I am wondering why we cannot do
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index e55508b39496..93034178673d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    */
>   static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>   {
> -	if (blockable)
> -		mutex_lock(&amn->read_lock);
> -	else if (!mutex_trylock(&amn->read_lock))
> -		return -EAGAIN;
> -
> -	if (atomic_inc_return(&amn->recursion) == 1)
> -		down_read_non_owner(&amn->lock);
> -	mutex_unlock(&amn->read_lock);
> +	if (!down_read_trylock(&amn->lock)) {
> +		if (!blockable)
> +			return -EAGAIN;
> +		down_read(amn->lock);
> +	}
>   
>   	return 0;
>   }
> @@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>    */
>   static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
>   {
> -	if (atomic_dec_return(&amn->recursion) == 0)
> -		up_read_non_owner(&amn->lock);
> +	up_read(&amn->lock);
>   }
>   
>   /**
>
Michal Hocko Aug. 24, 2018, 11:52 a.m. UTC | #20
On Fri 24-08-18 13:43:16, Christian König wrote:
> Am 24.08.2018 um 13:32 schrieb Michal Hocko:
> > On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> > > Two more worries for this patch.
> > > 
> > > 
> > > 
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> > > >    *
> > > >    * @amn: our notifier
> > > >    */
> > > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> > > >   {
> > > > -       mutex_lock(&amn->read_lock);
> > > > +       if (blockable)
> > > > +               mutex_lock(&amn->read_lock);
> > > > +       else if (!mutex_trylock(&amn->read_lock))
> > > > +               return -EAGAIN;
> > > > +
> > > >          if (atomic_inc_return(&amn->recursion) == 1)
> > > >                  down_read_non_owner(&amn->lock);
> > > Why don't we need to use trylock here if blockable == false ?
> > > Want comment why it is safe to use blocking lock here.
> > Hmm, I am pretty sure I have checked the code but it was quite confusing
> > so I might have missed something. Double checking now, it seems that
> > this read_lock is not used anywhere else and it is not _the_ lock we are
> > interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
> > it is taken in exclusive mode for expensive operations.
> 
> The write side of the lock is only taken in the command submission IOCTL.
> 
> So you actually don't need to change anything here (even the proposed
> changes are overkill) since we can't tear down the struct_mm while an IOCTL
> is still using.

I am not so sure. We are not in the mm destruction phase yet. This is
mostly about the oom context which might fire right during the IOCTL. If
any of the path which is holding the write lock blocks for unbound
amount of time or even worse allocates a memory then we are screwed. So
we need to back of when blockable = false.

> > Is that correct Christian? If this is correct then we need to update the
> > locking here. I am struggling to grasp the ref counting part. Why cannot
> > all readers simply take the lock rather than rely on somebody else to
> > take it? 1ed3d2567c800 didn't really help me to understand the locking
> > scheme here so any help would be appreciated.
> 
> That won't work like this there might be multiple
> invalidate_range_start()/invalidate_range_end() pairs open at the same time.
> E.g. the lock might be taken recursively and that is illegal for a
> rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?
Christian König Aug. 24, 2018, 11:57 a.m. UTC | #21
Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> On Fri 24-08-18 13:43:16, Christian König wrote:
>> Am 24.08.2018 um 13:32 schrieb Michal Hocko:
>>> On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
>>>> Two more worries for this patch.
>>>>
>>>>
>>>>
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>>>> @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>>>>     *
>>>>>     * @amn: our notifier
>>>>>     */
>>>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
>>>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>>>>>    {
>>>>> -       mutex_lock(&amn->read_lock);
>>>>> +       if (blockable)
>>>>> +               mutex_lock(&amn->read_lock);
>>>>> +       else if (!mutex_trylock(&amn->read_lock))
>>>>> +               return -EAGAIN;
>>>>> +
>>>>>           if (atomic_inc_return(&amn->recursion) == 1)
>>>>>                   down_read_non_owner(&amn->lock);
>>>> Why don't we need to use trylock here if blockable == false ?
>>>> Want comment why it is safe to use blocking lock here.
>>> Hmm, I am pretty sure I have checked the code but it was quite confusing
>>> so I might have missed something. Double checking now, it seems that
>>> this read_lock is not used anywhere else and it is not _the_ lock we are
>>> interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
>>> it is taken in exclusive mode for expensive operations.
>> The write side of the lock is only taken in the command submission IOCTL.
>>
>> So you actually don't need to change anything here (even the proposed
>> changes are overkill) since we can't tear down the struct_mm while an IOCTL
>> is still using.
> I am not so sure. We are not in the mm destruction phase yet. This is
> mostly about the oom context which might fire right during the IOCTL. If
> any of the path which is holding the write lock blocks for unbound
> amount of time or even worse allocates a memory then we are screwed. So
> we need to back of when blockable = false.

Oh, yeah good point. Haven't thought about that possibility.

>
>>> Is that correct Christian? If this is correct then we need to update the
>>> locking here. I am struggling to grasp the ref counting part. Why cannot
>>> all readers simply take the lock rather than rely on somebody else to
>>> take it? 1ed3d2567c800 didn't really help me to understand the locking
>>> scheme here so any help would be appreciated.
>> That won't work like this there might be multiple
>> invalidate_range_start()/invalidate_range_end() pairs open at the same time.
>> E.g. the lock might be taken recursively and that is illegal for a
>> rw_semaphore.
> I am not sure I follow. Are you saying that one invalidate_range might
> trigger another one from the same path?

No, but what can happen is:

invalidate_range_start(A,B);
invalidate_range_start(C,D);
...
invalidate_range_end(C,D);
invalidate_range_end(A,B);

Grabbing the read lock twice would be illegal in this case.

Regards,
Christian.
Michal Hocko Aug. 24, 2018, 12:03 p.m. UTC | #22
On Fri 24-08-18 13:57:52, Christian König wrote:
> Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > On Fri 24-08-18 13:43:16, Christian König wrote:
[...]
> > > That won't work like this there might be multiple
> > > invalidate_range_start()/invalidate_range_end() pairs open at the same time.
> > > E.g. the lock might be taken recursively and that is illegal for a
> > > rw_semaphore.
> > I am not sure I follow. Are you saying that one invalidate_range might
> > trigger another one from the same path?
> 
> No, but what can happen is:
> 
> invalidate_range_start(A,B);
> invalidate_range_start(C,D);
> ...
> invalidate_range_end(C,D);
> invalidate_range_end(A,B);
> 
> Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from? Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.
Christian König Aug. 24, 2018, 12:18 p.m. UTC | #23
Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> On Fri 24-08-18 13:57:52, Christian König wrote:
>> Am 24.08.2018 um 13:52 schrieb Michal Hocko:
>>> On Fri 24-08-18 13:43:16, Christian König wrote:
> [...]
>>>> That won't work like this there might be multiple
>>>> invalidate_range_start()/invalidate_range_end() pairs open at the same time.
>>>> E.g. the lock might be taken recursively and that is illegal for a
>>>> rw_semaphore.
>>> I am not sure I follow. Are you saying that one invalidate_range might
>>> trigger another one from the same path?
>> No, but what can happen is:
>>
>> invalidate_range_start(A,B);
>> invalidate_range_start(C,D);
>> ...
>> invalidate_range_end(C,D);
>> invalidate_range_end(A,B);
>>
>> Grabbing the read lock twice would be illegal in this case.
> I am sorry but I still do not follow. What is the context the two are
> called from?

I don't have the slightest idea.

> Can you give me an example. I simply do not see it in the
> code, mostly because I am not familiar with it.

I'm neither.

We stumbled over that by pure observation and after discussing the 
problem with Jerome came up with this solution.

No idea where exactly that case comes from, but I can confirm that it 
indeed happens.

Regards,
Christian.
Michal Hocko Aug. 24, 2018, 12:33 p.m. UTC | #24
On Fri 24-08-18 14:18:44, Christian König wrote:
> Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > [...]
> > > > > That won't work like this there might be multiple
> > > > > invalidate_range_start()/invalidate_range_end() pairs open at the same time.
> > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > rw_semaphore.
> > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > trigger another one from the same path?
> > > No, but what can happen is:
> > > 
> > > invalidate_range_start(A,B);
> > > invalidate_range_start(C,D);
> > > ...
> > > invalidate_range_end(C,D);
> > > invalidate_range_end(A,B);
> > > 
> > > Grabbing the read lock twice would be illegal in this case.
> > I am sorry but I still do not follow. What is the context the two are
> > called from?
> 
> I don't have the slightest idea.
> 
> > Can you give me an example. I simply do not see it in the
> > code, mostly because I am not familiar with it.
> 
> I'm neither.
> 
> We stumbled over that by pure observation and after discussing the problem
> with Jerome came up with this solution.
> 
> No idea where exactly that case comes from, but I can confirm that it indeed
> happens.

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.
Christian König Aug. 24, 2018, 12:52 p.m. UTC | #25
Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> On Fri 24-08-18 14:18:44, Christian König wrote:
>> Am 24.08.2018 um 14:03 schrieb Michal Hocko:
>>> On Fri 24-08-18 13:57:52, Christian König wrote:
>>>> Am 24.08.2018 um 13:52 schrieb Michal Hocko:
>>>>> On Fri 24-08-18 13:43:16, Christian König wrote:
>>> [...]
>>>>>> That won't work like this there might be multiple
>>>>>> invalidate_range_start()/invalidate_range_end() pairs open at the same time.
>>>>>> E.g. the lock might be taken recursively and that is illegal for a
>>>>>> rw_semaphore.
>>>>> I am not sure I follow. Are you saying that one invalidate_range might
>>>>> trigger another one from the same path?
>>>> No, but what can happen is:
>>>>
>>>> invalidate_range_start(A,B);
>>>> invalidate_range_start(C,D);
>>>> ...
>>>> invalidate_range_end(C,D);
>>>> invalidate_range_end(A,B);
>>>>
>>>> Grabbing the read lock twice would be illegal in this case.
>>> I am sorry but I still do not follow. What is the context the two are
>>> called from?
>> I don't have the slightest idea.
>>
>>> Can you give me an example. I simply do not see it in the
>>> code, mostly because I am not familiar with it.
>> I'm neither.
>>
>> We stumbled over that by pure observation and after discussing the problem
>> with Jerome came up with this solution.
>>
>> No idea where exactly that case comes from, but I can confirm that it indeed
>> happens.
> Thiking about it some more, I can imagine that a notifier callback which
> performs an allocation might trigger a memory reclaim and that in turn
> might trigger a notifier to be invoked and recurse. But notifier
> shouldn't really allocate memory. They are called from deep MM code
> paths and this would be extremely deadlock prone. Maybe Jerome can come
> up some more realistic scenario. If not then I would propose to simplify
> the locking here. We have lockdep to catch self deadlocks and it is
> always better to handle a specific issue rather than having a code
> without a clear indication how it can recurse.

Well I agree that we should probably fix that, but I have some concerns 
to remove the existing workaround.

See we added that to get rid of a real problem in a customer environment 
and I don't want to that to show up again.

In the meantime I've send out a fix to avoid allocating memory while 
holding the mn_lock.

Thanks for pointing that out,
Christian.
Michal Hocko Aug. 24, 2018, 1:01 p.m. UTC | #26
On Fri 24-08-18 14:52:26, Christian König wrote:
> Am 24.08.2018 um 14:33 schrieb Michal Hocko:
[...]
> > Thiking about it some more, I can imagine that a notifier callback which
> > performs an allocation might trigger a memory reclaim and that in turn
> > might trigger a notifier to be invoked and recurse. But notifier
> > shouldn't really allocate memory. They are called from deep MM code
> > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > up some more realistic scenario. If not then I would propose to simplify
> > the locking here. We have lockdep to catch self deadlocks and it is
> > always better to handle a specific issue rather than having a code
> > without a clear indication how it can recurse.
> 
> Well I agree that we should probably fix that, but I have some concerns to
> remove the existing workaround.
> 
> See we added that to get rid of a real problem in a customer environment and
> I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.
Tetsuo Handa Aug. 24, 2018, 1:02 p.m. UTC | #27
On 2018/08/24 20:36, Michal Hocko wrote:
>> That is, this API seems to be currently used by only out-of-tree users. Since
>> we can't check that nobody has memory allocation dependency, I think that
>> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now.
> 
> The code expects that the invalidate_range_end doesn't block if
> invalidate_range_start hasn't blocked. That is the reason why the end
> callback doesn't have blockable parameter. If this doesn't hold then the
> whole scheme is just fragile because those two calls should pair.
> 
That is

  More worrisome part in that patch is that I don't know whether using
  trylock if blockable == false at entry is really sufficient.

. Since those two calls should pair, I think that we need to determine whether
we need to return -EAGAIN at start call by evaluating both calls.

Like mn_invl_range_start() involves schedule_delayed_work() which could be
blocked on memory allocation under OOM situation, I worry that (currently
out-of-tree) users of this API are involving work / recursion.
And hmm_release() says that

	/*
	 * Drop mirrors_sem so callback can wait on any pending
	 * work that might itself trigger mmu_notifier callback
	 * and thus would deadlock with us.
	 */

and keeps "all operations protected by hmm->mirrors_sem held for write are
atomic". This suggests that "some operations protected by hmm->mirrors_sem held
for read will sleep (and in the worst case involves memory allocation
dependency)".
Christian König Aug. 24, 2018, 1:10 p.m. UTC | #28
Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> On Fri 24-08-18 14:52:26, Christian König wrote:
>> Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> [...]
>>> Thiking about it some more, I can imagine that a notifier callback which
>>> performs an allocation might trigger a memory reclaim and that in turn
>>> might trigger a notifier to be invoked and recurse. But notifier
>>> shouldn't really allocate memory. They are called from deep MM code
>>> paths and this would be extremely deadlock prone. Maybe Jerome can come
>>> up some more realistic scenario. If not then I would propose to simplify
>>> the locking here. We have lockdep to catch self deadlocks and it is
>>> always better to handle a specific issue rather than having a code
>>> without a clear indication how it can recurse.
>> Well I agree that we should probably fix that, but I have some concerns to
>> remove the existing workaround.
>>
>> See we added that to get rid of a real problem in a customer environment and
>> I don't want to that to show up again.
> It would really help to know more about that case and fix it properly
> rather than workaround it like this. Anyway, let me think how to handle
> the non-blocking notifier invocation then. I was not able to come up
> with anything remotely sane yet.

With avoiding allocating memory in the write lock path I don't see an 
issue any more with that.

All what the write lock path does now is adding items to a linked lists, 
arrays etc....

So there is no more blocking involved here and the read lock side should 
be able to grab the lock immediately.

Christian.
Michal Hocko Aug. 24, 2018, 1:24 p.m. UTC | #29
On Fri 24-08-18 15:10:08, Christian König wrote:
> Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > [...]
> > > > Thiking about it some more, I can imagine that a notifier callback which
> > > > performs an allocation might trigger a memory reclaim and that in turn
> > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > shouldn't really allocate memory. They are called from deep MM code
> > > > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > > > up some more realistic scenario. If not then I would propose to simplify
> > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > always better to handle a specific issue rather than having a code
> > > > without a clear indication how it can recurse.
> > > Well I agree that we should probably fix that, but I have some concerns to
> > > remove the existing workaround.
> > > 
> > > See we added that to get rid of a real problem in a customer environment and
> > > I don't want to that to show up again.
> > It would really help to know more about that case and fix it properly
> > rather than workaround it like this. Anyway, let me think how to handle
> > the non-blocking notifier invocation then. I was not able to come up
> > with anything remotely sane yet.
> 
> With avoiding allocating memory in the write lock path I don't see an issue
> any more with that.
> 
> All what the write lock path does now is adding items to a linked lists,
> arrays etc....

Can we change it to non-sleepable lock then?
Christian König Aug. 24, 2018, 1:28 p.m. UTC | #30
Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> On Fri 24-08-18 15:10:08, Christian König wrote:
>> Am 24.08.2018 um 15:01 schrieb Michal Hocko:
>>> On Fri 24-08-18 14:52:26, Christian König wrote:
>>>> Am 24.08.2018 um 14:33 schrieb Michal Hocko:
>>> [...]
>>>>> Thiking about it some more, I can imagine that a notifier callback which
>>>>> performs an allocation might trigger a memory reclaim and that in turn
>>>>> might trigger a notifier to be invoked and recurse. But notifier
>>>>> shouldn't really allocate memory. They are called from deep MM code
>>>>> paths and this would be extremely deadlock prone. Maybe Jerome can come
>>>>> up some more realistic scenario. If not then I would propose to simplify
>>>>> the locking here. We have lockdep to catch self deadlocks and it is
>>>>> always better to handle a specific issue rather than having a code
>>>>> without a clear indication how it can recurse.
>>>> Well I agree that we should probably fix that, but I have some concerns to
>>>> remove the existing workaround.
>>>>
>>>> See we added that to get rid of a real problem in a customer environment and
>>>> I don't want to that to show up again.
>>> It would really help to know more about that case and fix it properly
>>> rather than workaround it like this. Anyway, let me think how to handle
>>> the non-blocking notifier invocation then. I was not able to come up
>>> with anything remotely sane yet.
>> With avoiding allocating memory in the write lock path I don't see an issue
>> any more with that.
>>
>> All what the write lock path does now is adding items to a linked lists,
>> arrays etc....
> Can we change it to non-sleepable lock then?

No, the write side doesn't sleep any more, but the read side does.

See amdgpu_mn_invalidate_node() and that is where you actually need to 
handle the non-blocking flag correctly.

Christian.
Michal Hocko Aug. 24, 2018, 1:32 p.m. UTC | #31
On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> On 2018/08/24 20:36, Michal Hocko wrote:
> >> That is, this API seems to be currently used by only out-of-tree users. Since
> >> we can't check that nobody has memory allocation dependency, I think that
> >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now.
> > 
> > The code expects that the invalidate_range_end doesn't block if
> > invalidate_range_start hasn't blocked. That is the reason why the end
> > callback doesn't have blockable parameter. If this doesn't hold then the
> > whole scheme is just fragile because those two calls should pair.
> > 
> That is
> 
>   More worrisome part in that patch is that I don't know whether using
>   trylock if blockable == false at entry is really sufficient.
> 
> . Since those two calls should pair, I think that we need to determine whether
> we need to return -EAGAIN at start call by evaluating both calls.

Yes, and I believe I have done that audit. Module my misunderstanding of
the code.

> Like mn_invl_range_start() involves schedule_delayed_work() which could be
> blocked on memory allocation under OOM situation,

It doesn't because that code path is not invoked for the !blockable
case.

> I worry that (currently
> out-of-tree) users of this API are involving work / recursion.

I do not give a slightest about out-of-tree modules. They will have to
accomodate to the new API. I have no problems to extend the
documentation and be explicit about this expectation.
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 	 *
 	 * If blockable argument is set to false then the callback cannot
 	 * sleep and has to return with -EAGAIN. 0 should be returned
-	 * otherwise.
+	 * otherwise. Please note that if invalidate_range_start approves
+	 * a non-blocking behavior then the same applies to
+	 * invalidate_range_end.
 	 *
 	 */
 	int (*invalidate_range_start)(struct mmu_notifier *mn,


> And hmm_release() says that
> 
> 	/*
> 	 * Drop mirrors_sem so callback can wait on any pending
> 	 * work that might itself trigger mmu_notifier callback
> 	 * and thus would deadlock with us.
> 	 */
> 
> and keeps "all operations protected by hmm->mirrors_sem held for write are
> atomic". This suggests that "some operations protected by hmm->mirrors_sem held
> for read will sleep (and in the worst case involves memory allocation
> dependency)".

Yes and so what? The clear expectation is that neither of the range
notifiers do not sleep in !blocking mode. I really fail to see what you
are trying to say.
Michal Hocko Aug. 24, 2018, 1:40 p.m. UTC | #32
On Fri 24-08-18 15:28:33, Christian König wrote:
> Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > [...]
> > > > > > Thiking about it some more, I can imagine that a notifier callback which
> > > > > > performs an allocation might trigger a memory reclaim and that in turn
> > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > shouldn't really allocate memory. They are called from deep MM code
> > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > > > > > up some more realistic scenario. If not then I would propose to simplify
> > > > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > > > always better to handle a specific issue rather than having a code
> > > > > > without a clear indication how it can recurse.
> > > > > Well I agree that we should probably fix that, but I have some concerns to
> > > > > remove the existing workaround.
> > > > > 
> > > > > See we added that to get rid of a real problem in a customer environment and
> > > > > I don't want to that to show up again.
> > > > It would really help to know more about that case and fix it properly
> > > > rather than workaround it like this. Anyway, let me think how to handle
> > > > the non-blocking notifier invocation then. I was not able to come up
> > > > with anything remotely sane yet.
> > > With avoiding allocating memory in the write lock path I don't see an issue
> > > any more with that.
> > > 
> > > All what the write lock path does now is adding items to a linked lists,
> > > arrays etc....
> > Can we change it to non-sleepable lock then?
> 
> No, the write side doesn't sleep any more, but the read side does.
> 
> See amdgpu_mn_invalidate_node() and that is where you actually need to
> handle the non-blocking flag correctly.

Ohh, right you are. We already handle that by bailing out before calling
amdgpu_mn_invalidate_node in !blockable mode. So does this looks good to
you?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-	if (blockable)
-		mutex_lock(&amn->read_lock);
-	else if (!mutex_trylock(&amn->read_lock))
-		return -EAGAIN;
-
+	/*
+	 * We can take sleepable lock even on !blockable mode because
+	 * read_lock is only ever take from this path and the notifier
+	 * lock never really sleeps. In fact the only reason why the
+	 * later is sleepable is because the notifier itself might sleep
+	 * in amdgpu_mn_invalidate_node but blockable mode is handled
+	 * before calling into that path.
+	 */
+	mutex_lock(&amn->read_lock);
 	if (atomic_inc_return(&amn->recursion) == 1)
 		down_read_non_owner(&amn->lock);
 	mutex_unlock(&amn->read_lock);
Christian König Aug. 24, 2018, 1:44 p.m. UTC | #33
Am 24.08.2018 um 15:40 schrieb Michal Hocko:
> On Fri 24-08-18 15:28:33, Christian König wrote:
>> Am 24.08.2018 um 15:24 schrieb Michal Hocko:
>>> On Fri 24-08-18 15:10:08, Christian König wrote:
>>>> Am 24.08.2018 um 15:01 schrieb Michal Hocko:
>>>>> On Fri 24-08-18 14:52:26, Christian König wrote:
>>>>>> Am 24.08.2018 um 14:33 schrieb Michal Hocko:
>>>>> [...]
>>>>>>> Thiking about it some more, I can imagine that a notifier callback which
>>>>>>> performs an allocation might trigger a memory reclaim and that in turn
>>>>>>> might trigger a notifier to be invoked and recurse. But notifier
>>>>>>> shouldn't really allocate memory. They are called from deep MM code
>>>>>>> paths and this would be extremely deadlock prone. Maybe Jerome can come
>>>>>>> up some more realistic scenario. If not then I would propose to simplify
>>>>>>> the locking here. We have lockdep to catch self deadlocks and it is
>>>>>>> always better to handle a specific issue rather than having a code
>>>>>>> without a clear indication how it can recurse.
>>>>>> Well I agree that we should probably fix that, but I have some concerns to
>>>>>> remove the existing workaround.
>>>>>>
>>>>>> See we added that to get rid of a real problem in a customer environment and
>>>>>> I don't want to that to show up again.
>>>>> It would really help to know more about that case and fix it properly
>>>>> rather than workaround it like this. Anyway, let me think how to handle
>>>>> the non-blocking notifier invocation then. I was not able to come up
>>>>> with anything remotely sane yet.
>>>> With avoiding allocating memory in the write lock path I don't see an issue
>>>> any more with that.
>>>>
>>>> All what the write lock path does now is adding items to a linked lists,
>>>> arrays etc....
>>> Can we change it to non-sleepable lock then?
>> No, the write side doesn't sleep any more, but the read side does.
>>
>> See amdgpu_mn_invalidate_node() and that is where you actually need to
>> handle the non-blocking flag correctly.
> Ohh, right you are. We already handle that by bailing out before calling
> amdgpu_mn_invalidate_node in !blockable mode.

Yeah, that is sufficient.

It could be improved because we have something like 90% chance that 
amdgpu_mn_invalidate_node() actually doesn't need to do anything.

But I can take care of that when the patch set has landed.

> So does this looks good to
> you?

Yeah, that looks perfect to me. Reviewed-by: Christian König 
<christian.koenig@amd.com>

Thanks,
Christian.

>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index e55508b39496..48fa152231be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    */
>   static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>   {
> -	if (blockable)
> -		mutex_lock(&amn->read_lock);
> -	else if (!mutex_trylock(&amn->read_lock))
> -		return -EAGAIN;
> -
> +	/*
> +	 * We can take sleepable lock even on !blockable mode because
> +	 * read_lock is only ever take from this path and the notifier
> +	 * lock never really sleeps. In fact the only reason why the
> +	 * later is sleepable is because the notifier itself might sleep
> +	 * in amdgpu_mn_invalidate_node but blockable mode is handled
> +	 * before calling into that path.
> +	 */
> +	mutex_lock(&amn->read_lock);
>   	if (atomic_inc_return(&amn->recursion) == 1)
>   		down_read_non_owner(&amn->lock);
>   	mutex_unlock(&amn->read_lock);
Michal Hocko Aug. 24, 2018, 1:52 p.m. UTC | #34
On Fri 24-08-18 15:44:03, Christian König wrote:
> Am 24.08.2018 um 15:40 schrieb Michal Hocko:
> > On Fri 24-08-18 15:28:33, Christian König wrote:
> > > Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > > > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > > > [...]
> > > > > > > > Thiking about it some more, I can imagine that a notifier callback which
> > > > > > > > performs an allocation might trigger a memory reclaim and that in turn
> > > > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > > > shouldn't really allocate memory. They are called from deep MM code
> > > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > > > > > > > up some more realistic scenario. If not then I would propose to simplify
> > > > > > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > > > > > always better to handle a specific issue rather than having a code
> > > > > > > > without a clear indication how it can recurse.
> > > > > > > Well I agree that we should probably fix that, but I have some concerns to
> > > > > > > remove the existing workaround.
> > > > > > > 
> > > > > > > See we added that to get rid of a real problem in a customer environment and
> > > > > > > I don't want to that to show up again.
> > > > > > It would really help to know more about that case and fix it properly
> > > > > > rather than workaround it like this. Anyway, let me think how to handle
> > > > > > the non-blocking notifier invocation then. I was not able to come up
> > > > > > with anything remotely sane yet.
> > > > > With avoiding allocating memory in the write lock path I don't see an issue
> > > > > any more with that.
> > > > > 
> > > > > All what the write lock path does now is adding items to a linked lists,
> > > > > arrays etc....
> > > > Can we change it to non-sleepable lock then?
> > > No, the write side doesn't sleep any more, but the read side does.
> > > 
> > > See amdgpu_mn_invalidate_node() and that is where you actually need to
> > > handle the non-blocking flag correctly.
> > Ohh, right you are. We already handle that by bailing out before calling
> > amdgpu_mn_invalidate_node in !blockable mode.
> 
> Yeah, that is sufficient.
> 
> It could be improved because we have something like 90% chance that
> amdgpu_mn_invalidate_node() actually doesn't need to do anything.
> 
> But I can take care of that when the patch set has landed.
> 
> > So does this looks good to you?
> 
> Yeah, that looks perfect to me. Reviewed-by: Christian König
> <christian.koenig@amd.com>

Cool! Thanks for your guidance and patience with me. Here is the full
patch. Feel free to take it and route per your preference.

From 4e297bf5a55906ee369d70bee9f7beeb3cba74bb Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 24 Aug 2018 15:45:52 +0200
Subject: [PATCH] drm/amd: clarify amdgpu_mn_read_lock !blocking mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Tetsuo has noticed that 93065ac753e4 ("mm, oom: distinguish blockable
mode for mmu notifiers") !blocking case for amdgpu_mn_read_lock is
incomplete because it might sleep on the notifier lock. This is true
but as it turned out from the discussion with Christian this doesn't
really matter.

The amd notifier lock doesn't block in the exclusive mode. It only ever
sleeps with the read lock inside amdgpu_mn_invalidate_node. That one
is not called in !blockable state so while we might sleep on notifier
read_lock this will only be for a short while. The same applies on the
notifier lock.

Therefore remove blockable handling from amdgpu_mn_read_lock and
document it properly.

Noticed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-	if (blockable)
-		mutex_lock(&amn->read_lock);
-	else if (!mutex_trylock(&amn->read_lock))
-		return -EAGAIN;
-
+	/*
+	 * We can take sleepable lock even on !blockable mode because
+	 * read_lock is only ever take from this path and the notifier
+	 * lock never really sleeps. In fact the only reason why the
+	 * later is sleepable is because the notifier itself might sleep
+	 * in amdgpu_mn_invalidate_node but blockable mode is handled
+	 * before calling into that path.
+	 */
+	mutex_lock(&amn->read_lock);
 	if (atomic_inc_return(&amn->recursion) == 1)
 		down_read_non_owner(&amn->lock);
 	mutex_unlock(&amn->read_lock);
Jerome Glisse Aug. 24, 2018, 2:40 p.m. UTC | #35
On Fri, Aug 24, 2018 at 07:54:19PM +0900, Tetsuo Handa wrote:
> Two more worries for this patch.

[...]

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> >         up_write(&hmm->mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >                                        struct mm_struct *mm,
> >                                        unsigned long start,
> > -                                      unsigned long end)
> > +                                      unsigned long end,
> > +                                      bool blockable)
> >  {
> >         struct hmm *hmm = mm->hmm;
> > 
> >         VM_BUG_ON(!hmm);
> > 
> >         atomic_inc(&hmm->sequence);
> > +
> > +       return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
>         down_read(&hmm->mirrors_sem);
>         list_for_each_entry(mirror, &hmm->mirrors, list)
>                 mirror->ops->sync_cpu_device_pagetables(mirror, action,
>                                                         start, end);
>         up_read(&hmm->mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.
> 
>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst:     /* sync_cpu_device_pagetables() - synchronize page tables
>   include/linux/hmm.h: * will get callbacks through sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:    /* sync_cpu_device_pagetables() - synchronize page tables
>   include/linux/hmm.h:    void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
>   mm/hmm.c:               mirror->ops->sync_cpu_device_pagetables(mirror, action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now.

So you can see update and user of this there:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00

I am still working on Mellanox and AMD GPU patchset.

I will post the HMM changes that adapt to Michal shortly as anyway
thus have been sufficiently tested by now.

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20&id=78785dcb5ba0924c2c5e7be027793f99ebbc39f3
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20&id=4fc25571dc893f2b278e90cda9e71e139e01de70

Cheers,
Jérôme
Tetsuo Handa Aug. 24, 2018, 2:52 p.m. UTC | #36
On 2018/08/24 22:32, Michal Hocko wrote:
> On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
>> I worry that (currently
>> out-of-tree) users of this API are involving work / recursion.
> 
> I do not give a slightest about out-of-tree modules. They will have to
> accomodate to the new API. I have no problems to extend the
> documentation and be explicit about this expectation.

You don't need to care about out-of-tree modules. But you need to hear from
mm/hmm.c authors/maintainers when making changes for mmu-notifiers.

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>  	 *
>  	 * If blockable argument is set to false then the callback cannot
>  	 * sleep and has to return with -EAGAIN. 0 should be returned
> -	 * otherwise.
> +	 * otherwise. Please note that if invalidate_range_start approves
> +	 * a non-blocking behavior then the same applies to
> +	 * invalidate_range_end.

Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
mmu-notifiers users.

	-	 * If both of these callbacks cannot block, and invalidate_range
	-	 * cannot block, mmu_notifier_ops.flags should have
	-	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
	+	 * If blockable argument is set to false then the callback cannot
	+	 * sleep and has to return with -EAGAIN. 0 should be returned
	+	 * otherwise.

Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.

Now we are in a merge window. And we noticed a possibility that out-of-tree
mmu-notifiers users might have trouble with making changes immediately in order
to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
And you are trying to ignore such possibility by just updating expected behavior
description instead of giving out-of-tree users a grace period to check and update
their code.

>> and keeps "all operations protected by hmm->mirrors_sem held for write are
>> atomic". This suggests that "some operations protected by hmm->mirrors_sem held
>> for read will sleep (and in the worst case involves memory allocation
>> dependency)".
> 
> Yes and so what? The clear expectation is that neither of the range
> notifiers do not sleep in !blocking mode. I really fail to see what you
> are trying to say.

I'm saying "Get ACK from Jérôme about mm/hmm.c changes".
Jerome Glisse Aug. 24, 2018, 3:08 p.m. UTC | #37
On Fri, Aug 24, 2018 at 02:33:41PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 14:18:44, Christian König wrote:
> > Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > > [...]
> > > > > > That won't work like this there might be multiple
> > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the same time.
> > > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > > rw_semaphore.
> > > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > > trigger another one from the same path?
> > > > No, but what can happen is:
> > > > 
> > > > invalidate_range_start(A,B);
> > > > invalidate_range_start(C,D);
> > > > ...
> > > > invalidate_range_end(C,D);
> > > > invalidate_range_end(A,B);
> > > > 
> > > > Grabbing the read lock twice would be illegal in this case.
> > > I am sorry but I still do not follow. What is the context the two are
> > > called from?
> > 
> > I don't have the slightest idea.
> > 
> > > Can you give me an example. I simply do not see it in the
> > > code, mostly because I am not familiar with it.
> > 
> > I'm neither.
> > 
> > We stumbled over that by pure observation and after discussing the problem
> > with Jerome came up with this solution.
> > 
> > No idea where exactly that case comes from, but I can confirm that it indeed
> > happens.
> 
> Thiking about it some more, I can imagine that a notifier callback which
> performs an allocation might trigger a memory reclaim and that in turn
> might trigger a notifier to be invoked and recurse. But notifier
> shouldn't really allocate memory. They are called from deep MM code
> paths and this would be extremely deadlock prone. Maybe Jerome can come
> up some more realistic scenario. If not then I would propose to simplify
> the locking here. We have lockdep to catch self deadlocks and it is
> always better to handle a specific issue rather than having a code
> without a clear indication how it can recurse.

Multiple concurrent mmu notifier, for overlapping range or not, is
common (each concurrent threads can trigger some). So you might have
multiple invalidate_range_start() in flight for same mm and thus might
complete in different order (invalidate_range_end()). IIRC this is
what this lock was trying to protect against.

I can't think of a reason for recursive mmu notifier call right now.
I will ponder see if i remember something about it.

Cheers,
Jérôme
Jerome Glisse Aug. 24, 2018, 3:12 p.m. UTC | #38
On Fri, Aug 24, 2018 at 11:52:25PM +0900, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  	 *
> >  	 * If blockable argument is set to false then the callback cannot
> >  	 * sleep and has to return with -EAGAIN. 0 should be returned
> > -	 * otherwise.
> > +	 * otherwise. Please note that if invalidate_range_start approves
> > +	 * a non-blocking behavior then the same applies to
> > +	 * invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
> 	-	 * If both of these callbacks cannot block, and invalidate_range
> 	-	 * cannot block, mmu_notifier_ops.flags should have
> 	-	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> 	+	 * If blockable argument is set to false then the callback cannot
> 	+	 * sleep and has to return with -EAGAIN. 0 should be returned
> 	+	 * otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected behavior
> description instead of giving out-of-tree users a grace period to check and update
> their code.

Intention is that 99% of HMM users will be upstream as long as they are
not people shouldn't worry. We have been working on nouveau to use it
for the last year or so. Many bits were added in 4.16, 4.17, 4.18 and i
hope it will all be there in 4.20/4.21 timeframe.

See my other mail for list of other users.

> 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

I am fine with Michal patch, i already said so couple month ago first time
this discussion did pop up, Michal you can add:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Michal Hocko Aug. 24, 2018, 4:38 p.m. UTC | #39
On Fri 24-08-18 23:52:25, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  	 *
> >  	 * If blockable argument is set to false then the callback cannot
> >  	 * sleep and has to return with -EAGAIN. 0 should be returned
> > -	 * otherwise.
> > +	 * otherwise. Please note that if invalidate_range_start approves
> > +	 * a non-blocking behavior then the same applies to
> > +	 * invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
> 	-	 * If both of these callbacks cannot block, and invalidate_range
> 	-	 * cannot block, mmu_notifier_ops.flags should have
> 	-	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> 	+	 * If blockable argument is set to false then the callback cannot
> 	+	 * sleep and has to return with -EAGAIN. 0 should be returned
> 	+	 * otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected behavior
> description instead of giving out-of-tree users a grace period to check and update
> their code.

This is just ridiculous. I have no idea what you are trying to achieve
here but please read through Documentation/process/stable-api-nonsense.rst
before you try to make strong statements again.

I have changed an in-kernel interface. I have gone through all users and
fixed them up. It is really appreciated to double check after me and I
am willing to fix up any fallouts. But that is just about it. I do not
get a whit about _any_ out of tree drivers when changing the interface.
I am willing to answer any questions regarding this change so developers
of those drivers know how to do their change properly but doing so is
completely their business.
 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

HMM is a library layer for other driver, until those get merged the same
applies for them as well.
Michal Hocko Aug. 24, 2018, 4:40 p.m. UTC | #40
On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
[...]
> I am fine with Michal patch, i already said so couple month ago first time
> this discussion did pop up, Michal you can add:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

So I guess the below is the patch you were talking about?

From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 24 Aug 2018 15:32:24 +0200
Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
 non-blocking mode

If invalidate_range_start is called for !blocking mode then all
callbacks have to guarantee they will no block/sleep. The same obviously
applies to invalidate_range_end because this operation pairs with the
former and they are called from the same context. Make sure this is
appropriately documented.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmu_notifier.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 	 *
 	 * If blockable argument is set to false then the callback cannot
 	 * sleep and has to return with -EAGAIN. 0 should be returned
-	 * otherwise.
+	 * otherwise. Please note that if invalidate_range_start approves
+	 * a non-blocking behavior then the same applies to
+	 * invalidate_range_end.
 	 *
 	 */
 	int (*invalidate_range_start)(struct mmu_notifier *mn,
Jerome Glisse Aug. 24, 2018, 5:33 p.m. UTC | #41
On Fri, Aug 24, 2018 at 06:40:03PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
> [...]
> > I am fine with Michal patch, i already said so couple month ago first time
> > this discussion did pop up, Michal you can add:
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> So I guess the below is the patch you were talking about?
> 
> From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 24 Aug 2018 15:32:24 +0200
> Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
>  non-blocking mode
> 
> If invalidate_range_start is called for !blocking mode then all
> callbacks have to guarantee they will no block/sleep. The same obviously
> applies to invalidate_range_end because this operation pairs with the
> former and they are called from the same context. Make sure this is
> appropriately documented.

In my branch i already updated HMM to be like other existing user
ie all blocking operation in the start callback. But yes it would
be wise to added such comments.


> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmu_notifier.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>  	 *
>  	 * If blockable argument is set to false then the callback cannot
>  	 * sleep and has to return with -EAGAIN. 0 should be returned
> -	 * otherwise.
> +	 * otherwise. Please note that if invalidate_range_start approves
> +	 * a non-blocking behavior then the same applies to
> +	 * invalidate_range_end.
>  	 *
>  	 */
>  	int (*invalidate_range_start)(struct mmu_notifier *mn,
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs
Tetsuo Handa Aug. 26, 2018, 8:40 a.m. UTC | #42
On 2018/08/24 22:52, Michal Hocko wrote:
> @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   */
>  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>  {
> -	if (blockable)
> -		mutex_lock(&amn->read_lock);
> -	else if (!mutex_trylock(&amn->read_lock))
> -		return -EAGAIN;
> -
> +	/*
> +	 * We can take sleepable lock even on !blockable mode because
> +	 * read_lock is only ever take from this path and the notifier
> +	 * lock never really sleeps. In fact the only reason why the
> +	 * later is sleepable is because the notifier itself might sleep
> +	 * in amdgpu_mn_invalidate_node but blockable mode is handled
> +	 * before calling into that path.
> +	 */
> +	mutex_lock(&amn->read_lock);
>  	if (atomic_inc_return(&amn->recursion) == 1)
>  		down_read_non_owner(&amn->lock);
>  	mutex_unlock(&amn->read_lock);
> 

I'm not following. Why don't we need to do like below (given that
nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
is doing GFP_KERNEL memory allocation with ->lock held for write?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b..e1cb344 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -64,8 +64,6 @@
  * @node: hash table node to find structure by adev and mn
  * @lock: rw semaphore protecting the notifier nodes
  * @objects: interval tree containing amdgpu_mn_nodes
- * @read_lock: mutex for recursive locking of @lock
- * @recursion: depth of recursion
  *
  * Data for each amdgpu device and process address space.
  */
@@ -85,8 +83,6 @@ struct amdgpu_mn {
 	/* objects protected by lock */
 	struct rw_semaphore	lock;
 	struct rb_root_cached	objects;
-	struct mutex		read_lock;
-	atomic_t		recursion;
 };
 
 /**
@@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
 	if (blockable)
-		mutex_lock(&amn->read_lock);
-	else if (!mutex_trylock(&amn->read_lock))
+		down_read(&amn->lock);
+	else if (!down_read_trylock(&amn->lock))
 		return -EAGAIN;
-
-	if (atomic_inc_return(&amn->recursion) == 1)
-		down_read_non_owner(&amn->lock);
-	mutex_unlock(&amn->read_lock);
-
 	return 0;
 }
 
@@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  */
 static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
 {
-	if (atomic_dec_return(&amn->recursion) == 0)
-		up_read_non_owner(&amn->lock);
+	up_read(&amn->lock);
 }
 
 /**
@@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
 	amn->type = type;
 	amn->mn.ops = &amdgpu_mn_ops[type];
 	amn->objects = RB_ROOT_CACHED;
-	mutex_init(&amn->read_lock);
-	atomic_set(&amn->recursion, 0);
 
 	r = __mmu_notifier_register(&amn->mn, mm);
 	if (r)
Christian König Aug. 27, 2018, 7:41 a.m. UTC | #43
Am 26.08.2018 um 10:40 schrieb Tetsuo Handa:
> On 2018/08/24 22:52, Michal Hocko wrote:
>> @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>    */
>>   static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>>   {
>> -	if (blockable)
>> -		mutex_lock(&amn->read_lock);
>> -	else if (!mutex_trylock(&amn->read_lock))
>> -		return -EAGAIN;
>> -
>> +	/*
>> +	 * We can take sleepable lock even on !blockable mode because
>> +	 * read_lock is only ever take from this path and the notifier
>> +	 * lock never really sleeps. In fact the only reason why the
>> +	 * later is sleepable is because the notifier itself might sleep
>> +	 * in amdgpu_mn_invalidate_node but blockable mode is handled
>> +	 * before calling into that path.
>> +	 */
>> +	mutex_lock(&amn->read_lock);
>>   	if (atomic_inc_return(&amn->recursion) == 1)
>>   		down_read_non_owner(&amn->lock);
>>   	mutex_unlock(&amn->read_lock);
>>
> I'm not following. Why don't we need to do like below (given that
> nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
> drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
> is doing GFP_KERNEL memory allocation with ->lock held for write?

That's a bug which needs to be fixed separately.

Allocating memory with GFP_KERNEL while holding a lock which is also 
taken in the reclaim code path is illegal not matter what you do.

Patches to fix this are already on the appropriate mailing list and will 
be pushed upstream today.

Regards,
Christian.

>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index e55508b..e1cb344 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -64,8 +64,6 @@
>    * @node: hash table node to find structure by adev and mn
>    * @lock: rw semaphore protecting the notifier nodes
>    * @objects: interval tree containing amdgpu_mn_nodes
> - * @read_lock: mutex for recursive locking of @lock
> - * @recursion: depth of recursion
>    *
>    * Data for each amdgpu device and process address space.
>    */
> @@ -85,8 +83,6 @@ struct amdgpu_mn {
>   	/* objects protected by lock */
>   	struct rw_semaphore	lock;
>   	struct rb_root_cached	objects;
> -	struct mutex		read_lock;
> -	atomic_t		recursion;
>   };
>   
>   /**
> @@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>   {
>   	if (blockable)
> -		mutex_lock(&amn->read_lock);
> -	else if (!mutex_trylock(&amn->read_lock))
> +		down_read(&amn->lock);
> +	else if (!down_read_trylock(&amn->lock))
>   		return -EAGAIN;
> -
> -	if (atomic_inc_return(&amn->recursion) == 1)
> -		down_read_non_owner(&amn->lock);
> -	mutex_unlock(&amn->read_lock);
> -
>   	return 0;
>   }
>   
> @@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>    */
>   static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
>   {
> -	if (atomic_dec_return(&amn->recursion) == 0)
> -		up_read_non_owner(&amn->lock);
> +	up_read(&amn->lock);
>   }
>   
>   /**
> @@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
>   	amn->type = type;
>   	amn->mn.ops = &amdgpu_mn_ops[type];
>   	amn->objects = RB_ROOT_CACHED;
> -	mutex_init(&amn->read_lock);
> -	atomic_set(&amn->recursion, 0);
>   
>   	r = __mmu_notifier_register(&amn->mn, mm);
>   	if (r)
Tetsuo Handa Sept. 6, 2018, 10:46 p.m. UTC | #44
On 2018/08/27 16:41, Christian König wrote:
> Am 26.08.2018 um 10:40 schrieb Tetsuo Handa:
>> I'm not following. Why don't we need to do like below (given that
>> nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
>> drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
>> is doing GFP_KERNEL memory allocation with ->lock held for write?
> 
> That's a bug which needs to be fixed separately.
> 
> Allocating memory with GFP_KERNEL while holding a lock which is also taken in the reclaim code path is illegal not matter what you do.
> 
> Patches to fix this are already on the appropriate mailing list and will be pushed upstream today.
> 
> Regards,
> Christian.

Commit 4a2de54dc1d7668f ("drm/amdgpu: fix holding mn_lock while allocating memory")
seems to be calling amdgpu_mn_unlock() without amdgpu_mn_lock() when
drm_sched_job_init() failed... 



Michal, you are asking me to fix all bugs (including out of tree code) and prevent
future bugs just because you want to avoid using timeout in order to avoid OOM lockup
( https://marc.info/?i=55a3fb37-3246-73d7-0f45-5835a3f4831c@i-love.sakura.ne.jp ).
That is a too much request which is impossible for even you. More you count on
the OOM reaper, we exponentially complicates dependency and more likely to stumble
over unreviewed/untested code...
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@  static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end,
+		bool blockable)
 {
 	unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@  void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (start <= apic_address && apic_address < end)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+	return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@  void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  *
  * Take the rmn read side lock.
  */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
 {
-	mutex_lock(&rmn->read_lock);
+	if (blockable)
+		mutex_lock(&rmn->read_lock);
+	else if (!mutex_trylock(&rmn->read_lock))
+		return -EAGAIN;
+
 	if (atomic_inc_return(&rmn->recursion) == 1)
 		down_read_non_owner(&rmn->lock);
 	mutex_unlock(&rmn->read_lock);
+
+	return 0;
 }
 
 /**
@@ -197,10 +203,11 @@  static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -208,17 +215,28 @@  static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	/* TODO we should be able to split locking for interval tree and
+	 * amdgpu_mn_invalidate_node
+	 */
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
 		amdgpu_mn_invalidate_node(node, start, end);
 	}
+
+	return 0;
 }
 
 /**
@@ -233,10 +251,11 @@  static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
  * necessitates evicting all user-mode queues of the process. The BOs
  * are restorted in amdgpu_mn_invalidate_range_end_hsa.
  */
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -244,13 +263,19 @@  static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -262,6 +287,8 @@  static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 				amdgpu_amdkfd_evict_userptr(mem, mm);
 		}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..9cbff68f6b41 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@  static void del_object(struct i915_mmu_object *mo)
 	mo->attached = false;
 }
 
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
-		return;
+		return 0;
 
 	/* interval ranges are inclusive, but invalidate range is exclusive */
 	end--;
@@ -132,6 +133,10 @@  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		if (!blockable) {
+			spin_unlock(&mn->lock);
+			return -EAGAIN;
+		}
 		/* The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
@@ -154,6 +159,8 @@  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 
 	if (!list_empty(&cancelled))
 		flush_workqueue(mn->wq);
+
+	return 0;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..f8b35df44c60 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,19 +118,27 @@  static void radeon_mn_release(struct mmu_notifier *mn,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     bool blockable)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct ttm_operation_ctx ctx = { false, false };
 	struct interval_tree_node *it;
+	int ret = 0;
 
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	mutex_lock(&rmn->lock);
+	/* TODO we should be able to split locking for interval tree and
+	 * the tear down.
+	 */
+	if (blockable)
+		mutex_lock(&rmn->lock);
+	else if (!mutex_trylock(&rmn->lock))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -138,6 +146,11 @@  static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		struct radeon_bo *bo;
 		long r;
 
+		if (!blockable) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
+
 		node = container_of(it, struct radeon_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -166,7 +179,10 @@  static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		}
 	}
 	
+out_unlock:
 	mutex_unlock(&rmn->lock);
+
+	return ret;
 }
 
 static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..6ec748eccff7 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -186,6 +186,7 @@  static void ib_umem_notifier_release(struct mmu_notifier *mn,
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
 				      ULLONG_MAX,
 				      ib_umem_notifier_release_trampoline,
+				      true,
 				      NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -207,22 +208,31 @@  static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+	int ret;
 
 	if (!context->invalidate_range)
-		return;
+		return 0;
+
+	if (blockable)
+		down_read(&context->umem_rwsem);
+	else if (!down_read_trylock(&context->umem_rwsem))
+		return -EAGAIN;
 
 	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
+	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_start_trampoline, NULL);
+				      invalidate_range_start_trampoline,
+				      blockable, NULL);
 	up_read(&context->umem_rwsem);
+
+	return ret;
 }
 
 static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
@@ -242,10 +252,15 @@  static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	if (!context->invalidate_range)
 		return;
 
+	/*
+	 * TODO: we currently bail out if there is any sleepable work to be done
+	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
+	 * here. But this is ugly and fragile.
+	 */
 	down_read(&context->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_end_trampoline, NULL);
+				      invalidate_range_end_trampoline, true, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
 }
@@ -798,6 +813,7 @@  EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 last,
 				  umem_call_back cb,
+				  bool blockable,
 				  void *cookie)
 {
 	int ret_val = 0;
@@ -809,6 +825,9 @@  int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 
 	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
 			node; node = next) {
+		/* TODO move the blockable decision up to the callback */
+		if (!blockable)
+			return -EAGAIN;
 		next = rbt_ib_umem_iter_next(node, start, last - 1);
 		umem = container_of(node, struct ib_umem_odp, interval_tree);
 		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..e1c7996c018e 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,9 +67,9 @@  struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
-static void mmu_notifier_range_start(struct mmu_notifier *,
+static int mmu_notifier_range_start(struct mmu_notifier *,
 				     struct mm_struct *,
-				     unsigned long, unsigned long);
+				     unsigned long, unsigned long, bool);
 static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
 					   unsigned long, unsigned long);
 static void do_remove(struct mmu_rb_handler *handler,
@@ -284,10 +284,11 @@  void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     bool blockable)
 {
 	struct mmu_rb_handler *handler =
 		container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@  static void mmu_notifier_range_start(struct mmu_notifier *mn,
 
 	if (added)
 		queue_work(handler->wq, &handler->del_work);
+
+	return 0;
 }
 
 /*
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f1a87a690a4c..d216e0d2921d 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -488,7 +488,7 @@  void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
 
 	down_read(&ctx->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
-				      mr_leaf_free, imr);
+				      mr_leaf_free, true, imr);
 	up_read(&ctx->umem_rwsem);
 
 	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..6369aeaa7056 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@  static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
-						     unsigned long end)
+						     unsigned long end,
+						     bool blockable)
 {
 	struct scif_mmu_notif	*mmn;
 
 	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
 	scif_rma_destroy_tcw(mmn, start, end - start);
+
+	return 0;
 }
 
 static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@  void gru_flush_all_tlb(struct gru_state *gru)
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       bool blockable)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -231,6 +232,8 @@  static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
 		start, end, atomic_read(&gms->ms_range_active));
 	gru_flush_tlb_range(gms, start, end - start);
+
+	return 0;
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..55b4f0e3f4d6 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -441,18 +441,25 @@  static const struct vm_operations_struct gntdev_vmops = {
 
 /* ------------------------------------------------------------------ */
 
+static bool in_range(struct grant_map *map,
+			      unsigned long start, unsigned long end)
+{
+	if (!map->vma)
+		return false;
+	if (map->vma->vm_start >= end)
+		return false;
+	if (map->vma->vm_end <= start)
+		return false;
+
+	return true;
+}
+
 static void unmap_if_in_range(struct grant_map *map,
 			      unsigned long start, unsigned long end)
 {
 	unsigned long mstart, mend;
 	int err;
 
-	if (!map->vma)
-		return;
-	if (map->vma->vm_start >= end)
-		return;
-	if (map->vma->vm_end <= start)
-		return;
 	mstart = max(start, map->vma->vm_start);
 	mend   = min(end,   map->vma->vm_end);
 	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
@@ -465,21 +472,40 @@  static void unmap_if_in_range(struct grant_map *map,
 	WARN_ON(err);
 }
 
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool blockable)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
+	int ret = 0;
+
+	/* TODO do we really need a mutex here? */
+	if (blockable)
+		mutex_lock(&priv->lock);
+	else if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
 
-	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
+
+out_unlock:
 	mutex_unlock(&priv->lock);
+
+	return ret;
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..148935085194 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,8 +1275,8 @@  static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end);
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
 int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..2eb1a2d01759 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -151,13 +151,15 @@  struct mmu_notifier_ops {
 	 * address space but may still be referenced by sptes until
 	 * the last refcount is dropped.
 	 *
-	 * If both of these callbacks cannot block, and invalidate_range
-	 * cannot block, mmu_notifier_ops.flags should have
-	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
+	 * If blockable argument is set to false then the callback cannot
+	 * sleep and has to return with -EAGAIN. 0 should be returned
+	 * otherwise.
+	 *
 	 */
-	void (*invalidate_range_start)(struct mmu_notifier *mn,
+	int (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start, unsigned long end,
+				       bool blockable);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
@@ -229,8 +231,9 @@  extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
 				  bool only_end);
@@ -281,7 +284,17 @@  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	if (mm_has_notifiers(mm))
+		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+	return ret;
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
@@ -461,6 +474,12 @@  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 {
 }
 
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac113e96d..92f70e4c6252 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@  static inline int check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-void __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 6a17f856f841..381cdf5a9bd1 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -119,7 +119,8 @@  typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
  */
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 end,
-				  umem_call_back cb, void *cookie);
+				  umem_call_back cb,
+				  bool blockable, void *cookie);
 
 /*
  * Find first region intersecting with address range.
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	up_write(&hmm->mirrors_sem);
 }
 
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
-				       unsigned long end)
+				       unsigned long end,
+				       bool blockable)
 {
 	struct hmm *hmm = mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
 	atomic_inc(&hmm->sequence);
+
+	return 0;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87ef4b1a..336bee8c4e25 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3074,7 +3074,7 @@  void exit_mmap(struct mm_struct *mm)
 		 * reliably test it.
 		 */
 		mutex_lock(&oom_lock);
-		__oom_reap_task_mm(mm);
+		(void)__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..103b2b450043 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,29 @@  void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable)
 {
 	struct mmu_notifier *mn;
+	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+		if (mn->ops->invalidate_range_start) {
+			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+			if (_ret) {
+				pr_info("%pS callback failed with %d in %sblockable context.\n",
+						mn->ops->invalidate_range_start, _ret,
+						!blockable ? "non-": "");
+				ret = _ret;
+			}
+		}
 	}
 	srcu_read_unlock(&srcu, id);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..5a936cf24d79 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	bool ret = true;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@  void __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_gather tlb;
 
 			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
+			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+				ret = false;
+				continue;
+			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
 			mmu_notifier_invalidate_range_end(mm, start, end);
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+
+	return ret;
 }
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
 	/*
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	__oom_reap_task_mm(mm);
+	/* failed to reap part of the address space. Try again later */
+	if (!__oom_reap_task_mm(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto unlock_oom;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..16ce38f178d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,9 +135,10 @@  static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable)
 {
+	return 0;
 }
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
@@ -354,13 +355,15 @@  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
+	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -378,9 +381,11 @@  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	spin_unlock(&kvm->mmu_lock);
 
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
 
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return ret;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,