diff mbox series

[v2,hmm,11/11] mm/hmm: Remove confusing comment and logic from hmm_release

Message ID 20190606184438.31646-12-jgg@ziepe.ca (mailing list archive)
State New, archived
Headers show
Series Various revisions from a locking/code review | expand

Commit Message

Jason Gunthorpe June 6, 2019, 6:44 p.m. UTC
From: Jason Gunthorpe <jgg@mellanox.com>

hmm_release() is called exactly once per hmm. ops->release() cannot
accidentally trigger any action that would recurse back onto
hmm->mirrors_sem.

This fixes a use after-free race of the form:

       CPU0                                   CPU1
                                           hmm_release()
                                             up_write(&hmm->mirrors_sem);
 hmm_mirror_unregister(mirror)
  down_write(&hmm->mirrors_sem);
  up_write(&hmm->mirrors_sem);
  kfree(mirror)
                                             mirror->ops->release(mirror)

The only user we have today for ops->release is an empty function, so this
is unambiguously safe.

As a consequence of plugging this race drivers are not allowed to
register/unregister mirrors from within a release op.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 28 +++++++++-------------------
 1 file changed, 9 insertions(+), 19 deletions(-)

Comments

John Hubbard June 7, 2019, 3:47 a.m. UTC | #1
On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> hmm_release() is called exactly once per hmm. ops->release() cannot
> accidentally trigger any action that would recurse back onto
> hmm->mirrors_sem.
> 
> This fixes a use after-free race of the form:
> 
>        CPU0                                   CPU1
>                                            hmm_release()
>                                              up_write(&hmm->mirrors_sem);
>  hmm_mirror_unregister(mirror)
>   down_write(&hmm->mirrors_sem);
>   up_write(&hmm->mirrors_sem);
>   kfree(mirror)
>                                              mirror->ops->release(mirror)
> 
> The only user we have today for ops->release is an empty function, so this
> is unambiguously safe.
> 
> As a consequence of plugging this race drivers are not allowed to
> register/unregister mirrors from within a release op.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  mm/hmm.c | 28 +++++++++-------------------
>  1 file changed, 9 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 709d138dd49027..3a45dd3d778248 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  	WARN_ON(!list_empty(&hmm->ranges));
>  	mutex_unlock(&hmm->lock);
>  
> -	down_write(&hmm->mirrors_sem);
> -	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> -					  list);
> -	while (mirror) {
> -		list_del_init(&mirror->list);
> -		if (mirror->ops->release) {
> -			/*
> -			 * Drop mirrors_sem so the release callback can wait
> -			 * on any pending work that might itself trigger a
> -			 * mmu_notifier callback and thus would deadlock with
> -			 * us.
> -			 */
> -			up_write(&hmm->mirrors_sem);
> +	down_read(&hmm->mirrors_sem);

This is cleaner and simpler, but I suspect it is leading to the deadlock
that Ralph Campbell is seeing in his driver testing. (And in general, holding
a lock during a driver callback usually leads to deadlocks.)

Ralph, is this the one? It's the only place in this patchset where I can
see a lock around a callback to driver code, that wasn't there before. So
I'm pretty sure it is the one...


thanks,
Jason Gunthorpe June 7, 2019, 12:58 p.m. UTC | #2
On Thu, Jun 06, 2019 at 08:47:28PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > hmm_release() is called exactly once per hmm. ops->release() cannot
> > accidentally trigger any action that would recurse back onto
> > hmm->mirrors_sem.
> > 
> > This fixes a use after-free race of the form:
> > 
> >        CPU0                                   CPU1
> >                                            hmm_release()
> >                                              up_write(&hmm->mirrors_sem);
> >  hmm_mirror_unregister(mirror)
> >   down_write(&hmm->mirrors_sem);
> >   up_write(&hmm->mirrors_sem);
> >   kfree(mirror)
> >                                              mirror->ops->release(mirror)
> > 
> > The only user we have today for ops->release is an empty function, so this
> > is unambiguously safe.
> > 
> > As a consequence of plugging this race drivers are not allowed to
> > register/unregister mirrors from within a release op.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> >  mm/hmm.c | 28 +++++++++-------------------
> >  1 file changed, 9 insertions(+), 19 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 709d138dd49027..3a45dd3d778248 100644
> > +++ b/mm/hmm.c
> > @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> >  	WARN_ON(!list_empty(&hmm->ranges));
> >  	mutex_unlock(&hmm->lock);
> >  
> > -	down_write(&hmm->mirrors_sem);
> > -	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> > -					  list);
> > -	while (mirror) {
> > -		list_del_init(&mirror->list);
> > -		if (mirror->ops->release) {
> > -			/*
> > -			 * Drop mirrors_sem so the release callback can wait
> > -			 * on any pending work that might itself trigger a
> > -			 * mmu_notifier callback and thus would deadlock with
> > -			 * us.
> > -			 */
> > -			up_write(&hmm->mirrors_sem);
> > +	down_read(&hmm->mirrors_sem);
> 
> This is cleaner and simpler, but I suspect it is leading to the deadlock
> that Ralph Campbell is seeing in his driver testing. (And in general, holding
> a lock during a driver callback usually leads to deadlocks.)

I think Ralph has never seen this patch (it is new), so it must be one
of the earlier patches..

> Ralph, is this the one? It's the only place in this patchset where I can
> see a lock around a callback to driver code, that wasn't there before. So
> I'm pretty sure it is the one...

Can you share the lockdep report please?

Thanks,
Jason
Ralph Campbell June 7, 2019, 9:37 p.m. UTC | #3
On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> hmm_release() is called exactly once per hmm. ops->release() cannot
> accidentally trigger any action that would recurse back onto
> hmm->mirrors_sem.
> 
> This fixes a use after-free race of the form:
> 
>         CPU0                                   CPU1
>                                             hmm_release()
>                                               up_write(&hmm->mirrors_sem);
>   hmm_mirror_unregister(mirror)
>    down_write(&hmm->mirrors_sem);
>    up_write(&hmm->mirrors_sem);
>    kfree(mirror)
>                                               mirror->ops->release(mirror)
> 
> The only user we have today for ops->release is an empty function, so this
> is unambiguously safe.
> 
> As a consequence of plugging this race drivers are not allowed to
> register/unregister mirrors from within a release op.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

I agree with the analysis above but I'm not sure that release() will
always be an empty function. It might be more efficient to write back
all data migrated to a device "in one pass" instead of relying
on unmap_vmas() calling hmm_start_range_invalidate() per VMA.

I think the bigger issue is potential deadlocks while calling
sync_cpu_device_pagetables() and tasks calling hmm_mirror_unregister():

Say you have three threads:
- Thread A is in try_to_unmap(), either without holding mmap_sem or with
mmap_sem held for read.
- Thread B has some unrelated driver calling hmm_mirror_unregister().
This doesn't require mmap_sem.
- Thread C is about to call migrate_vma().

Thread A                Thread B                 Thread C
try_to_unmap            hmm_mirror_unregister    migrate_vma
----------------------  -----------------------  ----------------------
hmm_invalidate_range_start
down_read(mirrors_sem)
                         down_write(mirrors_sem)
                         // Blocked on A
                                                   device_lock
device_lock
// Blocked on C
                                                   migrate_vma()
                                                   hmm_invalidate_range_s
                                                   down_read(mirrors_sem)
                                                   // Blocked on B
                                                   // Deadlock

Perhaps we should consider using SRCU for walking the mirror->list?

> ---
>   mm/hmm.c | 28 +++++++++-------------------
>   1 file changed, 9 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 709d138dd49027..3a45dd3d778248 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	WARN_ON(!list_empty(&hmm->ranges));
>   	mutex_unlock(&hmm->lock);
>   
> -	down_write(&hmm->mirrors_sem);
> -	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> -					  list);
> -	while (mirror) {
> -		list_del_init(&mirror->list);
> -		if (mirror->ops->release) {
> -			/*
> -			 * Drop mirrors_sem so the release callback can wait
> -			 * on any pending work that might itself trigger a
> -			 * mmu_notifier callback and thus would deadlock with
> -			 * us.
> -			 */
> -			up_write(&hmm->mirrors_sem);
> +	down_read(&hmm->mirrors_sem);
> +	list_for_each_entry(mirror, &hmm->mirrors, list) {
> +		/*
> +		 * Note: The driver is not allowed to trigger
> +		 * hmm_mirror_unregister() from this thread.
> +		 */
> +		if (mirror->ops->release)
>   			mirror->ops->release(mirror);
> -			down_write(&hmm->mirrors_sem);
> -		}
> -		mirror = list_first_entry_or_null(&hmm->mirrors,
> -						  struct hmm_mirror, list);
>   	}
> -	up_write(&hmm->mirrors_sem);
> +	up_read(&hmm->mirrors_sem);
>   
>   	hmm_put(hmm);
>   }
> @@ -287,7 +277,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
>   	struct hmm *hmm = mirror->hmm;
>   
>   	down_write(&hmm->mirrors_sem);
> -	list_del_init(&mirror->list);
> +	list_del(&mirror->list);
>   	up_write(&hmm->mirrors_sem);
>   	hmm_put(hmm);
>   	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
>
Jason Gunthorpe June 8, 2019, 2:12 a.m. UTC | #4
On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > hmm_release() is called exactly once per hmm. ops->release() cannot
> > accidentally trigger any action that would recurse back onto
> > hmm->mirrors_sem.
> > 
> > This fixes a use after-free race of the form:
> > 
> >         CPU0                                   CPU1
> >                                             hmm_release()
> >                                               up_write(&hmm->mirrors_sem);
> >   hmm_mirror_unregister(mirror)
> >    down_write(&hmm->mirrors_sem);
> >    up_write(&hmm->mirrors_sem);
> >    kfree(mirror)
> >                                               mirror->ops->release(mirror)
> > 
> > The only user we have today for ops->release is an empty function, so this
> > is unambiguously safe.
> > 
> > As a consequence of plugging this race drivers are not allowed to
> > register/unregister mirrors from within a release op.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> I agree with the analysis above but I'm not sure that release() will
> always be an empty function. It might be more efficient to write back
> all data migrated to a device "in one pass" instead of relying
> on unmap_vmas() calling hmm_start_range_invalidate() per VMA.

Sure, but it should not be allowed to recurse back to
hmm_mirror_unregister.

> I think the bigger issue is potential deadlocks while calling
> sync_cpu_device_pagetables() and tasks calling hmm_mirror_unregister():
>
> Say you have three threads:
> - Thread A is in try_to_unmap(), either without holding mmap_sem or with
> mmap_sem held for read.
> - Thread B has some unrelated driver calling hmm_mirror_unregister().
> This doesn't require mmap_sem.
> - Thread C is about to call migrate_vma().
>
> Thread A                Thread B                 Thread C
> try_to_unmap            hmm_mirror_unregister    migrate_vma
> hmm_invalidate_range_start
> down_read(mirrors_sem)
>                         down_write(mirrors_sem)
>                         // Blocked on A
>                                                   device_lock
> device_lock
> // Blocked on C
>                                                   migrate_vma()
>                                                   hmm_invalidate_range_s
>                                                   down_read(mirrors_sem)
>                                                   // Blocked on B
>                                                   // Deadlock

Oh... you know I didn't know this about rwsems in linux that they have
a fairness policy for writes to block future reads..

Still, at least as things are designed, the driver cannot hold a lock
it obtains under sync_cpu_device_pagetables() and nest other things in
that lock. It certainly can't recurse back into any mmu notifiers
while holding that lock. (as you point out)

The lock in sync_cpu_device_pagetables() needs to be very narrowly
focused on updating device state only.

So, my first reaction is that the driver in thread C is wrong, and
needs a different locking scheme. I think you'd have to make a really
good case that there is no alternative for a driver..

> Perhaps we should consider using SRCU for walking the mirror->list?

It means the driver has to deal with races like in this patch
description. At that point there is almost no reason to insert hmm
here, just use mmu notifiers directly.

Drivers won't get this right, it is too hard.

Jason
Jason Gunthorpe June 10, 2019, 4:02 p.m. UTC | #5
On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > hmm_release() is called exactly once per hmm. ops->release() cannot
> > accidentally trigger any action that would recurse back onto
> > hmm->mirrors_sem.
> > 
> > This fixes a use after-free race of the form:
> > 
> >         CPU0                                   CPU1
> >                                             hmm_release()
> >                                               up_write(&hmm->mirrors_sem);
> >   hmm_mirror_unregister(mirror)
> >    down_write(&hmm->mirrors_sem);
> >    up_write(&hmm->mirrors_sem);
> >    kfree(mirror)
> >                                               mirror->ops->release(mirror)
> > 
> > The only user we have today for ops->release is an empty function, so this
> > is unambiguously safe.
> > 
> > As a consequence of plugging this race drivers are not allowed to
> > register/unregister mirrors from within a release op.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> I agree with the analysis above but I'm not sure that release() will
> always be an empty function. It might be more efficient to write back
> all data migrated to a device "in one pass" instead of relying
> on unmap_vmas() calling hmm_start_range_invalidate() per VMA.

I think we have to focus on the *current* kernel - and we have two
users of release, nouveau_svm.c is empty and amdgpu_mn.c does
schedule_work() - so I believe we should go ahead with this simple
solution to the actual race today that both of those will suffer from.

If we find a need for a more complex version then it can be debated
and justified with proper context...

Ok?

Jason
Ralph Campbell June 10, 2019, 10:03 p.m. UTC | #6
On 6/10/19 9:02 AM, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote:
>>
>> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> hmm_release() is called exactly once per hmm. ops->release() cannot
>>> accidentally trigger any action that would recurse back onto
>>> hmm->mirrors_sem.
>>>
>>> This fixes a use after-free race of the form:
>>>
>>>          CPU0                                   CPU1
>>>                                              hmm_release()
>>>                                                up_write(&hmm->mirrors_sem);
>>>    hmm_mirror_unregister(mirror)
>>>     down_write(&hmm->mirrors_sem);
>>>     up_write(&hmm->mirrors_sem);
>>>     kfree(mirror)
>>>                                                mirror->ops->release(mirror)
>>>
>>> The only user we have today for ops->release is an empty function, so this
>>> is unambiguously safe.
>>>
>>> As a consequence of plugging this race drivers are not allowed to
>>> register/unregister mirrors from within a release op.
>>>
>>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>>
>> I agree with the analysis above but I'm not sure that release() will
>> always be an empty function. It might be more efficient to write back
>> all data migrated to a device "in one pass" instead of relying
>> on unmap_vmas() calling hmm_start_range_invalidate() per VMA.
> 
> I think we have to focus on the *current* kernel - and we have two
> users of release, nouveau_svm.c is empty and amdgpu_mn.c does
> schedule_work() - so I believe we should go ahead with this simple
> solution to the actual race today that both of those will suffer from.
> 
> If we find a need for a more complex version then it can be debated
> and justified with proper context...
> 
> Ok?
> 
> Jason

OK.
I guess we have enough on the plate already :-)
diff mbox series

Patch

diff --git a/mm/hmm.c b/mm/hmm.c
index 709d138dd49027..3a45dd3d778248 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -136,26 +136,16 @@  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	WARN_ON(!list_empty(&hmm->ranges));
 	mutex_unlock(&hmm->lock);
 
-	down_write(&hmm->mirrors_sem);
-	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
-					  list);
-	while (mirror) {
-		list_del_init(&mirror->list);
-		if (mirror->ops->release) {
-			/*
-			 * Drop mirrors_sem so the release callback can wait
-			 * on any pending work that might itself trigger a
-			 * mmu_notifier callback and thus would deadlock with
-			 * us.
-			 */
-			up_write(&hmm->mirrors_sem);
+	down_read(&hmm->mirrors_sem);
+	list_for_each_entry(mirror, &hmm->mirrors, list) {
+		/*
+		 * Note: The driver is not allowed to trigger
+		 * hmm_mirror_unregister() from this thread.
+		 */
+		if (mirror->ops->release)
 			mirror->ops->release(mirror);
-			down_write(&hmm->mirrors_sem);
-		}
-		mirror = list_first_entry_or_null(&hmm->mirrors,
-						  struct hmm_mirror, list);
 	}
-	up_write(&hmm->mirrors_sem);
+	up_read(&hmm->mirrors_sem);
 
 	hmm_put(hmm);
 }
@@ -287,7 +277,7 @@  void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del_init(&mirror->list);
+	list_del(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 	hmm_put(hmm);
 	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));