[5/5] mm/hmm: Fix mm stale reference use in hmm_free()
diff mbox series

Message ID 20190506233514.12795-1-rcampbell@nvidia.com
State New
Headers show
Series
  • mm/hmm: HMM documentation updates and code fixes
Related show

Commit Message

Ralph Campbell May 6, 2019, 11:35 p.m. UTC
From: Ralph Campbell <rcampbell@nvidia.com>

The last reference to struct hmm may be released long after the mm_struct
is destroyed because the struct hmm_mirror memory may be part of a
device driver open file private data pointer. The file descriptor close
is usually after the mm_struct is destroyed in do_exit(). This is a good
reason for making struct hmm a kref_t object [1] since its lifetime spans
the life time of mm_struct and struct hmm_mirror.

The fix is to not use hmm->mm in hmm_free() and to clear mm->hmm and
hmm->mm pointers in hmm_destroy() when the mm_struct is destroyed. By
clearing the pointers at the very last moment, it eliminates the need for
additional locking since the mmu notifier code already handles quiescing
notifier callbacks and unregistering the hmm notifiers. Also, by making
mm_struct hold a reference to struct hmm, there is no need to check for a
zero hmm reference count in mm_get_hmm().

[1] https://marc.info/?l=linux-mm&m=155432001406049&w=2
    ("mm/hmm: use reference counting for HMM struct v3")

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/hmm.h |  10 +----
 mm/hmm.c            | 100 ++++++++++++++++----------------------------
 2 files changed, 37 insertions(+), 73 deletions(-)

Comments

Jason Gunthorpe May 22, 2019, 11:36 p.m. UTC | #1
On Mon, May 06, 2019 at 04:35:14PM -0700, rcampbell@nvidia.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> The last reference to struct hmm may be released long after the mm_struct
> is destroyed because the struct hmm_mirror memory may be part of a
> device driver open file private data pointer. The file descriptor close
> is usually after the mm_struct is destroyed in do_exit(). This is a good
> reason for making struct hmm a kref_t object [1] since its lifetime spans
> the life time of mm_struct and struct hmm_mirror.

> The fix is to not use hmm->mm in hmm_free() and to clear mm->hmm and
> hmm->mm pointers in hmm_destroy() when the mm_struct is
> destroyed.

I think the right way to fix this is to have the struct hmm hold a
mmgrab() on the mm so its memory cannot go away until all of the hmm
users release the struct hmm, hmm_ranges/etc

Then we can properly use mmget_not_zero() instead of the racy/abnormal
'if (hmm->xmm == NULL || hmm->dead)' pattern (see the other
thread). Actually looking at this, all these tests look very
questionable. If we hold the mmget() for the duration of the range
object, as Jerome suggested, then they all get deleted.

That just leaves mmu_notifier_unregister_no_relase() as the remaining
user of hmm->mm (everyone else is trying to do range->mm) - and it
looks like it currently tries to call
mmu_notifier_unregister_no_release on a NULL hmm->mm and crashes :(

Holding the mmgrab fixes this as we can safely call
mmu_notifier_unregister_no_relase() post exit_mmap on a grab'd mm.

Also we can delete the hmm_mm_destroy() intrustion into fork.c as it
can't be called when the mmgrab is active.

This is the basic pattern we used in ODP when working with mmu
notifiers, I don't know why hmm would need to be different.

> index 2aa75dbed04a..4e42c282d334 100644
> +++ b/mm/hmm.c
> @@ -43,8 +43,10 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>  {
>  	struct hmm *hmm = READ_ONCE(mm->hmm);
>  
> -	if (hmm && kref_get_unless_zero(&hmm->kref))
> +	if (hmm && !hmm->dead) {
> +		kref_get(&hmm->kref);
>  		return hmm;
> +	}

hmm->dead and mm->hmm are not being read under lock, so this went from
something almost thread safe to something racy :(

> @@ -53,25 +55,28 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>   * hmm_get_or_create - register HMM against an mm (HMM internal)
>   *
>   * @mm: mm struct to attach to
> - * Returns: returns an HMM object, either by referencing the existing
> - *          (per-process) object, or by creating a new one.
> + * Return: an HMM object reference, either by referencing the existing
> + *         (per-process) object, or by creating a new one.
>   *
> - * This is not intended to be used directly by device drivers. If mm already
> - * has an HMM struct then it get a reference on it and returns it. Otherwise
> - * it allocates an HMM struct, initializes it, associate it with the mm and
> - * returns it.
> + * If the mm already has an HMM struct then return a new reference to it.
> + * Otherwise, allocate an HMM struct, initialize it, associate it with the mm,
> + * and return a new reference to it. If the return value is not NULL,
> + * the caller is responsible for calling hmm_put().
>   */
>  static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  {
> -	struct hmm *hmm = mm_get_hmm(mm);
> -	bool cleanup = false;
> +	struct hmm *hmm = mm->hmm;
>  
> -	if (hmm)
> -		return hmm;
> +	if (hmm) {
> +		if (hmm->dead)
> +			goto error;

Create shouldn't fail just because it is racing with something doing
destroy

The flow should be something like:

spin_lock(&mm->page_table_lock); // or write side mmap_sem if you prefer
if (mm->hmm)
   if (kref_get_unless_zero(mm->hmm))
        return mm->hmm;
   mm->hmm = NULL


> +		goto out;
> +	}
>  
>  	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>  	if (!hmm)
> -		return NULL;
> +		goto error;
> +
>  	init_waitqueue_head(&hmm->wq);
>  	INIT_LIST_HEAD(&hmm->mirrors);
>  	init_rwsem(&hmm->mirrors_sem);
> @@ -83,47 +88,32 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	hmm->dead = false;
>  	hmm->mm = mm;
>  
> -	spin_lock(&mm->page_table_lock);
> -	if (!mm->hmm)
> -		mm->hmm = hmm;
> -	else
> -		cleanup = true;
> -	spin_unlock(&mm->page_table_lock);

BTW, Jerome this needs fixing too, it shouldn't fail the function just
because it lost the race.

More like

spin_lock(&mm->page_table_lock);
if (mm->hmm)
   if (kref_get_unless_zero(mm->hmm)) {
        kfree(hmm);
        return mm->hmm;
   }
mm->hmm = hmm

> -	if (cleanup)
> -		goto error;
> -
>  	/*
> -	 * We should only get here if hold the mmap_sem in write mode ie on
> -	 * registration of first mirror through hmm_mirror_register()
> +	 * The mmap_sem should be held for write so no additional locking

Please let us have proper lockdep assertions for this kind of stuff.

> +	 * is needed. Note that struct_mm holds a reference to hmm.
> +	 * It is cleared in hmm_release().
>  	 */
> +	mm->hmm = hmm;

Actually using the write side the mmap_sem seems sort of same if it is
assured the write side is always held for this call..


Hmm, there is a race with hmm_destroy touching mm->hmm that does
hold the write lock.

> +
>  	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
>  	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
>  		goto error_mm;

And the error unwind here is problematic as it should do
kref_put. Actually after my patch to use container_of this
mmu_notifier_register should go before the mm->hmm = hmm to avoid
having to do the sketchy error unwind at all.

> +out:
> +	/* Return a separate hmm reference for the caller. */
> +	kref_get(&hmm->kref);
>  	return hmm;
>  
>  error_mm:
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -error:
> +	mm->hmm = NULL;
>  	kfree(hmm);
> +error:
>  	return NULL;
>  }
>  
>  static void hmm_free(struct kref *kref)
>  {
>  	struct hmm *hmm = container_of(kref, struct hmm, kref);
> -	struct mm_struct *mm = hmm->mm;
> -
> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);

Where did the unregister go?

> -
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);

Well, we still need to NULL mm->hmm if the hmm was put before the mm
is destroyed.

>  	kfree(hmm);
>  }
> @@ -135,25 +125,18 @@ static inline void hmm_put(struct hmm *hmm)
>  
>  void hmm_mm_destroy(struct mm_struct *mm)
>  {
> -	struct hmm *hmm;
> +	struct hmm *hmm = mm->hmm;
>  
> -	spin_lock(&mm->page_table_lock);
> -	hmm = mm_get_hmm(mm);
> -	mm->hmm = NULL;
>  	if (hmm) {
> +		mm->hmm = NULL;

At this point The kref on mm is 0, so any other thread reading mm->hmm
has a use-after-free bug. Not much point in doing this assignment , it
is just confusing.

>  		hmm->mm = NULL;
> -		hmm->dead = true;
> -		spin_unlock(&mm->page_table_lock);
>  		hmm_put(hmm);
> -		return;
>  	}
> -
> -	spin_unlock(&mm->page_table_lock);
>  }
>  
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> -	struct hmm *hmm = mm_get_hmm(mm);
> +	struct hmm *hmm = mm->hmm;

container_of is much safer/better

> @@ -931,20 +909,14 @@ int hmm_range_register(struct hmm_range *range,
>  		return -EINVAL;
>  	if (start >= end)
>  		return -EINVAL;
> +	hmm = mm_get_hmm(mm);
> +	if (!hmm)
> +		return -EFAULT;
>  
>  	range->page_shift = page_shift;
>  	range->start = start;
>  	range->end = end;
> -
> -	range->hmm = mm_get_hmm(mm);
> -	if (!range->hmm)
> -		return -EFAULT;
> -
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (range->hmm->mm == NULL || range->hmm->dead) {

This comment looks bogus too, we can't race with hmm_mm_destroy as the
caller MUST have a mmgrab or mmget on the mm already to call this API
- ie can't be destroyed. 

As discussed in the other thread this should probably be
mmget_not_zero.

Jason
Ralph Campbell May 23, 2019, 12:54 a.m. UTC | #2
On 5/22/19 4:36 PM, Jason Gunthorpe wrote:
> On Mon, May 06, 2019 at 04:35:14PM -0700, rcampbell@nvidia.com wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> The last reference to struct hmm may be released long after the mm_struct
>> is destroyed because the struct hmm_mirror memory may be part of a
>> device driver open file private data pointer. The file descriptor close
>> is usually after the mm_struct is destroyed in do_exit(). This is a good
>> reason for making struct hmm a kref_t object [1] since its lifetime spans
>> the life time of mm_struct and struct hmm_mirror.
> 
>> The fix is to not use hmm->mm in hmm_free() and to clear mm->hmm and
>> hmm->mm pointers in hmm_destroy() when the mm_struct is
>> destroyed.
> 
> I think the right way to fix this is to have the struct hmm hold a
> mmgrab() on the mm so its memory cannot go away until all of the hmm
> users release the struct hmm, hmm_ranges/etc
> 
> Then we can properly use mmget_not_zero() instead of the racy/abnormal
> 'if (hmm->xmm == NULL || hmm->dead)' pattern (see the other
> thread). Actually looking at this, all these tests look very
> questionable. If we hold the mmget() for the duration of the range
> object, as Jerome suggested, then they all get deleted.
> 
> That just leaves mmu_notifier_unregister_no_relase() as the remaining
> user of hmm->mm (everyone else is trying to do range->mm) - and it
> looks like it currently tries to call
> mmu_notifier_unregister_no_release on a NULL hmm->mm and crashes :(
> 
> Holding the mmgrab fixes this as we can safely call
> mmu_notifier_unregister_no_relase() post exit_mmap on a grab'd mm.
> 
> Also we can delete the hmm_mm_destroy() intrustion into fork.c as it
> can't be called when the mmgrab is active.
> 
> This is the basic pattern we used in ODP when working with mmu
> notifiers, I don't know why hmm would need to be different.
> 
>> index 2aa75dbed04a..4e42c282d334 100644
>> +++ b/mm/hmm.c
>> @@ -43,8 +43,10 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>>   {
>>   	struct hmm *hmm = READ_ONCE(mm->hmm);
>>   
>> -	if (hmm && kref_get_unless_zero(&hmm->kref))
>> +	if (hmm && !hmm->dead) {
>> +		kref_get(&hmm->kref);
>>   		return hmm;
>> +	}
> 
> hmm->dead and mm->hmm are not being read under lock, so this went from
> something almost thread safe to something racy :(
> 
>> @@ -53,25 +55,28 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>>    * hmm_get_or_create - register HMM against an mm (HMM internal)
>>    *
>>    * @mm: mm struct to attach to
>> - * Returns: returns an HMM object, either by referencing the existing
>> - *          (per-process) object, or by creating a new one.
>> + * Return: an HMM object reference, either by referencing the existing
>> + *         (per-process) object, or by creating a new one.
>>    *
>> - * This is not intended to be used directly by device drivers. If mm already
>> - * has an HMM struct then it get a reference on it and returns it. Otherwise
>> - * it allocates an HMM struct, initializes it, associate it with the mm and
>> - * returns it.
>> + * If the mm already has an HMM struct then return a new reference to it.
>> + * Otherwise, allocate an HMM struct, initialize it, associate it with the mm,
>> + * and return a new reference to it. If the return value is not NULL,
>> + * the caller is responsible for calling hmm_put().
>>    */
>>   static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>>   {
>> -	struct hmm *hmm = mm_get_hmm(mm);
>> -	bool cleanup = false;
>> +	struct hmm *hmm = mm->hmm;
>>   
>> -	if (hmm)
>> -		return hmm;
>> +	if (hmm) {
>> +		if (hmm->dead)
>> +			goto error;
> 
> Create shouldn't fail just because it is racing with something doing
> destroy
> 
> The flow should be something like:
> 
> spin_lock(&mm->page_table_lock); // or write side mmap_sem if you prefer
> if (mm->hmm)
>     if (kref_get_unless_zero(mm->hmm))
>          return mm->hmm;
>     mm->hmm = NULL
> 
> 
>> +		goto out;
>> +	}
>>   
>>   	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>>   	if (!hmm)
>> -		return NULL;
>> +		goto error;
>> +
>>   	init_waitqueue_head(&hmm->wq);
>>   	INIT_LIST_HEAD(&hmm->mirrors);
>>   	init_rwsem(&hmm->mirrors_sem);
>> @@ -83,47 +88,32 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>>   	hmm->dead = false;
>>   	hmm->mm = mm;
>>   
>> -	spin_lock(&mm->page_table_lock);
>> -	if (!mm->hmm)
>> -		mm->hmm = hmm;
>> -	else
>> -		cleanup = true;
>> -	spin_unlock(&mm->page_table_lock);
> 
> BTW, Jerome this needs fixing too, it shouldn't fail the function just
> because it lost the race.
> 
> More like
> 
> spin_lock(&mm->page_table_lock);
> if (mm->hmm)
>     if (kref_get_unless_zero(mm->hmm)) {
>          kfree(hmm);
>          return mm->hmm;
>     }
> mm->hmm = hmm
> 
>> -	if (cleanup)
>> -		goto error;
>> -
>>   	/*
>> -	 * We should only get here if hold the mmap_sem in write mode ie on
>> -	 * registration of first mirror through hmm_mirror_register()
>> +	 * The mmap_sem should be held for write so no additional locking
> 
> Please let us have proper lockdep assertions for this kind of stuff.
> 
>> +	 * is needed. Note that struct_mm holds a reference to hmm.
>> +	 * It is cleared in hmm_release().
>>   	 */
>> +	mm->hmm = hmm;
> 
> Actually using the write side the mmap_sem seems sort of same if it is
> assured the write side is always held for this call..
> 
> 
> Hmm, there is a race with hmm_destroy touching mm->hmm that does
> hold the write lock.
> 
>> +
>>   	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
>>   	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
>>   		goto error_mm;
> 
> And the error unwind here is problematic as it should do
> kref_put. Actually after my patch to use container_of this
> mmu_notifier_register should go before the mm->hmm = hmm to avoid
> having to do the sketchy error unwind at all.
> 
>> +out:
>> +	/* Return a separate hmm reference for the caller. */
>> +	kref_get(&hmm->kref);
>>   	return hmm;
>>   
>>   error_mm:
>> -	spin_lock(&mm->page_table_lock);
>> -	if (mm->hmm == hmm)
>> -		mm->hmm = NULL;
>> -	spin_unlock(&mm->page_table_lock);
>> -error:
>> +	mm->hmm = NULL;
>>   	kfree(hmm);
>> +error:
>>   	return NULL;
>>   }
>>   
>>   static void hmm_free(struct kref *kref)
>>   {
>>   	struct hmm *hmm = container_of(kref, struct hmm, kref);
>> -	struct mm_struct *mm = hmm->mm;
>> -
>> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
> 
> Where did the unregister go?
> 
>> -
>> -	spin_lock(&mm->page_table_lock);
>> -	if (mm->hmm == hmm)
>> -		mm->hmm = NULL;
>> -	spin_unlock(&mm->page_table_lock);
> 
> Well, we still need to NULL mm->hmm if the hmm was put before the mm
> is destroyed.
> 
>>   	kfree(hmm);
>>   }
>> @@ -135,25 +125,18 @@ static inline void hmm_put(struct hmm *hmm)
>>   
>>   void hmm_mm_destroy(struct mm_struct *mm)
>>   {
>> -	struct hmm *hmm;
>> +	struct hmm *hmm = mm->hmm;
>>   
>> -	spin_lock(&mm->page_table_lock);
>> -	hmm = mm_get_hmm(mm);
>> -	mm->hmm = NULL;
>>   	if (hmm) {
>> +		mm->hmm = NULL;
> 
> At this point The kref on mm is 0, so any other thread reading mm->hmm
> has a use-after-free bug. Not much point in doing this assignment , it
> is just confusing.
> 
>>   		hmm->mm = NULL;
>> -		hmm->dead = true;
>> -		spin_unlock(&mm->page_table_lock);
>>   		hmm_put(hmm);
>> -		return;
>>   	}
>> -
>> -	spin_unlock(&mm->page_table_lock);
>>   }
>>   
>>   static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>>   {
>> -	struct hmm *hmm = mm_get_hmm(mm);
>> +	struct hmm *hmm = mm->hmm;
> 
> container_of is much safer/better
> 
>> @@ -931,20 +909,14 @@ int hmm_range_register(struct hmm_range *range,
>>   		return -EINVAL;
>>   	if (start >= end)
>>   		return -EINVAL;
>> +	hmm = mm_get_hmm(mm);
>> +	if (!hmm)
>> +		return -EFAULT;
>>   
>>   	range->page_shift = page_shift;
>>   	range->start = start;
>>   	range->end = end;
>> -
>> -	range->hmm = mm_get_hmm(mm);
>> -	if (!range->hmm)
>> -		return -EFAULT;
>> -
>> -	/* Check if hmm_mm_destroy() was call. */
>> -	if (range->hmm->mm == NULL || range->hmm->dead) {
> 
> This comment looks bogus too, we can't race with hmm_mm_destroy as the
> caller MUST have a mmgrab or mmget on the mm already to call this API
> - ie can't be destroyed.
> 
> As discussed in the other thread this should probably be
> mmget_not_zero.
> 
> Jason

I think you missed the main points which are:

1) mm->hmm holds a reference to struct hmm so hmm isn't going away until
    __mmdrop() is called. hmm->mm is not a reference to mm,
   just a "backward" pointer.
   Trying to make struct hmm hold a *reference* to mm seems wrong to me.

2) mm->hmm is only set with mm->mmap_sem held for write.
    mm->hmm is only cleared when __mmdrop() is called.
    hmm->mm is only cleared when __mmdrop() is called so it is long 
after the call to hmm_release().

3) The mmu notifier unregister happens only as part of exit_mmap().

The hmm->dead and hmm->mm == NULL checks are more for sanity checking
since hmm_mirror_register() shouldn't be called without holding mmap_sem.
A VM_WARN or other check makes sense like you said.

Anyway, I'll wait for Jerome to weigh in as to how to proceed.
Jason Gunthorpe May 23, 2019, 1:25 a.m. UTC | #3
On Wed, May 22, 2019 at 05:54:17PM -0700, Ralph Campbell wrote:
> 
> On 5/22/19 4:36 PM, Jason Gunthorpe wrote:
> > On Mon, May 06, 2019 at 04:35:14PM -0700, rcampbell@nvidia.com wrote:
> > > From: Ralph Campbell <rcampbell@nvidia.com>
> > > 
> > > The last reference to struct hmm may be released long after the mm_struct
> > > is destroyed because the struct hmm_mirror memory may be part of a
> > > device driver open file private data pointer. The file descriptor close
> > > is usually after the mm_struct is destroyed in do_exit(). This is a good
> > > reason for making struct hmm a kref_t object [1] since its lifetime spans
> > > the life time of mm_struct and struct hmm_mirror.
> > 
> > > The fix is to not use hmm->mm in hmm_free() and to clear mm->hmm and
> > > hmm->mm pointers in hmm_destroy() when the mm_struct is
> > > destroyed.
> > 
> > I think the right way to fix this is to have the struct hmm hold a
> > mmgrab() on the mm so its memory cannot go away until all of the hmm
> > users release the struct hmm, hmm_ranges/etc
> > 
> > Then we can properly use mmget_not_zero() instead of the racy/abnormal
> > 'if (hmm->xmm == NULL || hmm->dead)' pattern (see the other
> > thread). Actually looking at this, all these tests look very
> > questionable. If we hold the mmget() for the duration of the range
> > object, as Jerome suggested, then they all get deleted.
> > 
> > That just leaves mmu_notifier_unregister_no_relase() as the remaining
> > user of hmm->mm (everyone else is trying to do range->mm) - and it
> > looks like it currently tries to call
> > mmu_notifier_unregister_no_release on a NULL hmm->mm and crashes :(
> > 
> > Holding the mmgrab fixes this as we can safely call
> > mmu_notifier_unregister_no_relase() post exit_mmap on a grab'd mm.
> > 
> > Also we can delete the hmm_mm_destroy() intrustion into fork.c as it
> > can't be called when the mmgrab is active.
> > 
> > This is the basic pattern we used in ODP when working with mmu
> > notifiers, I don't know why hmm would need to be different.
> > 
> > > index 2aa75dbed04a..4e42c282d334 100644
> > > +++ b/mm/hmm.c
> > > @@ -43,8 +43,10 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> > >   {
> > >   	struct hmm *hmm = READ_ONCE(mm->hmm);
> > > -	if (hmm && kref_get_unless_zero(&hmm->kref))
> > > +	if (hmm && !hmm->dead) {
> > > +		kref_get(&hmm->kref);
> > >   		return hmm;
> > > +	}
> > 
> > hmm->dead and mm->hmm are not being read under lock, so this went from
> > something almost thread safe to something racy :(
> > 
> > > @@ -53,25 +55,28 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> > >    * hmm_get_or_create - register HMM against an mm (HMM internal)
> > >    *
> > >    * @mm: mm struct to attach to
> > > - * Returns: returns an HMM object, either by referencing the existing
> > > - *          (per-process) object, or by creating a new one.
> > > + * Return: an HMM object reference, either by referencing the existing
> > > + *         (per-process) object, or by creating a new one.
> > >    *
> > > - * This is not intended to be used directly by device drivers. If mm already
> > > - * has an HMM struct then it get a reference on it and returns it. Otherwise
> > > - * it allocates an HMM struct, initializes it, associate it with the mm and
> > > - * returns it.
> > > + * If the mm already has an HMM struct then return a new reference to it.
> > > + * Otherwise, allocate an HMM struct, initialize it, associate it with the mm,
> > > + * and return a new reference to it. If the return value is not NULL,
> > > + * the caller is responsible for calling hmm_put().
> > >    */
> > >   static struct hmm *hmm_get_or_create(struct mm_struct *mm)
> > >   {
> > > -	struct hmm *hmm = mm_get_hmm(mm);
> > > -	bool cleanup = false;
> > > +	struct hmm *hmm = mm->hmm;
> > > -	if (hmm)
> > > -		return hmm;
> > > +	if (hmm) {
> > > +		if (hmm->dead)
> > > +			goto error;
> > 
> > Create shouldn't fail just because it is racing with something doing
> > destroy
> > 
> > The flow should be something like:
> > 
> > spin_lock(&mm->page_table_lock); // or write side mmap_sem if you prefer
> > if (mm->hmm)
> >     if (kref_get_unless_zero(mm->hmm))
> >          return mm->hmm;
> >     mm->hmm = NULL
> > 
> > 
> > > +		goto out;
> > > +	}
> > >   	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
> > >   	if (!hmm)
> > > -		return NULL;
> > > +		goto error;
> > > +
> > >   	init_waitqueue_head(&hmm->wq);
> > >   	INIT_LIST_HEAD(&hmm->mirrors);
> > >   	init_rwsem(&hmm->mirrors_sem);
> > > @@ -83,47 +88,32 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
> > >   	hmm->dead = false;
> > >   	hmm->mm = mm;
> > > -	spin_lock(&mm->page_table_lock);
> > > -	if (!mm->hmm)
> > > -		mm->hmm = hmm;
> > > -	else
> > > -		cleanup = true;
> > > -	spin_unlock(&mm->page_table_lock);
> > 
> > BTW, Jerome this needs fixing too, it shouldn't fail the function just
> > because it lost the race.
> > 
> > More like
> > 
> > spin_lock(&mm->page_table_lock);
> > if (mm->hmm)
> >     if (kref_get_unless_zero(mm->hmm)) {
> >          kfree(hmm);
> >          return mm->hmm;
> >     }
> > mm->hmm = hmm
> > 
> > > -	if (cleanup)
> > > -		goto error;
> > > -
> > >   	/*
> > > -	 * We should only get here if hold the mmap_sem in write mode ie on
> > > -	 * registration of first mirror through hmm_mirror_register()
> > > +	 * The mmap_sem should be held for write so no additional locking
> > 
> > Please let us have proper lockdep assertions for this kind of stuff.
> > 
> > > +	 * is needed. Note that struct_mm holds a reference to hmm.
> > > +	 * It is cleared in hmm_release().
> > >   	 */
> > > +	mm->hmm = hmm;
> > 
> > Actually using the write side the mmap_sem seems sort of same if it is
> > assured the write side is always held for this call..
> > 
> > 
> > Hmm, there is a race with hmm_destroy touching mm->hmm that does
> > hold the write lock.
> > 
> > > +
> > >   	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> > >   	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
> > >   		goto error_mm;
> > 
> > And the error unwind here is problematic as it should do
> > kref_put. Actually after my patch to use container_of this
> > mmu_notifier_register should go before the mm->hmm = hmm to avoid
> > having to do the sketchy error unwind at all.
> > 
> > > +out:
> > > +	/* Return a separate hmm reference for the caller. */
> > > +	kref_get(&hmm->kref);
> > >   	return hmm;
> > >   error_mm:
> > > -	spin_lock(&mm->page_table_lock);
> > > -	if (mm->hmm == hmm)
> > > -		mm->hmm = NULL;
> > > -	spin_unlock(&mm->page_table_lock);
> > > -error:
> > > +	mm->hmm = NULL;
> > >   	kfree(hmm);
> > > +error:
> > >   	return NULL;
> > >   }
> > >   static void hmm_free(struct kref *kref)
> > >   {
> > >   	struct hmm *hmm = container_of(kref, struct hmm, kref);
> > > -	struct mm_struct *mm = hmm->mm;
> > > -
> > > -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
> > 
> > Where did the unregister go?
> > 
> > > -
> > > -	spin_lock(&mm->page_table_lock);
> > > -	if (mm->hmm == hmm)
> > > -		mm->hmm = NULL;
> > > -	spin_unlock(&mm->page_table_lock);
> > 
> > Well, we still need to NULL mm->hmm if the hmm was put before the mm
> > is destroyed.
> > 
> > >   	kfree(hmm);
> > >   }
> > > @@ -135,25 +125,18 @@ static inline void hmm_put(struct hmm *hmm)
> > >   void hmm_mm_destroy(struct mm_struct *mm)
> > >   {
> > > -	struct hmm *hmm;
> > > +	struct hmm *hmm = mm->hmm;
> > > -	spin_lock(&mm->page_table_lock);
> > > -	hmm = mm_get_hmm(mm);
> > > -	mm->hmm = NULL;
> > >   	if (hmm) {
> > > +		mm->hmm = NULL;
> > 
> > At this point The kref on mm is 0, so any other thread reading mm->hmm
> > has a use-after-free bug. Not much point in doing this assignment , it
> > is just confusing.
> > 
> > >   		hmm->mm = NULL;
> > > -		hmm->dead = true;
> > > -		spin_unlock(&mm->page_table_lock);
> > >   		hmm_put(hmm);
> > > -		return;
> > >   	}
> > > -
> > > -	spin_unlock(&mm->page_table_lock);
> > >   }
> > >   static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> > >   {
> > > -	struct hmm *hmm = mm_get_hmm(mm);
> > > +	struct hmm *hmm = mm->hmm;
> > 
> > container_of is much safer/better
> > 
> > > @@ -931,20 +909,14 @@ int hmm_range_register(struct hmm_range *range,
> > >   		return -EINVAL;
> > >   	if (start >= end)
> > >   		return -EINVAL;
> > > +	hmm = mm_get_hmm(mm);
> > > +	if (!hmm)
> > > +		return -EFAULT;
> > >   	range->page_shift = page_shift;
> > >   	range->start = start;
> > >   	range->end = end;
> > > -
> > > -	range->hmm = mm_get_hmm(mm);
> > > -	if (!range->hmm)
> > > -		return -EFAULT;
> > > -
> > > -	/* Check if hmm_mm_destroy() was call. */
> > > -	if (range->hmm->mm == NULL || range->hmm->dead) {
> > 
> > This comment looks bogus too, we can't race with hmm_mm_destroy as the
> > caller MUST have a mmgrab or mmget on the mm already to call this API
> > - ie can't be destroyed.
> > 
> > As discussed in the other thread this should probably be
> > mmget_not_zero.
> > 
> > Jason
> 
> I think you missed the main points which are:

Well, I covered a lot of points in the above, I only suggested the
reverse refcount in the initial block. All the other notes basically
still apply no matter which way the refcount goes.
 
> 1) mm->hmm holds a reference to struct hmm so hmm isn't going away until
>    __mmdrop() is called. hmm->mm is not a reference to mm,
>   just a "backward" pointer.

But this wasn't done completely. If the hmm is created once and lives
until exit_mmap then there is no need for any of the refcounting,
hmm->dead, etc.

The refcounting is all moved to the mm via mmgrab/mmget and all the
places that were doing refcount/dead, etc need to switch to those mm
APIs instead.

In many senses this would be simpler, as we don't have the weird mess
of a hmm coming and going concurrently on the same mm, but I think the
patch needs to do a complete change over (delete the kref, delete
dead, delete the 'null mm/hmm' idea, and document what mmget/grab
every caller must be holding) before we can see what it would look
like.

This design would also solve the srcu race I pointed out.

>   Trying to make struct hmm hold a *reference* to mm seems wrong to me.

Well, this is a fairly standard direction, the caller of
hmm_mirror_register()/hmm_mirror_unregister() should reasonably expect
that hmm or the mm will not going away during the registered period.

What this patch is trying to do is to say that the caller must hold a
mmgrab for the duration of registration, for the sole benefit of
HMM. This seems like it is breaking the encapsulation of the API.

Arugably I would like to remove the mmgrab detail from the ODP code
and just establish a hmm_mirror as enough to guarentee the mm, hmm,
etc is valid memory until the ODP is cleaned up. 

> 2) mm->hmm is only set with mm->mmap_sem held for write.
>    mm->hmm is only cleared when __mmdrop() is called.
>    hmm->mm is only cleared when __mmdrop() is called so it is long after the
> call to hmm_release().

Then we don't need any refcounting on struct hmm, all the refcounting
falls to mmgrab/mmget instead..

> 3) The mmu notifier unregister happens only as part of exit_mmap().

> The hmm->dead and hmm->mm == NULL checks are more for sanity
> checking

It is all a use-after free - and it is just confusing what the
lifetime is supposed to be, should all be deleted.

In this approach all the tests for !mm->hmm should either lockdep
assert that the mm_sem write side is held or use mmgrab_not_zero &&
VM_BUG_ON instead (having previously established somehow that the hmm
was created already).

The test !hmm->mm should never exist. If the hmm object is valid
memory then the mm object *MUST* also be valid memory.

> since hmm_mirror_register() shouldn't be called without holding mmap_sem.
> A VM_WARN or other check makes sense like you said.

Stuff like this really should be documented as a lockdep assertion..

Jason
John Hubbard May 23, 2019, 6:32 a.m. UTC | #4
On 5/22/19 6:25 PM, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 05:54:17PM -0700, Ralph Campbell wrote:
>>
>> On 5/22/19 4:36 PM, Jason Gunthorpe wrote:
>>> On Mon, May 06, 2019 at 04:35:14PM -0700, rcampbell@nvidia.com wrote:
>>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>>
>>>> The last reference to struct hmm may be released long after the mm_struct
>>>> is destroyed because the struct hmm_mirror memory may be part of a
>>>> device driver open file private data pointer. The file descriptor close
>>>> is usually after the mm_struct is destroyed in do_exit(). This is a good
>>>> reason for making struct hmm a kref_t object [1] since its lifetime spans
>>>> the life time of mm_struct and struct hmm_mirror.
>>>
>>>> The fix is to not use hmm->mm in hmm_free() and to clear mm->hmm and
>>>> hmm->mm pointers in hmm_destroy() when the mm_struct is
>>>> destroyed.
>>>
>>> I think the right way to fix this is to have the struct hmm hold a
>>> mmgrab() on the mm so its memory cannot go away until all of the hmm
>>> users release the struct hmm, hmm_ranges/etc
>>>
>>> Then we can properly use mmget_not_zero() instead of the racy/abnormal
>>> 'if (hmm->xmm == NULL || hmm->dead)' pattern (see the other
>>> thread). Actually looking at this, all these tests look very
>>> questionable. If we hold the mmget() for the duration of the range
>>> object, as Jerome suggested, then they all get deleted.
>>>
>>> That just leaves mmu_notifier_unregister_no_relase() as the remaining
>>> user of hmm->mm (everyone else is trying to do range->mm) - and it
>>> looks like it currently tries to call
>>> mmu_notifier_unregister_no_release on a NULL hmm->mm and crashes :(
>>>
>>> Holding the mmgrab fixes this as we can safely call
>>> mmu_notifier_unregister_no_relase() post exit_mmap on a grab'd mm.
>>>
>>> Also we can delete the hmm_mm_destroy() intrustion into fork.c as it
>>> can't be called when the mmgrab is active.
>>>
>>> This is the basic pattern we used in ODP when working with mmu
>>> notifiers, I don't know why hmm would need to be different.

+1 for the mmgrab() approach. I have never been able to see how these
various checks can protect anything, and refcounting it into place definitely
sounds like the right answer.


thanks,

Patch
diff mbox series

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index fa0671d67269..538867c76906 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -488,15 +488,7 @@  void hmm_mirror_unregister(struct hmm_mirror *mirror);
  */
 static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
 {
-	struct mm_struct *mm;
-
-	if (!mirror || !mirror->hmm)
-		return false;
-	mm = READ_ONCE(mirror->hmm->mm);
-	if (mirror->hmm->dead || !mm)
-		return false;
-
-	return true;
+	return mirror && mirror->hmm && !mirror->hmm->dead;
 }
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index 2aa75dbed04a..4e42c282d334 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -43,8 +43,10 @@  static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
 {
 	struct hmm *hmm = READ_ONCE(mm->hmm);
 
-	if (hmm && kref_get_unless_zero(&hmm->kref))
+	if (hmm && !hmm->dead) {
+		kref_get(&hmm->kref);
 		return hmm;
+	}
 
 	return NULL;
 }
@@ -53,25 +55,28 @@  static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
  * hmm_get_or_create - register HMM against an mm (HMM internal)
  *
  * @mm: mm struct to attach to
- * Returns: returns an HMM object, either by referencing the existing
- *          (per-process) object, or by creating a new one.
+ * Return: an HMM object reference, either by referencing the existing
+ *         (per-process) object, or by creating a new one.
  *
- * This is not intended to be used directly by device drivers. If mm already
- * has an HMM struct then it get a reference on it and returns it. Otherwise
- * it allocates an HMM struct, initializes it, associate it with the mm and
- * returns it.
+ * If the mm already has an HMM struct then return a new reference to it.
+ * Otherwise, allocate an HMM struct, initialize it, associate it with the mm,
+ * and return a new reference to it. If the return value is not NULL,
+ * the caller is responsible for calling hmm_put().
  */
 static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 {
-	struct hmm *hmm = mm_get_hmm(mm);
-	bool cleanup = false;
+	struct hmm *hmm = mm->hmm;
 
-	if (hmm)
-		return hmm;
+	if (hmm) {
+		if (hmm->dead)
+			goto error;
+		goto out;
+	}
 
 	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 	if (!hmm)
-		return NULL;
+		goto error;
+
 	init_waitqueue_head(&hmm->wq);
 	INIT_LIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->mirrors_sem);
@@ -83,47 +88,32 @@  static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	hmm->dead = false;
 	hmm->mm = mm;
 
-	spin_lock(&mm->page_table_lock);
-	if (!mm->hmm)
-		mm->hmm = hmm;
-	else
-		cleanup = true;
-	spin_unlock(&mm->page_table_lock);
-
-	if (cleanup)
-		goto error;
-
 	/*
-	 * We should only get here if hold the mmap_sem in write mode ie on
-	 * registration of first mirror through hmm_mirror_register()
+	 * The mmap_sem should be held for write so no additional locking
+	 * is needed. Note that struct_mm holds a reference to hmm.
+	 * It is cleared in hmm_release().
 	 */
+	mm->hmm = hmm;
+
 	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
 	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
 		goto error_mm;
 
+out:
+	/* Return a separate hmm reference for the caller. */
+	kref_get(&hmm->kref);
 	return hmm;
 
 error_mm:
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
-error:
+	mm->hmm = NULL;
 	kfree(hmm);
+error:
 	return NULL;
 }
 
 static void hmm_free(struct kref *kref)
 {
 	struct hmm *hmm = container_of(kref, struct hmm, kref);
-	struct mm_struct *mm = hmm->mm;
-
-	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
-
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
 
 	kfree(hmm);
 }
@@ -135,25 +125,18 @@  static inline void hmm_put(struct hmm *hmm)
 
 void hmm_mm_destroy(struct mm_struct *mm)
 {
-	struct hmm *hmm;
+	struct hmm *hmm = mm->hmm;
 
-	spin_lock(&mm->page_table_lock);
-	hmm = mm_get_hmm(mm);
-	mm->hmm = NULL;
 	if (hmm) {
+		mm->hmm = NULL;
 		hmm->mm = NULL;
-		hmm->dead = true;
-		spin_unlock(&mm->page_table_lock);
 		hmm_put(hmm);
-		return;
 	}
-
-	spin_unlock(&mm->page_table_lock);
 }
 
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	struct hmm *hmm = mm_get_hmm(mm);
+	struct hmm *hmm = mm->hmm;
 	struct hmm_mirror *mirror;
 	struct hmm_range *range;
 
@@ -187,14 +170,12 @@  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 						  struct hmm_mirror, list);
 	}
 	up_write(&hmm->mirrors_sem);
-
-	hmm_put(hmm);
 }
 
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = nrange->mm->hmm;
 	struct hmm_mirror *mirror;
 	struct hmm_update update;
 	struct hmm_range *range;
@@ -238,14 +219,13 @@  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 	up_read(&hmm->mirrors_sem);
 
 out:
-	hmm_put(hmm);
 	return ret;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = nrange->mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
@@ -262,8 +242,6 @@  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 		wake_up_all(&hmm->wq);
 	}
 	mutex_unlock(&hmm->lock);
-
-	hmm_put(hmm);
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
@@ -931,20 +909,14 @@  int hmm_range_register(struct hmm_range *range,
 		return -EINVAL;
 	if (start >= end)
 		return -EINVAL;
+	hmm = mm_get_hmm(mm);
+	if (!hmm)
+		return -EFAULT;
 
 	range->page_shift = page_shift;
 	range->start = start;
 	range->end = end;
-
-	range->hmm = mm_get_hmm(mm);
-	if (!range->hmm)
-		return -EFAULT;
-
-	/* Check if hmm_mm_destroy() was call. */
-	if (range->hmm->mm == NULL || range->hmm->dead) {
-		hmm_put(range->hmm);
-		return -EFAULT;
-	}
+	range->hmm = hmm;
 
 	/* Initialize range to track CPU page table updates. */
 	mutex_lock(&range->hmm->lock);