mbox series

[v2,hmm,00/11] Various revisions from a locking/code review

Message ID 20190606184438.31646-1-jgg@ziepe.ca (mailing list archive)
Headers show
Series Various revisions from a locking/code review | expand

Message

Jason Gunthorpe June 6, 2019, 6:44 p.m. UTC
From: Jason Gunthorpe <jgg@mellanox.com>

For hmm.git:

This patch series arised out of discussions with Jerome when looking at the
ODP changes, particularly informed by use after free races we have already
found and fixed in the ODP code (thanks to syzkaller) working with mmu
notifiers, and the discussion with Ralph on how to resolve the lifetime model.

Overall this brings in a simplified locking scheme and easy to explain
lifetime model:

 If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
 is allocated memory.

 If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
 then the mmget must be obtained via mmget_not_zero().

Locking of mm->hmm is shifted to use the mmap_sem consistently for all
read/write and unlocked accesses are removed.

The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
standard mmget() locking to prevent the mm from being released. Many of the
debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
which is much clearer as to the lifetime intent.

The trailing patches are just some random cleanups I noticed when reviewing
this code.

This v2 incorporates alot of the good off list changes & feedback Jerome had,
and all the on-list comments too. However, now that we have the shared git I
have kept the one line change to nouveau_svm.c rather than the compat
funtions.

I believe we can resolve this merge in the DRM tree now and keep the core
mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.

It is on top of hmm.git, and I have a git tree of this series to ease testing
here:

https://github.com/jgunthorpe/linux/tree/hmm

There are still some open locking issues, as I think this remains unaddressed:

https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/

I'm looking for some more acks, reviews and tests so this can move ahead to
hmm.git.

Detailed notes on the v2 changes are in each patch. The big changes:
 - mmget is held so long as the range is registered
 - the last patch 'Remove confusing comment and logic from hmm_release' is new

Thanks everyone,
Jason

Jason Gunthorpe (11):
  mm/hmm: fix use after free with struct hmm in the mmu notifiers
  mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  mm/hmm: Hold a mmgrab from hmm to mm
  mm/hmm: Simplify hmm_get_or_create and make it reliable
  mm/hmm: Remove duplicate condition test before wait_event_timeout
  mm/hmm: Hold on to the mmget for the lifetime of the range
  mm/hmm: Use lockdep instead of comments
  mm/hmm: Remove racy protection against double-unregistration
  mm/hmm: Poison hmm_range during unregister
  mm/hmm: Do not use list*_rcu() for hmm->ranges
  mm/hmm: Remove confusing comment and logic from hmm_release

 drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
 include/linux/hmm.h                   |  49 +------
 kernel/fork.c                         |   1 -
 mm/hmm.c                              | 204 ++++++++++----------------
 4 files changed, 87 insertions(+), 169 deletions(-)

Comments

Jason Gunthorpe June 11, 2019, 7:48 p.m. UTC | #1
On Thu, Jun 06, 2019 at 03:44:27PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> For hmm.git:
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> Overall this brings in a simplified locking scheme and easy to explain
> lifetime model:
> 
>  If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>  is allocated memory.
> 
>  If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>  then the mmget must be obtained via mmget_not_zero().
> 
> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> read/write and unlocked accesses are removed.
> 
> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> standard mmget() locking to prevent the mm from being released. Many of the
> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> which is much clearer as to the lifetime intent.
> 
> The trailing patches are just some random cleanups I noticed when reviewing
> this code.
> 
> This v2 incorporates alot of the good off list changes & feedback Jerome had,
> and all the on-list comments too. However, now that we have the shared git I
> have kept the one line change to nouveau_svm.c rather than the compat
> funtions.
> 
> I believe we can resolve this merge in the DRM tree now and keep the core
> mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.
> 
> It is on top of hmm.git, and I have a git tree of this series to ease testing
> here:
> 
> https://github.com/jgunthorpe/linux/tree/hmm
> 
> There are still some open locking issues, as I think this remains unaddressed:
> 
> https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/
> 
> I'm looking for some more acks, reviews and tests so this can move ahead to
> hmm.git.

AMD Folks, this is looking pretty good now, can you please give at
least a Tested-by for the new driver code using this that I see in
linux-next?

Thanks,
Jason
Felix Kuehling June 12, 2019, 5:54 p.m. UTC | #2
[+Philip]

Hi Jason,

I'm out of the office this week.

Hi Philip, can you give this a go? Not sure how much you've been 
following this patch series review. Message or call me on Skype to 
discuss any questions.

Thanks,
   Felix

On 2019-06-11 12:48, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019 at 03:44:27PM -0300, Jason Gunthorpe wrote:
>> From: Jason Gunthorpe <jgg@mellanox.com>
>>
>> For hmm.git:
>>
>> This patch series arised out of discussions with Jerome when looking at the
>> ODP changes, particularly informed by use after free races we have already
>> found and fixed in the ODP code (thanks to syzkaller) working with mmu
>> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
>>
>> Overall this brings in a simplified locking scheme and easy to explain
>> lifetime model:
>>
>>   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>>   is allocated memory.
>>
>>   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>>   then the mmget must be obtained via mmget_not_zero().
>>
>> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
>> read/write and unlocked accesses are removed.
>>
>> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
>> standard mmget() locking to prevent the mm from being released. Many of the
>> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
>> which is much clearer as to the lifetime intent.
>>
>> The trailing patches are just some random cleanups I noticed when reviewing
>> this code.
>>
>> This v2 incorporates alot of the good off list changes & feedback Jerome had,
>> and all the on-list comments too. However, now that we have the shared git I
>> have kept the one line change to nouveau_svm.c rather than the compat
>> funtions.
>>
>> I believe we can resolve this merge in the DRM tree now and keep the core
>> mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.
>>
>> It is on top of hmm.git, and I have a git tree of this series to ease testing
>> here:
>>
>> https://github.com/jgunthorpe/linux/tree/hmm
>>
>> There are still some open locking issues, as I think this remains unaddressed:
>>
>> https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/
>>
>> I'm looking for some more acks, reviews and tests so this can move ahead to
>> hmm.git.
> AMD Folks, this is looking pretty good now, can you please give at
> least a Tested-by for the new driver code using this that I see in
> linux-next?
>
> Thanks,
> Jason
Philip Yang June 12, 2019, 9:49 p.m. UTC | #3
Rebase to https://github.com/jgunthorpe/linux.git hmm branch, need some 
changes because of interface hmm_range_register change. Then run a quick 
amdgpu_test. Test is finished, result is ok. But there is below kernel 
BUG message, seems hmm_free_rcu calls down_write.....

[ 1171.919921] BUG: sleeping function called from invalid context at 
/home/yangp/git/compute_staging/kernel/kernel/locking/rwsem.c:65
[ 1171.919933] in_atomic(): 1, irqs_disabled(): 0, pid: 53, name: 
kworker/1:1
[ 1171.919938] 2 locks held by kworker/1:1/53:
[ 1171.919940]  #0: 000000001c7c19d4 ((wq_completion)rcu_gp){+.+.}, at: 
process_one_work+0x20e/0x630
[ 1171.919951]  #1: 00000000923f2cfa 
((work_completion)(&sdp->work)){+.+.}, at: process_one_work+0x20e/0x630
[ 1171.919959] CPU: 1 PID: 53 Comm: kworker/1:1 Tainted: G        W 
    5.2.0-rc1-kfd-yangp #196
[ 1171.919961] Hardware name: ASUS All Series/Z97-PRO(Wi-Fi ac)/USB 3.1, 
BIOS 9001 03/07/2016
[ 1171.919965] Workqueue: rcu_gp srcu_invoke_callbacks
[ 1171.919968] Call Trace:
[ 1171.919974]  dump_stack+0x67/0x9b
[ 1171.919980]  ___might_sleep+0x149/0x230
[ 1171.919985]  down_write+0x1c/0x70
[ 1171.919989]  hmm_free_rcu+0x24/0x80
[ 1171.919993]  srcu_invoke_callbacks+0xc9/0x150
[ 1171.920000]  process_one_work+0x28e/0x630
[ 1171.920008]  worker_thread+0x39/0x3f0
[ 1171.920014]  ? process_one_work+0x630/0x630
[ 1171.920017]  kthread+0x11c/0x140
[ 1171.920021]  ? kthread_park+0x90/0x90
[ 1171.920026]  ret_from_fork+0x24/0x30

Philip

On 2019-06-12 1:54 p.m., Kuehling, Felix wrote:
> [+Philip]
> 
> Hi Jason,
> 
> I'm out of the office this week.
> 
> Hi Philip, can you give this a go? Not sure how much you've been
> following this patch series review. Message or call me on Skype to
> discuss any questions.
> 
> Thanks,
>     Felix
> 
> On 2019-06-11 12:48, Jason Gunthorpe wrote:
>> On Thu, Jun 06, 2019 at 03:44:27PM -0300, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> For hmm.git:
>>>
>>> This patch series arised out of discussions with Jerome when looking at the
>>> ODP changes, particularly informed by use after free races we have already
>>> found and fixed in the ODP code (thanks to syzkaller) working with mmu
>>> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
>>>
>>> Overall this brings in a simplified locking scheme and easy to explain
>>> lifetime model:
>>>
>>>    If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>>>    is allocated memory.
>>>
>>>    If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>>>    then the mmget must be obtained via mmget_not_zero().
>>>
>>> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
>>> read/write and unlocked accesses are removed.
>>>
>>> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
>>> standard mmget() locking to prevent the mm from being released. Many of the
>>> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
>>> which is much clearer as to the lifetime intent.
>>>
>>> The trailing patches are just some random cleanups I noticed when reviewing
>>> this code.
>>>
>>> This v2 incorporates alot of the good off list changes & feedback Jerome had,
>>> and all the on-list comments too. However, now that we have the shared git I
>>> have kept the one line change to nouveau_svm.c rather than the compat
>>> funtions.
>>>
>>> I believe we can resolve this merge in the DRM tree now and keep the core
>>> mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.
>>>
>>> It is on top of hmm.git, and I have a git tree of this series to ease testing
>>> here:
>>>
>>> https://github.com/jgunthorpe/linux/tree/hmm
>>>
>>> There are still some open locking issues, as I think this remains unaddressed:
>>>
>>> https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/
>>>
>>> I'm looking for some more acks, reviews and tests so this can move ahead to
>>> hmm.git.
>> AMD Folks, this is looking pretty good now, can you please give at
>> least a Tested-by for the new driver code using this that I see in
>> linux-next?
>>
>> Thanks,
>> Jason
Jason Gunthorpe June 13, 2019, 5:50 p.m. UTC | #4
On Wed, Jun 12, 2019 at 09:49:12PM +0000, Yang, Philip wrote:
> Rebase to https://github.com/jgunthorpe/linux.git hmm branch, need some 
> changes because of interface hmm_range_register change. Then run a quick 
> amdgpu_test. Test is finished, result is ok.

Great! Thanks

I'll add your Tested-by to the series

>  But there is below kernel BUG message, seems hmm_free_rcu calls
> down_write.....
> 
> [ 1171.919921] BUG: sleeping function called from invalid context at 
> /home/yangp/git/compute_staging/kernel/kernel/locking/rwsem.c:65
> [ 1171.919933] in_atomic(): 1, irqs_disabled(): 0, pid: 53, name: 
> kworker/1:1
> [ 1171.919938] 2 locks held by kworker/1:1/53:
> [ 1171.919940]  #0: 000000001c7c19d4 ((wq_completion)rcu_gp){+.+.}, at: 
> process_one_work+0x20e/0x630
> [ 1171.919951]  #1: 00000000923f2cfa 
> ((work_completion)(&sdp->work)){+.+.}, at: process_one_work+0x20e/0x630
> [ 1171.919959] CPU: 1 PID: 53 Comm: kworker/1:1 Tainted: G        W 
>     5.2.0-rc1-kfd-yangp #196
> [ 1171.919961] Hardware name: ASUS All Series/Z97-PRO(Wi-Fi ac)/USB 3.1, 
> BIOS 9001 03/07/2016
> [ 1171.919965] Workqueue: rcu_gp srcu_invoke_callbacks
> [ 1171.919968] Call Trace:
> [ 1171.919974]  dump_stack+0x67/0x9b
> [ 1171.919980]  ___might_sleep+0x149/0x230
> [ 1171.919985]  down_write+0x1c/0x70
> [ 1171.919989]  hmm_free_rcu+0x24/0x80
> [ 1171.919993]  srcu_invoke_callbacks+0xc9/0x150
> [ 1171.920000]  process_one_work+0x28e/0x630
> [ 1171.920008]  worker_thread+0x39/0x3f0
> [ 1171.920014]  ? process_one_work+0x630/0x630
> [ 1171.920017]  kthread+0x11c/0x140
> [ 1171.920021]  ? kthread_park+0x90/0x90
> [ 1171.920026]  ret_from_fork+0x24/0x30

Thank you Phillip, it seems the prior tests were not done with
lockdep..

Sigh, I will keep this with the gross pagetable_lock then. I updated
the patches on the git with this modification. I think we have covered
all the bases so I will send another V of the series to the list and
if no more comments then it will move ahead to hmm.git. Thanks to all.

diff --git a/mm/hmm.c b/mm/hmm.c
index 136c812faa2790..4c64d4c32f4825 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -49,16 +49,15 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 
 	lockdep_assert_held_exclusive(&mm->mmap_sem);
 
+	/* Abuse the page_table_lock to also protect mm->hmm. */
+	spin_lock(&mm->page_table_lock);
 	if (mm->hmm) {
-		if (kref_get_unless_zero(&mm->hmm->kref))
+		if (kref_get_unless_zero(&mm->hmm->kref)) {
+			spin_unlock(&mm->page_table_lock);
 			return mm->hmm;
-		/*
-		 * The hmm is being freed by some other CPU and is pending a
-		 * RCU grace period, but this CPU can NULL now it since we
-		 * have the mmap_sem.
-		 */
-		mm->hmm = NULL;
+		}
 	}
+	spin_unlock(&mm->page_table_lock);
 
 	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 	if (!hmm)
@@ -81,7 +80,14 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	}
 
 	mmgrab(hmm->mm);
+
+	/*
+	 * We hold the exclusive mmap_sem here so we know that mm->hmm is
+	 * still NULL or 0 kref, and is safe to update.
+	 */
+	spin_lock(&mm->page_table_lock);
 	mm->hmm = hmm;
+	spin_unlock(&mm->page_table_lock);
 	return hmm;
 }
 
@@ -89,10 +95,14 @@ static void hmm_free_rcu(struct rcu_head *rcu)
 {
 	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
 
-	down_write(&hmm->mm->mmap_sem);
+	/*
+	 * The mm->hmm pointer is kept valid while notifier ops can be running
+	 * so they don't have to deal with a NULL mm->hmm value
+	 */
+	spin_lock(&hmm->mm->page_table_lock);
 	if (hmm->mm->hmm == hmm)
 		hmm->mm->hmm = NULL;
-	up_write(&hmm->mm->mmap_sem);
+	spin_unlock(&hmm->mm->page_table_lock);
 	mmdrop(hmm->mm);
 
 	kfree(hmm);