Message ID | 20190606184438.31646-12-jgg@ziepe.ca (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | Various revisions from a locking/code review | expand |
On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > hmm_release() is called exactly once per hmm. ops->release() cannot > accidentally trigger any action that would recurse back onto > hmm->mirrors_sem. > > This fixes a use after-free race of the form: > > CPU0 CPU1 > hmm_release() > up_write(&hmm->mirrors_sem); > hmm_mirror_unregister(mirror) > down_write(&hmm->mirrors_sem); > up_write(&hmm->mirrors_sem); > kfree(mirror) > mirror->ops->release(mirror) > > The only user we have today for ops->release is an empty function, so this > is unambiguously safe. > > As a consequence of plugging this race drivers are not allowed to > register/unregister mirrors from within a release op. > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > --- > mm/hmm.c | 28 +++++++++------------------- > 1 file changed, 9 insertions(+), 19 deletions(-) > > diff --git a/mm/hmm.c b/mm/hmm.c > index 709d138dd49027..3a45dd3d778248 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) > WARN_ON(!list_empty(&hmm->ranges)); > mutex_unlock(&hmm->lock); > > - down_write(&hmm->mirrors_sem); > - mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror, > - list); > - while (mirror) { > - list_del_init(&mirror->list); > - if (mirror->ops->release) { > - /* > - * Drop mirrors_sem so the release callback can wait > - * on any pending work that might itself trigger a > - * mmu_notifier callback and thus would deadlock with > - * us. > - */ > - up_write(&hmm->mirrors_sem); > + down_read(&hmm->mirrors_sem); This is cleaner and simpler, but I suspect it is leading to the deadlock that Ralph Campbell is seeing in his driver testing. (And in general, holding a lock during a driver callback usually leads to deadlocks.) Ralph, is this the one? It's the only place in this patchset where I can see a lock around a callback to driver code, that wasn't there before. So I'm pretty sure it is the one... thanks,
On Thu, Jun 06, 2019 at 08:47:28PM -0700, John Hubbard wrote: > On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > hmm_release() is called exactly once per hmm. ops->release() cannot > > accidentally trigger any action that would recurse back onto > > hmm->mirrors_sem. > > > > This fixes a use after-free race of the form: > > > > CPU0 CPU1 > > hmm_release() > > up_write(&hmm->mirrors_sem); > > hmm_mirror_unregister(mirror) > > down_write(&hmm->mirrors_sem); > > up_write(&hmm->mirrors_sem); > > kfree(mirror) > > mirror->ops->release(mirror) > > > > The only user we have today for ops->release is an empty function, so this > > is unambiguously safe. > > > > As a consequence of plugging this race drivers are not allowed to > > register/unregister mirrors from within a release op. > > > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > > mm/hmm.c | 28 +++++++++------------------- > > 1 file changed, 9 insertions(+), 19 deletions(-) > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > index 709d138dd49027..3a45dd3d778248 100644 > > +++ b/mm/hmm.c > > @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) > > WARN_ON(!list_empty(&hmm->ranges)); > > mutex_unlock(&hmm->lock); > > > > - down_write(&hmm->mirrors_sem); > > - mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror, > > - list); > > - while (mirror) { > > - list_del_init(&mirror->list); > > - if (mirror->ops->release) { > > - /* > > - * Drop mirrors_sem so the release callback can wait > > - * on any pending work that might itself trigger a > > - * mmu_notifier callback and thus would deadlock with > > - * us. > > - */ > > - up_write(&hmm->mirrors_sem); > > + down_read(&hmm->mirrors_sem); > > This is cleaner and simpler, but I suspect it is leading to the deadlock > that Ralph Campbell is seeing in his driver testing. (And in general, holding > a lock during a driver callback usually leads to deadlocks.) I think Ralph has never seen this patch (it is new), so it must be one of the earlier patches.. > Ralph, is this the one? It's the only place in this patchset where I can > see a lock around a callback to driver code, that wasn't there before. So > I'm pretty sure it is the one... Can you share the lockdep report please? Thanks, Jason
On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > hmm_release() is called exactly once per hmm. ops->release() cannot > accidentally trigger any action that would recurse back onto > hmm->mirrors_sem. > > This fixes a use after-free race of the form: > > CPU0 CPU1 > hmm_release() > up_write(&hmm->mirrors_sem); > hmm_mirror_unregister(mirror) > down_write(&hmm->mirrors_sem); > up_write(&hmm->mirrors_sem); > kfree(mirror) > mirror->ops->release(mirror) > > The only user we have today for ops->release is an empty function, so this > is unambiguously safe. > > As a consequence of plugging this race drivers are not allowed to > register/unregister mirrors from within a release op. > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> I agree with the analysis above but I'm not sure that release() will always be an empty function. It might be more efficient to write back all data migrated to a device "in one pass" instead of relying on unmap_vmas() calling hmm_start_range_invalidate() per VMA. I think the bigger issue is potential deadlocks while calling sync_cpu_device_pagetables() and tasks calling hmm_mirror_unregister(): Say you have three threads: - Thread A is in try_to_unmap(), either without holding mmap_sem or with mmap_sem held for read. - Thread B has some unrelated driver calling hmm_mirror_unregister(). This doesn't require mmap_sem. - Thread C is about to call migrate_vma(). Thread A Thread B Thread C try_to_unmap hmm_mirror_unregister migrate_vma ---------------------- ----------------------- ---------------------- hmm_invalidate_range_start down_read(mirrors_sem) down_write(mirrors_sem) // Blocked on A device_lock device_lock // Blocked on C migrate_vma() hmm_invalidate_range_s down_read(mirrors_sem) // Blocked on B // Deadlock Perhaps we should consider using SRCU for walking the mirror->list? > --- > mm/hmm.c | 28 +++++++++------------------- > 1 file changed, 9 insertions(+), 19 deletions(-) > > diff --git a/mm/hmm.c b/mm/hmm.c > index 709d138dd49027..3a45dd3d778248 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) > WARN_ON(!list_empty(&hmm->ranges)); > mutex_unlock(&hmm->lock); > > - down_write(&hmm->mirrors_sem); > - mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror, > - list); > - while (mirror) { > - list_del_init(&mirror->list); > - if (mirror->ops->release) { > - /* > - * Drop mirrors_sem so the release callback can wait > - * on any pending work that might itself trigger a > - * mmu_notifier callback and thus would deadlock with > - * us. > - */ > - up_write(&hmm->mirrors_sem); > + down_read(&hmm->mirrors_sem); > + list_for_each_entry(mirror, &hmm->mirrors, list) { > + /* > + * Note: The driver is not allowed to trigger > + * hmm_mirror_unregister() from this thread. > + */ > + if (mirror->ops->release) > mirror->ops->release(mirror); > - down_write(&hmm->mirrors_sem); > - } > - mirror = list_first_entry_or_null(&hmm->mirrors, > - struct hmm_mirror, list); > } > - up_write(&hmm->mirrors_sem); > + up_read(&hmm->mirrors_sem); > > hmm_put(hmm); > } > @@ -287,7 +277,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror) > struct hmm *hmm = mirror->hmm; > > down_write(&hmm->mirrors_sem); > - list_del_init(&mirror->list); > + list_del(&mirror->list); > up_write(&hmm->mirrors_sem); > hmm_put(hmm); > memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm)); >
On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote: > > On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > hmm_release() is called exactly once per hmm. ops->release() cannot > > accidentally trigger any action that would recurse back onto > > hmm->mirrors_sem. > > > > This fixes a use after-free race of the form: > > > > CPU0 CPU1 > > hmm_release() > > up_write(&hmm->mirrors_sem); > > hmm_mirror_unregister(mirror) > > down_write(&hmm->mirrors_sem); > > up_write(&hmm->mirrors_sem); > > kfree(mirror) > > mirror->ops->release(mirror) > > > > The only user we have today for ops->release is an empty function, so this > > is unambiguously safe. > > > > As a consequence of plugging this race drivers are not allowed to > > register/unregister mirrors from within a release op. > > > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > > I agree with the analysis above but I'm not sure that release() will > always be an empty function. It might be more efficient to write back > all data migrated to a device "in one pass" instead of relying > on unmap_vmas() calling hmm_start_range_invalidate() per VMA. Sure, but it should not be allowed to recurse back to hmm_mirror_unregister. > I think the bigger issue is potential deadlocks while calling > sync_cpu_device_pagetables() and tasks calling hmm_mirror_unregister(): > > Say you have three threads: > - Thread A is in try_to_unmap(), either without holding mmap_sem or with > mmap_sem held for read. > - Thread B has some unrelated driver calling hmm_mirror_unregister(). > This doesn't require mmap_sem. > - Thread C is about to call migrate_vma(). > > Thread A Thread B Thread C > try_to_unmap hmm_mirror_unregister migrate_vma > hmm_invalidate_range_start > down_read(mirrors_sem) > down_write(mirrors_sem) > // Blocked on A > device_lock > device_lock > // Blocked on C > migrate_vma() > hmm_invalidate_range_s > down_read(mirrors_sem) > // Blocked on B > // Deadlock Oh... you know I didn't know this about rwsems in linux that they have a fairness policy for writes to block future reads.. Still, at least as things are designed, the driver cannot hold a lock it obtains under sync_cpu_device_pagetables() and nest other things in that lock. It certainly can't recurse back into any mmu notifiers while holding that lock. (as you point out) The lock in sync_cpu_device_pagetables() needs to be very narrowly focused on updating device state only. So, my first reaction is that the driver in thread C is wrong, and needs a different locking scheme. I think you'd have to make a really good case that there is no alternative for a driver.. > Perhaps we should consider using SRCU for walking the mirror->list? It means the driver has to deal with races like in this patch description. At that point there is almost no reason to insert hmm here, just use mmu notifiers directly. Drivers won't get this right, it is too hard. Jason
On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote: > > On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > hmm_release() is called exactly once per hmm. ops->release() cannot > > accidentally trigger any action that would recurse back onto > > hmm->mirrors_sem. > > > > This fixes a use after-free race of the form: > > > > CPU0 CPU1 > > hmm_release() > > up_write(&hmm->mirrors_sem); > > hmm_mirror_unregister(mirror) > > down_write(&hmm->mirrors_sem); > > up_write(&hmm->mirrors_sem); > > kfree(mirror) > > mirror->ops->release(mirror) > > > > The only user we have today for ops->release is an empty function, so this > > is unambiguously safe. > > > > As a consequence of plugging this race drivers are not allowed to > > register/unregister mirrors from within a release op. > > > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > > I agree with the analysis above but I'm not sure that release() will > always be an empty function. It might be more efficient to write back > all data migrated to a device "in one pass" instead of relying > on unmap_vmas() calling hmm_start_range_invalidate() per VMA. I think we have to focus on the *current* kernel - and we have two users of release, nouveau_svm.c is empty and amdgpu_mn.c does schedule_work() - so I believe we should go ahead with this simple solution to the actual race today that both of those will suffer from. If we find a need for a more complex version then it can be debated and justified with proper context... Ok? Jason
On 6/10/19 9:02 AM, Jason Gunthorpe wrote: > On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote: >> >> On 6/6/19 11:44 AM, Jason Gunthorpe wrote: >>> From: Jason Gunthorpe <jgg@mellanox.com> >>> >>> hmm_release() is called exactly once per hmm. ops->release() cannot >>> accidentally trigger any action that would recurse back onto >>> hmm->mirrors_sem. >>> >>> This fixes a use after-free race of the form: >>> >>> CPU0 CPU1 >>> hmm_release() >>> up_write(&hmm->mirrors_sem); >>> hmm_mirror_unregister(mirror) >>> down_write(&hmm->mirrors_sem); >>> up_write(&hmm->mirrors_sem); >>> kfree(mirror) >>> mirror->ops->release(mirror) >>> >>> The only user we have today for ops->release is an empty function, so this >>> is unambiguously safe. >>> >>> As a consequence of plugging this race drivers are not allowed to >>> register/unregister mirrors from within a release op. >>> >>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> >> >> I agree with the analysis above but I'm not sure that release() will >> always be an empty function. It might be more efficient to write back >> all data migrated to a device "in one pass" instead of relying >> on unmap_vmas() calling hmm_start_range_invalidate() per VMA. > > I think we have to focus on the *current* kernel - and we have two > users of release, nouveau_svm.c is empty and amdgpu_mn.c does > schedule_work() - so I believe we should go ahead with this simple > solution to the actual race today that both of those will suffer from. > > If we find a need for a more complex version then it can be debated > and justified with proper context... > > Ok? > > Jason OK. I guess we have enough on the plate already :-)
diff --git a/mm/hmm.c b/mm/hmm.c index 709d138dd49027..3a45dd3d778248 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) WARN_ON(!list_empty(&hmm->ranges)); mutex_unlock(&hmm->lock); - down_write(&hmm->mirrors_sem); - mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror, - list); - while (mirror) { - list_del_init(&mirror->list); - if (mirror->ops->release) { - /* - * Drop mirrors_sem so the release callback can wait - * on any pending work that might itself trigger a - * mmu_notifier callback and thus would deadlock with - * us. - */ - up_write(&hmm->mirrors_sem); + down_read(&hmm->mirrors_sem); + list_for_each_entry(mirror, &hmm->mirrors, list) { + /* + * Note: The driver is not allowed to trigger + * hmm_mirror_unregister() from this thread. + */ + if (mirror->ops->release) mirror->ops->release(mirror); - down_write(&hmm->mirrors_sem); - } - mirror = list_first_entry_or_null(&hmm->mirrors, - struct hmm_mirror, list); } - up_write(&hmm->mirrors_sem); + up_read(&hmm->mirrors_sem); hmm_put(hmm); } @@ -287,7 +277,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror) struct hmm *hmm = mirror->hmm; down_write(&hmm->mirrors_sem); - list_del_init(&mirror->list); + list_del(&mirror->list); up_write(&hmm->mirrors_sem); hmm_put(hmm); memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));