mbox series

[00/37] Replace obj->mm.lock with reservation_ww_class

Message ID 20200805122231.23313-1-chris@chris-wilson.co.uk (mailing list archive)
Headers show
Series Replace obj->mm.lock with reservation_ww_class | expand

Message

Chris Wilson Aug. 5, 2020, 12:21 p.m. UTC
Long story short, we need to manage evictions using dma_resv & dma_fence
tracking. The backing storage will then be managed using the ww_mutex
borrowed from (and shared via) obj->base.resv, rather than the current
obj->mm.lock.

Skipping over the breadcrumbs, the first step is to remove the final
crutches of struct_mutex from execbuf and to broaden the hold for the
dma-resv to guard not just publishing the dma-fences, but for the
duration of the execbuf submission (holding all objects and their
backing store from the point of acquisition to publishing of the final
GPU work, after which the guard is delegated to the dma-fences).

This is of course made complicated by our history. On top of the user's
objects, we also have the HW/kernel objects with their own lifetimes,
and a bunch of auxiliary objects used for working around unhappy HW and
for providing the legacy relocation mechanism. We add every auxiliary
object to the list of user objects required, and attempt to acquire them
en masse. Since all the objects can be known a priori, we can build a
list of those objects and pass that to a routine that can resolve the
-EDEADLK (and evictions). [To avoid relocations imposing a penalty on
sane userspace that avoids them, we do not touch any relocations until
necessary, at will point we have to unroll the state, and rebuild a new
list with more auxiliary buffers to accommodate the extra copy_from_user].
More examples are included as to how we can break down operations
involving multiple objects into an acquire phase prior to those
operations, keeping the -EDEADLK handling under control.

execbuf is the unique interface in that it deals with multiple user
and kernel buffers. After that, we have callers that in principle care
about accessing a single buffer, and so can be migrated over to a helper
that permits only holding one such buffer at a time. That enables us to
swap out obj->mm.lock for obj->base.resv->lock, and use lockdep to spot
illegal nesting, and to throw away the temporary pins by replacing them
with holding the ww_mutex for the duration instead.

What's changed? Some patch splitting and we need to pull in Matthew's
patch to map the page directories under the ww_mutex.

Comments

Thomas Hellström (Intel) Aug. 5, 2020, 4:22 p.m. UTC | #1
Hi, Chris,


On 8/5/20 2:21 PM, Chris Wilson wrote:
> Long story short, we need to manage evictions using dma_resv & dma_fence
> tracking. The backing storage will then be managed using the ww_mutex
> borrowed from (and shared via) obj->base.resv, rather than the current
> obj->mm.lock.
>
> Skipping over the breadcrumbs,

While perhaps needed fixes, could we submit them as a separate series, 
since they, from what I can tell, are not a direct part of the locking 
rework, and some of them were actually part of a series that Dave NaK'ed 
and may require additional justification?


>   the first step is to remove the final
> crutches of struct_mutex from execbuf and to broaden the hold for the
> dma-resv to guard not just publishing the dma-fences, but for the
> duration of the execbuf submission (holding all objects and their
> backing store from the point of acquisition to publishing of the final
> GPU work, after which the guard is delegated to the dma-fences).
>
> This is of course made complicated by our history. On top of the user's
> objects, we also have the HW/kernel objects with their own lifetimes,
> and a bunch of auxiliary objects used for working around unhappy HW and
> for providing the legacy relocation mechanism. We add every auxiliary
> object to the list of user objects required, and attempt to acquire them
> en masse. Since all the objects can be known a priori, we can build a
> list of those objects and pass that to a routine that can resolve the
> -EDEADLK (and evictions). [To avoid relocations imposing a penalty on
> sane userspace that avoids them, we do not touch any relocations until
> necessary, at will point we have to unroll the state, and rebuild a new
> list with more auxiliary buffers to accommodate the extra copy_from_user].
> More examples are included as to how we can break down operations
> involving multiple objects into an acquire phase prior to those
> operations, keeping the -EDEADLK handling under control.
>
> execbuf is the unique interface in that it deals with multiple user
> and kernel buffers. After that, we have callers that in principle care
> about accessing a single buffer, and so can be migrated over to a helper
> that permits only holding one such buffer at a time. That enables us to
> swap out obj->mm.lock for obj->base.resv->lock, and use lockdep to spot
> illegal nesting, and to throw away the temporary pins by replacing them
> with holding the ww_mutex for the duration instead.
>
> What's changed? Some patch splitting and we need to pull in Matthew's
> patch to map the page directories under the ww_mutex.

I would still like to see a justification for the newly introduced async 
work, as opposed to add it as an optimizing / regression fixing series 
follow the locking rework. That async work introduces a bunch of code 
complexity and it would be beneficial to see a discussion of the 
tradeoffs and how it alignes with the upstream proposed dma-fence 
annotations

Thanks,

Thomas

> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Tvrtko Ursulin Aug. 6, 2020, 9:21 a.m. UTC | #2
On 05/08/2020 17:22, Thomas Hellström (Intel) wrote:
> Hi, Chris,
> 
> 
> On 8/5/20 2:21 PM, Chris Wilson wrote:
>> Long story short, we need to manage evictions using dma_resv & dma_fence
>> tracking. The backing storage will then be managed using the ww_mutex
>> borrowed from (and shared via) obj->base.resv, rather than the current
>> obj->mm.lock.
>>
>> Skipping over the breadcrumbs,
> 
> While perhaps needed fixes, could we submit them as a separate series, 
> since they, from what I can tell, are not a direct part of the locking 
> rework, and some of them were actually part of a series that Dave NaK'ed 
> and may require additional justification?
> 
> 
>>   the first step is to remove the final
>> crutches of struct_mutex from execbuf and to broaden the hold for the
>> dma-resv to guard not just publishing the dma-fences, but for the
>> duration of the execbuf submission (holding all objects and their
>> backing store from the point of acquisition to publishing of the final
>> GPU work, after which the guard is delegated to the dma-fences).
>>
>> This is of course made complicated by our history. On top of the user's
>> objects, we also have the HW/kernel objects with their own lifetimes,
>> and a bunch of auxiliary objects used for working around unhappy HW and
>> for providing the legacy relocation mechanism. We add every auxiliary
>> object to the list of user objects required, and attempt to acquire them
>> en masse. Since all the objects can be known a priori, we can build a
>> list of those objects and pass that to a routine that can resolve the
>> -EDEADLK (and evictions). [To avoid relocations imposing a penalty on
>> sane userspace that avoids them, we do not touch any relocations until
>> necessary, at will point we have to unroll the state, and rebuild a new
>> list with more auxiliary buffers to accommodate the extra 
>> copy_from_user].
>> More examples are included as to how we can break down operations
>> involving multiple objects into an acquire phase prior to those
>> operations, keeping the -EDEADLK handling under control.
>>
>> execbuf is the unique interface in that it deals with multiple user
>> and kernel buffers. After that, we have callers that in principle care
>> about accessing a single buffer, and so can be migrated over to a helper
>> that permits only holding one such buffer at a time. That enables us to
>> swap out obj->mm.lock for obj->base.resv->lock, and use lockdep to spot
>> illegal nesting, and to throw away the temporary pins by replacing them
>> with holding the ww_mutex for the duration instead.
>>
>> What's changed? Some patch splitting and we need to pull in Matthew's
>> patch to map the page directories under the ww_mutex.
> 
> I would still like to see a justification for the newly introduced async 
> work, as opposed to add it as an optimizing / regression fixing series 
> follow the locking rework. That async work introduces a bunch of code 
> complexity and it would be beneficial to see a discussion of the 
> tradeoffs and how it alignes with the upstream proposed dma-fence 
> annotations

On the topic of annotations, maybe do a trybot run with them enabled 
with the latest series and then see what pops up.

+Daniel, since I noticed last time he was doing that one of the splats 
(possibly the only one?) was actually caused by dma_fence_is_signaled. 
Which I think comes under the opportunistic signaling rule for the 
annotation kerneldoc so looked like a false positive to me. Not sure how 
to avoid that one, apart from making it call a special, un-annotated, 
flavours of dma_fence_signal(_locked).

Regards,

Tvrtko
Daniel Vetter Aug. 6, 2020, 11:55 a.m. UTC | #3
On Thu, Aug 6, 2020 at 11:21 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 05/08/2020 17:22, Thomas Hellström (Intel) wrote:
> > Hi, Chris,
> >
> >
> > On 8/5/20 2:21 PM, Chris Wilson wrote:
> >> Long story short, we need to manage evictions using dma_resv & dma_fence
> >> tracking. The backing storage will then be managed using the ww_mutex
> >> borrowed from (and shared via) obj->base.resv, rather than the current
> >> obj->mm.lock.
> >>
> >> Skipping over the breadcrumbs,
> >
> > While perhaps needed fixes, could we submit them as a separate series,
> > since they, from what I can tell, are not a direct part of the locking
> > rework, and some of them were actually part of a series that Dave NaK'ed
> > and may require additional justification?
> >
> >
> >>   the first step is to remove the final
> >> crutches of struct_mutex from execbuf and to broaden the hold for the
> >> dma-resv to guard not just publishing the dma-fences, but for the
> >> duration of the execbuf submission (holding all objects and their
> >> backing store from the point of acquisition to publishing of the final
> >> GPU work, after which the guard is delegated to the dma-fences).
> >>
> >> This is of course made complicated by our history. On top of the user's
> >> objects, we also have the HW/kernel objects with their own lifetimes,
> >> and a bunch of auxiliary objects used for working around unhappy HW and
> >> for providing the legacy relocation mechanism. We add every auxiliary
> >> object to the list of user objects required, and attempt to acquire them
> >> en masse. Since all the objects can be known a priori, we can build a
> >> list of those objects and pass that to a routine that can resolve the
> >> -EDEADLK (and evictions). [To avoid relocations imposing a penalty on
> >> sane userspace that avoids them, we do not touch any relocations until
> >> necessary, at will point we have to unroll the state, and rebuild a new
> >> list with more auxiliary buffers to accommodate the extra
> >> copy_from_user].
> >> More examples are included as to how we can break down operations
> >> involving multiple objects into an acquire phase prior to those
> >> operations, keeping the -EDEADLK handling under control.
> >>
> >> execbuf is the unique interface in that it deals with multiple user
> >> and kernel buffers. After that, we have callers that in principle care
> >> about accessing a single buffer, and so can be migrated over to a helper
> >> that permits only holding one such buffer at a time. That enables us to
> >> swap out obj->mm.lock for obj->base.resv->lock, and use lockdep to spot
> >> illegal nesting, and to throw away the temporary pins by replacing them
> >> with holding the ww_mutex for the duration instead.
> >>
> >> What's changed? Some patch splitting and we need to pull in Matthew's
> >> patch to map the page directories under the ww_mutex.
> >
> > I would still like to see a justification for the newly introduced async
> > work, as opposed to add it as an optimizing / regression fixing series
> > follow the locking rework. That async work introduces a bunch of code
> > complexity and it would be beneficial to see a discussion of the
> > tradeoffs and how it alignes with the upstream proposed dma-fence
> > annotations
>
> On the topic of annotations, maybe do a trybot run with them enabled
> with the latest series and then see what pops up.
>
> +Daniel, since I noticed last time he was doing that one of the splats
> (possibly the only one?) was actually caused by dma_fence_is_signaled.
> Which I think comes under the opportunistic signaling rule for the
> annotation kerneldoc so looked like a false positive to me. Not sure how
> to avoid that one, apart from making it call a special, un-annotated,
> flavours of dma_fence_signal(_locked).

Yeah that became a bit more constrained due to the switch from
recursive locks (which don't catch the actual wait vs signalling
issues because lockdep is not as good as it should be) to explicitly
recursive read locks (which do catch the wait vs signal side of
things, but get more annoyed about creative locking schemes on the
read (i.e. signalling side).
-Daniel
Tvrtko Ursulin Aug. 6, 2020, 1:10 p.m. UTC | #4
On 06/08/2020 12:55, Daniel Vetter wrote:
> On Thu, Aug 6, 2020 at 11:21 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>> On 05/08/2020 17:22, Thomas Hellström (Intel) wrote:
>>> Hi, Chris,
>>> On 8/5/20 2:21 PM, Chris Wilson wrote:
>>>> Long story short, we need to manage evictions using dma_resv & dma_fence
>>>> tracking. The backing storage will then be managed using the ww_mutex
>>>> borrowed from (and shared via) obj->base.resv, rather than the current
>>>> obj->mm.lock.
>>>>
>>>> Skipping over the breadcrumbs,
>>>
>>> While perhaps needed fixes, could we submit them as a separate series,
>>> since they, from what I can tell, are not a direct part of the locking
>>> rework, and some of them were actually part of a series that Dave NaK'ed
>>> and may require additional justification?
>>>
>>>
>>>>    the first step is to remove the final
>>>> crutches of struct_mutex from execbuf and to broaden the hold for the
>>>> dma-resv to guard not just publishing the dma-fences, but for the
>>>> duration of the execbuf submission (holding all objects and their
>>>> backing store from the point of acquisition to publishing of the final
>>>> GPU work, after which the guard is delegated to the dma-fences).
>>>>
>>>> This is of course made complicated by our history. On top of the user's
>>>> objects, we also have the HW/kernel objects with their own lifetimes,
>>>> and a bunch of auxiliary objects used for working around unhappy HW and
>>>> for providing the legacy relocation mechanism. We add every auxiliary
>>>> object to the list of user objects required, and attempt to acquire them
>>>> en masse. Since all the objects can be known a priori, we can build a
>>>> list of those objects and pass that to a routine that can resolve the
>>>> -EDEADLK (and evictions). [To avoid relocations imposing a penalty on
>>>> sane userspace that avoids them, we do not touch any relocations until
>>>> necessary, at will point we have to unroll the state, and rebuild a new
>>>> list with more auxiliary buffers to accommodate the extra
>>>> copy_from_user].
>>>> More examples are included as to how we can break down operations
>>>> involving multiple objects into an acquire phase prior to those
>>>> operations, keeping the -EDEADLK handling under control.
>>>>
>>>> execbuf is the unique interface in that it deals with multiple user
>>>> and kernel buffers. After that, we have callers that in principle care
>>>> about accessing a single buffer, and so can be migrated over to a helper
>>>> that permits only holding one such buffer at a time. That enables us to
>>>> swap out obj->mm.lock for obj->base.resv->lock, and use lockdep to spot
>>>> illegal nesting, and to throw away the temporary pins by replacing them
>>>> with holding the ww_mutex for the duration instead.
>>>>
>>>> What's changed? Some patch splitting and we need to pull in Matthew's
>>>> patch to map the page directories under the ww_mutex.
>>>
>>> I would still like to see a justification for the newly introduced async
>>> work, as opposed to add it as an optimizing / regression fixing series
>>> follow the locking rework. That async work introduces a bunch of code
>>> complexity and it would be beneficial to see a discussion of the
>>> tradeoffs and how it alignes with the upstream proposed dma-fence
>>> annotations
>>
>> On the topic of annotations, maybe do a trybot run with them enabled
>> with the latest series and then see what pops up.
>>
>> +Daniel, since I noticed last time he was doing that one of the splats
>> (possibly the only one?) was actually caused by dma_fence_is_signaled.
>> Which I think comes under the opportunistic signaling rule for the
>> annotation kerneldoc so looked like a false positive to me. Not sure how
>> to avoid that one, apart from making it call a special, un-annotated,
>> flavours of dma_fence_signal(_locked).
> 
> Yeah that became a bit more constrained due to the switch from
> recursive locks (which don't catch the actual wait vs signalling
> issues because lockdep is not as good as it should be) to explicitly
> recursive read locks (which do catch the wait vs signal side of
> things, but get more annoyed about creative locking schemes on the
> read (i.e. signalling side).

I did not get you, what became more constrained? Allowance for 
opportunistic signaling got retracted? But dma_fence_is_signaled is not 
even that, I don't see how it makes sense to flag it as foul.

Or you agree dma_fence_is_signaled should be made call new non-annotated 
helpers as I suggested?

Regards,

Tvrtko
Chris Wilson Aug. 10, 2020, 9:51 a.m. UTC | #5
Quoting Tvrtko Ursulin (2020-08-06 10:21:38)
> 
> On 05/08/2020 17:22, Thomas Hellström (Intel) wrote:
> > Hi, Chris,
> > 
> > 
> > On 8/5/20 2:21 PM, Chris Wilson wrote:
> >> Long story short, we need to manage evictions using dma_resv & dma_fence
> >> tracking. The backing storage will then be managed using the ww_mutex
> >> borrowed from (and shared via) obj->base.resv, rather than the current
> >> obj->mm.lock.
> >>
> >> Skipping over the breadcrumbs,
> > 
> > While perhaps needed fixes, could we submit them as a separate series, 
> > since they, from what I can tell, are not a direct part of the locking 
> > rework, and some of them were actually part of a series that Dave NaK'ed 
> > and may require additional justification?
> > 
> > 
> >>   the first step is to remove the final
> >> crutches of struct_mutex from execbuf and to broaden the hold for the
> >> dma-resv to guard not just publishing the dma-fences, but for the
> >> duration of the execbuf submission (holding all objects and their
> >> backing store from the point of acquisition to publishing of the final
> >> GPU work, after which the guard is delegated to the dma-fences).
> >>
> >> This is of course made complicated by our history. On top of the user's
> >> objects, we also have the HW/kernel objects with their own lifetimes,
> >> and a bunch of auxiliary objects used for working around unhappy HW and
> >> for providing the legacy relocation mechanism. We add every auxiliary
> >> object to the list of user objects required, and attempt to acquire them
> >> en masse. Since all the objects can be known a priori, we can build a
> >> list of those objects and pass that to a routine that can resolve the
> >> -EDEADLK (and evictions). [To avoid relocations imposing a penalty on
> >> sane userspace that avoids them, we do not touch any relocations until
> >> necessary, at will point we have to unroll the state, and rebuild a new
> >> list with more auxiliary buffers to accommodate the extra 
> >> copy_from_user].
> >> More examples are included as to how we can break down operations
> >> involving multiple objects into an acquire phase prior to those
> >> operations, keeping the -EDEADLK handling under control.
> >>
> >> execbuf is the unique interface in that it deals with multiple user
> >> and kernel buffers. After that, we have callers that in principle care
> >> about accessing a single buffer, and so can be migrated over to a helper
> >> that permits only holding one such buffer at a time. That enables us to
> >> swap out obj->mm.lock for obj->base.resv->lock, and use lockdep to spot
> >> illegal nesting, and to throw away the temporary pins by replacing them
> >> with holding the ww_mutex for the duration instead.
> >>
> >> What's changed? Some patch splitting and we need to pull in Matthew's
> >> patch to map the page directories under the ww_mutex.
> > 
> > I would still like to see a justification for the newly introduced async 
> > work, as opposed to add it as an optimizing / regression fixing series 
> > follow the locking rework. That async work introduces a bunch of code 
> > complexity and it would be beneficial to see a discussion of the 
> > tradeoffs and how it alignes with the upstream proposed dma-fence 
> > annotations
> 
> On the topic of annotations, maybe do a trybot run with them enabled 
> with the latest series and then see what pops up.

It didn't change since the run Daniel did. In that run there were two
splats I found,

vm->mutex -> dma_fence_is_signaled/dma_fence_signal -> async work + vm->mutex

This is the overconstraint Daniel mentioned, it's an entirely false
coupling from the assumption that the async work runs inside the
parent's signaling context. Gut feeling says that should be resolvable
by using lockdep's dependency analysis between local execution/signal
contexts.

The other was

userptr mmu_notifier -> i915_vma_unbind + i915_vma_revoke -> mmu_notifier
[with the interplay between the two global lockmap's, two strikes?]

And that is the well known false positive for userptr that we break by
never allowing the userptr to in be the dev->mapping.

A path not exercised but it should find is synchronous fence eviction.
While binding a fresh fence ensures the vma is idle before starting, we
do not do the same for the fence we steal. That's why care was taken for
execbuf, and it just means teaching the synchronous path to similarly
move the wait outside of the mutex (the simplest will be to wrap the
async handling).
-Chris
Tvrtko Ursulin Sept. 3, 2020, 2:25 p.m. UTC | #6
On 10/08/2020 10:51, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-08-06 10:21:38)
>>
>> On 05/08/2020 17:22, Thomas Hellström (Intel) wrote:
>>> Hi, Chris,
>>>
>>>
>>> On 8/5/20 2:21 PM, Chris Wilson wrote:
>>>> Long story short, we need to manage evictions using dma_resv & dma_fence
>>>> tracking. The backing storage will then be managed using the ww_mutex
>>>> borrowed from (and shared via) obj->base.resv, rather than the current
>>>> obj->mm.lock.
>>>>
>>>> Skipping over the breadcrumbs,
>>>
>>> While perhaps needed fixes, could we submit them as a separate series,
>>> since they, from what I can tell, are not a direct part of the locking
>>> rework, and some of them were actually part of a series that Dave NaK'ed
>>> and may require additional justification?
>>>
>>>
>>>>    the first step is to remove the final
>>>> crutches of struct_mutex from execbuf and to broaden the hold for the
>>>> dma-resv to guard not just publishing the dma-fences, but for the
>>>> duration of the execbuf submission (holding all objects and their
>>>> backing store from the point of acquisition to publishing of the final
>>>> GPU work, after which the guard is delegated to the dma-fences).
>>>>
>>>> This is of course made complicated by our history. On top of the user's
>>>> objects, we also have the HW/kernel objects with their own lifetimes,
>>>> and a bunch of auxiliary objects used for working around unhappy HW and
>>>> for providing the legacy relocation mechanism. We add every auxiliary
>>>> object to the list of user objects required, and attempt to acquire them
>>>> en masse. Since all the objects can be known a priori, we can build a
>>>> list of those objects and pass that to a routine that can resolve the
>>>> -EDEADLK (and evictions). [To avoid relocations imposing a penalty on
>>>> sane userspace that avoids them, we do not touch any relocations until
>>>> necessary, at will point we have to unroll the state, and rebuild a new
>>>> list with more auxiliary buffers to accommodate the extra
>>>> copy_from_user].
>>>> More examples are included as to how we can break down operations
>>>> involving multiple objects into an acquire phase prior to those
>>>> operations, keeping the -EDEADLK handling under control.
>>>>
>>>> execbuf is the unique interface in that it deals with multiple user
>>>> and kernel buffers. After that, we have callers that in principle care
>>>> about accessing a single buffer, and so can be migrated over to a helper
>>>> that permits only holding one such buffer at a time. That enables us to
>>>> swap out obj->mm.lock for obj->base.resv->lock, and use lockdep to spot
>>>> illegal nesting, and to throw away the temporary pins by replacing them
>>>> with holding the ww_mutex for the duration instead.
>>>>
>>>> What's changed? Some patch splitting and we need to pull in Matthew's
>>>> patch to map the page directories under the ww_mutex.
>>>
>>> I would still like to see a justification for the newly introduced async
>>> work, as opposed to add it as an optimizing / regression fixing series
>>> follow the locking rework. That async work introduces a bunch of code
>>> complexity and it would be beneficial to see a discussion of the
>>> tradeoffs and how it alignes with the upstream proposed dma-fence
>>> annotations
>>
>> On the topic of annotations, maybe do a trybot run with them enabled
>> with the latest series and then see what pops up.
> 
> It didn't change since the run Daniel did. In that run there were two
> splats I found,
> 
> vm->mutex -> dma_fence_is_signaled/dma_fence_signal -> async work + vm->mutex

Coming back to this after a few weeks - was there any further discussion 
on the dma_fence_is_signaled problem? I don't think Daniel you ever 
replied, or at least I did not see it. Can the signaling section in 
there possibly remain and not be obviously incorrect?

Regards,

Tvrtko

> 
> This is the overconstraint Daniel mentioned, it's an entirely false
> coupling from the assumption that the async work runs inside the
> parent's signaling context. Gut feeling says that should be resolvable
> by using lockdep's dependency analysis between local execution/signal
> contexts.
> 
> The other was
> 
> userptr mmu_notifier -> i915_vma_unbind + i915_vma_revoke -> mmu_notifier
> [with the interplay between the two global lockmap's, two strikes?]
> 
> And that is the well known false positive for userptr that we break by
> never allowing the userptr to in be the dev->mapping.
> 
> A path not exercised but it should find is synchronous fence eviction.
> While binding a fresh fence ensures the vma is idle before starting, we
> do not do the same for the fence we steal. That's why care was taken for
> execbuf, and it just means teaching the synchronous path to similarly
> move the wait outside of the mutex (the simplest will be to wrap the
> async handling).
> -Chris
>