diff mbox

[1/4] locking/ww_mutex: Fix a deadlock affecting ww_mutexes

Message ID 1479900325-28358-1-git-send-email-nhaehnle@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nicolai Hähnle Nov. 23, 2016, 11:25 a.m. UTC
From: Nicolai Hähnle <Nicolai.Haehnle@amd.com>

Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
the following example. Acquire context stamps are ordered like the thread
numbers, i.e. thread #1 should back off when it encounters a mutex locked
by thread #0 etc.

Thread #0    Thread #1    Thread #2    Thread #3
---------    ---------    ---------    ---------
                                       lock(ww)
                                       success
             lock(ww')
             success
                          lock(ww)
             lock(ww)        .
                .            .         unlock(ww) part 1
lock(ww)        .            .            .
success         .            .            .
                .            .         unlock(ww) part 2
                .         back off
lock(ww')       .
   .            .
(stuck)      (stuck)

Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
(without being protected by lock->base.wait_lock), meaning that thread #0
can acquire ww in the fast path or, much more likely, the medium path
in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
won't wake up any of the waiters in ww_mutex_set_context_fastpath.

Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
thread #2, since waiters are added at the tail. Thread #2 wakes up and
backs off since it sees ww owned by a context with a lower stamp.

Meanwhile, thread #1 is never woken up, and so it won't back off its lock
on ww'. So thread #0 gets stuck waiting for ww' to be released.

This patch fixes the deadlock by waking up all waiters in the slow path
of ww_mutex_unlock.

We have an internal test case for amdgpu which continuously submits
command streams from tens of threads, where all command streams reference
hundreds of GPU buffer objects with a lot of overlap in the buffer lists
between command streams. This test reliably caused a deadlock, and while I
haven't completely confirmed that it is exactly the scenario outlined
above, this patch does fix the test case.

v2:
- use wake_q_add
- add additional explanations

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Cc: dri-devel@lists.freedesktop.org
Cc: stable@vger.kernel.org
Reviewed-by: Christian König <christian.koenig@amd.com> (v1)
Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
---
 kernel/locking/mutex.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

Comments

Daniel Vetter Nov. 23, 2016, 12:50 p.m. UTC | #1
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> From: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
> 
> Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
> the following example. Acquire context stamps are ordered like the thread
> numbers, i.e. thread #1 should back off when it encounters a mutex locked
> by thread #0 etc.
> 
> Thread #0    Thread #1    Thread #2    Thread #3
> ---------    ---------    ---------    ---------
>                                        lock(ww)
>                                        success
>              lock(ww')
>              success
>                           lock(ww)
>              lock(ww)        .
>                 .            .         unlock(ww) part 1
> lock(ww)        .            .            .
> success         .            .            .
>                 .            .         unlock(ww) part 2
>                 .         back off
> lock(ww')       .
>    .            .
> (stuck)      (stuck)
> 
> Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
> (without being protected by lock->base.wait_lock), meaning that thread #0
> can acquire ww in the fast path or, much more likely, the medium path
> in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
> won't wake up any of the waiters in ww_mutex_set_context_fastpath.
> 
> Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
> thread #2, since waiters are added at the tail. Thread #2 wakes up and
> backs off since it sees ww owned by a context with a lower stamp.
> 
> Meanwhile, thread #1 is never woken up, and so it won't back off its lock
> on ww'. So thread #0 gets stuck waiting for ww' to be released.
> 
> This patch fixes the deadlock by waking up all waiters in the slow path
> of ww_mutex_unlock.
> 
> We have an internal test case for amdgpu which continuously submits
> command streams from tens of threads, where all command streams reference
> hundreds of GPU buffer objects with a lot of overlap in the buffer lists
> between command streams. This test reliably caused a deadlock, and while I
> haven't completely confirmed that it is exactly the scenario outlined
> above, this patch does fix the test case.
> 
> v2:
> - use wake_q_add
> - add additional explanations
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: stable@vger.kernel.org
> Reviewed-by: Christian König <christian.koenig@amd.com> (v1)
> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

Yeah, when the owning ctx changes we need to wake up all waiters, to make
sure we catch all (new) deadlock scenarios. And I tried poking at your
example, and I think it's solid and can't be minimized any further. I
don't have much clue on mutex.c code itself, but the changes seem
reasonable. With that caveat:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Cheers, Daniel

> ---
>  kernel/locking/mutex.c | 33 +++++++++++++++++++++++++++++----
>  1 file changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index a70b90d..7fbf9b4 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -409,6 +409,9 @@ static bool mutex_optimistic_spin(struct mutex *lock,
>  __visible __used noinline
>  void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
>  
> +static __used noinline
> +void __sched __mutex_unlock_slowpath_wakeall(atomic_t *lock_count);
> +
>  /**
>   * mutex_unlock - release the mutex
>   * @lock: the mutex to be released
> @@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock)
>  	 */
>  	mutex_clear_owner(&lock->base);
>  #endif
> -	__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
> +	/*
> +	 * A previously _not_ waiting task may acquire the lock via the fast
> +	 * path during our unlock. In that case, already waiting tasks may have
> +	 * to back off to avoid a deadlock. Wake up all waiters so that they
> +	 * can check their acquire context stamp against the new owner.
> +	 */
> +	__mutex_fastpath_unlock(&lock->base.count,
> +				__mutex_unlock_slowpath_wakeall);
>  }
>  EXPORT_SYMBOL(ww_mutex_unlock);
>  
> @@ -716,7 +726,7 @@ EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible);
>   * Release the lock, slowpath:
>   */
>  static inline void
> -__mutex_unlock_common_slowpath(struct mutex *lock, int nested)
> +__mutex_unlock_common_slowpath(struct mutex *lock, int nested, int wake_all)
>  {
>  	unsigned long flags;
>  	WAKE_Q(wake_q);
> @@ -740,7 +750,14 @@ __mutex_unlock_common_slowpath(struct mutex *lock, int nested)
>  	mutex_release(&lock->dep_map, nested, _RET_IP_);
>  	debug_mutex_unlock(lock);
>  
> -	if (!list_empty(&lock->wait_list)) {
> +	if (wake_all) {
> +		struct mutex_waiter *waiter;
> +
> +		list_for_each_entry(waiter, &lock->wait_list, list) {
> +			debug_mutex_wake_waiter(lock, waiter);
> +			wake_q_add(&wake_q, waiter->task);
> +		}
> +	} else if (!list_empty(&lock->wait_list)) {
>  		/* get the first entry from the wait-list: */
>  		struct mutex_waiter *waiter =
>  				list_entry(lock->wait_list.next,
> @@ -762,7 +779,15 @@ __mutex_unlock_slowpath(atomic_t *lock_count)
>  {
>  	struct mutex *lock = container_of(lock_count, struct mutex, count);
>  
> -	__mutex_unlock_common_slowpath(lock, 1);
> +	__mutex_unlock_common_slowpath(lock, 1, 0);
> +}
> +
> +static void
> +__mutex_unlock_slowpath_wakeall(atomic_t *lock_count)
> +{
> +	struct mutex *lock = container_of(lock_count, struct mutex, count);
> +
> +	__mutex_unlock_common_slowpath(lock, 1, 1);
>  }
>  
>  #ifndef CONFIG_DEBUG_LOCK_ALLOC
> -- 
> 2.7.4
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Peter Zijlstra Nov. 23, 2016, 1 p.m. UTC | #2
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> From: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
> 
> Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
> the following example. Acquire context stamps are ordered like the thread
> numbers, i.e. thread #1 should back off when it encounters a mutex locked
> by thread #0 etc.
> 
> Thread #0    Thread #1    Thread #2    Thread #3
> ---------    ---------    ---------    ---------
>                                        lock(ww)
>                                        success
>              lock(ww')
>              success
>                           lock(ww)
>              lock(ww)        .
>                 .            .         unlock(ww) part 1
> lock(ww)        .            .            .
> success         .            .            .
>                 .            .         unlock(ww) part 2
>                 .         back off
> lock(ww')       .
>    .            .
> (stuck)      (stuck)
> 
> Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
> (without being protected by lock->base.wait_lock), meaning that thread #0
> can acquire ww in the fast path or, much more likely, the medium path
> in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
> won't wake up any of the waiters in ww_mutex_set_context_fastpath.
> 
> Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
> thread #2, since waiters are added at the tail. Thread #2 wakes up and
> backs off since it sees ww owned by a context with a lower stamp.
> 
> Meanwhile, thread #1 is never woken up, and so it won't back off its lock
> on ww'. So thread #0 gets stuck waiting for ww' to be released.
> 
> This patch fixes the deadlock by waking up all waiters in the slow path
> of ww_mutex_unlock.
> 
> We have an internal test case for amdgpu which continuously submits
> command streams from tens of threads, where all command streams reference
> hundreds of GPU buffer objects with a lot of overlap in the buffer lists
> between command streams. This test reliably caused a deadlock, and while I
> haven't completely confirmed that it is exactly the scenario outlined
> above, this patch does fix the test case.
> 
> v2:
> - use wake_q_add
> - add additional explanations
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: stable@vger.kernel.org
> Reviewed-by: Christian König <christian.koenig@amd.com> (v1)
> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

Completely and utterly fails to apply; I think this patch is based on
code prior to the mutex rewrite.

Please rebase on tip/locking/core.

Also, is this a regression, or has this been a 'feature' of the ww_mutex
code from early on?
Daniel Vetter Nov. 23, 2016, 1:08 p.m. UTC | #3
On Wed, Nov 23, 2016 at 02:00:46PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> > From: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
> > 
> > Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
> > the following example. Acquire context stamps are ordered like the thread
> > numbers, i.e. thread #1 should back off when it encounters a mutex locked
> > by thread #0 etc.
> > 
> > Thread #0    Thread #1    Thread #2    Thread #3
> > ---------    ---------    ---------    ---------
> >                                        lock(ww)
> >                                        success
> >              lock(ww')
> >              success
> >                           lock(ww)
> >              lock(ww)        .
> >                 .            .         unlock(ww) part 1
> > lock(ww)        .            .            .
> > success         .            .            .
> >                 .            .         unlock(ww) part 2
> >                 .         back off
> > lock(ww')       .
> >    .            .
> > (stuck)      (stuck)
> > 
> > Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
> > (without being protected by lock->base.wait_lock), meaning that thread #0
> > can acquire ww in the fast path or, much more likely, the medium path
> > in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
> > won't wake up any of the waiters in ww_mutex_set_context_fastpath.
> > 
> > Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
> > thread #2, since waiters are added at the tail. Thread #2 wakes up and
> > backs off since it sees ww owned by a context with a lower stamp.
> > 
> > Meanwhile, thread #1 is never woken up, and so it won't back off its lock
> > on ww'. So thread #0 gets stuck waiting for ww' to be released.
> > 
> > This patch fixes the deadlock by waking up all waiters in the slow path
> > of ww_mutex_unlock.
> > 
> > We have an internal test case for amdgpu which continuously submits
> > command streams from tens of threads, where all command streams reference
> > hundreds of GPU buffer objects with a lot of overlap in the buffer lists
> > between command streams. This test reliably caused a deadlock, and while I
> > haven't completely confirmed that it is exactly the scenario outlined
> > above, this patch does fix the test case.
> > 
> > v2:
> > - use wake_q_add
> > - add additional explanations
> > 
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: stable@vger.kernel.org
> > Reviewed-by: Christian König <christian.koenig@amd.com> (v1)
> > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
> 
> Completely and utterly fails to apply; I think this patch is based on
> code prior to the mutex rewrite.
> 
> Please rebase on tip/locking/core.
> 
> Also, is this a regression, or has this been a 'feature' of the ww_mutex
> code from early on?

Sorry forgot to mention that, but I checked. Seems to have been broken
since day 1, at least looking at the original code the wake-single-waiter
stuff is as old as the mutex code added in 2006.
-Daniel
Daniel Vetter Nov. 23, 2016, 1:11 p.m. UTC | #4
On Wed, Nov 23, 2016 at 02:08:48PM +0100, Daniel Vetter wrote:
> On Wed, Nov 23, 2016 at 02:00:46PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> > > From: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
> > > 
> > > Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
> > > the following example. Acquire context stamps are ordered like the thread
> > > numbers, i.e. thread #1 should back off when it encounters a mutex locked
> > > by thread #0 etc.
> > > 
> > > Thread #0    Thread #1    Thread #2    Thread #3
> > > ---------    ---------    ---------    ---------
> > >                                        lock(ww)
> > >                                        success
> > >              lock(ww')
> > >              success
> > >                           lock(ww)
> > >              lock(ww)        .
> > >                 .            .         unlock(ww) part 1
> > > lock(ww)        .            .            .
> > > success         .            .            .
> > >                 .            .         unlock(ww) part 2
> > >                 .         back off
> > > lock(ww')       .
> > >    .            .
> > > (stuck)      (stuck)
> > > 
> > > Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
> > > (without being protected by lock->base.wait_lock), meaning that thread #0
> > > can acquire ww in the fast path or, much more likely, the medium path
> > > in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
> > > won't wake up any of the waiters in ww_mutex_set_context_fastpath.
> > > 
> > > Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
> > > thread #2, since waiters are added at the tail. Thread #2 wakes up and
> > > backs off since it sees ww owned by a context with a lower stamp.
> > > 
> > > Meanwhile, thread #1 is never woken up, and so it won't back off its lock
> > > on ww'. So thread #0 gets stuck waiting for ww' to be released.
> > > 
> > > This patch fixes the deadlock by waking up all waiters in the slow path
> > > of ww_mutex_unlock.
> > > 
> > > We have an internal test case for amdgpu which continuously submits
> > > command streams from tens of threads, where all command streams reference
> > > hundreds of GPU buffer objects with a lot of overlap in the buffer lists
> > > between command streams. This test reliably caused a deadlock, and while I
> > > haven't completely confirmed that it is exactly the scenario outlined
> > > above, this patch does fix the test case.
> > > 
> > > v2:
> > > - use wake_q_add
> > > - add additional explanations
> > > 
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: stable@vger.kernel.org
> > > Reviewed-by: Christian König <christian.koenig@amd.com> (v1)
> > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
> > 
> > Completely and utterly fails to apply; I think this patch is based on
> > code prior to the mutex rewrite.
> > 
> > Please rebase on tip/locking/core.
> > 
> > Also, is this a regression, or has this been a 'feature' of the ww_mutex
> > code from early on?
> 
> Sorry forgot to mention that, but I checked. Seems to have been broken
> since day 1, at least looking at the original code the wake-single-waiter
> stuff is as old as the mutex code added in 2006.

More details: For gpu drivers this was originally working, since the
ww_mutex implementation in ttm did use wake_up_all. So need to add a

Fixes: 5e338405119a ("drm/ttm: convert to the reservation api")
Maarten Lankhorst Nov. 23, 2016, 1:33 p.m. UTC | #5
Op 23-11-16 om 14:11 schreef Daniel Vetter:
> On Wed, Nov 23, 2016 at 02:08:48PM +0100, Daniel Vetter wrote:
>> On Wed, Nov 23, 2016 at 02:00:46PM +0100, Peter Zijlstra wrote:
>>> On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
>>>> From: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
>>>>
>>>> Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
>>>> the following example. Acquire context stamps are ordered like the thread
>>>> numbers, i.e. thread #1 should back off when it encounters a mutex locked
>>>> by thread #0 etc.
>>>>
>>>> Thread #0    Thread #1    Thread #2    Thread #3
>>>> ---------    ---------    ---------    ---------
>>>>                                        lock(ww)
>>>>                                        success
>>>>              lock(ww')
>>>>              success
>>>>                           lock(ww)
>>>>              lock(ww)        .
>>>>                 .            .         unlock(ww) part 1
>>>> lock(ww)        .            .            .
>>>> success         .            .            .
>>>>                 .            .         unlock(ww) part 2
>>>>                 .         back off
>>>> lock(ww')       .
>>>>    .            .
>>>> (stuck)      (stuck)
>>>>
>>>> Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
>>>> (without being protected by lock->base.wait_lock), meaning that thread #0
>>>> can acquire ww in the fast path or, much more likely, the medium path
>>>> in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
>>>> won't wake up any of the waiters in ww_mutex_set_context_fastpath.
>>>>
>>>> Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
>>>> thread #2, since waiters are added at the tail. Thread #2 wakes up and
>>>> backs off since it sees ww owned by a context with a lower stamp.
>>>>
>>>> Meanwhile, thread #1 is never woken up, and so it won't back off its lock
>>>> on ww'. So thread #0 gets stuck waiting for ww' to be released.
>>>>
>>>> This patch fixes the deadlock by waking up all waiters in the slow path
>>>> of ww_mutex_unlock.
>>>>
>>>> We have an internal test case for amdgpu which continuously submits
>>>> command streams from tens of threads, where all command streams reference
>>>> hundreds of GPU buffer objects with a lot of overlap in the buffer lists
>>>> between command streams. This test reliably caused a deadlock, and while I
>>>> haven't completely confirmed that it is exactly the scenario outlined
>>>> above, this patch does fix the test case.
>>>>
>>>> v2:
>>>> - use wake_q_add
>>>> - add additional explanations
>>>>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>>>> Cc: dri-devel@lists.freedesktop.org
>>>> Cc: stable@vger.kernel.org
>>>> Reviewed-by: Christian König <christian.koenig@amd.com> (v1)
>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>> Completely and utterly fails to apply; I think this patch is based on
>>> code prior to the mutex rewrite.
>>>
>>> Please rebase on tip/locking/core.
>>>
>>> Also, is this a regression, or has this been a 'feature' of the ww_mutex
>>> code from early on?
>> Sorry forgot to mention that, but I checked. Seems to have been broken
>> since day 1, at least looking at the original code the wake-single-waiter
>> stuff is as old as the mutex code added in 2006.
> More details: For gpu drivers this was originally working, since the
> ww_mutex implementation in ttm did use wake_up_all. So need to add a
>
> Fixes: 5e338405119a ("drm/ttm: convert to the reservation api")

But it's broken in the original kernel ww_mutex implementation, I guess it doesn't matter much since ttm was the first in-kernel user anyway. :)

~Maarten
Peter Zijlstra Nov. 23, 2016, 2:03 p.m. UTC | #6
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> @@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock)
>  	 */
>  	mutex_clear_owner(&lock->base);
>  #endif
> -	__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
> +	/*
> +	 * A previously _not_ waiting task may acquire the lock via the fast
> +	 * path during our unlock. In that case, already waiting tasks may have
> +	 * to back off to avoid a deadlock. Wake up all waiters so that they
> +	 * can check their acquire context stamp against the new owner.
> +	 */
> +	__mutex_fastpath_unlock(&lock->base.count,
> +				__mutex_unlock_slowpath_wakeall);
>  }

So doing a wake-all has obvious issues with thundering herd etc.. Also,
with the new mutex, you'd not be able to do hand-off, which would
introduce starvation cases.

Ideally we'd iterate the blocked list and pick the waiter with the
earliest stamp, or we'd maintain the list in stamp order instead of
FIFO, for ww_mutex.
Daniel Vetter Nov. 23, 2016, 2:25 p.m. UTC | #7
On Wed, Nov 23, 2016 at 03:03:36PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> > @@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock)
> >  	 */
> >  	mutex_clear_owner(&lock->base);
> >  #endif
> > -	__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
> > +	/*
> > +	 * A previously _not_ waiting task may acquire the lock via the fast
> > +	 * path during our unlock. In that case, already waiting tasks may have
> > +	 * to back off to avoid a deadlock. Wake up all waiters so that they
> > +	 * can check their acquire context stamp against the new owner.
> > +	 */
> > +	__mutex_fastpath_unlock(&lock->base.count,
> > +				__mutex_unlock_slowpath_wakeall);
> >  }
> 
> So doing a wake-all has obvious issues with thundering herd etc.. Also,
> with the new mutex, you'd not be able to do hand-off, which would
> introduce starvation cases.
> 
> Ideally we'd iterate the blocked list and pick the waiter with the
> earliest stamp, or we'd maintain the list in stamp order instead of
> FIFO, for ww_mutex.

Not sure we'll win that much, at least I think we still need to wake up
everyone with earlier stamp than the one of the task that just released
the lock. Otherwise there's deadlocks. So just cuts the wakeups in half,
on average.

What we could do is do a handoff-hint with the timestamp of earliest task
we believe should get the lock. Everyone with a later timestamp that gets
woken then knows that they definitely have a deadlock situation and need
to back off (thread 2 in the example).

thread 1 would get woken, and would be able to take the lock, except when
thread 0 successfully raced it and stole the lock. And anyone else racing
in with later timestamps would also immediately back off, ensuring
fairness.

Without thinking it through in detail this is a PI issue, except that we
replace boosting with wakeup&back-off. Could we perhaps steal something
from rt mutexes to make it fair&efficient?
-Daniel
Peter Zijlstra Nov. 23, 2016, 2:32 p.m. UTC | #8
On Wed, Nov 23, 2016 at 03:25:25PM +0100, Daniel Vetter wrote:

> Without thinking it through in detail this is a PI issue, except that we
> replace boosting with wakeup&back-off. Could we perhaps steal something
> from rt mutexes to make it fair&efficient?

rt_mutexes order the waiters by 'priority' and wake the top most.
Nicolai Hähnle Nov. 24, 2016, 11:26 a.m. UTC | #9
On 23.11.2016 15:25, Daniel Vetter wrote:
> On Wed, Nov 23, 2016 at 03:03:36PM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
>>> @@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock)
>>>  	 */
>>>  	mutex_clear_owner(&lock->base);
>>>  #endif
>>> -	__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
>>> +	/*
>>> +	 * A previously _not_ waiting task may acquire the lock via the fast
>>> +	 * path during our unlock. In that case, already waiting tasks may have
>>> +	 * to back off to avoid a deadlock. Wake up all waiters so that they
>>> +	 * can check their acquire context stamp against the new owner.
>>> +	 */
>>> +	__mutex_fastpath_unlock(&lock->base.count,
>>> +				__mutex_unlock_slowpath_wakeall);
>>>  }
>>
>> So doing a wake-all has obvious issues with thundering herd etc.. Also,
>> with the new mutex, you'd not be able to do hand-off, which would
>> introduce starvation cases.
>>
>> Ideally we'd iterate the blocked list and pick the waiter with the
>> earliest stamp, or we'd maintain the list in stamp order instead of
>> FIFO, for ww_mutex.
>
> Not sure we'll win that much, at least I think we still need to wake up
> everyone with earlier stamp than the one of the task that just released
> the lock. Otherwise there's deadlocks. So just cuts the wakeups in half,
> on average.
>
> What we could do is do a handoff-hint with the timestamp of earliest task
> we believe should get the lock. Everyone with a later timestamp that gets
> woken then knows that they definitely have a deadlock situation and need
> to back off (thread 2 in the example).
>
> thread 1 would get woken, and would be able to take the lock, except when
> thread 0 successfully raced it and stole the lock. And anyone else racing
> in with later timestamps would also immediately back off, ensuring
> fairness.

I did consider maintaining stamp order in the waiter list and originally 
decided against it because I just wanted a simple and conservative fix 
to avoid adding new regressions.

Now that a different approach is needed for >= 4.9 anyway mostly due to 
the hand-off logic, I'm reconsidering this.

I do believe we can win a bit by keeping the wait list sorted, if we 
also make sure that waiters don't add themselves in the first place if 
they see that a deadlock situation cannot be avoided.

I will probably want to extend struct mutex_waiter with 
ww_mutex-specific fields to facilitate this (i.e. ctx pointer, perhaps 
stamp as well to reduce pointer-chasing). That should be fine since it 
lives on the stack.

In the meantime, I'd appreciate it if patch #1 could be accepted as-is 
for stable updates to <= 4.8. It fixes a real (if rare) bug, and the 
stampede inefficiency isn't a problem in practice at least for GPU 
applications.

Thanks,
Nicolai

>
> Without thinking it through in detail this is a PI issue, except that we
> replace boosting with wakeup&back-off. Could we perhaps steal something
> from rt mutexes to make it fair&efficient?
> -Daniel
>
Peter Zijlstra Nov. 24, 2016, 11:40 a.m. UTC | #10
On Thu, Nov 24, 2016 at 12:26:57PM +0100, Nicolai Hähnle wrote:

> I do believe we can win a bit by keeping the wait list sorted, if we also
> make sure that waiters don't add themselves in the first place if they see
> that a deadlock situation cannot be avoided.
> 
> I will probably want to extend struct mutex_waiter with ww_mutex-specific
> fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce
> pointer-chasing). That should be fine since it lives on the stack.

Right, shouldn't be a problem I think.

The only 'problem' I can see with using that is that its possible to mix
ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the
list order somewhat tricky.

Ideally we'd remove that feature, although I see its actually used quite
a bit :/

> In the meantime, I'd appreciate it if patch #1 could be accepted as-is for
> stable updates to <= 4.8. It fixes a real (if rare) bug, and the stampede
> inefficiency isn't a problem in practice at least for GPU applications.

Sorry can't do. We don't do stable patches that don't have anything
upstream.
Daniel Vetter Nov. 24, 2016, 11:52 a.m. UTC | #11
On Thu, Nov 24, 2016 at 12:40 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> I do believe we can win a bit by keeping the wait list sorted, if we also
>> make sure that waiters don't add themselves in the first place if they see
>> that a deadlock situation cannot be avoided.
>>
>> I will probably want to extend struct mutex_waiter with ww_mutex-specific
>> fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce
>> pointer-chasing). That should be fine since it lives on the stack.
>
> Right, shouldn't be a problem I think.
>
> The only 'problem' I can see with using that is that its possible to mix
> ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the
> list order somewhat tricky.
>
> Ideally we'd remove that feature, although I see its actually used quite
> a bit :/

I guess we could create a small fake acquire_ctx for single-lock
paths. That way callers still don't need to deal with having an
explicit ctx, but we can assume the timestamp (for ensuring fairness)
is available for all cases. Otherwise there's indeed a problem with
correctly (well fairly) interleaving ctx and non-ctx lockers I think.
-Daniel
Peter Zijlstra Nov. 24, 2016, 11:56 a.m. UTC | #12
On Thu, Nov 24, 2016 at 12:52:25PM +0100, Daniel Vetter wrote:
> On Thu, Nov 24, 2016 at 12:40 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> I do believe we can win a bit by keeping the wait list sorted, if we also
> >> make sure that waiters don't add themselves in the first place if they see
> >> that a deadlock situation cannot be avoided.
> >>
> >> I will probably want to extend struct mutex_waiter with ww_mutex-specific
> >> fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce
> >> pointer-chasing). That should be fine since it lives on the stack.
> >
> > Right, shouldn't be a problem I think.
> >
> > The only 'problem' I can see with using that is that its possible to mix
> > ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the
> > list order somewhat tricky.
> >
> > Ideally we'd remove that feature, although I see its actually used quite
> > a bit :/
> 
> I guess we could create a small fake acquire_ctx for single-lock
> paths. That way callers still don't need to deal with having an
> explicit ctx, but we can assume the timestamp (for ensuring fairness)
> is available for all cases. Otherwise there's indeed a problem with
> correctly (well fairly) interleaving ctx and non-ctx lockers I think.

Actually tried that, but we need a ww_class to get a stamp from, and
ww_mutex_lock() doesn't have one of those..
Nicolai Hähnle Nov. 24, 2016, 12:05 p.m. UTC | #13
On 24.11.2016 12:56, Peter Zijlstra wrote:
> On Thu, Nov 24, 2016 at 12:52:25PM +0100, Daniel Vetter wrote:
>> On Thu, Nov 24, 2016 at 12:40 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>> I do believe we can win a bit by keeping the wait list sorted, if we also
>>>> make sure that waiters don't add themselves in the first place if they see
>>>> that a deadlock situation cannot be avoided.
>>>>
>>>> I will probably want to extend struct mutex_waiter with ww_mutex-specific
>>>> fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce
>>>> pointer-chasing). That should be fine since it lives on the stack.
>>>
>>> Right, shouldn't be a problem I think.
>>>
>>> The only 'problem' I can see with using that is that its possible to mix
>>> ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the
>>> list order somewhat tricky.
>>>
>>> Ideally we'd remove that feature, although I see its actually used quite
>>> a bit :/
>>
>> I guess we could create a small fake acquire_ctx for single-lock
>> paths. That way callers still don't need to deal with having an
>> explicit ctx, but we can assume the timestamp (for ensuring fairness)
>> is available for all cases. Otherwise there's indeed a problem with
>> correctly (well fairly) interleaving ctx and non-ctx lockers I think.
>
> Actually tried that, but we need a ww_class to get a stamp from, and
> ww_mutex_lock() doesn't have one of those..

The acquire context needs to be live until the unlock anyway, so this is 
something that requires modifying the callers of ww_mutex_lock. Those 
should all have a ww_class available, or something is very wrong :)

Nicolai
diff mbox

Patch

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..7fbf9b4 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -409,6 +409,9 @@  static bool mutex_optimistic_spin(struct mutex *lock,
 __visible __used noinline
 void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
 
+static __used noinline
+void __sched __mutex_unlock_slowpath_wakeall(atomic_t *lock_count);
+
 /**
  * mutex_unlock - release the mutex
  * @lock: the mutex to be released
@@ -473,7 +476,14 @@  void __sched ww_mutex_unlock(struct ww_mutex *lock)
 	 */
 	mutex_clear_owner(&lock->base);
 #endif
-	__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+	/*
+	 * A previously _not_ waiting task may acquire the lock via the fast
+	 * path during our unlock. In that case, already waiting tasks may have
+	 * to back off to avoid a deadlock. Wake up all waiters so that they
+	 * can check their acquire context stamp against the new owner.
+	 */
+	__mutex_fastpath_unlock(&lock->base.count,
+				__mutex_unlock_slowpath_wakeall);
 }
 EXPORT_SYMBOL(ww_mutex_unlock);
 
@@ -716,7 +726,7 @@  EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible);
  * Release the lock, slowpath:
  */
 static inline void
-__mutex_unlock_common_slowpath(struct mutex *lock, int nested)
+__mutex_unlock_common_slowpath(struct mutex *lock, int nested, int wake_all)
 {
 	unsigned long flags;
 	WAKE_Q(wake_q);
@@ -740,7 +750,14 @@  __mutex_unlock_common_slowpath(struct mutex *lock, int nested)
 	mutex_release(&lock->dep_map, nested, _RET_IP_);
 	debug_mutex_unlock(lock);
 
-	if (!list_empty(&lock->wait_list)) {
+	if (wake_all) {
+		struct mutex_waiter *waiter;
+
+		list_for_each_entry(waiter, &lock->wait_list, list) {
+			debug_mutex_wake_waiter(lock, waiter);
+			wake_q_add(&wake_q, waiter->task);
+		}
+	} else if (!list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
 				list_entry(lock->wait_list.next,
@@ -762,7 +779,15 @@  __mutex_unlock_slowpath(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_unlock_common_slowpath(lock, 1);
+	__mutex_unlock_common_slowpath(lock, 1, 0);
+}
+
+static void
+__mutex_unlock_slowpath_wakeall(atomic_t *lock_count)
+{
+	struct mutex *lock = container_of(lock_count, struct mutex, count);
+
+	__mutex_unlock_common_slowpath(lock, 1, 1);
 }
 
 #ifndef CONFIG_DEBUG_LOCK_ALLOC