diff mbox

dma-buf/sw_sync: fix lockdep anger

Message ID 1476389045-12535-1-git-send-email-robdclark@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Rob Clark Oct. 13, 2016, 8:04 p.m. UTC
We were holding the wrong lock to be using fence_is_signaled_locked().
And holding the child_list_lock over something that could end up calling
fence cb's angers lockdep:

Comments

Chris Wilson Oct. 13, 2016, 8:17 p.m. UTC | #1
On Thu, Oct 13, 2016 at 04:04:05PM -0400, Rob Clark wrote:
> We were holding the wrong lock to be using fence_is_signaled_locked().
> And holding the child_list_lock over something that could end up calling
> fence cb's angers lockdep:
> 
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 4.7.0-rc7+ #489 Not tainted
> -------------------------------------------------------
> surfaceflinger/2034 is trying to acquire lock:
>  (&(&array->lock)->rlock){......}, at: [<ffff00000858cddc>] fence_signal+0x5c/0xf8
> 
> but task is already holding lock:
>  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&(&obj->child_list_lock)->rlock){......}:
>        [<ffff000008108924>] __lock_acquire+0x173c/0x18d8
>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>        [<ffff00000858d05c>] fence_add_callback+0x3c/0x100
>        [<ffff00000858f100>] fence_array_enable_signaling+0x80/0x170
>        [<ffff00000858d0d8>] fence_add_callback+0xb8/0x100
>        [<ffff00000858f504>] sync_file_poll+0xd4/0xf0
>        [<ffff0000081fd3a0>] do_sys_poll+0x220/0x438
>        [<ffff0000081fd8d0>] SyS_ppoll+0x1b0/0x1d8
>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
> 
> -> #0 (&(&array->lock)->rlock){......}:
>        [<ffff000008104768>] print_circular_bug+0x80/0x2e0
>        [<ffff0000081089ac>] __lock_acquire+0x17c4/0x18d8
>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>        [<ffff00000858cddc>] fence_signal+0x5c/0xf8
>        [<ffff00000858f268>] fence_array_cb_func+0x78/0x88
>        [<ffff00000858cb28>] fence_signal_locked+0x80/0xe0
>        [<ffff0000085903c8>] sw_sync_ioctl+0x2f8/0x3b0
>        [<ffff0000081faf6c>] do_vfs_ioctl+0xa4/0x790
>        [<ffff0000081fb6e4>] SyS_ioctl+0x8c/0xa0
>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&(&obj->child_list_lock)->rlock);
>                                lock(&(&array->lock)->rlock);
>                                lock(&(&obj->child_list_lock)->rlock);
>   lock(&(&array->lock)->rlock);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by surfaceflinger/2034:
>  #0:  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
> 
> Signed-off-by: Rob Clark <robdclark@gmail.com>
> ---
> The fence_get()/_put() might be overkill.. wasn't sure if there was
> any path where the ref could get dropped while child_list_lock was
> released..
> 
>  drivers/dma-buf/sw_sync.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
> index 62e8e6d..3bf8f5c 100644
> --- a/drivers/dma-buf/sw_sync.c
> +++ b/drivers/dma-buf/sw_sync.c
> @@ -146,8 +146,17 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
>  
>  	list_for_each_entry_safe(pt, next, &obj->active_list_head,
>  				 active_list) {
> -		if (fence_is_signaled_locked(&pt->base))
> +		struct fence *fence = fence_get(&pt->base);
> +		bool signaled;
> +
> +		spin_unlock_irqrestore(&obj->child_list_lock, flags);

Hmm. The obj->child_list_lock and fence->lock are one and the same (
fence->lock = &obj->child_list_lock). The problem lockdep is complaining
about appears to nesting of identical lockclasses. (The current fence_cb
design allows for unbounded recursion from the signal callbacks.)
Aiui, this patch shouldn't be fixing anything as the fence_signal is
still fired from under the very same obj->child_list_lock.
-Chris
Rob Clark Oct. 13, 2016, 9:01 p.m. UTC | #2
On Thu, Oct 13, 2016 at 4:17 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Oct 13, 2016 at 04:04:05PM -0400, Rob Clark wrote:
>> We were holding the wrong lock to be using fence_is_signaled_locked().
>> And holding the child_list_lock over something that could end up calling
>> fence cb's angers lockdep:
>>
>> ======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 4.7.0-rc7+ #489 Not tainted
>> -------------------------------------------------------
>> surfaceflinger/2034 is trying to acquire lock:
>>  (&(&array->lock)->rlock){......}, at: [<ffff00000858cddc>] fence_signal+0x5c/0xf8
>>
>> but task is already holding lock:
>>  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&(&obj->child_list_lock)->rlock){......}:
>>        [<ffff000008108924>] __lock_acquire+0x173c/0x18d8
>>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>>        [<ffff00000858d05c>] fence_add_callback+0x3c/0x100
>>        [<ffff00000858f100>] fence_array_enable_signaling+0x80/0x170
>>        [<ffff00000858d0d8>] fence_add_callback+0xb8/0x100
>>        [<ffff00000858f504>] sync_file_poll+0xd4/0xf0
>>        [<ffff0000081fd3a0>] do_sys_poll+0x220/0x438
>>        [<ffff0000081fd8d0>] SyS_ppoll+0x1b0/0x1d8
>>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
>>
>> -> #0 (&(&array->lock)->rlock){......}:
>>        [<ffff000008104768>] print_circular_bug+0x80/0x2e0
>>        [<ffff0000081089ac>] __lock_acquire+0x17c4/0x18d8
>>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>>        [<ffff00000858cddc>] fence_signal+0x5c/0xf8
>>        [<ffff00000858f268>] fence_array_cb_func+0x78/0x88
>>        [<ffff00000858cb28>] fence_signal_locked+0x80/0xe0
>>        [<ffff0000085903c8>] sw_sync_ioctl+0x2f8/0x3b0
>>        [<ffff0000081faf6c>] do_vfs_ioctl+0xa4/0x790
>>        [<ffff0000081fb6e4>] SyS_ioctl+0x8c/0xa0
>>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
>>
>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&(&obj->child_list_lock)->rlock);
>>                                lock(&(&array->lock)->rlock);
>>                                lock(&(&obj->child_list_lock)->rlock);
>>   lock(&(&array->lock)->rlock);
>>
>>  *** DEADLOCK ***
>>
>> 1 lock held by surfaceflinger/2034:
>>  #0:  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
>>
>> Signed-off-by: Rob Clark <robdclark@gmail.com>
>> ---
>> The fence_get()/_put() might be overkill.. wasn't sure if there was
>> any path where the ref could get dropped while child_list_lock was
>> released..
>>
>>  drivers/dma-buf/sw_sync.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
>> index 62e8e6d..3bf8f5c 100644
>> --- a/drivers/dma-buf/sw_sync.c
>> +++ b/drivers/dma-buf/sw_sync.c
>> @@ -146,8 +146,17 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
>>
>>       list_for_each_entry_safe(pt, next, &obj->active_list_head,
>>                                active_list) {
>> -             if (fence_is_signaled_locked(&pt->base))
>> +             struct fence *fence = fence_get(&pt->base);
>> +             bool signaled;
>> +
>> +             spin_unlock_irqrestore(&obj->child_list_lock, flags);
>
> Hmm. The obj->child_list_lock and fence->lock are one and the same (
> fence->lock = &obj->child_list_lock).

Oh, right..  I didn't see that..

> The problem lockdep is complaining
> about appears to nesting of identical lockclasses. (The current fence_cb
> design allows for unbounded recursion from the signal callbacks.)
> Aiui, this patch shouldn't be fixing anything as the fence_signal is
> still fired from under the very same obj->child_list_lock.

hmm, I didn't see the lockdep splat.. but maybe that is coming down
the the order of nesting fences under fence-array.. which might be
what is confusing lockdep?

BR,
-R

> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
Rob Clark Oct. 16, 2016, 3:11 p.m. UTC | #3
On Thu, Oct 13, 2016 at 4:17 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Oct 13, 2016 at 04:04:05PM -0400, Rob Clark wrote:
>> We were holding the wrong lock to be using fence_is_signaled_locked().
>> And holding the child_list_lock over something that could end up calling
>> fence cb's angers lockdep:
>>
>> ======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 4.7.0-rc7+ #489 Not tainted
>> -------------------------------------------------------
>> surfaceflinger/2034 is trying to acquire lock:
>>  (&(&array->lock)->rlock){......}, at: [<ffff00000858cddc>] fence_signal+0x5c/0xf8
>>
>> but task is already holding lock:
>>  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&(&obj->child_list_lock)->rlock){......}:
>>        [<ffff000008108924>] __lock_acquire+0x173c/0x18d8
>>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>>        [<ffff00000858d05c>] fence_add_callback+0x3c/0x100
>>        [<ffff00000858f100>] fence_array_enable_signaling+0x80/0x170
>>        [<ffff00000858d0d8>] fence_add_callback+0xb8/0x100
>>        [<ffff00000858f504>] sync_file_poll+0xd4/0xf0
>>        [<ffff0000081fd3a0>] do_sys_poll+0x220/0x438
>>        [<ffff0000081fd8d0>] SyS_ppoll+0x1b0/0x1d8
>>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
>>
>> -> #0 (&(&array->lock)->rlock){......}:
>>        [<ffff000008104768>] print_circular_bug+0x80/0x2e0
>>        [<ffff0000081089ac>] __lock_acquire+0x17c4/0x18d8
>>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>>        [<ffff00000858cddc>] fence_signal+0x5c/0xf8
>>        [<ffff00000858f268>] fence_array_cb_func+0x78/0x88
>>        [<ffff00000858cb28>] fence_signal_locked+0x80/0xe0
>>        [<ffff0000085903c8>] sw_sync_ioctl+0x2f8/0x3b0
>>        [<ffff0000081faf6c>] do_vfs_ioctl+0xa4/0x790
>>        [<ffff0000081fb6e4>] SyS_ioctl+0x8c/0xa0
>>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
>>
>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&(&obj->child_list_lock)->rlock);
>>                                lock(&(&array->lock)->rlock);
>>                                lock(&(&obj->child_list_lock)->rlock);
>>   lock(&(&array->lock)->rlock);
>>
>>  *** DEADLOCK ***
>>
>> 1 lock held by surfaceflinger/2034:
>>  #0:  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
>>
>> Signed-off-by: Rob Clark <robdclark@gmail.com>
>> ---
>> The fence_get()/_put() might be overkill.. wasn't sure if there was
>> any path where the ref could get dropped while child_list_lock was
>> released..
>>
>>  drivers/dma-buf/sw_sync.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
>> index 62e8e6d..3bf8f5c 100644
>> --- a/drivers/dma-buf/sw_sync.c
>> +++ b/drivers/dma-buf/sw_sync.c
>> @@ -146,8 +146,17 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
>>
>>       list_for_each_entry_safe(pt, next, &obj->active_list_head,
>>                                active_list) {
>> -             if (fence_is_signaled_locked(&pt->base))
>> +             struct fence *fence = fence_get(&pt->base);
>> +             bool signaled;
>> +
>> +             spin_unlock_irqrestore(&obj->child_list_lock, flags);
>
> Hmm. The obj->child_list_lock and fence->lock are one and the same (
> fence->lock = &obj->child_list_lock). The problem lockdep is complaining
> about appears to nesting of identical lockclasses. (The current fence_cb
> design allows for unbounded recursion from the signal callbacks.)
> Aiui, this patch shouldn't be fixing anything as the fence_signal is
> still fired from under the very same obj->child_list_lock.

So I looked at this a bit more.. the real problem is actually that we
hold the fence's lock when calling the callback..  so combine that
with array-fences, and you have the situation that when you
fence_add_callback() on the array-fence, the array-fence's lock is
acquired first, and then the array-member's lock is acquired.  But
when the array-member is signalled, it's lock is acquired first and
then (when last array-member is signalled) the array-fence's lock is
acquired.

I guess the best thing is to avoid holding the fence's lock when
calling it's cb?

BR,
-R

> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
diff mbox

Patch

======================================================
[ INFO: possible circular locking dependency detected ]
4.7.0-rc7+ #489 Not tainted
-------------------------------------------------------
surfaceflinger/2034 is trying to acquire lock:
 (&(&array->lock)->rlock){......}, at: [<ffff00000858cddc>] fence_signal+0x5c/0xf8

but task is already holding lock:
 (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&obj->child_list_lock)->rlock){......}:
       [<ffff000008108924>] __lock_acquire+0x173c/0x18d8
       [<ffff000008108e0c>] lock_acquire+0x4c/0x68
       [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
       [<ffff00000858d05c>] fence_add_callback+0x3c/0x100
       [<ffff00000858f100>] fence_array_enable_signaling+0x80/0x170
       [<ffff00000858d0d8>] fence_add_callback+0xb8/0x100
       [<ffff00000858f504>] sync_file_poll+0xd4/0xf0
       [<ffff0000081fd3a0>] do_sys_poll+0x220/0x438
       [<ffff0000081fd8d0>] SyS_ppoll+0x1b0/0x1d8
       [<ffff000008084f30>] el0_svc_naked+0x24/0x28

-> #0 (&(&array->lock)->rlock){......}:
       [<ffff000008104768>] print_circular_bug+0x80/0x2e0
       [<ffff0000081089ac>] __lock_acquire+0x17c4/0x18d8
       [<ffff000008108e0c>] lock_acquire+0x4c/0x68
       [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
       [<ffff00000858cddc>] fence_signal+0x5c/0xf8
       [<ffff00000858f268>] fence_array_cb_func+0x78/0x88
       [<ffff00000858cb28>] fence_signal_locked+0x80/0xe0
       [<ffff0000085903c8>] sw_sync_ioctl+0x2f8/0x3b0
       [<ffff0000081faf6c>] do_vfs_ioctl+0xa4/0x790
       [<ffff0000081fb6e4>] SyS_ioctl+0x8c/0xa0
       [<ffff000008084f30>] el0_svc_naked+0x24/0x28

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&obj->child_list_lock)->rlock);
                               lock(&(&array->lock)->rlock);
                               lock(&(&obj->child_list_lock)->rlock);
  lock(&(&array->lock)->rlock);

 *** DEADLOCK ***

1 lock held by surfaceflinger/2034:
 #0:  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0

Signed-off-by: Rob Clark <robdclark@gmail.com>
---
The fence_get()/_put() might be overkill.. wasn't sure if there was
any path where the ref could get dropped while child_list_lock was
released..

 drivers/dma-buf/sw_sync.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 62e8e6d..3bf8f5c 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -146,8 +146,17 @@  static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
 
 	list_for_each_entry_safe(pt, next, &obj->active_list_head,
 				 active_list) {
-		if (fence_is_signaled_locked(&pt->base))
+		struct fence *fence = fence_get(&pt->base);
+		bool signaled;
+
+		spin_unlock_irqrestore(&obj->child_list_lock, flags);
+		signaled = fence_is_signaled(fence);
+		spin_lock_irqsave(&obj->child_list_lock, flags);
+
+		if (signaled)
 			list_del_init(&pt->active_list);
+
+		fence_put(fence);
 	}
 
 	spin_unlock_irqrestore(&obj->child_list_lock, flags);