diff mbox series

[v1] Bluetooth: hci_sync: Fix disconnect complete event timeout issue

Message ID 20241216080758.3450976-1-quic_chejiang@quicinc.com (mailing list archive)
State New
Headers show
Series [v1] Bluetooth: hci_sync: Fix disconnect complete event timeout issue | expand

Checks

Context Check Description
tedd_an/pre-ci_am success Success
tedd_an/SubjectPrefix success Gitlint PASS
tedd_an/BuildKernel success BuildKernel PASS
tedd_an/CheckAllWarning success CheckAllWarning PASS
tedd_an/CheckSparse warning CheckSparse WARNING net/bluetooth/hci_event.c: note: in included file (through include/net/bluetooth/hci_core.h):
tedd_an/BuildKernel32 success BuildKernel32 PASS
tedd_an/TestRunnerSetup success TestRunnerSetup PASS
tedd_an/TestRunner_l2cap-tester success TestRunner PASS
tedd_an/TestRunner_iso-tester success TestRunner PASS
tedd_an/TestRunner_bnep-tester success TestRunner PASS
tedd_an/TestRunner_mgmt-tester fail TestRunner_mgmt-tester: Total: 490, Passed: 483 (98.6%), Failed: 3, Not Run: 4
tedd_an/TestRunner_rfcomm-tester success TestRunner PASS
tedd_an/TestRunner_sco-tester fail TestRunner_sco-tester: BUG: KASAN: slab-use-after-free in hci_abort_conn_sync+0xc92/0xf30
tedd_an/TestRunner_ioctl-tester success TestRunner PASS
tedd_an/TestRunner_mesh-tester success TestRunner PASS
tedd_an/TestRunner_smp-tester success TestRunner PASS
tedd_an/TestRunner_userchan-tester success TestRunner PASS

Commit Message

Cheng Jiang (IOE) Dec. 16, 2024, 8:07 a.m. UTC
Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
in time, requiring the controller to wait for the supervision timeout,
which may exceed 2 seconds. In the current implementation, the
HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
This causes the mgmt to get stuck, resulting in bluetoothd waiting
indefinitely for the mgmt response to the disconnect. To recover,
restarting bluetoothd is necessary.

bluetoothctl log like this:
[Designer Mouse]# disconnect D9:B5:6C:F2:51:91
Attempting to disconnect from D9:B5:6C:F2:51:91
[Designer Mouse]#
[Designer Mouse]# power off
[Designer Mouse]#
Failed to set power off: org.freedesktop.DBus.Error.NoReply.

Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
---
 include/net/bluetooth/hci_core.h |  2 ++
 net/bluetooth/hci_conn.c         |  9 +++++++++
 net/bluetooth/hci_event.c        |  9 +++++++++
 net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
 4 files changed, 38 insertions(+)


base-commit: e25c8d66f6786300b680866c0e0139981273feba

Comments

bluez.test.bot@gmail.com Dec. 16, 2024, 8:38 a.m. UTC | #1
This is automated email and please do not reply to this email!

Dear submitter,

Thank you for submitting the patches to the linux bluetooth mailing list.
This is a CI test results with your patch series:
PW Link:https://patchwork.kernel.org/project/bluetooth/list/?series=918106

---Test result---

Test Summary:
CheckPatch                    PENDING   0.40 seconds
GitLint                       PENDING   0.25 seconds
SubjectPrefix                 PASS      0.07 seconds
BuildKernel                   PASS      25.14 seconds
CheckAllWarning               PASS      27.29 seconds
CheckSparse                   WARNING   30.84 seconds
BuildKernel32                 PASS      24.66 seconds
TestRunnerSetup               PASS      437.78 seconds
TestRunner_l2cap-tester       PASS      20.72 seconds
TestRunner_iso-tester         PASS      30.18 seconds
TestRunner_bnep-tester        PASS      4.86 seconds
TestRunner_mgmt-tester        FAIL      122.69 seconds
TestRunner_rfcomm-tester      PASS      7.66 seconds
TestRunner_sco-tester         FAIL      9.40 seconds
TestRunner_ioctl-tester       PASS      8.14 seconds
TestRunner_mesh-tester        PASS      5.96 seconds
TestRunner_smp-tester         PASS      6.98 seconds
TestRunner_userchan-tester    PASS      5.05 seconds
IncrementalBuild              PENDING   0.72 seconds

Details
##############################
Test: CheckPatch - PENDING
Desc: Run checkpatch.pl script
Output:

##############################
Test: GitLint - PENDING
Desc: Run gitlint
Output:

##############################
Test: CheckSparse - WARNING
Desc: Run sparse tool with linux kernel
Output:
net/bluetooth/hci_event.c: note: in included file (through include/net/bluetooth/hci_core.h):
##############################
Test: TestRunner_mgmt-tester - FAIL
Desc: Run mgmt-tester with test-runner
Output:
Total: 490, Passed: 483 (98.6%), Failed: 3, Not Run: 4

Failed Test Cases
LL Privacy - Set Flags 1 (Add to RL)                 Failed       0.153 seconds
LL Privacy - Set Flags 3 (2 Devices to RL)           Failed       0.169 seconds
LL Privacy - Start Discovery 2 (Disable RL)          Failed       0.187 seconds
##############################
Test: TestRunner_sco-tester - FAIL
Desc: Run sco-tester with test-runner
Output:
BUG: KASAN: slab-use-after-free in hci_abort_conn_sync+0xc92/0xf30
Total: 15, Passed: 14 (93.3%), Failed: 0, Not Run: 1
##############################
Test: IncrementalBuild - PENDING
Desc: Incremental build with the patches in the series
Output:



---
Regards,
Linux Bluetooth
Luiz Augusto von Dentz Dec. 16, 2024, 2:32 p.m. UTC | #2
Hi Cheng,

On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote:
>
> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> in time, requiring the controller to wait for the supervision timeout,
> which may exceed 2 seconds. In the current implementation, the
> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> This causes the mgmt to get stuck, resulting in bluetoothd waiting
> indefinitely for the mgmt response to the disconnect. To recover,
> restarting bluetoothd is necessary.
>
> bluetoothctl log like this:
> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> Attempting to disconnect from D9:B5:6C:F2:51:91
> [Designer Mouse]#
> [Designer Mouse]# power off
> [Designer Mouse]#
> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
>
> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
> ---
>  include/net/bluetooth/hci_core.h |  2 ++
>  net/bluetooth/hci_conn.c         |  9 +++++++++
>  net/bluetooth/hci_event.c        |  9 +++++++++
>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
>  4 files changed, 38 insertions(+)
>
> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> index 734cd50cd..2ab079dcf 100644
> --- a/include/net/bluetooth/hci_core.h
> +++ b/include/net/bluetooth/hci_core.h
> @@ -753,6 +753,8 @@ struct hci_conn {
>
>         struct bt_codec codec;
>
> +       struct completion disc_ev_comp;
> +
>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> index d097e308a..e0244e191 100644
> --- a/net/bluetooth/hci_conn.c
> +++ b/net/bluetooth/hci_conn.c
> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
>
>         hci_conn_init_sysfs(conn);
>
> +       /* This disc_ev_comp is inited when we send a disconnect request to
> +        * the remote device but fail to receive the disconnect complete
> +        * event within the expected time (2 seconds). This occurs because
> +        * the remote device doesn't ack the terminate indication, forcing
> +        * the controller to wait for the supervision timeout.
> +        */
> +       init_completion(&conn->disc_ev_comp);
> +       complete(&conn->disc_ev_comp);
> +
>         return conn;
>  }
>
> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> index 2cc7a9306..60ecb2b18 100644
> --- a/net/bluetooth/hci_event.c
> +++ b/net/bluetooth/hci_event.c
> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>         if (!conn)
>                 goto unlock;
>
> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> +        * hci_abort_conn_sync will wait hdev lock release to continue.
> +        */
> +       if (!completion_done(&conn->disc_ev_comp)) {
> +               complete(&conn->disc_ev_comp);
> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
> +               usleep_range(100, 1000);
> +       }
> +
>         if (ev->status) {
>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>                                        conn->dst_type, ev->status);
> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> index 0badec712..783d04b57 100644
> --- a/net/bluetooth/hci_sync.c
> +++ b/net/bluetooth/hci_sync.c
> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
>                 break;
>         }
>
> +       /* Check whether the connection is successfully disconnected.
> +        * Sometimes the remote device doesn't acknowledge the
> +        * LL_TERMINATE_IND in time, requiring the controller to wait
> +        * for the supervision timeout, which may exceed 2 seconds. In
> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> +        * event before cleaning up the connection.
> +        */
> +       if (err == -ETIMEDOUT) {
> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> +
> +               reinit_completion(&conn->disc_ev_comp);
> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> +                       bt_dev_warn(hdev, "Failed to get complete");
> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> +                                              conn->dst_type, conn->abort_reason);
> +               }
> +       }

Why don't we just set the supervision timeout as timeout then? If we
will have to wait for it anyway just change hci_disconnect_sync to use
10 * conn->le_supv_timeout as timeout instead.

That said, we really need to fix bluetoothd if it is not able to be
cleaned up if SET_POWERED command fails, but it looks like it is
handling errors correctly so it sounds like something else is at play.

>         hci_dev_lock(hdev);
>
>         /* Check if the connection has been cleaned up concurrently */
>
> base-commit: e25c8d66f6786300b680866c0e0139981273feba
> --
> 2.34.1
>
Luiz Augusto von Dentz Dec. 16, 2024, 2:42 p.m. UTC | #3
Hi Cheng,

On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
<luiz.dentz@gmail.com> wrote:
>
> Hi Cheng,
>
> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote:
> >
> > Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> > in time, requiring the controller to wait for the supervision timeout,
> > which may exceed 2 seconds. In the current implementation, the
> > HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> > the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> > This causes the mgmt to get stuck, resulting in bluetoothd waiting
> > indefinitely for the mgmt response to the disconnect. To recover,
> > restarting bluetoothd is necessary.
> >
> > bluetoothctl log like this:
> > [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> > Attempting to disconnect from D9:B5:6C:F2:51:91
> > [Designer Mouse]#
> > [Designer Mouse]# power off
> > [Designer Mouse]#
> > Failed to set power off: org.freedesktop.DBus.Error.NoReply.
> >
> > Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
> > ---
> >  include/net/bluetooth/hci_core.h |  2 ++
> >  net/bluetooth/hci_conn.c         |  9 +++++++++
> >  net/bluetooth/hci_event.c        |  9 +++++++++
> >  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
> >  4 files changed, 38 insertions(+)
> >
> > diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> > index 734cd50cd..2ab079dcf 100644
> > --- a/include/net/bluetooth/hci_core.h
> > +++ b/include/net/bluetooth/hci_core.h
> > @@ -753,6 +753,8 @@ struct hci_conn {
> >
> >         struct bt_codec codec;
> >
> > +       struct completion disc_ev_comp;
> > +
> >         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
> >         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
> >         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
> > diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> > index d097e308a..e0244e191 100644
> > --- a/net/bluetooth/hci_conn.c
> > +++ b/net/bluetooth/hci_conn.c
> > @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
> >
> >         hci_conn_init_sysfs(conn);
> >
> > +       /* This disc_ev_comp is inited when we send a disconnect request to
> > +        * the remote device but fail to receive the disconnect complete
> > +        * event within the expected time (2 seconds). This occurs because
> > +        * the remote device doesn't ack the terminate indication, forcing
> > +        * the controller to wait for the supervision timeout.
> > +        */
> > +       init_completion(&conn->disc_ev_comp);
> > +       complete(&conn->disc_ev_comp);
> > +
> >         return conn;
> >  }
> >
> > diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> > index 2cc7a9306..60ecb2b18 100644
> > --- a/net/bluetooth/hci_event.c
> > +++ b/net/bluetooth/hci_event.c
> > @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >         if (!conn)
> >                 goto unlock;
> >
> > +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> > +        * hci_abort_conn_sync will wait hdev lock release to continue.
> > +        */
> > +       if (!completion_done(&conn->disc_ev_comp)) {
> > +               complete(&conn->disc_ev_comp);
> > +               /* Add some delay for hci_abort_conn_sync to handle the complete */
> > +               usleep_range(100, 1000);
> > +       }
> > +
> >         if (ev->status) {
> >                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >                                        conn->dst_type, ev->status);
> > diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> > index 0badec712..783d04b57 100644
> > --- a/net/bluetooth/hci_sync.c
> > +++ b/net/bluetooth/hci_sync.c
> > @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
> >                 break;
> >         }
> >
> > +       /* Check whether the connection is successfully disconnected.
> > +        * Sometimes the remote device doesn't acknowledge the
> > +        * LL_TERMINATE_IND in time, requiring the controller to wait
> > +        * for the supervision timeout, which may exceed 2 seconds. In
> > +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> > +        * event before cleaning up the connection.
> > +        */
> > +       if (err == -ETIMEDOUT) {
> > +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> > +
> > +               reinit_completion(&conn->disc_ev_comp);
> > +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> > +                       bt_dev_warn(hdev, "Failed to get complete");
> > +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> > +                                              conn->dst_type, conn->abort_reason);
> > +               }
> > +       }
>
> Why don't we just set the supervision timeout as timeout then? If we
> will have to wait for it anyway just change hci_disconnect_sync to use
> 10 * conn->le_supv_timeout as timeout instead.
>
> That said, we really need to fix bluetoothd if it is not able to be
> cleaned up if SET_POWERED command fails, but it looks like it is
> handling errors correctly so it sounds like something else is at play.

I double checked this and apparently this could no longer fail:

+               /* Disregard possible errors since hci_conn_del shall have been
+                * called even in case of errors had occurred since it would
+                * then cause hci_conn_failed to be called which calls
+                * hci_conn_del internally.
+                */
+               hci_abort_conn_sync(hdev, conn, reason);

So it will clean up the hci_conn no matter what is the timeout, so
either you don't have this change or it is not working for some
reason.

> >         hci_dev_lock(hdev);
> >
> >         /* Check if the connection has been cleaned up concurrently */
> >
> > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> > --
> > 2.34.1
> >
>
>
> --
> Luiz Augusto von Dentz
Cheng Jiang (IOE) Dec. 17, 2024, 2:49 a.m. UTC | #4
Hi Luiz,

On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
> Hi Cheng,
> 
> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
> <luiz.dentz@gmail.com> wrote:
>>
>> Hi Cheng,
>>
>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote:
>>>
>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
>>> in time, requiring the controller to wait for the supervision timeout,
>>> which may exceed 2 seconds. In the current implementation, the
>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
>>> indefinitely for the mgmt response to the disconnect. To recover,
>>> restarting bluetoothd is necessary.
>>>
>>> bluetoothctl log like this:
>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
>>> Attempting to disconnect from D9:B5:6C:F2:51:91
>>> [Designer Mouse]#
>>> [Designer Mouse]# power off
>>> [Designer Mouse]#
>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
>>>
>>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
>>> ---
>>>  include/net/bluetooth/hci_core.h |  2 ++
>>>  net/bluetooth/hci_conn.c         |  9 +++++++++
>>>  net/bluetooth/hci_event.c        |  9 +++++++++
>>>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
>>>  4 files changed, 38 insertions(+)
>>>
>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
>>> index 734cd50cd..2ab079dcf 100644
>>> --- a/include/net/bluetooth/hci_core.h
>>> +++ b/include/net/bluetooth/hci_core.h
>>> @@ -753,6 +753,8 @@ struct hci_conn {
>>>
>>>         struct bt_codec codec;
>>>
>>> +       struct completion disc_ev_comp;
>>> +
>>>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
>>>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
>>>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
>>> index d097e308a..e0244e191 100644
>>> --- a/net/bluetooth/hci_conn.c
>>> +++ b/net/bluetooth/hci_conn.c
>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
>>>
>>>         hci_conn_init_sysfs(conn);
>>>
>>> +       /* This disc_ev_comp is inited when we send a disconnect request to
>>> +        * the remote device but fail to receive the disconnect complete
>>> +        * event within the expected time (2 seconds). This occurs because
>>> +        * the remote device doesn't ack the terminate indication, forcing
>>> +        * the controller to wait for the supervision timeout.
>>> +        */
>>> +       init_completion(&conn->disc_ev_comp);
>>> +       complete(&conn->disc_ev_comp);
>>> +
>>>         return conn;
>>>  }
>>>
>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
>>> index 2cc7a9306..60ecb2b18 100644
>>> --- a/net/bluetooth/hci_event.c
>>> +++ b/net/bluetooth/hci_event.c
>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>>>         if (!conn)
>>>                 goto unlock;
>>>
>>> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
>>> +        * hci_abort_conn_sync will wait hdev lock release to continue.
>>> +        */
>>> +       if (!completion_done(&conn->disc_ev_comp)) {
>>> +               complete(&conn->disc_ev_comp);
>>> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
>>> +               usleep_range(100, 1000);
>>> +       }
>>> +
>>>         if (ev->status) {
>>>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>                                        conn->dst_type, ev->status);
>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
>>> index 0badec712..783d04b57 100644
>>> --- a/net/bluetooth/hci_sync.c
>>> +++ b/net/bluetooth/hci_sync.c
>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
>>>                 break;
>>>         }
>>>
>>> +       /* Check whether the connection is successfully disconnected.
>>> +        * Sometimes the remote device doesn't acknowledge the
>>> +        * LL_TERMINATE_IND in time, requiring the controller to wait
>>> +        * for the supervision timeout, which may exceed 2 seconds. In
>>> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
>>> +        * event before cleaning up the connection.
>>> +        */
>>> +       if (err == -ETIMEDOUT) {
>>> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
>>> +
>>> +               reinit_completion(&conn->disc_ev_comp);
>>> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
>>> +                       bt_dev_warn(hdev, "Failed to get complete");
>>> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>> +                                              conn->dst_type, conn->abort_reason);
>>> +               }
>>> +       }
>>
>> Why don't we just set the supervision timeout as timeout then? If we
>> will have to wait for it anyway just change hci_disconnect_sync to use
>> 10 * conn->le_supv_timeout as timeout instead.
>>
hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the 
HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. 
In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait 
for the event. So even in hci_disconnect_sync we can set the timeout to
supervision timeout, hci_disconnect_sync still timeout after 2s. 

>> That said, we really need to fix bluetoothd if it is not able to be
>> cleaned up if SET_POWERED command fails, but it looks like it is
>> handling errors correctly so it sounds like something else is at play.
> 
The issue arises after a 2-second timeout of hci_disconnect_sync. During 
hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. 
after supervision timeout, the disconnect complete event arrives, but 
it returns at line 3370 since the connection has already been removed. 
As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT 
to the application layer (bluetoothctl), causing bluetoothctl to get stuck 
and unable to perform other mgmt commands.


3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
3356              struct sk_buff *skb)
3357 {
3358   struct hci_ev_disconn_complete *ev = data;
3359   u8 reason;
3360   struct hci_conn_params *params;
3361   struct hci_conn *conn;
3362   bool mgmt_connected;
3363
3364   bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
3365
3366   hci_dev_lock(hdev);
3367
3368   conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
3369   if (!conn)
3370     goto unlock;
3371
3372   if (ev->status) {
3373     mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
3374                conn->dst_type, ev->status);
3375     goto unlock;
3376   }
3377
3378   conn->state = BT_CLOSED;
3379
3380   mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
3381

> I double checked this and apparently this could no longer fail:
> 
> +               /* Disregard possible errors since hci_conn_del shall have been
> +                * called even in case of errors had occurred since it would
> +                * then cause hci_conn_failed to be called which calls
> +                * hci_conn_del internally.
> +                */
> +               hci_abort_conn_sync(hdev, conn, reason);
> 
> So it will clean up the hci_conn no matter what is the timeout, so
> either you don't have this change or it is not working for some
> reason.
> 
The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is 
blocked. It does not happen during the power off stage. It happened when disconnect 
a BLE device, but the disconnect complete event sent from controller to host 2s later.
Then it causes the mgmt in bluetoothctl is blocked as decribed as above. 

>>>         hci_dev_lock(hdev);
>>>
>>>         /* Check if the connection has been cleaned up concurrently */
>>>
>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
>>> --
>>> 2.34.1
>>>
>>
>>
>> --
>> Luiz Augusto von Dentz
> 
> 
>
Luiz Augusto von Dentz Dec. 17, 2024, 3:15 a.m. UTC | #5
Hi Cheng,

On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
<quic_chejiang@quicinc.com> wrote:
>
> Hi Luiz,
>
> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
> > Hi Cheng,
> >
> > On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
> > <luiz.dentz@gmail.com> wrote:
> >>
> >> Hi Cheng,
> >>
> >> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote:
> >>>
> >>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> >>> in time, requiring the controller to wait for the supervision timeout,
> >>> which may exceed 2 seconds. In the current implementation, the
> >>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> >>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> >>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
> >>> indefinitely for the mgmt response to the disconnect. To recover,
> >>> restarting bluetoothd is necessary.
> >>>
> >>> bluetoothctl log like this:
> >>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> >>> Attempting to disconnect from D9:B5:6C:F2:51:91
> >>> [Designer Mouse]#
> >>> [Designer Mouse]# power off
> >>> [Designer Mouse]#
> >>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
> >>>
> >>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
> >>> ---
> >>>  include/net/bluetooth/hci_core.h |  2 ++
> >>>  net/bluetooth/hci_conn.c         |  9 +++++++++
> >>>  net/bluetooth/hci_event.c        |  9 +++++++++
> >>>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
> >>>  4 files changed, 38 insertions(+)
> >>>
> >>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> >>> index 734cd50cd..2ab079dcf 100644
> >>> --- a/include/net/bluetooth/hci_core.h
> >>> +++ b/include/net/bluetooth/hci_core.h
> >>> @@ -753,6 +753,8 @@ struct hci_conn {
> >>>
> >>>         struct bt_codec codec;
> >>>
> >>> +       struct completion disc_ev_comp;
> >>> +
> >>>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
> >>>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
> >>>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
> >>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> >>> index d097e308a..e0244e191 100644
> >>> --- a/net/bluetooth/hci_conn.c
> >>> +++ b/net/bluetooth/hci_conn.c
> >>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
> >>>
> >>>         hci_conn_init_sysfs(conn);
> >>>
> >>> +       /* This disc_ev_comp is inited when we send a disconnect request to
> >>> +        * the remote device but fail to receive the disconnect complete
> >>> +        * event within the expected time (2 seconds). This occurs because
> >>> +        * the remote device doesn't ack the terminate indication, forcing
> >>> +        * the controller to wait for the supervision timeout.
> >>> +        */
> >>> +       init_completion(&conn->disc_ev_comp);
> >>> +       complete(&conn->disc_ev_comp);
> >>> +
> >>>         return conn;
> >>>  }
> >>>
> >>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> >>> index 2cc7a9306..60ecb2b18 100644
> >>> --- a/net/bluetooth/hci_event.c
> >>> +++ b/net/bluetooth/hci_event.c
> >>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >>>         if (!conn)
> >>>                 goto unlock;
> >>>
> >>> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> >>> +        * hci_abort_conn_sync will wait hdev lock release to continue.
> >>> +        */
> >>> +       if (!completion_done(&conn->disc_ev_comp)) {
> >>> +               complete(&conn->disc_ev_comp);
> >>> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
> >>> +               usleep_range(100, 1000);
> >>> +       }
> >>> +
> >>>         if (ev->status) {
> >>>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>                                        conn->dst_type, ev->status);
> >>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> >>> index 0badec712..783d04b57 100644
> >>> --- a/net/bluetooth/hci_sync.c
> >>> +++ b/net/bluetooth/hci_sync.c
> >>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
> >>>                 break;
> >>>         }
> >>>
> >>> +       /* Check whether the connection is successfully disconnected.
> >>> +        * Sometimes the remote device doesn't acknowledge the
> >>> +        * LL_TERMINATE_IND in time, requiring the controller to wait
> >>> +        * for the supervision timeout, which may exceed 2 seconds. In
> >>> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> >>> +        * event before cleaning up the connection.
> >>> +        */
> >>> +       if (err == -ETIMEDOUT) {
> >>> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> >>> +
> >>> +               reinit_completion(&conn->disc_ev_comp);
> >>> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> >>> +                       bt_dev_warn(hdev, "Failed to get complete");
> >>> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>> +                                              conn->dst_type, conn->abort_reason);
> >>> +               }
> >>> +       }
> >>
> >> Why don't we just set the supervision timeout as timeout then? If we
> >> will have to wait for it anyway just change hci_disconnect_sync to use
> >> 10 * conn->le_supv_timeout as timeout instead.
> >>
> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
> for the event. So even in hci_disconnect_sync we can set the timeout to
> supervision timeout, hci_disconnect_sync still timeout after 2s.

Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
the supervision timeout instead? If it still timeout after to 2
seconds then there is something still forcing HCI_CMD_TIMEOUT which
shouldn't happen.

> >> That said, we really need to fix bluetoothd if it is not able to be
> >> cleaned up if SET_POWERED command fails, but it looks like it is
> >> handling errors correctly so it sounds like something else is at play.
> >
> The issue arises after a 2-second timeout of hci_disconnect_sync. During
> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
> after supervision timeout, the disconnect complete event arrives, but
> it returns at line 3370 since the connection has already been removed.
> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
> and unable to perform other mgmt commands.

The command shall have completed regardless if disconnect complete has
been received or not, the is the whole point of having a timeout, so
this makes no sense to me, or you are not describing what is happening
here. Also there is no MGMT_OP_DISCONNECT pending, it is
MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
that is a totally different thing and perhaps that is the source of
the problem because if we do cleanup hci_conn even in case the command
fails/times out then we should be generating a disconnected event as
well.

>
> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> 3356              struct sk_buff *skb)
> 3357 {
> 3358   struct hci_ev_disconn_complete *ev = data;
> 3359   u8 reason;
> 3360   struct hci_conn_params *params;
> 3361   struct hci_conn *conn;
> 3362   bool mgmt_connected;
> 3363
> 3364   bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
> 3365
> 3366   hci_dev_lock(hdev);
> 3367
> 3368   conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
> 3369   if (!conn)
> 3370     goto unlock;
> 3371
> 3372   if (ev->status) {
> 3373     mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> 3374                conn->dst_type, ev->status);
> 3375     goto unlock;
> 3376   }
> 3377
> 3378   conn->state = BT_CLOSED;
> 3379
> 3380   mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
> 3381
>
> > I double checked this and apparently this could no longer fail:
> >
> > +               /* Disregard possible errors since hci_conn_del shall have been
> > +                * called even in case of errors had occurred since it would
> > +                * then cause hci_conn_failed to be called which calls
> > +                * hci_conn_del internally.
> > +                */
> > +               hci_abort_conn_sync(hdev, conn, reason);
> >
> > So it will clean up the hci_conn no matter what is the timeout, so
> > either you don't have this change or it is not working for some
> > reason.
> >
> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
> blocked. It does not happen during the power off stage. It happened when disconnect
> a BLE device, but the disconnect complete event sent from controller to host 2s later.
> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.

There is a difference about a MGMT command, initiated by bluetoothd,
versus a MGMT event initiated by the kernel, so the daemon is not
blocked it just don't get a disconnection event, which is different
than a command complete.

What is the command sequence that you use to reproduce the problem?

> >>>         hci_dev_lock(hdev);
> >>>
> >>>         /* Check if the connection has been cleaned up concurrently */
> >>>
> >>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
> >>> --
> >>> 2.34.1
> >>>
> >>
> >>
> >> --
> >> Luiz Augusto von Dentz
> >
> >
> >
>
Cheng Jiang (IOE) Dec. 17, 2024, 5:50 a.m. UTC | #6
Hi Luiz,

On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote:
> Hi Cheng,
> 
> On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
> <quic_chejiang@quicinc.com> wrote:
>>
>> Hi Luiz,
>>
>> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
>>> Hi Cheng,
>>>
>>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
>>> <luiz.dentz@gmail.com> wrote:
>>>>
>>>> Hi Cheng,
>>>>
>>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote:
>>>>>
>>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
>>>>> in time, requiring the controller to wait for the supervision timeout,
>>>>> which may exceed 2 seconds. In the current implementation, the
>>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
>>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
>>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
>>>>> indefinitely for the mgmt response to the disconnect. To recover,
>>>>> restarting bluetoothd is necessary.
>>>>>
>>>>> bluetoothctl log like this:
>>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
>>>>> Attempting to disconnect from D9:B5:6C:F2:51:91
>>>>> [Designer Mouse]#
>>>>> [Designer Mouse]# power off
>>>>> [Designer Mouse]#
>>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
>>>>>
>>>>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
>>>>> ---
>>>>>  include/net/bluetooth/hci_core.h |  2 ++
>>>>>  net/bluetooth/hci_conn.c         |  9 +++++++++
>>>>>  net/bluetooth/hci_event.c        |  9 +++++++++
>>>>>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
>>>>>  4 files changed, 38 insertions(+)
>>>>>
>>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
>>>>> index 734cd50cd..2ab079dcf 100644
>>>>> --- a/include/net/bluetooth/hci_core.h
>>>>> +++ b/include/net/bluetooth/hci_core.h
>>>>> @@ -753,6 +753,8 @@ struct hci_conn {
>>>>>
>>>>>         struct bt_codec codec;
>>>>>
>>>>> +       struct completion disc_ev_comp;
>>>>> +
>>>>>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
>>>>>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
>>>>>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
>>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
>>>>> index d097e308a..e0244e191 100644
>>>>> --- a/net/bluetooth/hci_conn.c
>>>>> +++ b/net/bluetooth/hci_conn.c
>>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
>>>>>
>>>>>         hci_conn_init_sysfs(conn);
>>>>>
>>>>> +       /* This disc_ev_comp is inited when we send a disconnect request to
>>>>> +        * the remote device but fail to receive the disconnect complete
>>>>> +        * event within the expected time (2 seconds). This occurs because
>>>>> +        * the remote device doesn't ack the terminate indication, forcing
>>>>> +        * the controller to wait for the supervision timeout.
>>>>> +        */
>>>>> +       init_completion(&conn->disc_ev_comp);
>>>>> +       complete(&conn->disc_ev_comp);
>>>>> +
>>>>>         return conn;
>>>>>  }
>>>>>
>>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
>>>>> index 2cc7a9306..60ecb2b18 100644
>>>>> --- a/net/bluetooth/hci_event.c
>>>>> +++ b/net/bluetooth/hci_event.c
>>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>>>>>         if (!conn)
>>>>>                 goto unlock;
>>>>>
>>>>> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
>>>>> +        * hci_abort_conn_sync will wait hdev lock release to continue.
>>>>> +        */
>>>>> +       if (!completion_done(&conn->disc_ev_comp)) {
>>>>> +               complete(&conn->disc_ev_comp);
>>>>> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
>>>>> +               usleep_range(100, 1000);
>>>>> +       }
>>>>> +
>>>>>         if (ev->status) {
>>>>>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>>                                        conn->dst_type, ev->status);
>>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
>>>>> index 0badec712..783d04b57 100644
>>>>> --- a/net/bluetooth/hci_sync.c
>>>>> +++ b/net/bluetooth/hci_sync.c
>>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
>>>>>                 break;
>>>>>         }
>>>>>
>>>>> +       /* Check whether the connection is successfully disconnected.
>>>>> +        * Sometimes the remote device doesn't acknowledge the
>>>>> +        * LL_TERMINATE_IND in time, requiring the controller to wait
>>>>> +        * for the supervision timeout, which may exceed 2 seconds. In
>>>>> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
>>>>> +        * event before cleaning up the connection.
>>>>> +        */
>>>>> +       if (err == -ETIMEDOUT) {
>>>>> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
>>>>> +
>>>>> +               reinit_completion(&conn->disc_ev_comp);
>>>>> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
>>>>> +                       bt_dev_warn(hdev, "Failed to get complete");
>>>>> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>> +                                              conn->dst_type, conn->abort_reason);
>>>>> +               }
>>>>> +       }
>>>>
>>>> Why don't we just set the supervision timeout as timeout then? If we
>>>> will have to wait for it anyway just change hci_disconnect_sync to use
>>>> 10 * conn->le_supv_timeout as timeout instead.
>>>>
>> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
>> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
>> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
>> for the event. So even in hci_disconnect_sync we can set the timeout to
>> supervision timeout, hci_disconnect_sync still timeout after 2s.
> 
> Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
> the supervision timeout instead? If it still timeout after to 2
> seconds then there is something still forcing HCI_CMD_TIMEOUT which
> shouldn't happen.
> 
Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so 
even the upper layer set to a larger timeout value, like supervision timeout, 
it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is:
hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) -> 
hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the 
wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout.

So even if we set a large timeout value in hci_disconnect_sync, it doesn't work
since it's waked up by other event, not the real timeout. 

What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits
for the disconnect complete event rather than command status event. According to
BT core spec, disconnect complete event has to wait for remote's ack or wait until
supervision timeout. It's a valid case that the disconnect complete event taking 
more than 2s. 

>>>> That said, we really need to fix bluetoothd if it is not able to be
>>>> cleaned up if SET_POWERED command fails, but it looks like it is
>>>> handling errors correctly so it sounds like something else is at play.
>>>
>> The issue arises after a 2-second timeout of hci_disconnect_sync. During
>> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
>> after supervision timeout, the disconnect complete event arrives, but
>> it returns at line 3370 since the connection has already been removed.
>> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
>> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
>> and unable to perform other mgmt commands.
> 
> The command shall have completed regardless if disconnect complete has
> been received or not, the is the whole point of having a timeout, so
> this makes no sense to me, or you are not describing what is happening
> here. Also there is no MGMT_OP_DISCONNECT pending, it is
> MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
> that is a totally different thing and perhaps that is the source of
> the problem because if we do cleanup hci_conn even in case the command
> fails/times out then we should be generating a disconnected event as
> well.
> 
Here is the flow describe the issue. For normal case:
┌───────────┐          ┌──────────┐           ┌──────┐           ┌──────────┐           ┌──────┐
│bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
└─────┬─────┘          └─────┬────┘           └───┬──┘           └─────┬────┘           └───┬──┘
      │   disconnect cmd     │                    │                    │                    │
      │─────────────────────>│                    │                    │                    │
      │                      │                    │                    │                    │
      │                      │    MGMT_OP         │                    │                    │
      │                      │    _DISCONNECT     │                    │                    │
      │                      │───────────────────>│                    │                    │
      │                      │                    │                    │                    │
      │                      │                    │  HCI_Disconnect    │                    │
      │                      │                    │───────────────────>│                    │
      │                      │                    │                    │                    │
      │                      │                    │                    │   LL_TERMINATE     │
      │                      │                    │                    │   _IND             │
      │                      │                    │                    │───────────────────>│
      │                      │                    │                    │                    │
      │                      │                    │                    │        ACK         │
      │                      │                    │                    │<───────────────────│
      │                      │                    │                    │                    │
      │                      │                    │   Disc_Comp_Evt    │                    │
      │                      │                    │<───────────────────│                    │
      │                      │                    │                    │                    │
      │                      │   MGMT Response    │                    │                    │
      │                      │<───────────────────│                    │                    │
      │                      │                    │                    │                    │
      │      disc succ       │                    │                    │                    │
      │<─────────────────────│                    │                    │                    │
┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐           ┌─────┴────┐           ┌───┴──┐
│bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
└───────────┘          └──────────┘           └──────┘           └──────────┘           └──────┘


The failure case like this:

┌───────────┐          ┌──────────┐           ┌──────┐            ┌──────────┐           ┌──────┐
│bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
└─────┬─────┘          └─────┬────┘           └───┬──┘            └─────┬────┘           └───┬──┘
      │     disconnect       │                    │                     │                    │
      │     cmd              │                    │                     │                    │
      │─────────────────────>│                    │                     │                    │
      │                      │                    │                     │                    │
      │                      │    MGMT_OP_        │                     │                    │
      │                      │    DISCONNECT      │                     │                    │
      │                      │───────────────────>│                     │                    │
      │                      │                    │                     │                    │
      │                      │                   ┌┴┐     HCI_           │                    │
      │                      │                   │ │     Disconnect     │                    │
      │                      │                   │ │ ──────────────────>│                    │
      │                      │                   │ │                    │                    │
      │                      │                   │ │                    │  LL_TERMINATE     ┌┴┐
      │                      │                   │ │                    │  _IND             │ │
      │                      │                   │ │                    │─────────────────> │ │
      │                      │                   │ │                    │                   │ │
      │                      │                   │ │ 2s                 │                   │ │
      │                      │                   │ │                    │                   │ │
      │                      │                   │ │                    │                   │ │ More
      │                      │                   │ │                    │                   │ │ than
      │                      │                   │ │                    │                   │ │ 2s
      │                      │                   │ │                    │                   │ │
      │                      │                   │ │                    │                   │ │
      │                      │                   │ │                    │                   │ │
      │                      │                   └┬┘                    │                   │ │
      │                      │                    │────┐                │                   │ │
      │                      │                    │    │ hci_disconnect │                   │ │
      │                      │                    │<───┘ sync timeout,  │                   │ │
      │                      │                    │      del 'con' by   │                   │ │
      │                      │                    │      hci_conn_failed│                   │ │
      │                      │                    │                     │                   └┬┘
      │                      │                    │                     │        ACK         │
      │                      │                    │                     │<───────────────────│
      │                      │                    │                     │                    │
      │                      │                    │   Disc_Comp_Evt     │                    │
      │                      │                    │<────────────────────│                    │
      │                      │                    │                     │                    │
      │                      │                    │────┐                │                    │
      │                     \│/                   │    │ ignore the     │                    │
      │                      X                    │<───┘ event since    │                    │
      │                     /│\                   │      'con' has been │                    │
      │                      │     MGMT           │      deleted        │                    │
      │                      │     Response       │                     │                    │
      │                      │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│                     │                    │
┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐            ┌─────┴────┐           ┌───┴──┐
│bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
└───────────┘          └──────────┘           └──────┘            └──────────┘           └──────┘


So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From
our test, bluetoothd can't accept any command related to mgmt from bluetothctl. 

>>
>> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>> 3356              struct sk_buff *skb)
>> 3357 {
>> 3358   struct hci_ev_disconn_complete *ev = data;
>> 3359   u8 reason;
>> 3360   struct hci_conn_params *params;
>> 3361   struct hci_conn *conn;
>> 3362   bool mgmt_connected;
>> 3363
>> 3364   bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
>> 3365
>> 3366   hci_dev_lock(hdev);
>> 3367
>> 3368   conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
>> 3369   if (!conn)
>> 3370     goto unlock;
>> 3371
>> 3372   if (ev->status) {
>> 3373     mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>> 3374                conn->dst_type, ev->status);
>> 3375     goto unlock;
>> 3376   }
>> 3377
>> 3378   conn->state = BT_CLOSED;
>> 3379
>> 3380   mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
>> 3381
>>
>>> I double checked this and apparently this could no longer fail:
>>>
>>> +               /* Disregard possible errors since hci_conn_del shall have been
>>> +                * called even in case of errors had occurred since it would
>>> +                * then cause hci_conn_failed to be called which calls
>>> +                * hci_conn_del internally.
>>> +                */
>>> +               hci_abort_conn_sync(hdev, conn, reason);
>>>
>>> So it will clean up the hci_conn no matter what is the timeout, so
>>> either you don't have this change or it is not working for some
>>> reason.
>>>
>> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
>> blocked. It does not happen during the power off stage. It happened when disconnect
>> a BLE device, but the disconnect complete event sent from controller to host 2s later.
>> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
> 
> There is a difference about a MGMT command, initiated by bluetoothd,
> versus a MGMT event initiated by the kernel, so the daemon is not
> blocked it just don't get a disconnection event, which is different
> than a command complete.
> 
> What is the command sequence that you use to reproduce the problem?
Here is the command log:
[CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes
[bluetooth]# power on
Changing power on succeeded
[bluetooth]# connect CF:90:67:3C:7A:56
Attempting to connect to CF:90:67:3C:7A:56
[CHG] Device CF:90:67:3C:7A:56 Connected: yes
Connection successful
[CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes
[Designer Mouse]#
[Designer Mouse]# disconnect D9:B5:6C:F2:51:91
Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response
[Designer Mouse]#
[Designer Mouse]# power off
[Designer Mouse]#
Failed to set power off: org.freedesktop.DBus.Error.NoReply 

To easily reproduce this issue, we use a firmware which always send the disconnect
complete event more than 2s. Then the issue occurred 100%.

Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it 
relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s.

> 
>>>>>         hci_dev_lock(hdev);
>>>>>
>>>>>         /* Check if the connection has been cleaned up concurrently */
>>>>>
>>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
>>>>> --
>>>>> 2.34.1
>>>>>
>>>>
>>>>
>>>> --
>>>> Luiz Augusto von Dentz
>>>
>>>
>>>
>>
> 
>
Luiz Augusto von Dentz Dec. 17, 2024, 3:13 p.m. UTC | #7
Hi Cheng,

On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE)
<quic_chejiang@quicinc.com> wrote:
>
> Hi Luiz,
>
> On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote:
> > Hi Cheng,
> >
> > On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
> > <quic_chejiang@quicinc.com> wrote:
> >>
> >> Hi Luiz,
> >>
> >> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
> >>> Hi Cheng,
> >>>
> >>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
> >>> <luiz.dentz@gmail.com> wrote:
> >>>>
> >>>> Hi Cheng,
> >>>>
> >>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote:
> >>>>>
> >>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> >>>>> in time, requiring the controller to wait for the supervision timeout,
> >>>>> which may exceed 2 seconds. In the current implementation, the
> >>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> >>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> >>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
> >>>>> indefinitely for the mgmt response to the disconnect. To recover,
> >>>>> restarting bluetoothd is necessary.
> >>>>>
> >>>>> bluetoothctl log like this:
> >>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> >>>>> Attempting to disconnect from D9:B5:6C:F2:51:91
> >>>>> [Designer Mouse]#
> >>>>> [Designer Mouse]# power off
> >>>>> [Designer Mouse]#
> >>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
> >>>>>
> >>>>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
> >>>>> ---
> >>>>>  include/net/bluetooth/hci_core.h |  2 ++
> >>>>>  net/bluetooth/hci_conn.c         |  9 +++++++++
> >>>>>  net/bluetooth/hci_event.c        |  9 +++++++++
> >>>>>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
> >>>>>  4 files changed, 38 insertions(+)
> >>>>>
> >>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> >>>>> index 734cd50cd..2ab079dcf 100644
> >>>>> --- a/include/net/bluetooth/hci_core.h
> >>>>> +++ b/include/net/bluetooth/hci_core.h
> >>>>> @@ -753,6 +753,8 @@ struct hci_conn {
> >>>>>
> >>>>>         struct bt_codec codec;
> >>>>>
> >>>>> +       struct completion disc_ev_comp;
> >>>>> +
> >>>>>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
> >>>>>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
> >>>>>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
> >>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> >>>>> index d097e308a..e0244e191 100644
> >>>>> --- a/net/bluetooth/hci_conn.c
> >>>>> +++ b/net/bluetooth/hci_conn.c
> >>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
> >>>>>
> >>>>>         hci_conn_init_sysfs(conn);
> >>>>>
> >>>>> +       /* This disc_ev_comp is inited when we send a disconnect request to
> >>>>> +        * the remote device but fail to receive the disconnect complete
> >>>>> +        * event within the expected time (2 seconds). This occurs because
> >>>>> +        * the remote device doesn't ack the terminate indication, forcing
> >>>>> +        * the controller to wait for the supervision timeout.
> >>>>> +        */
> >>>>> +       init_completion(&conn->disc_ev_comp);
> >>>>> +       complete(&conn->disc_ev_comp);
> >>>>> +
> >>>>>         return conn;
> >>>>>  }
> >>>>>
> >>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> >>>>> index 2cc7a9306..60ecb2b18 100644
> >>>>> --- a/net/bluetooth/hci_event.c
> >>>>> +++ b/net/bluetooth/hci_event.c
> >>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >>>>>         if (!conn)
> >>>>>                 goto unlock;
> >>>>>
> >>>>> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> >>>>> +        * hci_abort_conn_sync will wait hdev lock release to continue.
> >>>>> +        */
> >>>>> +       if (!completion_done(&conn->disc_ev_comp)) {
> >>>>> +               complete(&conn->disc_ev_comp);
> >>>>> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
> >>>>> +               usleep_range(100, 1000);
> >>>>> +       }
> >>>>> +
> >>>>>         if (ev->status) {
> >>>>>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>>>                                        conn->dst_type, ev->status);
> >>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> >>>>> index 0badec712..783d04b57 100644
> >>>>> --- a/net/bluetooth/hci_sync.c
> >>>>> +++ b/net/bluetooth/hci_sync.c
> >>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
> >>>>>                 break;
> >>>>>         }
> >>>>>
> >>>>> +       /* Check whether the connection is successfully disconnected.
> >>>>> +        * Sometimes the remote device doesn't acknowledge the
> >>>>> +        * LL_TERMINATE_IND in time, requiring the controller to wait
> >>>>> +        * for the supervision timeout, which may exceed 2 seconds. In
> >>>>> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> >>>>> +        * event before cleaning up the connection.
> >>>>> +        */
> >>>>> +       if (err == -ETIMEDOUT) {
> >>>>> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> >>>>> +
> >>>>> +               reinit_completion(&conn->disc_ev_comp);
> >>>>> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> >>>>> +                       bt_dev_warn(hdev, "Failed to get complete");
> >>>>> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>>> +                                              conn->dst_type, conn->abort_reason);
> >>>>> +               }
> >>>>> +       }
> >>>>
> >>>> Why don't we just set the supervision timeout as timeout then? If we
> >>>> will have to wait for it anyway just change hci_disconnect_sync to use
> >>>> 10 * conn->le_supv_timeout as timeout instead.
> >>>>
> >> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
> >> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
> >> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
> >> for the event. So even in hci_disconnect_sync we can set the timeout to
> >> supervision timeout, hci_disconnect_sync still timeout after 2s.
> >
> > Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
> > the supervision timeout instead? If it still timeout after to 2
> > seconds then there is something still forcing HCI_CMD_TIMEOUT which
> > shouldn't happen.
> >
> Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so
> even the upper layer set to a larger timeout value, like supervision timeout,
> it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is:
> hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) ->
> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the
> wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout.
>
> So even if we set a large timeout value in hci_disconnect_sync, it doesn't work
> since it's waked up by other event, not the real timeout.
>
> What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits
> for the disconnect complete event rather than command status event. According to
> BT core spec, disconnect complete event has to wait for remote's ack or wait until
> supervision timeout. It's a valid case that the disconnect complete event taking
> more than 2s.

You seems to be confusing the role of 2 different timers:

    err = wait_event_interruptible_timeout(hdev->req_wait_q,
                           hdev->req_status != HCI_REQ_PEND,
                           timeout);

and

            queue_delayed_work(hdev->workqueue, &hdev->cmd_timer,
                       HCI_CMD_TIMEOUT);

The former waits for a specific event, while the later waits for
handle_cmd_cnt_and_timer, each can have a distinct timeout as in the
code bellow:

    return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN,
                    plen, data,
                    HCI_EV_LE_ENHANCED_CONN_COMPLETE,
                    conn->conn_timeout, NULL);

The reason there are 2 timers is that we need to track the number of
commands outstanding in the controller, and no you can't delay Command
Status:

When the Controller receives the HCI_Disconnect command, it _shall_ send the
HCI_Command_Status event to the Host.

So even if HCI_Disconnection_Complete is delayed the following shall
still work provided the HCI_Command_Status is still being generated
accordingly:

index 0badec7120ab..0ab607fb6433 100644
--- a/net/bluetooth/hci_sync.c
+++ b/net/bluetooth/hci_sync.c
@@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev
*hdev, struct hci_conn *conn,
                return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT,
                                                sizeof(cp), &cp,
                                                HCI_EV_DISCONN_COMPLETE,
-                                               HCI_CMD_TIMEOUT, NULL);
+                                               conn->disc_timeout, NULL);

        return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp,
                                     HCI_CMD_TIMEOUT);

Then we need to adjust conn->disc_timeout according to supervision
timeout, that said it appears disc_timeout is actually acting as idle
timeout, so perhaps we need a separate field, also it doesn't look
like we track the supervision timeout for non-LE links.

> >>>> That said, we really need to fix bluetoothd if it is not able to be
> >>>> cleaned up if SET_POWERED command fails, but it looks like it is
> >>>> handling errors correctly so it sounds like something else is at play.
> >>>
> >> The issue arises after a 2-second timeout of hci_disconnect_sync. During
> >> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
> >> after supervision timeout, the disconnect complete event arrives, but
> >> it returns at line 3370 since the connection has already been removed.
> >> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
> >> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
> >> and unable to perform other mgmt commands.
> >
> > The command shall have completed regardless if disconnect complete has
> > been received or not, the is the whole point of having a timeout, so
> > this makes no sense to me, or you are not describing what is happening
> > here. Also there is no MGMT_OP_DISCONNECT pending, it is
> > MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
> > that is a totally different thing and perhaps that is the source of
> > the problem because if we do cleanup hci_conn even in case the command
> > fails/times out then we should be generating a disconnected event as
> > well.
> >
> Here is the flow describe the issue. For normal case:
> ┌───────────┐          ┌──────────┐           ┌──────┐           ┌──────────┐           ┌──────┐
> │bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
> └─────┬─────┘          └─────┬────┘           └───┬──┘           └─────┬────┘           └───┬──┘
>       │   disconnect cmd     │                    │                    │                    │
>       │─────────────────────>│                    │                    │                    │
>       │                      │                    │                    │                    │
>       │                      │    MGMT_OP         │                    │                    │
>       │                      │    _DISCONNECT     │                    │                    │
>       │                      │───────────────────>│                    │                    │
>       │                      │                    │                    │                    │
>       │                      │                    │  HCI_Disconnect    │                    │
>       │                      │                    │───────────────────>│                    │
>       │                      │                    │                    │                    │
>       │                      │                    │                    │   LL_TERMINATE     │
>       │                      │                    │                    │   _IND             │
>       │                      │                    │                    │───────────────────>│
>       │                      │                    │                    │                    │
>       │                      │                    │                    │        ACK         │
>       │                      │                    │                    │<───────────────────│
>       │                      │                    │                    │                    │
>       │                      │                    │   Disc_Comp_Evt    │                    │
>       │                      │                    │<───────────────────│                    │
>       │                      │                    │                    │                    │
>       │                      │   MGMT Response    │                    │                    │
>       │                      │<───────────────────│                    │                    │
>       │                      │                    │                    │                    │
>       │      disc succ       │                    │                    │                    │
>       │<─────────────────────│                    │                    │                    │
> ┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐           ┌─────┴────┐           ┌───┴──┐
> │bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
> └───────────┘          └──────────┘           └──────┘           └──────────┘           └──────┘
>
>
> The failure case like this:
>
> ┌───────────┐          ┌──────────┐           ┌──────┐            ┌──────────┐           ┌──────┐
> │bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
> └─────┬─────┘          └─────┬────┘           └───┬──┘            └─────┬────┘           └───┬──┘
>       │     disconnect       │                    │                     │                    │
>       │     cmd              │                    │                     │                    │
>       │─────────────────────>│                    │                     │                    │
>       │                      │                    │                     │                    │
>       │                      │    MGMT_OP_        │                     │                    │
>       │                      │    DISCONNECT      │                     │                    │
>       │                      │───────────────────>│                     │                    │
>       │                      │                    │                     │                    │
>       │                      │                   ┌┴┐     HCI_           │                    │
>       │                      │                   │ │     Disconnect     │                    │
>       │                      │                   │ │ ──────────────────>│                    │
>       │                      │                   │ │                    │                    │
>       │                      │                   │ │                    │  LL_TERMINATE     ┌┴┐
>       │                      │                   │ │                    │  _IND             │ │
>       │                      │                   │ │                    │─────────────────> │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │ 2s                 │                   │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │                    │                   │ │ More
>       │                      │                   │ │                    │                   │ │ than
>       │                      │                   │ │                    │                   │ │ 2s
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   └┬┘                    │                   │ │
>       │                      │                    │────┐                │                   │ │
>       │                      │                    │    │ hci_disconnect │                   │ │
>       │                      │                    │<───┘ sync timeout,  │                   │ │
>       │                      │                    │      del 'con' by   │                   │ │
>       │                      │                    │      hci_conn_failed│                   │ │
>       │                      │                    │                     │                   └┬┘
>       │                      │                    │                     │        ACK         │
>       │                      │                    │                     │<───────────────────│
>       │                      │                    │                     │                    │
>       │                      │                    │   Disc_Comp_Evt     │                    │
>       │                      │                    │<────────────────────│                    │
>       │                      │                    │                     │                    │
>       │                      │                    │────┐                │                    │
>       │                     \│/                   │    │ ignore the     │                    │
>       │                      X                    │<───┘ event since    │                    │
>       │                     /│\                   │      'con' has been │                    │
>       │                      │     MGMT           │      deleted        │                    │
>       │                      │     Response       │                     │                    │
>       │                      │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│                     │                    │
> ┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐            ┌─────┴────┐           ┌───┴──┐
> │bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
> └───────────┘          └──────────┘           └──────┘            └──────────┘           └──────┘
>
>
> So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From
> our test, bluetoothd can't accept any command related to mgmt from bluetothctl.
>
> >>
> >> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >> 3356              struct sk_buff *skb)
> >> 3357 {
> >> 3358   struct hci_ev_disconn_complete *ev = data;
> >> 3359   u8 reason;
> >> 3360   struct hci_conn_params *params;
> >> 3361   struct hci_conn *conn;
> >> 3362   bool mgmt_connected;
> >> 3363
> >> 3364   bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
> >> 3365
> >> 3366   hci_dev_lock(hdev);
> >> 3367
> >> 3368   conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
> >> 3369   if (!conn)
> >> 3370     goto unlock;
> >> 3371
> >> 3372   if (ev->status) {
> >> 3373     mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >> 3374                conn->dst_type, ev->status);
> >> 3375     goto unlock;
> >> 3376   }
> >> 3377
> >> 3378   conn->state = BT_CLOSED;
> >> 3379
> >> 3380   mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
> >> 3381
> >>
> >>> I double checked this and apparently this could no longer fail:
> >>>
> >>> +               /* Disregard possible errors since hci_conn_del shall have been
> >>> +                * called even in case of errors had occurred since it would
> >>> +                * then cause hci_conn_failed to be called which calls
> >>> +                * hci_conn_del internally.
> >>> +                */
> >>> +               hci_abort_conn_sync(hdev, conn, reason);
> >>>
> >>> So it will clean up the hci_conn no matter what is the timeout, so
> >>> either you don't have this change or it is not working for some
> >>> reason.
> >>>
> >> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
> >> blocked. It does not happen during the power off stage. It happened when disconnect
> >> a BLE device, but the disconnect complete event sent from controller to host 2s later.
> >> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
> >
> > There is a difference about a MGMT command, initiated by bluetoothd,
> > versus a MGMT event initiated by the kernel, so the daemon is not
> > blocked it just don't get a disconnection event, which is different
> > than a command complete.
> >
> > What is the command sequence that you use to reproduce the problem?
> Here is the command log:
> [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes
> [bluetooth]# power on
> Changing power on succeeded
> [bluetooth]# connect CF:90:67:3C:7A:56
> Attempting to connect to CF:90:67:3C:7A:56
> [CHG] Device CF:90:67:3C:7A:56 Connected: yes
> Connection successful
> [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes
> [Designer Mouse]#
> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response
> [Designer Mouse]#
> [Designer Mouse]# power off
> [Designer Mouse]#
> Failed to set power off: org.freedesktop.DBus.Error.NoReply
>
> To easily reproduce this issue, we use a firmware which always send the disconnect
> complete event more than 2s. Then the issue occurred 100%.
>
> Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it
> relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s.
>
> >
> >>>>>         hci_dev_lock(hdev);
> >>>>>
> >>>>>         /* Check if the connection has been cleaned up concurrently */
> >>>>>
> >>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
> >>>>> --
> >>>>> 2.34.1
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Luiz Augusto von Dentz
> >>>
> >>>
> >>>
> >>
> >
> >
>
Cheng Jiang (IOE) Dec. 18, 2024, 8:22 a.m. UTC | #8
Hi Luiz,

On 12/17/2024 11:13 PM, Luiz Augusto von Dentz wrote:
> Hi Cheng,
> 
> On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE)
> <quic_chejiang@quicinc.com> wrote:
>>
>> Hi Luiz,
>>
>> On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote:
>>> Hi Cheng,
>>>
>>> On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
>>> <quic_chejiang@quicinc.com> wrote:
>>>>
>>>> Hi Luiz,
>>>>
>>>> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
>>>>> Hi Cheng,
>>>>>
>>>>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
>>>>> <luiz.dentz@gmail.com> wrote:
>>>>>>
>>>>>> Hi Cheng,
>>>>>>
>>>>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote:
>>>>>>>
>>>>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
>>>>>>> in time, requiring the controller to wait for the supervision timeout,
>>>>>>> which may exceed 2 seconds. In the current implementation, the
>>>>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
>>>>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
>>>>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
>>>>>>> indefinitely for the mgmt response to the disconnect. To recover,
>>>>>>> restarting bluetoothd is necessary.
>>>>>>>
>>>>>>> bluetoothctl log like this:
>>>>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
>>>>>>> Attempting to disconnect from D9:B5:6C:F2:51:91
>>>>>>> [Designer Mouse]#
>>>>>>> [Designer Mouse]# power off
>>>>>>> [Designer Mouse]#
>>>>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
>>>>>>>
>>>>>>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com>
>>>>>>> ---
>>>>>>>  include/net/bluetooth/hci_core.h |  2 ++
>>>>>>>  net/bluetooth/hci_conn.c         |  9 +++++++++
>>>>>>>  net/bluetooth/hci_event.c        |  9 +++++++++
>>>>>>>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
>>>>>>>  4 files changed, 38 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
>>>>>>> index 734cd50cd..2ab079dcf 100644
>>>>>>> --- a/include/net/bluetooth/hci_core.h
>>>>>>> +++ b/include/net/bluetooth/hci_core.h
>>>>>>> @@ -753,6 +753,8 @@ struct hci_conn {
>>>>>>>
>>>>>>>         struct bt_codec codec;
>>>>>>>
>>>>>>> +       struct completion disc_ev_comp;
>>>>>>> +
>>>>>>>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
>>>>>>>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
>>>>>>>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
>>>>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
>>>>>>> index d097e308a..e0244e191 100644
>>>>>>> --- a/net/bluetooth/hci_conn.c
>>>>>>> +++ b/net/bluetooth/hci_conn.c
>>>>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
>>>>>>>
>>>>>>>         hci_conn_init_sysfs(conn);
>>>>>>>
>>>>>>> +       /* This disc_ev_comp is inited when we send a disconnect request to
>>>>>>> +        * the remote device but fail to receive the disconnect complete
>>>>>>> +        * event within the expected time (2 seconds). This occurs because
>>>>>>> +        * the remote device doesn't ack the terminate indication, forcing
>>>>>>> +        * the controller to wait for the supervision timeout.
>>>>>>> +        */
>>>>>>> +       init_completion(&conn->disc_ev_comp);
>>>>>>> +       complete(&conn->disc_ev_comp);
>>>>>>> +
>>>>>>>         return conn;
>>>>>>>  }
>>>>>>>
>>>>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
>>>>>>> index 2cc7a9306..60ecb2b18 100644
>>>>>>> --- a/net/bluetooth/hci_event.c
>>>>>>> +++ b/net/bluetooth/hci_event.c
>>>>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>>>>>>>         if (!conn)
>>>>>>>                 goto unlock;
>>>>>>>
>>>>>>> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
>>>>>>> +        * hci_abort_conn_sync will wait hdev lock release to continue.
>>>>>>> +        */
>>>>>>> +       if (!completion_done(&conn->disc_ev_comp)) {
>>>>>>> +               complete(&conn->disc_ev_comp);
>>>>>>> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
>>>>>>> +               usleep_range(100, 1000);
>>>>>>> +       }
>>>>>>> +
>>>>>>>         if (ev->status) {
>>>>>>>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>>>>                                        conn->dst_type, ev->status);
>>>>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
>>>>>>> index 0badec712..783d04b57 100644
>>>>>>> --- a/net/bluetooth/hci_sync.c
>>>>>>> +++ b/net/bluetooth/hci_sync.c
>>>>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
>>>>>>>                 break;
>>>>>>>         }
>>>>>>>
>>>>>>> +       /* Check whether the connection is successfully disconnected.
>>>>>>> +        * Sometimes the remote device doesn't acknowledge the
>>>>>>> +        * LL_TERMINATE_IND in time, requiring the controller to wait
>>>>>>> +        * for the supervision timeout, which may exceed 2 seconds. In
>>>>>>> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
>>>>>>> +        * event before cleaning up the connection.
>>>>>>> +        */
>>>>>>> +       if (err == -ETIMEDOUT) {
>>>>>>> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
>>>>>>> +
>>>>>>> +               reinit_completion(&conn->disc_ev_comp);
>>>>>>> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
>>>>>>> +                       bt_dev_warn(hdev, "Failed to get complete");
>>>>>>> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>>>> +                                              conn->dst_type, conn->abort_reason);
>>>>>>> +               }
>>>>>>> +       }
>>>>>>
>>>>>> Why don't we just set the supervision timeout as timeout then? If we
>>>>>> will have to wait for it anyway just change hci_disconnect_sync to use
>>>>>> 10 * conn->le_supv_timeout as timeout instead.
>>>>>>
>>>> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
>>>> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
>>>> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
>>>> for the event. So even in hci_disconnect_sync we can set the timeout to
>>>> supervision timeout, hci_disconnect_sync still timeout after 2s.
>>>
>>> Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
>>> the supervision timeout instead? If it still timeout after to 2
>>> seconds then there is something still forcing HCI_CMD_TIMEOUT which
>>> shouldn't happen.
>>>
>> Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so
>> even the upper layer set to a larger timeout value, like supervision timeout,
>> it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is:
>> hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) ->
>> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the
>> wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout.
>>
>> So even if we set a large timeout value in hci_disconnect_sync, it doesn't work
>> since it's waked up by other event, not the real timeout.
>>
>> What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits
>> for the disconnect complete event rather than command status event. According to
>> BT core spec, disconnect complete event has to wait for remote's ack or wait until
>> supervision timeout. It's a valid case that the disconnect complete event taking
>> more than 2s.
> 
> You seems to be confusing the role of 2 different timers:
> 
>     err = wait_event_interruptible_timeout(hdev->req_wait_q,
>                            hdev->req_status != HCI_REQ_PEND,
>                            timeout);
> 
> and
> 
>             queue_delayed_work(hdev->workqueue, &hdev->cmd_timer,
>                        HCI_CMD_TIMEOUT);
> 
> The former waits for a specific event, while the later waits for
> handle_cmd_cnt_and_timer, each can have a distinct timeout as in the
> code bellow:
> 
>     return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN,
>                     plen, data,
>                     HCI_EV_LE_ENHANCED_CONN_COMPLETE,
>                     conn->conn_timeout, NULL);
> 
> The reason there are 2 timers is that we need to track the number of
> commands outstanding in the controller, and no you can't delay Command
> Status:
> 
> When the Controller receives the HCI_Disconnect command, it _shall_ send the
> HCI_Command_Status event to the Host.
> 
> So even if HCI_Disconnection_Complete is delayed the following shall
> still work provided the HCI_Command_Status is still being generated
> accordingly:
> 
> index 0badec7120ab..0ab607fb6433 100644
> --- a/net/bluetooth/hci_sync.c
> +++ b/net/bluetooth/hci_sync.c
> @@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev
> *hdev, struct hci_conn *conn,
>                 return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT,
>                                                 sizeof(cp), &cp,
>                                                 HCI_EV_DISCONN_COMPLETE,
> -                                               HCI_CMD_TIMEOUT, NULL);
> +                                               conn->disc_timeout, NULL);
> 
>         return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp,
>                                      HCI_CMD_TIMEOUT);
> 
> Then we need to adjust conn->disc_timeout according to supervision
> timeout, that said it appears disc_timeout is actually acting as idle
> timeout, so perhaps we need a separate field, also it doesn't look
> like we track the supervision timeout for non-LE links.
> 
Yes, you are right. I mixed up the two timers. Thank you for the detail
explanation. 
I just found you have fixed this issue by 227a0cdf4a028a73dc256d0f5144b4808d718893.
Link: https://github.com/bluez/bluez/issues/932

We are using an old codebase. Sorry for this. 

BTW, do you think we should change the timeout value here? Set the timeout value 
based on the link type.  

@@ -5419,6 +5419,7 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn,
                               u8 reason)
 {
        struct hci_cp_disconnect cp;
+       u32 disc_timeout;

        if (test_bit(HCI_CONN_BIG_CREATED, &conn->flags)) {
                /* This is a BIS connection, hci_conn_del will
@@ -5440,11 +5441,18 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn,
         * used when suspending or powering off, where we don't want to wait
         * for the peer's response.
         */
-       if (reason != HCI_ERROR_REMOTE_POWER_OFF)
+
+       if (reason != HCI_ERROR_REMOTE_POWER_OFF) {
+               if (conn->type == LE_LINK)
+                       disc_timeout = msecs_to_jiffies(10 * conn->le_supv_timeout);
+               else
+                       disc_timeout = conn->disc_timeout;
+
                return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT,
                                                sizeof(cp), &cp,
                                                HCI_EV_DISCONN_COMPLETE,
-                                               HCI_CMD_TIMEOUT, NULL);
+                                               disc_timeout, NULL);
+       }


>>>>>> That said, we really need to fix bluetoothd if it is not able to be
>>>>>> cleaned up if SET_POWERED command fails, but it looks like it is
>>>>>> handling errors correctly so it sounds like something else is at play.
>>>>>
>>>> The issue arises after a 2-second timeout of hci_disconnect_sync. During
>>>> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
>>>> after supervision timeout, the disconnect complete event arrives, but
>>>> it returns at line 3370 since the connection has already been removed.
>>>> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
>>>> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
>>>> and unable to perform other mgmt commands.
>>>
>>> The command shall have completed regardless if disconnect complete has
>>> been received or not, the is the whole point of having a timeout, so
>>> this makes no sense to me, or you are not describing what is happening
>>> here. Also there is no MGMT_OP_DISCONNECT pending, it is
>>> MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
>>> that is a totally different thing and perhaps that is the source of
>>> the problem because if we do cleanup hci_conn even in case the command
>>> fails/times out then we should be generating a disconnected event as
>>> well.
>>>
>> Here is the flow describe the issue. For normal case:
>> ┌───────────┐          ┌──────────┐           ┌──────┐           ┌──────────┐           ┌──────┐
>> │bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
>> └─────┬─────┘          └─────┬────┘           └───┬──┘           └─────┬────┘           └───┬──┘
>>       │   disconnect cmd     │                    │                    │                    │
>>       │─────────────────────>│                    │                    │                    │
>>       │                      │                    │                    │                    │
>>       │                      │    MGMT_OP         │                    │                    │
>>       │                      │    _DISCONNECT     │                    │                    │
>>       │                      │───────────────────>│                    │                    │
>>       │                      │                    │                    │                    │
>>       │                      │                    │  HCI_Disconnect    │                    │
>>       │                      │                    │───────────────────>│                    │
>>       │                      │                    │                    │                    │
>>       │                      │                    │                    │   LL_TERMINATE     │
>>       │                      │                    │                    │   _IND             │
>>       │                      │                    │                    │───────────────────>│
>>       │                      │                    │                    │                    │
>>       │                      │                    │                    │        ACK         │
>>       │                      │                    │                    │<───────────────────│
>>       │                      │                    │                    │                    │
>>       │                      │                    │   Disc_Comp_Evt    │                    │
>>       │                      │                    │<───────────────────│                    │
>>       │                      │                    │                    │                    │
>>       │                      │   MGMT Response    │                    │                    │
>>       │                      │<───────────────────│                    │                    │
>>       │                      │                    │                    │                    │
>>       │      disc succ       │                    │                    │                    │
>>       │<─────────────────────│                    │                    │                    │
>> ┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐           ┌─────┴────┐           ┌───┴──┐
>> │bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
>> └───────────┘          └──────────┘           └──────┘           └──────────┘           └──────┘
>>
>>
>> The failure case like this:
>>
>> ┌───────────┐          ┌──────────┐           ┌──────┐            ┌──────────┐           ┌──────┐
>> │bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
>> └─────┬─────┘          └─────┬────┘           └───┬──┘            └─────┬────┘           └───┬──┘
>>       │     disconnect       │                    │                     │                    │
>>       │     cmd              │                    │                     │                    │
>>       │─────────────────────>│                    │                     │                    │
>>       │                      │                    │                     │                    │
>>       │                      │    MGMT_OP_        │                     │                    │
>>       │                      │    DISCONNECT      │                     │                    │
>>       │                      │───────────────────>│                     │                    │
>>       │                      │                    │                     │                    │
>>       │                      │                   ┌┴┐     HCI_           │                    │
>>       │                      │                   │ │     Disconnect     │                    │
>>       │                      │                   │ │ ──────────────────>│                    │
>>       │                      │                   │ │                    │                    │
>>       │                      │                   │ │                    │  LL_TERMINATE     ┌┴┐
>>       │                      │                   │ │                    │  _IND             │ │
>>       │                      │                   │ │                    │─────────────────> │ │
>>       │                      │                   │ │                    │                   │ │
>>       │                      │                   │ │ 2s                 │                   │ │
>>       │                      │                   │ │                    │                   │ │
>>       │                      │                   │ │                    │                   │ │ More
>>       │                      │                   │ │                    │                   │ │ than
>>       │                      │                   │ │                    │                   │ │ 2s
>>       │                      │                   │ │                    │                   │ │
>>       │                      │                   │ │                    │                   │ │
>>       │                      │                   │ │                    │                   │ │
>>       │                      │                   └┬┘                    │                   │ │
>>       │                      │                    │────┐                │                   │ │
>>       │                      │                    │    │ hci_disconnect │                   │ │
>>       │                      │                    │<───┘ sync timeout,  │                   │ │
>>       │                      │                    │      del 'con' by   │                   │ │
>>       │                      │                    │      hci_conn_failed│                   │ │
>>       │                      │                    │                     │                   └┬┘
>>       │                      │                    │                     │        ACK         │
>>       │                      │                    │                     │<───────────────────│
>>       │                      │                    │                     │                    │
>>       │                      │                    │   Disc_Comp_Evt     │                    │
>>       │                      │                    │<────────────────────│                    │
>>       │                      │                    │                     │                    │
>>       │                      │                    │────┐                │                    │
>>       │                     \│/                   │    │ ignore the     │                    │
>>       │                      X                    │<───┘ event since    │                    │
>>       │                     /│\                   │      'con' has been │                    │
>>       │                      │     MGMT           │      deleted        │                    │
>>       │                      │     Response       │                     │                    │
>>       │                      │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│                     │                    │
>> ┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐            ┌─────┴────┐           ┌───┴──┐
>> │bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
>> └───────────┘          └──────────┘           └──────┘            └──────────┘           └──────┘
>>
>>
>> So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From
>> our test, bluetoothd can't accept any command related to mgmt from bluetothctl.
>>
>>>>
>>>> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>>>> 3356              struct sk_buff *skb)
>>>> 3357 {
>>>> 3358   struct hci_ev_disconn_complete *ev = data;
>>>> 3359   u8 reason;
>>>> 3360   struct hci_conn_params *params;
>>>> 3361   struct hci_conn *conn;
>>>> 3362   bool mgmt_connected;
>>>> 3363
>>>> 3364   bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
>>>> 3365
>>>> 3366   hci_dev_lock(hdev);
>>>> 3367
>>>> 3368   conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
>>>> 3369   if (!conn)
>>>> 3370     goto unlock;
>>>> 3371
>>>> 3372   if (ev->status) {
>>>> 3373     mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>> 3374                conn->dst_type, ev->status);
>>>> 3375     goto unlock;
>>>> 3376   }
>>>> 3377
>>>> 3378   conn->state = BT_CLOSED;
>>>> 3379
>>>> 3380   mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
>>>> 3381
>>>>
>>>>> I double checked this and apparently this could no longer fail:
>>>>>
>>>>> +               /* Disregard possible errors since hci_conn_del shall have been
>>>>> +                * called even in case of errors had occurred since it would
>>>>> +                * then cause hci_conn_failed to be called which calls
>>>>> +                * hci_conn_del internally.
>>>>> +                */
>>>>> +               hci_abort_conn_sync(hdev, conn, reason);
>>>>>
>>>>> So it will clean up the hci_conn no matter what is the timeout, so
>>>>> either you don't have this change or it is not working for some
>>>>> reason.
>>>>>
>>>> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
>>>> blocked. It does not happen during the power off stage. It happened when disconnect
>>>> a BLE device, but the disconnect complete event sent from controller to host 2s later.
>>>> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
>>>
>>> There is a difference about a MGMT command, initiated by bluetoothd,
>>> versus a MGMT event initiated by the kernel, so the daemon is not
>>> blocked it just don't get a disconnection event, which is different
>>> than a command complete.
>>>
>>> What is the command sequence that you use to reproduce the problem?
>> Here is the command log:
>> [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes
>> [bluetooth]# power on
>> Changing power on succeeded
>> [bluetooth]# connect CF:90:67:3C:7A:56
>> Attempting to connect to CF:90:67:3C:7A:56
>> [CHG] Device CF:90:67:3C:7A:56 Connected: yes
>> Connection successful
>> [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes
>> [Designer Mouse]#
>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
>> Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response
>> [Designer Mouse]#
>> [Designer Mouse]# power off
>> [Designer Mouse]#
>> Failed to set power off: org.freedesktop.DBus.Error.NoReply
>>
>> To easily reproduce this issue, we use a firmware which always send the disconnect
>> complete event more than 2s. Then the issue occurred 100%.
>>
>> Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it
>> relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s.
>>
>>>
>>>>>>>         hci_dev_lock(hdev);
>>>>>>>
>>>>>>>         /* Check if the connection has been cleaned up concurrently */
>>>>>>>
>>>>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
>>>>>>> --
>>>>>>> 2.34.1
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Luiz Augusto von Dentz
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
>
diff mbox series

Patch

diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
index 734cd50cd..2ab079dcf 100644
--- a/include/net/bluetooth/hci_core.h
+++ b/include/net/bluetooth/hci_core.h
@@ -753,6 +753,8 @@  struct hci_conn {
 
 	struct bt_codec codec;
 
+	struct completion disc_ev_comp;
+
 	void (*connect_cfm_cb)	(struct hci_conn *conn, u8 status);
 	void (*security_cfm_cb)	(struct hci_conn *conn, u8 status);
 	void (*disconn_cfm_cb)	(struct hci_conn *conn, u8 reason);
diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
index d097e308a..e0244e191 100644
--- a/net/bluetooth/hci_conn.c
+++ b/net/bluetooth/hci_conn.c
@@ -1028,6 +1028,15 @@  static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
 
 	hci_conn_init_sysfs(conn);
 
+	/* This disc_ev_comp is inited when we send a disconnect request to
+	 * the remote device but fail to receive the disconnect complete
+	 * event within the expected time (2 seconds). This occurs because
+	 * the remote device doesn't ack the terminate indication, forcing
+	 * the controller to wait for the supervision timeout.
+	 */
+	init_completion(&conn->disc_ev_comp);
+	complete(&conn->disc_ev_comp);
+
 	return conn;
 }
 
diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
index 2cc7a9306..60ecb2b18 100644
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -3366,6 +3366,15 @@  static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
 	if (!conn)
 		goto unlock;
 
+	/* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
+	 * hci_abort_conn_sync will wait hdev lock release to continue.
+	 */
+	if (!completion_done(&conn->disc_ev_comp)) {
+		complete(&conn->disc_ev_comp);
+		/* Add some delay for hci_abort_conn_sync to handle the complete */
+		usleep_range(100, 1000);
+	}
+
 	if (ev->status) {
 		mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
 				       conn->dst_type, ev->status);
diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
index 0badec712..783d04b57 100644
--- a/net/bluetooth/hci_sync.c
+++ b/net/bluetooth/hci_sync.c
@@ -5590,6 +5590,24 @@  int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
 		break;
 	}
 
+	/* Check whether the connection is successfully disconnected.
+	 * Sometimes the remote device doesn't acknowledge the
+	 * LL_TERMINATE_IND in time, requiring the controller to wait
+	 * for the supervision timeout, which may exceed 2 seconds. In
+	 * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
+	 * event before cleaning up the connection.
+	 */
+	if (err == -ETIMEDOUT) {
+		u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
+
+		reinit_completion(&conn->disc_ev_comp);
+		if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
+			bt_dev_warn(hdev, "Failed to get complete");
+			mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
+					       conn->dst_type, conn->abort_reason);
+		}
+	}
+
 	hci_dev_lock(hdev);
 
 	/* Check if the connection has been cleaned up concurrently */