Message ID | 20241216080758.3450976-1-quic_chejiang@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] Bluetooth: hci_sync: Fix disconnect complete event timeout issue | expand |
Context | Check | Description |
---|---|---|
tedd_an/pre-ci_am | success | Success |
tedd_an/SubjectPrefix | success | Gitlint PASS |
tedd_an/BuildKernel | success | BuildKernel PASS |
tedd_an/CheckAllWarning | success | CheckAllWarning PASS |
tedd_an/CheckSparse | warning | CheckSparse WARNING net/bluetooth/hci_event.c: note: in included file (through include/net/bluetooth/hci_core.h): |
tedd_an/BuildKernel32 | success | BuildKernel32 PASS |
tedd_an/TestRunnerSetup | success | TestRunnerSetup PASS |
tedd_an/TestRunner_l2cap-tester | success | TestRunner PASS |
tedd_an/TestRunner_iso-tester | success | TestRunner PASS |
tedd_an/TestRunner_bnep-tester | success | TestRunner PASS |
tedd_an/TestRunner_mgmt-tester | fail | TestRunner_mgmt-tester: Total: 490, Passed: 483 (98.6%), Failed: 3, Not Run: 4 |
tedd_an/TestRunner_rfcomm-tester | success | TestRunner PASS |
tedd_an/TestRunner_sco-tester | fail | TestRunner_sco-tester: BUG: KASAN: slab-use-after-free in hci_abort_conn_sync+0xc92/0xf30 |
tedd_an/TestRunner_ioctl-tester | success | TestRunner PASS |
tedd_an/TestRunner_mesh-tester | success | TestRunner PASS |
tedd_an/TestRunner_smp-tester | success | TestRunner PASS |
tedd_an/TestRunner_userchan-tester | success | TestRunner PASS |
This is automated email and please do not reply to this email! Dear submitter, Thank you for submitting the patches to the linux bluetooth mailing list. This is a CI test results with your patch series: PW Link:https://patchwork.kernel.org/project/bluetooth/list/?series=918106 ---Test result--- Test Summary: CheckPatch PENDING 0.40 seconds GitLint PENDING 0.25 seconds SubjectPrefix PASS 0.07 seconds BuildKernel PASS 25.14 seconds CheckAllWarning PASS 27.29 seconds CheckSparse WARNING 30.84 seconds BuildKernel32 PASS 24.66 seconds TestRunnerSetup PASS 437.78 seconds TestRunner_l2cap-tester PASS 20.72 seconds TestRunner_iso-tester PASS 30.18 seconds TestRunner_bnep-tester PASS 4.86 seconds TestRunner_mgmt-tester FAIL 122.69 seconds TestRunner_rfcomm-tester PASS 7.66 seconds TestRunner_sco-tester FAIL 9.40 seconds TestRunner_ioctl-tester PASS 8.14 seconds TestRunner_mesh-tester PASS 5.96 seconds TestRunner_smp-tester PASS 6.98 seconds TestRunner_userchan-tester PASS 5.05 seconds IncrementalBuild PENDING 0.72 seconds Details ############################## Test: CheckPatch - PENDING Desc: Run checkpatch.pl script Output: ############################## Test: GitLint - PENDING Desc: Run gitlint Output: ############################## Test: CheckSparse - WARNING Desc: Run sparse tool with linux kernel Output: net/bluetooth/hci_event.c: note: in included file (through include/net/bluetooth/hci_core.h): ############################## Test: TestRunner_mgmt-tester - FAIL Desc: Run mgmt-tester with test-runner Output: Total: 490, Passed: 483 (98.6%), Failed: 3, Not Run: 4 Failed Test Cases LL Privacy - Set Flags 1 (Add to RL) Failed 0.153 seconds LL Privacy - Set Flags 3 (2 Devices to RL) Failed 0.169 seconds LL Privacy - Start Discovery 2 (Disable RL) Failed 0.187 seconds ############################## Test: TestRunner_sco-tester - FAIL Desc: Run sco-tester with test-runner Output: BUG: KASAN: slab-use-after-free in hci_abort_conn_sync+0xc92/0xf30 Total: 15, Passed: 14 (93.3%), Failed: 0, Not Run: 1 ############################## Test: IncrementalBuild - PENDING Desc: Incremental build with the patches in the series Output: --- Regards, Linux Bluetooth
Hi Cheng, On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote: > > Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND > in time, requiring the controller to wait for the supervision timeout, > which may exceed 2 seconds. In the current implementation, the > HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since > the hci_abort_conn_sync has cleaned up the connection after 2 seconds. > This causes the mgmt to get stuck, resulting in bluetoothd waiting > indefinitely for the mgmt response to the disconnect. To recover, > restarting bluetoothd is necessary. > > bluetoothctl log like this: > [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > Attempting to disconnect from D9:B5:6C:F2:51:91 > [Designer Mouse]# > [Designer Mouse]# power off > [Designer Mouse]# > Failed to set power off: org.freedesktop.DBus.Error.NoReply. > > Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> > --- > include/net/bluetooth/hci_core.h | 2 ++ > net/bluetooth/hci_conn.c | 9 +++++++++ > net/bluetooth/hci_event.c | 9 +++++++++ > net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ > 4 files changed, 38 insertions(+) > > diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h > index 734cd50cd..2ab079dcf 100644 > --- a/include/net/bluetooth/hci_core.h > +++ b/include/net/bluetooth/hci_core.h > @@ -753,6 +753,8 @@ struct hci_conn { > > struct bt_codec codec; > > + struct completion disc_ev_comp; > + > void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); > void (*security_cfm_cb) (struct hci_conn *conn, u8 status); > void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); > diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c > index d097e308a..e0244e191 100644 > --- a/net/bluetooth/hci_conn.c > +++ b/net/bluetooth/hci_conn.c > @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t > > hci_conn_init_sysfs(conn); > > + /* This disc_ev_comp is inited when we send a disconnect request to > + * the remote device but fail to receive the disconnect complete > + * event within the expected time (2 seconds). This occurs because > + * the remote device doesn't ack the terminate indication, forcing > + * the controller to wait for the supervision timeout. > + */ > + init_completion(&conn->disc_ev_comp); > + complete(&conn->disc_ev_comp); > + > return conn; > } > > diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c > index 2cc7a9306..60ecb2b18 100644 > --- a/net/bluetooth/hci_event.c > +++ b/net/bluetooth/hci_event.c > @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > if (!conn) > goto unlock; > > + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock > + * hci_abort_conn_sync will wait hdev lock release to continue. > + */ > + if (!completion_done(&conn->disc_ev_comp)) { > + complete(&conn->disc_ev_comp); > + /* Add some delay for hci_abort_conn_sync to handle the complete */ > + usleep_range(100, 1000); > + } > + > if (ev->status) { > mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > conn->dst_type, ev->status); > diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c > index 0badec712..783d04b57 100644 > --- a/net/bluetooth/hci_sync.c > +++ b/net/bluetooth/hci_sync.c > @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) > break; > } > > + /* Check whether the connection is successfully disconnected. > + * Sometimes the remote device doesn't acknowledge the > + * LL_TERMINATE_IND in time, requiring the controller to wait > + * for the supervision timeout, which may exceed 2 seconds. In > + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE > + * event before cleaning up the connection. > + */ > + if (err == -ETIMEDOUT) { > + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); > + > + reinit_completion(&conn->disc_ev_comp); > + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { > + bt_dev_warn(hdev, "Failed to get complete"); > + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > + conn->dst_type, conn->abort_reason); > + } > + } Why don't we just set the supervision timeout as timeout then? If we will have to wait for it anyway just change hci_disconnect_sync to use 10 * conn->le_supv_timeout as timeout instead. That said, we really need to fix bluetoothd if it is not able to be cleaned up if SET_POWERED command fails, but it looks like it is handling errors correctly so it sounds like something else is at play. > hci_dev_lock(hdev); > > /* Check if the connection has been cleaned up concurrently */ > > base-commit: e25c8d66f6786300b680866c0e0139981273feba > -- > 2.34.1 >
Hi Cheng, On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz <luiz.dentz@gmail.com> wrote: > > Hi Cheng, > > On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote: > > > > Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND > > in time, requiring the controller to wait for the supervision timeout, > > which may exceed 2 seconds. In the current implementation, the > > HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since > > the hci_abort_conn_sync has cleaned up the connection after 2 seconds. > > This causes the mgmt to get stuck, resulting in bluetoothd waiting > > indefinitely for the mgmt response to the disconnect. To recover, > > restarting bluetoothd is necessary. > > > > bluetoothctl log like this: > > [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > > Attempting to disconnect from D9:B5:6C:F2:51:91 > > [Designer Mouse]# > > [Designer Mouse]# power off > > [Designer Mouse]# > > Failed to set power off: org.freedesktop.DBus.Error.NoReply. > > > > Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> > > --- > > include/net/bluetooth/hci_core.h | 2 ++ > > net/bluetooth/hci_conn.c | 9 +++++++++ > > net/bluetooth/hci_event.c | 9 +++++++++ > > net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ > > 4 files changed, 38 insertions(+) > > > > diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h > > index 734cd50cd..2ab079dcf 100644 > > --- a/include/net/bluetooth/hci_core.h > > +++ b/include/net/bluetooth/hci_core.h > > @@ -753,6 +753,8 @@ struct hci_conn { > > > > struct bt_codec codec; > > > > + struct completion disc_ev_comp; > > + > > void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); > > void (*security_cfm_cb) (struct hci_conn *conn, u8 status); > > void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); > > diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c > > index d097e308a..e0244e191 100644 > > --- a/net/bluetooth/hci_conn.c > > +++ b/net/bluetooth/hci_conn.c > > @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t > > > > hci_conn_init_sysfs(conn); > > > > + /* This disc_ev_comp is inited when we send a disconnect request to > > + * the remote device but fail to receive the disconnect complete > > + * event within the expected time (2 seconds). This occurs because > > + * the remote device doesn't ack the terminate indication, forcing > > + * the controller to wait for the supervision timeout. > > + */ > > + init_completion(&conn->disc_ev_comp); > > + complete(&conn->disc_ev_comp); > > + > > return conn; > > } > > > > diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c > > index 2cc7a9306..60ecb2b18 100644 > > --- a/net/bluetooth/hci_event.c > > +++ b/net/bluetooth/hci_event.c > > @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > > if (!conn) > > goto unlock; > > > > + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock > > + * hci_abort_conn_sync will wait hdev lock release to continue. > > + */ > > + if (!completion_done(&conn->disc_ev_comp)) { > > + complete(&conn->disc_ev_comp); > > + /* Add some delay for hci_abort_conn_sync to handle the complete */ > > + usleep_range(100, 1000); > > + } > > + > > if (ev->status) { > > mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > > conn->dst_type, ev->status); > > diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c > > index 0badec712..783d04b57 100644 > > --- a/net/bluetooth/hci_sync.c > > +++ b/net/bluetooth/hci_sync.c > > @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) > > break; > > } > > > > + /* Check whether the connection is successfully disconnected. > > + * Sometimes the remote device doesn't acknowledge the > > + * LL_TERMINATE_IND in time, requiring the controller to wait > > + * for the supervision timeout, which may exceed 2 seconds. In > > + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE > > + * event before cleaning up the connection. > > + */ > > + if (err == -ETIMEDOUT) { > > + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); > > + > > + reinit_completion(&conn->disc_ev_comp); > > + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { > > + bt_dev_warn(hdev, "Failed to get complete"); > > + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > > + conn->dst_type, conn->abort_reason); > > + } > > + } > > Why don't we just set the supervision timeout as timeout then? If we > will have to wait for it anyway just change hci_disconnect_sync to use > 10 * conn->le_supv_timeout as timeout instead. > > That said, we really need to fix bluetoothd if it is not able to be > cleaned up if SET_POWERED command fails, but it looks like it is > handling errors correctly so it sounds like something else is at play. I double checked this and apparently this could no longer fail: + /* Disregard possible errors since hci_conn_del shall have been + * called even in case of errors had occurred since it would + * then cause hci_conn_failed to be called which calls + * hci_conn_del internally. + */ + hci_abort_conn_sync(hdev, conn, reason); So it will clean up the hci_conn no matter what is the timeout, so either you don't have this change or it is not working for some reason. > > hci_dev_lock(hdev); > > > > /* Check if the connection has been cleaned up concurrently */ > > > > base-commit: e25c8d66f6786300b680866c0e0139981273feba > > -- > > 2.34.1 > > > > > -- > Luiz Augusto von Dentz
Hi Luiz, On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: > Hi Cheng, > > On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz > <luiz.dentz@gmail.com> wrote: >> >> Hi Cheng, >> >> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote: >>> >>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND >>> in time, requiring the controller to wait for the supervision timeout, >>> which may exceed 2 seconds. In the current implementation, the >>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since >>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. >>> This causes the mgmt to get stuck, resulting in bluetoothd waiting >>> indefinitely for the mgmt response to the disconnect. To recover, >>> restarting bluetoothd is necessary. >>> >>> bluetoothctl log like this: >>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 >>> Attempting to disconnect from D9:B5:6C:F2:51:91 >>> [Designer Mouse]# >>> [Designer Mouse]# power off >>> [Designer Mouse]# >>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. >>> >>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> >>> --- >>> include/net/bluetooth/hci_core.h | 2 ++ >>> net/bluetooth/hci_conn.c | 9 +++++++++ >>> net/bluetooth/hci_event.c | 9 +++++++++ >>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ >>> 4 files changed, 38 insertions(+) >>> >>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h >>> index 734cd50cd..2ab079dcf 100644 >>> --- a/include/net/bluetooth/hci_core.h >>> +++ b/include/net/bluetooth/hci_core.h >>> @@ -753,6 +753,8 @@ struct hci_conn { >>> >>> struct bt_codec codec; >>> >>> + struct completion disc_ev_comp; >>> + >>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); >>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); >>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); >>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c >>> index d097e308a..e0244e191 100644 >>> --- a/net/bluetooth/hci_conn.c >>> +++ b/net/bluetooth/hci_conn.c >>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t >>> >>> hci_conn_init_sysfs(conn); >>> >>> + /* This disc_ev_comp is inited when we send a disconnect request to >>> + * the remote device but fail to receive the disconnect complete >>> + * event within the expected time (2 seconds). This occurs because >>> + * the remote device doesn't ack the terminate indication, forcing >>> + * the controller to wait for the supervision timeout. >>> + */ >>> + init_completion(&conn->disc_ev_comp); >>> + complete(&conn->disc_ev_comp); >>> + >>> return conn; >>> } >>> >>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c >>> index 2cc7a9306..60ecb2b18 100644 >>> --- a/net/bluetooth/hci_event.c >>> +++ b/net/bluetooth/hci_event.c >>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, >>> if (!conn) >>> goto unlock; >>> >>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock >>> + * hci_abort_conn_sync will wait hdev lock release to continue. >>> + */ >>> + if (!completion_done(&conn->disc_ev_comp)) { >>> + complete(&conn->disc_ev_comp); >>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ >>> + usleep_range(100, 1000); >>> + } >>> + >>> if (ev->status) { >>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>> conn->dst_type, ev->status); >>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c >>> index 0badec712..783d04b57 100644 >>> --- a/net/bluetooth/hci_sync.c >>> +++ b/net/bluetooth/hci_sync.c >>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) >>> break; >>> } >>> >>> + /* Check whether the connection is successfully disconnected. >>> + * Sometimes the remote device doesn't acknowledge the >>> + * LL_TERMINATE_IND in time, requiring the controller to wait >>> + * for the supervision timeout, which may exceed 2 seconds. In >>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE >>> + * event before cleaning up the connection. >>> + */ >>> + if (err == -ETIMEDOUT) { >>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); >>> + >>> + reinit_completion(&conn->disc_ev_comp); >>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { >>> + bt_dev_warn(hdev, "Failed to get complete"); >>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>> + conn->dst_type, conn->abort_reason); >>> + } >>> + } >> >> Why don't we just set the supervision timeout as timeout then? If we >> will have to wait for it anyway just change hci_disconnect_sync to use >> 10 * conn->le_supv_timeout as timeout instead. >> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait for the event. So even in hci_disconnect_sync we can set the timeout to supervision timeout, hci_disconnect_sync still timeout after 2s. >> That said, we really need to fix bluetoothd if it is not able to be >> cleaned up if SET_POWERED command fails, but it looks like it is >> handling errors correctly so it sounds like something else is at play. > The issue arises after a 2-second timeout of hci_disconnect_sync. During hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. after supervision timeout, the disconnect complete event arrives, but it returns at line 3370 since the connection has already been removed. As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT to the application layer (bluetoothctl), causing bluetoothctl to get stuck and unable to perform other mgmt commands. 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, 3356 struct sk_buff *skb) 3357 { 3358 struct hci_ev_disconn_complete *ev = data; 3359 u8 reason; 3360 struct hci_conn_params *params; 3361 struct hci_conn *conn; 3362 bool mgmt_connected; 3363 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); 3365 3366 hci_dev_lock(hdev); 3367 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); 3369 if (!conn) 3370 goto unlock; 3371 3372 if (ev->status) { 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, 3374 conn->dst_type, ev->status); 3375 goto unlock; 3376 } 3377 3378 conn->state = BT_CLOSED; 3379 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); 3381 > I double checked this and apparently this could no longer fail: > > + /* Disregard possible errors since hci_conn_del shall have been > + * called even in case of errors had occurred since it would > + * then cause hci_conn_failed to be called which calls > + * hci_conn_del internally. > + */ > + hci_abort_conn_sync(hdev, conn, reason); > > So it will clean up the hci_conn no matter what is the timeout, so > either you don't have this change or it is not working for some > reason. > The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is blocked. It does not happen during the power off stage. It happened when disconnect a BLE device, but the disconnect complete event sent from controller to host 2s later. Then it causes the mgmt in bluetoothctl is blocked as decribed as above. >>> hci_dev_lock(hdev); >>> >>> /* Check if the connection has been cleaned up concurrently */ >>> >>> base-commit: e25c8d66f6786300b680866c0e0139981273feba >>> -- >>> 2.34.1 >>> >> >> >> -- >> Luiz Augusto von Dentz > > >
Hi Cheng, On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE) <quic_chejiang@quicinc.com> wrote: > > Hi Luiz, > > On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: > > Hi Cheng, > > > > On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz > > <luiz.dentz@gmail.com> wrote: > >> > >> Hi Cheng, > >> > >> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote: > >>> > >>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND > >>> in time, requiring the controller to wait for the supervision timeout, > >>> which may exceed 2 seconds. In the current implementation, the > >>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since > >>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. > >>> This causes the mgmt to get stuck, resulting in bluetoothd waiting > >>> indefinitely for the mgmt response to the disconnect. To recover, > >>> restarting bluetoothd is necessary. > >>> > >>> bluetoothctl log like this: > >>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > >>> Attempting to disconnect from D9:B5:6C:F2:51:91 > >>> [Designer Mouse]# > >>> [Designer Mouse]# power off > >>> [Designer Mouse]# > >>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. > >>> > >>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> > >>> --- > >>> include/net/bluetooth/hci_core.h | 2 ++ > >>> net/bluetooth/hci_conn.c | 9 +++++++++ > >>> net/bluetooth/hci_event.c | 9 +++++++++ > >>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ > >>> 4 files changed, 38 insertions(+) > >>> > >>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h > >>> index 734cd50cd..2ab079dcf 100644 > >>> --- a/include/net/bluetooth/hci_core.h > >>> +++ b/include/net/bluetooth/hci_core.h > >>> @@ -753,6 +753,8 @@ struct hci_conn { > >>> > >>> struct bt_codec codec; > >>> > >>> + struct completion disc_ev_comp; > >>> + > >>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); > >>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); > >>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); > >>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c > >>> index d097e308a..e0244e191 100644 > >>> --- a/net/bluetooth/hci_conn.c > >>> +++ b/net/bluetooth/hci_conn.c > >>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t > >>> > >>> hci_conn_init_sysfs(conn); > >>> > >>> + /* This disc_ev_comp is inited when we send a disconnect request to > >>> + * the remote device but fail to receive the disconnect complete > >>> + * event within the expected time (2 seconds). This occurs because > >>> + * the remote device doesn't ack the terminate indication, forcing > >>> + * the controller to wait for the supervision timeout. > >>> + */ > >>> + init_completion(&conn->disc_ev_comp); > >>> + complete(&conn->disc_ev_comp); > >>> + > >>> return conn; > >>> } > >>> > >>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c > >>> index 2cc7a9306..60ecb2b18 100644 > >>> --- a/net/bluetooth/hci_event.c > >>> +++ b/net/bluetooth/hci_event.c > >>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > >>> if (!conn) > >>> goto unlock; > >>> > >>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock > >>> + * hci_abort_conn_sync will wait hdev lock release to continue. > >>> + */ > >>> + if (!completion_done(&conn->disc_ev_comp)) { > >>> + complete(&conn->disc_ev_comp); > >>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ > >>> + usleep_range(100, 1000); > >>> + } > >>> + > >>> if (ev->status) { > >>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>> conn->dst_type, ev->status); > >>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c > >>> index 0badec712..783d04b57 100644 > >>> --- a/net/bluetooth/hci_sync.c > >>> +++ b/net/bluetooth/hci_sync.c > >>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) > >>> break; > >>> } > >>> > >>> + /* Check whether the connection is successfully disconnected. > >>> + * Sometimes the remote device doesn't acknowledge the > >>> + * LL_TERMINATE_IND in time, requiring the controller to wait > >>> + * for the supervision timeout, which may exceed 2 seconds. In > >>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE > >>> + * event before cleaning up the connection. > >>> + */ > >>> + if (err == -ETIMEDOUT) { > >>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); > >>> + > >>> + reinit_completion(&conn->disc_ev_comp); > >>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { > >>> + bt_dev_warn(hdev, "Failed to get complete"); > >>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>> + conn->dst_type, conn->abort_reason); > >>> + } > >>> + } > >> > >> Why don't we just set the supervision timeout as timeout then? If we > >> will have to wait for it anyway just change hci_disconnect_sync to use > >> 10 * conn->le_supv_timeout as timeout instead. > >> > hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the > HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. > In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait > for the event. So even in hci_disconnect_sync we can set the timeout to > supervision timeout, hci_disconnect_sync still timeout after 2s. Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use the supervision timeout instead? If it still timeout after to 2 seconds then there is something still forcing HCI_CMD_TIMEOUT which shouldn't happen. > >> That said, we really need to fix bluetoothd if it is not able to be > >> cleaned up if SET_POWERED command fails, but it looks like it is > >> handling errors correctly so it sounds like something else is at play. > > > The issue arises after a 2-second timeout of hci_disconnect_sync. During > hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. > after supervision timeout, the disconnect complete event arrives, but > it returns at line 3370 since the connection has already been removed. > As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT > to the application layer (bluetoothctl), causing bluetoothctl to get stuck > and unable to perform other mgmt commands. The command shall have completed regardless if disconnect complete has been received or not, the is the whole point of having a timeout, so this makes no sense to me, or you are not describing what is happening here. Also there is no MGMT_OP_DISCONNECT pending, it is MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event that is a totally different thing and perhaps that is the source of the problem because if we do cleanup hci_conn even in case the command fails/times out then we should be generating a disconnected event as well. > > 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > 3356 struct sk_buff *skb) > 3357 { > 3358 struct hci_ev_disconn_complete *ev = data; > 3359 u8 reason; > 3360 struct hci_conn_params *params; > 3361 struct hci_conn *conn; > 3362 bool mgmt_connected; > 3363 > 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); > 3365 > 3366 hci_dev_lock(hdev); > 3367 > 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); > 3369 if (!conn) > 3370 goto unlock; > 3371 > 3372 if (ev->status) { > 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > 3374 conn->dst_type, ev->status); > 3375 goto unlock; > 3376 } > 3377 > 3378 conn->state = BT_CLOSED; > 3379 > 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); > 3381 > > > I double checked this and apparently this could no longer fail: > > > > + /* Disregard possible errors since hci_conn_del shall have been > > + * called even in case of errors had occurred since it would > > + * then cause hci_conn_failed to be called which calls > > + * hci_conn_del internally. > > + */ > > + hci_abort_conn_sync(hdev, conn, reason); > > > > So it will clean up the hci_conn no matter what is the timeout, so > > either you don't have this change or it is not working for some > > reason. > > > The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is > blocked. It does not happen during the power off stage. It happened when disconnect > a BLE device, but the disconnect complete event sent from controller to host 2s later. > Then it causes the mgmt in bluetoothctl is blocked as decribed as above. There is a difference about a MGMT command, initiated by bluetoothd, versus a MGMT event initiated by the kernel, so the daemon is not blocked it just don't get a disconnection event, which is different than a command complete. What is the command sequence that you use to reproduce the problem? > >>> hci_dev_lock(hdev); > >>> > >>> /* Check if the connection has been cleaned up concurrently */ > >>> > >>> base-commit: e25c8d66f6786300b680866c0e0139981273feba > >>> -- > >>> 2.34.1 > >>> > >> > >> > >> -- > >> Luiz Augusto von Dentz > > > > > > >
Hi Luiz, On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote: > Hi Cheng, > > On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE) > <quic_chejiang@quicinc.com> wrote: >> >> Hi Luiz, >> >> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: >>> Hi Cheng, >>> >>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz >>> <luiz.dentz@gmail.com> wrote: >>>> >>>> Hi Cheng, >>>> >>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote: >>>>> >>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND >>>>> in time, requiring the controller to wait for the supervision timeout, >>>>> which may exceed 2 seconds. In the current implementation, the >>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since >>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. >>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting >>>>> indefinitely for the mgmt response to the disconnect. To recover, >>>>> restarting bluetoothd is necessary. >>>>> >>>>> bluetoothctl log like this: >>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 >>>>> Attempting to disconnect from D9:B5:6C:F2:51:91 >>>>> [Designer Mouse]# >>>>> [Designer Mouse]# power off >>>>> [Designer Mouse]# >>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. >>>>> >>>>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> >>>>> --- >>>>> include/net/bluetooth/hci_core.h | 2 ++ >>>>> net/bluetooth/hci_conn.c | 9 +++++++++ >>>>> net/bluetooth/hci_event.c | 9 +++++++++ >>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ >>>>> 4 files changed, 38 insertions(+) >>>>> >>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h >>>>> index 734cd50cd..2ab079dcf 100644 >>>>> --- a/include/net/bluetooth/hci_core.h >>>>> +++ b/include/net/bluetooth/hci_core.h >>>>> @@ -753,6 +753,8 @@ struct hci_conn { >>>>> >>>>> struct bt_codec codec; >>>>> >>>>> + struct completion disc_ev_comp; >>>>> + >>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); >>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); >>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); >>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c >>>>> index d097e308a..e0244e191 100644 >>>>> --- a/net/bluetooth/hci_conn.c >>>>> +++ b/net/bluetooth/hci_conn.c >>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t >>>>> >>>>> hci_conn_init_sysfs(conn); >>>>> >>>>> + /* This disc_ev_comp is inited when we send a disconnect request to >>>>> + * the remote device but fail to receive the disconnect complete >>>>> + * event within the expected time (2 seconds). This occurs because >>>>> + * the remote device doesn't ack the terminate indication, forcing >>>>> + * the controller to wait for the supervision timeout. >>>>> + */ >>>>> + init_completion(&conn->disc_ev_comp); >>>>> + complete(&conn->disc_ev_comp); >>>>> + >>>>> return conn; >>>>> } >>>>> >>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c >>>>> index 2cc7a9306..60ecb2b18 100644 >>>>> --- a/net/bluetooth/hci_event.c >>>>> +++ b/net/bluetooth/hci_event.c >>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, >>>>> if (!conn) >>>>> goto unlock; >>>>> >>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock >>>>> + * hci_abort_conn_sync will wait hdev lock release to continue. >>>>> + */ >>>>> + if (!completion_done(&conn->disc_ev_comp)) { >>>>> + complete(&conn->disc_ev_comp); >>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ >>>>> + usleep_range(100, 1000); >>>>> + } >>>>> + >>>>> if (ev->status) { >>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>>> conn->dst_type, ev->status); >>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c >>>>> index 0badec712..783d04b57 100644 >>>>> --- a/net/bluetooth/hci_sync.c >>>>> +++ b/net/bluetooth/hci_sync.c >>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) >>>>> break; >>>>> } >>>>> >>>>> + /* Check whether the connection is successfully disconnected. >>>>> + * Sometimes the remote device doesn't acknowledge the >>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait >>>>> + * for the supervision timeout, which may exceed 2 seconds. In >>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE >>>>> + * event before cleaning up the connection. >>>>> + */ >>>>> + if (err == -ETIMEDOUT) { >>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); >>>>> + >>>>> + reinit_completion(&conn->disc_ev_comp); >>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { >>>>> + bt_dev_warn(hdev, "Failed to get complete"); >>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>>> + conn->dst_type, conn->abort_reason); >>>>> + } >>>>> + } >>>> >>>> Why don't we just set the supervision timeout as timeout then? If we >>>> will have to wait for it anyway just change hci_disconnect_sync to use >>>> 10 * conn->le_supv_timeout as timeout instead. >>>> >> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the >> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. >> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait >> for the event. So even in hci_disconnect_sync we can set the timeout to >> supervision timeout, hci_disconnect_sync still timeout after 2s. > > Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use > the supervision timeout instead? If it still timeout after to 2 > seconds then there is something still forcing HCI_CMD_TIMEOUT which > shouldn't happen. > Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so even the upper layer set to a larger timeout value, like supervision timeout, it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is: hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) -> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout. So even if we set a large timeout value in hci_disconnect_sync, it doesn't work since it's waked up by other event, not the real timeout. What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits for the disconnect complete event rather than command status event. According to BT core spec, disconnect complete event has to wait for remote's ack or wait until supervision timeout. It's a valid case that the disconnect complete event taking more than 2s. >>>> That said, we really need to fix bluetoothd if it is not able to be >>>> cleaned up if SET_POWERED command fails, but it looks like it is >>>> handling errors correctly so it sounds like something else is at play. >>> >> The issue arises after a 2-second timeout of hci_disconnect_sync. During >> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. >> after supervision timeout, the disconnect complete event arrives, but >> it returns at line 3370 since the connection has already been removed. >> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT >> to the application layer (bluetoothctl), causing bluetoothctl to get stuck >> and unable to perform other mgmt commands. > > The command shall have completed regardless if disconnect complete has > been received or not, the is the whole point of having a timeout, so > this makes no sense to me, or you are not describing what is happening > here. Also there is no MGMT_OP_DISCONNECT pending, it is > MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event > that is a totally different thing and perhaps that is the source of > the problem because if we do cleanup hci_conn even in case the command > fails/times out then we should be generating a disconnected event as > well. > Here is the flow describe the issue. For normal case: ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ │ disconnect cmd │ │ │ │ │─────────────────────>│ │ │ │ │ │ │ │ │ │ │ MGMT_OP │ │ │ │ │ _DISCONNECT │ │ │ │ │───────────────────>│ │ │ │ │ │ │ │ │ │ │ HCI_Disconnect │ │ │ │ │───────────────────>│ │ │ │ │ │ │ │ │ │ │ LL_TERMINATE │ │ │ │ │ _IND │ │ │ │ │───────────────────>│ │ │ │ │ │ │ │ │ │ ACK │ │ │ │ │<───────────────────│ │ │ │ │ │ │ │ │ Disc_Comp_Evt │ │ │ │ │<───────────────────│ │ │ │ │ │ │ │ │ MGMT Response │ │ │ │ │<───────────────────│ │ │ │ │ │ │ │ │ disc succ │ │ │ │ │<─────────────────────│ │ │ │ ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ The failure case like this: ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ │ disconnect │ │ │ │ │ cmd │ │ │ │ │─────────────────────>│ │ │ │ │ │ │ │ │ │ │ MGMT_OP_ │ │ │ │ │ DISCONNECT │ │ │ │ │───────────────────>│ │ │ │ │ │ │ │ │ │ ┌┴┐ HCI_ │ │ │ │ │ │ Disconnect │ │ │ │ │ │ ──────────────────>│ │ │ │ │ │ │ │ │ │ │ │ │ LL_TERMINATE ┌┴┐ │ │ │ │ │ _IND │ │ │ │ │ │ │─────────────────> │ │ │ │ │ │ │ │ │ │ │ │ │ 2s │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ More │ │ │ │ │ │ │ than │ │ │ │ │ │ │ 2s │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └┬┘ │ │ │ │ │ │────┐ │ │ │ │ │ │ │ hci_disconnect │ │ │ │ │ │<───┘ sync timeout, │ │ │ │ │ │ del 'con' by │ │ │ │ │ │ hci_conn_failed│ │ │ │ │ │ │ └┬┘ │ │ │ │ ACK │ │ │ │ │<───────────────────│ │ │ │ │ │ │ │ │ Disc_Comp_Evt │ │ │ │ │<────────────────────│ │ │ │ │ │ │ │ │ │────┐ │ │ │ \│/ │ │ ignore the │ │ │ X │<───┘ event since │ │ │ /│\ │ 'con' has been │ │ │ │ MGMT │ deleted │ │ │ │ Response │ │ │ │ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │ ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From our test, bluetoothd can't accept any command related to mgmt from bluetothctl. >> >> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, >> 3356 struct sk_buff *skb) >> 3357 { >> 3358 struct hci_ev_disconn_complete *ev = data; >> 3359 u8 reason; >> 3360 struct hci_conn_params *params; >> 3361 struct hci_conn *conn; >> 3362 bool mgmt_connected; >> 3363 >> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); >> 3365 >> 3366 hci_dev_lock(hdev); >> 3367 >> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); >> 3369 if (!conn) >> 3370 goto unlock; >> 3371 >> 3372 if (ev->status) { >> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >> 3374 conn->dst_type, ev->status); >> 3375 goto unlock; >> 3376 } >> 3377 >> 3378 conn->state = BT_CLOSED; >> 3379 >> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); >> 3381 >> >>> I double checked this and apparently this could no longer fail: >>> >>> + /* Disregard possible errors since hci_conn_del shall have been >>> + * called even in case of errors had occurred since it would >>> + * then cause hci_conn_failed to be called which calls >>> + * hci_conn_del internally. >>> + */ >>> + hci_abort_conn_sync(hdev, conn, reason); >>> >>> So it will clean up the hci_conn no matter what is the timeout, so >>> either you don't have this change or it is not working for some >>> reason. >>> >> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is >> blocked. It does not happen during the power off stage. It happened when disconnect >> a BLE device, but the disconnect complete event sent from controller to host 2s later. >> Then it causes the mgmt in bluetoothctl is blocked as decribed as above. > > There is a difference about a MGMT command, initiated by bluetoothd, > versus a MGMT event initiated by the kernel, so the daemon is not > blocked it just don't get a disconnection event, which is different > than a command complete. > > What is the command sequence that you use to reproduce the problem? Here is the command log: [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes [bluetooth]# power on Changing power on succeeded [bluetooth]# connect CF:90:67:3C:7A:56 Attempting to connect to CF:90:67:3C:7A:56 [CHG] Device CF:90:67:3C:7A:56 Connected: yes Connection successful [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes [Designer Mouse]# [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response [Designer Mouse]# [Designer Mouse]# power off [Designer Mouse]# Failed to set power off: org.freedesktop.DBus.Error.NoReply To easily reproduce this issue, we use a firmware which always send the disconnect complete event more than 2s. Then the issue occurred 100%. Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s. > >>>>> hci_dev_lock(hdev); >>>>> >>>>> /* Check if the connection has been cleaned up concurrently */ >>>>> >>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba >>>>> -- >>>>> 2.34.1 >>>>> >>>> >>>> >>>> -- >>>> Luiz Augusto von Dentz >>> >>> >>> >> > >
Hi Cheng, On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE) <quic_chejiang@quicinc.com> wrote: > > Hi Luiz, > > On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote: > > Hi Cheng, > > > > On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE) > > <quic_chejiang@quicinc.com> wrote: > >> > >> Hi Luiz, > >> > >> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: > >>> Hi Cheng, > >>> > >>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz > >>> <luiz.dentz@gmail.com> wrote: > >>>> > >>>> Hi Cheng, > >>>> > >>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote: > >>>>> > >>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND > >>>>> in time, requiring the controller to wait for the supervision timeout, > >>>>> which may exceed 2 seconds. In the current implementation, the > >>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since > >>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. > >>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting > >>>>> indefinitely for the mgmt response to the disconnect. To recover, > >>>>> restarting bluetoothd is necessary. > >>>>> > >>>>> bluetoothctl log like this: > >>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > >>>>> Attempting to disconnect from D9:B5:6C:F2:51:91 > >>>>> [Designer Mouse]# > >>>>> [Designer Mouse]# power off > >>>>> [Designer Mouse]# > >>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. > >>>>> > >>>>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> > >>>>> --- > >>>>> include/net/bluetooth/hci_core.h | 2 ++ > >>>>> net/bluetooth/hci_conn.c | 9 +++++++++ > >>>>> net/bluetooth/hci_event.c | 9 +++++++++ > >>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ > >>>>> 4 files changed, 38 insertions(+) > >>>>> > >>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h > >>>>> index 734cd50cd..2ab079dcf 100644 > >>>>> --- a/include/net/bluetooth/hci_core.h > >>>>> +++ b/include/net/bluetooth/hci_core.h > >>>>> @@ -753,6 +753,8 @@ struct hci_conn { > >>>>> > >>>>> struct bt_codec codec; > >>>>> > >>>>> + struct completion disc_ev_comp; > >>>>> + > >>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); > >>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); > >>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); > >>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c > >>>>> index d097e308a..e0244e191 100644 > >>>>> --- a/net/bluetooth/hci_conn.c > >>>>> +++ b/net/bluetooth/hci_conn.c > >>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t > >>>>> > >>>>> hci_conn_init_sysfs(conn); > >>>>> > >>>>> + /* This disc_ev_comp is inited when we send a disconnect request to > >>>>> + * the remote device but fail to receive the disconnect complete > >>>>> + * event within the expected time (2 seconds). This occurs because > >>>>> + * the remote device doesn't ack the terminate indication, forcing > >>>>> + * the controller to wait for the supervision timeout. > >>>>> + */ > >>>>> + init_completion(&conn->disc_ev_comp); > >>>>> + complete(&conn->disc_ev_comp); > >>>>> + > >>>>> return conn; > >>>>> } > >>>>> > >>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c > >>>>> index 2cc7a9306..60ecb2b18 100644 > >>>>> --- a/net/bluetooth/hci_event.c > >>>>> +++ b/net/bluetooth/hci_event.c > >>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > >>>>> if (!conn) > >>>>> goto unlock; > >>>>> > >>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock > >>>>> + * hci_abort_conn_sync will wait hdev lock release to continue. > >>>>> + */ > >>>>> + if (!completion_done(&conn->disc_ev_comp)) { > >>>>> + complete(&conn->disc_ev_comp); > >>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ > >>>>> + usleep_range(100, 1000); > >>>>> + } > >>>>> + > >>>>> if (ev->status) { > >>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>>>> conn->dst_type, ev->status); > >>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c > >>>>> index 0badec712..783d04b57 100644 > >>>>> --- a/net/bluetooth/hci_sync.c > >>>>> +++ b/net/bluetooth/hci_sync.c > >>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) > >>>>> break; > >>>>> } > >>>>> > >>>>> + /* Check whether the connection is successfully disconnected. > >>>>> + * Sometimes the remote device doesn't acknowledge the > >>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait > >>>>> + * for the supervision timeout, which may exceed 2 seconds. In > >>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE > >>>>> + * event before cleaning up the connection. > >>>>> + */ > >>>>> + if (err == -ETIMEDOUT) { > >>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); > >>>>> + > >>>>> + reinit_completion(&conn->disc_ev_comp); > >>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { > >>>>> + bt_dev_warn(hdev, "Failed to get complete"); > >>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>>>> + conn->dst_type, conn->abort_reason); > >>>>> + } > >>>>> + } > >>>> > >>>> Why don't we just set the supervision timeout as timeout then? If we > >>>> will have to wait for it anyway just change hci_disconnect_sync to use > >>>> 10 * conn->le_supv_timeout as timeout instead. > >>>> > >> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the > >> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. > >> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait > >> for the event. So even in hci_disconnect_sync we can set the timeout to > >> supervision timeout, hci_disconnect_sync still timeout after 2s. > > > > Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use > > the supervision timeout instead? If it still timeout after to 2 > > seconds then there is something still forcing HCI_CMD_TIMEOUT which > > shouldn't happen. > > > Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so > even the upper layer set to a larger timeout value, like supervision timeout, > it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is: > hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) -> > hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the > wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout. > > So even if we set a large timeout value in hci_disconnect_sync, it doesn't work > since it's waked up by other event, not the real timeout. > > What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits > for the disconnect complete event rather than command status event. According to > BT core spec, disconnect complete event has to wait for remote's ack or wait until > supervision timeout. It's a valid case that the disconnect complete event taking > more than 2s. You seems to be confusing the role of 2 different timers: err = wait_event_interruptible_timeout(hdev->req_wait_q, hdev->req_status != HCI_REQ_PEND, timeout); and queue_delayed_work(hdev->workqueue, &hdev->cmd_timer, HCI_CMD_TIMEOUT); The former waits for a specific event, while the later waits for handle_cmd_cnt_and_timer, each can have a distinct timeout as in the code bellow: return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN, plen, data, HCI_EV_LE_ENHANCED_CONN_COMPLETE, conn->conn_timeout, NULL); The reason there are 2 timers is that we need to track the number of commands outstanding in the controller, and no you can't delay Command Status: When the Controller receives the HCI_Disconnect command, it _shall_ send the HCI_Command_Status event to the Host. So even if HCI_Disconnection_Complete is delayed the following shall still work provided the HCI_Command_Status is still being generated accordingly: index 0badec7120ab..0ab607fb6433 100644 --- a/net/bluetooth/hci_sync.c +++ b/net/bluetooth/hci_sync.c @@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn, return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, HCI_EV_DISCONN_COMPLETE, - HCI_CMD_TIMEOUT, NULL); + conn->disc_timeout, NULL); return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, HCI_CMD_TIMEOUT); Then we need to adjust conn->disc_timeout according to supervision timeout, that said it appears disc_timeout is actually acting as idle timeout, so perhaps we need a separate field, also it doesn't look like we track the supervision timeout for non-LE links. > >>>> That said, we really need to fix bluetoothd if it is not able to be > >>>> cleaned up if SET_POWERED command fails, but it looks like it is > >>>> handling errors correctly so it sounds like something else is at play. > >>> > >> The issue arises after a 2-second timeout of hci_disconnect_sync. During > >> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. > >> after supervision timeout, the disconnect complete event arrives, but > >> it returns at line 3370 since the connection has already been removed. > >> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT > >> to the application layer (bluetoothctl), causing bluetoothctl to get stuck > >> and unable to perform other mgmt commands. > > > > The command shall have completed regardless if disconnect complete has > > been received or not, the is the whole point of having a timeout, so > > this makes no sense to me, or you are not describing what is happening > > here. Also there is no MGMT_OP_DISCONNECT pending, it is > > MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event > > that is a totally different thing and perhaps that is the source of > > the problem because if we do cleanup hci_conn even in case the command > > fails/times out then we should be generating a disconnected event as > > well. > > > Here is the flow describe the issue. For normal case: > ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ > │ disconnect cmd │ │ │ │ > │─────────────────────>│ │ │ │ > │ │ │ │ │ > │ │ MGMT_OP │ │ │ > │ │ _DISCONNECT │ │ │ > │ │───────────────────>│ │ │ > │ │ │ │ │ > │ │ │ HCI_Disconnect │ │ > │ │ │───────────────────>│ │ > │ │ │ │ │ > │ │ │ │ LL_TERMINATE │ > │ │ │ │ _IND │ > │ │ │ │───────────────────>│ > │ │ │ │ │ > │ │ │ │ ACK │ > │ │ │ │<───────────────────│ > │ │ │ │ │ > │ │ │ Disc_Comp_Evt │ │ > │ │ │<───────────────────│ │ > │ │ │ │ │ > │ │ MGMT Response │ │ │ > │ │<───────────────────│ │ │ > │ │ │ │ │ > │ disc succ │ │ │ │ > │<─────────────────────│ │ │ │ > ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ > > > The failure case like this: > > ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ > │ disconnect │ │ │ │ > │ cmd │ │ │ │ > │─────────────────────>│ │ │ │ > │ │ │ │ │ > │ │ MGMT_OP_ │ │ │ > │ │ DISCONNECT │ │ │ > │ │───────────────────>│ │ │ > │ │ │ │ │ > │ │ ┌┴┐ HCI_ │ │ > │ │ │ │ Disconnect │ │ > │ │ │ │ ──────────────────>│ │ > │ │ │ │ │ │ > │ │ │ │ │ LL_TERMINATE ┌┴┐ > │ │ │ │ │ _IND │ │ > │ │ │ │ │─────────────────> │ │ > │ │ │ │ │ │ │ > │ │ │ │ 2s │ │ │ > │ │ │ │ │ │ │ > │ │ │ │ │ │ │ More > │ │ │ │ │ │ │ than > │ │ │ │ │ │ │ 2s > │ │ │ │ │ │ │ > │ │ │ │ │ │ │ > │ │ │ │ │ │ │ > │ │ └┬┘ │ │ │ > │ │ │────┐ │ │ │ > │ │ │ │ hci_disconnect │ │ │ > │ │ │<───┘ sync timeout, │ │ │ > │ │ │ del 'con' by │ │ │ > │ │ │ hci_conn_failed│ │ │ > │ │ │ │ └┬┘ > │ │ │ │ ACK │ > │ │ │ │<───────────────────│ > │ │ │ │ │ > │ │ │ Disc_Comp_Evt │ │ > │ │ │<────────────────────│ │ > │ │ │ │ │ > │ │ │────┐ │ │ > │ \│/ │ │ ignore the │ │ > │ X │<───┘ event since │ │ > │ /│\ │ 'con' has been │ │ > │ │ MGMT │ deleted │ │ > │ │ Response │ │ │ > │ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │ > ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ > > > So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From > our test, bluetoothd can't accept any command related to mgmt from bluetothctl. > > >> > >> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > >> 3356 struct sk_buff *skb) > >> 3357 { > >> 3358 struct hci_ev_disconn_complete *ev = data; > >> 3359 u8 reason; > >> 3360 struct hci_conn_params *params; > >> 3361 struct hci_conn *conn; > >> 3362 bool mgmt_connected; > >> 3363 > >> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); > >> 3365 > >> 3366 hci_dev_lock(hdev); > >> 3367 > >> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); > >> 3369 if (!conn) > >> 3370 goto unlock; > >> 3371 > >> 3372 if (ev->status) { > >> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >> 3374 conn->dst_type, ev->status); > >> 3375 goto unlock; > >> 3376 } > >> 3377 > >> 3378 conn->state = BT_CLOSED; > >> 3379 > >> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); > >> 3381 > >> > >>> I double checked this and apparently this could no longer fail: > >>> > >>> + /* Disregard possible errors since hci_conn_del shall have been > >>> + * called even in case of errors had occurred since it would > >>> + * then cause hci_conn_failed to be called which calls > >>> + * hci_conn_del internally. > >>> + */ > >>> + hci_abort_conn_sync(hdev, conn, reason); > >>> > >>> So it will clean up the hci_conn no matter what is the timeout, so > >>> either you don't have this change or it is not working for some > >>> reason. > >>> > >> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is > >> blocked. It does not happen during the power off stage. It happened when disconnect > >> a BLE device, but the disconnect complete event sent from controller to host 2s later. > >> Then it causes the mgmt in bluetoothctl is blocked as decribed as above. > > > > There is a difference about a MGMT command, initiated by bluetoothd, > > versus a MGMT event initiated by the kernel, so the daemon is not > > blocked it just don't get a disconnection event, which is different > > than a command complete. > > > > What is the command sequence that you use to reproduce the problem? > Here is the command log: > [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes > [bluetooth]# power on > Changing power on succeeded > [bluetooth]# connect CF:90:67:3C:7A:56 > Attempting to connect to CF:90:67:3C:7A:56 > [CHG] Device CF:90:67:3C:7A:56 Connected: yes > Connection successful > [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes > [Designer Mouse]# > [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response > [Designer Mouse]# > [Designer Mouse]# power off > [Designer Mouse]# > Failed to set power off: org.freedesktop.DBus.Error.NoReply > > To easily reproduce this issue, we use a firmware which always send the disconnect > complete event more than 2s. Then the issue occurred 100%. > > Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it > relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s. > > > > >>>>> hci_dev_lock(hdev); > >>>>> > >>>>> /* Check if the connection has been cleaned up concurrently */ > >>>>> > >>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba > >>>>> -- > >>>>> 2.34.1 > >>>>> > >>>> > >>>> > >>>> -- > >>>> Luiz Augusto von Dentz > >>> > >>> > >>> > >> > > > > >
Hi Luiz, On 12/17/2024 11:13 PM, Luiz Augusto von Dentz wrote: > Hi Cheng, > > On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE) > <quic_chejiang@quicinc.com> wrote: >> >> Hi Luiz, >> >> On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote: >>> Hi Cheng, >>> >>> On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE) >>> <quic_chejiang@quicinc.com> wrote: >>>> >>>> Hi Luiz, >>>> >>>> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: >>>>> Hi Cheng, >>>>> >>>>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz >>>>> <luiz.dentz@gmail.com> wrote: >>>>>> >>>>>> Hi Cheng, >>>>>> >>>>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@quicinc.com> wrote: >>>>>>> >>>>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND >>>>>>> in time, requiring the controller to wait for the supervision timeout, >>>>>>> which may exceed 2 seconds. In the current implementation, the >>>>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since >>>>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. >>>>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting >>>>>>> indefinitely for the mgmt response to the disconnect. To recover, >>>>>>> restarting bluetoothd is necessary. >>>>>>> >>>>>>> bluetoothctl log like this: >>>>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 >>>>>>> Attempting to disconnect from D9:B5:6C:F2:51:91 >>>>>>> [Designer Mouse]# >>>>>>> [Designer Mouse]# power off >>>>>>> [Designer Mouse]# >>>>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. >>>>>>> >>>>>>> Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> >>>>>>> --- >>>>>>> include/net/bluetooth/hci_core.h | 2 ++ >>>>>>> net/bluetooth/hci_conn.c | 9 +++++++++ >>>>>>> net/bluetooth/hci_event.c | 9 +++++++++ >>>>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ >>>>>>> 4 files changed, 38 insertions(+) >>>>>>> >>>>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h >>>>>>> index 734cd50cd..2ab079dcf 100644 >>>>>>> --- a/include/net/bluetooth/hci_core.h >>>>>>> +++ b/include/net/bluetooth/hci_core.h >>>>>>> @@ -753,6 +753,8 @@ struct hci_conn { >>>>>>> >>>>>>> struct bt_codec codec; >>>>>>> >>>>>>> + struct completion disc_ev_comp; >>>>>>> + >>>>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); >>>>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); >>>>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); >>>>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c >>>>>>> index d097e308a..e0244e191 100644 >>>>>>> --- a/net/bluetooth/hci_conn.c >>>>>>> +++ b/net/bluetooth/hci_conn.c >>>>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t >>>>>>> >>>>>>> hci_conn_init_sysfs(conn); >>>>>>> >>>>>>> + /* This disc_ev_comp is inited when we send a disconnect request to >>>>>>> + * the remote device but fail to receive the disconnect complete >>>>>>> + * event within the expected time (2 seconds). This occurs because >>>>>>> + * the remote device doesn't ack the terminate indication, forcing >>>>>>> + * the controller to wait for the supervision timeout. >>>>>>> + */ >>>>>>> + init_completion(&conn->disc_ev_comp); >>>>>>> + complete(&conn->disc_ev_comp); >>>>>>> + >>>>>>> return conn; >>>>>>> } >>>>>>> >>>>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c >>>>>>> index 2cc7a9306..60ecb2b18 100644 >>>>>>> --- a/net/bluetooth/hci_event.c >>>>>>> +++ b/net/bluetooth/hci_event.c >>>>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, >>>>>>> if (!conn) >>>>>>> goto unlock; >>>>>>> >>>>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock >>>>>>> + * hci_abort_conn_sync will wait hdev lock release to continue. >>>>>>> + */ >>>>>>> + if (!completion_done(&conn->disc_ev_comp)) { >>>>>>> + complete(&conn->disc_ev_comp); >>>>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ >>>>>>> + usleep_range(100, 1000); >>>>>>> + } >>>>>>> + >>>>>>> if (ev->status) { >>>>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>>>>> conn->dst_type, ev->status); >>>>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c >>>>>>> index 0badec712..783d04b57 100644 >>>>>>> --- a/net/bluetooth/hci_sync.c >>>>>>> +++ b/net/bluetooth/hci_sync.c >>>>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) >>>>>>> break; >>>>>>> } >>>>>>> >>>>>>> + /* Check whether the connection is successfully disconnected. >>>>>>> + * Sometimes the remote device doesn't acknowledge the >>>>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait >>>>>>> + * for the supervision timeout, which may exceed 2 seconds. In >>>>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE >>>>>>> + * event before cleaning up the connection. >>>>>>> + */ >>>>>>> + if (err == -ETIMEDOUT) { >>>>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); >>>>>>> + >>>>>>> + reinit_completion(&conn->disc_ev_comp); >>>>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { >>>>>>> + bt_dev_warn(hdev, "Failed to get complete"); >>>>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>>>>> + conn->dst_type, conn->abort_reason); >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> Why don't we just set the supervision timeout as timeout then? If we >>>>>> will have to wait for it anyway just change hci_disconnect_sync to use >>>>>> 10 * conn->le_supv_timeout as timeout instead. >>>>>> >>>> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the >>>> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. >>>> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait >>>> for the event. So even in hci_disconnect_sync we can set the timeout to >>>> supervision timeout, hci_disconnect_sync still timeout after 2s. >>> >>> Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use >>> the supervision timeout instead? If it still timeout after to 2 >>> seconds then there is something still forcing HCI_CMD_TIMEOUT which >>> shouldn't happen. >>> >> Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so >> even the upper layer set to a larger timeout value, like supervision timeout, >> it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is: >> hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) -> >> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the >> wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout. >> >> So even if we set a large timeout value in hci_disconnect_sync, it doesn't work >> since it's waked up by other event, not the real timeout. >> >> What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits >> for the disconnect complete event rather than command status event. According to >> BT core spec, disconnect complete event has to wait for remote's ack or wait until >> supervision timeout. It's a valid case that the disconnect complete event taking >> more than 2s. > > You seems to be confusing the role of 2 different timers: > > err = wait_event_interruptible_timeout(hdev->req_wait_q, > hdev->req_status != HCI_REQ_PEND, > timeout); > > and > > queue_delayed_work(hdev->workqueue, &hdev->cmd_timer, > HCI_CMD_TIMEOUT); > > The former waits for a specific event, while the later waits for > handle_cmd_cnt_and_timer, each can have a distinct timeout as in the > code bellow: > > return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN, > plen, data, > HCI_EV_LE_ENHANCED_CONN_COMPLETE, > conn->conn_timeout, NULL); > > The reason there are 2 timers is that we need to track the number of > commands outstanding in the controller, and no you can't delay Command > Status: > > When the Controller receives the HCI_Disconnect command, it _shall_ send the > HCI_Command_Status event to the Host. > > So even if HCI_Disconnection_Complete is delayed the following shall > still work provided the HCI_Command_Status is still being generated > accordingly: > > index 0badec7120ab..0ab607fb6433 100644 > --- a/net/bluetooth/hci_sync.c > +++ b/net/bluetooth/hci_sync.c > @@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev > *hdev, struct hci_conn *conn, > return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT, > sizeof(cp), &cp, > HCI_EV_DISCONN_COMPLETE, > - HCI_CMD_TIMEOUT, NULL); > + conn->disc_timeout, NULL); > > return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, > HCI_CMD_TIMEOUT); > > Then we need to adjust conn->disc_timeout according to supervision > timeout, that said it appears disc_timeout is actually acting as idle > timeout, so perhaps we need a separate field, also it doesn't look > like we track the supervision timeout for non-LE links. > Yes, you are right. I mixed up the two timers. Thank you for the detail explanation. I just found you have fixed this issue by 227a0cdf4a028a73dc256d0f5144b4808d718893. Link: https://github.com/bluez/bluez/issues/932 We are using an old codebase. Sorry for this. BTW, do you think we should change the timeout value here? Set the timeout value based on the link type. @@ -5419,6 +5419,7 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) { struct hci_cp_disconnect cp; + u32 disc_timeout; if (test_bit(HCI_CONN_BIG_CREATED, &conn->flags)) { /* This is a BIS connection, hci_conn_del will @@ -5440,11 +5441,18 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn, * used when suspending or powering off, where we don't want to wait * for the peer's response. */ - if (reason != HCI_ERROR_REMOTE_POWER_OFF) + + if (reason != HCI_ERROR_REMOTE_POWER_OFF) { + if (conn->type == LE_LINK) + disc_timeout = msecs_to_jiffies(10 * conn->le_supv_timeout); + else + disc_timeout = conn->disc_timeout; + return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, HCI_EV_DISCONN_COMPLETE, - HCI_CMD_TIMEOUT, NULL); + disc_timeout, NULL); + } >>>>>> That said, we really need to fix bluetoothd if it is not able to be >>>>>> cleaned up if SET_POWERED command fails, but it looks like it is >>>>>> handling errors correctly so it sounds like something else is at play. >>>>> >>>> The issue arises after a 2-second timeout of hci_disconnect_sync. During >>>> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. >>>> after supervision timeout, the disconnect complete event arrives, but >>>> it returns at line 3370 since the connection has already been removed. >>>> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT >>>> to the application layer (bluetoothctl), causing bluetoothctl to get stuck >>>> and unable to perform other mgmt commands. >>> >>> The command shall have completed regardless if disconnect complete has >>> been received or not, the is the whole point of having a timeout, so >>> this makes no sense to me, or you are not describing what is happening >>> here. Also there is no MGMT_OP_DISCONNECT pending, it is >>> MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event >>> that is a totally different thing and perhaps that is the source of >>> the problem because if we do cleanup hci_conn even in case the command >>> fails/times out then we should be generating a disconnected event as >>> well. >>> >> Here is the flow describe the issue. For normal case: >> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ >> │ disconnect cmd │ │ │ │ >> │─────────────────────>│ │ │ │ >> │ │ │ │ │ >> │ │ MGMT_OP │ │ │ >> │ │ _DISCONNECT │ │ │ >> │ │───────────────────>│ │ │ >> │ │ │ │ │ >> │ │ │ HCI_Disconnect │ │ >> │ │ │───────────────────>│ │ >> │ │ │ │ │ >> │ │ │ │ LL_TERMINATE │ >> │ │ │ │ _IND │ >> │ │ │ │───────────────────>│ >> │ │ │ │ │ >> │ │ │ │ ACK │ >> │ │ │ │<───────────────────│ >> │ │ │ │ │ >> │ │ │ Disc_Comp_Evt │ │ >> │ │ │<───────────────────│ │ >> │ │ │ │ │ >> │ │ MGMT Response │ │ │ >> │ │<───────────────────│ │ │ >> │ │ │ │ │ >> │ disc succ │ │ │ │ >> │<─────────────────────│ │ │ │ >> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ >> >> >> The failure case like this: >> >> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ >> │ disconnect │ │ │ │ >> │ cmd │ │ │ │ >> │─────────────────────>│ │ │ │ >> │ │ │ │ │ >> │ │ MGMT_OP_ │ │ │ >> │ │ DISCONNECT │ │ │ >> │ │───────────────────>│ │ │ >> │ │ │ │ │ >> │ │ ┌┴┐ HCI_ │ │ >> │ │ │ │ Disconnect │ │ >> │ │ │ │ ──────────────────>│ │ >> │ │ │ │ │ │ >> │ │ │ │ │ LL_TERMINATE ┌┴┐ >> │ │ │ │ │ _IND │ │ >> │ │ │ │ │─────────────────> │ │ >> │ │ │ │ │ │ │ >> │ │ │ │ 2s │ │ │ >> │ │ │ │ │ │ │ >> │ │ │ │ │ │ │ More >> │ │ │ │ │ │ │ than >> │ │ │ │ │ │ │ 2s >> │ │ │ │ │ │ │ >> │ │ │ │ │ │ │ >> │ │ │ │ │ │ │ >> │ │ └┬┘ │ │ │ >> │ │ │────┐ │ │ │ >> │ │ │ │ hci_disconnect │ │ │ >> │ │ │<───┘ sync timeout, │ │ │ >> │ │ │ del 'con' by │ │ │ >> │ │ │ hci_conn_failed│ │ │ >> │ │ │ │ └┬┘ >> │ │ │ │ ACK │ >> │ │ │ │<───────────────────│ >> │ │ │ │ │ >> │ │ │ Disc_Comp_Evt │ │ >> │ │ │<────────────────────│ │ >> │ │ │ │ │ >> │ │ │────┐ │ │ >> │ \│/ │ │ ignore the │ │ >> │ X │<───┘ event since │ │ >> │ /│\ │ 'con' has been │ │ >> │ │ MGMT │ deleted │ │ >> │ │ Response │ │ │ >> │ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │ >> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ >> >> >> So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From >> our test, bluetoothd can't accept any command related to mgmt from bluetothctl. >> >>>> >>>> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, >>>> 3356 struct sk_buff *skb) >>>> 3357 { >>>> 3358 struct hci_ev_disconn_complete *ev = data; >>>> 3359 u8 reason; >>>> 3360 struct hci_conn_params *params; >>>> 3361 struct hci_conn *conn; >>>> 3362 bool mgmt_connected; >>>> 3363 >>>> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); >>>> 3365 >>>> 3366 hci_dev_lock(hdev); >>>> 3367 >>>> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); >>>> 3369 if (!conn) >>>> 3370 goto unlock; >>>> 3371 >>>> 3372 if (ev->status) { >>>> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>> 3374 conn->dst_type, ev->status); >>>> 3375 goto unlock; >>>> 3376 } >>>> 3377 >>>> 3378 conn->state = BT_CLOSED; >>>> 3379 >>>> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); >>>> 3381 >>>> >>>>> I double checked this and apparently this could no longer fail: >>>>> >>>>> + /* Disregard possible errors since hci_conn_del shall have been >>>>> + * called even in case of errors had occurred since it would >>>>> + * then cause hci_conn_failed to be called which calls >>>>> + * hci_conn_del internally. >>>>> + */ >>>>> + hci_abort_conn_sync(hdev, conn, reason); >>>>> >>>>> So it will clean up the hci_conn no matter what is the timeout, so >>>>> either you don't have this change or it is not working for some >>>>> reason. >>>>> >>>> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is >>>> blocked. It does not happen during the power off stage. It happened when disconnect >>>> a BLE device, but the disconnect complete event sent from controller to host 2s later. >>>> Then it causes the mgmt in bluetoothctl is blocked as decribed as above. >>> >>> There is a difference about a MGMT command, initiated by bluetoothd, >>> versus a MGMT event initiated by the kernel, so the daemon is not >>> blocked it just don't get a disconnection event, which is different >>> than a command complete. >>> >>> What is the command sequence that you use to reproduce the problem? >> Here is the command log: >> [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes >> [bluetooth]# power on >> Changing power on succeeded >> [bluetooth]# connect CF:90:67:3C:7A:56 >> Attempting to connect to CF:90:67:3C:7A:56 >> [CHG] Device CF:90:67:3C:7A:56 Connected: yes >> Connection successful >> [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes >> [Designer Mouse]# >> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 >> Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response >> [Designer Mouse]# >> [Designer Mouse]# power off >> [Designer Mouse]# >> Failed to set power off: org.freedesktop.DBus.Error.NoReply >> >> To easily reproduce this issue, we use a firmware which always send the disconnect >> complete event more than 2s. Then the issue occurred 100%. >> >> Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it >> relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s. >> >>> >>>>>>> hci_dev_lock(hdev); >>>>>>> >>>>>>> /* Check if the connection has been cleaned up concurrently */ >>>>>>> >>>>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba >>>>>>> -- >>>>>>> 2.34.1 >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Luiz Augusto von Dentz >>>>> >>>>> >>>>> >>>> >>> >>> >> > >
diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h index 734cd50cd..2ab079dcf 100644 --- a/include/net/bluetooth/hci_core.h +++ b/include/net/bluetooth/hci_core.h @@ -753,6 +753,8 @@ struct hci_conn { struct bt_codec codec; + struct completion disc_ev_comp; + void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); void (*security_cfm_cb) (struct hci_conn *conn, u8 status); void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c index d097e308a..e0244e191 100644 --- a/net/bluetooth/hci_conn.c +++ b/net/bluetooth/hci_conn.c @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t hci_conn_init_sysfs(conn); + /* This disc_ev_comp is inited when we send a disconnect request to + * the remote device but fail to receive the disconnect complete + * event within the expected time (2 seconds). This occurs because + * the remote device doesn't ack the terminate indication, forcing + * the controller to wait for the supervision timeout. + */ + init_completion(&conn->disc_ev_comp); + complete(&conn->disc_ev_comp); + return conn; } diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c index 2cc7a9306..60ecb2b18 100644 --- a/net/bluetooth/hci_event.c +++ b/net/bluetooth/hci_event.c @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, if (!conn) goto unlock; + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock + * hci_abort_conn_sync will wait hdev lock release to continue. + */ + if (!completion_done(&conn->disc_ev_comp)) { + complete(&conn->disc_ev_comp); + /* Add some delay for hci_abort_conn_sync to handle the complete */ + usleep_range(100, 1000); + } + if (ev->status) { mgmt_disconnect_failed(hdev, &conn->dst, conn->type, conn->dst_type, ev->status); diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c index 0badec712..783d04b57 100644 --- a/net/bluetooth/hci_sync.c +++ b/net/bluetooth/hci_sync.c @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) break; } + /* Check whether the connection is successfully disconnected. + * Sometimes the remote device doesn't acknowledge the + * LL_TERMINATE_IND in time, requiring the controller to wait + * for the supervision timeout, which may exceed 2 seconds. In + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE + * event before cleaning up the connection. + */ + if (err == -ETIMEDOUT) { + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); + + reinit_completion(&conn->disc_ev_comp); + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { + bt_dev_warn(hdev, "Failed to get complete"); + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, + conn->dst_type, conn->abort_reason); + } + } + hci_dev_lock(hdev); /* Check if the connection has been cleaned up concurrently */
Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND in time, requiring the controller to wait for the supervision timeout, which may exceed 2 seconds. In the current implementation, the HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since the hci_abort_conn_sync has cleaned up the connection after 2 seconds. This causes the mgmt to get stuck, resulting in bluetoothd waiting indefinitely for the mgmt response to the disconnect. To recover, restarting bluetoothd is necessary. bluetoothctl log like this: [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 Attempting to disconnect from D9:B5:6C:F2:51:91 [Designer Mouse]# [Designer Mouse]# power off [Designer Mouse]# Failed to set power off: org.freedesktop.DBus.Error.NoReply. Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> --- include/net/bluetooth/hci_core.h | 2 ++ net/bluetooth/hci_conn.c | 9 +++++++++ net/bluetooth/hci_event.c | 9 +++++++++ net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ 4 files changed, 38 insertions(+) base-commit: e25c8d66f6786300b680866c0e0139981273feba