diff mbox series

[BlueZ] Bluetooth: hci_core: Skip hci_cmd_work if hci_request is pending

Message ID 20240214012328.BlueZ.1.I10eda6f375bc8dfedf4eef3a8cf6572c65803afc@changeid (mailing list archive)
State New, archived
Headers show
Series [BlueZ] Bluetooth: hci_core: Skip hci_cmd_work if hci_request is pending | expand

Commit Message

Hsin-chen Chuang Feb. 13, 2024, 5:23 p.m. UTC
hci_cmd_work overwrites the hdev->sent_cmd which contains the required
info for a hci_request to work. In the real world, it's observed that
a request from hci_le_ext_create_conn_sync could be interrupted by
the authentication (hci_conn_auth) caused by rfcomm_sock_connect. When
it happends, hci_le_ext_create_conn_sync hangs until timeout; If the
LE connection is triggered by MGMT, it freezes the whole MGMT interface.

Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
---

 net/bluetooth/hci_core.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Luiz Augusto von Dentz Feb. 15, 2024, 9:37 p.m. UTC | #1
Hi Hsin-chen,

On Tue, Feb 13, 2024 at 12:24 PM Hsin-chen Chuang <chharry@chromium.org> wrote:
>
> hci_cmd_work overwrites the hdev->sent_cmd which contains the required
> info for a hci_request to work. In the real world, it's observed that
> a request from hci_le_ext_create_conn_sync could be interrupted by
> the authentication (hci_conn_auth) caused by rfcomm_sock_connect. When
> it happends, hci_le_ext_create_conn_sync hangs until timeout; If the
> LE connection is triggered by MGMT, it freezes the whole MGMT interface.
>
> Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
> ---
>
>  net/bluetooth/hci_core.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
> index 34c8dca2069f..e3706889976d 100644
> --- a/net/bluetooth/hci_core.c
> +++ b/net/bluetooth/hci_core.c
> @@ -4213,8 +4213,11 @@ static void hci_cmd_work(struct work_struct *work)
>         BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
>                atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));
>
> -       /* Send queued commands */
> -       if (atomic_read(&hdev->cmd_cnt)) {
> +       /* Send queued commands. Don't send the command when there is a pending
> +        * hci_request because the request callbacks would be overwritten.
> +        */
> +       if (atomic_read(&hdev->cmd_cnt) &&
> +           !hci_dev_test_flag(hdev, HCI_CMD_PENDING)) {
>                 skb = skb_dequeue(&hdev->cmd_q);
>                 if (!skb)
>                         return;
> --
> 2.43.0.687.g38aa6559b0-goog


This seems to be causing some mgmt-tester failures:

Pair Device - Sec Mode 3 Success 1                   Timed out   22.753 seconds
Pair Device - Sec Mode 3 Reject 1                    Timed out   22.533 seconds
Pair Device - Sec Mode 3 Reject 2                    Timed out   22.526 seconds

I think this is because we need to respond to an event with a command like:

< HCI Command: Create Conn.. (0x01|0x0005) plen 13  #241 [hci0] 16:25:38.699066
        Address: 00:AA:01:01:00:00 (Intel Corporation)
        Packet type: 0x0018
          DM1 may be used
          DH1 may be used
        Page scan repetition mode: R2 (0x02)
        Page scan mode: Mandatory (0x00)
        Clock offset: 0x0000
        Role switch: Allow peripheral (0x01)
> HCI Event: Command Status (0x0f) plen 4           #242 [hci0] 16:25:38.701881
      Create Connection (0x01|0x0005) ncmd 1
        Status: Success (0x00)
> HCI Event: Link Key Request (0x17) plen 6         #243 [hci0] 16:25:38.702375
        Address: 00:AA:01:01:00:00 (Intel Corporation)

But because Create Connection is pending we cannot respond to Link Key
Request, so it is actually a design problem if we cannot send commands
because something is pending so perhaps we need to redesign how we
store cmd_sent so we can have multiple outstanding commands rather
than just one.
Luiz Augusto von Dentz Feb. 16, 2024, 3:54 p.m. UTC | #2
Hi Hsin-chen,

On Thu, Feb 15, 2024 at 11:21 PM Hsin-chen Chuang <chharry@google.com> wrote:
>
> +Some Googlers who would be interested in
>
> Hi Luiz,
>
> How about moving the hci_req-related data out from sent_cmd? This allows sending HCI commands while hci_req data would not be overwritten.

I have something like the following in the works:

https://gist.github.com/Vudentz/251275bb688fac32585f90ac0076c407

It is not stable yet, but I think we can get away with it since it
just means we can keep the pending request stored in the req_skb, that
said we might need to overhaul this design since it is not very clean
in my opinion.

> On Fri, Feb 16, 2024 at 5:37 AM Luiz Augusto von Dentz <luiz.dentz@gmail.com> wrote:
>>
>> Hi Hsin-chen,
>>
>> On Tue, Feb 13, 2024 at 12:24 PM Hsin-chen Chuang <chharry@chromium.org> wrote:
>> >
>> > hci_cmd_work overwrites the hdev->sent_cmd which contains the required
>> > info for a hci_request to work. In the real world, it's observed that
>> > a request from hci_le_ext_create_conn_sync could be interrupted by
>> > the authentication (hci_conn_auth) caused by rfcomm_sock_connect. When
>> > it happends, hci_le_ext_create_conn_sync hangs until timeout; If the
>> > LE connection is triggered by MGMT, it freezes the whole MGMT interface.
>> >
>> > Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
>> > ---
>> >
>> >  net/bluetooth/hci_core.c | 7 +++++--
>> >  1 file changed, 5 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
>> > index 34c8dca2069f..e3706889976d 100644
>> > --- a/net/bluetooth/hci_core.c
>> > +++ b/net/bluetooth/hci_core.c
>> > @@ -4213,8 +4213,11 @@ static void hci_cmd_work(struct work_struct *work)
>> >         BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
>> >                atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));
>> >
>> > -       /* Send queued commands */
>> > -       if (atomic_read(&hdev->cmd_cnt)) {
>> > +       /* Send queued commands. Don't send the command when there is a pending
>> > +        * hci_request because the request callbacks would be overwritten.
>> > +        */
>> > +       if (atomic_read(&hdev->cmd_cnt) &&
>> > +           !hci_dev_test_flag(hdev, HCI_CMD_PENDING)) {
>> >                 skb = skb_dequeue(&hdev->cmd_q);
>> >                 if (!skb)
>> >                         return;
>> > --
>> > 2.43.0.687.g38aa6559b0-goog
>>
>>
>> This seems to be causing some mgmt-tester failures:
>>
>> Pair Device - Sec Mode 3 Success 1                   Timed out   22.753 seconds
>> Pair Device - Sec Mode 3 Reject 1                    Timed out   22.533 seconds
>> Pair Device - Sec Mode 3 Reject 2                    Timed out   22.526 seconds
>>
>> I think this is because we need to respond to an event with a command like:
>>
>> < HCI Command: Create Conn.. (0x01|0x0005) plen 13  #241 [hci0] 16:25:38.699066
>>         Address: 00:AA:01:01:00:00 (Intel Corporation)
>>         Packet type: 0x0018
>>           DM1 may be used
>>           DH1 may be used
>>         Page scan repetition mode: R2 (0x02)
>>         Page scan mode: Mandatory (0x00)
>>         Clock offset: 0x0000
>>         Role switch: Allow peripheral (0x01)
>> > HCI Event: Command Status (0x0f) plen 4           #242 [hci0] 16:25:38.701881
>>       Create Connection (0x01|0x0005) ncmd 1
>>         Status: Success (0x00)
>> > HCI Event: Link Key Request (0x17) plen 6         #243 [hci0] 16:25:38.702375
>>         Address: 00:AA:01:01:00:00 (Intel Corporation)
>>
>> But because Create Connection is pending we cannot respond to Link Key
>> Request, so it is actually a design problem if we cannot send commands
>> because something is pending so perhaps we need to redesign how we
>> store cmd_sent so we can have multiple outstanding commands rather
>> than just one.
>>
>> --
>> Luiz Augusto von Dentz
>
>
>
> --
> Best Regards,
> Hsin-chen
Hsin-chen Chuang Feb. 16, 2024, 4:47 p.m. UTC | #3
Hi Luiz,

Good point. I found the hdev->req_skb is actually not read at all in the
current code base, so perhaps we could remove the legacy code entirely,
and reuse it for solving the problem we're facing now. This approach
seems cleaner to me. WDYT?

OTOH the below snippet in hci_event_packet (also another similar one in
hci_le_meta_evt) needs redesign as well. It distinguishes LE/Classic
event per hdev->sent_cmd, so the hci_req_cmd_complete for LE command
would never be called after a classic command is sent, vice versa.

/* Only match event if command OGF is not for LE */
if (hdev->sent_cmd &&
    hci_opcode_ogf(hci_skb_opcode(hdev->sent_cmd)) != 0x08 &&
    hci_skb_event(hdev->sent_cmd) == event) {
        hci_req_cmd_complete(hdev, hci_skb_opcode(hdev->sent_cmd),
                             status, &req_complete, &req_complete_skb);
        req_evt = event;
}


On Fri, Feb 16, 2024 at 11:54 PM Luiz Augusto von Dentz
<luiz.dentz@gmail.com> wrote:
>
> Hi Hsin-chen,
>
> On Thu, Feb 15, 2024 at 11:21 PM Hsin-chen Chuang <chharry@google.com> wrote:
> >
> > +Some Googlers who would be interested in
> >
> > Hi Luiz,
> >
> > How about moving the hci_req-related data out from sent_cmd? This allows sending HCI commands while hci_req data would not be overwritten.
>
> I have something like the following in the works:
>
> https://gist.github.com/Vudentz/251275bb688fac32585f90ac0076c407
>
> It is not stable yet, but I think we can get away with it since it
> just means we can keep the pending request stored in the req_skb, that
> said we might need to overhaul this design since it is not very clean
> in my opinion.
>
> > On Fri, Feb 16, 2024 at 5:37 AM Luiz Augusto von Dentz <luiz.dentz@gmail.com> wrote:
> >>
> >> Hi Hsin-chen,
> >>
> >> On Tue, Feb 13, 2024 at 12:24 PM Hsin-chen Chuang <chharry@chromium.org> wrote:
> >> >
> >> > hci_cmd_work overwrites the hdev->sent_cmd which contains the required
> >> > info for a hci_request to work. In the real world, it's observed that
> >> > a request from hci_le_ext_create_conn_sync could be interrupted by
> >> > the authentication (hci_conn_auth) caused by rfcomm_sock_connect. When
> >> > it happends, hci_le_ext_create_conn_sync hangs until timeout; If the
> >> > LE connection is triggered by MGMT, it freezes the whole MGMT interface.
> >> >
> >> > Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
> >> > ---
> >> >
> >> >  net/bluetooth/hci_core.c | 7 +++++--
> >> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
> >> > index 34c8dca2069f..e3706889976d 100644
> >> > --- a/net/bluetooth/hci_core.c
> >> > +++ b/net/bluetooth/hci_core.c
> >> > @@ -4213,8 +4213,11 @@ static void hci_cmd_work(struct work_struct *work)
> >> >         BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
> >> >                atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));
> >> >
> >> > -       /* Send queued commands */
> >> > -       if (atomic_read(&hdev->cmd_cnt)) {
> >> > +       /* Send queued commands. Don't send the command when there is a pending
> >> > +        * hci_request because the request callbacks would be overwritten.
> >> > +        */
> >> > +       if (atomic_read(&hdev->cmd_cnt) &&
> >> > +           !hci_dev_test_flag(hdev, HCI_CMD_PENDING)) {
> >> >                 skb = skb_dequeue(&hdev->cmd_q);
> >> >                 if (!skb)
> >> >                         return;
> >> > --
> >> > 2.43.0.687.g38aa6559b0-goog
> >>
> >>
> >> This seems to be causing some mgmt-tester failures:
> >>
> >> Pair Device - Sec Mode 3 Success 1                   Timed out   22.753 seconds
> >> Pair Device - Sec Mode 3 Reject 1                    Timed out   22.533 seconds
> >> Pair Device - Sec Mode 3 Reject 2                    Timed out   22.526 seconds
> >>
> >> I think this is because we need to respond to an event with a command like:
> >>
> >> < HCI Command: Create Conn.. (0x01|0x0005) plen 13  #241 [hci0] 16:25:38.699066
> >>         Address: 00:AA:01:01:00:00 (Intel Corporation)
> >>         Packet type: 0x0018
> >>           DM1 may be used
> >>           DH1 may be used
> >>         Page scan repetition mode: R2 (0x02)
> >>         Page scan mode: Mandatory (0x00)
> >>         Clock offset: 0x0000
> >>         Role switch: Allow peripheral (0x01)
> >> > HCI Event: Command Status (0x0f) plen 4           #242 [hci0] 16:25:38.701881
> >>       Create Connection (0x01|0x0005) ncmd 1
> >>         Status: Success (0x00)
> >> > HCI Event: Link Key Request (0x17) plen 6         #243 [hci0] 16:25:38.702375
> >>         Address: 00:AA:01:01:00:00 (Intel Corporation)
> >>
> >> But because Create Connection is pending we cannot respond to Link Key
> >> Request, so it is actually a design problem if we cannot send commands
> >> because something is pending so perhaps we need to redesign how we
> >> store cmd_sent so we can have multiple outstanding commands rather
> >> than just one.
> >>
> >> --
> >> Luiz Augusto von Dentz
> >
> >
> >
> > --
> > Best Regards,
> > Hsin-chen
>
>
>
> --
> Luiz Augusto von Dentz
Luiz Augusto von Dentz Feb. 16, 2024, 9:28 p.m. UTC | #4
Hi Hsin-chen,

On Fri, Feb 16, 2024 at 11:47 AM Hsin-chen Chuang <chharry@google.com> wrote:
>
> Hi Luiz,
>
> Good point. I found the hdev->req_skb is actually not read at all in the
> current code base, so perhaps we could remove the legacy code entirely,
> and reuse it for solving the problem we're facing now. This approach
> seems cleaner to me. WDYT?
>
> OTOH the below snippet in hci_event_packet (also another similar one in
> hci_le_meta_evt) needs redesign as well. It distinguishes LE/Classic
> event per hdev->sent_cmd, so the hci_req_cmd_complete for LE command
> would never be called after a classic command is sent, vice versa.
>
> /* Only match event if command OGF is not for LE */
> if (hdev->sent_cmd &&
>     hci_opcode_ogf(hci_skb_opcode(hdev->sent_cmd)) != 0x08 &&
>     hci_skb_event(hdev->sent_cmd) == event) {
>         hci_req_cmd_complete(hdev, hci_skb_opcode(hdev->sent_cmd),
>                              status, &req_complete, &req_complete_skb);
>         req_evt = event;
> }

Yeah, Ive fixed that, but there are some mgmt-tester test that don't
seem to be running for some reason, anyway this actually fixes tests
like Pair Device - Sec Mode 3 Reject 1 since that happens to receive
events that requires commands as responses which was taking 20+
seconds to complete since that would be overwrite the sent_cmd.


>
> On Fri, Feb 16, 2024 at 11:54 PM Luiz Augusto von Dentz
> <luiz.dentz@gmail.com> wrote:
> >
> > Hi Hsin-chen,
> >
> > On Thu, Feb 15, 2024 at 11:21 PM Hsin-chen Chuang <chharry@google.com> wrote:
> > >
> > > +Some Googlers who would be interested in
> > >
> > > Hi Luiz,
> > >
> > > How about moving the hci_req-related data out from sent_cmd? This allows sending HCI commands while hci_req data would not be overwritten.
> >
> > I have something like the following in the works:
> >
> > https://gist.github.com/Vudentz/251275bb688fac32585f90ac0076c407
> >
> > It is not stable yet, but I think we can get away with it since it
> > just means we can keep the pending request stored in the req_skb, that
> > said we might need to overhaul this design since it is not very clean
> > in my opinion.
> >
> > > On Fri, Feb 16, 2024 at 5:37 AM Luiz Augusto von Dentz <luiz.dentz@gmail.com> wrote:
> > >>
> > >> Hi Hsin-chen,
> > >>
> > >> On Tue, Feb 13, 2024 at 12:24 PM Hsin-chen Chuang <chharry@chromium.org> wrote:
> > >> >
> > >> > hci_cmd_work overwrites the hdev->sent_cmd which contains the required
> > >> > info for a hci_request to work. In the real world, it's observed that
> > >> > a request from hci_le_ext_create_conn_sync could be interrupted by
> > >> > the authentication (hci_conn_auth) caused by rfcomm_sock_connect. When
> > >> > it happends, hci_le_ext_create_conn_sync hangs until timeout; If the
> > >> > LE connection is triggered by MGMT, it freezes the whole MGMT interface.
> > >> >
> > >> > Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
> > >> > ---
> > >> >
> > >> >  net/bluetooth/hci_core.c | 7 +++++--
> > >> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > >> >
> > >> > diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
> > >> > index 34c8dca2069f..e3706889976d 100644
> > >> > --- a/net/bluetooth/hci_core.c
> > >> > +++ b/net/bluetooth/hci_core.c
> > >> > @@ -4213,8 +4213,11 @@ static void hci_cmd_work(struct work_struct *work)
> > >> >         BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
> > >> >                atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));
> > >> >
> > >> > -       /* Send queued commands */
> > >> > -       if (atomic_read(&hdev->cmd_cnt)) {
> > >> > +       /* Send queued commands. Don't send the command when there is a pending
> > >> > +        * hci_request because the request callbacks would be overwritten.
> > >> > +        */
> > >> > +       if (atomic_read(&hdev->cmd_cnt) &&
> > >> > +           !hci_dev_test_flag(hdev, HCI_CMD_PENDING)) {
> > >> >                 skb = skb_dequeue(&hdev->cmd_q);
> > >> >                 if (!skb)
> > >> >                         return;
> > >> > --
> > >> > 2.43.0.687.g38aa6559b0-goog
> > >>
> > >>
> > >> This seems to be causing some mgmt-tester failures:
> > >>
> > >> Pair Device - Sec Mode 3 Success 1                   Timed out   22.753 seconds
> > >> Pair Device - Sec Mode 3 Reject 1                    Timed out   22.533 seconds
> > >> Pair Device - Sec Mode 3 Reject 2                    Timed out   22.526 seconds
> > >>
> > >> I think this is because we need to respond to an event with a command like:
> > >>
> > >> < HCI Command: Create Conn.. (0x01|0x0005) plen 13  #241 [hci0] 16:25:38.699066
> > >>         Address: 00:AA:01:01:00:00 (Intel Corporation)
> > >>         Packet type: 0x0018
> > >>           DM1 may be used
> > >>           DH1 may be used
> > >>         Page scan repetition mode: R2 (0x02)
> > >>         Page scan mode: Mandatory (0x00)
> > >>         Clock offset: 0x0000
> > >>         Role switch: Allow peripheral (0x01)
> > >> > HCI Event: Command Status (0x0f) plen 4           #242 [hci0] 16:25:38.701881
> > >>       Create Connection (0x01|0x0005) ncmd 1
> > >>         Status: Success (0x00)
> > >> > HCI Event: Link Key Request (0x17) plen 6         #243 [hci0] 16:25:38.702375
> > >>         Address: 00:AA:01:01:00:00 (Intel Corporation)
> > >>
> > >> But because Create Connection is pending we cannot respond to Link Key
> > >> Request, so it is actually a design problem if we cannot send commands
> > >> because something is pending so perhaps we need to redesign how we
> > >> store cmd_sent so we can have multiple outstanding commands rather
> > >> than just one.
> > >>
> > >> --
> > >> Luiz Augusto von Dentz
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Hsin-chen
> >
> >
> >
> > --
> > Luiz Augusto von Dentz
diff mbox series

Patch

diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 34c8dca2069f..e3706889976d 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -4213,8 +4213,11 @@  static void hci_cmd_work(struct work_struct *work)
 	BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
 	       atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));
 
-	/* Send queued commands */
-	if (atomic_read(&hdev->cmd_cnt)) {
+	/* Send queued commands. Don't send the command when there is a pending
+	 * hci_request because the request callbacks would be overwritten.
+	 */
+	if (atomic_read(&hdev->cmd_cnt) &&
+	    !hci_dev_test_flag(hdev, HCI_CMD_PENDING)) {
 		skb = skb_dequeue(&hdev->cmd_q);
 		if (!skb)
 			return;