diff mbox series

wifi: ath10k: poll service ready message before failing

Message ID 20240221031729.2707-1-quic_bqiang@quicinc.com (mailing list archive)
State Changes Requested
Delegated to: Kalle Valo
Headers show
Series wifi: ath10k: poll service ready message before failing | expand

Commit Message

Baochen Qiang Feb. 21, 2024, 3:17 a.m. UTC
Currently host relies on CE interrupts to get notified that
the service ready message is ready. This results in timeout
issue if the interrupt is not fired, due to some unknown
reasons. See below logs:

[76321.937866] ath10k_pci 0000:02:00.0: wmi service ready event not received
...
[76322.016738] ath10k_pci 0000:02:00.0: Could not init core: -110

And finally it causes WLAN interface bring up failure.

Change to give it one more chance here by polling CE rings,
before failing directly.

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00157-QCARMSWPZ-1

Fixes: 5e3dd157d7e7 ("ath10k: mac80211 driver for Qualcomm Atheros 802.11ac CQA98xx devices")
Reported-by: James Prestwood <prestwoj@gmail.com>
Link: https://lore.kernel.org/linux-wireless/304ce305-fbe6-420e-ac2a-d61ae5e6ca1a@gmail.com/
Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com>
---
 drivers/net/wireless/ath/ath10k/wmi.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)


base-commit: 707e306f3573fa321ae197d77366578e4566cff5

Comments

James Prestwood Feb. 21, 2024, 12:38 p.m. UTC | #1
Hi Baochen,

On 2/20/24 7:17 PM, Baochen Qiang wrote:
> Currently host relies on CE interrupts to get notified that
> the service ready message is ready. This results in timeout
> issue if the interrupt is not fired, due to some unknown
> reasons. See below logs:
>
> [76321.937866] ath10k_pci 0000:02:00.0: wmi service ready event not received
> ...
> [76322.016738] ath10k_pci 0000:02:00.0: Could not init core: -110
>
> And finally it causes WLAN interface bring up failure.
>
> Change to give it one more chance here by polling CE rings,
> before failing directly.
>
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00157-QCARMSWPZ-1
>
> Fixes: 5e3dd157d7e7 ("ath10k: mac80211 driver for Qualcomm Atheros 802.11ac CQA98xx devices")
> Reported-by: James Prestwood <prestwoj@gmail.com>
> Link: https://lore.kernel.org/linux-wireless/304ce305-fbe6-420e-ac2a-d61ae5e6ca1a@gmail.com/
> Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com>
> ---
>   drivers/net/wireless/ath/ath10k/wmi.c | 22 +++++++++++++++++++---
>   1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
> index ddf15717d504..bf6cb2c73128 100644
> --- a/drivers/net/wireless/ath/ath10k/wmi.c
> +++ b/drivers/net/wireless/ath/ath10k/wmi.c
> @@ -1763,12 +1763,28 @@ void ath10k_wmi_put_wmi_channel(struct ath10k *ar, struct wmi_channel *ch,
>   
>   int ath10k_wmi_wait_for_service_ready(struct ath10k *ar)
>   {
> -	unsigned long time_left;
> +	unsigned long time_left, i;
>   
>   	time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
>   						WMI_SERVICE_READY_TIMEOUT_HZ);
> -	if (!time_left)
> -		return -ETIMEDOUT;
> +	if (!time_left) {
> +		/* Sometimes the PCI HIF doesn't receive interrupt
> +		 * for the service ready message even if the buffer
> +		 * was completed. PCIe sniffer shows that it's
> +		 * because the corresponding CE ring doesn't fires
> +		 * it. Workaround here by polling CE rings once.
> +		 */
> +		ath10k_warn(ar, "failed to receive service ready completion, polling..\n");
> +
> +		for (i = 0; i < CE_COUNT; i++)
> +			ath10k_hif_send_complete_check(ar, i, 1);
> +
> +		time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
> +							WMI_SERVICE_READY_TIMEOUT_HZ);
> +		if (!time_left)
> +			return -ETIMEDOUT;
> +	}
> +
>   	return 0;
>   }
>   
>
> base-commit: 707e306f3573fa321ae197d77366578e4566cff5

Thank you for looking at this I will test this and see if it resolves 
the problem we're seeing but since its somewhat rare it may take me a 
bit to validate.

Is this any different than just trying to bring up the interface again 
from userspace? I could be wrong, but my concern with this is that when 
I retried in userspace things got into a very odd state:

  - IWD starts

  - ifdown interface

  - ifup interface, timeout -110

  - Retry ifup, success

  - Authenticate/associate succeed

  - 4-way handshake fails because the device never received the 1/4 frame.

IWD would then retry indefinitely with auth/assoc succeeding but never 
receiving any 4-way handshake frames. The only way to get things working 
again was reloading the ath10k driver/reboot. Maybe this patch is 
different because its waiting for the initial request and no issuing a 
second one? Just wanted to point that out in case it sheds any light.

Thanks,

James
Baochen Qiang Feb. 22, 2024, 2:18 a.m. UTC | #2
On 2/21/2024 8:38 PM, James Prestwood wrote:
> Hi Baochen,
> 
> On 2/20/24 7:17 PM, Baochen Qiang wrote:
>> Currently host relies on CE interrupts to get notified that
>> the service ready message is ready. This results in timeout
>> issue if the interrupt is not fired, due to some unknown
>> reasons. See below logs:
>>
>> [76321.937866] ath10k_pci 0000:02:00.0: wmi service ready event not 
>> received
>> ...
>> [76322.016738] ath10k_pci 0000:02:00.0: Could not init core: -110
>>
>> And finally it causes WLAN interface bring up failure.
>>
>> Change to give it one more chance here by polling CE rings,
>> before failing directly.
>>
>> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00157-QCARMSWPZ-1
>>
>> Fixes: 5e3dd157d7e7 ("ath10k: mac80211 driver for Qualcomm Atheros 
>> 802.11ac CQA98xx devices")
>> Reported-by: James Prestwood <prestwoj@gmail.com>
>> Link: 
>> https://lore.kernel.org/linux-wireless/304ce305-fbe6-420e-ac2a-d61ae5e6ca1a@gmail.com/
>> Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com>
>> ---
>>   drivers/net/wireless/ath/ath10k/wmi.c | 22 +++++++++++++++++++---
>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
>> b/drivers/net/wireless/ath/ath10k/wmi.c
>> index ddf15717d504..bf6cb2c73128 100644
>> --- a/drivers/net/wireless/ath/ath10k/wmi.c
>> +++ b/drivers/net/wireless/ath/ath10k/wmi.c
>> @@ -1763,12 +1763,28 @@ void ath10k_wmi_put_wmi_channel(struct ath10k 
>> *ar, struct wmi_channel *ch,
>>   int ath10k_wmi_wait_for_service_ready(struct ath10k *ar)
>>   {
>> -    unsigned long time_left;
>> +    unsigned long time_left, i;
>>       time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
>>                           WMI_SERVICE_READY_TIMEOUT_HZ);
>> -    if (!time_left)
>> -        return -ETIMEDOUT;
>> +    if (!time_left) {
>> +        /* Sometimes the PCI HIF doesn't receive interrupt
>> +         * for the service ready message even if the buffer
>> +         * was completed. PCIe sniffer shows that it's
>> +         * because the corresponding CE ring doesn't fires
>> +         * it. Workaround here by polling CE rings once.
>> +         */
>> +        ath10k_warn(ar, "failed to receive service ready completion, 
>> polling..\n");
>> +
>> +        for (i = 0; i < CE_COUNT; i++)
>> +            ath10k_hif_send_complete_check(ar, i, 1);
>> +
>> +        time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
>> +                            WMI_SERVICE_READY_TIMEOUT_HZ);
>> +        if (!time_left)
>> +            return -ETIMEDOUT;
>> +    }
>> +
>>       return 0;
>>   }
>>
>> base-commit: 707e306f3573fa321ae197d77366578e4566cff5
> 
> Thank you for looking at this I will test this and see if it resolves 
> the problem we're seeing but since its somewhat rare it may take me a 
> bit to validate.
> 
> Is this any different than just trying to bring up the interface again 
> from userspace? I could be wrong, but my concern with this is that when 
> I retried in userspace things got into a very odd state:
> 
>   - IWD starts
> 
>   - ifdown interface
> 
>   - ifup interface, timeout -110
> 
>   - Retry ifup, success
> 
>   - Authenticate/associate succeed
> 
>   - 4-way handshake fails because the device never received the 1/4 frame.
> 
Don't get time to look into this case, but I suppose there might be some 
issues in error handling when interface up fails, kind of incorrect irq 
enable/disable or something else impacting data path, so no data frame 
received even after a second interface up retry succeeds,

Anyway please test this patch, which is supposed to be the right fix to 
this issue.

> IWD would then retry indefinitely with auth/assoc succeeding but never 
> receiving any 4-way handshake frames. The only way to get things working 
> again was reloading the ath10k driver/reboot. Maybe this patch is 
> different because its waiting for the initial request and no issuing a 
> second one? Just wanted to point that out in case it sheds any light.
> 
> Thanks,
> 
> James
>
Kalle Valo Feb. 26, 2024, 10:45 a.m. UTC | #3
Baochen Qiang <quic_bqiang@quicinc.com> writes:

> Currently host relies on CE interrupts to get notified that
> the service ready message is ready. This results in timeout
> issue if the interrupt is not fired, due to some unknown
> reasons. See below logs:
>
> [76321.937866] ath10k_pci 0000:02:00.0: wmi service ready event not received
> ...
> [76322.016738] ath10k_pci 0000:02:00.0: Could not init core: -110
>
> And finally it causes WLAN interface bring up failure.
>
> Change to give it one more chance here by polling CE rings,
> before failing directly.
>
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00157-QCARMSWPZ-1
>
> Fixes: 5e3dd157d7e7 ("ath10k: mac80211 driver for Qualcomm Atheros
> 802.11ac CQA98xx devices")
> Reported-by: James Prestwood <prestwoj@gmail.com>
> Link:
> https://lore.kernel.org/linux-wireless/304ce305-fbe6-420e-ac2a-d61ae5e6ca1a@gmail.com/
> Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com>

[...]

> +		/* Sometimes the PCI HIF doesn't receive interrupt
> +		 * for the service ready message even if the buffer
> +		 * was completed. PCIe sniffer shows that it's
> +		 * because the corresponding CE ring doesn't fires
> +		 * it. Workaround here by polling CE rings once.
> +		 */
> +		ath10k_warn(ar, "failed to receive service ready completion, polling..\n");
> +
> +		for (i = 0; i < CE_COUNT; i++)
> +			ath10k_hif_send_complete_check(ar, i, 1);
> +
> +		time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
> +							WMI_SERVICE_READY_TIMEOUT_HZ);
> +		if (!time_left)
> +			return -ETIMEDOUT;

I think it would be user friendly to also print the end result for the
polling, for example something like this:

if (!time_left) {
        ath10k_warn(ar, "polling timed out");
	return -ETIMEDOUT;
}

ath10k_warn(ar, "service ready completion received, continuing normally");
James Prestwood Feb. 28, 2024, 6:46 p.m. UTC | #4
Hi Baochen,

On 2/21/24 6:18 PM, Baochen Qiang wrote:
>
>
> On 2/21/2024 8:38 PM, James Prestwood wrote:
>> Hi Baochen,
>>
>> On 2/20/24 7:17 PM, Baochen Qiang wrote:
>>> Currently host relies on CE interrupts to get notified that
>>> the service ready message is ready. This results in timeout
>>> issue if the interrupt is not fired, due to some unknown
>>> reasons. See below logs:
>>>
>>> [76321.937866] ath10k_pci 0000:02:00.0: wmi service ready event not 
>>> received
>>> ...
>>> [76322.016738] ath10k_pci 0000:02:00.0: Could not init core: -110
>>>
>>> And finally it causes WLAN interface bring up failure.
>>>
>>> Change to give it one more chance here by polling CE rings,
>>> before failing directly.
>>>
>>> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00157-QCARMSWPZ-1
>>>
>>> Fixes: 5e3dd157d7e7 ("ath10k: mac80211 driver for Qualcomm Atheros 
>>> 802.11ac CQA98xx devices")
>>> Reported-by: James Prestwood <prestwoj@gmail.com>
>>> Link: 
>>> https://lore.kernel.org/linux-wireless/304ce305-fbe6-420e-ac2a-d61ae5e6ca1a@gmail.com/
>>> Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com>
>>> ---
>>>   drivers/net/wireless/ath/ath10k/wmi.c | 22 +++++++++++++++++++---
>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/net/wireless/ath/ath10k/wmi.c 
>>> b/drivers/net/wireless/ath/ath10k/wmi.c
>>> index ddf15717d504..bf6cb2c73128 100644
>>> --- a/drivers/net/wireless/ath/ath10k/wmi.c
>>> +++ b/drivers/net/wireless/ath/ath10k/wmi.c
>>> @@ -1763,12 +1763,28 @@ void ath10k_wmi_put_wmi_channel(struct 
>>> ath10k *ar, struct wmi_channel *ch,
>>>   int ath10k_wmi_wait_for_service_ready(struct ath10k *ar)
>>>   {
>>> -    unsigned long time_left;
>>> +    unsigned long time_left, i;
>>>       time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
>>>                           WMI_SERVICE_READY_TIMEOUT_HZ);
>>> -    if (!time_left)
>>> -        return -ETIMEDOUT;
>>> +    if (!time_left) {
>>> +        /* Sometimes the PCI HIF doesn't receive interrupt
>>> +         * for the service ready message even if the buffer
>>> +         * was completed. PCIe sniffer shows that it's
>>> +         * because the corresponding CE ring doesn't fires
>>> +         * it. Workaround here by polling CE rings once.
>>> +         */
>>> +        ath10k_warn(ar, "failed to receive service ready 
>>> completion, polling..\n");
>>> +
>>> +        for (i = 0; i < CE_COUNT; i++)
>>> +            ath10k_hif_send_complete_check(ar, i, 1);
>>> +
>>> +        time_left = 
>>> wait_for_completion_timeout(&ar->wmi.service_ready,
>>> +                            WMI_SERVICE_READY_TIMEOUT_HZ);
>>> +        if (!time_left)
>>> +            return -ETIMEDOUT;
>>> +    }
>>> +
>>>       return 0;
>>>   }
>>>
>>> base-commit: 707e306f3573fa321ae197d77366578e4566cff5
>>
>> Thank you for looking at this I will test this and see if it resolves 
>> the problem we're seeing but since its somewhat rare it may take me a 
>> bit to validate.
>>
>> Is this any different than just trying to bring up the interface 
>> again from userspace? I could be wrong, but my concern with this is 
>> that when I retried in userspace things got into a very odd state:
>>
>>   - IWD starts
>>
>>   - ifdown interface
>>
>>   - ifup interface, timeout -110
>>
>>   - Retry ifup, success
>>
>>   - Authenticate/associate succeed
>>
>>   - 4-way handshake fails because the device never received the 1/4 
>> frame.
>>
> Don't get time to look into this case, but I suppose there might be 
> some issues in error handling when interface up fails, kind of 
> incorrect irq enable/disable or something else impacting data path, so 
> no data frame received even after a second interface up retry succeeds,
>
> Anyway please test this patch, which is supposed to be the right fix 
> to this issue.

This does appear to have fixed it! For reference this was my test:

  for i in $(seq 1 100000); do sudo ip link set wlan0 down; sudo ip link 
set wlan0 up; echo $?; done

I never saw the up command fail, and after a while I noticed one of the 
iterations took a bit longer to complete. Checked dmesg and saw:

[ 1006.017198] ath10k_pci 0000:02:00.0: failed to receive service ready 
completion, polling..
[ 1006.017295] ath10k_pci 0000:02:00.0: service ready completion 
received, continuing normally

I then started IWD and it was able to connect fine (data frames were 
being passed). I was able to trigger this 3 times relatively quickly, 
each time IWD connected afterwards. So from my end this appears fixed.

You can add tested-by me if you like:

Tested-By: James Prestwood <prestwoj@gmail.com> # on QCA6174 hw3.2


>
>> IWD would then retry indefinitely with auth/assoc succeeding but 
>> never receiving any 4-way handshake frames. The only way to get 
>> things working again was reloading the ath10k driver/reboot. Maybe 
>> this patch is different because its waiting for the initial request 
>> and no issuing a second one? Just wanted to point that out in case it 
>> sheds any light.
>>
>> Thanks,
>>
>> James
>>
Jeff Johnson Feb. 28, 2024, 9:31 p.m. UTC | #5
On 2/28/2024 10:46 AM, James Prestwood wrote:
> This does appear to have fixed it! For reference this was my test:
> 
>   for i in $(seq 1 100000); do sudo ip link set wlan0 down; sudo ip link 
> set wlan0 up; echo $?; done
> 
> I never saw the up command fail, and after a while I noticed one of the 
> iterations took a bit longer to complete. Checked dmesg and saw:
> 
> [ 1006.017198] ath10k_pci 0000:02:00.0: failed to receive service ready 
> completion, polling..
> [ 1006.017295] ath10k_pci 0000:02:00.0: service ready completion 
> received, continuing normally
> 
> I then started IWD and it was able to connect fine (data frames were 
> being passed). I was able to trigger this 3 times relatively quickly, 
> each time IWD connected afterwards. So from my end this appears fixed.
> 
> You can add tested-by me if you like:
> 
> Tested-By: James Prestwood <prestwoj@gmail.com> # on QCA6174 hw3.2

Thanks for the initial report and the testing :)

/jeff
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
index ddf15717d504..bf6cb2c73128 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -1763,12 +1763,28 @@  void ath10k_wmi_put_wmi_channel(struct ath10k *ar, struct wmi_channel *ch,
 
 int ath10k_wmi_wait_for_service_ready(struct ath10k *ar)
 {
-	unsigned long time_left;
+	unsigned long time_left, i;
 
 	time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
 						WMI_SERVICE_READY_TIMEOUT_HZ);
-	if (!time_left)
-		return -ETIMEDOUT;
+	if (!time_left) {
+		/* Sometimes the PCI HIF doesn't receive interrupt
+		 * for the service ready message even if the buffer
+		 * was completed. PCIe sniffer shows that it's
+		 * because the corresponding CE ring doesn't fires
+		 * it. Workaround here by polling CE rings once.
+		 */
+		ath10k_warn(ar, "failed to receive service ready completion, polling..\n");
+
+		for (i = 0; i < CE_COUNT; i++)
+			ath10k_hif_send_complete_check(ar, i, 1);
+
+		time_left = wait_for_completion_timeout(&ar->wmi.service_ready,
+							WMI_SERVICE_READY_TIMEOUT_HZ);
+		if (!time_left)
+			return -ETIMEDOUT;
+	}
+
 	return 0;
 }