diff mbox series

[v2,ath-next,2/2] wifi: ath11k: fix HTC rx insufficient length

Message ID 20250310010217.3845141-3-quic_miaoqing@quicinc.com (mailing list archive)
State New
Delegated to: Jeff Johnson
Headers show
Series wifi: ath11k: fix HTC rx insufficient length | expand

Checks

Context Check Description
wifibot/fixes_present success Fixes tag not required for -next series
wifibot/series_format success Posting correctly formatted
wifibot/tree_selection success Clearly marked for ath-next
wifibot/ynl success Generated files up to date; no warnings/errors; no diff in generated;
wifibot/build_32bit success Errors and warnings before: 0 this patch: 0
wifibot/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
wifibot/build_clang success Errors and warnings before: 0 this patch: 0
wifibot/build_clang_rust success No Rust files in patch. Skipping build
wifibot/build_tools success No tools touched, skip
wifibot/check_selftest success No net selftest shell script
wifibot/checkpatch warning WARNING: line length of 86 exceeds 80 columns
wifibot/deprecated_api success None detected
wifibot/header_inline success No static functions without inline keyword in header files
wifibot/kdoc success Errors and warnings before: 0 this patch: 0
wifibot/source_inline success Was 0 now: 0
wifibot/verify_fixes success No Fixes tag
wifibot/verify_signedoff success Signed-off-by tag matches author and committer

Commit Message

Miaoqing Pan March 10, 2025, 1:02 a.m. UTC
A relatively unusual race condition occurs between host software
and hardware, where the host sees the updated destination ring head
pointer before the hardware updates the corresponding descriptor.
When this situation occurs, the length of the descriptor returns 0.

The current error handling method is to increment descriptor tail
pointer by 1, but 'sw_index' is not updated, causing descriptor and
skb to not correspond one-to-one, resulting in the following error:

ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484

To address this problem, temporarily skip processing the current
descriptor and handle it again next time. However, to prevent this
descriptor from continuously returning 0, use skb cb to set a flag.
If the length returns 0 again, this descriptor will be discarded.

Tested-on: QCA6698AQ hw2.1 PCI WLAN.HSP.1.1-04546-QCAHSPSWPL_V1_V2_SILICONZ_IOE-1

Reported-by: Johan Hovold <johan+linaro@kernel.org>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
Signed-off-by: Miaoqing Pan <quic_miaoqing@quicinc.com>
---
 drivers/net/wireless/ath/ath11k/ce.c   | 32 ++++++++++++++++++++------
 drivers/net/wireless/ath/ath11k/core.h |  1 +
 2 files changed, 26 insertions(+), 7 deletions(-)

Comments

Johan Hovold March 10, 2025, 10:09 a.m. UTC | #1
On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
> A relatively unusual race condition occurs between host software
> and hardware, where the host sees the updated destination ring head
> pointer before the hardware updates the corresponding descriptor.
> When this situation occurs, the length of the descriptor returns 0.

I still think this description is too vague and it doesn't explain how
this race is even possible. It sounds like there's a bug somewhere in
the driver or firmware, but if this really is an indication the hardware
is broken as your reply here seems to suggest:

	https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/

then that too should be highlighted in the commit message (e.g. by
describing this as "working around broken hardware").

> The current error handling method is to increment descriptor tail
> pointer by 1, but 'sw_index' is not updated, causing descriptor and
> skb to not correspond one-to-one, resulting in the following error:
> 
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
> 
> To address this problem, temporarily skip processing the current
> descriptor and handle it again next time. However, to prevent this
> descriptor from continuously returning 0, use skb cb to set a flag.
> If the length returns 0 again, this descriptor will be discarded.

The ath12k ring-buffer handling looks very similar. Do you need a
corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
sure that this (hardware) bug only affects ath11k devices?
 
>  	*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> -	if (*nbytes == 0) {
> -		ret = -EIO;
> -		goto err;
> +	if (unlikely(*nbytes == 0)) {
> +		struct ath11k_skb_rxcb *rxcb =
> +			ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
> +
> +		/* A relatively unusual race condition occurs between host
> +		 * software and hardware, where the host sees the updated
> +		 * destination ring head pointer before the hardware updates
> +		 * the corresponding descriptor.
> +		 *
> +		 * Temporarily skip processing the current descriptor and handle
> +		 * it again next time. However, to prevent this descriptor from
> +		 * continuously returning 0, set 'is_desc_len0' flag. If the
> +		 * length returns 0 again, this descriptor will be discarded.
> +		 */
> +		if (!rxcb->is_desc_len0) {
> +			rxcb->is_desc_len0 = true;
> +			ret = -EIO;
> +			goto err;
> +		}
>  	}

I'm still waiting for feedback from one user that can reproduce the
ring-buffer corruption very easily, but another user mentioned seeing
multiple zero-length descriptor warnings over the weekend when running
with this patch:

	ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)

Are there ever any valid reasons for seeing a zero-length descriptor
(i.e. unrelated to the race at hand)? IIUC the warning would only be
printed when processing such descriptors a second time (i.e. when
is_desc_len0 is set).

Johan
Miaoqing Pan March 11, 2025, 8:29 a.m. UTC | #2
On 3/10/2025 6:09 PM, Johan Hovold wrote:
> On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
>> A relatively unusual race condition occurs between host software
>> and hardware, where the host sees the updated destination ring head
>> pointer before the hardware updates the corresponding descriptor.
>> When this situation occurs, the length of the descriptor returns 0.
> 
> I still think this description is too vague and it doesn't explain how
> this race is even possible. It sounds like there's a bug somewhere in
> the driver or firmware, but if this really is an indication the hardware
> is broken as your reply here seems to suggest:
> 
> 	https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/
> 
> then that too should be highlighted in the commit message (e.g. by
> describing this as "working around broken hardware").
> 
Ok, will do.

>> The current error handling method is to increment descriptor tail
>> pointer by 1, but 'sw_index' is not updated, causing descriptor and
>> skb to not correspond one-to-one, resulting in the following error:
>>
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>>
>> To address this problem, temporarily skip processing the current
>> descriptor and handle it again next time. However, to prevent this
>> descriptor from continuously returning 0, use skb cb to set a flag.
>> If the length returns 0 again, this descriptor will be discarded.
> 
> The ath12k ring-buffer handling looks very similar. Do you need a
> corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
> sure that this (hardware) bug only affects ath11k devices?
>   
Yes, will submit a similar patch.

>>   	*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>> -	if (*nbytes == 0) {
>> -		ret = -EIO;
>> -		goto err;
>> +	if (unlikely(*nbytes == 0)) {
>> +		struct ath11k_skb_rxcb *rxcb =
>> +			ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
>> +
>> +		/* A relatively unusual race condition occurs between host
>> +		 * software and hardware, where the host sees the updated
>> +		 * destination ring head pointer before the hardware updates
>> +		 * the corresponding descriptor.
>> +		 *
>> +		 * Temporarily skip processing the current descriptor and handle
>> +		 * it again next time. However, to prevent this descriptor from
>> +		 * continuously returning 0, set 'is_desc_len0' flag. If the
>> +		 * length returns 0 again, this descriptor will be discarded.
>> +		 */
>> +		if (!rxcb->is_desc_len0) {
>> +			rxcb->is_desc_len0 = true;
>> +			ret = -EIO;
>> +			goto err;
>> +		}
>>   	}
> 
> I'm still waiting for feedback from one user that can reproduce the
> ring-buffer corruption very easily, but another user mentioned seeing
> multiple zero-length descriptor warnings over the weekend when running
> with this patch:
> 
> 	ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
> 
> Are there ever any valid reasons for seeing a zero-length descriptor
> (i.e. unrelated to the race at hand)? IIUC the warning would only be
> printed when processing such descriptors a second time (i.e. when
> is_desc_len0 is set).
> 

That's exactly the logic, only can see the warning in a second time. For 
the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
Jeff Johnson March 11, 2025, 3:20 p.m. UTC | #3
On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
> On 3/10/2025 6:09 PM, Johan Hovold wrote:
>> I'm still waiting for feedback from one user that can reproduce the
>> ring-buffer corruption very easily, but another user mentioned seeing
>> multiple zero-length descriptor warnings over the weekend when running
>> with this patch:
>>
>> 	ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>>
>> Are there ever any valid reasons for seeing a zero-length descriptor
>> (i.e. unrelated to the race at hand)? IIUC the warning would only be
>> printed when processing such descriptors a second time (i.e. when
>> is_desc_len0 is set).
>>
> 
> That's exactly the logic, only can see the warning in a second time. For 
> the first time, ath12k_ce_completed_recv_next() returns '-EIO'.

That didn't answer Johan's first question:
Are there ever any valid reasons for seeing a zero-length descriptor?

We have an issue that there is a race condition where hardware updates the
pointer before it has filled all the data. The current solution is just to
read the data a second time. But if that second read also occurs before
hardware has updated the data, then the issue isn't fixed.

So should there be some forced delay before we read a second time?
Or should we attempt to read more times?
Miaoqing Pan March 12, 2025, 1:11 a.m. UTC | #4
On 3/11/2025 11:20 PM, Jeff Johnson wrote:
> On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
>> On 3/10/2025 6:09 PM, Johan Hovold wrote:
>>> I'm still waiting for feedback from one user that can reproduce the
>>> ring-buffer corruption very easily, but another user mentioned seeing
>>> multiple zero-length descriptor warnings over the weekend when running
>>> with this patch:
>>>
>>> 	ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>>>
>>> Are there ever any valid reasons for seeing a zero-length descriptor
>>> (i.e. unrelated to the race at hand)? IIUC the warning would only be
>>> printed when processing such descriptors a second time (i.e. when
>>> is_desc_len0 is set).
>>>
>>
>> That's exactly the logic, only can see the warning in a second time. For
>> the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
> 
> That didn't answer Johan's first question:
> Are there ever any valid reasons for seeing a zero-length descriptor?
> 
The events currently observed are all firmware logs. The discarded 
packets will not affect normal operation. I will adjust the logging to 
debug level.

> We have an issue that there is a race condition where hardware updates the
> pointer before it has filled all the data. The current solution is just to
> read the data a second time. But if that second read also occurs before
> hardware has updated the data, then the issue isn't fixed.
> 
Thanks for the addition.

> So should there be some forced delay before we read a second time?
> Or should we attempt to read more times?
> 

The initial fix was to keep waiting for the data to be ready. The 
observed phenomenon is that when the second read fails, subsequent reads 
will continue to fail until the firmware's CE2 ring is full and triggers 
an assert after timeout. However, this situation is relatively rare, and 
in most cases, the second read will succeed. Therefore, adding a delay 
or multiple read attempts is not useful.
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
index e66e86bdec20..2573f8c7a994 100644
--- a/drivers/net/wireless/ath/ath11k/ce.c
+++ b/drivers/net/wireless/ath/ath11k/ce.c
@@ -1,7 +1,7 @@ 
 // SPDX-License-Identifier: BSD-3-Clause-Clear
 /*
  * Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
- * Copyright (c) 2021-2023 Qualcomm Innovation Center, Inc. All rights reserved.
+ * Copyright (c) 2021-2023, 2025 Qualcomm Innovation Center, Inc. All rights reserved.
  */
 
 #include "dp_rx.h"
@@ -387,18 +387,36 @@  static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
 
 	ath11k_hal_srng_access_begin(ab, srng);
 
-	desc = ath11k_hal_srng_dst_get_next_entry(ab, srng);
+	desc = ath11k_hal_srng_dst_peek(ab, srng);
 	if (!desc) {
 		ret = -EIO;
 		goto err;
 	}
 
 	*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
-	if (*nbytes == 0) {
-		ret = -EIO;
-		goto err;
+	if (unlikely(*nbytes == 0)) {
+		struct ath11k_skb_rxcb *rxcb =
+			ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
+
+		/* A relatively unusual race condition occurs between host
+		 * software and hardware, where the host sees the updated
+		 * destination ring head pointer before the hardware updates
+		 * the corresponding descriptor.
+		 *
+		 * Temporarily skip processing the current descriptor and handle
+		 * it again next time. However, to prevent this descriptor from
+		 * continuously returning 0, set 'is_desc_len0' flag. If the
+		 * length returns 0 again, this descriptor will be discarded.
+		 */
+		if (!rxcb->is_desc_len0) {
+			rxcb->is_desc_len0 = true;
+			ret = -EIO;
+			goto err;
+		}
 	}
 
+	ath11k_hal_srng_dst_next(ab, srng);
+
 	*skb = pipe->dest_ring->skb[sw_index];
 	pipe->dest_ring->skb[sw_index] = NULL;
 
@@ -430,8 +448,8 @@  static void ath11k_ce_recv_process_cb(struct ath11k_ce_pipe *pipe)
 		dma_unmap_single(ab->dev, ATH11K_SKB_RXCB(skb)->paddr,
 				 max_nbytes, DMA_FROM_DEVICE);
 
-		if (unlikely(max_nbytes < nbytes)) {
-			ath11k_warn(ab, "rxed more than expected (nbytes %d, max %d)",
+		if (unlikely(max_nbytes < nbytes || !nbytes)) {
+			ath11k_warn(ab, "rxed invalid length (nbytes %d, max %d)",
 				    nbytes, max_nbytes);
 			dev_kfree_skb_any(skb);
 			continue;
diff --git a/drivers/net/wireless/ath/ath11k/core.h b/drivers/net/wireless/ath/ath11k/core.h
index 1a3d0de4afde..c8614c5c6493 100644
--- a/drivers/net/wireless/ath/ath11k/core.h
+++ b/drivers/net/wireless/ath/ath11k/core.h
@@ -128,6 +128,7 @@  struct ath11k_skb_rxcb {
 	bool is_continuation;
 	bool is_mcbc;
 	bool is_eapol;
+	bool is_desc_len0;
 	struct hal_rx_desc *rx_desc;
 	u8 err_rel_src;
 	u8 err_code;