Message ID | 20250310010217.3845141-3-quic_miaoqing@quicinc.com (mailing list archive) |
---|---|
State | New |
Delegated to: | Jeff Johnson |
Headers | show |
Series | wifi: ath11k: fix HTC rx insufficient length | expand |
On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote: > A relatively unusual race condition occurs between host software > and hardware, where the host sees the updated destination ring head > pointer before the hardware updates the corresponding descriptor. > When this situation occurs, the length of the descriptor returns 0. I still think this description is too vague and it doesn't explain how this race is even possible. It sounds like there's a bug somewhere in the driver or firmware, but if this really is an indication the hardware is broken as your reply here seems to suggest: https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/ then that too should be highlighted in the commit message (e.g. by describing this as "working around broken hardware"). > The current error handling method is to increment descriptor tail > pointer by 1, but 'sw_index' is not updated, causing descriptor and > skb to not correspond one-to-one, resulting in the following error: > > ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492 > ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484 > > To address this problem, temporarily skip processing the current > descriptor and handle it again next time. However, to prevent this > descriptor from continuously returning 0, use skb cb to set a flag. > If the length returns 0 again, this descriptor will be discarded. The ath12k ring-buffer handling looks very similar. Do you need a corresponding workaround in ath12k_ce_completed_recv_next()? Or are you sure that this (hardware) bug only affects ath11k devices? > *nbytes = ath11k_hal_ce_dst_status_get_length(desc); > - if (*nbytes == 0) { > - ret = -EIO; > - goto err; > + if (unlikely(*nbytes == 0)) { > + struct ath11k_skb_rxcb *rxcb = > + ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]); > + > + /* A relatively unusual race condition occurs between host > + * software and hardware, where the host sees the updated > + * destination ring head pointer before the hardware updates > + * the corresponding descriptor. > + * > + * Temporarily skip processing the current descriptor and handle > + * it again next time. However, to prevent this descriptor from > + * continuously returning 0, set 'is_desc_len0' flag. If the > + * length returns 0 again, this descriptor will be discarded. > + */ > + if (!rxcb->is_desc_len0) { > + rxcb->is_desc_len0 = true; > + ret = -EIO; > + goto err; > + } > } I'm still waiting for feedback from one user that can reproduce the ring-buffer corruption very easily, but another user mentioned seeing multiple zero-length descriptor warnings over the weekend when running with this patch: ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048) Are there ever any valid reasons for seeing a zero-length descriptor (i.e. unrelated to the race at hand)? IIUC the warning would only be printed when processing such descriptors a second time (i.e. when is_desc_len0 is set). Johan
On 3/10/2025 6:09 PM, Johan Hovold wrote: > On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote: >> A relatively unusual race condition occurs between host software >> and hardware, where the host sees the updated destination ring head >> pointer before the hardware updates the corresponding descriptor. >> When this situation occurs, the length of the descriptor returns 0. > > I still think this description is too vague and it doesn't explain how > this race is even possible. It sounds like there's a bug somewhere in > the driver or firmware, but if this really is an indication the hardware > is broken as your reply here seems to suggest: > > https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/ > > then that too should be highlighted in the commit message (e.g. by > describing this as "working around broken hardware"). > Ok, will do. >> The current error handling method is to increment descriptor tail >> pointer by 1, but 'sw_index' is not updated, causing descriptor and >> skb to not correspond one-to-one, resulting in the following error: >> >> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492 >> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484 >> >> To address this problem, temporarily skip processing the current >> descriptor and handle it again next time. However, to prevent this >> descriptor from continuously returning 0, use skb cb to set a flag. >> If the length returns 0 again, this descriptor will be discarded. > > The ath12k ring-buffer handling looks very similar. Do you need a > corresponding workaround in ath12k_ce_completed_recv_next()? Or are you > sure that this (hardware) bug only affects ath11k devices? > Yes, will submit a similar patch. >> *nbytes = ath11k_hal_ce_dst_status_get_length(desc); >> - if (*nbytes == 0) { >> - ret = -EIO; >> - goto err; >> + if (unlikely(*nbytes == 0)) { >> + struct ath11k_skb_rxcb *rxcb = >> + ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]); >> + >> + /* A relatively unusual race condition occurs between host >> + * software and hardware, where the host sees the updated >> + * destination ring head pointer before the hardware updates >> + * the corresponding descriptor. >> + * >> + * Temporarily skip processing the current descriptor and handle >> + * it again next time. However, to prevent this descriptor from >> + * continuously returning 0, set 'is_desc_len0' flag. If the >> + * length returns 0 again, this descriptor will be discarded. >> + */ >> + if (!rxcb->is_desc_len0) { >> + rxcb->is_desc_len0 = true; >> + ret = -EIO; >> + goto err; >> + } >> } > > I'm still waiting for feedback from one user that can reproduce the > ring-buffer corruption very easily, but another user mentioned seeing > multiple zero-length descriptor warnings over the weekend when running > with this patch: > > ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048) > > Are there ever any valid reasons for seeing a zero-length descriptor > (i.e. unrelated to the race at hand)? IIUC the warning would only be > printed when processing such descriptors a second time (i.e. when > is_desc_len0 is set). > That's exactly the logic, only can see the warning in a second time. For the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
On 3/11/2025 1:29 AM, Miaoqing Pan wrote: > On 3/10/2025 6:09 PM, Johan Hovold wrote: >> I'm still waiting for feedback from one user that can reproduce the >> ring-buffer corruption very easily, but another user mentioned seeing >> multiple zero-length descriptor warnings over the weekend when running >> with this patch: >> >> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048) >> >> Are there ever any valid reasons for seeing a zero-length descriptor >> (i.e. unrelated to the race at hand)? IIUC the warning would only be >> printed when processing such descriptors a second time (i.e. when >> is_desc_len0 is set). >> > > That's exactly the logic, only can see the warning in a second time. For > the first time, ath12k_ce_completed_recv_next() returns '-EIO'. That didn't answer Johan's first question: Are there ever any valid reasons for seeing a zero-length descriptor? We have an issue that there is a race condition where hardware updates the pointer before it has filled all the data. The current solution is just to read the data a second time. But if that second read also occurs before hardware has updated the data, then the issue isn't fixed. So should there be some forced delay before we read a second time? Or should we attempt to read more times?
On 3/11/2025 11:20 PM, Jeff Johnson wrote: > On 3/11/2025 1:29 AM, Miaoqing Pan wrote: >> On 3/10/2025 6:09 PM, Johan Hovold wrote: >>> I'm still waiting for feedback from one user that can reproduce the >>> ring-buffer corruption very easily, but another user mentioned seeing >>> multiple zero-length descriptor warnings over the weekend when running >>> with this patch: >>> >>> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048) >>> >>> Are there ever any valid reasons for seeing a zero-length descriptor >>> (i.e. unrelated to the race at hand)? IIUC the warning would only be >>> printed when processing such descriptors a second time (i.e. when >>> is_desc_len0 is set). >>> >> >> That's exactly the logic, only can see the warning in a second time. For >> the first time, ath12k_ce_completed_recv_next() returns '-EIO'. > > That didn't answer Johan's first question: > Are there ever any valid reasons for seeing a zero-length descriptor? > The events currently observed are all firmware logs. The discarded packets will not affect normal operation. I will adjust the logging to debug level. > We have an issue that there is a race condition where hardware updates the > pointer before it has filled all the data. The current solution is just to > read the data a second time. But if that second read also occurs before > hardware has updated the data, then the issue isn't fixed. > Thanks for the addition. > So should there be some forced delay before we read a second time? > Or should we attempt to read more times? > The initial fix was to keep waiting for the data to be ready. The observed phenomenon is that when the second read fails, subsequent reads will continue to fail until the firmware's CE2 ring is full and triggers an assert after timeout. However, this situation is relatively rare, and in most cases, the second read will succeed. Therefore, adding a delay or multiple read attempts is not useful.
diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c index e66e86bdec20..2573f8c7a994 100644 --- a/drivers/net/wireless/ath/ath11k/ce.c +++ b/drivers/net/wireless/ath/ath11k/ce.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: BSD-3-Clause-Clear /* * Copyright (c) 2018-2019 The Linux Foundation. All rights reserved. - * Copyright (c) 2021-2023 Qualcomm Innovation Center, Inc. All rights reserved. + * Copyright (c) 2021-2023, 2025 Qualcomm Innovation Center, Inc. All rights reserved. */ #include "dp_rx.h" @@ -387,18 +387,36 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe, ath11k_hal_srng_access_begin(ab, srng); - desc = ath11k_hal_srng_dst_get_next_entry(ab, srng); + desc = ath11k_hal_srng_dst_peek(ab, srng); if (!desc) { ret = -EIO; goto err; } *nbytes = ath11k_hal_ce_dst_status_get_length(desc); - if (*nbytes == 0) { - ret = -EIO; - goto err; + if (unlikely(*nbytes == 0)) { + struct ath11k_skb_rxcb *rxcb = + ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]); + + /* A relatively unusual race condition occurs between host + * software and hardware, where the host sees the updated + * destination ring head pointer before the hardware updates + * the corresponding descriptor. + * + * Temporarily skip processing the current descriptor and handle + * it again next time. However, to prevent this descriptor from + * continuously returning 0, set 'is_desc_len0' flag. If the + * length returns 0 again, this descriptor will be discarded. + */ + if (!rxcb->is_desc_len0) { + rxcb->is_desc_len0 = true; + ret = -EIO; + goto err; + } } + ath11k_hal_srng_dst_next(ab, srng); + *skb = pipe->dest_ring->skb[sw_index]; pipe->dest_ring->skb[sw_index] = NULL; @@ -430,8 +448,8 @@ static void ath11k_ce_recv_process_cb(struct ath11k_ce_pipe *pipe) dma_unmap_single(ab->dev, ATH11K_SKB_RXCB(skb)->paddr, max_nbytes, DMA_FROM_DEVICE); - if (unlikely(max_nbytes < nbytes)) { - ath11k_warn(ab, "rxed more than expected (nbytes %d, max %d)", + if (unlikely(max_nbytes < nbytes || !nbytes)) { + ath11k_warn(ab, "rxed invalid length (nbytes %d, max %d)", nbytes, max_nbytes); dev_kfree_skb_any(skb); continue; diff --git a/drivers/net/wireless/ath/ath11k/core.h b/drivers/net/wireless/ath/ath11k/core.h index 1a3d0de4afde..c8614c5c6493 100644 --- a/drivers/net/wireless/ath/ath11k/core.h +++ b/drivers/net/wireless/ath/ath11k/core.h @@ -128,6 +128,7 @@ struct ath11k_skb_rxcb { bool is_continuation; bool is_mcbc; bool is_eapol; + bool is_desc_len0; struct hal_rx_desc *rx_desc; u8 err_rel_src; u8 err_code;
A relatively unusual race condition occurs between host software and hardware, where the host sees the updated destination ring head pointer before the hardware updates the corresponding descriptor. When this situation occurs, the length of the descriptor returns 0. The current error handling method is to increment descriptor tail pointer by 1, but 'sw_index' is not updated, causing descriptor and skb to not correspond one-to-one, resulting in the following error: ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492 ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484 To address this problem, temporarily skip processing the current descriptor and handle it again next time. However, to prevent this descriptor from continuously returning 0, use skb cb to set a flag. If the length returns 0 again, this descriptor will be discarded. Tested-on: QCA6698AQ hw2.1 PCI WLAN.HSP.1.1-04546-QCAHSPSWPL_V1_V2_SILICONZ_IOE-1 Reported-by: Johan Hovold <johan+linaro@kernel.org> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623 Signed-off-by: Miaoqing Pan <quic_miaoqing@quicinc.com> --- drivers/net/wireless/ath/ath11k/ce.c | 32 ++++++++++++++++++++------ drivers/net/wireless/ath/ath11k/core.h | 1 + 2 files changed, 26 insertions(+), 7 deletions(-)