Message ID | 20190228155823.24749-1-greearb@candelatech.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Kalle Valo |
Headers | show |
Series | [RFC] ath10k: Fix DMA errors related to beacons (CT FW only) | expand |
On Thu, 28 Feb 2019 at 16:59, <greearb@candelatech.com> wrote: > > From: Ben Greear <greearb@candelatech.com> > > I often saw the ath10k-ct wave-1 firmware spit DMA errors and > hang the entire system, requiring a hard power-cycle to revoer. > > It appears the issue is that there is no beacon-tx callback in > stock firmware, so the driver can delete the beacon DMA buffer > while firmware is still trying to access it. > > So, wave-1 ath10k-ct firmware now sends a beacon-tx-complete > wmi message and that allows the driver to safely know when it > can clean up the buffer. > > Signed-off-by: Ben Greear <greearb@candelatech.com> > --- > > NOTE: This will not apply or work in upstream kernels since the > rest of the CT fw support will not be accepted. But, I'd appreciate > any technical feedback on this in case I missed any corner cases > on locking or similar. For the record this patch seems to be based on code with "ath10k: Free beacon buf later in vdev teardown" included. [...] > diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c > index 154dcdabc48a..02a8efa2e783 100644 > --- a/drivers/net/wireless/ath/ath10k/mac.c > +++ b/drivers/net/wireless/ath/ath10k/mac.c [...] > @@ -6147,6 +6151,16 @@ static void ath10k_remove_interface(struct ieee80211_hw *hw, > ath10k_warn(ar, "failed to stop spectral for vdev %i: %d\n", > arvif->vdev_id, ret); > > + if (test_bit(ATH10K_FW_FEATURE_BEACON_TX_CB_CT, > + ar->running_fw->fw_file.fw_features)) { > + int time_left; > + > + time_left = wait_for_completion_timeout(&arvif->beacon_tx_done, (3 * HZ)); > + if (!time_left) > + ath10k_warn(ar, "WARNING: failed to wait for beacon tx callback for vdev %i: %d\n", > + arvif->vdev_id, ret); > + } I think this can race against wmi tx credits replenishment and maybe other wmi commands if they get issued (not many of them I suppose though). The ordering would need to be something like this: cpu0 cpu1 ath10k_wmi_op_ep_tx_credits ieee80211_iterate_active_interfaces_atomic ath10k_wmi_beacon_send_ref_nowait # preempted ieee80211_do_stop clear sdata running drv_remove_interface wait_for_completion_timeout ath10k_mac_vif_beacon_cleanup free/unmap gen_beacon_dma(bcn->xxx) reinit_completion() ath10k_wmi_cmd_send_nowait() Interestingly, this also shows it has been a possible use-after-free without your patch. This is probably a unlikely scenario because there's a disproportionate amount of code between these flows for this to happen. There's also a memory leak because ath10k_mac_vif_beacon_free() exits without freeing arvif->beacon if beacon_state is SENDING, which it will be in in the above race case. With the per-skb beaconing case (which is dead code by the way it seems..) iommu wouldn't be happy either because we'd hand over an unmapped paddr to ath10k_wmi_cmd_send_nowait(). That's assuming kernel didn't crash due to use-after-free on arvif->beacon (stored in `bcn` on stack) before that. This feels like a candidate for rcu if you wanted to have it fixed neatly. It would fix both use-after-free and the locking conundrum - ath10k_remove_interface() could NULL the pointer, call_rcu() and wait_for_completion() in that call handler, and when its called back unmap/free the beacon. This would require a refactor to how beacon stuff is stored/maintained - a helper structure that would need carry the beacon stuff + rcu_head, and this structure would be allocated and assigned with rcu_assign_pointer() to a pointer in ath10k_vif. But I feel like it's asking for unbound allocations or other bugs, especially since I'm... Not sure if it makes any sense. I would probably just rip out the arvif->beacon code completely instead and keep arvif->beacon_buf only (memcpy on swba and immediatelly free the skbuff). No point in fixing dead code, is there? Since you have beacon tx completion you can consider preventing beacon corruption, similar to how disabled vsync causes frame tearing, by avoiding memcpy() if completion for previous one didn't come. Not sure how relevant that is though because if fw/hw are having hard time transmitting a beacon then the RF condition must be really harsh and unusably bad anyway. Michał
On 2/28/19 10:54 AM, Michał Kazior wrote: > On Thu, 28 Feb 2019 at 16:59, <greearb@candelatech.com> wrote: >> >> From: Ben Greear <greearb@candelatech.com> >> >> I often saw the ath10k-ct wave-1 firmware spit DMA errors and >> hang the entire system, requiring a hard power-cycle to revoer. >> >> It appears the issue is that there is no beacon-tx callback in >> stock firmware, so the driver can delete the beacon DMA buffer >> while firmware is still trying to access it. >> >> So, wave-1 ath10k-ct firmware now sends a beacon-tx-complete >> wmi message and that allows the driver to safely know when it >> can clean up the buffer. >> >> Signed-off-by: Ben Greear <greearb@candelatech.com> >> --- >> >> NOTE: This will not apply or work in upstream kernels since the >> rest of the CT fw support will not be accepted. But, I'd appreciate >> any technical feedback on this in case I missed any corner cases >> on locking or similar. > > For the record this patch seems to be based on code with "ath10k: Free > beacon buf later in vdev teardown" included. > > > [...] >> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c >> index 154dcdabc48a..02a8efa2e783 100644 >> --- a/drivers/net/wireless/ath/ath10k/mac.c >> +++ b/drivers/net/wireless/ath/ath10k/mac.c > [...] >> @@ -6147,6 +6151,16 @@ static void ath10k_remove_interface(struct ieee80211_hw *hw, >> ath10k_warn(ar, "failed to stop spectral for vdev %i: %d\n", >> arvif->vdev_id, ret); >> >> + if (test_bit(ATH10K_FW_FEATURE_BEACON_TX_CB_CT, >> + ar->running_fw->fw_file.fw_features)) { >> + int time_left; >> + >> + time_left = wait_for_completion_timeout(&arvif->beacon_tx_done, (3 * HZ)); >> + if (!time_left) >> + ath10k_warn(ar, "WARNING: failed to wait for beacon tx callback for vdev %i: %d\n", >> + arvif->vdev_id, ret); >> + } > > I think this can race against wmi tx credits replenishment and maybe > other wmi commands if they get issued (not many of them I suppose > though). The ordering would need to be something like this: > > cpu0 cpu1 > ath10k_wmi_op_ep_tx_credits > ieee80211_iterate_active_interfaces_atomic > ath10k_wmi_beacon_send_ref_nowait > # preempted > ieee80211_do_stop > clear sdata running > drv_remove_interface > wait_for_completion_timeout > ath10k_mac_vif_beacon_cleanup > free/unmap > gen_beacon_dma(bcn->xxx) > reinit_completion() > ath10k_wmi_cmd_send_nowait() > > Interestingly, this also shows it has been a possible use-after-free > without your patch. This is probably a unlikely scenario because > there's a disproportionate amount of code between these flows for this > to happen. Thanks for the thorough review. One thing, the firmware cannot guarantee it can send the event (due to potential WMI event buffer exhaustion). It would be rare that you got unlucky enough to hit the case where FW ran out of msg buffers right as you were deleting an interface though. As far as I can tell, my patch at least does not make the use-after-free scenario worse, so I'm tempted to ignore that for now. > > There's also a memory leak because ath10k_mac_vif_beacon_free() exits > without freeing arvif->beacon if beacon_state is SENDING, which it > will be in in the above race case. Ok, this could be fixed by definitely freeing the mem in the remove_interface logic, and is not new to my patch, right? > > With the per-skb beaconing case (which is dead code by the way it > seems..) iommu wouldn't be happy either because we'd hand over an > unmapped paddr to ath10k_wmi_cmd_send_nowait(). That's assuming kernel > didn't crash due to use-after-free on arvif->beacon (stored in `bcn` > on stack) before that > > This feels like a candidate for rcu if you wanted to have it fixed > neatly. It would fix both use-after-free and the locking conundrum - > ath10k_remove_interface() could NULL the pointer, call_rcu() and > wait_for_completion() in that call handler, and when its called back > unmap/free the beacon. This would require a refactor to how beacon > stuff is stored/maintained - a helper structure that would need carry > the beacon stuff + rcu_head, and this structure would be allocated and > assigned with rcu_assign_pointer() to a pointer in ath10k_vif. But I > feel like it's asking for unbound allocations or other bugs, > especially since I'm... This callback only happens at all with CT fw, and I definitely don't want to deal with an out-of-tree rcu patch. And, you are not guaranteed to get the callback as previously mentioned. So I think rcu might be overkill for this? > > Not sure if it makes any sense. I would probably just rip out the > arvif->beacon code completely instead and keep arvif->beacon_buf only > (memcpy on swba and immediatelly free the skbuff). No point in fixing > dead code, is there? > > Since you have beacon tx completion you can consider preventing beacon > corruption, similar to how disabled vsync causes frame tearing, by > avoiding memcpy() if completion for previous one didn't come. Not sure > how relevant that is though because if fw/hw are having hard time > transmitting a beacon then the RF condition must be really harsh and > unusably bad anyway. Thanks, Ben
On Thu, 28 Feb 2019 at 22:11, Ben Greear <greearb@candelatech.com> wrote: > On 2/28/19 10:54 AM, Michał Kazior wrote: > > On Thu, 28 Feb 2019 at 16:59, <greearb@candelatech.com> wrote: > >> > >> From: Ben Greear <greearb@candelatech.com> [...] > >> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c > >> index 154dcdabc48a..02a8efa2e783 100644 > >> --- a/drivers/net/wireless/ath/ath10k/mac.c > >> +++ b/drivers/net/wireless/ath/ath10k/mac.c > > [...] > >> @@ -6147,6 +6151,16 @@ static void ath10k_remove_interface(struct ieee80211_hw *hw, > >> ath10k_warn(ar, "failed to stop spectral for vdev %i: %d\n", > >> arvif->vdev_id, ret); > >> > >> + if (test_bit(ATH10K_FW_FEATURE_BEACON_TX_CB_CT, > >> + ar->running_fw->fw_file.fw_features)) { > >> + int time_left; > >> + > >> + time_left = wait_for_completion_timeout(&arvif->beacon_tx_done, (3 * HZ)); > >> + if (!time_left) > >> + ath10k_warn(ar, "WARNING: failed to wait for beacon tx callback for vdev %i: %d\n", > >> + arvif->vdev_id, ret); > >> + } > > > > I think this can race against wmi tx credits replenishment and maybe > > other wmi commands if they get issued (not many of them I suppose > > though). The ordering would need to be something like this: > > > > cpu0 cpu1 > > ath10k_wmi_op_ep_tx_credits > > ieee80211_iterate_active_interfaces_atomic > > ath10k_wmi_beacon_send_ref_nowait > > # preempted > > ieee80211_do_stop > > clear sdata running > > drv_remove_interface > > wait_for_completion_timeout > > ath10k_mac_vif_beacon_cleanup > > free/unmap > > gen_beacon_dma(bcn->xxx) > > reinit_completion() > > ath10k_wmi_cmd_send_nowait() > > > > Interestingly, this also shows it has been a possible use-after-free > > without your patch. This is probably a unlikely scenario because > > there's a disproportionate amount of code between these flows for this > > to happen. > > Thanks for the thorough review. One thing, the firmware > cannot guarantee it can send the event (due to potential WMI event buffer > exhaustion). It would be rare that you got unlucky enough to hit the > case where FW ran out of msg buffers right as you were deleting an > interface though. Right. > As far as I can tell, my patch at least does not make the use-after-free > scenario worse, so I'm tempted to ignore that for now. Correct. Your patch doesn't make it worse. > > There's also a memory leak because ath10k_mac_vif_beacon_free() exits > > without freeing arvif->beacon if beacon_state is SENDING, which it > > will be in in the above race case. > > Ok, this could be fixed by definitely freeing the mem in the remove_interface > logic, and is not new to my patch, right? It's not new to your patch. It's just something I've noticed in the codebase while reviewing your patch. Freeing it is going to be tricky with the current design though. > > With the per-skb beaconing case (which is dead code by the way it > > seems..) iommu wouldn't be happy either because we'd hand over an > > unmapped paddr to ath10k_wmi_cmd_send_nowait(). That's assuming kernel > > didn't crash due to use-after-free on arvif->beacon (stored in `bcn` > > on stack) before that > > > This feels like a candidate for rcu if you wanted to have it fixed > > neatly. It would fix both use-after-free and the locking conundrum - > > ath10k_remove_interface() could NULL the pointer, call_rcu() and > > wait_for_completion() in that call handler, and when its called back > > unmap/free the beacon. This would require a refactor to how beacon > > stuff is stored/maintained - a helper structure that would need carry > > the beacon stuff + rcu_head, and this structure would be allocated and > > assigned with rcu_assign_pointer() to a pointer in ath10k_vif. But I > > feel like it's asking for unbound allocations or other bugs, > > especially since I'm... > > This callback only happens at all with CT fw, and I definitely don't want to > deal with an out-of-tree rcu patch. And, you are not guaranteed to get > the callback as previously mentioned. So I think rcu might be overkill > for this? You'd still need wait_for_completion with a timeout so even if you don't get the beacon tx completion it'd still (have to) work. However *it is* an overkill because the arvif->beacon can be simply removed making the entire problem go away without adding extra logic. Michał
diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c index e6664131fdee..929e0c059adc 100644 --- a/drivers/net/wireless/ath/ath10k/core.c +++ b/drivers/net/wireless/ath/ath10k/core.c @@ -623,6 +623,7 @@ static const char *const ath10k_core_fw_feature_str[] = { [ATH10K_FW_FEATURE_RETRY_GT2_CT] = "retry-gt2-CT", [ATH10K_FW_FEATURE_CT_STA] = "CT-STA", [ATH10K_FW_FEATURE_TXRATE2_CT] = "txrate2-CT", + [ATH10K_FW_FEATURE_BEACON_TX_CB_CT] = "beacon-cb-CT", }; static unsigned int ath10k_core_get_fw_feature_str(char *buf, diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h index 15621a2af557..a70771170007 100644 --- a/drivers/net/wireless/ath/ath10k/core.h +++ b/drivers/net/wireless/ath/ath10k/core.h @@ -563,6 +563,7 @@ enum ath10k_beacon_state { struct ath10k_vif { struct list_head list; + struct completion beacon_tx_done; u32 vdev_id; u16 peer_id; @@ -944,6 +945,11 @@ enum ath10k_fw_features { /* TX-Rate v2 is reported. */ ATH10K_FW_FEATURE_TXRATE2_CT = 49, + /* Firmware will send a beacon-tx-callback message so driver knows when + * beacon buffer can be released. + */ + ATH10K_FW_FEATURE_BEACON_TX_CB_CT = 50, + /* keep last */ ATH10K_FW_FEATURE_COUNT, }; diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c index 154dcdabc48a..02a8efa2e783 100644 --- a/drivers/net/wireless/ath/ath10k/mac.c +++ b/drivers/net/wireless/ath/ath10k/mac.c @@ -5811,6 +5811,10 @@ static int ath10k_add_interface(struct ieee80211_hw *hw, arvif->ar = ar; arvif->vif = vif; + init_completion(&arvif->beacon_tx_done); + /* start completed since we have not sent any beacons yet */ + complete(&arvif->beacon_tx_done); + INIT_LIST_HEAD(&arvif->list); INIT_WORK(&arvif->ap_csa_work, ath10k_mac_vif_ap_csa_work); INIT_DELAYED_WORK(&arvif->connection_loss_work, @@ -6147,6 +6151,16 @@ static void ath10k_remove_interface(struct ieee80211_hw *hw, ath10k_warn(ar, "failed to stop spectral for vdev %i: %d\n", arvif->vdev_id, ret); + if (test_bit(ATH10K_FW_FEATURE_BEACON_TX_CB_CT, + ar->running_fw->fw_file.fw_features)) { + int time_left; + + time_left = wait_for_completion_timeout(&arvif->beacon_tx_done, (3 * HZ)); + if (!time_left) + ath10k_warn(ar, "WARNING: failed to wait for beacon tx callback for vdev %i: %d\n", + arvif->vdev_id, ret); + } + ar->free_vdev_map |= 1LL << arvif->vdev_id; spin_lock_bh(&ar->data_lock); list_del(&arvif->list); diff --git a/drivers/net/wireless/ath/ath10k/wmi-ops.h b/drivers/net/wireless/ath/ath10k/wmi-ops.h index 6c6f7db5b20b..dcba53575dfc 100644 --- a/drivers/net/wireless/ath/ath10k/wmi-ops.h +++ b/drivers/net/wireless/ath/ath10k/wmi-ops.h @@ -1000,25 +1000,33 @@ ath10k_wmi_peer_assoc(struct ath10k *ar, } static inline int -ath10k_wmi_beacon_send_ref_nowait(struct ath10k *ar, u32 vdev_id, +ath10k_wmi_beacon_send_ref_nowait(struct ath10k_vif *arvif, const void *bcn, size_t bcn_len, u32 bcn_paddr, bool dtim_zero, bool deliver_cab) { struct sk_buff *skb; int ret; + struct ath10k *ar = arvif->ar; if (!ar->wmi.ops->gen_beacon_dma) return -EOPNOTSUPP; - skb = ar->wmi.ops->gen_beacon_dma(ar, vdev_id, bcn, bcn_len, bcn_paddr, + skb = ar->wmi.ops->gen_beacon_dma(ar, arvif->vdev_id, bcn, bcn_len, bcn_paddr, dtim_zero, deliver_cab); if (IS_ERR(skb)) return PTR_ERR(skb); + spin_lock_bh(&ar->data_lock); + reinit_completion(&arvif->beacon_tx_done); + spin_unlock_bh(&ar->data_lock); + ret = ath10k_wmi_cmd_send_nowait(ar, skb, ar->wmi.cmd->pdev_send_bcn_cmdid); if (ret) { + spin_lock_bh(&ar->data_lock); + complete(&arvif->beacon_tx_done); + spin_unlock_bh(&ar->data_lock); dev_kfree_skb(skb); return ret; } diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c index 5d89482ffca8..6e9705d3a967 100644 --- a/drivers/net/wireless/ath/ath10k/wmi.c +++ b/drivers/net/wireless/ath/ath10k/wmi.c @@ -1846,8 +1846,7 @@ static void ath10k_wmi_tx_beacon_nowait(struct ath10k_vif *arvif) dtim_zero = !!(cb->flags & ATH10K_SKB_F_DTIM_ZERO); deliver_cab = !!(cb->flags & ATH10K_SKB_F_DELIVER_CAB); - ret = ath10k_wmi_beacon_send_ref_nowait(arvif->ar, - arvif->vdev_id, + ret = ath10k_wmi_beacon_send_ref_nowait(arvif, bcn->data, bcn->len, cb->paddr, dtim_zero, @@ -5941,6 +5940,38 @@ static int ath10k_wmi_event_temperature(struct ath10k *ar, struct sk_buff *skb) return 0; } +static void ath10k_wmi_event_beacon_tx(struct ath10k *ar, struct sk_buff *skb) +{ + struct ath10k_vif *arvif; + const struct wmi_beacon_tx_event *ev; + u32 vdev_id; + + spin_lock_bh(&ar->data_lock); + + ev = (struct wmi_beacon_tx_event *)skb->data; + + if (WARN_ON_ONCE(skb->len < sizeof(*ev))) + goto exit; + + vdev_id = __le32_to_cpu(ev->vdev_id); + + /*ath10k_dbg(ar, ATH10K_DBG_WMI, + "wmi event beacon-tx-complete, vdev-id: %u completion-status: 0x%x\n", + vdev_id, __le32_to_cpu(ev->tx_status));*/ + + arvif = ath10k_get_arvif(ar, vdev_id); + if (!arvif) { + ath10k_warn(ar, "wmi-event-beacon-tx, could not find vdev for id: %u\n", + vdev_id); + goto exit; + } + + complete(&arvif->beacon_tx_done); + +exit: + spin_unlock_bh(&ar->data_lock); +} + static int ath10k_wmi_event_pdev_bss_chan_info(struct ath10k *ar, struct sk_buff *skb) { @@ -6269,6 +6300,9 @@ static void ath10k_wmi_10_1_op_rx(struct ath10k *ar, struct sk_buff *skb) case WMI_10_1_PDEV_BSS_CHAN_INFO_EVENTID: /* Newer CT firmware supports this */ ath10k_wmi_event_pdev_bss_chan_info(ar, skb); break; + case WMI_10_1_BEACON_TX_EVENTID: /* Feb 28, 2019 CT firmware supports this */ + ath10k_wmi_event_beacon_tx(ar, skb); + break; default: ath10k_warn(ar, "Unknown (10.1) eventid: %d\n", id); break; diff --git a/drivers/net/wireless/ath/ath10k/wmi.h b/drivers/net/wireless/ath/ath10k/wmi.h index 49466123b8fe..96066cc57f83 100644 --- a/drivers/net/wireless/ath/ath10k/wmi.h +++ b/drivers/net/wireless/ath/ath10k/wmi.h @@ -1533,6 +1533,7 @@ enum wmi_10x_event_id { WMI_10_1_PDEV_TEMPERATURE_EVENTID = 36898, /* Newer CT firmware */ WMI_10_1_PDEV_BSS_CHAN_INFO_EVENTID = 36900, /* Newer CT firmware */ + WMI_10_1_BEACON_TX_EVENTID = WMI_10X_END_EVENTID - 4, /* CT FW, beacon tx completed */ WMI_10X_PDEV_UTF_EVENTID = WMI_10X_END_EVENTID - 1, }; @@ -6223,6 +6224,13 @@ struct wmi_bcn_info { struct wmi_p2p_noa_info p2p_noa_info; } __packed; +/* CT FW only */ +struct wmi_beacon_tx_event { + __le32 vdev_id; + __le32 tx_status; + __le32 future[4]; +}; + struct wmi_host_swba_event { __le32 vdev_map; struct wmi_bcn_info bcn_info[0];