diff mbox

[1/3] ath10k: Ensure there are no stale ar->txqs entries.

Message ID 1471569995-10028-1-git-send-email-greearb@candelatech.com (mailing list archive)
State Changes Requested
Delegated to: Kalle Valo
Headers show

Commit Message

Ben Greear Aug. 19, 2016, 1:26 a.m. UTC
From: Ben Greear <greearb@candelatech.com>

I was seeing kernel crashes due to accessing freed memory
while debugging a 9984 firmware that was crashing often.

This patch fixes the crashes.  I am not certain if there
is a better way or not.

Signed-off-by: Ben Greear <greearb@candelatech.com>
---
 drivers/net/wireless/ath/ath10k/mac.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Michal Kazior Aug. 19, 2016, 6:59 a.m. UTC | #1
On 19 August 2016 at 03:26,  <greearb@candelatech.com> wrote:
> From: Ben Greear <greearb@candelatech.com>
>
> I was seeing kernel crashes due to accessing freed memory
> while debugging a 9984 firmware that was crashing often.
>
> This patch fixes the crashes.  I am not certain if there
> is a better way or not.
>
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
>  drivers/net/wireless/ath/ath10k/mac.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
> index 5659ef1..916119c 100644
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -4172,8 +4172,10 @@ static void ath10k_mac_txq_init(struct ieee80211_txq *txq)
>  static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>  {
>         struct ath10k_txq *artxq = (void *)txq->drv_priv;
> +       struct ath10k_txq *tmp, *walker;
>         struct ath10k_skb_cb *cb;
>         struct sk_buff *msdu;
> +       struct ieee80211_txq *txq_tmp;
>         int msdu_id;
>
>         if (!txq)
> @@ -4182,6 +4184,14 @@ static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>         spin_lock_bh(&ar->txqs_lock);
>         if (!list_empty(&artxq->list))
>                 list_del_init(&artxq->list);
> +
> +       /* Remove from ar->txqs in case it still exists there. */
> +       list_for_each_entry_safe(walker, tmp, &ar->txqs, list) {
> +               txq_tmp = container_of((void *)walker, struct ieee80211_txq,
> +                                      drv_priv);
> +               if (txq_tmp == txq)
> +                       list_del(&walker->list);
> +       }

How could this even happen? All artxq->list accesses (add/del) are
protected by txqs_lock so this shouldn't happen, no?

Do you perhaps have the logic around txqs reworked in your tree?


Michał
Ben Greear Aug. 19, 2016, 1:34 p.m. UTC | #2
On 08/18/2016 11:59 PM, Michal Kazior wrote:
> On 19 August 2016 at 03:26,  <greearb@candelatech.com> wrote:
>> From: Ben Greear <greearb@candelatech.com>
>>
>> I was seeing kernel crashes due to accessing freed memory
>> while debugging a 9984 firmware that was crashing often.
>>
>> This patch fixes the crashes.  I am not certain if there
>> is a better way or not.
>>
>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>> ---
>>   drivers/net/wireless/ath/ath10k/mac.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>> index 5659ef1..916119c 100644
>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>> @@ -4172,8 +4172,10 @@ static void ath10k_mac_txq_init(struct ieee80211_txq *txq)
>>   static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>>   {
>>          struct ath10k_txq *artxq = (void *)txq->drv_priv;
>> +       struct ath10k_txq *tmp, *walker;
>>          struct ath10k_skb_cb *cb;
>>          struct sk_buff *msdu;
>> +       struct ieee80211_txq *txq_tmp;
>>          int msdu_id;
>>
>>          if (!txq)
>> @@ -4182,6 +4184,14 @@ static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>>          spin_lock_bh(&ar->txqs_lock);
>>          if (!list_empty(&artxq->list))
>>                  list_del_init(&artxq->list);
>> +
>> +       /* Remove from ar->txqs in case it still exists there. */
>> +       list_for_each_entry_safe(walker, tmp, &ar->txqs, list) {
>> +               txq_tmp = container_of((void *)walker, struct ieee80211_txq,
>> +                                      drv_priv);
>> +               if (txq_tmp == txq)
>> +                       list_del(&walker->list);
>> +       }
>
> How could this even happen? All artxq->list accesses (add/del) are
> protected by txqs_lock so this shouldn't happen, no?
>
> Do you perhaps have the logic around txqs reworked in your tree?

I don't have any significant changes as far as I can tell.

I can build you a buggy 9984 firmware to reproduce the problem if you want...

Maybe the upstream patch could WARN_ON in this case to see if anyone else
ever hits it?

I did see a comment in the mac80211 about some assumptions on the driver with
regard to station teardown...I am not 100% sure ath10k meets that assumption,
so maybe that is why I could see this problem.

Thanks,
Ben
Felix Fietkau Sept. 9, 2016, 5:25 p.m. UTC | #3
On 2016-08-19 03:26, greearb@candelatech.com wrote:
> From: Ben Greear <greearb@candelatech.com>
> 
> I was seeing kernel crashes due to accessing freed memory
> while debugging a 9984 firmware that was crashing often.
> 
> This patch fixes the crashes.  I am not certain if there
> is a better way or not.
> 
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
>  drivers/net/wireless/ath/ath10k/mac.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
> index 5659ef1..916119c 100644
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -4172,8 +4172,10 @@ static void ath10k_mac_txq_init(struct ieee80211_txq *txq)
>  static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>  {
>  	struct ath10k_txq *artxq = (void *)txq->drv_priv;
> +	struct ath10k_txq *tmp, *walker;
>  	struct ath10k_skb_cb *cb;
>  	struct sk_buff *msdu;
> +	struct ieee80211_txq *txq_tmp;
>  	int msdu_id;
>  
>  	if (!txq)
> @@ -4182,6 +4184,14 @@ static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>  	spin_lock_bh(&ar->txqs_lock);
>  	if (!list_empty(&artxq->list))
>  		list_del_init(&artxq->list);
> +
> +	/* Remove from ar->txqs in case it still exists there. */
> +	list_for_each_entry_safe(walker, tmp, &ar->txqs, list) {
> +		txq_tmp = container_of((void *)walker, struct ieee80211_txq,
> +				       drv_priv);
> +		if (txq_tmp == txq)
> +			list_del(&walker->list);
> +	}
This makes no sense at all. From txq_tmp == txq we can deduce that
walker == artxq. In the context above, it already does a
list_del_init(&artxq->list).

- Felix
Ben Greear Sept. 9, 2016, 5:46 p.m. UTC | #4
On 09/09/2016 10:25 AM, Felix Fietkau wrote:
> On 2016-08-19 03:26, greearb@candelatech.com wrote:
>> From: Ben Greear <greearb@candelatech.com>
>>
>> I was seeing kernel crashes due to accessing freed memory
>> while debugging a 9984 firmware that was crashing often.
>>
>> This patch fixes the crashes.  I am not certain if there
>> is a better way or not.
>>
>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>> ---
>>  drivers/net/wireless/ath/ath10k/mac.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>> index 5659ef1..916119c 100644
>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>> @@ -4172,8 +4172,10 @@ static void ath10k_mac_txq_init(struct ieee80211_txq *txq)
>>  static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>>  {
>>  	struct ath10k_txq *artxq = (void *)txq->drv_priv;
>> +	struct ath10k_txq *tmp, *walker;
>>  	struct ath10k_skb_cb *cb;
>>  	struct sk_buff *msdu;
>> +	struct ieee80211_txq *txq_tmp;
>>  	int msdu_id;
>>
>>  	if (!txq)
>> @@ -4182,6 +4184,14 @@ static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
>>  	spin_lock_bh(&ar->txqs_lock);
>>  	if (!list_empty(&artxq->list))
>>  		list_del_init(&artxq->list);
>> +
>> +	/* Remove from ar->txqs in case it still exists there. */
>> +	list_for_each_entry_safe(walker, tmp, &ar->txqs, list) {
>> +		txq_tmp = container_of((void *)walker, struct ieee80211_txq,
>> +				       drv_priv);
>> +		if (txq_tmp == txq)
>> +			list_del(&walker->list);
>> +	}
> This makes no sense at all. From txq_tmp == txq we can deduce that
> walker == artxq. In the context above, it already does a
> list_del_init(&artxq->list).

This fixed my problem, so something about this matters.

Possibly it works around some other race, just possibly it
is because of some other regression/bug in my driver/kernel.

I thought maybe the issue was that flushing doesn't really work
for ath10k, so when the upper stack tried to flush and delete
a station there were still skbs in the driver that were referencing
the txqs up in mac80211.

Also, this bug was triggered by firmware that crashed very often on
transmit of an skb, so in general there were skbs that were not properly
transmitted and maybe that also triggers some other bug/race in the
driver.

Thanks,
Ben
Ben Greear Dec. 1, 2016, 10:52 p.m. UTC | #5
On 08/19/2016 06:34 AM, Ben Greear wrote:
>
>
> On 08/18/2016 11:59 PM, Michal Kazior wrote:
>> On 19 August 2016 at 03:26,  <greearb@candelatech.com> wrote:
>>> From: Ben Greear <greearb@candelatech.com>
>>>
>>> I was seeing kernel crashes due to accessing freed memory
>>> while debugging a 9984 firmware that was crashing often.
>>>
>>> This patch fixes the crashes.  I am not certain if there
>>> is a better way or not.


I did some more hacking on this today.  I think I found some better clue on this.

I added this code:

static void ath10k_mac_txq_init(struct ath10k *ar, struct ieee80211_txq *txq)
{
	struct ath10k_txq *artxq = (void *)txq->drv_priv;
	struct ath10k_txq *tmp, *walker;
	struct ieee80211_txq *txq_tmp;
	int i = 0;

	if (!txq)
		return;

	spin_lock_bh(&ar->txqs_lock);
	ar->txqs_lock.rlock.dbg1 = 104;

	/* Remove from ar->txqs in case it still exists there. */
	list_for_each_entry_safe(walker, tmp, &ar->txqs, list) {
		txq_tmp = container_of((void *)walker, struct ieee80211_txq,
				       drv_priv);
		if ((++i % 10000) == 0) {
			ath10k_err(ar, "txq-init: Checking txq_tmp: %p i: %d\n", txq_tmp, i);
			ath10k_err(ar, "txq-init: txqs: %p walker->list: %p w->next: %p  w->prev: %p ar->txqs: %p\n",
				   &ar->txqs, &(walker->list), walker->list.next, walker->list.prev, &ar->txqs);
		}

		if (txq_tmp == txq) {
			WARN_ON_ONCE(1);
			ath10k_err(ar, "txq-init: Found txq when it should be deleted, txq_tmp: %p  txq: %p\n",
				   txq_tmp, txq);
			list_del(&walker->list);
		}
	}
	spin_unlock_bh(&ar->txqs_lock);

	INIT_LIST_HEAD(&artxq->list);
}


[firmware has just crashed]

Dec 01 14:43:06 wave2 kernel: ------------[ cut here ]------------
Dec 01 14:43:06 wave2 kernel: WARNING: CPU: 0 PID: 193 at /home/greearb/git/linux-4.7.dev.y/drivers/net/wireless/ath/ath10k/mac.c:4217 
ath10k_mac_txq_init+0x1a7/0x1b0 [ath10k_core]
Dec 01 14:43:06 wave2 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 bridge 8021q garp mrp stp llc bnep bluetooth fuse 
macvlan pktgen rpcsec_gss_krb5 nfsv4 nfs fscache coretemp hwmon intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_codec_realtek 
snd_hda_codec_generic kvm iTCO_wdt irqbypass iTCO_vendor_support ath10k_pci ath10k_core joydev ath snd_hda_intel mac80211 snd_hda_codec snd_hda_core snd_hwdep 
snd_seq snd_seq_device pcspkr cfg80211 snd_pcm snd_timer snd i2c_i801 lpc_ich shpchp soundcore tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc serio_raw 
i915 i2c_algo_bit ata_generic drm_kms_helper pata_acpi e1000e ptp drm pps_core i2c_core fjes video ipv6 [last unloaded: nf_conntrack]
Dec 01 14:43:06 wave2 kernel: CPU: 0 PID: 193 Comm: kworker/0:1 Not tainted 4.7.10+ #14
Dec 01 14:43:06 wave2 kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./ChiefRiver, BIOS 4.6.5 06/07/2013
Dec 01 14:43:06 wave2 kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211]
Dec 01 14:43:06 wave2 kernel:  ffffffffa0e29507 ffff8801d14f7920 ffffffff8169ed08 0000000000000000
Dec 01 14:43:06 wave2 kernel:  0000000000000000 ffff8801d14f7968 ffffffff811569bc ffff8801d14e4f00
Dec 01 14:43:06 wave2 kernel:  0000107900000001 ffff8800c43ec9a0 0000000000000027 ffff8800c43ec988
Dec 01 14:43:06 wave2 kernel: Call Trace:
Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e29507>] ? ath10k_mac_txq_init+0x1a7/0x1b0 [ath10k_core]
Dec 01 14:43:06 wave2 kernel:  [<ffffffff8169ed08>] dump_stack+0x85/0xcd
Dec 01 14:43:06 wave2 kernel:  [<ffffffff811569bc>] __warn+0x10c/0x130
Dec 01 14:43:06 wave2 kernel:  [<ffffffff81156b58>] warn_slowpath_null+0x18/0x20
Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e29507>] ath10k_mac_txq_init+0x1a7/0x1b0 [ath10k_core]
Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e3766f>] ath10k_sta_state+0x4ef/0x1350 [ath10k_core]
Dec 01 14:43:06 wave2 kernel:  [<ffffffff811e10ed>] ? mark_lock+0x6d/0x8a0
Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e37180>] ? ath10k_station_assoc+0x920/0x920 [ath10k_core]
Dec 01 14:43:06 wave2 kernel:  [<ffffffff811ddd14>] ? __lock_is_held+0x84/0xc0
Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0c0095f>] drv_sta_state+0xef/0xc50 [mac80211]
Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0c6b1b0>] ieee80211_reconfig+0x10a0/0x2890 [mac80211]
Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0bf8361>] ieee80211_restart_work+0xb1/0xf0 [mac80211]
Dec 01 14:43:06 wave2 kernel:  [<ffffffff81184dad>] process_one_work+0x42d/0xac0
Dec 01 14:43:06 wave2 kernel:  [<ffffffff81184cf4>] ? process_one_work+0x374/0xac0
Dec 01 14:43:06 wave2 kernel:  [<ffffffff81184980>] ? pwq_dec_nr_in_flight+0x110/0x110
Dec 01 14:43:06 wave2 kernel:  [<ffffffff811854c6>] worker_thread+0x86/0x730
Dec 01 14:43:06 wave2 kernel:  [<ffffffff81df25aa>] ? _raw_spin_unlock_irqrestore+0x5a/0x70
Dec 01 14:43:06 wave2 kernel:  [<ffffffff81185440>] ? process_one_work+0xac0/0xac0
Dec 01 14:43:06 wave2 kernel:  [<ffffffff8118f381>] kthread+0x191/0x1b0
Dec 01 14:43:06 wave2 kernel:  [<ffffffff8118f1f0>] ? kthread_create_on_node+0x320/0x320
Dec 01 14:43:06 wave2 kernel:  [<ffffffff8119baa3>] ? preempt_count_sub+0x13/0xd0
Dec 01 14:43:06 wave2 kernel:  [<ffffffff81df2f8f>] ret_from_fork+0x1f/0x40
Dec 01 14:43:06 wave2 kernel:  [<ffffffff8118f1f0>] ? kthread_create_on_node+0x320/0x320
Dec 01 14:43:06 wave2 kernel: ---[ end trace e64bc8f0c1a2531b ]---
Dec 01 14:43:06 wave2 kernel: ath10k_pci 0000:05:00.0: txq-init: Found txq when it should be deleted, txq_tmp: ffff8800c43ec988  txq: ffff8800c43ec988
Dec 01 14:43:07 wave2 kernel: ath10k_pci 0000:05:00.0: dropping dbg buffer due to crash since read


Thanks,
Ben
Ben Greear Dec. 2, 2016, 12:24 a.m. UTC | #6
On 12/01/2016 02:52 PM, Ben Greear wrote:
> On 08/19/2016 06:34 AM, Ben Greear wrote:
>>
>>
>> On 08/18/2016 11:59 PM, Michal Kazior wrote:
>>> On 19 August 2016 at 03:26,  <greearb@candelatech.com> wrote:
>>>> From: Ben Greear <greearb@candelatech.com>
>>>>
>>>> I was seeing kernel crashes due to accessing freed memory
>>>> while debugging a 9984 firmware that was crashing often.
>>>>
>>>> This patch fixes the crashes.  I am not certain if there
>>>> is a better way or not.

Michal, thanks for the help on IRC.  I added this logic:

static void ieee80211_drv_tx(struct ieee80211_local *local,
			     struct ieee80211_vif *vif,
			     struct ieee80211_sta *pubsta,
			     struct sk_buff *skb)
{
	struct ieee80211_hdr *hdr = (struct ieee80211_hdr *) skb->data;
	struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
	struct ieee80211_tx_control control = {
		.sta = pubsta,
	};
	struct ieee80211_txq *txq = NULL;
	struct txq_info *txqi;
	u8 ac;

	if ((info->flags & IEEE80211_TX_CTL_SEND_AFTER_DTIM) ||
	    (info->control.flags & IEEE80211_TX_CTRL_PS_RESPONSE))
		goto tx_normal;

	if (!ieee80211_is_data(hdr->frame_control))
		goto tx_normal;

	if (unlikely(!ieee80211_sdata_running(sdata))) {
		WARN_ON_ONCE(1);
		goto delete_and_return;
	}

...

	if (atomic_read(&sdata->txqs_len[ac]) >=
	    (local->hw.txq_ac_max_pending * 2)) {
		/* Must be that something is not paying attention to
		 * max-pending, like pktgen, so just drop this frame.
		 */
delete_and_return:
		ieee80211_free_txskb(&local->hw, skb);
		return;
	}


But, I still see the txq entries on the ar->txqs list in the ath10k_mac_txq_init
after firmware crash in my test case.  Is this the test you were suggesting?


Thanks,
Ben

>
>
> I did some more hacking on this today.  I think I found some better clue on this.
>
> I added this code:
>
> static void ath10k_mac_txq_init(struct ath10k *ar, struct ieee80211_txq *txq)
> {
>     struct ath10k_txq *artxq = (void *)txq->drv_priv;
>     struct ath10k_txq *tmp, *walker;
>     struct ieee80211_txq *txq_tmp;
>     int i = 0;
>
>     if (!txq)
>         return;
>
>     spin_lock_bh(&ar->txqs_lock);
>     ar->txqs_lock.rlock.dbg1 = 104;
>
>     /* Remove from ar->txqs in case it still exists there. */
>     list_for_each_entry_safe(walker, tmp, &ar->txqs, list) {
>         txq_tmp = container_of((void *)walker, struct ieee80211_txq,
>                        drv_priv);
>         if ((++i % 10000) == 0) {
>             ath10k_err(ar, "txq-init: Checking txq_tmp: %p i: %d\n", txq_tmp, i);
>             ath10k_err(ar, "txq-init: txqs: %p walker->list: %p w->next: %p  w->prev: %p ar->txqs: %p\n",
>                    &ar->txqs, &(walker->list), walker->list.next, walker->list.prev, &ar->txqs);
>         }
>
>         if (txq_tmp == txq) {
>             WARN_ON_ONCE(1);
>             ath10k_err(ar, "txq-init: Found txq when it should be deleted, txq_tmp: %p  txq: %p\n",
>                    txq_tmp, txq);
>             list_del(&walker->list);
>         }
>     }
>     spin_unlock_bh(&ar->txqs_lock);
>
>     INIT_LIST_HEAD(&artxq->list);
> }
>
>
> [firmware has just crashed]
>
> Dec 01 14:43:06 wave2 kernel: ------------[ cut here ]------------
> Dec 01 14:43:06 wave2 kernel: WARNING: CPU: 0 PID: 193 at /home/greearb/git/linux-4.7.dev.y/drivers/net/wireless/ath/ath10k/mac.c:4217
> ath10k_mac_txq_init+0x1a7/0x1b0 [ath10k_core]
> Dec 01 14:43:06 wave2 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 bridge 8021q garp mrp stp llc bnep bluetooth fuse
> macvlan pktgen rpcsec_gss_krb5 nfsv4 nfs fscache coretemp hwmon intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_codec_realtek
> snd_hda_codec_generic kvm iTCO_wdt irqbypass iTCO_vendor_support ath10k_pci ath10k_core joydev ath snd_hda_intel mac80211 snd_hda_codec snd_hda_core snd_hwdep
> snd_seq snd_seq_device pcspkr cfg80211 snd_pcm snd_timer snd i2c_i801 lpc_ich shpchp soundcore tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc serio_raw
> i915 i2c_algo_bit ata_generic drm_kms_helper pata_acpi e1000e ptp drm pps_core i2c_core fjes video ipv6 [last unloaded: nf_conntrack]
> Dec 01 14:43:06 wave2 kernel: CPU: 0 PID: 193 Comm: kworker/0:1 Not tainted 4.7.10+ #14
> Dec 01 14:43:06 wave2 kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./ChiefRiver, BIOS 4.6.5 06/07/2013
> Dec 01 14:43:06 wave2 kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211]
> Dec 01 14:43:06 wave2 kernel:  ffffffffa0e29507 ffff8801d14f7920 ffffffff8169ed08 0000000000000000
> Dec 01 14:43:06 wave2 kernel:  0000000000000000 ffff8801d14f7968 ffffffff811569bc ffff8801d14e4f00
> Dec 01 14:43:06 wave2 kernel:  0000107900000001 ffff8800c43ec9a0 0000000000000027 ffff8800c43ec988
> Dec 01 14:43:06 wave2 kernel: Call Trace:
> Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e29507>] ? ath10k_mac_txq_init+0x1a7/0x1b0 [ath10k_core]
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff8169ed08>] dump_stack+0x85/0xcd
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff811569bc>] __warn+0x10c/0x130
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff81156b58>] warn_slowpath_null+0x18/0x20
> Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e29507>] ath10k_mac_txq_init+0x1a7/0x1b0 [ath10k_core]
> Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e3766f>] ath10k_sta_state+0x4ef/0x1350 [ath10k_core]
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff811e10ed>] ? mark_lock+0x6d/0x8a0
> Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0e37180>] ? ath10k_station_assoc+0x920/0x920 [ath10k_core]
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff811ddd14>] ? __lock_is_held+0x84/0xc0
> Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0c0095f>] drv_sta_state+0xef/0xc50 [mac80211]
> Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0c6b1b0>] ieee80211_reconfig+0x10a0/0x2890 [mac80211]
> Dec 01 14:43:06 wave2 kernel:  [<ffffffffa0bf8361>] ieee80211_restart_work+0xb1/0xf0 [mac80211]
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff81184dad>] process_one_work+0x42d/0xac0
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff81184cf4>] ? process_one_work+0x374/0xac0
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff81184980>] ? pwq_dec_nr_in_flight+0x110/0x110
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff811854c6>] worker_thread+0x86/0x730
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff81df25aa>] ? _raw_spin_unlock_irqrestore+0x5a/0x70
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff81185440>] ? process_one_work+0xac0/0xac0
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff8118f381>] kthread+0x191/0x1b0
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff8118f1f0>] ? kthread_create_on_node+0x320/0x320
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff8119baa3>] ? preempt_count_sub+0x13/0xd0
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff81df2f8f>] ret_from_fork+0x1f/0x40
> Dec 01 14:43:06 wave2 kernel:  [<ffffffff8118f1f0>] ? kthread_create_on_node+0x320/0x320
> Dec 01 14:43:06 wave2 kernel: ---[ end trace e64bc8f0c1a2531b ]---
> Dec 01 14:43:06 wave2 kernel: ath10k_pci 0000:05:00.0: txq-init: Found txq when it should be deleted, txq_tmp: ffff8800c43ec988  txq: ffff8800c43ec988
> Dec 01 14:43:07 wave2 kernel: ath10k_pci 0000:05:00.0: dropping dbg buffer due to crash since read
>
>
> Thanks,
> Ben
>
Michal Kazior Dec. 5, 2016, 8:50 a.m. UTC | #7
On 2 December 2016 at 01:24, Ben Greear <greearb@candelatech.com> wrote:
> On 12/01/2016 02:52 PM, Ben Greear wrote:
>>
>> On 08/19/2016 06:34 AM, Ben Greear wrote:
>>>
>>>
>>>
>>> On 08/18/2016 11:59 PM, Michal Kazior wrote:
>>>>
>>>> On 19 August 2016 at 03:26,  <greearb@candelatech.com> wrote:
>>>>>
>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>
>>>>> I was seeing kernel crashes due to accessing freed memory
>>>>> while debugging a 9984 firmware that was crashing often.
>>>>>
>>>>> This patch fixes the crashes.  I am not certain if there
>>>>> is a better way or not.
>
>
> Michal, thanks for the help on IRC.  I added this logic:
>
> static void ieee80211_drv_tx(struct ieee80211_local *local,
>                              struct ieee80211_vif *vif,
>                              struct ieee80211_sta *pubsta,
>                              struct sk_buff *skb)
> {
>         struct ieee80211_hdr *hdr = (struct ieee80211_hdr *) skb->data;
>         struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
>         struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>         struct ieee80211_tx_control control = {
>                 .sta = pubsta,
>         };
>         struct ieee80211_txq *txq = NULL;
>         struct txq_info *txqi;
>         u8 ac;
>
>         if ((info->flags & IEEE80211_TX_CTL_SEND_AFTER_DTIM) ||
>             (info->control.flags & IEEE80211_TX_CTRL_PS_RESPONSE))
>                 goto tx_normal;
>
>         if (!ieee80211_is_data(hdr->frame_control))
>                 goto tx_normal;
>
>         if (unlikely(!ieee80211_sdata_running(sdata))) {
>                 WARN_ON_ONCE(1);
>                 goto delete_and_return;
>         }
>
> ...
>
>         if (atomic_read(&sdata->txqs_len[ac]) >=
>             (local->hw.txq_ac_max_pending * 2)) {
>                 /* Must be that something is not paying attention to
>                  * max-pending, like pktgen, so just drop this frame.
>                  */
> delete_and_return:
>                 ieee80211_free_txskb(&local->hw, skb);
>                 return;
>         }
>
>
> But, I still see the txq entries on the ar->txqs list in the
> ath10k_mac_txq_init
> after firmware crash in my test case.  Is this the test you were suggesting?

Yes.

Now that I think about it mac80211 doesn't call anything in driver
during hw_restart that would unref txqs. This means you'll have them
still linked when add_interface/sta_state is called, no?

This means that either:
 (a) txq (re-)init should be smarter in ath10k
 (b) txqs should be purged during hw_restart in ath10k


Michał
Ben Greear Dec. 5, 2016, 6:19 p.m. UTC | #8
On 12/05/2016 12:50 AM, Michal Kazior wrote:
> On 2 December 2016 at 01:24, Ben Greear <greearb@candelatech.com> wrote:
>> On 12/01/2016 02:52 PM, Ben Greear wrote:
>>>
>>> On 08/19/2016 06:34 AM, Ben Greear wrote:
>>>>
>>>>
>>>>
>>>> On 08/18/2016 11:59 PM, Michal Kazior wrote:
>>>>>
>>>>> On 19 August 2016 at 03:26,  <greearb@candelatech.com> wrote:
>>>>>>
>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>
>>>>>> I was seeing kernel crashes due to accessing freed memory
>>>>>> while debugging a 9984 firmware that was crashing often.
>>>>>>
>>>>>> This patch fixes the crashes.  I am not certain if there
>>>>>> is a better way or not.
>>
>>
>> Michal, thanks for the help on IRC.  I added this logic:
>>
>> static void ieee80211_drv_tx(struct ieee80211_local *local,
>>                              struct ieee80211_vif *vif,
>>                              struct ieee80211_sta *pubsta,
>>                              struct sk_buff *skb)
>> {
>>         struct ieee80211_hdr *hdr = (struct ieee80211_hdr *) skb->data;
>>         struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
>>         struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>>         struct ieee80211_tx_control control = {
>>                 .sta = pubsta,
>>         };
>>         struct ieee80211_txq *txq = NULL;
>>         struct txq_info *txqi;
>>         u8 ac;
>>
>>         if ((info->flags & IEEE80211_TX_CTL_SEND_AFTER_DTIM) ||
>>             (info->control.flags & IEEE80211_TX_CTRL_PS_RESPONSE))
>>                 goto tx_normal;
>>
>>         if (!ieee80211_is_data(hdr->frame_control))
>>                 goto tx_normal;
>>
>>         if (unlikely(!ieee80211_sdata_running(sdata))) {
>>                 WARN_ON_ONCE(1);
>>                 goto delete_and_return;
>>         }
>>
>> ...
>>
>>         if (atomic_read(&sdata->txqs_len[ac]) >=
>>             (local->hw.txq_ac_max_pending * 2)) {
>>                 /* Must be that something is not paying attention to
>>                  * max-pending, like pktgen, so just drop this frame.
>>                  */
>> delete_and_return:
>>                 ieee80211_free_txskb(&local->hw, skb);
>>                 return;
>>         }
>>
>>
>> But, I still see the txq entries on the ar->txqs list in the
>> ath10k_mac_txq_init
>> after firmware crash in my test case.  Is this the test you were suggesting?
>
> Yes.
>
> Now that I think about it mac80211 doesn't call anything in driver
> during hw_restart that would unref txqs. This means you'll have them
> still linked when add_interface/sta_state is called, no?
>
> This means that either:
>  (a) txq (re-)init should be smarter in ath10k
>  (b) txqs should be purged during hw_restart in ath10k

I posted a patch that does (a) last Friday:

"ath10k:  work-around for stale txq in ar->txqs"

I realized it will not apply upstream because it is also patching the previous
work-around I had in the patch that originated this email thread.

With these patches and the iterate hack to mac80211, then I no longer
see crashes in my test case that previously crashed very readily.

Thanks,
Ben
diff mbox

Patch

diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 5659ef1..916119c 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -4172,8 +4172,10 @@  static void ath10k_mac_txq_init(struct ieee80211_txq *txq)
 static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
 {
 	struct ath10k_txq *artxq = (void *)txq->drv_priv;
+	struct ath10k_txq *tmp, *walker;
 	struct ath10k_skb_cb *cb;
 	struct sk_buff *msdu;
+	struct ieee80211_txq *txq_tmp;
 	int msdu_id;
 
 	if (!txq)
@@ -4182,6 +4184,14 @@  static void ath10k_mac_txq_unref(struct ath10k *ar, struct ieee80211_txq *txq)
 	spin_lock_bh(&ar->txqs_lock);
 	if (!list_empty(&artxq->list))
 		list_del_init(&artxq->list);
+
+	/* Remove from ar->txqs in case it still exists there. */
+	list_for_each_entry_safe(walker, tmp, &ar->txqs, list) {
+		txq_tmp = container_of((void *)walker, struct ieee80211_txq,
+				       drv_priv);
+		if (txq_tmp == txq)
+			list_del(&walker->list);
+	}
 	spin_unlock_bh(&ar->txqs_lock);
 
 	spin_lock_bh(&ar->htt.tx_lock);