Help debugging iwldvm / ath10k stalls

Message ID	CALCETrWdQwvMcgC=PESyECzkyMD8-vzjNEHOCkm5z=ENH7coHg@mail.gmail.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-wireless-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <CANUX_P2_GZGhd7X4Ntzi5Jeng52i9Jam22DQTTU2akESPL1ffQ@mail.gmail.com> References: <CALCETrUffJLp+AM2-13HEE5+cKtyMMSjJFsJnhcd5kMzebs3NA@mail.gmail.com> <CALCETrU58grNoqQh5ZutHa8CEt4rqS6FrMWwpC6DaQ0cXqxXhw@mail.gmail.com> <CANUX_P22a2eC0VAw-NCKrnVQRs9t=uy1VHrBF-2dAsW+s=YfUQ@mail.gmail.com> <CALCETrUXGKpHfTwYti9Rzgv-ZM--_d9pW8_Jx9oR1L+GsBeZcQ@mail.gmail.com> <CANUX_P2sH1yzx5aeLbHXBGHAsXmkZ70ioEhH1Ad_xNrCEJbJRg@mail.gmail.com> <CANUX_P2XpeDOmNOX=p4yMUe5ZAZv-UQ97M0ayFcFSsTrEDS3CA@mail.gmail.com> <CALCETrVpvFk+qhq3PFgK+5kTnBcbtv8H_Ko+kHV0jay0ZTfmcQ@mail.gmail.com> <CANUX_P3NrA2GgZ-Nqgvk7do7Kzqa9kgjtpn4t9TuLKzhmqTqjQ@mail.gmail.com> <CALCETrX1s504Ej7sQV292NfHgrKrV-Wigk0GP7wo6GyALaM9xA@mail.gmail.com> <CANUX_P3jMcV=7stRtPAFZBJVG9bE10iTC0sRK_o-7vTxpy9+9w@mail.gmail.com> <CALCETrVPXwedhPGtDJQDSK5Kxa4NpeSDck4-nHyox_xriPBcZQ@mail.gmail.com> <CALCETrW0NQ6aW06WD6ZAi8-o7fZSiKPOf0xYCSEw2egYBfqyQQ@mail.gmail.com> <CALCETrUTsKCw_G7V=YDeP-v_GF8cK8ezY-wqcpxeuw3dzBQhQg@mail.gmail.com> <CANUX_P2_GZGhd7X4Ntzi5Jeng52i9Jam22DQTTU2akESPL1ffQ@mail.gmail.com> From: Andy Lutomirski <luto@amacapital.net> Date: Tue, 27 May 2014 20:07:41 -0700 Message-ID: <CALCETrWdQwvMcgC=PESyECzkyMD8-vzjNEHOCkm5z=ENH7coHg@mail.gmail.com> Subject: Re: Help debugging iwldvm / ath10k stalls To: Emmanuel Grumbach <egrumbach@gmail.com> Cc: Linux Wireless List <linux-wireless@vger.kernel.org> Content-Type: multipart/mixed; boundary=001a1133c548ca02a004fa6d1e66 Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk

Andy Lutomirski May 28, 2014, 3:07 a.m. UTC

On Tue, May 27, 2014 at 1:30 PM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>>>
>>> It looks like your patch allows both:
>>>
>>> iwl_static_sleep_cmd(priv, cmd, IWL_POWER_INDEX_5, dtimper);
>>>
>>> and
>>>
>>> iwl_static_sleep_cmd(priv, cmd, IWL_POWER_INDEX_5, 20);
>>>
>>> I don't know whether that's a problem.
>
> I'd expect enable to be false... But I might have got lost in
> mac80211's PS code... Can  you check the value of enable? Thanks
> Anyway - I want to call the cam one. IOW, I want to disable power save
> completely. This is what is causing trouble.
> We want to add a frame on the ring, and since the NIC is asleep, we
> schedule a wakeup of the NIC so that it can fetch the packet when it
> wakes up... But it doesn't wake up...
> Now - This is firmware / hardware related... which means that there
> isn't much I can do for this old device.
>  Bottom line, I guess we want to disable power save here.
> If you can bisect this, this can be really helpful. This might allow
> us to keep the feature.

I doubt I can bisect -- the trigger was a new AP, not a new kernel.  I
can't exactly cut the AP in half :)


Pre-suspend, i.e., working:

[   20.949900] enabled = 1, wowlan = 0
[   20.950177] enabled = 1, wowlan = 0
[   21.614016] enabled = 1, wowlan = 0
[   21.614658] enabled = 1, wowlan = 0
[   42.667586] enabled = 0, wowlan = 0
[   42.672514] enabled = 1, wowlan = 0
[   53.088165] fuse init (API version 7.23)
[   53.102082] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
[   53.130945] SELinux: initialized (dev fusectl, type fusectl), uses
genfs_contexts
[   85.627558] enabled = 0, wowlan = 0
[   85.631686] enabled = 1, wowlan = 0
[  134.649346] e1000e: em1 NIC Link is Down
[  137.682277] wlan0: deauthenticating from 02:c6:26:cc:b4:c7 by local
choice (Reason: 3=DEAUTH_LEAVING)
[  137.682780] enabled = 0, wowlan = 0
[  137.693889] enabled = 0, wowlan = 0

Post-suspend, i.e., not working:

[  144.406303] enabled = 1, wowlan = 0
[  144.406496] enabled = 1, wowlan = 0
[  145.026827] enabled = 1, wowlan = 0
[  145.028211] enabled = 1, wowlan = 0
[  165.688632] enabled = 0, wowlan = 0
[  165.689960] enabled = 0, wowlan = 0
[  165.693988] enabled = 1, wowlan = 0
[  165.694245] enabled = 1, wowlan = 0
[  208.641426] enabled = 0, wowlan = 0
[  208.641786] enabled = 0, wowlan = 0
[  208.647499] enabled = 1, wowlan = 0
[  208.647639] enabled = 1, wowlan = 0
[  271.435558] enabled = 0, wowlan = 0
[  271.435767] enabled = 0, wowlan = 0
[  271.440125] enabled = 1, wowlan = 0
[  271.440405] enabled = 1, wowlan = 0

With even more instrumentation added, I did get a glitch before
suspend/resume, but it came with more than two power setting updates.
Logs and patch attached, complete with call stacks.

Is it possible that part of the problem is that the firmware doesn't
like being spammed with power config changes so frequently?

--Andy

Emmanuel Grumbach May 28, 2014, 12:09 p.m. UTC | #1

>
> I doubt I can bisect -- the trigger was a new AP, not a new kernel.  I
> can't exactly cut the AP in half :)

I see.. This is really weird though. Anyway.

>
>
> Pre-suspend, i.e., working:
>
> [   20.949900] enabled = 1, wowlan = 0
> [   20.950177] enabled = 1, wowlan = 0
> [   21.614016] enabled = 1, wowlan = 0
> [   21.614658] enabled = 1, wowlan = 0
> [   42.667586] enabled = 0, wowlan = 0
> [   42.672514] enabled = 1, wowlan = 0
> [   53.088165] fuse init (API version 7.23)
> [   53.102082] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
> [   53.130945] SELinux: initialized (dev fusectl, type fusectl), uses
> genfs_contexts
> [   85.627558] enabled = 0, wowlan = 0
> [   85.631686] enabled = 1, wowlan = 0
> [  134.649346] e1000e: em1 NIC Link is Down
> [  137.682277] wlan0: deauthenticating from 02:c6:26:cc:b4:c7 by local
> choice (Reason: 3=DEAUTH_LEAVING)
> [  137.682780] enabled = 0, wowlan = 0
> [  137.693889] enabled = 0, wowlan = 0
>
> Post-suspend, i.e., not working:
>
> [  144.406303] enabled = 1, wowlan = 0
> [  144.406496] enabled = 1, wowlan = 0
> [  145.026827] enabled = 1, wowlan = 0
> [  145.028211] enabled = 1, wowlan = 0
> [  165.688632] enabled = 0, wowlan = 0
> [  165.689960] enabled = 0, wowlan = 0
> [  165.693988] enabled = 1, wowlan = 0
> [  165.694245] enabled = 1, wowlan = 0
> [  208.641426] enabled = 0, wowlan = 0
> [  208.641786] enabled = 0, wowlan = 0
> [  208.647499] enabled = 1, wowlan = 0
> [  208.647639] enabled = 1, wowlan = 0
> [  271.435558] enabled = 0, wowlan = 0
> [  271.435767] enabled = 0, wowlan = 0
> [  271.440125] enabled = 1, wowlan = 0
> [  271.440405] enabled = 1, wowlan = 0
>
> With even more instrumentation added, I did get a glitch before
> suspend/resume, but it came with more than two power setting updates.
> Logs and patch attached, complete with call stacks.
>

I don't see any callstacks?
Doesn't matter though.

> Is it possible that part of the problem is that the firmware doesn't
> like being spammed with power config changes so frequently?
>
Don't think so.
But the suspend / resume thing is not the kind of things that are fun to debug.
I am afraid that I will disable power save completely since I don't
really have the time to dig into that.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski June 2, 2014, 4:54 p.m. UTC | #2

On Wed, May 28, 2014 at 5:09 AM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>>
>> I doubt I can bisect -- the trigger was a new AP, not a new kernel.  I
>> can't exactly cut the AP in half :)
>
> I see.. This is really weird though. Anyway.
>
>>
>>
>> Pre-suspend, i.e., working:
>>
>> [   20.949900] enabled = 1, wowlan = 0
>> [   20.950177] enabled = 1, wowlan = 0
>> [   21.614016] enabled = 1, wowlan = 0
>> [   21.614658] enabled = 1, wowlan = 0
>> [   42.667586] enabled = 0, wowlan = 0
>> [   42.672514] enabled = 1, wowlan = 0
>> [   53.088165] fuse init (API version 7.23)
>> [   53.102082] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
>> [   53.130945] SELinux: initialized (dev fusectl, type fusectl), uses
>> genfs_contexts
>> [   85.627558] enabled = 0, wowlan = 0
>> [   85.631686] enabled = 1, wowlan = 0
>> [  134.649346] e1000e: em1 NIC Link is Down
>> [  137.682277] wlan0: deauthenticating from 02:c6:26:cc:b4:c7 by local
>> choice (Reason: 3=DEAUTH_LEAVING)
>> [  137.682780] enabled = 0, wowlan = 0
>> [  137.693889] enabled = 0, wowlan = 0
>>
>> Post-suspend, i.e., not working:
>>
>> [  144.406303] enabled = 1, wowlan = 0
>> [  144.406496] enabled = 1, wowlan = 0
>> [  145.026827] enabled = 1, wowlan = 0
>> [  145.028211] enabled = 1, wowlan = 0
>> [  165.688632] enabled = 0, wowlan = 0
>> [  165.689960] enabled = 0, wowlan = 0
>> [  165.693988] enabled = 1, wowlan = 0
>> [  165.694245] enabled = 1, wowlan = 0
>> [  208.641426] enabled = 0, wowlan = 0
>> [  208.641786] enabled = 0, wowlan = 0
>> [  208.647499] enabled = 1, wowlan = 0
>> [  208.647639] enabled = 1, wowlan = 0
>> [  271.435558] enabled = 0, wowlan = 0
>> [  271.435767] enabled = 0, wowlan = 0
>> [  271.440125] enabled = 1, wowlan = 0
>> [  271.440405] enabled = 1, wowlan = 0
>>
>> With even more instrumentation added, I did get a glitch before
>> suspend/resume, but it came with more than two power setting updates.
>> Logs and patch attached, complete with call stacks.
>>
>
> I don't see any callstacks?
> Doesn't matter though.

I think the callstacks were in the attachment.  I could have messed up, though.

Anyway, I don't buy the theory that this is caused by the firmware
going out to lunch.  The queues files in debugfs show the rx queue
chugging along and all of the tx queues have read_ptr == write_ptr.
Wireshark shows incoming broadcast traffic, too.  I'd guess that the
problem is more likely to be that the card is failing to wake up and
notice pending data in the TIM.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski June 2, 2014, 6:11 p.m. UTC | #3

On Mon, Jun 2, 2014 at 9:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, May 28, 2014 at 5:09 AM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>>>
>>> I doubt I can bisect -- the trigger was a new AP, not a new kernel.  I
>>> can't exactly cut the AP in half :)
>>
>> I see.. This is really weird though. Anyway.
>>
>>>
>>>
>>> Pre-suspend, i.e., working:
>>>
>>> [   20.949900] enabled = 1, wowlan = 0
>>> [   20.950177] enabled = 1, wowlan = 0
>>> [   21.614016] enabled = 1, wowlan = 0
>>> [   21.614658] enabled = 1, wowlan = 0
>>> [   42.667586] enabled = 0, wowlan = 0
>>> [   42.672514] enabled = 1, wowlan = 0
>>> [   53.088165] fuse init (API version 7.23)
>>> [   53.102082] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
>>> [   53.130945] SELinux: initialized (dev fusectl, type fusectl), uses
>>> genfs_contexts
>>> [   85.627558] enabled = 0, wowlan = 0
>>> [   85.631686] enabled = 1, wowlan = 0
>>> [  134.649346] e1000e: em1 NIC Link is Down
>>> [  137.682277] wlan0: deauthenticating from 02:c6:26:cc:b4:c7 by local
>>> choice (Reason: 3=DEAUTH_LEAVING)
>>> [  137.682780] enabled = 0, wowlan = 0
>>> [  137.693889] enabled = 0, wowlan = 0
>>>
>>> Post-suspend, i.e., not working:
>>>
>>> [  144.406303] enabled = 1, wowlan = 0
>>> [  144.406496] enabled = 1, wowlan = 0
>>> [  145.026827] enabled = 1, wowlan = 0
>>> [  145.028211] enabled = 1, wowlan = 0
>>> [  165.688632] enabled = 0, wowlan = 0
>>> [  165.689960] enabled = 0, wowlan = 0
>>> [  165.693988] enabled = 1, wowlan = 0
>>> [  165.694245] enabled = 1, wowlan = 0
>>> [  208.641426] enabled = 0, wowlan = 0
>>> [  208.641786] enabled = 0, wowlan = 0
>>> [  208.647499] enabled = 1, wowlan = 0
>>> [  208.647639] enabled = 1, wowlan = 0
>>> [  271.435558] enabled = 0, wowlan = 0
>>> [  271.435767] enabled = 0, wowlan = 0
>>> [  271.440125] enabled = 1, wowlan = 0
>>> [  271.440405] enabled = 1, wowlan = 0
>>>
>>> With even more instrumentation added, I did get a glitch before
>>> suspend/resume, but it came with more than two power setting updates.
>>> Logs and patch attached, complete with call stacks.
>>>
>>
>> I don't see any callstacks?
>> Doesn't matter though.
>
> I think the callstacks were in the attachment.  I could have messed up, though.
>
> Anyway, I don't buy the theory that this is caused by the firmware
> going out to lunch.  The queues files in debugfs show the rx queue
> chugging along and all of the tx queues have read_ptr == write_ptr.
> Wireshark shows incoming broadcast traffic, too.  I'd guess that the
> problem is more likely to be that the card is failing to wake up and
> notice pending data in the TIM.

OTOH, with iwlwifi.11n_disable=4 (no rx A-MPDU), I seem to be doing
pretty well.  I'll test a stock kernel configured like that for the
next few days.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach June 2, 2014, 6:40 p.m. UTC | #4

On 06/02/2014 09:11 PM, Andy Lutomirski wrote:
> On Mon, Jun 2, 2014 at 9:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, May 28, 2014 at 5:09 AM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>>>>
>>>> I doubt I can bisect -- the trigger was a new AP, not a new kernel.  I
>>>> can't exactly cut the AP in half :)
>>>
>>> I see.. This is really weird though. Anyway.
>>>
>>>>
>>>>
>>>> Pre-suspend, i.e., working:
>>>>
>>>> [   20.949900] enabled = 1, wowlan = 0
>>>> [   20.950177] enabled = 1, wowlan = 0
>>>> [   21.614016] enabled = 1, wowlan = 0
>>>> [   21.614658] enabled = 1, wowlan = 0
>>>> [   42.667586] enabled = 0, wowlan = 0
>>>> [   42.672514] enabled = 1, wowlan = 0
>>>> [   53.088165] fuse init (API version 7.23)
>>>> [   53.102082] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
>>>> [   53.130945] SELinux: initialized (dev fusectl, type fusectl), uses
>>>> genfs_contexts
>>>> [   85.627558] enabled = 0, wowlan = 0
>>>> [   85.631686] enabled = 1, wowlan = 0
>>>> [  134.649346] e1000e: em1 NIC Link is Down
>>>> [  137.682277] wlan0: deauthenticating from 02:c6:26:cc:b4:c7 by local
>>>> choice (Reason: 3=DEAUTH_LEAVING)
>>>> [  137.682780] enabled = 0, wowlan = 0
>>>> [  137.693889] enabled = 0, wowlan = 0
>>>>
>>>> Post-suspend, i.e., not working:
>>>>
>>>> [  144.406303] enabled = 1, wowlan = 0
>>>> [  144.406496] enabled = 1, wowlan = 0
>>>> [  145.026827] enabled = 1, wowlan = 0
>>>> [  145.028211] enabled = 1, wowlan = 0
>>>> [  165.688632] enabled = 0, wowlan = 0
>>>> [  165.689960] enabled = 0, wowlan = 0
>>>> [  165.693988] enabled = 1, wowlan = 0
>>>> [  165.694245] enabled = 1, wowlan = 0
>>>> [  208.641426] enabled = 0, wowlan = 0
>>>> [  208.641786] enabled = 0, wowlan = 0
>>>> [  208.647499] enabled = 1, wowlan = 0
>>>> [  208.647639] enabled = 1, wowlan = 0
>>>> [  271.435558] enabled = 0, wowlan = 0
>>>> [  271.435767] enabled = 0, wowlan = 0
>>>> [  271.440125] enabled = 1, wowlan = 0
>>>> [  271.440405] enabled = 1, wowlan = 0
>>>>
>>>> With even more instrumentation added, I did get a glitch before
>>>> suspend/resume, but it came with more than two power setting updates.
>>>> Logs and patch attached, complete with call stacks.
>>>>
>>>
>>> I don't see any callstacks?
>>> Doesn't matter though.
>>
>> I think the callstacks were in the attachment.  I could have messed up, though.
>>
>> Anyway, I don't buy the theory that this is caused by the firmware
>> going out to lunch.  The queues files in debugfs show the rx queue
>> chugging along and all of the tx queues have read_ptr == write_ptr.
>> Wireshark shows incoming broadcast traffic, too.  I'd guess that the
>> problem is more likely to be that the card is failing to wake up and
>> notice pending data in the TIM.

Well... I might have been unclear here (I never know how much detail I should share with the recipient :)).
From your log it appears that the NIC is in power save. So we can't increment the write pointer of the Tx ring (add a packet for transmission). So we simply remember that we need to do so (increment the write pointer) and request a wakeup so that we will update the write pointer in the wakeup interrupt... which doesn't happen.
No power save - no need for wakeup interrupt.

> 
> OTOH, with iwlwifi.11n_disable=4 (no rx A-MPDU), I seem to be doing
> pretty well.  I'll test a stock kernel configured like that for the
> next few days.
> 

That's interesting...
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski June 2, 2014, 6:45 p.m. UTC | #5

On Mon, Jun 2, 2014 at 11:40 AM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>
>
> On 06/02/2014 09:11 PM, Andy Lutomirski wrote:
>> On Mon, Jun 2, 2014 at 9:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, May 28, 2014 at 5:09 AM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>>>>>
>>>>> I doubt I can bisect -- the trigger was a new AP, not a new kernel.  I
>>>>> can't exactly cut the AP in half :)
>>>>
>>>> I see.. This is really weird though. Anyway.
>>>>
>>>>>
>>>>>
>>>>> Pre-suspend, i.e., working:
>>>>>
>>>>> [   20.949900] enabled = 1, wowlan = 0
>>>>> [   20.950177] enabled = 1, wowlan = 0
>>>>> [   21.614016] enabled = 1, wowlan = 0
>>>>> [   21.614658] enabled = 1, wowlan = 0
>>>>> [   42.667586] enabled = 0, wowlan = 0
>>>>> [   42.672514] enabled = 1, wowlan = 0
>>>>> [   53.088165] fuse init (API version 7.23)
>>>>> [   53.102082] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
>>>>> [   53.130945] SELinux: initialized (dev fusectl, type fusectl), uses
>>>>> genfs_contexts
>>>>> [   85.627558] enabled = 0, wowlan = 0
>>>>> [   85.631686] enabled = 1, wowlan = 0
>>>>> [  134.649346] e1000e: em1 NIC Link is Down
>>>>> [  137.682277] wlan0: deauthenticating from 02:c6:26:cc:b4:c7 by local
>>>>> choice (Reason: 3=DEAUTH_LEAVING)
>>>>> [  137.682780] enabled = 0, wowlan = 0
>>>>> [  137.693889] enabled = 0, wowlan = 0
>>>>>
>>>>> Post-suspend, i.e., not working:
>>>>>
>>>>> [  144.406303] enabled = 1, wowlan = 0
>>>>> [  144.406496] enabled = 1, wowlan = 0
>>>>> [  145.026827] enabled = 1, wowlan = 0
>>>>> [  145.028211] enabled = 1, wowlan = 0
>>>>> [  165.688632] enabled = 0, wowlan = 0
>>>>> [  165.689960] enabled = 0, wowlan = 0
>>>>> [  165.693988] enabled = 1, wowlan = 0
>>>>> [  165.694245] enabled = 1, wowlan = 0
>>>>> [  208.641426] enabled = 0, wowlan = 0
>>>>> [  208.641786] enabled = 0, wowlan = 0
>>>>> [  208.647499] enabled = 1, wowlan = 0
>>>>> [  208.647639] enabled = 1, wowlan = 0
>>>>> [  271.435558] enabled = 0, wowlan = 0
>>>>> [  271.435767] enabled = 0, wowlan = 0
>>>>> [  271.440125] enabled = 1, wowlan = 0
>>>>> [  271.440405] enabled = 1, wowlan = 0
>>>>>
>>>>> With even more instrumentation added, I did get a glitch before
>>>>> suspend/resume, but it came with more than two power setting updates.
>>>>> Logs and patch attached, complete with call stacks.
>>>>>
>>>>
>>>> I don't see any callstacks?
>>>> Doesn't matter though.
>>>
>>> I think the callstacks were in the attachment.  I could have messed up, though.
>>>
>>> Anyway, I don't buy the theory that this is caused by the firmware
>>> going out to lunch.  The queues files in debugfs show the rx queue
>>> chugging along and all of the tx queues have read_ptr == write_ptr.
>>> Wireshark shows incoming broadcast traffic, too.  I'd guess that the
>>> problem is more likely to be that the card is failing to wake up and
>>> notice pending data in the TIM.
>
> Well... I might have been unclear here (I never know how much detail I should share with the recipient :)).
> From your log it appears that the NIC is in power save. So we can't increment the write pointer of the Tx ring (add a packet for transmission). So we simply remember that we need to do so (increment the write pointer) and request a wakeup so that we will update the write pointer in the wakeup interrupt... which doesn't happen.
> No power save - no need for wakeup interrupt.

I'm still unconvinced.  One of the tx queues actually has a both
read_ptr and write_ptr incrementing once or twice per second even when
I can't ping the gateway.  Can you point me at the right code or log
stuff to look at?

>
>>
>> OTOH, with iwlwifi.11n_disable=4 (no rx A-MPDU), I seem to be doing
>> pretty well.  I'll test a stock kernel configured like that for the
>> next few days.
>>
>
> That's interesting...

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach June 2, 2014, 6:53 p.m. UTC | #6

>>>>
>>>> Anyway, I don't buy the theory that this is caused by the firmware
>>>> going out to lunch.  The queues files in debugfs show the rx queue
>>>> chugging along and all of the tx queues have read_ptr == write_ptr.
>>>> Wireshark shows incoming broadcast traffic, too.  I'd guess that the
>>>> problem is more likely to be that the card is failing to wake up and
>>>> notice pending data in the TIM.
>>
>> Well... I might have been unclear here (I never know how much detail I should share with the recipient :)).
>> From your log it appears that the NIC is in power save. So we can't increment the write pointer of the Tx ring (add a packet for transmission). So we simply remember that we need to do so (increment the write pointer) and request a wakeup so that we will update the write pointer in the wakeup interrupt... which doesn't happen.
>> No power save - no need for wakeup interrupt.
> 
> I'm still unconvinced.  One of the tx queues actually has a both
> read_ptr and write_ptr incrementing once or twice per second even when
> I can't ping the gateway.  Can you point me at the right code or log
> stuff to look at?

drivers/net/wireless/iwlwifi/pcie/tx.c:
static void iwl_pcie_txq_inc_wr_ptr(struct iwl_trans *trans,
                                    struct iwl_txq *txq)
[snip]

                if (reg & CSR_UCODE_DRV_GP1_BIT_MAC_SLEEP) {
                        IWL_DEBUG_INFO(trans, "Tx queue %d requesting wakeup, GP1 = 0x%x\n",
                                       txq_id, reg);
                        iwl_set_bit(trans, CSR_GP_CNTRL,
                                    CSR_GP_CNTRL_REG_FLAG_MAC_ACCESS_REQ);
                        txq->need_update = true;
                        return;
                }
[snip]
}

ISR - drivers/net/wireless/iwlwifi/pcie/tx.c:
irqreturn_t iwl_pcie_irq_handler(int irq, void *dev_id)
{
[snip]
        /* uCode wakes up after power-down sleep */
        if (inta & CSR_INT_BIT_WAKEUP) {
                IWL_DEBUG_ISR(trans, "Wakeup interrupt\n");
                iwl_pcie_rxq_check_wrptr(trans);
                iwl_pcie_txq_check_wrptrs(trans);

                isr_stats->wakeup++;

                handled |= CSR_INT_BIT_WAKEUP;
        }
[snip]
}

drivers/net/wireless/iwlwifi/pcie/tx.c:
void iwl_pcie_txq_check_wrptrs(struct iwl_trans *trans)
{
        struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
        int i;

        for (i = 0; i < trans->cfg->base_params->num_of_queues; i++) {
                struct iwl_txq *txq = &trans_pcie->txq[i];

                spin_lock_bh(&txq->lock);
                if (trans_pcie->txq[i].need_update) {
                        iwl_pcie_txq_inc_wr_ptr(trans, txq);
                        trans_pcie->txq[i].need_update = false;
                }
                spin_unlock_bh(&txq->lock);
        }
}

Note that the CMD queue (9 or 4 depending on your configuration) - uses another mechanism that is safer but consumes more power. This queue is intended for commands and not for real Tx packets.


> 
>>
>>>
>>> OTOH, with iwlwifi.11n_disable=4 (no rx A-MPDU), I seem to be doing
>>> pretty well.  I'll test a stock kernel configured like that for the
>>> next few days.
>>>
>>
>> That's interesting...
> 
> --Andy
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski June 2, 2014, 6:56 p.m. UTC | #7

On Mon, Jun 2, 2014 at 11:53 AM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>
>>>>>
>>>>> Anyway, I don't buy the theory that this is caused by the firmware
>>>>> going out to lunch.  The queues files in debugfs show the rx queue
>>>>> chugging along and all of the tx queues have read_ptr == write_ptr.
>>>>> Wireshark shows incoming broadcast traffic, too.  I'd guess that the
>>>>> problem is more likely to be that the card is failing to wake up and
>>>>> notice pending data in the TIM.
>>>
>>> Well... I might have been unclear here (I never know how much detail I should share with the recipient :)).
>>> From your log it appears that the NIC is in power save. So we can't increment the write pointer of the Tx ring (add a packet for transmission). So we simply remember that we need to do so (increment the write pointer) and request a wakeup so that we will update the write pointer in the wakeup interrupt... which doesn't happen.
>>> No power save - no need for wakeup interrupt.
>>
>> I'm still unconvinced.  One of the tx queues actually has a both
>> read_ptr and write_ptr incrementing once or twice per second even when
>> I can't ping the gateway.  Can you point me at the right code or log
>> stuff to look at?
>
> drivers/net/wireless/iwlwifi/pcie/tx.c:
> static void iwl_pcie_txq_inc_wr_ptr(struct iwl_trans *trans,
>                                     struct iwl_txq *txq)
> [snip]
>
>                 if (reg & CSR_UCODE_DRV_GP1_BIT_MAC_SLEEP) {
>                         IWL_DEBUG_INFO(trans, "Tx queue %d requesting wakeup, GP1 = 0x%x\n",
>                                        txq_id, reg);
>                         iwl_set_bit(trans, CSR_GP_CNTRL,
>                                     CSR_GP_CNTRL_REG_FLAG_MAC_ACCESS_REQ);
>                         txq->need_update = true;
>                         return;
>                 }
> [snip]
> }
>
> ISR - drivers/net/wireless/iwlwifi/pcie/tx.c:
> irqreturn_t iwl_pcie_irq_handler(int irq, void *dev_id)
> {
> [snip]
>         /* uCode wakes up after power-down sleep */
>         if (inta & CSR_INT_BIT_WAKEUP) {
>                 IWL_DEBUG_ISR(trans, "Wakeup interrupt\n");
>                 iwl_pcie_rxq_check_wrptr(trans);
>                 iwl_pcie_txq_check_wrptrs(trans);
>
>                 isr_stats->wakeup++;
>
>                 handled |= CSR_INT_BIT_WAKEUP;
>         }
> [snip]
> }
>
> drivers/net/wireless/iwlwifi/pcie/tx.c:
> void iwl_pcie_txq_check_wrptrs(struct iwl_trans *trans)
> {
>         struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
>         int i;
>
>         for (i = 0; i < trans->cfg->base_params->num_of_queues; i++) {
>                 struct iwl_txq *txq = &trans_pcie->txq[i];
>
>                 spin_lock_bh(&txq->lock);
>                 if (trans_pcie->txq[i].need_update) {
>                         iwl_pcie_txq_inc_wr_ptr(trans, txq);
>                         trans_pcie->txq[i].need_update = false;
>                 }
>                 spin_unlock_bh(&txq->lock);
>         }
> }
>
> Note that the CMD queue (9 or 4 depending on your configuration) - uses another mechanism that is safer but consumes more power. This queue is intended for commands and not for real Tx packets.

If that's queue 4 in the debugfs tx_queues file, then I think that
that's the queue that keeps moving when the card is dead.  I'll see if
I can improve the tx_queues file to show pending updates.

Yay troubleshooting old hardware.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach June 2, 2014, 7:32 p.m. UTC | #8

>>
>>>>>>
>>>>>> Anyway, I don't buy the theory that this is caused by the firmware
>>>>>> going out to lunch.  The queues files in debugfs show the rx queue
>>>>>> chugging along and all of the tx queues have read_ptr == write_ptr.
>>>>>> Wireshark shows incoming broadcast traffic, too.  I'd guess that the
>>>>>> problem is more likely to be that the card is failing to wake up and
>>>>>> notice pending data in the TIM.
>>>>
>>>> Well... I might have been unclear here (I never know how much detail I should share with the recipient :)).
>>>> From your log it appears that the NIC is in power save. So we can't increment the write pointer of the Tx ring (add a packet for transmission). So we simply remember that we need to do so (increment the write pointer) and request a wakeup so that we will update the write pointer in the wakeup interrupt... which doesn't happen.
>>>> No power save - no need for wakeup interrupt.
>>>
>>> I'm still unconvinced.  One of the tx queues actually has a both
>>> read_ptr and write_ptr incrementing once or twice per second even when
>>> I can't ping the gateway.  Can you point me at the right code or log
>>> stuff to look at?
>>
>> drivers/net/wireless/iwlwifi/pcie/tx.c:
>> static void iwl_pcie_txq_inc_wr_ptr(struct iwl_trans *trans,
>>                                     struct iwl_txq *txq)
>> [snip]
>>
>>                 if (reg & CSR_UCODE_DRV_GP1_BIT_MAC_SLEEP) {
>>                         IWL_DEBUG_INFO(trans, "Tx queue %d requesting wakeup, GP1 = 0x%x\n",
>>                                        txq_id, reg);
>>                         iwl_set_bit(trans, CSR_GP_CNTRL,
>>                                     CSR_GP_CNTRL_REG_FLAG_MAC_ACCESS_REQ);
>>                         txq->need_update = true;
>>                         return;
>>                 }
>> [snip]
>> }
>>
>> ISR - drivers/net/wireless/iwlwifi/pcie/tx.c:
>> irqreturn_t iwl_pcie_irq_handler(int irq, void *dev_id)
>> {
>> [snip]
>>         /* uCode wakes up after power-down sleep */
>>         if (inta & CSR_INT_BIT_WAKEUP) {
>>                 IWL_DEBUG_ISR(trans, "Wakeup interrupt\n");
>>                 iwl_pcie_rxq_check_wrptr(trans);
>>                 iwl_pcie_txq_check_wrptrs(trans);
>>
>>                 isr_stats->wakeup++;
>>
>>                 handled |= CSR_INT_BIT_WAKEUP;
>>         }
>> [snip]
>> }
>>
>> drivers/net/wireless/iwlwifi/pcie/tx.c:
>> void iwl_pcie_txq_check_wrptrs(struct iwl_trans *trans)
>> {
>>         struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
>>         int i;
>>
>>         for (i = 0; i < trans->cfg->base_params->num_of_queues; i++) {
>>                 struct iwl_txq *txq = &trans_pcie->txq[i];
>>
>>                 spin_lock_bh(&txq->lock);
>>                 if (trans_pcie->txq[i].need_update) {
>>                         iwl_pcie_txq_inc_wr_ptr(trans, txq);
>>                         trans_pcie->txq[i].need_update = false;
>>                 }
>>                 spin_unlock_bh(&txq->lock);
>>         }
>> }
>>
>> Note that the CMD queue (9 or 4 depending on your configuration) - uses another mechanism that is safer but consumes more power. This queue is intended for commands and not for real Tx packets.
>
> If that's queue 4 in the debugfs tx_queues file, then I think that
> that's the queue that keeps moving when the card is dead.  I'll see if
> I can improve the tx_queues file to show pending updates.
>
> Yay troubleshooting old hardware.
>
If you plan to spend some time hacking on the code, please take into
account this code has been changed in 3.14 (mostly cleanups by
Johannes and myself).
Also note that long ago, I disabled shadow registers (you'll see ifs
all over the code). Shadow registers makes the whole wakup thing much
easier since it is handled internally in the HW. BUT we had tons of
issues with it so we disabled it. Most of the issues should have gone
away with the cmd_in_flight thing. So I would recommend you to give it
a try with shadow register enabled. Just check the relevant iwl-XXXX.c
file.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski June 2, 2014, 7:57 p.m. UTC | #9

On Mon, Jun 2, 2014 at 12:32 PM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>>> Note that the CMD queue (9 or 4 depending on your configuration) - uses another mechanism that is safer but consumes more power. This queue is intended for commands and not for real Tx packets.
>>
>> If that's queue 4 in the debugfs tx_queues file, then I think that
>> that's the queue that keeps moving when the card is dead.  I'll see if
>> I can improve the tx_queues file to show pending updates.
>>
>> Yay troubleshooting old hardware.
>>
> If you plan to spend some time hacking on the code, please take into
> account this code has been changed in 3.14 (mostly cleanups by
> Johannes and myself).
> Also note that long ago, I disabled shadow registers (you'll see ifs
> all over the code). Shadow registers makes the whole wakup thing much
> easier since it is handled internally in the HW. BUT we had tons of
> issues with it so we disabled it. Most of the issues should have gone
> away with the cmd_in_flight thing. So I would recommend you to give it
> a try with shadow register enabled. Just check the relevant iwl-XXXX.c
> file.

Will do.

FWIW, I looked at the cmd_in_flight thing and, unless there's a bunch
of locking magic going on, it looks racy.  It seems like some code
expects it to be protected by reg_lock, and other code expects it to
be protected by txq_lock (and there are multiple txq->lock instances,
I think).

Also, what stops the card from going to sleep while processing a
wakeup interrupt?  For example, what prevents this chain of events?
The only thing I can see is that the only thing that clears ACCESS_REQ
is the rx interrupt processing, which happens after wakeup processing,
but this seems rather fragile.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach June 2, 2014, 8:15 p.m. UTC | #10

>>> If that's queue 4 in the debugfs tx_queues file, then I think that
>>> that's the queue that keeps moving when the card is dead.  I'll see if
>>> I can improve the tx_queues file to show pending updates.
>>>
>>> Yay troubleshooting old hardware.
>>>
>> If you plan to spend some time hacking on the code, please take into
>> account this code has been changed in 3.14 (mostly cleanups by
>> Johannes and myself).
>> Also note that long ago, I disabled shadow registers (you'll see ifs
>> all over the code). Shadow registers makes the whole wakup thing much
>> easier since it is handled internally in the HW. BUT we had tons of
>> issues with it so we disabled it. Most of the issues should have gone
>> away with the cmd_in_flight thing. So I would recommend you to give it
>> a try with shadow register enabled. Just check the relevant iwl-XXXX.c
>> file.
>
> Will do.
>
> FWIW, I looked at the cmd_in_flight thing and, unless there's a bunch
> of locking magic going on, it looks racy.  It seems like some code
> expects it to be protected by reg_lock, and other code expects it to
> be protected by txq_lock (and there are multiple txq->lock instances,
> I think).

:)
Yeah - there is lockdep magic there.
In any case, we set it when we enqueue and host command and clear it
when the last command is being responded.
When we set it, we request access, and vice versa.

>
> Also, what stops the card from going to sleep while processing a
> wakeup interrupt?  For example, what prevents this chain of events?
> The only thing I can see is that the only thing that clears ACCESS_REQ
> is the rx interrupt processing, which happens after wakeup processing,
> but this seems rather fragile.
>

As long as ACCESS_REQ isn't cleared, the NIC is awake. We set it when
we we want to wakeup the NIC.
I can't really find where we clear it right now, but I am half sleeping :)
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Help debugging iwldvm / ath10k stalls

Commit Message

Comments

Patch