diff mbox

ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue...?

Message ID 20180528125039.GA22122@redhat.com (mailing list archive)
State RFC
Headers show

Commit Message

Stanislaw Gruszka May 28, 2018, 12:50 p.m. UTC
I have some updates here.

First thing is that printing messages by printk can cause few seconds 
of cpu hang on my mt7620 hardware, what result on wifi disconnects.
I realized that when added debug patch like this:


solved the problem and improved performance when connecting with this
older rt2800 usb device. Even if other rt2800 usb devices respond with
correct solicit BA SSN this change is beneficial when connecting two
rt2800 devices. Unfortunately sending BAR seems to be needed for
some stations, so this need to be investigated a bit more.

Another issue is that sometimes we do not sent rate probe frames,
I posted fix for that (revert of my older change):
https://marc.info/?l=linux-wireless&m=152750671715369&w=2 
This patch improve performance when mt7620 is connected with Intel iwl7265
device. With the patch I have 70 Mbits/sec compared to about 17 Mbits/sec
before. Perhaps it also fix stability issues.

Stanislaw

Comments

Daniel Golle May 28, 2018, 1:54 p.m. UTC | #1
Hi!

On Mon, May 28, 2018 at 02:50:40PM +0200, Stanislaw Gruszka wrote:
> I have some updates here.
> 
> ...
> Another issue is that sometimes we do not sent rate probe frames,
> I posted fix for that (revert of my older change):
> https://marc.info/?l=linux-wireless&m=152750671715369&w=2 
> This patch improve performance when mt7620 is connected with Intel iwl7265
> device. With the patch I have 70 Mbits/sec compared to about 17 Mbits/sec
> before. Perhaps it also fix stability issues.

Oh goodness, I didn't expect anything like that to still happen at
this stage. Thanks a lot for investigating the issue!
I've imported the patch to OpenWrt git:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=f4a639a3d7d40b4f63c431c2d554c479fbcc6b74

Once we got some feedback from users (please test!) we can push that
also to still be part of the openwrt-18.06 release.

Cheers

Daniel
Stanislaw Gruszka Aug. 15, 2018, 11:40 a.m. UTC | #2
On Mon, May 28, 2018 at 02:50:39PM +0200, Stanislaw Gruszka wrote:
> I have some updates here.

And now more updates. I have 5 patches to test here:
https://github.com/sgruszka/wireless-drivers-next/commits/rt2800-draft-v2

as reported by T-Bone here:
https://bugzilla.kernel.org/show_bug.cgi?id=82751
first 3 patches made Netgear WN3000RPv3 router workable.
Another 2 should be further improvement, but were not tested broadly
and can have some bugs.

Daniel, could you apply this on your staging tree for testing?
Please also remove:
600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch
991-rt2800_change_rx_ampdu_factor.patch
992-rt2800_change_ba_size.patch
993-rt2800_change_rx_ampdu_density.patch
Those are not needed and can be harmful with the test patches,
(especially spurious interrupt one, patches will not apply cleanly 
with it).

Patches should make "Dropping frame due to full tx queue" and
"Queue 2 failed to flush" errors gone. However if device is somewhat
wrongly configured by driver at low level (i.e. wrong MAC/BBP/RF
registers programing) there still will be troubles to have stable
and fast wireless connection.

On TODO site, the proper watchdog should be added, but I'm not sure
how to exactly detect device/firmware hung and how exactly device should
be rested (by some HW reset by register or by doing full reinitialization).

Another thing is fixing RATE_PROBE frames which are aggregated with
other frames and not sent at requested rate. I implemented qsel queue patch
similar to mt76, but this not work as expected on older Ralink chips.  
https://github.com/sgruszka/wireless-drivers-next/commit/846d205edd8c36d1b7828fee54bf4cf40bf8cb1a

Any help with those 2 problems are welcome.
 
Thanks
Stanislaw
Daniel Golle Aug. 15, 2018, 10:35 p.m. UTC | #3
Hi Stanislaw,

On Wed, Aug 15, 2018 at 01:40:30PM +0200, Stanislaw Gruszka wrote:
> On Mon, May 28, 2018 at 02:50:39PM +0200, Stanislaw Gruszka wrote:
> > I have some updates here.
> 
> And now more updates. I have 5 patches to test here:
> https://github.com/sgruszka/wireless-drivers-next/commits/rt2800-draft-v2
> 
> as reported by T-Bone here:
> https://bugzilla.kernel.org/show_bug.cgi?id=82751
> first 3 patches made Netgear WN3000RPv3 router workable.
> Another 2 should be further improvement, but were not tested broadly
> and can have some bugs.
> 
> Daniel, could you apply this on your staging tree for testing?

Done and rebased on top of blogic's tree with mac80211 from v4.18-rc7
instead of (outdated) wireless-testing.
Only compile-tested by myself for now, maybe someone can run a build on
actual MT7620 hardware?

Clone the master branch of
https://git.openwrt.org/openwrt/staging/dangole.git
and give it shot and let us know the results.

> Please also remove:
> 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch
> 991-rt2800_change_rx_ampdu_factor.patch
> 992-rt2800_change_ba_size.patch
> 993-rt2800_change_rx_ampdu_density.patch
> Those are not needed and can be harmful with the test patches,
> (especially spurious interrupt one, patches will not apply cleanly 
> with it).

Ack. Hope it doesn't break Rt3883 and/or Rt3663 for which that patch
was added by Gabor Juhos a decade ago...

> 
> Patches should make "Dropping frame due to full tx queue" and
> "Queue 2 failed to flush" errors gone. However if device is somewhat
> wrongly configured by driver at low level (i.e. wrong MAC/BBP/RF
> registers programing) there still will be troubles to have stable
> and fast wireless connection.
> 
> On TODO site, the proper watchdog should be added, but I'm not sure
> how to exactly detect device/firmware hung and how exactly device should
> be rested (by some HW reset by register or by doing full reinitialization).
> 
> Another thing is fixing RATE_PROBE frames which are aggregated with
> other frames and not sent at requested rate. I implemented qsel queue patch
> similar to mt76, but this not work as expected on older Ralink chips.  
> https://github.com/sgruszka/wireless-drivers-next/commit/846d205edd8c36d1b7828fee54bf4cf40bf8cb1a

Which hardware did you try? Just so I can reproduce what's going on
and maybe help fixing it...


Cheers

Daniel

> 
> Any help with those 2 problems are welcome.
>  
> Thanks
> Stanislaw
>
Stanislaw Gruszka Aug. 16, 2018, 11:01 a.m. UTC | #4
Hello

On Thu, Aug 16, 2018 at 12:35:29AM +0200, Daniel Golle wrote:
> Clone the master branch of
> https://git.openwrt.org/openwrt/staging/dangole.git
> and give it shot and let us know the results.

Thanks!

> > Please also remove:
> > 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch
> > 991-rt2800_change_rx_ampdu_factor.patch
> > 992-rt2800_change_ba_size.patch
> > 993-rt2800_change_rx_ampdu_density.patch
> > Those are not needed and can be harmful with the test patches,
> > (especially spurious interrupt one, patches will not apply cleanly 
> > with it).
> 
> Ack. Hope it doesn't break Rt3883 and/or Rt3663 for which that patch
> was added by Gabor Juhos a decade ago...

The new patches change the way we handle TX status interrupt. I think
spurious interrupts were a problem because we do not disable the IRQ 
and read statuses, so we could get interrupt after we empty TX_STA_FIFO
register. With current implementation this should not be an issue. 

> > Another thing is fixing RATE_PROBE frames which are aggregated with
> > other frames and not sent at requested rate. I implemented qsel queue patch
> > similar to mt76, but this not work as expected on older Ralink chips.  
> > https://github.com/sgruszka/wireless-drivers-next/commit/846d205edd8c36d1b7828fee54bf4cf40bf8cb1a
> 
> Which hardware did you try? Just so I can reproduce what's going on
> and maybe help fixing it...

I tested on RT3062 PCI and some USB dongles, don't remember chip version.
Basically HW become unresponsive after sent some traffic.

I debugged the problem by printing sequence number and rate of RATE PROBE
frames by something like this:

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
index 92ddc19..d7f250b 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
@@ -334,6 +334,12 @@ static void rt2x00queue_create_tx_descriptor_ht(struct rt2x00_dev *rt2x00dev,
 			txdesc->u.ht.mcs |= 0x08;
 	}
 
+	if (tx_info->flags & IEEE80211_TX_CTL_RATE_CTRL_PROBE) {
+		int ssn = le16_to_cpu(hdr->seq_ctrl) >> 4;
+		printk("RATE PROBE %d %d\n", ssn, txdesc->u.ht.mcs);
+	}
+
+
 	if (test_bit(CONFIG_HT_DISABLED, &rt2x00dev->flags)) {
 		if (!(tx_info->flags & IEEE80211_TX_CTL_FIRST_FRAGMENT))
 			txdesc->u.ht.txop = TXOP_SIFS;

and compared that with on air traffic captured in monitor mode on
different device in wireshark. On air traffic showed different 
probe frames rate than requested.

Regards
Stanislaw
Daniel Golle Aug. 18, 2018, 4:08 p.m. UTC | #5
Still seeing a lot of those

Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.972151] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.981622] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.991070] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8365.000506] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8365.009935] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2

while wifi somehow keeps working at degraded level once the error
occured.

On Thu, Aug 16, 2018 at 07:30:40PM +0000, Craig Matsuura wrote:
> FWIW we would use iperf and push a lot of data to a STA and would cause lock ups and issues with the queues.
> 
> 
> Craig
> 
> 
> Craig Matsuura • Technical Director, Embedded Software Architecture
> 
> cmatsuura@vivint.com<mailto:cmatsuura@vivint.com> • P: 801.229.6005
> 
> 
> 
> simply smarter • vivint.com
> 
> 3401 N Ashton Blvd. Lehi, UT 84043
> 
> 
> 
> [1497369905956_vivint-logo-orange.png]
> 
> ________________________________
> From: linux-wireless-owner@vger.kernel.org <linux-wireless-owner@vger.kernel.org> on behalf of Stanislaw Gruszka <sgruszka@redhat.com>
> Sent: Thursday, August 16, 2018 5:01:04 AM
> To: Daniel Golle
> Cc: Kofi Agor; Enrico Mioso; Craig Matsuura; Mathias Kresin; Tom Psyborg; linux-wireless; John Crispin; Felix Fietkau; Jamie Stuart
> Subject: Re: ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue...?
> 
> Hello
> 
> On Thu, Aug 16, 2018 at 12:35:29AM +0200, Daniel Golle wrote:
> > Clone the master branch of
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.openwrt.org%2Fopenwrt%2Fstaging%2Fdangole.git&amp;data=01%7C01%7Ccmatsuura%40vivint.com%7C259b5ae9949e445dd9fe08d603680486%7C54cc98ca024a470185483741e3b8d59d%7C0&amp;sdata=8mOOSNRnqvjDVnCNHH15c%2BfEzr69yPpCkbSdD6Q8ZuA%3D&amp;reserved=0
> > and give it shot and let us know the results.
> 
> Thanks!
> 
> > > Please also remove:
> > > 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch
> > > 991-rt2800_change_rx_ampdu_factor.patch
> > > 992-rt2800_change_ba_size.patch
> > > 993-rt2800_change_rx_ampdu_density.patch
> > > Those are not needed and can be harmful with the test patches,
> > > (especially spurious interrupt one, patches will not apply cleanly
> > > with it).
> >
> > Ack. Hope it doesn't break Rt3883 and/or Rt3663 for which that patch
> > was added by Gabor Juhos a decade ago...
> 
> The new patches change the way we handle TX status interrupt. I think
> spurious interrupts were a problem because we do not disable the IRQ
> and read statuses, so we could get interrupt after we empty TX_STA_FIFO
> register. With current implementation this should not be an issue.
> 
> > > Another thing is fixing RATE_PROBE frames which are aggregated with
> > > other frames and not sent at requested rate. I implemented qsel queue patch
> > > similar to mt76, but this not work as expected on older Ralink chips.
> > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsgruszka%2Fwireless-drivers-next%2Fcommit%2F846d205edd8c36d1b7828fee54bf4cf40bf8cb1a&amp;data=01%7C01%7Ccmatsuura%40vivint.com%7C259b5ae9949e445dd9fe08d603680486%7C54cc98ca024a470185483741e3b8d59d%7C0&amp;sdata=mI%2BPAh7OalmB1hxzjF%2FL6GqxwHDPi7Hy5P9IWADu5gI%3D&amp;reserved=0
> >
> > Which hardware did you try? Just so I can reproduce what's going on
> > and maybe help fixing it...
> 
> I tested on RT3062 PCI and some USB dongles, don't remember chip version.
> Basically HW become unresponsive after sent some traffic.
> 
> I debugged the problem by printing sequence number and rate of RATE PROBE
> frames by something like this:
> 
> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> index 92ddc19..d7f250b 100644
> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> @@ -334,6 +334,12 @@ static void rt2x00queue_create_tx_descriptor_ht(struct rt2x00_dev *rt2x00dev,
>                          txdesc->u.ht.mcs |= 0x08;
>          }
> 
> +       if (tx_info->flags & IEEE80211_TX_CTL_RATE_CTRL_PROBE) {
> +               int ssn = le16_to_cpu(hdr->seq_ctrl) >> 4;
> +               printk("RATE PROBE %d %d\n", ssn, txdesc->u.ht.mcs);
> +       }
> +
> +
>          if (test_bit(CONFIG_HT_DISABLED, &rt2x00dev->flags)) {
>                  if (!(tx_info->flags & IEEE80211_TX_CTL_FIRST_FRAGMENT))
>                          txdesc->u.ht.txop = TXOP_SIFS;
> 
> and compared that with on air traffic captured in monitor mode on
> different device in wireshark. On air traffic showed different
> probe frames rate than requested.
> 
> Regards
> Stanislaw
Stanislaw Gruszka Aug. 20, 2018, 12:20 p.m. UTC | #6
Hi Daniel

On Sat, Aug 18, 2018 at 06:08:40PM +0200, Daniel Golle wrote:
> Still seeing a lot of those
> 
> Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.972151] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.981622] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.991070] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8365.000506] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8365.009935] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> 
> while wifi somehow keeps working at degraded level once the error
> occured.

I would like to know why we get those errors at all. Perhaps there is
something wrong with queue pause/unpasue logic. Or with locking.
Or maybe this is normal, we pausue queue in mac80211 but it still have
some pending skbs which are scheduled to the driver.

Let's check if pausing is correct. Please apply this patch:
https://github.com/sgruszka/wireless-drivers-next/commit/525c50486e17446793b21ac7a8498cb48b3bb210.patch
and provide dmesg output.

Note that adding printk messages can make itself somehow AP
unresponsive, as I pointed earlier in this thread. However
at this point I would like to check if queue pause works as it
should in the driver. 

Also, how do test? I can not reproduce those errors.

Thanks
Stanislaw
Stanislaw Gruszka Aug. 24, 2018, 1:02 p.m. UTC | #7
Hello

On Mon, Aug 20, 2018 at 02:20:41PM +0200, Stanislaw Gruszka wrote:
> On Sat, Aug 18, 2018 at 06:08:40PM +0200, Daniel Golle wrote:
> > Still seeing a lot of those
> > 
> > Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.972151] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> > Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.981622] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> > Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8364.991070] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> > Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8365.000506] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> > Sat Aug 18 16:05:51 2018 kern.err kernel: [ 8365.009935] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
> > 
> > while wifi somehow keeps working at degraded level once the error
> > occured.
> 
> I would like to know why we get those errors at all. Perhaps there is
> something wrong with queue pause/unpasue logic. Or with locking.
> Or maybe this is normal, we pausue queue in mac80211 but it still have
> some pending skbs which are scheduled to the driver.
> 
> Let's check if pausing is correct. Please apply this patch:
> https://github.com/sgruszka/wireless-drivers-next/commit/525c50486e17446793b21ac7a8498cb48b3bb210.patch
> and provide dmesg output.
> 
> Note that adding printk messages can make itself somehow AP
> unresponsive, as I pointed earlier in this thread. However
> at this point I would like to check if queue pause works as it
> should in the driver. 

I get testing results from T-Bone user in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=82751

And get this:

[  781.644185] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[  781.662620] 2 d1 s1 p1 ms1

Looks like rt2x00 correctly stop queue in mac80211, but sill get skb's. 

So we can do 3 things: increase queue size to 128, increase threshold
to 16 and make the massage debug one instead of error (I checked
some other drivers and looks most of them silently drop the frame
in case queue is full). Especially removing the message can be
useful since printk can somehow make mt7620 router unresponsive
for some time resulting in connection drops.

Thoughts ?

Regards
Stanislaw
diff mbox

Patch

diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
index 595c662a61e8..de2643f7189f 100644
--- a/net/mac80211/agg-tx.c
+++ b/net/mac80211/agg-tx.c
@@ -135,6 +135,8 @@  void ieee80211_send_bar(struct ieee80211_vif *vif, u8 *ra, u16 tid, u16 ssn)
 	bar->control = cpu_to_le16(bar_control);
 	bar->start_seq_num = cpu_to_le16(ssn);
 
+	printk("%s tid %d ssn %04x %d\n", __func__, tid, ssn , ssn >> 4);
+
 	IEEE80211_SKB_CB(skb)->flags |= IEEE80211_TX_INTFL_DONT_ENCRYPT |
 					IEEE80211_TX_CTL_REQ_TX_STATUS;
 	ieee80211_tx_skb_tid(sdata, skb, tid);

This patch alone cause wifi disconnects. I don't know where problem is, but
remove printing i.e. rt2x00_warn() messages can be thing to check if you
observe wifi disconnects.

Another issue is BA ssn mishmashes which I can observe. Those were
causing by older rt2800 usb hardware not responding correctly to BAR
frame. I.e for BAR frame with SSN 0x0060 (6), it responded BA with
SSN 7fe0 (2046). Removing of posting BAR by this patch:

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
index 357c0941aaad..7e2786464bcf 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
@@ -382,9 +382,6 @@  static void rt2x00lib_fill_tx_status(struct rt2x00_dev *rt2x00dev,
 				  IEEE80211_TX_CTL_AMPDU;
 		tx_info->status.ampdu_len = 1;
 		tx_info->status.ampdu_ack_len = success ? 1 : 0;
-
-		if (!success)
-			tx_info->flags |= IEEE80211_TX_STAT_AMPDU_NO_BACK;
 	}
 
 	if (rate_flags & IEEE80211_TX_RC_USE_RTS_CTS) {