[v2,3/3] rt2x00: do not print error when queue is full
diff mbox series

Message ID 1545318971-28351-3-git-send-email-sgruszka@redhat.com
State Changes Requested
Delegated to: Kalle Valo
Headers show
Series
  • [v2,1/3] rt2x00: use ratelimited variants dev_warn/dev_err
Related show

Commit Message

Stanislaw Gruszka Dec. 20, 2018, 3:16 p.m. UTC
For unknown reasons printk() on some context can cause CPU hung on
embedded MT7620 AP/router MIPS platforms. What can result on wifi
disconnects.

This patch move queue full messages to debug level what is consistent
with other mac80211 drivers which drop packet silently if tx queue is
full. This make MT7620 OpenWRT routers more stable, what was reported
by various users.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/wireless/ralink/rt2x00/rt2x00queue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tom Psyborg Dec. 20, 2018, 5:52 p.m. UTC | #1
Shouldn't you now also revert the commits from
https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
since they caused throughput regression?

On 20/12/2018, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> For unknown reasons printk() on some context can cause CPU hung on
> embedded MT7620 AP/router MIPS platforms. What can result on wifi
> disconnects.
>
> This patch move queue full messages to debug level what is consistent
> with other mac80211 drivers which drop packet silently if tx queue is
> full. This make MT7620 OpenWRT routers more stable, what was reported
> by various users.
>
> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
> ---
>  drivers/net/wireless/ralink/rt2x00/rt2x00queue.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> index 92ddc19e7bf7..947fc8964e9a 100644
> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
> @@ -671,7 +671,7 @@ int rt2x00queue_write_tx_frame(struct data_queue *queue,
> struct sk_buff *skb,
>  	spin_lock(&queue->tx_lock);
>
>  	if (unlikely(rt2x00queue_full(queue))) {
> -		rt2x00_err(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
> +		rt2x00_dbg(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
>  			   queue->qid);
>  		ret = -ENOBUFS;
>  		goto out;
> --
> 2.7.5
>
>
Stanislaw Gruszka Dec. 21, 2018, 9:59 a.m. UTC | #2
On Thu, Dec 20, 2018 at 06:52:03PM +0100, Tom Psyborg wrote:
> Shouldn't you now also revert the commits from
> https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
> since they caused throughput regression?

For the record: this is not about currently posted patch and you
reference this:
https://lore.kernel.org/linux-wireless/1538697102-3764-1-git-send-email-pozega.tomislav@gmail.com/

I'm not convinced for reverting. I would rater fix the problem
on top of the patches.

Looking at those images, the performance vary also on "good" case and
there is entry 12.6 Mbits/s there, where the lowest entry on "bad" case
is 32.5 Mbits/s . However, yes, if performance do not drop for unknown
reason, it is 51.4 Mbits/s with patches and 56.6 Mbits/s on good case.

Could you specify after applying which one of those 5 patches
throughput regression starts ?

Thanks
Stanislaw
Tom Psyborg Dec. 22, 2018, 1:12 p.m. UTC | #3
On 21/12/2018, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> On Thu, Dec 20, 2018 at 06:52:03PM +0100, Tom Psyborg wrote:
>> Shouldn't you now also revert the commits from
>> https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
>> since they caused throughput regression?
>
> For the record: this is not about currently posted patch and you
> reference this:
> https://lore.kernel.org/linux-wireless/1538697102-3764-1-git-send-email-pozega.tomislav@gmail.com/
>
> I'm not convinced for reverting. I would rater fix the problem
> on top of the patches.
>

Let me get this straight: You made these patches to fix wlan stalls
when "arrived at non-free entry" error message appears, but in turn
they result in lower throughput. Now you are moving err printks to dbg
level since the printks themselfs are causing wlan problems. Why would
you waste life fixing patches that don't do any good? Just revert all
of them!

> Looking at those images, the performance vary also on "good" case and
> there is entry 12.6 Mbits/s there, where the lowest entry on "bad" case
> is 32.5 Mbits/s . However, yes, if performance do not drop for unknown
> reason, it is 51.4 Mbits/s with patches and 56.6 Mbits/s on good case.
>
> Could you specify after applying which one of those 5 patches
> throughput regression starts ?
>
> Thanks
> Stanislaw
>

I cannot test that ATM but the best would be to revert all of them.
And, for the future reference, please try sending them to the
openwrt-devel list first for testing before deciding to go upstream.

Regards, Tom
Tom Psyborg Dec. 25, 2018, 10:43 p.m. UTC | #4
Even with your patches that are currently being tested i had interface
frozen yesterday. 2 android pads, 2 android phones and laptop with
intel card on win10. Wlan frozen about half an hour after bootup, had
to restart interface that revelaed more queue problems:

[  238.715836] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[  238.722681] br-lan: port 2(wlan1) entered blocking state
[  238.728107] br-lan: port 2(wlan1) entered forwarding state
[ 2403.051342] device wlan1 left promiscuous mode
[ 2403.056021] br-lan: port 2(wlan1) entered disabled state
[ 2403.213039] ieee80211 phy1: rt2800_config_channel: Warning - Using
incomplete support for external PA
[ 2403.522622] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 0 failed to flush
[ 2403.817143] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 1 failed to flush
[ 2404.058047] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 2 failed to flush
[ 2404.299509] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 3 failed to flush
[ 2404.566195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 0 failed to flush
[ 2404.806880] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 1 failed to flush
[ 2405.047624] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 2 failed to flush
[ 2405.288195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
Queue 3 failed to flush
Stanislaw Gruszka Dec. 27, 2018, 10:25 a.m. UTC | #5
On Sat, Dec 22, 2018 at 02:12:13PM +0100, Tom Psyborg wrote:
> On 21/12/2018, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > On Thu, Dec 20, 2018 at 06:52:03PM +0100, Tom Psyborg wrote:
> >> Shouldn't you now also revert the commits from
> >> https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=commitdiff;h=de1c58a64bd66319e770d2587da07d8c9c90174a
> >> since they caused throughput regression?
> >
> > For the record: this is not about currently posted patch and you
> > reference this:
> > https://lore.kernel.org/linux-wireless/1538697102-3764-1-git-send-email-pozega.tomislav@gmail.com/
> >
> > I'm not convinced for reverting. I would rater fix the problem
> > on top of the patches.
> >
> 
> Let me get this straight: You made these patches to fix wlan stalls
> when "arrived at non-free entry" error message appears, but in turn
> they result in lower throughput. Now you are moving err printks to dbg
> level since the printks themselfs are causing wlan problems. Why would
> you waste life fixing patches that don't do any good? Just revert all
> of them!

I move prints to debug level for for "arrived at non-free entry" from
"Dropping frame due to full tx queue" . For some users those 5 patches
improve things, they fix problem of router connection hung completely.
Printk problem is connection stall due to CPU being busy, but 
is possible to reconnect to AP after that. It's different issue.

Also 'result in lower throughput' is over-generalized statement. More
precise would be 'it lower throughput on some test cases', but even
in those specific cases, throughput vary randomly with and without
patches.

> > Looking at those images, the performance vary also on "good" case and
> > there is entry 12.6 Mbits/s there, where the lowest entry on "bad" case
> > is 32.5 Mbits/s . However, yes, if performance do not drop for unknown
> > reason, it is 51.4 Mbits/s with patches and 56.6 Mbits/s on good case.
> >
> > Could you specify after applying which one of those 5 patches
> > throughput regression starts ?
> >
> > Thanks
> > Stanislaw
> >
> 
> I cannot test that ATM but the best would be to revert all of them.

Ok, could anyone else confirm this throughput regression and point
offending commit ?

> And, for the future reference, please try sending them to the
> openwrt-devel list first for testing before deciding to go upstream.

Sure, I can send patches to openwrt-devel. However, for the record, 
I informed about those patches interested people including you
in August:
https://lore.kernel.org/linux-wireless/20180815114029.GA1862@redhat.com/
month before I submitted them upstream.

Thanks
Stanislaw
Stanislaw Gruszka Dec. 27, 2018, 10:32 a.m. UTC | #6
On Tue, Dec 25, 2018 at 11:43:09PM +0100, Tom Psyborg wrote:
> Even with your patches that are currently being tested i had interface
> frozen yesterday. 2 android pads, 2 android phones and laptop with
> intel card on win10. Wlan frozen about half an hour after bootup, had
> to restart interface that revelaed more queue problems:
>
> [  238.715836] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
> [  238.722681] br-lan: port 2(wlan1) entered blocking state
> [  238.728107] br-lan: port 2(wlan1) entered forwarding state
> [ 2403.051342] device wlan1 left promiscuous mode
> [ 2403.056021] br-lan: port 2(wlan1) entered disabled state
> [ 2403.213039] ieee80211 phy1: rt2800_config_channel: Warning - Using
> incomplete support for external PA

Yes, patches will not help if device is not programed properly,
i.e. when external PA is not well configured. I already pointed
this here:
https://lore.kernel.org/linux-wireless/20180815114029.GA1862@redhat.com/

Register programming should be fixed there and also whould be good
to implement watchdog to recover from wifi hung it it happens.

> [ 2403.522622] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 0 failed to flush
> [ 2403.817143] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 1 failed to flush
> [ 2404.058047] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 2 failed to flush
> [ 2404.299509] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 3 failed to flush
> [ 2404.566195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 0 failed to flush
> [ 2404.806880] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 1 failed to flush
> [ 2405.047624] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 2 failed to flush
> [ 2405.288195] ieee80211 phy1: rt2x00queue_flush_queue: Warning -
> Queue 3 failed to flush

Hmm, this one should be gone. Could you check attached patch?

Thanks
Stanislaw
diff --git a/drivers/net/wireless/ralink/rt2x00/rt2800soc.c b/drivers/net/wireless/ralink/rt2x00/rt2800soc.c
index a502816214ab..05a801774713 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2800soc.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2800soc.c
@@ -203,7 +203,7 @@ static const struct rt2x00lib_ops rt2800soc_rt2x00_ops = {
 	.start_queue		= rt2800mmio_start_queue,
 	.kick_queue		= rt2800mmio_kick_queue,
 	.stop_queue		= rt2800mmio_stop_queue,
-	.flush_queue		= rt2x00mmio_flush_queue,
+	.flush_queue		= rt2800mmio_flush_queue,
 	.write_tx_desc		= rt2800mmio_write_tx_desc,
 	.write_tx_data		= rt2800_write_tx_data,
 	.write_beacon		= rt2800_write_beacon,
Tom Psyborg Dec. 28, 2018, 12:45 a.m. UTC | #7
>
> I move prints to debug level for for "arrived at non-free entry" from
> "Dropping frame due to full tx queue" . For some users those 5 patches
> improve things, they fix problem of router connection hung completely.
> Printk problem is connection stall due to CPU being busy, but
> is possible to reconnect to AP after that. It's different issue.
>
yes they fix hung. same hung that does this single patch fix by moving
to dbg so they're not printed at all. FYI "Dropping frame due to full
tx queue" causes "arrived at non-free entry" eventually that freezes
interface so only reboot helps, removing "Dropping frame due to full
tx queue" from printing to logs prevents this kind of hung and is
better solution than 5 patches. oh and don't forget to add note that
rt2x00_err/rt2x00_dbg prints can cause serious bug so you don't
acidentally enable them again in a year or two!

> Also 'result in lower throughput' is over-generalized statement. More
> precise would be 'it lower throughput on some test cases', but even
> in those specific cases, throughput vary randomly with and without
> patches.
>

100Mbps TX 70Mbps RX without and 80Mbps TX 55 Mbps RX with 5
patches--> EA2750 HT40 CH2 4.14.82+4.19-rc5-1-1
Stanislaw Gruszka Jan. 2, 2019, 8:19 a.m. UTC | #8
On Fri, Dec 28, 2018 at 01:45:27AM +0100, Tom Psyborg wrote:
> >
> > I move prints to debug level for for "arrived at non-free entry" from
> > "Dropping frame due to full tx queue" . For some users those 5 patches
> > improve things, they fix problem of router connection hung completely.
> > Printk problem is connection stall due to CPU being busy, but
> > is possible to reconnect to AP after that. It's different issue.
> >
> yes they fix hung. same hung that does this single patch fix by moving
> to dbg so they're not printed at all. FYI "Dropping frame due to full
> tx queue" causes "arrived at non-free entry" eventually that freezes
> interface so only reboot helps, removing "Dropping frame due to full
> tx queue" from printing to logs prevents this kind of hung and is
> better solution than 5 patches.

I wonder why are you confident about this ? And also why 5 patches
fixed the hung if there were still "Dropping frame ..." prints ?

Anyway I'll ask people on the bz for confirmation, if stop printing
is sufficient to fix the wifi hung.

> oh and don't forget to add note that
> rt2x00_err/rt2x00_dbg prints can cause serious bug so you don't
> acidentally enable them again in a year or two!

I think I will remember that, for others git changelog should
be sufficient.
 
> > Also 'result in lower throughput' is over-generalized statement. More
> > precise would be 'it lower throughput on some test cases', but even
> > in those specific cases, throughput vary randomly with and without
> > patches.
> >
> 
> 100Mbps TX 70Mbps RX without and 80Mbps TX 55 Mbps RX with 5
> patches--> EA2750 HT40 CH2 4.14.82+4.19-rc5-1-1

Ok, this is different case that you presented before on USB devices.
And I'm more concerned about it, because changes with MMIO ware
much bigger (for USB only some code was moved and some flush
changed were done). Please check attached patch, if it fixes
throughput regression for you ?

Thanks
Stanislaw
diff --git a/drivers/net/wireless/ralink/rt2x00/rt2800lib.c b/drivers/net/wireless/ralink/rt2x00/rt2800lib.c
index fe365cda0841..17e87e9b3d19 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2800lib.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2800lib.c
@@ -961,8 +961,7 @@ static bool rt2800_txdone_entry_check(struct queue_entry *entry, u32 reg)
 {
 	__le32 *txwi;
 	u32 word;
-	int wcid, ack, pid;
-	int tx_wcid, tx_ack, tx_pid, is_agg;
+	int wcid, tx_wcid;
 
 	/*
 	 * This frames has returned with an IO error,
@@ -972,23 +971,17 @@ static bool rt2800_txdone_entry_check(struct queue_entry *entry, u32 reg)
 	if (test_bit(ENTRY_DATA_IO_FAILED, &entry->flags))
 		return false;
 
-	wcid	= rt2x00_get_field32(reg, TX_STA_FIFO_WCID);
-	ack	= rt2x00_get_field32(reg, TX_STA_FIFO_TX_ACK_REQUIRED);
-	pid	= rt2x00_get_field32(reg, TX_STA_FIFO_PID_TYPE);
-	is_agg	= rt2x00_get_field32(reg, TX_STA_FIFO_TX_AGGRE);
-
 	/*
 	 * Validate if this TX status report is intended for
-	 * this entry by comparing the WCID/ACK/PID fields.
+	 * this entry by comparing the WCID field.
 	 */
 	txwi = rt2800_drv_get_txwi(entry);
-
 	word = rt2x00_desc_read(txwi, 1);
 	tx_wcid = rt2x00_get_field32(word, TXWI_W1_WIRELESS_CLI_ID);
-	tx_ack  = rt2x00_get_field32(word, TXWI_W1_ACK);
-	tx_pid  = rt2x00_get_field32(word, TXWI_W1_PACKETID);
 
-	if (wcid != tx_wcid || ack != tx_ack || (!is_agg && pid != tx_pid)) {
+	wcid = rt2x00_get_field32(reg, TX_STA_FIFO_WCID);
+
+	if (wcid != tx_wcid) {
 		rt2x00_dbg(entry->queue->rt2x00dev,
 			   "TX status report missed for queue %d entry %d\n",
 			   entry->queue->qid, entry->entry_idx);
Stanislaw Gruszka Feb. 9, 2019, 11:03 a.m. UTC | #9
On Wed, Jan 02, 2019 at 09:19:34AM +0100, Stanislaw Gruszka wrote:
> On Fri, Dec 28, 2018 at 01:45:27AM +0100, Tom Psyborg wrote:
> > >
> > > I move prints to debug level for for "arrived at non-free entry" from
> > > "Dropping frame due to full tx queue" . For some users those 5 patches
> > > improve things, they fix problem of router connection hung completely.
> > > Printk problem is connection stall due to CPU being busy, but
> > > is possible to reconnect to AP after that. It's different issue.
> > >
> > yes they fix hung. same hung that does this single patch fix by moving
> > to dbg so they're not printed at all. FYI "Dropping frame due to full
> > tx queue" causes "arrived at non-free entry" eventually that freezes
> > interface so only reboot helps, removing "Dropping frame due to full
> > tx queue" from printing to logs prevents this kind of hung and is
> > better solution than 5 patches.
> 
> I wonder why are you confident about this ? And also why 5 patches
> fixed the hung if there were still "Dropping frame ..." prints ?
> 
> Anyway I'll ask people on the bz for confirmation, if stop printing
> is sufficient to fix the wifi hung.
> 
> > oh and don't forget to add note that
> > rt2x00_err/rt2x00_dbg prints can cause serious bug so you don't
> > acidentally enable them again in a year or two!
> 
> I think I will remember that, for others git changelog should
> be sufficient.
>  
> > > Also 'result in lower throughput' is over-generalized statement. More
> > > precise would be 'it lower throughput on some test cases', but even
> > > in those specific cases, throughput vary randomly with and without
> > > patches.
> > >
> > 
> > 100Mbps TX 70Mbps RX without and 80Mbps TX 55 Mbps RX with 5
> > patches--> EA2750 HT40 CH2 4.14.82+4.19-rc5-1-1
> 
> Ok, this is different case that you presented before on USB devices.
> And I'm more concerned about it, because changes with MMIO ware
> much bigger (for USB only some code was moved and some flush
> changed were done). Please check attached patch, if it fixes
> throughput regression for you ?

So what is the point complaining that I do not provide patches for
testing and when I do, ignore that ?

Anyway I have now MT7620A device and was capable to reproduce the
performance regression. Will provide patches to fix it and still
prevent tx queues hungs.

Stanislaw
Tom Psyborg Feb. 9, 2019, 11:11 a.m. UTC | #10
Can you reproduce interface freeze with this patch if you enable debug
options to print all dbg messages?
Stanislaw Gruszka Feb. 9, 2019, 11:56 a.m. UTC | #11
On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> Can you reproduce interface freeze with this patch if you enable debug
> options to print all dbg messages?

Haven't tried that, but I expect enabling debug messages will
cause troubles at hardware where printk make CPU hog.

Stanislaw
Tom Psyborg Feb. 9, 2019, 12:28 p.m. UTC | #12
On 09/02/2019, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
>> Can you reproduce interface freeze with this patch if you enable debug
>> options to print all dbg messages?
>
> Haven't tried that, but I expect enabling debug messages will
> cause troubles at hardware where printk make CPU hog.
>
> Stanislaw
>

Take a look at my comment from 23.01.2019. on this link:
https://bugs.openwrt.org/index.php?do=details&task_id=2018
There might be a chance dbg printks do not cause problems.
Daniel Golle Feb. 9, 2019, 3:38 p.m. UTC | #13
Hi Tom,

On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
> On 09/02/2019, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> >> Can you reproduce interface freeze with this patch if you enable debug
> >> options to print all dbg messages?
> >
> > Haven't tried that, but I expect enabling debug messages will
> > cause troubles at hardware where printk make CPU hog.
> >
> > Stanislaw
> >
> 
> Take a look at my comment from 23.01.2019. on this link:
> https://bugs.openwrt.org/index.php?do=details&task_id=2018
> There might be a chance dbg printks do not cause problems.

This is very likely just overheat because of missing TSSI. It happends
on boards where MT7620A is used without heatsink (and easy to
reproduce, I've tried. We just need TSSI and for bad board design it
will mean limiting to one TX-queue soon after serious traffic starts).
We should tell those users to glue heatsink on the chip for $0.05 they
can fix this in hardware -- and implement TSSI to mitigate at least the
very worst effects (such as hardware stuck until reset).
So dkg printks can make this worse, of course, because logging overhead
also burns CPU cycles and increases system load.

I remember in Rt5350 (and other predecessors of Rt6352 aka. MT7620)
also had problems like that and it was possible to improve it by all
sorts of things like removing power from unused ports of the Ethernet
switch -- so for those boards (like Xiaomi MiWiFi mini) even this maybe
the trick, because they also only use 3 out of 5 ports, and maybe we
need to try harder to disable the power of unused ports (and also those
without cable plugged in only periodicly power them up and check for
link)....

Cheers

Daniel
Tom Psyborg Feb. 9, 2019, 4:29 p.m. UTC | #14
On 09/02/2019, Daniel Golle <daniel@makrotopia.org> wrote:
> Hi Tom,
>
> On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
>> On 09/02/2019, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
>> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
>> >> Can you reproduce interface freeze with this patch if you enable debug
>> >> options to print all dbg messages?
>> >
>> > Haven't tried that, but I expect enabling debug messages will
>> > cause troubles at hardware where printk make CPU hog.
>> >
>> > Stanislaw
>> >
>>
>> Take a look at my comment from 23.01.2019. on this link:
>> https://bugs.openwrt.org/index.php?do=details&task_id=2018
>> There might be a chance dbg printks do not cause problems.
>
> This is very likely just overheat because of missing TSSI. It happends
> on boards where MT7620A is used without heatsink (and easy to
> reproduce, I've tried. We just need TSSI and for bad board design it
> will mean limiting to one TX-queue soon after serious traffic starts).
> We should tell those users to glue heatsink on the chip for $0.05 they
> can fix this in hardware -- and implement TSSI to mitigate at least the
> very worst effects (such as hardware stuck until reset).
> So dkg printks can make this worse, of course, because logging overhead
> also burns CPU cycles and increases system load.
>
> I remember in Rt5350 (and other predecessors of Rt6352 aka. MT7620)
> also had problems like that and it was possible to improve it by all
> sorts of things like removing power from unused ports of the Ethernet
> switch -- so for those boards (like Xiaomi MiWiFi mini) even this maybe
> the trick, because they also only use 3 out of 5 ports, and maybe we
> need to try harder to disable the power of unused ports (and also those
> without cable plugged in only periodicly power them up and check for
> link)....
>
> Cheers
>
> Daniel
>

Disagree. MT7620 in Xiaomi MiWiFi mini has thermal pad sticked to
enough large metal shielding that is sticked with another thermal pad
to even bigger metal part of case. So thermal dissipation in that case
is very good.
Also with pandorabox firmware 200Mbps of stable throughput is possible
without any throttling which I believe would not be the case if the
cooling system wasn't reliable.

Back on subject, this patch is tested and proven to work, so ACK from me.
Daniel Golle Feb. 9, 2019, 5:28 p.m. UTC | #15
Hi Tom,

On Sat, Feb 09, 2019 at 05:29:33PM +0100, Tom Psyborg wrote:
> On 09/02/2019, Daniel Golle <daniel@makrotopia.org> wrote:
> > Hi Tom,
> >
> > On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
> >> On 09/02/2019, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> >> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> >> >> Can you reproduce interface freeze with this patch if you enable debug
> >> >> options to print all dbg messages?
> >> >
> >> > Haven't tried that, but I expect enabling debug messages will
> >> > cause troubles at hardware where printk make CPU hog.
> >> >
> >> > Stanislaw
> >> >
> >>
> >> Take a look at my comment from 23.01.2019. on this link:
> >> https://bugs.openwrt.org/index.php?do=details&task_id=2018
> >> There might be a chance dbg printks do not cause problems.
> >
> > This is very likely just overheat because of missing TSSI. It happends
> > on boards where MT7620A is used without heatsink (and easy to
> > reproduce, I've tried. We just need TSSI and for bad board design it
> > will mean limiting to one TX-queue soon after serious traffic starts).
> > We should tell those users to glue heatsink on the chip for $0.05 they
> > can fix this in hardware -- and implement TSSI to mitigate at least the
> > very worst effects (such as hardware stuck until reset).
> > So dkg printks can make this worse, of course, because logging overhead
> > also burns CPU cycles and increases system load.
> >
> > I remember in Rt5350 (and other predecessors of Rt6352 aka. MT7620)
> > also had problems like that and it was possible to improve it by all
> > sorts of things like removing power from unused ports of the Ethernet
> > switch -- so for those boards (like Xiaomi MiWiFi mini) even this maybe
> > the trick, because they also only use 3 out of 5 ports, and maybe we
> > need to try harder to disable the power of unused ports (and also those
> > without cable plugged in only periodicly power them up and check for
> > link)....
> >
> > Cheers
> >
> > Daniel
> >
> 
> Disagree. MT7620 in Xiaomi MiWiFi mini has thermal pad sticked to
> enough large metal shielding that is sticked with another thermal pad
> to even bigger metal part of case. So thermal dissipation in that case
> is very good.

Good to hear that, and yes it may still be, see below.

> Also with pandorabox firmware 200Mbps of stable throughput is possible

>20MiB/s download rate is very impressive for 2.4GHz WiFi, I never saw
anything even near that (using BitTorrent I get up to 8MiB/s when next
to the router in low-noise environment with proprietry ubiquiti gear).

> without any throttling which I believe would not be the case if the
> cooling system wasn't reliable.

pandorabox uses MediaTek proprietary driver and that limits to single
TX chain operation once temperature limit is hit (read the code, you
know where), same as in Xiaomi proprietary firmware.

> 
> Back on subject, this patch is tested and proven to work, so ACK from me.

Thanks for testing. I will push Stanislavs complete series into OpenWrt
tomorrow.


Cheers


Daniel
Stanislaw Gruszka Feb. 10, 2019, 9:57 a.m. UTC | #16
On Sat, Feb 09, 2019 at 01:28:32PM +0100, Tom Psyborg wrote:
> On 09/02/2019, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > On Sat, Feb 09, 2019 at 12:11:03PM +0100, Tom Psyborg wrote:
> >> Can you reproduce interface freeze with this patch if you enable debug
> >> options to print all dbg messages?
> >
> > Haven't tried that, but I expect enabling debug messages will
> > cause troubles at hardware where printk make CPU hog.
> >
> > Stanislaw
> >
> 
> Take a look at my comment from 23.01.2019. on this link:
> https://bugs.openwrt.org/index.php?do=details&task_id=2018
> There might be a chance dbg printks do not cause problems.

I'm sure printk cause problems on some hardware i.e. Wt3020 .
There is hypothesis that is because kernel try to emit
messages by misconfigured 56k serial line, but who knows.
This should be debugged and fixed but I have not idea how to do this.

Another issue not related with printk is tx queue hung, when
queue is full but we do not get any TX status interrupt. This was
mitigated in my previous set that caused throughput regression and
should be now fixed by latest RFC/RTF set.

But there are other issues, that could cause troubles.

Stanislaw
Tom Psyborg March 19, 2019, 2:37 a.m. UTC | #17
On 09/02/2019, Daniel Golle <daniel@makrotopia.org> wrote:

>>20MiB/s download rate is very impressive for 2.4GHz WiFi, I never saw
> anything even near that (using BitTorrent I get up to 8MiB/s when next
> to the router in low-noise environment with proprietry ubiquiti gear).

various apps and data transfer stacks may add significant overhead.
for real throughput testing i rely on iperf2 (tried iperf3 it was
mess). if interested take a look at bandwidth achieved on OCed QCA9531
device: https://forum.openwrt.org/t/different-wifi-speed-tx-rx-archer-c7-v2/18860/10?u=psyborg

Patch
diff mbox series

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
index 92ddc19e7bf7..947fc8964e9a 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c
@@ -671,7 +671,7 @@  int rt2x00queue_write_tx_frame(struct data_queue *queue, struct sk_buff *skb,
 	spin_lock(&queue->tx_lock);
 
 	if (unlikely(rt2x00queue_full(queue))) {
-		rt2x00_err(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
+		rt2x00_dbg(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
 			   queue->qid);
 		ret = -ENOBUFS;
 		goto out;