diff mbox series

[v2,2/3] rt2x00: check number of EPROTO errors

Message ID 1545318971-28351-2-git-send-email-sgruszka@redhat.com (mailing list archive)
State Changes Requested
Delegated to: Kalle Valo
Headers show
Series [v2,1/3] rt2x00: use ratelimited variants dev_warn/dev_err | expand

Commit Message

Stanislaw Gruszka Dec. 20, 2018, 3:16 p.m. UTC
Some USB host devices/drivers on some conditions can always return
EPROTO error on submitted URBs. That can cause infinity loop in the
rt2x00 driver.

Since we can have single EPROTO errors we can not mark as device as
removed to avoid infinity loop. However we can count consecutive
EPROTO errors and mark device as removed if get lot of it.
I choose number 10 as threshold.

Reported-and-tested-by: Randy Oostdyk <linux-kernel@oostdyk.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/wireless/ralink/rt2x00/rt2x00.h    |  1 +
 drivers/net/wireless/ralink/rt2x00/rt2x00usb.c | 22 +++++++++++++++++++---
 2 files changed, 20 insertions(+), 3 deletions(-)

Comments

Jeroen Roovers Jan. 7, 2019, 12:47 p.m. UTC | #1
On Thu, 20 Dec 2018 at 16:16, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
>
> Some USB host devices/drivers on some conditions can always return
> EPROTO error on submitted URBs. That can cause infinity loop in the
> rt2x00 driver.
>
> Since we can have single EPROTO errors we can not mark as device as
> removed to avoid infinity loop. However we can count consecutive
> EPROTO errors and mark device as removed if get lot of it.
> I choose number 10 as threshold.
>
> Reported-and-tested-by: Randy Oostdyk <linux-kernel@oostdyk.com>
> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>

> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> index 086aad22743d..60b8bccab83d 100644
> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> @@ -31,6 +31,22 @@
>  #include "rt2x00.h"
>  #include "rt2x00usb.h"
>
> +static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
> +{
> +       if (status == -ENODEV || status == -ENOENT)

I am not sure about this, but it looks to me like you would never see
ENOENT, but ETIMEDOUT:

https://github.com/torvalds/linux/blob/master/drivers/usb/core/message.c#L64

In usb_start_wait_urb ETIMEDOUT is returned instead of ENOENT and
passed up the chain.

     retval = (ctx.status == -ENOENT ? -ETIMEDOUT : ctx.status);


Maybe I am wrong about this, but then again I have neither ever seen
the driver respond to an ENOENT like this when an RT5592
"disappeared".


Kind regards,
     jer
Stanislaw Gruszka Jan. 7, 2019, 3:09 p.m. UTC | #2
On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote:
> On Thu, 20 Dec 2018 at 16:16, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> >
> > Some USB host devices/drivers on some conditions can always return
> > EPROTO error on submitted URBs. That can cause infinity loop in the
> > rt2x00 driver.
> >
> > Since we can have single EPROTO errors we can not mark as device as
> > removed to avoid infinity loop. However we can count consecutive
> > EPROTO errors and mark device as removed if get lot of it.
> > I choose number 10 as threshold.
> >
> > Reported-and-tested-by: Randy Oostdyk <linux-kernel@oostdyk.com>
> > Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
> 
> > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> > index 086aad22743d..60b8bccab83d 100644
> > --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> > +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> > @@ -31,6 +31,22 @@
> >  #include "rt2x00.h"
> >  #include "rt2x00usb.h"
> >
> > +static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
> > +{
> > +       if (status == -ENODEV || status == -ENOENT)
> 
> I am not sure about this, but it looks to me like you would never see
> ENOENT, but ETIMEDOUT:
> 
> https://github.com/torvalds/linux/blob/master/drivers/usb/core/message.c#L64
> 
> In usb_start_wait_urb ETIMEDOUT is returned instead of ENOENT and
> passed up the chain.
> 
>      retval = (ctx.status == -ENOENT ? -ETIMEDOUT : ctx.status);
> 
> 
> Maybe I am wrong about this, but then again I have neither ever seen
> the driver respond to an ENOENT like this when an RT5592
> "disappeared".

According to Documentation/driver-api/usb/error-codes.rst 
ENOENT is valid error code when interface does not exist or
when URB was unlinked.

Regards
Stanislaw
Jeroen Roovers Jan. 8, 2019, 9:30 a.m. UTC | #3
Hi Stanislaw,


On Mon, 7 Jan 2019 at 16:09, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote:
> > Maybe I am wrong about this, but then again I have neither ever seen
> > the driver respond to an ENOENT like this when an RT5592
> > "disappeared".

By "neither ever seen" I mean I have never seen rt2x00usb respond
properly like that because no -ENOENT was ever seen. I have many
system logs collected over the years showing the exact error that the
code was trying to catch and still does try to catch: with USB debug
messages enabled, I tend to see

usb: kworker.* timed out on ep.*

followed by rt2x00 specific warnings about the queues filling up:
rt2800usb_txdone: Warning - Data pending for entry N in queue N
[may repeat a few times]

followed by the fatal:

rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
with error -110
[many of these, system is slowly locking up]

So the only clue that I had was that perhaps rt2x00usb_vendor_request
wasn't catching the correct return value.

> According to Documentation/driver-api/usb/error-codes.rst
> ENOENT is valid error code when interface does not exist or
> when URB was unlinked.

rt2x00usb_vendor_request calls usb_control_msg.
usb_control_msg according to that document can return -ETIMEDOUT.

1. rt2x00usb_vendor_request calls usb_control_msg.
2. usb_control_msg calls usb_internal_control_msg and passes on its
return value.
3. usb_internal_control_msg calls usb_start_wait_urb and passes on its
return value.
4. usb_start_wait_urb very specifically substitutes -ETIMEDOUT for
-ENOENT as the return value of usb_kill_urb and
5. that is passed back all the way up to rt2x00usb_vendor_request.

Kind regards,
     jer
Jeroen Roovers Jan. 8, 2019, 10:09 a.m. UTC | #4
On Tue, 8 Jan 2019 at 10:30, Jeroen Roovers <jer@airfi.aero> wrote:
> I tend to see
>
> usb: kworker.* timed out on ep.*
>
> followed by rt2x00 specific warnings about the queues filling up:

Sorry that should read "preceded by".

> rt2800usb_txdone: Warning - Data pending for entry N in queue N
> [may repeat a few times]


Kind regards,
     jer
Tom Psyborg Jan. 8, 2019, 11:04 a.m. UTC | #5
On 08/01/2019, Jeroen Roovers <jer@airfi.aero> wrote:
>      Hi Stanislaw,
>
>
> On Mon, 7 Jan 2019 at 16:09, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
>> On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote:
>> > Maybe I am wrong about this, but then again I have neither ever seen
>> > the driver respond to an ENOENT like this when an RT5592
>> > "disappeared".
>
> By "neither ever seen" I mean I have never seen rt2x00usb respond
> properly like that because no -ENOENT was ever seen. I have many
> system logs collected over the years showing the exact error that the
> code was trying to catch and still does try to catch: with USB debug
> messages enabled, I tend to see
>
> usb: kworker.* timed out on ep.*
>
> followed by rt2x00 specific warnings about the queues filling up:
> rt2800usb_txdone: Warning - Data pending for entry N in queue N
> [may repeat a few times]
>
> followed by the fatal:
>
> rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
> with error -110
> [many of these, system is slowly locking up]
>
> So the only clue that I had was that perhaps rt2x00usb_vendor_request
> wasn't catching the correct return value.

Hi

error message vendor request failed - do you get it on a real hardware
or in virtualized environment?

noticed it only happens when running attached USB device in
virtualbox, even in such case I was able once to hit the right
combination of ubuntu, vbox and vbox extensions and could run the USB
wifi without lockups. so you might consider fixing this bug at
different level. oh, and I'd also pay attention to hardware connection
at the USB port, at least one of the devices I tested seems flaky when
plugged in (edup mini adapter/RT53xx)

regards, Tom
Jeroen Roovers Jan. 9, 2019, 6:17 a.m. UTC | #6
Hi Tom,

On Tue, 8 Jan 2019 at 12:04, Tom Psyborg <pozega.tomislav@gmail.com> wrote:
> > rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
> > with error -110
> > [many of these, system is slowly locking up]
> >
> > So the only clue that I had was that perhaps rt2x00usb_vendor_request
> > wasn't catching the correct return value.
>
> Hi
>
> error message vendor request failed - do you get it on a real hardware
> or in virtualized environment?

I only run these on bare metal. What I assume so far is that when
rt2x00usb_vendor_request starts failing like this, the MCU has failed.
Power cycling the system helps but is undesirable, and sometimes so
does a forced removal of rt2800usb, a short recovery period (cooling
down, reloading the firmware?) and then loading the module again.

But the problem I am looking to solve is not a hardware problem, it is
recovering gracefully from a failure in the RT5592, so I have been
looking intently at rt2x00usb_vendor_request because that's the
function that complains loudly and kills the entire kernel when the
RT5592 sees this failure.


Kind regards,
     jer
Stanislaw Gruszka Jan. 9, 2019, 11:33 a.m. UTC | #7
On Wed, Jan 09, 2019 at 07:17:59AM +0100, Jeroen Roovers wrote:
>      Hi Tom,
> 
> On Tue, 8 Jan 2019 at 12:04, Tom Psyborg <pozega.tomislav@gmail.com> wrote:
> > > rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X
> > > with error -110
> > > [many of these, system is slowly locking up]
> > >
> > > So the only clue that I had was that perhaps rt2x00usb_vendor_request
> > > wasn't catching the correct return value.
> >
> > Hi
> >
> > error message vendor request failed - do you get it on a real hardware
> > or in virtualized environment?
> 
> I only run these on bare metal. What I assume so far is that when
> rt2x00usb_vendor_request starts failing like this, the MCU has failed.
> Power cycling the system helps but is undesirable, and sometimes so
> does a forced removal of rt2800usb, a short recovery period (cooling
> down, reloading the firmware?) and then loading the module again.
> 
> But the problem I am looking to solve is not a hardware problem, it is
> recovering gracefully from a failure in the RT5592, so I have been
> looking intently at rt2x00usb_vendor_request because that's the
> function that complains loudly and kills the entire kernel when the
> RT5592 sees this failure.

So would be below patch (on top of this set) be a solution for at 
least to not kill the kernel. Or we looking for something better
i.e. watchdog ?

Regards
Stanislaw

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
index 60b8bccab83d..ee5bd0efd2c7 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
@@ -36,7 +36,7 @@ static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
 	if (status == -ENODEV || status == -ENOENT)
 		return true;
 	
-	if (status == -EPROTO)
+	if (status == -EPROTO || status == -ETIMEDOUT)
 		rt2x00dev->num_proto_errs++;
 	else
 		rt2x00dev->num_proto_errs = 0;
Jeroen Roovers Jan. 10, 2019, 7:49 a.m. UTC | #8
Hi Stanislaw,

On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote:

> So would be below patch (on top of this set) be a solution for at
> least to not kill the kernel. Or we looking for something better
> i.e. watchdog ?

I'll give it a spin. Thanks!


Kind regards,
     jer
Jeroen Roovers Jan. 10, 2019, 2:29 p.m. UTC | #9
Aaaaaaand the results are in.

On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <jer@airfi.aero> wrote:
>
>      Hi Stanislaw,
>
> On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
>
> > So would be below patch (on top of this set) be a solution for at
> > least to not kill the kernel. Or we looking for something better
> > i.e. watchdog ?
>
> I'll give it a spin. Thanks!

hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
kernel: [  500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x07 failed for offset 0x6888 with error -110
kernel: [  500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x6888 with error -110
kernel: [  501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x6110 with error -110
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
kernel: [  501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x07 failed for offset 0x1018 with error -110
kernel: [  501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x1018 with error -110
kernel: [  501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error
- Vendor Request 0x06 failed for offset 0x1910 with error -110
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
to inactivity
hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
to inactivity (timer DEAUTH/REMOVE)

rt2x00usb_check_usb_error in your patch is set to
clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6
errors were seen. Maybe I should set it to 1 as I have never seen an
RT5592 recover from this.

The system remained relatively stable until after I tried forcibly
removing and then loading the rt2800usb module. A simple `ifconfig`
then triggered a kernel panic. Sadly I couldn't capture it in time but
I did spot that more phyN (up to phy4) devices had been added.


Kind regards,
      jer
Stanislaw Gruszka Jan. 16, 2019, 11:11 a.m. UTC | #10
On Thu, Jan 10, 2019 at 03:29:44PM +0100, Jeroen Roovers wrote:
> Aaaaaaand the results are in.
> 
> On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <jer@airfi.aero> wrote:
> >
> >      Hi Stanislaw,
> >
> > On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> >
> > > So would be below patch (on top of this set) be a solution for at
> > > least to not kill the kernel. Or we looking for something better
> > > i.e. watchdog ?
> >
> > I'll give it a spin. Thanks!
> 
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> kernel: [  500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x6888 with error -110
> kernel: [  500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6888 with error -110
> kernel: [  501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6110 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> kernel: [  501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x1018 with error -110
> kernel: [  501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1018 with error -110
> kernel: [  501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1910 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> 
> rt2x00usb_check_usb_error in your patch is set to
> clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6
> errors were seen. Maybe I should set it to 1 as I have never seen an
> RT5592 recover from this.

I know cases where we have only single USB error and device work after
that. But if I remember correctly it was only for EPROTO not ETIMEDOUT.

> The system remained relatively stable until after I tried forcibly
> removing and then loading the rt2800usb module. A simple `ifconfig`
> then triggered a kernel panic. Sadly I couldn't capture it in time but
> I did spot that more phyN (up to phy4) devices had been added.

Apparently we should not crash and that should be fixed, but without
logs there is nothing we can do.

Let try to attack problem from other side. RT5592 mises TSSI
calibration, so perhaps device overheat for you because of that.
Please apply attached patch which fixes txpower setting in mac80211
then set some low txpower value i.e:

iw dev wlan0 set txpower fixed 1500

and check if device still stop working with USB errors.

Regards
Stanislaw
diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index 818aa0060349..ca513e94be9e 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -2415,6 +2415,8 @@ static int ieee80211_set_tx_power(struct wiphy *wiphy,
 	bool update_txp_type = false;
 	bool has_monitor = false;
 
+	printk("%s mbm %d\n", __func__, mbm);
+
 	if (wdev) {
 		sdata = IEEE80211_WDEV_TO_SUB_IF(wdev);
 
diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index 3a0171a65db3..022874bf49f3 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -49,19 +49,23 @@ static void ieee80211_iface_work(struct work_struct *work);
 
 bool __ieee80211_recalc_txpower(struct ieee80211_sub_if_data *sdata)
 {
+	struct ieee80211_local *local = sdata->local;
 	struct ieee80211_chanctx_conf *chanctx_conf;
 	int power;
 
-	rcu_read_lock();
-	chanctx_conf = rcu_dereference(sdata->vif.chanctx_conf);
-	if (!chanctx_conf) {
+	if (local->use_chanctx) {
+		rcu_read_lock();
+		chanctx_conf = rcu_dereference(sdata->vif.chanctx_conf);
+		if (!chanctx_conf) {
+			rcu_read_unlock();
+			return false;
+		}
+		power = ieee80211_chandef_max_power(&chanctx_conf->def);
 		rcu_read_unlock();
-		return false;
+	} else {
+		power = ieee80211_chandef_max_power(&local->hw.conf.chandef);
 	}
 
-	power = ieee80211_chandef_max_power(&chanctx_conf->def);
-	rcu_read_unlock();
-
 	if (sdata->user_power_level != IEEE80211_UNSET_POWER_LEVEL)
 		power = min(power, sdata->user_power_level);
 
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 7b8320d4a8e4..6005c5e09be6 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -144,10 +144,14 @@ static u32 ieee80211_hw_conf_chan(struct ieee80211_local *local)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
-		if (!rcu_access_pointer(sdata->vif.chanctx_conf))
-			continue;
 		if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN)
 			continue;
+		if (sdata->user_power_level != IEEE80211_UNSET_POWER_LEVEL)
+			power = min(power, sdata->user_power_level);
+		if (sdata->ap_power_level != IEEE80211_UNSET_POWER_LEVEL)
+			power = min(power, sdata->ap_power_level);
+		if (!rcu_access_pointer(sdata->vif.chanctx_conf))
+			continue;
 		power = min(power, sdata->vif.bss_conf.txpower);
 	}
 	rcu_read_unlock();
Jeroen Roovers Jan. 22, 2019, 5:32 p.m. UTC | #11
On Thu, 10 Jan 2019 at 15:29, Jeroen Roovers <jer@airfi.aero> wrote:
>
> Aaaaaaand the results are in.
>
> On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <jer@airfi.aero> wrote:
> >
> >      Hi Stanislaw,
> >
> > On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> >
> > > So would be below patch (on top of this set) be a solution for at
> > > least to not kill the kernel. Or we looking for something better
> > > i.e. watchdog ?
> >
> > I'll give it a spin. Thanks!
>
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> kernel: [  500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x6888 with error -110
> kernel: [  500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6888 with error -110
> kernel: [  501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x6110 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> kernel: [  501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x07 failed for offset 0x1018 with error -110
> kernel: [  501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1018 with error -110
> kernel: [  501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error
> - Vendor Request 0x06 failed for offset 0x1910 with error -110
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due
> to inactivity
> hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due
> to inactivity (timer DEAUTH/REMOVE)
>
> rt2x00usb_check_usb_error in your patch is set to
> clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6
> errors were seen. Maybe I should set it to 1 as I have never seen an
> RT5592 recover from this.
>
> The system remained relatively stable until after I tried forcibly
> removing and then loading the rt2800usb module. A simple `ifconfig`
> then triggered a kernel panic. Sadly I couldn't capture it in time but
> I did spot that more phyN (up to phy4) devices had been added.

Yes, even when some rt2x00usb_vendor_request starts to fail but keeps
on trying, the WLAN modules remain unrecoverable except by removing
power, so the 10 retries are usually not reached and the device is
never removed. Could it be that some operations do succeed, perhaps
because the MCU was reset and is now capable of responding to some
"vendor requests" but not others?

Checking for num_proto_errs > 1 would then make as much sense as
num_proto_errs > 10 when running an AP. Maybe it's different for an
STA?


Kind regards,
       jer
Stanislaw Gruszka Feb. 12, 2019, 3:02 p.m. UTC | #12
On Tue, Jan 22, 2019 at 06:32:05PM +0100, Jeroen Roovers wrote:
> > then triggered a kernel panic. Sadly I couldn't capture it in time but
> > I did spot that more phyN (up to phy4) devices had been added.
> 
> Yes, even when some rt2x00usb_vendor_request starts to fail but keeps
> on trying, the WLAN modules remain unrecoverable except by removing
> power, so the 10 retries are usually not reached and the device is
> never removed. Could it be that some operations do succeed, perhaps
> because the MCU was reset and is now capable of responding to some
> "vendor requests" but not others?
> 
> Checking for num_proto_errs > 1 would then make as much sense as
> num_proto_errs > 10 when running an AP. Maybe it's different for an
> STA?

Does it make sense to do num_proto_err > 3 check. Would that
be helpful on your problem with rt2800usb device? I don't want
make num_proto_err > 1 since this will remove device just after
2 errors.

Also have you tested tx_power patch with lowering txpower ?
Did it improve things ?

Stanislaw
diff mbox series

Patch

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
index 94459d5ee01b..88e0dc803227 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00.h
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00.h
@@ -1022,6 +1022,7 @@  struct rt2x00_dev {
 	unsigned int extra_tx_headroom;
 
 	struct usb_anchor *anchor;
+	unsigned int num_proto_errs;
 
 	/* Clock for System On Chip devices. */
 	struct clk *clk;
diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
index 086aad22743d..60b8bccab83d 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
@@ -31,6 +31,22 @@ 
 #include "rt2x00.h"
 #include "rt2x00usb.h"
 
+static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status)
+{
+	if (status == -ENODEV || status == -ENOENT)
+		return true;
+	
+	if (status == -EPROTO)
+		rt2x00dev->num_proto_errs++;
+	else
+		rt2x00dev->num_proto_errs = 0;
+
+	if (rt2x00dev->num_proto_errs > 10)
+		return true;
+
+	return false;
+}
+
 /*
  * Interfacing with the HW.
  */
@@ -57,7 +73,7 @@  int rt2x00usb_vendor_request(struct rt2x00_dev *rt2x00dev,
 		if (status >= 0)
 			return 0;
 
-		if (status == -ENODEV || status == -ENOENT) {
+		if (rt2x00usb_check_usb_error(rt2x00dev, status)) {
 			/* Device has disappeared. */
 			clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags);
 			break;
@@ -321,7 +337,7 @@  static bool rt2x00usb_kick_tx_entry(struct queue_entry *entry, void *data)
 
 	status = usb_submit_urb(entry_priv->urb, GFP_ATOMIC);
 	if (status) {
-		if (status == -ENODEV || status == -ENOENT)
+		if (rt2x00usb_check_usb_error(rt2x00dev, status))
 			clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags);
 		set_bit(ENTRY_DATA_IO_FAILED, &entry->flags);
 		rt2x00lib_dmadone(entry);
@@ -410,7 +426,7 @@  static bool rt2x00usb_kick_rx_entry(struct queue_entry *entry, void *data)
 
 	status = usb_submit_urb(entry_priv->urb, GFP_ATOMIC);
 	if (status) {
-		if (status == -ENODEV || status == -ENOENT)
+		if (rt2x00usb_check_usb_error(rt2x00dev, status))
 			clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags);
 		set_bit(ENTRY_DATA_IO_FAILED, &entry->flags);
 		rt2x00lib_dmadone(entry);