diff mbox

[PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

Message ID 1349174009-16496-1-git-send-email-sven@narfation.org (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Sven Eckelmann Oct. 2, 2012, 10:33 a.m. UTC
Interrupts with the sync_cause AR_INTR_SYNC_HOST1_FATAL has to be handled
using a chip reset. Otherwise a interrupt storm with unhandled interrupts
will cause a hang or crash of the machine.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
---
I was informed that AR_INTR_SYNC_HOST1_PERR should not be handled this way
because it can create system freezes after an adhoc interface was created.

I really need some Atheros developer who can check the documentation to
verify the interpretation of these flags. Otherwise this is just guessing
and may lead to even bigger problems.

 drivers/net/wireless/ath/ath9k/ar9003_mac.c |    5 +++++
 1 file changed, 5 insertions(+)

Comments

Adrian Chadd Oct. 2, 2012, 1:13 p.m. UTC | #1
.. well, the rule here is "You shouldn't get PERR/FATAL interrupts."

Haven't I posted a summary of what those errors are?

Ok. So they're signals from the PCIe core (named host1_fatal and
host1_perr. Helpfully.) Those errors occured during a DMA transfer.

So the question is why you're seeing PERR interrupts when creating an
adhoc interface. That hints to me that something odd is going on..

I've seen these issues creep up when the NIC is in some way behaving
very, very badly (lots of timeouts and sync errors with little to no
traffic at all), which resulted in all kinds of odd and weird,
unstable behaviour. After replacing the NIC with another NIC (in my
case, an AR9280 -> AR9280 NIC :-) the errors went away and things
continued swimmingly.

I'd have to go digging through the PCIe core source to figure out
exactly what host1_peer and host1_fatal mean. I can if you'd like,
it'll just take some time as I'm not familiar at all with the PCIe
host interface.

Thanks,



Adrian

On 2 October 2012 03:33, Sven Eckelmann <sven@narfation.org> wrote:
> Interrupts with the sync_cause AR_INTR_SYNC_HOST1_FATAL has to be handled
> using a chip reset. Otherwise a interrupt storm with unhandled interrupts
> will cause a hang or crash of the machine.
>
> Signed-off-by: Sven Eckelmann <sven@narfation.org>
> ---
> I was informed that AR_INTR_SYNC_HOST1_PERR should not be handled this way
> because it can create system freezes after an adhoc interface was created.
>
> I really need some Atheros developer who can check the documentation to
> verify the interpretation of these flags. Otherwise this is just guessing
> and may lead to even bigger problems.
>
>  drivers/net/wireless/ath/ath9k/ar9003_mac.c |    5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> index d5b2e0e..6031bdf 100644
> --- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> +++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> @@ -311,6 +311,11 @@ static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)
>         if (sync_cause) {
>                 ath9k_debug_sync_cause(common, sync_cause);
>
> +               if (sync_cause & AR_INTR_SYNC_HOST1_FATAL) {
> +                       ath_dbg(common, ANY, "received PCI FATAL interrupt\n");
> +                       *masked |= ATH9K_INT_FATAL;
> +               }
> +
>                 if (sync_cause & AR_INTR_SYNC_RADM_CPL_TIMEOUT) {
>                         REG_WRITE(ah, AR_RC, AR_RC_HOSTIF);
>                         REG_WRITE(ah, AR_RC, 0);
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Felix Fietkau Oct. 2, 2012, 1:33 p.m. UTC | #2
On 2012-10-02 3:13 PM, Adrian Chadd wrote:
> .. well, the rule here is "You shouldn't get PERR/FATAL interrupts."
> 
> Haven't I posted a summary of what those errors are?
> 
> Ok. So they're signals from the PCIe core (named host1_fatal and
> host1_perr. Helpfully.) Those errors occured during a DMA transfer.
> 
> So the question is why you're seeing PERR interrupts when creating an
> adhoc interface. That hints to me that something odd is going on..
> 
> I've seen these issues creep up when the NIC is in some way behaving
> very, very badly (lots of timeouts and sync errors with little to no
> traffic at all), which resulted in all kinds of odd and weird,
> unstable behaviour. After replacing the NIC with another NIC (in my
> case, an AR9280 -> AR9280 NIC :-) the errors went away and things
> continued swimmingly.
> 
> I'd have to go digging through the PCIe core source to figure out
> exactly what host1_peer and host1_fatal mean. I can if you'd like,
> it'll just take some time as I'm not familiar at all with the PCIe
> host interface.
According to the datasheet, AR_INTR_SYNC_HOST1_PERR is triggered by an
invalid register access, and AR_INTR_SYNC_HOST1_FATAL is triggered by
corrupt descriptors or other DMA issues.
Maybe you can get some information on the source of this PERR error if
you record the last register accesses outside of the irq context and
print them once this IRQ comes in.

- Felix

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Wunderlich Oct. 2, 2012, 1:35 p.m. UTC | #3
Hey Adrian,

On Tue, Oct 02, 2012 at 06:13:37AM -0700, Adrian Chadd wrote:
> .. well, the rule here is "You shouldn't get PERR/FATAL interrupts."
> 
> Haven't I posted a summary of what those errors are?
> 
> Ok. So they're signals from the PCIe core (named host1_fatal and
> host1_perr. Helpfully.) Those errors occured during a DMA transfer.
> 
> So the question is why you're seeing PERR interrupts when creating an
> adhoc interface. That hints to me that something odd is going on..

thanks for the explanation!

> 
> I've seen these issues creep up when the NIC is in some way behaving
> very, very badly (lots of timeouts and sync errors with little to no
> traffic at all), which resulted in all kinds of odd and weird,
> unstable behaviour. After replacing the NIC with another NIC (in my
> case, an AR9280 -> AR9280 NIC :-) the errors went away and things
> continued swimmingly.

Sounds like a good solution, but I'm afraid it won't work for us. We
are using AR9330 SoCs (Hornet), and as long as we don't have a very sharp knife
we won't be able to replace the NIC ... And cutting a few thousand of
them will also not be funny.

I'm starting to lose a little bit of confidence in these insects ... :/

> 
> I'd have to go digging through the PCIe core source to figure out
> exactly what host1_peer and host1_fatal mean. I can if you'd like,
> it'll just take some time as I'm not familiar at all with the PCIe
> host interface.

It would at least be interesting if we are supposed to handle the interrupt
somehow, instead of resetting the chip.

Thanks,
	Simon
> 
> Thanks,
> 
> 
> 
> Adrian
> 
> On 2 October 2012 03:33, Sven Eckelmann <sven@narfation.org> wrote:
> > Interrupts with the sync_cause AR_INTR_SYNC_HOST1_FATAL has to be handled
> > using a chip reset. Otherwise a interrupt storm with unhandled interrupts
> > will cause a hang or crash of the machine.
> >
> > Signed-off-by: Sven Eckelmann <sven@narfation.org>
> > ---
> > I was informed that AR_INTR_SYNC_HOST1_PERR should not be handled this way
> > because it can create system freezes after an adhoc interface was created.
> >
> > I really need some Atheros developer who can check the documentation to
> > verify the interpretation of these flags. Otherwise this is just guessing
> > and may lead to even bigger problems.
> >
> >  drivers/net/wireless/ath/ath9k/ar9003_mac.c |    5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> > index d5b2e0e..6031bdf 100644
> > --- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> > +++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> > @@ -311,6 +311,11 @@ static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)
> >         if (sync_cause) {
> >                 ath9k_debug_sync_cause(common, sync_cause);
> >
> > +               if (sync_cause & AR_INTR_SYNC_HOST1_FATAL) {
> > +                       ath_dbg(common, ANY, "received PCI FATAL interrupt\n");
> > +                       *masked |= ATH9K_INT_FATAL;
> > +               }
> > +
> >                 if (sync_cause & AR_INTR_SYNC_RADM_CPL_TIMEOUT) {
> >                         REG_WRITE(ah, AR_RC, AR_RC_HOSTIF);
> >                         REG_WRITE(ah, AR_RC, 0);
> > --
> > 1.7.10.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Adrian Chadd Oct. 2, 2012, 2:06 p.m. UTC | #4
Hm, there are still issues on Hornet?

And no, you're not supposed to handle the interrupt per se.. it's a
sign that things got upset and you can't trust anything from that
point forward.

Felix is right, it'd be good to log the register accesses that lead up to this.



Adrian
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sven Eckelmann Oct. 2, 2012, 3:02 p.m. UTC | #5
On Tuesday 02 October 2012 07:06:03 Adrian Chadd wrote:
> Hm, there are still issues on Hornet?

Yes, we still have problems with hornet. The issue I am trying to "fix" with 
this patch is an interrupt storm on AR9330 devices with sta interface(s). 
Random devices crash after getting a stacktrace reporting __report_bad_irq. 
The crash either results in a reboot or hang of the device

[  952.950000] irq 2: nobody cared (try booting with the "irqpoll" option)
[  952.950000] Call Trace:
[  952.950000] [<8026ade8>] dump_stack+0x8/0x34
[  952.950000] [<800a75d0>] __report_bad_irq+0x44/0xf4
[  952.950000] [<800a78ec>] note_interrupt+0x200/0x2a4
[  952.950000] [<800a58c8>] handle_irq_event_percpu+0x19c/0x1e0
[  952.950000] [<800a86cc>] handle_percpu_irq+0x54/0x88
[  952.950000] [<800a501c>] generic_handle_irq+0x3c/0x4c
[  952.950000] [<80064748>] do_IRQ+0x1c/0x34
[  952.950000] [<80062d6c>] ret_from_irq+0x0/0x4
[  952.950000] [<8007673c>] tasklet_action+0xb8/0xd4
[  952.950000] [<80076c24>] __do_softirq+0xa0/0x154
[  952.950000] [<80076e30>] do_softirq+0x48/0x68
[  952.950000] [<80076f94>] local_bh_enable+0x94/0xb0
[  952.950000] [<83406d60>] cfg80211_scan_done+0x670/0x6d0 [cfg80211]
[  952.950000] 
[  952.950000] handlers:
[  952.950000] [<83564d48>] ath_isr
[  952.950000] Disabling IRQ #2

The test setup is using 30 AR9330 devices running OpenWRT 32727/33559. 32727 
is using compat-wireless-2012-04-17 (+ many OpenWRT patches) and 33559 is 
running compat-wireless-2012-09-07 (+many more patches from Felix). 1 device 
is running an open AP device (standard OpenWRT settings) and 29 devices are 
trying to connect. Random devices will now fail. To debug this problem, I used 
one devices with 8 vif devices and restarted the network script again and 
again to force the recreation of the vif and reconnect.

The stack trace doesn't seem to be very helpful. Therefore, I checked ath_isr 
and noticed that the interrupts right before the device crash get the status 0 
from ar9003_hw_get_isr. Digging a little but further also revealed that the 
interrupts in the interrupt storm also have async_cause 0 and sync_cause 0x20.

This sync cause 0x20 isn't handled anywhere and may be the cause of the 
hang/crash. At least this is the symptom which can be fixed without crashing 
the system.

I hope that helps to track down the problem.

Kind regards,
        Sven
Felix Fietkau Oct. 2, 2012, 3:20 p.m. UTC | #6
On 2012-10-02 5:02 PM, Sven Eckelmann wrote:
> On Tuesday 02 October 2012 07:06:03 Adrian Chadd wrote:
>> Hm, there are still issues on Hornet?
> 
> Yes, we still have problems with hornet. The issue I am trying to "fix" with 
> this patch is an interrupt storm on AR9330 devices with sta interface(s). 
> Random devices crash after getting a stacktrace reporting __report_bad_irq. 
> The crash either results in a reboot or hang of the device
> 
> [  952.950000] irq 2: nobody cared (try booting with the "irqpoll" option)
> [  952.950000] Call Trace:
> [  952.950000] [<8026ade8>] dump_stack+0x8/0x34
> [  952.950000] [<800a75d0>] __report_bad_irq+0x44/0xf4
> [  952.950000] [<800a78ec>] note_interrupt+0x200/0x2a4
> [  952.950000] [<800a58c8>] handle_irq_event_percpu+0x19c/0x1e0
> [  952.950000] [<800a86cc>] handle_percpu_irq+0x54/0x88
> [  952.950000] [<800a501c>] generic_handle_irq+0x3c/0x4c
> [  952.950000] [<80064748>] do_IRQ+0x1c/0x34
> [  952.950000] [<80062d6c>] ret_from_irq+0x0/0x4
> [  952.950000] [<8007673c>] tasklet_action+0xb8/0xd4
> [  952.950000] [<80076c24>] __do_softirq+0xa0/0x154
> [  952.950000] [<80076e30>] do_softirq+0x48/0x68
> [  952.950000] [<80076f94>] local_bh_enable+0x94/0xb0
> [  952.950000] [<83406d60>] cfg80211_scan_done+0x670/0x6d0 [cfg80211]
> [  952.950000] 
> [  952.950000] handlers:
> [  952.950000] [<83564d48>] ath_isr
> [  952.950000] Disabling IRQ #2
> 
> The test setup is using 30 AR9330 devices running OpenWRT 32727/33559. 32727 
> is using compat-wireless-2012-04-17 (+ many OpenWRT patches) and 33559 is 
> running compat-wireless-2012-09-07 (+many more patches from Felix). 1 device 
> is running an open AP device (standard OpenWRT settings) and 29 devices are 
> trying to connect. Random devices will now fail. To debug this problem, I used 
> one devices with 8 vif devices and restarted the network script again and 
> again to force the recreation of the vif and reconnect.
> 
> The stack trace doesn't seem to be very helpful. Therefore, I checked ath_isr 
> and noticed that the interrupts right before the device crash get the status 0 
> from ar9003_hw_get_isr. Digging a little but further also revealed that the 
> interrupts in the interrupt storm also have async_cause 0 and sync_cause 0x20.
> 
> This sync cause 0x20 isn't handled anywhere and may be the cause of the 
> hang/crash. At least this is the symptom which can be fixed without crashing 
> the system.
I checked the AR933x datasheet, and it says that cause 0x20 is tx
descriptor corruption.

- Felix

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Chadd Oct. 3, 2012, 2:51 p.m. UTC | #7
On 2 October 2012 08:20, Felix Fietkau <nbd@openwrt.org> wrote:

>> This sync cause 0x20 isn't handled anywhere and may be the cause of the
>> hang/crash. At least this is the symptom which can be fixed without crashing
>> the system.
> I checked the AR933x datasheet, and it says that cause 0x20 is tx
> descriptor corruption.

Ah hey, for Hornet they redefined those bits:

5: MAC_TXC_CORRUPTION_FLAG_SYNC (TX descriptor integrity flag)
6: INVALID_ADDRESS_ACCESS (invalid register access)

Good catch. That's definitely something in the right direction.



Adrian
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
index d5b2e0e..6031bdf 100644
--- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
+++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
@@ -311,6 +311,11 @@  static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)
 	if (sync_cause) {
 		ath9k_debug_sync_cause(common, sync_cause);
 
+		if (sync_cause & AR_INTR_SYNC_HOST1_FATAL) {
+			ath_dbg(common, ANY, "received PCI FATAL interrupt\n");
+			*masked |= ATH9K_INT_FATAL;
+		}
+
 		if (sync_cause & AR_INTR_SYNC_RADM_CPL_TIMEOUT) {
 			REG_WRITE(ah, AR_RC, AR_RC_HOSTIF);
 			REG_WRITE(ah, AR_RC, 0);