diff mbox

[1/3] irq / PM: New driver interface for wakeup interrupts

Message ID 5808955.xYYC2DJrBV@vostro.rjw.lan (mailing list archive)
State RFC, archived
Headers show

Commit Message

Rafael J. Wysocki July 30, 2014, 9:51 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Device drivers currently use enable_irq_wake() to configure their
interrupts for system wakeup, but that API is not particularly
well suited for this purpose, because it goes directly all the
way to the hardware and attempts to change the IRQ configuration
at the chip level.

The first problem with this approach is that the IRQ subsystem
is not told which interrupt handler is supposed to handle
interrupts from the wakeup line should they occur during system
suspend or resume.  That is problematic if the IRQ is shared
and the other devices sharing it with the wakeup device in question
are not wakeup devices.  In that case their drivers may not be
prepared to handle interrupts after the devices have been powered
down and they may expect suspend_device_irqs() to disable the
interrupt.  For this reason, the IRQ should not be left enabled
by suspend_device_irqs() in that case.  On the other hand, though,
it needs to be left enabled to prevent wakeup events occuring
after suspend_device_irqs() has returned from being lost.

The second problem is that on some platforms enable_irq_wake()
results in moving the IRQ over to a special interrupt controller
whose voltage is not removed in the final platform state.  That
allows the platform to react to wakeup signals from the IRQ while
suspended, but the IRQ stops generating regular interrupts at that
point.  That may lead to the loss of wakeup interrupts if they
come in after calling enable_irq_wake() and before the platform
is put into the final state.  Moreover, if the IRQ is shared
and enable_irq_wake() is called from a device driver's .suspend()
callback, for example, it may prevent interrupts generated by
the other devices sharing the line from being handled.

To address the above issues introduce a new interface that can be
used by drivers to request that IRQs be configured for system
wakeup.  That interface doesn't actually change the hardware
state, but tells the IRQ subsystem that the given interrupt should
or should not be configured for system wakeup at the right time.

First, enable_device_irq_wake() takes two arguments, the IRQ
number and the device cookie used when requesting the IRQ.  The
cookie is used to identify the irqaction that should be used for
handling interrups during system suspend and resume.  Namely,
the (new) IRQF_WAKEUP flag is set for that irqaction (if not
set already) and the (new) wake_needed field of the irq_desc
identified by the first argument is incremented.

Second, suspend_device_irqs() is modified to treat irqactions with
IRQF_WAKEUP set in the same way as irqactions with IRQF_NO_SUSPEND
set.  That is, the handlers of those irqactions are left enabled
for the entire duration of system suspend/resume.

Next, the (new) syscore suspend routine for the IRQ subsystem,
irq_pm_syscore_suspend(), browses all irq_descs and calls
enable_irq_wake() for the ones with wake_needed set.  If that is
successful, the (new) IRQS_WAKE_READY flag is set for the given
irq_desc to indicate that the IRQ should be switched back from
the wakeup mode during resume.

The IRQ subsystem's syscore resume routine, irq_pm_syscore_resume(),
is modified to call disable_irq_wake() for each irq_desc with
IRQS_WAKE_READY set and clears that flag for all of them.

Finally, disable_device_irq_wake() takes the same arguments as
enable_device_irq_wake(), finds the irqaction identified by the
second argument, clears IRQF_WAKEUP for it and decrements
wake_needed for the irq_desc identified by the first argument.

This organization of code guarantees that suspend_device_irqs()
will leave wakeup IRQs enabled, but also will block the execution
of interrupt handlers that should not be invoked going forward,
which allows wakeup interrupts to be handled and prevents possible
driver bugs from being tripped over at the same time.  It also
ensures that IRQs will be reconfigured for wakeup at the point
where that should not disturb any legitimate functionality.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/interrupt.h |   12 ++++++++++
 include/linux/irqdesc.h   |    1 
 kernel/irq/internals.h    |    1 
 kernel/irq/manage.c       |   53 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/irq/pm.c           |   47 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 109 insertions(+), 5 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Thomas Gleixner July 30, 2014, 10:56 p.m. UTC | #1
On Wed, 30 Jul 2014, Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Device drivers currently use enable_irq_wake() to configure their
> interrupts for system wakeup, but that API is not particularly
> well suited for this purpose, because it goes directly all the
> way to the hardware and attempts to change the IRQ configuration
> at the chip level.
>
> The first problem with this approach is that the IRQ subsystem
> is not told which interrupt handler is supposed to handle
> interrupts from the wakeup line should they occur during system
> suspend or resume.  That is problematic if the IRQ is shared
> and the other devices sharing it with the wakeup device in question
> are not wakeup devices.  In that case their drivers may not be
> prepared to handle interrupts after the devices have been powered
> down and they may expect suspend_device_irqs() to disable the
> interrupt.  For this reason, the IRQ should not be left enabled
> by suspend_device_irqs() in that case.  On the other hand, though,
> it needs to be left enabled to prevent wakeup events occuring
> after suspend_device_irqs() has returned from being lost.

I really disagree here. The API was designed at a point where it was
very well suited for the purpose. At least from the POV of the
hardware which caused that infrastructure to be built. Looking at it
10 years later with a different set of hardware requirements in mind
does not make it invalid.

That's really not the way it works. x86 didn't give a rats ass 10
years ago when this was introduced, simply because there was no x86
hardware which could support this or if there was hardware nobody was
interested to do so. Coming in 10 years after the fact and telling
those who designed and used this for 10 years, that it's a design
failure is more than inappropriate.

There is nothing wrong to point out that existing infrastructure is
not able to handle the new requirements of differently (and partially
ass backwards) designed hardware. But that's different from stating:

   "that API is not particularly well suited for the purpose"

What's worse is that you are merily fiddling around in the existing
code without doing a proper analysis of the existing semantics and a
proper description of your new semantics.

Before this code changes in any way I want to see:

 1) a description of the existing semantics and their background

 2) a description of the short comings of the existing semantics w/o
    considering the new fangled (hardware) use cases

 3) a description how to mitigate the short comings described in #2
    w/o breaking the world and some more and introducing hard to
    decode regressions

 4) a description why the improvements introduced by #3 are not
    sufficient for the new fangled (hardware) use cases

 5) a description how to mitigate the short comings described in #4
    w/o breaking the world and some more and introducing hard to
    decode regressions

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner July 31, 2014, 12:12 a.m. UTC | #2
On Thu, 31 Jul 2014, Thomas Gleixner wrote:
> On Wed, 30 Jul 2014, Rafael J. Wysocki wrote:
> Before this code changes in any way I want to see:
> 
>  1) a description of the existing semantics and their background
> 
>  2) a description of the short comings of the existing semantics w/o
>     considering the new fangled (hardware) use cases
> 
>  3) a description how to mitigate the short comings described in #2
>     w/o breaking the world and some more and introducing hard to
>     decode regressions
> 
>  4) a description why the improvements introduced by #3 are not
>     sufficient for the new fangled (hardware) use cases
> 
>  5) a description how to mitigate the short comings described in #4
>     w/o breaking the world and some more and introducing hard to
>     decode regressions

Aside of that I want to see a coherent explanation why a shared MSI
interrupt makes any sense at all.

25:  1 <....> 0   PCI-MSI-edge      aerdrv, PCIe PME

AFAICT, that's the primary reason why you insist to support wakeup
sources on shared irq lines. And to be honest, it's utter bullshit.

The whole purpose of MSI is to avoid interrupt sharing, but of course
if that particular interrupt source has two potential causes, i.e. the
AER and the PME one and the stupid hardware does not support different
vectors or the drivers are not able to do so, it's conveniant to make
them shared instead of thinking about them what they really are:

 Separate interrupts on a secondary interrupt controller connected to
 the primary (MSI) one.

That's what is in fact. Simply because you can enable/disable them at
the same IP block quite contrary to stuff which is physically shared
by hard wired electrical connections.

Slapping IRQF_SHARED on that and then whining about shortcomings of
the core code is way simpler than sitting down and think about
actual topologies, right?

Most architectures aside of x86 have long ago realized that and the
core has all infrastructure available to deal with irq topologies. You
just have to utilize it.

Thanks,

	tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki July 31, 2014, 2:14 a.m. UTC | #3
On Thursday, July 31, 2014 02:12:11 AM Thomas Gleixner wrote:
> On Thu, 31 Jul 2014, Thomas Gleixner wrote:
> > On Wed, 30 Jul 2014, Rafael J. Wysocki wrote:
> > Before this code changes in any way I want to see:
> > 
> >  1) a description of the existing semantics and their background

On that one, the meaning of IRQF_NO_SUSPEND is quite simple to me.

If it is set (for the first irqaction in a given irq_desc)
suspend_device_irqs() will leave that IRQ alone (that is, will not
disable it and will not mark it as IRQS_SUSPENDED).

As a result, if an interrupt for that irq_desc happens after
suspend_device_irqs(), the interrupt handlers from all of its irqactions
will be invoked.

So this basically means "ignore that IRQ" to suspend_device_irqs() and
that's its *only* meaning.

It's primary users are timers, per-CPU IRQs, ACPI SCI, via-pmu, Xen.
There also is a bunch of drivers that use it for wakeup stuff, but they
shouldn't in my opinion.

The background I'm aware of was primarily timers (we wanted to be able
to msleep() during PCI PM transitions in the noirq phases of system suspend
and resume among other things), and we want per-CPU stuff to work all the
time too.  ACPI uses it for signaling various types of events (including
battery and thermal) that need to be handled all the time.  I'm not sure
why Xen needs it exactly, but that seems to be IPI-related.

The current handling of IRQF_NO_SUSPEND has a problem vs shared interrupts
which I've tried to address by https://patchwork.kernel.org/patch/4643871/
and which is described in the changelog of that patch.  Unfortunately, some
existing users pass IRQF_SHARED along with IRQF_NO_SUSPEND which is the main
motivation for that.  Many of them use it for wakeup purposes, but for one
example (that doesn't use it for wakeup only) the ACPI SCI is shareable by
design.

To be continued.


> >  2) a description of the short comings of the existing semantics w/o
> >     considering the new fangled (hardware) use cases
> > 
> >  3) a description how to mitigate the short comings described in #2
> >     w/o breaking the world and some more and introducing hard to
> >     decode regressions
> > 
> >  4) a description why the improvements introduced by #3 are not
> >     sufficient for the new fangled (hardware) use cases
> > 
> >  5) a description how to mitigate the short comings described in #4
> >     w/o breaking the world and some more and introducing hard to
> >     decode regressions
> 
> Aside of that I want to see a coherent explanation why a shared MSI
> interrupt makes any sense at all.
> 
> 25:  1 <....> 0   PCI-MSI-edge      aerdrv, PCIe PME
>
> AFAICT, that's the primary reason why you insist to support wakeup
> sources on shared irq lines. And to be honest, it's utter bullshit.

No, this isn't the primary reason.

Here's /proc/interrupts from my MSI Wind system and, as you can see,
PCIe PME are (a) not MSI and (b) shared with some interesting things
(USB, WiFi and the GPU).

           CPU0       CPU1       
  0:      26321          0   IO-APIC-edge      timer
  1:        379          0   IO-APIC-edge      i8042
  7:          6          0   IO-APIC-edge    
  9:         59          0   IO-APIC-fasteoi   acpi
 12:       2824          0   IO-APIC-edge      i8042
 14:       9074          0   IO-APIC-edge      ata_piix
 15:          0          0   IO-APIC-edge      ata_piix
 16:       5217          0   IO-APIC-fasteoi   PCIe PME, uhci_hcd:usb4, i915
 17:      13964          0   IO-APIC-fasteoi   PCIe PME, rtl818x_pci
 18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
 23:       9402          0   IO-APIC-fasteoi   uhci_hcd:usb1, ehci_hcd:usb5
 40:         61          0   PCI-MSI-edge      eth0
 41:        102          0   PCI-MSI-edge      snd_hda_intel
NMI:          0          0   Non-maskable interrupts
LOC:      18538      30831   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:          0          0   Performance monitoring interrupts
IWI:          6          0   IRQ work interrupts
RTR:          0          0   APIC ICR read retries
RES:       5277       6121   Rescheduling interrupts
CAL:         92        106   Function call interrupts
TLB:        317        302   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:          5          5   Machine check polls
ERR:          6
MIS:          0


> The whole purpose of MSI is to avoid interrupt sharing, but of course
> if that particular interrupt source has two potential causes, i.e. the
> AER and the PME one and the stupid hardware does not support different
> vectors or the drivers are not able to do so, it's conveniant to make
> them shared instead of thinking about them what they really are:
> 
>  Separate interrupts on a secondary interrupt controller connected to
>  the primary (MSI) one.
> 
> That's what is in fact. Simply because you can enable/disable them at
> the same IP block quite contrary to stuff which is physically shared
> by hard wired electrical connections.

I agree with the above.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner July 31, 2014, 10:44 a.m. UTC | #4
On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:

> On Thursday, July 31, 2014 02:12:11 AM Thomas Gleixner wrote:
> > On Thu, 31 Jul 2014, Thomas Gleixner wrote:
> > > On Wed, 30 Jul 2014, Rafael J. Wysocki wrote:
> > > Before this code changes in any way I want to see:
> > > 
> > >  1) a description of the existing semantics and their background
> 
> On that one, the meaning of IRQF_NO_SUSPEND is quite simple to me.
> 
> If it is set (for the first irqaction in a given irq_desc)
> suspend_device_irqs() will leave that IRQ alone (that is, will not
> disable it and will not mark it as IRQS_SUSPENDED).
> 
> As a result, if an interrupt for that irq_desc happens after
> suspend_device_irqs(), the interrupt handlers from all of its irqactions
> will be invoked.
> 
> So this basically means "ignore that IRQ" to suspend_device_irqs() and
> that's its *only* meaning.
> 
> It's primary users are timers, per-CPU IRQs, ACPI SCI, via-pmu, Xen.
> There also is a bunch of drivers that use it for wakeup stuff, but they
> shouldn't in my opinion.

Indeed.

> The background I'm aware of was primarily timers (we wanted to be able
> to msleep() during PCI PM transitions in the noirq phases of system suspend
> and resume among other things), and we want per-CPU stuff to work all the
> time too.  ACPI uses it for signaling various types of events (including
> battery and thermal) that need to be handled all the time.  I'm not sure
> why Xen needs it exactly, but that seems to be IPI-related.

So none of these interrupts is used to abort suspend or wakeup. They
are kept functional because they are required for the suspend/resume
transitions itself.

What's this PCIe PME handler doing? Is it required functionality for
the suspend/resume path or is it a wakeup/abort mechanism.

> The current handling of IRQF_NO_SUSPEND has a problem vs shared interrupts
> which I've tried to address by https://patchwork.kernel.org/patch/4643871/
> and which is described in the changelog of that patch.  Unfortunately, some
> existing users pass IRQF_SHARED along with IRQF_NO_SUSPEND which is the main
> motivation for that.  Many of them use it for wakeup purposes, but for one
> example (that doesn't use it for wakeup only) the ACPI SCI is shareable by
> design.

So many of them use it for wakeup purposes. Why so and how is that
supposed to work?

The mechanism for wakeup sources is:

    1) Lazy disable the interrupt

    2) Do the transition into suspend with interrupts enabled

    3) Check whether one of the wakeup sources has triggered. If yes,
       abort. Otherwise suspend.

The ones marked IRQF_NO_SUSPEND are not part of the above scheme,
because they are not checked. So they use different mechanisms to
abort the suspend?

> > Aside of that I want to see a coherent explanation why a shared MSI
> > interrupt makes any sense at all.
> > 
> > 25:  1 <....> 0   PCI-MSI-edge      aerdrv, PCIe PME
> >
> > AFAICT, that's the primary reason why you insist to support wakeup
> > sources on shared irq lines. And to be honest, it's utter bullshit.
> 
> No, this isn't the primary reason.
> 
> Here's /proc/interrupts from my MSI Wind system and, as you can see,
> PCIe PME are (a) not MSI and (b) shared with some interesting things
> (USB, WiFi and the GPU).

>  16:       5217          0   IO-APIC-fasteoi   PCIe PME, uhci_hcd:usb4, i915
>  17:      13964          0   IO-APIC-fasteoi   PCIe PME, rtl818x_pci

So the obvious question is: WHY are they not using MSI?

Just because MSI fucked up the MSI configuration of the device or is
there any sane explanation for it?
 
Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki July 31, 2014, 6:36 p.m. UTC | #5
On Thursday, July 31, 2014 12:44:24 PM Thomas Gleixner wrote:
> On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> 
> > On Thursday, July 31, 2014 02:12:11 AM Thomas Gleixner wrote:
> > > On Thu, 31 Jul 2014, Thomas Gleixner wrote:
> > > > On Wed, 30 Jul 2014, Rafael J. Wysocki wrote:
> > > > Before this code changes in any way I want to see:
> > > > 
> > > >  1) a description of the existing semantics and their background
> > 
> > On that one, the meaning of IRQF_NO_SUSPEND is quite simple to me.
> > 
> > If it is set (for the first irqaction in a given irq_desc)
> > suspend_device_irqs() will leave that IRQ alone (that is, will not
> > disable it and will not mark it as IRQS_SUSPENDED).
> > 
> > As a result, if an interrupt for that irq_desc happens after
> > suspend_device_irqs(), the interrupt handlers from all of its irqactions
> > will be invoked.
> > 
> > So this basically means "ignore that IRQ" to suspend_device_irqs() and
> > that's its *only* meaning.
> > 
> > It's primary users are timers, per-CPU IRQs, ACPI SCI, via-pmu, Xen.
> > There also is a bunch of drivers that use it for wakeup stuff, but they
> > shouldn't in my opinion.
> 
> Indeed.
> 
> > The background I'm aware of was primarily timers (we wanted to be able
> > to msleep() during PCI PM transitions in the noirq phases of system suspend
> > and resume among other things), and we want per-CPU stuff to work all the
> > time too.  ACPI uses it for signaling various types of events (including
> > battery and thermal) that need to be handled all the time.  I'm not sure
> > why Xen needs it exactly, but that seems to be IPI-related.
> 
> So none of these interrupts is used to abort suspend or wakeup.

ACPI can do that too, but it's just one of its functions.

> They are kept functional because they are required for the suspend/resume
> transitions itself.

They are for things that need to work throughout system suspend/resume and
which are not wakeup.

> What's this PCIe PME handler doing? Is it required functionality for
> the suspend/resume path or is it a wakeup/abort mechanism.

It is a wakeup/abort mechanism.

> > The current handling of IRQF_NO_SUSPEND has a problem vs shared interrupts
> > which I've tried to address by https://patchwork.kernel.org/patch/4643871/
> > and which is described in the changelog of that patch.

And before we enter the wakeup handling slippery slope, let me make a note
that this problem is bothering me quite a bit at the moment.  In my opinion
we need to address it somehow regardless of the wakeup issues and I'm not sure
if failing __setup_irq() when there's a mismatch (that is, there are existing
actions for the given irq_desc and their IRQF_NO_SUSPEND settings are not
consistent with the new one) is the right way to do that, because it may make
things behave a bit randomly (it will always fail the second guy, but that need
not be the one who's requested IRQF_NO_SUSPEND and it depends on the ordering
between them).

I had a couple of ideas, but none of them was particularly clean.  Ideally,
IRQF_NO_SUSPEND should always be requested without IRQF_SHARED, but I'm
afraid that we can't really do that for the ACPI SCI, because that may
cause problems to happen on some older systems where that interrupt is
actually shared.  On all systems I have immediate access to it isn't shared,
but I remember seeing some where it was.  On those systems the ACPI SCI itself
would not be affected, because it is requested quite early during system init,
but the other guys wanting to share the line with it would take a hit.

One thing I was thinking about was to return an error from suspend_device_irqs()
if there was a mismatch between IRQF_NO_SUSPEND settings for different irqactions
in the same irq_desc.  That would make system suspend fail on systems where it
is potentially unsafe, but at least any other functionality would not be affected.

> > Unfortunately, some
> > existing users pass IRQF_SHARED along with IRQF_NO_SUSPEND which is the main
> > motivation for that.  Many of them use it for wakeup purposes, but for one
> > example (that doesn't use it for wakeup only) the ACPI SCI is shareable by
> > design.
> 
> So many of them use it for wakeup purposes. Why so and how is that
> supposed to work?

Quite frankly, I'm not sure why they use it.  These are mostly drivers I'm not
familiar with on platforms I'm not familiar with.  My guess is that the lazy
disable mechanism is not sufficient for them for some reason.

> The mechanism for wakeup sources is:
> 
>     1) Lazy disable the interrupt
> 
>     2) Do the transition into suspend with interrupts enabled
> 
>     3) Check whether one of the wakeup sources has triggered. If yes,
>        abort. Otherwise suspend.
> 
> The ones marked IRQF_NO_SUSPEND are not part of the above scheme,
> because they are not checked. So they use different mechanisms to
> abort the suspend?

Well, if you look at the tegra_kbc driver, for example, it uses both
enable_irq_wake() and IRQF_NO_SUSPEND.  Why it does that, I don't know.

Other ones seem to be using pm_wakeup_event(), but that will only abort
suspend when it is enabled upfront (it need not be).  Moreover, it wasn't
intended to be used that way.

It generally looks like things are used not as intended in the wakeup
area, sadly.  Perhaps that's my fault, because I wasn't looking carefully
enough every time, but I wasn't directly involved in any of them IIRC.

I guess that's an opportunity to clean that up ...

And now there's one more piece of it which is suspend-to-idle (aka "freeze").
That doesn't go all the way to syscore_suspend(), but basically stops after
finishing the "noirq" phase of suspending devices.  Then, it idles the CPUs
and waits for interrupts that will take them out of idle.  Only some of those
interrupts are wakeup events, so it only resumes when __pm_wakeup_event() or
__pm_relax() is called in the process of handling the interrupts.

Of course, it could be implemented differently, but that was the simplest
way to do that.  It still can be changed, but I'd really like it not to have
to go through all of the disabling nonboot CPUs and sysfore_suspend(), because
that simply isn't necessary (and it avoids a metric ton of problems with CPU
offline/online).  And of course, it has to work on x86 too.

On x86, however, enable_irq_wake() always returns -ENXIO and nothing happens,
because the chip doesn't have an ->irq_set_wake callback and doesn't flag
itself as IRQCHIP_SKIP_SET_WAKE, so even if we found a way to do something
equivalent to check_wakeup_irqs() for suspend-to-idle, it still wouldn't work
on x86 for that reason.

Also, I wouldn't like to make suspend-to-idle more special than it really has
to be, so it would be ideal if it could use as much of the same mechanics as
regular platform-supported suspend as reasonably possible.  The handling of
wakeup events is crucial here, because it's needed to make suspend-to-idle
really work and I'd like to make it as consistent as possible with the
"regular" suspend.  Now, that can be done in a couple of ways and some of
them may be better than others for reasons that aren't known to me.

My current case at hand is to make PCIe MSI wake systems up from suspend-to-idle
(it actually works for the "regular" suspend most of the time), but that's part
of a bigger picture, of course.

> > > Aside of that I want to see a coherent explanation why a shared MSI
> > > interrupt makes any sense at all.
> > > 
> > > 25:  1 <....> 0   PCI-MSI-edge      aerdrv, PCIe PME
> > >
> > > AFAICT, that's the primary reason why you insist to support wakeup
> > > sources on shared irq lines. And to be honest, it's utter bullshit.
> > 
> > No, this isn't the primary reason.
> > 
> > Here's /proc/interrupts from my MSI Wind system and, as you can see,
> > PCIe PME are (a) not MSI and (b) shared with some interesting things
> > (USB, WiFi and the GPU).
> 
> >  16:       5217          0   IO-APIC-fasteoi   PCIe PME, uhci_hcd:usb4, i915
> >  17:      13964          0   IO-APIC-fasteoi   PCIe PME, rtl818x_pci
> 
> So the obvious question is: WHY are they not using MSI?
> 
> Just because MSI fucked up the MSI configuration of the device or is
> there any sane explanation for it?

It looks like they don't use MSI on that machine at all, so perhaps the chipset
is not capable of that or similar.  I'm not sure why it matters, though.  The
system shipped like that and with Linux on it, so we should be able to
handle it regardless.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern July 31, 2014, 8:12 p.m. UTC | #6
On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:

> And before we enter the wakeup handling slippery slope, let me make a note
> that this problem is bothering me quite a bit at the moment.  In my opinion
> we need to address it somehow regardless of the wakeup issues and I'm not sure
> if failing __setup_irq() when there's a mismatch (that is, there are existing
> actions for the given irq_desc and their IRQF_NO_SUSPEND settings are not
> consistent with the new one) is the right way to do that, because it may make
> things behave a bit randomly (it will always fail the second guy, but that need
> not be the one who's requested IRQF_NO_SUSPEND and it depends on the ordering
> between them).

Pardon me for sticking my nose into the middle of the conversation, but
here's what it looks like to me:

The entire no_irq phase of suspend/resume is starting to seem like a
mistake.  We should never have done it.  Alternatively, it might be
okay to disable _all_ interrupts during the no_irq phase provided they
are then _all_ enabled again before entering the sysdev and
platform-specific parts of suspend (or the final freeze).

As I understand it, the idea behind the no_irq phase was to make life 
easier for drivers.  They wouldn't have to worry about strange things 
cropping up while they're in the middle of powering down their devices 
or afterward.

Well, guess what?  It turns out that they do have to worry about it
after all.  Timers can still fire during suspend transitions, and if an
IRQ line is shared with a wakeup source then it won't be disabled.

The fact is, drivers should not rely on disabled interrupts to prevent
untimely interrupt requests or wakeups.  They should configure their
devices not to generate any interrupt requests at all, apart from
wakeup requests.  And their interrupt handlers shouldn't mind being
invoked for a shared IRQ, even after their devices are turned off.

Any driver that leaves its device capable of generating non-wakeup
interrupt requests during suspend, and relies on interrupts being
globally disabled to avoid problems, is most likely broken.  Yes, it
may be acceptable in cases where the IRQ line isn't shared, but it's
still a bad design.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki July 31, 2014, 9:04 p.m. UTC | #7
On Thursday, July 31, 2014 04:12:55 PM Alan Stern wrote:
> On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> 
> > And before we enter the wakeup handling slippery slope, let me make a note
> > that this problem is bothering me quite a bit at the moment.  In my opinion
> > we need to address it somehow regardless of the wakeup issues and I'm not sure
> > if failing __setup_irq() when there's a mismatch (that is, there are existing
> > actions for the given irq_desc and their IRQF_NO_SUSPEND settings are not
> > consistent with the new one) is the right way to do that, because it may make
> > things behave a bit randomly (it will always fail the second guy, but that need
> > not be the one who's requested IRQF_NO_SUSPEND and it depends on the ordering
> > between them).
> 
> Pardon me for sticking my nose into the middle of the conversation, but
> here's what it looks like to me:
> 
> The entire no_irq phase of suspend/resume is starting to seem like a
> mistake.  We should never have done it.

In hindsight, I totally agree.  Question is what we can do about it now.

> Alternatively, it might be
> okay to disable _all_ interrupts during the no_irq phase provided they
> are then _all_ enabled again before entering the sysdev and
> platform-specific parts of suspend (or the final freeze).
> 
> As I understand it, the idea behind the no_irq phase was to make life 
> easier for drivers.  They wouldn't have to worry about strange things 
> cropping up while they're in the middle of powering down their devices 
> or afterward.

Actually, it was done to prevent bugs in PCI drivers from crashing boxes
during suspend and resume IIRC.

When we were debugging PME with Peter a couple of days ago I asked him to
comment out suspend/resume_device_irqs() experimentally and see what
happens and it turned out that the system didn't resume any more.  It looks
like it still prevents problems from happening, then.

> Well, guess what?  It turns out that they do have to worry about it
> after all.  Timers can still fire during suspend transitions, and if an
> IRQ line is shared with a wakeup source then it won't be disabled.

OK, that's where it starts to get really messy.  If people were not using
IRQF_NO_SUSPEND excessively, the situation would be that almost all IRQ
lines, including the ones of wakeup sources, would be disabled by
suspend_device_irqs() (the exceptions being really low-level stuff that
needs to run during suspend/resume and not sharing IRQ lines).

The wakeup would then be handled with the help of the lazy disable mechanism,
that is, enable_irq_wake() from the driver and then check_wakeup_irqs() in
syscore_suspend().

However, what we have now is a mixture of different mechanisms often used
for things they had not been intended for.

But I agree that we'd probably have avoided it if suspend_device_irqs() hadn't
existed.

> The fact is, drivers should not rely on disabled interrupts to prevent
> untimely interrupt requests or wakeups.  They should configure their
> devices not to generate any interrupt requests at all, apart from
> wakeup requests.  And their interrupt handlers shouldn't mind being
> invoked for a shared IRQ, even after their devices are turned off.
> 
> Any driver that leaves its device capable of generating non-wakeup
> interrupt requests during suspend, and relies on interrupts being
> globally disabled to avoid problems, is most likely broken.  Yes, it
> may be acceptable in cases where the IRQ line isn't shared, but it's
> still a bad design.

Totally agreed.

So how can we eliminate the noirq phase in a workable way?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner July 31, 2014, 10:16 p.m. UTC | #8
On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> On Thursday, July 31, 2014 12:44:24 PM Thomas Gleixner wrote:
> > What's this PCIe PME handler doing? Is it required functionality for
> > the suspend/resume path or is it a wakeup/abort mechanism.
> 
> It is a wakeup/abort mechanism.

So why is it using IRQF_NO_SUSPEND in the first place? Just because x86
does not have irq wake implemented or flagged it's irq chips with
IRQCHIP_SKIP_SET_WAKE. Right, so instead of thinking about a proper
solution driver folks just slap the next available thing on it w/o
thinking about the consequences. But, thats partly our own fault due
to lack of proper documentation.
 
> And before we enter the wakeup handling slippery slope, let me make a note
> that this problem is bothering me quite a bit at the moment.  In my opinion
> we need to address it somehow regardless of the wakeup issues and I'm not sure
> if failing __setup_irq() when there's a mismatch (that is, there are existing
> actions for the given irq_desc and their IRQF_NO_SUSPEND settings are not
> consistent with the new one) is the right way to do that, because it may make
> things behave a bit randomly (it will always fail the second guy, but that need
> not be the one who's requested IRQF_NO_SUSPEND and it depends on the ordering
> between them).

I totally agree that we want to fix it and I'm going to help. Though I
really wanted to have a clear picture of the stuff before making
decisions.
 
> I had a couple of ideas, but none of them was particularly clean.  Ideally,
> IRQF_NO_SUSPEND should always be requested without IRQF_SHARED, but I'm
> afraid that we can't really do that for the ACPI SCI, because that may
> cause problems to happen on some older systems where that interrupt is
> actually shared.  On all systems I have immediate access to it isn't shared,
> but I remember seeing some where it was.  On those systems the ACPI SCI itself
> would not be affected, because it is requested quite early during system init,
> but the other guys wanting to share the line with it would take a hit.
> 
> One thing I was thinking about was to return an error from suspend_device_irqs()
> if there was a mismatch between IRQF_NO_SUSPEND settings for different irqactions
> in the same irq_desc.  That would make system suspend fail on systems where it
> is potentially unsafe, but at least any other functionality would not be affected.
>

That's one possible solution. See below.

> > So many of them use it for wakeup purposes. Why so and how is that
> > supposed to work?
> 
> Quite frankly, I'm not sure why they use it.  These are mostly drivers I'm not
> familiar with on platforms I'm not familiar with.  My guess is that the lazy
> disable mechanism is not sufficient for them for some reason.

Looking at a few of them I fear the reason is that the developer did
not understand the wakeup mechanism at all. Again that's probably our
fault, because the whole business including the irq part lacks proper
in depth documentation.

> > The mechanism for wakeup sources is:
> > 
> >     1) Lazy disable the interrupt
> > 
> >     2) Do the transition into suspend with interrupts enabled
> > 
> >     3) Check whether one of the wakeup sources has triggered. If yes,
> >        abort. Otherwise suspend.
> > 
> > The ones marked IRQF_NO_SUSPEND are not part of the above scheme,
> > because they are not checked. So they use different mechanisms to
> > abort the suspend?
> 
> Well, if you look at the tegra_kbc driver, for example, it uses both
> enable_irq_wake() and IRQF_NO_SUSPEND.  Why it does that, I don't know.

That doesn't make sense at all.
 
> Other ones seem to be using pm_wakeup_event(), but that will only abort
> suspend when it is enabled upfront (it need not be).  Moreover, it wasn't
> intended to be used that way.

Right. We should kill that before more people copy it blindly.
 
> It generally looks like things are used not as intended in the wakeup
> area, sadly.  Perhaps that's my fault, because I wasn't looking carefully
> enough every time, but I wasn't directly involved in any of them IIRC.
> 
> I guess that's an opportunity to clean that up ...

Agreed. I'm not frightened to do a tree wide sweep. Seems to become a
habit :)
 
> And now there's one more piece of it which is suspend-to-idle (aka "freeze").
> That doesn't go all the way to syscore_suspend(), but basically stops after
> finishing the "noirq" phase of suspending devices.  Then, it idles the CPUs
> and waits for interrupts that will take them out of idle.  Only some of those
> interrupts are wakeup events, so it only resumes when __pm_wakeup_event() or
> __pm_relax() is called in the process of handling the interrupts.
>
> Of course, it could be implemented differently, but that was the simplest
> way to do that.  It still can be changed, but I'd really like it not to have
> to go through all of the disabling nonboot CPUs and sysfore_suspend(), because
> that simply isn't necessary (and it avoids a metric ton of problems with CPU
> offline/online).  And of course, it has to work on x86 too.

Right. So one thing which would make the situation more clear is that
the interrupt handler needs to tell the core code about this by
returning something like IRQ_HANDLED_PMWAKE and the core kicks the
suspend-to-idle logic back to life. I'm not sure whether the extra
return value is actually necessary, we might even map it to
IRQ_HANDLED in the core as we know that we are in the suspend
state.
 
> On x86, however, enable_irq_wake() always returns -ENXIO and nothing happens,
> because the chip doesn't have an ->irq_set_wake callback and doesn't flag
> itself as IRQCHIP_SKIP_SET_WAKE, so even if we found a way to do something
> equivalent to check_wakeup_irqs() for suspend-to-idle, it still wouldn't work
> on x86 for that reason.

There is no reason, why we can't flag the affected irq chips with
IRQCHIP_SKIP_SET_WAKE. That was introduced to handle irq controllers
which provide wakeup functionality but do not have this special extra
magic which is implemented in ->irq_set_wake, i.e. talking to some PM
controller directly.

> Also, I wouldn't like to make suspend-to-idle more special than it really has
> to be, so it would be ideal if it could use as much of the same mechanics as
> regular platform-supported suspend as reasonably possible.  The handling of
> wakeup events is crucial here, because it's needed to make suspend-to-idle
> really work and I'd like to make it as consistent as possible with the
> "regular" suspend.  Now, that can be done in a couple of ways and some of
> them may be better than others for reasons that aren't known to me.

Sure. From the recent discussion I can see the following things we
need to address:

1) IRQF_NO_SUSPEND:

    - Investigate the (ab)use
    - Fix the offending call sites 
    - Provide proper documentation when and why it is sane to use


2) check_wakeup_irqs()

   Call it in the suspend-to-idle case and deal with the detected
   suspend abort from there. If we don't do that, we basically force
   people to hack creative workarounds into their driver, be it
   abusing IRQF_NO_SUSPEND or other completely brainwrecked
   modifications.

   The approach of doing:

    1) lazy disable
    2) transition into suspend
    3) check abort conditions

   is the only sane way to keep the real suspend and the
   suspend-to-idle stuff in line.

   Any other approach will create major headache, simply because by
   avoiding the conditional call in the PM core you force the driver
   developers to work around it and have conditional handling for that
   case at the driver level without having proper information about
   the actual suspend target, i.e. idle vs. real. I can predict the
   mess caused by that halfways precisely without using a crystal
   ball: With the expected popularity of the suspend-to-idle approach
   this is going to approach infinite level rapidly.

   Seriously we cannot tell people to use A for real suspend and B for
   suspend-to-idle. It must be the same mechanism. If that costs the
   price of some core code restructuring, we better pay that than
   dealing with the fallout of the other variant.


3) IRQF_SHARED

   - The ideal solution for this would be to provide seperate virtual
     irq numbers for each device which sits on a shared interrupt line
     and have a pseudo interrupt chip handling the devices and the
     real interrupt line providing the "demux" function.

     That would make life way simpler as we'd always deal with a
     single interrupt number per device. So the whole chained action
     pain would simply go away and the wake/no suspend functionality
     would not have to care at all about this whole shared interrupt
     trainwreck. You'd get per device interrupt stats for free as a
     bonus.

     Unfortunately we can't pull that off w/o major surgery to a
     gazillion of device drivers, but I'm tempted to provide the
     infrastructure for this sooner than later.

     Though for the PCIe root hub thing which has a shared MSI handler
     !?! we really want to implement that at the device level
     today. Sharing a MSI interrupt is just ass backwards.


   - So we need to bite the bullet and handle the action chain as it
     is.

     And we need to support it for the IRQF_NO_SUSPEND and the wake
     functionality.

     For that we have two choices:

     1) Use the separate pointer in the irq desc which stores the
        actions which are either not flagged IRQF_NO_SUSPEND or not
        declared as actual wakeup sources.

     2) Have a separate handler function pointer in struct irqaction,
     	which is used to store the actual device handler and have a
     	core stub handler as the primary handler.

	That stub handler would basically do:

	if (!irqs_suspended)
	   return action->real_handler();
	return IRQ_NONE;

	The switch over of the handler can be done at
	setup/request_irq time when a shared non consistent flagged
	handler is detected or at enable_wake() time.

     There is an advantage to #2:

        The suspend/resume path does not care about any of this. It
	only penalizes the few shared handlers which are affected by
	this nonsense with the extra conditional in the hot handling
	path. Bad luck for the device which has to share the interrupt
	line: complain to the morons who still think that shared
	interrupt lines are a brilliant idea.

     Now for the enable/disable_wake() stuff we really need to have
     the mechanism which has the extra *dev_id argument to identify
     the proper handler. The existing interface should simply hand in
     NULL and if it's used on a shared interrupt line it should yell
     loudly and maybe even prevent suspend. Combined with the above we
     burden all the nonsense of handling shared interrupts in that
     context to the slow path.

     Of course we need handling for the stupid case where the non
     wakeup device decides to fire interrupts on the shared line.

     We certainly need a detection mechanism for this, but your
     current proposal of accelerating the spurious detection has a
     serious downside:

       If there is a spurious interrupt on a shared edge type
       interrupt, which is crap by definition, but unfortunately
       reality, we never trigger the spurious detector, but we might
       render the device useless in the following case:

         disk irq shares line with wakeup button irq

         suspend
		spurious interrupt caused by disk device

	 resume from a differnt wakeup source

	 Disk device interrupt is lost and therefor the disk is
	 disfunctional until someone presses the wakeup button.

       Admittedly a constructed scenario, but we had our share of
       pleasure to deal with lost edge type interrupts in the past and
       I can assure you it's not fun to debug.

       So we rather error out on the safe side and disable the
       interrupt line right away, mark it pending and resume or abort
       suspend and tell why. If we don't do that upfront we'll add
       that feature after a painful debug session which starts with
       this information: Sporadically after suspend I have to hard
       reset the machine by pressing the power button for 10 seconds
       or removing the battery....

       Note: This only applies to shared interrupt lines. We still
       have a sane spurious handling for proper designed systems which
       avoid the shared brainfarts.


4) Move the handling of resume/abort to the core

   We really want either an explicit return value from the device
   interrupt handler which indicates that we should resume or abort
   the suspend transition or handle IRQ_HANDLED automatically at the
   core level in case of suspend.

   If you let drivers do that, they will create a gazillion of
   solutions to abort/resume which will kill your ability to change
   any of the PM core internal mechanisms w/o doing a tree wide
   cleanup. Let's do it now before we get into the unavoidable
   exponential copy and paste phase.


5) Documentation
   
   Add a metric ton of it!

Thoughts?

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner July 31, 2014, 10:54 p.m. UTC | #9
On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> On Thursday, July 31, 2014 12:44:24 PM Thomas Gleixner wrote:
> > > > Aside of that I want to see a coherent explanation why a shared MSI
> > > > interrupt makes any sense at all.
> > > > 
> > > > 25:  1 <....> 0   PCI-MSI-edge      aerdrv, PCIe PME
> > > >
> > > > AFAICT, that's the primary reason why you insist to support wakeup
> > > > sources on shared irq lines. And to be honest, it's utter bullshit.
> > > 
> > > No, this isn't the primary reason.
> > > 
> > > Here's /proc/interrupts from my MSI Wind system and, as you can see,
> > > PCIe PME are (a) not MSI and (b) shared with some interesting things
> > > (USB, WiFi and the GPU).
> > 
> > >  16:       5217          0   IO-APIC-fasteoi   PCIe PME, uhci_hcd:usb4, i915
> > >  17:      13964          0   IO-APIC-fasteoi   PCIe PME, rtl818x_pci
> > 
> > So the obvious question is: WHY are they not using MSI?
> > 
> > Just because MSI fucked up the MSI configuration of the device or is
> > there any sane explanation for it?
> 
> It looks like they don't use MSI on that machine at all, so perhaps the chipset
> is not capable of that or similar.  I'm not sure why it matters, though.  The
> system shipped like that and with Linux on it, so we should be able to
> handle it regardless.

Sorry, but this is outright hillarious.

#1 "so perhaps the chipset is not capable of that or similar."

   This is a fricking Intel chipset. All Intel chipsets which have a
   PCIe root hub are MSI capable at least to my restricted knowledge.

   As an Intel employee you could have had the decency to RTFM of the
   chipset which is in the machine you care about.

#2 "I'm not sure why it matters"

   It matters a lot.
   
   You're just fostering the industry mentality of "it works for us,
   lets ship it" simply because you tell them:

     No matter what brainfarts you have or what level of complete
     ignorance and incompetence you have, we're going to cope with it.

   You are sending out the wrong message. The message needs to be:

     Despite your completely idiotic implementation which renders a
     perfectly well designed piece of modern hardware into a last
     century piece of shit, we went out on a limb to provide the
     victims of your incompetence, i.e. your paying customers, a state
     of the art user experience.

  See the difference ?

The only point where I agree is that we want to handle it, but not
without pointing out that this stuff is a major piece of shite, which
should not ever had been offered to customers who simply are not able
to judge the shit level before spending their hard earned bucks for
it.

I'm well aware that neither Intel nor any other silicon vendor gives a
rats ass about this as long as they sell enough chips. But that's no
excuse for wasting the brains of community developers for no value,
simply because that could have been avoided by applying a fraction of
the brain power, which is necessary to cope with the damage, in the
first place.

I don't care how you feel about that, but I very much care about the
waste of my own precious time.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner July 31, 2014, 11:41 p.m. UTC | #10
On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> On Thursday, July 31, 2014 04:12:55 PM Alan Stern wrote:
> > Pardon me for sticking my nose into the middle of the conversation, but
> > here's what it looks like to me:
> > 
> > The entire no_irq phase of suspend/resume is starting to seem like a
> > mistake.  We should never have done it.
> 
> In hindsight, I totally agree.  Question is what we can do about it now.

<SNIP>

> So how can we eliminate the noirq phase in a workable way?

The straight way to do that is breaking the world and some more and
then fix up a gazillion of device drivers by doing a massive voodoo
debugging effort simply because in most cases we do not get any useful
information out of the system once the shit hits the fan.

We could add instrumentation to the core code about interrupts which
are coming in unexpectedly during suspend, but that does not solve
anything.

We really cannot call any device handler at that point as clocks might
be turned off already and any access to a device register might simply
cause a full undebuggable stall of the CPU.

And there is no way to prove that there is no chance of a spurious
interrupt for a given device. 

So if we cannot handle it at the infrastructure level, we need to make
sure that every fricking device driver interrupt handler has a 

     if (dev->suspended)
     	return CRAP;

conditional as the first line of code in it.

What is that buying us? 

Nothing than a shitload of hard to understand problems, really. The
only sensible way to handle this is at the core level.

#1 There is no way that you can rely on random drivers to do the Right
   Thing. 

#2 There is no way that all hardware is implemented in a sane way.

#3 You CANNOT educate the people who are tasked to implement something
   which "does the job" to understand all the subtle details of
   suspend/resume or whatever.

In fact such an approach would take the general aims of consolidating
repeating patterns into core infrastructure and hiding complexity from
the driver developers ad absurdum. No thanks. We have enough
uncomprehensible shite in drivers/* already. We really can do without
adding more reasons for voodoo programming.

This is a classic core infrastructure problem and we need to get the
semantics and the implementation straight by considering the
challenges of new fangled hardware and the incompentent usage of
that. Once we have that we need to fix the few offending drivers, but
that's a task which can be handled with grep and some brain applied.

Anyone who thinks that this can and should be solved at the driver
level is simply taking the wrong drugs or ran out of supply of the
proper ones. Either call your shrink or your drug dealer to get out of
that.

Thanks,

	tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Aug. 1, 2014, 12:08 a.m. UTC | #11
On Friday, August 01, 2014 12:16:23 AM Thomas Gleixner wrote:
> On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> > On Thursday, July 31, 2014 12:44:24 PM Thomas Gleixner wrote:
> > > What's this PCIe PME handler doing? Is it required functionality for
> > > the suspend/resume path or is it a wakeup/abort mechanism.
> > 
> > It is a wakeup/abort mechanism.
> 
> So why is it using IRQF_NO_SUSPEND in the first place?

It isn't in the current code.

It did in *some* of the prototype patches floating around, but they were
withdrawn.

In the last one ([2/3] in this series) it doesn't any more.

> Just because x86 does not have irq wake implemented or flagged it's irq
> chips with IRQCHIP_SKIP_SET_WAKE.

The reason for using IRQF_NO_SUSPEND was to make it wake up from suspend-to-idle,
but that was a sledgehammer approach.

> Right, so instead of thinking about a proper solution driver folks just slap
> the next available thing on it w/o thinking about the consequences. But,
> thats partly our own fault due to lack of proper documentation.

Prototyping with the next available thing is not generally wrong in my view
as long as this doesn't get to the code base without consideration.

> > And before we enter the wakeup handling slippery slope, let me make a note
> > that this problem is bothering me quite a bit at the moment.  In my opinion
> > we need to address it somehow regardless of the wakeup issues and I'm not sure
> > if failing __setup_irq() when there's a mismatch (that is, there are existing
> > actions for the given irq_desc and their IRQF_NO_SUSPEND settings are not
> > consistent with the new one) is the right way to do that, because it may make
> > things behave a bit randomly (it will always fail the second guy, but that need
> > not be the one who's requested IRQF_NO_SUSPEND and it depends on the ordering
> > between them).
> 
> I totally agree that we want to fix it and I'm going to help. Though I
> really wanted to have a clear picture of the stuff before making
> decisions.
>  
> > I had a couple of ideas, but none of them was particularly clean.  Ideally,
> > IRQF_NO_SUSPEND should always be requested without IRQF_SHARED, but I'm
> > afraid that we can't really do that for the ACPI SCI, because that may
> > cause problems to happen on some older systems where that interrupt is
> > actually shared.  On all systems I have immediate access to it isn't shared,
> > but I remember seeing some where it was.  On those systems the ACPI SCI itself
> > would not be affected, because it is requested quite early during system init,
> > but the other guys wanting to share the line with it would take a hit.
> > 
> > One thing I was thinking about was to return an error from suspend_device_irqs()
> > if there was a mismatch between IRQF_NO_SUSPEND settings for different irqactions
> > in the same irq_desc.  That would make system suspend fail on systems where it
> > is potentially unsafe, but at least any other functionality would not be affected.
> >
> 
> That's one possible solution. See below.
> 
> > > So many of them use it for wakeup purposes. Why so and how is that
> > > supposed to work?
> > 
> > Quite frankly, I'm not sure why they use it.  These are mostly drivers I'm not
> > familiar with on platforms I'm not familiar with.  My guess is that the lazy
> > disable mechanism is not sufficient for them for some reason.
> 
> Looking at a few of them I fear the reason is that the developer did
> not understand the wakeup mechanism at all. Again that's probably our
> fault, because the whole business including the irq part lacks proper
> in depth documentation.
> 
> > > The mechanism for wakeup sources is:
> > > 
> > >     1) Lazy disable the interrupt
> > > 
> > >     2) Do the transition into suspend with interrupts enabled
> > > 
> > >     3) Check whether one of the wakeup sources has triggered. If yes,
> > >        abort. Otherwise suspend.
> > > 
> > > The ones marked IRQF_NO_SUSPEND are not part of the above scheme,
> > > because they are not checked. So they use different mechanisms to
> > > abort the suspend?
> > 
> > Well, if you look at the tegra_kbc driver, for example, it uses both
> > enable_irq_wake() and IRQF_NO_SUSPEND.  Why it does that, I don't know.
> 
> That doesn't make sense at all.
>  
> > Other ones seem to be using pm_wakeup_event(), but that will only abort
> > suspend when it is enabled upfront (it need not be).  Moreover, it wasn't
> > intended to be used that way.
> 
> Right. We should kill that before more people copy it blindly.
>  
> > It generally looks like things are used not as intended in the wakeup
> > area, sadly.  Perhaps that's my fault, because I wasn't looking carefully
> > enough every time, but I wasn't directly involved in any of them IIRC.
> > 
> > I guess that's an opportunity to clean that up ...
> 
> Agreed. I'm not frightened to do a tree wide sweep. Seems to become a
> habit :)
>  
> > And now there's one more piece of it which is suspend-to-idle (aka "freeze").
> > That doesn't go all the way to syscore_suspend(), but basically stops after
> > finishing the "noirq" phase of suspending devices.  Then, it idles the CPUs
> > and waits for interrupts that will take them out of idle.  Only some of those
> > interrupts are wakeup events, so it only resumes when __pm_wakeup_event() or
> > __pm_relax() is called in the process of handling the interrupts.
> >
> > Of course, it could be implemented differently, but that was the simplest
> > way to do that.  It still can be changed, but I'd really like it not to have
> > to go through all of the disabling nonboot CPUs and sysfore_suspend(), because
> > that simply isn't necessary (and it avoids a metric ton of problems with CPU
> > offline/online).  And of course, it has to work on x86 too.
> 
> Right. So one thing which would make the situation more clear is that
> the interrupt handler needs to tell the core code about this by
> returning something like IRQ_HANDLED_PMWAKE and the core kicks the
> suspend-to-idle logic back to life. I'm not sure whether the extra
> return value is actually necessary, we might even map it to
> IRQ_HANDLED in the core as we know that we are in the suspend
> state.

I'm not sure I'm following you here.  Do you mean that upon receiving
IRQ_HANDLED_PMWAKE from an interrupt handler the core will know that
this was a wakeup event and will trigger a resume from suspend-to-idle?

> > On x86, however, enable_irq_wake() always returns -ENXIO and nothing happens,
> > because the chip doesn't have an ->irq_set_wake callback and doesn't flag
> > itself as IRQCHIP_SKIP_SET_WAKE, so even if we found a way to do something
> > equivalent to check_wakeup_irqs() for suspend-to-idle, it still wouldn't work
> > on x86 for that reason.
> 
> There is no reason, why we can't flag the affected irq chips with
> IRQCHIP_SKIP_SET_WAKE. That was introduced to handle irq controllers
> which provide wakeup functionality but do not have this special extra
> magic which is implemented in ->irq_set_wake, i.e. talking to some PM
> controller directly.

That may help, but then it won't prevent wakeup IRQs from being disabled by
suspend_device_irqs().  Do I think correctly that we can re-enable them
in the suspend-to-idle loop, before idling the CPUs?

> > Also, I wouldn't like to make suspend-to-idle more special than it really has
> > to be, so it would be ideal if it could use as much of the same mechanics as
> > regular platform-supported suspend as reasonably possible.  The handling of
> > wakeup events is crucial here, because it's needed to make suspend-to-idle
> > really work and I'd like to make it as consistent as possible with the
> > "regular" suspend.  Now, that can be done in a couple of ways and some of
> > them may be better than others for reasons that aren't known to me.
> 
> Sure. From the recent discussion I can see the following things we
> need to address:
> 
> 1) IRQF_NO_SUSPEND:
> 
>     - Investigate the (ab)use
>     - Fix the offending call sites 
>     - Provide proper documentation when and why it is sane to use

Totally agreed.

> 2) check_wakeup_irqs()
> 
>    Call it in the suspend-to-idle case and deal with the detected
>    suspend abort from there. If we don't do that, we basically force
>    people to hack creative workarounds into their driver, be it
>    abusing IRQF_NO_SUSPEND or other completely brainwrecked
>    modifications.
> 
>    The approach of doing:
> 
>     1) lazy disable
>     2) transition into suspend
>     3) check abort conditions
> 
>    is the only sane way to keep the real suspend and the
>    suspend-to-idle stuff in line.

Well, suspend-to-idle is a loop of the type

 (1) wait for an interrupt
 (2) iterrupt happens -> if wakeup, resume; else goto (1)

so I'm not sure how to match that with lazy disable?  We basically would need to
do something like:

 (1) enable all wakeup interrupts
 (2) wait for an interrupt
 (3) interrupt happens
 (4) disable all interrupts enabled in (1)
 (5) if wakeup, resume (we'll resume interrupts later); else goto (1)

Is that what you mean?

>    Any other approach will create major headache, simply because by
>    avoiding the conditional call in the PM core you force the driver
>    developers to work around it and have conditional handling for that
>    case at the driver level without having proper information about
>    the actual suspend target, i.e. idle vs. real. I can predict the
>    mess caused by that halfways precisely without using a crystal
>    ball: With the expected popularity of the suspend-to-idle approach
>    this is going to approach infinite level rapidly.

Yeah, exponential expansion.

>    Seriously we cannot tell people to use A for real suspend and B for
>    suspend-to-idle. It must be the same mechanism. If that costs the
>    price of some core code restructuring, we better pay that than
>    dealing with the fallout of the other variant.

Totally agreed.

> 3) IRQF_SHARED
> 
>    - The ideal solution for this would be to provide seperate virtual
>      irq numbers for each device which sits on a shared interrupt line
>      and have a pseudo interrupt chip handling the devices and the
>      real interrupt line providing the "demux" function.
> 
>      That would make life way simpler as we'd always deal with a
>      single interrupt number per device. So the whole chained action
>      pain would simply go away and the wake/no suspend functionality
>      would not have to care at all about this whole shared interrupt
>      trainwreck. You'd get per device interrupt stats for free as a
>      bonus.
> 
>      Unfortunately we can't pull that off w/o major surgery to a
>      gazillion of device drivers, but I'm tempted to provide the
>      infrastructure for this sooner than later.

Sounds interesting.

>      Though for the PCIe root hub thing which has a shared MSI handler
>      !?! we really want to implement that at the device level
>      today. Sharing a MSI interrupt is just ass backwards.

There are more reasons for doing that.  A PCIe port is really one device
with multiple things to handle and they really need more coordination with
each other than we have in the current code.  In my opinion we should
just use one PCIe port driver doing all of these things together instead of
three separate drivers handling multiple artificially cut out devices on a
virtual bus.

Problem is, people either have no time or are too scared to do that.

>    - So we need to bite the bullet and handle the action chain as it
>      is.
> 
>      And we need to support it for the IRQF_NO_SUSPEND and the wake
>      functionality.
> 
>      For that we have two choices:
> 
>      1) Use the separate pointer in the irq desc which stores the
>         actions which are either not flagged IRQF_NO_SUSPEND or not
>         declared as actual wakeup sources.
> 
>      2) Have a separate handler function pointer in struct irqaction,
>      	which is used to store the actual device handler and have a
>      	core stub handler as the primary handler.
> 
> 	That stub handler would basically do:
> 
> 	if (!irqs_suspended)
> 	   return action->real_handler();
> 	return IRQ_NONE;
> 
> 	The switch over of the handler can be done at
> 	setup/request_irq time when a shared non consistent flagged
> 	handler is detected or at enable_wake() time.

I had a similar idea at one point.

>      There is an advantage to #2:
> 
>         The suspend/resume path does not care about any of this. It
> 	only penalizes the few shared handlers which are affected by
> 	this nonsense with the extra conditional in the hot handling
> 	path. Bad luck for the device which has to share the interrupt
> 	line: complain to the morons who still think that shared
> 	interrupt lines are a brilliant idea.
> 
>      Now for the enable/disable_wake() stuff we really need to have
>      the mechanism which has the extra *dev_id argument to identify
>      the proper handler. The existing interface should simply hand in
>      NULL and if it's used on a shared interrupt line it should yell
>      loudly and maybe even prevent suspend. Combined with the above we
>      burden all the nonsense of handling shared interrupts in that
>      context to the slow path.
> 
>      Of course we need handling for the stupid case where the non
>      wakeup device decides to fire interrupts on the shared line.
> 
>      We certainly need a detection mechanism for this, but your
>      current proposal of accelerating the spurious detection has a
>      serious downside:
> 
>        If there is a spurious interrupt on a shared edge type
>        interrupt, which is crap by definition, but unfortunately
>        reality, we never trigger the spurious detector, but we might
>        render the device useless in the following case:
> 
>          disk irq shares line with wakeup button irq
> 
>          suspend
> 		spurious interrupt caused by disk device
> 
> 	 resume from a differnt wakeup source
> 
> 	 Disk device interrupt is lost and therefor the disk is
> 	 disfunctional until someone presses the wakeup button.

But the disk was supposed to be suspended, right?  So its driver should
not expect to receive any interrupts at that point anyway.  Or am I missing
anything?

>        Admittedly a constructed scenario, but we had our share of
>        pleasure to deal with lost edge type interrupts in the past and
>        I can assure you it's not fun to debug.
> 
>        So we rather error out on the safe side and disable the
>        interrupt line right away, mark it pending and resume or abort
>        suspend and tell why.

I thought about the same.

>        If we don't do that upfront we'll add
>        that feature after a painful debug session which starts with
>        this information: Sporadically after suspend I have to hard
>        reset the machine by pressing the power button for 10 seconds
>        or removing the battery....
>
>        Note: This only applies to shared interrupt lines. We still
>        have a sane spurious handling for proper designed systems which
>        avoid the shared brainfarts.
>

Agreed.

> 4) Move the handling of resume/abort to the core
> 
>    We really want either an explicit return value from the device
>    interrupt handler which indicates that we should resume or abort
>    the suspend transition or handle IRQ_HANDLED automatically at the
>    core level in case of suspend.
> 
>    If you let drivers do that, they will create a gazillion of
>    solutions to abort/resume which will kill your ability to change
>    any of the PM core internal mechanisms w/o doing a tree wide
>    cleanup. Let's do it now before we get into the unavoidable
>    exponential copy and paste phase.

OK

> 5) Documentation
>    
>    Add a metric ton of it!
> 
> Thoughts?

Except for a couple of points where I'm not sure I understand you correctly
(commented above), all of that sounds good to me.

I'm not sure about the ordering, though.  It would be good to have a working
replacement for the IRQF_NO_SUSPEND things that we'll be removing in 1, for
example.  So since we need to do 3) IRQF_SHARED for both IRQF_NO_SUSPEND and
wakeup, as you said, would it be practical to start with that one?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Aug. 1, 2014, 12:51 a.m. UTC | #12
On Friday, August 01, 2014 01:41:31 AM Thomas Gleixner wrote:
> On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> > On Thursday, July 31, 2014 04:12:55 PM Alan Stern wrote:
> > > Pardon me for sticking my nose into the middle of the conversation, but
> > > here's what it looks like to me:
> > > 
> > > The entire no_irq phase of suspend/resume is starting to seem like a
> > > mistake.  We should never have done it.
> > 
> > In hindsight, I totally agree.  Question is what we can do about it now.
> 
> <SNIP>
> 
> > So how can we eliminate the noirq phase in a workable way?
> 
> The straight way to do that is breaking the world and some more and
> then fix up a gazillion of device drivers by doing a massive voodoo
> debugging effort simply because in most cases we do not get any useful
> information out of the system once the shit hits the fan.
> 
> We could add instrumentation to the core code about interrupts which
> are coming in unexpectedly during suspend, but that does not solve
> anything.
> 
> We really cannot call any device handler at that point as clocks might
> be turned off already and any access to a device register might simply
> cause a full undebuggable stall of the CPU.
> 
> And there is no way to prove that there is no chance of a spurious
> interrupt for a given device. 
> 
> So if we cannot handle it at the infrastructure level, we need to make
> sure that every fricking device driver interrupt handler has a 
> 
>      if (dev->suspended)
>      	return CRAP;
> 
> conditional as the first line of code in it.
> 
> What is that buying us? 
> 
> Nothing than a shitload of hard to understand problems, really. The
> only sensible way to handle this is at the core level.
> 
> #1 There is no way that you can rely on random drivers to do the Right
>    Thing. 
> 
> #2 There is no way that all hardware is implemented in a sane way.
> 
> #3 You CANNOT educate the people who are tasked to implement something
>    which "does the job" to understand all the subtle details of
>    suspend/resume or whatever.

These are fair points.

However, if the driver implements ->runtime_suspend, it has to handle
the "my device is suspended" condition in its interrupt handler regardless.

For such a driver doing the same over system suspend/resume shouldn't
be a real problem.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Aug. 1, 2014, 1:24 a.m. UTC | #13
On Friday, August 01, 2014 02:08:12 AM Rafael J. Wysocki wrote:
> On Friday, August 01, 2014 12:16:23 AM Thomas Gleixner wrote:
> > On Thu, 31 Jul 2014, Rafael J. Wysocki wrote:
> > > On Thursday, July 31, 2014 12:44:24 PM Thomas Gleixner wrote:

[cut]

> Except for a couple of points where I'm not sure I understand you correctly
> (commented above), all of that sounds good to me.
> 
> I'm not sure about the ordering, though.  It would be good to have a working
> replacement for the IRQF_NO_SUSPEND things that we'll be removing in 1, for
> example.  So since we need to do 3) IRQF_SHARED for both IRQF_NO_SUSPEND and
> wakeup, as you said, would it be practical to start with that one?

I forgot about one case which in my opinion would be good to take into account
from the outset.  That is the case of runtime-suspended devices that we don't
want to touch during system suspend/resume.  If those are system wakeup
devices, their drivers should be able to configure them for system wakeup
at the runtime suspend time rather than during system suspend.  Also if their
interrupts are going to be used as system wakeup interrupts, the interface
that we're going to provide needs to handle that case.  That is, it should
be possible to use that interface at the runtime suspend time and it should
take care of all things going forward.

Triggering a lazy disable right at the runtime suspend time may not work,
because the interrupt will also be used for the device's runtime remote wakeup
in that case, so it has to be functional until system suspend is started and
the core decides to leave the device in its current state.  This means that
the core will need to trigger the lazy disable for it at one point during
system suspend.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Aug. 1, 2014, 9:40 a.m. UTC | #14
On Fri, 1 Aug 2014, Rafael J. Wysocki wrote:
> On Friday, August 01, 2014 12:16:23 AM Thomas Gleixner wrote:
> > Right, so instead of thinking about a proper solution driver folks just slap
> > the next available thing on it w/o thinking about the consequences. But,
> > thats partly our own fault due to lack of proper documentation.
> 
> Prototyping with the next available thing is not generally wrong in my view
> as long as this doesn't get to the code base without consideration.

Prototyping is one thing, but we have real users which are simply wrong
AFAICT.
 
> > > And now there's one more piece of it which is suspend-to-idle (aka "freeze").
> > > That doesn't go all the way to syscore_suspend(), but basically stops after
> > > finishing the "noirq" phase of suspending devices.  Then, it idles the CPUs
> > > and waits for interrupts that will take them out of idle.  Only some of those
> > > interrupts are wakeup events, so it only resumes when __pm_wakeup_event() or
> > > __pm_relax() is called in the process of handling the interrupts.
> > >
> > > Of course, it could be implemented differently, but that was the simplest
> > > way to do that.  It still can be changed, but I'd really like it not to have
> > > to go through all of the disabling nonboot CPUs and sysfore_suspend(), because
> > > that simply isn't necessary (and it avoids a metric ton of problems with CPU
> > > offline/online).  And of course, it has to work on x86 too.
> > 
> > Right. So one thing which would make the situation more clear is that
> > the interrupt handler needs to tell the core code about this by
> > returning something like IRQ_HANDLED_PMWAKE and the core kicks the
> > suspend-to-idle logic back to life. I'm not sure whether the extra
> > return value is actually necessary, we might even map it to
> > IRQ_HANDLED in the core as we know that we are in the suspend
> > state.
> 
> I'm not sure I'm following you here.  Do you mean that upon receiving
> IRQ_HANDLED_PMWAKE from an interrupt handler the core will know that
> this was a wakeup event and will trigger a resume from suspend-to-idle?

Correct. Whether we need the extra return code is debatable. But yes,
we want to talk to the PM/suspend/resume thing at core level instead
of letting drivers use random interfaces which happen to be exported
for one reason or the other, but definitely not for the purpose of
random driver.

> > There is no reason, why we can't flag the affected irq chips with
> > IRQCHIP_SKIP_SET_WAKE. That was introduced to handle irq controllers
> > which provide wakeup functionality but do not have this special extra
> > magic which is implemented in ->irq_set_wake, i.e. talking to some PM
> > controller directly.
> 
> That may help, but then it won't prevent wakeup IRQs from being disabled by
> suspend_device_irqs().  Do I think correctly that we can re-enable them
> in the suspend-to-idle loop, before idling the CPUs?

Yes.
 
> > 2) check_wakeup_irqs()
> > 
> >    Call it in the suspend-to-idle case and deal with the detected
> >    suspend abort from there. If we don't do that, we basically force
> >    people to hack creative workarounds into their driver, be it
> >    abusing IRQF_NO_SUSPEND or other completely brainwrecked
> >    modifications.
> > 
> >    The approach of doing:
> > 
> >     1) lazy disable
> >     2) transition into suspend
> >     3) check abort conditions
> > 
> >    is the only sane way to keep the real suspend and the
> >    suspend-to-idle stuff in line.
> 
> Well, suspend-to-idle is a loop of the type
> 
>  (1) wait for an interrupt
>  (2) iterrupt happens -> if wakeup, resume; else goto (1)
> 
> so I'm not sure how to match that with lazy disable?  We basically would need to
> do something like:
> 
>  (1) enable all wakeup interrupts
>  (2) wait for an interrupt
>  (3) interrupt happens
>  (4) disable all interrupts enabled in (1)
>  (5) if wakeup, resume (we'll resume interrupts later); else goto (1)
> 
> Is that what you mean?

More or less, yes. Whether we do the explicit enable or the implicit
leave it enabled thing is an implementation detail. We just need to
chose one.

> >      Though for the PCIe root hub thing which has a shared MSI handler
> >      !?! we really want to implement that at the device level
> >      today. Sharing a MSI interrupt is just ass backwards.
> 
> There are more reasons for doing that.  A PCIe port is really one device
> with multiple things to handle and they really need more coordination with
> each other than we have in the current code.  In my opinion we should
> just use one PCIe port driver doing all of these things together instead of
> three separate drivers handling multiple artificially cut out devices on a
> virtual bus.
> 
> Problem is, people either have no time or are too scared to do that.

Well, if we don't want to end up with a complete unmaintainable mess,
we need to find the time and the people who have enough brains to do
so.
 
> >        If there is a spurious interrupt on a shared edge type
> >        interrupt, which is crap by definition, but unfortunately
> >        reality, we never trigger the spurious detector, but we might
> >        render the device useless in the following case:
> > 
> >          disk irq shares line with wakeup button irq
> > 
> >          suspend
> > 		spurious interrupt caused by disk device
> > 
> > 	 resume from a differnt wakeup source
> > 
> > 	 Disk device interrupt is lost and therefor the disk is
> > 	 disfunctional until someone presses the wakeup button.
> 
> But the disk was supposed to be suspended, right?  So its driver should
> not expect to receive any interrupts at that point anyway.  Or am I missing
> anything?

As I said it's a conctructed scenario, but we need to deal with that
kind of malfunction in devices. Right now if the driver disabled
interrupts at the device level and calls disable_irq() the spurious
edge from the broken IP block will be recorded and the interrupt
reinjected at enable_irq() time. I'm aware of at least two devices
which rely on that behaviour.
 
> I'm not sure about the ordering, though.  It would be good to have a working
> replacement for the IRQF_NO_SUSPEND things that we'll be removing in 1, for
> example.  So since we need to do 3) IRQF_SHARED for both IRQF_NO_SUSPEND and
> wakeup, as you said, would it be practical to start with that one?

The numbering was not meant as ordering, it was just to seperate the
various issues which we need to look at.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Aug. 1, 2014, 1:43 p.m. UTC | #15
On Fri, 1 Aug 2014, Rafael J. Wysocki wrote:
> OK, I guess "IRQ_HANDLED from a wakeup interrupt" may be interpreted as
> IRQ_HANDLED_PMWAKE.  On the other hand, if that's going to be handled in
> handle_irq_event_percpu(), then using a special return code would save us
> a brach for IRQ_HANDLED interrupts.  We could convert it to IRQ_HANDLED
> immediately then.

We can handle it at the end of the function by calling
note_interrupt() unconditionally do the following there:

      if (suspended) {
      	 if (ret == IRQ_NONE) {
	    if (shared)
	       yell_and_abort_or_resume();
         } else {
	    abort_or_resume();
         }
      }
      if (noirqdebug)
      	 return;
 
> OK, I'll take a stab at the IRQF_SHARED thing if you don't mind.

Definitely not :)

> Here's my current understanding of what can be done for IRQF_NO_SUSPEND.
> 
> In suspend_device_irqs():
> 
> (1) If all actions in the list have the same setting (eg. IRQF_NO_SUSPEND unset),
>     keep the current behavior.
> (2) If the actions have different settings:
>     - Actions with IRQF_NO_SUSPEND set are not modified.
>     - Actions with IRQF_NO_SUSPEND unset are switched over to a stub handler.
>     - IRQS_SUSPEND_MODE (new flag) is set for the IRQ.

Can we please do that in setup_irq() and let the shared ones always
run through the stub? That keeps suspend/resume_device_irqs() simple.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Aug. 1, 2014, 1:45 p.m. UTC | #16
On Friday, August 01, 2014 11:40:55 AM Thomas Gleixner wrote:
> On Fri, 1 Aug 2014, Rafael J. Wysocki wrote:
> > On Friday, August 01, 2014 12:16:23 AM Thomas Gleixner wrote:

[cut]

> > > > And now there's one more piece of it which is suspend-to-idle (aka "freeze").
> > > > That doesn't go all the way to syscore_suspend(), but basically stops after
> > > > finishing the "noirq" phase of suspending devices.  Then, it idles the CPUs
> > > > and waits for interrupts that will take them out of idle.  Only some of those
> > > > interrupts are wakeup events, so it only resumes when __pm_wakeup_event() or
> > > > __pm_relax() is called in the process of handling the interrupts.
> > > >
> > > > Of course, it could be implemented differently, but that was the simplest
> > > > way to do that.  It still can be changed, but I'd really like it not to have
> > > > to go through all of the disabling nonboot CPUs and sysfore_suspend(), because
> > > > that simply isn't necessary (and it avoids a metric ton of problems with CPU
> > > > offline/online).  And of course, it has to work on x86 too.
> > > 
> > > Right. So one thing which would make the situation more clear is that
> > > the interrupt handler needs to tell the core code about this by
> > > returning something like IRQ_HANDLED_PMWAKE and the core kicks the
> > > suspend-to-idle logic back to life. I'm not sure whether the extra
> > > return value is actually necessary, we might even map it to
> > > IRQ_HANDLED in the core as we know that we are in the suspend
> > > state.
> > 
> > I'm not sure I'm following you here.  Do you mean that upon receiving
> > IRQ_HANDLED_PMWAKE from an interrupt handler the core will know that
> > this was a wakeup event and will trigger a resume from suspend-to-idle?
> 
> Correct. Whether we need the extra return code is debatable. But yes,
> we want to talk to the PM/suspend/resume thing at core level instead
> of letting drivers use random interfaces which happen to be exported
> for one reason or the other, but definitely not for the purpose of
> random driver.

OK, I guess "IRQ_HANDLED from a wakeup interrupt" may be interpreted as
IRQ_HANDLED_PMWAKE.  On the other hand, if that's going to be handled in
handle_irq_event_percpu(), then using a special return code would save us
a brach for IRQ_HANDLED interrupts.  We could convert it to IRQ_HANDLED
immediately then.

[cut]

> > I'm not sure about the ordering, though.  It would be good to have a working
> > replacement for the IRQF_NO_SUSPEND things that we'll be removing in 1, for
> > example.  So since we need to do 3) IRQF_SHARED for both IRQF_NO_SUSPEND and
> > wakeup, as you said, would it be practical to start with that one?
> 
> The numbering was not meant as ordering, it was just to seperate the
> various issues which we need to look at.

OK, I'll take a stab at the IRQF_SHARED thing if you don't mind.

Here's my current understanding of what can be done for IRQF_NO_SUSPEND.

In suspend_device_irqs():

(1) If all actions in the list have the same setting (eg. IRQF_NO_SUSPEND unset),
    keep the current behavior.
(2) If the actions have different settings:
    - Actions with IRQF_NO_SUSPEND set are not modified.
    - Actions with IRQF_NO_SUSPEND unset are switched over to a stub handler.
    - IRQS_SUSPEND_MODE (new flag) is set for the IRQ.

In note_interrupt():

  If action_ret is IRQ_NONE and IRQS_SUSPEND_MODE is set for the IRQ, disable the
  IRQ, set IRQS_SUSPENDED for it and call system_wakeup(BAD_INTERRUPT) (that will
  abort suspend if still in progress or break the suspend-to-idle loop).

In resume_device_irqs():

(1) If IRQS_SUSPEND_MODE is set, switch over actions with IRQF_NO_SUSPEND unset
    to their original handlers and clear the flag.  Fall through.
(2) If IRQS_SUSPENDED is set, clear the flag and enable the interrupt.

The stub handler only needs to return IRQ_NONE unconditionally in that case.

Does that make sense?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Aug. 1, 2014, 2:29 p.m. UTC | #17
On Friday, August 01, 2014 03:43:21 PM Thomas Gleixner wrote:
> On Fri, 1 Aug 2014, Rafael J. Wysocki wrote:
> > OK, I guess "IRQ_HANDLED from a wakeup interrupt" may be interpreted as
> > IRQ_HANDLED_PMWAKE.  On the other hand, if that's going to be handled in
> > handle_irq_event_percpu(), then using a special return code would save us
> > a brach for IRQ_HANDLED interrupts.  We could convert it to IRQ_HANDLED
> > immediately then.
> 
> We can handle it at the end of the function by calling
> note_interrupt() unconditionally do the following there:
> 
>       if (suspended) {
>       	 if (ret == IRQ_NONE) {
> 	    if (shared)
> 	       yell_and_abort_or_resume();
>          } else {
> 	    abort_or_resume();
>          }
>       }
>       if (noirqdebug)
>       	 return;

I see.

> > OK, I'll take a stab at the IRQF_SHARED thing if you don't mind.
> 
> Definitely not :)
> 
> > Here's my current understanding of what can be done for IRQF_NO_SUSPEND.
> > 
> > In suspend_device_irqs():
> > 
> > (1) If all actions in the list have the same setting (eg. IRQF_NO_SUSPEND unset),
> >     keep the current behavior.
> > (2) If the actions have different settings:
> >     - Actions with IRQF_NO_SUSPEND set are not modified.
> >     - Actions with IRQF_NO_SUSPEND unset are switched over to a stub handler.
> >     - IRQS_SUSPEND_MODE (new flag) is set for the IRQ.
> 
> Can we please do that in setup_irq() and let the shared ones always
> run through the stub? That keeps suspend/resume_device_irqs() simple.

OK

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Aug. 1, 2014, 2:41 p.m. UTC | #18
On Fri, 1 Aug 2014, Thomas Gleixner wrote:

> Anyone who thinks that this can and should be solved at the driver
> level is simply taking the wrong drugs or ran out of supply of the
> proper ones. Either call your shrink or your drug dealer to get out of
> that.

That's absolutely true, but there is a counterpoint: Some of the 
problems in this area _must_ be fixed at the driver level.

In particular, any driver that:

	doesn't prevent its device from generating interrupt requests 
	while the system is suspended

	and allows its device to share an IRQ line with a wakeup source

has to be fixed.  The situation is even worse if the driver's interrupt 
handler blows up when called at an inopportune time (such as when 
clocks or power supplies are off).

Unfortunately, the only reasonable way to find such drivers is to wait
and see when the bad combination of events happens.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Aug. 5, 2014, 3:22 p.m. UTC | #19
On Friday, August 01, 2014 04:29:40 PM Rafael J. Wysocki wrote:
> On Friday, August 01, 2014 03:43:21 PM Thomas Gleixner wrote:
> > On Fri, 1 Aug 2014, Rafael J. Wysocki wrote:
> > > OK, I guess "IRQ_HANDLED from a wakeup interrupt" may be interpreted as
> > > IRQ_HANDLED_PMWAKE.  On the other hand, if that's going to be handled in
> > > handle_irq_event_percpu(), then using a special return code would save us
> > > a brach for IRQ_HANDLED interrupts.  We could convert it to IRQ_HANDLED
> > > immediately then.
> > 
> > We can handle it at the end of the function by calling
> > note_interrupt() unconditionally do the following there:
> > 
> >       if (suspended) {
> >       	 if (ret == IRQ_NONE) {
> > 	    if (shared)
> > 	       yell_and_abort_or_resume();
> >          } else {
> > 	    abort_or_resume();
> >          }
> >       }
> >       if (noirqdebug)
> >       	 return;
> 
> I see.
> 
> > > OK, I'll take a stab at the IRQF_SHARED thing if you don't mind.
> > 
> > Definitely not :)
> > 
> > > Here's my current understanding of what can be done for IRQF_NO_SUSPEND.
> > > 
> > > In suspend_device_irqs():
> > > 
> > > (1) If all actions in the list have the same setting (eg. IRQF_NO_SUSPEND unset),
> > >     keep the current behavior.
> > > (2) If the actions have different settings:
> > >     - Actions with IRQF_NO_SUSPEND set are not modified.
> > >     - Actions with IRQF_NO_SUSPEND unset are switched over to a stub handler.
> > >     - IRQS_SUSPEND_MODE (new flag) is set for the IRQ.
> > 
> > Can we please do that in setup_irq() and let the shared ones always
> > run through the stub? That keeps suspend/resume_device_irqs() simple.
> 
> OK

Here's a patch series based on what we talked about.

[1/5] Mechanism to wake up the system or abort suspend in progress automatically.
[2/5] Fix for shared IRQs vs IRQF_NO_SUSPEND (with wakeup in mind).
[3/5] Wakeup interrupts support for suspend-to-idle.
[4/5] Set IRQCHIP_SKIP_SET_WAKE for x86 IOAPIC IRQ chips.
[5/5] Make PCIe PME wake up from suspend to idle.

All tested on MSI Wind that has a couple of issues being addressed.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Aug. 5, 2014, 4:12 p.m. UTC | #20
On Tue, Aug 05, 2014 at 05:22:57PM +0200, Rafael J. Wysocki wrote:
> Here's a patch series based on what we talked about.
> 
> [1/5] Mechanism to wake up the system or abort suspend in progress automatically.
> [2/5] Fix for shared IRQs vs IRQF_NO_SUSPEND (with wakeup in mind).
> [3/5] Wakeup interrupts support for suspend-to-idle.
> [4/5] Set IRQCHIP_SKIP_SET_WAKE for x86 IOAPIC IRQ chips.
> [5/5] Make PCIe PME wake up from suspend to idle.
> 
> All tested on MSI Wind that has a couple of issues being addressed.

I can confirm it works on my WSM-EP which has that shared interrupt
stuff.
diff mbox

Patch

Index: linux-pm/include/linux/interrupt.h
===================================================================
--- linux-pm.orig/include/linux/interrupt.h
+++ linux-pm/include/linux/interrupt.h
@@ -70,8 +70,10 @@ 
 #define IRQF_FORCE_RESUME	0x00008000
 #define IRQF_NO_THREAD		0x00010000
 #define IRQF_EARLY_RESUME	0x00020000
+#define IRQF_WAKEUP		0x00040000
 
 #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
+#define IRQF_INHIBIT_SUSPEND	(IRQF_NO_SUSPEND | IRQF_WAKEUP)
 
 /*
  * These values can be returned by request_any_context_irq() and
@@ -350,6 +352,7 @@  static inline void enable_irq_lockdep_ir
 
 /* IRQ wakeup (PM) control: */
 extern int irq_set_irq_wake(unsigned int irq, unsigned int on);
+extern int device_irq_wake(unsigned int irq, void *dev_id, bool enable);
 
 static inline int enable_irq_wake(unsigned int irq)
 {
@@ -361,6 +364,15 @@  static inline int disable_irq_wake(unsig
 	return irq_set_irq_wake(irq, 0);
 }
 
+static inline int enable_device_irq_wake(unsigned int irq, void *dev_id)
+{
+	return device_irq_wake(irq, dev_id, true);
+}
+
+static inline int disable_device_irq_wake(unsigned int irq, void *dev_id)
+{
+	return device_irq_wake(irq, dev_id, false);
+}
 
 #ifdef CONFIG_IRQ_FORCED_THREADING
 extern bool force_irqthreads;
Index: linux-pm/include/linux/irqdesc.h
===================================================================
--- linux-pm.orig/include/linux/irqdesc.h
+++ linux-pm/include/linux/irqdesc.h
@@ -53,6 +53,7 @@  struct irq_desc {
 	unsigned int		core_internal_state__do_not_mess_with_it;
 	unsigned int		depth;		/* nested irq disables */
 	unsigned int		wake_depth;	/* nested wake enables */
+	unsigned int		wake_needed;
 	unsigned int		irq_count;	/* For detecting broken IRQs */
 	unsigned long		last_unhandled;	/* Aging timer for unhandled count */
 	unsigned int		irqs_unhandled;
Index: linux-pm/kernel/irq/internals.h
===================================================================
--- linux-pm.orig/kernel/irq/internals.h
+++ linux-pm/kernel/irq/internals.h
@@ -53,6 +53,7 @@  enum {
 	IRQS_REPLAY		= 0x00000040,
 	IRQS_WAITING		= 0x00000080,
 	IRQS_PENDING		= 0x00000200,
+	IRQS_WAKE_READY		= 0x00000400,
 	IRQS_SUSPENDED		= 0x00000800,
 };
 
Index: linux-pm/kernel/irq/pm.c
===================================================================
--- linux-pm.orig/kernel/irq/pm.c
+++ linux-pm/kernel/irq/pm.c
@@ -21,7 +21,7 @@  static void suspend_irq(struct irq_desc
 	if (!action)
 		return;
 
-	no_suspend = IRQF_NO_SUSPEND;
+	no_suspend = IRQF_INHIBIT_SUSPEND;
 	flags = 0;
 	do {
 		no_suspend &= action->flags;
@@ -33,7 +33,7 @@  static void suspend_irq(struct irq_desc
 
 	desc->istate |= IRQS_SUSPENDED;
 
-	if ((flags & IRQF_NO_SUSPEND) &&
+	if ((flags & IRQF_INHIBIT_SUSPEND) &&
 	    !(desc->istate & IRQS_SPURIOUS_DISABLED)) {
 		struct irqaction *active = NULL;
 		struct irqaction *suspended = NULL;
@@ -42,7 +42,7 @@  static void suspend_irq(struct irq_desc
 		do {
 			action = head;
 			head = action->next;
-			if (action->flags & IRQF_NO_SUSPEND) {
+			if (action->flags & IRQF_INHIBIT_SUSPEND) {
 				action->next = active;
 				active = action;
 			} else {
@@ -138,16 +138,53 @@  static void resume_irqs(bool want_early)
 }
 
 /**
- * irq_pm_syscore_ops - enable interrupt lines early
+ * irq_pm_syscore_suspend - configure interrupts for system wakeup
  *
- * Enable all interrupt lines with %IRQF_EARLY_RESUME set.
+ * Configure all interrupt lines with %wake_needed set for system wakeup.
+ */
+static int irq_pm_syscore_suspend(void)
+{
+	struct irq_desc *desc;
+	int irq;
+
+	for_each_irq_desc(irq, desc)
+		if (desc->wake_needed) {
+			int error = enable_irq_wake(irq);
+
+			if (error) {
+				/* Ignore missing callbacks. */
+				if (error != -ENXIO)
+					return error;
+			} else {
+				desc->istate |= IRQS_WAKE_READY;
+			}
+		}
+
+	return 0;
+}
+
+/**
+ * irq_pm_syscore_resume - enable interrupt lines early
+ *
+ * Switch wakeup interrupt lines back to the normal mode of operation and
+ * enable all interrupt lines with %IRQF_EARLY_RESUME set.
  */
 static void irq_pm_syscore_resume(void)
 {
+	struct irq_desc *desc;
+	int irq;
+
+	for_each_irq_desc(irq, desc)
+		if (desc->istate & IRQS_WAKE_READY) {
+			disable_irq_wake(irq);
+			desc->istate &= ~IRQS_WAKE_READY;
+		}
+
 	resume_irqs(true);
 }
 
 static struct syscore_ops irq_pm_syscore_ops = {
+	.suspend	= irq_pm_syscore_suspend,
 	.resume		= irq_pm_syscore_resume,
 };
 
Index: linux-pm/kernel/irq/manage.c
===================================================================
--- linux-pm.orig/kernel/irq/manage.c
+++ linux-pm/kernel/irq/manage.c
@@ -547,6 +547,59 @@  int irq_set_irq_wake(unsigned int irq, u
 }
 EXPORT_SYMBOL(irq_set_irq_wake);
 
+/**
+ *	device_irq_wake - set/unset device irq PM wakeup requirement
+ *	@irq:		interrupt to control
+ *	@dev_id:	interrupt handler device cookie
+ *	@enable:	whether or not the interrupt should wake up the system
+ *
+ *	Tell the IRQ subsystem whether or not the given interrupt should be used
+ *	for system wakeup from sleep states (like suspend-to-RAM).  This doesn't
+ *	change the hardware configuration, but notes whether or not to change it
+ *	at the syscore stage of system suspend and resume.
+ *
+ *	@dev_id may be NULL if the IRQ has not been requested as a shared one.
+ *	Otherwise, it must be the same as the one used when requesting the IRQ.
+ */
+int device_irq_wake(unsigned int irq, void *dev_id, bool enable)
+{
+	unsigned long flags;
+	struct irq_desc *desc = irq_get_desc_buslock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL);
+	struct irqaction *action;
+
+	if (!desc || !desc->action)
+		return -EINVAL;
+
+	if (dev_id) {
+		for (action = desc->action; action; action = action->next)
+			if (action->dev_id == dev_id)
+				break;
+	} else {
+		action = desc->action;
+		if (action->flags & IRQF_SHARED)
+			action = NULL;
+	}
+	if (!action) {
+		irq_put_desc_busunlock(desc, flags);
+		return -ENODEV;
+	}
+	if (enable) {
+		if (!(action->flags & IRQF_WAKEUP)) {
+			action->flags |= IRQF_WAKEUP;
+			desc->wake_needed++;
+		}
+	} else {
+		if (action->flags & IRQF_WAKEUP) {
+			action->flags &= ~IRQF_WAKEUP;
+			desc->wake_needed--;
+		}
+	}
+
+	irq_put_desc_busunlock(desc, flags);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(device_irq_wake);
+
 /*
  * Internal function that tells the architecture code whether a
  * particular irq has been exclusively allocated or is available