diff mbox

[v3,1/2] IOMMU/spinlock: Fix a bug found in AMD IOMMU initialization

Message ID 1457529455-38314-2-git-send-email-quan.xu@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Quan Xu March 9, 2016, 1:17 p.m. UTC
pcidevs_lock should be held with interrupt enabled. However there remains
an exception in AMD IOMMU code, where the lock is acquired with interrupt
disabled. This inconsistency might lead to deadlock.

The fix is straightforward to use spin_lock instead. Also interrupt has been
enabled when this function is invoked, so we're sure consistency around
pcidevs_lock can be guaranteed after this fix.

Signed-off-by: Quan Xu <quan.xu@intel.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
CC: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
CC: Dario Faggioli <dario.faggioli@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
---
 xen/drivers/passthrough/amd/iommu_init.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Dario Faggioli March 9, 2016, 2:59 p.m. UTC | #1
On Wed, 2016-03-09 at 21:17 +0800, Quan Xu wrote:
> pcidevs_lock should be held with interrupt enabled. 
>
There's a message from Jan when he says:
<<Well, I'd say something like "pcidevs_lock doesn't require interrupts
to be disabled while being acquired".>>

:-O

> However there remains
> an exception in AMD IOMMU code, where the lock is acquired with
> interrupt
> disabled. This inconsistency might lead to deadlock.
> 
I appreciate that 'might' is weaker than 'can'. Personally, I still
dob't find this correct, or at least clear enough, referred to this
patch, but I won't be in the way because of this.

> The fix is straightforward to use spin_lock instead. Also interrupt
> has been
> enabled when this function is invoked, so we're sure consistency
> around
> pcidevs_lock can be guaranteed after this fix.
> 
And I also can't really get what "so we're sure consistency
around pcidevs_lock can be guaranteed after this fix" actually means,
but again, that's probably me, and it's fine.

However,

> Signed-off-by: Quan Xu <quan.xu@intel.com>
> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
>
This still stands **only** if the very first sentence "pcidevs_lock
should be held with interrupt enabled" is changed to "pcidevs_lock
doesn't require interrupts to be disabled while being acquired".

Regards,
Dario
Quan Xu March 10, 2016, 6:12 a.m. UTC | #2
CC Kevin, 

On March 09, 2016 11:00pm, <dario.faggioli@citrix.com> wrote:
> On Wed, 2016-03-09 at 21:17 +0800, Quan Xu wrote:

> > pcidevs_lock should be held with interrupt enabled.

> >

> There's a message from Jan when he says:

> <<Well, I'd say something like "pcidevs_lock doesn't require interrupts to be

> disabled while being acquired".>>

> 

> :-O

> 

> > However there remains

> > an exception in AMD IOMMU code, where the lock is acquired with

> > interrupt disabled. This inconsistency might lead to deadlock.

> >

> I appreciate that 'might' is weaker than 'can'. Personally, I still dob't find this

> correct, or at least clear enough, referred to this patch, but I won't be in the

> way because of this.

> 

> > The fix is straightforward to use spin_lock instead. Also interrupt

> > has been enabled when this function is invoked, so we're sure

> > consistency around pcidevs_lock can be guaranteed after this fix.

> >

> And I also can't really get what "so we're sure consistency around pcidevs_lock

> can be guaranteed after this fix" actually means, but again, that's probably me,

> and it's fine.

> 

> However,

> 

> > Signed-off-by: Quan Xu <quan.xu@intel.com>

> > Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

> >

> This still stands **only** if the very first sentence "pcidevs_lock should be held

> with interrupt enabled" is changed to "pcidevs_lock doesn't require interrupts to

> be disabled while being acquired".

> 


Dario,
I am open for this change:).

When I read:
1. (look at "Lesson 3: spinlocks revisited.")
  https://www.kernel.org/doc/Documentation/locking/spinlocks.txt
2. comment inside check_lock(), in xen/common/spinlock.c, in Xen's codebase.
3. the "Linux Device Drivers, 3rd Edition" book , http://www.makelinux.net/ldd3/chp-5-sect-5 

I found that:
    - "We partition locks into IRQ-safe (__always__ held with IRQs disabled) and IRQ-unsafe (__always__ held with IRQs enabled) types".

It looks like Kevin's description is better. 

I also found that you mentioned in this thread:
"...  __except for very special situations__".
If it is true, I think your description is better.

Thanks for your time..:):)


Quan
Meng Xu March 11, 2016, 3:24 a.m. UTC | #3
Hi Quan,

Can I ask a dumb question?

On Wed, Mar 9, 2016 at 8:17 AM, Quan Xu <quan.xu@intel.com> wrote:
> pcidevs_lock should be held with interrupt enabled. However there remains
> an exception in AMD IOMMU code, where the lock is acquired with interrupt
> disabled. This inconsistency might lead to deadlock.

Why will this inconsistency lead to deadlock?
I understand the difference between spin_lock_irqsave(), which disable
interrupt,  and spin_lock(), which allows interrupt, however, I didn't
get why misuse the spin_lock_irqsave() at the place of spin_lock()
could potentially lead to a deadlock?
Would you minding pointing me to somewhere I can find the reason or
enlighten me?

Thank you very much!

Best,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/
Quan Xu March 11, 2016, 6:54 a.m. UTC | #4
On March 11, 2016 11:25am, <mengxu@cis.upenn.edu> wrote:
> On Wed, Mar 9, 2016 at 8:17 AM, Quan Xu <quan.xu@intel.com> wrote:

> > pcidevs_lock should be held with interrupt enabled. However there

> > remains an exception in AMD IOMMU code, where the lock is acquired

> > with interrupt disabled. This inconsistency might lead to deadlock.

> 

> Why will this inconsistency lead to deadlock?

> I understand the difference between spin_lock_irqsave(), which disable interrupt,

> and spin_lock(), which allows interrupt, however, I didn't get why misuse the

> spin_lock_irqsave() at the place of spin_lock() could potentially lead to a

> deadlock?


For above 2 questions:
Mix is just one of _possible_ scenarios(IMO, not 100% leading to a deadlock):

-where an interrupt tries to lock an already locked variable. This is ok if
the other interrupt happens on another CPU, but it is _not_ ok if the
interrupt happens on the same CPU that already holds the lock

I got it from the bellow 2 points:
 1).As Jan mentioned, The implication from disabling interrupts while acquiring a lock is that the lock is also being acquired by some interrupt handler.
  If you mix acquire types, the one not disabling interrupts is prone to be interrupted, and the interrupt trying to get hold of the lock the same CPU already owns.

 2). Comment inside check_lock(), 
we partition locks into IRQ-safe (always held with IRQs disabled) and
IRQ-unsafe (always held with IRQs enabled) types. The convention for
every lock must be consistently observed else we can deadlock in
IRQ-context rendezvous functions (__a rendezvous which gets every CPU
into IRQ context before any CPU is released from the rendezvous__).
If we can mix IRQ-disabled and IRQ-enabled callers, the following can
happen:
 * Lock is held by CPU A, with IRQs enabled
 * CPU B is spinning on same lock, with IRQs disabled
 * Rendezvous starts -- CPU A takes interrupt and enters rendezbous spin
 * DEADLOCK -- CPU B will never enter rendezvous, CPU A will never exit
               the rendezvous, and will hence never release the lock.

To guard against this subtle bug we latch the IRQ safety of every
spinlock in the system, on first use.

> Would you minding pointing me to somewhere I can find the reason or

> enlighten me?

> 


Some reference material:

   * (look at "Lesson 3: spinlocks revisited.")
     https://www.kernel.org/doc/Documentation/locking/spinlocks.txt
   * comment inside check_lock(), in xen/common/spinlock.c, in Xen's codebase.
   * the "Linux Device Drivers, 3rd Edition" book , http://www.makelinux.net/ldd3/chp-5-sect-5

Quan
Dario Faggioli March 11, 2016, 10:35 a.m. UTC | #5
On Fri, 2016-03-11 at 06:54 +0000, Xu, Quan wrote:
> On March 11, 2016 11:25am, <mengxu@cis.upenn.edu> wrote:
> > 
> > On Wed, Mar 9, 2016 at 8:17 AM, Quan Xu <quan.xu@intel.com> wrote:
> > > 
> > > pcidevs_lock should be held with interrupt enabled. However there
> > > remains an exception in AMD IOMMU code, where the lock is
> > > acquired
> > > with interrupt disabled. This inconsistency might lead to
> > > deadlock.
> > Why will this inconsistency lead to deadlock?
> > I understand the difference between spin_lock_irqsave(), which
> > disable interrupt,
> > and spin_lock(), which allows interrupt, however, I didn't get why
> > misuse the
> > spin_lock_irqsave() at the place of spin_lock() could potentially
> > lead to a
> > deadlock?

>  1).As Jan mentioned, The implication from disabling interrupts while
> acquiring a lock is that the lock is also being acquired by some
> interrupt handler.
>   If you mix acquire types, the one not disabling interrupts is prone
> to be interrupted, and the interrupt trying to get hold of the lock
> the same CPU already owns.
> 
The key issue is whether or not a lock can be acquired from IRQ context
(i.e., in an interrupt handler, or a function called by that, as a
result of an interrupt occurring).

For lock that can, IRQ disabling variants must be used *everywhere* the
lock is taken (so, e.g., not only when it is actually taken from IRQ
context, just *always*!).

If that rule is not honored, we are ok if the lock is taken on CPUA,
and the interrupt handled on CPUB:

  CPUA              CPUB
  .                 .
  .                 .
  spin_lock(l)      .
  .                 .
  .                 . <-- Interrupt!
  .                        .
  .                       spin_lock(l); //spins on l
  .                       x             //spins on l
  .                       x             //spins on l
  spin_unlock(l)          .             //takes l
  .                       .
  .                       spin_unlock(l);
  .                 . <-- .
  .                 .

but the following can happen, if the interrupt is delivered to the CPU
that has already taken the lock:

     CPU
     .
     .
[1]  spin_lock(l);       //takes l
     .
     . <-- Interrupt!
           .
           spin_lock(l) // CPU deadlocks

If, in the latter case, spin_lock_irq(l) were used at [1], things would
have been fine, as the interrupt wouldn't have occurred until l weren't
released:

  CPU
  .
  .
  spin_lock_irq(l)                        //takes l
  .
  . <---------------- Interrupt!
  .                   -                   //IRQs are disabled
  .                   -                   //IRQs are disabled
  .                   -                   //IRQs are disabled
  spin_unlock_irq(l)  .                   //IRQs re-hanbled
                      spin_lock_irq(l);   //takes l
                      .
                      .
                      spin_unlock_irq(l);
 . <----------------- .
 .

Note that whether spin_lock_irq() or spin_lock_irqsave() should be
used, is completely independent from this, and it must be decided
according to whether IRQs are disabled already or not when taking the
lock. And (quite obviously, but since wre're here) it is fine to mix
spin_lock_irq() and spin_lock_irqsave(), as they both disable
interrupts, which is what matters.

So, now, if we know for sure that a lock is _never_ever_ever_ taken
from interrupt context, can we mix spin_lock() and spin_lock_irq() on
it (for whatever reason)? Well, as far as the above reasoning is
concerned, yes.

In fact, the deadlock arises because IRQs interrupt asynchronously what
the CPU is doing, and that can happen when the CPU has taken the lock
already. But if the 'asynchronous' part goes away, we really don't care
whether a lock is take at time t1 with IRQ enabled, and at time t2 with
IRQ disabled, don't you think?

Well, here it is where what the comment inside check_lock() describes
comes into play. As quoted by Qaun already:

>  2). Comment inside check_lock(), 
> we partition locks into IRQ-safe (always held with IRQs disabled) and
> IRQ-unsafe (always held with IRQs enabled) types. The convention for
> every lock must be consistently observed else we can deadlock in
> IRQ-context rendezvous functions (__a rendezvous which gets every CPU
> into IRQ context before any CPU is released from the rendezvous__).
> If we can mix IRQ-disabled and IRQ-enabled callers, the following can
> happen:
>  * Lock is held by CPU A, with IRQs enabled
>  * CPU B is spinning on same lock, with IRQs disabled
>  * Rendezvous starts -- CPU A takes interrupt and enters rendezbous
> spin
>  * DEADLOCK -- CPU B will never enter rendezvous, CPU A will never
> exit
>                the rendezvous, and will hence never release the lock.
> 
> To guard against this subtle bug we latch the IRQ safety of every
> spinlock in the system, on first use.
> 
This is a very good safety measure, but it can be quite a restriction
in some (luckily limited) cases. And that's why it is possible --
although absolutely discouraged-- to relax it. See, for an example of
this, the comment in start_secondary(), in xen/arch/x86/smpboot.c:

    ...
    /*
     * Just as during early bootstrap, it is convenient here to disable
     * spinlock checking while we have IRQs disabled. This allows us to
     * acquire IRQ-unsafe locks when it would otherwise be disallowed.
     * 
     * It is safe because the race we are usually trying to avoid involves
     * a group of CPUs rendezvousing in an IPI handler, where one cannot
     * join because it is spinning with IRQs disabled waiting to acquire a
     * lock held by another in the rendezvous group (the lock must be an
     * IRQ-unsafe lock since the CPU took the IPI after acquiring it, and
     * hence had IRQs enabled). This is a deadlock scenario.
     * 
     * However, no CPU can be involved in rendezvous until it is online,
     * hence no such group can be waiting for this CPU until it is
     * visible in cpu_online_map. Hence such a deadlock is not possible.
     */
    spin_debug_disable();
    ...

Just FTR, at least as far as I can remember, Linux does not enforce
anything like that.

Hope this clarifies things.

Regards,
Dario
Quan Xu March 11, 2016, 12:36 p.m. UTC | #6
On March 11, 2016 6:36pm, <Dario Faggioli> wrote:
> On Fri, 2016-03-11 at 06:54 +0000, Xu, Quan wrote:

> > On March 11, 2016 11:25am, <mengxu@cis.upenn.edu> wrote:

> > >

> > > On Wed, Mar 9, 2016 at 8:17 AM, Quan Xu <quan.xu@intel.com> wrote:

> > > >

> > > > pcidevs_lock should be held with interrupt enabled. However there

> > > > remains an exception in AMD IOMMU code, where the lock is acquired

> > > > with interrupt disabled. This inconsistency might lead to

> > > > deadlock.

> > > Why will this inconsistency lead to deadlock?

> > > I understand the difference between spin_lock_irqsave(), which

> > > disable interrupt, and spin_lock(), which allows interrupt, however,

> > > I didn't get why misuse the

> > > spin_lock_irqsave() at the place of spin_lock() could potentially

> > > lead to a deadlock?

> >

> >  1).As Jan mentioned, The implication from disabling interrupts while

> > acquiring a lock is that the lock is also being acquired by some

> > interrupt handler.

> >   If you mix acquire types, the one not disabling interrupts is prone

> > to be interrupted, and the interrupt trying to get hold of the lock

> > the same CPU already owns.

> >

> The key issue is whether or not a lock can be acquired from IRQ context (i.e., in

> an interrupt handler, or a function called by that, as a result of an interrupt

> occurring).

> 

> For lock that can, IRQ disabling variants must be used *everywhere* the lock is

> taken (so, e.g., not only when it is actually taken from IRQ context, just

> *always*!).

> 

> If that rule is not honored, we are ok if the lock is taken on CPUA, and the

> interrupt handled on CPUB:

> 

>   CPUA              CPUB

>   .                 .

>   .                 .

>   spin_lock(l)      .

>   .                 .

>   .                 . <-- Interrupt!

>   .                        .

>   .                       spin_lock(l); //spins on l

>   .                       x             //spins on l

>   .                       x             //spins on l

>   spin_unlock(l)          .             //takes l

>   .                       .

>   .                       spin_unlock(l);

>   .                 . <-- .

>   .                 .

> 

> but the following can happen, if the interrupt is delivered to the CPU that has

> already taken the lock:

> 

>      CPU

>      .

>      .

> [1]  spin_lock(l);       //takes l

>      .

>      . <-- Interrupt!

>            .

>            spin_lock(l) // CPU deadlocks

> 

> If, in the latter case, spin_lock_irq(l) were used at [1], things would have been

> fine, as the interrupt wouldn't have occurred until l weren't

> released:

> 

>   CPU

>   .

>   .

>   spin_lock_irq(l)                        //takes l

>   .

>   . <---------------- Interrupt!

>   .                   -                   //IRQs are disabled

>   .                   -                   //IRQs are disabled

>   .                   -                   //IRQs are disabled

>   spin_unlock_irq(l)  .                   //IRQs re-hanbled

>                       spin_lock_irq(l);   //takes l

>                       .

>                       .

>                       spin_unlock_irq(l);

>  . <----------------- .

>  .

> 

> Note that whether spin_lock_irq() or spin_lock_irqsave() should be used, is

> completely independent from this, and it must be decided according to whether

> IRQs are disabled already or not when taking the lock. And (quite obviously, but

> since wre're here) it is fine to mix

> spin_lock_irq() and spin_lock_irqsave(), as they both disable interrupts, which is

> what matters.

> 

> So, now, if we know for sure that a lock is _never_ever_ever_ taken from

> interrupt context, can we mix spin_lock() and spin_lock_irq() on it (for whatever

> reason)? Well, as far as the above reasoning is concerned, yes.

> 

I appreciate Dario's explanation.
btw, be careful of spin_lock_irq(), which includes 'ASSERT(local_irq_is_enabled()'..

> In fact, the deadlock arises because IRQs interrupt asynchronously what the

> CPU is doing, and that can happen when the CPU has taken the lock already. But

> if the 'asynchronous' part goes away, we really don't care whether a lock is take

> at time t1 with IRQ enabled, and at time t2 with IRQ disabled, don't you think?

> 


Yes.
Consistency may be helpful to avoid some easy-to-avoid lock errors.
Moreover, without my fix, I think it would not lead dead lock, as the pcidevs_lock is not being taken
In IRQ context. Right? 


For deadlock, I think the key problems are:
  - A lock can be acquired from IRQ context
  -The interrupt is delivered to the _same_CPU_ that already holds the lock.

Quan

> Well, here it is where what the comment inside check_lock() describes comes

> into play. As quoted by Qaun already:

> 

> >  2). Comment inside check_lock(),

> > we partition locks into IRQ-safe (always held with IRQs disabled) and

> > IRQ-unsafe (always held with IRQs enabled) types. The convention for

> > every lock must be consistently observed else we can deadlock in

> > IRQ-context rendezvous functions (__a rendezvous which gets every CPU

> > into IRQ context before any CPU is released from the rendezvous__).

> > If we can mix IRQ-disabled and IRQ-enabled callers, the following can

> > happen:

> >  * Lock is held by CPU A, with IRQs enabled

> >  * CPU B is spinning on same lock, with IRQs disabled

> >  * Rendezvous starts -- CPU A takes interrupt and enters rendezbous

> > spin

> >  * DEADLOCK -- CPU B will never enter rendezvous, CPU A will never

> > exit

> >                the rendezvous, and will hence never release the lock.

> >

> > To guard against this subtle bug we latch the IRQ safety of every

> > spinlock in the system, on first use.

> >

> This is a very good safety measure, but it can be quite a restriction in some

> (luckily limited) cases. And that's why it is possible -- although absolutely

> discouraged-- to relax it. See, for an example of this, the comment in

> start_secondary(), in xen/arch/x86/smpboot.c:

> 

>     ...

>     /*

>      * Just as during early bootstrap, it is convenient here to disable

>      * spinlock checking while we have IRQs disabled. This allows us to

>      * acquire IRQ-unsafe locks when it would otherwise be disallowed.

>      *

>      * It is safe because the race we are usually trying to avoid involves

>      * a group of CPUs rendezvousing in an IPI handler, where one cannot

>      * join because it is spinning with IRQs disabled waiting to acquire a

>      * lock held by another in the rendezvous group (the lock must be an

>      * IRQ-unsafe lock since the CPU took the IPI after acquiring it, and

>      * hence had IRQs enabled). This is a deadlock scenario.

>      *

>      * However, no CPU can be involved in rendezvous until it is online,

>      * hence no such group can be waiting for this CPU until it is

>      * visible in cpu_online_map. Hence such a deadlock is not possible.

>      */

>     spin_debug_disable();

>     ...

> 

> Just FTR, at least as far as I can remember, Linux does not enforce anything like

> that.

> 

> Hope this clarifies things.
Dario Faggioli March 11, 2016, 1:58 p.m. UTC | #7
On Fri, 2016-03-11 at 12:36 +0000, Xu, Quan wrote:
> On March 11, 2016 6:36pm, <Dario Faggioli> wrote:
> > On Fri, 2016-03-11 at 06:54 +0000, Xu, Quan wrote:
> > > 
> > So, now, if we know for sure that a lock is _never_ever_ever_ taken
> > from
> > interrupt context, can we mix spin_lock() and spin_lock_irq() on it
> > (for whatever
> > reason)? Well, as far as the above reasoning is concerned, yes.
> > 
> I appreciate Dario's explanation.
> btw, be careful of spin_lock_irq(), which includes
> 'ASSERT(local_irq_is_enabled()'..
> 
It does. What about it? That is exactly what marks the difference
between spin_lock_irq() and spin_lock_irqsave(). In fact, the former
should not be used if interrupts are already disabled because then,
when calling the corresponding spin_unlock_irq(), they'd be re-enabled
while instead they shouldn't.

But as said, from the point of view of preventing deadlock, for locks
taken both from inside and outside IRQ context, they're equivalent.

> > 
> > In fact, the deadlock arises because IRQs interrupt asynchronously
> > what the
> > CPU is doing, and that can happen when the CPU has taken the lock
> > already. But
> > if the 'asynchronous' part goes away, we really don't care whether
> > a lock is take
> > at time t1 with IRQ enabled, and at time t2 with IRQ disabled,
> > don't you think?
> > 
> Yes.
> Consistency may be helpful to avoid some easy-to-avoid lock errors.
> Moreover, without my fix, I think it would not lead dead lock, as the
> pcidevs_lock is not being taken
> In IRQ context. Right? 
> 
> 
> For deadlock, I think the key problems are:
>   - A lock can be acquired from IRQ context
>   -The interrupt is delivered to the _same_CPU_ that already holds
> the lock.
> 
In your case, pcidevs_lock is certainly not being taken from both
inside and outside IRQ context. If it where, using spin_lock() --as it
happens basically everywhere in the code-- would be wrong, and using
spin_lock_irq[save]() --as it happens in the only case you're patching-
- would be what would be right!

Regards,
Dario
Meng Xu March 11, 2016, 2:41 p.m. UTC | #8
Hi Quan and Dario,

On Fri, Mar 11, 2016 at 5:35 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Fri, 2016-03-11 at 06:54 +0000, Xu, Quan wrote:
>> On March 11, 2016 11:25am, <mengxu@cis.upenn.edu> wrote:
>> >
>> > On Wed, Mar 9, 2016 at 8:17 AM, Quan Xu <quan.xu@intel.com> wrote:
>> > >
>> > > pcidevs_lock should be held with interrupt enabled. However there
>> > > remains an exception in AMD IOMMU code, where the lock is
>> > > acquired
>> > > with interrupt disabled. This inconsistency might lead to
>> > > deadlock.
>> > Why will this inconsistency lead to deadlock?
>> > I understand the difference between spin_lock_irqsave(), which
>> > disable interrupt,
>> > and spin_lock(), which allows interrupt, however, I didn't get why
>> > misuse the
>> > spin_lock_irqsave() at the place of spin_lock() could potentially
>> > lead to a
>> > deadlock?
>>
>>  1).As Jan mentioned, The implication from disabling interrupts while
>> acquiring a lock is that the lock is also being acquired by some
>> interrupt handler.
>>   If you mix acquire types, the one not disabling interrupts is prone
>> to be interrupted, and the interrupt trying to get hold of the lock
>> the same CPU already owns.
>>
> The key issue is whether or not a lock can be acquired from IRQ context
> (i.e., in an interrupt handler, or a function called by that, as a
> result of an interrupt occurring).
>
> For lock that can, IRQ disabling variants must be used *everywhere* the
> lock is taken (so, e.g., not only when it is actually taken from IRQ
> context, just *always*!).
>
> If that rule is not honored, we are ok if the lock is taken on CPUA,
> and the interrupt handled on CPUB:
>
>   CPUA              CPUB
>   .                 .
>   .                 .
>   spin_lock(l)      .
>   .                 .
>   .                 . <-- Interrupt!
>   .                        .
>   .                       spin_lock(l); //spins on l
>   .                       x             //spins on l
>   .                       x             //spins on l
>   spin_unlock(l)          .             //takes l
>   .                       .
>   .                       spin_unlock(l);
>   .                 . <-- .
>   .                 .
>
> but the following can happen, if the interrupt is delivered to the CPU
> that has already taken the lock:
>
>      CPU
>      .
>      .
> [1]  spin_lock(l);       //takes l
>      .
>      . <-- Interrupt!
>            .
>            spin_lock(l) // CPU deadlocks
>
> If, in the latter case, spin_lock_irq(l) were used at [1], things would
> have been fine, as the interrupt wouldn't have occurred until l weren't
> released:
>
>   CPU
>   .
>   .
>   spin_lock_irq(l)                        //takes l
>   .
>   . <---------------- Interrupt!
>   .                   -                   //IRQs are disabled
>   .                   -                   //IRQs are disabled
>   .                   -                   //IRQs are disabled
>   spin_unlock_irq(l)  .                   //IRQs re-hanbled
>                       spin_lock_irq(l);   //takes l
>                       .
>                       .
>                       spin_unlock_irq(l);
>  . <----------------- .
>  .
>
> Note that whether spin_lock_irq() or spin_lock_irqsave() should be
> used, is completely independent from this, and it must be decided
> according to whether IRQs are disabled already or not when taking the
> lock. And (quite obviously, but since wre're here) it is fine to mix
> spin_lock_irq() and spin_lock_irqsave(), as they both disable
> interrupts, which is what matters.
>
> So, now, if we know for sure that a lock is _never_ever_ever_ taken
> from interrupt context, can we mix spin_lock() and spin_lock_irq() on
> it (for whatever reason)? Well, as far as the above reasoning is
> concerned, yes.
>
> In fact, the deadlock arises because IRQs interrupt asynchronously what
> the CPU is doing, and that can happen when the CPU has taken the lock
> already. But if the 'asynchronous' part goes away, we really don't care
> whether a lock is take at time t1 with IRQ enabled, and at time t2 with
> IRQ disabled, don't you think?
>
> Well, here it is where what the comment inside check_lock() describes
> comes into play. As quoted by Qaun already:
>
>>  2). Comment inside check_lock(),
>> we partition locks into IRQ-safe (always held with IRQs disabled) and
>> IRQ-unsafe (always held with IRQs enabled) types. The convention for
>> every lock must be consistently observed else we can deadlock in
>> IRQ-context rendezvous functions (__a rendezvous which gets every CPU
>> into IRQ context before any CPU is released from the rendezvous__).
>> If we can mix IRQ-disabled and IRQ-enabled callers, the following can
>> happen:
>>  * Lock is held by CPU A, with IRQs enabled
>>  * CPU B is spinning on same lock, with IRQs disabled
>>  * Rendezvous starts -- CPU A takes interrupt and enters rendezbous
>> spin
>>  * DEADLOCK -- CPU B will never enter rendezvous, CPU A will never
>> exit
>>                the rendezvous, and will hence never release the lock.
>>
>> To guard against this subtle bug we latch the IRQ safety of every
>> spinlock in the system, on first use.
>>
> This is a very good safety measure, but it can be quite a restriction
> in some (luckily limited) cases. And that's why it is possible --
> although absolutely discouraged-- to relax it. See, for an example of
> this, the comment in start_secondary(), in xen/arch/x86/smpboot.c:
>
>     ...
>     /*
>      * Just as during early bootstrap, it is convenient here to disable
>      * spinlock checking while we have IRQs disabled. This allows us to
>      * acquire IRQ-unsafe locks when it would otherwise be disallowed.
>      *
>      * It is safe because the race we are usually trying to avoid involves
>      * a group of CPUs rendezvousing in an IPI handler, where one cannot
>      * join because it is spinning with IRQs disabled waiting to acquire a
>      * lock held by another in the rendezvous group (the lock must be an
>      * IRQ-unsafe lock since the CPU took the IPI after acquiring it, and
>      * hence had IRQs enabled). This is a deadlock scenario.
>      *
>      * However, no CPU can be involved in rendezvous until it is online,
>      * hence no such group can be waiting for this CPU until it is
>      * visible in cpu_online_map. Hence such a deadlock is not possible.
>      */
>     spin_debug_disable();
>     ...
>
> Just FTR, at least as far as I can remember, Linux does not enforce
> anything like that.
>
> Hope this clarifies things.

Thank you very much for your explanation and education! They are
really helpful! :-D

Let me summarize: ;-)
|                                                      | spin_lock |
spin_lock_irq | spin_lock_irqsave
| Can run in irq context?                  | No            |  Yes
        | Yes
| Can run in irq disabled context?   | No            |  No                | Yes

Why deadlock may occur if we mix the spin_lock and spin_lock_irq(save)?
If we mix the spin_lock and spin_lock_irq(save), and a group of CPUs
rendezvousing in an IPI handler, we will have deadlock. Because the
CPU A that takes spin_lock will wait for the rendezvousing condition
to be satisfied, while the CPU B that takes th spin_lock_irq(save)
will not enter into the rendezvousing condition (since it disable the
interrupt). Then,
CPU A waits for CPU B to handle the IPI to get out of the
rendezvousing condition (kind of synchrous point), which won't until
it gets the spin_lock.
CPU B waits for CPU A to release the spin_lock, which won't until it
get out of the rendezvousing condition;

Are my understanding and summary correct?

Thanks and Best Regards,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/
Meng Xu March 11, 2016, 2:49 p.m. UTC | #9
>
>> In fact, the deadlock arises because IRQs interrupt asynchronously what the
>> CPU is doing, and that can happen when the CPU has taken the lock already. But
>> if the 'asynchronous' part goes away, we really don't care whether a lock is take
>> at time t1 with IRQ enabled, and at time t2 with IRQ disabled, don't you think?
>>
>
> Yes.
> Consistency may be helpful to avoid some easy-to-avoid lock errors.
> Moreover, without my fix, I think it would not lead dead lock, as the pcidevs_lock is not being taken
> In IRQ context. Right?

I think without your fix, the deadlock may still happen due to the
rendezvous condition.
           CPU A                                |    CPU B
     | CPU C
Step 1| spin_lock                           |
Step 2|                                           | spin_lock_irq           |
Step 3|                                            | wait for A to unlock |
Step 4|
              | send rendezvous IPI to A and B
Step 5| receive IPI                             | wait for A to unlock |
Step 6| wait for B to handle the IPI    | wait for A to unlock |
Step 7| spin_unlock


Deadlock occurs at Step 6, IMO.

Unless we can prove that rendezvous won't happen while
spin_lock_irqsave is taken, we have the deadlock hazard.

Actually, I'm not quite sure when the rendezvous condition may happen
in  this situation. So probably, in reality, we don't have the
rendezvous condition.

>
>
> For deadlock, I think the key problems are:
>   - A lock can be acquired from IRQ context
>   -The interrupt is delivered to the _same_CPU_ that already holds the lock.
>
>

This is one type of deadlock, not the one due to rendezvous condition,
I think. :-)

Thanks and Best Regards,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/
Dario Faggioli March 11, 2016, 3:55 p.m. UTC | #10
On Fri, 2016-03-11 at 09:49 -0500, Meng Xu wrote:
> > Yes.
> > Consistency may be helpful to avoid some easy-to-avoid lock errors.
> > Moreover, without my fix, I think it would not lead dead lock, as
> > the pcidevs_lock is not being taken
> > In IRQ context. Right?
> I think without your fix, the deadlock may still happen due to the
> rendezvous condition.
>            CPU A                                |    CPU B
>      | CPU C
> Step 1| spin_lock                           |
> Step 2|                                           |
> spin_lock_irq           |
> Step 3|                                            | wait for A to
> unlock |
> Step 4|
>               | send rendezvous IPI to A and B
> Step 5| receive IPI                             | wait for A to
> unlock |
> Step 6| wait for B to handle the IPI    | wait for A to unlock |
> Step 7| spin_unlock
> 
> 
> Deadlock occurs at Step 6, IMO.
> 
> Unless we can prove that rendezvous won't happen while
> spin_lock_irqsave is taken, we have the deadlock hazard.
> 
Yes. But, in the case of Quan's patch (without it, I mean), have you
seen where in the code it is that we use spin_lock_irqsave()?

It's inside a function that is called during Xen boot, whose callchain
starts with iommu_setup(), from __start_xen(). Here's a (big, sorry)
code snippet of what is around iommu_setup():

    ...
    init_idle_domain();

    this_cpu(stubs.addr) = alloc_stub_page(smp_processor_id(),
                                           &this_cpu(stubs).mfn);
    BUG_ON(!this_cpu(stubs.addr));

    trap_init();
    rcu_init();

    early_time_init();

    arch_init_memory();

    alternative_instructions();

    local_irq_enable();

    pt_pci_init();

    vesa_mtrr_init();

    acpi_mmcfg_init();

    early_msi_init();

    iommu_setup();    /* setup iommu if available */

    smp_prepare_cpus(max_cpus);

    spin_debug_enable();

    /*
     * Initialise higher-level timer functions. We do this fairly late
     * (after interrupts got enabled) because the time bases and scale
     * factors need to be updated regularly.
     */
    init_xen_time();
    initialize_keytable();
    console_init_postirq();

    system_state = SYS_STATE_smp_boot;
    do_presmp_initcalls();
    for_each_present_cpu ( i )
    {
        /* Set up cpu_to_node[]. */
        srat_detect_node(i);
        /* Set up node_to_cpumask based on cpu_to_node[]. */
        numa_add_cpu(i);

        if ( (num_online_cpus() < max_cpus) && !cpu_online(i) )
        {
            int ret = cpu_up(i);
            if ( ret != 0 )
                printk("Failed to bring up CPU %u (error %d)\n", i, ret);
        }
    }
    printk("Brought up %ld CPUs\n", (long)num_online_cpus());
    ...

As you can see, it is only *after* iommu_setup() that we call functions
like smp_prepare_cpus(), do_presmp_initcalls(), and then the loop that
waits for all the present CPUs to come online.

What that means is that, at iommu_setup() time, there still is only one
CPU online, and there is not much chances that one single CPU deadlocks
in a rendezvous!

Honestly, the biggest issue that I think Quan's patch solves, is that
if we ever want/manage to move spin_debug_enable() up above it, then
the BUG_ON in check_lock() would trigger the first time that
pcidevs_lock would be taken with interrupts enabled.

Until then, code is technically fine, and, as a matter of fact, I think
that removing the locking from that particular instance would be an
equally effective fix!

All that being said, consistency is indeed important, and for the sake
of it and for other reasons too, even if, strictly speaking, there
isn't any actual buggy behavior to be fixed here, and it is worthwhile
conforming to a locking pattern that is consistent with the rules that
we sat ourselves, unless there's specific reasons not to.

Regards,
Dario
Dario Faggioli March 11, 2016, 4:12 p.m. UTC | #11
On Fri, 2016-03-11 at 09:41 -0500, Meng Xu wrote:
> Thank you very much for your explanation and education! They are
> really helpful! :-D
> 
> Let me summarize: ;-)
> > 
> >                                                      | spin_lock |
> spin_lock_irq | spin_lock_irqsave
> > 
> > Can run in irq context?                  | No            |  Yes
>         | Yes
> > 
> > Can run in irq disabled context?   |
> > No            |  No                | Yes
>
The table came out mangled, at least in my mail reader. For what I can
see, I think I see the point you're trying to make, and it's hard to
say whether the table itself is right or wrong...

Probably, a table like this, is just not the best way with which to try
to represent things.

For instance, you seem to be saying that spin_lock() can't be used if
interrupts are disabled, which is not true.

For instance, it is perfectly fine to do the following:

    CPU:
    .
    spin_lock_irq(l1);
    .
    .
[1] spin_lock(l2);
    .
    .
[2] spin_unlock(l2);
    .
    .
    spin_unlock_irq(l1);
    .

Even if l2 is also taken in an interrupt handler. In fact, what we want
is make sure that such an interrupt handler that may take l2, does not
interrupt CPU in the [1]--[2] time frame... But that can't happen
because interrupts are already disabled, so just using spin_lock() is
ok, and should be done, as that's cheaper than spin_lock_irqsave().

Fact is, what really matters is whether or not a lock is taken both
inside and outside of IRQ context. As a matter of fact, it is mostly
the case that, if a lock is ever taken inside an interrupt handler,
then spin_lock_irqsave() is what you want... but this does not justify
coming up with a table like the one above and claiming it to be
general.

> Why deadlock may occur if we mix the spin_lock and
> spin_lock_irq(save)?
> If we mix the spin_lock and spin_lock_irq(save), and a group of CPUs
> rendezvousing in an IPI handler, we will have deadlock. Because the
> CPU A that takes spin_lock will wait for the rendezvousing condition
> to be satisfied, while the CPU B that takes th spin_lock_irq(save)
> will not enter into the rendezvousing condition (since it disable the
> interrupt). Then,
> CPU A waits for CPU B to handle the IPI to get out of the
> rendezvousing condition (kind of synchrous point), which won't until
> it gets the spin_lock.
> CPU B waits for CPU A to release the spin_lock, which won't until it
> get out of the rendezvousing condition;
> 
> Are my understanding and summary correct?
> 
Yes, this looks a decent explanation of the rendezvous-related spinlock
problem... Although I prefer the wording that we do have already in
check_lock() and in start_secondary(). :-D

Regards,
Dario
Meng Xu March 11, 2016, 5:17 p.m. UTC | #12
On Fri, Mar 11, 2016 at 10:55 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Fri, 2016-03-11 at 09:49 -0500, Meng Xu wrote:
>> > Yes.
>> > Consistency may be helpful to avoid some easy-to-avoid lock errors.
>> > Moreover, without my fix, I think it would not lead dead lock, as
>> > the pcidevs_lock is not being taken
>> > In IRQ context. Right?
>> I think without your fix, the deadlock may still happen due to the
>> rendezvous condition.
>>            CPU A                                |    CPU B
>>      | CPU C
>> Step 1| spin_lock                           |
>> Step 2|                                           |
>> spin_lock_irq           |
>> Step 3|                                            | wait for A to
>> unlock |
>> Step 4|
>>               | send rendezvous IPI to A and B
>> Step 5| receive IPI                             | wait for A to
>> unlock |
>> Step 6| wait for B to handle the IPI    | wait for A to unlock |
>> Step 7| spin_unlock
>>
>>
>> Deadlock occurs at Step 6, IMO.
>>
>> Unless we can prove that rendezvous won't happen while
>> spin_lock_irqsave is taken, we have the deadlock hazard.
>>
> Yes. But, in the case of Quan's patch (without it, I mean), have you
> seen where in the code it is that we use spin_lock_irqsave()?
>
> It's inside a function that is called during Xen boot, whose callchain
> starts with iommu_setup(), from __start_xen(). Here's a (big, sorry)
> code snippet of what is around iommu_setup():
>
>     ...
>     init_idle_domain();
>
>     this_cpu(stubs.addr) = alloc_stub_page(smp_processor_id(),
>                                            &this_cpu(stubs).mfn);
>     BUG_ON(!this_cpu(stubs.addr));
>
>     trap_init();
>     rcu_init();
>
>     early_time_init();
>
>     arch_init_memory();
>
>     alternative_instructions();
>
>     local_irq_enable();
>
>     pt_pci_init();
>
>     vesa_mtrr_init();
>
>     acpi_mmcfg_init();
>
>     early_msi_init();
>
>     iommu_setup();    /* setup iommu if available */
>
>     smp_prepare_cpus(max_cpus);
>
>     spin_debug_enable();
>
>     /*
>      * Initialise higher-level timer functions. We do this fairly late
>      * (after interrupts got enabled) because the time bases and scale
>      * factors need to be updated regularly.
>      */
>     init_xen_time();
>     initialize_keytable();
>     console_init_postirq();
>
>     system_state = SYS_STATE_smp_boot;
>     do_presmp_initcalls();
>     for_each_present_cpu ( i )
>     {
>         /* Set up cpu_to_node[]. */
>         srat_detect_node(i);
>         /* Set up node_to_cpumask based on cpu_to_node[]. */
>         numa_add_cpu(i);
>
>         if ( (num_online_cpus() < max_cpus) && !cpu_online(i) )
>         {
>             int ret = cpu_up(i);
>             if ( ret != 0 )
>                 printk("Failed to bring up CPU %u (error %d)\n", i, ret);
>         }
>     }
>     printk("Brought up %ld CPUs\n", (long)num_online_cpus());
>     ...
>
> As you can see, it is only *after* iommu_setup() that we call functions
> like smp_prepare_cpus(), do_presmp_initcalls(), and then the loop that
> waits for all the present CPUs to come online.
>
> What that means is that, at iommu_setup() time, there still is only one
> CPU online, and there is not much chances that one single CPU deadlocks
> in a rendezvous!
>
> Honestly, the biggest issue that I think Quan's patch solves, is that
> if we ever want/manage to move spin_debug_enable() up above it, then
> the BUG_ON in check_lock() would trigger the first time that
> pcidevs_lock would be taken with interrupts enabled.

Right! I got it now.. :-)
The lock is initialized as SPIN_LOCK_UNLOCKED.
check_lock() will trigger the BUG_ON when it is used in a interrupt
disabled situation.

>
> Until then, code is technically fine, and, as a matter of fact, I think
> that removing the locking from that particular instance would be an
> equally effective fix!
>
> All that being said, consistency is indeed important, and for the sake
> of it and for other reasons too, even if, strictly speaking, there
> isn't any actual buggy behavior to be fixed here, and it is worthwhile
> conforming to a locking pattern that is consistent with the rules that
> we sat ourselves, unless there's specific reasons not to.

I see. Even though no deadlock may happen, consistency can be the
reason to modify. :-)
But it's good to know the real reason of making the change, which
could be avoiding the deadlock, be consistency, or both.
It seems to me that this is more for consistency, and avoid the
potential deadlock (although not so sure if it will eventually happen,
it could be a hazard.).

Thank you, Dario and Quan, very much for your explanation! :-) They
are really helpful! :-D

BTW, I'm sorry for the messed format of the table. Let me reformat it here:

Condition 1 (C1): Can run in irq context?
      | spin_lock | spin_lock_irq | spin_lock_irqsave
C1 | No           |  Yes              | Yes
C2 |  No            |  No             | Yes

It may be too fast (incorrect) to get the generalized conclusion, but
it should give some overview of the differences of these three locks.
Just for reference later. ;-)

Thanks,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/
diff mbox

Patch

diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index d90a2d2..a400497 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -778,7 +778,6 @@  static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
 {
     int irq, ret;
     hw_irq_controller *handler;
-    unsigned long flags;
     u16 control;
 
     irq = create_irq(NUMA_NO_NODE);
@@ -788,10 +787,10 @@  static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
         return 0;
     }
 
-    spin_lock_irqsave(&pcidevs_lock, flags);
+    spin_lock(&pcidevs_lock);
     iommu->msi.dev = pci_get_pdev(iommu->seg, PCI_BUS(iommu->bdf),
                                   PCI_DEVFN2(iommu->bdf));
-    spin_unlock_irqrestore(&pcidevs_lock, flags);
+    spin_unlock(&pcidevs_lock);
     if ( !iommu->msi.dev )
     {
         AMD_IOMMU_DEBUG("IOMMU: no pdev for %04x:%02x:%02x.%u\n",