diff mbox

xen/arm: fix rank/vgic locks inversion bug

Message ID alpine.DEB.2.10.1612141524350.9245@sstabellini-ThinkPad-X260 (mailing list archive)
State New, archived
Headers show

Commit Message

Stefano Stabellini Dec. 15, 2016, 1:04 a.m. UTC
The locking order is: first rank lock, then vgic lock. The order is
respected everywhere, except for gic_update_one_lr.

gic_update_one_lr is called with the vgic lock held, but it calls
vgic_get_target_vcpu, which tries to obtain the rank lock. This can
cause deadlocks.

We already have a version of vgic_get_target_vcpu that doesn't take the
rank lock: __vgic_get_target_vcpu. Also the only routine that modify
the target vcpu are vgic_store_itargetsr and vgic_store_irouter. They
call vgic_migrate_irq, which already take the vgic lock.

Solve the lock inversion problem, by not taking the rank lock in
gic_update_one_lr (calling __vgic_get_target_vcpu instead of
vgic_get_target_vcpu). We make this safe, by placing modifications to
rank->vcpu within regions protected by the vgic lock.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>

---
There are supposed to be 2 Coverity IDs associated with this, but the
system the currently down.

Comments

Julien Grall Dec. 15, 2016, 10:55 p.m. UTC | #1
Hi Stefano,

On 15/12/2016 01:04, Stefano Stabellini wrote:
> The locking order is: first rank lock, then vgic lock. The order is
> respected everywhere, except for gic_update_one_lr.
>
> gic_update_one_lr is called with the vgic lock held, but it calls
> vgic_get_target_vcpu, which tries to obtain the rank lock. This can
> cause deadlocks.
>
> We already have a version of vgic_get_target_vcpu that doesn't take the
> rank lock: __vgic_get_target_vcpu. Also the only routine that modify
> the target vcpu are vgic_store_itargetsr and vgic_store_irouter. They
> call vgic_migrate_irq, which already take the vgic lock.
>
> Solve the lock inversion problem, by not taking the rank lock in
> gic_update_one_lr (calling __vgic_get_target_vcpu instead of
> vgic_get_target_vcpu).

If I look at the callers of gic_update_one_lr, the function 
gic_clear_lrs will not take the rank lock. So from my understanding 
nobody will take the rank here.

However __vgic_get_target_vcpu has an ASSERT to check whether the rank 
lock has been taken. So who is taking lock for gic_update_one_lr now?

  We make this safe, by placing modifications to
> rank->vcpu within regions protected by the vgic lock.

This look unsafe to me. The vgic lock is per-vcpu, but rank->vcpu could 
be written/read by any vCPU. So you will never protect rank->vcpu with 
this lock. Did I miss anything?

>
> Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
>
> ---
> There are supposed to be 2 Coverity IDs associated with this, but the
> system the currently down.

I think the 2 Coverity IDs are 1381855 and 1381853.

Cheers,
Stefano Stabellini Dec. 16, 2016, 12:08 a.m. UTC | #2
On Thu, 15 Dec 2016, Julien Grall wrote:
> Hi Stefano,
> 
> On 15/12/2016 01:04, Stefano Stabellini wrote:
> > The locking order is: first rank lock, then vgic lock. The order is
> > respected everywhere, except for gic_update_one_lr.
> > 
> > gic_update_one_lr is called with the vgic lock held, but it calls
> > vgic_get_target_vcpu, which tries to obtain the rank lock. This can
> > cause deadlocks.
> > 
> > We already have a version of vgic_get_target_vcpu that doesn't take the
> > rank lock: __vgic_get_target_vcpu. Also the only routine that modify
> > the target vcpu are vgic_store_itargetsr and vgic_store_irouter. They
> > call vgic_migrate_irq, which already take the vgic lock.
> > 
> > Solve the lock inversion problem, by not taking the rank lock in
> > gic_update_one_lr (calling __vgic_get_target_vcpu instead of
> > vgic_get_target_vcpu).
> 
> If I look at the callers of gic_update_one_lr, the function gic_clear_lrs will
> not take the rank lock. So from my understanding nobody will take the rank
> here.
>
> However __vgic_get_target_vcpu has an ASSERT to check whether the rank lock
> has been taken. So who is taking lock for gic_update_one_lr now?

I should have removed the ASSERT - nobody is taking the rank lock now on
the gic_update_one_lr code path.


>  We make this safe, by placing modifications to
> > rank->vcpu within regions protected by the vgic lock.
> 
> This look unsafe to me. The vgic lock is per-vcpu, but rank->vcpu could be
> written/read by any vCPU. So you will never protect rank->vcpu with this lock.
> Did I miss anything?

The vgic lock is per-vcpu, but it is always the vgic lock of the old (old
in the sense, before the interrupt is migrated) vcpu that is taken.

On one hand any vcpu can read/write to itargetsr. Those operations are
protected by the rank lock. However vgic_migrate_irq and writes to
rank->vcpu are protected also by the vgic lock of the old vcpu (first
the rank lock is taken, then the vgic lock of the old vcpu).

gic_update_one_lr doesn't take the rank lock, but it takes the vgic lock
of the old vcpu. We know that it is the old vcpu because it is the one
with the interrupt in question in an LR register. gic_update_one_lr
accesses to rank->vcpu are safe because they are protected by the same
vgic lock.

Does it make sense?



> > 
> > Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
> > 
> > ---
> > There are supposed to be 2 Coverity IDs associated with this, but the
> > system the currently down.
> 
> I think the 2 Coverity IDs are 1381855 and 1381853.
Julien Grall Dec. 16, 2016, 11:51 a.m. UTC | #3
Hi Stefano,

On 16/12/16 00:08, Stefano Stabellini wrote:
> On Thu, 15 Dec 2016, Julien Grall wrote:
>> On 15/12/2016 01:04, Stefano Stabellini wrote:
>>> The locking order is: first rank lock, then vgic lock. The order is
>>> respected everywhere, except for gic_update_one_lr.
>>>
>>> gic_update_one_lr is called with the vgic lock held, but it calls
>>> vgic_get_target_vcpu, which tries to obtain the rank lock. This can
>>> cause deadlocks.
>>>
>>> We already have a version of vgic_get_target_vcpu that doesn't take the
>>> rank lock: __vgic_get_target_vcpu. Also the only routine that modify
>>> the target vcpu are vgic_store_itargetsr and vgic_store_irouter. They
>>> call vgic_migrate_irq, which already take the vgic lock.
>>>
>>> Solve the lock inversion problem, by not taking the rank lock in
>>> gic_update_one_lr (calling __vgic_get_target_vcpu instead of
>>> vgic_get_target_vcpu).
>>
>> If I look at the callers of gic_update_one_lr, the function gic_clear_lrs will
>> not take the rank lock. So from my understanding nobody will take the rank
>> here.
>>
>> However __vgic_get_target_vcpu has an ASSERT to check whether the rank lock
>> has been taken. So who is taking lock for gic_update_one_lr now?
>
> I should have removed the ASSERT - nobody is taking the rank lock now on
> the gic_update_one_lr code path.

Please either keep this ASSERT or update it. But not dropping it, we 
need to prevent anyone calling this function without any lock taken at 
least in debug build.

The current callers (see vgic_{disable,enable}_irqs) are calling the 
function with rank lock taken.

So we need to keep this behavior at least for them.

>
>
>>  We make this safe, by placing modifications to
>>> rank->vcpu within regions protected by the vgic lock.
>>
>> This look unsafe to me. The vgic lock is per-vcpu, but rank->vcpu could be
>> written/read by any vCPU. So you will never protect rank->vcpu with this lock.
>> Did I miss anything?
>
> The vgic lock is per-vcpu, but it is always the vgic lock of the old (old
> in the sense, before the interrupt is migrated) vcpu that is taken.
>
> On one hand any vcpu can read/write to itargetsr. Those operations are
> protected by the rank lock. However vgic_migrate_irq and writes to
> rank->vcpu are protected also by the vgic lock of the old vcpu (first
> the rank lock is taken, then the vgic lock of the old vcpu).

I don't think this is true. Any vCPU read/write to itargetsr will be 
protected by the vgic lock of the vCPU in rank->vcpu[offset].

For inflight IRQ, the migration will happen when the guest has EOIed the 
IRQ. Until this is happening, the guest may have written multiple time 
in ITARGETSR.

For instance, let say the guest is requesting to migrate the IRQ from 
vCPU A to vCPU B. The vgic lock A will be taken in vgic_store_itargetsr, 
if the interrupt is inflight vgic_migrate_irq will set a flag. Now the 
guest could decide to migrate the interrupt to vCPU C before the 
interrupt has been EOIed. In this case, vgic lock B will be taken.

So we have a mismatch between the lock taken in gic_update_one_lr and 
vgic_migrate_irq.

I think the only safe way to fully protect rank->vcpu is taking the rank 
lock. It will never change and be accessible from anywhere.

>
> gic_update_one_lr doesn't take the rank lock, but it takes the vgic lock
> of the old vcpu. We know that it is the old vcpu because it is the one
> with the interrupt in question in an LR register. gic_update_one_lr
> accesses to rank->vcpu are safe because they are protected by the same
> vgic lock.
>
> Does it make sense?

This needs more documentation in the code. Regardless the documentation, 
I think the code is becoming really complex to understand because with 
your changes rank->vcpu is both protect
ted by the rank and the old vgic lock. And depending of the caller, only 
one of the lock will be taken.

Note that it is not possible to add in __vgic_get_target_vcpu an 
ASSERT(spin_is_locked(&v->arch.vgic.lock)) because v may not be the old 
vCPU (see the one I mentioned yesterday).

>>>
>>> Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
>>>
>>> ---
>>> There are supposed to be 2 Coverity IDs associated with this, but the
>>> system the currently down.
>>
>> I think the 2 Coverity IDs are 1381855 and 1381853.
>
>

Regards,
Stefano Stabellini Dec. 17, 2016, 1:08 a.m. UTC | #4
On Fri, 16 Dec 2016, Julien Grall wrote:
> Hi Stefano,
> 
> On 16/12/16 00:08, Stefano Stabellini wrote:
> > On Thu, 15 Dec 2016, Julien Grall wrote:
> > > On 15/12/2016 01:04, Stefano Stabellini wrote:
> > > > The locking order is: first rank lock, then vgic lock. The order is
> > > > respected everywhere, except for gic_update_one_lr.
> > > > 
> > > > gic_update_one_lr is called with the vgic lock held, but it calls
> > > > vgic_get_target_vcpu, which tries to obtain the rank lock. This can
> > > > cause deadlocks.
> > > > 
> > > > We already have a version of vgic_get_target_vcpu that doesn't take the
> > > > rank lock: __vgic_get_target_vcpu. Also the only routine that modify
> > > > the target vcpu are vgic_store_itargetsr and vgic_store_irouter. They
> > > > call vgic_migrate_irq, which already take the vgic lock.
> > > > 
> > > > Solve the lock inversion problem, by not taking the rank lock in
> > > > gic_update_one_lr (calling __vgic_get_target_vcpu instead of
> > > > vgic_get_target_vcpu).
> > > 
> > > If I look at the callers of gic_update_one_lr, the function gic_clear_lrs
> > > will
> > > not take the rank lock. So from my understanding nobody will take the rank
> > > here.
> > > 
> > > However __vgic_get_target_vcpu has an ASSERT to check whether the rank
> > > lock
> > > has been taken. So who is taking lock for gic_update_one_lr now?
> > 
> > I should have removed the ASSERT - nobody is taking the rank lock now on
> > the gic_update_one_lr code path.
> 
> Please either keep this ASSERT or update it. But not dropping it, we need to
> prevent anyone calling this function without any lock taken at least in debug
> build.
> 
> The current callers (see vgic_{disable,enable}_irqs) are calling the function
> with rank lock taken.
> 
> So we need to keep this behavior at least for them.
> 
> > >  We make this safe, by placing modifications to
> > > > rank->vcpu within regions protected by the vgic lock.
> > > 
> > > This look unsafe to me. The vgic lock is per-vcpu, but rank->vcpu could be
> > > written/read by any vCPU. So you will never protect rank->vcpu with this
> > > lock.
> > > Did I miss anything?
> > 
> > The vgic lock is per-vcpu, but it is always the vgic lock of the old (old
> > in the sense, before the interrupt is migrated) vcpu that is taken.
> > 
> > On one hand any vcpu can read/write to itargetsr. Those operations are
> > protected by the rank lock. However vgic_migrate_irq and writes to
> > rank->vcpu are protected also by the vgic lock of the old vcpu (first
> > the rank lock is taken, then the vgic lock of the old vcpu).
> 
> I don't think this is true. Any vCPU read/write to itargetsr will be protected
> by the vgic lock of the vCPU in rank->vcpu[offset].
> 
> For inflight IRQ, the migration will happen when the guest has EOIed the IRQ.
> Until this is happening, the guest may have written multiple time in
> ITARGETSR.
> 
> For instance, let say the guest is requesting to migrate the IRQ from vCPU A
> to vCPU B. The vgic lock A will be taken in vgic_store_itargetsr, if the
> interrupt is inflight vgic_migrate_irq will set a flag. Now the guest could
> decide to migrate the interrupt to vCPU C before the interrupt has been EOIed.
> In this case, vgic lock B will be taken.
> 
> So we have a mismatch between the lock taken in gic_update_one_lr and
> vgic_migrate_irq.
> 
> I think the only safe way to fully protect rank->vcpu is taking the rank lock.
> It will never change and be accessible from anywhere.

When I submitted this series, I considered this scenario, and I thought
that the existing GIC_IRQ_GUEST_MIGRATING flag should be enough to
protect it, because when GIC_IRQ_GUEST_MIGRATING is set,
vgic_migrate_irq actually returns immediately, without doing anything.
Either GIC_IRQ_GUEST_MIGRATING is not set, and it's the first irq
migration, or GIC_IRQ_GUEST_MIGRATING is set, and vgic_migrate_irq just
returns.

However, as I was writing this explanation to you in more details, I
realized that it is not actually safe. Concurrent accesses to
rank->vcpu, and concurrent calls to irq_set_affinity, can result in the
physical affinity actually set to the intermediate pcpu rather than the
final. As a matter of fact, one of these races affect the code today,
even without this patch series.

I considered various possible ways to solve this problem, and the lock
inversion problem at once. These are the potential solutions I came up
with:

1) We release the vgic lock at the end of gic_update_one_lr, before
test_and_clear_bit(GIC_IRQ_GUEST_MIGRATING, &p->status). Then we take
the rank lock. We don't need the per-vcpu vgic lock to execute the
following code safely, the rank lock is enough:

            if ( test_and_clear_bit(GIC_IRQ_GUEST_MIGRATING, &p->status) )
            {
                struct vcpu *v_target = __vgic_get_target_vcpu(v, irq);
                irq_set_affinity(p->desc, cpumask_of(v_target->processor));
            }

This sounds simple in theory, but in practice it is very ugly. We would
end up with a function (gic_update_one_lr) that is called with the vgic
lock but actually releases the lock before returning.

2) We run gic_update_one_lr and vgic_store_itargetsr in parallel safely
and locklessly. There might be a way to do it, but it is not easy I
haven't found it yet.

3) Use the vgic lock on the old vcpu to protect read accesses to
rank->vcpu. This is the same strategy as implemented in this series.
However, to actually work correctly, we also have to store the original
vcpu id in struct pending_irq. That way, we know for sure which vgic
lock we need to take.


Let me premise that I don't like any of these approaches very much. But
I think that 1) is the least bad. If that's OK for you, I'll do that.
Stefano Stabellini Dec. 17, 2016, 1:13 a.m. UTC | #5
On Fri, 16 Dec 2016, Stefano Stabellini wrote:
> On Fri, 16 Dec 2016, Julien Grall wrote:
> > Hi Stefano,
> > 
> > On 16/12/16 00:08, Stefano Stabellini wrote:
> > > On Thu, 15 Dec 2016, Julien Grall wrote:
> > > > On 15/12/2016 01:04, Stefano Stabellini wrote:
> > > > > The locking order is: first rank lock, then vgic lock. The order is
> > > > > respected everywhere, except for gic_update_one_lr.
> > > > > 
> > > > > gic_update_one_lr is called with the vgic lock held, but it calls
> > > > > vgic_get_target_vcpu, which tries to obtain the rank lock. This can
> > > > > cause deadlocks.
> > > > > 
> > > > > We already have a version of vgic_get_target_vcpu that doesn't take the
> > > > > rank lock: __vgic_get_target_vcpu. Also the only routine that modify
> > > > > the target vcpu are vgic_store_itargetsr and vgic_store_irouter. They
> > > > > call vgic_migrate_irq, which already take the vgic lock.
> > > > > 
> > > > > Solve the lock inversion problem, by not taking the rank lock in
> > > > > gic_update_one_lr (calling __vgic_get_target_vcpu instead of
> > > > > vgic_get_target_vcpu).
> > > > 
> > > > If I look at the callers of gic_update_one_lr, the function gic_clear_lrs
> > > > will
> > > > not take the rank lock. So from my understanding nobody will take the rank
> > > > here.
> > > > 
> > > > However __vgic_get_target_vcpu has an ASSERT to check whether the rank
> > > > lock
> > > > has been taken. So who is taking lock for gic_update_one_lr now?
> > > 
> > > I should have removed the ASSERT - nobody is taking the rank lock now on
> > > the gic_update_one_lr code path.
> > 
> > Please either keep this ASSERT or update it. But not dropping it, we need to
> > prevent anyone calling this function without any lock taken at least in debug
> > build.
> > 
> > The current callers (see vgic_{disable,enable}_irqs) are calling the function
> > with rank lock taken.
> > 
> > So we need to keep this behavior at least for them.
> > 
> > > >  We make this safe, by placing modifications to
> > > > > rank->vcpu within regions protected by the vgic lock.
> > > > 
> > > > This look unsafe to me. The vgic lock is per-vcpu, but rank->vcpu could be
> > > > written/read by any vCPU. So you will never protect rank->vcpu with this
> > > > lock.
> > > > Did I miss anything?
> > > 
> > > The vgic lock is per-vcpu, but it is always the vgic lock of the old (old
> > > in the sense, before the interrupt is migrated) vcpu that is taken.
> > > 
> > > On one hand any vcpu can read/write to itargetsr. Those operations are
> > > protected by the rank lock. However vgic_migrate_irq and writes to
> > > rank->vcpu are protected also by the vgic lock of the old vcpu (first
> > > the rank lock is taken, then the vgic lock of the old vcpu).
> > 
> > I don't think this is true. Any vCPU read/write to itargetsr will be protected
> > by the vgic lock of the vCPU in rank->vcpu[offset].
> > 
> > For inflight IRQ, the migration will happen when the guest has EOIed the IRQ.
> > Until this is happening, the guest may have written multiple time in
> > ITARGETSR.
> > 
> > For instance, let say the guest is requesting to migrate the IRQ from vCPU A
> > to vCPU B. The vgic lock A will be taken in vgic_store_itargetsr, if the
> > interrupt is inflight vgic_migrate_irq will set a flag. Now the guest could
> > decide to migrate the interrupt to vCPU C before the interrupt has been EOIed.
> > In this case, vgic lock B will be taken.
> > 
> > So we have a mismatch between the lock taken in gic_update_one_lr and
> > vgic_migrate_irq.
> > 
> > I think the only safe way to fully protect rank->vcpu is taking the rank lock.
> > It will never change and be accessible from anywhere.
> 
> When I submitted this series, I considered this scenario, and I thought
> that the existing GIC_IRQ_GUEST_MIGRATING flag should be enough to
> protect it, because when GIC_IRQ_GUEST_MIGRATING is set,
> vgic_migrate_irq actually returns immediately, without doing anything.
> Either GIC_IRQ_GUEST_MIGRATING is not set, and it's the first irq
> migration, or GIC_IRQ_GUEST_MIGRATING is set, and vgic_migrate_irq just
> returns.
> 
> However, as I was writing this explanation to you in more details, I
> realized that it is not actually safe. Concurrent accesses to
> rank->vcpu, and concurrent calls to irq_set_affinity, can result in the
> physical affinity actually set to the intermediate pcpu rather than the
> final. As a matter of fact, one of these races affect the code today,
> even without this patch series.
> 
> I considered various possible ways to solve this problem, and the lock
> inversion problem at once. These are the potential solutions I came up
> with:
> 
> 1) We release the vgic lock at the end of gic_update_one_lr, before
> test_and_clear_bit(GIC_IRQ_GUEST_MIGRATING, &p->status). Then we take
> the rank lock. We don't need the per-vcpu vgic lock to execute the
> following code safely, the rank lock is enough:
> 
>             if ( test_and_clear_bit(GIC_IRQ_GUEST_MIGRATING, &p->status) )
>             {
>                 struct vcpu *v_target = __vgic_get_target_vcpu(v, irq);
>                 irq_set_affinity(p->desc, cpumask_of(v_target->processor));
>             }
> 
> This sounds simple in theory, but in practice it is very ugly. We would
> end up with a function (gic_update_one_lr) that is called with the vgic
> lock but actually releases the lock before returning.
> 
> 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel safely
> and locklessly. There might be a way to do it, but it is not easy I
> haven't found it yet.
> 
> 3) Use the vgic lock on the old vcpu to protect read accesses to
> rank->vcpu. This is the same strategy as implemented in this series.
> However, to actually work correctly, we also have to store the original
> vcpu id in struct pending_irq. That way, we know for sure which vgic
> lock we need to take.
> 
> 
> Let me premise that I don't like any of these approaches very much. But
> I think that 1) is the least bad. If that's OK for you, I'll do that.

Sorry I meant that 3) is the least bad :-)
Julien Grall Dec. 19, 2016, 5:46 p.m. UTC | #6
Hi Stefano,

On 17/12/2016 02:13, Stefano Stabellini wrote:
> On Fri, 16 Dec 2016, Stefano Stabellini wrote:
>> On Fri, 16 Dec 2016, Julien Grall wrote:
>>> Hi Stefano,
>>>
>>> On 16/12/16 00:08, Stefano Stabellini wrote:
>>>> On Thu, 15 Dec 2016, Julien Grall wrote:
>>>>> On 15/12/2016 01:04, Stefano Stabellini wrote:
>>>>>> The locking order is: first rank lock, then vgic lock. The order is
>>>>>> respected everywhere, except for gic_update_one_lr.
>>>>>>
>>>>>> gic_update_one_lr is called with the vgic lock held, but it calls
>>>>>> vgic_get_target_vcpu, which tries to obtain the rank lock. This can
>>>>>> cause deadlocks.
>>>>>>
>>>>>> We already have a version of vgic_get_target_vcpu that doesn't take the
>>>>>> rank lock: __vgic_get_target_vcpu. Also the only routine that modify
>>>>>> the target vcpu are vgic_store_itargetsr and vgic_store_irouter. They
>>>>>> call vgic_migrate_irq, which already take the vgic lock.
>>>>>>
>>>>>> Solve the lock inversion problem, by not taking the rank lock in
>>>>>> gic_update_one_lr (calling __vgic_get_target_vcpu instead of
>>>>>> vgic_get_target_vcpu).
>>>>>
>>>>> If I look at the callers of gic_update_one_lr, the function gic_clear_lrs
>>>>> will
>>>>> not take the rank lock. So from my understanding nobody will take the rank
>>>>> here.
>>>>>
>>>>> However __vgic_get_target_vcpu has an ASSERT to check whether the rank
>>>>> lock
>>>>> has been taken. So who is taking lock for gic_update_one_lr now?
>>>>
>>>> I should have removed the ASSERT - nobody is taking the rank lock now on
>>>> the gic_update_one_lr code path.
>>>
>>> Please either keep this ASSERT or update it. But not dropping it, we need to
>>> prevent anyone calling this function without any lock taken at least in debug
>>> build.
>>>
>>> The current callers (see vgic_{disable,enable}_irqs) are calling the function
>>> with rank lock taken.
>>>
>>> So we need to keep this behavior at least for them.
>>>
>>>>>  We make this safe, by placing modifications to
>>>>>> rank->vcpu within regions protected by the vgic lock.
>>>>>
>>>>> This look unsafe to me. The vgic lock is per-vcpu, but rank->vcpu could be
>>>>> written/read by any vCPU. So you will never protect rank->vcpu with this
>>>>> lock.
>>>>> Did I miss anything?
>>>>
>>>> The vgic lock is per-vcpu, but it is always the vgic lock of the old (old
>>>> in the sense, before the interrupt is migrated) vcpu that is taken.
>>>>
>>>> On one hand any vcpu can read/write to itargetsr. Those operations are
>>>> protected by the rank lock. However vgic_migrate_irq and writes to
>>>> rank->vcpu are protected also by the vgic lock of the old vcpu (first
>>>> the rank lock is taken, then the vgic lock of the old vcpu).
>>>
>>> I don't think this is true. Any vCPU read/write to itargetsr will be protected
>>> by the vgic lock of the vCPU in rank->vcpu[offset].
>>>
>>> For inflight IRQ, the migration will happen when the guest has EOIed the IRQ.
>>> Until this is happening, the guest may have written multiple time in
>>> ITARGETSR.
>>>
>>> For instance, let say the guest is requesting to migrate the IRQ from vCPU A
>>> to vCPU B. The vgic lock A will be taken in vgic_store_itargetsr, if the
>>> interrupt is inflight vgic_migrate_irq will set a flag. Now the guest could
>>> decide to migrate the interrupt to vCPU C before the interrupt has been EOIed.
>>> In this case, vgic lock B will be taken.
>>>
>>> So we have a mismatch between the lock taken in gic_update_one_lr and
>>> vgic_migrate_irq.
>>>
>>> I think the only safe way to fully protect rank->vcpu is taking the rank lock.
>>> It will never change and be accessible from anywhere.
>>
>> When I submitted this series, I considered this scenario, and I thought
>> that the existing GIC_IRQ_GUEST_MIGRATING flag should be enough to
>> protect it, because when GIC_IRQ_GUEST_MIGRATING is set,
>> vgic_migrate_irq actually returns immediately, without doing anything.
>> Either GIC_IRQ_GUEST_MIGRATING is not set, and it's the first irq
>> migration, or GIC_IRQ_GUEST_MIGRATING is set, and vgic_migrate_irq just
>> returns.
>>
>> However, as I was writing this explanation to you in more details, I
>> realized that it is not actually safe. Concurrent accesses to
>> rank->vcpu, and concurrent calls to irq_set_affinity, can result in the
>> physical affinity actually set to the intermediate pcpu rather than the
>> final. As a matter of fact, one of these races affect the code today,
>> even without this patch series.

Because irq_set_affinity is not called protected by the same lock (e.g 
rank or vgic), right?

[...]

>> 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel safely
>> and locklessly. There might be a way to do it, but it is not easy I
>> haven't found it yet.

Correct me if I am wrong. There are no restriction to write into 
IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity could 
be called once at the beginning of vgic_irq_migrate.

We may receive the interrupt on the wrong physical CPU at the beginning. 
But it would be the same when writing into IROUTER/ITARGETSR.

This would remove the need to get the rank lock in gic_update_one_lr.

Did I miss anything?

Cheers,
Stefano Stabellini Dec. 19, 2016, 10:30 p.m. UTC | #7
On Mon, 19 Dec 2016, Julien Grall wrote:
> Hi Stefano,
> 
> On 17/12/2016 02:13, Stefano Stabellini wrote:
> > On Fri, 16 Dec 2016, Stefano Stabellini wrote:
> > > On Fri, 16 Dec 2016, Julien Grall wrote:
> > > > Hi Stefano,
> > > > 
> > > > On 16/12/16 00:08, Stefano Stabellini wrote:
> > > > > On Thu, 15 Dec 2016, Julien Grall wrote:
> > > > > > On 15/12/2016 01:04, Stefano Stabellini wrote:
> > > > > > > The locking order is: first rank lock, then vgic lock. The order
> > > > > > > is
> > > > > > > respected everywhere, except for gic_update_one_lr.
> > > > > > > 
> > > > > > > gic_update_one_lr is called with the vgic lock held, but it calls
> > > > > > > vgic_get_target_vcpu, which tries to obtain the rank lock. This
> > > > > > > can
> > > > > > > cause deadlocks.
> > > > > > > 
> > > > > > > We already have a version of vgic_get_target_vcpu that doesn't
> > > > > > > take the
> > > > > > > rank lock: __vgic_get_target_vcpu. Also the only routine that
> > > > > > > modify
> > > > > > > the target vcpu are vgic_store_itargetsr and vgic_store_irouter.
> > > > > > > They
> > > > > > > call vgic_migrate_irq, which already take the vgic lock.
> > > > > > > 
> > > > > > > Solve the lock inversion problem, by not taking the rank lock in
> > > > > > > gic_update_one_lr (calling __vgic_get_target_vcpu instead of
> > > > > > > vgic_get_target_vcpu).
> > > > > > 
> > > > > > If I look at the callers of gic_update_one_lr, the function
> > > > > > gic_clear_lrs
> > > > > > will
> > > > > > not take the rank lock. So from my understanding nobody will take
> > > > > > the rank
> > > > > > here.
> > > > > > 
> > > > > > However __vgic_get_target_vcpu has an ASSERT to check whether the
> > > > > > rank
> > > > > > lock
> > > > > > has been taken. So who is taking lock for gic_update_one_lr now?
> > > > > 
> > > > > I should have removed the ASSERT - nobody is taking the rank lock now
> > > > > on
> > > > > the gic_update_one_lr code path.
> > > > 
> > > > Please either keep this ASSERT or update it. But not dropping it, we
> > > > need to
> > > > prevent anyone calling this function without any lock taken at least in
> > > > debug
> > > > build.
> > > > 
> > > > The current callers (see vgic_{disable,enable}_irqs) are calling the
> > > > function
> > > > with rank lock taken.
> > > > 
> > > > So we need to keep this behavior at least for them.
> > > > 
> > > > > >  We make this safe, by placing modifications to
> > > > > > > rank->vcpu within regions protected by the vgic lock.
> > > > > > 
> > > > > > This look unsafe to me. The vgic lock is per-vcpu, but rank->vcpu
> > > > > > could be
> > > > > > written/read by any vCPU. So you will never protect rank->vcpu with
> > > > > > this
> > > > > > lock.
> > > > > > Did I miss anything?
> > > > > 
> > > > > The vgic lock is per-vcpu, but it is always the vgic lock of the old
> > > > > (old
> > > > > in the sense, before the interrupt is migrated) vcpu that is taken.
> > > > > 
> > > > > On one hand any vcpu can read/write to itargetsr. Those operations are
> > > > > protected by the rank lock. However vgic_migrate_irq and writes to
> > > > > rank->vcpu are protected also by the vgic lock of the old vcpu (first
> > > > > the rank lock is taken, then the vgic lock of the old vcpu).
> > > > 
> > > > I don't think this is true. Any vCPU read/write to itargetsr will be
> > > > protected
> > > > by the vgic lock of the vCPU in rank->vcpu[offset].
> > > > 
> > > > For inflight IRQ, the migration will happen when the guest has EOIed the
> > > > IRQ.
> > > > Until this is happening, the guest may have written multiple time in
> > > > ITARGETSR.
> > > > 
> > > > For instance, let say the guest is requesting to migrate the IRQ from
> > > > vCPU A
> > > > to vCPU B. The vgic lock A will be taken in vgic_store_itargetsr, if the
> > > > interrupt is inflight vgic_migrate_irq will set a flag. Now the guest
> > > > could
> > > > decide to migrate the interrupt to vCPU C before the interrupt has been
> > > > EOIed.
> > > > In this case, vgic lock B will be taken.
> > > > 
> > > > So we have a mismatch between the lock taken in gic_update_one_lr and
> > > > vgic_migrate_irq.
> > > > 
> > > > I think the only safe way to fully protect rank->vcpu is taking the rank
> > > > lock.
> > > > It will never change and be accessible from anywhere.
> > > 
> > > When I submitted this series, I considered this scenario, and I thought
> > > that the existing GIC_IRQ_GUEST_MIGRATING flag should be enough to
> > > protect it, because when GIC_IRQ_GUEST_MIGRATING is set,
> > > vgic_migrate_irq actually returns immediately, without doing anything.
> > > Either GIC_IRQ_GUEST_MIGRATING is not set, and it's the first irq
> > > migration, or GIC_IRQ_GUEST_MIGRATING is set, and vgic_migrate_irq just
> > > returns.
> > > 
> > > However, as I was writing this explanation to you in more details, I
> > > realized that it is not actually safe. Concurrent accesses to
> > > rank->vcpu, and concurrent calls to irq_set_affinity, can result in the
> > > physical affinity actually set to the intermediate pcpu rather than the
> > > final. As a matter of fact, one of these races affect the code today,
> > > even without this patch series.
> 
> Because irq_set_affinity is not called protected by the same lock (e.g rank or
> vgic), right?

Yes. This is an example of failure that affects the code as it is today:

  CPU0                                  CPU1
  <GIC_IRQ_GUEST_MIGRATING is set>      <GIC_IRQ_GUEST_MIGRATING is set>
  ----------------------------------------------------------------------
                                        vgic_migrate_irq (does nothing)
  clear GIC_IRQ_GUEST_MIGRATING
  read rank->vcpu (intermediate)
                                        set rank->vcpu (final)
  irq_set_affinity (intermediate)


Result: the irq physical affinity is the one of the intermediate vcpu,
not the final vcpu.



> [...]
> 
> > > 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel safely
> > > and locklessly. There might be a way to do it, but it is not easy I
> > > haven't found it yet.
> 
> Correct me if I am wrong. There are no restriction to write into
> IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity could be
> called once at the beginning of vgic_irq_migrate.
> 
> We may receive the interrupt on the wrong physical CPU at the beginning. But
> it would be the same when writing into IROUTER/ITARGETSR.
> 
> This would remove the need to get the rank lock in gic_update_one_lr.
> 
> Did I miss anything?

I am not sure if we can do that: the virtual interrupt might not be
EOI'd yet at that time. The guest EOI is used to deactivate the
corresponding physical interrupt. Migrating the physical IRQ at that
time, could have unintended consequences? I am not sure what the spec
says about this, but even if it is correct on paper, I would prefer not
to put it to the test: I bet that not many interrupt controllers have
been heavily tested in this scenario. What do you think?

Thinking outside the box, another way to solve this problem is to reject
any interrupt affinity changes while an interrupt migration is still in
progress. In fact, that scenario is very uncommon. As far as I know
operating systems deactivate interrupts before migrating them.

Otherwise, I think it is very reasonable to store the vcpu id (or the
pcpu id) in struct pending_irq: we already store the lr register number
there. The current code can tell in which lr register an interrupt has
been written to, but it cannot tell to which cpu the lr register belongs
to. It's a paradox. Once we know the vcpu id for any inflight irqs, then
we can make sure to take the right vcpu.vgic lock from vgic_migrate_irq.
Julien Grall Dec. 19, 2016, 10:56 p.m. UTC | #8
Hi Stefano,

On 19/12/2016 23:30, Stefano Stabellini wrote:
> On Mon, 19 Dec 2016, Julien Grall wrote:
>>>> 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel safely
>>>> and locklessly. There might be a way to do it, but it is not easy I
>>>> haven't found it yet.
>>
>> Correct me if I am wrong. There are no restriction to write into
>> IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity could be
>> called once at the beginning of vgic_irq_migrate.
>>
>> We may receive the interrupt on the wrong physical CPU at the beginning. But
>> it would be the same when writing into IROUTER/ITARGETSR.
>>
>> This would remove the need to get the rank lock in gic_update_one_lr.
>>
>> Did I miss anything?
>
> I am not sure if we can do that: the virtual interrupt might not be
> EOI'd yet at that time. The guest EOI is used to deactivate the
> corresponding physical interrupt. Migrating the physical IRQ at that
> time, could have unintended consequences? I am not sure what the spec
> says about this, but even if it is correct on paper, I would prefer not
> to put it to the test: I bet that not many interrupt controllers have
> been heavily tested in this scenario. What do you think?

I don't think this is an issue. An interrupt could have the priority 
drop happening on pCPU A and be deactivated on pCPU B if the vCPU A is 
been migrated when the interrupt is inflight. So if it is fine here, why 
would not it be when the guest is specifically requesting the routing?

>
> Thinking outside the box, another way to solve this problem is to reject
> any interrupt affinity changes while an interrupt migration is still in
> progress. In fact, that scenario is very uncommon. As far as I know
> operating systems deactivate interrupts before migrating them.

I would rather avoid to differ from the specification, even it looks 
sensible for a guest to migrate the interrupt whilst it is not active.

>
> Otherwise, I think it is very reasonable to store the vcpu id (or the
> pcpu id) in struct pending_irq: we already store the lr register number
> there. The current code can tell in which lr register an interrupt has
> been written to, but it cannot tell to which cpu the lr register belongs
> to. It's a paradox. Once we know the vcpu id for any inflight irqs, then
> we can make sure to take the right vcpu.vgic lock from vgic_migrate_irq.

This is a good point. I was concerned about the size of pending_irq (we 
have to allocate it per-IRQ) but it looks like we have some padding in 
the structure between priority and inflight.

Cheers,
Stefano Stabellini Dec. 19, 2016, 11:22 p.m. UTC | #9
On Mon, 19 Dec 2016, Julien Grall wrote:
> Hi Stefano,
> 
> On 19/12/2016 23:30, Stefano Stabellini wrote:
> > On Mon, 19 Dec 2016, Julien Grall wrote:
> > > > > 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel
> > > > > safely
> > > > > and locklessly. There might be a way to do it, but it is not easy I
> > > > > haven't found it yet.
> > > 
> > > Correct me if I am wrong. There are no restriction to write into
> > > IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity could
> > > be
> > > called once at the beginning of vgic_irq_migrate.
> > > 
> > > We may receive the interrupt on the wrong physical CPU at the beginning.
> > > But
> > > it would be the same when writing into IROUTER/ITARGETSR.
> > > 
> > > This would remove the need to get the rank lock in gic_update_one_lr.
> > > 
> > > Did I miss anything?
> > 
> > I am not sure if we can do that: the virtual interrupt might not be
> > EOI'd yet at that time. The guest EOI is used to deactivate the
> > corresponding physical interrupt. Migrating the physical IRQ at that
> > time, could have unintended consequences? I am not sure what the spec
> > says about this, but even if it is correct on paper, I would prefer not
> > to put it to the test: I bet that not many interrupt controllers have
> > been heavily tested in this scenario. What do you think?
> 
> I don't think this is an issue. An interrupt could have the priority drop
> happening on pCPU A and be deactivated on pCPU B if the vCPU A is been
> migrated when the interrupt is inflight. So if it is fine here, why would not
> it be when the guest is specifically requesting the routing?

That is true, but this is not exactly the same. This is changing the
physical irq affinity while both physical and virtual irqs are still
active. As I wrote, usually operating systems only change affinity after
deactivating an irq, so I thought it would be wise in Xen to wait at
least for the EOI. If we tried to inject the same virtual interrupt on a
different vcpu, while the interrupt is still inflight, we could get in
troubles. But we could avoid that by testing for GIC_IRQ_GUEST_MIGRATING
in vgic_vcpu_inject_irq. Maybe I am worrying about nothing.


> > Otherwise, I think it is very reasonable to store the vcpu id (or the
> > pcpu id) in struct pending_irq: we already store the lr register number
> > there. The current code can tell in which lr register an interrupt has
> > been written to, but it cannot tell to which cpu the lr register belongs
> > to. It's a paradox. Once we know the vcpu id for any inflight irqs, then
> > we can make sure to take the right vcpu.vgic lock from vgic_migrate_irq.
> 
> This is a good point. I was concerned about the size of pending_irq (we have
> to allocate it per-IRQ) but it looks like we have some padding in the
> structure between priority and inflight.

That's right, that's exactly what I intended to use. This solution comes
at no costs, but we would have to remember that it's the vcpu vgic lock
that protects reads to rank->vcpu rather than the rank lock.
Julien Grall Dec. 20, 2016, 11:28 a.m. UTC | #10
Hi Stefano,

On 20/12/2016 00:22, Stefano Stabellini wrote:
> On Mon, 19 Dec 2016, Julien Grall wrote:
>> Hi Stefano,
>>
>> On 19/12/2016 23:30, Stefano Stabellini wrote:
>>> On Mon, 19 Dec 2016, Julien Grall wrote:
>>>>>> 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel
>>>>>> safely
>>>>>> and locklessly. There might be a way to do it, but it is not easy I
>>>>>> haven't found it yet.
>>>>
>>>> Correct me if I am wrong. There are no restriction to write into
>>>> IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity could
>>>> be
>>>> called once at the beginning of vgic_irq_migrate.
>>>>
>>>> We may receive the interrupt on the wrong physical CPU at the beginning.
>>>> But
>>>> it would be the same when writing into IROUTER/ITARGETSR.
>>>>
>>>> This would remove the need to get the rank lock in gic_update_one_lr.
>>>>
>>>> Did I miss anything?
>>>
>>> I am not sure if we can do that: the virtual interrupt might not be
>>> EOI'd yet at that time. The guest EOI is used to deactivate the
>>> corresponding physical interrupt. Migrating the physical IRQ at that
>>> time, could have unintended consequences? I am not sure what the spec
>>> says about this, but even if it is correct on paper, I would prefer not
>>> to put it to the test: I bet that not many interrupt controllers have
>>> been heavily tested in this scenario. What do you think?
>>
>> I don't think this is an issue. An interrupt could have the priority drop
>> happening on pCPU A and be deactivated on pCPU B if the vCPU A is been
>> migrated when the interrupt is inflight. So if it is fine here, why would not
>> it be when the guest is specifically requesting the routing?
>
> That is true, but this is not exactly the same.

My example was to show you that an IRQ can have its priority dropped in 
pCPU A and been deactivated to pCPU B. Another example is when only the 
IRQ is been migrated. The spec does not promise you to receive the next 
interrupt on the CPU you asked because it may take time to update the 
GIC state. So the priority drop and deactivation could be done on 
separate physical CPU here too.

 > This is changing the
> physical irq affinity while both physical and virtual irqs are still
> active.

Physical IRQ state and virtual IRQ state are completely dissociated in 
the GIC. The only interaction possible is the virtual interface to send 
a deactivate request to the distributor when the virtual interrupt has 
been deactivated and correspond to a hardware interrupt.

> As I wrote, usually operating systems only change affinity after
> deactivating an irq, so I thought it would be wise in Xen to wait at
> least for the EOI.

I looked at the Linux code and did not see a such requirement when 
setting the affinity (see irq_set_affinity) of an IRQ.

> If we tried to inject the same virtual interrupt on a
> different vcpu, while the interrupt is still inflight, we could get in
> troubles. But we could avoid that by testing for GIC_IRQ_GUEST_MIGRATING
> in vgic_vcpu_inject_irq. Maybe I am worrying about nothing.

The only interrupt that can be routed to a guest in Xen are SPI which 
are shared between all CPUs. The active bit is handled by the 
distributor. It is not possible to receive the same SPI until it has 
been deactivated.

Cheers,
Stefano Stabellini Dec. 20, 2016, 7:38 p.m. UTC | #11
On Tue, 20 Dec 2016, Julien Grall wrote:
> Hi Stefano,
> 
> On 20/12/2016 00:22, Stefano Stabellini wrote:
> > On Mon, 19 Dec 2016, Julien Grall wrote:
> > > Hi Stefano,
> > > 
> > > On 19/12/2016 23:30, Stefano Stabellini wrote:
> > > > On Mon, 19 Dec 2016, Julien Grall wrote:
> > > > > > > 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel
> > > > > > > safely
> > > > > > > and locklessly. There might be a way to do it, but it is not easy
> > > > > > > I
> > > > > > > haven't found it yet.
> > > > > 
> > > > > Correct me if I am wrong. There are no restriction to write into
> > > > > IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity
> > > > > could
> > > > > be
> > > > > called once at the beginning of vgic_irq_migrate.
> > > > > 
> > > > > We may receive the interrupt on the wrong physical CPU at the
> > > > > beginning.
> > > > > But
> > > > > it would be the same when writing into IROUTER/ITARGETSR.
> > > > > 
> > > > > This would remove the need to get the rank lock in gic_update_one_lr.
> > > > > 
> > > > > Did I miss anything?
> > > > 
> > > > I am not sure if we can do that: the virtual interrupt might not be
> > > > EOI'd yet at that time. The guest EOI is used to deactivate the
> > > > corresponding physical interrupt. Migrating the physical IRQ at that
> > > > time, could have unintended consequences? I am not sure what the spec
> > > > says about this, but even if it is correct on paper, I would prefer not
> > > > to put it to the test: I bet that not many interrupt controllers have
> > > > been heavily tested in this scenario. What do you think?
> > > 
> > > I don't think this is an issue. An interrupt could have the priority drop
> > > happening on pCPU A and be deactivated on pCPU B if the vCPU A is been
> > > migrated when the interrupt is inflight. So if it is fine here, why would
> > > not
> > > it be when the guest is specifically requesting the routing?
> > 
> > That is true, but this is not exactly the same.
> 
> My example was to show you that an IRQ can have its priority dropped in pCPU A
> and been deactivated to pCPU B. Another example is when only the IRQ is been
> migrated. The spec does not promise you to receive the next interrupt on the
> CPU you asked because it may take time to update the GIC state. So the
> priority drop and deactivation could be done on separate physical CPU here
> too.
> 
> > This is changing the
> > physical irq affinity while both physical and virtual irqs are still
> > active.
> 
> Physical IRQ state and virtual IRQ state are completely dissociated in the
> GIC. The only interaction possible is the virtual interface to send a
> deactivate request to the distributor when the virtual interrupt has been
> deactivated and correspond to a hardware interrupt.
> 
> > As I wrote, usually operating systems only change affinity after
> > deactivating an irq, so I thought it would be wise in Xen to wait at
> > least for the EOI.
> 
> I looked at the Linux code and did not see a such requirement when setting the
> affinity (see irq_set_affinity) of an IRQ.
> 
> > If we tried to inject the same virtual interrupt on a
> > different vcpu, while the interrupt is still inflight, we could get in
> > troubles. But we could avoid that by testing for GIC_IRQ_GUEST_MIGRATING
> > in vgic_vcpu_inject_irq. Maybe I am worrying about nothing.
> 
> The only interrupt that can be routed to a guest in Xen are SPI which are
> shared between all CPUs. The active bit is handled by the distributor. It is
> not possible to receive the same SPI until it has been deactivated.

True, but keep in mind that we don't get any interruptions when the vcpu
issues an EOI. We lazily clean the data structures the first time we get
back to Xen. So there is a window, where the interrupt has already been
EOI'ed on the first vcpu, but struct pending_irq still shows the
interrupt as inflight and would mess up today's checks in
vgic_vcpu_inject_irq on other vcpus. Also they wouldn't be protected by
the right vgic lock either. Maybe this is the real reason why I didn't
go for this route originally. Sorry I didn't bring this up earlier, the
irq migration stuff is extremely difficult to get right.
Stefano Stabellini Dec. 22, 2016, 2:12 a.m. UTC | #12
On Tue, 20 Dec 2016, Stefano Stabellini wrote:
> On Tue, 20 Dec 2016, Julien Grall wrote:
> > Hi Stefano,
> > 
> > On 20/12/2016 00:22, Stefano Stabellini wrote:
> > > On Mon, 19 Dec 2016, Julien Grall wrote:
> > > > Hi Stefano,
> > > > 
> > > > On 19/12/2016 23:30, Stefano Stabellini wrote:
> > > > > On Mon, 19 Dec 2016, Julien Grall wrote:
> > > > > > > > 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel
> > > > > > > > safely
> > > > > > > > and locklessly. There might be a way to do it, but it is not easy
> > > > > > > > I
> > > > > > > > haven't found it yet.
> > > > > > 
> > > > > > Correct me if I am wrong. There are no restriction to write into
> > > > > > IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity
> > > > > > could
> > > > > > be
> > > > > > called once at the beginning of vgic_irq_migrate.
> > > > > > 
> > > > > > We may receive the interrupt on the wrong physical CPU at the
> > > > > > beginning.
> > > > > > But
> > > > > > it would be the same when writing into IROUTER/ITARGETSR.
> > > > > > 
> > > > > > This would remove the need to get the rank lock in gic_update_one_lr.
> > > > > > 
> > > > > > Did I miss anything?
> > > > > 
> > > > > I am not sure if we can do that: the virtual interrupt might not be
> > > > > EOI'd yet at that time. The guest EOI is used to deactivate the
> > > > > corresponding physical interrupt. Migrating the physical IRQ at that
> > > > > time, could have unintended consequences? I am not sure what the spec
> > > > > says about this, but even if it is correct on paper, I would prefer not
> > > > > to put it to the test: I bet that not many interrupt controllers have
> > > > > been heavily tested in this scenario. What do you think?
> > > > 
> > > > I don't think this is an issue. An interrupt could have the priority drop
> > > > happening on pCPU A and be deactivated on pCPU B if the vCPU A is been
> > > > migrated when the interrupt is inflight. So if it is fine here, why would
> > > > not
> > > > it be when the guest is specifically requesting the routing?
> > > 
> > > That is true, but this is not exactly the same.
> > 
> > My example was to show you that an IRQ can have its priority dropped in pCPU A
> > and been deactivated to pCPU B. Another example is when only the IRQ is been
> > migrated. The spec does not promise you to receive the next interrupt on the
> > CPU you asked because it may take time to update the GIC state. So the
> > priority drop and deactivation could be done on separate physical CPU here
> > too.
> > 
> > > This is changing the
> > > physical irq affinity while both physical and virtual irqs are still
> > > active.
> > 
> > Physical IRQ state and virtual IRQ state are completely dissociated in the
> > GIC. The only interaction possible is the virtual interface to send a
> > deactivate request to the distributor when the virtual interrupt has been
> > deactivated and correspond to a hardware interrupt.
> > 
> > > As I wrote, usually operating systems only change affinity after
> > > deactivating an irq, so I thought it would be wise in Xen to wait at
> > > least for the EOI.
> > 
> > I looked at the Linux code and did not see a such requirement when setting the
> > affinity (see irq_set_affinity) of an IRQ.
> > 
> > > If we tried to inject the same virtual interrupt on a
> > > different vcpu, while the interrupt is still inflight, we could get in
> > > troubles. But we could avoid that by testing for GIC_IRQ_GUEST_MIGRATING
> > > in vgic_vcpu_inject_irq. Maybe I am worrying about nothing.
> > 
> > The only interrupt that can be routed to a guest in Xen are SPI which are
> > shared between all CPUs. The active bit is handled by the distributor. It is
> > not possible to receive the same SPI until it has been deactivated.
> 
> True, but keep in mind that we don't get any interruptions when the vcpu
> issues an EOI. We lazily clean the data structures the first time we get
> back to Xen. So there is a window, where the interrupt has already been
> EOI'ed on the first vcpu, but struct pending_irq still shows the
> interrupt as inflight and would mess up today's checks in
> vgic_vcpu_inject_irq on other vcpus. Also they wouldn't be protected by
> the right vgic lock either. Maybe this is the real reason why I didn't
> go for this route originally. Sorry I didn't bring this up earlier, the
> irq migration stuff is extremely difficult to get right.

I couldn't figure out a good way to solve the problem going down this
route. Thus, I wrote a small patch series to solve the lock inversion
problem using the vgic lock technique (same as the previous patch).

However, if you can come up with a better approach, I'll be happy to
rewrite my patches.
Julien Grall Dec. 28, 2016, 3:48 p.m. UTC | #13
Hi Stefano,

On 20/12/16 19:38, Stefano Stabellini wrote:
> On Tue, 20 Dec 2016, Julien Grall wrote:
>> On 20/12/2016 00:22, Stefano Stabellini wrote:
>>> On Mon, 19 Dec 2016, Julien Grall wrote:
>>>> On 19/12/2016 23:30, Stefano Stabellini wrote:
>>>>> On Mon, 19 Dec 2016, Julien Grall wrote:
>>>>>>>> 2) We run gic_update_one_lr and vgic_store_itargetsr in parallel
>>>>>>>> safely
>>>>>>>> and locklessly. There might be a way to do it, but it is not easy
>>>>>>>> I
>>>>>>>> haven't found it yet.
>>>>>>
>>>>>> Correct me if I am wrong. There are no restriction to write into
>>>>>> IROUTER/ITARGETSR while an IRQ is pending. So the irq_set_affinity
>>>>>> could
>>>>>> be
>>>>>> called once at the beginning of vgic_irq_migrate.
>>>>>>
>>>>>> We may receive the interrupt on the wrong physical CPU at the
>>>>>> beginning.
>>>>>> But
>>>>>> it would be the same when writing into IROUTER/ITARGETSR.
>>>>>>
>>>>>> This would remove the need to get the rank lock in gic_update_one_lr.
>>>>>>
>>>>>> Did I miss anything?
>>>>>
>>>>> I am not sure if we can do that: the virtual interrupt might not be
>>>>> EOI'd yet at that time. The guest EOI is used to deactivate the
>>>>> corresponding physical interrupt. Migrating the physical IRQ at that
>>>>> time, could have unintended consequences? I am not sure what the spec
>>>>> says about this, but even if it is correct on paper, I would prefer not
>>>>> to put it to the test: I bet that not many interrupt controllers have
>>>>> been heavily tested in this scenario. What do you think?
>>>>
>>>> I don't think this is an issue. An interrupt could have the priority drop
>>>> happening on pCPU A and be deactivated on pCPU B if the vCPU A is been
>>>> migrated when the interrupt is inflight. So if it is fine here, why would
>>>> not
>>>> it be when the guest is specifically requesting the routing?
>>>
>>> That is true, but this is not exactly the same.
>>
>> My example was to show you that an IRQ can have its priority dropped in pCPU A
>> and been deactivated to pCPU B. Another example is when only the IRQ is been
>> migrated. The spec does not promise you to receive the next interrupt on the
>> CPU you asked because it may take time to update the GIC state. So the
>> priority drop and deactivation could be done on separate physical CPU here
>> too.
>>
>>> This is changing the
>>> physical irq affinity while both physical and virtual irqs are still
>>> active.
>>
>> Physical IRQ state and virtual IRQ state are completely dissociated in the
>> GIC. The only interaction possible is the virtual interface to send a
>> deactivate request to the distributor when the virtual interrupt has been
>> deactivated and correspond to a hardware interrupt.
>>
>>> As I wrote, usually operating systems only change affinity after
>>> deactivating an irq, so I thought it would be wise in Xen to wait at
>>> least for the EOI.
>>
>> I looked at the Linux code and did not see a such requirement when setting the
>> affinity (see irq_set_affinity) of an IRQ.
>>
>>> If we tried to inject the same virtual interrupt on a
>>> different vcpu, while the interrupt is still inflight, we could get in
>>> troubles. But we could avoid that by testing for GIC_IRQ_GUEST_MIGRATING
>>> in vgic_vcpu_inject_irq. Maybe I am worrying about nothing.
>>
>> The only interrupt that can be routed to a guest in Xen are SPI which are
>> shared between all CPUs. The active bit is handled by the distributor. It is
>> not possible to receive the same SPI until it has been deactivated.
>
> True, but keep in mind that we don't get any interruptions when the vcpu
> issues an EOI. We lazily clean the data structures the first time we get
> back to Xen. So there is a window, where the interrupt has already been
> EOI'ed on the first vcpu, but struct pending_irq still shows the
> interrupt as inflight and would mess up today's checks in
> vgic_vcpu_inject_irq on other vcpus.

I am aware that LRs are cleaned lazily. However, you continue to assume 
that the GIC will fire the next IRQ on the right pCPU as soon as 
ITARGETSR and IROUTER will be written.

In case of GICv2, gicv2_irq_set_affinity there is no system barrier (dsb 
sy). So nothing ensure that the GIC has received the write when the 
function returns because it takes time to propagate the information.

So no matter the lock you are trying to take. You will have window where 
the IRQ is received on the wrong physical vCPU.

That's why the field status in pending_irq is accessible with any lock.
gic_update_one_lr is working only on p->status so it is fine here.

The only place where it might be an issue is list_empty(&n->inflight) in 
vgic_vcpu_inject_irq. But, I think it is perfectly fine because if you 
the IRQ has not been received on the correct vCPU then, it would be 
taken care later. For instance when the vCPU has been kicked (see at the 
end of vgic_vcpu_inject_irq).

> Also they wouldn't be protected by
> the right vgic lock either. Maybe this is the real reason why I didn't
> go for this route originally. Sorry I didn't bring this up earlier, the
> irq migration stuff is extremely difficult to get right.

I don't think it is difficult to get the core right. The problem is 
there are no documentation of this code, so we have to try to figure out 
why it was written like that before.

All this code was meant to be lockless, thanks to p->status and *_bit 
helper. The vgic lock taken in vgic_vcpu_inject_irq is protecting 
inflight_irqs and not the rest.

Regards,
diff mbox

Patch

diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
index a5348f2..3483192 100644
--- a/xen/arch/arm/gic.c
+++ b/xen/arch/arm/gic.c
@@ -506,7 +506,7 @@  static void gic_update_one_lr(struct vcpu *v, int i)
             list_del_init(&p->inflight);
             if ( test_and_clear_bit(GIC_IRQ_GUEST_MIGRATING, &p->status) )
             {
-                struct vcpu *v_target = vgic_get_target_vcpu(v, irq);
+                struct vcpu *v_target = __vgic_get_target_vcpu(v, irq);
                 irq_set_affinity(p->desc, cpumask_of(v_target->processor));
             }
         }
diff --git a/xen/arch/arm/vgic-v2.c b/xen/arch/arm/vgic-v2.c
index 3dbcfe8..666ebdb 100644
--- a/xen/arch/arm/vgic-v2.c
+++ b/xen/arch/arm/vgic-v2.c
@@ -95,6 +95,7 @@  static void vgic_store_itargetsr(struct domain *d, struct vgic_irq_rank *rank,
 {
     unsigned int i;
     unsigned int virq;
+    unsigned long flags;
 
     ASSERT(spin_is_locked(&rank->lock));
 
@@ -116,6 +117,7 @@  static void vgic_store_itargetsr(struct domain *d, struct vgic_irq_rank *rank,
     {
         unsigned int new_target, old_target;
         uint8_t new_mask;
+        struct vcpu *old;
 
         /*
          * Don't need to mask as we rely on new_mask to fit for only one
@@ -153,16 +155,18 @@  static void vgic_store_itargetsr(struct domain *d, struct vgic_irq_rank *rank,
         new_target--;
 
         old_target = rank->vcpu[offset];
+        old = d->vcpu[old_target];
 
+        spin_lock_irqsave(&old->arch.vgic.lock, flags);
         /* Only migrate the vIRQ if the target vCPU has changed */
         if ( new_target != old_target )
         {
-            vgic_migrate_irq(d->vcpu[old_target],
+            vgic_migrate_irq(old,
                              d->vcpu[new_target],
                              virq);
         }
-
         rank->vcpu[offset] = new_target;
+        spin_unlock_irqrestore(&old->arch.vgic.lock, flags);
     }
 }
 
diff --git a/xen/arch/arm/vgic-v3.c b/xen/arch/arm/vgic-v3.c
index d61479d..880c643 100644
--- a/xen/arch/arm/vgic-v3.c
+++ b/xen/arch/arm/vgic-v3.c
@@ -122,6 +122,7 @@  static void vgic_store_irouter(struct domain *d, struct vgic_irq_rank *rank,
 {
     struct vcpu *new_vcpu, *old_vcpu;
     unsigned int virq;
+    unsigned long flags;
 
     /* There is 1 vIRQ per IROUTER */
     virq = offset / NR_BYTES_PER_IROUTER;
@@ -150,11 +151,13 @@  static void vgic_store_irouter(struct domain *d, struct vgic_irq_rank *rank,
     if ( !new_vcpu )
         return;
 
+    spin_lock_irqsave(&old_vcpu->arch.vgic.lock, flags);
     /* Only migrate the IRQ if the target vCPU has changed */
     if ( new_vcpu != old_vcpu )
         vgic_migrate_irq(old_vcpu, new_vcpu, virq);
 
     rank->vcpu[offset] = new_vcpu->vcpu_id;
+    spin_unlock_irqrestore(&old_vcpu->arch.vgic.lock, flags);
 }
 
 static inline bool vgic_reg64_check_access(struct hsr_dabt dabt)
diff --git a/xen/arch/arm/vgic.c b/xen/arch/arm/vgic.c
index 364d5f0..4f7971f 100644
--- a/xen/arch/arm/vgic.c
+++ b/xen/arch/arm/vgic.c
@@ -219,7 +219,7 @@  int vcpu_vgic_free(struct vcpu *v)
 }
 
 /* The function should be called by rank lock taken. */
-static struct vcpu *__vgic_get_target_vcpu(struct vcpu *v, unsigned int virq)
+struct vcpu *__vgic_get_target_vcpu(struct vcpu *v, unsigned int virq)
 {
     struct vgic_irq_rank *rank = vgic_rank_irq(v, virq);
 
@@ -257,9 +257,11 @@  static int vgic_get_virq_priority(struct vcpu *v, unsigned int virq)
 
 void vgic_migrate_irq(struct vcpu *old, struct vcpu *new, unsigned int irq)
 {
-    unsigned long flags;
     struct pending_irq *p = irq_to_pending(old, irq);
 
+    ASSERT(spin_is_locked(&old->arch.vgic.lock));
+    ASSERT(!local_irq_is_enabled());
+
     /* nothing to do for virtual interrupts */
     if ( p->desc == NULL )
         return;
@@ -270,12 +272,9 @@  void vgic_migrate_irq(struct vcpu *old, struct vcpu *new, unsigned int irq)
 
     perfc_incr(vgic_irq_migrates);
 
-    spin_lock_irqsave(&old->arch.vgic.lock, flags);
-
     if ( list_empty(&p->inflight) )
     {
         irq_set_affinity(p->desc, cpumask_of(new->processor));
-        spin_unlock_irqrestore(&old->arch.vgic.lock, flags);
         return;
     }
     /* If the IRQ is still lr_pending, re-inject it to the new vcpu */
@@ -285,7 +284,6 @@  void vgic_migrate_irq(struct vcpu *old, struct vcpu *new, unsigned int irq)
         list_del_init(&p->lr_queue);
         list_del_init(&p->inflight);
         irq_set_affinity(p->desc, cpumask_of(new->processor));
-        spin_unlock_irqrestore(&old->arch.vgic.lock, flags);
         vgic_vcpu_inject_irq(new, irq);
         return;
     }
@@ -293,8 +291,6 @@  void vgic_migrate_irq(struct vcpu *old, struct vcpu *new, unsigned int irq)
      * and wait for the EOI */
     if ( !list_empty(&p->inflight) )
         set_bit(GIC_IRQ_GUEST_MIGRATING, &p->status);
-
-    spin_unlock_irqrestore(&old->arch.vgic.lock, flags);
 }
 
 void arch_move_irqs(struct vcpu *v)
diff --git a/xen/include/asm-arm/vgic.h b/xen/include/asm-arm/vgic.h
index 672f649..c911b33 100644
--- a/xen/include/asm-arm/vgic.h
+++ b/xen/include/asm-arm/vgic.h
@@ -293,6 +293,7 @@  extern int domain_vgic_init(struct domain *d, unsigned int nr_spis);
 extern void domain_vgic_free(struct domain *d);
 extern int vcpu_vgic_init(struct vcpu *v);
 extern struct vcpu *vgic_get_target_vcpu(struct vcpu *v, unsigned int virq);
+extern struct vcpu *__vgic_get_target_vcpu(struct vcpu *v, unsigned int virq);
 extern void vgic_vcpu_inject_irq(struct vcpu *v, unsigned int virq);
 extern void vgic_vcpu_inject_spi(struct domain *d, unsigned int virq);
 extern void vgic_clear_pending_irqs(struct vcpu *v);