[RFC] pass-through: sync pir to irr after msix vector been updated
diff mbox series

Message ID 70457d4e-068f-0160-532b-e00dd295b3b1@oracle.com
State New, archived
Headers show
Series
  • [RFC] pass-through: sync pir to irr after msix vector been updated
Related show

Commit Message

Joe Jin Sept. 12, 2019, 6:03 p.m. UTC
With below testcase, guest kernel reported "No irq handler for vector":
  1). Passthrough mlx ib VF to 2 pvhvm guests.
  2). Start rds-stress between 2 guests.
  3). Scale down 2 guests vcpu from 32 to 6 at the same time.

Repeat above test several iteration, guest kernel reported "No irq handler
for vector", and IB traffic downed to zero which caused by interrupt lost.

When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
update MSI-X table, enable IRQ. If any new interrupt arrived after
local IRQ disabled also before MSI-X table been updated, interrupt still 
used old vector and dest cpu info, and when local IRQ enabled again, 
interrupt been sent to wrong cpu and vector.

Looks sync PIR to IRR after MSI-X been updated is help for this issue.

BTW, I could not reproduced this issue if I disabled apicv.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
---
 xen/drivers/passthrough/io.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Jan Beulich Sept. 13, 2019, 7:14 a.m. UTC | #1
On 12.09.2019 20:03, Joe Jin wrote:
> With below testcase, guest kernel reported "No irq handler for vector":
>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>   2). Start rds-stress between 2 guests.
>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
> 
> Repeat above test several iteration, guest kernel reported "No irq handler
> for vector", and IB traffic downed to zero which caused by interrupt lost.
> 
> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
> update MSI-X table, enable IRQ. If any new interrupt arrived after
> local IRQ disabled also before MSI-X table been updated, interrupt still 
> used old vector and dest cpu info, and when local IRQ enabled again, 
> interrupt been sent to wrong cpu and vector.
> 
> Looks sync PIR to IRR after MSI-X been updated is help for this issue.

I'm having trouble making the connection, which quite possibly simply
means the description needs to be further extended: Sync-ing PIR to
IRR has nothing to do with a vector change. It would help if nothing
else caused this bitmap propagation, and an interrupt was lost (rather
than delivered through the wrong vector, or to the wrong CPU).
Furthermore with vector and destination being coupled, after a CPU has
been offlined this would generally mean
- if there was just a single destination permitted, lack of delivery
  altogether,
- if there were multiple destinations permitted, delivery to one of
  the other CPUs, at which point the vector would still be valid.

An interesting aspect would be on which CPU the log message was
observed, and how this correlates with the destination sets of the
CPUs that have got offlined. From there it would then further be
interesting to understand why the interrupt made it to that CPU,
since - as said - destination and vector get changed together, and
hence with things going wrong it would be of interest to know whether
the CPU receiving the IRQ is within the new destination set, or some
(random?) other one.

> BTW, I could not reproduced this issue if I disabled apicv.

Which, I agree, is a fair hint of something APIC-V-specific to be
amiss, but which (due to the vastly different timing) isn't a
reliable indicator.

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>                  pirq_dpci->gmsi.gflags = gflags;
>              }
> +
> +            if ( hvm_funcs.sync_pir_to_irr )
> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);

If the need for this change can be properly explained, then it
still wants converting to alternative_vcall() - the the other
caller of this hook. Or perhaps even better move vlapic.c's
wrapper (suitably renamed) into hvm.h, and use it here.

Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
(right after your code insertion) allows for the field to be
invalid, which I think you need to guard against.

Also, just to remind you: Please follow patch submission rules:
They get sent _to_ the list, with maintainers etc _cc_-ed.

Jan
Roger Pau Monné Sept. 13, 2019, 10:33 a.m. UTC | #2
On Thu, Sep 12, 2019 at 11:03:14AM -0700, Joe Jin wrote:
> With below testcase, guest kernel reported "No irq handler for vector":
>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>   2). Start rds-stress between 2 guests.
>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
> 
> Repeat above test several iteration, guest kernel reported "No irq handler
> for vector", and IB traffic downed to zero which caused by interrupt lost.
> 
> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
> update MSI-X table, enable IRQ. If any new interrupt arrived after
> local IRQ disabled also before MSI-X table been updated, interrupt still 
> used old vector and dest cpu info, and when local IRQ enabled again, 
> interrupt been sent to wrong cpu and vector.

Yes, but that's something Linux shoulkd be able to handle, according
to your description there's a window where interrupts can be delivered
to the old CPU, but that's something expected.

> 
> Looks sync PIR to IRR after MSI-X been updated is help for this issue.

AFAICT the sync that you do is still using the old vcpu id, as
pirq_dpci->gmsi.dest_vcpu_id gets updated a little bit below. I'm
unsure about why does this help, I would expect the sync between pir
and irr to happen anyway, and hence I'm not sure why is this helping.

Maybe you need to force such syncing so that no stale pir vector gets
injected later on when the guest assumes the new MSI address has been
successfully written, and no more interrupts should appear on the
previous vCPU?

PIR to IRR sync happens on vmentry, so if the old vCPU doesn't take a
vmentry and drains any pending pir vectors more or less at the same
time as the new MSI address gets written it's possible that such pir
vectors get injected way past the update of the MSI fields.

> BTW, I could not reproduced this issue if I disabled apicv.

IIRC Linux won't route interrupts from non-pv devices over event
channels when apicv is in use.

Thanks, Roger.
Joe Jin Sept. 13, 2019, 4:38 p.m. UTC | #3
Hi Jan,

Thanks for your reply, see my reply in line please.

On 9/13/19 12:14 AM, Jan Beulich wrote:
> On 12.09.2019 20:03, Joe Jin wrote:
>> With below testcase, guest kernel reported "No irq handler for vector":
>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>>   2). Start rds-stress between 2 guests.
>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
>>
>> Repeat above test several iteration, guest kernel reported "No irq handler
>> for vector", and IB traffic downed to zero which caused by interrupt lost.
>>
>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
>> update MSI-X table, enable IRQ. If any new interrupt arrived after
>> local IRQ disabled also before MSI-X table been updated, interrupt still 
>> used old vector and dest cpu info, and when local IRQ enabled again, 
>> interrupt been sent to wrong cpu and vector.
>>
>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
> 
> I'm having trouble making the connection, which quite possibly simply
> means the description needs to be further extended: Sync-ing PIR to
> IRR has nothing to do with a vector change. It would help if nothing
> else caused this bitmap propagation, and an interrupt was lost (rather
> than delivered through the wrong vector, or to the wrong CPU).
> Furthermore with vector and destination being coupled, after a CPU has
> been offlined this would generally mean
> - if there was just a single destination permitted, lack of delivery
>   altogether,
> - if there were multiple destinations permitted, delivery to one of
>   the other CPUs, at which point the vector would still be valid.

When cpu offline on guest kernel, it only migrates IRQs which affinity set
to the cpu only, if multiple destinations, kernel does not do migration
which included update msi-x table with new destination also vector.

After IRQ migration, kernel will check all vector's IRR, if APIC IRR
been set, retrigger the IRQ to new destination. This intend to avoid
to lost any interrupt.

But on Xen, after msi-x table updated, it never tried to check and notify
guest kernel there was pending IRQ.

> 
> An interesting aspect would be on which CPU the log message was
> observed, and how this correlates with the destination sets of the
> CPUs that have got offlined. From there it would then further be
> interesting to understand why the interrupt made it to that CPU,
> since - as said - destination and vector get changed together, and
> hence with things going wrong it would be of interest to know whether
> the CPU receiving the IRQ is within the new destination set, or some
> (random?) other one.

irq_retrigger() been called after kernel updated vector, irq_retrigger()
will send pending IRQ(s) to new destination.

Here are kernel log when issue happened, guest kernel is 4.1, on 4.14
guest, it's almost same, but no "(irq -1)" for kernel changes, IRQ
migrations workflow are same(fixup_irqs()):

Sep 12 20:26:46 localhost kernel: smpboot: CPU 17 is now offline
Sep 12 20:26:46 localhost kernel: smpboot: CPU 18 is now offline
Sep 12 20:26:46 localhost kernel: smpboot: CPU 19 is now offline
Sep 12 20:26:47 localhost kernel: Broke affinity for irq 251
Sep 12 20:26:47 localhost kernel: do_IRQ: 20.178 No irq handler for vector (irq -1)
Sep 12 20:26:47 localhost kernel: smpboot: CPU 20 is now offline
Sep 12 20:26:47 localhost kernel: smpboot: CPU 21 is now offline

From above, you can see IRQ sent to cpu 20, which is offlining.

IRQ arrived to the cpu immediately when IRQ enabled, after CPU offlined,
it prints log "smpboot: CPU 20 is now offline".

Call path in kernel as below:
cpu_down()
  |-> cpu_down_maps_locked()
  |     _cpu_down
  |       |-> __stop_machine
  |             |-> stop_cpus()
  |                   |->__stop_cpus()
  |                        |- queue_stop_cpus_work() ---+
  |-> __cpu_die()                                       |
         |-> pr_info("CPU %u is now offline\n", cpu);   |
     +--------------------------------------------------+
     |
     +
multi_cpu_stop()
  |-> local_save_flags
  |-> take_cpu_down()
  |      |-> __cpu_disable
  |            |-> smp_ops.cpu_disable = xen_cpu_disable
  |                  |-> cpu_disable_common
  |                        |-> fixup_irqs <== IRQ migration.
  |-> local_irq_restore()

> 
>> BTW, I could not reproduced this issue if I disabled apicv.
> 
> Which, I agree, is a fair hint of something APIC-V-specific to be
> amiss, but which (due to the vastly different timing) isn't a
> reliable indicator.

Seems apicv enabled/disabled, it on different code path.
> 
>> --- a/xen/drivers/passthrough/io.c
>> +++ b/xen/drivers/passthrough/io.c
>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>                  pirq_dpci->gmsi.gflags = gflags;
>>              }
>> +
>> +            if ( hvm_funcs.sync_pir_to_irr )
>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
> 
> If the need for this change can be properly explained, then it
> still wants converting to alternative_vcall() - the the other
> caller of this hook. Or perhaps even better move vlapic.c's
> wrapper (suitably renamed) into hvm.h, and use it here.

Yes I agree, I'm not 100% sure, so I set it to RFC.

> 
> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
> (right after your code insertion) allows for the field to be
> invalid, which I think you need to guard against.

I think you means multiple destination, then it's -1?

> 
> Also, just to remind you: Please follow patch submission rules:
> They get sent _to_ the list, with maintainers etc _cc_-ed.

Got it, thanks,
Joe

> 
> Jan
>
Joe Jin Sept. 13, 2019, 4:50 p.m. UTC | #4
On 9/13/19 3:33 AM, Roger Pau Monné wrote:
> On Thu, Sep 12, 2019 at 11:03:14AM -0700, Joe Jin wrote:
>> With below testcase, guest kernel reported "No irq handler for vector":
>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>>   2). Start rds-stress between 2 guests.
>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
>>
>> Repeat above test several iteration, guest kernel reported "No irq handler
>> for vector", and IB traffic downed to zero which caused by interrupt lost.
>>
>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
>> update MSI-X table, enable IRQ. If any new interrupt arrived after
>> local IRQ disabled also before MSI-X table been updated, interrupt still 
>> used old vector and dest cpu info, and when local IRQ enabled again, 
>> interrupt been sent to wrong cpu and vector.
> 
> Yes, but that's something Linux shoulkd be able to handle, according
> to your description there's a window where interrupts can be delivered
> to the old CPU, but that's something expected.

Actually, kernel will check APIC IRR when do migration, if any pending
IRQ, will retrigger IRQ to new destination, but Xen does not set the
bit.

> 
>>
>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
> 
> AFAICT the sync that you do is still using the old vcpu id, as
> pirq_dpci->gmsi.dest_vcpu_id gets updated a little bit below. I'm
> unsure about why does this help, I would expect the sync between pir
> and irr to happen anyway, and hence I'm not sure why is this helping.

As my above update, IRQ retriggered by old cpu, so Xen need to set IRR
for old cpu but not of new.

Thanks,
Joe
Jan Beulich Sept. 16, 2019, 8:01 a.m. UTC | #5
On 13.09.2019 18:38, Joe Jin wrote:
> Hi Jan,
> 
> Thanks for your reply, see my reply in line please.
> 
> On 9/13/19 12:14 AM, Jan Beulich wrote:
>> On 12.09.2019 20:03, Joe Jin wrote:
>>> With below testcase, guest kernel reported "No irq handler for vector":
>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>>>   2). Start rds-stress between 2 guests.
>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
>>>
>>> Repeat above test several iteration, guest kernel reported "No irq handler
>>> for vector", and IB traffic downed to zero which caused by interrupt lost.
>>>
>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
>>> used old vector and dest cpu info, and when local IRQ enabled again, 
>>> interrupt been sent to wrong cpu and vector.
>>>
>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
>>
>> I'm having trouble making the connection, which quite possibly simply
>> means the description needs to be further extended: Sync-ing PIR to
>> IRR has nothing to do with a vector change. It would help if nothing
>> else caused this bitmap propagation, and an interrupt was lost (rather
>> than delivered through the wrong vector, or to the wrong CPU).
>> Furthermore with vector and destination being coupled, after a CPU has
>> been offlined this would generally mean
>> - if there was just a single destination permitted, lack of delivery
>>   altogether,
>> - if there were multiple destinations permitted, delivery to one of
>>   the other CPUs, at which point the vector would still be valid.
> 
> When cpu offline on guest kernel, it only migrates IRQs which affinity set
> to the cpu only, if multiple destinations, kernel does not do migration
> which included update msi-x table with new destination also vector.
> 
> After IRQ migration, kernel will check all vector's IRR, if APIC IRR
> been set, retrigger the IRQ to new destination. This intend to avoid
> to lost any interrupt.
> 
> But on Xen, after msi-x table updated, it never tried to check and notify
> guest kernel there was pending IRQ.
> 
>>
>> An interesting aspect would be on which CPU the log message was
>> observed, and how this correlates with the destination sets of the
>> CPUs that have got offlined. From there it would then further be
>> interesting to understand why the interrupt made it to that CPU,
>> since - as said - destination and vector get changed together, and
>> hence with things going wrong it would be of interest to know whether
>> the CPU receiving the IRQ is within the new destination set, or some
>> (random?) other one.
> 
> irq_retrigger() been called after kernel updated vector, irq_retrigger()
> will send pending IRQ(s) to new destination.
> 
> Here are kernel log when issue happened, guest kernel is 4.1, on 4.14
> guest, it's almost same, but no "(irq -1)" for kernel changes, IRQ
> migrations workflow are same(fixup_irqs()):
> 
> Sep 12 20:26:46 localhost kernel: smpboot: CPU 17 is now offline
> Sep 12 20:26:46 localhost kernel: smpboot: CPU 18 is now offline
> Sep 12 20:26:46 localhost kernel: smpboot: CPU 19 is now offline
> Sep 12 20:26:47 localhost kernel: Broke affinity for irq 251
> Sep 12 20:26:47 localhost kernel: do_IRQ: 20.178 No irq handler for vector (irq -1)
> Sep 12 20:26:47 localhost kernel: smpboot: CPU 20 is now offline
> Sep 12 20:26:47 localhost kernel: smpboot: CPU 21 is now offline
> 
> From above, you can see IRQ sent to cpu 20, which is offlining.
> 
> IRQ arrived to the cpu immediately when IRQ enabled, after CPU offlined,
> it prints log "smpboot: CPU 20 is now offline".
> 
> Call path in kernel as below:
> cpu_down()
>   |-> cpu_down_maps_locked()
>   |     _cpu_down
>   |       |-> __stop_machine
>   |             |-> stop_cpus()
>   |                   |->__stop_cpus()
>   |                        |- queue_stop_cpus_work() ---+
>   |-> __cpu_die()                                       |
>          |-> pr_info("CPU %u is now offline\n", cpu);   |
>      +--------------------------------------------------+
>      |
>      +
> multi_cpu_stop()
>   |-> local_save_flags
>   |-> take_cpu_down()
>   |      |-> __cpu_disable
>   |            |-> smp_ops.cpu_disable = xen_cpu_disable
>   |                  |-> cpu_disable_common
>   |                        |-> fixup_irqs <== IRQ migration.
>   |-> local_irq_restore()

Ah yes, this makes sense. You want to extend the description to
reflect some of the further explanation above.

>>> --- a/xen/drivers/passthrough/io.c
>>> +++ b/xen/drivers/passthrough/io.c
>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>              }
>>> +
>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>
>> If the need for this change can be properly explained, then it
>> still wants converting to alternative_vcall() - the the other
>> caller of this hook. Or perhaps even better move vlapic.c's
>> wrapper (suitably renamed) into hvm.h, and use it here.
> 
> Yes I agree, I'm not 100% sure, so I set it to RFC.

And btw, please also attach a brief comment here, to clarify
why the syncing is needed precisely at this point.

>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>> (right after your code insertion) allows for the field to be
>> invalid, which I think you need to guard against.
> 
> I think you means multiple destination, then it's -1?

The reason for why it might be -1 are irrelevant here, I think.
You need to handle the case both to avoid an out-of-bounds
array access and to make sure an IRR bit wouldn't still get
propagated too late in some special case.

Also - what about the respective other path in the function,
dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
seems to me that there's the same chance of deferring IRR
propagation for too long?

Jan
Joe Jin Sept. 16, 2019, 10:20 p.m. UTC | #6
On 9/16/19 1:01 AM, Jan Beulich wrote:
> On 13.09.2019 18:38, Joe Jin wrote:
>> Hi Jan,
>>
>> Thanks for your reply, see my reply in line please.
>>
>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>> With below testcase, guest kernel reported "No irq handler for vector":
>>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>>>>   2). Start rds-stress between 2 guests.
>>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
>>>>
>>>> Repeat above test several iteration, guest kernel reported "No irq handler
>>>> for vector", and IB traffic downed to zero which caused by interrupt lost.
>>>>
>>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
>>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
>>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
>>>> used old vector and dest cpu info, and when local IRQ enabled again, 
>>>> interrupt been sent to wrong cpu and vector.
>>>>
>>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
>>>
>>> I'm having trouble making the connection, which quite possibly simply
>>> means the description needs to be further extended: Sync-ing PIR to
>>> IRR has nothing to do with a vector change. It would help if nothing
>>> else caused this bitmap propagation, and an interrupt was lost (rather
>>> than delivered through the wrong vector, or to the wrong CPU).
>>> Furthermore with vector and destination being coupled, after a CPU has
>>> been offlined this would generally mean
>>> - if there was just a single destination permitted, lack of delivery
>>>   altogether,
>>> - if there were multiple destinations permitted, delivery to one of
>>>   the other CPUs, at which point the vector would still be valid.
>>
>> When cpu offline on guest kernel, it only migrates IRQs which affinity set
>> to the cpu only, if multiple destinations, kernel does not do migration
>> which included update msi-x table with new destination also vector.
>>
>> After IRQ migration, kernel will check all vector's IRR, if APIC IRR
>> been set, retrigger the IRQ to new destination. This intend to avoid
>> to lost any interrupt.
>>
>> But on Xen, after msi-x table updated, it never tried to check and notify
>> guest kernel there was pending IRQ.
>>
>>>
>>> An interesting aspect would be on which CPU the log message was
>>> observed, and how this correlates with the destination sets of the
>>> CPUs that have got offlined. From there it would then further be
>>> interesting to understand why the interrupt made it to that CPU,
>>> since - as said - destination and vector get changed together, and
>>> hence with things going wrong it would be of interest to know whether
>>> the CPU receiving the IRQ is within the new destination set, or some
>>> (random?) other one.
>>
>> irq_retrigger() been called after kernel updated vector, irq_retrigger()
>> will send pending IRQ(s) to new destination.
>>
>> Here are kernel log when issue happened, guest kernel is 4.1, on 4.14
>> guest, it's almost same, but no "(irq -1)" for kernel changes, IRQ
>> migrations workflow are same(fixup_irqs()):
>>
>> Sep 12 20:26:46 localhost kernel: smpboot: CPU 17 is now offline
>> Sep 12 20:26:46 localhost kernel: smpboot: CPU 18 is now offline
>> Sep 12 20:26:46 localhost kernel: smpboot: CPU 19 is now offline
>> Sep 12 20:26:47 localhost kernel: Broke affinity for irq 251
>> Sep 12 20:26:47 localhost kernel: do_IRQ: 20.178 No irq handler for vector (irq -1)
>> Sep 12 20:26:47 localhost kernel: smpboot: CPU 20 is now offline
>> Sep 12 20:26:47 localhost kernel: smpboot: CPU 21 is now offline
>>
>> From above, you can see IRQ sent to cpu 20, which is offlining.
>>
>> IRQ arrived to the cpu immediately when IRQ enabled, after CPU offlined,
>> it prints log "smpboot: CPU 20 is now offline".
>>
>> Call path in kernel as below:
>> cpu_down()
>>   |-> cpu_down_maps_locked()
>>   |     _cpu_down
>>   |       |-> __stop_machine
>>   |             |-> stop_cpus()
>>   |                   |->__stop_cpus()
>>   |                        |- queue_stop_cpus_work() ---+
>>   |-> __cpu_die()                                       |
>>          |-> pr_info("CPU %u is now offline\n", cpu);   |
>>      +--------------------------------------------------+
>>      |
>>      +
>> multi_cpu_stop()
>>   |-> local_save_flags
>>   |-> take_cpu_down()
>>   |      |-> __cpu_disable
>>   |            |-> smp_ops.cpu_disable = xen_cpu_disable
>>   |                  |-> cpu_disable_common
>>   |                        |-> fixup_irqs <== IRQ migration.
>>   |-> local_irq_restore()
> 
> Ah yes, this makes sense. You want to extend the description to
> reflect some of the further explanation above.

I will add more explanation later.

> 
>>>> --- a/xen/drivers/passthrough/io.c
>>>> +++ b/xen/drivers/passthrough/io.c
>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>              }
>>>> +
>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>
>>> If the need for this change can be properly explained, then it
>>> still wants converting to alternative_vcall() - the the other
>>> caller of this hook. Or perhaps even better move vlapic.c's
>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>
>> Yes I agree, I'm not 100% sure, so I set it to RFC.
> 
> And btw, please also attach a brief comment here, to clarify
> why the syncing is needed precisely at this point.
> 
>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>> (right after your code insertion) allows for the field to be
>>> invalid, which I think you need to guard against.
>>
>> I think you means multiple destination, then it's -1?
> 
> The reason for why it might be -1 are irrelevant here, I think.
> You need to handle the case both to avoid an out-of-bounds
> array access and to make sure an IRR bit wouldn't still get
> propagated too late in some special case.

Add following checks?
            if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
                 d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )

> 
> Also - what about the respective other path in the function,
> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
> seems to me that there's the same chance of deferring IRR
> propagation for too long?

This is possible, can you please help on how to get which vcpu associate the IRQ?
I did not found any helper on current Xen.

Thanks,
Joe
Jan Beulich Sept. 17, 2019, 6:48 a.m. UTC | #7
On 17.09.2019 00:20, Joe Jin wrote:
> On 9/16/19 1:01 AM, Jan Beulich wrote:
>> On 13.09.2019 18:38, Joe Jin wrote:
>>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>>> --- a/xen/drivers/passthrough/io.c
>>>>> +++ b/xen/drivers/passthrough/io.c
>>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>>              }
>>>>> +
>>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>>
>>>> If the need for this change can be properly explained, then it
>>>> still wants converting to alternative_vcall() - the the other
>>>> caller of this hook. Or perhaps even better move vlapic.c's
>>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>>
>>> Yes I agree, I'm not 100% sure, so I set it to RFC.
>>
>> And btw, please also attach a brief comment here, to clarify
>> why the syncing is needed precisely at this point.
>>
>>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>>> (right after your code insertion) allows for the field to be
>>>> invalid, which I think you need to guard against.
>>>
>>> I think you means multiple destination, then it's -1?
>>
>> The reason for why it might be -1 are irrelevant here, I think.
>> You need to handle the case both to avoid an out-of-bounds
>> array access and to make sure an IRR bit wouldn't still get
>> propagated too late in some special case.
> 
> Add following checks?
>             if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
>                  d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )

Just the >= part should suffice; without an explanation I don't
see why you want the runstate check (which after all is racy
anyway afaict).

>> Also - what about the respective other path in the function,
>> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
>> seems to me that there's the same chance of deferring IRR
>> propagation for too long?
> 
> This is possible, can you please help on how to get which vcpu associate the IRQ?
> I did not found any helper on current Xen.

There's no such helper, I'm afraid. Looking at hvm_migrate_pirq()
and hvm_girq_dest_2_vcpu_id() I notice that the former does nothing
if pirq_dpci->gmsi.posted is set. Hence pirq_dpci->gmsi.dest_vcpu_id
isn't really used in this case (please double check), and so you may
want to update the field alongside setting pirq_dpci->gmsi.posted in
pt_irq_create_bind(), covering the multi destination case.

Your code addition still visible in context above may then want to
be further conditionalized upon iommu_intpost or (perhaps better)
pirq_dpci->gmsi.posted being set.

Jan
Joe Jin Sept. 18, 2019, 9:16 p.m. UTC | #8
On 9/16/19 11:48 PM, Jan Beulich wrote:
> On 17.09.2019 00:20, Joe Jin wrote:
>> On 9/16/19 1:01 AM, Jan Beulich wrote:
>>> On 13.09.2019 18:38, Joe Jin wrote:
>>>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>>>> --- a/xen/drivers/passthrough/io.c
>>>>>> +++ b/xen/drivers/passthrough/io.c
>>>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>>>              }
>>>>>> +
>>>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>>>
>>>>> If the need for this change can be properly explained, then it
>>>>> still wants converting to alternative_vcall() - the the other
>>>>> caller of this hook. Or perhaps even better move vlapic.c's
>>>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>>>
>>>> Yes I agree, I'm not 100% sure, so I set it to RFC.
>>>
>>> And btw, please also attach a brief comment here, to clarify
>>> why the syncing is needed precisely at this point.
>>>
>>>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>>>> (right after your code insertion) allows for the field to be
>>>>> invalid, which I think you need to guard against.
>>>>
>>>> I think you means multiple destination, then it's -1?
>>>
>>> The reason for why it might be -1 are irrelevant here, I think.
>>> You need to handle the case both to avoid an out-of-bounds
>>> array access and to make sure an IRR bit wouldn't still get
>>> propagated too late in some special case.
>>
>> Add following checks?
>>             if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
>>                  d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )
> 
> Just the >= part should suffice; without an explanation I don't
> see why you want the runstate check (which after all is racy
> anyway afaict).
> 
>>> Also - what about the respective other path in the function,
>>> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
>>> seems to me that there's the same chance of deferring IRR
>>> propagation for too long?
>>
>> This is possible, can you please help on how to get which vcpu associate the IRQ?
>> I did not found any helper on current Xen.
> 
> There's no such helper, I'm afraid. Looking at hvm_migrate_pirq()
> and hvm_girq_dest_2_vcpu_id() I notice that the former does nothing
> if pirq_dpci->gmsi.posted is set. Hence pirq_dpci->gmsi.dest_vcpu_id
> isn't really used in this case (please double check), and so you may
> want to update the field alongside setting pirq_dpci->gmsi.posted in
> pt_irq_create_bind(), covering the multi destination case.
> 
> Your code addition still visible in context above may then want to
> be further conditionalized upon iommu_intpost or (perhaps better)
> pirq_dpci->gmsi.posted being set.
> 

Sorry this is new to me, and I have to study from code.
Do you think below check cover all conditions?

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4290c7c710..90c3da441d 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -412,6 +412,10 @@ int pt_irq_create_bind(
                 pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
                 pirq_dpci->gmsi.gflags = gflags;
             }
+
+            /* Notify guest of pending interrupts if necessary */
+            if ( dest_vcpu_id >= 0 && iommu_intpost && pirq_dpci->gmsi.posted )
+                vlapic_sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
         }
         /* Calculate dest_vcpu_id for MSI-type pirq migration. */
         dest = MASK_EXTR(pirq_dpci->gmsi.gflags,


Thanks,
Joe
Jan Beulich Sept. 19, 2019, 10:24 a.m. UTC | #9
On 18.09.2019 23:16, Joe Jin wrote:
> On 9/16/19 11:48 PM, Jan Beulich wrote:
>> On 17.09.2019 00:20, Joe Jin wrote:
>>> On 9/16/19 1:01 AM, Jan Beulich wrote:
>>>> On 13.09.2019 18:38, Joe Jin wrote:
>>>>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>>>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>>>>> --- a/xen/drivers/passthrough/io.c
>>>>>>> +++ b/xen/drivers/passthrough/io.c
>>>>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>>>>              }
>>>>>>> +
>>>>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>>>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>>>>
>>>>>> If the need for this change can be properly explained, then it
>>>>>> still wants converting to alternative_vcall() - the the other
>>>>>> caller of this hook. Or perhaps even better move vlapic.c's
>>>>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>>>>
>>>>> Yes I agree, I'm not 100% sure, so I set it to RFC.
>>>>
>>>> And btw, please also attach a brief comment here, to clarify
>>>> why the syncing is needed precisely at this point.
>>>>
>>>>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>>>>> (right after your code insertion) allows for the field to be
>>>>>> invalid, which I think you need to guard against.
>>>>>
>>>>> I think you means multiple destination, then it's -1?
>>>>
>>>> The reason for why it might be -1 are irrelevant here, I think.
>>>> You need to handle the case both to avoid an out-of-bounds
>>>> array access and to make sure an IRR bit wouldn't still get
>>>> propagated too late in some special case.
>>>
>>> Add following checks?
>>>             if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
>>>                  d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )
>>
>> Just the >= part should suffice; without an explanation I don't
>> see why you want the runstate check (which after all is racy
>> anyway afaict).
>>
>>>> Also - what about the respective other path in the function,
>>>> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
>>>> seems to me that there's the same chance of deferring IRR
>>>> propagation for too long?
>>>
>>> This is possible, can you please help on how to get which vcpu associate the IRQ?
>>> I did not found any helper on current Xen.
>>
>> There's no such helper, I'm afraid. Looking at hvm_migrate_pirq()
>> and hvm_girq_dest_2_vcpu_id() I notice that the former does nothing
>> if pirq_dpci->gmsi.posted is set. Hence pirq_dpci->gmsi.dest_vcpu_id
>> isn't really used in this case (please double check), and so you may
>> want to update the field alongside setting pirq_dpci->gmsi.posted in
>> pt_irq_create_bind(), covering the multi destination case.
>>
>> Your code addition still visible in context above may then want to
>> be further conditionalized upon iommu_intpost or (perhaps better)
>> pirq_dpci->gmsi.posted being set.
>>
> 
> Sorry this is new to me, and I have to study from code.
> Do you think below check cover all conditions?

I does afaict; I don't think you need to check both iommu_intpost and
pirq_dpci->gmsi.posted - just the latter ought to be enough. What's
still missing is the further updating of pirq_dpci->gmsi.dest_vcpu_id
(as explained before, still visible in context above).

Jan
Joe Jin Sept. 19, 2019, 9:38 p.m. UTC | #10
On 9/19/19 3:24 AM, Jan Beulich wrote:
> On 18.09.2019 23:16, Joe Jin wrote:
>> On 9/16/19 11:48 PM, Jan Beulich wrote:
>>> On 17.09.2019 00:20, Joe Jin wrote:
>>>> On 9/16/19 1:01 AM, Jan Beulich wrote:
>>>>> On 13.09.2019 18:38, Joe Jin wrote:
>>>>>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>>>>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>>>>>> --- a/xen/drivers/passthrough/io.c
>>>>>>>> +++ b/xen/drivers/passthrough/io.c
>>>>>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>>>>>              }
>>>>>>>> +
>>>>>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>>>>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>>>>>
>>>>>>> If the need for this change can be properly explained, then it
>>>>>>> still wants converting to alternative_vcall() - the the other
>>>>>>> caller of this hook. Or perhaps even better move vlapic.c's
>>>>>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>>>>>
>>>>>> Yes I agree, I'm not 100% sure, so I set it to RFC.
>>>>>
>>>>> And btw, please also attach a brief comment here, to clarify
>>>>> why the syncing is needed precisely at this point.
>>>>>
>>>>>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>>>>>> (right after your code insertion) allows for the field to be
>>>>>>> invalid, which I think you need to guard against.
>>>>>>
>>>>>> I think you means multiple destination, then it's -1?
>>>>>
>>>>> The reason for why it might be -1 are irrelevant here, I think.
>>>>> You need to handle the case both to avoid an out-of-bounds
>>>>> array access and to make sure an IRR bit wouldn't still get
>>>>> propagated too late in some special case.
>>>>
>>>> Add following checks?
>>>>             if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
>>>>                  d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )
>>>
>>> Just the >= part should suffice; without an explanation I don't
>>> see why you want the runstate check (which after all is racy
>>> anyway afaict).
>>>
>>>>> Also - what about the respective other path in the function,
>>>>> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
>>>>> seems to me that there's the same chance of deferring IRR
>>>>> propagation for too long?
>>>>
>>>> This is possible, can you please help on how to get which vcpu associate the IRQ?
>>>> I did not found any helper on current Xen.
>>>
>>> There's no such helper, I'm afraid. Looking at hvm_migrate_pirq()
>>> and hvm_girq_dest_2_vcpu_id() I notice that the former does nothing
>>> if pirq_dpci->gmsi.posted is set. Hence pirq_dpci->gmsi.dest_vcpu_id
>>> isn't really used in this case (please double check), and so you may
>>> want to update the field alongside setting pirq_dpci->gmsi.posted in
>>> pt_irq_create_bind(), covering the multi destination case.
>>>
>>> Your code addition still visible in context above may then want to
>>> be further conditionalized upon iommu_intpost or (perhaps better)
>>> pirq_dpci->gmsi.posted being set.
>>>
>>
>> Sorry this is new to me, and I have to study from code.
>> Do you think below check cover all conditions?
> 
> I does afaict; I don't think you need to check both iommu_intpost and
> pirq_dpci->gmsi.posted - just the latter ought to be enough. What's
> still missing is the further updating of pirq_dpci->gmsi.dest_vcpu_id
> (as explained before, still visible in context above).
> 

 422
 423         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
 424         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;

dest_vcpu_id updated later by above code, do I missed something?

Thanks,
Joe
Jan Beulich Sept. 20, 2019, 8:28 a.m. UTC | #11
On 19.09.2019 23:38, Joe Jin wrote:
> On 9/19/19 3:24 AM, Jan Beulich wrote:
>> What's
>> still missing is the further updating of pirq_dpci->gmsi.dest_vcpu_id
>> (as explained before, still visible in context above).
>>
> 
>  422
>  423         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>  424         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> 
> dest_vcpu_id updated later by above code, do I missed something?

This piece of code

        if ( iommu_intpost )
        {
            if ( delivery_mode == dest_LowestPrio )
                vcpu = vector_hashing_dest(d, dest, dest_mode,
                                           pirq_dpci->gmsi.gvec);
            if ( vcpu )
                pirq_dpci->gmsi.posted = true;
        }

updates the vCPU to be delivered to. Right now, when the "posted"
flag is set, the dest_vcpu_id field is unused (as far as I was
able to tell), and hence didn't need setting. The way you intend
to change things, you want to use the field also in the "posted"
case, and hence you also should update it in the code path above.

But please double check all of this.

Jan
Chao Gao Sept. 23, 2019, 8:31 a.m. UTC | #12
On Wed, Sep 18, 2019 at 02:16:13PM -0700, Joe Jin wrote:
>On 9/16/19 11:48 PM, Jan Beulich wrote:
>> On 17.09.2019 00:20, Joe Jin wrote:
>>> On 9/16/19 1:01 AM, Jan Beulich wrote:
>>>> On 13.09.2019 18:38, Joe Jin wrote:
>>>>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>>>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>>>>> --- a/xen/drivers/passthrough/io.c
>>>>>>> +++ b/xen/drivers/passthrough/io.c
>>>>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>>>>              }
>>>>>>> +
>>>>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>>>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>>>>
>>>>>> If the need for this change can be properly explained, then it
>>>>>> still wants converting to alternative_vcall() - the the other
>>>>>> caller of this hook. Or perhaps even better move vlapic.c's
>>>>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>>>>
>>>>> Yes I agree, I'm not 100% sure, so I set it to RFC.
>>>>
>>>> And btw, please also attach a brief comment here, to clarify
>>>> why the syncing is needed precisely at this point.
>>>>
>>>>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>>>>> (right after your code insertion) allows for the field to be
>>>>>> invalid, which I think you need to guard against.
>>>>>
>>>>> I think you means multiple destination, then it's -1?
>>>>
>>>> The reason for why it might be -1 are irrelevant here, I think.
>>>> You need to handle the case both to avoid an out-of-bounds
>>>> array access and to make sure an IRR bit wouldn't still get
>>>> propagated too late in some special case.
>>>
>>> Add following checks?
>>>             if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
>>>                  d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )
>> 
>> Just the >= part should suffice; without an explanation I don't
>> see why you want the runstate check (which after all is racy
>> anyway afaict).
>> 
>>>> Also - what about the respective other path in the function,
>>>> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
>>>> seems to me that there's the same chance of deferring IRR
>>>> propagation for too long?
>>>
>>> This is possible, can you please help on how to get which vcpu associate the IRQ?
>>> I did not found any helper on current Xen.
>> 
>> There's no such helper, I'm afraid. Looking at hvm_migrate_pirq()
>> and hvm_girq_dest_2_vcpu_id() I notice that the former does nothing
>> if pirq_dpci->gmsi.posted is set. Hence pirq_dpci->gmsi.dest_vcpu_id
>> isn't really used in this case (please double check), and so you may
>> want to update the field alongside setting pirq_dpci->gmsi.posted in
>> pt_irq_create_bind(), covering the multi destination case.
>> 
>> Your code addition still visible in context above may then want to
>> be further conditionalized upon iommu_intpost or (perhaps better)
>> pirq_dpci->gmsi.posted being set.
>> 
>
>Sorry this is new to me, and I have to study from code.
>Do you think below check cover all conditions?
>
>diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
>index 4290c7c710..90c3da441d 100644
>--- a/xen/drivers/passthrough/io.c
>+++ b/xen/drivers/passthrough/io.c
>@@ -412,6 +412,10 @@ int pt_irq_create_bind(
>                 pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>                 pirq_dpci->gmsi.gflags = gflags;
>             }
>+
>+            /* Notify guest of pending interrupts if necessary */
>+            if ( dest_vcpu_id >= 0 && iommu_intpost && pirq_dpci->gmsi.posted )

Hi Joe,

Do you enable vt-d posted interrupt in Xen boot options? I don't see
why it is specific to vt-d posted interrupt. If only CPU side posted
interrupt is enabled, it is also possible that interrupts are not
propagated from PIR to IRR in time.

Thanks
Chao
Joe Jin Sept. 23, 2019, 3:49 p.m. UTC | #13
On 9/23/19 1:31 AM, Chao Gao wrote:
> On Wed, Sep 18, 2019 at 02:16:13PM -0700, Joe Jin wrote:
>> On 9/16/19 11:48 PM, Jan Beulich wrote:
>>> On 17.09.2019 00:20, Joe Jin wrote:
>>>> On 9/16/19 1:01 AM, Jan Beulich wrote:
>>>>> On 13.09.2019 18:38, Joe Jin wrote:
>>>>>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>>>>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>>>>>> --- a/xen/drivers/passthrough/io.c
>>>>>>>> +++ b/xen/drivers/passthrough/io.c
>>>>>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>>>>>              }
>>>>>>>> +
>>>>>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>>>>>> +                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>>>>>
>>>>>>> If the need for this change can be properly explained, then it
>>>>>>> still wants converting to alternative_vcall() - the the other
>>>>>>> caller of this hook. Or perhaps even better move vlapic.c's
>>>>>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>>>>>
>>>>>> Yes I agree, I'm not 100% sure, so I set it to RFC.
>>>>>
>>>>> And btw, please also attach a brief comment here, to clarify
>>>>> why the syncing is needed precisely at this point.
>>>>>
>>>>>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>>>>>> (right after your code insertion) allows for the field to be
>>>>>>> invalid, which I think you need to guard against.
>>>>>>
>>>>>> I think you means multiple destination, then it's -1?
>>>>>
>>>>> The reason for why it might be -1 are irrelevant here, I think.
>>>>> You need to handle the case both to avoid an out-of-bounds
>>>>> array access and to make sure an IRR bit wouldn't still get
>>>>> propagated too late in some special case.
>>>>
>>>> Add following checks?
>>>>             if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
>>>>                  d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )
>>>
>>> Just the >= part should suffice; without an explanation I don't
>>> see why you want the runstate check (which after all is racy
>>> anyway afaict).
>>>
>>>>> Also - what about the respective other path in the function,
>>>>> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
>>>>> seems to me that there's the same chance of deferring IRR
>>>>> propagation for too long?
>>>>
>>>> This is possible, can you please help on how to get which vcpu associate the IRQ?
>>>> I did not found any helper on current Xen.
>>>
>>> There's no such helper, I'm afraid. Looking at hvm_migrate_pirq()
>>> and hvm_girq_dest_2_vcpu_id() I notice that the former does nothing
>>> if pirq_dpci->gmsi.posted is set. Hence pirq_dpci->gmsi.dest_vcpu_id
>>> isn't really used in this case (please double check), and so you may
>>> want to update the field alongside setting pirq_dpci->gmsi.posted in
>>> pt_irq_create_bind(), covering the multi destination case.
>>>
>>> Your code addition still visible in context above may then want to
>>> be further conditionalized upon iommu_intpost or (perhaps better)
>>> pirq_dpci->gmsi.posted being set.
>>>
>>
>> Sorry this is new to me, and I have to study from code.
>> Do you think below check cover all conditions?
>>
>> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
>> index 4290c7c710..90c3da441d 100644
>> --- a/xen/drivers/passthrough/io.c
>> +++ b/xen/drivers/passthrough/io.c
>> @@ -412,6 +412,10 @@ int pt_irq_create_bind(
>>                 pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>                 pirq_dpci->gmsi.gflags = gflags;
>>             }
>> +
>> +            /* Notify guest of pending interrupts if necessary */
>> +            if ( dest_vcpu_id >= 0 && iommu_intpost && pirq_dpci->gmsi.posted )
> 
> Hi Joe,
> 
> Do you enable vt-d posted interrupt in Xen boot options? I don't see
> why it is specific to vt-d posted interrupt. If only CPU side posted
> interrupt is enabled, it is also possible that interrupts are not
> propagated from PIR to IRR in time.

Hi Chao,

Yes vt-d posted interrupt been enabled on booting:

(XEN) VMX: Supported advanced features:
(XEN)  - APIC MMIO access virtualisation
(XEN)  - APIC TPR shadow
(XEN)  - Extended Page Tables (EPT)
(XEN)  - Virtual-Processor Identifiers (VPID)
(XEN)  - Virtual NMI
(XEN)  - MSR direct-access bitmap
(XEN)  - Unrestricted Guest
(XEN)  - APIC Register Virtualization
(XEN)  - Virtual Interrupt Delivery
(XEN)  - Posted Interrupt Processing
(XEN)  - VMCS shadowing

Look at vlapic_set_irq(), and seems if posted interrupt been enabled, it set PIR
but not IRR?

 170     if ( hvm_funcs.deliver_posted_intr )
 171         alternative_vcall(hvm_funcs.deliver_posted_intr, target, vec);
 172     else if ( !vlapic_test_and_set_irr(vec, vlapic) )
 173         vcpu_kick(target);

Thanks,
Joe
Joe Jin Sept. 23, 2019, 10:29 p.m. UTC | #14
On 9/20/19 1:28 AM, Jan Beulich wrote:
> On 19.09.2019 23:38, Joe Jin wrote:
>> On 9/19/19 3:24 AM, Jan Beulich wrote:
>>> What's
>>> still missing is the further updating of pirq_dpci->gmsi.dest_vcpu_id
>>> (as explained before, still visible in context above).
>>>
>>
>>  422
>>  423         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>>  424         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>>
>> dest_vcpu_id updated later by above code, do I missed something?
> 
> This piece of code
> 
>         if ( iommu_intpost )
>         {
>             if ( delivery_mode == dest_LowestPrio )
>                 vcpu = vector_hashing_dest(d, dest, dest_mode,
>                                            pirq_dpci->gmsi.gvec);
>             if ( vcpu )
>                 pirq_dpci->gmsi.posted = true;
>         }
> 
> updates the vCPU to be delivered to. Right now, when the "posted"
> flag is set, the dest_vcpu_id field is unused (as far as I was
> able to tell), and hence didn't need setting. The way you intend
> to change things, you want to use the field also in the "posted"
> case, and hence you also should update it in the code path above.
> 
> But please double check all of this.

Thanks for your explanation.

I did not found any other possible conditions which may lead invalid
setting, seems ( dest_vcpu_id >=0 && pirq_dpci->gmsi.posted ) is enough?

Thanks,
Joe
Jan Beulich Sept. 24, 2019, 7:26 a.m. UTC | #15
On 24.09.2019 00:29, Joe Jin wrote:
> On 9/20/19 1:28 AM, Jan Beulich wrote:
>> On 19.09.2019 23:38, Joe Jin wrote:
>>> On 9/19/19 3:24 AM, Jan Beulich wrote:
>>>> What's
>>>> still missing is the further updating of pirq_dpci->gmsi.dest_vcpu_id
>>>> (as explained before, still visible in context above).
>>>>
>>>
>>>  422
>>>  423         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>>>  424         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>>>
>>> dest_vcpu_id updated later by above code, do I missed something?
>>
>> This piece of code
>>
>>         if ( iommu_intpost )
>>         {
>>             if ( delivery_mode == dest_LowestPrio )
>>                 vcpu = vector_hashing_dest(d, dest, dest_mode,
>>                                            pirq_dpci->gmsi.gvec);
>>             if ( vcpu )
>>                 pirq_dpci->gmsi.posted = true;
>>         }
>>
>> updates the vCPU to be delivered to. Right now, when the "posted"
>> flag is set, the dest_vcpu_id field is unused (as far as I was
>> able to tell), and hence didn't need setting. The way you intend
>> to change things, you want to use the field also in the "posted"
>> case, and hence you also should update it in the code path above.
>>
>> But please double check all of this.
> 
> Thanks for your explanation.
> 
> I did not found any other possible conditions which may lead invalid
> setting, seems ( dest_vcpu_id >=0 && pirq_dpci->gmsi.posted ) is enough?

I thought we had settled on this side of things a number of roundtrips
ago.

Jan
Roger Pau Monné Sept. 24, 2019, 3:42 p.m. UTC | #16
On Fri, Sep 13, 2019 at 09:50:34AM -0700, Joe Jin wrote:
> On 9/13/19 3:33 AM, Roger Pau Monné wrote:
> > On Thu, Sep 12, 2019 at 11:03:14AM -0700, Joe Jin wrote:
> >> With below testcase, guest kernel reported "No irq handler for vector":
> >>   1). Passthrough mlx ib VF to 2 pvhvm guests.
> >>   2). Start rds-stress between 2 guests.
> >>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
> >>
> >> Repeat above test several iteration, guest kernel reported "No irq handler
> >> for vector", and IB traffic downed to zero which caused by interrupt lost.
> >>
> >> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
> >> update MSI-X table, enable IRQ. If any new interrupt arrived after
> >> local IRQ disabled also before MSI-X table been updated, interrupt still 
> >> used old vector and dest cpu info, and when local IRQ enabled again, 
> >> interrupt been sent to wrong cpu and vector.
> > 
> > Yes, but that's something Linux shoulkd be able to handle, according
> > to your description there's a window where interrupts can be delivered
> > to the old CPU, but that's something expected.
> 
> Actually, kernel will check APIC IRR when do migration, if any pending
> IRQ, will retrigger IRQ to new destination, but Xen does not set the
> bit.

Right, because the interrupt is pending in the PIRR posted descriptor
field, it has not yet reached the vlapic IRR.

> > 
> >>
> >> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
> > 
> > AFAICT the sync that you do is still using the old vcpu id, as
> > pirq_dpci->gmsi.dest_vcpu_id gets updated a little bit below. I'm
> > unsure about why does this help, I would expect the sync between pir
> > and irr to happen anyway, and hence I'm not sure why is this helping.
> 
> As my above update, IRQ retriggered by old cpu, so Xen need to set IRR
> for old cpu but not of new.

AFAICT you are draining any pending data from the posted interrupt
PIRR field into the IRR vlapic field, so that no stale interrupts are
left behind after the MSI fields have been updated by the guest. I
think this is correct, I wonder however whether you also need to
trigger a vcpu re-scheduling (pause/unpause the vpcu), so that pending
interrupts in IRR are injected by vmx_intr_assist.

Also, I think you should do this syncing after the pi_update_irte
call, or else there's still a window (albeit small) where you can
still get posted interrupts delivered to the old vcpu.

Thanks, Roger.
Joe Jin Sept. 26, 2019, 8:33 p.m. UTC | #17
On 9/24/19 8:42 AM, Roger Pau Monné wrote:
> On Fri, Sep 13, 2019 at 09:50:34AM -0700, Joe Jin wrote:
>> On 9/13/19 3:33 AM, Roger Pau Monné wrote:
>>> On Thu, Sep 12, 2019 at 11:03:14AM -0700, Joe Jin wrote:
>>>> With below testcase, guest kernel reported "No irq handler for vector":
>>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>>>>   2). Start rds-stress between 2 guests.
>>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
>>>>
>>>> Repeat above test several iteration, guest kernel reported "No irq handler
>>>> for vector", and IB traffic downed to zero which caused by interrupt lost.
>>>>
>>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
>>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
>>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
>>>> used old vector and dest cpu info, and when local IRQ enabled again, 
>>>> interrupt been sent to wrong cpu and vector.
>>>
>>> Yes, but that's something Linux shoulkd be able to handle, according
>>> to your description there's a window where interrupts can be delivered
>>> to the old CPU, but that's something expected.
>>
>> Actually, kernel will check APIC IRR when do migration, if any pending
>> IRQ, will retrigger IRQ to new destination, but Xen does not set the
>> bit.
> 
> Right, because the interrupt is pending in the PIRR posted descriptor
> field, it has not yet reached the vlapic IRR.
> 
>>>
>>>>
>>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
>>>
>>> AFAICT the sync that you do is still using the old vcpu id, as
>>> pirq_dpci->gmsi.dest_vcpu_id gets updated a little bit below. I'm
>>> unsure about why does this help, I would expect the sync between pir
>>> and irr to happen anyway, and hence I'm not sure why is this helping.
>>
>> As my above update, IRQ retriggered by old cpu, so Xen need to set IRR
>> for old cpu but not of new.
> 
> AFAICT you are draining any pending data from the posted interrupt
> PIRR field into the IRR vlapic field, so that no stale interrupts are
> left behind after the MSI fields have been updated by the guest. I
> think this is correct, I wonder however whether you also need to
> trigger a vcpu re-scheduling (pause/unpause the vpcu), so that pending
> interrupts in IRR are injected by vmx_intr_assist.
> 
> Also, I think you should do this syncing after the pi_update_irte
> call, or else there's still a window (albeit small) where you can
> still get posted interrupts delivered to the old vcpu.

I agree with you that we need to take care of this issue as well.

I created an additional patch but not tested yet for the test env was
broken, post here for your comment firstly, I'll update later of test
result when my test env back:

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 3f43b9d5a9..4596110a17 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -23,6 +23,7 @@
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/support.h>
+#include <asm/hvm/vmx/vmx.h>
 #include <asm/io_apic.h>
 
 static DEFINE_PER_CPU(struct list_head, dpci_list);
@@ -443,9 +444,21 @@ int pt_irq_create_bind(
             hvm_migrate_pirq(pirq_dpci, vcpu);
 
         /* Use interrupt posting if it is supported. */
-        if ( iommu_intpost )
-            pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
-                           info, pirq_dpci->gmsi.gvec);
+        if ( iommu_intpost ) {
+            unsigned int ndst = APIC_INVALID_DEST;
+            struct irq_desc *desc;
+
+            desc = pirq_spin_lock_irq_desc(info, NULL);
+            if ( irq_desc ) {
+                ndst = irq_desc->msi_desc->pi_desc->ndst;
+                spin_unlock_irq(&desc->lock);
+            }
+
+            if ( pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
+                                    info, pirq_dpci->gmsi.gvec) == 0
+                            && ndst != APIC_INVALID_DEST )
+                vlapic_sync_pir_to_irr(d->vcpu[ndst]);
+        }
 
         if ( pt_irq_bind->u.msi.gflags & XEN_DOMCTL_VMSI_X86_UNMASKED )
         {
Jan Beulich Sept. 27, 2019, 8:18 a.m. UTC | #18
On 26.09.2019 22:33, Joe Jin wrote:
> On 9/24/19 8:42 AM, Roger Pau Monné wrote:
>> AFAICT you are draining any pending data from the posted interrupt
>> PIRR field into the IRR vlapic field, so that no stale interrupts are
>> left behind after the MSI fields have been updated by the guest. I
>> think this is correct, I wonder however whether you also need to
>> trigger a vcpu re-scheduling (pause/unpause the vpcu), so that pending
>> interrupts in IRR are injected by vmx_intr_assist.

You didn't address (perhaps just verbally) this remark from Roger
at all. I'm not convinced that a pause/unpause is an appropriate
action here - there surely should be a more direct way.

>> Also, I think you should do this syncing after the pi_update_irte
>> call, or else there's still a window (albeit small) where you can
>> still get posted interrupts delivered to the old vcpu.
> 
> I agree with you that we need to take care of this issue as well.
> 
> I created an additional patch but not tested yet for the test env was
> broken, post here for your comment firstly, I'll update later of test
> result when my test env back:

It would have been nice if you had at least build-tested it. In
its current shape it's hard to tell what value it is. Anyway, ...

> @@ -23,6 +23,7 @@
>  #include <xen/irq.h>
>  #include <asm/hvm/irq.h>
>  #include <asm/hvm/support.h>
> +#include <asm/hvm/vmx/vmx.h>

Why is this needed? It's not a good idea to include it here.

> @@ -443,9 +444,21 @@ int pt_irq_create_bind(
>              hvm_migrate_pirq(pirq_dpci, vcpu);
>  
>          /* Use interrupt posting if it is supported. */
> -        if ( iommu_intpost )
> -            pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
> -                           info, pirq_dpci->gmsi.gvec);
> +        if ( iommu_intpost ) {
> +            unsigned int ndst = APIC_INVALID_DEST;
> +            struct irq_desc *desc;
> +
> +            desc = pirq_spin_lock_irq_desc(info, NULL);
> +            if ( irq_desc ) {
> +                ndst = irq_desc->msi_desc->pi_desc->ndst;
> +                spin_unlock_irq(&desc->lock);
> +            }

This redoes (in a much less careful way, i.e. prone to a crash)
what pi_update_irte() does. It would seem far easier if you
simply made the function return the value, or even make it do
the call right away (when needed).

> +            if ( pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
> +                                    info, pirq_dpci->gmsi.gvec) == 0
> +                            && ndst != APIC_INVALID_DEST )
> +                vlapic_sync_pir_to_irr(d->vcpu[ndst]);

Aiui "ndst" is an APIC ID and hence can't be used to index
d->vcpu[]. Without a description it's not really clear to me
why you found it necessary to go via APIC ID here - in your
earlier patch variant this wasn't the case iirc.

Before you actually (re)post this patch you'll also want to
take care of numerous style issues.

Jan
Roger Pau Monné Sept. 27, 2019, 8:42 a.m. UTC | #19
On Thu, Sep 26, 2019 at 01:33:42PM -0700, Joe Jin wrote:
> On 9/24/19 8:42 AM, Roger Pau Monné wrote:
> > On Fri, Sep 13, 2019 at 09:50:34AM -0700, Joe Jin wrote:
> >> On 9/13/19 3:33 AM, Roger Pau Monné wrote:
> >>> On Thu, Sep 12, 2019 at 11:03:14AM -0700, Joe Jin wrote:
> >>>> With below testcase, guest kernel reported "No irq handler for vector":
> >>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
> >>>>   2). Start rds-stress between 2 guests.
> >>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
> >>>>
> >>>> Repeat above test several iteration, guest kernel reported "No irq handler
> >>>> for vector", and IB traffic downed to zero which caused by interrupt lost.
> >>>>
> >>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
> >>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
> >>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
> >>>> used old vector and dest cpu info, and when local IRQ enabled again, 
> >>>> interrupt been sent to wrong cpu and vector.
> >>>
> >>> Yes, but that's something Linux shoulkd be able to handle, according
> >>> to your description there's a window where interrupts can be delivered
> >>> to the old CPU, but that's something expected.
> >>
> >> Actually, kernel will check APIC IRR when do migration, if any pending
> >> IRQ, will retrigger IRQ to new destination, but Xen does not set the
> >> bit.
> > 
> > Right, because the interrupt is pending in the PIRR posted descriptor
> > field, it has not yet reached the vlapic IRR.
> > 
> >>>
> >>>>
> >>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
> >>>
> >>> AFAICT the sync that you do is still using the old vcpu id, as
> >>> pirq_dpci->gmsi.dest_vcpu_id gets updated a little bit below. I'm
> >>> unsure about why does this help, I would expect the sync between pir
> >>> and irr to happen anyway, and hence I'm not sure why is this helping.
> >>
> >> As my above update, IRQ retriggered by old cpu, so Xen need to set IRR
> >> for old cpu but not of new.
> > 
> > AFAICT you are draining any pending data from the posted interrupt
> > PIRR field into the IRR vlapic field, so that no stale interrupts are
> > left behind after the MSI fields have been updated by the guest. I
> > think this is correct, I wonder however whether you also need to
> > trigger a vcpu re-scheduling (pause/unpause the vpcu), so that pending
> > interrupts in IRR are injected by vmx_intr_assist.
> > 
> > Also, I think you should do this syncing after the pi_update_irte
> > call, or else there's still a window (albeit small) where you can
> > still get posted interrupts delivered to the old vcpu.
> 
> I agree with you that we need to take care of this issue as well.
> 
> I created an additional patch but not tested yet for the test env was
> broken, post here for your comment firstly, I'll update later of test
> result when my test env back:
> 
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 3f43b9d5a9..4596110a17 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -23,6 +23,7 @@
>  #include <xen/irq.h>
>  #include <asm/hvm/irq.h>
>  #include <asm/hvm/support.h>
> +#include <asm/hvm/vmx/vmx.h>

Why do you need this include?

>  #include <asm/io_apic.h>
>  
>  static DEFINE_PER_CPU(struct list_head, dpci_list);
> @@ -443,9 +444,21 @@ int pt_irq_create_bind(
>              hvm_migrate_pirq(pirq_dpci, vcpu);
>  
>          /* Use interrupt posting if it is supported. */
> -        if ( iommu_intpost )
> -            pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
> -                           info, pirq_dpci->gmsi.gvec);
> +        if ( iommu_intpost ) {
> +            unsigned int ndst = APIC_INVALID_DEST;
> +            struct irq_desc *desc;
> +
> +            desc = pirq_spin_lock_irq_desc(info, NULL);
> +            if ( irq_desc ) {
> +                ndst = irq_desc->msi_desc->pi_desc->ndst;

I think this is not correct. NDST is the APIC ID of the physical CPU
used as the destination for notifications, it's not the ID of the
vcpu where the events are to be delivered.

Also, I think I'm still confused by this, I've just realized that the
PI descriptor seems to be moved from one vCPU to another without
clearing PIRR, and hence I'm not sure why you are losing interrupts in
that case. I need to look deeper in order to figure out what's going
on there.

Thanks, Roger.
Roger Pau Monné Sept. 27, 2019, 9:07 a.m. UTC | #20
On Fri, Sep 27, 2019 at 10:42:02AM +0200, Roger Pau Monné wrote:
> Also, I think I'm still confused by this, I've just realized that the
> PI descriptor seems to be moved from one vCPU to another without
> clearing PIRR, and hence I'm not sure why you are losing interrupts in
> that case. I need to look deeper in order to figure out what's going
> on there.

Forget about the above paragraph, it's completely bogus. The vector on
the new vCPU might be completely different, and hence the PIRR must
be flushed before moving. Let me try to come up with a patch for you
to test.

Thanks, Roger.
Roger Pau Monné Oct. 1, 2019, 4:01 p.m. UTC | #21
On Thu, Sep 26, 2019 at 01:33:42PM -0700, Joe Jin wrote:
> On 9/24/19 8:42 AM, Roger Pau Monné wrote:
> > On Fri, Sep 13, 2019 at 09:50:34AM -0700, Joe Jin wrote:
> >> On 9/13/19 3:33 AM, Roger Pau Monné wrote:
> >>> On Thu, Sep 12, 2019 at 11:03:14AM -0700, Joe Jin wrote:
> >>>> With below testcase, guest kernel reported "No irq handler for vector":
> >>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
> >>>>   2). Start rds-stress between 2 guests.
> >>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
> >>>>
> >>>> Repeat above test several iteration, guest kernel reported "No irq handler
> >>>> for vector", and IB traffic downed to zero which caused by interrupt lost.
> >>>>
> >>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
> >>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
> >>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
> >>>> used old vector and dest cpu info, and when local IRQ enabled again, 
> >>>> interrupt been sent to wrong cpu and vector.
> >>>
> >>> Yes, but that's something Linux shoulkd be able to handle, according
> >>> to your description there's a window where interrupts can be delivered
> >>> to the old CPU, but that's something expected.
> >>
> >> Actually, kernel will check APIC IRR when do migration, if any pending
> >> IRQ, will retrigger IRQ to new destination, but Xen does not set the
> >> bit.
> > 
> > Right, because the interrupt is pending in the PIRR posted descriptor
> > field, it has not yet reached the vlapic IRR.
> > 
> >>>
> >>>>
> >>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
> >>>
> >>> AFAICT the sync that you do is still using the old vcpu id, as
> >>> pirq_dpci->gmsi.dest_vcpu_id gets updated a little bit below. I'm
> >>> unsure about why does this help, I would expect the sync between pir
> >>> and irr to happen anyway, and hence I'm not sure why is this helping.
> >>
> >> As my above update, IRQ retriggered by old cpu, so Xen need to set IRR
> >> for old cpu but not of new.
> > 
> > AFAICT you are draining any pending data from the posted interrupt
> > PIRR field into the IRR vlapic field, so that no stale interrupts are
> > left behind after the MSI fields have been updated by the guest. I
> > think this is correct, I wonder however whether you also need to
> > trigger a vcpu re-scheduling (pause/unpause the vpcu), so that pending
> > interrupts in IRR are injected by vmx_intr_assist.
> > 
> > Also, I think you should do this syncing after the pi_update_irte
> > call, or else there's still a window (albeit small) where you can
> > still get posted interrupts delivered to the old vcpu.
> 
> I agree with you that we need to take care of this issue as well.

Can you give a try to the patch above? I don't have the hardware to
test any of this ATM, so your help would be appreciated.

Thanks, Roger.
---8<---
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 9466258d6f..d255ad8db7 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -106,7 +106,7 @@ static void vlapic_clear_irr(int vector, struct vlapic *vlapic)
     vlapic_clear_vector(vector, &vlapic->regs->data[APIC_IRR]);
 }
 
-static void sync_pir_to_irr(struct vcpu *v)
+void vlapic_sync_pir_to_irr(struct vcpu *v)
 {
     if ( hvm_funcs.sync_pir_to_irr )
         alternative_vcall(hvm_funcs.sync_pir_to_irr, v);
@@ -114,7 +114,7 @@ static void sync_pir_to_irr(struct vcpu *v)
 
 static int vlapic_find_highest_irr(struct vlapic *vlapic)
 {
-    sync_pir_to_irr(vlapic_vcpu(vlapic));
+    vlapic_sync_pir_to_irr(vlapic_vcpu(vlapic));
 
     return vlapic_find_highest_vector(&vlapic->regs->data[APIC_IRR]);
 }
@@ -1493,7 +1493,7 @@ static int lapic_save_regs(struct vcpu *v, hvm_domain_context_t *h)
     if ( !has_vlapic(v->domain) )
         return 0;
 
-    sync_pir_to_irr(v);
+    vlapic_sync_pir_to_irr(v);
 
     return hvm_save_entry(LAPIC_REGS, v->vcpu_id, h, vcpu_vlapic(v)->regs);
 }
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index b292e79382..88188d2d7f 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -341,7 +341,7 @@ int pt_irq_create_bind(
     {
         uint8_t dest, delivery_mode;
         bool dest_mode;
-        int dest_vcpu_id;
+        int dest_vcpu_id, prev_vcpu_id = -1;
         const struct vcpu *vcpu;
         uint32_t gflags = pt_irq_bind->u.msi.gflags &
                           ~XEN_DOMCTL_VMSI_X86_UNMASKED;
@@ -411,6 +411,7 @@ int pt_irq_create_bind(
 
                 pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
                 pirq_dpci->gmsi.gflags = gflags;
+                prev_vcpu_id = pirq_dpci->gmsi.dest_vcpu_id;
             }
         }
         /* Calculate dest_vcpu_id for MSI-type pirq migration. */
@@ -440,7 +441,8 @@ int pt_irq_create_bind(
         /* Use interrupt posting if it is supported. */
         if ( iommu_intpost )
             pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
-                           info, pirq_dpci->gmsi.gvec);
+                           info, pirq_dpci->gmsi.gvec,
+                           prev_vcpu_id >= 0 ? d->vcpu[prev_vcpu_id] : NULL);
 
         if ( pt_irq_bind->u.msi.gflags & XEN_DOMCTL_VMSI_X86_UNMASKED )
         {
@@ -729,7 +731,9 @@ int pt_irq_destroy_bind(
             what = "bogus";
     }
     else if ( pirq_dpci && pirq_dpci->gmsi.posted )
-        pi_update_irte(NULL, pirq, 0);
+        pi_update_irte(NULL, pirq, 0,
+                       pirq_dpci->gmsi.dest_vcpu_id >= 0
+                       ? d->vcpu[pirq_dpci->gmsi.dest_vcpu_id] : NULL);
 
     if ( pirq_dpci && (pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) &&
          list_empty(&pirq_dpci->digl_list) )
diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index bf846195c4..ad03abb4da 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -951,7 +951,7 @@ void intel_iommu_disable_eim(void)
  * when guest changes MSI/MSI-X information.
  */
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
-    const uint8_t gvec)
+    const uint8_t gvec, struct vcpu *prev)
 {
     struct irq_desc *desc;
     struct msi_desc *msi_desc;
@@ -964,8 +964,8 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
     msi_desc = desc->msi_desc;
     if ( !msi_desc )
     {
-        rc = -ENODEV;
-        goto unlock_out;
+        spin_unlock_irq(&desc->lock);
+        return -ENODEV;
     }
     msi_desc->pi_desc = pi_desc;
     msi_desc->gvec = gvec;
@@ -974,10 +974,9 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
 
     ASSERT(pcidevs_locked());
 
-    return msi_msg_write_remap_rte(msi_desc, &msi_desc->msg);
-
- unlock_out:
-    spin_unlock_irq(&desc->lock);
+    rc = msi_msg_write_remap_rte(msi_desc, &msi_desc->msg);
+    if ( !rc && prev )
+         vlapic_sync_pir_to_irr(prev);
 
     return rc;
 }
diff --git a/xen/include/asm-x86/hvm/vlapic.h b/xen/include/asm-x86/hvm/vlapic.h
index dde66b4f0f..b0017d1dae 100644
--- a/xen/include/asm-x86/hvm/vlapic.h
+++ b/xen/include/asm-x86/hvm/vlapic.h
@@ -150,4 +150,6 @@ bool_t vlapic_match_dest(
     const struct vlapic *target, const struct vlapic *source,
     int short_hand, uint32_t dest, bool_t dest_mode);
 
+void vlapic_sync_pir_to_irr(struct vcpu *v);
+
 #endif /* __ASM_X86_HVM_VLAPIC_H__ */
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 85741f7c96..314dcfbe47 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -119,7 +119,7 @@ static inline void iommu_disable_x2apic(void)
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
-                   const uint8_t gvec);
+                   const uint8_t gvec, struct vcpu *prev);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
Joe Jin Oct. 1, 2019, 4:22 p.m. UTC | #22
On 10/1/19 9:01 AM, Roger Pau Monné wrote:
> Can you give a try to the patch above? I don't have the hardware to
> test any of this ATM, so your help would be appreciated.

I'd like to test this patch, but now there is hardware issue, once the
my test env is available I'll test it and update you result.

Thanks,
Joe

Patch
diff mbox series

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4290c7c710..10c5b5d1e1 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -412,6 +412,9 @@  int pt_irq_create_bind(
                 pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
                 pirq_dpci->gmsi.gflags = gflags;
             }
+
+            if ( hvm_funcs.sync_pir_to_irr )
+                hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
         }
         /* Calculate dest_vcpu_id for MSI-type pirq migration. */
         dest = MASK_EXTR(pirq_dpci->gmsi.gflags,