diff mbox series

[XEN,v13,3/6] x86/pvh: Add PHYSDEVOP_setup_gsi for PVH dom0

Message ID 20240816110820.75672-4-Jiqian.Chen@amd.com (mailing list archive)
State Superseded
Headers show
Series Support device passthrough when dom0 is PVH on Xen | expand

Commit Message

Chen, Jiqian Aug. 16, 2024, 11:08 a.m. UTC
The gsi of a passthrough device must be configured for it to be
able to be mapped into a hvm domU.
But When dom0 is PVH, the gsis may not get registered(see below
clarification), it causes the info of apic, pin and irq not be
added into irq_2_pin list, and the handler of irq_desc is not set,
then when passthrough a device, setting ioapic affinity and vector
will fail.

To fix above problem, on Linux kernel side, a new code will
need to call PHYSDEVOP_setup_gsi for passthrough devices to
register gsi when dom0 is PVH.

So, add PHYSDEVOP_setup_gsi into hvm_physdev_op for above
purpose.

Clarify two questions:
First, why the gsi of devices belong to PVH dom0 can work?
Because when probe a driver to a normal device, it uses the normal
probe function of pci device, in its callstack, it requests irq
and unmask corresponding ioapic of gsi, then trap into xen and
register gsi finally.
Callstack is(on linux kernel side) pci_device_probe->
request_threaded_irq-> irq_startup-> __unmask_ioapic->
io_apic_write, then trap into xen hvmemul_do_io->
hvm_io_intercept-> hvm_process_io_intercept->
vioapic_write_indirect-> vioapic_hwdom_map_gsi-> mp_register_gsi.
So that the gsi can be registered.

Second, why the gsi of passthrough device can't work when dom0
is PVH?
Because when assign a device to passthrough, it uses the specific
probe function of pciback, in its callstack, it doesn't install a
fake irq handler due to the ISR is not running. So that
mp_register_gsi on Xen side is never called, then the gsi is not
registered.
Callstack is(on linux kernel side) pcistub_probe->pcistub_seize->
pcistub_init_device-> xen_pcibk_reset_device->
xen_pcibk_control_isr->isr_on==0.

Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
---
 xen/arch/x86/hvm/hypercall.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Jan Beulich Aug. 19, 2024, 9:16 a.m. UTC | #1
On 16.08.2024 13:08, Jiqian Chen wrote:
> The gsi of a passthrough device must be configured for it to be
> able to be mapped into a hvm domU.
> But When dom0 is PVH, the gsis may not get registered(see below
> clarification), it causes the info of apic, pin and irq not be
> added into irq_2_pin list, and the handler of irq_desc is not set,
> then when passthrough a device, setting ioapic affinity and vector
> will fail.
> 
> To fix above problem, on Linux kernel side, a new code will
> need to call PHYSDEVOP_setup_gsi for passthrough devices to
> register gsi when dom0 is PVH.
> 
> So, add PHYSDEVOP_setup_gsi into hvm_physdev_op for above
> purpose.
> 
> Clarify two questions:
> First, why the gsi of devices belong to PVH dom0 can work?
> Because when probe a driver to a normal device, it uses the normal
> probe function of pci device, in its callstack, it requests irq
> and unmask corresponding ioapic of gsi, then trap into xen and
> register gsi finally.
> Callstack is(on linux kernel side) pci_device_probe->
> request_threaded_irq-> irq_startup-> __unmask_ioapic->
> io_apic_write, then trap into xen hvmemul_do_io->
> hvm_io_intercept-> hvm_process_io_intercept->
> vioapic_write_indirect-> vioapic_hwdom_map_gsi-> mp_register_gsi.
> So that the gsi can be registered.
> 
> Second, why the gsi of passthrough device can't work when dom0
> is PVH?
> Because when assign a device to passthrough, it uses the specific
> probe function of pciback, in its callstack, it doesn't install a
> fake irq handler due to the ISR is not running. So that
> mp_register_gsi on Xen side is never called, then the gsi is not
> registered.
> Callstack is(on linux kernel side) pcistub_probe->pcistub_seize->
> pcistub_init_device-> xen_pcibk_reset_device->
> xen_pcibk_control_isr->isr_on==0.

So: Underlying XSA-461 was the observation that the very limited set of
cases where this fake IRQ handler is installed is an issue. The problem
of dealing with "false" IRQs when a line-based interrupt is shared
between devices affects all parties, not just Dom0 and not just PV
guests. Therefore an alternative to the introduction of a new hypercall
would be to simply leverage that the installation of such a handler
will need widening anyway.

However, the installation of said handler presently also occurs in
cases where it's not really needed - when the line isn't shared. Thus,
if the handler registration would also be eliminated when it's not
really needed, we'd be back to needing a separate hypercall.

So I think first of all it needs deciding what is going to be done in
Linux, at least in pciback (as here we care about the Dom0 case only).

Jan
Chen, Jiqian Aug. 20, 2024, 6:33 a.m. UTC | #2
On 2024/8/19 17:16, Jan Beulich wrote:
> On 16.08.2024 13:08, Jiqian Chen wrote:
>> The gsi of a passthrough device must be configured for it to be
>> able to be mapped into a hvm domU.
>> But When dom0 is PVH, the gsis may not get registered(see below
>> clarification), it causes the info of apic, pin and irq not be
>> added into irq_2_pin list, and the handler of irq_desc is not set,
>> then when passthrough a device, setting ioapic affinity and vector
>> will fail.
>>
>> To fix above problem, on Linux kernel side, a new code will
>> need to call PHYSDEVOP_setup_gsi for passthrough devices to
>> register gsi when dom0 is PVH.
>>
>> So, add PHYSDEVOP_setup_gsi into hvm_physdev_op for above
>> purpose.
>>
>> Clarify two questions:
>> First, why the gsi of devices belong to PVH dom0 can work?
>> Because when probe a driver to a normal device, it uses the normal
>> probe function of pci device, in its callstack, it requests irq
>> and unmask corresponding ioapic of gsi, then trap into xen and
>> register gsi finally.
>> Callstack is(on linux kernel side) pci_device_probe->
>> request_threaded_irq-> irq_startup-> __unmask_ioapic->
>> io_apic_write, then trap into xen hvmemul_do_io->
>> hvm_io_intercept-> hvm_process_io_intercept->
>> vioapic_write_indirect-> vioapic_hwdom_map_gsi-> mp_register_gsi.
>> So that the gsi can be registered.
>>
>> Second, why the gsi of passthrough device can't work when dom0
>> is PVH?
>> Because when assign a device to passthrough, it uses the specific
>> probe function of pciback, in its callstack, it doesn't install a
>> fake irq handler due to the ISR is not running. So that
>> mp_register_gsi on Xen side is never called, then the gsi is not
>> registered.
>> Callstack is(on linux kernel side) pcistub_probe->pcistub_seize->
>> pcistub_init_device-> xen_pcibk_reset_device->
>> xen_pcibk_control_isr->isr_on==0.
> 
> So: Underlying XSA-461 was the observation that the very limited set of
> cases where this fake IRQ handler is installed is an issue. The problem
> of dealing with "false" IRQs when a line-based interrupt is shared
> between devices affects all parties, not just Dom0 and not just PV
> guests. Therefore an alternative to the introduction of a new hypercall
> would be to simply leverage that the installation of such a handler
> will need widening anyway.
> 
> However, the installation of said handler presently also occurs in
> cases where it's not really needed - when the line isn't shared. Thus,
> if the handler registration would also be eliminated when it's not
> really needed, we'd be back to needing a separate hypercall.
> 
> So I think first of all it needs deciding what is going to be done in
> Linux, at least in pciback (as here we care about the Dom0 case only).
Agree, so the current options are either to use hypercall (PHYSDEVOP_setup_gsi) or to install fake IRQ handler in pciback.
So, we may need the inputs from the Maintainers on Linux side.
Hi Stefano and Juergen, what about your opinions?

> 
> Jan
Stefano Stabellini Aug. 21, 2024, 12:16 a.m. UTC | #3
On Tue, 20 Aug 2024, Chen, Jiqian wrote:
> On 2024/8/19 17:16, Jan Beulich wrote:
> > On 16.08.2024 13:08, Jiqian Chen wrote:
> >> The gsi of a passthrough device must be configured for it to be
> >> able to be mapped into a hvm domU.
> >> But When dom0 is PVH, the gsis may not get registered(see below
> >> clarification), it causes the info of apic, pin and irq not be
> >> added into irq_2_pin list, and the handler of irq_desc is not set,
> >> then when passthrough a device, setting ioapic affinity and vector
> >> will fail.
> >>
> >> To fix above problem, on Linux kernel side, a new code will
> >> need to call PHYSDEVOP_setup_gsi for passthrough devices to
> >> register gsi when dom0 is PVH.
> >>
> >> So, add PHYSDEVOP_setup_gsi into hvm_physdev_op for above
> >> purpose.
> >>
> >> Clarify two questions:
> >> First, why the gsi of devices belong to PVH dom0 can work?
> >> Because when probe a driver to a normal device, it uses the normal
> >> probe function of pci device, in its callstack, it requests irq
> >> and unmask corresponding ioapic of gsi, then trap into xen and
> >> register gsi finally.
> >> Callstack is(on linux kernel side) pci_device_probe->
> >> request_threaded_irq-> irq_startup-> __unmask_ioapic->
> >> io_apic_write, then trap into xen hvmemul_do_io->
> >> hvm_io_intercept-> hvm_process_io_intercept->
> >> vioapic_write_indirect-> vioapic_hwdom_map_gsi-> mp_register_gsi.
> >> So that the gsi can be registered.
> >>
> >> Second, why the gsi of passthrough device can't work when dom0
> >> is PVH?
> >> Because when assign a device to passthrough, it uses the specific
> >> probe function of pciback, in its callstack, it doesn't install a
> >> fake irq handler due to the ISR is not running. So that
> >> mp_register_gsi on Xen side is never called, then the gsi is not
> >> registered.
> >> Callstack is(on linux kernel side) pcistub_probe->pcistub_seize->
> >> pcistub_init_device-> xen_pcibk_reset_device->
> >> xen_pcibk_control_isr->isr_on==0.
> > 
> > So: Underlying XSA-461 was the observation that the very limited set of
> > cases where this fake IRQ handler is installed is an issue. The problem
> > of dealing with "false" IRQs when a line-based interrupt is shared
> > between devices affects all parties, not just Dom0 and not just PV
> > guests. Therefore an alternative to the introduction of a new hypercall
> > would be to simply leverage that the installation of such a handler
> > will need widening anyway.
> > 
> > However, the installation of said handler presently also occurs in
> > cases where it's not really needed - when the line isn't shared. Thus,
> > if the handler registration would also be eliminated when it's not
> > really needed, we'd be back to needing a separate hypercall.
> > 
> > So I think first of all it needs deciding what is going to be done in
> > Linux, at least in pciback (as here we care about the Dom0 case only).
> Agree, so the current options are either to use hypercall (PHYSDEVOP_setup_gsi) or to install fake IRQ handler in pciback.
> So, we may need the inputs from the Maintainers on Linux side.
> Hi Stefano and Juergen, what about your opinions?

I would go with the PHYSDEVOP_setup_gsi solution
Chen, Jiqian Aug. 26, 2024, 6:05 a.m. UTC | #4
On 2024/8/21 08:16, Stefano Stabellini wrote:
> On Tue, 20 Aug 2024, Chen, Jiqian wrote:
>> On 2024/8/19 17:16, Jan Beulich wrote:
>>> On 16.08.2024 13:08, Jiqian Chen wrote:
>>>> The gsi of a passthrough device must be configured for it to be
>>>> able to be mapped into a hvm domU.
>>>> But When dom0 is PVH, the gsis may not get registered(see below
>>>> clarification), it causes the info of apic, pin and irq not be
>>>> added into irq_2_pin list, and the handler of irq_desc is not set,
>>>> then when passthrough a device, setting ioapic affinity and vector
>>>> will fail.
>>>>
>>>> To fix above problem, on Linux kernel side, a new code will
>>>> need to call PHYSDEVOP_setup_gsi for passthrough devices to
>>>> register gsi when dom0 is PVH.
>>>>
>>>> So, add PHYSDEVOP_setup_gsi into hvm_physdev_op for above
>>>> purpose.
>>>>
>>>> Clarify two questions:
>>>> First, why the gsi of devices belong to PVH dom0 can work?
>>>> Because when probe a driver to a normal device, it uses the normal
>>>> probe function of pci device, in its callstack, it requests irq
>>>> and unmask corresponding ioapic of gsi, then trap into xen and
>>>> register gsi finally.
>>>> Callstack is(on linux kernel side) pci_device_probe->
>>>> request_threaded_irq-> irq_startup-> __unmask_ioapic->
>>>> io_apic_write, then trap into xen hvmemul_do_io->
>>>> hvm_io_intercept-> hvm_process_io_intercept->
>>>> vioapic_write_indirect-> vioapic_hwdom_map_gsi-> mp_register_gsi.
>>>> So that the gsi can be registered.
>>>>
>>>> Second, why the gsi of passthrough device can't work when dom0
>>>> is PVH?
>>>> Because when assign a device to passthrough, it uses the specific
>>>> probe function of pciback, in its callstack, it doesn't install a
>>>> fake irq handler due to the ISR is not running. So that
>>>> mp_register_gsi on Xen side is never called, then the gsi is not
>>>> registered.
>>>> Callstack is(on linux kernel side) pcistub_probe->pcistub_seize->
>>>> pcistub_init_device-> xen_pcibk_reset_device->
>>>> xen_pcibk_control_isr->isr_on==0.
>>>
>>> So: Underlying XSA-461 was the observation that the very limited set of
>>> cases where this fake IRQ handler is installed is an issue. The problem
>>> of dealing with "false" IRQs when a line-based interrupt is shared
>>> between devices affects all parties, not just Dom0 and not just PV
>>> guests. Therefore an alternative to the introduction of a new hypercall
>>> would be to simply leverage that the installation of such a handler
>>> will need widening anyway.
>>>
>>> However, the installation of said handler presently also occurs in
>>> cases where it's not really needed - when the line isn't shared. Thus,
>>> if the handler registration would also be eliminated when it's not
>>> really needed, we'd be back to needing a separate hypercall.
>>>
>>> So I think first of all it needs deciding what is going to be done in
>>> Linux, at least in pciback (as here we care about the Dom0 case only).
>> Agree, so the current options are either to use hypercall (PHYSDEVOP_setup_gsi) or to install fake IRQ handler in pciback.
>> So, we may need the inputs from the Maintainers on Linux side.
>> Hi Stefano and Juergen, what about your opinions?
> 
> I would go with the PHYSDEVOP_setup_gsi solution

Thanks Stafano.

If use PHYSDEVOP_setup_gsi solution, it requires the advice of the Maintainers of this file.
Hi Jan, Andrew and Roger, is it okay for you?
Jan Beulich Aug. 26, 2024, 7:35 a.m. UTC | #5
On 16.08.2024 13:08, Jiqian Chen wrote:
> The gsi of a passthrough device must be configured for it to be
> able to be mapped into a hvm domU.
> But When dom0 is PVH, the gsis may not get registered(see below
> clarification), it causes the info of apic, pin and irq not be
> added into irq_2_pin list, and the handler of irq_desc is not set,
> then when passthrough a device, setting ioapic affinity and vector
> will fail.
> 
> To fix above problem, on Linux kernel side, a new code will
> need to call PHYSDEVOP_setup_gsi for passthrough devices to
> register gsi when dom0 is PVH.
> 
> So, add PHYSDEVOP_setup_gsi into hvm_physdev_op for above
> purpose.
> 
> Clarify two questions:
> First, why the gsi of devices belong to PVH dom0 can work?
> Because when probe a driver to a normal device, it uses the normal
> probe function of pci device, in its callstack, it requests irq
> and unmask corresponding ioapic of gsi, then trap into xen and
> register gsi finally.
> Callstack is(on linux kernel side) pci_device_probe->
> request_threaded_irq-> irq_startup-> __unmask_ioapic->
> io_apic_write, then trap into xen hvmemul_do_io->
> hvm_io_intercept-> hvm_process_io_intercept->
> vioapic_write_indirect-> vioapic_hwdom_map_gsi-> mp_register_gsi.
> So that the gsi can be registered.
> 
> Second, why the gsi of passthrough device can't work when dom0
> is PVH?
> Because when assign a device to passthrough, it uses the specific
> probe function of pciback, in its callstack, it doesn't install a
> fake irq handler due to the ISR is not running. So that
> mp_register_gsi on Xen side is never called, then the gsi is not
> registered.
> Callstack is(on linux kernel side) pcistub_probe->pcistub_seize->
> pcistub_init_device-> xen_pcibk_reset_device->
> xen_pcibk_control_isr->isr_on==0.
> 
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> Signed-off-by: Huang Rui <ray.huang@amd.com>
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>

Acked-by: Jan Beulich <jbeulich@suse.com>
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 0b7fc060b4e2..81883c8d4f60 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -82,6 +82,7 @@  long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             return -ENOSYS;
         break;
 
+    case PHYSDEVOP_setup_gsi:
     case PHYSDEVOP_pci_mmcfg_reserved:
     case PHYSDEVOP_pci_device_add:
     case PHYSDEVOP_pci_device_remove: