diff mbox series

[RFC,2/6] xen/public: arch-arm: reserve resources for virtio-pci

Message ID 20231115112611.3865905-3-Sergiy_Kibrik@epam.com (mailing list archive)
State Superseded
Headers show
Series ARM virtio-pci initial support | expand

Commit Message

Sergiy Kibrik Nov. 15, 2023, 11:26 a.m. UTC
From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

In order to enable more use-cases such as having multiple
device-models (Qemu) running in different backend domains which provide
virtio-pci devices for the same guest, we allocate and expose one
PCI host bridge for every virtio backend domain for that guest.

For that purpose, reserve separate virtio-pci resources (memory and SPI range
for Legacy PCI interrupts) up to 8 possible PCI hosts (to be aligned with
MAX_NR_IOREQ_SERVERS) and allocate a host per backend domain. The PCI host
details including its host_id to be written to dedicated Xenstore node for
the device-model to retrieve.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com>
---
 xen/include/public/arch-arm.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Julien Grall Nov. 15, 2023, 12:33 p.m. UTC | #1
Hi,

Thanks for adding support for virtio-pci in Xen. I have some questions.

On 15/11/2023 11:26, Sergiy Kibrik wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> 
> In order to enable more use-cases such as having multiple
> device-models (Qemu) running in different backend domains which provide
> virtio-pci devices for the same guest, we allocate and expose one
> PCI host bridge for every virtio backend domain for that guest.

OOI, why do you need to expose one PCI host bridge for every stubdomain?

In fact looking at the next patch, it seems you are handling some of the 
hostbridge request in Xen. This is adds a bit more confusion.

I was expecting the virtual PCI device would be in the vPCI and each 
Device emulator would advertise which BDF they are covering.

> 
> For that purpose, reserve separate virtio-pci resources (memory and SPI range
> for Legacy PCI interrupts) up to 8 possible PCI hosts (to be aligned with

Do you mean host bridge rather than host?

> MAX_NR_IOREQ_SERVERS) and allocate a host per backend domain. The PCI host
> details including its host_id to be written to dedicated Xenstore node for
> the device-model to retrieve.

So which with approach, who is decide which BDF will be used for a given 
virtio PCI device?

> 
> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com>
> ---
>   xen/include/public/arch-arm.h | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h
> index a25e87dbda..e6c9cd5335 100644
> --- a/xen/include/public/arch-arm.h
> +++ b/xen/include/public/arch-arm.h
> @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t;
>   #define GUEST_VPCI_MEM_ADDR                 xen_mk_ullong(0x23000000)
>   #define GUEST_VPCI_MEM_SIZE                 xen_mk_ullong(0x10000000)
>   
> +/*
> + * 16 MB is reserved for virtio-pci configuration space based on calculation
> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB

Can you explain how youd ecided the "2"?

> + */
> +#define GUEST_VIRTIO_PCI_ECAM_BASE          xen_mk_ullong(0x33000000)
> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE    xen_mk_ullong(0x01000000)
> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE     xen_mk_ullong(0x00200000)
> +
> +/* 64 MB is reserved for virtio-pci memory */
> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM    xen_mk_ullong(0x02000000)
> +#define GUEST_VIRTIO_PCI_MEM_ADDR         xen_mk_ullong(0x34000000)
> +#define GUEST_VIRTIO_PCI_MEM_SIZE         xen_mk_ullong(0x04000000)
> +
>   /*
>    * 16MB == 4096 pages reserved for guest to use as a region to map its
>    * grant table in.
> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t;
>   #define GUEST_MAGIC_BASE  xen_mk_ullong(0x39000000)
>   #define GUEST_MAGIC_SIZE  xen_mk_ullong(0x01000000)
>   
> +/* 64 MB is reserved for virtio-pci Prefetch memory */

This doesn't seem a lot depending on your use case. Can you details how 
you can up with "64 MB"?

> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_PREFETCH_MEM    xen_mk_ullong(0x42000000)
> +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_ADDR         xen_mk_ullong(0x3a000000)
> +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_SIZE         xen_mk_ullong(0x04000000)
> +
>   #define GUEST_RAM_BANKS   2
>   
>   /*
> @@ -515,6 +533,9 @@ typedef uint64_t xen_callback_t;
>   #define GUEST_VIRTIO_MMIO_SPI_FIRST   33
>   #define GUEST_VIRTIO_MMIO_SPI_LAST    43
>   
> +#define GUEST_VIRTIO_PCI_SPI_FIRST   44
> +#define GUEST_VIRTIO_PCI_SPI_LAST    76

Just to confirm this is 4 interrupts per PCI host bridge? If so, can 
this be clarified in a comment?

Also, above you said that the host ID will be written to Xenstore. But 
who will decide which SPI belongs to which host bridge?

Cheers,
Oleksandr Tyshchenko Nov. 15, 2023, 4:51 p.m. UTC | #2
On 15.11.23 14:33, Julien Grall wrote:
> Hi,


Hello Julien

Let me please try to explain some bits.


> 
> Thanks for adding support for virtio-pci in Xen. I have some questions.
> 
> On 15/11/2023 11:26, Sergiy Kibrik wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> In order to enable more use-cases such as having multiple
>> device-models (Qemu) running in different backend domains which provide
>> virtio-pci devices for the same guest, we allocate and expose one
>> PCI host bridge for every virtio backend domain for that guest.
> 
> OOI, why do you need to expose one PCI host bridge for every stubdomain?
> 
> In fact looking at the next patch, it seems you are handling some of the 
> hostbridge request in Xen. This is adds a bit more confusion.
> 
> I was expecting the virtual PCI device would be in the vPCI and each 
> Device emulator would advertise which BDF they are covering.


This patch series only covers use-cases where the device emulator 
handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind 
it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI 
pass-through resources, handling, accounting, nothing. From the 
hypervisor we only need a help to intercept the config space accesses 
happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... 
GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and 
forward them to the linked device emulator (if any), that's all.

It is not possible (with current series) to run device emulators what
emulate only separate PCI (virtio-pci) devices. For it to be possible, I 
think, much more changes are required than current patch series does. 
There at least should be special PCI Host bridge emulation in Xen (or 
reuse vPCI) for the integration. Also Xen should be in charge of forming 
resulting PCI interrupt based on each PCI device level signaling (if we 
use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, 
etc. Please note, I am not saying this is not possible in general, 
likely it is possible, but initial patch series doesn't cover these 
use-cases)

We expose one PCI host bridge per virtio backend domain. This is a 
separate PCI host bridge to combine all virtio-pci devices running in 
the same backend domain (in the same device emulator currently).
The examples:
- if only one domain runs Qemu which servers virtio-blk, virtio-net, 
virtio-console devices for DomU - only single PCI Host bridge will be 
exposed for DomU
- if we add another domain to run Qemu to serve additionally virtio-gpu, 
virtio-input and virtio-snd for the *same* DomU - we expose second PCI 
Host bridge for DomU

I am afraid, we cannot end up exposing only single PCI Host bridge with 
current model (if we use device emulators running in different domains 
that handles the *entire* PCI Host bridges), this won't work.

Please note, I might miss some bits since this enabling work.


> 
>>
>> For that purpose, reserve separate virtio-pci resources (memory and 
>> SPI range
>> for Legacy PCI interrupts) up to 8 possible PCI hosts (to be aligned with
> 
> Do you mean host bridge rather than host?

yes


> 
>> MAX_NR_IOREQ_SERVERS) and allocate a host per backend domain. The PCI 
>> host
>> details including its host_id to be written to dedicated Xenstore node 
>> for
>> the device-model to retrieve.
> 
> So which with approach, who is decide which BDF will be used for a given 
> virtio PCI device?


toolstack (via configuration file)

> 
>>
>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com>
>> ---
>>   xen/include/public/arch-arm.h | 21 +++++++++++++++++++++
>>   1 file changed, 21 insertions(+)
>>
>> diff --git a/xen/include/public/arch-arm.h 
>> b/xen/include/public/arch-arm.h
>> index a25e87dbda..e6c9cd5335 100644
>> --- a/xen/include/public/arch-arm.h
>> +++ b/xen/include/public/arch-arm.h
>> @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t;
>>   #define GUEST_VPCI_MEM_ADDR                 xen_mk_ullong(0x23000000)
>>   #define GUEST_VPCI_MEM_SIZE                 xen_mk_ullong(0x10000000)
>> +/*
>> + * 16 MB is reserved for virtio-pci configuration space based on 
>> calculation
>> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB
> 
> Can you explain how youd ecided the "2"?

good question, we have a limited free space available in memory layout 
(we had difficulties to find a suitable holes) also we don't expect a 
lot of virtio-pci devices, so "256" used vPCI would be too much. It was 
decided to reduce significantly, but select maximum to fit into free 
space, with having "2" buses we still fit into the chosen holes.


> 
>> + */
>> +#define GUEST_VIRTIO_PCI_ECAM_BASE          xen_mk_ullong(0x33000000)
>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE    xen_mk_ullong(0x01000000)
>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE     xen_mk_ullong(0x00200000)
>> +
>> +/* 64 MB is reserved for virtio-pci memory */
>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM    xen_mk_ullong(0x02000000)
>> +#define GUEST_VIRTIO_PCI_MEM_ADDR         xen_mk_ullong(0x34000000)
>> +#define GUEST_VIRTIO_PCI_MEM_SIZE         xen_mk_ullong(0x04000000)
>> +
>>   /*
>>    * 16MB == 4096 pages reserved for guest to use as a region to map its
>>    * grant table in.
>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t;
>>   #define GUEST_MAGIC_BASE  xen_mk_ullong(0x39000000)
>>   #define GUEST_MAGIC_SIZE  xen_mk_ullong(0x01000000)
>> +/* 64 MB is reserved for virtio-pci Prefetch memory */
> 
> This doesn't seem a lot depending on your use case. Can you details how 
> you can up with "64 MB"?

the same calculation as it was done configuration space. It was observed 
that only 16K is used per virtio-pci device (maybe it can be bigger for 
usual PCI device, I don't know). Please look at the example of DomU log 
below (to strings that contain "*BAR 4: assigned*"):


> root@h3ulcb:~# dmesg | grep pci
> [    0.444163] pci-host-generic 33000000.pcie: host bridge /pcie@33000000 ranges:
> [    0.444257] pci-host-generic 33000000.pcie:      MEM 0x0034000000..0x00347fffff -> 0x0034000000
> [    0.444304] pci-host-generic 33000000.pcie:      MEM 0x003a000000..0x003a7fffff -> 0x003a000000
> [    0.444396] pci-host-generic 33000000.pcie: ECAM at [mem 0x33000000-0x331fffff] for [bus 00-01]
> [    0.444595] pci-host-generic 33000000.pcie: PCI host bridge to bus 0000:00
> [    0.444635] pci_bus 0000:00: root bus resource [bus 00-01]
> [    0.444662] pci_bus 0000:00: root bus resource [mem 0x34000000-0x347fffff]
> [    0.444692] pci_bus 0000:00: root bus resource [mem 0x3a000000-0x3a7fffff pref]
> [    0.445193] pci 0000:00:00.0: [1b36:0008] type 00 class 0x060000
> [    0.449493] pci 0000:00:08.0: [1af4:1042] type 00 class 0x010000                                                                                                                                        
> [    0.451760] pci 0000:00:08.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]                                                                                                                          
> [    0.455985] pci 0000:00:08.0: BAR 4: assigned [mem 0x3a000000-0x3a003fff 64bit pref]   
> [    0.456678] pci-host-generic 33200000.pcie: host bridge /pcie@33200000 ranges:
> [    0.456748] pci-host-generic 33200000.pcie:      MEM 0x0034800000..0x0034ffffff -> 0x0034800000
> [    0.456793] pci-host-generic 33200000.pcie:      MEM 0x003a800000..0x003affffff -> 0x003a800000
> [    0.456879] pci-host-generic 33200000.pcie: ECAM at [mem 0x33200000-0x333fffff] for [bus 00-01]
> [    0.457038] pci-host-generic 33200000.pcie: PCI host bridge to bus 0001:00
> [    0.457079] pci_bus 0001:00: root bus resource [bus 00-01]
> [    0.457106] pci_bus 0001:00: root bus resource [mem 0x34800000-0x34ffffff]
> [    0.457136] pci_bus 0001:00: root bus resource [mem 0x3a800000-0x3affffff pref]
> [    0.457808] pci 0001:00:00.0: [1b36:0008] type 00 class 0x060000
> [    0.461401] pci 0001:00:01.0: [1af4:1041] type 00 class 0x020000
> [    0.463537] pci 0001:00:01.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
> [    0.468135] pci 0001:00:02.0: [1af4:1042] type 00 class 0x010000
> [    0.470185] pci 0001:00:02.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
> [    0.474575] pci 0001:00:03.0: [1af4:1050] type 00 class 0x038000
> [    0.476534] pci 0001:00:03.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
> [    0.480942] pci 0001:00:04.0: [1af4:1052] type 00 class 0x090000
> [    0.483096] pci 0001:00:04.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
> [    0.487631] pci 0001:00:06.0: [1af4:1053] type 00 class 0x078000
> [    0.489693] pci 0001:00:06.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
> [    0.493669] pci 0001:00:01.0: BAR 4: assigned [mem 0x3a800000-0x3a803fff 64bit pref]                                                                                                                    
> [    0.495840] pci 0001:00:02.0: BAR 4: assigned [mem 0x3a804000-0x3a807fff 64bit pref]                                                                                                                    
> [    0.496656] pci 0001:00:03.0: BAR 4: assigned [mem 0x3a808000-0x3a80bfff 64bit pref]                                                                                                                    
> [    0.497671] pci 0001:00:04.0: BAR 4: assigned [mem 0x3a80c000-0x3a80ffff 64bit pref]                                                                                                                    
> [    0.498320] pci 0001:00:06.0: BAR 4: assigned [mem 0x3a810000-0x3a813fff 64bit pref]      
> [    0.500732] virtio-pci 0000:00:08.0: enabling device (0000 -> 0002)
> [    0.509728] virtio-pci 0001:00:01.0: enabling device (0000 -> 0002)
> [    0.529757] virtio-pci 0001:00:02.0: enabling device (0000 -> 0002)
> [    0.550208] virtio-pci 0001:00:03.0: enabling device (0000 -> 0002)
> [    0.571012] virtio-pci 0001:00:04.0: enabling device (0000 -> 0002)
> [    0.591042] virtio-pci 0001:00:06.0: enabling device (0000 -> 0002)



> 
>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_PREFETCH_MEM    
>> xen_mk_ullong(0x42000000)
>> +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_ADDR         
>> xen_mk_ullong(0x3a000000)
>> +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_SIZE         
>> xen_mk_ullong(0x04000000)
>> +
>>   #define GUEST_RAM_BANKS   2
>>   /*
>> @@ -515,6 +533,9 @@ typedef uint64_t xen_callback_t;
>>   #define GUEST_VIRTIO_MMIO_SPI_FIRST   33
>>   #define GUEST_VIRTIO_MMIO_SPI_LAST    43
>> +#define GUEST_VIRTIO_PCI_SPI_FIRST   44
>> +#define GUEST_VIRTIO_PCI_SPI_LAST    76
> 
> Just to confirm this is 4 interrupts per PCI host bridge? If so, can 
> this be clarified in a comment?

yes (INTA - INTD)


> 
> Also, above you said that the host ID will be written to Xenstore. But 
> who will decide which SPI belongs to which host bridge?


toolstack, it in charge of PCI host bridge resources allocation (as we 
need to both properly create PCI Host bridge device tree node and inform 
device emulator about PCI host bridge details). Please take a look at:

https://patchwork.kernel.org/project/xen-devel/patch/20231115112611.3865905-4-Sergiy_Kibrik@epam.com/

> 
> Cheers,
>
Julien Grall Nov. 15, 2023, 5:31 p.m. UTC | #3
Hi Oleksandr,

On 15/11/2023 16:51, Oleksandr Tyshchenko wrote:
> 
> 
> On 15.11.23 14:33, Julien Grall wrote:
>> Hi,
> 
> 
> Hello Julien
> 
> Let me please try to explain some bits.
> 
> 
>>
>> Thanks for adding support for virtio-pci in Xen. I have some questions.
>>
>> On 15/11/2023 11:26, Sergiy Kibrik wrote:
>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>
>>> In order to enable more use-cases such as having multiple
>>> device-models (Qemu) running in different backend domains which provide
>>> virtio-pci devices for the same guest, we allocate and expose one
>>> PCI host bridge for every virtio backend domain for that guest.
>>
>> OOI, why do you need to expose one PCI host bridge for every stubdomain?
>>
>> In fact looking at the next patch, it seems you are handling some of the
>> hostbridge request in Xen. This is adds a bit more confusion.
>>
>> I was expecting the virtual PCI device would be in the vPCI and each
>> Device emulator would advertise which BDF they are covering.
> 
> 
> This patch series only covers use-cases where the device emulator
> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind
> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI
> pass-through resources, handling, accounting, nothing. 

I understood you want to one Device Emulator to handle the entire PCI 
host bridge. But...

 From the
> hypervisor we only need a help to intercept the config space accesses
> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ...
> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and
> forward them to the linked device emulator (if any), that's all.

... I really don't see why you need to add code in Xen to trap the 
region. If QEMU is dealing with the hostbridge, then it should be able 
to register the MMIO region and then do the translation.

> 
> It is not possible (with current series) to run device emulators what
> emulate only separate PCI (virtio-pci) devices. For it to be possible, I
> think, much more changes are required than current patch series does.
> There at least should be special PCI Host bridge emulation in Xen (or
> reuse vPCI) for the integration. Also Xen should be in charge of forming
> resulting PCI interrupt based on each PCI device level signaling (if we
> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level,
> etc. Please note, I am not saying this is not possible in general,
> likely it is possible, but initial patch series doesn't cover these
> use-cases)
> 
> We expose one PCI host bridge per virtio backend domain. This is a
> separate PCI host bridge to combine all virtio-pci devices running in
> the same backend domain (in the same device emulator currently).
> The examples:
> - if only one domain runs Qemu which servers virtio-blk, virtio-net,
> virtio-console devices for DomU - only single PCI Host bridge will be
> exposed for DomU
> - if we add another domain to run Qemu to serve additionally virtio-gpu,
> virtio-input and virtio-snd for the *same* DomU - we expose second PCI
> Host bridge for DomU
> 
> I am afraid, we cannot end up exposing only single PCI Host bridge with
> current model (if we use device emulators running in different domains
> that handles the *entire* PCI Host bridges), this won't work.

That makes sense and it is fine. But see above, I think only the #2 is 
necessary for the hypervisor. Patch #5 should not be necessary at all.

[...]

>>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com>
>>> ---
>>>    xen/include/public/arch-arm.h | 21 +++++++++++++++++++++
>>>    1 file changed, 21 insertions(+)
>>>
>>> diff --git a/xen/include/public/arch-arm.h
>>> b/xen/include/public/arch-arm.h
>>> index a25e87dbda..e6c9cd5335 100644
>>> --- a/xen/include/public/arch-arm.h
>>> +++ b/xen/include/public/arch-arm.h
>>> @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t;
>>>    #define GUEST_VPCI_MEM_ADDR                 xen_mk_ullong(0x23000000)
>>>    #define GUEST_VPCI_MEM_SIZE                 xen_mk_ullong(0x10000000)
>>> +/*
>>> + * 16 MB is reserved for virtio-pci configuration space based on
>>> calculation
>>> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB
>>
>> Can you explain how youd ecided the "2"?
> 
> good question, we have a limited free space available in memory layout
> (we had difficulties to find a suitable holes) also we don't expect a
> lot of virtio-pci devices, so "256" used vPCI would be too much. It was
> decided to reduce significantly, but select maximum to fit into free
> space, with having "2" buses we still fit into the chosen holes.

If you don't expect a lot of virtio devices, then why do you need two 
buses? Wouldn't one be sufficient?

> 
> 
>>
>>> + */
>>> +#define GUEST_VIRTIO_PCI_ECAM_BASE          xen_mk_ullong(0x33000000)
>>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE    xen_mk_ullong(0x01000000)
>>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE     xen_mk_ullong(0x00200000)
>>> +
>>> +/* 64 MB is reserved for virtio-pci memory */
>>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM    xen_mk_ullong(0x02000000)
>>> +#define GUEST_VIRTIO_PCI_MEM_ADDR         xen_mk_ullong(0x34000000)
>>> +#define GUEST_VIRTIO_PCI_MEM_SIZE         xen_mk_ullong(0x04000000)
>>> +
>>>    /*
>>>     * 16MB == 4096 pages reserved for guest to use as a region to map its
>>>     * grant table in.
>>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t;
>>>    #define GUEST_MAGIC_BASE  xen_mk_ullong(0x39000000)
>>>    #define GUEST_MAGIC_SIZE  xen_mk_ullong(0x01000000)
>>> +/* 64 MB is reserved for virtio-pci Prefetch memory */
>>
>> This doesn't seem a lot depending on your use case. Can you details how
>> you can up with "64 MB"?
> 
> the same calculation as it was done configuration space. It was observed
> that only 16K is used per virtio-pci device (maybe it can be bigger for
> usual PCI device, I don't know). Please look at the example of DomU log
> below (to strings that contain "*BAR 4: assigned*"):

What about virtio-gpu? I would expect a bit more memory is necessary for 
that use case.

Any case, what I am looking for is for some explanation in the commit 
message of the limits. I don't particularly care about the exact limit 
because this is not part of a stable ABI.

Cheers,
Oleksandr Tyshchenko Nov. 15, 2023, 6:14 p.m. UTC | #4
On 15.11.23 19:31, Julien Grall wrote:
> Hi Oleksandr,


Hello Julien

> 
> On 15/11/2023 16:51, Oleksandr Tyshchenko wrote:
>>
>>
>> On 15.11.23 14:33, Julien Grall wrote:
>>> Hi,
>>
>>
>> Hello Julien
>>
>> Let me please try to explain some bits.
>>
>>
>>>
>>> Thanks for adding support for virtio-pci in Xen. I have some questions.
>>>
>>> On 15/11/2023 11:26, Sergiy Kibrik wrote:
>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>
>>>> In order to enable more use-cases such as having multiple
>>>> device-models (Qemu) running in different backend domains which provide
>>>> virtio-pci devices for the same guest, we allocate and expose one
>>>> PCI host bridge for every virtio backend domain for that guest.
>>>
>>> OOI, why do you need to expose one PCI host bridge for every stubdomain?
>>>
>>> In fact looking at the next patch, it seems you are handling some of the
>>> hostbridge request in Xen. This is adds a bit more confusion.
>>>
>>> I was expecting the virtual PCI device would be in the vPCI and each
>>> Device emulator would advertise which BDF they are covering.
>>
>>
>> This patch series only covers use-cases where the device emulator
>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind
>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI
>> pass-through resources, handling, accounting, nothing. 
> 
> I understood you want to one Device Emulator to handle the entire PCI 
> host bridge. But...
> 
>  From the
>> hypervisor we only need a help to intercept the config space accesses
>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ...
>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and
>> forward them to the linked device emulator (if any), that's all.
> 
> ... I really don't see why you need to add code in Xen to trap the 
> region. If QEMU is dealing with the hostbridge, then it should be able 
> to register the MMIO region and then do the translation.


Hmm, sounds surprising I would say. Are you saying that unmodified Qemu 
will work if we drop #5? I think this wants to be re-checked (@Sergiy 
can you please investigate?). If indeed so, than #5 will be dropped of 
course from the that series (I would say, postponed until more use-cases).



> 
>>
>> It is not possible (with current series) to run device emulators what
>> emulate only separate PCI (virtio-pci) devices. For it to be possible, I
>> think, much more changes are required than current patch series does.
>> There at least should be special PCI Host bridge emulation in Xen (or
>> reuse vPCI) for the integration. Also Xen should be in charge of forming
>> resulting PCI interrupt based on each PCI device level signaling (if we
>> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level,
>> etc. Please note, I am not saying this is not possible in general,
>> likely it is possible, but initial patch series doesn't cover these
>> use-cases)
>>
>> We expose one PCI host bridge per virtio backend domain. This is a
>> separate PCI host bridge to combine all virtio-pci devices running in
>> the same backend domain (in the same device emulator currently).
>> The examples:
>> - if only one domain runs Qemu which servers virtio-blk, virtio-net,
>> virtio-console devices for DomU - only single PCI Host bridge will be
>> exposed for DomU
>> - if we add another domain to run Qemu to serve additionally virtio-gpu,
>> virtio-input and virtio-snd for the *same* DomU - we expose second PCI
>> Host bridge for DomU
>>
>> I am afraid, we cannot end up exposing only single PCI Host bridge with
>> current model (if we use device emulators running in different domains
>> that handles the *entire* PCI Host bridges), this won't work.
> 
> That makes sense and it is fine. But see above, I think only the #2 is 
> necessary for the hypervisor. Patch #5 should not be necessary at all.


Good, it should be re-checked without #5 sure.


> 
> [...]
> 
>>>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com>
>>>> ---
>>>>    xen/include/public/arch-arm.h | 21 +++++++++++++++++++++
>>>>    1 file changed, 21 insertions(+)
>>>>
>>>> diff --git a/xen/include/public/arch-arm.h
>>>> b/xen/include/public/arch-arm.h
>>>> index a25e87dbda..e6c9cd5335 100644
>>>> --- a/xen/include/public/arch-arm.h
>>>> +++ b/xen/include/public/arch-arm.h
>>>> @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t;
>>>>    #define GUEST_VPCI_MEM_ADDR                 
>>>> xen_mk_ullong(0x23000000)
>>>>    #define GUEST_VPCI_MEM_SIZE                 
>>>> xen_mk_ullong(0x10000000)
>>>> +/*
>>>> + * 16 MB is reserved for virtio-pci configuration space based on
>>>> calculation
>>>> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB
>>>
>>> Can you explain how youd ecided the "2"?
>>
>> good question, we have a limited free space available in memory layout
>> (we had difficulties to find a suitable holes) also we don't expect a
>> lot of virtio-pci devices, so "256" used vPCI would be too much. It was
>> decided to reduce significantly, but select maximum to fit into free
>> space, with having "2" buses we still fit into the chosen holes.
> 
> If you don't expect a lot of virtio devices, then why do you need two 
> buses? Wouldn't one be sufficient?


For now one would sufficient I think. @Sergiy if you reduce to a single 
bus here, don't forget to update "bus-range" property in device-tree node.




> 
>>
>>
>>>
>>>> + */
>>>> +#define GUEST_VIRTIO_PCI_ECAM_BASE          xen_mk_ullong(0x33000000)
>>>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE    xen_mk_ullong(0x01000000)
>>>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE     xen_mk_ullong(0x00200000)
>>>> +
>>>> +/* 64 MB is reserved for virtio-pci memory */
>>>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM    xen_mk_ullong(0x02000000)
>>>> +#define GUEST_VIRTIO_PCI_MEM_ADDR         xen_mk_ullong(0x34000000)
>>>> +#define GUEST_VIRTIO_PCI_MEM_SIZE         xen_mk_ullong(0x04000000)
>>>> +
>>>>    /*
>>>>     * 16MB == 4096 pages reserved for guest to use as a region to 
>>>> map its
>>>>     * grant table in.
>>>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t;
>>>>    #define GUEST_MAGIC_BASE  xen_mk_ullong(0x39000000)
>>>>    #define GUEST_MAGIC_SIZE  xen_mk_ullong(0x01000000)
>>>> +/* 64 MB is reserved for virtio-pci Prefetch memory */
>>>
>>> This doesn't seem a lot depending on your use case. Can you details how
>>> you can up with "64 MB"?
>>
>> the same calculation as it was done configuration space. It was observed
>> that only 16K is used per virtio-pci device (maybe it can be bigger for
>> usual PCI device, I don't know). Please look at the example of DomU log
>> below (to strings that contain "*BAR 4: assigned*"):
> 
> What about virtio-gpu? I would expect a bit more memory is necessary for 
> that use case.


In the DomU log I provided during last response, virtio-gpu was also 
present among 5 virtio-pci devices and it worked at the runtime

[    0.474575] pci 0001:00:03.0: [1af4:1050] type 00 class 0x038000
[    0.476534] pci 0001:00:03.0: reg 0x20: [mem 0x00000000-0x00003fff 
64bit pref]
....
[    0.496656] pci 0001:00:03.0: BAR 4: assigned [mem 
0x3a808000-0x3a80bfff 64bit pref]
....
[    0.550208] virtio-pci 0001:00:03.0: enabling device (0000 -> 0002)
....
0001:00:03.0 Display controller: Red Hat, Inc. Virtio GPU (rev 01)

I guess, indeed it needs more memory, but this is related to I/O 
descriptors at the runtime that passed via virtqueue.



> 
> Any case, what I am looking for is for some explanation in the commit 
> message of the limits. I don't particularly care about the exact limit 
> because this is not part of a stable ABI.

ok, sure


> 
> Cheers,
>
Julien Grall Nov. 15, 2023, 6:33 p.m. UTC | #5
Hi Oleksandr,

On 15/11/2023 18:14, Oleksandr Tyshchenko wrote:
> On 15.11.23 19:31, Julien Grall wrote:
>> On 15/11/2023 16:51, Oleksandr Tyshchenko wrote:
>>> On 15.11.23 14:33, Julien Grall wrote:
>>>> Thanks for adding support for virtio-pci in Xen. I have some questions.
>>>>
>>>> On 15/11/2023 11:26, Sergiy Kibrik wrote:
>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>
>>>>> In order to enable more use-cases such as having multiple
>>>>> device-models (Qemu) running in different backend domains which provide
>>>>> virtio-pci devices for the same guest, we allocate and expose one
>>>>> PCI host bridge for every virtio backend domain for that guest.
>>>>
>>>> OOI, why do you need to expose one PCI host bridge for every stubdomain?
>>>>
>>>> In fact looking at the next patch, it seems you are handling some of the
>>>> hostbridge request in Xen. This is adds a bit more confusion.
>>>>
>>>> I was expecting the virtual PCI device would be in the vPCI and each
>>>> Device emulator would advertise which BDF they are covering.
>>>
>>>
>>> This patch series only covers use-cases where the device emulator
>>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind
>>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI
>>> pass-through resources, handling, accounting, nothing.
>>
>> I understood you want to one Device Emulator to handle the entire PCI
>> host bridge. But...
>>
>>   From the
>>> hypervisor we only need a help to intercept the config space accesses
>>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ...
>>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and
>>> forward them to the linked device emulator (if any), that's all.
>>
>> ... I really don't see why you need to add code in Xen to trap the
>> region. If QEMU is dealing with the hostbridge, then it should be able
>> to register the MMIO region and then do the translation.
> 
> 
> Hmm, sounds surprising I would say. Are you saying that unmodified Qemu
> will work if we drop #5?

I don't know if an unmodified QEMU will work. My point is I don't view 
the patch in Xen necessary. You should be able to tell QEMU "here is the 
ECAM region, please emulate an hostbridge". QEMU will then register the 
region to Xen and all the accesses will be forwarded.

In the future we may need a patch similar to #5 if we want to have 
multiple DM using the same hostbridge. But this is a different 
discussion and the patch would need some rework.

The ioreq.c code was always meant to be generic and is always for every 
emulated MMIO. So you want to limit any change in it. Checking the MMIO 
region belongs to the hostbridge and doing the translation is IMHO not a 
good idea to do in ioreq.c. Instead you want to do the conversion from 
MMIO to (sbdf, offset) in virtio_pci_mmio{read, write}(). So the job of 
ioreq.c is to simply find the correct Device Model and forward it.

I also don't see why the feature is gated by has_vcpi(). They are two 
distinct features (at least in your current model).

Cheers,
Oleksandr Tyshchenko Nov. 15, 2023, 7:38 p.m. UTC | #6
On 15.11.23 20:33, Julien Grall wrote:
> Hi Oleksandr,

Hello Julien


> 
> On 15/11/2023 18:14, Oleksandr Tyshchenko wrote:
>> On 15.11.23 19:31, Julien Grall wrote:
>>> On 15/11/2023 16:51, Oleksandr Tyshchenko wrote:
>>>> On 15.11.23 14:33, Julien Grall wrote:
>>>>> Thanks for adding support for virtio-pci in Xen. I have some 
>>>>> questions.
>>>>>
>>>>> On 15/11/2023 11:26, Sergiy Kibrik wrote:
>>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>>
>>>>>> In order to enable more use-cases such as having multiple
>>>>>> device-models (Qemu) running in different backend domains which 
>>>>>> provide
>>>>>> virtio-pci devices for the same guest, we allocate and expose one
>>>>>> PCI host bridge for every virtio backend domain for that guest.
>>>>>
>>>>> OOI, why do you need to expose one PCI host bridge for every 
>>>>> stubdomain?
>>>>>
>>>>> In fact looking at the next patch, it seems you are handling some 
>>>>> of the
>>>>> hostbridge request in Xen. This is adds a bit more confusion.
>>>>>
>>>>> I was expecting the virtual PCI device would be in the vPCI and each
>>>>> Device emulator would advertise which BDF they are covering.
>>>>
>>>>
>>>> This patch series only covers use-cases where the device emulator
>>>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices 
>>>> behind
>>>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI
>>>> pass-through resources, handling, accounting, nothing.
>>>
>>> I understood you want to one Device Emulator to handle the entire PCI
>>> host bridge. But...
>>>
>>>   From the
>>>> hypervisor we only need a help to intercept the config space accesses
>>>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ...
>>>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and
>>>> forward them to the linked device emulator (if any), that's all.
>>>
>>> ... I really don't see why you need to add code in Xen to trap the
>>> region. If QEMU is dealing with the hostbridge, then it should be able
>>> to register the MMIO region and then do the translation.
>>
>>
>> Hmm, sounds surprising I would say. Are you saying that unmodified Qemu
>> will work if we drop #5?
> 
> I don't know if an unmodified QEMU will work. My point is I don't view 
> the patch in Xen necessary. You should be able to tell QEMU "here is the 
> ECAM region, please emulate an hostbridge". QEMU will then register the 
> region to Xen and all the accesses will be forwarded. >
> In the future we may need a patch similar to #5 if we want to have 
> multiple DM using the same hostbridge. But this is a different 
> discussion and the patch would need some rework.


ok

> 
> The ioreq.c code was always meant to be generic and is always for every 
> emulated MMIO. So you want to limit any change in it. Checking the MMIO 
> region belongs to the hostbridge and doing the translation is IMHO not a 
> good idea to do in ioreq.c. Instead you want to do the conversion from 
> MMIO to (sbdf, offset) in virtio_pci_mmio{read, write}(). So the job of 
> ioreq.c is to simply find the correct Device Model and forward it.



Are you about virtio_pci_ioreq_server_get_addr() called from 
arch_ioreq_server_get_type_addr()? If so and if I am not mistaken the 
x86 also check what PCI device is targeted there.

But, I am not against the suggestion, I agree with it.


> 
> I also don't see why the feature is gated by has_vcpi(). They are two 
> distinct features (at least in your current model).

yes, you are correct. In #5 virtio-pci mmio handlers are still 
registered in domain_vpci_init() (which is gated by has_vcpi()), etc


> 
> Cheers,
>
Julien Grall Nov. 15, 2023, 7:56 p.m. UTC | #7
Hi Oleksandr,

On 15/11/2023 19:38, Oleksandr Tyshchenko wrote:
>> The ioreq.c code was always meant to be generic and is always for every
>> emulated MMIO. So you want to limit any change in it. Checking the MMIO
>> region belongs to the hostbridge and doing the translation is IMHO not a
>> good idea to do in ioreq.c. Instead you want to do the conversion from
>> MMIO to (sbdf, offset) in virtio_pci_mmio{read, write}(). So the job of
>> ioreq.c is to simply find the correct Device Model and forward it.
> 
> 
> 
> Are you about virtio_pci_ioreq_server_get_addr() called from
> arch_ioreq_server_get_type_addr()? If so and if I am not mistaken the
> x86 also check what PCI device is targeted there.
Well yes. We can do better to avoid extra complexity for each MMIO.

Note that the x86 version is somewhat more light-weight.

Cheers,
Stefano Stabellini Nov. 15, 2023, 11:28 p.m. UTC | #8
+ Stewart, Vikram

On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote:
> On 15.11.23 14:33, Julien Grall wrote:
> > Thanks for adding support for virtio-pci in Xen. I have some questions.
> > 
> > On 15/11/2023 11:26, Sergiy Kibrik wrote:
> >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> >>
> >> In order to enable more use-cases such as having multiple
> >> device-models (Qemu) running in different backend domains which provide
> >> virtio-pci devices for the same guest, we allocate and expose one
> >> PCI host bridge for every virtio backend domain for that guest.
> > 
> > OOI, why do you need to expose one PCI host bridge for every stubdomain?
> > 
> > In fact looking at the next patch, it seems you are handling some of the 
> > hostbridge request in Xen. This is adds a bit more confusion.
> > 
> > I was expecting the virtual PCI device would be in the vPCI and each 
> > Device emulator would advertise which BDF they are covering.
> 
> 
> This patch series only covers use-cases where the device emulator 
> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind 
> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI 
> pass-through resources, handling, accounting, nothing. From the 
> hypervisor we only need a help to intercept the config space accesses 
> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... 
> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and 
> forward them to the linked device emulator (if any), that's all.
> 
> It is not possible (with current series) to run device emulators what
> emulate only separate PCI (virtio-pci) devices. For it to be possible, I 
> think, much more changes are required than current patch series does. 
> There at least should be special PCI Host bridge emulation in Xen (or 
> reuse vPCI) for the integration. Also Xen should be in charge of forming 
> resulting PCI interrupt based on each PCI device level signaling (if we 
> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, 
> etc. Please note, I am not saying this is not possible in general, 
> likely it is possible, but initial patch series doesn't cover these 
> use-cases)
>
> We expose one PCI host bridge per virtio backend domain. This is a 
> separate PCI host bridge to combine all virtio-pci devices running in 
> the same backend domain (in the same device emulator currently).
> The examples:
> - if only one domain runs Qemu which servers virtio-blk, virtio-net, 
> virtio-console devices for DomU - only single PCI Host bridge will be 
> exposed for DomU
> - if we add another domain to run Qemu to serve additionally virtio-gpu, 
> virtio-input and virtio-snd for the *same* DomU - we expose second PCI 
> Host bridge for DomU
> 
> I am afraid, we cannot end up exposing only single PCI Host bridge with 
> current model (if we use device emulators running in different domains 
> that handles the *entire* PCI Host bridges), this won't work.
 

We were discussing the topic of vPCI and Virtio PCI just this morning
with Stewart and Vikram. We also intend to make them work well together
in the next couple of months (great timing!!)

However, our thinking is to go with the other approach Julien
suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would
register individual PCI devices against it.

Vikram, Stewart, please comment. Our understanding is that it should be
possible to make QEMU virtio-pci work with vPCI with relatively minor
efforts and AMD volunteers to do the work in the next couple of months
on the vPCI side.


Although it should be possible to make both approaches work at the same
time, given that it would seem that EPAM and AMD have very similar
requirements, I suggest we work together and collaborate on a single
approach going forward that works best for everyone.


Let me start by saying that if we can get away with it, I think that a
single PCI Root Complex in Xen would be best because it requires less
complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
one?

Stewart, you are deep into vPCI, what's your thinking?
Volodymyr Babchuk Nov. 16, 2023, 3:07 p.m. UTC | #9
Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> + Stewart, Vikram
>
> On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote:
>> On 15.11.23 14:33, Julien Grall wrote:
>> > Thanks for adding support for virtio-pci in Xen. I have some questions.
>> > 
>> > On 15/11/2023 11:26, Sergiy Kibrik wrote:
>> >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>> >>
>> >> In order to enable more use-cases such as having multiple
>> >> device-models (Qemu) running in different backend domains which provide
>> >> virtio-pci devices for the same guest, we allocate and expose one
>> >> PCI host bridge for every virtio backend domain for that guest.
>> > 
>> > OOI, why do you need to expose one PCI host bridge for every stubdomain?
>> > 
>> > In fact looking at the next patch, it seems you are handling some of the 
>> > hostbridge request in Xen. This is adds a bit more confusion.
>> > 
>> > I was expecting the virtual PCI device would be in the vPCI and each 
>> > Device emulator would advertise which BDF they are covering.
>> 
>> 
>> This patch series only covers use-cases where the device emulator 
>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind 
>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI 
>> pass-through resources, handling, accounting, nothing. From the 
>> hypervisor we only need a help to intercept the config space accesses 
>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... 
>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and 
>> forward them to the linked device emulator (if any), that's all.
>> 
>> It is not possible (with current series) to run device emulators what
>> emulate only separate PCI (virtio-pci) devices. For it to be possible, I 
>> think, much more changes are required than current patch series does. 
>> There at least should be special PCI Host bridge emulation in Xen (or 
>> reuse vPCI) for the integration. Also Xen should be in charge of forming 
>> resulting PCI interrupt based on each PCI device level signaling (if we 
>> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, 
>> etc. Please note, I am not saying this is not possible in general, 
>> likely it is possible, but initial patch series doesn't cover these 
>> use-cases)
>>
>> We expose one PCI host bridge per virtio backend domain. This is a 
>> separate PCI host bridge to combine all virtio-pci devices running in 
>> the same backend domain (in the same device emulator currently).
>> The examples:
>> - if only one domain runs Qemu which servers virtio-blk, virtio-net, 
>> virtio-console devices for DomU - only single PCI Host bridge will be 
>> exposed for DomU
>> - if we add another domain to run Qemu to serve additionally virtio-gpu, 
>> virtio-input and virtio-snd for the *same* DomU - we expose second PCI 
>> Host bridge for DomU
>> 
>> I am afraid, we cannot end up exposing only single PCI Host bridge with 
>> current model (if we use device emulators running in different domains 
>> that handles the *entire* PCI Host bridges), this won't work.
>  
>
> We were discussing the topic of vPCI and Virtio PCI just this morning
> with Stewart and Vikram. We also intend to make them work well together
> in the next couple of months (great timing!!)
>
> However, our thinking is to go with the other approach Julien
> suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would
> register individual PCI devices against it.
>
> Vikram, Stewart, please comment. Our understanding is that it should be
> possible to make QEMU virtio-pci work with vPCI with relatively minor
> efforts and AMD volunteers to do the work in the next couple of months
> on the vPCI side.
>
>
> Although it should be possible to make both approaches work at the same
> time, given that it would seem that EPAM and AMD have very similar
> requirements, I suggest we work together and collaborate on a single
> approach going forward that works best for everyone.
>
>
> Let me start by saying that if we can get away with it, I think that a
> single PCI Root Complex in Xen would be best because it requires less
> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
> one?

Well, in fact we tried similar setup, this was in the first version of
virtio-pci support. But we had a couple of issues with this. For
instance, this might conflict with PCI passthrough devices, with virtio
devices that have back-ends in different domains, etc. I am no saying
that this is impossible, but this just involves more moving parts.

With my vPCI patch series in place, hypervisor itself assigns BDFs for
passed-through devices. Toolstack needs to get this information to know
which BDFs are free and can be used by virtio-pci.
Julien Grall Nov. 16, 2023, 3:12 p.m. UTC | #10
Hi Volodymyr,

On 16/11/2023 15:07, Volodymyr Babchuk wrote:
> With my vPCI patch series in place, hypervisor itself assigns BDFs for
> passed-through devices. Toolstack needs to get this information to know
> which BDFs are free and can be used by virtio-pci.

It sounds a bit odd to let the hypervisor to assign the BDFs. At least 
because there might be case where you want to specific vBDF (for 
instance this is the case with some intel graphic cards). This should be 
the toolstack job to say "I want to assign the pBDF to this vBDF".

Do you have a link to the patch adding the logic in the hypervisor? I 
will comment there as well.

Cheers,
Stewart Hildebrand Nov. 16, 2023, 3:26 p.m. UTC | #11
On 11/16/23 10:12, Julien Grall wrote:
> Hi Volodymyr,
> 
> On 16/11/2023 15:07, Volodymyr Babchuk wrote:
>> With my vPCI patch series in place, hypervisor itself assigns BDFs for
>> passed-through devices. Toolstack needs to get this information to know
>> which BDFs are free and can be used by virtio-pci.
> 
> It sounds a bit odd to let the hypervisor to assign the BDFs. At least because there might be case where you want to specific vBDF (for instance this is the case with some intel graphic cards). This should be the toolstack job to say "I want to assign the pBDF to this vBDF".

Keep in mind we are also supporting dom0less PCI passthrough.

> 
> Do you have a link to the patch adding the logic in the hypervisor? I will comment there as well.

See add_virtual_device() in [1], specifically the line:

    pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0);

[1] https://lists.xenproject.org/archives/html/xen-devel/2023-10/msg00673.html
Julien Grall Nov. 16, 2023, 3:58 p.m. UTC | #12
On 16/11/2023 15:26, Stewart Hildebrand wrote:
> On 11/16/23 10:12, Julien Grall wrote:
>> Hi Volodymyr,
>>
>> On 16/11/2023 15:07, Volodymyr Babchuk wrote:
>>> With my vPCI patch series in place, hypervisor itself assigns BDFs for
>>> passed-through devices. Toolstack needs to get this information to know
>>> which BDFs are free and can be used by virtio-pci.
>>
>> It sounds a bit odd to let the hypervisor to assign the BDFs. At least because there might be case where you want to specific vBDF (for instance this is the case with some intel graphic cards). This should be the toolstack job to say "I want to assign the pBDF to this vBDF".
> 
> Keep in mind we are also supporting dom0less PCI passthrough.
Right, but even with that in mind, I expect the Device-Tree to provide 
the details where a given PCI is assigned.

You could have logic for assigning the BDF automagically. But that 
should be part of dom0less, not deep into the vPCI code.

Cheers,
Volodymyr Babchuk Nov. 16, 2023, 4:53 p.m. UTC | #13
Hi Julien,

Julien Grall <julien@xen.org> writes:

> On 16/11/2023 15:26, Stewart Hildebrand wrote:
>> On 11/16/23 10:12, Julien Grall wrote:
>>> Hi Volodymyr,
>>>
>>> On 16/11/2023 15:07, Volodymyr Babchuk wrote:
>>>> With my vPCI patch series in place, hypervisor itself assigns BDFs for
>>>> passed-through devices. Toolstack needs to get this information to know
>>>> which BDFs are free and can be used by virtio-pci.
>>>
>>> It sounds a bit odd to let the hypervisor to assign the BDFs. At
>>> least because there might be case where you want to specific vBDF
>>> (for instance this is the case with some intel graphic cards). This
>>> should be the toolstack job to say "I want to assign the pBDF to
>>> this vBDF".
>> Keep in mind we are also supporting dom0less PCI passthrough.
> Right, but even with that in mind, I expect the Device-Tree to provide
> the details where a given PCI is assigned.
>
> You could have logic for assigning the BDF automagically. But that
> should be part of dom0less, not deep into the vPCI code.

As far as I know, right now toolstack does not allow you to assign BDF
in any form. For x86 there are two options, and they are controlled by
"passthrough" option of xen-pciback driver in Linux:

  "Option to specify how to export PCI topology to guest:"
  " 0 - (default) Hide the true PCI topology and makes the frontend"
  "   there is a single PCI bus with only the exported devices on it."
  "   For example, a device at 03:05.0 will be re-assigned to 00:00.0"
  "   while second device at 02:1a.1 will be re-assigned to 00:01.1."
  " 1 - Passthrough provides a real view of the PCI topology to the"
  "   frontend (for example, a device at 06:01.b will still appear at"
  "   06:01.b to the frontend). This is similar to how Xen 2.0.x"
  "   exposed PCI devices to its driver domains. This may be required"
  "   for drivers which depend on finding their hardward in certain"
  "   bus/slot locations.");

Also, isn't strict dependency on BDF breaks the PCI specification? I
mean, of course, you can assign Function on random, but what about Bus
and Device parts?


I mean, we can make toolstack responsible of assigning BDFs, but this
might break existing x86 setups...
Julien Grall Nov. 16, 2023, 5:27 p.m. UTC | #14
Hi Volodymyr,

On 16/11/2023 16:53, Volodymyr Babchuk wrote:
> Julien Grall <julien@xen.org> writes:
>> On 16/11/2023 15:26, Stewart Hildebrand wrote:
>>> On 11/16/23 10:12, Julien Grall wrote:
>>>> Hi Volodymyr,
>>>>
>>>> On 16/11/2023 15:07, Volodymyr Babchuk wrote:
>>>>> With my vPCI patch series in place, hypervisor itself assigns BDFs for
>>>>> passed-through devices. Toolstack needs to get this information to know
>>>>> which BDFs are free and can be used by virtio-pci.
>>>>
>>>> It sounds a bit odd to let the hypervisor to assign the BDFs. At
>>>> least because there might be case where you want to specific vBDF
>>>> (for instance this is the case with some intel graphic cards). This
>>>> should be the toolstack job to say "I want to assign the pBDF to
>>>> this vBDF".
>>> Keep in mind we are also supporting dom0less PCI passthrough.
>> Right, but even with that in mind, I expect the Device-Tree to provide
>> the details where a given PCI is assigned.
>>
>> You could have logic for assigning the BDF automagically. But that
>> should be part of dom0less, not deep into the vPCI code.
> 
> As far as I know, right now toolstack does not allow you to assign BDF
> in any form. For x86 there are two options, and they are controlled by
> "passthrough" option of xen-pciback driver in Linux:

Are you talking about HVM or PV? I am not very familiar for the latter 
but for the former, QEMU is today responsible to find a free slot.

You can also specify which virtual slot you want to use in XL:

=item B<vslot>=I<NUMBER>

=over 4

=item Description

Specifies the virtual slot (device) number where the guest will see this
device. For example, running L<lspci(1)> in a Linux guest where B<vslot>
was specified as C<8> would identify the device as C<00:08.0>. Virtual 
domain
and bus numbers are always 0.

> 
>    "Option to specify how to export PCI topology to guest:"
>    " 0 - (default) Hide the true PCI topology and makes the frontend"
>    "   there is a single PCI bus with only the exported devices on it."
>    "   For example, a device at 03:05.0 will be re-assigned to 00:00.0"
>    "   while second device at 02:1a.1 will be re-assigned to 00:01.1."
>    " 1 - Passthrough provides a real view of the PCI topology to the"
>    "   frontend (for example, a device at 06:01.b will still appear at"
>    "   06:01.b to the frontend). This is similar to how Xen 2.0.x"
>    "   exposed PCI devices to its driver domains. This may be required"
>    "   for drivers which depend on finding their hardward in certain"
>    "   bus/slot locations.");
> 
> Also, isn't strict dependency on BDF breaks the PCI specification? I
> mean, of course, you can assign Function on random, but what about Bus
> and Device parts?

I am not well-versed with the PCI specification. However, what the specs 
says and what users do tend to be different :).

I know a few cases where you need to specify the slot. I have mentioned 
one earlier with the Intel Graphic cards.

> I mean, we can make toolstack responsible of assigning BDFs, but this
> might break existing x86 setups...

It is not clear to me how letting the toolstack selecting the vslot 
would be a problem for HVM (see above). However, it is clear that you 
may break some x86 setup if always allocate a "random" slot.

Cheers,
Stefano Stabellini Nov. 16, 2023, 11:04 p.m. UTC | #15
On Thu, 16 Nov 2023, Volodymyr Babchuk wrote:
> Hi Stefano,
> 
> Stefano Stabellini <sstabellini@kernel.org> writes:
> 
> > + Stewart, Vikram
> >
> > On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote:
> >> On 15.11.23 14:33, Julien Grall wrote:
> >> > Thanks for adding support for virtio-pci in Xen. I have some questions.
> >> > 
> >> > On 15/11/2023 11:26, Sergiy Kibrik wrote:
> >> >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> >> >>
> >> >> In order to enable more use-cases such as having multiple
> >> >> device-models (Qemu) running in different backend domains which provide
> >> >> virtio-pci devices for the same guest, we allocate and expose one
> >> >> PCI host bridge for every virtio backend domain for that guest.
> >> > 
> >> > OOI, why do you need to expose one PCI host bridge for every stubdomain?
> >> > 
> >> > In fact looking at the next patch, it seems you are handling some of the 
> >> > hostbridge request in Xen. This is adds a bit more confusion.
> >> > 
> >> > I was expecting the virtual PCI device would be in the vPCI and each 
> >> > Device emulator would advertise which BDF they are covering.
> >> 
> >> 
> >> This patch series only covers use-cases where the device emulator 
> >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind 
> >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI 
> >> pass-through resources, handling, accounting, nothing. From the 
> >> hypervisor we only need a help to intercept the config space accesses 
> >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... 
> >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and 
> >> forward them to the linked device emulator (if any), that's all.
> >> 
> >> It is not possible (with current series) to run device emulators what
> >> emulate only separate PCI (virtio-pci) devices. For it to be possible, I 
> >> think, much more changes are required than current patch series does. 
> >> There at least should be special PCI Host bridge emulation in Xen (or 
> >> reuse vPCI) for the integration. Also Xen should be in charge of forming 
> >> resulting PCI interrupt based on each PCI device level signaling (if we 
> >> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, 
> >> etc. Please note, I am not saying this is not possible in general, 
> >> likely it is possible, but initial patch series doesn't cover these 
> >> use-cases)
> >>
> >> We expose one PCI host bridge per virtio backend domain. This is a 
> >> separate PCI host bridge to combine all virtio-pci devices running in 
> >> the same backend domain (in the same device emulator currently).
> >> The examples:
> >> - if only one domain runs Qemu which servers virtio-blk, virtio-net, 
> >> virtio-console devices for DomU - only single PCI Host bridge will be 
> >> exposed for DomU
> >> - if we add another domain to run Qemu to serve additionally virtio-gpu, 
> >> virtio-input and virtio-snd for the *same* DomU - we expose second PCI 
> >> Host bridge for DomU
> >> 
> >> I am afraid, we cannot end up exposing only single PCI Host bridge with 
> >> current model (if we use device emulators running in different domains 
> >> that handles the *entire* PCI Host bridges), this won't work.
> >  
> >
> > We were discussing the topic of vPCI and Virtio PCI just this morning
> > with Stewart and Vikram. We also intend to make them work well together
> > in the next couple of months (great timing!!)
> >
> > However, our thinking is to go with the other approach Julien
> > suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would
> > register individual PCI devices against it.
> >
> > Vikram, Stewart, please comment. Our understanding is that it should be
> > possible to make QEMU virtio-pci work with vPCI with relatively minor
> > efforts and AMD volunteers to do the work in the next couple of months
> > on the vPCI side.
> >
> >
> > Although it should be possible to make both approaches work at the same
> > time, given that it would seem that EPAM and AMD have very similar
> > requirements, I suggest we work together and collaborate on a single
> > approach going forward that works best for everyone.
> >
> >
> > Let me start by saying that if we can get away with it, I think that a
> > single PCI Root Complex in Xen would be best because it requires less
> > complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
> > one?
> 
> Well, in fact we tried similar setup, this was in the first version of
> virtio-pci support. But we had a couple of issues with this. For
> instance, this might conflict with PCI passthrough devices, with virtio
> devices that have back-ends in different domains, etc. I am no saying
> that this is impossible, but this just involves more moving parts.
> 
> With my vPCI patch series in place, hypervisor itself assigns BDFs for
> passed-through devices. Toolstack needs to get this information to know
> which BDFs are free and can be used by virtio-pci.

I'll premise that I don't really have an opinion on how the virtual BDF
allocation should happen.

But I'll ask the opposite question that Julien asked: if it is Xen that
does the allocation, that's fine, then couldn't we arrange so that Xen
also does the allocation in the toolstack case too (simply by picking
the first available virtual BDF)?

One way or the other it should be OK as long as we are consistent.
Volodymyr Babchuk Nov. 17, 2023, 12:23 a.m. UTC | #16
Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Thu, 16 Nov 2023, Volodymyr Babchuk wrote:
>> Hi Stefano,
>> 
>> Stefano Stabellini <sstabellini@kernel.org> writes:
>> 
>> > + Stewart, Vikram
>> >
>> > On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote:
>> >> On 15.11.23 14:33, Julien Grall wrote:
>> >> > Thanks for adding support for virtio-pci in Xen. I have some questions.
>> >> > 
>> >> > On 15/11/2023 11:26, Sergiy Kibrik wrote:
>> >> >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>> >> >>
>> >> >> In order to enable more use-cases such as having multiple
>> >> >> device-models (Qemu) running in different backend domains which provide
>> >> >> virtio-pci devices for the same guest, we allocate and expose one
>> >> >> PCI host bridge for every virtio backend domain for that guest.
>> >> > 
>> >> > OOI, why do you need to expose one PCI host bridge for every stubdomain?
>> >> > 
>> >> > In fact looking at the next patch, it seems you are handling some of the 
>> >> > hostbridge request in Xen. This is adds a bit more confusion.
>> >> > 
>> >> > I was expecting the virtual PCI device would be in the vPCI and each 
>> >> > Device emulator would advertise which BDF they are covering.
>> >> 
>> >> 
>> >> This patch series only covers use-cases where the device emulator 
>> >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind 
>> >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI 
>> >> pass-through resources, handling, accounting, nothing. From the 
>> >> hypervisor we only need a help to intercept the config space accesses 
>> >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... 
>> >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and 
>> >> forward them to the linked device emulator (if any), that's all.
>> >> 
>> >> It is not possible (with current series) to run device emulators what
>> >> emulate only separate PCI (virtio-pci) devices. For it to be possible, I 
>> >> think, much more changes are required than current patch series does. 
>> >> There at least should be special PCI Host bridge emulation in Xen (or 
>> >> reuse vPCI) for the integration. Also Xen should be in charge of forming 
>> >> resulting PCI interrupt based on each PCI device level signaling (if we 
>> >> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, 
>> >> etc. Please note, I am not saying this is not possible in general, 
>> >> likely it is possible, but initial patch series doesn't cover these 
>> >> use-cases)
>> >>
>> >> We expose one PCI host bridge per virtio backend domain. This is a 
>> >> separate PCI host bridge to combine all virtio-pci devices running in 
>> >> the same backend domain (in the same device emulator currently).
>> >> The examples:
>> >> - if only one domain runs Qemu which servers virtio-blk, virtio-net, 
>> >> virtio-console devices for DomU - only single PCI Host bridge will be 
>> >> exposed for DomU
>> >> - if we add another domain to run Qemu to serve additionally virtio-gpu, 
>> >> virtio-input and virtio-snd for the *same* DomU - we expose second PCI 
>> >> Host bridge for DomU
>> >> 
>> >> I am afraid, we cannot end up exposing only single PCI Host bridge with 
>> >> current model (if we use device emulators running in different domains 
>> >> that handles the *entire* PCI Host bridges), this won't work.
>> >  
>> >
>> > We were discussing the topic of vPCI and Virtio PCI just this morning
>> > with Stewart and Vikram. We also intend to make them work well together
>> > in the next couple of months (great timing!!)
>> >
>> > However, our thinking is to go with the other approach Julien
>> > suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would
>> > register individual PCI devices against it.
>> >
>> > Vikram, Stewart, please comment. Our understanding is that it should be
>> > possible to make QEMU virtio-pci work with vPCI with relatively minor
>> > efforts and AMD volunteers to do the work in the next couple of months
>> > on the vPCI side.
>> >
>> >
>> > Although it should be possible to make both approaches work at the same
>> > time, given that it would seem that EPAM and AMD have very similar
>> > requirements, I suggest we work together and collaborate on a single
>> > approach going forward that works best for everyone.
>> >
>> >
>> > Let me start by saying that if we can get away with it, I think that a
>> > single PCI Root Complex in Xen would be best because it requires less
>> > complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
>> > one?
>> 
>> Well, in fact we tried similar setup, this was in the first version of
>> virtio-pci support. But we had a couple of issues with this. For
>> instance, this might conflict with PCI passthrough devices, with virtio
>> devices that have back-ends in different domains, etc. I am no saying
>> that this is impossible, but this just involves more moving parts.
>> 
>> With my vPCI patch series in place, hypervisor itself assigns BDFs for
>> passed-through devices. Toolstack needs to get this information to know
>> which BDFs are free and can be used by virtio-pci.
>
> I'll premise that I don't really have an opinion on how the virtual BDF
> allocation should happen.
>
> But I'll ask the opposite question that Julien asked: if it is Xen that
> does the allocation, that's fine, then couldn't we arrange so that Xen
> also does the allocation in the toolstack case too (simply by picking
> the first available virtual BDF)?

Actually, this was my intention as well. As I said in the another email,
we just need to extend or add another domctl to manage vBFDs.
Stewart Hildebrand Nov. 17, 2023, 3:31 a.m. UTC | #17
On 11/15/23 18:28, Stefano Stabellini wrote:
> + Stewart, Vikram
> 
> On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote:
>> On 15.11.23 14:33, Julien Grall wrote:
>>> Thanks for adding support for virtio-pci in Xen. I have some questions.
>>>
>>> On 15/11/2023 11:26, Sergiy Kibrik wrote:
>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>
>>>> In order to enable more use-cases such as having multiple
>>>> device-models (Qemu) running in different backend domains which provide
>>>> virtio-pci devices for the same guest, we allocate and expose one
>>>> PCI host bridge for every virtio backend domain for that guest.
>>>
>>> OOI, why do you need to expose one PCI host bridge for every stubdomain?
>>>
>>> In fact looking at the next patch, it seems you are handling some of the 
>>> hostbridge request in Xen. This is adds a bit more confusion.
>>>
>>> I was expecting the virtual PCI device would be in the vPCI and each 
>>> Device emulator would advertise which BDF they are covering.
>>
>>
>> This patch series only covers use-cases where the device emulator 
>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind 
>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI 
>> pass-through resources, handling, accounting, nothing. From the 
>> hypervisor we only need a help to intercept the config space accesses 
>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... 
>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and 
>> forward them to the linked device emulator (if any), that's all.
>>
>> It is not possible (with current series) to run device emulators what
>> emulate only separate PCI (virtio-pci) devices. For it to be possible, I 
>> think, much more changes are required than current patch series does. 
>> There at least should be special PCI Host bridge emulation in Xen (or 
>> reuse vPCI) for the integration. Also Xen should be in charge of forming 
>> resulting PCI interrupt based on each PCI device level signaling (if we 
>> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, 
>> etc. Please note, I am not saying this is not possible in general, 
>> likely it is possible, but initial patch series doesn't cover these 
>> use-cases)
>>
>> We expose one PCI host bridge per virtio backend domain. This is a 
>> separate PCI host bridge to combine all virtio-pci devices running in 
>> the same backend domain (in the same device emulator currently).
>> The examples:
>> - if only one domain runs Qemu which servers virtio-blk, virtio-net, 
>> virtio-console devices for DomU - only single PCI Host bridge will be 
>> exposed for DomU
>> - if we add another domain to run Qemu to serve additionally virtio-gpu, 
>> virtio-input and virtio-snd for the *same* DomU - we expose second PCI 
>> Host bridge for DomU
>>
>> I am afraid, we cannot end up exposing only single PCI Host bridge with 
>> current model (if we use device emulators running in different domains 
>> that handles the *entire* PCI Host bridges), this won't work.
>  
> 
> We were discussing the topic of vPCI and Virtio PCI just this morning
> with Stewart and Vikram. We also intend to make them work well together
> in the next couple of months (great timing!!)
> 
> However, our thinking is to go with the other approach Julien
> suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would
> register individual PCI devices against it.
> 
> Vikram, Stewart, please comment. Our understanding is that it should be
> possible to make QEMU virtio-pci work with vPCI with relatively minor
> efforts and AMD volunteers to do the work in the next couple of months
> on the vPCI side.
> 
> 
> Although it should be possible to make both approaches work at the same
> time, given that it would seem that EPAM and AMD have very similar
> requirements, I suggest we work together and collaborate on a single
> approach going forward that works best for everyone.
> 
> 
> Let me start by saying that if we can get away with it, I think that a
> single PCI Root Complex in Xen would be best because it requires less
> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
> one?
> 
> Stewart, you are deep into vPCI, what's your thinking?

First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci:

virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ]
device_model_args = [ "-device", "virtio-serial-pci" ]
pci = [ "01:00.0" ]

Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view:

    pcie@10000000 {
        compatible = "pci-host-ecam-generic";
        device_type = "pci";
        reg = <0x00 0x10000000 0x00 0x10000000>;
        bus-range = <0x00 0xff>;
        #address-cells = <0x03>;
        #size-cells = <0x02>;
        status = "okay";
        ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>;
        #interrupt-cells = <0x01>;
        interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>;
        interrupt-map-mask = <0x00 0x00 0x00 0x07>;
    };

    pcie@33000000 {
        compatible = "pci-host-ecam-generic";
        device_type = "pci";
        reg = <0x00 0x33000000 0x00 0x200000>;
        bus-range = <0x00 0x01>;
        #address-cells = <0x03>;
        #size-cells = <0x02>;
        status = "okay";
        ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>;
        dma-coherent;
        #interrupt-cells = <0x01>;
        interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>;
        interrupt-map-mask = <0x1800 0x00 0x00 0x07>;
    };

Xen vPCI doesn't currently expose a host bridge (i.e. a device with base class 0x06). As an aside, we may eventually want to expose a virtual/emulated host bridge in vPCI, because Linux's x86 PCI probe expects one [0].

Qemu exposes an emulated host bridge, along with any requested emulated devices.

Running lspci -v in the domU yields the following:

0000:00:00.0 Network controller: Ralink corp. RT2790 Wireless 802.11n 1T/2R PCIe
        Subsystem: ASUSTeK Computer Inc. RT2790 Wireless 802.11n 1T/2R PCIe
        Flags: bus master, fast devsel, latency 0, IRQ 13
        Memory at 23000000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [50] MSI: Enable- Count=1/128 Maskable- 64bit+
        Kernel driver in use: rt2800pci

0001:00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
        Subsystem: Red Hat, Inc. QEMU PCIe Host bridge
        Flags: fast devsel

0001:00:01.0 Communication controller: Red Hat, Inc. Virtio console
        Subsystem: Red Hat, Inc. Virtio console
        Flags: bus master, fast devsel, latency 0, IRQ 14
        Memory at 3a000000 (64-bit, prefetchable) [size=16K]
        Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
        Capabilities: [70] Vendor Specific Information: VirtIO: Notify
        Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
        Capabilities: [50] Vendor Specific Information: VirtIO: ISR
        Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
        Kernel driver in use: virtio-pci

0000:00:00.0 is a real passed through device (corresponding to 0000:01:00.0 in dom0).
0001:00:00.0 is the qemu host bridge (base class 0x06).
They are on different segments because they are associated with different root complexes.


Back to the question: Sure, avoiding reserving more memory from the preciously small lowmem virtual memory layout is probably a good idea. With everything in a single virtual root complex (segment), it's probably possible to come up with some vBDF-picking algorithm (+ user ability to specify) that works for most use cases as discussed elsewhere. It will always be in a single fixed segment as far as I can tell.

Some more observations assuming a single virtual root complex:

We should probably hide the qemu host bridge(s) from the guest. In other words, hide all devices with base class 0x06, except eventually vPCI's own virtual host bridge. If we don't hide them, we would likely end up with multiple emulated host bridges on a single root complex (segment). That sounds messy and hard to manage.

We have a need to control the vBDF exposed to the guest - can we force qemu to use particular BDFs for its emulated devices? If not, we will need to do SBDF translation in vPCI for virtio-pci devices. Meaning virtio-pci would then be dependent on the vPCI series [1]. I think that's okay. Hmm - if a qemu SBDF clashes with a real hardware SBDF, is that an issue?

vPCI and virtio-pci devices will be sharing a single ECAM range, so it looks like we must rely on vPCI to intercept the accesses that need to be forwarded to ioreq.

Initially, it looks like we would be adding some sort of support for legacy interrupts in vPCI, but for virtio-pci devices only. Eventually, we may also want to consider virtio-pci with MSI/MSI-X, but it's probably okay to wait.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/pci/direct.c?h=v6.6.1#n186
[1] https://lists.xenproject.org/archives/html/xen-devel/2023-10/msg00660.html
Oleksandr Tyshchenko Nov. 17, 2023, 8:11 a.m. UTC | #18
On 17.11.23 05:31, Stewart Hildebrand wrote:

Hello Stewart

[answering only for virtio-pci bits as for vPCI I am only familiar with 
code responsible for trapping config space accesses]

[snip]

>>
>>
>> Let me start by saying that if we can get away with it, I think that a
>> single PCI Root Complex in Xen would be best because it requires less
>> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
>> one?
>>
>> Stewart, you are deep into vPCI, what's your thinking?
> 
> First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci:
> 
> virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ]
> device_model_args = [ "-device", "virtio-serial-pci" ]
> pci = [ "01:00.0" ]
> 
> Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view:
> 
>      pcie@10000000 {
>          compatible = "pci-host-ecam-generic";
>          device_type = "pci";
>          reg = <0x00 0x10000000 0x00 0x10000000>;
>          bus-range = <0x00 0xff>;
>          #address-cells = <0x03>;
>          #size-cells = <0x02>;
>          status = "okay";
>          ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>;
>          #interrupt-cells = <0x01>;
>          interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>;
>          interrupt-map-mask = <0x00 0x00 0x00 0x07>;


I am wondering how you got interrupt-map here? AFAIR upstream toolstack 
doesn't add that property for vpci dt node.

>      };
> 
>      pcie@33000000 {
>          compatible = "pci-host-ecam-generic";
>          device_type = "pci";
>          reg = <0x00 0x33000000 0x00 0x200000>;
>          bus-range = <0x00 0x01>;
>          #address-cells = <0x03>;
>          #size-cells = <0x02>;
>          status = "okay";
>          ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>;
>          dma-coherent;
>          #interrupt-cells = <0x01>;
>          interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>;
>          interrupt-map-mask = <0x1800 0x00 0x00 0x07>;


that is correct dump.

BTW, if you added "grant_usage=1" (it is disabled by default for dom0) 
to virtio configuration you would get iommu-map property here as well 
[1]. This is another point to think about when considering combined 
approach (single PCI Host bridge node -> single virtual root complex), I 
guess usual PCI device doesn't want grant based DMA addresses, correct? 
If so, it shouldn't be specified in the property.


>      };
> 
> Xen vPCI doesn't currently expose a host bridge (i.e. a device with base class 0x06). As an aside, we may eventually want to expose a virtual/emulated host bridge in vPCI, because Linux's x86 PCI probe expects one [0].
> 
> Qemu exposes an emulated host bridge, along with any requested emulated devices.
> 
> Running lspci -v in the domU yields the following:
> 
> 0000:00:00.0 Network controller: Ralink corp. RT2790 Wireless 802.11n 1T/2R PCIe
>          Subsystem: ASUSTeK Computer Inc. RT2790 Wireless 802.11n 1T/2R PCIe
>          Flags: bus master, fast devsel, latency 0, IRQ 13
>          Memory at 23000000 (32-bit, non-prefetchable) [size=64K]
>          Capabilities: [50] MSI: Enable- Count=1/128 Maskable- 64bit+
>          Kernel driver in use: rt2800pci
> 
> 0001:00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
>          Subsystem: Red Hat, Inc. QEMU PCIe Host bridge
>          Flags: fast devsel
> 
> 0001:00:01.0 Communication controller: Red Hat, Inc. Virtio console
>          Subsystem: Red Hat, Inc. Virtio console
>          Flags: bus master, fast devsel, latency 0, IRQ 14
>          Memory at 3a000000 (64-bit, prefetchable) [size=16K]
>          Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
>          Capabilities: [70] Vendor Specific Information: VirtIO: Notify
>          Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
>          Capabilities: [50] Vendor Specific Information: VirtIO: ISR
>          Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
>          Kernel driver in use: virtio-pci
> 
> 0000:00:00.0 is a real passed through device (corresponding to 0000:01:00.0 in dom0).
> 0001:00:00.0 is the qemu host bridge (base class 0x06).
> They are on different segments because they are associated with different root complexes.


Glad to hear this patch series doesn't seem to break PCI passthrough in 
your environment.


> 
> 
> Back to the question: Sure, avoiding reserving more memory from the preciously small lowmem virtual memory layout is probably a good idea. With everything in a single virtual root complex (segment), it's probably possible to come up with some vBDF-picking algorithm (+ user ability to specify) that works for most use cases as discussed elsewhere. It will always be in a single fixed segment as far as I can tell.
> 
> Some more observations assuming a single virtual root complex:
> 
> We should probably hide the qemu host bridge(s) from the guest. In other words, hide all devices with base class 0x06, except eventually vPCI's own virtual host bridge. If we don't hide them, we would likely end up with multiple emulated host bridges on a single root complex (segment). That sounds messy and hard to manage.
> 
> We have a need to control the vBDF exposed to the guest - can we force qemu to use particular BDFs for its emulated devices?


Yes, it is possible. Maybe there is a better way, but at
least *bus* and *addr* can be specified and Qemu indeed follows that.

device_model_args=[ '-device', 
'virtio-blk-pci,scsi=off,disable-legacy=on,iommu_platform=on,bus=pcie.0,addr=2,drive=image', 
'-drive', 'if=none,id=image,format=raw,file=/dev/mmcblk1p3' ]

virtio=[ "backend=Domain-0, type=virtio,device, transport=pci, 
bdf=0000:00:02.0, grant_usage=1, backend_type=qemu" ]

root@h3ulcb-domd:~# dmesg | grep virtio
[   0.660789] virtio-pci 0000:00:02.0: enabling device (0000 -> 0002)
[   0.715876] virtio_blk virtio0: [vda] 4096 512-byte logical blocks 
(2.10 MB/2.00 MiB)

root@h3ulcb-domd:~# lspci
00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
00:02.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)

Also there is one moment for current series: bdf specified for 
virtio-pci device only makes sense for iommu-map property. So 
bdf=0000:00:02.0 in virtio property and bus=pcie.0,addr=2 in 
device_model_args property should be in sync.

[1] 
https://patchwork.kernel.org/project/xen-devel/patch/20231115112611.3865905-5-Sergiy_Kibrik@epam.com/


[snip]
Sergiy Kibrik Nov. 17, 2023, 1:19 p.m. UTC | #19
hi Julien, Oleksandr,

[..]
>>
>>
>> This patch series only covers use-cases where the device emulator
>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind
>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI
>> pass-through resources, handling, accounting, nothing. 
> 
> I understood you want to one Device Emulator to handle the entire PCI 
> host bridge. But...
> 
>  From the
>> hypervisor we only need a help to intercept the config space accesses
>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ...
>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and
>> forward them to the linked device emulator (if any), that's all.
> 
> ... I really don't see why you need to add code in Xen to trap the 
> region. If QEMU is dealing with the hostbridge, then it should be able 
> to register the MMIO region and then do the translation.
> 
>>
[..]
>>
>> I am afraid, we cannot end up exposing only single PCI Host bridge with
>> current model (if we use device emulators running in different domains
>> that handles the *entire* PCI Host bridges), this won't work.
> 
> That makes sense and it is fine. But see above, I think only the #2 is 
> necessary for the hypervisor. Patch #5 should not be necessary at all.
> 
> [...]

I did checks w/o patch #5 and can confirm that indeed -- qemu & xen can 
do this work without additional modifications to qemu code. So I'll drop
this patch from this series.

[..]
>>>> +/*
>>>> + * 16 MB is reserved for virtio-pci configuration space based on
>>>> calculation
>>>> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB
>>>
>>> Can you explain how youd ecided the "2"?
>>
>> good question, we have a limited free space available in memory layout
>> (we had difficulties to find a suitable holes) also we don't expect a
>> lot of virtio-pci devices, so "256" used vPCI would be too much. It was
>> decided to reduce significantly, but select maximum to fit into free
>> space, with having "2" buses we still fit into the chosen holes.
> 
> If you don't expect a lot of virtio devices, then why do you need two 
> buses? Wouldn't one be sufficient?
> 

one should be reasonably sufficient, I agree

>>
>>
>>>
>>>> + */
>>>> +#define GUEST_VIRTIO_PCI_ECAM_BASE          xen_mk_ullong(0x33000000)
>>>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE    xen_mk_ullong(0x01000000)
>>>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE     xen_mk_ullong(0x00200000)
>>>> +
>>>> +/* 64 MB is reserved for virtio-pci memory */
>>>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM    xen_mk_ullong(0x02000000)
>>>> +#define GUEST_VIRTIO_PCI_MEM_ADDR         xen_mk_ullong(0x34000000)
>>>> +#define GUEST_VIRTIO_PCI_MEM_SIZE         xen_mk_ullong(0x04000000)
>>>> +
>>>>    /*
>>>>     * 16MB == 4096 pages reserved for guest to use as a region to 
>>>> map its
>>>>     * grant table in.
>>>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t;
>>>>    #define GUEST_MAGIC_BASE  xen_mk_ullong(0x39000000)
>>>>    #define GUEST_MAGIC_SIZE  xen_mk_ullong(0x01000000)
>>>> +/* 64 MB is reserved for virtio-pci Prefetch memory */
>>>
>>> This doesn't seem a lot depending on your use case. Can you details how
>>> you can up with "64 MB"?
>>
>> the same calculation as it was done configuration space. It was observed
>> that only 16K is used per virtio-pci device (maybe it can be bigger for
>> usual PCI device, I don't know). Please look at the example of DomU log
>> below (to strings that contain "*BAR 4: assigned*"):
> 
> What about virtio-gpu? I would expect a bit more memory is necessary for 
> that use case.
> 
> Any case, what I am looking for is for some explanation in the commit 
> message of the limits. I don't particularly care about the exact limit 
> because this is not part of a stable ABI.

sure, I'll put a bit more explanation in both comment and commit 
message. Should I post updated patch series, with updated resources and 
without patch #5, or shall we wait for some more comments here?

--
regards,
  Sergiy
Julien Grall Nov. 17, 2023, 6:24 p.m. UTC | #20
Hi Sergiy,

On 17/11/2023 13:19, Sergiy Kibrik wrote:
>>>>> + */
>>>>> +#define GUEST_VIRTIO_PCI_ECAM_BASE          xen_mk_ullong(0x33000000)
>>>>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE    xen_mk_ullong(0x01000000)
>>>>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE     xen_mk_ullong(0x00200000)
>>>>> +
>>>>> +/* 64 MB is reserved for virtio-pci memory */
>>>>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM    xen_mk_ullong(0x02000000)
>>>>> +#define GUEST_VIRTIO_PCI_MEM_ADDR         xen_mk_ullong(0x34000000)
>>>>> +#define GUEST_VIRTIO_PCI_MEM_SIZE         xen_mk_ullong(0x04000000)
>>>>> +
>>>>>    /*
>>>>>     * 16MB == 4096 pages reserved for guest to use as a region to 
>>>>> map its
>>>>>     * grant table in.
>>>>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t;
>>>>>    #define GUEST_MAGIC_BASE  xen_mk_ullong(0x39000000)
>>>>>    #define GUEST_MAGIC_SIZE  xen_mk_ullong(0x01000000)
>>>>> +/* 64 MB is reserved for virtio-pci Prefetch memory */
>>>>
>>>> This doesn't seem a lot depending on your use case. Can you details how
>>>> you can up with "64 MB"?
>>>
>>> the same calculation as it was done configuration space. It was observed
>>> that only 16K is used per virtio-pci device (maybe it can be bigger for
>>> usual PCI device, I don't know). Please look at the example of DomU log
>>> below (to strings that contain "*BAR 4: assigned*"):
>>
>> What about virtio-gpu? I would expect a bit more memory is necessary 
>> for that use case.
>>
>> Any case, what I am looking for is for some explanation in the commit 
>> message of the limits. I don't particularly care about the exact limit 
>> because this is not part of a stable ABI.
> 
> sure, I'll put a bit more explanation in both comment and commit 
> message. Should I post updated patch series, with updated resources and 
> without patch #5, or shall we wait for some more comments here?

I would wait for comments before posting in particular if you haven't 
yet received any comment on the tools side.

Cheers,
Stewart Hildebrand Nov. 21, 2023, 7:12 p.m. UTC | #21
On 11/17/23 03:11, Oleksandr Tyshchenko wrote:
> 
> 
> On 17.11.23 05:31, Stewart Hildebrand wrote:
> 
> Hello Stewart
> 
> [answering only for virtio-pci bits as for vPCI I am only familiar with 
> code responsible for trapping config space accesses]
> 
> [snip]
> 
>>>
>>>
>>> Let me start by saying that if we can get away with it, I think that a
>>> single PCI Root Complex in Xen would be best because it requires less
>>> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
>>> one?
>>>
>>> Stewart, you are deep into vPCI, what's your thinking?
>>
>> First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci:
>>
>> virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ]
>> device_model_args = [ "-device", "virtio-serial-pci" ]
>> pci = [ "01:00.0" ]
>>
>> Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view:
>>
>>      pcie@10000000 {
>>          compatible = "pci-host-ecam-generic";
>>          device_type = "pci";
>>          reg = <0x00 0x10000000 0x00 0x10000000>;
>>          bus-range = <0x00 0xff>;
>>          #address-cells = <0x03>;
>>          #size-cells = <0x02>;
>>          status = "okay";
>>          ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>;
>>          #interrupt-cells = <0x01>;
>>          interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>;
>>          interrupt-map-mask = <0x00 0x00 0x00 0x07>;
> 
> 
> I am wondering how you got interrupt-map here? AFAIR upstream toolstack 
> doesn't add that property for vpci dt node.

You are correct. I'm airing my dirty laundry here: this is a hack to get a legacy interrupt passed through on a ZCU102 board (Zynq UltraScale+), which doesn't have GICv3/ITS. In a system with a GICv3/ITS, we would have an msi-map property (this is also not yet upstream, but is in a couple of downstreams).

> 
>>      };
>>
>>      pcie@33000000 {
>>          compatible = "pci-host-ecam-generic";
>>          device_type = "pci";
>>          reg = <0x00 0x33000000 0x00 0x200000>;
>>          bus-range = <0x00 0x01>;
>>          #address-cells = <0x03>;
>>          #size-cells = <0x02>;
>>          status = "okay";
>>          ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>;
>>          dma-coherent;
>>          #interrupt-cells = <0x01>;
>>          interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>;
>>          interrupt-map-mask = <0x1800 0x00 0x00 0x07>;
> 
> 
> that is correct dump.
> 
> BTW, if you added "grant_usage=1" (it is disabled by default for dom0) 
> to virtio configuration you would get iommu-map property here as well 
> [1]. This is another point to think about when considering combined 
> approach (single PCI Host bridge node -> single virtual root complex), I 
> guess usual PCI device doesn't want grant based DMA addresses, correct? 
> If so, it shouldn't be specified in the property.

Right, a usual PCI device doesn't want/need grant based DMA addresses. The iommu-map property [2] is flexible enough to make it apply only to certain vBDFs or ranges of vBDFs. Actually, it looks like ("libxl/arm: Reuse generic PCI-IOMMU bindings for virtio-pci devices") already has the logic for doing exactly this. So it should be fine to have the iommu-map property in the single root complex and still do regular PCI passthrough.

Presumably, we would also want msi-map [3] to apply to some vBDFs, not others. msi-map has the same flexible BDF matching as iommu-map (these two bindings actually share the same parsing function), so we should be able to use a similar strategy here if needed.

We would also want interrupt-map [4] to apply to some vBDFs, not others. The BDF matching with interrupt-map behaves slightly differently, but I believe it is still flexible enough to achieve this. In this case, the function create_virtio_pci_irq_map(), added in ("libxl/arm: Add basic virtio-pci support"), would need some re-thinking.

In summary, with a single virtual root complex, the toolstack needs to know which vBDFs are virtio-pci, and which vBDFs are passthrough, so it can create the [iommu,msi,interrupt]-map properties accordingly.

[2] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt
[3] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-msi.txt
[4] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/host-generic-pci.yaml

> 
> 
>>      };
>>
>> Xen vPCI doesn't currently expose a host bridge (i.e. a device with base class 0x06). As an aside, we may eventually want to expose a virtual/emulated host bridge in vPCI, because Linux's x86 PCI probe expects one [0].
>>
>> Qemu exposes an emulated host bridge, along with any requested emulated devices.
>>
>> Running lspci -v in the domU yields the following:
>>
>> 0000:00:00.0 Network controller: Ralink corp. RT2790 Wireless 802.11n 1T/2R PCIe
>>          Subsystem: ASUSTeK Computer Inc. RT2790 Wireless 802.11n 1T/2R PCIe
>>          Flags: bus master, fast devsel, latency 0, IRQ 13
>>          Memory at 23000000 (32-bit, non-prefetchable) [size=64K]
>>          Capabilities: [50] MSI: Enable- Count=1/128 Maskable- 64bit+
>>          Kernel driver in use: rt2800pci
>>
>> 0001:00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
>>          Subsystem: Red Hat, Inc. QEMU PCIe Host bridge
>>          Flags: fast devsel
>>
>> 0001:00:01.0 Communication controller: Red Hat, Inc. Virtio console
>>          Subsystem: Red Hat, Inc. Virtio console
>>          Flags: bus master, fast devsel, latency 0, IRQ 14
>>          Memory at 3a000000 (64-bit, prefetchable) [size=16K]
>>          Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
>>          Capabilities: [70] Vendor Specific Information: VirtIO: Notify
>>          Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
>>          Capabilities: [50] Vendor Specific Information: VirtIO: ISR
>>          Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
>>          Kernel driver in use: virtio-pci
>>
>> 0000:00:00.0 is a real passed through device (corresponding to 0000:01:00.0 in dom0).
>> 0001:00:00.0 is the qemu host bridge (base class 0x06).
>> They are on different segments because they are associated with different root complexes.
> 
> 
> Glad to hear this patch series doesn't seem to break PCI passthrough in 
> your environment.
> 
> 
>>
>>
>> Back to the question: Sure, avoiding reserving more memory from the preciously small lowmem virtual memory layout is probably a good idea. With everything in a single virtual root complex (segment), it's probably possible to come up with some vBDF-picking algorithm (+ user ability to specify) that works for most use cases as discussed elsewhere. It will always be in a single fixed segment as far as I can tell.
>>
>> Some more observations assuming a single virtual root complex:
>>
>> We should probably hide the qemu host bridge(s) from the guest. In other words, hide all devices with base class 0x06, except eventually vPCI's own virtual host bridge. If we don't hide them, we would likely end up with multiple emulated host bridges on a single root complex (segment). That sounds messy and hard to manage.
>>
>> We have a need to control the vBDF exposed to the guest - can we force qemu to use particular BDFs for its emulated devices?
> 
> 
> Yes, it is possible. Maybe there is a better way, but at
> least *bus* and *addr* can be specified and Qemu indeed follows that.
> 
> device_model_args=[ '-device', 
> 'virtio-blk-pci,scsi=off,disable-legacy=on,iommu_platform=on,bus=pcie.0,addr=2,drive=image', 
> '-drive', 'if=none,id=image,format=raw,file=/dev/mmcblk1p3' ]
> 
> virtio=[ "backend=Domain-0, type=virtio,device, transport=pci, 
> bdf=0000:00:02.0, grant_usage=1, backend_type=qemu" ]
> 
> root@h3ulcb-domd:~# dmesg | grep virtio
> [   0.660789] virtio-pci 0000:00:02.0: enabling device (0000 -> 0002)
> [   0.715876] virtio_blk virtio0: [vda] 4096 512-byte logical blocks 
> (2.10 MB/2.00 MiB)
> 
> root@h3ulcb-domd:~# lspci
> 00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
> 00:02.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
> 
> Also there is one moment for current series: bdf specified for 
> virtio-pci device only makes sense for iommu-map property. So 
> bdf=0000:00:02.0 in virtio property and bus=pcie.0,addr=2 in 
> device_model_args property should be in sync.
> 
> [1] 
> https://patchwork.kernel.org/project/xen-devel/patch/20231115112611.3865905-5-Sergiy_Kibrik@epam.com/
> 
> 
> [snip]
Stefano Stabellini Nov. 22, 2023, 1:14 a.m. UTC | #22
On Tue, 21 Nov 2023, Stewart Hildebrand wrote:
> On 11/17/23 03:11, Oleksandr Tyshchenko wrote:
> > 
> > 
> > On 17.11.23 05:31, Stewart Hildebrand wrote:
> > 
> > Hello Stewart
> > 
> > [answering only for virtio-pci bits as for vPCI I am only familiar with 
> > code responsible for trapping config space accesses]
> > 
> > [snip]
> > 
> >>>
> >>>
> >>> Let me start by saying that if we can get away with it, I think that a
> >>> single PCI Root Complex in Xen would be best because it requires less
> >>> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only
> >>> one?
> >>>
> >>> Stewart, you are deep into vPCI, what's your thinking?
> >>
> >> First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci:
> >>
> >> virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ]
> >> device_model_args = [ "-device", "virtio-serial-pci" ]
> >> pci = [ "01:00.0" ]
> >>
> >> Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view:
> >>
> >>      pcie@10000000 {
> >>          compatible = "pci-host-ecam-generic";
> >>          device_type = "pci";
> >>          reg = <0x00 0x10000000 0x00 0x10000000>;
> >>          bus-range = <0x00 0xff>;
> >>          #address-cells = <0x03>;
> >>          #size-cells = <0x02>;
> >>          status = "okay";
> >>          ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>;
> >>          #interrupt-cells = <0x01>;
> >>          interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>;
> >>          interrupt-map-mask = <0x00 0x00 0x00 0x07>;
> > 
> > 
> > I am wondering how you got interrupt-map here? AFAIR upstream toolstack 
> > doesn't add that property for vpci dt node.
> 
> You are correct. I'm airing my dirty laundry here: this is a hack to get a legacy interrupt passed through on a ZCU102 board (Zynq UltraScale+), which doesn't have GICv3/ITS. In a system with a GICv3/ITS, we would have an msi-map property (this is also not yet upstream, but is in a couple of downstreams).
> 
> > 
> >>      };
> >>
> >>      pcie@33000000 {
> >>          compatible = "pci-host-ecam-generic";
> >>          device_type = "pci";
> >>          reg = <0x00 0x33000000 0x00 0x200000>;
> >>          bus-range = <0x00 0x01>;
> >>          #address-cells = <0x03>;
> >>          #size-cells = <0x02>;
> >>          status = "okay";
> >>          ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>;
> >>          dma-coherent;
> >>          #interrupt-cells = <0x01>;
> >>          interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>;
> >>          interrupt-map-mask = <0x1800 0x00 0x00 0x07>;
> > 
> > 
> > that is correct dump.
> > 
> > BTW, if you added "grant_usage=1" (it is disabled by default for dom0) 
> > to virtio configuration you would get iommu-map property here as well 
> > [1]. This is another point to think about when considering combined 
> > approach (single PCI Host bridge node -> single virtual root complex), I 
> > guess usual PCI device doesn't want grant based DMA addresses, correct? 
> > If so, it shouldn't be specified in the property.
> 
> Right, a usual PCI device doesn't want/need grant based DMA addresses. The iommu-map property [2] is flexible enough to make it apply only to certain vBDFs or ranges of vBDFs. Actually, it looks like ("libxl/arm: Reuse generic PCI-IOMMU bindings for virtio-pci devices") already has the logic for doing exactly this. So it should be fine to have the iommu-map property in the single root complex and still do regular PCI passthrough.
> 
> Presumably, we would also want msi-map [3] to apply to some vBDFs, not others. msi-map has the same flexible BDF matching as iommu-map (these two bindings actually share the same parsing function), so we should be able to use a similar strategy here if needed.
> 
> We would also want interrupt-map [4] to apply to some vBDFs, not others. The BDF matching with interrupt-map behaves slightly differently, but I believe it is still flexible enough to achieve this. In this case, the function create_virtio_pci_irq_map(), added in ("libxl/arm: Add basic virtio-pci support"), would need some re-thinking.
> 
> In summary, with a single virtual root complex, the toolstack needs to know which vBDFs are virtio-pci, and which vBDFs are passthrough, so it can create the [iommu,msi,interrupt]-map properties accordingly.

That should be fine given that the toolstack also knows all the PCI
Passthrough devices too.
diff mbox series

Patch

diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h
index a25e87dbda..e6c9cd5335 100644
--- a/xen/include/public/arch-arm.h
+++ b/xen/include/public/arch-arm.h
@@ -466,6 +466,19 @@  typedef uint64_t xen_callback_t;
 #define GUEST_VPCI_MEM_ADDR                 xen_mk_ullong(0x23000000)
 #define GUEST_VPCI_MEM_SIZE                 xen_mk_ullong(0x10000000)
 
+/*
+ * 16 MB is reserved for virtio-pci configuration space based on calculation
+ * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB
+ */
+#define GUEST_VIRTIO_PCI_ECAM_BASE          xen_mk_ullong(0x33000000)
+#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE    xen_mk_ullong(0x01000000)
+#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE     xen_mk_ullong(0x00200000)
+
+/* 64 MB is reserved for virtio-pci memory */
+#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM    xen_mk_ullong(0x02000000)
+#define GUEST_VIRTIO_PCI_MEM_ADDR         xen_mk_ullong(0x34000000)
+#define GUEST_VIRTIO_PCI_MEM_SIZE         xen_mk_ullong(0x04000000)
+
 /*
  * 16MB == 4096 pages reserved for guest to use as a region to map its
  * grant table in.
@@ -476,6 +489,11 @@  typedef uint64_t xen_callback_t;
 #define GUEST_MAGIC_BASE  xen_mk_ullong(0x39000000)
 #define GUEST_MAGIC_SIZE  xen_mk_ullong(0x01000000)
 
+/* 64 MB is reserved for virtio-pci Prefetch memory */
+#define GUEST_VIRTIO_PCI_ADDR_TYPE_PREFETCH_MEM    xen_mk_ullong(0x42000000)
+#define GUEST_VIRTIO_PCI_PREFETCH_MEM_ADDR         xen_mk_ullong(0x3a000000)
+#define GUEST_VIRTIO_PCI_PREFETCH_MEM_SIZE         xen_mk_ullong(0x04000000)
+
 #define GUEST_RAM_BANKS   2
 
 /*
@@ -515,6 +533,9 @@  typedef uint64_t xen_callback_t;
 #define GUEST_VIRTIO_MMIO_SPI_FIRST   33
 #define GUEST_VIRTIO_MMIO_SPI_LAST    43
 
+#define GUEST_VIRTIO_PCI_SPI_FIRST   44
+#define GUEST_VIRTIO_PCI_SPI_LAST    76
+
 /* PSCI functions */
 #define PSCI_cpu_suspend 0
 #define PSCI_cpu_off     1