diff mbox series

[2/2] xen/virtio: Avoid use of the dom0 backend in dom0

Message ID 20230621131214.9398-3-petr.pavlu@suse.com (mailing list archive)
State New, archived
Headers show
Series Fix Linux dom0 boot on a QEMU/KVM virtual machine | expand

Commit Message

Petr Pavlu June 21, 2023, 1:12 p.m. UTC
When attempting to run Xen on a QEMU/KVM virtual machine with virtio
devices (all x86_64), dom0 tries to establish a grant for itself which
eventually results in a hang during the boot.

The backtrace looks as follows, the while loop in __send_control_msg()
makes no progress:

  #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
  #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
  #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
  #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
  #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
  #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
  #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
  #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
  #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
  #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
  #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
  #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
      fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
  #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
  #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
  #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
  #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
  #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
  #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
  #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
  #19 do_initcalls () at ../init/main.c:1335
  #20 do_basic_setup () at ../init/main.c:1354
  #21 kernel_init_freeable () at ../init/main.c:1571
  #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
  #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
  #24 0x0000000000000000 in ?? ()

Fix the problem by preventing xen_grant_init_backend_domid() from
setting dom0 as a backend when running in dom0.

Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
---
 drivers/xen/grant-dma-ops.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Oleksandr Tyshchenko June 21, 2023, 5:58 p.m. UTC | #1
On 21.06.23 16:12, Petr Pavlu wrote:


Hello Petr


> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
> devices (all x86_64), dom0 tries to establish a grant for itself which
> eventually results in a hang during the boot.
> 
> The backtrace looks as follows, the while loop in __send_control_msg()
> makes no progress:
> 
>    #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>    #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
>    #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
>    #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
>    #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>    #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>    #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>    #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
>    #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>    #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>    #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
>    #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
>        fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
>    #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
>    #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
>    #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
>    #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
>    #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
>    #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
>    #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
>    #19 do_initcalls () at ../init/main.c:1335
>    #20 do_basic_setup () at ../init/main.c:1354
>    #21 kernel_init_freeable () at ../init/main.c:1571
>    #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
>    #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
>    #24 0x0000000000000000 in ?? ()
> 
> Fix the problem by preventing xen_grant_init_backend_domid() from
> setting dom0 as a backend when running in dom0.
> 
> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")


I am not 100% sure whether the Fixes tag points to precise commit. If I 
am not mistaken, the said commit just moves the code in the context 
without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was 
introduced before.


> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> ---
>   drivers/xen/grant-dma-ops.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> index 76f6f26265a3..29ed27ac450e 100644
> --- a/drivers/xen/grant-dma-ops.c
> +++ b/drivers/xen/grant-dma-ops.c
> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
>   	if (np) {
>   		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
>   		of_node_put(np);
> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
> +		    xen_pv_domain()) &&
> +		   !xen_initial_domain()) {

The commit lgtm, just one note:


I would even bail out early in xen_virtio_restricted_mem_acc() instead,
as I assume the same issue could happen on Arm with DT (although there 
we don't guess the backend's domid, we read it from DT and quite 
unlikely we get Dom0 being in Dom0 with correct DT).

Something like:

@@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct 
virtio_device *dev)
  {
         domid_t backend_domid;

+       /* Xen grant DMA ops are not used when running as initial domain */
+       if (xen_initial_domain())
+               return false;
+
         if (!xen_grant_init_backend_domid(dev->dev.parent, 
&backend_domid)) {
                 xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
                 return true;
(END)



If so, that commit subject would need to be updated accordingly.

Let's see what other reviewers will say.





>   		dev_info(dev, "Using dom0 as backend\n");
>   		*backend_domid = 0;
>   		ret = 0;
Petr Pavlu June 26, 2023, 1:17 p.m. UTC | #2
On 6/21/23 19:58, Oleksandr Tyshchenko wrote:
> On 21.06.23 16:12, Petr Pavlu wrote:
>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>> devices (all x86_64), dom0 tries to establish a grant for itself which
>> eventually results in a hang during the boot.
>>
>> The backtrace looks as follows, the while loop in __send_control_msg()
>> makes no progress:
>>
>>    #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>    #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
>>    #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
>>    #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>    #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>    #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>    #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>    #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
>>    #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>    #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>    #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
>>    #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>        fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
>>    #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
>>    #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
>>    #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
>>    #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
>>    #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
>>    #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
>>    #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
>>    #19 do_initcalls () at ../init/main.c:1335
>>    #20 do_basic_setup () at ../init/main.c:1354
>>    #21 kernel_init_freeable () at ../init/main.c:1571
>>    #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
>>    #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
>>    #24 0x0000000000000000 in ?? ()
>>
>> Fix the problem by preventing xen_grant_init_backend_domid() from
>> setting dom0 as a backend when running in dom0.
>>
>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
> 
> 
> I am not 100% sure whether the Fixes tag points to precise commit. If I 
> am not mistaken, the said commit just moves the code in the context 
> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was 
> introduced before.

I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use
dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which
introduced the original logic to use dom0 as backend.

Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma"
devices") is relevant in sense that it extended when this logic is
active by adding an OR check for xen_pv_domain().

> 
> 
>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>> ---
>>   drivers/xen/grant-dma-ops.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>> index 76f6f26265a3..29ed27ac450e 100644
>> --- a/drivers/xen/grant-dma-ops.c
>> +++ b/drivers/xen/grant-dma-ops.c
>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
>>   	if (np) {
>>   		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
>>   		of_node_put(np);
>> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
>> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>> +		    xen_pv_domain()) &&
>> +		   !xen_initial_domain()) {
> 
> The commit lgtm, just one note:
> 
> 
> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
> as I assume the same issue could happen on Arm with DT (although there 
> we don't guess the backend's domid, we read it from DT and quite 
> unlikely we get Dom0 being in Dom0 with correct DT).
> 
> Something like:
> 
> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct 
> virtio_device *dev)
>   {
>          domid_t backend_domid;
> 
> +       /* Xen grant DMA ops are not used when running as initial domain */
> +       if (xen_initial_domain())
> +               return false;
> +
>          if (!xen_grant_init_backend_domid(dev->dev.parent, 
> &backend_domid)) {
>                  xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
>                  return true;
> (END)
> 
> 
> 
> If so, that commit subject would need to be updated accordingly.
> 
> Let's see what other reviewers will say.

Ok, makes sense.

Thanks,
Petr
Stefano Stabellini June 29, 2023, 1 a.m. UTC | #3
On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
> On 21.06.23 16:12, Petr Pavlu wrote:
> 
> 
> Hello Petr
> 
> 
> > When attempting to run Xen on a QEMU/KVM virtual machine with virtio
> > devices (all x86_64), dom0 tries to establish a grant for itself which
> > eventually results in a hang during the boot.
> > 
> > The backtrace looks as follows, the while loop in __send_control_msg()
> > makes no progress:
> > 
> >    #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
> >    #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
> >    #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
> >    #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
> >    #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
> >    #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
> >    #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
> >    #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
> >    #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
> >    #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
> >    #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
> >    #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
> >        fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
> >    #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
> >    #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
> >    #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
> >    #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
> >    #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
> >    #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
> >    #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
> >    #19 do_initcalls () at ../init/main.c:1335
> >    #20 do_basic_setup () at ../init/main.c:1354
> >    #21 kernel_init_freeable () at ../init/main.c:1571
> >    #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
> >    #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
> >    #24 0x0000000000000000 in ?? ()
> > 
> > Fix the problem by preventing xen_grant_init_backend_domid() from
> > setting dom0 as a backend when running in dom0.
> > 
> > Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
> 
> 
> I am not 100% sure whether the Fixes tag points to precise commit. If I 
> am not mistaken, the said commit just moves the code in the context 
> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was 
> introduced before.
> 
> 
> > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> > ---
> >   drivers/xen/grant-dma-ops.c | 4 +++-
> >   1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> > index 76f6f26265a3..29ed27ac450e 100644
> > --- a/drivers/xen/grant-dma-ops.c
> > +++ b/drivers/xen/grant-dma-ops.c
> > @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
> >   	if (np) {
> >   		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
> >   		of_node_put(np);
> > -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
> > +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
> > +		    xen_pv_domain()) &&
> > +		   !xen_initial_domain()) {
> 
> The commit lgtm, just one note:
> 
> 
> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
> as I assume the same issue could happen on Arm with DT (although there 
> we don't guess the backend's domid, we read it from DT and quite 
> unlikely we get Dom0 being in Dom0 with correct DT).
> 
> Something like:
> 
> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct 
> virtio_device *dev)
>   {
>          domid_t backend_domid;
> 
> +       /* Xen grant DMA ops are not used when running as initial domain */
> +       if (xen_initial_domain())
> +               return false;
> +
>          if (!xen_grant_init_backend_domid(dev->dev.parent, 
> &backend_domid)) {
>                  xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
>                  return true;
> (END)
> 
> 
> 
> If so, that commit subject would need to be updated accordingly.
> 
> Let's see what other reviewers will say.

This doesn't work in all cases. Imagine using PCI Passthrough to assign
a "physical" virtio device to a domU. The domU will run into the same
error, right?

The problem is that we need a way for the virtio backend to advertise
its ability of handling grants. Right now we only have a way to do with
that with device tree on ARM. On x86, we only have
CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
"physical" virtio devices. Note that in this case we are fixing a
nested-virtualization bug, but there are actually physical
virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
break those too.

I think we need to add a second way? It could be anything that can help
us distinguish between a non-grants-capable virtio backend and a
grants-capable virtio backend, such as:
- a string on xenstore
- a xen param
- a special PCI configuration register value
- something in the ACPI tables
- the QEMU machine type

Or at least should we change CONFIG_XEN_VIRTIO_FORCE_GRANT into a
command line parameter so that it can be disabled in cases like this
one?

I realize that fixing this problem properly takes a lot longer than
adding a trivial if (dom0) return; check in the code. If you cannot find
a good way to solve the problem or you don't have time to do that now
and you need this bug fixed quickly, then I would be OK with the if
(dom0) return; check but please add a detailed TODO in-code comment to
explain that this is just a hack and we are still looking for a real
solution.

The check itself I prefer the original position because I want to retain
the ability of using virtio frontends with grant on ARM in Dom0 (DomD
case).
Oleksandr Tyshchenko June 29, 2023, 8:29 p.m. UTC | #4
On 29.06.23 04:00, Stefano Stabellini wrote:

Hello Stefano

> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
>> On 21.06.23 16:12, Petr Pavlu wrote:
>>
>>
>> Hello Petr
>>
>>
>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>> devices (all x86_64), dom0 tries to establish a grant for itself which
>>> eventually results in a hang during the boot.
>>>
>>> The backtrace looks as follows, the while loop in __send_control_msg()
>>> makes no progress:
>>>
>>>     #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>>     #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
>>>     #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
>>>     #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>>     #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>>     #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>>     #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>     #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
>>>     #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>     #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>     #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
>>>     #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>         fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
>>>     #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
>>>     #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
>>>     #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
>>>     #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
>>>     #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
>>>     #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
>>>     #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
>>>     #19 do_initcalls () at ../init/main.c:1335
>>>     #20 do_basic_setup () at ../init/main.c:1354
>>>     #21 kernel_init_freeable () at ../init/main.c:1571
>>>     #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
>>>     #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
>>>     #24 0x0000000000000000 in ?? ()
>>>
>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>> setting dom0 as a backend when running in dom0.
>>>
>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
>>
>>
>> I am not 100% sure whether the Fixes tag points to precise commit. If I
>> am not mistaken, the said commit just moves the code in the context
>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>> introduced before.
>>
>>
>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>> ---
>>>    drivers/xen/grant-dma-ops.c | 4 +++-
>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>>> index 76f6f26265a3..29ed27ac450e 100644
>>> --- a/drivers/xen/grant-dma-ops.c
>>> +++ b/drivers/xen/grant-dma-ops.c
>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
>>>    	if (np) {
>>>    		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
>>>    		of_node_put(np);
>>> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
>>> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>> +		    xen_pv_domain()) &&
>>> +		   !xen_initial_domain()) {
>>
>> The commit lgtm, just one note:
>>
>>
>> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
>> as I assume the same issue could happen on Arm with DT (although there
>> we don't guess the backend's domid, we read it from DT and quite
>> unlikely we get Dom0 being in Dom0 with correct DT).
>>
>> Something like:
>>
>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>> virtio_device *dev)
>>    {
>>           domid_t backend_domid;
>>
>> +       /* Xen grant DMA ops are not used when running as initial domain */
>> +       if (xen_initial_domain())
>> +               return false;
>> +
>>           if (!xen_grant_init_backend_domid(dev->dev.parent,
>> &backend_domid)) {
>>                   xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
>>                   return true;
>> (END)
>>
>>
>>
>> If so, that commit subject would need to be updated accordingly.
>>
>> Let's see what other reviewers will say.
> 
> This doesn't work in all cases. Imagine using PCI Passthrough to assign
> a "physical" virtio device to a domU. The domU will run into the same
> error, right?
> 
> The problem is that we need a way for the virtio backend to advertise
> its ability of handling grants. Right now we only have a way to do with
> that with device tree on ARM. On x86, we only have
> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
> "physical" virtio devices. Note that in this case we are fixing a
> nested-virtualization bug, but there are actually physical
> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
> break those too.


If these "physical" virtio devices are also spawned by
drivers/virtio/virtio.c:virtio_dev_probe(), then yes, otherwise I don't 
see how this could even be possible, but I might miss something here.

xen_virtio_restricted_mem_acc() gets called indirectly from 
virtio_dev_probe()->virtio_features_ok()->
virtio_check_mem_acc_cb(). So the Xen grant DMA ops are only installed 
for those.

> 
> I think we need to add a second way? It could be anything that can help
> us distinguish between a non-grants-capable virtio backend and a
> grants-capable virtio backend, such as:
> - a string on xenstore
> - a xen param
> - a special PCI configuration register value
> - something in the ACPI tables
> - the QEMU machine type


Yes, I remember there was a discussion regarding that. The point is to 
choose a solution to be functional for both PV and HVM *and* to be able 
to support a hotplug. IIRC, the xenstore could be a possible candidate.


> 
> Or at least should we change CONFIG_XEN_VIRTIO_FORCE_GRANT into a
> command line parameter so that it can be disabled in cases like this
> one?

IIUC, this will help with HVM only.


> 
> I realize that fixing this problem properly takes a lot longer than
> adding a trivial if (dom0) return; check in the code. If you cannot find
> a good way to solve the problem or you don't have time to do that now
> and you need this bug fixed quickly, then I would be OK with the if
> (dom0) return; check but please add a detailed TODO in-code comment to
> explain that this is just a hack and we are still looking for a real
> solution.
> 
> The check itself I prefer the original position because I want to retain
> the ability of using virtio frontends with grant on ARM in Dom0 (DomD
> case).

Makes sense, agree.
Stefano Stabellini June 29, 2023, 10:44 p.m. UTC | #5
On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
> On 29.06.23 04:00, Stefano Stabellini wrote:
> 
> Hello Stefano
> 
> > On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
> >> On 21.06.23 16:12, Petr Pavlu wrote:
> >>
> >>
> >> Hello Petr
> >>
> >>
> >>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
> >>> devices (all x86_64), dom0 tries to establish a grant for itself which
> >>> eventually results in a hang during the boot.
> >>>
> >>> The backtrace looks as follows, the while loop in __send_control_msg()
> >>> makes no progress:
> >>>
> >>>     #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
> >>>     #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
> >>>     #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
> >>>     #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
> >>>     #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
> >>>     #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
> >>>     #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
> >>>     #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
> >>>     #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
> >>>     #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
> >>>     #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
> >>>     #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
> >>>         fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
> >>>     #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
> >>>     #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
> >>>     #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
> >>>     #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
> >>>     #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
> >>>     #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
> >>>     #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
> >>>     #19 do_initcalls () at ../init/main.c:1335
> >>>     #20 do_basic_setup () at ../init/main.c:1354
> >>>     #21 kernel_init_freeable () at ../init/main.c:1571
> >>>     #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
> >>>     #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
> >>>     #24 0x0000000000000000 in ?? ()
> >>>
> >>> Fix the problem by preventing xen_grant_init_backend_domid() from
> >>> setting dom0 as a backend when running in dom0.
> >>>
> >>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
> >>
> >>
> >> I am not 100% sure whether the Fixes tag points to precise commit. If I
> >> am not mistaken, the said commit just moves the code in the context
> >> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
> >> introduced before.
> >>
> >>
> >>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> >>> ---
> >>>    drivers/xen/grant-dma-ops.c | 4 +++-
> >>>    1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> >>> index 76f6f26265a3..29ed27ac450e 100644
> >>> --- a/drivers/xen/grant-dma-ops.c
> >>> +++ b/drivers/xen/grant-dma-ops.c
> >>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
> >>>    	if (np) {
> >>>    		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
> >>>    		of_node_put(np);
> >>> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
> >>> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
> >>> +		    xen_pv_domain()) &&
> >>> +		   !xen_initial_domain()) {
> >>
> >> The commit lgtm, just one note:
> >>
> >>
> >> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
> >> as I assume the same issue could happen on Arm with DT (although there
> >> we don't guess the backend's domid, we read it from DT and quite
> >> unlikely we get Dom0 being in Dom0 with correct DT).
> >>
> >> Something like:
> >>
> >> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
> >> virtio_device *dev)
> >>    {
> >>           domid_t backend_domid;
> >>
> >> +       /* Xen grant DMA ops are not used when running as initial domain */
> >> +       if (xen_initial_domain())
> >> +               return false;
> >> +
> >>           if (!xen_grant_init_backend_domid(dev->dev.parent,
> >> &backend_domid)) {
> >>                   xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
> >>                   return true;
> >> (END)
> >>
> >>
> >>
> >> If so, that commit subject would need to be updated accordingly.
> >>
> >> Let's see what other reviewers will say.
> > 
> > This doesn't work in all cases. Imagine using PCI Passthrough to assign
> > a "physical" virtio device to a domU. The domU will run into the same
> > error, right?
> > 
> > The problem is that we need a way for the virtio backend to advertise
> > its ability of handling grants. Right now we only have a way to do with
> > that with device tree on ARM. On x86, we only have
> > CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
> > CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
> > "physical" virtio devices. Note that in this case we are fixing a
> > nested-virtualization bug, but there are actually physical
> > virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
> > break those too.
> 
> 
> If these "physical" virtio devices are also spawned by
> drivers/virtio/virtio.c:virtio_dev_probe(), then yes, otherwise I don't 
> see how this could even be possible, but I might miss something here.

Yes, I would imagine virtio_dev_probe() would be called for them too



> xen_virtio_restricted_mem_acc() gets called indirectly from 
> virtio_dev_probe()->virtio_features_ok()->
> virtio_check_mem_acc_cb(). So the Xen grant DMA ops are only installed 
> for those.
>
>
> > 
> > I think we need to add a second way? It could be anything that can help
> > us distinguish between a non-grants-capable virtio backend and a
> > grants-capable virtio backend, such as:
> > - a string on xenstore
> > - a xen param
> > - a special PCI configuration register value
> > - something in the ACPI tables
> > - the QEMU machine type
> 
> 
> Yes, I remember there was a discussion regarding that. The point is to 
> choose a solution to be functional for both PV and HVM *and* to be able 
> to support a hotplug. IIRC, the xenstore could be a possible candidate.

xenstore would be among the easiest to make work. The only downside is
the dependency on xenstore which otherwise virtio+grants doesn't have.

Vikram is working on virtio with grants support in QEMU as we speak.
Maybe we could find a way to add a flag in QEMU so that we can detect at
runtime if a given virtio device support grants or not.

 
> > Or at least should we change CONFIG_XEN_VIRTIO_FORCE_GRANT into a
> > command line parameter so that it can be disabled in cases like this
> > one?
> 
> IIUC, this will help with HVM only.

For sure this is the least attractive solution, only marginally better
than the fix proposed in this patch
 

> > I realize that fixing this problem properly takes a lot longer than
> > adding a trivial if (dom0) return; check in the code. If you cannot find
> > a good way to solve the problem or you don't have time to do that now
> > and you need this bug fixed quickly, then I would be OK with the if
> > (dom0) return; check but please add a detailed TODO in-code comment to
> > explain that this is just a hack and we are still looking for a real
> > solution.
> > 
> > The check itself I prefer the original position because I want to retain
> > the ability of using virtio frontends with grant on ARM in Dom0 (DomD
> > case).
> 
> Makes sense, agree.
Jürgen Groß July 4, 2023, 6:25 a.m. UTC | #6
On 30.06.23 00:44, Stefano Stabellini wrote:
> On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
>> On 29.06.23 04:00, Stefano Stabellini wrote:
>>
>> Hello Stefano
>>
>>> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
>>>> On 21.06.23 16:12, Petr Pavlu wrote:
>>>>
>>>>
>>>> Hello Petr
>>>>
>>>>
>>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>>>> devices (all x86_64), dom0 tries to establish a grant for itself which
>>>>> eventually results in a hang during the boot.
>>>>>
>>>>> The backtrace looks as follows, the while loop in __send_control_msg()
>>>>> makes no progress:
>>>>>
>>>>>      #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>>>>      #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
>>>>>      #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
>>>>>      #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>>>>      #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>>>>      #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>>>>      #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>>>      #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
>>>>>      #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>>>      #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>>>      #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
>>>>>      #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>>>          fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
>>>>>      #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
>>>>>      #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
>>>>>      #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
>>>>>      #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
>>>>>      #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
>>>>>      #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
>>>>>      #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
>>>>>      #19 do_initcalls () at ../init/main.c:1335
>>>>>      #20 do_basic_setup () at ../init/main.c:1354
>>>>>      #21 kernel_init_freeable () at ../init/main.c:1571
>>>>>      #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
>>>>>      #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
>>>>>      #24 0x0000000000000000 in ?? ()
>>>>>
>>>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>>>> setting dom0 as a backend when running in dom0.
>>>>>
>>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
>>>>
>>>>
>>>> I am not 100% sure whether the Fixes tag points to precise commit. If I
>>>> am not mistaken, the said commit just moves the code in the context
>>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>>>> introduced before.
>>>>
>>>>
>>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>>>> ---
>>>>>     drivers/xen/grant-dma-ops.c | 4 +++-
>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>>>>> index 76f6f26265a3..29ed27ac450e 100644
>>>>> --- a/drivers/xen/grant-dma-ops.c
>>>>> +++ b/drivers/xen/grant-dma-ops.c
>>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
>>>>>     	if (np) {
>>>>>     		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
>>>>>     		of_node_put(np);
>>>>> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
>>>>> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>>> +		    xen_pv_domain()) &&
>>>>> +		   !xen_initial_domain()) {
>>>>
>>>> The commit lgtm, just one note:
>>>>
>>>>
>>>> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
>>>> as I assume the same issue could happen on Arm with DT (although there
>>>> we don't guess the backend's domid, we read it from DT and quite
>>>> unlikely we get Dom0 being in Dom0 with correct DT).
>>>>
>>>> Something like:
>>>>
>>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>>>> virtio_device *dev)
>>>>     {
>>>>            domid_t backend_domid;
>>>>
>>>> +       /* Xen grant DMA ops are not used when running as initial domain */
>>>> +       if (xen_initial_domain())
>>>> +               return false;
>>>> +
>>>>            if (!xen_grant_init_backend_domid(dev->dev.parent,
>>>> &backend_domid)) {
>>>>                    xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
>>>>                    return true;
>>>> (END)
>>>>
>>>>
>>>>
>>>> If so, that commit subject would need to be updated accordingly.
>>>>
>>>> Let's see what other reviewers will say.
>>>
>>> This doesn't work in all cases. Imagine using PCI Passthrough to assign
>>> a "physical" virtio device to a domU. The domU will run into the same
>>> error, right?
>>>
>>> The problem is that we need a way for the virtio backend to advertise
>>> its ability of handling grants. Right now we only have a way to do with
>>> that with device tree on ARM. On x86, we only have
>>> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
>>> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
>>> "physical" virtio devices. Note that in this case we are fixing a
>>> nested-virtualization bug, but there are actually physical
>>> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
>>> break those too.
>>
>>
>> If these "physical" virtio devices are also spawned by
>> drivers/virtio/virtio.c:virtio_dev_probe(), then yes, otherwise I don't
>> see how this could even be possible, but I might miss something here.
> 
> Yes, I would imagine virtio_dev_probe() would be called for them too
> 
> 
> 
>> xen_virtio_restricted_mem_acc() gets called indirectly from
>> virtio_dev_probe()->virtio_features_ok()->
>> virtio_check_mem_acc_cb(). So the Xen grant DMA ops are only installed
>> for those.
>>
>>
>>>
>>> I think we need to add a second way? It could be anything that can help
>>> us distinguish between a non-grants-capable virtio backend and a
>>> grants-capable virtio backend, such as:
>>> - a string on xenstore
>>> - a xen param
>>> - a special PCI configuration register value
>>> - something in the ACPI tables
>>> - the QEMU machine type
>>
>>
>> Yes, I remember there was a discussion regarding that. The point is to
>> choose a solution to be functional for both PV and HVM *and* to be able
>> to support a hotplug. IIRC, the xenstore could be a possible candidate.
> 
> xenstore would be among the easiest to make work. The only downside is
> the dependency on xenstore which otherwise virtio+grants doesn't have.

I'm in favor of using xenstore. Especially the hotplug support would be
much easier using xenstore (the alternative would be ACPI, which I don't
think we want for PV guests or many Arm configurations).


Juergen
Roger Pau Monné July 4, 2023, 7:48 a.m. UTC | #7
On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
> On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
> > On 29.06.23 04:00, Stefano Stabellini wrote:
> > > I think we need to add a second way? It could be anything that can help
> > > us distinguish between a non-grants-capable virtio backend and a
> > > grants-capable virtio backend, such as:
> > > - a string on xenstore
> > > - a xen param
> > > - a special PCI configuration register value
> > > - something in the ACPI tables
> > > - the QEMU machine type
> > 
> > 
> > Yes, I remember there was a discussion regarding that. The point is to 
> > choose a solution to be functional for both PV and HVM *and* to be able 
> > to support a hotplug. IIRC, the xenstore could be a possible candidate.
> 
> xenstore would be among the easiest to make work. The only downside is
> the dependency on xenstore which otherwise virtio+grants doesn't have.

I would avoid introducing a dependency on xenstore, if nothing else we
know it's a performance bottleneck.

We would also need to map the virtio device topology into xenstore, so
that we can pass different options for each device.

> Vikram is working on virtio with grants support in QEMU as we speak.
> Maybe we could find a way to add a flag in QEMU so that we can detect at
> runtime if a given virtio device support grants or not.

Isn't there a way for the device to expose capabilities already?  For
example how does a virtio-blk backend expose support for indirect
descriptors?

Thanks, Roger.
Jürgen Groß July 4, 2023, 10:39 a.m. UTC | #8
On 04.07.23 09:48, Roger Pau Monné wrote:
> On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
>> On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
>>> On 29.06.23 04:00, Stefano Stabellini wrote:
>>>> I think we need to add a second way? It could be anything that can help
>>>> us distinguish between a non-grants-capable virtio backend and a
>>>> grants-capable virtio backend, such as:
>>>> - a string on xenstore
>>>> - a xen param
>>>> - a special PCI configuration register value
>>>> - something in the ACPI tables
>>>> - the QEMU machine type
>>>
>>>
>>> Yes, I remember there was a discussion regarding that. The point is to
>>> choose a solution to be functional for both PV and HVM *and* to be able
>>> to support a hotplug. IIRC, the xenstore could be a possible candidate.
>>
>> xenstore would be among the easiest to make work. The only downside is
>> the dependency on xenstore which otherwise virtio+grants doesn't have.
> 
> I would avoid introducing a dependency on xenstore, if nothing else we
> know it's a performance bottleneck.
> 
> We would also need to map the virtio device topology into xenstore, so
> that we can pass different options for each device.

This aspect (different options) is important. How do you want to pass virtio
device configuration parameters from dom0 to the virtio backend domain? You
probably need something like Xenstore (a virtio based alternative like virtiofs
would work, too) for that purpose.

Mapping the topology should be rather easy via the PCI-Id, e.g.:

/local/domain/42/device/virtio/0000:00:1c.0/backend

> 
>> Vikram is working on virtio with grants support in QEMU as we speak.
>> Maybe we could find a way to add a flag in QEMU so that we can detect at
>> runtime if a given virtio device support grants or not.
> 
> Isn't there a way for the device to expose capabilities already?  For
> example how does a virtio-blk backend expose support for indirect
> descriptors?

Those capabilities are defined in the virtio spec [1]. Adding the backend
domid would be possible, but it probably wouldn't be that easy (requires
changing the virtio spec by either expanding an existing config area or by
adding a new one). I'm not sure handling in the specific frontends is
generic enough for being able to have a central place where the backend
domid could be retrieved, without requiring any change of the frontends.


Juergen

[1]: http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
Marek Marczykowski-Górecki July 4, 2023, 11:43 a.m. UTC | #9
Hi,

FWIW, I have ran into this issue some time ago too. I run Xen on top of
KVM and then passthrough some of the virtio devices (network one
specifically) into a (PV) guest. So, I hit both cases, the dom0 one and
domU one. As a temporary workaround I needed to disable
CONFIG_XEN_VIRTIO completely (just disabling
CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it).
With that context in place, the actual response below.

On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote:
> On 04.07.23 09:48, Roger Pau Monné wrote:
> > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
> > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
> > > > On 29.06.23 04:00, Stefano Stabellini wrote:
> > > > > I think we need to add a second way? It could be anything that can help
> > > > > us distinguish between a non-grants-capable virtio backend and a
> > > > > grants-capable virtio backend, such as:
> > > > > - a string on xenstore
> > > > > - a xen param
> > > > > - a special PCI configuration register value
> > > > > - something in the ACPI tables
> > > > > - the QEMU machine type
> > > > 
> > > > 
> > > > Yes, I remember there was a discussion regarding that. The point is to
> > > > choose a solution to be functional for both PV and HVM *and* to be able
> > > > to support a hotplug. IIRC, the xenstore could be a possible candidate.
> > > 
> > > xenstore would be among the easiest to make work. The only downside is
> > > the dependency on xenstore which otherwise virtio+grants doesn't have.
> > 
> > I would avoid introducing a dependency on xenstore, if nothing else we
> > know it's a performance bottleneck.
> > 
> > We would also need to map the virtio device topology into xenstore, so
> > that we can pass different options for each device.
> 
> This aspect (different options) is important. How do you want to pass virtio
> device configuration parameters from dom0 to the virtio backend domain? You
> probably need something like Xenstore (a virtio based alternative like virtiofs
> would work, too) for that purpose.
> 
> Mapping the topology should be rather easy via the PCI-Id, e.g.:
> 
> /local/domain/42/device/virtio/0000:00:1c.0/backend

While I agree this would probably be the simplest to implement, I don't
like introducing xenstore dependency into virtio frontend either.
Toolstack -> backend communication is probably easier to solve, as it's
much more flexible (could use qemu cmdline, QMP, other similar
mechanisms for non-qemu backends etc).

> > > Vikram is working on virtio with grants support in QEMU as we speak.
> > > Maybe we could find a way to add a flag in QEMU so that we can detect at
> > > runtime if a given virtio device support grants or not.
> > 
> > Isn't there a way for the device to expose capabilities already?  For
> > example how does a virtio-blk backend expose support for indirect
> > descriptors?
> 
> Those capabilities are defined in the virtio spec [1]. Adding the backend
> domid would be possible, but it probably wouldn't be that easy (requires
> changing the virtio spec by either expanding an existing config area or by
> adding a new one). I'm not sure handling in the specific frontends is
> generic enough for being able to have a central place where the backend
> domid could be retrieved, without requiring any change of the frontends.

IMHO the proper solution is to extend the spec. I don't have much
experience with virtio code, but reading the spec it looks like new
config area will be better for compatibility/uniform handling in a
frontent-agnostic way. Since it will definitely take time, some
transitional solution (maybe even xenstore...) might be warranted.

> Juergen
> 
> [1]: http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
Roger Pau Monné July 4, 2023, 2:49 p.m. UTC | #10
On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> FWIW, I have ran into this issue some time ago too. I run Xen on top of
> KVM and then passthrough some of the virtio devices (network one
> specifically) into a (PV) guest. So, I hit both cases, the dom0 one and
> domU one. As a temporary workaround I needed to disable
> CONFIG_XEN_VIRTIO completely (just disabling
> CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it).
> With that context in place, the actual response below.
> 
> On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote:
> > On 04.07.23 09:48, Roger Pau Monné wrote:
> > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
> > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
> > > > > On 29.06.23 04:00, Stefano Stabellini wrote:
> > > > > > I think we need to add a second way? It could be anything that can help
> > > > > > us distinguish between a non-grants-capable virtio backend and a
> > > > > > grants-capable virtio backend, such as:
> > > > > > - a string on xenstore
> > > > > > - a xen param
> > > > > > - a special PCI configuration register value
> > > > > > - something in the ACPI tables
> > > > > > - the QEMU machine type
> > > > > 
> > > > > 
> > > > > Yes, I remember there was a discussion regarding that. The point is to
> > > > > choose a solution to be functional for both PV and HVM *and* to be able
> > > > > to support a hotplug. IIRC, the xenstore could be a possible candidate.
> > > > 
> > > > xenstore would be among the easiest to make work. The only downside is
> > > > the dependency on xenstore which otherwise virtio+grants doesn't have.
> > > 
> > > I would avoid introducing a dependency on xenstore, if nothing else we
> > > know it's a performance bottleneck.
> > > 
> > > We would also need to map the virtio device topology into xenstore, so
> > > that we can pass different options for each device.
> > 
> > This aspect (different options) is important. How do you want to pass virtio
> > device configuration parameters from dom0 to the virtio backend domain? You
> > probably need something like Xenstore (a virtio based alternative like virtiofs
> > would work, too) for that purpose.
> > 
> > Mapping the topology should be rather easy via the PCI-Id, e.g.:
> > 
> > /local/domain/42/device/virtio/0000:00:1c.0/backend
> 
> While I agree this would probably be the simplest to implement, I don't
> like introducing xenstore dependency into virtio frontend either.
> Toolstack -> backend communication is probably easier to solve, as it's
> much more flexible (could use qemu cmdline, QMP, other similar
> mechanisms for non-qemu backends etc).

I also think features should be exposed uniformly for devices, it's at
least weird to have certain features exposed in the PCI config space
while other features exposed in xenstore.

For virtio-mmio this might get a bit confusing, are we going to add
xenstore entries based on the position of the device config mmio
region?

I think on Arm PCI enumeration is not (usually?) done by the firmware,
at which point the SBDF expected by the tools/backend might be
different than the value assigned by the guest OS.

I think there are two slightly different issues, one is how to pass
information to virtio backends, I think doing this initially based on
xenstore is not that bad, because it's an internal detail of the
backend implementation. However passing information to virtio
frontends using xenstore is IMO a bad idea, there's already a way to
negotiate features between virtio frontends and backends, and Xen
should just expand and use that.


> > > > Vikram is working on virtio with grants support in QEMU as we speak.
> > > > Maybe we could find a way to add a flag in QEMU so that we can detect at
> > > > runtime if a given virtio device support grants or not.
> > > 
> > > Isn't there a way for the device to expose capabilities already?  For
> > > example how does a virtio-blk backend expose support for indirect
> > > descriptors?
> > 
> > Those capabilities are defined in the virtio spec [1]. Adding the backend
> > domid would be possible, but it probably wouldn't be that easy (requires
> > changing the virtio spec by either expanding an existing config area or by
> > adding a new one). I'm not sure handling in the specific frontends is
> > generic enough for being able to have a central place where the backend
> > domid could be retrieved, without requiring any change of the frontends.
> 
> IMHO the proper solution is to extend the spec. I don't have much
> experience with virtio code, but reading the spec it looks like new
> config area will be better for compatibility/uniform handling in a
> frontent-agnostic way. Since it will definitely take time, some
> transitional solution (maybe even xenstore...) might be warranted.

My fear is that such 'transitional' solutions end up being definitive
ones, as people move to other stuff once things work.  Might be better
to implement a transitional approach based on a future virtio
specification change rather than something out of band based on a
completely different communication channel.

Thanks, Roger.
Oleksandr Tyshchenko July 4, 2023, 5:14 p.m. UTC | #11
On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com> wrote:

Hello all.

[sorry for the possible format issues]


On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote:
> > Hi,
> >
> > FWIW, I have ran into this issue some time ago too. I run Xen on top of
> > KVM and then passthrough some of the virtio devices (network one
> > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and
> > domU one. As a temporary workaround I needed to disable
> > CONFIG_XEN_VIRTIO completely (just disabling
> > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it).
> > With that context in place, the actual response below.
> >
> > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote:
> > > On 04.07.23 09:48, Roger Pau Monné wrote:
> > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
> > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
> > > > > > On 29.06.23 04:00, Stefano Stabellini wrote:
> > > > > > > I think we need to add a second way? It could be anything that
> can help
> > > > > > > us distinguish between a non-grants-capable virtio backend and
> a
> > > > > > > grants-capable virtio backend, such as:
> > > > > > > - a string on xenstore
> > > > > > > - a xen param
> > > > > > > - a special PCI configuration register value
> > > > > > > - something in the ACPI tables
> > > > > > > - the QEMU machine type
> > > > > >
> > > > > >
> > > > > > Yes, I remember there was a discussion regarding that. The point
> is to
> > > > > > choose a solution to be functional for both PV and HVM *and* to
> be able
> > > > > > to support a hotplug. IIRC, the xenstore could be a possible
> candidate.
> > > > >
> > > > > xenstore would be among the easiest to make work. The only
> downside is
> > > > > the dependency on xenstore which otherwise virtio+grants doesn't
> have.
> > > >
> > > > I would avoid introducing a dependency on xenstore, if nothing else
> we
> > > > know it's a performance bottleneck.
> > > >
> > > > We would also need to map the virtio device topology into xenstore,
> so
> > > > that we can pass different options for each device.
> > >
> > > This aspect (different options) is important. How do you want to pass
> virtio
> > > device configuration parameters from dom0 to the virtio backend
> domain? You
> > > probably need something like Xenstore (a virtio based alternative like
> virtiofs
> > > would work, too) for that purpose.
> > >
> > > Mapping the topology should be rather easy via the PCI-Id, e.g.:
> > >
> > > /local/domain/42/device/virtio/0000:00:1c.0/backend
> >
> > While I agree this would probably be the simplest to implement, I don't
> > like introducing xenstore dependency into virtio frontend either.
> > Toolstack -> backend communication is probably easier to solve, as it's
> > much more flexible (could use qemu cmdline, QMP, other similar
> > mechanisms for non-qemu backends etc).
>
> I also think features should be exposed uniformly for devices, it's at
> least weird to have certain features exposed in the PCI config space
> while other features exposed in xenstore.
>
> For virtio-mmio this might get a bit confusing, are we going to add
> xenstore entries based on the position of the device config mmio
> region?
>
> I think on Arm PCI enumeration is not (usually?) done by the firmware,
> at which point the SBDF expected by the tools/backend might be
> different than the value assigned by the guest OS.
>
> I think there are two slightly different issues, one is how to pass
> information to virtio backends, I think doing this initially based on
> xenstore is not that bad, because it's an internal detail of the
> backend implementation. However passing information to virtio
> frontends using xenstore is IMO a bad idea, there's already a way to
> negotiate features between virtio frontends and backends, and Xen
> should just expand and use that.
>
>

On Arm with device-tree we have a special bindings which purpose is to
inform us whether we need to use grants for virtio and backend domid for a
particular device.Here on x86, we don't have a device tree, so cannot
(easily?) reuse this logic.

I have just recollected one idea suggested by Stefano some time ago [1].
The context of discussion was about what to do when device-tree and ACPI
cannot be reused (or something like that).The idea won't cover virtio-mmio,
but I have heard that virtio-mmio usage with x86 Xen is rather unusual case.

I will paste the text below for convenience.

**********

Part 1 (intro):

We could reuse a PCI config space register to expose the backend id.
However this solution requires a backend change (QEMU) to expose the
backend id via an emulated register for each emulated device.

To avoid having to introduce a special config space register in all
emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we
could add a special PCI config space register at the emulated PCI Root
Complex level.

Basically the workflow would be as follow:

- Linux recognizes the PCI Root Complex as a Xen PCI Root Complex
- Linux writes to special PCI config space register of the Xen PCI Root
  Complex the PCI device id (basically the BDF)
- The Xen PCI Root Complex emulated by Xen answers by writing back to
  the same location the backend id (domid of the backend)
- Linux reads back the same PCI config space register of the Xen PCI
  Root Complex and learn the relevant domid

Part 2 (clarification):

I think using a special config space register in the root complex would
not be terrible in terms of guest changes because it is easy to
introduce a new root complex driver in Linux and other OSes. The root
complex would still be ECAM compatible so the regular ECAM driver would
still work. A new driver would only be necessary if you want to be able
to access the special config space register.


**********
What do you think about it? Are there any pitfalls, etc? This also requires
system changes, but at least without virtio spec changes.

[1]
https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2210061747590.3690179@ubuntu-linux-20-04-desktop/
Jürgen Groß July 5, 2023, 4:46 a.m. UTC | #12
On 04.07.23 19:14, Oleksandr Tyshchenko wrote:
> 
> 
> On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com 
> <mailto:roger.pau@citrix.com>> wrote:
> 
> Hello all.
> 
> [sorry for the possible format issues]
> 
> 
>     On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote:
>      > Hi,
>      >
>      > FWIW, I have ran into this issue some time ago too. I run Xen on top of
>      > KVM and then passthrough some of the virtio devices (network one
>      > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and
>      > domU one. As a temporary workaround I needed to disable
>      > CONFIG_XEN_VIRTIO completely (just disabling
>      > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it).
>      > With that context in place, the actual response below.
>      >
>      > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote:
>      > > On 04.07.23 09:48, Roger Pau Monné wrote:
>      > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
>      > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
>      > > > > > On 29.06.23 04:00, Stefano Stabellini wrote:
>      > > > > > > I think we need to add a second way? It could be anything that
>     can help
>      > > > > > > us distinguish between a non-grants-capable virtio backend and a
>      > > > > > > grants-capable virtio backend, such as:
>      > > > > > > - a string on xenstore
>      > > > > > > - a xen param
>      > > > > > > - a special PCI configuration register value
>      > > > > > > - something in the ACPI tables
>      > > > > > > - the QEMU machine type
>      > > > > >
>      > > > > >
>      > > > > > Yes, I remember there was a discussion regarding that. The point
>     is to
>      > > > > > choose a solution to be functional for both PV and HVM *and* to
>     be able
>      > > > > > to support a hotplug. IIRC, the xenstore could be a possible
>     candidate.
>      > > > >
>      > > > > xenstore would be among the easiest to make work. The only downside is
>      > > > > the dependency on xenstore which otherwise virtio+grants doesn't have.
>      > > >
>      > > > I would avoid introducing a dependency on xenstore, if nothing else we
>      > > > know it's a performance bottleneck.
>      > > >
>      > > > We would also need to map the virtio device topology into xenstore, so
>      > > > that we can pass different options for each device.
>      > >
>      > > This aspect (different options) is important. How do you want to pass
>     virtio
>      > > device configuration parameters from dom0 to the virtio backend domain? You
>      > > probably need something like Xenstore (a virtio based alternative like
>     virtiofs
>      > > would work, too) for that purpose.
>      > >
>      > > Mapping the topology should be rather easy via the PCI-Id, e.g.:
>      > >
>      > > /local/domain/42/device/virtio/0000:00:1c.0/backend
>      >
>      > While I agree this would probably be the simplest to implement, I don't
>      > like introducing xenstore dependency into virtio frontend either.
>      > Toolstack -> backend communication is probably easier to solve, as it's
>      > much more flexible (could use qemu cmdline, QMP, other similar
>      > mechanisms for non-qemu backends etc).
> 
>     I also think features should be exposed uniformly for devices, it's at
>     least weird to have certain features exposed in the PCI config space
>     while other features exposed in xenstore.
> 
>     For virtio-mmio this might get a bit confusing, are we going to add
>     xenstore entries based on the position of the device config mmio
>     region?
> 
>     I think on Arm PCI enumeration is not (usually?) done by the firmware,
>     at which point the SBDF expected by the tools/backend might be
>     different than the value assigned by the guest OS.
> 
>     I think there are two slightly different issues, one is how to pass
>     information to virtio backends, I think doing this initially based on
>     xenstore is not that bad, because it's an internal detail of the
>     backend implementation. However passing information to virtio
>     frontends using xenstore is IMO a bad idea, there's already a way to
>     negotiate features between virtio frontends and backends, and Xen
>     should just expand and use that.
> 
> 
> 
> On Arm with device-tree we have a special bindings which purpose is to inform us 
> whether we need to use grants for virtio and backend domid for a particular 
> device.Here on x86, we don't have a device tree, so cannot (easily?) reuse this 
> logic.
> 
> I have just recollected one idea suggested by Stefano some time ago [1]. The 
> context of discussion was about what to do when device-tree and ACPI cannot be 
> reused (or something like that).The idea won't cover virtio-mmio, but I have 
> heard that virtio-mmio usage with x86 Xen is rather unusual case.
> 
> I will paste the text below for convenience.
> 
> **********
> 
> Part 1 (intro):
> 
> We could reuse a PCI config space register to expose the backend id.
> However this solution requires a backend change (QEMU) to expose the
> backend id via an emulated register for each emulated device.
> 
> To avoid having to introduce a special config space register in all
> emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we
> could add a special PCI config space register at the emulated PCI Root
> Complex level.
> 
> Basically the workflow would be as follow:
> 
> - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex
> - Linux writes to special PCI config space register of the Xen PCI Root
>    Complex the PCI device id (basically the BDF)
> - The Xen PCI Root Complex emulated by Xen answers by writing back to
>    the same location the backend id (domid of the backend)
> - Linux reads back the same PCI config space register of the Xen PCI
>    Root Complex and learn the relevant domid
> 
> Part 2 (clarification):
> 
> I think using a special config space register in the root complex would
> not be terrible in terms of guest changes because it is easy to
> introduce a new root complex driver in Linux and other OSes. The root
> complex would still be ECAM compatible so the regular ECAM driver would
> still work. A new driver would only be necessary if you want to be able
> to access the special config space register.
> 
> 
> **********
> What do you think about it? Are there any pitfalls, etc? This also requires 
> system changes, but at least without virtio spec changes.
> 
> [1] 
> https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2210061747590.3690179@ubuntu-linux-20-04-desktop/ <https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2210061747590.3690179@ubuntu-linux-20-04-desktop/>

Sounds like a good idea. There would be one PCI root per backend domain needed,
but that should be possible.


Juergen
Roger Pau Monné July 5, 2023, 8:32 a.m. UTC | #13
On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com> wrote:
> 
> Hello all.
> 
> [sorry for the possible format issues]
> 
> 
> On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote:
> > > Hi,
> > >
> > > FWIW, I have ran into this issue some time ago too. I run Xen on top of
> > > KVM and then passthrough some of the virtio devices (network one
> > > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and
> > > domU one. As a temporary workaround I needed to disable
> > > CONFIG_XEN_VIRTIO completely (just disabling
> > > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it).
> > > With that context in place, the actual response below.
> > >
> > > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote:
> > > > On 04.07.23 09:48, Roger Pau Monné wrote:
> > > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
> > > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
> > > > > > > On 29.06.23 04:00, Stefano Stabellini wrote:
> > > > > > > > I think we need to add a second way? It could be anything that
> > can help
> > > > > > > > us distinguish between a non-grants-capable virtio backend and
> > a
> > > > > > > > grants-capable virtio backend, such as:
> > > > > > > > - a string on xenstore
> > > > > > > > - a xen param
> > > > > > > > - a special PCI configuration register value
> > > > > > > > - something in the ACPI tables
> > > > > > > > - the QEMU machine type
> > > > > > >
> > > > > > >
> > > > > > > Yes, I remember there was a discussion regarding that. The point
> > is to
> > > > > > > choose a solution to be functional for both PV and HVM *and* to
> > be able
> > > > > > > to support a hotplug. IIRC, the xenstore could be a possible
> > candidate.
> > > > > >
> > > > > > xenstore would be among the easiest to make work. The only
> > downside is
> > > > > > the dependency on xenstore which otherwise virtio+grants doesn't
> > have.
> > > > >
> > > > > I would avoid introducing a dependency on xenstore, if nothing else
> > we
> > > > > know it's a performance bottleneck.
> > > > >
> > > > > We would also need to map the virtio device topology into xenstore,
> > so
> > > > > that we can pass different options for each device.
> > > >
> > > > This aspect (different options) is important. How do you want to pass
> > virtio
> > > > device configuration parameters from dom0 to the virtio backend
> > domain? You
> > > > probably need something like Xenstore (a virtio based alternative like
> > virtiofs
> > > > would work, too) for that purpose.
> > > >
> > > > Mapping the topology should be rather easy via the PCI-Id, e.g.:
> > > >
> > > > /local/domain/42/device/virtio/0000:00:1c.0/backend
> > >
> > > While I agree this would probably be the simplest to implement, I don't
> > > like introducing xenstore dependency into virtio frontend either.
> > > Toolstack -> backend communication is probably easier to solve, as it's
> > > much more flexible (could use qemu cmdline, QMP, other similar
> > > mechanisms for non-qemu backends etc).
> >
> > I also think features should be exposed uniformly for devices, it's at
> > least weird to have certain features exposed in the PCI config space
> > while other features exposed in xenstore.
> >
> > For virtio-mmio this might get a bit confusing, are we going to add
> > xenstore entries based on the position of the device config mmio
> > region?
> >
> > I think on Arm PCI enumeration is not (usually?) done by the firmware,
> > at which point the SBDF expected by the tools/backend might be
> > different than the value assigned by the guest OS.
> >
> > I think there are two slightly different issues, one is how to pass
> > information to virtio backends, I think doing this initially based on
> > xenstore is not that bad, because it's an internal detail of the
> > backend implementation. However passing information to virtio
> > frontends using xenstore is IMO a bad idea, there's already a way to
> > negotiate features between virtio frontends and backends, and Xen
> > should just expand and use that.
> >
> >
> 
> On Arm with device-tree we have a special bindings which purpose is to
> inform us whether we need to use grants for virtio and backend domid for a
> particular device.Here on x86, we don't have a device tree, so cannot
> (easily?) reuse this logic.
> 
> I have just recollected one idea suggested by Stefano some time ago [1].
> The context of discussion was about what to do when device-tree and ACPI
> cannot be reused (or something like that).The idea won't cover virtio-mmio,
> but I have heard that virtio-mmio usage with x86 Xen is rather unusual case.
> 
> I will paste the text below for convenience.
> 
> **********
> 
> Part 1 (intro):
> 
> We could reuse a PCI config space register to expose the backend id.
> However this solution requires a backend change (QEMU) to expose the
> backend id via an emulated register for each emulated device.
> 
> To avoid having to introduce a special config space register in all
> emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we
> could add a special PCI config space register at the emulated PCI Root
> Complex level.
> 
> Basically the workflow would be as follow:
> 
> - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex
> - Linux writes to special PCI config space register of the Xen PCI Root
>   Complex the PCI device id (basically the BDF)
> - The Xen PCI Root Complex emulated by Xen answers by writing back to
>   the same location the backend id (domid of the backend)
> - Linux reads back the same PCI config space register of the Xen PCI
>   Root Complex and learn the relevant domid

IMO this seems awfully complex.  I'm not familiar with the VirtIO
spec, but I see there's a Vendor data capability, could we possibly
expose Xen-specific information on that capability?

> Part 2 (clarification):
> 
> I think using a special config space register in the root complex would
> not be terrible in terms of guest changes because it is easy to
> introduce a new root complex driver in Linux and other OSes. The root
> complex would still be ECAM compatible so the regular ECAM driver would
> still work. A new driver would only be necessary if you want to be able
> to access the special config space register.

I'm slightly worry of this approach, we end up modifying a root
complex emulation in order to avoid modifying a PCI device emulation
on QEMU, not sure that's a good trade off.

Note also that different architectures will likely have different root
complex, and so you might need to modify several of them, plus then
arrange the PCI layout correctly in order to have the proper hierarchy
so that devices belonging to different driver domains are assigned to
different bridges.

> 
> 
> **********
> What do you think about it? Are there any pitfalls, etc? This also requires
> system changes, but at least without virtio spec changes.

Why are we so reluctant to add spec changes?  I understand this might
take time an effort, but it's the only way IMO to build a sustainable
VirtIO Xen implementation.  Did we already attempt to negotiate with
Oasis Xen related spec changes and those where refused?

Thanks, Roger.
Stefano Stabellini July 5, 2023, 10:41 p.m. UTC | #14
On Wed, 5 Jul 2023, Roger Pau Monné wrote:
> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> > On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com> wrote:
> > 
> > Hello all.
> > 
> > [sorry for the possible format issues]
> > 
> > 
> > On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote:
> > > > Hi,
> > > >
> > > > FWIW, I have ran into this issue some time ago too. I run Xen on top of
> > > > KVM and then passthrough some of the virtio devices (network one
> > > > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and
> > > > domU one. As a temporary workaround I needed to disable
> > > > CONFIG_XEN_VIRTIO completely (just disabling
> > > > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it).
> > > > With that context in place, the actual response below.
> > > >
> > > > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote:
> > > > > On 04.07.23 09:48, Roger Pau Monné wrote:
> > > > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote:
> > > > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote:
> > > > > > > > On 29.06.23 04:00, Stefano Stabellini wrote:
> > > > > > > > > I think we need to add a second way? It could be anything that
> > > can help
> > > > > > > > > us distinguish between a non-grants-capable virtio backend and
> > > a
> > > > > > > > > grants-capable virtio backend, such as:
> > > > > > > > > - a string on xenstore
> > > > > > > > > - a xen param
> > > > > > > > > - a special PCI configuration register value
> > > > > > > > > - something in the ACPI tables
> > > > > > > > > - the QEMU machine type
> > > > > > > >
> > > > > > > >
> > > > > > > > Yes, I remember there was a discussion regarding that. The point
> > > is to
> > > > > > > > choose a solution to be functional for both PV and HVM *and* to
> > > be able
> > > > > > > > to support a hotplug. IIRC, the xenstore could be a possible
> > > candidate.
> > > > > > >
> > > > > > > xenstore would be among the easiest to make work. The only
> > > downside is
> > > > > > > the dependency on xenstore which otherwise virtio+grants doesn't
> > > have.
> > > > > >
> > > > > > I would avoid introducing a dependency on xenstore, if nothing else
> > > we
> > > > > > know it's a performance bottleneck.
> > > > > >
> > > > > > We would also need to map the virtio device topology into xenstore,
> > > so
> > > > > > that we can pass different options for each device.
> > > > >
> > > > > This aspect (different options) is important. How do you want to pass
> > > virtio
> > > > > device configuration parameters from dom0 to the virtio backend
> > > domain? You
> > > > > probably need something like Xenstore (a virtio based alternative like
> > > virtiofs
> > > > > would work, too) for that purpose.
> > > > >
> > > > > Mapping the topology should be rather easy via the PCI-Id, e.g.:
> > > > >
> > > > > /local/domain/42/device/virtio/0000:00:1c.0/backend
> > > >
> > > > While I agree this would probably be the simplest to implement, I don't
> > > > like introducing xenstore dependency into virtio frontend either.
> > > > Toolstack -> backend communication is probably easier to solve, as it's
> > > > much more flexible (could use qemu cmdline, QMP, other similar
> > > > mechanisms for non-qemu backends etc).
> > >
> > > I also think features should be exposed uniformly for devices, it's at
> > > least weird to have certain features exposed in the PCI config space
> > > while other features exposed in xenstore.
> > >
> > > For virtio-mmio this might get a bit confusing, are we going to add
> > > xenstore entries based on the position of the device config mmio
> > > region?
> > >
> > > I think on Arm PCI enumeration is not (usually?) done by the firmware,
> > > at which point the SBDF expected by the tools/backend might be
> > > different than the value assigned by the guest OS.
> > >
> > > I think there are two slightly different issues, one is how to pass
> > > information to virtio backends, I think doing this initially based on
> > > xenstore is not that bad, because it's an internal detail of the
> > > backend implementation. However passing information to virtio
> > > frontends using xenstore is IMO a bad idea, there's already a way to
> > > negotiate features between virtio frontends and backends, and Xen
> > > should just expand and use that.
> > >
> > >
> > 
> > On Arm with device-tree we have a special bindings which purpose is to
> > inform us whether we need to use grants for virtio and backend domid for a
> > particular device.Here on x86, we don't have a device tree, so cannot
> > (easily?) reuse this logic.
> > 
> > I have just recollected one idea suggested by Stefano some time ago [1].
> > The context of discussion was about what to do when device-tree and ACPI
> > cannot be reused (or something like that).The idea won't cover virtio-mmio,
> > but I have heard that virtio-mmio usage with x86 Xen is rather unusual case.
> > 
> > I will paste the text below for convenience.

Thanks Oleksandr! I had forgotten about this, but in retrospect it was a
good suggestion :-D


> > **********
> > 
> > Part 1 (intro):
> > 
> > We could reuse a PCI config space register to expose the backend id.
> > However this solution requires a backend change (QEMU) to expose the
> > backend id via an emulated register for each emulated device.
> > 
> > To avoid having to introduce a special config space register in all
> > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we
> > could add a special PCI config space register at the emulated PCI Root
> > Complex level.
> > 
> > Basically the workflow would be as follow:
> > 
> > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex
> > - Linux writes to special PCI config space register of the Xen PCI Root
> >   Complex the PCI device id (basically the BDF)
> > - The Xen PCI Root Complex emulated by Xen answers by writing back to
> >   the same location the backend id (domid of the backend)
> > - Linux reads back the same PCI config space register of the Xen PCI
> >   Root Complex and learn the relevant domid
> 
> IMO this seems awfully complex.  I'm not familiar with the VirtIO
> spec, but I see there's a Vendor data capability, could we possibly
> expose Xen-specific information on that capability?

That is also a possibility too. Also we could use a PCI conf register
which is known to be unused in the Virtio spec to expose the grant
capability and backend domid.


> > Part 2 (clarification):
> > 
> > I think using a special config space register in the root complex would
> > not be terrible in terms of guest changes because it is easy to
> > introduce a new root complex driver in Linux and other OSes. The root
> > complex would still be ECAM compatible so the regular ECAM driver would
> > still work. A new driver would only be necessary if you want to be able
> > to access the special config space register.
> 
> I'm slightly worry of this approach, we end up modifying a root
> complex emulation in order to avoid modifying a PCI device emulation
> on QEMU, not sure that's a good trade off.
> 
> Note also that different architectures will likely have different root
> complex, and so you might need to modify several of them, plus then
> arrange the PCI layout correctly in order to have the proper hierarchy
> so that devices belonging to different driver domains are assigned to
> different bridges.

I do think that adding something to the PCI conf register somewhere is
the best option because it is not dependent on ACPI and it is not
dependent on xenstore both of which are very undesirable.

I am not sure where specifically is the best place. These are 3 ideas
we came up with:
1. PCI root complex
2. a register on the device itself
3. a new capability of the device
4. add one extra dummy PCI device for the sole purpose of exposing the
   grants capability


Looking at the spec, there is a way to add a vendor-specific capability
(cap_vndr = 0x9). Could we use that? It doesn't look like it is used
today, Linux doesn't parse it.


> > 
> > 
> > **********
> > What do you think about it? Are there any pitfalls, etc? This also requires
> > system changes, but at least without virtio spec changes.
> 
> Why are we so reluctant to add spec changes?  I understand this might
> take time an effort, but it's the only way IMO to build a sustainable
> VirtIO Xen implementation.  Did we already attempt to negotiate with
> Oasis Xen related spec changes and those where refused?

That's because spec changes can be very slow. This is a bug that we need
a relatively quick solution for and waiting 12-24 months for a spec
update is not realistic.

I think a spec change would be best as a long term solution. We also
need a short term solution. The short term solution doesn't have to be
ideal but it has to work now.
Roger Pau Monné July 6, 2023, 8:17 a.m. UTC | #15
On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
> On Wed, 5 Jul 2023, Roger Pau Monné wrote:
> > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> > > **********
> > > 
> > > Part 1 (intro):
> > > 
> > > We could reuse a PCI config space register to expose the backend id.
> > > However this solution requires a backend change (QEMU) to expose the
> > > backend id via an emulated register for each emulated device.
> > > 
> > > To avoid having to introduce a special config space register in all
> > > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we
> > > could add a special PCI config space register at the emulated PCI Root
> > > Complex level.
> > > 
> > > Basically the workflow would be as follow:
> > > 
> > > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex
> > > - Linux writes to special PCI config space register of the Xen PCI Root
> > >   Complex the PCI device id (basically the BDF)
> > > - The Xen PCI Root Complex emulated by Xen answers by writing back to
> > >   the same location the backend id (domid of the backend)
> > > - Linux reads back the same PCI config space register of the Xen PCI
> > >   Root Complex and learn the relevant domid
> > 
> > IMO this seems awfully complex.  I'm not familiar with the VirtIO
> > spec, but I see there's a Vendor data capability, could we possibly
> > expose Xen-specific information on that capability?
> 
> That is also a possibility too. Also we could use a PCI conf register
> which is known to be unused in the Virtio spec to expose the grant
> capability and backend domid.

Capabilities don't have a fixed config space register, they are a
linked list, and so capabilities end up at different positions
depending on the specific device layout.  The only fixed part is the
range from [0, 0x3F), and that's fully defined in the specification.

Trying to define a fixed address for Xen use after the 3f boundary
seems like a bad idea, as it's going to be hard to make sure that such
address is not used on all possible devices.  IMO the only way is to
place such information in a capability, whether that's an existing
capability or a new one I don't really know.

> 
> > > Part 2 (clarification):
> > > 
> > > I think using a special config space register in the root complex would
> > > not be terrible in terms of guest changes because it is easy to
> > > introduce a new root complex driver in Linux and other OSes. The root
> > > complex would still be ECAM compatible so the regular ECAM driver would
> > > still work. A new driver would only be necessary if you want to be able
> > > to access the special config space register.
> > 
> > I'm slightly worry of this approach, we end up modifying a root
> > complex emulation in order to avoid modifying a PCI device emulation
> > on QEMU, not sure that's a good trade off.
> > 
> > Note also that different architectures will likely have different root
> > complex, and so you might need to modify several of them, plus then
> > arrange the PCI layout correctly in order to have the proper hierarchy
> > so that devices belonging to different driver domains are assigned to
> > different bridges.
> 
> I do think that adding something to the PCI conf register somewhere is
> the best option because it is not dependent on ACPI and it is not
> dependent on xenstore both of which are very undesirable.
> 
> I am not sure where specifically is the best place. These are 3 ideas
> we came up with:
> 1. PCI root complex
> 2. a register on the device itself
> 3. a new capability of the device
> 4. add one extra dummy PCI device for the sole purpose of exposing the
>    grants capability
> 
> 
> Looking at the spec, there is a way to add a vendor-specific capability
> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
> today, Linux doesn't parse it.

I did wonder the same from a quick look at the spec.  There's however
a text in the specification that says:

"The driver SHOULD NOT use the Vendor data capability except for
debugging and reporting purposes."

So we would at least need to change that because the capability would
then be used by other purposes different than debugging and reporting.

Seems like a minor adjustment, so might we worth asking upstream about
their opinion, and to get a conversation started.

> 
> > > 
> > > 
> > > **********
> > > What do you think about it? Are there any pitfalls, etc? This also requires
> > > system changes, but at least without virtio spec changes.
> > 
> > Why are we so reluctant to add spec changes?  I understand this might
> > take time an effort, but it's the only way IMO to build a sustainable
> > VirtIO Xen implementation.  Did we already attempt to negotiate with
> > Oasis Xen related spec changes and those where refused?
> 
> That's because spec changes can be very slow. This is a bug that we need
> a relatively quick solution for and waiting 12-24 months for a spec
> update is not realistic.
> 
> I think a spec change would be best as a long term solution. We also
> need a short term solution. The short term solution doesn't have to be
> ideal but it has to work now.

My fear with such approach is that once a bodge is in place people
move on to other stuff and this never gets properly fixed.

I know this might not be a well received opinion, but it would be
better if such bodge is kept in each interested party patchqueue for
the time being, until a proper solution is implemented.  That way
there's an interest from parties into properly fixing it upstream.

Thanks, Roger.
Stefano Stabellini July 6, 2023, 9:49 p.m. UTC | #16
On Thu, 6 Jul 2023, Roger Pau Monné wrote:
> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
> > On Wed, 5 Jul 2023, Roger Pau Monné wrote:
> > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> > > > **********
> > > > 
> > > > Part 1 (intro):
> > > > 
> > > > We could reuse a PCI config space register to expose the backend id.
> > > > However this solution requires a backend change (QEMU) to expose the
> > > > backend id via an emulated register for each emulated device.
> > > > 
> > > > To avoid having to introduce a special config space register in all
> > > > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we
> > > > could add a special PCI config space register at the emulated PCI Root
> > > > Complex level.
> > > > 
> > > > Basically the workflow would be as follow:
> > > > 
> > > > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex
> > > > - Linux writes to special PCI config space register of the Xen PCI Root
> > > >   Complex the PCI device id (basically the BDF)
> > > > - The Xen PCI Root Complex emulated by Xen answers by writing back to
> > > >   the same location the backend id (domid of the backend)
> > > > - Linux reads back the same PCI config space register of the Xen PCI
> > > >   Root Complex and learn the relevant domid
> > > 
> > > IMO this seems awfully complex.  I'm not familiar with the VirtIO
> > > spec, but I see there's a Vendor data capability, could we possibly
> > > expose Xen-specific information on that capability?
> > 
> > That is also a possibility too. Also we could use a PCI conf register
> > which is known to be unused in the Virtio spec to expose the grant
> > capability and backend domid.
> 
> Capabilities don't have a fixed config space register, they are a
> linked list, and so capabilities end up at different positions
> depending on the specific device layout.  The only fixed part is the
> range from [0, 0x3F), and that's fully defined in the specification.
> 
> Trying to define a fixed address for Xen use after the 3f boundary
> seems like a bad idea, as it's going to be hard to make sure that such
> address is not used on all possible devices.  IMO the only way is to
> place such information in a capability, whether that's an existing
> capability or a new one I don't really know.

That seems like a good idea
 
 
> > > > Part 2 (clarification):
> > > > 
> > > > I think using a special config space register in the root complex would
> > > > not be terrible in terms of guest changes because it is easy to
> > > > introduce a new root complex driver in Linux and other OSes. The root
> > > > complex would still be ECAM compatible so the regular ECAM driver would
> > > > still work. A new driver would only be necessary if you want to be able
> > > > to access the special config space register.
> > > 
> > > I'm slightly worry of this approach, we end up modifying a root
> > > complex emulation in order to avoid modifying a PCI device emulation
> > > on QEMU, not sure that's a good trade off.
> > > 
> > > Note also that different architectures will likely have different root
> > > complex, and so you might need to modify several of them, plus then
> > > arrange the PCI layout correctly in order to have the proper hierarchy
> > > so that devices belonging to different driver domains are assigned to
> > > different bridges.
> > 
> > I do think that adding something to the PCI conf register somewhere is
> > the best option because it is not dependent on ACPI and it is not
> > dependent on xenstore both of which are very undesirable.
> > 
> > I am not sure where specifically is the best place. These are 3 ideas
> > we came up with:
> > 1. PCI root complex
> > 2. a register on the device itself
> > 3. a new capability of the device
> > 4. add one extra dummy PCI device for the sole purpose of exposing the
> >    grants capability
> > 
> > 
> > Looking at the spec, there is a way to add a vendor-specific capability
> > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
> > today, Linux doesn't parse it.
> 
> I did wonder the same from a quick look at the spec.  There's however
> a text in the specification that says:
> 
> "The driver SHOULD NOT use the Vendor data capability except for
> debugging and reporting purposes."
> 
> So we would at least need to change that because the capability would
> then be used by other purposes different than debugging and reporting.
> 
> Seems like a minor adjustment, so might we worth asking upstream about
> their opinion, and to get a conversation started.

Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
we are doing, right?

 
> > > > 
> > > > 
> > > > **********
> > > > What do you think about it? Are there any pitfalls, etc? This also requires
> > > > system changes, but at least without virtio spec changes.
> > > 
> > > Why are we so reluctant to add spec changes?  I understand this might
> > > take time an effort, but it's the only way IMO to build a sustainable
> > > VirtIO Xen implementation.  Did we already attempt to negotiate with
> > > Oasis Xen related spec changes and those where refused?
> > 
> > That's because spec changes can be very slow. This is a bug that we need
> > a relatively quick solution for and waiting 12-24 months for a spec
> > update is not realistic.
> > 
> > I think a spec change would be best as a long term solution. We also
> > need a short term solution. The short term solution doesn't have to be
> > ideal but it has to work now.
> 
> My fear with such approach is that once a bodge is in place people
> move on to other stuff and this never gets properly fixed.
> 
> I know this might not be a well received opinion, but it would be
> better if such bodge is kept in each interested party patchqueue for
> the time being, until a proper solution is implemented.  That way
> there's an interest from parties into properly fixing it upstream.

Unfortunately we are in the situation where we have an outstanding
upstream bug, so we have to take action one way or the other.
Jürgen Groß July 7, 2023, 4:38 a.m. UTC | #17
On 06.07.23 23:49, Stefano Stabellini wrote:
> On Thu, 6 Jul 2023, Roger Pau Monné wrote:
>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote:
>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
>>>>> **********
>>>>>
>>>>> Part 1 (intro):
>>>>>
>>>>> We could reuse a PCI config space register to expose the backend id.
>>>>> However this solution requires a backend change (QEMU) to expose the
>>>>> backend id via an emulated register for each emulated device.
>>>>>
>>>>> To avoid having to introduce a special config space register in all
>>>>> emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we
>>>>> could add a special PCI config space register at the emulated PCI Root
>>>>> Complex level.
>>>>>
>>>>> Basically the workflow would be as follow:
>>>>>
>>>>> - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex
>>>>> - Linux writes to special PCI config space register of the Xen PCI Root
>>>>>    Complex the PCI device id (basically the BDF)
>>>>> - The Xen PCI Root Complex emulated by Xen answers by writing back to
>>>>>    the same location the backend id (domid of the backend)
>>>>> - Linux reads back the same PCI config space register of the Xen PCI
>>>>>    Root Complex and learn the relevant domid
>>>>
>>>> IMO this seems awfully complex.  I'm not familiar with the VirtIO
>>>> spec, but I see there's a Vendor data capability, could we possibly
>>>> expose Xen-specific information on that capability?
>>>
>>> That is also a possibility too. Also we could use a PCI conf register
>>> which is known to be unused in the Virtio spec to expose the grant
>>> capability and backend domid.
>>
>> Capabilities don't have a fixed config space register, they are a
>> linked list, and so capabilities end up at different positions
>> depending on the specific device layout.  The only fixed part is the
>> range from [0, 0x3F), and that's fully defined in the specification.
>>
>> Trying to define a fixed address for Xen use after the 3f boundary
>> seems like a bad idea, as it's going to be hard to make sure that such
>> address is not used on all possible devices.  IMO the only way is to
>> place such information in a capability, whether that's an existing
>> capability or a new one I don't really know.
> 
> That seems like a good idea
>   
>   
>>>>> Part 2 (clarification):
>>>>>
>>>>> I think using a special config space register in the root complex would
>>>>> not be terrible in terms of guest changes because it is easy to
>>>>> introduce a new root complex driver in Linux and other OSes. The root
>>>>> complex would still be ECAM compatible so the regular ECAM driver would
>>>>> still work. A new driver would only be necessary if you want to be able
>>>>> to access the special config space register.
>>>>
>>>> I'm slightly worry of this approach, we end up modifying a root
>>>> complex emulation in order to avoid modifying a PCI device emulation
>>>> on QEMU, not sure that's a good trade off.
>>>>
>>>> Note also that different architectures will likely have different root
>>>> complex, and so you might need to modify several of them, plus then
>>>> arrange the PCI layout correctly in order to have the proper hierarchy
>>>> so that devices belonging to different driver domains are assigned to
>>>> different bridges.
>>>
>>> I do think that adding something to the PCI conf register somewhere is
>>> the best option because it is not dependent on ACPI and it is not
>>> dependent on xenstore both of which are very undesirable.
>>>
>>> I am not sure where specifically is the best place. These are 3 ideas
>>> we came up with:
>>> 1. PCI root complex
>>> 2. a register on the device itself
>>> 3. a new capability of the device
>>> 4. add one extra dummy PCI device for the sole purpose of exposing the
>>>     grants capability
>>>
>>>
>>> Looking at the spec, there is a way to add a vendor-specific capability
>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
>>> today, Linux doesn't parse it.
>>
>> I did wonder the same from a quick look at the spec.  There's however
>> a text in the specification that says:
>>
>> "The driver SHOULD NOT use the Vendor data capability except for
>> debugging and reporting purposes."
>>
>> So we would at least need to change that because the capability would
>> then be used by other purposes different than debugging and reporting.
>>
>> Seems like a minor adjustment, so might we worth asking upstream about
>> their opinion, and to get a conversation started.
> 
> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
> we are doing, right?

I'd understand "reporting" as e.g. logging, transferring statistics, ...

We'd like to use it for configuration purposes.

Another idea would be to enhance the virtio IOMMU device to suit our needs:
we could add the domid as another virtio IOMMU device capability and (for now)
use bypass mode for all "productive" devices.

Later we could even add grant-V3 support to Xen and to the virtio IOMMU device
(see my last year Xen Summit design session). This could be usable for
disaggregated KVM setups, too, so I believe there is a chance to get this
accepted.

>>>>> **********
>>>>> What do you think about it? Are there any pitfalls, etc? This also requires
>>>>> system changes, but at least without virtio spec changes.
>>>>
>>>> Why are we so reluctant to add spec changes?  I understand this might
>>>> take time an effort, but it's the only way IMO to build a sustainable
>>>> VirtIO Xen implementation.  Did we already attempt to negotiate with
>>>> Oasis Xen related spec changes and those where refused?
>>>
>>> That's because spec changes can be very slow. This is a bug that we need
>>> a relatively quick solution for and waiting 12-24 months for a spec
>>> update is not realistic.
>>>
>>> I think a spec change would be best as a long term solution. We also
>>> need a short term solution. The short term solution doesn't have to be
>>> ideal but it has to work now.
>>
>> My fear with such approach is that once a bodge is in place people
>> move on to other stuff and this never gets properly fixed.
>>
>> I know this might not be a well received opinion, but it would be
>> better if such bodge is kept in each interested party patchqueue for
>> the time being, until a proper solution is implemented.  That way
>> there's an interest from parties into properly fixing it upstream.
> 
> Unfortunately we are in the situation where we have an outstanding
> upstream bug, so we have to take action one way or the other.

The required virtio IOMMU device modification would be rather small, so
adding it maybe under a CONFIG option defaulting to off might be
acceptable.


Juergen
Jürgen Groß July 7, 2023, 7:04 a.m. UTC | #18
Re-reading the whole thread again ...

On 29.06.23 03:00, Stefano Stabellini wrote:
> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
>> On 21.06.23 16:12, Petr Pavlu wrote:
>>
>>
>> Hello Petr
>>
>>
>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>> devices (all x86_64), dom0 tries to establish a grant for itself which
>>> eventually results in a hang during the boot.
>>>
>>> The backtrace looks as follows, the while loop in __send_control_msg()
>>> makes no progress:
>>>
>>>     #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>>     #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
>>>     #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
>>>     #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>>     #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>>     #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>>     #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>     #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
>>>     #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>     #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>     #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
>>>     #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>         fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
>>>     #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
>>>     #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
>>>     #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
>>>     #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
>>>     #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
>>>     #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
>>>     #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
>>>     #19 do_initcalls () at ../init/main.c:1335
>>>     #20 do_basic_setup () at ../init/main.c:1354
>>>     #21 kernel_init_freeable () at ../init/main.c:1571
>>>     #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
>>>     #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
>>>     #24 0x0000000000000000 in ?? ()
>>>
>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>> setting dom0 as a backend when running in dom0.
>>>
>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
>>
>>
>> I am not 100% sure whether the Fixes tag points to precise commit. If I
>> am not mistaken, the said commit just moves the code in the context
>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>> introduced before.
>>
>>
>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>> ---
>>>    drivers/xen/grant-dma-ops.c | 4 +++-
>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>>> index 76f6f26265a3..29ed27ac450e 100644
>>> --- a/drivers/xen/grant-dma-ops.c
>>> +++ b/drivers/xen/grant-dma-ops.c
>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
>>>    	if (np) {
>>>    		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
>>>    		of_node_put(np);
>>> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
>>> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>> +		    xen_pv_domain()) &&
>>> +		   !xen_initial_domain()) {
>>
>> The commit lgtm, just one note:
>>
>>
>> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
>> as I assume the same issue could happen on Arm with DT (although there
>> we don't guess the backend's domid, we read it from DT and quite
>> unlikely we get Dom0 being in Dom0 with correct DT).
>>
>> Something like:
>>
>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>> virtio_device *dev)
>>    {
>>           domid_t backend_domid;
>>
>> +       /* Xen grant DMA ops are not used when running as initial domain */
>> +       if (xen_initial_domain())
>> +               return false;
>> +
>>           if (!xen_grant_init_backend_domid(dev->dev.parent,
>> &backend_domid)) {
>>                   xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
>>                   return true;
>> (END)
>>
>>
>>
>> If so, that commit subject would need to be updated accordingly.
>>
>> Let's see what other reviewers will say.
> 
> This doesn't work in all cases. Imagine using PCI Passthrough to assign
> a "physical" virtio device to a domU. The domU will run into the same
> error, right?
> 
> The problem is that we need a way for the virtio backend to advertise
> its ability of handling grants. Right now we only have a way to do with
> that with device tree on ARM. On x86, we only have
> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
> "physical" virtio devices. Note that in this case we are fixing a
> nested-virtualization bug, but there are actually physical
> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
> break those too.

In case you want virtio device passthrough, you shouldn't use a kernel
built with CONFIG_XEN_VIRTIO_FORCE_GRANT.

And supporting passing through virtio devices of the host to pv-domUs is
a security risk anyway.

We _could_ drop the requirement of the backend needing to set
VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio
handling for all guests. For this to work xen_virtio_restricted_mem_acc()
would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set.
Maybe we'd want to enable that possibility via a boot parameter?


Juergen
Jürgen Groß July 7, 2023, 7:46 a.m. UTC | #19
On 26.06.23 15:17, Petr Pavlu wrote:
> On 6/21/23 19:58, Oleksandr Tyshchenko wrote:
>> On 21.06.23 16:12, Petr Pavlu wrote:
>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>> devices (all x86_64), dom0 tries to establish a grant for itself which
>>> eventually results in a hang during the boot.
>>>
>>> The backtrace looks as follows, the while loop in __send_control_msg()
>>> makes no progress:
>>>
>>>     #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>>     #1  0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333
>>>     #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562
>>>     #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>>     #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>>     #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>>     #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>     #7  really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658
>>>     #8  0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>     #9  0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>     #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
>>>     #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>         fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368
>>>     #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233
>>>     #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673
>>>     #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246
>>>     #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357
>>>     #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258
>>>     #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246
>>>     #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319
>>>     #19 do_initcalls () at ../init/main.c:1335
>>>     #20 do_basic_setup () at ../init/main.c:1354
>>>     #21 kernel_init_freeable () at ../init/main.c:1571
>>>     #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462
>>>     #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308
>>>     #24 0x0000000000000000 in ?? ()
>>>
>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>> setting dom0 as a backend when running in dom0.
>>>
>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices")
>>
>>
>> I am not 100% sure whether the Fixes tag points to precise commit. If I
>> am not mistaken, the said commit just moves the code in the context
>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>> introduced before.
> 
> I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use
> dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which
> introduced the original logic to use dom0 as backend.
> 
> Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma"
> devices") is relevant in sense that it extended when this logic is
> active by adding an OR check for xen_pv_domain().
> 
>>
>>
>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>> ---
>>>    drivers/xen/grant-dma-ops.c | 4 +++-
>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>>> index 76f6f26265a3..29ed27ac450e 100644
>>> --- a/drivers/xen/grant-dma-ops.c
>>> +++ b/drivers/xen/grant-dma-ops.c
>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev,
>>>    	if (np) {
>>>    		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
>>>    		of_node_put(np);
>>> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
>>> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>> +		    xen_pv_domain()) &&
>>> +		   !xen_initial_domain()) {
>>
>> The commit lgtm, just one note:
>>
>>
>> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
>> as I assume the same issue could happen on Arm with DT (although there
>> we don't guess the backend's domid, we read it from DT and quite
>> unlikely we get Dom0 being in Dom0 with correct DT).
>>
>> Something like:
>>
>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>> virtio_device *dev)
>>    {
>>           domid_t backend_domid;
>>
>> +       /* Xen grant DMA ops are not used when running as initial domain */
>> +       if (xen_initial_domain())
>> +               return false;
>> +
>>           if (!xen_grant_init_backend_domid(dev->dev.parent,
>> &backend_domid)) {
>>                   xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
>>                   return true;
>> (END)
>>
>>
>>
>> If so, that commit subject would need to be updated accordingly.
>>
>> Let's see what other reviewers will say.
> 
> Ok, makes sense.

I think this is okay for a fix of the current problem.

Passing through virtio devices to a PV domU is not covered by this fix, but this
should be a rather rare configuration, which doesn't work today either. So the
suggested patch would fix the current issue without introducing a regression.

Anything else can be done later.


Juergen
Oleksandr Tyshchenko July 7, 2023, 8 a.m. UTC | #20
On 07.07.23 10:04, Juergen Gross wrote:

Hello Juergen


> Re-reading the whole thread again ...
> 
> On 29.06.23 03:00, Stefano Stabellini wrote:
>> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
>>> On 21.06.23 16:12, Petr Pavlu wrote:
>>>
>>>
>>> Hello Petr
>>>
>>>
>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>>> devices (all x86_64), dom0 tries to establish a grant for itself which
>>>> eventually results in a hang during the boot.
>>>>
>>>> The backtrace looks as follows, the while loop in __send_control_msg()
>>>> makes no progress:
>>>>
>>>>     #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, 
>>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 
>>>> <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>>>     #1  0xffffffff817086b7 in virtqueue_get_buf 
>>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) 
>>>> at ../drivers/virtio/virtio_ring.c:2333
>>>>     #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized 
>>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at 
>>>> ../drivers/char/virtio_console.c:562
>>>>     #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized 
>>>> out>, port_id=<optimized out>, event=<optimized out>, 
>>>> value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>>>     #4  0xffffffff817618b1 in virtcons_probe 
>>>> (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>>>     #5  0xffffffff81707117 in virtio_dev_probe 
>>>> (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>>>     #6  0xffffffff8198e348 in call_driver_probe 
>>>> (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 
>>>> <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>>     #7  really_probe (dev=dev@entry=0xffff88800585e810, 
>>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at 
>>>> ../drivers/base/dd.c:658
>>>>     #8  0xffffffff8198e58f in __driver_probe_device 
>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, 
>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>>     #9  0xffffffff8198e65a in driver_probe_device 
>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, 
>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>>     #10 0xffffffff8198e832 in __driver_attach 
>>>> (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) 
>>>> at ../drivers/base/dd.c:1216
>>>>     #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, 
>>>> start=start@entry=0x0 <fixed_percpu_data>, 
>>>> data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>>         fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at 
>>>> ../drivers/base/bus.c:368
>>>>     #12 0xffffffff8198db65 in driver_attach 
>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at 
>>>> ../drivers/base/dd.c:1233
>>>>     #13 0xffffffff8198d207 in bus_add_driver 
>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at 
>>>> ../drivers/base/bus.c:673
>>>>     #14 0xffffffff8198f550 in driver_register 
>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at 
>>>> ../drivers/base/driver.c:246
>>>>     #15 0xffffffff81706b47 in register_virtio_driver 
>>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at 
>>>> ../drivers/virtio/virtio.c:357
>>>>     #16 0xffffffff832cd34b in virtio_console_init () at 
>>>> ../drivers/char/virtio_console.c:2258
>>>>     #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 
>>>> <virtio_console_init>) at ../init/main.c:1246
>>>>     #18 0xffffffff83277293 in do_initcall_level 
>>>> (command_line=0xffff888003e2f900 "root", level=0x6) at 
>>>> ../init/main.c:1319
>>>>     #19 do_initcalls () at ../init/main.c:1335
>>>>     #20 do_basic_setup () at ../init/main.c:1354
>>>>     #21 kernel_init_freeable () at ../init/main.c:1571
>>>>     #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) 
>>>> at ../init/main.c:1462
>>>>     #23 0xffffffff81001f49 in ret_from_fork () at 
>>>> ../arch/x86/entry/entry_64.S:308
>>>>     #24 0x0000000000000000 in ?? ()
>>>>
>>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>>> setting dom0 as a backend when running in dom0.
>>>>
>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of 
>>>> "xen-grant-dma" devices")
>>>
>>>
>>> I am not 100% sure whether the Fixes tag points to precise commit. If I
>>> am not mistaken, the said commit just moves the code in the context
>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>>> introduced before.
>>>
>>>
>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>>> ---
>>>>    drivers/xen/grant-dma-ops.c | 4 +++-
>>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>>>> index 76f6f26265a3..29ed27ac450e 100644
>>>> --- a/drivers/xen/grant-dma-ops.c
>>>> +++ b/drivers/xen/grant-dma-ops.c
>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct 
>>>> device *dev,
>>>>        if (np) {
>>>>            ret = xen_dt_grant_init_backend_domid(dev, np, 
>>>> backend_domid);
>>>>            of_node_put(np);
>>>> -    } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || 
>>>> xen_pv_domain()) {
>>>> +    } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>> +            xen_pv_domain()) &&
>>>> +           !xen_initial_domain()) {
>>>
>>> The commit lgtm, just one note:
>>>
>>>
>>> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
>>> as I assume the same issue could happen on Arm with DT (although there
>>> we don't guess the backend's domid, we read it from DT and quite
>>> unlikely we get Dom0 being in Dom0 with correct DT).
>>>
>>> Something like:
>>>
>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>>> virtio_device *dev)
>>>    {
>>>           domid_t backend_domid;
>>>
>>> +       /* Xen grant DMA ops are not used when running as initial 
>>> domain */
>>> +       if (xen_initial_domain())
>>> +               return false;
>>> +
>>>           if (!xen_grant_init_backend_domid(dev->dev.parent,
>>> &backend_domid)) {
>>>                   xen_grant_setup_dma_ops(dev->dev.parent, 
>>> backend_domid);
>>>                   return true;
>>> (END)
>>>
>>>
>>>
>>> If so, that commit subject would need to be updated accordingly.
>>>
>>> Let's see what other reviewers will say.
>>
>> This doesn't work in all cases. Imagine using PCI Passthrough to assign
>> a "physical" virtio device to a domU. The domU will run into the same
>> error, right?
>>
>> The problem is that we need a way for the virtio backend to advertise
>> its ability of handling grants. Right now we only have a way to do with
>> that with device tree on ARM. On x86, we only have
>> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
>> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
>> "physical" virtio devices. Note that in this case we are fixing a
>> nested-virtualization bug, but there are actually physical
>> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
>> break those too.
> 
> In case you want virtio device passthrough, you shouldn't use a kernel
> built with CONFIG_XEN_VIRTIO_FORCE_GRANT.
> 
> And supporting passing through virtio devices of the host to pv-domUs is
> a security risk anyway.
> 
> We _could_ drop the requirement of the backend needing to set
> VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio
> handling for all guests. For this to work xen_virtio_restricted_mem_acc()
> would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set.
> Maybe we'd want to enable that possibility via a boot parameter?


Maybe, yes. I don't see at the moment why this won't work.

At the same time I wonder, could we just modify xen_pv_init_platform() 
to call virtio_no_restricted_mem_acc() if forcibly disabled by boot 
parameter irrespective of VIRTIO_F_ACCESS_PLATFORM presence?


> 
> 
> Juergen
Jürgen Groß July 7, 2023, 8:11 a.m. UTC | #21
On 07.07.23 10:00, Oleksandr Tyshchenko wrote:
> 
> 
> On 07.07.23 10:04, Juergen Gross wrote:
> 
> Hello Juergen
> 
> 
>> Re-reading the whole thread again ...
>>
>> On 29.06.23 03:00, Stefano Stabellini wrote:
>>> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
>>>> On 21.06.23 16:12, Petr Pavlu wrote:
>>>>
>>>>
>>>> Hello Petr
>>>>
>>>>
>>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>>>> devices (all x86_64), dom0 tries to establish a grant for itself which
>>>>> eventually results in a hang during the boot.
>>>>>
>>>>> The backtrace looks as follows, the while loop in __send_control_msg()
>>>>> makes no progress:
>>>>>
>>>>>      #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400,
>>>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0
>>>>> <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>>>>      #1  0xffffffff817086b7 in virtqueue_get_buf
>>>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94)
>>>>> at ../drivers/virtio/virtio_ring.c:2333
>>>>>      #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized
>>>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at
>>>>> ../drivers/char/virtio_console.c:562
>>>>>      #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized
>>>>> out>, port_id=<optimized out>, event=<optimized out>,
>>>>> value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>>>>      #4  0xffffffff817618b1 in virtcons_probe
>>>>> (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>>>>      #5  0xffffffff81707117 in virtio_dev_probe
>>>>> (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>>>>      #6  0xffffffff8198e348 in call_driver_probe
>>>>> (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0
>>>>> <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>>>      #7  really_probe (dev=dev@entry=0xffff88800585e810,
>>>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/dd.c:658
>>>>>      #8  0xffffffff8198e58f in __driver_probe_device
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>>>      #9  0xffffffff8198e65a in driver_probe_device
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>>>      #10 0xffffffff8198e832 in __driver_attach
>>>>> (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>)
>>>>> at ../drivers/base/dd.c:1216
>>>>>      #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>,
>>>>> start=start@entry=0x0 <fixed_percpu_data>,
>>>>> data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>>>          fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at
>>>>> ../drivers/base/bus.c:368
>>>>>      #12 0xffffffff8198db65 in driver_attach
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/dd.c:1233
>>>>>      #13 0xffffffff8198d207 in bus_add_driver
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/bus.c:673
>>>>>      #14 0xffffffff8198f550 in driver_register
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/driver.c:246
>>>>>      #15 0xffffffff81706b47 in register_virtio_driver
>>>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/virtio/virtio.c:357
>>>>>      #16 0xffffffff832cd34b in virtio_console_init () at
>>>>> ../drivers/char/virtio_console.c:2258
>>>>>      #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0
>>>>> <virtio_console_init>) at ../init/main.c:1246
>>>>>      #18 0xffffffff83277293 in do_initcall_level
>>>>> (command_line=0xffff888003e2f900 "root", level=0x6) at
>>>>> ../init/main.c:1319
>>>>>      #19 do_initcalls () at ../init/main.c:1335
>>>>>      #20 do_basic_setup () at ../init/main.c:1354
>>>>>      #21 kernel_init_freeable () at ../init/main.c:1571
>>>>>      #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>)
>>>>> at ../init/main.c:1462
>>>>>      #23 0xffffffff81001f49 in ret_from_fork () at
>>>>> ../arch/x86/entry/entry_64.S:308
>>>>>      #24 0x0000000000000000 in ?? ()
>>>>>
>>>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>>>> setting dom0 as a backend when running in dom0.
>>>>>
>>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of
>>>>> "xen-grant-dma" devices")
>>>>
>>>>
>>>> I am not 100% sure whether the Fixes tag points to precise commit. If I
>>>> am not mistaken, the said commit just moves the code in the context
>>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>>>> introduced before.
>>>>
>>>>
>>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>>>> ---
>>>>>     drivers/xen/grant-dma-ops.c | 4 +++-
>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>>>>> index 76f6f26265a3..29ed27ac450e 100644
>>>>> --- a/drivers/xen/grant-dma-ops.c
>>>>> +++ b/drivers/xen/grant-dma-ops.c
>>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct
>>>>> device *dev,
>>>>>         if (np) {
>>>>>             ret = xen_dt_grant_init_backend_domid(dev, np,
>>>>> backend_domid);
>>>>>             of_node_put(np);
>>>>> -    } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>>> xen_pv_domain()) {
>>>>> +    } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>>> +            xen_pv_domain()) &&
>>>>> +           !xen_initial_domain()) {
>>>>
>>>> The commit lgtm, just one note:
>>>>
>>>>
>>>> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
>>>> as I assume the same issue could happen on Arm with DT (although there
>>>> we don't guess the backend's domid, we read it from DT and quite
>>>> unlikely we get Dom0 being in Dom0 with correct DT).
>>>>
>>>> Something like:
>>>>
>>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>>>> virtio_device *dev)
>>>>     {
>>>>            domid_t backend_domid;
>>>>
>>>> +       /* Xen grant DMA ops are not used when running as initial
>>>> domain */
>>>> +       if (xen_initial_domain())
>>>> +               return false;
>>>> +
>>>>            if (!xen_grant_init_backend_domid(dev->dev.parent,
>>>> &backend_domid)) {
>>>>                    xen_grant_setup_dma_ops(dev->dev.parent,
>>>> backend_domid);
>>>>                    return true;
>>>> (END)
>>>>
>>>>
>>>>
>>>> If so, that commit subject would need to be updated accordingly.
>>>>
>>>> Let's see what other reviewers will say.
>>>
>>> This doesn't work in all cases. Imagine using PCI Passthrough to assign
>>> a "physical" virtio device to a domU. The domU will run into the same
>>> error, right?
>>>
>>> The problem is that we need a way for the virtio backend to advertise
>>> its ability of handling grants. Right now we only have a way to do with
>>> that with device tree on ARM. On x86, we only have
>>> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
>>> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
>>> "physical" virtio devices. Note that in this case we are fixing a
>>> nested-virtualization bug, but there are actually physical
>>> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
>>> break those too.
>>
>> In case you want virtio device passthrough, you shouldn't use a kernel
>> built with CONFIG_XEN_VIRTIO_FORCE_GRANT.
>>
>> And supporting passing through virtio devices of the host to pv-domUs is
>> a security risk anyway.
>>
>> We _could_ drop the requirement of the backend needing to set
>> VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio
>> handling for all guests. For this to work xen_virtio_restricted_mem_acc()
>> would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set.
>> Maybe we'd want to enable that possibility via a boot parameter?
> 
> 
> Maybe, yes. I don't see at the moment why this won't work.
> 
> At the same time I wonder, could we just modify xen_pv_init_platform()
> to call virtio_no_restricted_mem_acc() if forcibly disabled by boot
> parameter irrespective of VIRTIO_F_ACCESS_PLATFORM presence?

This wouldn't work for the case where a host virtio device is passed through
to the pv domU and at the same time another virtio device is using dom0 as a
backend. I think we should use grants if possible.


Juergen
Oleksandr Tyshchenko July 7, 2023, 8:23 a.m. UTC | #22
On 07.07.23 11:11, Juergen Gross wrote:

Hello Juergen


> On 07.07.23 10:00, Oleksandr Tyshchenko wrote:
>>
>>
>> On 07.07.23 10:04, Juergen Gross wrote:
>>
>> Hello Juergen
>>
>>
>>> Re-reading the whole thread again ...
>>>
>>> On 29.06.23 03:00, Stefano Stabellini wrote:
>>>> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote:
>>>>> On 21.06.23 16:12, Petr Pavlu wrote:
>>>>>
>>>>>
>>>>> Hello Petr
>>>>>
>>>>>
>>>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>>>>> devices (all x86_64), dom0 tries to establish a grant for itself 
>>>>>> which
>>>>>> eventually results in a hang during the boot.
>>>>>>
>>>>>> The backtrace looks as follows, the while loop in 
>>>>>> __send_control_msg()
>>>>>> makes no progress:
>>>>>>
>>>>>>      #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400,
>>>>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0
>>>>>> <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326
>>>>>>      #1  0xffffffff817086b7 in virtqueue_get_buf
>>>>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94)
>>>>>> at ../drivers/virtio/virtio_ring.c:2333
>>>>>>      #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized
>>>>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at
>>>>>> ../drivers/char/virtio_console.c:562
>>>>>>      #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized
>>>>>> out>, port_id=<optimized out>, event=<optimized out>,
>>>>>> value=<optimized out>) at ../drivers/char/virtio_console.c:569
>>>>>>      #4  0xffffffff817618b1 in virtcons_probe
>>>>>> (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098
>>>>>>      #5  0xffffffff81707117 in virtio_dev_probe
>>>>>> (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305
>>>>>>      #6  0xffffffff8198e348 in call_driver_probe
>>>>>> (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0
>>>>>> <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>>>>      #7  really_probe (dev=dev@entry=0xffff88800585e810,
>>>>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>>> ../drivers/base/dd.c:658
>>>>>>      #8  0xffffffff8198e58f in __driver_probe_device
>>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
>>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>>>>      #9  0xffffffff8198e65a in driver_probe_device
>>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
>>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>>>>      #10 0xffffffff8198e832 in __driver_attach
>>>>>> (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>)
>>>>>> at ../drivers/base/dd.c:1216
>>>>>>      #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>,
>>>>>> start=start@entry=0x0 <fixed_percpu_data>,
>>>>>> data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>>>>          fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at
>>>>>> ../drivers/base/bus.c:368
>>>>>>      #12 0xffffffff8198db65 in driver_attach
>>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>>> ../drivers/base/dd.c:1233
>>>>>>      #13 0xffffffff8198d207 in bus_add_driver
>>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>>> ../drivers/base/bus.c:673
>>>>>>      #14 0xffffffff8198f550 in driver_register
>>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>>> ../drivers/base/driver.c:246
>>>>>>      #15 0xffffffff81706b47 in register_virtio_driver
>>>>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>>> ../drivers/virtio/virtio.c:357
>>>>>>      #16 0xffffffff832cd34b in virtio_console_init () at
>>>>>> ../drivers/char/virtio_console.c:2258
>>>>>>      #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0
>>>>>> <virtio_console_init>) at ../init/main.c:1246
>>>>>>      #18 0xffffffff83277293 in do_initcall_level
>>>>>> (command_line=0xffff888003e2f900 "root", level=0x6) at
>>>>>> ../init/main.c:1319
>>>>>>      #19 do_initcalls () at ../init/main.c:1335
>>>>>>      #20 do_basic_setup () at ../init/main.c:1354
>>>>>>      #21 kernel_init_freeable () at ../init/main.c:1571
>>>>>>      #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>)
>>>>>> at ../init/main.c:1462
>>>>>>      #23 0xffffffff81001f49 in ret_from_fork () at
>>>>>> ../arch/x86/entry/entry_64.S:308
>>>>>>      #24 0x0000000000000000 in ?? ()
>>>>>>
>>>>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>>>>> setting dom0 as a backend when running in dom0.
>>>>>>
>>>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of
>>>>>> "xen-grant-dma" devices")
>>>>>
>>>>>
>>>>> I am not 100% sure whether the Fixes tag points to precise commit. 
>>>>> If I
>>>>> am not mistaken, the said commit just moves the code in the context
>>>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>>>>> introduced before.
>>>>>
>>>>>
>>>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>>>>> ---
>>>>>>     drivers/xen/grant-dma-ops.c | 4 +++-
>>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/xen/grant-dma-ops.c 
>>>>>> b/drivers/xen/grant-dma-ops.c
>>>>>> index 76f6f26265a3..29ed27ac450e 100644
>>>>>> --- a/drivers/xen/grant-dma-ops.c
>>>>>> +++ b/drivers/xen/grant-dma-ops.c
>>>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct
>>>>>> device *dev,
>>>>>>         if (np) {
>>>>>>             ret = xen_dt_grant_init_backend_domid(dev, np,
>>>>>> backend_domid);
>>>>>>             of_node_put(np);
>>>>>> -    } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>>>> xen_pv_domain()) {
>>>>>> +    } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>>>> +            xen_pv_domain()) &&
>>>>>> +           !xen_initial_domain()) {
>>>>>
>>>>> The commit lgtm, just one note:
>>>>>
>>>>>
>>>>> I would even bail out early in xen_virtio_restricted_mem_acc() 
>>>>> instead,
>>>>> as I assume the same issue could happen on Arm with DT (although there
>>>>> we don't guess the backend's domid, we read it from DT and quite
>>>>> unlikely we get Dom0 being in Dom0 with correct DT).
>>>>>
>>>>> Something like:
>>>>>
>>>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>>>>> virtio_device *dev)
>>>>>     {
>>>>>            domid_t backend_domid;
>>>>>
>>>>> +       /* Xen grant DMA ops are not used when running as initial
>>>>> domain */
>>>>> +       if (xen_initial_domain())
>>>>> +               return false;
>>>>> +
>>>>>            if (!xen_grant_init_backend_domid(dev->dev.parent,
>>>>> &backend_domid)) {
>>>>>                    xen_grant_setup_dma_ops(dev->dev.parent,
>>>>> backend_domid);
>>>>>                    return true;
>>>>> (END)
>>>>>
>>>>>
>>>>>
>>>>> If so, that commit subject would need to be updated accordingly.
>>>>>
>>>>> Let's see what other reviewers will say.
>>>>
>>>> This doesn't work in all cases. Imagine using PCI Passthrough to assign
>>>> a "physical" virtio device to a domU. The domU will run into the same
>>>> error, right?
>>>>
>>>> The problem is that we need a way for the virtio backend to advertise
>>>> its ability of handling grants. Right now we only have a way to do with
>>>> that with device tree on ARM. On x86, we only have
>>>> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take
>>>> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for
>>>> "physical" virtio devices. Note that in this case we are fixing a
>>>> nested-virtualization bug, but there are actually physical
>>>> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will
>>>> break those too.
>>>
>>> In case you want virtio device passthrough, you shouldn't use a kernel
>>> built with CONFIG_XEN_VIRTIO_FORCE_GRANT.
>>>
>>> And supporting passing through virtio devices of the host to pv-domUs is
>>> a security risk anyway.
>>>
>>> We _could_ drop the requirement of the backend needing to set
>>> VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio
>>> handling for all guests. For this to work 
>>> xen_virtio_restricted_mem_acc()
>>> would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set.
>>> Maybe we'd want to enable that possibility via a boot parameter?
>>
>>
>> Maybe, yes. I don't see at the moment why this won't work.
>>
>> At the same time I wonder, could we just modify xen_pv_init_platform()
>> to call virtio_no_restricted_mem_acc() if forcibly disabled by boot
>> parameter irrespective of VIRTIO_F_ACCESS_PLATFORM presence?
> 
> This wouldn't work for the case where a host virtio device is passed 
> through
> to the pv domU and at the same time another virtio device is using dom0 
> as a
> backend. I think we should use grants if possible.

Indeed, I missed that possible scenario. I agree with the explanations, 
thanks.


> 
> 
> Juergen
>
Roger Pau Monné July 7, 2023, 9:50 a.m. UTC | #23
On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
> On 06.07.23 23:49, Stefano Stabellini wrote:
> > On Thu, 6 Jul 2023, Roger Pau Monné wrote:
> > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
> > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote:
> > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> > > > > > Part 2 (clarification):
> > > > > > 
> > > > > > I think using a special config space register in the root complex would
> > > > > > not be terrible in terms of guest changes because it is easy to
> > > > > > introduce a new root complex driver in Linux and other OSes. The root
> > > > > > complex would still be ECAM compatible so the regular ECAM driver would
> > > > > > still work. A new driver would only be necessary if you want to be able
> > > > > > to access the special config space register.
> > > > > 
> > > > > I'm slightly worry of this approach, we end up modifying a root
> > > > > complex emulation in order to avoid modifying a PCI device emulation
> > > > > on QEMU, not sure that's a good trade off.
> > > > > 
> > > > > Note also that different architectures will likely have different root
> > > > > complex, and so you might need to modify several of them, plus then
> > > > > arrange the PCI layout correctly in order to have the proper hierarchy
> > > > > so that devices belonging to different driver domains are assigned to
> > > > > different bridges.
> > > > 
> > > > I do think that adding something to the PCI conf register somewhere is
> > > > the best option because it is not dependent on ACPI and it is not
> > > > dependent on xenstore both of which are very undesirable.
> > > > 
> > > > I am not sure where specifically is the best place. These are 3 ideas
> > > > we came up with:
> > > > 1. PCI root complex
> > > > 2. a register on the device itself
> > > > 3. a new capability of the device
> > > > 4. add one extra dummy PCI device for the sole purpose of exposing the
> > > >     grants capability
> > > > 
> > > > 
> > > > Looking at the spec, there is a way to add a vendor-specific capability
> > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
> > > > today, Linux doesn't parse it.
> > > 
> > > I did wonder the same from a quick look at the spec.  There's however
> > > a text in the specification that says:
> > > 
> > > "The driver SHOULD NOT use the Vendor data capability except for
> > > debugging and reporting purposes."
> > > 
> > > So we would at least need to change that because the capability would
> > > then be used by other purposes different than debugging and reporting.
> > > 
> > > Seems like a minor adjustment, so might we worth asking upstream about
> > > their opinion, and to get a conversation started.
> > 
> > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
> > we are doing, right?
> 
> I'd understand "reporting" as e.g. logging, transferring statistics, ...
> 
> We'd like to use it for configuration purposes.

I've also read it that way.

> Another idea would be to enhance the virtio IOMMU device to suit our needs:
> we could add the domid as another virtio IOMMU device capability and (for now)
> use bypass mode for all "productive" devices.

If we have to start adding capabilties, won't it be easier to just add
it to the each device instead of adding it to virtio IOMMU.  Or is the
parsing of capabilities device specific, and hence we would have to
implement such parsing for each device?  I would expect some
capabilities are shared between all devices, and a Xen capability could
be one of those.

> Later we could even add grant-V3 support to Xen and to the virtio IOMMU device
> (see my last year Xen Summit design session). This could be usable for
> disaggregated KVM setups, too, so I believe there is a chance to get this
> accepted.
> 
> > > > > > **********
> > > > > > What do you think about it? Are there any pitfalls, etc? This also requires
> > > > > > system changes, but at least without virtio spec changes.
> > > > > 
> > > > > Why are we so reluctant to add spec changes?  I understand this might
> > > > > take time an effort, but it's the only way IMO to build a sustainable
> > > > > VirtIO Xen implementation.  Did we already attempt to negotiate with
> > > > > Oasis Xen related spec changes and those where refused?
> > > > 
> > > > That's because spec changes can be very slow. This is a bug that we need
> > > > a relatively quick solution for and waiting 12-24 months for a spec
> > > > update is not realistic.
> > > > 
> > > > I think a spec change would be best as a long term solution. We also
> > > > need a short term solution. The short term solution doesn't have to be
> > > > ideal but it has to work now.
> > > 
> > > My fear with such approach is that once a bodge is in place people
> > > move on to other stuff and this never gets properly fixed.
> > > 
> > > I know this might not be a well received opinion, but it would be
> > > better if such bodge is kept in each interested party patchqueue for
> > > the time being, until a proper solution is implemented.  That way
> > > there's an interest from parties into properly fixing it upstream.
> > 
> > Unfortunately we are in the situation where we have an outstanding
> > upstream bug, so we have to take action one way or the other.
> 
> The required virtio IOMMU device modification would be rather small, so
> adding it maybe under a CONFIG option defaulting to off might be
> acceptable.

Would you then do the grant allocation as part of virtio IOMMU?

Thanks, Roger.
Jürgen Groß July 7, 2023, 2:10 p.m. UTC | #24
On 07.07.23 11:50, Roger Pau Monné wrote:
> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
>> On 06.07.23 23:49, Stefano Stabellini wrote:
>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote:
>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote:
>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
>>>>>>> Part 2 (clarification):
>>>>>>>
>>>>>>> I think using a special config space register in the root complex would
>>>>>>> not be terrible in terms of guest changes because it is easy to
>>>>>>> introduce a new root complex driver in Linux and other OSes. The root
>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would
>>>>>>> still work. A new driver would only be necessary if you want to be able
>>>>>>> to access the special config space register.
>>>>>>
>>>>>> I'm slightly worry of this approach, we end up modifying a root
>>>>>> complex emulation in order to avoid modifying a PCI device emulation
>>>>>> on QEMU, not sure that's a good trade off.
>>>>>>
>>>>>> Note also that different architectures will likely have different root
>>>>>> complex, and so you might need to modify several of them, plus then
>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy
>>>>>> so that devices belonging to different driver domains are assigned to
>>>>>> different bridges.
>>>>>
>>>>> I do think that adding something to the PCI conf register somewhere is
>>>>> the best option because it is not dependent on ACPI and it is not
>>>>> dependent on xenstore both of which are very undesirable.
>>>>>
>>>>> I am not sure where specifically is the best place. These are 3 ideas
>>>>> we came up with:
>>>>> 1. PCI root complex
>>>>> 2. a register on the device itself
>>>>> 3. a new capability of the device
>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the
>>>>>      grants capability
>>>>>
>>>>>
>>>>> Looking at the spec, there is a way to add a vendor-specific capability
>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
>>>>> today, Linux doesn't parse it.
>>>>
>>>> I did wonder the same from a quick look at the spec.  There's however
>>>> a text in the specification that says:
>>>>
>>>> "The driver SHOULD NOT use the Vendor data capability except for
>>>> debugging and reporting purposes."
>>>>
>>>> So we would at least need to change that because the capability would
>>>> then be used by other purposes different than debugging and reporting.
>>>>
>>>> Seems like a minor adjustment, so might we worth asking upstream about
>>>> their opinion, and to get a conversation started.
>>>
>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
>>> we are doing, right?
>>
>> I'd understand "reporting" as e.g. logging, transferring statistics, ...
>>
>> We'd like to use it for configuration purposes.
> 
> I've also read it that way.
> 
>> Another idea would be to enhance the virtio IOMMU device to suit our needs:
>> we could add the domid as another virtio IOMMU device capability and (for now)
>> use bypass mode for all "productive" devices.
> 
> If we have to start adding capabilties, won't it be easier to just add
> it to the each device instead of adding it to virtio IOMMU.  Or is the
> parsing of capabilities device specific, and hence we would have to
> implement such parsing for each device?  I would expect some
> capabilities are shared between all devices, and a Xen capability could
> be one of those.

Have a look at [1], which is describing the common device config layout.
The problem here is that we'd need to add the domid after the queue specific
data, resulting in a mess if further queue fields would be added later.

We could try that, of course.

> 
>> Later we could even add grant-V3 support to Xen and to the virtio IOMMU device
>> (see my last year Xen Summit design session). This could be usable for
>> disaggregated KVM setups, too, so I believe there is a chance to get this
>> accepted.
>>
>>>>>>> **********
>>>>>>> What do you think about it? Are there any pitfalls, etc? This also requires
>>>>>>> system changes, but at least without virtio spec changes.
>>>>>>
>>>>>> Why are we so reluctant to add spec changes?  I understand this might
>>>>>> take time an effort, but it's the only way IMO to build a sustainable
>>>>>> VirtIO Xen implementation.  Did we already attempt to negotiate with
>>>>>> Oasis Xen related spec changes and those where refused?
>>>>>
>>>>> That's because spec changes can be very slow. This is a bug that we need
>>>>> a relatively quick solution for and waiting 12-24 months for a spec
>>>>> update is not realistic.
>>>>>
>>>>> I think a spec change would be best as a long term solution. We also
>>>>> need a short term solution. The short term solution doesn't have to be
>>>>> ideal but it has to work now.
>>>>
>>>> My fear with such approach is that once a bodge is in place people
>>>> move on to other stuff and this never gets properly fixed.
>>>>
>>>> I know this might not be a well received opinion, but it would be
>>>> better if such bodge is kept in each interested party patchqueue for
>>>> the time being, until a proper solution is implemented.  That way
>>>> there's an interest from parties into properly fixing it upstream.
>>>
>>> Unfortunately we are in the situation where we have an outstanding
>>> upstream bug, so we have to take action one way or the other.
>>
>> The required virtio IOMMU device modification would be rather small, so
>> adding it maybe under a CONFIG option defaulting to off might be
>> acceptable.
> 
> Would you then do the grant allocation as part of virtio IOMMU?

Long term, maybe. Do you remember my Grant-V3 design session last year? Being
able to reuse the same layout for virtio IOMMU was one of the basic ideas for
that layout (this would need some heavy work on the virtio IOMMU frontend and
backend, of course).


Juergen

[1]: 
http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.pdf#subsubsection.4.1.4.3
Jürgen Groß July 7, 2023, 2:27 p.m. UTC | #25
On 07.07.23 16:10, Juergen Gross wrote:
> On 07.07.23 11:50, Roger Pau Monné wrote:
>> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
>>> On 06.07.23 23:49, Stefano Stabellini wrote:
>>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote:
>>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
>>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote:
>>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
>>>>>>>> Part 2 (clarification):
>>>>>>>>
>>>>>>>> I think using a special config space register in the root complex would
>>>>>>>> not be terrible in terms of guest changes because it is easy to
>>>>>>>> introduce a new root complex driver in Linux and other OSes. The root
>>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would
>>>>>>>> still work. A new driver would only be necessary if you want to be able
>>>>>>>> to access the special config space register.
>>>>>>>
>>>>>>> I'm slightly worry of this approach, we end up modifying a root
>>>>>>> complex emulation in order to avoid modifying a PCI device emulation
>>>>>>> on QEMU, not sure that's a good trade off.
>>>>>>>
>>>>>>> Note also that different architectures will likely have different root
>>>>>>> complex, and so you might need to modify several of them, plus then
>>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy
>>>>>>> so that devices belonging to different driver domains are assigned to
>>>>>>> different bridges.
>>>>>>
>>>>>> I do think that adding something to the PCI conf register somewhere is
>>>>>> the best option because it is not dependent on ACPI and it is not
>>>>>> dependent on xenstore both of which are very undesirable.
>>>>>>
>>>>>> I am not sure where specifically is the best place. These are 3 ideas
>>>>>> we came up with:
>>>>>> 1. PCI root complex
>>>>>> 2. a register on the device itself
>>>>>> 3. a new capability of the device
>>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the
>>>>>>      grants capability
>>>>>>
>>>>>>
>>>>>> Looking at the spec, there is a way to add a vendor-specific capability
>>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
>>>>>> today, Linux doesn't parse it.
>>>>>
>>>>> I did wonder the same from a quick look at the spec.  There's however
>>>>> a text in the specification that says:
>>>>>
>>>>> "The driver SHOULD NOT use the Vendor data capability except for
>>>>> debugging and reporting purposes."
>>>>>
>>>>> So we would at least need to change that because the capability would
>>>>> then be used by other purposes different than debugging and reporting.
>>>>>
>>>>> Seems like a minor adjustment, so might we worth asking upstream about
>>>>> their opinion, and to get a conversation started.
>>>>
>>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
>>>> we are doing, right?
>>>
>>> I'd understand "reporting" as e.g. logging, transferring statistics, ...
>>>
>>> We'd like to use it for configuration purposes.
>>
>> I've also read it that way.
>>
>>> Another idea would be to enhance the virtio IOMMU device to suit our needs:
>>> we could add the domid as another virtio IOMMU device capability and (for now)
>>> use bypass mode for all "productive" devices.
>>
>> If we have to start adding capabilties, won't it be easier to just add
>> it to the each device instead of adding it to virtio IOMMU.  Or is the
>> parsing of capabilities device specific, and hence we would have to
>> implement such parsing for each device?  I would expect some
>> capabilities are shared between all devices, and a Xen capability could
>> be one of those.
> 
> Have a look at [1], which is describing the common device config layout.
> The problem here is that we'd need to add the domid after the queue specific
> data, resulting in a mess if further queue fields would be added later.
> 
> We could try that, of course.

Thinking more about it, the virtio IOMMU device seems to be a better fit:

In case we'd add the domid to the device's PCI config space, the value would
be controlled by the backend domain. IMO the domid passed to the frontend
should be controlled by a trusted entity (dom0 or the hypervisor), which
would be the natural backend of the virtio IOMMU device.


Juergen
Roger Pau Monné July 7, 2023, 2:42 p.m. UTC | #26
On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote:
> On 07.07.23 11:50, Roger Pau Monné wrote:
> > On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
> > > On 06.07.23 23:49, Stefano Stabellini wrote:
> > > > On Thu, 6 Jul 2023, Roger Pau Monné wrote:
> > > > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
> > > > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote:
> > > > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> > > > > > > > Part 2 (clarification):
> > > > > > > > 
> > > > > > > > I think using a special config space register in the root complex would
> > > > > > > > not be terrible in terms of guest changes because it is easy to
> > > > > > > > introduce a new root complex driver in Linux and other OSes. The root
> > > > > > > > complex would still be ECAM compatible so the regular ECAM driver would
> > > > > > > > still work. A new driver would only be necessary if you want to be able
> > > > > > > > to access the special config space register.
> > > > > > > 
> > > > > > > I'm slightly worry of this approach, we end up modifying a root
> > > > > > > complex emulation in order to avoid modifying a PCI device emulation
> > > > > > > on QEMU, not sure that's a good trade off.
> > > > > > > 
> > > > > > > Note also that different architectures will likely have different root
> > > > > > > complex, and so you might need to modify several of them, plus then
> > > > > > > arrange the PCI layout correctly in order to have the proper hierarchy
> > > > > > > so that devices belonging to different driver domains are assigned to
> > > > > > > different bridges.
> > > > > > 
> > > > > > I do think that adding something to the PCI conf register somewhere is
> > > > > > the best option because it is not dependent on ACPI and it is not
> > > > > > dependent on xenstore both of which are very undesirable.
> > > > > > 
> > > > > > I am not sure where specifically is the best place. These are 3 ideas
> > > > > > we came up with:
> > > > > > 1. PCI root complex
> > > > > > 2. a register on the device itself
> > > > > > 3. a new capability of the device
> > > > > > 4. add one extra dummy PCI device for the sole purpose of exposing the
> > > > > >      grants capability
> > > > > > 
> > > > > > 
> > > > > > Looking at the spec, there is a way to add a vendor-specific capability
> > > > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
> > > > > > today, Linux doesn't parse it.
> > > > > 
> > > > > I did wonder the same from a quick look at the spec.  There's however
> > > > > a text in the specification that says:
> > > > > 
> > > > > "The driver SHOULD NOT use the Vendor data capability except for
> > > > > debugging and reporting purposes."
> > > > > 
> > > > > So we would at least need to change that because the capability would
> > > > > then be used by other purposes different than debugging and reporting.
> > > > > 
> > > > > Seems like a minor adjustment, so might we worth asking upstream about
> > > > > their opinion, and to get a conversation started.
> > > > 
> > > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
> > > > we are doing, right?
> > > 
> > > I'd understand "reporting" as e.g. logging, transferring statistics, ...
> > > 
> > > We'd like to use it for configuration purposes.
> > 
> > I've also read it that way.
> > 
> > > Another idea would be to enhance the virtio IOMMU device to suit our needs:
> > > we could add the domid as another virtio IOMMU device capability and (for now)
> > > use bypass mode for all "productive" devices.
> > 
> > If we have to start adding capabilties, won't it be easier to just add
> > it to the each device instead of adding it to virtio IOMMU.  Or is the
> > parsing of capabilities device specific, and hence we would have to
> > implement such parsing for each device?  I would expect some
> > capabilities are shared between all devices, and a Xen capability could
> > be one of those.
> 
> Have a look at [1], which is describing the common device config layout.
> The problem here is that we'd need to add the domid after the queue specific
> data, resulting in a mess if further queue fields would be added later.
> 
> We could try that, of course.

Right, we must make it part of the standard if we modify
virtio_pci_common_cfg, or else newly added fields would overlap the
Xen specific one.

Would it be possible to signal Xen-grants support in the
`device_feature` field, and then expose it from a vendor capability?
IOW, would it be possible to add a Xen-specific hook in the parsing of
virtio_pci_common_cfg that would then fetch additional data from a
capability?

That would likely be less intrusive than adding a new Xen-specific
field to virtio_pci_common_cfg while still allowing us to do Xen
specific configuration for all VirtIO devices.

> > 
> > > Later we could even add grant-V3 support to Xen and to the virtio IOMMU device
> > > (see my last year Xen Summit design session). This could be usable for
> > > disaggregated KVM setups, too, so I believe there is a chance to get this
> > > accepted.
> > > 
> > > > > > > > **********
> > > > > > > > What do you think about it? Are there any pitfalls, etc? This also requires
> > > > > > > > system changes, but at least without virtio spec changes.
> > > > > > > 
> > > > > > > Why are we so reluctant to add spec changes?  I understand this might
> > > > > > > take time an effort, but it's the only way IMO to build a sustainable
> > > > > > > VirtIO Xen implementation.  Did we already attempt to negotiate with
> > > > > > > Oasis Xen related spec changes and those where refused?
> > > > > > 
> > > > > > That's because spec changes can be very slow. This is a bug that we need
> > > > > > a relatively quick solution for and waiting 12-24 months for a spec
> > > > > > update is not realistic.
> > > > > > 
> > > > > > I think a spec change would be best as a long term solution. We also
> > > > > > need a short term solution. The short term solution doesn't have to be
> > > > > > ideal but it has to work now.
> > > > > 
> > > > > My fear with such approach is that once a bodge is in place people
> > > > > move on to other stuff and this never gets properly fixed.
> > > > > 
> > > > > I know this might not be a well received opinion, but it would be
> > > > > better if such bodge is kept in each interested party patchqueue for
> > > > > the time being, until a proper solution is implemented.  That way
> > > > > there's an interest from parties into properly fixing it upstream.
> > > > 
> > > > Unfortunately we are in the situation where we have an outstanding
> > > > upstream bug, so we have to take action one way or the other.
> > > 
> > > The required virtio IOMMU device modification would be rather small, so
> > > adding it maybe under a CONFIG option defaulting to off might be
> > > acceptable.
> > 
> > Would you then do the grant allocation as part of virtio IOMMU?
> 
> Long term, maybe. Do you remember my Grant-V3 design session last year? Being
> able to reuse the same layout for virtio IOMMU was one of the basic ideas for
> that layout (this would need some heavy work on the virtio IOMMU frontend and
> backend, of course).

While this might well be the best option, do we have anyone with the
time and expertise to work on this?  I might be wrong, but it seems
like a huge task.

Thanks, Roger.
Roger Pau Monné July 7, 2023, 2:48 p.m. UTC | #27
On Fri, Jul 07, 2023 at 04:27:59PM +0200, Juergen Gross wrote:
> On 07.07.23 16:10, Juergen Gross wrote:
> > On 07.07.23 11:50, Roger Pau Monné wrote:
> > > On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
> > > > On 06.07.23 23:49, Stefano Stabellini wrote:
> > > > > On Thu, 6 Jul 2023, Roger Pau Monné wrote:
> > > > > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
> > > > > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote:
> > > > > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> > > > > > > > > Part 2 (clarification):
> > > > > > > > > 
> > > > > > > > > I think using a special config space register in the root complex would
> > > > > > > > > not be terrible in terms of guest changes because it is easy to
> > > > > > > > > introduce a new root complex driver in Linux and other OSes. The root
> > > > > > > > > complex would still be ECAM compatible so the regular ECAM driver would
> > > > > > > > > still work. A new driver would only be necessary if you want to be able
> > > > > > > > > to access the special config space register.
> > > > > > > > 
> > > > > > > > I'm slightly worry of this approach, we end up modifying a root
> > > > > > > > complex emulation in order to avoid modifying a PCI device emulation
> > > > > > > > on QEMU, not sure that's a good trade off.
> > > > > > > > 
> > > > > > > > Note also that different architectures will likely have different root
> > > > > > > > complex, and so you might need to modify several of them, plus then
> > > > > > > > arrange the PCI layout correctly in order to have the proper hierarchy
> > > > > > > > so that devices belonging to different driver domains are assigned to
> > > > > > > > different bridges.
> > > > > > > 
> > > > > > > I do think that adding something to the PCI conf register somewhere is
> > > > > > > the best option because it is not dependent on ACPI and it is not
> > > > > > > dependent on xenstore both of which are very undesirable.
> > > > > > > 
> > > > > > > I am not sure where specifically is the best place. These are 3 ideas
> > > > > > > we came up with:
> > > > > > > 1. PCI root complex
> > > > > > > 2. a register on the device itself
> > > > > > > 3. a new capability of the device
> > > > > > > 4. add one extra dummy PCI device for the sole purpose of exposing the
> > > > > > >      grants capability
> > > > > > > 
> > > > > > > 
> > > > > > > Looking at the spec, there is a way to add a vendor-specific capability
> > > > > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
> > > > > > > today, Linux doesn't parse it.
> > > > > > 
> > > > > > I did wonder the same from a quick look at the spec.  There's however
> > > > > > a text in the specification that says:
> > > > > > 
> > > > > > "The driver SHOULD NOT use the Vendor data capability except for
> > > > > > debugging and reporting purposes."
> > > > > > 
> > > > > > So we would at least need to change that because the capability would
> > > > > > then be used by other purposes different than debugging and reporting.
> > > > > > 
> > > > > > Seems like a minor adjustment, so might we worth asking upstream about
> > > > > > their opinion, and to get a conversation started.
> > > > > 
> > > > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
> > > > > we are doing, right?
> > > > 
> > > > I'd understand "reporting" as e.g. logging, transferring statistics, ...
> > > > 
> > > > We'd like to use it for configuration purposes.
> > > 
> > > I've also read it that way.
> > > 
> > > > Another idea would be to enhance the virtio IOMMU device to suit our needs:
> > > > we could add the domid as another virtio IOMMU device capability and (for now)
> > > > use bypass mode for all "productive" devices.
> > > 
> > > If we have to start adding capabilties, won't it be easier to just add
> > > it to the each device instead of adding it to virtio IOMMU.  Or is the
> > > parsing of capabilities device specific, and hence we would have to
> > > implement such parsing for each device?  I would expect some
> > > capabilities are shared between all devices, and a Xen capability could
> > > be one of those.
> > 
> > Have a look at [1], which is describing the common device config layout.
> > The problem here is that we'd need to add the domid after the queue specific
> > data, resulting in a mess if further queue fields would be added later.
> > 
> > We could try that, of course.
> 
> Thinking more about it, the virtio IOMMU device seems to be a better fit:
> 
> In case we'd add the domid to the device's PCI config space, the value would
> be controlled by the backend domain. IMO the domid passed to the frontend
> should be controlled by a trusted entity (dom0 or the hypervisor), which
> would be the natural backend of the virtio IOMMU device.

Hm, yes.  I'm however failing to see how a backed could exploit that.

The guest would be granting memory to a different domain than the one
running the backend, but otherwise that memory would be granted to the
backend domain, which could then also make it available to other
domains (without having to play with the reported backend domid).

Thanks, Roger.
Jürgen Groß July 7, 2023, 3:01 p.m. UTC | #28
On 07.07.23 16:42, Roger Pau Monné wrote:
> On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote:
>> On 07.07.23 11:50, Roger Pau Monné wrote:
>>> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
>>>> On 06.07.23 23:49, Stefano Stabellini wrote:
>>>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote:
>>>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
>>>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote:
>>>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
>>>>>>>>> Part 2 (clarification):
>>>>>>>>>
>>>>>>>>> I think using a special config space register in the root complex would
>>>>>>>>> not be terrible in terms of guest changes because it is easy to
>>>>>>>>> introduce a new root complex driver in Linux and other OSes. The root
>>>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would
>>>>>>>>> still work. A new driver would only be necessary if you want to be able
>>>>>>>>> to access the special config space register.
>>>>>>>>
>>>>>>>> I'm slightly worry of this approach, we end up modifying a root
>>>>>>>> complex emulation in order to avoid modifying a PCI device emulation
>>>>>>>> on QEMU, not sure that's a good trade off.
>>>>>>>>
>>>>>>>> Note also that different architectures will likely have different root
>>>>>>>> complex, and so you might need to modify several of them, plus then
>>>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy
>>>>>>>> so that devices belonging to different driver domains are assigned to
>>>>>>>> different bridges.
>>>>>>>
>>>>>>> I do think that adding something to the PCI conf register somewhere is
>>>>>>> the best option because it is not dependent on ACPI and it is not
>>>>>>> dependent on xenstore both of which are very undesirable.
>>>>>>>
>>>>>>> I am not sure where specifically is the best place. These are 3 ideas
>>>>>>> we came up with:
>>>>>>> 1. PCI root complex
>>>>>>> 2. a register on the device itself
>>>>>>> 3. a new capability of the device
>>>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the
>>>>>>>       grants capability
>>>>>>>
>>>>>>>
>>>>>>> Looking at the spec, there is a way to add a vendor-specific capability
>>>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
>>>>>>> today, Linux doesn't parse it.
>>>>>>
>>>>>> I did wonder the same from a quick look at the spec.  There's however
>>>>>> a text in the specification that says:
>>>>>>
>>>>>> "The driver SHOULD NOT use the Vendor data capability except for
>>>>>> debugging and reporting purposes."
>>>>>>
>>>>>> So we would at least need to change that because the capability would
>>>>>> then be used by other purposes different than debugging and reporting.
>>>>>>
>>>>>> Seems like a minor adjustment, so might we worth asking upstream about
>>>>>> their opinion, and to get a conversation started.
>>>>>
>>>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
>>>>> we are doing, right?
>>>>
>>>> I'd understand "reporting" as e.g. logging, transferring statistics, ...
>>>>
>>>> We'd like to use it for configuration purposes.
>>>
>>> I've also read it that way.
>>>
>>>> Another idea would be to enhance the virtio IOMMU device to suit our needs:
>>>> we could add the domid as another virtio IOMMU device capability and (for now)
>>>> use bypass mode for all "productive" devices.
>>>
>>> If we have to start adding capabilties, won't it be easier to just add
>>> it to the each device instead of adding it to virtio IOMMU.  Or is the
>>> parsing of capabilities device specific, and hence we would have to
>>> implement such parsing for each device?  I would expect some
>>> capabilities are shared between all devices, and a Xen capability could
>>> be one of those.
>>
>> Have a look at [1], which is describing the common device config layout.
>> The problem here is that we'd need to add the domid after the queue specific
>> data, resulting in a mess if further queue fields would be added later.
>>
>> We could try that, of course.
> 
> Right, we must make it part of the standard if we modify
> virtio_pci_common_cfg, or else newly added fields would overlap the
> Xen specific one.
> 
> Would it be possible to signal Xen-grants support in the
> `device_feature` field, and then expose it from a vendor capability?
> IOW, would it be possible to add a Xen-specific hook in the parsing of
> virtio_pci_common_cfg that would then fetch additional data from a
> capability?

TBH, I don't know. It might require some changes in the central parsing
logic, but this shouldn't be too hard to do.

> That would likely be less intrusive than adding a new Xen-specific
> field to virtio_pci_common_cfg while still allowing us to do Xen
> specific configuration for all VirtIO devices.

In case we want to go that route, this should be in a new "platform config"
capability, which might be just another form of a vendor capability.

> 
>>>
>>>> Later we could even add grant-V3 support to Xen and to the virtio IOMMU device
>>>> (see my last year Xen Summit design session). This could be usable for
>>>> disaggregated KVM setups, too, so I believe there is a chance to get this
>>>> accepted.
>>>>
>>>>>>>>> **********
>>>>>>>>> What do you think about it? Are there any pitfalls, etc? This also requires
>>>>>>>>> system changes, but at least without virtio spec changes.
>>>>>>>>
>>>>>>>> Why are we so reluctant to add spec changes?  I understand this might
>>>>>>>> take time an effort, but it's the only way IMO to build a sustainable
>>>>>>>> VirtIO Xen implementation.  Did we already attempt to negotiate with
>>>>>>>> Oasis Xen related spec changes and those where refused?
>>>>>>>
>>>>>>> That's because spec changes can be very slow. This is a bug that we need
>>>>>>> a relatively quick solution for and waiting 12-24 months for a spec
>>>>>>> update is not realistic.
>>>>>>>
>>>>>>> I think a spec change would be best as a long term solution. We also
>>>>>>> need a short term solution. The short term solution doesn't have to be
>>>>>>> ideal but it has to work now.
>>>>>>
>>>>>> My fear with such approach is that once a bodge is in place people
>>>>>> move on to other stuff and this never gets properly fixed.
>>>>>>
>>>>>> I know this might not be a well received opinion, but it would be
>>>>>> better if such bodge is kept in each interested party patchqueue for
>>>>>> the time being, until a proper solution is implemented.  That way
>>>>>> there's an interest from parties into properly fixing it upstream.
>>>>>
>>>>> Unfortunately we are in the situation where we have an outstanding
>>>>> upstream bug, so we have to take action one way or the other.
>>>>
>>>> The required virtio IOMMU device modification would be rather small, so
>>>> adding it maybe under a CONFIG option defaulting to off might be
>>>> acceptable.
>>>
>>> Would you then do the grant allocation as part of virtio IOMMU?
>>
>> Long term, maybe. Do you remember my Grant-V3 design session last year? Being
>> able to reuse the same layout for virtio IOMMU was one of the basic ideas for
>> that layout (this would need some heavy work on the virtio IOMMU frontend and
>> backend, of course).
> 
> While this might well be the best option, do we have anyone with the
> time and expertise to work on this?  I might be wrong, but it seems
> like a huge task.

As a background project I'd like to pursue it. OTOH I'm not sure how much time
I could spend on it.


Juergen
Roger Pau Monné July 7, 2023, 3:14 p.m. UTC | #29
On Fri, Jul 07, 2023 at 05:01:38PM +0200, Juergen Gross wrote:
> On 07.07.23 16:42, Roger Pau Monné wrote:
> > On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote:
> > > On 07.07.23 11:50, Roger Pau Monné wrote:
> > > > On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
> > > > > On 06.07.23 23:49, Stefano Stabellini wrote:
> > > > > > On Thu, 6 Jul 2023, Roger Pau Monné wrote:
> > > > > > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
> > > > > > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote:
> > > > > > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
> > > > > > > > > > Part 2 (clarification):
> > > > > > > > > > 
> > > > > > > > > > I think using a special config space register in the root complex would
> > > > > > > > > > not be terrible in terms of guest changes because it is easy to
> > > > > > > > > > introduce a new root complex driver in Linux and other OSes. The root
> > > > > > > > > > complex would still be ECAM compatible so the regular ECAM driver would
> > > > > > > > > > still work. A new driver would only be necessary if you want to be able
> > > > > > > > > > to access the special config space register.
> > > > > > > > > 
> > > > > > > > > I'm slightly worry of this approach, we end up modifying a root
> > > > > > > > > complex emulation in order to avoid modifying a PCI device emulation
> > > > > > > > > on QEMU, not sure that's a good trade off.
> > > > > > > > > 
> > > > > > > > > Note also that different architectures will likely have different root
> > > > > > > > > complex, and so you might need to modify several of them, plus then
> > > > > > > > > arrange the PCI layout correctly in order to have the proper hierarchy
> > > > > > > > > so that devices belonging to different driver domains are assigned to
> > > > > > > > > different bridges.
> > > > > > > > 
> > > > > > > > I do think that adding something to the PCI conf register somewhere is
> > > > > > > > the best option because it is not dependent on ACPI and it is not
> > > > > > > > dependent on xenstore both of which are very undesirable.
> > > > > > > > 
> > > > > > > > I am not sure where specifically is the best place. These are 3 ideas
> > > > > > > > we came up with:
> > > > > > > > 1. PCI root complex
> > > > > > > > 2. a register on the device itself
> > > > > > > > 3. a new capability of the device
> > > > > > > > 4. add one extra dummy PCI device for the sole purpose of exposing the
> > > > > > > >       grants capability
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Looking at the spec, there is a way to add a vendor-specific capability
> > > > > > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
> > > > > > > > today, Linux doesn't parse it.
> > > > > > > 
> > > > > > > I did wonder the same from a quick look at the spec.  There's however
> > > > > > > a text in the specification that says:
> > > > > > > 
> > > > > > > "The driver SHOULD NOT use the Vendor data capability except for
> > > > > > > debugging and reporting purposes."
> > > > > > > 
> > > > > > > So we would at least need to change that because the capability would
> > > > > > > then be used by other purposes different than debugging and reporting.
> > > > > > > 
> > > > > > > Seems like a minor adjustment, so might we worth asking upstream about
> > > > > > > their opinion, and to get a conversation started.
> > > > > > 
> > > > > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
> > > > > > we are doing, right?
> > > > > 
> > > > > I'd understand "reporting" as e.g. logging, transferring statistics, ...
> > > > > 
> > > > > We'd like to use it for configuration purposes.
> > > > 
> > > > I've also read it that way.
> > > > 
> > > > > Another idea would be to enhance the virtio IOMMU device to suit our needs:
> > > > > we could add the domid as another virtio IOMMU device capability and (for now)
> > > > > use bypass mode for all "productive" devices.
> > > > 
> > > > If we have to start adding capabilties, won't it be easier to just add
> > > > it to the each device instead of adding it to virtio IOMMU.  Or is the
> > > > parsing of capabilities device specific, and hence we would have to
> > > > implement such parsing for each device?  I would expect some
> > > > capabilities are shared between all devices, and a Xen capability could
> > > > be one of those.
> > > 
> > > Have a look at [1], which is describing the common device config layout.
> > > The problem here is that we'd need to add the domid after the queue specific
> > > data, resulting in a mess if further queue fields would be added later.
> > > 
> > > We could try that, of course.
> > 
> > Right, we must make it part of the standard if we modify
> > virtio_pci_common_cfg, or else newly added fields would overlap the
> > Xen specific one.
> > 
> > Would it be possible to signal Xen-grants support in the
> > `device_feature` field, and then expose it from a vendor capability?
> > IOW, would it be possible to add a Xen-specific hook in the parsing of
> > virtio_pci_common_cfg that would then fetch additional data from a
> > capability?
> 
> TBH, I don't know. It might require some changes in the central parsing
> logic, but this shouldn't be too hard to do.
> 
> > That would likely be less intrusive than adding a new Xen-specific
> > field to virtio_pci_common_cfg while still allowing us to do Xen
> > specific configuration for all VirtIO devices.
> 
> In case we want to go that route, this should be in a new "platform config"
> capability, which might be just another form of a vendor capability.

I think telling people that they will need to implement grants-v3 in
order to solve this might be too much.  I would rather prefer a more
concrete solution that doesn't have so many loose ends.

Anyway, it's up to the person doing the job, but starting with "you
will have to implement grants-v3" is quite likely to deter anyone from
attempting to solve this I'm afraid.

Thanks, Roger.
Jürgen Groß July 7, 2023, 3:14 p.m. UTC | #30
On 07.07.23 16:48, Roger Pau Monné wrote:
> On Fri, Jul 07, 2023 at 04:27:59PM +0200, Juergen Gross wrote:
>> On 07.07.23 16:10, Juergen Gross wrote:
>>> On 07.07.23 11:50, Roger Pau Monné wrote:
>>>> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
>>>>> On 06.07.23 23:49, Stefano Stabellini wrote:
>>>>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote:
>>>>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
>>>>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote:
>>>>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
>>>>>>>>>> Part 2 (clarification):
>>>>>>>>>>
>>>>>>>>>> I think using a special config space register in the root complex would
>>>>>>>>>> not be terrible in terms of guest changes because it is easy to
>>>>>>>>>> introduce a new root complex driver in Linux and other OSes. The root
>>>>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would
>>>>>>>>>> still work. A new driver would only be necessary if you want to be able
>>>>>>>>>> to access the special config space register.
>>>>>>>>>
>>>>>>>>> I'm slightly worry of this approach, we end up modifying a root
>>>>>>>>> complex emulation in order to avoid modifying a PCI device emulation
>>>>>>>>> on QEMU, not sure that's a good trade off.
>>>>>>>>>
>>>>>>>>> Note also that different architectures will likely have different root
>>>>>>>>> complex, and so you might need to modify several of them, plus then
>>>>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy
>>>>>>>>> so that devices belonging to different driver domains are assigned to
>>>>>>>>> different bridges.
>>>>>>>>
>>>>>>>> I do think that adding something to the PCI conf register somewhere is
>>>>>>>> the best option because it is not dependent on ACPI and it is not
>>>>>>>> dependent on xenstore both of which are very undesirable.
>>>>>>>>
>>>>>>>> I am not sure where specifically is the best place. These are 3 ideas
>>>>>>>> we came up with:
>>>>>>>> 1. PCI root complex
>>>>>>>> 2. a register on the device itself
>>>>>>>> 3. a new capability of the device
>>>>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the
>>>>>>>>       grants capability
>>>>>>>>
>>>>>>>>
>>>>>>>> Looking at the spec, there is a way to add a vendor-specific capability
>>>>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
>>>>>>>> today, Linux doesn't parse it.
>>>>>>>
>>>>>>> I did wonder the same from a quick look at the spec.  There's however
>>>>>>> a text in the specification that says:
>>>>>>>
>>>>>>> "The driver SHOULD NOT use the Vendor data capability except for
>>>>>>> debugging and reporting purposes."
>>>>>>>
>>>>>>> So we would at least need to change that because the capability would
>>>>>>> then be used by other purposes different than debugging and reporting.
>>>>>>>
>>>>>>> Seems like a minor adjustment, so might we worth asking upstream about
>>>>>>> their opinion, and to get a conversation started.
>>>>>>
>>>>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
>>>>>> we are doing, right?
>>>>>
>>>>> I'd understand "reporting" as e.g. logging, transferring statistics, ...
>>>>>
>>>>> We'd like to use it for configuration purposes.
>>>>
>>>> I've also read it that way.
>>>>
>>>>> Another idea would be to enhance the virtio IOMMU device to suit our needs:
>>>>> we could add the domid as another virtio IOMMU device capability and (for now)
>>>>> use bypass mode for all "productive" devices.
>>>>
>>>> If we have to start adding capabilties, won't it be easier to just add
>>>> it to the each device instead of adding it to virtio IOMMU.  Or is the
>>>> parsing of capabilities device specific, and hence we would have to
>>>> implement such parsing for each device?  I would expect some
>>>> capabilities are shared between all devices, and a Xen capability could
>>>> be one of those.
>>>
>>> Have a look at [1], which is describing the common device config layout.
>>> The problem here is that we'd need to add the domid after the queue specific
>>> data, resulting in a mess if further queue fields would be added later.
>>>
>>> We could try that, of course.
>>
>> Thinking more about it, the virtio IOMMU device seems to be a better fit:
>>
>> In case we'd add the domid to the device's PCI config space, the value would
>> be controlled by the backend domain. IMO the domid passed to the frontend
>> should be controlled by a trusted entity (dom0 or the hypervisor), which
>> would be the natural backend of the virtio IOMMU device.
> 
> Hm, yes.  I'm however failing to see how a backed could exploit that.
> 
> The guest would be granting memory to a different domain than the one
> running the backend, but otherwise that memory would be granted to the
> backend domain, which could then also make it available to other
> domains (without having to play with the reported backend domid).

I agree that an exploit is at least not obvious.

It is still not a clean solution, though.

Giving the wrong domain direct access to some of the guest's memory is worse
than the ability to pass the contents indirectly to the wrong domain IMHO.


Juergen
Jürgen Groß July 7, 2023, 3:15 p.m. UTC | #31
On 07.07.23 17:14, Roger Pau Monné wrote:
> On Fri, Jul 07, 2023 at 05:01:38PM +0200, Juergen Gross wrote:
>> On 07.07.23 16:42, Roger Pau Monné wrote:
>>> On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote:
>>>> On 07.07.23 11:50, Roger Pau Monné wrote:
>>>>> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote:
>>>>>> On 06.07.23 23:49, Stefano Stabellini wrote:
>>>>>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote:
>>>>>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote:
>>>>>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote:
>>>>>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote:
>>>>>>>>>>> Part 2 (clarification):
>>>>>>>>>>>
>>>>>>>>>>> I think using a special config space register in the root complex would
>>>>>>>>>>> not be terrible in terms of guest changes because it is easy to
>>>>>>>>>>> introduce a new root complex driver in Linux and other OSes. The root
>>>>>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would
>>>>>>>>>>> still work. A new driver would only be necessary if you want to be able
>>>>>>>>>>> to access the special config space register.
>>>>>>>>>>
>>>>>>>>>> I'm slightly worry of this approach, we end up modifying a root
>>>>>>>>>> complex emulation in order to avoid modifying a PCI device emulation
>>>>>>>>>> on QEMU, not sure that's a good trade off.
>>>>>>>>>>
>>>>>>>>>> Note also that different architectures will likely have different root
>>>>>>>>>> complex, and so you might need to modify several of them, plus then
>>>>>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy
>>>>>>>>>> so that devices belonging to different driver domains are assigned to
>>>>>>>>>> different bridges.
>>>>>>>>>
>>>>>>>>> I do think that adding something to the PCI conf register somewhere is
>>>>>>>>> the best option because it is not dependent on ACPI and it is not
>>>>>>>>> dependent on xenstore both of which are very undesirable.
>>>>>>>>>
>>>>>>>>> I am not sure where specifically is the best place. These are 3 ideas
>>>>>>>>> we came up with:
>>>>>>>>> 1. PCI root complex
>>>>>>>>> 2. a register on the device itself
>>>>>>>>> 3. a new capability of the device
>>>>>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the
>>>>>>>>>        grants capability
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking at the spec, there is a way to add a vendor-specific capability
>>>>>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used
>>>>>>>>> today, Linux doesn't parse it.
>>>>>>>>
>>>>>>>> I did wonder the same from a quick look at the spec.  There's however
>>>>>>>> a text in the specification that says:
>>>>>>>>
>>>>>>>> "The driver SHOULD NOT use the Vendor data capability except for
>>>>>>>> debugging and reporting purposes."
>>>>>>>>
>>>>>>>> So we would at least need to change that because the capability would
>>>>>>>> then be used by other purposes different than debugging and reporting.
>>>>>>>>
>>>>>>>> Seems like a minor adjustment, so might we worth asking upstream about
>>>>>>>> their opinion, and to get a conversation started.
>>>>>>>
>>>>>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what
>>>>>>> we are doing, right?
>>>>>>
>>>>>> I'd understand "reporting" as e.g. logging, transferring statistics, ...
>>>>>>
>>>>>> We'd like to use it for configuration purposes.
>>>>>
>>>>> I've also read it that way.
>>>>>
>>>>>> Another idea would be to enhance the virtio IOMMU device to suit our needs:
>>>>>> we could add the domid as another virtio IOMMU device capability and (for now)
>>>>>> use bypass mode for all "productive" devices.
>>>>>
>>>>> If we have to start adding capabilties, won't it be easier to just add
>>>>> it to the each device instead of adding it to virtio IOMMU.  Or is the
>>>>> parsing of capabilities device specific, and hence we would have to
>>>>> implement such parsing for each device?  I would expect some
>>>>> capabilities are shared between all devices, and a Xen capability could
>>>>> be one of those.
>>>>
>>>> Have a look at [1], which is describing the common device config layout.
>>>> The problem here is that we'd need to add the domid after the queue specific
>>>> data, resulting in a mess if further queue fields would be added later.
>>>>
>>>> We could try that, of course.
>>>
>>> Right, we must make it part of the standard if we modify
>>> virtio_pci_common_cfg, or else newly added fields would overlap the
>>> Xen specific one.
>>>
>>> Would it be possible to signal Xen-grants support in the
>>> `device_feature` field, and then expose it from a vendor capability?
>>> IOW, would it be possible to add a Xen-specific hook in the parsing of
>>> virtio_pci_common_cfg that would then fetch additional data from a
>>> capability?
>>
>> TBH, I don't know. It might require some changes in the central parsing
>> logic, but this shouldn't be too hard to do.
>>
>>> That would likely be less intrusive than adding a new Xen-specific
>>> field to virtio_pci_common_cfg while still allowing us to do Xen
>>> specific configuration for all VirtIO devices.
>>
>> In case we want to go that route, this should be in a new "platform config"
>> capability, which might be just another form of a vendor capability.
> 
> I think telling people that they will need to implement grants-v3 in
> order to solve this might be too much.  I would rather prefer a more
> concrete solution that doesn't have so many loose ends.
> 
> Anyway, it's up to the person doing the job, but starting with "you
> will have to implement grants-v3" is quite likely to deter anyone from
> attempting to solve this I'm afraid.

Fair enough. :-)


Juergen
Stefano Stabellini July 7, 2023, 9:02 p.m. UTC | #32
On Fri, 7 Jul 2023, Juergen Gross wrote:
> On 26.06.23 15:17, Petr Pavlu wrote:
> > On 6/21/23 19:58, Oleksandr Tyshchenko wrote:
> > > On 21.06.23 16:12, Petr Pavlu wrote:
> > > > When attempting to run Xen on a QEMU/KVM virtual machine with virtio
> > > > devices (all x86_64), dom0 tries to establish a grant for itself which
> > > > eventually results in a hang during the boot.
> > > > 
> > > > The backtrace looks as follows, the while loop in __send_control_msg()
> > > > makes no progress:
> > > > 
> > > >     #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400,
> > > > len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>)
> > > > at ../drivers/virtio/virtio_ring.c:2326
> > > >     #1  0xffffffff817086b7 in virtqueue_get_buf
> > > > (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at
> > > > ../drivers/virtio/virtio_ring.c:2333
> > > >     #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized
> > > > out>, port_id=0xffffffff, event=0x0, value=0x1) at
> > > > ../drivers/char/virtio_console.c:562
> > > >     #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized
> > > > out>, port_id=<optimized out>, event=<optimized out>, value=<optimized
> > > > out>) at ../drivers/char/virtio_console.c:569
> > > >     #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800)
> > > > at ../drivers/char/virtio_console.c:2098
> > > >     #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810)
> > > > at ../drivers/virtio/virtio.c:305
> > > >     #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0
> > > > <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>,
> > > > dev=0xffff88800585e810) at ../drivers/base/dd.c:579
> > > >     #7  really_probe (dev=dev@entry=0xffff88800585e810,
> > > > drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > ../drivers/base/dd.c:658
> > > >     #8  0xffffffff8198e58f in __driver_probe_device
> > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
> > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
> > > >     #9  0xffffffff8198e65a in driver_probe_device
> > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
> > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
> > > >     #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810,
> > > > data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
> > > >     #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>,
> > > > start=start@entry=0x0 <fixed_percpu_data>,
> > > > data=data@entry=0xffffffff82be40c0 <virtio_console>,
> > > >         fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at
> > > > ../drivers/base/bus.c:368
> > > >     #12 0xffffffff8198db65 in driver_attach
> > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > ../drivers/base/dd.c:1233
> > > >     #13 0xffffffff8198d207 in bus_add_driver
> > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > ../drivers/base/bus.c:673
> > > >     #14 0xffffffff8198f550 in driver_register
> > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > ../drivers/base/driver.c:246
> > > >     #15 0xffffffff81706b47 in register_virtio_driver
> > > > (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > ../drivers/virtio/virtio.c:357
> > > >     #16 0xffffffff832cd34b in virtio_console_init () at
> > > > ../drivers/char/virtio_console.c:2258
> > > >     #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0
> > > > <virtio_console_init>) at ../init/main.c:1246
> > > >     #18 0xffffffff83277293 in do_initcall_level
> > > > (command_line=0xffff888003e2f900 "root", level=0x6) at
> > > > ../init/main.c:1319
> > > >     #19 do_initcalls () at ../init/main.c:1335
> > > >     #20 do_basic_setup () at ../init/main.c:1354
> > > >     #21 kernel_init_freeable () at ../init/main.c:1571
> > > >     #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at
> > > > ../init/main.c:1462
> > > >     #23 0xffffffff81001f49 in ret_from_fork () at
> > > > ../arch/x86/entry/entry_64.S:308
> > > >     #24 0x0000000000000000 in ?? ()
> > > > 
> > > > Fix the problem by preventing xen_grant_init_backend_domid() from
> > > > setting dom0 as a backend when running in dom0.
> > > > 
> > > > Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma"
> > > > devices")
> > > 
> > > 
> > > I am not 100% sure whether the Fixes tag points to precise commit. If I
> > > am not mistaken, the said commit just moves the code in the context
> > > without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
> > > introduced before.
> > 
> > I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use
> > dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which
> > introduced the original logic to use dom0 as backend.
> > 
> > Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma"
> > devices") is relevant in sense that it extended when this logic is
> > active by adding an OR check for xen_pv_domain().
> > 
> > > 
> > > 
> > > > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> > > > ---
> > > >    drivers/xen/grant-dma-ops.c | 4 +++-
> > > >    1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> > > > index 76f6f26265a3..29ed27ac450e 100644
> > > > --- a/drivers/xen/grant-dma-ops.c
> > > > +++ b/drivers/xen/grant-dma-ops.c
> > > > @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct
> > > > device *dev,
> > > >    	if (np) {
> > > >    		ret = xen_dt_grant_init_backend_domid(dev, np,
> > > > backend_domid);
> > > >    		of_node_put(np);
> > > > -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
> > > > xen_pv_domain()) {
> > > > +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
> > > > +		    xen_pv_domain()) &&
> > > > +		   !xen_initial_domain()) {
> > > 
> > > The commit lgtm, just one note:
> > > 
> > > 
> > > I would even bail out early in xen_virtio_restricted_mem_acc() instead,
> > > as I assume the same issue could happen on Arm with DT (although there
> > > we don't guess the backend's domid, we read it from DT and quite
> > > unlikely we get Dom0 being in Dom0 with correct DT).
> > > 
> > > Something like:
> > > 
> > > @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
> > > virtio_device *dev)
> > >    {
> > >           domid_t backend_domid;
> > > 
> > > +       /* Xen grant DMA ops are not used when running as initial domain
> > > */
> > > +       if (xen_initial_domain())
> > > +               return false;
> > > +
> > >           if (!xen_grant_init_backend_domid(dev->dev.parent,
> > > &backend_domid)) {
> > >                   xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
> > >                   return true;
> > > (END)
> > > 
> > > 
> > > 
> > > If so, that commit subject would need to be updated accordingly.
> > > 
> > > Let's see what other reviewers will say.
> > 
> > Ok, makes sense.
> 
> I think this is okay for a fix of the current problem.
> 
> Passing through virtio devices to a PV domU is not covered by this fix, but
> this
> should be a rather rare configuration, which doesn't work today either. So the
> suggested patch would fix the current issue without introducing a regression.
> 
> Anything else can be done later.

Why do you say that passing through virtio devices to a PV domU doesn't
work today anyway? Also, as you know many people use Xen outside of
datacenter deployments (laptops, embedded etc.) where drivers domains
and device assignment are very common. You could assign a virtio network
card to a domU and use PV network to share the network with other
guests. Physical virtio devices, especially virtio-net devices, exist. I
could probably repro this problem today in a domU just installing
QubesOS inside QEMU. QubesOS uses network driver domains and if QEMU
provides a virtio-net network card, this would break even with this
patch.
Jürgen Groß July 8, 2023, 10:54 a.m. UTC | #33
On 07.07.23 23:02, Stefano Stabellini wrote:
> On Fri, 7 Jul 2023, Juergen Gross wrote:
>> On 26.06.23 15:17, Petr Pavlu wrote:
>>> On 6/21/23 19:58, Oleksandr Tyshchenko wrote:
>>>> On 21.06.23 16:12, Petr Pavlu wrote:
>>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio
>>>>> devices (all x86_64), dom0 tries to establish a grant for itself which
>>>>> eventually results in a hang during the boot.
>>>>>
>>>>> The backtrace looks as follows, the while loop in __send_control_msg()
>>>>> makes no progress:
>>>>>
>>>>>      #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400,
>>>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>)
>>>>> at ../drivers/virtio/virtio_ring.c:2326
>>>>>      #1  0xffffffff817086b7 in virtqueue_get_buf
>>>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at
>>>>> ../drivers/virtio/virtio_ring.c:2333
>>>>>      #2  0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized
>>>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at
>>>>> ../drivers/char/virtio_console.c:562
>>>>>      #3  0xffffffff8175f6ee in __send_control_msg (portdev=<optimized
>>>>> out>, port_id=<optimized out>, event=<optimized out>, value=<optimized
>>>>> out>) at ../drivers/char/virtio_console.c:569
>>>>>      #4  0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800)
>>>>> at ../drivers/char/virtio_console.c:2098
>>>>>      #5  0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810)
>>>>> at ../drivers/virtio/virtio.c:305
>>>>>      #6  0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0
>>>>> <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>,
>>>>> dev=0xffff88800585e810) at ../drivers/base/dd.c:579
>>>>>      #7  really_probe (dev=dev@entry=0xffff88800585e810,
>>>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/dd.c:658
>>>>>      #8  0xffffffff8198e58f in __driver_probe_device
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
>>>>>      #9  0xffffffff8198e65a in driver_probe_device
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
>>>>>      #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810,
>>>>> data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216
>>>>>      #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>,
>>>>> start=start@entry=0x0 <fixed_percpu_data>,
>>>>> data=data@entry=0xffffffff82be40c0 <virtio_console>,
>>>>>          fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at
>>>>> ../drivers/base/bus.c:368
>>>>>      #12 0xffffffff8198db65 in driver_attach
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/dd.c:1233
>>>>>      #13 0xffffffff8198d207 in bus_add_driver
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/bus.c:673
>>>>>      #14 0xffffffff8198f550 in driver_register
>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/base/driver.c:246
>>>>>      #15 0xffffffff81706b47 in register_virtio_driver
>>>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at
>>>>> ../drivers/virtio/virtio.c:357
>>>>>      #16 0xffffffff832cd34b in virtio_console_init () at
>>>>> ../drivers/char/virtio_console.c:2258
>>>>>      #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0
>>>>> <virtio_console_init>) at ../init/main.c:1246
>>>>>      #18 0xffffffff83277293 in do_initcall_level
>>>>> (command_line=0xffff888003e2f900 "root", level=0x6) at
>>>>> ../init/main.c:1319
>>>>>      #19 do_initcalls () at ../init/main.c:1335
>>>>>      #20 do_basic_setup () at ../init/main.c:1354
>>>>>      #21 kernel_init_freeable () at ../init/main.c:1571
>>>>>      #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at
>>>>> ../init/main.c:1462
>>>>>      #23 0xffffffff81001f49 in ret_from_fork () at
>>>>> ../arch/x86/entry/entry_64.S:308
>>>>>      #24 0x0000000000000000 in ?? ()
>>>>>
>>>>> Fix the problem by preventing xen_grant_init_backend_domid() from
>>>>> setting dom0 as a backend when running in dom0.
>>>>>
>>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma"
>>>>> devices")
>>>>
>>>>
>>>> I am not 100% sure whether the Fixes tag points to precise commit. If I
>>>> am not mistaken, the said commit just moves the code in the context
>>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
>>>> introduced before.
>>>
>>> I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use
>>> dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which
>>> introduced the original logic to use dom0 as backend.
>>>
>>> Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma"
>>> devices") is relevant in sense that it extended when this logic is
>>> active by adding an OR check for xen_pv_domain().
>>>
>>>>
>>>>
>>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>>>>> ---
>>>>>     drivers/xen/grant-dma-ops.c | 4 +++-
>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
>>>>> index 76f6f26265a3..29ed27ac450e 100644
>>>>> --- a/drivers/xen/grant-dma-ops.c
>>>>> +++ b/drivers/xen/grant-dma-ops.c
>>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct
>>>>> device *dev,
>>>>>     	if (np) {
>>>>>     		ret = xen_dt_grant_init_backend_domid(dev, np,
>>>>> backend_domid);
>>>>>     		of_node_put(np);
>>>>> -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>>> xen_pv_domain()) {
>>>>> +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
>>>>> +		    xen_pv_domain()) &&
>>>>> +		   !xen_initial_domain()) {
>>>>
>>>> The commit lgtm, just one note:
>>>>
>>>>
>>>> I would even bail out early in xen_virtio_restricted_mem_acc() instead,
>>>> as I assume the same issue could happen on Arm with DT (although there
>>>> we don't guess the backend's domid, we read it from DT and quite
>>>> unlikely we get Dom0 being in Dom0 with correct DT).
>>>>
>>>> Something like:
>>>>
>>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
>>>> virtio_device *dev)
>>>>     {
>>>>            domid_t backend_domid;
>>>>
>>>> +       /* Xen grant DMA ops are not used when running as initial domain
>>>> */
>>>> +       if (xen_initial_domain())
>>>> +               return false;
>>>> +
>>>>            if (!xen_grant_init_backend_domid(dev->dev.parent,
>>>> &backend_domid)) {
>>>>                    xen_grant_setup_dma_ops(dev->dev.parent, backend_domid);
>>>>                    return true;
>>>> (END)
>>>>
>>>>
>>>>
>>>> If so, that commit subject would need to be updated accordingly.
>>>>
>>>> Let's see what other reviewers will say.
>>>
>>> Ok, makes sense.
>>
>> I think this is okay for a fix of the current problem.
>>
>> Passing through virtio devices to a PV domU is not covered by this fix, but
>> this
>> should be a rather rare configuration, which doesn't work today either. So the
>> suggested patch would fix the current issue without introducing a regression.
>>
>> Anything else can be done later.
> 
> Why do you say that passing through virtio devices to a PV domU doesn't
> work today anyway? Also, as you know many people use Xen outside of
> datacenter deployments (laptops, embedded etc.) where drivers domains
> and device assignment are very common. You could assign a virtio network
> card to a domU and use PV network to share the network with other
> guests. Physical virtio devices, especially virtio-net devices, exist. I
> could probably repro this problem today in a domU just installing
> QubesOS inside QEMU. QubesOS uses network driver domains and if QEMU
> provides a virtio-net network card, this would break even with this
> patch.

I might be wrong, but I don't think all virtio frontends will work in that
scenario. The main reason is the PFN/MFN difference: a frontend using guest
consecutive memory for doing large I/Os will fail miserably. This was the
main reason why I had to add the functionality of consecutive grants for
large I/O buffers. The same goes for multi-page virtio ring pages.


Juergen
Stefano Stabellini July 8, 2023, 6:13 p.m. UTC | #34
On Sat, 8 Jul 2023, Juergen Gross wrote:
> On 07.07.23 23:02, Stefano Stabellini wrote:
> > On Fri, 7 Jul 2023, Juergen Gross wrote:
> > > On 26.06.23 15:17, Petr Pavlu wrote:
> > > > On 6/21/23 19:58, Oleksandr Tyshchenko wrote:
> > > > > On 21.06.23 16:12, Petr Pavlu wrote:
> > > > > > When attempting to run Xen on a QEMU/KVM virtual machine with virtio
> > > > > > devices (all x86_64), dom0 tries to establish a grant for itself
> > > > > > which
> > > > > > eventually results in a hang during the boot.
> > > > > > 
> > > > > > The backtrace looks as follows, the while loop in
> > > > > > __send_control_msg()
> > > > > > makes no progress:
> > > > > > 
> > > > > >      #0  virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400,
> > > > > > len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0
> > > > > > <fixed_percpu_data>)
> > > > > > at ../drivers/virtio/virtio_ring.c:2326
> > > > > >      #1  0xffffffff817086b7 in virtqueue_get_buf
> > > > > > (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94)
> > > > > > at
> > > > > > ../drivers/virtio/virtio_ring.c:2333
> > > > > >      #2  0xffffffff8175f6b2 in __send_control_msg
> > > > > > (portdev=<optimized
> > > > > > out>, port_id=0xffffffff, event=0x0, value=0x1) at
> > > > > > ../drivers/char/virtio_console.c:562
> > > > > >      #3  0xffffffff8175f6ee in __send_control_msg
> > > > > > (portdev=<optimized
> > > > > > out>, port_id=<optimized out>, event=<optimized out>,
> > > > > > value=<optimized
> > > > > > out>) at ../drivers/char/virtio_console.c:569
> > > > > >      #4  0xffffffff817618b1 in virtcons_probe
> > > > > > (vdev=0xffff88800585e800)
> > > > > > at ../drivers/char/virtio_console.c:2098
> > > > > >      #5  0xffffffff81707117 in virtio_dev_probe
> > > > > > (_d=0xffff88800585e810)
> > > > > > at ../drivers/virtio/virtio.c:305
> > > > > >      #6  0xffffffff8198e348 in call_driver_probe
> > > > > > (drv=0xffffffff82be40c0
> > > > > > <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>,
> > > > > > dev=0xffff88800585e810) at ../drivers/base/dd.c:579
> > > > > >      #7  really_probe (dev=dev@entry=0xffff88800585e810,
> > > > > > drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > > > ../drivers/base/dd.c:658
> > > > > >      #8  0xffffffff8198e58f in __driver_probe_device
> > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
> > > > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800
> > > > > >      #9  0xffffffff8198e65a in driver_probe_device
> > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>,
> > > > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830
> > > > > >      #10 0xffffffff8198e832 in __driver_attach
> > > > > > (dev=0xffff88800585e810,
> > > > > > data=0xffffffff82be40c0 <virtio_console>) at
> > > > > > ../drivers/base/dd.c:1216
> > > > > >      #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized
> > > > > > out>,
> > > > > > start=start@entry=0x0 <fixed_percpu_data>,
> > > > > > data=data@entry=0xffffffff82be40c0 <virtio_console>,
> > > > > >          fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at
> > > > > > ../drivers/base/bus.c:368
> > > > > >      #12 0xffffffff8198db65 in driver_attach
> > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > > > ../drivers/base/dd.c:1233
> > > > > >      #13 0xffffffff8198d207 in bus_add_driver
> > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > > > ../drivers/base/bus.c:673
> > > > > >      #14 0xffffffff8198f550 in driver_register
> > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > > > ../drivers/base/driver.c:246
> > > > > >      #15 0xffffffff81706b47 in register_virtio_driver
> > > > > > (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at
> > > > > > ../drivers/virtio/virtio.c:357
> > > > > >      #16 0xffffffff832cd34b in virtio_console_init () at
> > > > > > ../drivers/char/virtio_console.c:2258
> > > > > >      #17 0xffffffff8100105c in do_one_initcall
> > > > > > (fn=0xffffffff832cd2e0
> > > > > > <virtio_console_init>) at ../init/main.c:1246
> > > > > >      #18 0xffffffff83277293 in do_initcall_level
> > > > > > (command_line=0xffff888003e2f900 "root", level=0x6) at
> > > > > > ../init/main.c:1319
> > > > > >      #19 do_initcalls () at ../init/main.c:1335
> > > > > >      #20 do_basic_setup () at ../init/main.c:1354
> > > > > >      #21 kernel_init_freeable () at ../init/main.c:1571
> > > > > >      #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>)
> > > > > > at
> > > > > > ../init/main.c:1462
> > > > > >      #23 0xffffffff81001f49 in ret_from_fork () at
> > > > > > ../arch/x86/entry/entry_64.S:308
> > > > > >      #24 0x0000000000000000 in ?? ()
> > > > > > 
> > > > > > Fix the problem by preventing xen_grant_init_backend_domid() from
> > > > > > setting dom0 as a backend when running in dom0.
> > > > > > 
> > > > > > Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of
> > > > > > "xen-grant-dma"
> > > > > > devices")
> > > > > 
> > > > > 
> > > > > I am not 100% sure whether the Fixes tag points to precise commit. If
> > > > > I
> > > > > am not mistaken, the said commit just moves the code in the context
> > > > > without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was
> > > > > introduced before.
> > > > 
> > > > I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use
> > > > dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which
> > > > introduced the original logic to use dom0 as backend.
> > > > 
> > > > Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma"
> > > > devices") is relevant in sense that it extended when this logic is
> > > > active by adding an OR check for xen_pv_domain().
> > > > 
> > > > > 
> > > > > 
> > > > > > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> > > > > > ---
> > > > > >     drivers/xen/grant-dma-ops.c | 4 +++-
> > > > > >     1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/xen/grant-dma-ops.c
> > > > > > b/drivers/xen/grant-dma-ops.c
> > > > > > index 76f6f26265a3..29ed27ac450e 100644
> > > > > > --- a/drivers/xen/grant-dma-ops.c
> > > > > > +++ b/drivers/xen/grant-dma-ops.c
> > > > > > @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct
> > > > > > device *dev,
> > > > > >     	if (np) {
> > > > > >     		ret = xen_dt_grant_init_backend_domid(dev, np,
> > > > > > backend_domid);
> > > > > >     		of_node_put(np);
> > > > > > -	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
> > > > > > xen_pv_domain()) {
> > > > > > +	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
> > > > > > +		    xen_pv_domain()) &&
> > > > > > +		   !xen_initial_domain()) {
> > > > > 
> > > > > The commit lgtm, just one note:
> > > > > 
> > > > > 
> > > > > I would even bail out early in xen_virtio_restricted_mem_acc()
> > > > > instead,
> > > > > as I assume the same issue could happen on Arm with DT (although there
> > > > > we don't guess the backend's domid, we read it from DT and quite
> > > > > unlikely we get Dom0 being in Dom0 with correct DT).
> > > > > 
> > > > > Something like:
> > > > > 
> > > > > @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct
> > > > > virtio_device *dev)
> > > > >     {
> > > > >            domid_t backend_domid;
> > > > > 
> > > > > +       /* Xen grant DMA ops are not used when running as initial
> > > > > domain
> > > > > */
> > > > > +       if (xen_initial_domain())
> > > > > +               return false;
> > > > > +
> > > > >            if (!xen_grant_init_backend_domid(dev->dev.parent,
> > > > > &backend_domid)) {
> > > > >                    xen_grant_setup_dma_ops(dev->dev.parent,
> > > > > backend_domid);
> > > > >                    return true;
> > > > > (END)
> > > > > 
> > > > > 
> > > > > 
> > > > > If so, that commit subject would need to be updated accordingly.
> > > > > 
> > > > > Let's see what other reviewers will say.
> > > > 
> > > > Ok, makes sense.
> > > 
> > > I think this is okay for a fix of the current problem.
> > > 
> > > Passing through virtio devices to a PV domU is not covered by this fix,
> > > but
> > > this
> > > should be a rather rare configuration, which doesn't work today either. So
> > > the
> > > suggested patch would fix the current issue without introducing a
> > > regression.
> > > 
> > > Anything else can be done later.
> > 
> > Why do you say that passing through virtio devices to a PV domU doesn't
> > work today anyway? Also, as you know many people use Xen outside of
> > datacenter deployments (laptops, embedded etc.) where drivers domains
> > and device assignment are very common. You could assign a virtio network
> > card to a domU and use PV network to share the network with other
> > guests. Physical virtio devices, especially virtio-net devices, exist. I
> > could probably repro this problem today in a domU just installing
> > QubesOS inside QEMU. QubesOS uses network driver domains and if QEMU
> > provides a virtio-net network card, this would break even with this
> > patch.
> 
> I might be wrong, but I don't think all virtio frontends will work in that
> scenario. The main reason is the PFN/MFN difference: a frontend using guest
> consecutive memory for doing large I/Os will fail miserably. This was the
> main reason why I had to add the functionality of consecutive grants for
> large I/O buffers. The same goes for multi-page virtio ring pages.

I think for PV guests the virtio frontends would work but they might be
slow as they would have to bounce over swiotlb-xen every multi-page
buffer. But the virtio frontends should work OK for PVH and HVM guests.
diff mbox series

Patch

diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 76f6f26265a3..29ed27ac450e 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -362,7 +362,9 @@  static int xen_grant_init_backend_domid(struct device *dev,
 	if (np) {
 		ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid);
 		of_node_put(np);
-	} else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) {
+	} else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) ||
+		    xen_pv_domain()) &&
+		   !xen_initial_domain()) {
 		dev_info(dev, "Using dom0 as backend\n");
 		*backend_domid = 0;
 		ret = 0;