Message ID | 20230621131214.9398-3-petr.pavlu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix Linux dom0 boot on a QEMU/KVM virtual machine | expand |
On 21.06.23 16:12, Petr Pavlu wrote: Hello Petr > When attempting to run Xen on a QEMU/KVM virtual machine with virtio > devices (all x86_64), dom0 tries to establish a grant for itself which > eventually results in a hang during the boot. > > The backtrace looks as follows, the while loop in __send_control_msg() > makes no progress: > > #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 > #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 > #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 > #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 > #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 > #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 > #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 > #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 > #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 > #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 > #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 > #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, > fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 > #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 > #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 > #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 > #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 > #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 > #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 > #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 > #19 do_initcalls () at ../init/main.c:1335 > #20 do_basic_setup () at ../init/main.c:1354 > #21 kernel_init_freeable () at ../init/main.c:1571 > #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 > #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 > #24 0x0000000000000000 in ?? () > > Fix the problem by preventing xen_grant_init_backend_domid() from > setting dom0 as a backend when running in dom0. > > Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") I am not 100% sure whether the Fixes tag points to precise commit. If I am not mistaken, the said commit just moves the code in the context without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was introduced before. > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > --- > drivers/xen/grant-dma-ops.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c > index 76f6f26265a3..29ed27ac450e 100644 > --- a/drivers/xen/grant-dma-ops.c > +++ b/drivers/xen/grant-dma-ops.c > @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, > if (np) { > ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); > of_node_put(np); > - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { > + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || > + xen_pv_domain()) && > + !xen_initial_domain()) { The commit lgtm, just one note: I would even bail out early in xen_virtio_restricted_mem_acc() instead, as I assume the same issue could happen on Arm with DT (although there we don't guess the backend's domid, we read it from DT and quite unlikely we get Dom0 being in Dom0 with correct DT). Something like: @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct virtio_device *dev) { domid_t backend_domid; + /* Xen grant DMA ops are not used when running as initial domain */ + if (xen_initial_domain()) + return false; + if (!xen_grant_init_backend_domid(dev->dev.parent, &backend_domid)) { xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); return true; (END) If so, that commit subject would need to be updated accordingly. Let's see what other reviewers will say. > dev_info(dev, "Using dom0 as backend\n"); > *backend_domid = 0; > ret = 0;
On 6/21/23 19:58, Oleksandr Tyshchenko wrote: > On 21.06.23 16:12, Petr Pavlu wrote: >> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >> devices (all x86_64), dom0 tries to establish a grant for itself which >> eventually results in a hang during the boot. >> >> The backtrace looks as follows, the while loop in __send_control_msg() >> makes no progress: >> >> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >> #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 >> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 >> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 >> #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >> #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >> #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >> #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 >> #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >> #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >> #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 >> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, >> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 >> #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 >> #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 >> #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 >> #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 >> #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 >> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 >> #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 >> #19 do_initcalls () at ../init/main.c:1335 >> #20 do_basic_setup () at ../init/main.c:1354 >> #21 kernel_init_freeable () at ../init/main.c:1571 >> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 >> #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 >> #24 0x0000000000000000 in ?? () >> >> Fix the problem by preventing xen_grant_init_backend_domid() from >> setting dom0 as a backend when running in dom0. >> >> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") > > > I am not 100% sure whether the Fixes tag points to precise commit. If I > am not mistaken, the said commit just moves the code in the context > without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was > introduced before. I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which introduced the original logic to use dom0 as backend. Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") is relevant in sense that it extended when this logic is active by adding an OR check for xen_pv_domain(). > > >> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >> --- >> drivers/xen/grant-dma-ops.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >> index 76f6f26265a3..29ed27ac450e 100644 >> --- a/drivers/xen/grant-dma-ops.c >> +++ b/drivers/xen/grant-dma-ops.c >> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, >> if (np) { >> ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); >> of_node_put(np); >> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { >> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >> + xen_pv_domain()) && >> + !xen_initial_domain()) { > > The commit lgtm, just one note: > > > I would even bail out early in xen_virtio_restricted_mem_acc() instead, > as I assume the same issue could happen on Arm with DT (although there > we don't guess the backend's domid, we read it from DT and quite > unlikely we get Dom0 being in Dom0 with correct DT). > > Something like: > > @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct > virtio_device *dev) > { > domid_t backend_domid; > > + /* Xen grant DMA ops are not used when running as initial domain */ > + if (xen_initial_domain()) > + return false; > + > if (!xen_grant_init_backend_domid(dev->dev.parent, > &backend_domid)) { > xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); > return true; > (END) > > > > If so, that commit subject would need to be updated accordingly. > > Let's see what other reviewers will say. Ok, makes sense. Thanks, Petr
On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: > On 21.06.23 16:12, Petr Pavlu wrote: > > > Hello Petr > > > > When attempting to run Xen on a QEMU/KVM virtual machine with virtio > > devices (all x86_64), dom0 tries to establish a grant for itself which > > eventually results in a hang during the boot. > > > > The backtrace looks as follows, the while loop in __send_control_msg() > > makes no progress: > > > > #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 > > #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 > > #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 > > #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 > > #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 > > #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 > > #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 > > #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 > > #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 > > #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 > > #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 > > #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, > > fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 > > #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 > > #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 > > #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 > > #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 > > #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 > > #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 > > #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 > > #19 do_initcalls () at ../init/main.c:1335 > > #20 do_basic_setup () at ../init/main.c:1354 > > #21 kernel_init_freeable () at ../init/main.c:1571 > > #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 > > #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 > > #24 0x0000000000000000 in ?? () > > > > Fix the problem by preventing xen_grant_init_backend_domid() from > > setting dom0 as a backend when running in dom0. > > > > Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") > > > I am not 100% sure whether the Fixes tag points to precise commit. If I > am not mistaken, the said commit just moves the code in the context > without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was > introduced before. > > > > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > > --- > > drivers/xen/grant-dma-ops.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c > > index 76f6f26265a3..29ed27ac450e 100644 > > --- a/drivers/xen/grant-dma-ops.c > > +++ b/drivers/xen/grant-dma-ops.c > > @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, > > if (np) { > > ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); > > of_node_put(np); > > - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { > > + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || > > + xen_pv_domain()) && > > + !xen_initial_domain()) { > > The commit lgtm, just one note: > > > I would even bail out early in xen_virtio_restricted_mem_acc() instead, > as I assume the same issue could happen on Arm with DT (although there > we don't guess the backend's domid, we read it from DT and quite > unlikely we get Dom0 being in Dom0 with correct DT). > > Something like: > > @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct > virtio_device *dev) > { > domid_t backend_domid; > > + /* Xen grant DMA ops are not used when running as initial domain */ > + if (xen_initial_domain()) > + return false; > + > if (!xen_grant_init_backend_domid(dev->dev.parent, > &backend_domid)) { > xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); > return true; > (END) > > > > If so, that commit subject would need to be updated accordingly. > > Let's see what other reviewers will say. This doesn't work in all cases. Imagine using PCI Passthrough to assign a "physical" virtio device to a domU. The domU will run into the same error, right? The problem is that we need a way for the virtio backend to advertise its ability of handling grants. Right now we only have a way to do with that with device tree on ARM. On x86, we only have CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for "physical" virtio devices. Note that in this case we are fixing a nested-virtualization bug, but there are actually physical virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will break those too. I think we need to add a second way? It could be anything that can help us distinguish between a non-grants-capable virtio backend and a grants-capable virtio backend, such as: - a string on xenstore - a xen param - a special PCI configuration register value - something in the ACPI tables - the QEMU machine type Or at least should we change CONFIG_XEN_VIRTIO_FORCE_GRANT into a command line parameter so that it can be disabled in cases like this one? I realize that fixing this problem properly takes a lot longer than adding a trivial if (dom0) return; check in the code. If you cannot find a good way to solve the problem or you don't have time to do that now and you need this bug fixed quickly, then I would be OK with the if (dom0) return; check but please add a detailed TODO in-code comment to explain that this is just a hack and we are still looking for a real solution. The check itself I prefer the original position because I want to retain the ability of using virtio frontends with grant on ARM in Dom0 (DomD case).
On 29.06.23 04:00, Stefano Stabellini wrote: Hello Stefano > On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: >> On 21.06.23 16:12, Petr Pavlu wrote: >> >> >> Hello Petr >> >> >>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>> devices (all x86_64), dom0 tries to establish a grant for itself which >>> eventually results in a hang during the boot. >>> >>> The backtrace looks as follows, the while loop in __send_control_msg() >>> makes no progress: >>> >>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >>> #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 >>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 >>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 >>> #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >>> #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >>> #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>> #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 >>> #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>> #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>> #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 >>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, >>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 >>> #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 >>> #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 >>> #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 >>> #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 >>> #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 >>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 >>> #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 >>> #19 do_initcalls () at ../init/main.c:1335 >>> #20 do_basic_setup () at ../init/main.c:1354 >>> #21 kernel_init_freeable () at ../init/main.c:1571 >>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 >>> #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 >>> #24 0x0000000000000000 in ?? () >>> >>> Fix the problem by preventing xen_grant_init_backend_domid() from >>> setting dom0 as a backend when running in dom0. >>> >>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") >> >> >> I am not 100% sure whether the Fixes tag points to precise commit. If I >> am not mistaken, the said commit just moves the code in the context >> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >> introduced before. >> >> >>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>> --- >>> drivers/xen/grant-dma-ops.c | 4 +++- >>> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >>> index 76f6f26265a3..29ed27ac450e 100644 >>> --- a/drivers/xen/grant-dma-ops.c >>> +++ b/drivers/xen/grant-dma-ops.c >>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, >>> if (np) { >>> ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); >>> of_node_put(np); >>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { >>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>> + xen_pv_domain()) && >>> + !xen_initial_domain()) { >> >> The commit lgtm, just one note: >> >> >> I would even bail out early in xen_virtio_restricted_mem_acc() instead, >> as I assume the same issue could happen on Arm with DT (although there >> we don't guess the backend's domid, we read it from DT and quite >> unlikely we get Dom0 being in Dom0 with correct DT). >> >> Something like: >> >> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >> virtio_device *dev) >> { >> domid_t backend_domid; >> >> + /* Xen grant DMA ops are not used when running as initial domain */ >> + if (xen_initial_domain()) >> + return false; >> + >> if (!xen_grant_init_backend_domid(dev->dev.parent, >> &backend_domid)) { >> xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); >> return true; >> (END) >> >> >> >> If so, that commit subject would need to be updated accordingly. >> >> Let's see what other reviewers will say. > > This doesn't work in all cases. Imagine using PCI Passthrough to assign > a "physical" virtio device to a domU. The domU will run into the same > error, right? > > The problem is that we need a way for the virtio backend to advertise > its ability of handling grants. Right now we only have a way to do with > that with device tree on ARM. On x86, we only have > CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take > CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for > "physical" virtio devices. Note that in this case we are fixing a > nested-virtualization bug, but there are actually physical > virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will > break those too. If these "physical" virtio devices are also spawned by drivers/virtio/virtio.c:virtio_dev_probe(), then yes, otherwise I don't see how this could even be possible, but I might miss something here. xen_virtio_restricted_mem_acc() gets called indirectly from virtio_dev_probe()->virtio_features_ok()-> virtio_check_mem_acc_cb(). So the Xen grant DMA ops are only installed for those. > > I think we need to add a second way? It could be anything that can help > us distinguish between a non-grants-capable virtio backend and a > grants-capable virtio backend, such as: > - a string on xenstore > - a xen param > - a special PCI configuration register value > - something in the ACPI tables > - the QEMU machine type Yes, I remember there was a discussion regarding that. The point is to choose a solution to be functional for both PV and HVM *and* to be able to support a hotplug. IIRC, the xenstore could be a possible candidate. > > Or at least should we change CONFIG_XEN_VIRTIO_FORCE_GRANT into a > command line parameter so that it can be disabled in cases like this > one? IIUC, this will help with HVM only. > > I realize that fixing this problem properly takes a lot longer than > adding a trivial if (dom0) return; check in the code. If you cannot find > a good way to solve the problem or you don't have time to do that now > and you need this bug fixed quickly, then I would be OK with the if > (dom0) return; check but please add a detailed TODO in-code comment to > explain that this is just a hack and we are still looking for a real > solution. > > The check itself I prefer the original position because I want to retain > the ability of using virtio frontends with grant on ARM in Dom0 (DomD > case). Makes sense, agree.
On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > On 29.06.23 04:00, Stefano Stabellini wrote: > > Hello Stefano > > > On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: > >> On 21.06.23 16:12, Petr Pavlu wrote: > >> > >> > >> Hello Petr > >> > >> > >>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio > >>> devices (all x86_64), dom0 tries to establish a grant for itself which > >>> eventually results in a hang during the boot. > >>> > >>> The backtrace looks as follows, the while loop in __send_control_msg() > >>> makes no progress: > >>> > >>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 > >>> #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 > >>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 > >>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 > >>> #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 > >>> #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 > >>> #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 > >>> #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 > >>> #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 > >>> #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 > >>> #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 > >>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, > >>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 > >>> #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 > >>> #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 > >>> #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 > >>> #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 > >>> #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 > >>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 > >>> #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 > >>> #19 do_initcalls () at ../init/main.c:1335 > >>> #20 do_basic_setup () at ../init/main.c:1354 > >>> #21 kernel_init_freeable () at ../init/main.c:1571 > >>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 > >>> #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 > >>> #24 0x0000000000000000 in ?? () > >>> > >>> Fix the problem by preventing xen_grant_init_backend_domid() from > >>> setting dom0 as a backend when running in dom0. > >>> > >>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") > >> > >> > >> I am not 100% sure whether the Fixes tag points to precise commit. If I > >> am not mistaken, the said commit just moves the code in the context > >> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was > >> introduced before. > >> > >> > >>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > >>> --- > >>> drivers/xen/grant-dma-ops.c | 4 +++- > >>> 1 file changed, 3 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c > >>> index 76f6f26265a3..29ed27ac450e 100644 > >>> --- a/drivers/xen/grant-dma-ops.c > >>> +++ b/drivers/xen/grant-dma-ops.c > >>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, > >>> if (np) { > >>> ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); > >>> of_node_put(np); > >>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { > >>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || > >>> + xen_pv_domain()) && > >>> + !xen_initial_domain()) { > >> > >> The commit lgtm, just one note: > >> > >> > >> I would even bail out early in xen_virtio_restricted_mem_acc() instead, > >> as I assume the same issue could happen on Arm with DT (although there > >> we don't guess the backend's domid, we read it from DT and quite > >> unlikely we get Dom0 being in Dom0 with correct DT). > >> > >> Something like: > >> > >> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct > >> virtio_device *dev) > >> { > >> domid_t backend_domid; > >> > >> + /* Xen grant DMA ops are not used when running as initial domain */ > >> + if (xen_initial_domain()) > >> + return false; > >> + > >> if (!xen_grant_init_backend_domid(dev->dev.parent, > >> &backend_domid)) { > >> xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); > >> return true; > >> (END) > >> > >> > >> > >> If so, that commit subject would need to be updated accordingly. > >> > >> Let's see what other reviewers will say. > > > > This doesn't work in all cases. Imagine using PCI Passthrough to assign > > a "physical" virtio device to a domU. The domU will run into the same > > error, right? > > > > The problem is that we need a way for the virtio backend to advertise > > its ability of handling grants. Right now we only have a way to do with > > that with device tree on ARM. On x86, we only have > > CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take > > CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for > > "physical" virtio devices. Note that in this case we are fixing a > > nested-virtualization bug, but there are actually physical > > virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will > > break those too. > > > If these "physical" virtio devices are also spawned by > drivers/virtio/virtio.c:virtio_dev_probe(), then yes, otherwise I don't > see how this could even be possible, but I might miss something here. Yes, I would imagine virtio_dev_probe() would be called for them too > xen_virtio_restricted_mem_acc() gets called indirectly from > virtio_dev_probe()->virtio_features_ok()-> > virtio_check_mem_acc_cb(). So the Xen grant DMA ops are only installed > for those. > > > > > > I think we need to add a second way? It could be anything that can help > > us distinguish between a non-grants-capable virtio backend and a > > grants-capable virtio backend, such as: > > - a string on xenstore > > - a xen param > > - a special PCI configuration register value > > - something in the ACPI tables > > - the QEMU machine type > > > Yes, I remember there was a discussion regarding that. The point is to > choose a solution to be functional for both PV and HVM *and* to be able > to support a hotplug. IIRC, the xenstore could be a possible candidate. xenstore would be among the easiest to make work. The only downside is the dependency on xenstore which otherwise virtio+grants doesn't have. Vikram is working on virtio with grants support in QEMU as we speak. Maybe we could find a way to add a flag in QEMU so that we can detect at runtime if a given virtio device support grants or not. > > Or at least should we change CONFIG_XEN_VIRTIO_FORCE_GRANT into a > > command line parameter so that it can be disabled in cases like this > > one? > > IIUC, this will help with HVM only. For sure this is the least attractive solution, only marginally better than the fix proposed in this patch > > I realize that fixing this problem properly takes a lot longer than > > adding a trivial if (dom0) return; check in the code. If you cannot find > > a good way to solve the problem or you don't have time to do that now > > and you need this bug fixed quickly, then I would be OK with the if > > (dom0) return; check but please add a detailed TODO in-code comment to > > explain that this is just a hack and we are still looking for a real > > solution. > > > > The check itself I prefer the original position because I want to retain > > the ability of using virtio frontends with grant on ARM in Dom0 (DomD > > case). > > Makes sense, agree.
On 30.06.23 00:44, Stefano Stabellini wrote: > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: >> On 29.06.23 04:00, Stefano Stabellini wrote: >> >> Hello Stefano >> >>> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: >>>> On 21.06.23 16:12, Petr Pavlu wrote: >>>> >>>> >>>> Hello Petr >>>> >>>> >>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>>>> devices (all x86_64), dom0 tries to establish a grant for itself which >>>>> eventually results in a hang during the boot. >>>>> >>>>> The backtrace looks as follows, the while loop in __send_control_msg() >>>>> makes no progress: >>>>> >>>>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >>>>> #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 >>>>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 >>>>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 >>>>> #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >>>>> #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >>>>> #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>>>> #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 >>>>> #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>>>> #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>>>> #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 >>>>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, >>>>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 >>>>> #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 >>>>> #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 >>>>> #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 >>>>> #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 >>>>> #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 >>>>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 >>>>> #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 >>>>> #19 do_initcalls () at ../init/main.c:1335 >>>>> #20 do_basic_setup () at ../init/main.c:1354 >>>>> #21 kernel_init_freeable () at ../init/main.c:1571 >>>>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 >>>>> #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 >>>>> #24 0x0000000000000000 in ?? () >>>>> >>>>> Fix the problem by preventing xen_grant_init_backend_domid() from >>>>> setting dom0 as a backend when running in dom0. >>>>> >>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") >>>> >>>> >>>> I am not 100% sure whether the Fixes tag points to precise commit. If I >>>> am not mistaken, the said commit just moves the code in the context >>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >>>> introduced before. >>>> >>>> >>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>>>> --- >>>>> drivers/xen/grant-dma-ops.c | 4 +++- >>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >>>>> index 76f6f26265a3..29ed27ac450e 100644 >>>>> --- a/drivers/xen/grant-dma-ops.c >>>>> +++ b/drivers/xen/grant-dma-ops.c >>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, >>>>> if (np) { >>>>> ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); >>>>> of_node_put(np); >>>>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { >>>>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>>> + xen_pv_domain()) && >>>>> + !xen_initial_domain()) { >>>> >>>> The commit lgtm, just one note: >>>> >>>> >>>> I would even bail out early in xen_virtio_restricted_mem_acc() instead, >>>> as I assume the same issue could happen on Arm with DT (although there >>>> we don't guess the backend's domid, we read it from DT and quite >>>> unlikely we get Dom0 being in Dom0 with correct DT). >>>> >>>> Something like: >>>> >>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >>>> virtio_device *dev) >>>> { >>>> domid_t backend_domid; >>>> >>>> + /* Xen grant DMA ops are not used when running as initial domain */ >>>> + if (xen_initial_domain()) >>>> + return false; >>>> + >>>> if (!xen_grant_init_backend_domid(dev->dev.parent, >>>> &backend_domid)) { >>>> xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); >>>> return true; >>>> (END) >>>> >>>> >>>> >>>> If so, that commit subject would need to be updated accordingly. >>>> >>>> Let's see what other reviewers will say. >>> >>> This doesn't work in all cases. Imagine using PCI Passthrough to assign >>> a "physical" virtio device to a domU. The domU will run into the same >>> error, right? >>> >>> The problem is that we need a way for the virtio backend to advertise >>> its ability of handling grants. Right now we only have a way to do with >>> that with device tree on ARM. On x86, we only have >>> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take >>> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for >>> "physical" virtio devices. Note that in this case we are fixing a >>> nested-virtualization bug, but there are actually physical >>> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will >>> break those too. >> >> >> If these "physical" virtio devices are also spawned by >> drivers/virtio/virtio.c:virtio_dev_probe(), then yes, otherwise I don't >> see how this could even be possible, but I might miss something here. > > Yes, I would imagine virtio_dev_probe() would be called for them too > > > >> xen_virtio_restricted_mem_acc() gets called indirectly from >> virtio_dev_probe()->virtio_features_ok()-> >> virtio_check_mem_acc_cb(). So the Xen grant DMA ops are only installed >> for those. >> >> >>> >>> I think we need to add a second way? It could be anything that can help >>> us distinguish between a non-grants-capable virtio backend and a >>> grants-capable virtio backend, such as: >>> - a string on xenstore >>> - a xen param >>> - a special PCI configuration register value >>> - something in the ACPI tables >>> - the QEMU machine type >> >> >> Yes, I remember there was a discussion regarding that. The point is to >> choose a solution to be functional for both PV and HVM *and* to be able >> to support a hotplug. IIRC, the xenstore could be a possible candidate. > > xenstore would be among the easiest to make work. The only downside is > the dependency on xenstore which otherwise virtio+grants doesn't have. I'm in favor of using xenstore. Especially the hotplug support would be much easier using xenstore (the alternative would be ACPI, which I don't think we want for PV guests or many Arm configurations). Juergen
On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > > On 29.06.23 04:00, Stefano Stabellini wrote: > > > I think we need to add a second way? It could be anything that can help > > > us distinguish between a non-grants-capable virtio backend and a > > > grants-capable virtio backend, such as: > > > - a string on xenstore > > > - a xen param > > > - a special PCI configuration register value > > > - something in the ACPI tables > > > - the QEMU machine type > > > > > > Yes, I remember there was a discussion regarding that. The point is to > > choose a solution to be functional for both PV and HVM *and* to be able > > to support a hotplug. IIRC, the xenstore could be a possible candidate. > > xenstore would be among the easiest to make work. The only downside is > the dependency on xenstore which otherwise virtio+grants doesn't have. I would avoid introducing a dependency on xenstore, if nothing else we know it's a performance bottleneck. We would also need to map the virtio device topology into xenstore, so that we can pass different options for each device. > Vikram is working on virtio with grants support in QEMU as we speak. > Maybe we could find a way to add a flag in QEMU so that we can detect at > runtime if a given virtio device support grants or not. Isn't there a way for the device to expose capabilities already? For example how does a virtio-blk backend expose support for indirect descriptors? Thanks, Roger.
On 04.07.23 09:48, Roger Pau Monné wrote: > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: >> On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: >>> On 29.06.23 04:00, Stefano Stabellini wrote: >>>> I think we need to add a second way? It could be anything that can help >>>> us distinguish between a non-grants-capable virtio backend and a >>>> grants-capable virtio backend, such as: >>>> - a string on xenstore >>>> - a xen param >>>> - a special PCI configuration register value >>>> - something in the ACPI tables >>>> - the QEMU machine type >>> >>> >>> Yes, I remember there was a discussion regarding that. The point is to >>> choose a solution to be functional for both PV and HVM *and* to be able >>> to support a hotplug. IIRC, the xenstore could be a possible candidate. >> >> xenstore would be among the easiest to make work. The only downside is >> the dependency on xenstore which otherwise virtio+grants doesn't have. > > I would avoid introducing a dependency on xenstore, if nothing else we > know it's a performance bottleneck. > > We would also need to map the virtio device topology into xenstore, so > that we can pass different options for each device. This aspect (different options) is important. How do you want to pass virtio device configuration parameters from dom0 to the virtio backend domain? You probably need something like Xenstore (a virtio based alternative like virtiofs would work, too) for that purpose. Mapping the topology should be rather easy via the PCI-Id, e.g.: /local/domain/42/device/virtio/0000:00:1c.0/backend > >> Vikram is working on virtio with grants support in QEMU as we speak. >> Maybe we could find a way to add a flag in QEMU so that we can detect at >> runtime if a given virtio device support grants or not. > > Isn't there a way for the device to expose capabilities already? For > example how does a virtio-blk backend expose support for indirect > descriptors? Those capabilities are defined in the virtio spec [1]. Adding the backend domid would be possible, but it probably wouldn't be that easy (requires changing the virtio spec by either expanding an existing config area or by adding a new one). I'm not sure handling in the specific frontends is generic enough for being able to have a central place where the backend domid could be retrieved, without requiring any change of the frontends. Juergen [1]: http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
Hi, FWIW, I have ran into this issue some time ago too. I run Xen on top of KVM and then passthrough some of the virtio devices (network one specifically) into a (PV) guest. So, I hit both cases, the dom0 one and domU one. As a temporary workaround I needed to disable CONFIG_XEN_VIRTIO completely (just disabling CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it). With that context in place, the actual response below. On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote: > On 04.07.23 09:48, Roger Pau Monné wrote: > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > > > > On 29.06.23 04:00, Stefano Stabellini wrote: > > > > > I think we need to add a second way? It could be anything that can help > > > > > us distinguish between a non-grants-capable virtio backend and a > > > > > grants-capable virtio backend, such as: > > > > > - a string on xenstore > > > > > - a xen param > > > > > - a special PCI configuration register value > > > > > - something in the ACPI tables > > > > > - the QEMU machine type > > > > > > > > > > > > Yes, I remember there was a discussion regarding that. The point is to > > > > choose a solution to be functional for both PV and HVM *and* to be able > > > > to support a hotplug. IIRC, the xenstore could be a possible candidate. > > > > > > xenstore would be among the easiest to make work. The only downside is > > > the dependency on xenstore which otherwise virtio+grants doesn't have. > > > > I would avoid introducing a dependency on xenstore, if nothing else we > > know it's a performance bottleneck. > > > > We would also need to map the virtio device topology into xenstore, so > > that we can pass different options for each device. > > This aspect (different options) is important. How do you want to pass virtio > device configuration parameters from dom0 to the virtio backend domain? You > probably need something like Xenstore (a virtio based alternative like virtiofs > would work, too) for that purpose. > > Mapping the topology should be rather easy via the PCI-Id, e.g.: > > /local/domain/42/device/virtio/0000:00:1c.0/backend While I agree this would probably be the simplest to implement, I don't like introducing xenstore dependency into virtio frontend either. Toolstack -> backend communication is probably easier to solve, as it's much more flexible (could use qemu cmdline, QMP, other similar mechanisms for non-qemu backends etc). > > > Vikram is working on virtio with grants support in QEMU as we speak. > > > Maybe we could find a way to add a flag in QEMU so that we can detect at > > > runtime if a given virtio device support grants or not. > > > > Isn't there a way for the device to expose capabilities already? For > > example how does a virtio-blk backend expose support for indirect > > descriptors? > > Those capabilities are defined in the virtio spec [1]. Adding the backend > domid would be possible, but it probably wouldn't be that easy (requires > changing the virtio spec by either expanding an existing config area or by > adding a new one). I'm not sure handling in the specific frontends is > generic enough for being able to have a central place where the backend > domid could be retrieved, without requiring any change of the frontends. IMHO the proper solution is to extend the spec. I don't have much experience with virtio code, but reading the spec it looks like new config area will be better for compatibility/uniform handling in a frontent-agnostic way. Since it will definitely take time, some transitional solution (maybe even xenstore...) might be warranted. > Juergen > > [1]: http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote: > Hi, > > FWIW, I have ran into this issue some time ago too. I run Xen on top of > KVM and then passthrough some of the virtio devices (network one > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and > domU one. As a temporary workaround I needed to disable > CONFIG_XEN_VIRTIO completely (just disabling > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it). > With that context in place, the actual response below. > > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote: > > On 04.07.23 09:48, Roger Pau Monné wrote: > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > > > > > On 29.06.23 04:00, Stefano Stabellini wrote: > > > > > > I think we need to add a second way? It could be anything that can help > > > > > > us distinguish between a non-grants-capable virtio backend and a > > > > > > grants-capable virtio backend, such as: > > > > > > - a string on xenstore > > > > > > - a xen param > > > > > > - a special PCI configuration register value > > > > > > - something in the ACPI tables > > > > > > - the QEMU machine type > > > > > > > > > > > > > > > Yes, I remember there was a discussion regarding that. The point is to > > > > > choose a solution to be functional for both PV and HVM *and* to be able > > > > > to support a hotplug. IIRC, the xenstore could be a possible candidate. > > > > > > > > xenstore would be among the easiest to make work. The only downside is > > > > the dependency on xenstore which otherwise virtio+grants doesn't have. > > > > > > I would avoid introducing a dependency on xenstore, if nothing else we > > > know it's a performance bottleneck. > > > > > > We would also need to map the virtio device topology into xenstore, so > > > that we can pass different options for each device. > > > > This aspect (different options) is important. How do you want to pass virtio > > device configuration parameters from dom0 to the virtio backend domain? You > > probably need something like Xenstore (a virtio based alternative like virtiofs > > would work, too) for that purpose. > > > > Mapping the topology should be rather easy via the PCI-Id, e.g.: > > > > /local/domain/42/device/virtio/0000:00:1c.0/backend > > While I agree this would probably be the simplest to implement, I don't > like introducing xenstore dependency into virtio frontend either. > Toolstack -> backend communication is probably easier to solve, as it's > much more flexible (could use qemu cmdline, QMP, other similar > mechanisms for non-qemu backends etc). I also think features should be exposed uniformly for devices, it's at least weird to have certain features exposed in the PCI config space while other features exposed in xenstore. For virtio-mmio this might get a bit confusing, are we going to add xenstore entries based on the position of the device config mmio region? I think on Arm PCI enumeration is not (usually?) done by the firmware, at which point the SBDF expected by the tools/backend might be different than the value assigned by the guest OS. I think there are two slightly different issues, one is how to pass information to virtio backends, I think doing this initially based on xenstore is not that bad, because it's an internal detail of the backend implementation. However passing information to virtio frontends using xenstore is IMO a bad idea, there's already a way to negotiate features between virtio frontends and backends, and Xen should just expand and use that. > > > > Vikram is working on virtio with grants support in QEMU as we speak. > > > > Maybe we could find a way to add a flag in QEMU so that we can detect at > > > > runtime if a given virtio device support grants or not. > > > > > > Isn't there a way for the device to expose capabilities already? For > > > example how does a virtio-blk backend expose support for indirect > > > descriptors? > > > > Those capabilities are defined in the virtio spec [1]. Adding the backend > > domid would be possible, but it probably wouldn't be that easy (requires > > changing the virtio spec by either expanding an existing config area or by > > adding a new one). I'm not sure handling in the specific frontends is > > generic enough for being able to have a central place where the backend > > domid could be retrieved, without requiring any change of the frontends. > > IMHO the proper solution is to extend the spec. I don't have much > experience with virtio code, but reading the spec it looks like new > config area will be better for compatibility/uniform handling in a > frontent-agnostic way. Since it will definitely take time, some > transitional solution (maybe even xenstore...) might be warranted. My fear is that such 'transitional' solutions end up being definitive ones, as people move to other stuff once things work. Might be better to implement a transitional approach based on a future virtio specification change rather than something out of band based on a completely different communication channel. Thanks, Roger.
On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com> wrote: Hello all. [sorry for the possible format issues] On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote: > > Hi, > > > > FWIW, I have ran into this issue some time ago too. I run Xen on top of > > KVM and then passthrough some of the virtio devices (network one > > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and > > domU one. As a temporary workaround I needed to disable > > CONFIG_XEN_VIRTIO completely (just disabling > > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it). > > With that context in place, the actual response below. > > > > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote: > > > On 04.07.23 09:48, Roger Pau Monné wrote: > > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: > > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > > > > > > On 29.06.23 04:00, Stefano Stabellini wrote: > > > > > > > I think we need to add a second way? It could be anything that > can help > > > > > > > us distinguish between a non-grants-capable virtio backend and > a > > > > > > > grants-capable virtio backend, such as: > > > > > > > - a string on xenstore > > > > > > > - a xen param > > > > > > > - a special PCI configuration register value > > > > > > > - something in the ACPI tables > > > > > > > - the QEMU machine type > > > > > > > > > > > > > > > > > > Yes, I remember there was a discussion regarding that. The point > is to > > > > > > choose a solution to be functional for both PV and HVM *and* to > be able > > > > > > to support a hotplug. IIRC, the xenstore could be a possible > candidate. > > > > > > > > > > xenstore would be among the easiest to make work. The only > downside is > > > > > the dependency on xenstore which otherwise virtio+grants doesn't > have. > > > > > > > > I would avoid introducing a dependency on xenstore, if nothing else > we > > > > know it's a performance bottleneck. > > > > > > > > We would also need to map the virtio device topology into xenstore, > so > > > > that we can pass different options for each device. > > > > > > This aspect (different options) is important. How do you want to pass > virtio > > > device configuration parameters from dom0 to the virtio backend > domain? You > > > probably need something like Xenstore (a virtio based alternative like > virtiofs > > > would work, too) for that purpose. > > > > > > Mapping the topology should be rather easy via the PCI-Id, e.g.: > > > > > > /local/domain/42/device/virtio/0000:00:1c.0/backend > > > > While I agree this would probably be the simplest to implement, I don't > > like introducing xenstore dependency into virtio frontend either. > > Toolstack -> backend communication is probably easier to solve, as it's > > much more flexible (could use qemu cmdline, QMP, other similar > > mechanisms for non-qemu backends etc). > > I also think features should be exposed uniformly for devices, it's at > least weird to have certain features exposed in the PCI config space > while other features exposed in xenstore. > > For virtio-mmio this might get a bit confusing, are we going to add > xenstore entries based on the position of the device config mmio > region? > > I think on Arm PCI enumeration is not (usually?) done by the firmware, > at which point the SBDF expected by the tools/backend might be > different than the value assigned by the guest OS. > > I think there are two slightly different issues, one is how to pass > information to virtio backends, I think doing this initially based on > xenstore is not that bad, because it's an internal detail of the > backend implementation. However passing information to virtio > frontends using xenstore is IMO a bad idea, there's already a way to > negotiate features between virtio frontends and backends, and Xen > should just expand and use that. > > On Arm with device-tree we have a special bindings which purpose is to inform us whether we need to use grants for virtio and backend domid for a particular device.Here on x86, we don't have a device tree, so cannot (easily?) reuse this logic. I have just recollected one idea suggested by Stefano some time ago [1]. The context of discussion was about what to do when device-tree and ACPI cannot be reused (or something like that).The idea won't cover virtio-mmio, but I have heard that virtio-mmio usage with x86 Xen is rather unusual case. I will paste the text below for convenience. ********** Part 1 (intro): We could reuse a PCI config space register to expose the backend id. However this solution requires a backend change (QEMU) to expose the backend id via an emulated register for each emulated device. To avoid having to introduce a special config space register in all emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we could add a special PCI config space register at the emulated PCI Root Complex level. Basically the workflow would be as follow: - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex - Linux writes to special PCI config space register of the Xen PCI Root Complex the PCI device id (basically the BDF) - The Xen PCI Root Complex emulated by Xen answers by writing back to the same location the backend id (domid of the backend) - Linux reads back the same PCI config space register of the Xen PCI Root Complex and learn the relevant domid Part 2 (clarification): I think using a special config space register in the root complex would not be terrible in terms of guest changes because it is easy to introduce a new root complex driver in Linux and other OSes. The root complex would still be ECAM compatible so the regular ECAM driver would still work. A new driver would only be necessary if you want to be able to access the special config space register. ********** What do you think about it? Are there any pitfalls, etc? This also requires system changes, but at least without virtio spec changes. [1] https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2210061747590.3690179@ubuntu-linux-20-04-desktop/
On 04.07.23 19:14, Oleksandr Tyshchenko wrote: > > > On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com > <mailto:roger.pau@citrix.com>> wrote: > > Hello all. > > [sorry for the possible format issues] > > > On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote: > > Hi, > > > > FWIW, I have ran into this issue some time ago too. I run Xen on top of > > KVM and then passthrough some of the virtio devices (network one > > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and > > domU one. As a temporary workaround I needed to disable > > CONFIG_XEN_VIRTIO completely (just disabling > > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it). > > With that context in place, the actual response below. > > > > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote: > > > On 04.07.23 09:48, Roger Pau Monné wrote: > > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: > > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > > > > > > On 29.06.23 04:00, Stefano Stabellini wrote: > > > > > > > I think we need to add a second way? It could be anything that > can help > > > > > > > us distinguish between a non-grants-capable virtio backend and a > > > > > > > grants-capable virtio backend, such as: > > > > > > > - a string on xenstore > > > > > > > - a xen param > > > > > > > - a special PCI configuration register value > > > > > > > - something in the ACPI tables > > > > > > > - the QEMU machine type > > > > > > > > > > > > > > > > > > Yes, I remember there was a discussion regarding that. The point > is to > > > > > > choose a solution to be functional for both PV and HVM *and* to > be able > > > > > > to support a hotplug. IIRC, the xenstore could be a possible > candidate. > > > > > > > > > > xenstore would be among the easiest to make work. The only downside is > > > > > the dependency on xenstore which otherwise virtio+grants doesn't have. > > > > > > > > I would avoid introducing a dependency on xenstore, if nothing else we > > > > know it's a performance bottleneck. > > > > > > > > We would also need to map the virtio device topology into xenstore, so > > > > that we can pass different options for each device. > > > > > > This aspect (different options) is important. How do you want to pass > virtio > > > device configuration parameters from dom0 to the virtio backend domain? You > > > probably need something like Xenstore (a virtio based alternative like > virtiofs > > > would work, too) for that purpose. > > > > > > Mapping the topology should be rather easy via the PCI-Id, e.g.: > > > > > > /local/domain/42/device/virtio/0000:00:1c.0/backend > > > > While I agree this would probably be the simplest to implement, I don't > > like introducing xenstore dependency into virtio frontend either. > > Toolstack -> backend communication is probably easier to solve, as it's > > much more flexible (could use qemu cmdline, QMP, other similar > > mechanisms for non-qemu backends etc). > > I also think features should be exposed uniformly for devices, it's at > least weird to have certain features exposed in the PCI config space > while other features exposed in xenstore. > > For virtio-mmio this might get a bit confusing, are we going to add > xenstore entries based on the position of the device config mmio > region? > > I think on Arm PCI enumeration is not (usually?) done by the firmware, > at which point the SBDF expected by the tools/backend might be > different than the value assigned by the guest OS. > > I think there are two slightly different issues, one is how to pass > information to virtio backends, I think doing this initially based on > xenstore is not that bad, because it's an internal detail of the > backend implementation. However passing information to virtio > frontends using xenstore is IMO a bad idea, there's already a way to > negotiate features between virtio frontends and backends, and Xen > should just expand and use that. > > > > On Arm with device-tree we have a special bindings which purpose is to inform us > whether we need to use grants for virtio and backend domid for a particular > device.Here on x86, we don't have a device tree, so cannot (easily?) reuse this > logic. > > I have just recollected one idea suggested by Stefano some time ago [1]. The > context of discussion was about what to do when device-tree and ACPI cannot be > reused (or something like that).The idea won't cover virtio-mmio, but I have > heard that virtio-mmio usage with x86 Xen is rather unusual case. > > I will paste the text below for convenience. > > ********** > > Part 1 (intro): > > We could reuse a PCI config space register to expose the backend id. > However this solution requires a backend change (QEMU) to expose the > backend id via an emulated register for each emulated device. > > To avoid having to introduce a special config space register in all > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we > could add a special PCI config space register at the emulated PCI Root > Complex level. > > Basically the workflow would be as follow: > > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex > - Linux writes to special PCI config space register of the Xen PCI Root > Complex the PCI device id (basically the BDF) > - The Xen PCI Root Complex emulated by Xen answers by writing back to > the same location the backend id (domid of the backend) > - Linux reads back the same PCI config space register of the Xen PCI > Root Complex and learn the relevant domid > > Part 2 (clarification): > > I think using a special config space register in the root complex would > not be terrible in terms of guest changes because it is easy to > introduce a new root complex driver in Linux and other OSes. The root > complex would still be ECAM compatible so the regular ECAM driver would > still work. A new driver would only be necessary if you want to be able > to access the special config space register. > > > ********** > What do you think about it? Are there any pitfalls, etc? This also requires > system changes, but at least without virtio spec changes. > > [1] > https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2210061747590.3690179@ubuntu-linux-20-04-desktop/ <https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2210061747590.3690179@ubuntu-linux-20-04-desktop/> Sounds like a good idea. There would be one PCI root per backend domain needed, but that should be possible. Juergen
On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com> wrote: > > Hello all. > > [sorry for the possible format issues] > > > On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote: > > > Hi, > > > > > > FWIW, I have ran into this issue some time ago too. I run Xen on top of > > > KVM and then passthrough some of the virtio devices (network one > > > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and > > > domU one. As a temporary workaround I needed to disable > > > CONFIG_XEN_VIRTIO completely (just disabling > > > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it). > > > With that context in place, the actual response below. > > > > > > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote: > > > > On 04.07.23 09:48, Roger Pau Monné wrote: > > > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: > > > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > > > > > > > On 29.06.23 04:00, Stefano Stabellini wrote: > > > > > > > > I think we need to add a second way? It could be anything that > > can help > > > > > > > > us distinguish between a non-grants-capable virtio backend and > > a > > > > > > > > grants-capable virtio backend, such as: > > > > > > > > - a string on xenstore > > > > > > > > - a xen param > > > > > > > > - a special PCI configuration register value > > > > > > > > - something in the ACPI tables > > > > > > > > - the QEMU machine type > > > > > > > > > > > > > > > > > > > > > Yes, I remember there was a discussion regarding that. The point > > is to > > > > > > > choose a solution to be functional for both PV and HVM *and* to > > be able > > > > > > > to support a hotplug. IIRC, the xenstore could be a possible > > candidate. > > > > > > > > > > > > xenstore would be among the easiest to make work. The only > > downside is > > > > > > the dependency on xenstore which otherwise virtio+grants doesn't > > have. > > > > > > > > > > I would avoid introducing a dependency on xenstore, if nothing else > > we > > > > > know it's a performance bottleneck. > > > > > > > > > > We would also need to map the virtio device topology into xenstore, > > so > > > > > that we can pass different options for each device. > > > > > > > > This aspect (different options) is important. How do you want to pass > > virtio > > > > device configuration parameters from dom0 to the virtio backend > > domain? You > > > > probably need something like Xenstore (a virtio based alternative like > > virtiofs > > > > would work, too) for that purpose. > > > > > > > > Mapping the topology should be rather easy via the PCI-Id, e.g.: > > > > > > > > /local/domain/42/device/virtio/0000:00:1c.0/backend > > > > > > While I agree this would probably be the simplest to implement, I don't > > > like introducing xenstore dependency into virtio frontend either. > > > Toolstack -> backend communication is probably easier to solve, as it's > > > much more flexible (could use qemu cmdline, QMP, other similar > > > mechanisms for non-qemu backends etc). > > > > I also think features should be exposed uniformly for devices, it's at > > least weird to have certain features exposed in the PCI config space > > while other features exposed in xenstore. > > > > For virtio-mmio this might get a bit confusing, are we going to add > > xenstore entries based on the position of the device config mmio > > region? > > > > I think on Arm PCI enumeration is not (usually?) done by the firmware, > > at which point the SBDF expected by the tools/backend might be > > different than the value assigned by the guest OS. > > > > I think there are two slightly different issues, one is how to pass > > information to virtio backends, I think doing this initially based on > > xenstore is not that bad, because it's an internal detail of the > > backend implementation. However passing information to virtio > > frontends using xenstore is IMO a bad idea, there's already a way to > > negotiate features between virtio frontends and backends, and Xen > > should just expand and use that. > > > > > > On Arm with device-tree we have a special bindings which purpose is to > inform us whether we need to use grants for virtio and backend domid for a > particular device.Here on x86, we don't have a device tree, so cannot > (easily?) reuse this logic. > > I have just recollected one idea suggested by Stefano some time ago [1]. > The context of discussion was about what to do when device-tree and ACPI > cannot be reused (or something like that).The idea won't cover virtio-mmio, > but I have heard that virtio-mmio usage with x86 Xen is rather unusual case. > > I will paste the text below for convenience. > > ********** > > Part 1 (intro): > > We could reuse a PCI config space register to expose the backend id. > However this solution requires a backend change (QEMU) to expose the > backend id via an emulated register for each emulated device. > > To avoid having to introduce a special config space register in all > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we > could add a special PCI config space register at the emulated PCI Root > Complex level. > > Basically the workflow would be as follow: > > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex > - Linux writes to special PCI config space register of the Xen PCI Root > Complex the PCI device id (basically the BDF) > - The Xen PCI Root Complex emulated by Xen answers by writing back to > the same location the backend id (domid of the backend) > - Linux reads back the same PCI config space register of the Xen PCI > Root Complex and learn the relevant domid IMO this seems awfully complex. I'm not familiar with the VirtIO spec, but I see there's a Vendor data capability, could we possibly expose Xen-specific information on that capability? > Part 2 (clarification): > > I think using a special config space register in the root complex would > not be terrible in terms of guest changes because it is easy to > introduce a new root complex driver in Linux and other OSes. The root > complex would still be ECAM compatible so the regular ECAM driver would > still work. A new driver would only be necessary if you want to be able > to access the special config space register. I'm slightly worry of this approach, we end up modifying a root complex emulation in order to avoid modifying a PCI device emulation on QEMU, not sure that's a good trade off. Note also that different architectures will likely have different root complex, and so you might need to modify several of them, plus then arrange the PCI layout correctly in order to have the proper hierarchy so that devices belonging to different driver domains are assigned to different bridges. > > > ********** > What do you think about it? Are there any pitfalls, etc? This also requires > system changes, but at least without virtio spec changes. Why are we so reluctant to add spec changes? I understand this might take time an effort, but it's the only way IMO to build a sustainable VirtIO Xen implementation. Did we already attempt to negotiate with Oasis Xen related spec changes and those where refused? Thanks, Roger.
On Wed, 5 Jul 2023, Roger Pau Monné wrote: > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > > On Tue, Jul 4, 2023 at 5:49 PM Roger Pau Monné <roger.pau@citrix.com> wrote: > > > > Hello all. > > > > [sorry for the possible format issues] > > > > > > On Tue, Jul 04, 2023 at 01:43:46PM +0200, Marek Marczykowski-Górecki wrote: > > > > Hi, > > > > > > > > FWIW, I have ran into this issue some time ago too. I run Xen on top of > > > > KVM and then passthrough some of the virtio devices (network one > > > > specifically) into a (PV) guest. So, I hit both cases, the dom0 one and > > > > domU one. As a temporary workaround I needed to disable > > > > CONFIG_XEN_VIRTIO completely (just disabling > > > > CONFIG_XEN_VIRTIO_FORCE_GRANT was not enough to fix it). > > > > With that context in place, the actual response below. > > > > > > > > On Tue, Jul 04, 2023 at 12:39:40PM +0200, Juergen Gross wrote: > > > > > On 04.07.23 09:48, Roger Pau Monné wrote: > > > > > > On Thu, Jun 29, 2023 at 03:44:04PM -0700, Stefano Stabellini wrote: > > > > > > > On Thu, 29 Jun 2023, Oleksandr Tyshchenko wrote: > > > > > > > > On 29.06.23 04:00, Stefano Stabellini wrote: > > > > > > > > > I think we need to add a second way? It could be anything that > > > can help > > > > > > > > > us distinguish between a non-grants-capable virtio backend and > > > a > > > > > > > > > grants-capable virtio backend, such as: > > > > > > > > > - a string on xenstore > > > > > > > > > - a xen param > > > > > > > > > - a special PCI configuration register value > > > > > > > > > - something in the ACPI tables > > > > > > > > > - the QEMU machine type > > > > > > > > > > > > > > > > > > > > > > > > Yes, I remember there was a discussion regarding that. The point > > > is to > > > > > > > > choose a solution to be functional for both PV and HVM *and* to > > > be able > > > > > > > > to support a hotplug. IIRC, the xenstore could be a possible > > > candidate. > > > > > > > > > > > > > > xenstore would be among the easiest to make work. The only > > > downside is > > > > > > > the dependency on xenstore which otherwise virtio+grants doesn't > > > have. > > > > > > > > > > > > I would avoid introducing a dependency on xenstore, if nothing else > > > we > > > > > > know it's a performance bottleneck. > > > > > > > > > > > > We would also need to map the virtio device topology into xenstore, > > > so > > > > > > that we can pass different options for each device. > > > > > > > > > > This aspect (different options) is important. How do you want to pass > > > virtio > > > > > device configuration parameters from dom0 to the virtio backend > > > domain? You > > > > > probably need something like Xenstore (a virtio based alternative like > > > virtiofs > > > > > would work, too) for that purpose. > > > > > > > > > > Mapping the topology should be rather easy via the PCI-Id, e.g.: > > > > > > > > > > /local/domain/42/device/virtio/0000:00:1c.0/backend > > > > > > > > While I agree this would probably be the simplest to implement, I don't > > > > like introducing xenstore dependency into virtio frontend either. > > > > Toolstack -> backend communication is probably easier to solve, as it's > > > > much more flexible (could use qemu cmdline, QMP, other similar > > > > mechanisms for non-qemu backends etc). > > > > > > I also think features should be exposed uniformly for devices, it's at > > > least weird to have certain features exposed in the PCI config space > > > while other features exposed in xenstore. > > > > > > For virtio-mmio this might get a bit confusing, are we going to add > > > xenstore entries based on the position of the device config mmio > > > region? > > > > > > I think on Arm PCI enumeration is not (usually?) done by the firmware, > > > at which point the SBDF expected by the tools/backend might be > > > different than the value assigned by the guest OS. > > > > > > I think there are two slightly different issues, one is how to pass > > > information to virtio backends, I think doing this initially based on > > > xenstore is not that bad, because it's an internal detail of the > > > backend implementation. However passing information to virtio > > > frontends using xenstore is IMO a bad idea, there's already a way to > > > negotiate features between virtio frontends and backends, and Xen > > > should just expand and use that. > > > > > > > > > > On Arm with device-tree we have a special bindings which purpose is to > > inform us whether we need to use grants for virtio and backend domid for a > > particular device.Here on x86, we don't have a device tree, so cannot > > (easily?) reuse this logic. > > > > I have just recollected one idea suggested by Stefano some time ago [1]. > > The context of discussion was about what to do when device-tree and ACPI > > cannot be reused (or something like that).The idea won't cover virtio-mmio, > > but I have heard that virtio-mmio usage with x86 Xen is rather unusual case. > > > > I will paste the text below for convenience. Thanks Oleksandr! I had forgotten about this, but in retrospect it was a good suggestion :-D > > ********** > > > > Part 1 (intro): > > > > We could reuse a PCI config space register to expose the backend id. > > However this solution requires a backend change (QEMU) to expose the > > backend id via an emulated register for each emulated device. > > > > To avoid having to introduce a special config space register in all > > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we > > could add a special PCI config space register at the emulated PCI Root > > Complex level. > > > > Basically the workflow would be as follow: > > > > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex > > - Linux writes to special PCI config space register of the Xen PCI Root > > Complex the PCI device id (basically the BDF) > > - The Xen PCI Root Complex emulated by Xen answers by writing back to > > the same location the backend id (domid of the backend) > > - Linux reads back the same PCI config space register of the Xen PCI > > Root Complex and learn the relevant domid > > IMO this seems awfully complex. I'm not familiar with the VirtIO > spec, but I see there's a Vendor data capability, could we possibly > expose Xen-specific information on that capability? That is also a possibility too. Also we could use a PCI conf register which is known to be unused in the Virtio spec to expose the grant capability and backend domid. > > Part 2 (clarification): > > > > I think using a special config space register in the root complex would > > not be terrible in terms of guest changes because it is easy to > > introduce a new root complex driver in Linux and other OSes. The root > > complex would still be ECAM compatible so the regular ECAM driver would > > still work. A new driver would only be necessary if you want to be able > > to access the special config space register. > > I'm slightly worry of this approach, we end up modifying a root > complex emulation in order to avoid modifying a PCI device emulation > on QEMU, not sure that's a good trade off. > > Note also that different architectures will likely have different root > complex, and so you might need to modify several of them, plus then > arrange the PCI layout correctly in order to have the proper hierarchy > so that devices belonging to different driver domains are assigned to > different bridges. I do think that adding something to the PCI conf register somewhere is the best option because it is not dependent on ACPI and it is not dependent on xenstore both of which are very undesirable. I am not sure where specifically is the best place. These are 3 ideas we came up with: 1. PCI root complex 2. a register on the device itself 3. a new capability of the device 4. add one extra dummy PCI device for the sole purpose of exposing the grants capability Looking at the spec, there is a way to add a vendor-specific capability (cap_vndr = 0x9). Could we use that? It doesn't look like it is used today, Linux doesn't parse it. > > > > > > ********** > > What do you think about it? Are there any pitfalls, etc? This also requires > > system changes, but at least without virtio spec changes. > > Why are we so reluctant to add spec changes? I understand this might > take time an effort, but it's the only way IMO to build a sustainable > VirtIO Xen implementation. Did we already attempt to negotiate with > Oasis Xen related spec changes and those where refused? That's because spec changes can be very slow. This is a bug that we need a relatively quick solution for and waiting 12-24 months for a spec update is not realistic. I think a spec change would be best as a long term solution. We also need a short term solution. The short term solution doesn't have to be ideal but it has to work now.
On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: > On Wed, 5 Jul 2023, Roger Pau Monné wrote: > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > > > ********** > > > > > > Part 1 (intro): > > > > > > We could reuse a PCI config space register to expose the backend id. > > > However this solution requires a backend change (QEMU) to expose the > > > backend id via an emulated register for each emulated device. > > > > > > To avoid having to introduce a special config space register in all > > > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we > > > could add a special PCI config space register at the emulated PCI Root > > > Complex level. > > > > > > Basically the workflow would be as follow: > > > > > > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex > > > - Linux writes to special PCI config space register of the Xen PCI Root > > > Complex the PCI device id (basically the BDF) > > > - The Xen PCI Root Complex emulated by Xen answers by writing back to > > > the same location the backend id (domid of the backend) > > > - Linux reads back the same PCI config space register of the Xen PCI > > > Root Complex and learn the relevant domid > > > > IMO this seems awfully complex. I'm not familiar with the VirtIO > > spec, but I see there's a Vendor data capability, could we possibly > > expose Xen-specific information on that capability? > > That is also a possibility too. Also we could use a PCI conf register > which is known to be unused in the Virtio spec to expose the grant > capability and backend domid. Capabilities don't have a fixed config space register, they are a linked list, and so capabilities end up at different positions depending on the specific device layout. The only fixed part is the range from [0, 0x3F), and that's fully defined in the specification. Trying to define a fixed address for Xen use after the 3f boundary seems like a bad idea, as it's going to be hard to make sure that such address is not used on all possible devices. IMO the only way is to place such information in a capability, whether that's an existing capability or a new one I don't really know. > > > > Part 2 (clarification): > > > > > > I think using a special config space register in the root complex would > > > not be terrible in terms of guest changes because it is easy to > > > introduce a new root complex driver in Linux and other OSes. The root > > > complex would still be ECAM compatible so the regular ECAM driver would > > > still work. A new driver would only be necessary if you want to be able > > > to access the special config space register. > > > > I'm slightly worry of this approach, we end up modifying a root > > complex emulation in order to avoid modifying a PCI device emulation > > on QEMU, not sure that's a good trade off. > > > > Note also that different architectures will likely have different root > > complex, and so you might need to modify several of them, plus then > > arrange the PCI layout correctly in order to have the proper hierarchy > > so that devices belonging to different driver domains are assigned to > > different bridges. > > I do think that adding something to the PCI conf register somewhere is > the best option because it is not dependent on ACPI and it is not > dependent on xenstore both of which are very undesirable. > > I am not sure where specifically is the best place. These are 3 ideas > we came up with: > 1. PCI root complex > 2. a register on the device itself > 3. a new capability of the device > 4. add one extra dummy PCI device for the sole purpose of exposing the > grants capability > > > Looking at the spec, there is a way to add a vendor-specific capability > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used > today, Linux doesn't parse it. I did wonder the same from a quick look at the spec. There's however a text in the specification that says: "The driver SHOULD NOT use the Vendor data capability except for debugging and reporting purposes." So we would at least need to change that because the capability would then be used by other purposes different than debugging and reporting. Seems like a minor adjustment, so might we worth asking upstream about their opinion, and to get a conversation started. > > > > > > > > > > ********** > > > What do you think about it? Are there any pitfalls, etc? This also requires > > > system changes, but at least without virtio spec changes. > > > > Why are we so reluctant to add spec changes? I understand this might > > take time an effort, but it's the only way IMO to build a sustainable > > VirtIO Xen implementation. Did we already attempt to negotiate with > > Oasis Xen related spec changes and those where refused? > > That's because spec changes can be very slow. This is a bug that we need > a relatively quick solution for and waiting 12-24 months for a spec > update is not realistic. > > I think a spec change would be best as a long term solution. We also > need a short term solution. The short term solution doesn't have to be > ideal but it has to work now. My fear with such approach is that once a bodge is in place people move on to other stuff and this never gets properly fixed. I know this might not be a well received opinion, but it would be better if such bodge is kept in each interested party patchqueue for the time being, until a proper solution is implemented. That way there's an interest from parties into properly fixing it upstream. Thanks, Roger.
On Thu, 6 Jul 2023, Roger Pau Monné wrote: > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: > > On Wed, 5 Jul 2023, Roger Pau Monné wrote: > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > > > > ********** > > > > > > > > Part 1 (intro): > > > > > > > > We could reuse a PCI config space register to expose the backend id. > > > > However this solution requires a backend change (QEMU) to expose the > > > > backend id via an emulated register for each emulated device. > > > > > > > > To avoid having to introduce a special config space register in all > > > > emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we > > > > could add a special PCI config space register at the emulated PCI Root > > > > Complex level. > > > > > > > > Basically the workflow would be as follow: > > > > > > > > - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex > > > > - Linux writes to special PCI config space register of the Xen PCI Root > > > > Complex the PCI device id (basically the BDF) > > > > - The Xen PCI Root Complex emulated by Xen answers by writing back to > > > > the same location the backend id (domid of the backend) > > > > - Linux reads back the same PCI config space register of the Xen PCI > > > > Root Complex and learn the relevant domid > > > > > > IMO this seems awfully complex. I'm not familiar with the VirtIO > > > spec, but I see there's a Vendor data capability, could we possibly > > > expose Xen-specific information on that capability? > > > > That is also a possibility too. Also we could use a PCI conf register > > which is known to be unused in the Virtio spec to expose the grant > > capability and backend domid. > > Capabilities don't have a fixed config space register, they are a > linked list, and so capabilities end up at different positions > depending on the specific device layout. The only fixed part is the > range from [0, 0x3F), and that's fully defined in the specification. > > Trying to define a fixed address for Xen use after the 3f boundary > seems like a bad idea, as it's going to be hard to make sure that such > address is not used on all possible devices. IMO the only way is to > place such information in a capability, whether that's an existing > capability or a new one I don't really know. That seems like a good idea > > > > Part 2 (clarification): > > > > > > > > I think using a special config space register in the root complex would > > > > not be terrible in terms of guest changes because it is easy to > > > > introduce a new root complex driver in Linux and other OSes. The root > > > > complex would still be ECAM compatible so the regular ECAM driver would > > > > still work. A new driver would only be necessary if you want to be able > > > > to access the special config space register. > > > > > > I'm slightly worry of this approach, we end up modifying a root > > > complex emulation in order to avoid modifying a PCI device emulation > > > on QEMU, not sure that's a good trade off. > > > > > > Note also that different architectures will likely have different root > > > complex, and so you might need to modify several of them, plus then > > > arrange the PCI layout correctly in order to have the proper hierarchy > > > so that devices belonging to different driver domains are assigned to > > > different bridges. > > > > I do think that adding something to the PCI conf register somewhere is > > the best option because it is not dependent on ACPI and it is not > > dependent on xenstore both of which are very undesirable. > > > > I am not sure where specifically is the best place. These are 3 ideas > > we came up with: > > 1. PCI root complex > > 2. a register on the device itself > > 3. a new capability of the device > > 4. add one extra dummy PCI device for the sole purpose of exposing the > > grants capability > > > > > > Looking at the spec, there is a way to add a vendor-specific capability > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used > > today, Linux doesn't parse it. > > I did wonder the same from a quick look at the spec. There's however > a text in the specification that says: > > "The driver SHOULD NOT use the Vendor data capability except for > debugging and reporting purposes." > > So we would at least need to change that because the capability would > then be used by other purposes different than debugging and reporting. > > Seems like a minor adjustment, so might we worth asking upstream about > their opinion, and to get a conversation started. Wait, wouldn't this use-case fall under "reporting" ? It is exactly what we are doing, right? > > > > > > > > > > > > ********** > > > > What do you think about it? Are there any pitfalls, etc? This also requires > > > > system changes, but at least without virtio spec changes. > > > > > > Why are we so reluctant to add spec changes? I understand this might > > > take time an effort, but it's the only way IMO to build a sustainable > > > VirtIO Xen implementation. Did we already attempt to negotiate with > > > Oasis Xen related spec changes and those where refused? > > > > That's because spec changes can be very slow. This is a bug that we need > > a relatively quick solution for and waiting 12-24 months for a spec > > update is not realistic. > > > > I think a spec change would be best as a long term solution. We also > > need a short term solution. The short term solution doesn't have to be > > ideal but it has to work now. > > My fear with such approach is that once a bodge is in place people > move on to other stuff and this never gets properly fixed. > > I know this might not be a well received opinion, but it would be > better if such bodge is kept in each interested party patchqueue for > the time being, until a proper solution is implemented. That way > there's an interest from parties into properly fixing it upstream. Unfortunately we are in the situation where we have an outstanding upstream bug, so we have to take action one way or the other.
On 06.07.23 23:49, Stefano Stabellini wrote: > On Thu, 6 Jul 2023, Roger Pau Monné wrote: >> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: >>> On Wed, 5 Jul 2023, Roger Pau Monné wrote: >>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: >>>>> ********** >>>>> >>>>> Part 1 (intro): >>>>> >>>>> We could reuse a PCI config space register to expose the backend id. >>>>> However this solution requires a backend change (QEMU) to expose the >>>>> backend id via an emulated register for each emulated device. >>>>> >>>>> To avoid having to introduce a special config space register in all >>>>> emulated PCI devices (virtio-net, virtio-block, etc) I wonder if we >>>>> could add a special PCI config space register at the emulated PCI Root >>>>> Complex level. >>>>> >>>>> Basically the workflow would be as follow: >>>>> >>>>> - Linux recognizes the PCI Root Complex as a Xen PCI Root Complex >>>>> - Linux writes to special PCI config space register of the Xen PCI Root >>>>> Complex the PCI device id (basically the BDF) >>>>> - The Xen PCI Root Complex emulated by Xen answers by writing back to >>>>> the same location the backend id (domid of the backend) >>>>> - Linux reads back the same PCI config space register of the Xen PCI >>>>> Root Complex and learn the relevant domid >>>> >>>> IMO this seems awfully complex. I'm not familiar with the VirtIO >>>> spec, but I see there's a Vendor data capability, could we possibly >>>> expose Xen-specific information on that capability? >>> >>> That is also a possibility too. Also we could use a PCI conf register >>> which is known to be unused in the Virtio spec to expose the grant >>> capability and backend domid. >> >> Capabilities don't have a fixed config space register, they are a >> linked list, and so capabilities end up at different positions >> depending on the specific device layout. The only fixed part is the >> range from [0, 0x3F), and that's fully defined in the specification. >> >> Trying to define a fixed address for Xen use after the 3f boundary >> seems like a bad idea, as it's going to be hard to make sure that such >> address is not used on all possible devices. IMO the only way is to >> place such information in a capability, whether that's an existing >> capability or a new one I don't really know. > > That seems like a good idea > > >>>>> Part 2 (clarification): >>>>> >>>>> I think using a special config space register in the root complex would >>>>> not be terrible in terms of guest changes because it is easy to >>>>> introduce a new root complex driver in Linux and other OSes. The root >>>>> complex would still be ECAM compatible so the regular ECAM driver would >>>>> still work. A new driver would only be necessary if you want to be able >>>>> to access the special config space register. >>>> >>>> I'm slightly worry of this approach, we end up modifying a root >>>> complex emulation in order to avoid modifying a PCI device emulation >>>> on QEMU, not sure that's a good trade off. >>>> >>>> Note also that different architectures will likely have different root >>>> complex, and so you might need to modify several of them, plus then >>>> arrange the PCI layout correctly in order to have the proper hierarchy >>>> so that devices belonging to different driver domains are assigned to >>>> different bridges. >>> >>> I do think that adding something to the PCI conf register somewhere is >>> the best option because it is not dependent on ACPI and it is not >>> dependent on xenstore both of which are very undesirable. >>> >>> I am not sure where specifically is the best place. These are 3 ideas >>> we came up with: >>> 1. PCI root complex >>> 2. a register on the device itself >>> 3. a new capability of the device >>> 4. add one extra dummy PCI device for the sole purpose of exposing the >>> grants capability >>> >>> >>> Looking at the spec, there is a way to add a vendor-specific capability >>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used >>> today, Linux doesn't parse it. >> >> I did wonder the same from a quick look at the spec. There's however >> a text in the specification that says: >> >> "The driver SHOULD NOT use the Vendor data capability except for >> debugging and reporting purposes." >> >> So we would at least need to change that because the capability would >> then be used by other purposes different than debugging and reporting. >> >> Seems like a minor adjustment, so might we worth asking upstream about >> their opinion, and to get a conversation started. > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what > we are doing, right? I'd understand "reporting" as e.g. logging, transferring statistics, ... We'd like to use it for configuration purposes. Another idea would be to enhance the virtio IOMMU device to suit our needs: we could add the domid as another virtio IOMMU device capability and (for now) use bypass mode for all "productive" devices. Later we could even add grant-V3 support to Xen and to the virtio IOMMU device (see my last year Xen Summit design session). This could be usable for disaggregated KVM setups, too, so I believe there is a chance to get this accepted. >>>>> ********** >>>>> What do you think about it? Are there any pitfalls, etc? This also requires >>>>> system changes, but at least without virtio spec changes. >>>> >>>> Why are we so reluctant to add spec changes? I understand this might >>>> take time an effort, but it's the only way IMO to build a sustainable >>>> VirtIO Xen implementation. Did we already attempt to negotiate with >>>> Oasis Xen related spec changes and those where refused? >>> >>> That's because spec changes can be very slow. This is a bug that we need >>> a relatively quick solution for and waiting 12-24 months for a spec >>> update is not realistic. >>> >>> I think a spec change would be best as a long term solution. We also >>> need a short term solution. The short term solution doesn't have to be >>> ideal but it has to work now. >> >> My fear with such approach is that once a bodge is in place people >> move on to other stuff and this never gets properly fixed. >> >> I know this might not be a well received opinion, but it would be >> better if such bodge is kept in each interested party patchqueue for >> the time being, until a proper solution is implemented. That way >> there's an interest from parties into properly fixing it upstream. > > Unfortunately we are in the situation where we have an outstanding > upstream bug, so we have to take action one way or the other. The required virtio IOMMU device modification would be rather small, so adding it maybe under a CONFIG option defaulting to off might be acceptable. Juergen
Re-reading the whole thread again ... On 29.06.23 03:00, Stefano Stabellini wrote: > On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: >> On 21.06.23 16:12, Petr Pavlu wrote: >> >> >> Hello Petr >> >> >>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>> devices (all x86_64), dom0 tries to establish a grant for itself which >>> eventually results in a hang during the boot. >>> >>> The backtrace looks as follows, the while loop in __send_control_msg() >>> makes no progress: >>> >>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >>> #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 >>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 >>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 >>> #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >>> #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >>> #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>> #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 >>> #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>> #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>> #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 >>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, >>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 >>> #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 >>> #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 >>> #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 >>> #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 >>> #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 >>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 >>> #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 >>> #19 do_initcalls () at ../init/main.c:1335 >>> #20 do_basic_setup () at ../init/main.c:1354 >>> #21 kernel_init_freeable () at ../init/main.c:1571 >>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 >>> #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 >>> #24 0x0000000000000000 in ?? () >>> >>> Fix the problem by preventing xen_grant_init_backend_domid() from >>> setting dom0 as a backend when running in dom0. >>> >>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") >> >> >> I am not 100% sure whether the Fixes tag points to precise commit. If I >> am not mistaken, the said commit just moves the code in the context >> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >> introduced before. >> >> >>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>> --- >>> drivers/xen/grant-dma-ops.c | 4 +++- >>> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >>> index 76f6f26265a3..29ed27ac450e 100644 >>> --- a/drivers/xen/grant-dma-ops.c >>> +++ b/drivers/xen/grant-dma-ops.c >>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, >>> if (np) { >>> ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); >>> of_node_put(np); >>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { >>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>> + xen_pv_domain()) && >>> + !xen_initial_domain()) { >> >> The commit lgtm, just one note: >> >> >> I would even bail out early in xen_virtio_restricted_mem_acc() instead, >> as I assume the same issue could happen on Arm with DT (although there >> we don't guess the backend's domid, we read it from DT and quite >> unlikely we get Dom0 being in Dom0 with correct DT). >> >> Something like: >> >> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >> virtio_device *dev) >> { >> domid_t backend_domid; >> >> + /* Xen grant DMA ops are not used when running as initial domain */ >> + if (xen_initial_domain()) >> + return false; >> + >> if (!xen_grant_init_backend_domid(dev->dev.parent, >> &backend_domid)) { >> xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); >> return true; >> (END) >> >> >> >> If so, that commit subject would need to be updated accordingly. >> >> Let's see what other reviewers will say. > > This doesn't work in all cases. Imagine using PCI Passthrough to assign > a "physical" virtio device to a domU. The domU will run into the same > error, right? > > The problem is that we need a way for the virtio backend to advertise > its ability of handling grants. Right now we only have a way to do with > that with device tree on ARM. On x86, we only have > CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take > CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for > "physical" virtio devices. Note that in this case we are fixing a > nested-virtualization bug, but there are actually physical > virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will > break those too. In case you want virtio device passthrough, you shouldn't use a kernel built with CONFIG_XEN_VIRTIO_FORCE_GRANT. And supporting passing through virtio devices of the host to pv-domUs is a security risk anyway. We _could_ drop the requirement of the backend needing to set VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio handling for all guests. For this to work xen_virtio_restricted_mem_acc() would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set. Maybe we'd want to enable that possibility via a boot parameter? Juergen
On 26.06.23 15:17, Petr Pavlu wrote: > On 6/21/23 19:58, Oleksandr Tyshchenko wrote: >> On 21.06.23 16:12, Petr Pavlu wrote: >>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>> devices (all x86_64), dom0 tries to establish a grant for itself which >>> eventually results in a hang during the boot. >>> >>> The backtrace looks as follows, the while loop in __send_control_msg() >>> makes no progress: >>> >>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >>> #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 >>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 >>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 >>> #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >>> #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >>> #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>> #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 >>> #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>> #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>> #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 >>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, >>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 >>> #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 >>> #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 >>> #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 >>> #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 >>> #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 >>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 >>> #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 >>> #19 do_initcalls () at ../init/main.c:1335 >>> #20 do_basic_setup () at ../init/main.c:1354 >>> #21 kernel_init_freeable () at ../init/main.c:1571 >>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 >>> #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 >>> #24 0x0000000000000000 in ?? () >>> >>> Fix the problem by preventing xen_grant_init_backend_domid() from >>> setting dom0 as a backend when running in dom0. >>> >>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") >> >> >> I am not 100% sure whether the Fixes tag points to precise commit. If I >> am not mistaken, the said commit just moves the code in the context >> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >> introduced before. > > I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use > dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which > introduced the original logic to use dom0 as backend. > > Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" > devices") is relevant in sense that it extended when this logic is > active by adding an OR check for xen_pv_domain(). > >> >> >>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>> --- >>> drivers/xen/grant-dma-ops.c | 4 +++- >>> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >>> index 76f6f26265a3..29ed27ac450e 100644 >>> --- a/drivers/xen/grant-dma-ops.c >>> +++ b/drivers/xen/grant-dma-ops.c >>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, >>> if (np) { >>> ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); >>> of_node_put(np); >>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { >>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>> + xen_pv_domain()) && >>> + !xen_initial_domain()) { >> >> The commit lgtm, just one note: >> >> >> I would even bail out early in xen_virtio_restricted_mem_acc() instead, >> as I assume the same issue could happen on Arm with DT (although there >> we don't guess the backend's domid, we read it from DT and quite >> unlikely we get Dom0 being in Dom0 with correct DT). >> >> Something like: >> >> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >> virtio_device *dev) >> { >> domid_t backend_domid; >> >> + /* Xen grant DMA ops are not used when running as initial domain */ >> + if (xen_initial_domain()) >> + return false; >> + >> if (!xen_grant_init_backend_domid(dev->dev.parent, >> &backend_domid)) { >> xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); >> return true; >> (END) >> >> >> >> If so, that commit subject would need to be updated accordingly. >> >> Let's see what other reviewers will say. > > Ok, makes sense. I think this is okay for a fix of the current problem. Passing through virtio devices to a PV domU is not covered by this fix, but this should be a rather rare configuration, which doesn't work today either. So the suggested patch would fix the current issue without introducing a regression. Anything else can be done later. Juergen
On 07.07.23 10:04, Juergen Gross wrote: Hello Juergen > Re-reading the whole thread again ... > > On 29.06.23 03:00, Stefano Stabellini wrote: >> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: >>> On 21.06.23 16:12, Petr Pavlu wrote: >>> >>> >>> Hello Petr >>> >>> >>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>>> devices (all x86_64), dom0 tries to establish a grant for itself which >>>> eventually results in a hang during the boot. >>>> >>>> The backtrace looks as follows, the while loop in __send_control_msg() >>>> makes no progress: >>>> >>>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, >>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 >>>> <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >>>> #1 0xffffffff817086b7 in virtqueue_get_buf >>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) >>>> at ../drivers/virtio/virtio_ring.c:2333 >>>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized >>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at >>>> ../drivers/char/virtio_console.c:562 >>>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized >>>> out>, port_id=<optimized out>, event=<optimized out>, >>>> value=<optimized out>) at ../drivers/char/virtio_console.c:569 >>>> #4 0xffffffff817618b1 in virtcons_probe >>>> (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >>>> #5 0xffffffff81707117 in virtio_dev_probe >>>> (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >>>> #6 0xffffffff8198e348 in call_driver_probe >>>> (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 >>>> <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>>> #7 really_probe (dev=dev@entry=0xffff88800585e810, >>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>> ../drivers/base/dd.c:658 >>>> #8 0xffffffff8198e58f in __driver_probe_device >>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>>> #9 0xffffffff8198e65a in driver_probe_device >>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>>> #10 0xffffffff8198e832 in __driver_attach >>>> (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) >>>> at ../drivers/base/dd.c:1216 >>>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, >>>> start=start@entry=0x0 <fixed_percpu_data>, >>>> data=data@entry=0xffffffff82be40c0 <virtio_console>, >>>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at >>>> ../drivers/base/bus.c:368 >>>> #12 0xffffffff8198db65 in driver_attach >>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>> ../drivers/base/dd.c:1233 >>>> #13 0xffffffff8198d207 in bus_add_driver >>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>> ../drivers/base/bus.c:673 >>>> #14 0xffffffff8198f550 in driver_register >>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>> ../drivers/base/driver.c:246 >>>> #15 0xffffffff81706b47 in register_virtio_driver >>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at >>>> ../drivers/virtio/virtio.c:357 >>>> #16 0xffffffff832cd34b in virtio_console_init () at >>>> ../drivers/char/virtio_console.c:2258 >>>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 >>>> <virtio_console_init>) at ../init/main.c:1246 >>>> #18 0xffffffff83277293 in do_initcall_level >>>> (command_line=0xffff888003e2f900 "root", level=0x6) at >>>> ../init/main.c:1319 >>>> #19 do_initcalls () at ../init/main.c:1335 >>>> #20 do_basic_setup () at ../init/main.c:1354 >>>> #21 kernel_init_freeable () at ../init/main.c:1571 >>>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) >>>> at ../init/main.c:1462 >>>> #23 0xffffffff81001f49 in ret_from_fork () at >>>> ../arch/x86/entry/entry_64.S:308 >>>> #24 0x0000000000000000 in ?? () >>>> >>>> Fix the problem by preventing xen_grant_init_backend_domid() from >>>> setting dom0 as a backend when running in dom0. >>>> >>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of >>>> "xen-grant-dma" devices") >>> >>> >>> I am not 100% sure whether the Fixes tag points to precise commit. If I >>> am not mistaken, the said commit just moves the code in the context >>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >>> introduced before. >>> >>> >>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>>> --- >>>> drivers/xen/grant-dma-ops.c | 4 +++- >>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >>>> index 76f6f26265a3..29ed27ac450e 100644 >>>> --- a/drivers/xen/grant-dma-ops.c >>>> +++ b/drivers/xen/grant-dma-ops.c >>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct >>>> device *dev, >>>> if (np) { >>>> ret = xen_dt_grant_init_backend_domid(dev, np, >>>> backend_domid); >>>> of_node_put(np); >>>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>> xen_pv_domain()) { >>>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>> + xen_pv_domain()) && >>>> + !xen_initial_domain()) { >>> >>> The commit lgtm, just one note: >>> >>> >>> I would even bail out early in xen_virtio_restricted_mem_acc() instead, >>> as I assume the same issue could happen on Arm with DT (although there >>> we don't guess the backend's domid, we read it from DT and quite >>> unlikely we get Dom0 being in Dom0 with correct DT). >>> >>> Something like: >>> >>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >>> virtio_device *dev) >>> { >>> domid_t backend_domid; >>> >>> + /* Xen grant DMA ops are not used when running as initial >>> domain */ >>> + if (xen_initial_domain()) >>> + return false; >>> + >>> if (!xen_grant_init_backend_domid(dev->dev.parent, >>> &backend_domid)) { >>> xen_grant_setup_dma_ops(dev->dev.parent, >>> backend_domid); >>> return true; >>> (END) >>> >>> >>> >>> If so, that commit subject would need to be updated accordingly. >>> >>> Let's see what other reviewers will say. >> >> This doesn't work in all cases. Imagine using PCI Passthrough to assign >> a "physical" virtio device to a domU. The domU will run into the same >> error, right? >> >> The problem is that we need a way for the virtio backend to advertise >> its ability of handling grants. Right now we only have a way to do with >> that with device tree on ARM. On x86, we only have >> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take >> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for >> "physical" virtio devices. Note that in this case we are fixing a >> nested-virtualization bug, but there are actually physical >> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will >> break those too. > > In case you want virtio device passthrough, you shouldn't use a kernel > built with CONFIG_XEN_VIRTIO_FORCE_GRANT. > > And supporting passing through virtio devices of the host to pv-domUs is > a security risk anyway. > > We _could_ drop the requirement of the backend needing to set > VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio > handling for all guests. For this to work xen_virtio_restricted_mem_acc() > would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set. > Maybe we'd want to enable that possibility via a boot parameter? Maybe, yes. I don't see at the moment why this won't work. At the same time I wonder, could we just modify xen_pv_init_platform() to call virtio_no_restricted_mem_acc() if forcibly disabled by boot parameter irrespective of VIRTIO_F_ACCESS_PLATFORM presence? > > > Juergen
On 07.07.23 10:00, Oleksandr Tyshchenko wrote: > > > On 07.07.23 10:04, Juergen Gross wrote: > > Hello Juergen > > >> Re-reading the whole thread again ... >> >> On 29.06.23 03:00, Stefano Stabellini wrote: >>> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: >>>> On 21.06.23 16:12, Petr Pavlu wrote: >>>> >>>> >>>> Hello Petr >>>> >>>> >>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>>>> devices (all x86_64), dom0 tries to establish a grant for itself which >>>>> eventually results in a hang during the boot. >>>>> >>>>> The backtrace looks as follows, the while loop in __send_control_msg() >>>>> makes no progress: >>>>> >>>>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, >>>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 >>>>> <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >>>>> #1 0xffffffff817086b7 in virtqueue_get_buf >>>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) >>>>> at ../drivers/virtio/virtio_ring.c:2333 >>>>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized >>>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at >>>>> ../drivers/char/virtio_console.c:562 >>>>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized >>>>> out>, port_id=<optimized out>, event=<optimized out>, >>>>> value=<optimized out>) at ../drivers/char/virtio_console.c:569 >>>>> #4 0xffffffff817618b1 in virtcons_probe >>>>> (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >>>>> #5 0xffffffff81707117 in virtio_dev_probe >>>>> (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >>>>> #6 0xffffffff8198e348 in call_driver_probe >>>>> (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 >>>>> <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>>>> #7 really_probe (dev=dev@entry=0xffff88800585e810, >>>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/dd.c:658 >>>>> #8 0xffffffff8198e58f in __driver_probe_device >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>>>> #9 0xffffffff8198e65a in driver_probe_device >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>>>> #10 0xffffffff8198e832 in __driver_attach >>>>> (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) >>>>> at ../drivers/base/dd.c:1216 >>>>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, >>>>> start=start@entry=0x0 <fixed_percpu_data>, >>>>> data=data@entry=0xffffffff82be40c0 <virtio_console>, >>>>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at >>>>> ../drivers/base/bus.c:368 >>>>> #12 0xffffffff8198db65 in driver_attach >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/dd.c:1233 >>>>> #13 0xffffffff8198d207 in bus_add_driver >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/bus.c:673 >>>>> #14 0xffffffff8198f550 in driver_register >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/driver.c:246 >>>>> #15 0xffffffff81706b47 in register_virtio_driver >>>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/virtio/virtio.c:357 >>>>> #16 0xffffffff832cd34b in virtio_console_init () at >>>>> ../drivers/char/virtio_console.c:2258 >>>>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 >>>>> <virtio_console_init>) at ../init/main.c:1246 >>>>> #18 0xffffffff83277293 in do_initcall_level >>>>> (command_line=0xffff888003e2f900 "root", level=0x6) at >>>>> ../init/main.c:1319 >>>>> #19 do_initcalls () at ../init/main.c:1335 >>>>> #20 do_basic_setup () at ../init/main.c:1354 >>>>> #21 kernel_init_freeable () at ../init/main.c:1571 >>>>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) >>>>> at ../init/main.c:1462 >>>>> #23 0xffffffff81001f49 in ret_from_fork () at >>>>> ../arch/x86/entry/entry_64.S:308 >>>>> #24 0x0000000000000000 in ?? () >>>>> >>>>> Fix the problem by preventing xen_grant_init_backend_domid() from >>>>> setting dom0 as a backend when running in dom0. >>>>> >>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of >>>>> "xen-grant-dma" devices") >>>> >>>> >>>> I am not 100% sure whether the Fixes tag points to precise commit. If I >>>> am not mistaken, the said commit just moves the code in the context >>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >>>> introduced before. >>>> >>>> >>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>>>> --- >>>>> drivers/xen/grant-dma-ops.c | 4 +++- >>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >>>>> index 76f6f26265a3..29ed27ac450e 100644 >>>>> --- a/drivers/xen/grant-dma-ops.c >>>>> +++ b/drivers/xen/grant-dma-ops.c >>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct >>>>> device *dev, >>>>> if (np) { >>>>> ret = xen_dt_grant_init_backend_domid(dev, np, >>>>> backend_domid); >>>>> of_node_put(np); >>>>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>>> xen_pv_domain()) { >>>>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>>> + xen_pv_domain()) && >>>>> + !xen_initial_domain()) { >>>> >>>> The commit lgtm, just one note: >>>> >>>> >>>> I would even bail out early in xen_virtio_restricted_mem_acc() instead, >>>> as I assume the same issue could happen on Arm with DT (although there >>>> we don't guess the backend's domid, we read it from DT and quite >>>> unlikely we get Dom0 being in Dom0 with correct DT). >>>> >>>> Something like: >>>> >>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >>>> virtio_device *dev) >>>> { >>>> domid_t backend_domid; >>>> >>>> + /* Xen grant DMA ops are not used when running as initial >>>> domain */ >>>> + if (xen_initial_domain()) >>>> + return false; >>>> + >>>> if (!xen_grant_init_backend_domid(dev->dev.parent, >>>> &backend_domid)) { >>>> xen_grant_setup_dma_ops(dev->dev.parent, >>>> backend_domid); >>>> return true; >>>> (END) >>>> >>>> >>>> >>>> If so, that commit subject would need to be updated accordingly. >>>> >>>> Let's see what other reviewers will say. >>> >>> This doesn't work in all cases. Imagine using PCI Passthrough to assign >>> a "physical" virtio device to a domU. The domU will run into the same >>> error, right? >>> >>> The problem is that we need a way for the virtio backend to advertise >>> its ability of handling grants. Right now we only have a way to do with >>> that with device tree on ARM. On x86, we only have >>> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take >>> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for >>> "physical" virtio devices. Note that in this case we are fixing a >>> nested-virtualization bug, but there are actually physical >>> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will >>> break those too. >> >> In case you want virtio device passthrough, you shouldn't use a kernel >> built with CONFIG_XEN_VIRTIO_FORCE_GRANT. >> >> And supporting passing through virtio devices of the host to pv-domUs is >> a security risk anyway. >> >> We _could_ drop the requirement of the backend needing to set >> VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio >> handling for all guests. For this to work xen_virtio_restricted_mem_acc() >> would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set. >> Maybe we'd want to enable that possibility via a boot parameter? > > > Maybe, yes. I don't see at the moment why this won't work. > > At the same time I wonder, could we just modify xen_pv_init_platform() > to call virtio_no_restricted_mem_acc() if forcibly disabled by boot > parameter irrespective of VIRTIO_F_ACCESS_PLATFORM presence? This wouldn't work for the case where a host virtio device is passed through to the pv domU and at the same time another virtio device is using dom0 as a backend. I think we should use grants if possible. Juergen
On 07.07.23 11:11, Juergen Gross wrote: Hello Juergen > On 07.07.23 10:00, Oleksandr Tyshchenko wrote: >> >> >> On 07.07.23 10:04, Juergen Gross wrote: >> >> Hello Juergen >> >> >>> Re-reading the whole thread again ... >>> >>> On 29.06.23 03:00, Stefano Stabellini wrote: >>>> On Wed, 21 Jun 2023, Oleksandr Tyshchenko wrote: >>>>> On 21.06.23 16:12, Petr Pavlu wrote: >>>>> >>>>> >>>>> Hello Petr >>>>> >>>>> >>>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>>>>> devices (all x86_64), dom0 tries to establish a grant for itself >>>>>> which >>>>>> eventually results in a hang during the boot. >>>>>> >>>>>> The backtrace looks as follows, the while loop in >>>>>> __send_control_msg() >>>>>> makes no progress: >>>>>> >>>>>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, >>>>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 >>>>>> <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 >>>>>> #1 0xffffffff817086b7 in virtqueue_get_buf >>>>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) >>>>>> at ../drivers/virtio/virtio_ring.c:2333 >>>>>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized >>>>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at >>>>>> ../drivers/char/virtio_console.c:562 >>>>>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized >>>>>> out>, port_id=<optimized out>, event=<optimized out>, >>>>>> value=<optimized out>) at ../drivers/char/virtio_console.c:569 >>>>>> #4 0xffffffff817618b1 in virtcons_probe >>>>>> (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 >>>>>> #5 0xffffffff81707117 in virtio_dev_probe >>>>>> (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 >>>>>> #6 0xffffffff8198e348 in call_driver_probe >>>>>> (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 >>>>>> <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>>>>> #7 really_probe (dev=dev@entry=0xffff88800585e810, >>>>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>>> ../drivers/base/dd.c:658 >>>>>> #8 0xffffffff8198e58f in __driver_probe_device >>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>>>>> #9 0xffffffff8198e65a in driver_probe_device >>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>>>>> #10 0xffffffff8198e832 in __driver_attach >>>>>> (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) >>>>>> at ../drivers/base/dd.c:1216 >>>>>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, >>>>>> start=start@entry=0x0 <fixed_percpu_data>, >>>>>> data=data@entry=0xffffffff82be40c0 <virtio_console>, >>>>>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at >>>>>> ../drivers/base/bus.c:368 >>>>>> #12 0xffffffff8198db65 in driver_attach >>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>>> ../drivers/base/dd.c:1233 >>>>>> #13 0xffffffff8198d207 in bus_add_driver >>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>>> ../drivers/base/bus.c:673 >>>>>> #14 0xffffffff8198f550 in driver_register >>>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>>> ../drivers/base/driver.c:246 >>>>>> #15 0xffffffff81706b47 in register_virtio_driver >>>>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at >>>>>> ../drivers/virtio/virtio.c:357 >>>>>> #16 0xffffffff832cd34b in virtio_console_init () at >>>>>> ../drivers/char/virtio_console.c:2258 >>>>>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 >>>>>> <virtio_console_init>) at ../init/main.c:1246 >>>>>> #18 0xffffffff83277293 in do_initcall_level >>>>>> (command_line=0xffff888003e2f900 "root", level=0x6) at >>>>>> ../init/main.c:1319 >>>>>> #19 do_initcalls () at ../init/main.c:1335 >>>>>> #20 do_basic_setup () at ../init/main.c:1354 >>>>>> #21 kernel_init_freeable () at ../init/main.c:1571 >>>>>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) >>>>>> at ../init/main.c:1462 >>>>>> #23 0xffffffff81001f49 in ret_from_fork () at >>>>>> ../arch/x86/entry/entry_64.S:308 >>>>>> #24 0x0000000000000000 in ?? () >>>>>> >>>>>> Fix the problem by preventing xen_grant_init_backend_domid() from >>>>>> setting dom0 as a backend when running in dom0. >>>>>> >>>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of >>>>>> "xen-grant-dma" devices") >>>>> >>>>> >>>>> I am not 100% sure whether the Fixes tag points to precise commit. >>>>> If I >>>>> am not mistaken, the said commit just moves the code in the context >>>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >>>>> introduced before. >>>>> >>>>> >>>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>>>>> --- >>>>>> drivers/xen/grant-dma-ops.c | 4 +++- >>>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/xen/grant-dma-ops.c >>>>>> b/drivers/xen/grant-dma-ops.c >>>>>> index 76f6f26265a3..29ed27ac450e 100644 >>>>>> --- a/drivers/xen/grant-dma-ops.c >>>>>> +++ b/drivers/xen/grant-dma-ops.c >>>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct >>>>>> device *dev, >>>>>> if (np) { >>>>>> ret = xen_dt_grant_init_backend_domid(dev, np, >>>>>> backend_domid); >>>>>> of_node_put(np); >>>>>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>>>> xen_pv_domain()) { >>>>>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>>>> + xen_pv_domain()) && >>>>>> + !xen_initial_domain()) { >>>>> >>>>> The commit lgtm, just one note: >>>>> >>>>> >>>>> I would even bail out early in xen_virtio_restricted_mem_acc() >>>>> instead, >>>>> as I assume the same issue could happen on Arm with DT (although there >>>>> we don't guess the backend's domid, we read it from DT and quite >>>>> unlikely we get Dom0 being in Dom0 with correct DT). >>>>> >>>>> Something like: >>>>> >>>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >>>>> virtio_device *dev) >>>>> { >>>>> domid_t backend_domid; >>>>> >>>>> + /* Xen grant DMA ops are not used when running as initial >>>>> domain */ >>>>> + if (xen_initial_domain()) >>>>> + return false; >>>>> + >>>>> if (!xen_grant_init_backend_domid(dev->dev.parent, >>>>> &backend_domid)) { >>>>> xen_grant_setup_dma_ops(dev->dev.parent, >>>>> backend_domid); >>>>> return true; >>>>> (END) >>>>> >>>>> >>>>> >>>>> If so, that commit subject would need to be updated accordingly. >>>>> >>>>> Let's see what other reviewers will say. >>>> >>>> This doesn't work in all cases. Imagine using PCI Passthrough to assign >>>> a "physical" virtio device to a domU. The domU will run into the same >>>> error, right? >>>> >>>> The problem is that we need a way for the virtio backend to advertise >>>> its ability of handling grants. Right now we only have a way to do with >>>> that with device tree on ARM. On x86, we only have >>>> CONFIG_XEN_VIRTIO_FORCE_GRANT, and if we take >>>> CONFIG_XEN_VIRTIO_FORCE_GRANT at face value, it also enables grants for >>>> "physical" virtio devices. Note that in this case we are fixing a >>>> nested-virtualization bug, but there are actually physical >>>> virtio-compatible devices out there. CONFIG_XEN_VIRTIO_FORCE_GRANT will >>>> break those too. >>> >>> In case you want virtio device passthrough, you shouldn't use a kernel >>> built with CONFIG_XEN_VIRTIO_FORCE_GRANT. >>> >>> And supporting passing through virtio devices of the host to pv-domUs is >>> a security risk anyway. >>> >>> We _could_ drop the requirement of the backend needing to set >>> VIRTIO_F_ACCESS_PLATFORM for PV guests and allow grant-less virtio >>> handling for all guests. For this to work >>> xen_virtio_restricted_mem_acc() >>> would need to check for VIRTIO_F_ACCESS_PLATFORM and return true if set. >>> Maybe we'd want to enable that possibility via a boot parameter? >> >> >> Maybe, yes. I don't see at the moment why this won't work. >> >> At the same time I wonder, could we just modify xen_pv_init_platform() >> to call virtio_no_restricted_mem_acc() if forcibly disabled by boot >> parameter irrespective of VIRTIO_F_ACCESS_PLATFORM presence? > > This wouldn't work for the case where a host virtio device is passed > through > to the pv domU and at the same time another virtio device is using dom0 > as a > backend. I think we should use grants if possible. Indeed, I missed that possible scenario. I agree with the explanations, thanks. > > > Juergen >
On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: > On 06.07.23 23:49, Stefano Stabellini wrote: > > On Thu, 6 Jul 2023, Roger Pau Monné wrote: > > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: > > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote: > > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > > > > > > Part 2 (clarification): > > > > > > > > > > > > I think using a special config space register in the root complex would > > > > > > not be terrible in terms of guest changes because it is easy to > > > > > > introduce a new root complex driver in Linux and other OSes. The root > > > > > > complex would still be ECAM compatible so the regular ECAM driver would > > > > > > still work. A new driver would only be necessary if you want to be able > > > > > > to access the special config space register. > > > > > > > > > > I'm slightly worry of this approach, we end up modifying a root > > > > > complex emulation in order to avoid modifying a PCI device emulation > > > > > on QEMU, not sure that's a good trade off. > > > > > > > > > > Note also that different architectures will likely have different root > > > > > complex, and so you might need to modify several of them, plus then > > > > > arrange the PCI layout correctly in order to have the proper hierarchy > > > > > so that devices belonging to different driver domains are assigned to > > > > > different bridges. > > > > > > > > I do think that adding something to the PCI conf register somewhere is > > > > the best option because it is not dependent on ACPI and it is not > > > > dependent on xenstore both of which are very undesirable. > > > > > > > > I am not sure where specifically is the best place. These are 3 ideas > > > > we came up with: > > > > 1. PCI root complex > > > > 2. a register on the device itself > > > > 3. a new capability of the device > > > > 4. add one extra dummy PCI device for the sole purpose of exposing the > > > > grants capability > > > > > > > > > > > > Looking at the spec, there is a way to add a vendor-specific capability > > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used > > > > today, Linux doesn't parse it. > > > > > > I did wonder the same from a quick look at the spec. There's however > > > a text in the specification that says: > > > > > > "The driver SHOULD NOT use the Vendor data capability except for > > > debugging and reporting purposes." > > > > > > So we would at least need to change that because the capability would > > > then be used by other purposes different than debugging and reporting. > > > > > > Seems like a minor adjustment, so might we worth asking upstream about > > > their opinion, and to get a conversation started. > > > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what > > we are doing, right? > > I'd understand "reporting" as e.g. logging, transferring statistics, ... > > We'd like to use it for configuration purposes. I've also read it that way. > Another idea would be to enhance the virtio IOMMU device to suit our needs: > we could add the domid as another virtio IOMMU device capability and (for now) > use bypass mode for all "productive" devices. If we have to start adding capabilties, won't it be easier to just add it to the each device instead of adding it to virtio IOMMU. Or is the parsing of capabilities device specific, and hence we would have to implement such parsing for each device? I would expect some capabilities are shared between all devices, and a Xen capability could be one of those. > Later we could even add grant-V3 support to Xen and to the virtio IOMMU device > (see my last year Xen Summit design session). This could be usable for > disaggregated KVM setups, too, so I believe there is a chance to get this > accepted. > > > > > > > ********** > > > > > > What do you think about it? Are there any pitfalls, etc? This also requires > > > > > > system changes, but at least without virtio spec changes. > > > > > > > > > > Why are we so reluctant to add spec changes? I understand this might > > > > > take time an effort, but it's the only way IMO to build a sustainable > > > > > VirtIO Xen implementation. Did we already attempt to negotiate with > > > > > Oasis Xen related spec changes and those where refused? > > > > > > > > That's because spec changes can be very slow. This is a bug that we need > > > > a relatively quick solution for and waiting 12-24 months for a spec > > > > update is not realistic. > > > > > > > > I think a spec change would be best as a long term solution. We also > > > > need a short term solution. The short term solution doesn't have to be > > > > ideal but it has to work now. > > > > > > My fear with such approach is that once a bodge is in place people > > > move on to other stuff and this never gets properly fixed. > > > > > > I know this might not be a well received opinion, but it would be > > > better if such bodge is kept in each interested party patchqueue for > > > the time being, until a proper solution is implemented. That way > > > there's an interest from parties into properly fixing it upstream. > > > > Unfortunately we are in the situation where we have an outstanding > > upstream bug, so we have to take action one way or the other. > > The required virtio IOMMU device modification would be rather small, so > adding it maybe under a CONFIG option defaulting to off might be > acceptable. Would you then do the grant allocation as part of virtio IOMMU? Thanks, Roger.
On 07.07.23 11:50, Roger Pau Monné wrote: > On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: >> On 06.07.23 23:49, Stefano Stabellini wrote: >>> On Thu, 6 Jul 2023, Roger Pau Monné wrote: >>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: >>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote: >>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: >>>>>>> Part 2 (clarification): >>>>>>> >>>>>>> I think using a special config space register in the root complex would >>>>>>> not be terrible in terms of guest changes because it is easy to >>>>>>> introduce a new root complex driver in Linux and other OSes. The root >>>>>>> complex would still be ECAM compatible so the regular ECAM driver would >>>>>>> still work. A new driver would only be necessary if you want to be able >>>>>>> to access the special config space register. >>>>>> >>>>>> I'm slightly worry of this approach, we end up modifying a root >>>>>> complex emulation in order to avoid modifying a PCI device emulation >>>>>> on QEMU, not sure that's a good trade off. >>>>>> >>>>>> Note also that different architectures will likely have different root >>>>>> complex, and so you might need to modify several of them, plus then >>>>>> arrange the PCI layout correctly in order to have the proper hierarchy >>>>>> so that devices belonging to different driver domains are assigned to >>>>>> different bridges. >>>>> >>>>> I do think that adding something to the PCI conf register somewhere is >>>>> the best option because it is not dependent on ACPI and it is not >>>>> dependent on xenstore both of which are very undesirable. >>>>> >>>>> I am not sure where specifically is the best place. These are 3 ideas >>>>> we came up with: >>>>> 1. PCI root complex >>>>> 2. a register on the device itself >>>>> 3. a new capability of the device >>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the >>>>> grants capability >>>>> >>>>> >>>>> Looking at the spec, there is a way to add a vendor-specific capability >>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used >>>>> today, Linux doesn't parse it. >>>> >>>> I did wonder the same from a quick look at the spec. There's however >>>> a text in the specification that says: >>>> >>>> "The driver SHOULD NOT use the Vendor data capability except for >>>> debugging and reporting purposes." >>>> >>>> So we would at least need to change that because the capability would >>>> then be used by other purposes different than debugging and reporting. >>>> >>>> Seems like a minor adjustment, so might we worth asking upstream about >>>> their opinion, and to get a conversation started. >>> >>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what >>> we are doing, right? >> >> I'd understand "reporting" as e.g. logging, transferring statistics, ... >> >> We'd like to use it for configuration purposes. > > I've also read it that way. > >> Another idea would be to enhance the virtio IOMMU device to suit our needs: >> we could add the domid as another virtio IOMMU device capability and (for now) >> use bypass mode for all "productive" devices. > > If we have to start adding capabilties, won't it be easier to just add > it to the each device instead of adding it to virtio IOMMU. Or is the > parsing of capabilities device specific, and hence we would have to > implement such parsing for each device? I would expect some > capabilities are shared between all devices, and a Xen capability could > be one of those. Have a look at [1], which is describing the common device config layout. The problem here is that we'd need to add the domid after the queue specific data, resulting in a mess if further queue fields would be added later. We could try that, of course. > >> Later we could even add grant-V3 support to Xen and to the virtio IOMMU device >> (see my last year Xen Summit design session). This could be usable for >> disaggregated KVM setups, too, so I believe there is a chance to get this >> accepted. >> >>>>>>> ********** >>>>>>> What do you think about it? Are there any pitfalls, etc? This also requires >>>>>>> system changes, but at least without virtio spec changes. >>>>>> >>>>>> Why are we so reluctant to add spec changes? I understand this might >>>>>> take time an effort, but it's the only way IMO to build a sustainable >>>>>> VirtIO Xen implementation. Did we already attempt to negotiate with >>>>>> Oasis Xen related spec changes and those where refused? >>>>> >>>>> That's because spec changes can be very slow. This is a bug that we need >>>>> a relatively quick solution for and waiting 12-24 months for a spec >>>>> update is not realistic. >>>>> >>>>> I think a spec change would be best as a long term solution. We also >>>>> need a short term solution. The short term solution doesn't have to be >>>>> ideal but it has to work now. >>>> >>>> My fear with such approach is that once a bodge is in place people >>>> move on to other stuff and this never gets properly fixed. >>>> >>>> I know this might not be a well received opinion, but it would be >>>> better if such bodge is kept in each interested party patchqueue for >>>> the time being, until a proper solution is implemented. That way >>>> there's an interest from parties into properly fixing it upstream. >>> >>> Unfortunately we are in the situation where we have an outstanding >>> upstream bug, so we have to take action one way or the other. >> >> The required virtio IOMMU device modification would be rather small, so >> adding it maybe under a CONFIG option defaulting to off might be >> acceptable. > > Would you then do the grant allocation as part of virtio IOMMU? Long term, maybe. Do you remember my Grant-V3 design session last year? Being able to reuse the same layout for virtio IOMMU was one of the basic ideas for that layout (this would need some heavy work on the virtio IOMMU frontend and backend, of course). Juergen [1]: http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.pdf#subsubsection.4.1.4.3
On 07.07.23 16:10, Juergen Gross wrote: > On 07.07.23 11:50, Roger Pau Monné wrote: >> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: >>> On 06.07.23 23:49, Stefano Stabellini wrote: >>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote: >>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: >>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote: >>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: >>>>>>>> Part 2 (clarification): >>>>>>>> >>>>>>>> I think using a special config space register in the root complex would >>>>>>>> not be terrible in terms of guest changes because it is easy to >>>>>>>> introduce a new root complex driver in Linux and other OSes. The root >>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would >>>>>>>> still work. A new driver would only be necessary if you want to be able >>>>>>>> to access the special config space register. >>>>>>> >>>>>>> I'm slightly worry of this approach, we end up modifying a root >>>>>>> complex emulation in order to avoid modifying a PCI device emulation >>>>>>> on QEMU, not sure that's a good trade off. >>>>>>> >>>>>>> Note also that different architectures will likely have different root >>>>>>> complex, and so you might need to modify several of them, plus then >>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy >>>>>>> so that devices belonging to different driver domains are assigned to >>>>>>> different bridges. >>>>>> >>>>>> I do think that adding something to the PCI conf register somewhere is >>>>>> the best option because it is not dependent on ACPI and it is not >>>>>> dependent on xenstore both of which are very undesirable. >>>>>> >>>>>> I am not sure where specifically is the best place. These are 3 ideas >>>>>> we came up with: >>>>>> 1. PCI root complex >>>>>> 2. a register on the device itself >>>>>> 3. a new capability of the device >>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the >>>>>> grants capability >>>>>> >>>>>> >>>>>> Looking at the spec, there is a way to add a vendor-specific capability >>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used >>>>>> today, Linux doesn't parse it. >>>>> >>>>> I did wonder the same from a quick look at the spec. There's however >>>>> a text in the specification that says: >>>>> >>>>> "The driver SHOULD NOT use the Vendor data capability except for >>>>> debugging and reporting purposes." >>>>> >>>>> So we would at least need to change that because the capability would >>>>> then be used by other purposes different than debugging and reporting. >>>>> >>>>> Seems like a minor adjustment, so might we worth asking upstream about >>>>> their opinion, and to get a conversation started. >>>> >>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what >>>> we are doing, right? >>> >>> I'd understand "reporting" as e.g. logging, transferring statistics, ... >>> >>> We'd like to use it for configuration purposes. >> >> I've also read it that way. >> >>> Another idea would be to enhance the virtio IOMMU device to suit our needs: >>> we could add the domid as another virtio IOMMU device capability and (for now) >>> use bypass mode for all "productive" devices. >> >> If we have to start adding capabilties, won't it be easier to just add >> it to the each device instead of adding it to virtio IOMMU. Or is the >> parsing of capabilities device specific, and hence we would have to >> implement such parsing for each device? I would expect some >> capabilities are shared between all devices, and a Xen capability could >> be one of those. > > Have a look at [1], which is describing the common device config layout. > The problem here is that we'd need to add the domid after the queue specific > data, resulting in a mess if further queue fields would be added later. > > We could try that, of course. Thinking more about it, the virtio IOMMU device seems to be a better fit: In case we'd add the domid to the device's PCI config space, the value would be controlled by the backend domain. IMO the domid passed to the frontend should be controlled by a trusted entity (dom0 or the hypervisor), which would be the natural backend of the virtio IOMMU device. Juergen
On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote: > On 07.07.23 11:50, Roger Pau Monné wrote: > > On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: > > > On 06.07.23 23:49, Stefano Stabellini wrote: > > > > On Thu, 6 Jul 2023, Roger Pau Monné wrote: > > > > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: > > > > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote: > > > > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > > > > > > > > Part 2 (clarification): > > > > > > > > > > > > > > > > I think using a special config space register in the root complex would > > > > > > > > not be terrible in terms of guest changes because it is easy to > > > > > > > > introduce a new root complex driver in Linux and other OSes. The root > > > > > > > > complex would still be ECAM compatible so the regular ECAM driver would > > > > > > > > still work. A new driver would only be necessary if you want to be able > > > > > > > > to access the special config space register. > > > > > > > > > > > > > > I'm slightly worry of this approach, we end up modifying a root > > > > > > > complex emulation in order to avoid modifying a PCI device emulation > > > > > > > on QEMU, not sure that's a good trade off. > > > > > > > > > > > > > > Note also that different architectures will likely have different root > > > > > > > complex, and so you might need to modify several of them, plus then > > > > > > > arrange the PCI layout correctly in order to have the proper hierarchy > > > > > > > so that devices belonging to different driver domains are assigned to > > > > > > > different bridges. > > > > > > > > > > > > I do think that adding something to the PCI conf register somewhere is > > > > > > the best option because it is not dependent on ACPI and it is not > > > > > > dependent on xenstore both of which are very undesirable. > > > > > > > > > > > > I am not sure where specifically is the best place. These are 3 ideas > > > > > > we came up with: > > > > > > 1. PCI root complex > > > > > > 2. a register on the device itself > > > > > > 3. a new capability of the device > > > > > > 4. add one extra dummy PCI device for the sole purpose of exposing the > > > > > > grants capability > > > > > > > > > > > > > > > > > > Looking at the spec, there is a way to add a vendor-specific capability > > > > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used > > > > > > today, Linux doesn't parse it. > > > > > > > > > > I did wonder the same from a quick look at the spec. There's however > > > > > a text in the specification that says: > > > > > > > > > > "The driver SHOULD NOT use the Vendor data capability except for > > > > > debugging and reporting purposes." > > > > > > > > > > So we would at least need to change that because the capability would > > > > > then be used by other purposes different than debugging and reporting. > > > > > > > > > > Seems like a minor adjustment, so might we worth asking upstream about > > > > > their opinion, and to get a conversation started. > > > > > > > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what > > > > we are doing, right? > > > > > > I'd understand "reporting" as e.g. logging, transferring statistics, ... > > > > > > We'd like to use it for configuration purposes. > > > > I've also read it that way. > > > > > Another idea would be to enhance the virtio IOMMU device to suit our needs: > > > we could add the domid as another virtio IOMMU device capability and (for now) > > > use bypass mode for all "productive" devices. > > > > If we have to start adding capabilties, won't it be easier to just add > > it to the each device instead of adding it to virtio IOMMU. Or is the > > parsing of capabilities device specific, and hence we would have to > > implement such parsing for each device? I would expect some > > capabilities are shared between all devices, and a Xen capability could > > be one of those. > > Have a look at [1], which is describing the common device config layout. > The problem here is that we'd need to add the domid after the queue specific > data, resulting in a mess if further queue fields would be added later. > > We could try that, of course. Right, we must make it part of the standard if we modify virtio_pci_common_cfg, or else newly added fields would overlap the Xen specific one. Would it be possible to signal Xen-grants support in the `device_feature` field, and then expose it from a vendor capability? IOW, would it be possible to add a Xen-specific hook in the parsing of virtio_pci_common_cfg that would then fetch additional data from a capability? That would likely be less intrusive than adding a new Xen-specific field to virtio_pci_common_cfg while still allowing us to do Xen specific configuration for all VirtIO devices. > > > > > Later we could even add grant-V3 support to Xen and to the virtio IOMMU device > > > (see my last year Xen Summit design session). This could be usable for > > > disaggregated KVM setups, too, so I believe there is a chance to get this > > > accepted. > > > > > > > > > > > ********** > > > > > > > > What do you think about it? Are there any pitfalls, etc? This also requires > > > > > > > > system changes, but at least without virtio spec changes. > > > > > > > > > > > > > > Why are we so reluctant to add spec changes? I understand this might > > > > > > > take time an effort, but it's the only way IMO to build a sustainable > > > > > > > VirtIO Xen implementation. Did we already attempt to negotiate with > > > > > > > Oasis Xen related spec changes and those where refused? > > > > > > > > > > > > That's because spec changes can be very slow. This is a bug that we need > > > > > > a relatively quick solution for and waiting 12-24 months for a spec > > > > > > update is not realistic. > > > > > > > > > > > > I think a spec change would be best as a long term solution. We also > > > > > > need a short term solution. The short term solution doesn't have to be > > > > > > ideal but it has to work now. > > > > > > > > > > My fear with such approach is that once a bodge is in place people > > > > > move on to other stuff and this never gets properly fixed. > > > > > > > > > > I know this might not be a well received opinion, but it would be > > > > > better if such bodge is kept in each interested party patchqueue for > > > > > the time being, until a proper solution is implemented. That way > > > > > there's an interest from parties into properly fixing it upstream. > > > > > > > > Unfortunately we are in the situation where we have an outstanding > > > > upstream bug, so we have to take action one way or the other. > > > > > > The required virtio IOMMU device modification would be rather small, so > > > adding it maybe under a CONFIG option defaulting to off might be > > > acceptable. > > > > Would you then do the grant allocation as part of virtio IOMMU? > > Long term, maybe. Do you remember my Grant-V3 design session last year? Being > able to reuse the same layout for virtio IOMMU was one of the basic ideas for > that layout (this would need some heavy work on the virtio IOMMU frontend and > backend, of course). While this might well be the best option, do we have anyone with the time and expertise to work on this? I might be wrong, but it seems like a huge task. Thanks, Roger.
On Fri, Jul 07, 2023 at 04:27:59PM +0200, Juergen Gross wrote: > On 07.07.23 16:10, Juergen Gross wrote: > > On 07.07.23 11:50, Roger Pau Monné wrote: > > > On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: > > > > On 06.07.23 23:49, Stefano Stabellini wrote: > > > > > On Thu, 6 Jul 2023, Roger Pau Monné wrote: > > > > > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: > > > > > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote: > > > > > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > > > > > > > > > Part 2 (clarification): > > > > > > > > > > > > > > > > > > I think using a special config space register in the root complex would > > > > > > > > > not be terrible in terms of guest changes because it is easy to > > > > > > > > > introduce a new root complex driver in Linux and other OSes. The root > > > > > > > > > complex would still be ECAM compatible so the regular ECAM driver would > > > > > > > > > still work. A new driver would only be necessary if you want to be able > > > > > > > > > to access the special config space register. > > > > > > > > > > > > > > > > I'm slightly worry of this approach, we end up modifying a root > > > > > > > > complex emulation in order to avoid modifying a PCI device emulation > > > > > > > > on QEMU, not sure that's a good trade off. > > > > > > > > > > > > > > > > Note also that different architectures will likely have different root > > > > > > > > complex, and so you might need to modify several of them, plus then > > > > > > > > arrange the PCI layout correctly in order to have the proper hierarchy > > > > > > > > so that devices belonging to different driver domains are assigned to > > > > > > > > different bridges. > > > > > > > > > > > > > > I do think that adding something to the PCI conf register somewhere is > > > > > > > the best option because it is not dependent on ACPI and it is not > > > > > > > dependent on xenstore both of which are very undesirable. > > > > > > > > > > > > > > I am not sure where specifically is the best place. These are 3 ideas > > > > > > > we came up with: > > > > > > > 1. PCI root complex > > > > > > > 2. a register on the device itself > > > > > > > 3. a new capability of the device > > > > > > > 4. add one extra dummy PCI device for the sole purpose of exposing the > > > > > > > grants capability > > > > > > > > > > > > > > > > > > > > > Looking at the spec, there is a way to add a vendor-specific capability > > > > > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used > > > > > > > today, Linux doesn't parse it. > > > > > > > > > > > > I did wonder the same from a quick look at the spec. There's however > > > > > > a text in the specification that says: > > > > > > > > > > > > "The driver SHOULD NOT use the Vendor data capability except for > > > > > > debugging and reporting purposes." > > > > > > > > > > > > So we would at least need to change that because the capability would > > > > > > then be used by other purposes different than debugging and reporting. > > > > > > > > > > > > Seems like a minor adjustment, so might we worth asking upstream about > > > > > > their opinion, and to get a conversation started. > > > > > > > > > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what > > > > > we are doing, right? > > > > > > > > I'd understand "reporting" as e.g. logging, transferring statistics, ... > > > > > > > > We'd like to use it for configuration purposes. > > > > > > I've also read it that way. > > > > > > > Another idea would be to enhance the virtio IOMMU device to suit our needs: > > > > we could add the domid as another virtio IOMMU device capability and (for now) > > > > use bypass mode for all "productive" devices. > > > > > > If we have to start adding capabilties, won't it be easier to just add > > > it to the each device instead of adding it to virtio IOMMU. Or is the > > > parsing of capabilities device specific, and hence we would have to > > > implement such parsing for each device? I would expect some > > > capabilities are shared between all devices, and a Xen capability could > > > be one of those. > > > > Have a look at [1], which is describing the common device config layout. > > The problem here is that we'd need to add the domid after the queue specific > > data, resulting in a mess if further queue fields would be added later. > > > > We could try that, of course. > > Thinking more about it, the virtio IOMMU device seems to be a better fit: > > In case we'd add the domid to the device's PCI config space, the value would > be controlled by the backend domain. IMO the domid passed to the frontend > should be controlled by a trusted entity (dom0 or the hypervisor), which > would be the natural backend of the virtio IOMMU device. Hm, yes. I'm however failing to see how a backed could exploit that. The guest would be granting memory to a different domain than the one running the backend, but otherwise that memory would be granted to the backend domain, which could then also make it available to other domains (without having to play with the reported backend domid). Thanks, Roger.
On 07.07.23 16:42, Roger Pau Monné wrote: > On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote: >> On 07.07.23 11:50, Roger Pau Monné wrote: >>> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: >>>> On 06.07.23 23:49, Stefano Stabellini wrote: >>>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote: >>>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: >>>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote: >>>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: >>>>>>>>> Part 2 (clarification): >>>>>>>>> >>>>>>>>> I think using a special config space register in the root complex would >>>>>>>>> not be terrible in terms of guest changes because it is easy to >>>>>>>>> introduce a new root complex driver in Linux and other OSes. The root >>>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would >>>>>>>>> still work. A new driver would only be necessary if you want to be able >>>>>>>>> to access the special config space register. >>>>>>>> >>>>>>>> I'm slightly worry of this approach, we end up modifying a root >>>>>>>> complex emulation in order to avoid modifying a PCI device emulation >>>>>>>> on QEMU, not sure that's a good trade off. >>>>>>>> >>>>>>>> Note also that different architectures will likely have different root >>>>>>>> complex, and so you might need to modify several of them, plus then >>>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy >>>>>>>> so that devices belonging to different driver domains are assigned to >>>>>>>> different bridges. >>>>>>> >>>>>>> I do think that adding something to the PCI conf register somewhere is >>>>>>> the best option because it is not dependent on ACPI and it is not >>>>>>> dependent on xenstore both of which are very undesirable. >>>>>>> >>>>>>> I am not sure where specifically is the best place. These are 3 ideas >>>>>>> we came up with: >>>>>>> 1. PCI root complex >>>>>>> 2. a register on the device itself >>>>>>> 3. a new capability of the device >>>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the >>>>>>> grants capability >>>>>>> >>>>>>> >>>>>>> Looking at the spec, there is a way to add a vendor-specific capability >>>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used >>>>>>> today, Linux doesn't parse it. >>>>>> >>>>>> I did wonder the same from a quick look at the spec. There's however >>>>>> a text in the specification that says: >>>>>> >>>>>> "The driver SHOULD NOT use the Vendor data capability except for >>>>>> debugging and reporting purposes." >>>>>> >>>>>> So we would at least need to change that because the capability would >>>>>> then be used by other purposes different than debugging and reporting. >>>>>> >>>>>> Seems like a minor adjustment, so might we worth asking upstream about >>>>>> their opinion, and to get a conversation started. >>>>> >>>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what >>>>> we are doing, right? >>>> >>>> I'd understand "reporting" as e.g. logging, transferring statistics, ... >>>> >>>> We'd like to use it for configuration purposes. >>> >>> I've also read it that way. >>> >>>> Another idea would be to enhance the virtio IOMMU device to suit our needs: >>>> we could add the domid as another virtio IOMMU device capability and (for now) >>>> use bypass mode for all "productive" devices. >>> >>> If we have to start adding capabilties, won't it be easier to just add >>> it to the each device instead of adding it to virtio IOMMU. Or is the >>> parsing of capabilities device specific, and hence we would have to >>> implement such parsing for each device? I would expect some >>> capabilities are shared between all devices, and a Xen capability could >>> be one of those. >> >> Have a look at [1], which is describing the common device config layout. >> The problem here is that we'd need to add the domid after the queue specific >> data, resulting in a mess if further queue fields would be added later. >> >> We could try that, of course. > > Right, we must make it part of the standard if we modify > virtio_pci_common_cfg, or else newly added fields would overlap the > Xen specific one. > > Would it be possible to signal Xen-grants support in the > `device_feature` field, and then expose it from a vendor capability? > IOW, would it be possible to add a Xen-specific hook in the parsing of > virtio_pci_common_cfg that would then fetch additional data from a > capability? TBH, I don't know. It might require some changes in the central parsing logic, but this shouldn't be too hard to do. > That would likely be less intrusive than adding a new Xen-specific > field to virtio_pci_common_cfg while still allowing us to do Xen > specific configuration for all VirtIO devices. In case we want to go that route, this should be in a new "platform config" capability, which might be just another form of a vendor capability. > >>> >>>> Later we could even add grant-V3 support to Xen and to the virtio IOMMU device >>>> (see my last year Xen Summit design session). This could be usable for >>>> disaggregated KVM setups, too, so I believe there is a chance to get this >>>> accepted. >>>> >>>>>>>>> ********** >>>>>>>>> What do you think about it? Are there any pitfalls, etc? This also requires >>>>>>>>> system changes, but at least without virtio spec changes. >>>>>>>> >>>>>>>> Why are we so reluctant to add spec changes? I understand this might >>>>>>>> take time an effort, but it's the only way IMO to build a sustainable >>>>>>>> VirtIO Xen implementation. Did we already attempt to negotiate with >>>>>>>> Oasis Xen related spec changes and those where refused? >>>>>>> >>>>>>> That's because spec changes can be very slow. This is a bug that we need >>>>>>> a relatively quick solution for and waiting 12-24 months for a spec >>>>>>> update is not realistic. >>>>>>> >>>>>>> I think a spec change would be best as a long term solution. We also >>>>>>> need a short term solution. The short term solution doesn't have to be >>>>>>> ideal but it has to work now. >>>>>> >>>>>> My fear with such approach is that once a bodge is in place people >>>>>> move on to other stuff and this never gets properly fixed. >>>>>> >>>>>> I know this might not be a well received opinion, but it would be >>>>>> better if such bodge is kept in each interested party patchqueue for >>>>>> the time being, until a proper solution is implemented. That way >>>>>> there's an interest from parties into properly fixing it upstream. >>>>> >>>>> Unfortunately we are in the situation where we have an outstanding >>>>> upstream bug, so we have to take action one way or the other. >>>> >>>> The required virtio IOMMU device modification would be rather small, so >>>> adding it maybe under a CONFIG option defaulting to off might be >>>> acceptable. >>> >>> Would you then do the grant allocation as part of virtio IOMMU? >> >> Long term, maybe. Do you remember my Grant-V3 design session last year? Being >> able to reuse the same layout for virtio IOMMU was one of the basic ideas for >> that layout (this would need some heavy work on the virtio IOMMU frontend and >> backend, of course). > > While this might well be the best option, do we have anyone with the > time and expertise to work on this? I might be wrong, but it seems > like a huge task. As a background project I'd like to pursue it. OTOH I'm not sure how much time I could spend on it. Juergen
On Fri, Jul 07, 2023 at 05:01:38PM +0200, Juergen Gross wrote: > On 07.07.23 16:42, Roger Pau Monné wrote: > > On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote: > > > On 07.07.23 11:50, Roger Pau Monné wrote: > > > > On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: > > > > > On 06.07.23 23:49, Stefano Stabellini wrote: > > > > > > On Thu, 6 Jul 2023, Roger Pau Monné wrote: > > > > > > > On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: > > > > > > > > On Wed, 5 Jul 2023, Roger Pau Monné wrote: > > > > > > > > > On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: > > > > > > > > > > Part 2 (clarification): > > > > > > > > > > > > > > > > > > > > I think using a special config space register in the root complex would > > > > > > > > > > not be terrible in terms of guest changes because it is easy to > > > > > > > > > > introduce a new root complex driver in Linux and other OSes. The root > > > > > > > > > > complex would still be ECAM compatible so the regular ECAM driver would > > > > > > > > > > still work. A new driver would only be necessary if you want to be able > > > > > > > > > > to access the special config space register. > > > > > > > > > > > > > > > > > > I'm slightly worry of this approach, we end up modifying a root > > > > > > > > > complex emulation in order to avoid modifying a PCI device emulation > > > > > > > > > on QEMU, not sure that's a good trade off. > > > > > > > > > > > > > > > > > > Note also that different architectures will likely have different root > > > > > > > > > complex, and so you might need to modify several of them, plus then > > > > > > > > > arrange the PCI layout correctly in order to have the proper hierarchy > > > > > > > > > so that devices belonging to different driver domains are assigned to > > > > > > > > > different bridges. > > > > > > > > > > > > > > > > I do think that adding something to the PCI conf register somewhere is > > > > > > > > the best option because it is not dependent on ACPI and it is not > > > > > > > > dependent on xenstore both of which are very undesirable. > > > > > > > > > > > > > > > > I am not sure where specifically is the best place. These are 3 ideas > > > > > > > > we came up with: > > > > > > > > 1. PCI root complex > > > > > > > > 2. a register on the device itself > > > > > > > > 3. a new capability of the device > > > > > > > > 4. add one extra dummy PCI device for the sole purpose of exposing the > > > > > > > > grants capability > > > > > > > > > > > > > > > > > > > > > > > > Looking at the spec, there is a way to add a vendor-specific capability > > > > > > > > (cap_vndr = 0x9). Could we use that? It doesn't look like it is used > > > > > > > > today, Linux doesn't parse it. > > > > > > > > > > > > > > I did wonder the same from a quick look at the spec. There's however > > > > > > > a text in the specification that says: > > > > > > > > > > > > > > "The driver SHOULD NOT use the Vendor data capability except for > > > > > > > debugging and reporting purposes." > > > > > > > > > > > > > > So we would at least need to change that because the capability would > > > > > > > then be used by other purposes different than debugging and reporting. > > > > > > > > > > > > > > Seems like a minor adjustment, so might we worth asking upstream about > > > > > > > their opinion, and to get a conversation started. > > > > > > > > > > > > Wait, wouldn't this use-case fall under "reporting" ? It is exactly what > > > > > > we are doing, right? > > > > > > > > > > I'd understand "reporting" as e.g. logging, transferring statistics, ... > > > > > > > > > > We'd like to use it for configuration purposes. > > > > > > > > I've also read it that way. > > > > > > > > > Another idea would be to enhance the virtio IOMMU device to suit our needs: > > > > > we could add the domid as another virtio IOMMU device capability and (for now) > > > > > use bypass mode for all "productive" devices. > > > > > > > > If we have to start adding capabilties, won't it be easier to just add > > > > it to the each device instead of adding it to virtio IOMMU. Or is the > > > > parsing of capabilities device specific, and hence we would have to > > > > implement such parsing for each device? I would expect some > > > > capabilities are shared between all devices, and a Xen capability could > > > > be one of those. > > > > > > Have a look at [1], which is describing the common device config layout. > > > The problem here is that we'd need to add the domid after the queue specific > > > data, resulting in a mess if further queue fields would be added later. > > > > > > We could try that, of course. > > > > Right, we must make it part of the standard if we modify > > virtio_pci_common_cfg, or else newly added fields would overlap the > > Xen specific one. > > > > Would it be possible to signal Xen-grants support in the > > `device_feature` field, and then expose it from a vendor capability? > > IOW, would it be possible to add a Xen-specific hook in the parsing of > > virtio_pci_common_cfg that would then fetch additional data from a > > capability? > > TBH, I don't know. It might require some changes in the central parsing > logic, but this shouldn't be too hard to do. > > > That would likely be less intrusive than adding a new Xen-specific > > field to virtio_pci_common_cfg while still allowing us to do Xen > > specific configuration for all VirtIO devices. > > In case we want to go that route, this should be in a new "platform config" > capability, which might be just another form of a vendor capability. I think telling people that they will need to implement grants-v3 in order to solve this might be too much. I would rather prefer a more concrete solution that doesn't have so many loose ends. Anyway, it's up to the person doing the job, but starting with "you will have to implement grants-v3" is quite likely to deter anyone from attempting to solve this I'm afraid. Thanks, Roger.
On 07.07.23 16:48, Roger Pau Monné wrote: > On Fri, Jul 07, 2023 at 04:27:59PM +0200, Juergen Gross wrote: >> On 07.07.23 16:10, Juergen Gross wrote: >>> On 07.07.23 11:50, Roger Pau Monné wrote: >>>> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: >>>>> On 06.07.23 23:49, Stefano Stabellini wrote: >>>>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote: >>>>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: >>>>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote: >>>>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: >>>>>>>>>> Part 2 (clarification): >>>>>>>>>> >>>>>>>>>> I think using a special config space register in the root complex would >>>>>>>>>> not be terrible in terms of guest changes because it is easy to >>>>>>>>>> introduce a new root complex driver in Linux and other OSes. The root >>>>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would >>>>>>>>>> still work. A new driver would only be necessary if you want to be able >>>>>>>>>> to access the special config space register. >>>>>>>>> >>>>>>>>> I'm slightly worry of this approach, we end up modifying a root >>>>>>>>> complex emulation in order to avoid modifying a PCI device emulation >>>>>>>>> on QEMU, not sure that's a good trade off. >>>>>>>>> >>>>>>>>> Note also that different architectures will likely have different root >>>>>>>>> complex, and so you might need to modify several of them, plus then >>>>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy >>>>>>>>> so that devices belonging to different driver domains are assigned to >>>>>>>>> different bridges. >>>>>>>> >>>>>>>> I do think that adding something to the PCI conf register somewhere is >>>>>>>> the best option because it is not dependent on ACPI and it is not >>>>>>>> dependent on xenstore both of which are very undesirable. >>>>>>>> >>>>>>>> I am not sure where specifically is the best place. These are 3 ideas >>>>>>>> we came up with: >>>>>>>> 1. PCI root complex >>>>>>>> 2. a register on the device itself >>>>>>>> 3. a new capability of the device >>>>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the >>>>>>>> grants capability >>>>>>>> >>>>>>>> >>>>>>>> Looking at the spec, there is a way to add a vendor-specific capability >>>>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used >>>>>>>> today, Linux doesn't parse it. >>>>>>> >>>>>>> I did wonder the same from a quick look at the spec. There's however >>>>>>> a text in the specification that says: >>>>>>> >>>>>>> "The driver SHOULD NOT use the Vendor data capability except for >>>>>>> debugging and reporting purposes." >>>>>>> >>>>>>> So we would at least need to change that because the capability would >>>>>>> then be used by other purposes different than debugging and reporting. >>>>>>> >>>>>>> Seems like a minor adjustment, so might we worth asking upstream about >>>>>>> their opinion, and to get a conversation started. >>>>>> >>>>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what >>>>>> we are doing, right? >>>>> >>>>> I'd understand "reporting" as e.g. logging, transferring statistics, ... >>>>> >>>>> We'd like to use it for configuration purposes. >>>> >>>> I've also read it that way. >>>> >>>>> Another idea would be to enhance the virtio IOMMU device to suit our needs: >>>>> we could add the domid as another virtio IOMMU device capability and (for now) >>>>> use bypass mode for all "productive" devices. >>>> >>>> If we have to start adding capabilties, won't it be easier to just add >>>> it to the each device instead of adding it to virtio IOMMU. Or is the >>>> parsing of capabilities device specific, and hence we would have to >>>> implement such parsing for each device? I would expect some >>>> capabilities are shared between all devices, and a Xen capability could >>>> be one of those. >>> >>> Have a look at [1], which is describing the common device config layout. >>> The problem here is that we'd need to add the domid after the queue specific >>> data, resulting in a mess if further queue fields would be added later. >>> >>> We could try that, of course. >> >> Thinking more about it, the virtio IOMMU device seems to be a better fit: >> >> In case we'd add the domid to the device's PCI config space, the value would >> be controlled by the backend domain. IMO the domid passed to the frontend >> should be controlled by a trusted entity (dom0 or the hypervisor), which >> would be the natural backend of the virtio IOMMU device. > > Hm, yes. I'm however failing to see how a backed could exploit that. > > The guest would be granting memory to a different domain than the one > running the backend, but otherwise that memory would be granted to the > backend domain, which could then also make it available to other > domains (without having to play with the reported backend domid). I agree that an exploit is at least not obvious. It is still not a clean solution, though. Giving the wrong domain direct access to some of the guest's memory is worse than the ability to pass the contents indirectly to the wrong domain IMHO. Juergen
On 07.07.23 17:14, Roger Pau Monné wrote: > On Fri, Jul 07, 2023 at 05:01:38PM +0200, Juergen Gross wrote: >> On 07.07.23 16:42, Roger Pau Monné wrote: >>> On Fri, Jul 07, 2023 at 04:10:14PM +0200, Juergen Gross wrote: >>>> On 07.07.23 11:50, Roger Pau Monné wrote: >>>>> On Fri, Jul 07, 2023 at 06:38:48AM +0200, Juergen Gross wrote: >>>>>> On 06.07.23 23:49, Stefano Stabellini wrote: >>>>>>> On Thu, 6 Jul 2023, Roger Pau Monné wrote: >>>>>>>> On Wed, Jul 05, 2023 at 03:41:10PM -0700, Stefano Stabellini wrote: >>>>>>>>> On Wed, 5 Jul 2023, Roger Pau Monné wrote: >>>>>>>>>> On Tue, Jul 04, 2023 at 08:14:59PM +0300, Oleksandr Tyshchenko wrote: >>>>>>>>>>> Part 2 (clarification): >>>>>>>>>>> >>>>>>>>>>> I think using a special config space register in the root complex would >>>>>>>>>>> not be terrible in terms of guest changes because it is easy to >>>>>>>>>>> introduce a new root complex driver in Linux and other OSes. The root >>>>>>>>>>> complex would still be ECAM compatible so the regular ECAM driver would >>>>>>>>>>> still work. A new driver would only be necessary if you want to be able >>>>>>>>>>> to access the special config space register. >>>>>>>>>> >>>>>>>>>> I'm slightly worry of this approach, we end up modifying a root >>>>>>>>>> complex emulation in order to avoid modifying a PCI device emulation >>>>>>>>>> on QEMU, not sure that's a good trade off. >>>>>>>>>> >>>>>>>>>> Note also that different architectures will likely have different root >>>>>>>>>> complex, and so you might need to modify several of them, plus then >>>>>>>>>> arrange the PCI layout correctly in order to have the proper hierarchy >>>>>>>>>> so that devices belonging to different driver domains are assigned to >>>>>>>>>> different bridges. >>>>>>>>> >>>>>>>>> I do think that adding something to the PCI conf register somewhere is >>>>>>>>> the best option because it is not dependent on ACPI and it is not >>>>>>>>> dependent on xenstore both of which are very undesirable. >>>>>>>>> >>>>>>>>> I am not sure where specifically is the best place. These are 3 ideas >>>>>>>>> we came up with: >>>>>>>>> 1. PCI root complex >>>>>>>>> 2. a register on the device itself >>>>>>>>> 3. a new capability of the device >>>>>>>>> 4. add one extra dummy PCI device for the sole purpose of exposing the >>>>>>>>> grants capability >>>>>>>>> >>>>>>>>> >>>>>>>>> Looking at the spec, there is a way to add a vendor-specific capability >>>>>>>>> (cap_vndr = 0x9). Could we use that? It doesn't look like it is used >>>>>>>>> today, Linux doesn't parse it. >>>>>>>> >>>>>>>> I did wonder the same from a quick look at the spec. There's however >>>>>>>> a text in the specification that says: >>>>>>>> >>>>>>>> "The driver SHOULD NOT use the Vendor data capability except for >>>>>>>> debugging and reporting purposes." >>>>>>>> >>>>>>>> So we would at least need to change that because the capability would >>>>>>>> then be used by other purposes different than debugging and reporting. >>>>>>>> >>>>>>>> Seems like a minor adjustment, so might we worth asking upstream about >>>>>>>> their opinion, and to get a conversation started. >>>>>>> >>>>>>> Wait, wouldn't this use-case fall under "reporting" ? It is exactly what >>>>>>> we are doing, right? >>>>>> >>>>>> I'd understand "reporting" as e.g. logging, transferring statistics, ... >>>>>> >>>>>> We'd like to use it for configuration purposes. >>>>> >>>>> I've also read it that way. >>>>> >>>>>> Another idea would be to enhance the virtio IOMMU device to suit our needs: >>>>>> we could add the domid as another virtio IOMMU device capability and (for now) >>>>>> use bypass mode for all "productive" devices. >>>>> >>>>> If we have to start adding capabilties, won't it be easier to just add >>>>> it to the each device instead of adding it to virtio IOMMU. Or is the >>>>> parsing of capabilities device specific, and hence we would have to >>>>> implement such parsing for each device? I would expect some >>>>> capabilities are shared between all devices, and a Xen capability could >>>>> be one of those. >>>> >>>> Have a look at [1], which is describing the common device config layout. >>>> The problem here is that we'd need to add the domid after the queue specific >>>> data, resulting in a mess if further queue fields would be added later. >>>> >>>> We could try that, of course. >>> >>> Right, we must make it part of the standard if we modify >>> virtio_pci_common_cfg, or else newly added fields would overlap the >>> Xen specific one. >>> >>> Would it be possible to signal Xen-grants support in the >>> `device_feature` field, and then expose it from a vendor capability? >>> IOW, would it be possible to add a Xen-specific hook in the parsing of >>> virtio_pci_common_cfg that would then fetch additional data from a >>> capability? >> >> TBH, I don't know. It might require some changes in the central parsing >> logic, but this shouldn't be too hard to do. >> >>> That would likely be less intrusive than adding a new Xen-specific >>> field to virtio_pci_common_cfg while still allowing us to do Xen >>> specific configuration for all VirtIO devices. >> >> In case we want to go that route, this should be in a new "platform config" >> capability, which might be just another form of a vendor capability. > > I think telling people that they will need to implement grants-v3 in > order to solve this might be too much. I would rather prefer a more > concrete solution that doesn't have so many loose ends. > > Anyway, it's up to the person doing the job, but starting with "you > will have to implement grants-v3" is quite likely to deter anyone from > attempting to solve this I'm afraid. Fair enough. :-) Juergen
On Fri, 7 Jul 2023, Juergen Gross wrote: > On 26.06.23 15:17, Petr Pavlu wrote: > > On 6/21/23 19:58, Oleksandr Tyshchenko wrote: > > > On 21.06.23 16:12, Petr Pavlu wrote: > > > > When attempting to run Xen on a QEMU/KVM virtual machine with virtio > > > > devices (all x86_64), dom0 tries to establish a grant for itself which > > > > eventually results in a hang during the boot. > > > > > > > > The backtrace looks as follows, the while loop in __send_control_msg() > > > > makes no progress: > > > > > > > > #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, > > > > len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) > > > > at ../drivers/virtio/virtio_ring.c:2326 > > > > #1 0xffffffff817086b7 in virtqueue_get_buf > > > > (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at > > > > ../drivers/virtio/virtio_ring.c:2333 > > > > #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized > > > > out>, port_id=0xffffffff, event=0x0, value=0x1) at > > > > ../drivers/char/virtio_console.c:562 > > > > #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized > > > > out>, port_id=<optimized out>, event=<optimized out>, value=<optimized > > > > out>) at ../drivers/char/virtio_console.c:569 > > > > #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) > > > > at ../drivers/char/virtio_console.c:2098 > > > > #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) > > > > at ../drivers/virtio/virtio.c:305 > > > > #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 > > > > <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, > > > > dev=0xffff88800585e810) at ../drivers/base/dd.c:579 > > > > #7 really_probe (dev=dev@entry=0xffff88800585e810, > > > > drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > ../drivers/base/dd.c:658 > > > > #8 0xffffffff8198e58f in __driver_probe_device > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, > > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 > > > > #9 0xffffffff8198e65a in driver_probe_device > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, > > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 > > > > #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, > > > > data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 > > > > #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, > > > > start=start@entry=0x0 <fixed_percpu_data>, > > > > data=data@entry=0xffffffff82be40c0 <virtio_console>, > > > > fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at > > > > ../drivers/base/bus.c:368 > > > > #12 0xffffffff8198db65 in driver_attach > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > ../drivers/base/dd.c:1233 > > > > #13 0xffffffff8198d207 in bus_add_driver > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > ../drivers/base/bus.c:673 > > > > #14 0xffffffff8198f550 in driver_register > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > ../drivers/base/driver.c:246 > > > > #15 0xffffffff81706b47 in register_virtio_driver > > > > (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at > > > > ../drivers/virtio/virtio.c:357 > > > > #16 0xffffffff832cd34b in virtio_console_init () at > > > > ../drivers/char/virtio_console.c:2258 > > > > #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 > > > > <virtio_console_init>) at ../init/main.c:1246 > > > > #18 0xffffffff83277293 in do_initcall_level > > > > (command_line=0xffff888003e2f900 "root", level=0x6) at > > > > ../init/main.c:1319 > > > > #19 do_initcalls () at ../init/main.c:1335 > > > > #20 do_basic_setup () at ../init/main.c:1354 > > > > #21 kernel_init_freeable () at ../init/main.c:1571 > > > > #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at > > > > ../init/main.c:1462 > > > > #23 0xffffffff81001f49 in ret_from_fork () at > > > > ../arch/x86/entry/entry_64.S:308 > > > > #24 0x0000000000000000 in ?? () > > > > > > > > Fix the problem by preventing xen_grant_init_backend_domid() from > > > > setting dom0 as a backend when running in dom0. > > > > > > > > Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" > > > > devices") > > > > > > > > > I am not 100% sure whether the Fixes tag points to precise commit. If I > > > am not mistaken, the said commit just moves the code in the context > > > without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was > > > introduced before. > > > > I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use > > dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which > > introduced the original logic to use dom0 as backend. > > > > Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" > > devices") is relevant in sense that it extended when this logic is > > active by adding an OR check for xen_pv_domain(). > > > > > > > > > > > > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > > > > --- > > > > drivers/xen/grant-dma-ops.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c > > > > index 76f6f26265a3..29ed27ac450e 100644 > > > > --- a/drivers/xen/grant-dma-ops.c > > > > +++ b/drivers/xen/grant-dma-ops.c > > > > @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct > > > > device *dev, > > > > if (np) { > > > > ret = xen_dt_grant_init_backend_domid(dev, np, > > > > backend_domid); > > > > of_node_put(np); > > > > - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || > > > > xen_pv_domain()) { > > > > + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || > > > > + xen_pv_domain()) && > > > > + !xen_initial_domain()) { > > > > > > The commit lgtm, just one note: > > > > > > > > > I would even bail out early in xen_virtio_restricted_mem_acc() instead, > > > as I assume the same issue could happen on Arm with DT (although there > > > we don't guess the backend's domid, we read it from DT and quite > > > unlikely we get Dom0 being in Dom0 with correct DT). > > > > > > Something like: > > > > > > @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct > > > virtio_device *dev) > > > { > > > domid_t backend_domid; > > > > > > + /* Xen grant DMA ops are not used when running as initial domain > > > */ > > > + if (xen_initial_domain()) > > > + return false; > > > + > > > if (!xen_grant_init_backend_domid(dev->dev.parent, > > > &backend_domid)) { > > > xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); > > > return true; > > > (END) > > > > > > > > > > > > If so, that commit subject would need to be updated accordingly. > > > > > > Let's see what other reviewers will say. > > > > Ok, makes sense. > > I think this is okay for a fix of the current problem. > > Passing through virtio devices to a PV domU is not covered by this fix, but > this > should be a rather rare configuration, which doesn't work today either. So the > suggested patch would fix the current issue without introducing a regression. > > Anything else can be done later. Why do you say that passing through virtio devices to a PV domU doesn't work today anyway? Also, as you know many people use Xen outside of datacenter deployments (laptops, embedded etc.) where drivers domains and device assignment are very common. You could assign a virtio network card to a domU and use PV network to share the network with other guests. Physical virtio devices, especially virtio-net devices, exist. I could probably repro this problem today in a domU just installing QubesOS inside QEMU. QubesOS uses network driver domains and if QEMU provides a virtio-net network card, this would break even with this patch.
On 07.07.23 23:02, Stefano Stabellini wrote: > On Fri, 7 Jul 2023, Juergen Gross wrote: >> On 26.06.23 15:17, Petr Pavlu wrote: >>> On 6/21/23 19:58, Oleksandr Tyshchenko wrote: >>>> On 21.06.23 16:12, Petr Pavlu wrote: >>>>> When attempting to run Xen on a QEMU/KVM virtual machine with virtio >>>>> devices (all x86_64), dom0 tries to establish a grant for itself which >>>>> eventually results in a hang during the boot. >>>>> >>>>> The backtrace looks as follows, the while loop in __send_control_msg() >>>>> makes no progress: >>>>> >>>>> #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, >>>>> len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) >>>>> at ../drivers/virtio/virtio_ring.c:2326 >>>>> #1 0xffffffff817086b7 in virtqueue_get_buf >>>>> (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at >>>>> ../drivers/virtio/virtio_ring.c:2333 >>>>> #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized >>>>> out>, port_id=0xffffffff, event=0x0, value=0x1) at >>>>> ../drivers/char/virtio_console.c:562 >>>>> #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized >>>>> out>, port_id=<optimized out>, event=<optimized out>, value=<optimized >>>>> out>) at ../drivers/char/virtio_console.c:569 >>>>> #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) >>>>> at ../drivers/char/virtio_console.c:2098 >>>>> #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) >>>>> at ../drivers/virtio/virtio.c:305 >>>>> #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 >>>>> <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, >>>>> dev=0xffff88800585e810) at ../drivers/base/dd.c:579 >>>>> #7 really_probe (dev=dev@entry=0xffff88800585e810, >>>>> drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/dd.c:658 >>>>> #8 0xffffffff8198e58f in __driver_probe_device >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 >>>>> #9 0xffffffff8198e65a in driver_probe_device >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, >>>>> dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 >>>>> #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, >>>>> data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 >>>>> #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, >>>>> start=start@entry=0x0 <fixed_percpu_data>, >>>>> data=data@entry=0xffffffff82be40c0 <virtio_console>, >>>>> fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at >>>>> ../drivers/base/bus.c:368 >>>>> #12 0xffffffff8198db65 in driver_attach >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/dd.c:1233 >>>>> #13 0xffffffff8198d207 in bus_add_driver >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/bus.c:673 >>>>> #14 0xffffffff8198f550 in driver_register >>>>> (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/base/driver.c:246 >>>>> #15 0xffffffff81706b47 in register_virtio_driver >>>>> (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at >>>>> ../drivers/virtio/virtio.c:357 >>>>> #16 0xffffffff832cd34b in virtio_console_init () at >>>>> ../drivers/char/virtio_console.c:2258 >>>>> #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 >>>>> <virtio_console_init>) at ../init/main.c:1246 >>>>> #18 0xffffffff83277293 in do_initcall_level >>>>> (command_line=0xffff888003e2f900 "root", level=0x6) at >>>>> ../init/main.c:1319 >>>>> #19 do_initcalls () at ../init/main.c:1335 >>>>> #20 do_basic_setup () at ../init/main.c:1354 >>>>> #21 kernel_init_freeable () at ../init/main.c:1571 >>>>> #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at >>>>> ../init/main.c:1462 >>>>> #23 0xffffffff81001f49 in ret_from_fork () at >>>>> ../arch/x86/entry/entry_64.S:308 >>>>> #24 0x0000000000000000 in ?? () >>>>> >>>>> Fix the problem by preventing xen_grant_init_backend_domid() from >>>>> setting dom0 as a backend when running in dom0. >>>>> >>>>> Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" >>>>> devices") >>>> >>>> >>>> I am not 100% sure whether the Fixes tag points to precise commit. If I >>>> am not mistaken, the said commit just moves the code in the context >>>> without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was >>>> introduced before. >>> >>> I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use >>> dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which >>> introduced the original logic to use dom0 as backend. >>> >>> Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" >>> devices") is relevant in sense that it extended when this logic is >>> active by adding an OR check for xen_pv_domain(). >>> >>>> >>>> >>>>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >>>>> --- >>>>> drivers/xen/grant-dma-ops.c | 4 +++- >>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c >>>>> index 76f6f26265a3..29ed27ac450e 100644 >>>>> --- a/drivers/xen/grant-dma-ops.c >>>>> +++ b/drivers/xen/grant-dma-ops.c >>>>> @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct >>>>> device *dev, >>>>> if (np) { >>>>> ret = xen_dt_grant_init_backend_domid(dev, np, >>>>> backend_domid); >>>>> of_node_put(np); >>>>> - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>>> xen_pv_domain()) { >>>>> + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || >>>>> + xen_pv_domain()) && >>>>> + !xen_initial_domain()) { >>>> >>>> The commit lgtm, just one note: >>>> >>>> >>>> I would even bail out early in xen_virtio_restricted_mem_acc() instead, >>>> as I assume the same issue could happen on Arm with DT (although there >>>> we don't guess the backend's domid, we read it from DT and quite >>>> unlikely we get Dom0 being in Dom0 with correct DT). >>>> >>>> Something like: >>>> >>>> @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct >>>> virtio_device *dev) >>>> { >>>> domid_t backend_domid; >>>> >>>> + /* Xen grant DMA ops are not used when running as initial domain >>>> */ >>>> + if (xen_initial_domain()) >>>> + return false; >>>> + >>>> if (!xen_grant_init_backend_domid(dev->dev.parent, >>>> &backend_domid)) { >>>> xen_grant_setup_dma_ops(dev->dev.parent, backend_domid); >>>> return true; >>>> (END) >>>> >>>> >>>> >>>> If so, that commit subject would need to be updated accordingly. >>>> >>>> Let's see what other reviewers will say. >>> >>> Ok, makes sense. >> >> I think this is okay for a fix of the current problem. >> >> Passing through virtio devices to a PV domU is not covered by this fix, but >> this >> should be a rather rare configuration, which doesn't work today either. So the >> suggested patch would fix the current issue without introducing a regression. >> >> Anything else can be done later. > > Why do you say that passing through virtio devices to a PV domU doesn't > work today anyway? Also, as you know many people use Xen outside of > datacenter deployments (laptops, embedded etc.) where drivers domains > and device assignment are very common. You could assign a virtio network > card to a domU and use PV network to share the network with other > guests. Physical virtio devices, especially virtio-net devices, exist. I > could probably repro this problem today in a domU just installing > QubesOS inside QEMU. QubesOS uses network driver domains and if QEMU > provides a virtio-net network card, this would break even with this > patch. I might be wrong, but I don't think all virtio frontends will work in that scenario. The main reason is the PFN/MFN difference: a frontend using guest consecutive memory for doing large I/Os will fail miserably. This was the main reason why I had to add the functionality of consecutive grants for large I/O buffers. The same goes for multi-page virtio ring pages. Juergen
On Sat, 8 Jul 2023, Juergen Gross wrote: > On 07.07.23 23:02, Stefano Stabellini wrote: > > On Fri, 7 Jul 2023, Juergen Gross wrote: > > > On 26.06.23 15:17, Petr Pavlu wrote: > > > > On 6/21/23 19:58, Oleksandr Tyshchenko wrote: > > > > > On 21.06.23 16:12, Petr Pavlu wrote: > > > > > > When attempting to run Xen on a QEMU/KVM virtual machine with virtio > > > > > > devices (all x86_64), dom0 tries to establish a grant for itself > > > > > > which > > > > > > eventually results in a hang during the boot. > > > > > > > > > > > > The backtrace looks as follows, the while loop in > > > > > > __send_control_msg() > > > > > > makes no progress: > > > > > > > > > > > > #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, > > > > > > len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 > > > > > > <fixed_percpu_data>) > > > > > > at ../drivers/virtio/virtio_ring.c:2326 > > > > > > #1 0xffffffff817086b7 in virtqueue_get_buf > > > > > > (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) > > > > > > at > > > > > > ../drivers/virtio/virtio_ring.c:2333 > > > > > > #2 0xffffffff8175f6b2 in __send_control_msg > > > > > > (portdev=<optimized > > > > > > out>, port_id=0xffffffff, event=0x0, value=0x1) at > > > > > > ../drivers/char/virtio_console.c:562 > > > > > > #3 0xffffffff8175f6ee in __send_control_msg > > > > > > (portdev=<optimized > > > > > > out>, port_id=<optimized out>, event=<optimized out>, > > > > > > value=<optimized > > > > > > out>) at ../drivers/char/virtio_console.c:569 > > > > > > #4 0xffffffff817618b1 in virtcons_probe > > > > > > (vdev=0xffff88800585e800) > > > > > > at ../drivers/char/virtio_console.c:2098 > > > > > > #5 0xffffffff81707117 in virtio_dev_probe > > > > > > (_d=0xffff88800585e810) > > > > > > at ../drivers/virtio/virtio.c:305 > > > > > > #6 0xffffffff8198e348 in call_driver_probe > > > > > > (drv=0xffffffff82be40c0 > > > > > > <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, > > > > > > dev=0xffff88800585e810) at ../drivers/base/dd.c:579 > > > > > > #7 really_probe (dev=dev@entry=0xffff88800585e810, > > > > > > drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > > > ../drivers/base/dd.c:658 > > > > > > #8 0xffffffff8198e58f in __driver_probe_device > > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, > > > > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 > > > > > > #9 0xffffffff8198e65a in driver_probe_device > > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, > > > > > > dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 > > > > > > #10 0xffffffff8198e832 in __driver_attach > > > > > > (dev=0xffff88800585e810, > > > > > > data=0xffffffff82be40c0 <virtio_console>) at > > > > > > ../drivers/base/dd.c:1216 > > > > > > #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized > > > > > > out>, > > > > > > start=start@entry=0x0 <fixed_percpu_data>, > > > > > > data=data@entry=0xffffffff82be40c0 <virtio_console>, > > > > > > fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at > > > > > > ../drivers/base/bus.c:368 > > > > > > #12 0xffffffff8198db65 in driver_attach > > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > > > ../drivers/base/dd.c:1233 > > > > > > #13 0xffffffff8198d207 in bus_add_driver > > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > > > ../drivers/base/bus.c:673 > > > > > > #14 0xffffffff8198f550 in driver_register > > > > > > (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at > > > > > > ../drivers/base/driver.c:246 > > > > > > #15 0xffffffff81706b47 in register_virtio_driver > > > > > > (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at > > > > > > ../drivers/virtio/virtio.c:357 > > > > > > #16 0xffffffff832cd34b in virtio_console_init () at > > > > > > ../drivers/char/virtio_console.c:2258 > > > > > > #17 0xffffffff8100105c in do_one_initcall > > > > > > (fn=0xffffffff832cd2e0 > > > > > > <virtio_console_init>) at ../init/main.c:1246 > > > > > > #18 0xffffffff83277293 in do_initcall_level > > > > > > (command_line=0xffff888003e2f900 "root", level=0x6) at > > > > > > ../init/main.c:1319 > > > > > > #19 do_initcalls () at ../init/main.c:1335 > > > > > > #20 do_basic_setup () at ../init/main.c:1354 > > > > > > #21 kernel_init_freeable () at ../init/main.c:1571 > > > > > > #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) > > > > > > at > > > > > > ../init/main.c:1462 > > > > > > #23 0xffffffff81001f49 in ret_from_fork () at > > > > > > ../arch/x86/entry/entry_64.S:308 > > > > > > #24 0x0000000000000000 in ?? () > > > > > > > > > > > > Fix the problem by preventing xen_grant_init_backend_domid() from > > > > > > setting dom0 as a backend when running in dom0. > > > > > > > > > > > > Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of > > > > > > "xen-grant-dma" > > > > > > devices") > > > > > > > > > > > > > > > I am not 100% sure whether the Fixes tag points to precise commit. If > > > > > I > > > > > am not mistaken, the said commit just moves the code in the context > > > > > without changing the logic of CONFIG_XEN_VIRTIO_FORCE_GRANT, this was > > > > > introduced before. > > > > > > > > I see, the tag should better point to 7228113d1fa0 ("xen/virtio: use > > > > dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANT") which > > > > introduced the original logic to use dom0 as backend. > > > > > > > > Commit 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" > > > > devices") is relevant in sense that it extended when this logic is > > > > active by adding an OR check for xen_pv_domain(). > > > > > > > > > > > > > > > > > > > > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > > > > > > --- > > > > > > drivers/xen/grant-dma-ops.c | 4 +++- > > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/xen/grant-dma-ops.c > > > > > > b/drivers/xen/grant-dma-ops.c > > > > > > index 76f6f26265a3..29ed27ac450e 100644 > > > > > > --- a/drivers/xen/grant-dma-ops.c > > > > > > +++ b/drivers/xen/grant-dma-ops.c > > > > > > @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct > > > > > > device *dev, > > > > > > if (np) { > > > > > > ret = xen_dt_grant_init_backend_domid(dev, np, > > > > > > backend_domid); > > > > > > of_node_put(np); > > > > > > - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || > > > > > > xen_pv_domain()) { > > > > > > + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || > > > > > > + xen_pv_domain()) && > > > > > > + !xen_initial_domain()) { > > > > > > > > > > The commit lgtm, just one note: > > > > > > > > > > > > > > > I would even bail out early in xen_virtio_restricted_mem_acc() > > > > > instead, > > > > > as I assume the same issue could happen on Arm with DT (although there > > > > > we don't guess the backend's domid, we read it from DT and quite > > > > > unlikely we get Dom0 being in Dom0 with correct DT). > > > > > > > > > > Something like: > > > > > > > > > > @@ -416,6 +421,10 @@ bool xen_virtio_restricted_mem_acc(struct > > > > > virtio_device *dev) > > > > > { > > > > > domid_t backend_domid; > > > > > > > > > > + /* Xen grant DMA ops are not used when running as initial > > > > > domain > > > > > */ > > > > > + if (xen_initial_domain()) > > > > > + return false; > > > > > + > > > > > if (!xen_grant_init_backend_domid(dev->dev.parent, > > > > > &backend_domid)) { > > > > > xen_grant_setup_dma_ops(dev->dev.parent, > > > > > backend_domid); > > > > > return true; > > > > > (END) > > > > > > > > > > > > > > > > > > > > If so, that commit subject would need to be updated accordingly. > > > > > > > > > > Let's see what other reviewers will say. > > > > > > > > Ok, makes sense. > > > > > > I think this is okay for a fix of the current problem. > > > > > > Passing through virtio devices to a PV domU is not covered by this fix, > > > but > > > this > > > should be a rather rare configuration, which doesn't work today either. So > > > the > > > suggested patch would fix the current issue without introducing a > > > regression. > > > > > > Anything else can be done later. > > > > Why do you say that passing through virtio devices to a PV domU doesn't > > work today anyway? Also, as you know many people use Xen outside of > > datacenter deployments (laptops, embedded etc.) where drivers domains > > and device assignment are very common. You could assign a virtio network > > card to a domU and use PV network to share the network with other > > guests. Physical virtio devices, especially virtio-net devices, exist. I > > could probably repro this problem today in a domU just installing > > QubesOS inside QEMU. QubesOS uses network driver domains and if QEMU > > provides a virtio-net network card, this would break even with this > > patch. > > I might be wrong, but I don't think all virtio frontends will work in that > scenario. The main reason is the PFN/MFN difference: a frontend using guest > consecutive memory for doing large I/Os will fail miserably. This was the > main reason why I had to add the functionality of consecutive grants for > large I/O buffers. The same goes for multi-page virtio ring pages. I think for PV guests the virtio frontends would work but they might be slow as they would have to bounce over swiotlb-xen every multi-page buffer. But the virtio frontends should work OK for PVH and HVM guests.
diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c index 76f6f26265a3..29ed27ac450e 100644 --- a/drivers/xen/grant-dma-ops.c +++ b/drivers/xen/grant-dma-ops.c @@ -362,7 +362,9 @@ static int xen_grant_init_backend_domid(struct device *dev, if (np) { ret = xen_dt_grant_init_backend_domid(dev, np, backend_domid); of_node_put(np); - } else if (IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || xen_pv_domain()) { + } else if ((IS_ENABLED(CONFIG_XEN_VIRTIO_FORCE_GRANT) || + xen_pv_domain()) && + !xen_initial_domain()) { dev_info(dev, "Using dom0 as backend\n"); *backend_domid = 0; ret = 0;
When attempting to run Xen on a QEMU/KVM virtual machine with virtio devices (all x86_64), dom0 tries to establish a grant for itself which eventually results in a hang during the boot. The backtrace looks as follows, the while loop in __send_control_msg() makes no progress: #0 virtqueue_get_buf_ctx (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94, ctx=ctx@entry=0x0 <fixed_percpu_data>) at ../drivers/virtio/virtio_ring.c:2326 #1 0xffffffff817086b7 in virtqueue_get_buf (_vq=_vq@entry=0xffff8880074a8400, len=len@entry=0xffffc90000413c94) at ../drivers/virtio/virtio_ring.c:2333 #2 0xffffffff8175f6b2 in __send_control_msg (portdev=<optimized out>, port_id=0xffffffff, event=0x0, value=0x1) at ../drivers/char/virtio_console.c:562 #3 0xffffffff8175f6ee in __send_control_msg (portdev=<optimized out>, port_id=<optimized out>, event=<optimized out>, value=<optimized out>) at ../drivers/char/virtio_console.c:569 #4 0xffffffff817618b1 in virtcons_probe (vdev=0xffff88800585e800) at ../drivers/char/virtio_console.c:2098 #5 0xffffffff81707117 in virtio_dev_probe (_d=0xffff88800585e810) at ../drivers/virtio/virtio.c:305 #6 0xffffffff8198e348 in call_driver_probe (drv=0xffffffff82be40c0 <virtio_console>, drv=0xffffffff82be40c0 <virtio_console>, dev=0xffff88800585e810) at ../drivers/base/dd.c:579 #7 really_probe (dev=dev@entry=0xffff88800585e810, drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:658 #8 0xffffffff8198e58f in __driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:800 #9 0xffffffff8198e65a in driver_probe_device (drv=drv@entry=0xffffffff82be40c0 <virtio_console>, dev=dev@entry=0xffff88800585e810) at ../drivers/base/dd.c:830 #10 0xffffffff8198e832 in __driver_attach (dev=0xffff88800585e810, data=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1216 #11 0xffffffff8198bfb2 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82be40c0 <virtio_console>, fn=fn@entry=0xffffffff8198e7b0 <__driver_attach>) at ../drivers/base/bus.c:368 #12 0xffffffff8198db65 in driver_attach (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/dd.c:1233 #13 0xffffffff8198d207 in bus_add_driver (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/bus.c:673 #14 0xffffffff8198f550 in driver_register (drv=drv@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/base/driver.c:246 #15 0xffffffff81706b47 in register_virtio_driver (driver=driver@entry=0xffffffff82be40c0 <virtio_console>) at ../drivers/virtio/virtio.c:357 #16 0xffffffff832cd34b in virtio_console_init () at ../drivers/char/virtio_console.c:2258 #17 0xffffffff8100105c in do_one_initcall (fn=0xffffffff832cd2e0 <virtio_console_init>) at ../init/main.c:1246 #18 0xffffffff83277293 in do_initcall_level (command_line=0xffff888003e2f900 "root", level=0x6) at ../init/main.c:1319 #19 do_initcalls () at ../init/main.c:1335 #20 do_basic_setup () at ../init/main.c:1354 #21 kernel_init_freeable () at ../init/main.c:1571 #22 0xffffffff81f64be1 in kernel_init (unused=<optimized out>) at ../init/main.c:1462 #23 0xffffffff81001f49 in ret_from_fork () at ../arch/x86/entry/entry_64.S:308 #24 0x0000000000000000 in ?? () Fix the problem by preventing xen_grant_init_backend_domid() from setting dom0 as a backend when running in dom0. Fixes: 035e3a4321f7 ("xen/virtio: Optimize the setup of "xen-grant-dma" devices") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> --- drivers/xen/grant-dma-ops.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)