Message ID | b62fbc602a629941c1acaad3b93d250a3eba33c0.1647222184.git.brchuckz@netscape.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [XEN] tools/libs/light/libxl_pci.c: explicitly grant access to Intel IGD opregion | expand |
On 14.03.2022 04:41, Chuck Zmudzinski wrote: > Fixes: abfb006f1ff4 (tools/libxl: explicitly grant access to needed I/O-memory ranges) > Fixes: 0561e1f01e87 (xen/common: do not implicitly permit access to mapped I/O memory) > Backport: 4.12+ Just fyi: This is fine to have as a tag, but it wouldn't be backported farther than to 4.15. Apart from this largely some style issues (see below), but please realize that I'm not a libxl maintainer and hence I may not have good enough knowledge of, in particular, potential unwritten conventions. > @@ -610,6 +612,45 @@ out: > return ret; > } > > +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc, > + libxl_device_pci *pci) > +{ > + char *pci_device_config_path = > + GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config", > + pci->domain, pci->bus, pci->dev, pci->func); > + size_t read_items; > + uint32_t igd_opregion; > + uint32_t error = 0xffffffff; I think this constant wants to gain a #define, to be able to correlate the use sites. I'm also not sure of the value - in principle the register can hold this value, but of course then it won't be 3 pages. Maybe the error check further down should be to see whether adding 2 to the value would overflow in 32 bits? (In that case a #define may not be needed anymore, as there wouldn't be multiple instances of the constant in the code.) > + > + FILE *f = fopen(pci_device_config_path, "r"); > + if (!f) { While libxl has some special style rules, I think it still wants a blank line between declaration(s) and statement(s), just like we expect elsewhere. Effectively you want to simply move the blank line you have one line down. > @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, > domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); > return ret; > } > + > + /* If this is an Intel IGD, allow access to the IGD opregion */ > + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; Despite the provision for "return" or alike to go on the same line as an error code check, I don't think this is okay here. It would be if, as iirc generally expected in libxl, you latched the function return value into a local variable named "rc" (I think). > + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); > + uint32_t error = 0xffffffff; Please don't mix declarations and statements. I also don't think "error" is really necessary as a local variable, but with the change suggested above it might disappear anyway. > + if (igd_opregion == error) break; Like above I'm not sure this is okay to all live on one line. I also think it would be nice if you used "return 0" or "break" consistently. Of course a related question is whether failure here should actually be reported to the caller. > + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; There's no need for a cast here, as you're right-shifting. Also (just fyi) there would have been three to many spaces here. I'm additionally not certain whether re-using a variable for a purpose not matching its name is deemed acceptable by libxl maintainers. > + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, > + vga_iomem_start, > + IGD_OPREGION_PAGES, 1); > + if (ret < 0) { > + LOGED(ERROR, domid, > + "failed to give stubdom%d access to iomem range " > + "%"PRIx64"-%"PRIx64" for IGD passthru", > + stubdom_domid, vga_iomem_start, (vga_iomem_start + > + IGD_OPREGION_PAGES - 1)); > + return ret; > + } I have to admit that I find it odd that this is done unconditionally, but I notice the same is done in pre-existing code. I would have expected this to happen only when there actually is a device model stub domain. Jan > + ret = xc_domain_iomem_permission(CTX->xch, domid, > + vga_iomem_start, > + IGD_OPREGION_PAGES, 1); > + if (ret < 0) { > + LOGED(ERROR, domid, > + "failed to give dom%d access to iomem range " > + "%"PRIx64"-%"PRIx64" for IGD passthru", > + domid, vga_iomem_start, (vga_iomem_start + > + IGD_OPREGION_PAGES - 1)); > + return ret; > + } > break; > } >
On 3/15/22 7:38 AM, Jan Beulich wrote: > On 14.03.2022 04:41, Chuck Zmudzinski wrote: >> Fixes: abfb006f1ff4 (tools/libxl: explicitly grant access to needed I/O-memory ranges) >> Fixes: 0561e1f01e87 (xen/common: do not implicitly permit access to mapped I/O memory) >> Backport: 4.12+ > Just fyi: This is fine to have as a tag, but it wouldn't be backported > farther than to 4.15. That's entirely reasonable. I understand this is a bug fix, not a security issue. > > Apart from this largely some style issues (see below), but please > realize that I'm not a libxl maintainer and hence I may not have good > enough knowledge of, in particular, potential unwritten conventions. I will take your comments into consideration regarding style before writing the next version of the patch, and carefully check libxl's coding style file. > >> @@ -610,6 +612,45 @@ out: >> return ret; >> } >> >> +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc, >> + libxl_device_pci *pci) >> +{ >> + char *pci_device_config_path = >> + GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config", >> + pci->domain, pci->bus, pci->dev, pci->func); >> + size_t read_items; >> + uint32_t igd_opregion; >> + uint32_t error = 0xffffffff; > I think this constant wants to gain a #define, to be able to correlate > the use sites. I'm also not sure of the value - in principle the > register can hold this value, but of course then it won't be 3 pages. What we are storing as the return value is the starting address, not the size, of the opregion, and that is a 32-bit value. If we cannot read it, we return 0xffffffff instead to indicate the error that we were unable to read the starting address of the opregion from the config attribute in sysfs. The 32-bit value we are looking for is read at offset 0xfc from the start of the config attribute of the Intel IGD in sysfs. The offset 0xfc is defined by PCI_INTEL_OPREGION both here and in hvmloader (and also in Qemu). The data that is read at this offset from the start of the config attribute of the Intel IGD in sysfs is the 32-bit address of the start of the opregion. > Maybe the error check further down should be to see whether adding 2 > to the value would overflow in 32 bits? (In that case a #define may > not be needed anymore, as there wouldn't be multiple instances of the > constant in the code.) That would work also. Please not that I chose that value for an error value consistent with the way the current function sysfs_dev_get_vendor does it. While that function does not assign the variable 'error' to its return value for an error, which in that case is 0xffff because that function returns uint16_t instead of uint32_t, I chose to explicitly assign the error variable to that value to help make the code more readable. Your comment that this could be a #define instead is good. I also think we should use a #define for the error return value of the sysfs_dev_get_vendor function Something like: #define ERROR_16 0xffff #define ERROR_32 0xffffffff might be appropriate. But that would be touching code unrelated to this bug fix. I think again the libxl maintainers should weigh in about what to do here. They might let me take this opportunity to update and improve the style of the patched file in other functions in the file not related to this bug fix but I am not inclined to do that without an explicit request from them to do so. So I am not sure yet what I will do in the next version of the patch, but I will address your concerns here and try to explain my reasoning for the changes in the changelog for version 2 of the patch. > >> + >> + FILE *f = fopen(pci_device_config_path, "r"); >> + if (!f) { > While libxl has some special style rules, I think it still wants a > blank line between declaration(s) and statement(s), just like we > expect elsewhere. Effectively you want to simply move the blank line > you have one line down. I think I followed the same style here as the existing sysfs_dev_get_xxx functions. I will double check that and use the same style the other functions use unless they clearly violate the rules, and note that I deviated from the style of the existing functions to conform to current coding style and suggest a subsequent patch to update the style of the other functions. > >> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, >> domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); >> return ret; >> } >> + >> + /* If this is an Intel IGD, allow access to the IGD opregion */ >> + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; > Despite the provision for "return" or alike to go on the same line > as an error code check, I don't think this is okay here. It would be > if, as iirc generally expected in libxl, you latched the function > return value into a local variable named "rc" (I think). I will double check how the function being patched handles the return value. I don't even remember if it has a local variable named rc for a return value. IIRC it was either ret or 0. I understand that libxl expects rc to be used these days, though. This might be another candidate for updating the file to libxl's current standards. > >> + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); >> + uint32_t error = 0xffffffff; > Please don't mix declarations and statements. I presume you are saying these two lines should be re-written as: uint32_t igd_opregion; unti32_t error; igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); error = 0xffff; Please reply if my understanding here is not correct. > I also don't think > "error" is really necessary as a local variable, but with the change > suggested above it might disappear anyway. I do plan for the next version of the patch to use a #define for this instead of the error variable (or add 2 to overflow it), so it will disappear in the next version. > >> + if (igd_opregion == error) break; > Like above I'm not sure this is okay to all live on one line. I also > think it would be nice if you used "return 0" or "break" consistently. > Of course a related question is whether failure here should actually > be reported to the caller. Good points here. I agree about consistency with break and return 0. I will change this to return 0 and move it to the next line. I do not want to change the current meaning of the return value without knowledge of how the caller uses the return value. IIRC, currently the function always returns 0 unless it encounters a negative return value from xc_domain_iomem_permission, in which case it returns that negative value to indicate an error to the caller. So if we return anything other than 0 here, we might be returning an error code that the caller does not expect or interpret correctly. I will also consider putting an error message here before returning 0. A message something like "dom%d: Intel IGD detected, but could not find IGD opregion" would explain the error that happens here. I don't think a non-zero error code to the caller is appropriate here, though, because, as already mentioned, IIRC this might return a value the caller does not interpret correctly. If it is necessary to return an error to the caller here instead of 0, it will be necessary to ensure all callers of this function will interpret it correctly. I would suggest an error return value greater than 0 to distinguish it from the return value < 0 which indicates an error from xc_domain_iomem_permission, but I hope libxl maintainers will accept a return value of 0 here, at least for this patch. A later patch could re-work the return value of this function which would also probably require touching the caller(s) of this function to properly respond to this particular error which is different from an error from xc_domain_iomem_permission. In any case, I will double-check to see if my current understanding of the meaning of the return value is correct before writing the next version of the patch. For now, I will use return 0 instead of break here and move it to the next line, unless I hear otherwise from the libxl maintainers. > >> + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; > There's no need for a cast here, as you're right-shifting. Also > (just fyi) there would have been three to many spaces here. I'm > additionally not certain whether re-using a variable for a purpose > not matching its name is deemed acceptable by libxl maintainers. I wrote it that way expecting a compiler error if I didn't do the cast. I have not checked if the cast is necessary, though, and maybe you are right. I will check and see if it is necessary by removing the cast and see if the compiler complains. If the cast is not needed, I will just use the 32-bit igd_opregion variable when calling xc_domain_iomem_permission instead of the 64-bit vga_iomem_start variable. I will remove the three spaces and use a more descriptive variable instead of re-using vga_iomem_start if the compiler insists on the cast from 32-bit to 64-bit. > >> + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); >> + if (ret < 0) { >> + LOGED(ERROR, domid, >> + "failed to give stubdom%d access to iomem range " >> + "%"PRIx64"-%"PRIx64" for IGD passthru", >> + stubdom_domid, vga_iomem_start, (vga_iomem_start + >> + IGD_OPREGION_PAGES - 1)); >> + return ret; >> + } > I have to admit that I find it odd that this is done unconditionally, > but I notice the same is done in pre-existing code. I would have > expected this to happen only when there actually is a device model > stub domain. I don't understand how that works either. All my tests have been with the device model running as a process in dom0. I am thinking maybe in that case it just uses dom0 for the stub domain, but I have not checked that. I will check it by dumping the value of stubdom_domid to a log in my next test. Thank you for responding promptly. Now I have some work to do writing the next version of the patch and documenting it clearly in its changelog. It will take me a while - I will spend enough time on it so hopefully the libxl maintainers don't have to spend so much time on it. Chuck N.B. I forgot to send this reply to xen-devel and cc the libxl maintainers, so I am doing so here. I also re-formatted my replies to avoid lines with too many characters. Sorry for the confusion.
On 3/15/22 7:38 AM, Jan Beulich wrote: > On 14.03.2022 04:41, Chuck Zmudzinski wrote: > >> @@ -610,6 +612,45 @@ out: >> return ret; >> } >> >> +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc, >> + libxl_device_pci *pci) >> +{ >> + char *pci_device_config_path = >> + GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config", >> + pci->domain, pci->bus, pci->dev, pci->func); >> + size_t read_items; >> + uint32_t igd_opregion; >> + uint32_t error = 0xffffffff; > I think this constant wants to gain a #define, to be able to correlate > the use sites. I'm also not sure of the value - in principle the > register can hold this value, but of course then it won't be 3 pages. > I have one more comment to add here. I am not intending to define igd_opregion as a data structure 3 pages (12k) long, much less as a pointer to such a structure. However, it would be nice to have access to the actual data structure in libxl, because we could use it to validate its contents. I looked in the code for the i915 Linux kernel module, and the IGD opregion does have a signature that we could check if we have access to it. That would mitigate my concerns expressed in my first version of the patch about a false positive when identifying an Intel IGD. Hvmloader should probably also do this check before it maps the Intel IGD into guest memory if that is possible. However, I expect that it is not a memory that libxl has access to. It is probably a structure in kernel space, but it might be possible for libxl to ask the hypervisor for access to it. Perhaps the libxl maintainers can shed some light on that possibility. If this is possible, I will include such a check for the validity of the contents in the IGD in version 2 of the patch. Regards, Chuck
On 3/15/2022 9:27 PM, Chuck Zmudzinski wrote: > > > On 3/15/22 7:38 AM, Jan Beulich wrote: >> On 14.03.2022 04:41, Chuck Zmudzinski wrote: >> >>> @@ -610,6 +612,45 @@ out: >>> return ret; >>> } >>> +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc, >>> + libxl_device_pci *pci) >>> +{ >>> + char *pci_device_config_path = >>> + GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config", >>> + pci->domain, pci->bus, pci->dev, pci->func); >>> + size_t read_items; >>> + uint32_t igd_opregion; >>> + uint32_t error = 0xffffffff; >> I think this constant wants to gain a #define, to be able to correlate >> the use sites. I'm also not sure of the value - in principle the >> register can hold this value, but of course then it won't be 3 pages. >> > > I have one more comment to add here. I am not intending > to define igd_opregion as a data structure 3 pages (12k) Correction: Actually, the igd_opregion itself would be 2 pages. The three pages comes from the fact that it is not guaranteed to be page aligned, so it will take three pages to ensure that it will be fully mapped to the guest. From the commit message in hvmloader that increased it from two to three pages: From: Keir Fraser <keir@xxxxxxx> Date: Thu, 10 Jan 2013 17:26:24 +0000 (+0000) Subject: hvmloader: Allocate 3 pages for Intel GPU OpRegion passthrough. X-Git-Tag: 4.3.0-rc1~546 X-Git-Url: https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff_plain;h=408a9e56343b006c9e58a334f0b97dd2deedf9ac hvmloader: Allocate 3 pages for Intel GPU OpRegion passthrough. The 8kB region may not be page aligned, hence requiring 3 pages to be mapped through. Signed-off-by: Keir Fraser <keir@xxxxxxx> In tests on my system, this is true. It was, IIRC, 0x18 (24) bytes offset from a page boundary. This has an unfortunate side effect of granting access to one page that the guest does not really need access to. My well-behaved and trusted Linux and Windows guests only request the two pages of the igd_opregion, but it could have accessed the 24 bytes before it or the (4k - 24) bytes after it. I don't think that greatly increases the security risk of including this patch, because I think with passthrough of PCI devices, it must always be to a trusted guest for it to be secure. I don't think an attacker who gained control over a guest that has PCI devices passed through to it would need this exploit to successfully attack the dom0 or control domain from the guest. The damage could be done whether or not the attacker has access to that extra page if the attacker gained full control over a guest with PCI devices passed through to it. Regards, Chuck
On 14.03.2022 04:41, Chuck Zmudzinski wrote: > When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD > opregion to the guest but libxl does not grant the guest permission to > access the mapped memory region. This results in a crash of the i915.ko > kernel module in a Linux HVM guest when it needs to access the IGD > opregion: > > Oct 23 11:36:33 domU kernel: Call Trace: > Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 > Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] > Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 [drm] > Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] > Oct 23 11:36:33 domU kernel: intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] > Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 > Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 [i915] > Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] > Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 [i915] > Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] > Oct 23 11:36:33 domU kernel: ? vga_switcheroo_client_probe_defer+0x1f/0x40 > Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] > Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 > Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 > Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 > Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 > Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 > Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 > Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 > Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 > Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 > Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 > Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] > Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 > Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 > Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 > Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 > Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 > Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 > Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 The call trace alone leaves open where exactly the crash occurred. Looking at 5.17 I notice that the first thing the driver does after mapping the range it to check the signature (both in intel_opregion_setup()). As the signature can't possibly match with no access granted to the underlying mappings, there shouldn't be any further attempts to use the region in the driver; if there are, I'd view this as a driver bug. Furthermore I've found indications (e.g. in the Core Gen11 doc) that the register may not hold an address in the first place, but instead a set of bitfields. I can't help the impression that the driver would still try to map the range pointed at by the value (as long as it's non-zero), which would imply unpredictable behavior. And then, looking further at intel_opregion_setup(), there's yet one more memory range which the guest may need access to: opregion->asle->rvda (or a value derived from it) also is handed to memremap() under certain conditions. Jan
On 3/15/22 7:38 AM, Jan Beulich wrote: > On 14.03.2022 04:41, Chuck Zmudzinski wrote: > >> + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); >> + if (ret < 0) { >> + LOGED(ERROR, domid, >> + "failed to give stubdom%d access to iomem range " >> + "%"PRIx64"-%"PRIx64" for IGD passthru", >> + stubdom_domid, vga_iomem_start, (vga_iomem_start + >> + IGD_OPREGION_PAGES - 1)); >> + return ret; >> + } > I have to admit that I find it odd that this is done unconditionally, > but I notice the same is done in pre-existing code. I would have > expected this to happen only when there actually is a device model > stub domain. > > Jan I dumped the value of stubdom_id for my tests with the device model running in dom0: libxl: info: libxl_pci.c:2556:libxl__grant_vga_iomem_permission: Domain 3: stubdom id: 0 As I expected, when there is not a device model stub domain and the device model runs in dom0, the stubdom_id is 0. I will now do some tests to see if this is necessary when the device model runs in dom0. I would like to know if the device model running in dom0 needs to have access granted here or not. When there is a device model stub domain, I presume it is necessary, and I can check that also and write the next version of the patch accordingly. Chuck
Hi Chuck, On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: > When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD > opregion to the guest but libxl does not grant the guest permission to I'm not reading the same thing when looking at code in hvmloader. It seems that hvmloader allocate some memory for the IGD opregion rather than mapping it. tools/firmware/hvmloader/pci.c:184 if ( vendor_id == 0x8086 ) { igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); /* * Write the the OpRegion offset to give the opregion * address to the device model. The device model will trap * and map the OpRegion at the give address. */ pci_writel(vga_devfn, PCI_INTEL_OPREGION, igd_opregion_pgbase << PAGE_SHIFT); } I think this write would go through QEMU, does it do something with it? (I kind of replying to this question at the end of the mail.) Is this code in hvmloader actually run in your case? > Currently, because of another bug in Qemu upstream, this crash can > only be reproduced using the traditional Qemu device model, and of qemu-traditional is a bit old... What is the bug in QEMU? Do you have a link to a patch/mail? > diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c > index 4bbbfe9f16..a4fc473de9 100644 > --- a/tools/libs/light/libxl_pci.c > +++ b/tools/libs/light/libxl_pci.c > @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, > domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); > return ret; > } > + > + /* If this is an Intel IGD, allow access to the IGD opregion */ > + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; > + > + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); > + uint32_t error = 0xffffffff; > + if (igd_opregion == error) break; > + > + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; > + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, > + vga_iomem_start, > + IGD_OPREGION_PAGES, 1); > + if (ret < 0) { > + LOGED(ERROR, domid, > + "failed to give stubdom%d access to iomem range " > + "%"PRIx64"-%"PRIx64" for IGD passthru", > + stubdom_domid, vga_iomem_start, (vga_iomem_start + > + IGD_OPREGION_PAGES - 1)); > + return ret; > + } > + ret = xc_domain_iomem_permission(CTX->xch, domid, > + vga_iomem_start, > + IGD_OPREGION_PAGES, 1); Here, you seems to add permission to an address that is read from the pci config space of the device, but as I've pointed above hvmloader seems to overwrite this address. It this call to xc_domain_iomem_permission() does actually anything useful? Or is it by luck that the address returned by sysfs_dev_get_igd_opregion() happened to be the address that hvmloader is going to write? Or maybe hvmloader doesn't actually do anything? Some more though on that, looking at QEMU, it seems that there's already a call to xc_domain_iomem_permission(), in igd_write_opregion(). So adding one in libxl would seems redundant, or maybe it the one for the device model's domain that's needed (named 'stubdom_domid' here)? Thanks,
On 30/03/2022 18:15, Anthony PERARD wrote: > Hi Chuck, > > On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: >> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >> opregion to the guest but libxl does not grant the guest permission to > I'm not reading the same thing when looking at code in hvmloader. It > seems that hvmloader allocate some memory for the IGD opregion rather > than mapping it. > > > tools/firmware/hvmloader/pci.c:184 > if ( vendor_id == 0x8086 ) > { > igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); > /* > * Write the the OpRegion offset to give the opregion > * address to the device model. The device model will trap > * and map the OpRegion at the give address. > */ > pci_writel(vga_devfn, PCI_INTEL_OPREGION, > igd_opregion_pgbase << PAGE_SHIFT); > } > > I think this write would go through QEMU, does it do something with it? > (I kind of replying to this question at the end of the mail.) > > Is this code in hvmloader actually run in your case? > >> Currently, because of another bug in Qemu upstream, this crash can >> only be reproduced using the traditional Qemu device model, and of > qemu-traditional is a bit old... What is the bug in QEMU? Do you have a > link to a patch/mail? > >> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c >> index 4bbbfe9f16..a4fc473de9 100644 >> --- a/tools/libs/light/libxl_pci.c >> +++ b/tools/libs/light/libxl_pci.c >> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, >> domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); >> return ret; >> } >> + >> + /* If this is an Intel IGD, allow access to the IGD opregion */ >> + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; >> + >> + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); >> + uint32_t error = 0xffffffff; >> + if (igd_opregion == error) break; >> + >> + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; >> + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); >> + if (ret < 0) { >> + LOGED(ERROR, domid, >> + "failed to give stubdom%d access to iomem range " >> + "%"PRIx64"-%"PRIx64" for IGD passthru", >> + stubdom_domid, vga_iomem_start, (vga_iomem_start + >> + IGD_OPREGION_PAGES - 1)); >> + return ret; >> + } >> + ret = xc_domain_iomem_permission(CTX->xch, domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); > Here, you seems to add permission to an address that is read from the > pci config space of the device, but as I've pointed above hvmloader > seems to overwrite this address. It this call to > xc_domain_iomem_permission() does actually anything useful? > Or is it by luck that the address returned by > sysfs_dev_get_igd_opregion() happened to be the address that hvmloader > is going to write? > > Or maybe hvmloader doesn't actually do anything? > > > Some more though on that, looking at QEMU, it seems that there's already > a call to xc_domain_iomem_permission(), in igd_write_opregion(). So > adding one in libxl would seems redundant, or maybe it the one for the > device model's domain that's needed (named 'stubdom_domid' here)? This has been discussed before, but noone's done anything about it. It's a massive layering violation for QEMU to issue xc_domain_iomem_permission()/etc hypercalls. It should be the toolstack, and only the toolstack, which makes permissions hypercalls, which in turn will fix a slew of "QEMU doesn't work when it doesn't have dom0 superpowers" bugs with stubdomains. In this case specifically, the opregion is a magic Intel graphics specific bodge. The i915 driver in the guest really does need to access part of the real PCH during init, which (in Xen's permission model) really does require permitting access to the MMIO range (8k iirc) so it can be mapped as a BAR in QEMU's emulated PCH. ~Andrew
On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: > > On 14.03.2022 04:41, Chuck Zmudzinski wrote: > > When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD > > opregion to the guest but libxl does not grant the guest permission to > > access the mapped memory region. This results in a crash of the i915.ko > > kernel module in a Linux HVM guest when it needs to access the IGD > > opregion: > > > > Oct 23 11:36:33 domU kernel: Call Trace: > > Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 > > Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] > > Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 [drm] > > Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] > > Oct 23 11:36:33 domU kernel: intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] > > Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > > Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 > > Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 [i915] > > Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] > > Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 [i915] > > Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] > > Oct 23 11:36:33 domU kernel: ? vga_switcheroo_client_probe_defer+0x1f/0x40 > > Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] > > Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 > > Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > > Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 > > Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 > > Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 > > Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 > > Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 > > Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > > Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > > Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 > > Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 > > Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 > > Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 > > Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] > > Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 > > Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 > > Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 > > Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 > > Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 > > Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 > > Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > The call trace alone leaves open where exactly the crash occurred. > Looking at 5.17 I notice that the first thing the driver does > after mapping the range it to check the signature (both in > intel_opregion_setup()). As the signature can't possibly match > with no access granted to the underlying mappings, there shouldn't > be any further attempts to use the region in the driver; if there > are, I'd view this as a driver bug. Yes. i915_driver_hw_probe does not check the return value of intel_opregion_setup(dev_priv) and just continues on. Chuck, the attached patch may help if you want to test it. Regards, Jason
On 3/30/22 1:15 PM, Anthony PERARD wrote: > Hi Chuck, > > On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: >> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >> opregion to the guest but libxl does not grant the guest permission to > I'm not reading the same thing when looking at code in hvmloader. It > seems that hvmloader allocate some memory for the IGD opregion rather > than mapping it. > > > tools/firmware/hvmloader/pci.c:184 > if ( vendor_id == 0x8086 ) > { > igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); > /* > * Write the the OpRegion offset to give the opregion > * address to the device model. The device model will trap > * and map the OpRegion at the give address. > */ > pci_writel(vga_devfn, PCI_INTEL_OPREGION, > igd_opregion_pgbase << PAGE_SHIFT); > } > > I think this write would go through QEMU, does it do something with it? AFAICT Qemu does do something with it: In the upstream Qemu, at hw/xen/xen_pt_config_init.c and in hw/xen/xen_pt_graphics.c, we see where Qemu implements functions to read and write the OpRegion, and that means it reads and writes the value for the mapped OpRegion address that is passed to it from hvmloader. It is a 32-bit address that is stored in the config attribute in sysfs for the Intel IGD on Linux guests. I have examined the config attribute of the Intel IGD in the control domain (dom0) and in the Linux guest and what I see is that in both dom0 and in the guest, the address of the IGD OpRegion is consistent with custom logs I have examined from xen/common/domctl.c that show the same machine and guest address for the OpRegion that the config attribute has for the machine and the guest. > (I kind of replying to this question at the end of the mail.) > > Is this code in hvmloader actually run in your case? I admit I have not verified with certainty that Qemu and the guest is getting the OpRegion address from hvmloader. I will do a test to verify it, such as removing the code from hvmloader that writes the address in the pci config attribute and see if that prevents the guest from finding the address where the OpRegion is mapped to in the guest. That would prove that hvmloader code here is run in my case. > >> Currently, because of another bug in Qemu upstream, this crash can >> only be reproduced using the traditional Qemu device model, and of > qemu-traditional is a bit old... What is the bug in QEMU? Do you have a > link to a patch/mail? Not yet. I am working on it now. The bug is related to the passthrough of the IRQ to the guest. So far I have compared the logs in the Linux guest using Qemu upstream with Qemu traditional, and I have found that with the upstream Qemu, the Linux kernel in the guest reports that it cannot obtain IRQ 24: Mar 29 18:31:39 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24) Mar 29 18:31:39 debian kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access ... Mar 29 18:31:39 debian kernel: xen:events: Failed to obtain physical IRQ 24 When using the traditional device model, this failure is not reported. The failure of the system to handle the IRQ prevents the guest from booting normally with the upstream Qemu. Comparing Qemu upstream code with Qemu traditional code, I notice that when the traditional Qemu sets up the pci-isa bridge at slot 1f, it configures an IRQ, but the upstream Qemu does not, so I suspect that is where the problem is, but I have not found the fix yet. It is well known that the pci-isa bridge at slot 1f needs to be specially configured for the Intel IGD to function properly. >> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c >> index 4bbbfe9f16..a4fc473de9 100644 >> --- a/tools/libs/light/libxl_pci.c >> +++ b/tools/libs/light/libxl_pci.c >> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, >> domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); >> return ret; >> } >> + >> + /* If this is an Intel IGD, allow access to the IGD opregion */ >> + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; >> + >> + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); >> + uint32_t error = 0xffffffff; >> + if (igd_opregion == error) break; >> + >> + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; >> + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); >> + if (ret < 0) { >> + LOGED(ERROR, domid, >> + "failed to give stubdom%d access to iomem range " >> + "%"PRIx64"-%"PRIx64" for IGD passthru", >> + stubdom_domid, vga_iomem_start, (vga_iomem_start + >> + IGD_OPREGION_PAGES - 1)); >> + return ret; >> + } >> + ret = xc_domain_iomem_permission(CTX->xch, domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); > Here, you seems to add permission to an address that is read from the > pci config space of the device, but as I've pointed above hvmloader > seems to overwrite this address. I do not think hvmloader overwrites this address. I think of it as hvmloader passing the mapped address to the device model and guest, but as I said above, I will do more tests to verify with certainty what hvmloader is actually doing. > It this call to > xc_domain_iomem_permission() does actually anything useful? I am certain this call to xc_domain_iomem_permission() is necessary with the Qemu traditional device model to obtain permission for the guest (domid) to access the OpRegion. Without it, the bug is not fixed and I get the crash in the i915 kernel module in the Linux guest and a dark screen instead of a guest with output to the screen. Moreover, I have verified with custom logs from xen/common/domctl.c that access to the opregion is denied to domid, but not to dom0, when this patch is not applied. > Or is it by luck that the address returned by > sysfs_dev_get_igd_opregion() happened to be the address that hvmloader > is going to write? > > Or maybe hvmloader doesn't actually do anything? I mentioned earlier my plans to verify that hvmloader does provide the device model and the guest with the mapped address of the Intel IGD OpRegion. > > > Some more though on that, looking at QEMU, it seems that there's already > a call to xc_domain_iomem_permission(), in igd_write_opregion(). So > adding one in libxl would seems redundant, You may be right that with Qemu upstream it is not needed. I will be able to check it once I have a patch for the IRQ problem in upstream Qemu. When I wrote this patch a couple of weeks ago, though, I did not yet know that Qemu upstream also calls xc_domain_iomem_permission() and since Qemu upstream seems to obtain the necessary permission, the call here to xc_domain_iomem_permission() can be done only when the device model is Qemu traditional. > or maybe it the one for the > device model's domain that's needed (named 'stubdom_domid' here)? Well, I am not using a device model stub domain but running the device model in dom0, and my tests indicate it is not necessary to obtain permission for dom0, but I do admit I need to do more tests before submitting the next version of a patch. I plan to do some tests with a device model stub domain and see what configurations require permission for the stub domain. I expect it will only be needed when the device model is Qemu traditional because Qemu upstream obtains the necessary permission. I have no experience running the device model in a stub domain, and IIRC the Xen wiki explains that the supported configuration from the Xen Project is with the traditional device model and mini os in the stub domain, and others, such as Qubes OS, have done some work on the upstream Qemu and Linux running in a stub domain. Please correct me if this is not correct. Thank you for taking the time to look at this patch. Chuck
On 3/30/22 1:27 PM, Andrew Cooper wrote: > On 30/03/2022 18:15, Anthony PERARD wrote: >> >> Some more though on that, looking at QEMU, it seems that there's already >> a call to xc_domain_iomem_permission(), in igd_write_opregion(). > This has been discussed before, but noone's done anything about it. > It's a massive layering violation for QEMU to issue > xc_domain_iomem_permission()/etc hypercalls. > > It should be the toolstack, and only the toolstack, which makes > permissions hypercalls, which in turn will fix a slew of "QEMU doesn't > work when it doesn't have dom0 superpowers" bugs with stubdomains. How much say does the Xen project have over the code in Qemu under hw/xen? I would not be against having libxl do the permissions hypercalls in this case instead of Qemu doing it. My test with Qemu traditional and this patch proves the permission can be granted by libxl instead of the device model. > In this case specifically, the opregion is a magic Intel graphics > specific bodge. The i915 driver in the guest really does need to access > part of the real PCH during init, which (in Xen's permission model) > really does require permitting access to the MMIO range (8k iirc) so it > can be mapped as a BAR in QEMU's emulated PCH. That is exactly what my testing confirmed, but in my tests only Linux guests need access to the magic Intel opregion. The proprietary Windows Intel graphics drivers are apparently able to work around lack of access to the opregion, at least on my system, and Windows guests with Intel IGD passthrough function very well without needing this patch. So the problem could be fixed in the Linux i915 kernel driver, but Intel has not contributed the magic sauce to the Linux kernel that would enable Linux guests to work without access to the Intel opregion. And you are correct, the opregion is 8k (2 pages) in size, but it is not always page aligned, as mentioned earlier, so the IGD_OPREGION_PAGES constant is set to 3 instead of 2. Chuck
On 3/30/22 2:45 PM, Jason Andryuk wrote: > On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: >> On 14.03.2022 04:41, Chuck Zmudzinski wrote: >>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >>> opregion to the guest but libxl does not grant the guest permission to >>> access the mapped memory region. This results in a crash of the i915.ko >>> kernel module in a Linux HVM guest when it needs to access the IGD >>> opregion: >>> >>> Oct 23 11:36:33 domU kernel: Call Trace: >>> Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 >>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] >>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 [drm] >>> Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] >>> Oct 23 11:36:33 domU kernel: intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] >>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>> Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 >>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 [i915] >>> Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] >>> Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 [i915] >>> Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] >>> Oct 23 11:36:33 domU kernel: ? vga_switcheroo_client_probe_defer+0x1f/0x40 >>> Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] >>> Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 >>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>> Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 >>> Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 >>> Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 >>> Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 >>> Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 >>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>> Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 >>> Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 >>> Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 >>> Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 >>> Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] >>> Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 >>> Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 >>> Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 >>> Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 >>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 >>> Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 >>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> The call trace alone leaves open where exactly the crash occurred. >> Looking at 5.17 I notice that the first thing the driver does >> after mapping the range it to check the signature (both in >> intel_opregion_setup()). As the signature can't possibly match >> with no access granted to the underlying mappings, there shouldn't >> be any further attempts to use the region in the driver; if there >> are, I'd view this as a driver bug. > Yes. i915_driver_hw_probe does not check the return value of > intel_opregion_setup(dev_priv) and just continues on. > > Chuck, the attached patch may help if you want to test it. > > Regards, > Jason Thanks for the patch, I will try it when I get a chance and report if it prevents the crash and enables video output to my screen. Has your patch been committed to Linux? I just checked on the gitlab Linux master branch and didn't see it there. Regards, Chuck
On Thu, Mar 31, 2022 at 12:34 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote: > > On 3/30/22 2:45 PM, Jason Andryuk wrote: > > On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: > >> On 14.03.2022 04:41, Chuck Zmudzinski wrote: > >>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD > >>> opregion to the guest but libxl does not grant the guest permission to > >>> access the mapped memory region. This results in a crash of the i915.ko > >>> kernel module in a Linux HVM guest when it needs to access the IGD > >>> opregion: > >>> > >>> Oct 23 11:36:33 domU kernel: Call Trace: > >>> Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 > >>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] > >>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 [drm] > >>> Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] > >>> Oct 23 11:36:33 domU kernel: intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] > >>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > >>> Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 > >>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 [i915] > >>> Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] > >>> Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 [i915] > >>> Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] > >>> Oct 23 11:36:33 domU kernel: ? vga_switcheroo_client_probe_defer+0x1f/0x40 > >>> Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] > >>> Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 > >>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > >>> Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 > >>> Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 > >>> Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 > >>> Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 > >>> Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 > >>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > >>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > >>> Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 > >>> Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 > >>> Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 > >>> Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 > >>> Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] > >>> Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 > >>> Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 > >>> Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 > >>> Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 > >>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 > >>> Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 > >>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >> The call trace alone leaves open where exactly the crash occurred. > >> Looking at 5.17 I notice that the first thing the driver does > >> after mapping the range it to check the signature (both in > >> intel_opregion_setup()). As the signature can't possibly match > >> with no access granted to the underlying mappings, there shouldn't > >> be any further attempts to use the region in the driver; if there > >> are, I'd view this as a driver bug. > > Yes. i915_driver_hw_probe does not check the return value of > > intel_opregion_setup(dev_priv) and just continues on. > > > > Chuck, the attached patch may help if you want to test it. > > > > Regards, > > Jason > > Thanks for the patch, I will try it when I get a chance > and report if it prevents the crash and enables video > output to my screen. Has your patch been committed > to Linux? I just checked on the gitlab Linux master > branch and didn't see it there. This patch should just make the i915 probe error out properly inside the domU when the opregion cannot be mapped properly. It would avoid trigger the domU trace you posted above, but it wouldn't solve any other issue. I have not yet submitted upstream. Regard, Jason
On Wed, Mar 30, 2022 at 11:54 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote: > > On 3/30/22 1:27 PM, Andrew Cooper wrote: > > On 30/03/2022 18:15, Anthony PERARD wrote: > >> > >> Some more though on that, looking at QEMU, it seems that there's already > >> a call to xc_domain_iomem_permission(), in igd_write_opregion(). > > This has been discussed before, but noone's done anything about it. > > It's a massive layering violation for QEMU to issue > > xc_domain_iomem_permission()/etc hypercalls. > > > > It should be the toolstack, and only the toolstack, which makes > > permissions hypercalls, which in turn will fix a slew of "QEMU doesn't > > work when it doesn't have dom0 superpowers" bugs with stubdomains. > > How much say does the Xen project have over the code > in Qemu under hw/xen? I would not be against having libxl > do the permissions hypercalls in this case instead of Qemu > doing it. My test with Qemu traditional and this patch proves > the permission can be granted by libxl instead of the device > model. Qubes patches libxl and QEMU, and they move the hypercalls to the toolstack. They are using linux stubdoms, and I think it works for them. QEMU: https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/master/qemu/patches/0015-IGD-fix-undefined-behaviour.patch https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/master/qemu/patches/0016-IGD-improve-legacy-vbios-handling.patch https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/master/qemu/patches/0017-IGD-move-enabling-opregion-access-to-libxl.patch libxl: https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.14/patch-fix-igd-passthrough-with-linux-stubdomain.patch maybe this one, too: https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.14/patch-libxl-automatically-enable-gfx_passthru-if-IGD-is-as.patch Regards, Jason
On 3/31/2022 8:23 AM, Jason Andryuk wrote: > On Thu, Mar 31, 2022 at 12:34 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote: >> On 3/30/22 2:45 PM, Jason Andryuk wrote: >>> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: >>>> On 14.03.2022 04:41, Chuck Zmudzinski wrote: >>>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >>>>> opregion to the guest but libxl does not grant the guest permission to >>>>> access the mapped memory region. This results in a crash of the i915.ko >>>>> kernel module in a Linux HVM guest when it needs to access the IGD >>>>> opregion: >>>>> >>>>> Oct 23 11:36:33 domU kernel: Call Trace: >>>>> Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 >>>>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] >>>>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 [drm] >>>>> Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] >>>>> Oct 23 11:36:33 domU kernel: intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] >>>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>>>> Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 >>>>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 [i915] >>>>> Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] >>>>> Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 [i915] >>>>> Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] >>>>> Oct 23 11:36:33 domU kernel: ? vga_switcheroo_client_probe_defer+0x1f/0x40 >>>>> Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] >>>>> Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 >>>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>>>> Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 >>>>> Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 >>>>> Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 >>>>> Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 >>>>> Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 >>>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>>>> Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 >>>>> Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 >>>>> Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 >>>>> Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 >>>>> Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] >>>>> Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 >>>>> Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 >>>>> Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 >>>>> Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 >>>>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 >>>>> Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 >>>>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>> The call trace alone leaves open where exactly the crash occurred. >>>> Looking at 5.17 I notice that the first thing the driver does >>>> after mapping the range it to check the signature (both in >>>> intel_opregion_setup()). As the signature can't possibly match >>>> with no access granted to the underlying mappings, there shouldn't >>>> be any further attempts to use the region in the driver; if there >>>> are, I'd view this as a driver bug. >>> Yes. i915_driver_hw_probe does not check the return value of >>> intel_opregion_setup(dev_priv) and just continues on. >>> >>> Chuck, the attached patch may help if you want to test it. >>> >>> Regards, >>> Jason >> Thanks for the patch, I will try it when I get a chance >> and report if it prevents the crash and enables video >> output to my screen. Has your patch been committed >> to Linux? I just checked on the gitlab Linux master >> branch and didn't see it there. > This patch should just make the i915 probe error out properly inside > the domU when the opregion cannot be mapped properly. It would avoid > trigger the domU trace you posted above, but it wouldn't solve any other > issue. > > I have not yet submitted upstream. > > Regard, > Jason I understand the limitations of this patch, that the guest will still not have access to the opregion. Still, I can test it - I do remember some configurations when I could get output on the VGA port but not the HDMI port, and maybe with this patch at least the VGA port will work. In fact, I am not even sure the VGA port does not currently work without your patch, but I know the HDMI port does not work without your patch and an unpatched Xen tool stack. So the patch might help some and if it does help it probably is suitable for upstream. Chuck
On 3/31/2022 8:29 AM, Jason Andryuk wrote: > On Wed, Mar 30, 2022 at 11:54 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote: >> On 3/30/22 1:27 PM, Andrew Cooper wrote: >>> >>> This has been discussed before, but noone's done anything about it. >>> It's a massive layering violation for QEMU to issue >>> xc_domain_iomem_permission()/etc hypercalls. >>> >>> It should be the toolstack, and only the toolstack, which makes >>> permissions hypercalls, which in turn will fix a slew of "QEMU doesn't >>> work when it doesn't have dom0 superpowers" bugs with stubdomains. >> How much say does the Xen project have over the code >> in Qemu under hw/xen? I would not be against having libxl >> do the permissions hypercalls in this case instead of Qemu >> doing it. My test with Qemu traditional and this patch proves >> the permission can be granted by libxl instead of the device >> model. > Qubes patches libxl and QEMU, and they move the hypercalls to the > toolstack. They are using linux stubdoms, and I think it works for > them. That still doesn't answer my question - will the Qemu upstream accept the patches that move the hypercalls to the toolstack? If not, we have to live with what we have now, which is that the hypercalls are done in Qemu. Regards, Chuck
On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote: > > On 3/31/2022 8:29 AM, Jason Andryuk wrote: > > On Wed, Mar 30, 2022 at 11:54 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote: > >> On 3/30/22 1:27 PM, Andrew Cooper wrote: > >>> > >>> This has been discussed before, but noone's done anything about it. > >>> It's a massive layering violation for QEMU to issue > >>> xc_domain_iomem_permission()/etc hypercalls. > >>> > >>> It should be the toolstack, and only the toolstack, which makes > >>> permissions hypercalls, which in turn will fix a slew of "QEMU doesn't > >>> work when it doesn't have dom0 superpowers" bugs with stubdomains. > >> How much say does the Xen project have over the code > >> in Qemu under hw/xen? I would not be against having libxl > >> do the permissions hypercalls in this case instead of Qemu > >> doing it. My test with Qemu traditional and this patch proves > >> the permission can be granted by libxl instead of the device > >> model. > > Qubes patches libxl and QEMU, and they move the hypercalls to the > > toolstack. They are using linux stubdoms, and I think it works for > > them. > > That still doesn't answer my question - will the Qemu upstream > accept the patches that move the hypercalls to the toolstack? If > not, we have to live with what we have now, which is that the > hypercalls are done in Qemu. Xen-associated people maintain hw/xen code in QEMU, so yes it could be accepted. Maybe it would need to be backwards compatible to have libxl check the QEMU version to decide who makes the hypercall? Unless it is broken today, in which case just make it work. Regards, Jason
On 3/30/22 1:15 PM, Anthony PERARD wrote: > Hi Chuck, > > On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: >> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >> opregion to the guest but libxl does not grant the guest permission to > I'm not reading the same thing when looking at code in hvmloader. It > seems that hvmloader allocate some memory for the IGD opregion rather > than mapping it. > > > tools/firmware/hvmloader/pci.c:184 > if ( vendor_id == 0x8086 ) > { > igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); > /* > * Write the the OpRegion offset to give the opregion > * address to the device model. The device model will trap > * and map the OpRegion at the give address. > */ > pci_writel(vga_devfn, PCI_INTEL_OPREGION, > igd_opregion_pgbase << PAGE_SHIFT); > } > > I think this write would go through QEMU, does it do something with it? > (I kind of replying to this question at the end of the mail.) > > Is this code in hvmloader actually run in your case? Hi Anthony, To test your theory that hvmloader is not called in my case and does nothing, I did the following tests: I wrote a little C program to read the mapped IGD opregion address in the guest from the config attribute of the IGD in sysfs: #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <stdint.h> int main() { int fd = open("/sys/devices/pci0000:00/0000:00:02.0/config", O_RDONLY); uint32_t buf; off_t offset = 0xfc; pread(fd, &buf, 4, offset); printf("opregion = %lx\n", buf); return close(fd); } ------------------------ end of C program ----------------- Since the config attribute cannot be read by an ordinary user, it is necessary to run the little C program as the super user to successfully read the opregion address from sysfs with the little C program. I also grabbed the BIOS-provided physical RAM map in the guest which shows the 3 pages mapped by hvmloader for the opregion (it is the second to last entry in the table): Mar 31 13:20:16 kolbe kernel: BIOS-provided physical RAM map: Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bfbfffff] usable Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x00000000cc346000-0x00000000cc352fff] reserved Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x00000000cf800000-0x00000000df9fffff] reserved Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x00000000fc000000-0x00000000fc009fff] ACPI NVS Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x00000000fc00a000-0x00000000fdffbfff] reserved Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x00000000fdffc000-0x00000000fdffefff] ACPI NVS Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 0x00000000fdfff000-0x00000000ffffffff] reserved Now with the current code and no patches, the output of the little C program to print the opregion address from sysfs in the guest is wrong, the opregion address in the guest is not displayed (it should be fdffc018 because the offset of the opregion from the page boundary is 0x18 (24) bytes on my hardware) but it displays an error value (ffffffff) instead: opregion = ffffffff I would expect it to be correct if nothing overwrites the value hvmloader wrote. In fact, I think the value hvmloader wrote might be the page aligned address of fdffc000 instead of fdffc018. I am not sure where this error value comes from, so I do need to do some investigations because I think you are correct that the value that hvmloader wrote was overwritten somewhere. Now when I apply my patch to libxl, I get the same physical RAM map in the guest: Mar 31 13:12:35 kolbe kernel: BIOS-provided physical RAM map: Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bfbfffff] usable Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x00000000cc346000-0x00000000cc352fff] reserved Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x00000000cf800000-0x00000000df9fffff] reserved Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x00000000fc000000-0x00000000fc009fff] ACPI NVS Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x00000000fc00a000-0x00000000fdffbfff] reserved Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x00000000fdffc000-0x00000000fdffefff] ACPI NVS Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 0x00000000fdfff000-0x00000000ffffffff] reserved And now I get the correct opregion address from the little C program to read the opregion address from sysfs in the guest, not even the page-aligned address that hvmloader appears to have written: opregion = fdffc018 Next I removed the code snippet from hvmloader that allocates three pages in the guest for the opregion and writes the opregion address to the pci config attribute, and now there is no memory hole allocated for the opregion in the guest: Mar 31 12:47:34 kolbe kernel: BIOS-provided physical RAM map: Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bfbfffff] usable Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x00000000cc346000-0x00000000cc352fff] reserved Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x00000000cf800000-0x00000000df9fffff] reserved Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x00000000fc000000-0x00000000fc009fff] ACPI NVS Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 0x00000000fc00a000-0x00000000ffffffff] reserved I ran this case without my patch to libxl for safety reasons because the opregion address in sysfs in the guest is wrong and I get the same error address return value using the little C program to read the opregion address from sysfs: opregion = ffffffff So the conclusion is that hvmloader does allocate the three pages in the guest for the opregion and writes a value for the opregion address, but it appears it is overwritten later with the error value when the guest cannot access the opregion and with the correct value when the guest can access the opregion. So I agree that I should understand what is going on here better. I tentatively think the call to pci_writel in hvmloader can be safely removed because that value seems to be changed later on somewhere. But we do need to keep the call to allocate the memory hole for the opregion in hvmloader, or maybe move that call to the toolstack. So I will need to have a better answer to your questions before I propose the next version of the patch / patch series. Regards, Chuck
On 3/31/22 10:19 AM, Jason Andryuk wrote: > On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote: >> >> That still doesn't answer my question - will the Qemu upstream >> accept the patches that move the hypercalls to the toolstack? If >> not, we have to live with what we have now, which is that the >> hypercalls are done in Qemu. > Xen-associated people maintain hw/xen code in QEMU, so yes it could be accepted. > > Maybe it would need to be backwards compatible to have libxl check the > QEMU version to decide who makes the hypercall? Unless it is broken > today, in which case just make it work. > > Regards, > Jason I know of another reason to check the Qemu upstream version, and that is dealing with deprecated / removed device model options that xl.cfg still uses. I looked at that a few years ago with regard to the deprecated 'usbdevice tablet' Qemu option, but I did not see anything in libxl to distinguish Qemu versions except for upstream vs. traditional. AFAICT, detecting traditional vs. upstream Qemu depends solely on the device_model_version xl.cfg setting. So it might be useful for libxl to add the capability to detect the upstream Qemu version or at least create an xl.cfg setting to inform libxl about the Qemu version. Regards, Chuck
On 3/30/22 1:15 PM, Anthony PERARD wrote: > Hi Chuck, > > On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: >> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >> opregion to the guest but libxl does not grant the guest permission to > I'm not reading the same thing when looking at code in hvmloader. It > seems that hvmloader allocate some memory for the IGD opregion rather > than mapping it. > > > tools/firmware/hvmloader/pci.c:184 > if ( vendor_id == 0x8086 ) > { > igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); > /* > * Write the the OpRegion offset to give the opregion > * address to the device model. The device model will trap > * and map the OpRegion at the give address. > */ > pci_writel(vga_devfn, PCI_INTEL_OPREGION, > igd_opregion_pgbase << PAGE_SHIFT); > } > > I think this write would go through QEMU, does it do something with it? > (I kind of replying to this question at the end of the mail.) > > Is this code in hvmloader actually run in your case? Hi Anthony, Let me try to answer your question again. My tests indicate that this code in hvmloader is actually run in my case. As I mentioned in an earlier message, the allocation of the three pages for the opregion is not done for the guest if I remove this code from hvmloader. The only concern I had was about the difference in what I was reading for the opregion address in sysfs (fcffc018 in my case) and the address that hvmloader wrote (fcffc000 in my case). The change is easily explained by what the Qemu device model (both upstream and traditional) does when the device model writes the opregion address into the Intel IGD config attribute: This is the traditional Qemu code in hw/pt_graphics.c:66 void igd_write_opregion(struct pt_dev *real_dev, uint32_t val) { uint32_t host_opregion = 0; int ret; if ( igd_guest_opregion ) { PT_LOG("opregion register already been set, ignoring %x\n", val); return; } host_opregion = pt_pci_host_read(real_dev->pci_dev, PCI_INTEL_OPREGION, 4); igd_guest_opregion = (val & ~0xfff) | (host_opregion & 0xfff); ------------------------ End of code snippet ----------------------------------- The offset of the opregion in the guest (0x18 in my case) is recovered by that last statement. The upstream model does the same thing using the constant XEN_PCI_INTEL_OPREGION_MASK set to 0xfff to recover the offset. So what we have in hvmloader is correct and necessary. > >> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c >> index 4bbbfe9f16..a4fc473de9 100644 >> --- a/tools/libs/light/libxl_pci.c >> +++ b/tools/libs/light/libxl_pci.c >> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, >> domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); >> return ret; >> } >> + >> + /* If this is an Intel IGD, allow access to the IGD opregion */ >> + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; >> + >> + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); >> + uint32_t error = 0xffffffff; >> + if (igd_opregion == error) break; >> + >> + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; >> + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); >> + if (ret < 0) { >> + LOGED(ERROR, domid, >> + "failed to give stubdom%d access to iomem range " >> + "%"PRIx64"-%"PRIx64" for IGD passthru", >> + stubdom_domid, vga_iomem_start, (vga_iomem_start + >> + IGD_OPREGION_PAGES - 1)); >> + return ret; >> + } >> + ret = xc_domain_iomem_permission(CTX->xch, domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); > Here, you seems to add permission to an address that is read from the > pci config space of the device, but as I've pointed above hvmloader > seems to overwrite this address. No, hvmloader wrote the mapped address and here we are reading the opregion address of the host, not the mapped address of the guest. There is no problem here. > It this call to > xc_domain_iomem_permission() does actually anything useful? > Or is it by luck that the address returned by > sysfs_dev_get_igd_opregion() happened to be the address that hvmloader > is going to write? No, it is not luck, we use the same constant in hvmloader, Qemu, and here in this patch to properly map the opregion to the guest, and the constant is PCI_INTEL_OPREGION, set to 0xfc, the offset of where in the config attribute the opregion address is stored. > > Or maybe hvmloader doesn't actually do anything? It does do something, and what it does is necessary. Regards, Chuck
On 3/31/22 2:32 PM, Chuck Zmudzinski wrote: > On 3/30/22 1:15 PM, Anthony PERARD wrote: >> Hi Chuck, >> >> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: >>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >>> opregion to the guest but libxl does not grant the guest permission to >> I'm not reading the same thing when looking at code in hvmloader. It >> seems that hvmloader allocate some memory for the IGD opregion rather >> than mapping it. >> >> >> tools/firmware/hvmloader/pci.c:184 >> if ( vendor_id == 0x8086 ) >> { >> igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); >> /* >> * Write the the OpRegion offset to give the opregion >> * address to the device model. The device model will trap >> * and map the OpRegion at the give address. >> */ >> pci_writel(vga_devfn, PCI_INTEL_OPREGION, >> igd_opregion_pgbase << PAGE_SHIFT); >> } >> >> I think this write would go through QEMU, does it do something with it? >> (I kind of replying to this question at the end of the mail.) >> >> Is this code in hvmloader actually run in your case? > > Hi Anthony, > ... > So the conclusion is that hvmloader does allocate the three pages in the > guest for the opregion and writes a value for the opregion address, but > it appears it is overwritten later with the error value when the guest > cannot access the opregion and with the correct value when the guest can > access the opregion. > > So I agree that I should understand what is going on here better. I > tentatively think the call to pci_writel in hvmloader can be safely > removed because that value seems to be changed later on somewhere. After discovering how the device model recovers the offset of the opregion from the page boundary, I am now certain that what we currently have in hvmloader is necessary. We do need to call pci_writel in hvmolader because that is where we write the mapped value of the page-aligned address of the opregion in the guest, and then the device model reads that value, recovers the offset of the opregion to the page boundary, and writes the address of the beginning of the opregion, not the page-aligned address that hvmloader wrote, into the config attribute of the Intel IGD that is passed through to the guest. That is why it seemed to me that the address was changed somewhere. The device model modifies it so it is the actual address of the opregion and not the address of the page boundary that immediately precedes the opregion. I hope this is an acceptable explanation of what we currently have in hvmloader. Regards, Chuck
On 3/30/22 2:45 PM, Jason Andryuk wrote: > On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: >> On 14.03.2022 04:41, Chuck Zmudzinski wrote: >>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >>> opregion to the guest but libxl does not grant the guest permission to >>> access the mapped memory region. This results in a crash of the i915.ko >>> kernel module in a Linux HVM guest when it needs to access the IGD >>> opregion: >>> >>> Oct 23 11:36:33 domU kernel: Call Trace: >>> Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 >>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] >>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 [drm] >>> Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] >>> Oct 23 11:36:33 domU kernel: intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] >>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>> Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 >>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 [i915] >>> Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] >>> Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 [i915] >>> Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] >>> Oct 23 11:36:33 domU kernel: ? vga_switcheroo_client_probe_defer+0x1f/0x40 >>> Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] >>> Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 >>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>> Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 >>> Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 >>> Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 >>> Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 >>> Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 >>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>> Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 >>> Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 >>> Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 >>> Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 >>> Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] >>> Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 >>> Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 >>> Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 >>> Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 >>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 >>> Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 >>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> The call trace alone leaves open where exactly the crash occurred. >> Looking at 5.17 I notice that the first thing the driver does >> after mapping the range it to check the signature (both in >> intel_opregion_setup()). As the signature can't possibly match >> with no access granted to the underlying mappings, there shouldn't >> be any further attempts to use the region in the driver; if there >> are, I'd view this as a driver bug. > Yes. i915_driver_hw_probe does not check the return value of > intel_opregion_setup(dev_priv) and just continues on. > > Chuck, the attached patch may help if you want to test it. > > Regards, > Jason I tested the patch - it made no noticeable difference. I still get the same crash and call trace with the patch. Actually, the call trace I posted here is only the first of three call traces, and I still see all three call traces with the patch. The patch is harmless and the i915 module with the patch works normally when it can access the intel opregion. Regards, Chuck
On 4/1/22 9:21 AM, Chuck Zmudzinski wrote: > On 3/30/22 2:45 PM, Jason Andryuk wrote: >> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: >>> On 14.03.2022 04:41, Chuck Zmudzinski wrote: >>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >>>> opregion to the guest but libxl does not grant the guest permission to >>>> access the mapped memory region. This results in a crash of the >>>> i915.ko >>>> kernel module in a Linux HVM guest when it needs to access the IGD >>>> opregion: >>>> >>>> Oct 23 11:36:33 domU kernel: Call Trace: >>>> Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 >>>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] >>>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 >>>> [drm] >>>> Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] >>>> Oct 23 11:36:33 domU kernel: >>>> intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] >>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>>> Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 >>>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 >>>> [i915] >>>> Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] >>>> Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 >>>> [i915] >>>> Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] >>>> Oct 23 11:36:33 domU kernel: ? >>>> vga_switcheroo_client_probe_defer+0x1f/0x40 >>>> Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] >>>> Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 >>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>>> Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 >>>> Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 >>>> Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 >>>> Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 >>>> Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 >>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>>> Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 >>>> Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 >>>> Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 >>>> Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 >>>> Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] >>>> Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 >>>> Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 >>>> Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 >>>> Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 >>>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 >>>> Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 >>>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> The call trace alone leaves open where exactly the crash occurred. >>> Looking at 5.17 I notice that the first thing the driver does >>> after mapping the range it to check the signature (both in >>> intel_opregion_setup()). As the signature can't possibly match >>> with no access granted to the underlying mappings, there shouldn't >>> be any further attempts to use the region in the driver; if there >>> are, I'd view this as a driver bug. >> Yes. i915_driver_hw_probe does not check the return value of >> intel_opregion_setup(dev_priv) and just continues on. >> >> Chuck, the attached patch may help if you want to test it. >> >> Regards, >> Jason > > I tested the patch - it made no noticeable difference.I still > get the same crash and call trace with the patch. Actually, > the call trace I posted here is only the first of three call > traces, and I still see all three call traces with the patch. It is probably necessary to patch intet_opregion_setup to return from it with an error sooner if the goal is to suppress the call traces that occur when the driver cannot access the opregion. Regards, Chuck
On Fri, Apr 1, 2022 at 9:41 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote: > > On 4/1/22 9:21 AM, Chuck Zmudzinski wrote: > > On 3/30/22 2:45 PM, Jason Andryuk wrote: > >> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: > >>> On 14.03.2022 04:41, Chuck Zmudzinski wrote: > >>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD > >>>> opregion to the guest but libxl does not grant the guest permission to > >>>> access the mapped memory region. This results in a crash of the > >>>> i915.ko > >>>> kernel module in a Linux HVM guest when it needs to access the IGD > >>>> opregion: > >>>> > >>>> Oct 23 11:36:33 domU kernel: Call Trace: > >>>> Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 > >>>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] > >>>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 > >>>> [drm] > >>>> Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] > >>>> Oct 23 11:36:33 domU kernel: > >>>> intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] > >>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > >>>> Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 > >>>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 > >>>> [i915] > >>>> Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] > >>>> Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 > >>>> [i915] > >>>> Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] > >>>> Oct 23 11:36:33 domU kernel: ? > >>>> vga_switcheroo_client_probe_defer+0x1f/0x40 > >>>> Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] > >>>> Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 > >>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 > >>>> Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 > >>>> Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 > >>>> Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 > >>>> Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 > >>>> Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 > >>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > >>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 > >>>> Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 > >>>> Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 > >>>> Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 > >>>> Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 > >>>> Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] > >>>> Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 > >>>> Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 > >>>> Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 > >>>> Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 > >>>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 > >>>> Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 > >>>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >>> The call trace alone leaves open where exactly the crash occurred. > >>> Looking at 5.17 I notice that the first thing the driver does > >>> after mapping the range it to check the signature (both in > >>> intel_opregion_setup()). As the signature can't possibly match > >>> with no access granted to the underlying mappings, there shouldn't > >>> be any further attempts to use the region in the driver; if there > >>> are, I'd view this as a driver bug. > >> Yes. i915_driver_hw_probe does not check the return value of > >> intel_opregion_setup(dev_priv) and just continues on. > >> > >> Chuck, the attached patch may help if you want to test it. > >> > >> Regards, > >> Jason > > > > I tested the patch - it made no noticeable difference.I still > > get the same crash and call trace with the patch. Actually, > > the call trace I posted here is only the first of three call > > traces, and I still see all three call traces with the patch. Thanks for testing. Sorry it didn't help. > It is probably necessary to patch intet_opregion_setup to > return from it with an error sooner if the goal is to suppress > the call traces that occur when the driver cannot access > the opregion. It looks correct for 5.17 running in your domU. I thought the opregion signature check would fail. A failure in intel_opregion_setup would percolate up through i915_driver_hw_probe to i915_driver_probe. In i915_driver_probe the error should goto out_cleanup_mmio and skip intel_modeset_init_nogem which is in your backtrace. Regards, Jason
On 4/1/22 9:21 AM, Chuck Zmudzinski wrote: > On 3/30/22 2:45 PM, Jason Andryuk wrote: >> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote: >>> On 14.03.2022 04:41, Chuck Zmudzinski wrote: >>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >>>> opregion to the guest but libxl does not grant the guest permission to >>>> access the mapped memory region. This results in a crash of the >>>> i915.ko >>>> kernel module in a Linux HVM guest when it needs to access the IGD >>>> opregion: >>>> >>>> Oct 23 11:36:33 domU kernel: Call Trace: >>>> Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 >>>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] >>>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 >>>> [drm] >>>> Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] >>>> Oct 23 11:36:33 domU kernel: >>>> intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] >>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>>> Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 >>>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 >>>> [i915] >>>> Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] >>>> Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 >>>> [i915] >>>> Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] >>>> Oct 23 11:36:33 domU kernel: ? >>>> vga_switcheroo_client_probe_defer+0x1f/0x40 >>>> Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] >>>> Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 >>>> Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 >>>> Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 >>>> Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 >>>> Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 >>>> Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 >>>> Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 >>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>>> Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 >>>> Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 >>>> Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 >>>> Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 >>>> Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 >>>> Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] >>>> Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 >>>> Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 >>>> Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 >>>> Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 >>>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 >>>> Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 >>>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> The call trace alone leaves open where exactly the crash occurred. >>> Looking at 5.17 I notice that the first thing the driver does >>> after mapping the range it to check the signature (both in >>> intel_opregion_setup()). As the signature can't possibly match >>> with no access granted to the underlying mappings, there shouldn't >>> be any further attempts to use the region in the driver; if there >>> are, I'd view this as a driver bug. >> Yes. i915_driver_hw_probe does not check the return value of >> intel_opregion_setup(dev_priv) and just continues on. >> >> Chuck, the attached patch may help if you want to test it. >> >> Regards, >> Jason > > I tested the patch - it made no noticeable difference. Correction (sorry for the confusion): I didn't know I needed to replace more than just a re-built i915.ko module to enable the patch for testing. When I updated the entire Debian kernel package including all the modules and the kernel image with the patched kernel package, it made quite a difference. With Jason's patch, the three call traces just became a much shorter error message: Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24) Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga console Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device 80x25 Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active, disabling use of stolen memory Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem 0xffffffff-0x100001ffe], which spans more than Reserved [mem 0xfdfff000-0xffffffff] Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping multiple BARs Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization failed (-22) Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details. Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with error -22 --------------------- End of Kernel Error Log ---------------------- So I think the patch does propagate the error up the stack and bails out before producing the Call traces, and... I even had output after booting - the gdm3 Gnome display manager login page displayed, but when I tried to login to the Gnome desktop, the screen went dark and I could not even login to the headless Xen Dom0 control domain via ssh after that and I just used the reset button on the machine to reboot it, so the patch causes some trouble with the Dom0 when the guest cannot access the opregion. The patch works fine when the guest can access the opregion and in that case I was able to login to the Gnome session, but it caused quite a bit of trouble and apparently crashed the Dom0 or at least caused networking in the Dom0 to stop working when I tried to login to the Gnome session in the guest for the case when the guest cannot access the opregion. So I would not recommend Jason's patch as is for the Linux kernel. The main reason is that it looks like it is working at first with a login screen displayed, but when a user tries to login, the whole system crashes. Regards, Chuck
On Tue, Apr 5, 2022 at 9:31 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote: > Correction (sorry for the confusion): > > I didn't know I needed to replace more than just a > re-built i915.ko module to enable the patch > for testing. When I updated the entire Debian kernel > package including all the modules and the kernel > image with the patched kernel package, it made > quite a difference. > > With Jason's patch, the three call traces just became a > much shorter error message: > > Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24) > Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for > gfx access > Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga > console > Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device > 80x25 > Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active, > disabling use of stolen memory > Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem > 0xffffffff-0x100001ffe], which spans more than Reserved [mem > 0xfdfff000-0xffffffff] > Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping > multiple BARs > Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization > failed (-22) > Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on > drm/i915; see > https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs > for details. > Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with > error -22 > --------------------- End of Kernel Error Log ---------------------- > > So I think the patch does propagate the error up the > stack and bails out before producing the Call traces, Thanks for re-testing. > and... > > I even had output after booting - the gdm3 Gnome display > manager login page displayed, but when I tried to login to > the Gnome desktop, the screen went dark and I could > not even login to the headless Xen Dom0 control domain > via ssh after that and I just used the reset button on the > machine to reboot it, so the patch causes some trouble > with the Dom0 when the guest cannot access the > opregion. The patch works fine when the guest can > access the opregion and in that case I was able to > login to the Gnome session, but it caused quite a bit of > trouble and apparently crashed the Dom0 or at > least caused networking in the Dom0 to stop working > when I tried to login to the Gnome session in the > guest for the case when the guest cannot access > the opregion. So I would not recommend Jason's > patch as is for the Linux kernel. The main reason > is that it looks like it is working at first with a > login screen displayed, but when a user tries to login, > the whole system crashes. I'm a little surprised you still had output from the VM & display with the i915 driver not binding. I guess Linux fell back to another VGA or Framebuffer driver for the display. However, locking up the host isn't good. You didn't happen to catch any Xen or dom0 output when that happened? Regards, Jason
On 4/6/22 9:10 AM, Jason Andryuk wrote: > On Tue, Apr 5, 2022 at 9:31 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote: >> Correction (sorry for the confusion): >> >> I didn't know I needed to replace more than just a >> re-built i915.ko module to enable the patch >> for testing. When I updated the entire Debian kernel >> package including all the modules and the kernel >> image with the patched kernel package, it made >> quite a difference. >> >> With Jason's patch, the three call traces just became a >> much shorter error message: >> >> Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24) >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for >> gfx access >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga >> console >> Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device >> 80x25 >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active, >> disabling use of stolen memory >> Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem >> 0xffffffff-0x100001ffe], which spans more than Reserved [mem >> 0xfdfff000-0xffffffff] >> Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping >> multiple BARs >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization >> failed (-22) >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on >> drm/i915; see >> https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs >> for details. >> Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with >> error -22 >> --------------------- End of Kernel Error Log ---------------------- >> >> So I think the patch does propagate the error up the >> stack and bails out before producing the Call traces, > Thanks for re-testing. > >> and... >> >> I even had output after booting - the gdm3 Gnome display >> manager login page displayed, but when I tried to login to >> the Gnome desktop, the screen went dark and I could >> not even login to the headless Xen Dom0 control domain >> via ssh after that and I just used the reset button on the >> machine to reboot it, so the patch causes some trouble >> with the Dom0 when the guest cannot access the >> opregion. The patch works fine when the guest can >> access the opregion and in that case I was able to >> login to the Gnome session, but it caused quite a bit of >> trouble and apparently crashed the Dom0 or at >> least caused networking in the Dom0 to stop working >> when I tried to login to the Gnome session in the >> guest for the case when the guest cannot access >> the opregion. So I would not recommend Jason's >> patch as is for the Linux kernel. The main reason >> is that it looks like it is working at first with a >> login screen displayed, but when a user tries to login, >> the whole system crashes. > I'm a little surprised you still had output from the VM & display with > the i915 driver not binding. I guess Linux fell back to another VGA > or Framebuffer driver for the display. > > However, locking up the host isn't good. You didn't happen to catch > any Xen or dom0 output when that happened? > > Regards, > Jason I just looked at Dom0's systemd journal and it did not capture anything. The six minute gap between Apr 05 20:46 and Apr 05 20:52 which is when I rebooted Dom0 after the crash is when bad things happened: Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:1b.0: xen_pciback: vpci: assign to virtual slot 0 Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:1b.0: registering for 18 Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:14.0: xen_pciback: vpci: assign to virtual slot 1 Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:14.0: registering for 18 Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:02.0: xen_pciback: vpci: assign to virtual slot 2 Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:02.0: registering for 18 Apr 05 20:46:01 Dom0 sudo[9639]: pam_unix(sudo:session): session closed for user root Apr 05 20:46:13 Dom0 sshd[9541]: Received disconnect from <redacted> port 60294:11: disconnected by user Apr 05 20:46:13 Dom0 sshd[9541]: Disconnected from user <redacted> <redacted> port 60294 Apr 05 20:46:13 Dom0 sshd[9521]: pam_unix(sshd:session): session closed for user <redacted> Apr 05 20:46:13 Dom0 systemd-logind[497]: Session 27 logged out. Waiting for processes to exit. Apr 05 20:46:17 Dom0 kernel: xen-blkback: backend/vbd/18/51712: using 4 queues, protocol 1 (x86_64-abi) persistent grants Apr 05 20:46:17 Dom0 kernel: xen-blkback: backend/vbd/18/51728: using 4 queues, protocol 1 (x86_64-abi) persistent grants Apr 05 20:46:17 Dom0 kernel: vif vif-18-0 vif18.0: Guest Rx ready Apr 05 20:46:17 Dom0 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif18.0: link becomes ready Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPDISCOVER from <redacted> via vif18.0 Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPOFFER on <redacted> to <redacted> via vif18.0 Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPREQUEST for <redacted> (<redacted>) from <redacted> via vif18.0 Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPACK on <redacted> to <redacted> via vif18.0 Apr 05 20:52:34 Dom0 kernel: Linux version 5.16.0-6-amd64 (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-19) 11.2.0, GNU ld (GNU Binutils for Debian) 2.38) #1 SMP PREEMPT Debian 5.16.18-1 (2022-03-29) Apr 05 20:52:34 Dom0 kernel: Command line: placeholder root=/dev/mapper/systems-unstable ro reboot=bios quiet console=hvc0 I would probably need to connect Dom0 to a serial console to capture something from Dom0 or Xen. I have done that in the past using a serial cable connected to a Windows 8 laptop using a usb to serial adapter I have but last time I tried it the usb to serial adapter did not work, I think because of the upgrade of the laptop to Windows 10. Regards, Chuck
On 4/6/22 9:10 AM, Jason Andryuk wrote: > On Tue, Apr 5, 2022 at 9:31 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote: >> Correction (sorry for the confusion): >> >> I didn't know I needed to replace more than just a >> re-built i915.ko module to enable the patch >> for testing. When I updated the entire Debian kernel >> package including all the modules and the kernel >> image with the patched kernel package, it made >> quite a difference. >> >> With Jason's patch, the three call traces just became a >> much shorter error message: >> >> Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24) >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for >> gfx access >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga >> console >> Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device >> 80x25 >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active, >> disabling use of stolen memory >> Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem >> 0xffffffff-0x100001ffe], which spans more than Reserved [mem >> 0xfdfff000-0xffffffff] >> Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping >> multiple BARs >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization >> failed (-22) >> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on >> drm/i915; see >> https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs >> for details. >> Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with >> error -22 >> --------------------- End of Kernel Error Log ---------------------- >> >> So I think the patch does propagate the error up the >> stack and bails out before producing the Call traces, > Thanks for re-testing. > > I'm a little surprised you still had output from the VM & display with > the i915 driver not binding. I guess Linux fell back to another VGA > or Framebuffer driver for the display. By looking at journal entries in the guest, it is clear the Xorg driver fell back from the kms modesetting driver to the vesa driver, as shown by the following journal entries. When guest can access opregion gdm: Apr 05 20:42:45 debian /usr/libexec/gdm-x-session[1226]: (II) modeset(0): Serial No: LX1AA0044210 Apr 05 20:42:45 debian /usr/libexec/gdm-x-session[1226]: (II) modeset(0) : Monitor name: Acer H236HL When guest cannot access opregion: Apr 05 20:46:22 debian /usr/libexec/gdm-x-session[1164]: (II) VESA(0): Serial No: LX1AA0044210 Apr 05 20:46:22 debian /usr/libexec/gdm-x-session[1164]: (II) VESA(0): Monitor name: Acer H236HL But as I said when I tried to login to a Gnome session, the system hung, and there are no journal entries captured in either the Dom0 or the guest, so it is hard to tell what happened. I think maybe the full Gnome session, as opposed to the gdm3 display manager, did not fall back to the Xorg vesa driver and when it tried to use the Xorg modesetting driver it caused the system to hang because the modesetting driver uses KMS and probably tried to use the i915 module which was not initialized correctly due to the inability to access the opregion. I also noted in an earlier message in this thread that when the guest cannot access the opregion, the guest overwrites the register that contains the mapped opregion address for the guest, which is provided for the guest by the Qemu device model, with the invalid value of 0xffffffff. When the gnome session manager started the session, it apparently caused the i915 module to try to access the opregion at the invalid address 0xffffffff and thus caused the system to hang, as shown in the journal entry I posted yesterday: Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem 0xffffffff-0x100001ffe], which spans more than Reserved [mem 0xfdfff000-0xffffffff] This is a request by the guest for 2 pages, which is the size of the opregion, but it is using the invalid address 0xffffffff for the opregion address. So although this resource sanity check failed, the system still hung later on when the user tried to login to the gnome session. Regards, Chuck
On 3/30/22 1:15 PM, Anthony PERARD wrote: > Hi Chuck, > > On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: >> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >> opregion to the guest but libxl does not grant the guest permission to > I'm not reading the same thing when looking at code in hvmloader. It > seems that hvmloader allocate some memory for the IGD opregion rather > than mapping it. > > > tools/firmware/hvmloader/pci.c:184 > if ( vendor_id == 0x8086 ) > { > igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); > /* > * Write the the OpRegion offset to give the opregion > * address to the device model. The device model will trap > * and map the OpRegion at the give address. > */ > pci_writel(vga_devfn, PCI_INTEL_OPREGION, > igd_opregion_pgbase << PAGE_SHIFT); > } > > I think this write would go through QEMU, does it do something with it? > (I kind of replying to this question at the end of the mail.) > > Is this code in hvmloader actually run in your case? > >> Currently, because of another bug in Qemu upstream, this crash can >> only be reproduced using the traditional Qemu device model, and of > qemu-traditional is a bit old... What is the bug in QEMU? Do you have a > link to a patch/mail? I finally found a patch for the other bug in Qemu upstream. The patch is currently being used in QubesOS, and they first added it to their version of Qemu way back in 2017: https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/3/commits/ab2b4c2ad02827a73c52ba561e9a921cc4bb227c Although this patch is advertised as applying to the device model running in a Linux stub domain, it is also needed (at least on my system) with the device model running in Dom0. Here is the story: The patch is titled "qemu: fix wrong mask in pci capability registers handling" There is scant information in the commit message about the nature of the problem, but I discovered the following in my testing: On my Intel Haswell system configured for PCI passthrough to the Xen HVM guest, Qemu does indeed incorrectly set the emulated register because it uses the wrong mask that disables instead of enables the PCI_STATUS_CAP_LIST bit of the PCI_STATUS register. This disabled the MSI-x capability of two of the three PCI devices passed through to the Xen HVM guest. The problem only manifests in a bad way in a Linux guest, not in a Windows guest. One possible reason that only Linux guests are affected is that I discovered in the Xen xl-dmesg verbose logs that Windows and Linux use different callbacks for interrupts: (XEN) Dom1 callback via changed to GSI 28 ... (XEN) Dom3 callback via changed to Direct Vector 0xf3 Dom1 is a Windows Xen HVM and Dom3 is a Linux HVM Apparently the Direct Vector callback that Linux uses requires MSI or MSI-x capability of the passed through devices, but the wrong mask in Qemu disables that capability. After applying the QubesOS patch to Qemu upstream, the PCI_STATUS_CAP_LIST bit is set correctly for the guest and PCI and Intel IGD passthrough works normally because the Linux guest can make use of the MSI-x capability of the PCI devices. The problem was discovered almost five years ago. I don't know why the fix has not been committed to Qemu upstream yet. After this, I was able to discover that the need for libxl to explicitly grant permission for access to the Intel OpRegion is only needed for the old traditional device model because the Xen hypercall to gain permission is included in upstream Qemu, but it is omitted from the old traditional device model. So this patch is not needed for users of the upstream device model who also add the QubesOS patch to fix the PCI_STATUS_CAP_LIST bit in the PCI_STATUS register. This all assumes the device model is running in Dom0. The permission for access to the Intel OpRegion might still be needed if the upstream device model is running in a stub domain. There are other problems in addition to this problem of access to the Intel OpRegion that are discussed here: https://www.qubes-os.org/news/2017/10/18/msi-support/ As old as that post is, the feature of allowing PCI and VGA passthrough to HVM domains is still not well supported, especially for the case when the device model runs in a stub domain. Since my proposed patch only applies to the very insecure case of the old traditional device model running in Dom0, I will not pursue it further. I will look for this feature in future versions of Xen. Currently, Xen 4.16 advertises support for Linux-based stub domains as "Tech Preview." So future versions of Xen might handle this problem in libxl or perhaps in some other way, and also hopefully the patch to Qemu to fix the PCI capabilities mask can be committed to Qemu upstream soon so this feature of Intel IGD passthrough can at least work with Linux guests and the upstream Qemu running in Dom0. Regards, Chuck > >> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c >> index 4bbbfe9f16..a4fc473de9 100644 >> --- a/tools/libs/light/libxl_pci.c >> +++ b/tools/libs/light/libxl_pci.c >> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, >> domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); >> return ret; >> } >> + >> + /* If this is an Intel IGD, allow access to the IGD opregion */ >> + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; >> + >> + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); >> + uint32_t error = 0xffffffff; >> + if (igd_opregion == error) break; >> + >> + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; >> + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); >> + if (ret < 0) { >> + LOGED(ERROR, domid, >> + "failed to give stubdom%d access to iomem range " >> + "%"PRIx64"-%"PRIx64" for IGD passthru", >> + stubdom_domid, vga_iomem_start, (vga_iomem_start + >> + IGD_OPREGION_PAGES - 1)); >> + return ret; >> + } >> + ret = xc_domain_iomem_permission(CTX->xch, domid, >> + vga_iomem_start, >> + IGD_OPREGION_PAGES, 1); > Here, you seems to add permission to an address that is read from the > pci config space of the device, but as I've pointed above hvmloader > seems to overwrite this address. It this call to > xc_domain_iomem_permission() does actually anything useful? > Or is it by luck that the address returned by > sysfs_dev_get_igd_opregion() happened to be the address that hvmloader > is going to write? > > Or maybe hvmloader doesn't actually do anything? > > > Some more though on that, looking at QEMU, it seems that there's already > a call to xc_domain_iomem_permission(), in igd_write_opregion(). So > adding one in libxl would seems redundant, or maybe it the one for the > device model's domain that's needed (named 'stubdom_domid' here)? > > Thanks, >
On 6/3/22 9:10 PM, Chuck Zmudzinski wrote: > On 3/30/22 1:15 PM, Anthony PERARD wrote: >> Hi Chuck, >> >> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote: >>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD >>> opregion to the guest but libxl does not grant the guest permission to >> I'm not reading the same thing when looking at code in hvmloader. It >> seems that hvmloader allocate some memory for the IGD opregion rather >> than mapping it. >> >> >> tools/firmware/hvmloader/pci.c:184 >> if ( vendor_id == 0x8086 ) >> { >> igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES); >> /* >> * Write the the OpRegion offset to give the opregion >> * address to the device model. The device model will trap >> * and map the OpRegion at the give address. >> */ >> pci_writel(vga_devfn, PCI_INTEL_OPREGION, >> igd_opregion_pgbase << PAGE_SHIFT); >> } >> >> I think this write would go through QEMU, does it do something with it? >> (I kind of replying to this question at the end of the mail.) >> >> Is this code in hvmloader actually run in your case? >> >>> Currently, because of another bug in Qemu upstream, this crash can >>> only be reproduced using the traditional Qemu device model, and of >> qemu-traditional is a bit old... What is the bug in QEMU? Do you have a >> link to a patch/mail? I finally found a patch that fixes the upstream bug on my system but I am not sure it is the best fix. It is a patch that QubesOS has been using since 2017. I just opened an issue titled "Incorrect register mask for PCI passthrough on Xen" with Qemu upstream about this problem which gives all the details: https://gitlab.com/qemu-project/qemu/-/issues/1061 When you get a chance, can you take a look at it? Not being an official Xen or Qemu developer, I would appreciate any suggestions you might have for me before I formally submit a patch to Qemu upstream. Please reply here or on the Qemu issue tracker. Best Regards, Chuck > > I finally found a patch for the other bug in Qemu upstream. The > patch is currently being used in QubesOS, and they first added > it to their version of Qemu way back in 2017: > > https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/3/commits/ab2b4c2ad02827a73c52ba561e9a921cc4bb227c > > > Although this patch is advertised as applying to the device model > running in a Linux stub domain, it is also needed (at least on my > system) with the device model running in Dom0. > > Here is the story: > > The patch is titled "qemu: fix wrong mask in pci capability registers > handling" > > There is scant information in the commit message about the nature of > the problem, but I discovered the following in my testing: > > On my Intel Haswell system configured for PCI passthrough to the > Xen HVM guest, Qemu does indeed incorrectly set the emulated > register because it uses the wrong mask that disables instead of > enables the PCI_STATUS_CAP_LIST bit of the PCI_STATUS register. > > This disabled the MSI-x capability of two of the three PCI devices > passed through to the Xen HVM guest. The problem only > manifests in a bad way in a Linux guest, not in a Windows guest. > > One possible reason that only Linux guests are affected is that > I discovered in the Xen xl-dmesg verbose logs that Windows and > Linux use different callbacks for interrupts: > > (XEN) Dom1 callback via changed to GSI 28 > ... > (XEN) Dom3 callback via changed to Direct Vector 0xf3 > > Dom1 is a Windows Xen HVM and Dom3 is a Linux HVM > > Apparently the Direct Vector callback that Linux uses requires > MSI or MSI-x capability of the passed through devices, but the > wrong mask in Qemu disables that capability. > > After applying the QubesOS patch to Qemu upstream, the > PCI_STATUS_CAP_LIST bit is set correctly for the guest and > PCI and Intel IGD passthrough works normally because the > Linux guest can make use of the MSI-x capability of the > PCI devices. > > The problem was discovered almost five years ago. I don't > know why the fix has not been committed to Qemu > upstream yet. > > After this, I was able to discover that the need for libxl to > explicitly grant permission for access to the Intel OpRegion > is only needed for the old traditional device model because > the Xen hypercall to gain permission is included in upstream > Qemu, but it is omitted from the old traditional device model. > > So this patch is not needed for users of the upstream device > model who also add the QubesOS patch to fix the > PCI_STATUS_CAP_LIST bit in the PCI_STATUS register. > > This all assumes the device model is running in Dom0. The > permission for access to the Intel OpRegion might still be > needed if the upstream device model is running in a stub > domain. > > There are other problems in addition to this problem of access > to the Intel OpRegion that are discussed here: > > https://www.qubes-os.org/news/2017/10/18/msi-support/ > > As old as that post is, the feature of allowing PCI and VGA > passthrough to HVM domains is still not well supported, > especially for the case when the device model runs in a > stub domain. > > Since my proposed patch only applies to the very insecure > case of the old traditional device model running in Dom0, > I will not pursue it further. > > I will look for this feature in future versions of Xen. Currently, > Xen 4.16 advertises support for Linux-based stub domains > as "Tech Preview." So future versions of Xen might handle > this problem in libxl or perhaps in some other way, and also > hopefully the patch to Qemu to fix the PCI capabilities mask > can be committed to Qemu upstream soon so this feature > of Intel IGD passthrough can at least work with Linux > guests and the upstream Qemu running in Dom0. > > Regards, > > Chuck
On Thu, Mar 31, 2022 at 03:44:33PM -0400, Chuck Zmudzinski wrote: > On 3/31/22 10:19 AM, Jason Andryuk wrote: > > On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote: > > > > > > That still doesn't answer my question - will the Qemu upstream > > > accept the patches that move the hypercalls to the toolstack? If > > > not, we have to live with what we have now, which is that the > > > hypercalls are done in Qemu. > > Xen-associated people maintain hw/xen code in QEMU, so yes it could be accepted. > > > > Maybe it would need to be backwards compatible to have libxl check the > > QEMU version to decide who makes the hypercall? Unless it is broken > > today, in which case just make it work. > > > > Regards, > > Jason > > I know of another reason to check the Qemu upstream version, > and that is dealing with deprecated / removed device model > options that xl.cfg still uses. I looked at that a few years ago > with regard to the deprecated 'usbdevice tablet' Qemu option, > but I did not see anything in libxl to distinguish Qemu versions > except for upstream vs. traditional. AFAICT, detecting traditional > vs. upstream Qemu depends solely on the device_model_version > xl.cfg setting. So it might be useful for libxl to add the capability > to detect the upstream Qemu version or at least create an xl.cfg > setting to inform libxl about the Qemu version. Hi, There's already some code to deal with QEMU's version (QEMU = upstream Qemu) in libxl, but this code is dealing with an already running QEMU. When libxl interact with QEMU through QMP, to setup some more devices, QEMU also advertise it's version, which we can use on follow up qmp commands. I think adding passthrough pci devices is all done via QMP, so we can potentially move an hypercall from QEMU to libxl, and tell libxl that depending on which version of QEMU is talking to, it needs or not do the hypercall. Also, we could probably add a parameter so that QEMU know if it have to do the hypercall or not, and thus newer version of QEMU could also deal with older version of libxl. (There's maybe some example like that in both QEMU and libxl, when doing live migration, dm_state_save_to_fdset() in libxl as a pointer) Cheers,
On 6/17/22 9:46 AM, Anthony PERARD wrote: > On Thu, Mar 31, 2022 at 03:44:33PM -0400, Chuck Zmudzinski wrote: >> On 3/31/22 10:19 AM, Jason Andryuk wrote: >>> On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote: >>>> That still doesn't answer my question - will the Qemu upstream >>>> accept the patches that move the hypercalls to the toolstack? If >>>> not, we have to live with what we have now, which is that the >>>> hypercalls are done in Qemu. >>> [...] >> I know of another reason to check the Qemu upstream version, >> and that is dealing with deprecated / removed device model >> options that xl.cfg still uses. I looked at that a few years ago >> with regard to the deprecated 'usbdevice tablet' Qemu option, >> but I did not see anything in libxl to distinguish Qemu versions >> except for upstream vs. traditional. AFAICT, detecting traditional >> vs. upstream Qemu depends solely on the device_model_version >> xl.cfg setting. So it might be useful for libxl to add the capability >> to detect the upstream Qemu version or at least create an xl.cfg >> setting to inform libxl about the Qemu version. > Hi, > > There's already some code to deal with QEMU's version (QEMU = upstream > Qemu) in libxl, but this code is dealing with an already running QEMU. > When libxl interact with QEMU through QMP, to setup some more devices, > QEMU also advertise it's version, which we can use on follow up qmp > commands. > > I think adding passthrough pci devices is all done via QMP, so we can > potentially move an hypercall from QEMU to libxl, and tell libxl that > depending on which version of QEMU is talking to, it needs or not do the > hypercall. Also, we could probably add a parameter so that QEMU know if > it have to do the hypercall or not, and thus newer version of QEMU could > also deal with older version of libxl. > > (There's maybe some example like that in both QEMU and libxl, when doing > live migration, dm_state_save_to_fdset() in libxl as a pointer) > > Cheers, > Hi Anthony, Thanks for this information, it is useful because I plan to work on improved Xen / Qemu interactions to better support features such as running the device model in a stub domain, in which case it is probably better to do some hypercalls in libxl or maybe in hvmloader that are currently done in Qemu. I also would like to see Xen HVM using Q35 instead of i440fx emulation which also requires improved Xen / Qemu interactions. I know of the patch set for Q35 emulation that was proposed back in 2018, but I have not tried anything like that yet. That looks like a complex problem. Has there been any more recent work in that area? I couldn't find much recent work on that by searching the Internet. I have quite a bit to learn before I can make contributions to support Q35 as a replacement for i440fx, but I plan to slowly work on it. Any suggestions anyone has are welcome. Best Regards, Chuck
diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c index 4bbbfe9f16..a4fc473de9 100644 --- a/tools/libs/light/libxl_pci.c +++ b/tools/libs/light/libxl_pci.c @@ -24,6 +24,8 @@ #define PCI_OPTIONS "msitranslate=%d,power_mgmt=%d" #define PCI_BDF_XSPATH "%04x-%02x-%02x-%01x" #define PCI_PT_QDEV_ID "pci-pt-%02x_%02x.%01x" +#define PCI_INTEL_OPREGION 0xfc +#define IGD_OPREGION_PAGES 3 static unsigned int pci_encode_bdf(libxl_device_pci *pci) { @@ -610,6 +612,45 @@ out: return ret; } +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc, + libxl_device_pci *pci) +{ + char *pci_device_config_path = + GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config", + pci->domain, pci->bus, pci->dev, pci->func); + size_t read_items; + uint32_t igd_opregion; + uint32_t error = 0xffffffff; + + FILE *f = fopen(pci_device_config_path, "r"); + if (!f) { + LOGE(ERROR, + "pci device "PCI_BDF" does not have config attribute", + pci->domain, pci->bus, pci->dev, pci->func); + goto out; + } + if (fseek(f, PCI_INTEL_OPREGION, SEEK_SET)) { + LOGE(ERROR, + "cannot find igd opregion address of pci device "PCI_BDF, + pci->domain, pci->bus, pci->dev, pci->func); + goto out; + } + read_items = fread(&igd_opregion, 4, 1, f); + if (read_items != 1) { + LOGE(ERROR, + "cannot read igd opregion address of pci device "PCI_BDF, + pci->domain, pci->bus, pci->dev, pci->func); + goto out; + } + fclose(f); + return igd_opregion; + +out: + if (f) + fclose(f); + return error; +} + /* * Some devices may need some ways to work well. Here like IGD, * we have to pass a specific option to qemu. @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid, domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1)); return ret; } + + /* If this is an Intel IGD, allow access to the IGD opregion */ + if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0; + + uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci); + uint32_t error = 0xffffffff; + if (igd_opregion == error) break; + + vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT; + ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid, + vga_iomem_start, + IGD_OPREGION_PAGES, 1); + if (ret < 0) { + LOGED(ERROR, domid, + "failed to give stubdom%d access to iomem range " + "%"PRIx64"-%"PRIx64" for IGD passthru", + stubdom_domid, vga_iomem_start, (vga_iomem_start + + IGD_OPREGION_PAGES - 1)); + return ret; + } + ret = xc_domain_iomem_permission(CTX->xch, domid, + vga_iomem_start, + IGD_OPREGION_PAGES, 1); + if (ret < 0) { + LOGED(ERROR, domid, + "failed to give dom%d access to iomem range " + "%"PRIx64"-%"PRIx64" for IGD passthru", + domid, vga_iomem_start, (vga_iomem_start + + IGD_OPREGION_PAGES - 1)); + return ret; + } break; }
When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD opregion to the guest but libxl does not grant the guest permission to access the mapped memory region. This results in a crash of the i915.ko kernel module in a Linux HVM guest when it needs to access the IGD opregion: Oct 23 11:36:33 domU kernel: Call Trace: Oct 23 11:36:33 domU kernel: ? idr_alloc+0x39/0x70 Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm] Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 [drm] Oct 23 11:36:33 domU kernel: drm_crtc_vblank_on+0x7b/0x130 [drm] Oct 23 11:36:33 domU kernel: intel_modeset_setup_hw_state+0xbd4/0x1900 [i915] Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 Oct 23 11:36:33 domU kernel: ? ww_mutex_lock+0x15/0x80 Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 [i915] Oct 23 11:36:33 domU kernel: ? gen6_write32+0x4b/0x1c0 [i915] Oct 23 11:36:33 domU kernel: ? intel_irq_postinstall+0xb9/0x670 [i915] Oct 23 11:36:33 domU kernel: i915_driver_probe+0x5c2/0xc90 [i915] Oct 23 11:36:33 domU kernel: ? vga_switcheroo_client_probe_defer+0x1f/0x40 Oct 23 11:36:33 domU kernel: ? i915_pci_probe+0x3f/0x150 [i915] Oct 23 11:36:33 domU kernel: local_pci_probe+0x42/0x80 Oct 23 11:36:33 domU kernel: ? _cond_resched+0x16/0x40 Oct 23 11:36:33 domU kernel: pci_device_probe+0xfd/0x1b0 Oct 23 11:36:33 domU kernel: really_probe+0x222/0x480 Oct 23 11:36:33 domU kernel: driver_probe_device+0xe1/0x150 Oct 23 11:36:33 domU kernel: device_driver_attach+0xa1/0xb0 Oct 23 11:36:33 domU kernel: __driver_attach+0x8a/0x150 Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 Oct 23 11:36:33 domU kernel: ? device_driver_attach+0xb0/0xb0 Oct 23 11:36:33 domU kernel: bus_for_each_dev+0x78/0xc0 Oct 23 11:36:33 domU kernel: bus_add_driver+0x12b/0x1e0 Oct 23 11:36:33 domU kernel: driver_register+0x8b/0xe0 Oct 23 11:36:33 domU kernel: ? 0xffffffffc06b8000 Oct 23 11:36:33 domU kernel: i915_init+0x5d/0x70 [i915] Oct 23 11:36:33 domU kernel: do_one_initcall+0x44/0x1d0 Oct 23 11:36:33 domU kernel: ? do_init_module+0x23/0x260 Oct 23 11:36:33 domU kernel: ? kmem_cache_alloc_trace+0xf5/0x200 Oct 23 11:36:33 domU kernel: do_init_module+0x5c/0x260 Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110 Oct 23 11:36:33 domU kernel: do_syscall_64+0x33/0x80 Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Oct 23 11:36:33 domU kernel: RIP: 0033:0x7f188e1aa9b9 This bug first appeared with commits abfb006f1ff4 and 0561e1f01e87 during the development of Xen 4.5 in 2014 and is present in all Xen versions 4.5 and higher. Currently, because of another bug in Qemu upstream, this crash can only be reproduced using the traditional Qemu device model, and of course it can only be reproduced on a system with an Intel IGD and compatible hardware and system BIOS that supports Intel VT-d. It also only occurs with Linux Intel graphics drivers (i915) in a Linux guest. It does not occur in a Windows guest with proprietary Windows Intel graphics drivers. This testing was done with Qemu traditional running as a process in dom0. The commit abfb006f1ff4 was intended to explicitly grant access to all needed I/O memory ranges so access requests by guests would not fail after commit 0561e1f01e87 was applied, but it failed to handle the case when the Intel IGD is being passed to an HVM guest for VGA passthrough. This patch grants access to the Intel IGD opregion if an Intel IGD is passed to an HVM guest and gfx_passthru is enabled in the xl.cfg guest configuration file, in addition to the other memory that was configured for access in commit abfb006f1ff4. The fix is implemented as follows: The first hunk of the patch adds two macros. These are the macros that determine the starting address and size of the Intel IGD opregion. PCI_INTEL_OPREGION matches the value in tools/firmware/hvmloader/pci_regs.h. IGD_OPREGION_PAGES matches the value in tools/firmware/hvmloader/config.h. These macros are used to correctly define the start address and size of the Intel IGD opregion. The second hunk is the new sysfs_dev_get_igd_opregion function, using the same coding style as the other sysfs_dev_get_xxx functions in the patched file. It returns the start address of the Intel IGD opregion from dom0's point of view if there are no errors, and it is called by code in the third and final hunk of the patch. The third hunk extends the libxl__grant_vga_iomem_permission function, which was a newly added function in one of the commits being fixed now (abfb006f1ff4). That function, in addition to what it did before, now checks if we have an Intel IGD and if so, it calls the new sysfs_dev_get_igd_opregion function to get the starting address of the memory to grant access to. This problem was discovered by building and testing versions of Xen 4.5-unstable until the aforementioned patches that broke passthrough of the Intel IGD to a Linux HVM guest were found. That alone, though, was not enough information to arrive at this fix. After identifying abfb006f1ff4 and 0561e1f01e87 as the commits that were causing the trouble, it was necessary to find out what memory was being denied that previously was allowed. By examining verbose logs from both Qemu and Xen, and the logs from a custom build of Xen that added a custom log entry to indicate the address of the memory that was being denied, it was possible to determine without doubt that the memory that was being denied was the Intel IGD opregion. Windows HVM guests are not affected by this issue, presumably because the proprietary Windows Intel graphics drivers do not need to access the Intel IGD opregion if for some reason it is inaccessible. Guidelines for maintaining this code: This code needs to be kept consistent with how hvmloader maps the Intel IGD opregion to the guest, how hvmloader and libxenlight detect an Intel IGD on the PCI bus, and how Qemu sets up the mapped IGD opregion in the guest and detects an Intel IGD. For example, all these components should agree on the size of the IGD opregion. The most important point for the future is accurate detection of the Intel IGD, which libxl__is_igd_vga_passthru currently provides. This patch does not modify that function, but it does use it. It will be important to ensure that function accurately detects an Intel IGD, because that is the only way we validate the contents of the Intel IGD opregion that we are permitting the guest to access. Currently, if we have a VGA device, the vendor is Intel, and the gfx_passthru option is enabled, we assume the contents of the memory we are mapping to and sharing with the guest is valid. The libxl__is_igd_vga_passthru function also reads the device id, but currently it assumes every Intel GPU device has an opregion. So, for example, if it is discovered that the newer discrete Intel GPUs do not have an opregion, the libxl__is_igd_vga_passthru function should be modified to return false for those devices. Fixes: abfb006f1ff4 (tools/libxl: explicitly grant access to needed I/O-memory ranges) Fixes: 0561e1f01e87 (xen/common: do not implicitly permit access to mapped I/O memory) Backport: 4.12+ Signed-off-by: Chuck Zmudzinski <brchuckz@netscape.net> --- tools/libs/light/libxl_pci.c | 72 ++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+)