diff mbox series

[XEN] tools/libs/light/libxl_pci.c: explicitly grant access to Intel IGD opregion

Message ID b62fbc602a629941c1acaad3b93d250a3eba33c0.1647222184.git.brchuckz@netscape.net (mailing list archive)
State New, archived
Headers show
Series [XEN] tools/libs/light/libxl_pci.c: explicitly grant access to Intel IGD opregion | expand

Commit Message

Chuck Zmudzinski March 14, 2022, 3:41 a.m. UTC
When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
opregion to the guest but libxl does not grant the guest permission to
access the mapped memory region. This results in a crash of the i915.ko
kernel module in a Linux HVM guest when it needs to access the IGD
opregion:

Oct 23 11:36:33 domU kernel: Call Trace:
Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
Oct 23 11:36:33 domU kernel:  drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
Oct 23 11:36:33 domU kernel:  drm_reset_vblank_timestamp+0x5b/0xd0 [drm]
Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
Oct 23 11:36:33 domU kernel:  intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
Oct 23 11:36:33 domU kernel:  intel_modeset_init_nogem+0x867/0x1d30 [i915]
Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 [i915]
Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
Oct 23 11:36:33 domU kernel:  ? vga_switcheroo_client_probe_defer+0x1f/0x40
Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
Oct 23 11:36:33 domU kernel:  __do_sys_finit_module+0xb1/0x110
Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
Oct 23 11:36:33 domU kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 23 11:36:33 domU kernel: RIP: 0033:0x7f188e1aa9b9

This bug first appeared with commits abfb006f1ff4 and 0561e1f01e87
during the development of Xen 4.5 in 2014 and is present in all Xen
versions 4.5 and higher.

Currently, because of another bug in Qemu upstream, this crash can
only be reproduced using the traditional Qemu device model, and of
course it can only be reproduced on a system with an Intel IGD and
compatible hardware and system BIOS that supports Intel VT-d. It also
only occurs with Linux Intel graphics drivers (i915) in a Linux guest.
It does not occur in a Windows guest with proprietary Windows Intel
graphics drivers. This testing was done with Qemu traditional running as
a process in dom0.

The commit abfb006f1ff4 was intended to explicitly grant access to all
needed I/O memory ranges so access requests by guests would not fail
after commit 0561e1f01e87 was applied, but it failed to handle the case
when the Intel IGD is being passed to an HVM guest for VGA passthrough.
This patch grants access to the Intel IGD opregion if an Intel IGD is
passed to an HVM guest and gfx_passthru is enabled in the xl.cfg guest
configuration file, in addition to the other memory that was configured
for access in commit abfb006f1ff4.

The fix is implemented as follows:

The first hunk of the patch adds two macros. These are the macros that
determine the starting address and size of the Intel IGD opregion.
PCI_INTEL_OPREGION matches the value in tools/firmware/hvmloader/pci_regs.h.
IGD_OPREGION_PAGES matches the value in tools/firmware/hvmloader/config.h.
These macros are used to correctly define the start address and size of
the Intel IGD opregion.

The second hunk is the new sysfs_dev_get_igd_opregion function, using
the same coding style as the other sysfs_dev_get_xxx functions in
the patched file. It returns the start address of the Intel IGD opregion
from dom0's point of view if there are no errors, and it is called by
code in the third and final hunk of the patch.

The third hunk extends the libxl__grant_vga_iomem_permission function,
which was a newly added function in one of the commits being fixed now
(abfb006f1ff4). That function, in addition to what it did before, now
checks if we have an Intel IGD and if so, it calls the new
sysfs_dev_get_igd_opregion function to get the starting address of the
memory to grant access to.

This problem was discovered by building and testing versions of
Xen 4.5-unstable until the aforementioned patches that broke passthrough
of the Intel IGD to a Linux HVM guest were found.

That alone, though, was not enough information to arrive at this fix.
After identifying abfb006f1ff4 and 0561e1f01e87 as the commits that were
causing the trouble, it was necessary to find out what memory was being
denied that previously was allowed. By examining verbose logs from both
Qemu and Xen, and the logs from a custom build of Xen that added a
custom log entry to indicate the address of the memory that was being
denied, it was possible to determine without doubt that the memory that
was being denied was the Intel IGD opregion.

Windows HVM guests are not affected by this issue, presumably because
the proprietary Windows Intel graphics drivers do not need to access the
Intel IGD opregion if for some reason it is inaccessible.

Guidelines for maintaining this code: This code needs to be kept
consistent with how hvmloader maps the Intel IGD opregion to the guest,
how hvmloader and libxenlight detect an Intel IGD on the PCI bus, and
how Qemu sets up the mapped IGD opregion in the guest and detects an
Intel IGD. For example, all these components should agree on the size of
the IGD opregion.

The most important point for the future is accurate detection of the
Intel IGD, which libxl__is_igd_vga_passthru currently provides. This
patch does not modify that function, but it does use it. It will be
important to ensure that function accurately detects an Intel IGD,
because that is the only way we validate the contents of the Intel
IGD opregion that we are permitting the guest to access. Currently, if
we have a VGA device, the vendor is Intel, and the gfx_passthru option
is enabled, we assume the contents of the memory we are mapping to
and sharing with the guest is valid. The libxl__is_igd_vga_passthru
function also reads the device id, but currently it assumes every
Intel GPU device has an opregion. So, for example, if it is discovered
that the newer discrete Intel GPUs do not have an opregion, the
libxl__is_igd_vga_passthru function should be modified to return false
for those devices.

Fixes: abfb006f1ff4 (tools/libxl: explicitly grant access to needed I/O-memory ranges)
Fixes: 0561e1f01e87 (xen/common: do not implicitly permit access to mapped I/O memory)
Backport: 4.12+

Signed-off-by: Chuck Zmudzinski <brchuckz@netscape.net>
---
 tools/libs/light/libxl_pci.c | 72 ++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

Comments

Jan Beulich March 15, 2022, 11:38 a.m. UTC | #1
On 14.03.2022 04:41, Chuck Zmudzinski wrote:
> Fixes: abfb006f1ff4 (tools/libxl: explicitly grant access to needed I/O-memory ranges)
> Fixes: 0561e1f01e87 (xen/common: do not implicitly permit access to mapped I/O memory)
> Backport: 4.12+

Just fyi: This is fine to have as a tag, but it wouldn't be backported
farther than to 4.15.

Apart from this largely some style issues (see below), but please
realize that I'm not a libxl maintainer and hence I may not have good
enough knowledge of, in particular, potential unwritten conventions.

> @@ -610,6 +612,45 @@ out:
>      return ret;
>  }
>  
> +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc,
> +                                           libxl_device_pci *pci)
> +{
> +    char *pci_device_config_path =
> +            GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config",
> +                      pci->domain, pci->bus, pci->dev, pci->func);
> +    size_t read_items;
> +    uint32_t igd_opregion;
> +    uint32_t error = 0xffffffff;

I think this constant wants to gain a #define, to be able to correlate
the use sites. I'm also not sure of the value - in principle the
register can hold this value, but of course then it won't be 3 pages.
Maybe the error check further down should be to see whether adding 2
to the value would overflow in 32 bits? (In that case a #define may
not be needed anymore, as there wouldn't be multiple instances of the
constant in the code.)

> +
> +    FILE *f = fopen(pci_device_config_path, "r");
> +    if (!f) {

While libxl has some special style rules, I think it still wants a
blank line between declaration(s) and statement(s), just like we
expect elsewhere. Effectively you want to simply move the blank line
you have one line down.

> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
>                    domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
>              return ret;
>          }
> +
> +        /* If this is an Intel IGD, allow access to the IGD opregion */
> +        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;

Despite the provision for "return" or alike to go on the same line
as an error code check, I don't think this is okay here. It would be
if, as iirc generally expected in libxl, you latched the function
return value into a local variable named "rc" (I think).

> +        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
> +        uint32_t error = 0xffffffff;

Please don't mix declarations and statements. I also don't think
"error" is really necessary as a local variable, but with the change
suggested above it might disappear anyway.

> +        if (igd_opregion == error) break;

Like above I'm not sure this is okay to all live on one line. I also
think it would be nice if you used "return 0" or "break" consistently.
Of course a related question is whether failure here should actually
be reported to the caller.

> +        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;

There's no need for a cast here, as you're right-shifting. Also
(just fyi) there would have been three to many spaces here. I'm
additionally not certain whether re-using a variable for a purpose
not matching its name is deemed acceptable by libxl maintainers.

> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
> +                                         vga_iomem_start,
> +                                         IGD_OPREGION_PAGES, 1);
> +        if (ret < 0) {
> +            LOGED(ERROR, domid,
> +                  "failed to give stubdom%d access to iomem range "
> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
> +                                                IGD_OPREGION_PAGES - 1));
> +            return ret;
> +        }

I have to admit that I find it odd that this is done unconditionally,
but I notice the same is done in pre-existing code. I would have
expected this to happen only when there actually is a device model
stub domain.

Jan

> +        ret = xc_domain_iomem_permission(CTX->xch, domid,
> +                                         vga_iomem_start,
> +                                         IGD_OPREGION_PAGES, 1);
> +        if (ret < 0) {
> +            LOGED(ERROR, domid,
> +                  "failed to give dom%d access to iomem range "
> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
> +                  domid, vga_iomem_start, (vga_iomem_start +
> +                                           IGD_OPREGION_PAGES - 1));
> +            return ret;
> +        }
>          break;
>      }
>
Chuck Zmudzinski March 15, 2022, 6:31 p.m. UTC | #2
On 3/15/22 7:38 AM, Jan Beulich wrote:
> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>> Fixes: abfb006f1ff4 (tools/libxl: explicitly grant access to needed I/O-memory ranges)
>> Fixes: 0561e1f01e87 (xen/common: do not implicitly permit access to mapped I/O memory)
>> Backport: 4.12+
> Just fyi: This is fine to have as a tag, but it wouldn't be backported
> farther than to 4.15.

That's entirely reasonable. I understand this is a bug fix, not a
security issue.
>
> Apart from this largely some style issues (see below), but please
> realize that I'm not a libxl maintainer and hence I may not have good
> enough knowledge of, in particular, potential unwritten conventions.

I will take your comments into consideration regarding style before
writing the next version of the patch, and carefully check libxl's
coding style file.
>
>> @@ -610,6 +612,45 @@ out:
>>       return ret;
>>   }
>>   
>> +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc,
>> +                                           libxl_device_pci *pci)
>> +{
>> +    char *pci_device_config_path =
>> +            GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config",
>> +                      pci->domain, pci->bus, pci->dev, pci->func);
>> +    size_t read_items;
>> +    uint32_t igd_opregion;
>> +    uint32_t error = 0xffffffff;
> I think this constant wants to gain a #define, to be able to correlate
> the use sites. I'm also not sure of the value - in principle the
> register can hold this value, but of course then it won't be 3 pages.

What we are storing as the return value is the starting address,
not the size, of the opregion, and that is a 32-bit value. If we
cannot read it, we return 0xffffffff instead to indicate the error
that we were unable to read the starting address of the opregion
from the config attribute in sysfs. The 32-bit value we are looking for
is read at offset 0xfc from the start of the config attribute of the
Intel IGD in sysfs. The offset 0xfc is defined by PCI_INTEL_OPREGION
both here and in hvmloader (and also in Qemu). The data that is
read at this offset from the start of the config attribute of the Intel
IGD in sysfs is the 32-bit address of the start of the opregion.
> Maybe the error check further down should be to see whether adding 2
> to the value would overflow in 32 bits? (In that case a #define may
> not be needed anymore, as there wouldn't be multiple instances of the
> constant in the code.)

That would work also. Please not that I chose that value for an error
value consistent with the way the current function sysfs_dev_get_vendor
does it. While that function does not assign the variable 'error' to
its return value for an error, which in that case is 0xffff because
that function returns uint16_t instead of uint32_t,
I chose to explicitly assign the error variable to that value to help make
the code more readable. Your  comment that this could be a #define
instead is good. I also think we should use a #define for the error return
value of the sysfs_dev_get_vendor function Something like:

#define ERROR_16    0xffff
#define ERROR_32    0xffffffff

might be appropriate. But that would be touching code unrelated to
this bug fix. I think again the libxl maintainers should weigh in about
what to do here. They might let me take this opportunity to update
and improve the style of the patched file in other functions in the
file not related to this bug fix but I am not inclined to do that without
an explicit request from them to do so. So I am not sure yet what I will
do in the next version of the patch, but I will address your concerns here
and try to explain my reasoning for the changes in the changelog for
version 2 of the patch.
>
>> +
>> +    FILE *f = fopen(pci_device_config_path, "r");
>> +    if (!f) {
> While libxl has some special style rules, I think it still wants a
> blank line between declaration(s) and statement(s), just like we
> expect elsewhere. Effectively you want to simply move the blank line
> you have one line down.

I think I followed the same style here as the existing sysfs_dev_get_xxx
functions. I will double check that and use the same style the other
functions use unless they clearly violate the rules, and note that I
deviated from the style of the existing functions to conform to current
coding style and suggest a subsequent patch to update the style of
the other functions.
>
>> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
>>                     domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
>>               return ret;
>>           }
>> +
>> +        /* If this is an Intel IGD, allow access to the IGD opregion */
>> +        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;
> Despite the provision for "return" or alike to go on the same line
> as an error code check, I don't think this is okay here. It would be
> if, as iirc generally expected in libxl, you latched the function
> return value into a local variable named "rc" (I think).

I will double check how the function being patched handles the return
value. I don't even remember if it has a local variable named rc for a 
return
value. IIRC it was either ret or 0. I understand that libxl expects rc to be
used these days, though. This might be another candidate for updating the
file to libxl's current standards.
>
>> +        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
>> +        uint32_t error = 0xffffffff;
> Please don't mix declarations and statements.

I presume you are saying these two lines should be re-written as:

uint32_t igd_opregion;
unti32_t error;

igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
error = 0xffff;

Please reply if my understanding here is not correct.
> I also don't think
> "error" is really necessary as a local variable, but with the change
> suggested above it might disappear anyway.

I do plan for the next version of the patch to use a #define for
this instead of the error variable (or add 2 to overflow it), so it
will disappear in the next version.
>
>> +        if (igd_opregion == error) break;
> Like above I'm not sure this is okay to all live on one line. I also
> think it would be nice if you used "return 0" or "break" consistently.
> Of course a related question is whether failure here should actually
> be reported to the caller.

Good points here. I agree about consistency with break and return 0.
I will change this to return 0 and move it to the next line. I do not
want to change the current meaning of the return value
without knowledge of how the caller uses the return value.
IIRC, currently the function always returns 0 unless it encounters a
negative return value from xc_domain_iomem_permission, in which
case it returns that negative value to indicate an error to the caller.
So if we return anything other than 0 here, we might be returning
an error code that the caller does not expect or interpret correctly.
I will also consider putting an error message here before returning 0.
A message something like "dom%d: Intel IGD detected, but could
not find IGD opregion" would explain the error that happens here.
I don't think a non-zero error code to the caller is appropriate here,
though, because, as already mentioned, IIRC this might return a
value the caller does not interpret correctly. If it is necessary to
return an error to the caller here instead of 0, it will be necessary to
ensure all callers of this function will interpret it correctly. I would
suggest an error return value greater than 0 to distinguish it from
the return value < 0 which indicates an error from
xc_domain_iomem_permission, but I hope libxl maintainers will
accept a return value of 0 here, at least for this patch. A later patch
could re-work the return value of this function which would also
probably require touching the caller(s) of this function to properly
respond to this particular error which is different from an error from
xc_domain_iomem_permission. In any case, I will double-check to
see if my current understanding of the meaning of the return value
is correct before writing the next version of the patch. For now, I
will use return 0 instead of break here and move it to the next
line, unless I hear otherwise from the libxl maintainers.
>
>> +        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;
> There's no need for a cast here, as you're right-shifting. Also
> (just fyi) there would have been three to many spaces here. I'm
> additionally not certain whether re-using a variable for a purpose
> not matching its name is deemed acceptable by libxl maintainers.

I wrote it that way expecting a compiler error if I didn't do the cast.
I have not checked if the cast is necessary, though, and maybe you
are right. I will check and see if it is necessary by removing the cast
and see if the compiler complains.

If the cast is not needed, I will just use the 32-bit igd_opregion variable
when calling xc_domain_iomem_permission instead of the 64-bit
vga_iomem_start variable. I will remove the three spaces and use a
more descriptive variable instead of re-using vga_iomem_start if the
compiler insists on the cast from 32-bit to 64-bit.
>
>> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
>> +        if (ret < 0) {
>> +            LOGED(ERROR, domid,
>> +                  "failed to give stubdom%d access to iomem range "
>> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
>> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
>> +                                                IGD_OPREGION_PAGES - 1));
>> +            return ret;
>> +        }
> I have to admit that I find it odd that this is done unconditionally,
> but I notice the same is done in pre-existing code. I would have
> expected this to happen only when there actually is a device model
> stub domain.

I don't understand how that works either. All my tests have been with
the device model running as a process in dom0. I am thinking maybe
in that case it just uses dom0 for the stub domain, but I have not checked
that. I will check it by dumping the value of stubdom_domid to a log in my
next test.

Thank you for responding promptly. Now I have some work to do writing
the next version of the patch and documenting it clearly in its changelog.
It will take me a while - I will spend enough time on it so hopefully the
libxl maintainers don't have to spend so much time on it.

Chuck

N.B. I forgot to send this reply to xen-devel and cc the libxl
maintainers, so I am doing so here. I also re-formatted my replies
to avoid lines with too many characters. Sorry for the
confusion.
Chuck Zmudzinski March 16, 2022, 1:27 a.m. UTC | #3
On 3/15/22 7:38 AM, Jan Beulich wrote:
> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>
>> @@ -610,6 +612,45 @@ out:
>>       return ret;
>>   }
>>   
>> +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc,
>> +                                           libxl_device_pci *pci)
>> +{
>> +    char *pci_device_config_path =
>> +            GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config",
>> +                      pci->domain, pci->bus, pci->dev, pci->func);
>> +    size_t read_items;
>> +    uint32_t igd_opregion;
>> +    uint32_t error = 0xffffffff;
> I think this constant wants to gain a #define, to be able to correlate
> the use sites. I'm also not sure of the value - in principle the
> register can hold this value, but of course then it won't be 3 pages.
>

I have one more comment to add here. I am not intending
to define igd_opregion as a data structure 3 pages (12k)
long, much less as a pointer to such a structure. However,
it would be nice to have access to the actual data structure
in libxl, because we could use it to validate its contents.
I looked in the code for the i915 Linux kernel module, and
the IGD opregion does have a signature that we could check
if we have access to it. That would mitigate my concerns
expressed in my first version of the patch about a false
positive when identifying an Intel IGD. Hvmloader should
probably also do this check before it maps the Intel IGD
into guest memory if that is possible. However, I expect
that it is not a memory that libxl has access to. It is
probably a structure in kernel space, but it might be
possible for libxl to ask the hypervisor for access to it.
Perhaps the libxl maintainers can shed some light on that
possibility. If this is possible, I will include such a check for
the validity of the contents in the IGD in version 2 of the
patch.

Regards,

Chuck
Chuck Zmudzinski March 16, 2022, 12:18 p.m. UTC | #4
On 3/15/2022 9:27 PM, Chuck Zmudzinski wrote:
>
>
> On 3/15/22 7:38 AM, Jan Beulich wrote:
>> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>>
>>> @@ -610,6 +612,45 @@ out:
>>>       return ret;
>>>   }
>>>   +static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc,
>>> +                                           libxl_device_pci *pci)
>>> +{
>>> +    char *pci_device_config_path =
>>> +            GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config",
>>> +                      pci->domain, pci->bus, pci->dev, pci->func);
>>> +    size_t read_items;
>>> +    uint32_t igd_opregion;
>>> +    uint32_t error = 0xffffffff;
>> I think this constant wants to gain a #define, to be able to correlate
>> the use sites. I'm also not sure of the value - in principle the
>> register can hold this value, but of course then it won't be 3 pages.
>>
>
> I have one more comment to add here. I am not intending
> to define igd_opregion as a data structure 3 pages (12k)

Correction: Actually, the igd_opregion itself would be 2 pages.
The three pages comes from the fact that it is not guaranteed
to be page aligned, so it will take three pages to ensure
that it will be fully mapped to the guest. From the commit
message in hvmloader that increased it from two to three
pages:

From: Keir Fraser <keir@xxxxxxx>
Date: Thu, 10 Jan 2013 17:26:24 +0000 (+0000)
Subject: hvmloader: Allocate 3 pages for Intel GPU OpRegion passthrough.
X-Git-Tag: 4.3.0-rc1~546
X-Git-Url:
https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff_plain;h=408a9e56343b006c9e58a334f0b97dd2deedf9ac

hvmloader: Allocate 3 pages for Intel GPU OpRegion passthrough.

The 8kB region may not be page aligned, hence requiring 3 pages to
be mapped through.

Signed-off-by: Keir Fraser <keir@xxxxxxx>

In tests on my system, this is true. It was, IIRC,
0x18 (24) bytes offset from a page boundary.

This has an unfortunate side effect of granting
access to one page that the guest does not really
need access to. My well-behaved and trusted Linux
and Windows guests only request the two pages of
the igd_opregion, but it could have accessed the 24
bytes before it or the (4k - 24) bytes after it. I don't
think that greatly increases the security risk of including
this patch, because I think with passthrough of PCI
devices, it must always be to a trusted guest for it to
be secure. I don't think an attacker who gained control
over a guest that has PCI devices passed through to it
would need this exploit to successfully attack the dom0
or control domain from the guest. The damage could
be done whether or not the attacker has access to
that extra page if the attacker gained full control over
a guest with PCI devices passed through to it.

Regards,

Chuck
Jan Beulich March 18, 2022, 8:13 a.m. UTC | #5
On 14.03.2022 04:41, Chuck Zmudzinski wrote:
> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
> opregion to the guest but libxl does not grant the guest permission to
> access the mapped memory region. This results in a crash of the i915.ko
> kernel module in a Linux HVM guest when it needs to access the IGD
> opregion:
> 
> Oct 23 11:36:33 domU kernel: Call Trace:
> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
> Oct 23 11:36:33 domU kernel:  drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
> Oct 23 11:36:33 domU kernel:  drm_reset_vblank_timestamp+0x5b/0xd0 [drm]
> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
> Oct 23 11:36:33 domU kernel:  intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
> Oct 23 11:36:33 domU kernel:  intel_modeset_init_nogem+0x867/0x1d30 [i915]
> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 [i915]
> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
> Oct 23 11:36:33 domU kernel:  ? vga_switcheroo_client_probe_defer+0x1f/0x40
> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
> Oct 23 11:36:33 domU kernel:  __do_sys_finit_module+0xb1/0x110
> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
> Oct 23 11:36:33 domU kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The call trace alone leaves open where exactly the crash occurred.
Looking at 5.17 I notice that the first thing the driver does
after mapping the range it to check the signature (both in
intel_opregion_setup()). As the signature can't possibly match
with no access granted to the underlying mappings, there shouldn't
be any further attempts to use the region in the driver; if there
are, I'd view this as a driver bug.

Furthermore I've found indications (e.g. in the Core Gen11 doc)
that the register may not hold an address in the first place, but
instead a set of bitfields. I can't help the impression that the
driver would still try to map the range pointed at by the value
(as long as it's non-zero), which would imply unpredictable
behavior.

And then, looking further at intel_opregion_setup(), there's yet
one more memory range which the guest may need access to:
opregion->asle->rvda (or a value derived from it) also is handed
to memremap() under certain conditions.

Jan
Chuck Zmudzinski March 28, 2022, 9:31 p.m. UTC | #6
On 3/15/22 7:38 AM, Jan Beulich wrote:
> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>
>> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
>> +        if (ret < 0) {
>> +            LOGED(ERROR, domid,
>> +                  "failed to give stubdom%d access to iomem range "
>> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
>> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
>> +                                                IGD_OPREGION_PAGES - 1));
>> +            return ret;
>> +        }
> I have to admit that I find it odd that this is done unconditionally,
> but I notice the same is done in pre-existing code. I would have
> expected this to happen only when there actually is a device model
> stub domain.
>
> Jan

I dumped the value of stubdom_id for my tests with the
device model running in dom0:

libxl: info: libxl_pci.c:2556:libxl__grant_vga_iomem_permission: Domain 
3: stubdom id: 0

As I expected, when there is not a device model stub domain
and the device model runs in dom0, the stubdom_id is 0.

I will now do some tests to see if this is necessary when the
device model runs in dom0. I would like to know if the device
model running in dom0 needs to have access granted here
or not. When there is a device model stub domain, I presume
it is necessary, and I can check that also and write the
next version of the patch accordingly.

Chuck
Anthony PERARD March 30, 2022, 5:15 p.m. UTC | #7
Hi Chuck,

On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
> opregion to the guest but libxl does not grant the guest permission to

I'm not reading the same thing when looking at code in hvmloader. It
seems that hvmloader allocate some memory for the IGD opregion rather
than mapping it.


tools/firmware/hvmloader/pci.c:184
    if ( vendor_id == 0x8086 )
    {
        igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
        /*
         * Write the the OpRegion offset to give the opregion
         * address to the device model. The device model will trap
         * and map the OpRegion at the give address.
         */
        pci_writel(vga_devfn, PCI_INTEL_OPREGION,
                   igd_opregion_pgbase << PAGE_SHIFT);
    }

I think this write would go through QEMU, does it do something with it?
(I kind of replying to this question at the end of the mail.)

Is this code in hvmloader actually run in your case?

> Currently, because of another bug in Qemu upstream, this crash can
> only be reproduced using the traditional Qemu device model, and of

qemu-traditional is a bit old... What is the bug in QEMU? Do you have a
link to a patch/mail?

> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> index 4bbbfe9f16..a4fc473de9 100644
> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
>                    domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
>              return ret;
>          }
> +
> +        /* If this is an Intel IGD, allow access to the IGD opregion */
> +        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;
> +
> +        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
> +        uint32_t error = 0xffffffff;
> +        if (igd_opregion == error) break;
> +
> +        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;
> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
> +                                         vga_iomem_start,
> +                                         IGD_OPREGION_PAGES, 1);
> +        if (ret < 0) {
> +            LOGED(ERROR, domid,
> +                  "failed to give stubdom%d access to iomem range "
> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
> +                                                IGD_OPREGION_PAGES - 1));
> +            return ret;
> +        }
> +        ret = xc_domain_iomem_permission(CTX->xch, domid,
> +                                         vga_iomem_start,
> +                                         IGD_OPREGION_PAGES, 1);

Here, you seems to add permission to an address that is read from the
pci config space of the device, but as I've pointed above hvmloader
seems to overwrite this address. It this call to
xc_domain_iomem_permission() does actually anything useful?
Or is it by luck that the address returned by
sysfs_dev_get_igd_opregion() happened to be the address that hvmloader
is going to write?

Or maybe hvmloader doesn't actually do anything?


Some more though on that, looking at QEMU, it seems that there's already
a call to xc_domain_iomem_permission(), in igd_write_opregion(). So
adding one in libxl would seems redundant, or maybe it the one for the
device model's domain that's needed  (named 'stubdom_domid' here)?

Thanks,
Andrew Cooper March 30, 2022, 5:27 p.m. UTC | #8
On 30/03/2022 18:15, Anthony PERARD wrote:
> Hi Chuck,
>
> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>> opregion to the guest but libxl does not grant the guest permission to
> I'm not reading the same thing when looking at code in hvmloader. It
> seems that hvmloader allocate some memory for the IGD opregion rather
> than mapping it.
>
>
> tools/firmware/hvmloader/pci.c:184
>     if ( vendor_id == 0x8086 )
>     {
>         igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
>         /*
>          * Write the the OpRegion offset to give the opregion
>          * address to the device model. The device model will trap
>          * and map the OpRegion at the give address.
>          */
>         pci_writel(vga_devfn, PCI_INTEL_OPREGION,
>                    igd_opregion_pgbase << PAGE_SHIFT);
>     }
>
> I think this write would go through QEMU, does it do something with it?
> (I kind of replying to this question at the end of the mail.)
>
> Is this code in hvmloader actually run in your case?
>
>> Currently, because of another bug in Qemu upstream, this crash can
>> only be reproduced using the traditional Qemu device model, and of
> qemu-traditional is a bit old... What is the bug in QEMU? Do you have a
> link to a patch/mail?
>
>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
>> index 4bbbfe9f16..a4fc473de9 100644
>> --- a/tools/libs/light/libxl_pci.c
>> +++ b/tools/libs/light/libxl_pci.c
>> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
>>                    domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
>>              return ret;
>>          }
>> +
>> +        /* If this is an Intel IGD, allow access to the IGD opregion */
>> +        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;
>> +
>> +        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
>> +        uint32_t error = 0xffffffff;
>> +        if (igd_opregion == error) break;
>> +
>> +        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;
>> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
>> +        if (ret < 0) {
>> +            LOGED(ERROR, domid,
>> +                  "failed to give stubdom%d access to iomem range "
>> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
>> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
>> +                                                IGD_OPREGION_PAGES - 1));
>> +            return ret;
>> +        }
>> +        ret = xc_domain_iomem_permission(CTX->xch, domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
> Here, you seems to add permission to an address that is read from the
> pci config space of the device, but as I've pointed above hvmloader
> seems to overwrite this address. It this call to
> xc_domain_iomem_permission() does actually anything useful?
> Or is it by luck that the address returned by
> sysfs_dev_get_igd_opregion() happened to be the address that hvmloader
> is going to write?
>
> Or maybe hvmloader doesn't actually do anything?
>
>
> Some more though on that, looking at QEMU, it seems that there's already
> a call to xc_domain_iomem_permission(), in igd_write_opregion(). So
> adding one in libxl would seems redundant, or maybe it the one for the
> device model's domain that's needed  (named 'stubdom_domid' here)?

This has been discussed before, but noone's done anything about it. 
It's a massive layering violation for QEMU to issue
xc_domain_iomem_permission()/etc hypercalls.

It should be the toolstack, and only the toolstack, which makes
permissions hypercalls, which in turn will fix a slew of "QEMU doesn't
work when it doesn't have dom0 superpowers" bugs with stubdomains.

In this case specifically, the opregion is a magic Intel graphics
specific bodge.  The i915 driver in the guest really does need to access
part of the real PCH during init, which (in Xen's permission model)
really does require permitting access to the MMIO range (8k iirc) so it
can be mapped as a BAR in QEMU's emulated PCH.

~Andrew
Jason Andryuk March 30, 2022, 6:45 p.m. UTC | #9
On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
> > When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
> > opregion to the guest but libxl does not grant the guest permission to
> > access the mapped memory region. This results in a crash of the i915.ko
> > kernel module in a Linux HVM guest when it needs to access the IGD
> > opregion:
> >
> > Oct 23 11:36:33 domU kernel: Call Trace:
> > Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
> > Oct 23 11:36:33 domU kernel:  drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
> > Oct 23 11:36:33 domU kernel:  drm_reset_vblank_timestamp+0x5b/0xd0 [drm]
> > Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
> > Oct 23 11:36:33 domU kernel:  intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
> > Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> > Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
> > Oct 23 11:36:33 domU kernel:  intel_modeset_init_nogem+0x867/0x1d30 [i915]
> > Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
> > Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 [i915]
> > Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
> > Oct 23 11:36:33 domU kernel:  ? vga_switcheroo_client_probe_defer+0x1f/0x40
> > Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
> > Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
> > Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> > Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
> > Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
> > Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
> > Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
> > Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
> > Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> > Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> > Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
> > Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
> > Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
> > Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
> > Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
> > Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
> > Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
> > Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
> > Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
> > Oct 23 11:36:33 domU kernel:  __do_sys_finit_module+0xb1/0x110
> > Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
> > Oct 23 11:36:33 domU kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The call trace alone leaves open where exactly the crash occurred.
> Looking at 5.17 I notice that the first thing the driver does
> after mapping the range it to check the signature (both in
> intel_opregion_setup()). As the signature can't possibly match
> with no access granted to the underlying mappings, there shouldn't
> be any further attempts to use the region in the driver; if there
> are, I'd view this as a driver bug.

Yes.  i915_driver_hw_probe does not check the return value of
intel_opregion_setup(dev_priv) and just continues on.

Chuck, the attached patch may help if you want to test it.

Regards,
Jason
Chuck Zmudzinski March 31, 2022, 2:41 a.m. UTC | #10
On 3/30/22 1:15 PM, Anthony PERARD wrote:
> Hi Chuck,
>
> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>> opregion to the guest but libxl does not grant the guest permission to
> I'm not reading the same thing when looking at code in hvmloader. It
> seems that hvmloader allocate some memory for the IGD opregion rather
> than mapping it.
>
>
> tools/firmware/hvmloader/pci.c:184
>      if ( vendor_id == 0x8086 )
>      {
>          igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
>          /*
>           * Write the the OpRegion offset to give the opregion
>           * address to the device model. The device model will trap
>           * and map the OpRegion at the give address.
>           */
>          pci_writel(vga_devfn, PCI_INTEL_OPREGION,
>                     igd_opregion_pgbase << PAGE_SHIFT);
>      }
>
> I think this write would go through QEMU, does it do something with it?

AFAICT Qemu does do something with it:

In the upstream Qemu, at hw/xen/xen_pt_config_init.c
and in hw/xen/xen_pt_graphics.c, we see where Qemu
implements functions to read and write the OpRegion,
and that means it reads and writes the value for the
mapped OpRegion address that is passed to it from
hvmloader. It is a 32-bit address that is stored in the
config attribute in sysfs for the Intel IGD on Linux guests.

I have examined the config attribute of the Intel IGD in
the control domain (dom0) and in the Linux guest and
what I see is that in both dom0 and in the guest, the
address of the IGD OpRegion is consistent with custom logs
I have examined from xen/common/domctl.c that show the
same machine and guest address for the OpRegion that
the config attribute has for the machine and the guest.

> (I kind of replying to this question at the end of the mail.)
>
> Is this code in hvmloader actually run in your case?

I admit I have not verified with certainty that Qemu and
the guest is getting the OpRegion address from hvmloader.
I will do a test to verify it, such as removing the code
from hvmloader that writes the address in the pci config
attribute and see if that prevents the guest from finding
the address where the OpRegion is mapped to in the guest.
That would prove that hvmloader code here is run in my case.
>
>> Currently, because of another bug in Qemu upstream, this crash can
>> only be reproduced using the traditional Qemu device model, and of
> qemu-traditional is a bit old... What is the bug in QEMU? Do you have a
> link to a patch/mail?

Not yet. I am working on it now. The bug is related to the
passthrough of the IRQ to the guest. So far I have compared
the logs in the Linux guest using Qemu upstream with Qemu
traditional, and I have found that with the upstream Qemu,
the Linux kernel in the guest reports that it cannot obtain
IRQ 24:

Mar 29 18:31:39 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24)
Mar 29 18:31:39 debian kernel: i915 0000:00:02.0: [drm] VT-d active for 
gfx access
...
Mar 29 18:31:39 debian kernel: xen:events: Failed to obtain physical IRQ 24

When using the traditional device model, this failure is not
reported. The failure of the system to handle the IRQ prevents
the guest from booting normally with the upstream Qemu.

Comparing Qemu upstream code with Qemu traditional code,
I notice that when the traditional Qemu sets up the pci-isa
bridge at slot 1f, it configures an IRQ, but the upstream
Qemu does not, so I suspect that is where the problem is, but I
have not found the fix yet. It is well known that the pci-isa
bridge at slot 1f needs to be specially configured for the
Intel IGD to function properly.
>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
>> index 4bbbfe9f16..a4fc473de9 100644
>> --- a/tools/libs/light/libxl_pci.c
>> +++ b/tools/libs/light/libxl_pci.c
>> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
>>                     domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
>>               return ret;
>>           }
>> +
>> +        /* If this is an Intel IGD, allow access to the IGD opregion */
>> +        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;
>> +
>> +        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
>> +        uint32_t error = 0xffffffff;
>> +        if (igd_opregion == error) break;
>> +
>> +        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;
>> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
>> +        if (ret < 0) {
>> +            LOGED(ERROR, domid,
>> +                  "failed to give stubdom%d access to iomem range "
>> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
>> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
>> +                                                IGD_OPREGION_PAGES - 1));
>> +            return ret;
>> +        }
>> +        ret = xc_domain_iomem_permission(CTX->xch, domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
> Here, you seems to add permission to an address that is read from the
> pci config space of the device, but as I've pointed above hvmloader
> seems to overwrite this address.

I do not think hvmloader overwrites this address. I think of
it as hvmloader passing the mapped address to the device
model and guest, but as I said above, I will do more tests
to verify with certainty what hvmloader is actually doing.
> It this call to
> xc_domain_iomem_permission() does actually anything useful?

I am certain this call to xc_domain_iomem_permission()
is necessary with the Qemu traditional device model to
obtain permission for the guest (domid) to access the
OpRegion. Without it, the bug is not fixed and I get the
crash in the i915 kernel module in the Linux guest and a dark
screen instead of a guest with output to the screen.

Moreover, I have verified with custom logs from
xen/common/domctl.c that access to the opregion
is denied to domid, but not to dom0, when this
patch is not applied.

> Or is it by luck that the address returned by
> sysfs_dev_get_igd_opregion() happened to be the address that hvmloader
> is going to write?
>
> Or maybe hvmloader doesn't actually do anything?

I mentioned earlier my plans to verify that
hvmloader does provide the device model and
the guest with the mapped address of the
Intel IGD OpRegion.
>
>
> Some more though on that, looking at QEMU, it seems that there's already
> a call to xc_domain_iomem_permission(), in igd_write_opregion(). So
> adding one in libxl would seems redundant,

You may be right that with Qemu upstream it is not
needed. I will be able to check it once I have a patch
for the IRQ problem in upstream Qemu. When I wrote
this patch a couple of weeks ago, though, I did not yet
know that Qemu upstream also calls
xc_domain_iomem_permission() and since Qemu
upstream seems to obtain the necessary permission,
the call here to xc_domain_iomem_permission()
can be done only when the device model is Qemu
traditional.
> or maybe it the one for the
> device model's domain that's needed  (named 'stubdom_domid' here)?

Well, I am not using a device model stub domain but
running the device model in dom0, and my tests
indicate it is not necessary to obtain permission for
dom0, but I do admit I need to do more tests before
submitting the next version of a patch. I plan to do
some tests with a device model stub domain and see
what configurations require permission for the stub
domain. I expect it will only be needed when the
device model is Qemu traditional because Qemu
upstream obtains the necessary permission.

I have no experience running the device model
in a stub domain, and IIRC the Xen wiki explains
that the supported configuration from the Xen Project
is with the traditional device model and mini os in
the stub domain, and others, such as Qubes OS,
have done some work on the upstream Qemu and
Linux running in a stub domain. Please correct me
if this is not correct.

Thank you for taking the time to look at this patch.

Chuck
Chuck Zmudzinski March 31, 2022, 3:54 a.m. UTC | #11
On 3/30/22 1:27 PM, Andrew Cooper wrote:
> On 30/03/2022 18:15, Anthony PERARD wrote:
>>
>> Some more though on that, looking at QEMU, it seems that there's already
>> a call to xc_domain_iomem_permission(), in igd_write_opregion().
> This has been discussed before, but noone's done anything about it.
> It's a massive layering violation for QEMU to issue
> xc_domain_iomem_permission()/etc hypercalls.
>
> It should be the toolstack, and only the toolstack, which makes
> permissions hypercalls, which in turn will fix a slew of "QEMU doesn't
> work when it doesn't have dom0 superpowers" bugs with stubdomains.

How much say does the Xen project have over the code
in Qemu under hw/xen? I would not be against having libxl
do the permissions hypercalls in this case instead of Qemu
doing it. My test with Qemu traditional and this patch proves
the permission can be granted by libxl instead of the device
model.
> In this case specifically, the opregion is a magic Intel graphics
> specific bodge.  The i915 driver in the guest really does need to access
> part of the real PCH during init, which (in Xen's permission model)
> really does require permitting access to the MMIO range (8k iirc) so it
> can be mapped as a BAR in QEMU's emulated PCH.

That is exactly what my testing confirmed, but in my
tests only Linux guests need access to the magic Intel
opregion. The proprietary Windows Intel graphics
drivers are apparently able to work around lack of
access to the opregion, at least on my system, and
Windows guests with Intel IGD passthrough function
very well without needing this patch. So the problem
could be fixed in the Linux i915 kernel driver, but Intel
has not contributed the magic sauce to the Linux kernel
that would enable Linux guests to work without access
to the Intel opregion.

And you are correct, the opregion is 8k (2 pages) in size,
but it is not always page aligned, as mentioned earlier,
so the IGD_OPREGION_PAGES constant is set to 3 instead
of 2.

Chuck
Chuck Zmudzinski March 31, 2022, 4:34 a.m. UTC | #12
On 3/30/22 2:45 PM, Jason Andryuk wrote:
> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>>> opregion to the guest but libxl does not grant the guest permission to
>>> access the mapped memory region. This results in a crash of the i915.ko
>>> kernel module in a Linux HVM guest when it needs to access the IGD
>>> opregion:
>>>
>>> Oct 23 11:36:33 domU kernel: Call Trace:
>>> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
>>> Oct 23 11:36:33 domU kernel:  drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
>>> Oct 23 11:36:33 domU kernel:  drm_reset_vblank_timestamp+0x5b/0xd0 [drm]
>>> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
>>> Oct 23 11:36:33 domU kernel:  intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
>>> Oct 23 11:36:33 domU kernel:  intel_modeset_init_nogem+0x867/0x1d30 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 [i915]
>>> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? vga_switcheroo_client_probe_defer+0x1f/0x40
>>> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
>>> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
>>> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
>>> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
>>> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
>>> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
>>> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
>>> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
>>> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
>>> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
>>> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
>>> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
>>> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
>>> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
>>> Oct 23 11:36:33 domU kernel:  __do_sys_finit_module+0xb1/0x110
>>> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
>>> Oct 23 11:36:33 domU kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> The call trace alone leaves open where exactly the crash occurred.
>> Looking at 5.17 I notice that the first thing the driver does
>> after mapping the range it to check the signature (both in
>> intel_opregion_setup()). As the signature can't possibly match
>> with no access granted to the underlying mappings, there shouldn't
>> be any further attempts to use the region in the driver; if there
>> are, I'd view this as a driver bug.
> Yes.  i915_driver_hw_probe does not check the return value of
> intel_opregion_setup(dev_priv) and just continues on.
>
> Chuck, the attached patch may help if you want to test it.
>
> Regards,
> Jason

Thanks for the patch, I will try it when I get a chance
and report if it prevents the crash and enables video
output to my screen. Has your patch been committed
to Linux? I just checked on the gitlab Linux master
branch and didn't see it there.

Regards,

Chuck
Jason Andryuk March 31, 2022, 12:23 p.m. UTC | #13
On Thu, Mar 31, 2022 at 12:34 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>
> On 3/30/22 2:45 PM, Jason Andryuk wrote:
> > On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
> >>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
> >>> opregion to the guest but libxl does not grant the guest permission to
> >>> access the mapped memory region. This results in a crash of the i915.ko
> >>> kernel module in a Linux HVM guest when it needs to access the IGD
> >>> opregion:
> >>>
> >>> Oct 23 11:36:33 domU kernel: Call Trace:
> >>> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
> >>> Oct 23 11:36:33 domU kernel:  drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
> >>> Oct 23 11:36:33 domU kernel:  drm_reset_vblank_timestamp+0x5b/0xd0 [drm]
> >>> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
> >>> Oct 23 11:36:33 domU kernel:  intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
> >>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> >>> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
> >>> Oct 23 11:36:33 domU kernel:  intel_modeset_init_nogem+0x867/0x1d30 [i915]
> >>> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
> >>> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 [i915]
> >>> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
> >>> Oct 23 11:36:33 domU kernel:  ? vga_switcheroo_client_probe_defer+0x1f/0x40
> >>> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
> >>> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
> >>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> >>> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
> >>> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
> >>> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
> >>> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
> >>> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
> >>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> >>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> >>> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
> >>> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
> >>> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
> >>> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
> >>> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
> >>> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
> >>> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
> >>> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
> >>> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
> >>> Oct 23 11:36:33 domU kernel:  __do_sys_finit_module+0xb1/0x110
> >>> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
> >>> Oct 23 11:36:33 domU kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> The call trace alone leaves open where exactly the crash occurred.
> >> Looking at 5.17 I notice that the first thing the driver does
> >> after mapping the range it to check the signature (both in
> >> intel_opregion_setup()). As the signature can't possibly match
> >> with no access granted to the underlying mappings, there shouldn't
> >> be any further attempts to use the region in the driver; if there
> >> are, I'd view this as a driver bug.
> > Yes.  i915_driver_hw_probe does not check the return value of
> > intel_opregion_setup(dev_priv) and just continues on.
> >
> > Chuck, the attached patch may help if you want to test it.
> >
> > Regards,
> > Jason
>
> Thanks for the patch, I will try it when I get a chance
> and report if it prevents the crash and enables video
> output to my screen. Has your patch been committed
> to Linux? I just checked on the gitlab Linux master
> branch and didn't see it there.

This patch should just make the i915 probe error out properly inside
the domU when the opregion cannot be mapped properly.  It would avoid
trigger the domU trace you posted above, but it wouldn't solve any other
issue.

I have not yet submitted upstream.

Regard,
Jason
Jason Andryuk March 31, 2022, 12:29 p.m. UTC | #14
On Wed, Mar 30, 2022 at 11:54 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>
> On 3/30/22 1:27 PM, Andrew Cooper wrote:
> > On 30/03/2022 18:15, Anthony PERARD wrote:
> >>
> >> Some more though on that, looking at QEMU, it seems that there's already
> >> a call to xc_domain_iomem_permission(), in igd_write_opregion().
> > This has been discussed before, but noone's done anything about it.
> > It's a massive layering violation for QEMU to issue
> > xc_domain_iomem_permission()/etc hypercalls.
> >
> > It should be the toolstack, and only the toolstack, which makes
> > permissions hypercalls, which in turn will fix a slew of "QEMU doesn't
> > work when it doesn't have dom0 superpowers" bugs with stubdomains.
>
> How much say does the Xen project have over the code
> in Qemu under hw/xen? I would not be against having libxl
> do the permissions hypercalls in this case instead of Qemu
> doing it. My test with Qemu traditional and this patch proves
> the permission can be granted by libxl instead of the device
> model.

Qubes patches libxl and QEMU, and they move the hypercalls to the
toolstack.  They are using linux stubdoms, and I think it works for
them.

QEMU:
https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/master/qemu/patches/0015-IGD-fix-undefined-behaviour.patch
https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/master/qemu/patches/0016-IGD-improve-legacy-vbios-handling.patch
https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/master/qemu/patches/0017-IGD-move-enabling-opregion-access-to-libxl.patch
libxl:
https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.14/patch-fix-igd-passthrough-with-linux-stubdomain.patch
maybe this one, too:
https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.14/patch-libxl-automatically-enable-gfx_passthru-if-IGD-is-as.patch

Regards,
Jason
Chuck Zmudzinski March 31, 2022, 1:57 p.m. UTC | #15
On 3/31/2022 8:23 AM, Jason Andryuk wrote:
> On Thu, Mar 31, 2022 at 12:34 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>> On 3/30/22 2:45 PM, Jason Andryuk wrote:
>>> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>>>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>>>>> opregion to the guest but libxl does not grant the guest permission to
>>>>> access the mapped memory region. This results in a crash of the i915.ko
>>>>> kernel module in a Linux HVM guest when it needs to access the IGD
>>>>> opregion:
>>>>>
>>>>> Oct 23 11:36:33 domU kernel: Call Trace:
>>>>> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
>>>>> Oct 23 11:36:33 domU kernel:  drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
>>>>> Oct 23 11:36:33 domU kernel:  drm_reset_vblank_timestamp+0x5b/0xd0 [drm]
>>>>> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
>>>>> Oct 23 11:36:33 domU kernel:  intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
>>>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>>>> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
>>>>> Oct 23 11:36:33 domU kernel:  intel_modeset_init_nogem+0x867/0x1d30 [i915]
>>>>> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
>>>>> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 [i915]
>>>>> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
>>>>> Oct 23 11:36:33 domU kernel:  ? vga_switcheroo_client_probe_defer+0x1f/0x40
>>>>> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
>>>>> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
>>>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>>>> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
>>>>> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
>>>>> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
>>>>> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
>>>>> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
>>>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>>>> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
>>>>> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
>>>>> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
>>>>> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
>>>>> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
>>>>> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
>>>>> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
>>>>> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
>>>>> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
>>>>> Oct 23 11:36:33 domU kernel:  __do_sys_finit_module+0xb1/0x110
>>>>> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
>>>>> Oct 23 11:36:33 domU kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> The call trace alone leaves open where exactly the crash occurred.
>>>> Looking at 5.17 I notice that the first thing the driver does
>>>> after mapping the range it to check the signature (both in
>>>> intel_opregion_setup()). As the signature can't possibly match
>>>> with no access granted to the underlying mappings, there shouldn't
>>>> be any further attempts to use the region in the driver; if there
>>>> are, I'd view this as a driver bug.
>>> Yes.  i915_driver_hw_probe does not check the return value of
>>> intel_opregion_setup(dev_priv) and just continues on.
>>>
>>> Chuck, the attached patch may help if you want to test it.
>>>
>>> Regards,
>>> Jason
>> Thanks for the patch, I will try it when I get a chance
>> and report if it prevents the crash and enables video
>> output to my screen. Has your patch been committed
>> to Linux? I just checked on the gitlab Linux master
>> branch and didn't see it there.
> This patch should just make the i915 probe error out properly inside
> the domU when the opregion cannot be mapped properly.  It would avoid
> trigger the domU trace you posted above, but it wouldn't solve any other
> issue.
>
> I have not yet submitted upstream.
>
> Regard,
> Jason

I understand the limitations of this patch, that the guest will still
not have access to the opregion. Still, I can test it - I do remember
some configurations when I could get output on the VGA port
but not the HDMI port, and maybe with this patch at least
the VGA port will work. In fact, I am not even sure the VGA port
does not currently work without your patch, but I know the HDMI
port does not work without your patch and an unpatched Xen
tool stack. So the patch might help some and if it does help it
probably is suitable for upstream.

Chuck
Chuck Zmudzinski March 31, 2022, 2:05 p.m. UTC | #16
On 3/31/2022 8:29 AM, Jason Andryuk wrote:
> On Wed, Mar 30, 2022 at 11:54 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>> On 3/30/22 1:27 PM, Andrew Cooper wrote:
>>>
>>> This has been discussed before, but noone's done anything about it.
>>> It's a massive layering violation for QEMU to issue
>>> xc_domain_iomem_permission()/etc hypercalls.
>>>
>>> It should be the toolstack, and only the toolstack, which makes
>>> permissions hypercalls, which in turn will fix a slew of "QEMU doesn't
>>> work when it doesn't have dom0 superpowers" bugs with stubdomains.
>> How much say does the Xen project have over the code
>> in Qemu under hw/xen? I would not be against having libxl
>> do the permissions hypercalls in this case instead of Qemu
>> doing it. My test with Qemu traditional and this patch proves
>> the permission can be granted by libxl instead of the device
>> model.
> Qubes patches libxl and QEMU, and they move the hypercalls to the
> toolstack.  They are using linux stubdoms, and I think it works for
> them.

That still doesn't answer my question - will the Qemu upstream
accept the patches that move the hypercalls to the toolstack? If
not, we have to live with what we have now, which is that the
hypercalls are done in Qemu.

Regards,

Chuck
Jason Andryuk March 31, 2022, 2:19 p.m. UTC | #17
On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>
> On 3/31/2022 8:29 AM, Jason Andryuk wrote:
> > On Wed, Mar 30, 2022 at 11:54 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
> >> On 3/30/22 1:27 PM, Andrew Cooper wrote:
> >>>
> >>> This has been discussed before, but noone's done anything about it.
> >>> It's a massive layering violation for QEMU to issue
> >>> xc_domain_iomem_permission()/etc hypercalls.
> >>>
> >>> It should be the toolstack, and only the toolstack, which makes
> >>> permissions hypercalls, which in turn will fix a slew of "QEMU doesn't
> >>> work when it doesn't have dom0 superpowers" bugs with stubdomains.
> >> How much say does the Xen project have over the code
> >> in Qemu under hw/xen? I would not be against having libxl
> >> do the permissions hypercalls in this case instead of Qemu
> >> doing it. My test with Qemu traditional and this patch proves
> >> the permission can be granted by libxl instead of the device
> >> model.
> > Qubes patches libxl and QEMU, and they move the hypercalls to the
> > toolstack.  They are using linux stubdoms, and I think it works for
> > them.
>
> That still doesn't answer my question - will the Qemu upstream
> accept the patches that move the hypercalls to the toolstack? If
> not, we have to live with what we have now, which is that the
> hypercalls are done in Qemu.

Xen-associated people maintain hw/xen code in QEMU, so yes it could be accepted.

Maybe it would need to be backwards compatible to have libxl check the
QEMU version to decide who makes the hypercall?  Unless it is broken
today, in which case just make it work.

Regards,
Jason
Chuck Zmudzinski March 31, 2022, 6:32 p.m. UTC | #18
On 3/30/22 1:15 PM, Anthony PERARD wrote:
> Hi Chuck,
>
> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>> opregion to the guest but libxl does not grant the guest permission to
> I'm not reading the same thing when looking at code in hvmloader. It
> seems that hvmloader allocate some memory for the IGD opregion rather
> than mapping it.
>
>
> tools/firmware/hvmloader/pci.c:184
>      if ( vendor_id == 0x8086 )
>      {
>          igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
>          /*
>           * Write the the OpRegion offset to give the opregion
>           * address to the device model. The device model will trap
>           * and map the OpRegion at the give address.
>           */
>          pci_writel(vga_devfn, PCI_INTEL_OPREGION,
>                     igd_opregion_pgbase << PAGE_SHIFT);
>      }
>
> I think this write would go through QEMU, does it do something with it?
> (I kind of replying to this question at the end of the mail.)
>
> Is this code in hvmloader actually run in your case?

Hi Anthony,

To test your theory that hvmloader is not called in my case and
does nothing, I did the following tests:

I wrote a little C program to read the mapped IGD opregion
address in the guest from the config attribute of the
IGD in sysfs:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdint.h>

int main()
{
     int fd = open("/sys/devices/pci0000:00/0000:00:02.0/config",
                    O_RDONLY);
     uint32_t buf;
     off_t offset = 0xfc;
     pread(fd, &buf, 4, offset);
     printf("opregion = %lx\n", buf);
     return close(fd);
}
------------------------ end of C program -----------------

Since the config attribute cannot be read by an ordinary user, it is
necessary to run the little C program as the super user to successfully
read the opregion address from sysfs with the little C program.

I also grabbed the BIOS-provided physical RAM map in the
guest which shows the 3 pages mapped by hvmloader for the
opregion (it is the second to last entry in the table):

Mar 31 13:20:16 kolbe kernel: BIOS-provided physical RAM map:
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x0000000000000000-0x000000000009dfff] usable
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x000000000009e000-0x000000000009ffff] reserved
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x00000000000e0000-0x00000000000fffff] reserved
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x0000000000100000-0x00000000bfbfffff] usable
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x00000000cc346000-0x00000000cc352fff] reserved
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x00000000cf800000-0x00000000df9fffff] reserved
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x00000000fc000000-0x00000000fc009fff] ACPI NVS
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x00000000fc00a000-0x00000000fdffbfff] reserved
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x00000000fdffc000-0x00000000fdffefff] ACPI NVS
Mar 31 13:20:16 kolbe kernel: BIOS-e820: [mem 
0x00000000fdfff000-0x00000000ffffffff] reserved

Now with the current code and no patches, the output of the little C
program to print the opregion address from sysfs in the guest is wrong,
the opregion address in the guest is not displayed (it should be
fdffc018 because the offset of the opregion from the page boundary is
0x18 (24) bytes on my hardware) but it displays an error value
(ffffffff) instead:

opregion = ffffffff

I would expect it to be correct if nothing overwrites the value
hvmloader wrote. In fact, I think the value hvmloader wrote might be the
page aligned address of fdffc000 instead of fdffc018. I am not sure
where this error value comes from, so I do need to do some
investigations because I think you are correct that the value that
hvmloader wrote was overwritten somewhere.

Now when I apply my patch to libxl, I get the same physical RAM map
in the guest:

Mar 31 13:12:35 kolbe kernel: BIOS-provided physical RAM map:
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x0000000000000000-0x000000000009dfff] usable
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x000000000009e000-0x000000000009ffff] reserved
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x00000000000e0000-0x00000000000fffff] reserved
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x0000000000100000-0x00000000bfbfffff] usable
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x00000000cc346000-0x00000000cc352fff] reserved
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x00000000cf800000-0x00000000df9fffff] reserved
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x00000000fc000000-0x00000000fc009fff] ACPI NVS
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x00000000fc00a000-0x00000000fdffbfff] reserved
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x00000000fdffc000-0x00000000fdffefff] ACPI NVS
Mar 31 13:12:35 kolbe kernel: BIOS-e820: [mem 
0x00000000fdfff000-0x00000000ffffffff] reserved

And now I get the correct opregion address from the little C program to
read the opregion address from sysfs in the guest, not even the
page-aligned address that hvmloader appears to have written:

opregion = fdffc018

Next I removed the code snippet from hvmloader that allocates three
pages in the guest for the opregion and writes the opregion address to
the pci config attribute, and now there is no memory hole allocated for
the opregion in the guest:

Mar 31 12:47:34 kolbe kernel: BIOS-provided physical RAM map:
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x0000000000000000-0x000000000009dfff] usable
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x000000000009e000-0x000000000009ffff] reserved
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x00000000000e0000-0x00000000000fffff] reserved
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x0000000000100000-0x00000000bfbfffff] usable
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x00000000cc346000-0x00000000cc352fff] reserved
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x00000000cf800000-0x00000000df9fffff] reserved
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x00000000fc000000-0x00000000fc009fff] ACPI NVS
Mar 31 12:47:34 kolbe kernel: BIOS-e820: [mem 
0x00000000fc00a000-0x00000000ffffffff] reserved

I ran this case without my patch to libxl for safety reasons because the
opregion address in sysfs in the guest is wrong and I get the same
error address return value using the little C program to read the
opregion address from sysfs:

opregion = ffffffff

So the conclusion is that hvmloader does allocate the three pages in the
guest for the opregion and writes a value for the opregion address, but
it appears it is overwritten later with the error value when the guest
cannot access the opregion and with the correct value when the guest can
access the opregion.

So I agree that I should understand what is going on here better. I
tentatively think the call to pci_writel in hvmloader can be safely
removed because that value seems to be changed later on somewhere.
But we do need to keep the call to allocate the memory hole for the
opregion in hvmloader, or maybe move that call to the toolstack.

So I will need to have a better answer to your questions before I
propose the next version of the patch / patch series.

Regards,

Chuck
Chuck Zmudzinski March 31, 2022, 7:44 p.m. UTC | #19
On 3/31/22 10:19 AM, Jason Andryuk wrote:
> On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>>
>> That still doesn't answer my question - will the Qemu upstream
>> accept the patches that move the hypercalls to the toolstack? If
>> not, we have to live with what we have now, which is that the
>> hypercalls are done in Qemu.
> Xen-associated people maintain hw/xen code in QEMU, so yes it could be accepted.
>
> Maybe it would need to be backwards compatible to have libxl check the
> QEMU version to decide who makes the hypercall?  Unless it is broken
> today, in which case just make it work.
>
> Regards,
> Jason

I know of another reason to check the Qemu upstream version,
and that is dealing with deprecated / removed device model
options that xl.cfg still uses. I looked at that a few years ago
with regard to the deprecated 'usbdevice tablet' Qemu option,
but I did not see anything in libxl to distinguish Qemu versions
except for upstream vs. traditional. AFAICT, detecting traditional
vs. upstream Qemu depends solely on the device_model_version
xl.cfg setting. So it might be useful for libxl to add the capability
to detect the upstream Qemu version or at least create an xl.cfg
setting to inform libxl about the Qemu version.

Regards,

Chuck
Chuck Zmudzinski March 31, 2022, 11:23 p.m. UTC | #20
On 3/30/22 1:15 PM, Anthony PERARD wrote:
> Hi Chuck,
>
> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>> opregion to the guest but libxl does not grant the guest permission to
> I'm not reading the same thing when looking at code in hvmloader. It
> seems that hvmloader allocate some memory for the IGD opregion rather
> than mapping it.
>
>
> tools/firmware/hvmloader/pci.c:184
>      if ( vendor_id == 0x8086 )
>      {
>          igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
>          /*
>           * Write the the OpRegion offset to give the opregion
>           * address to the device model. The device model will trap
>           * and map the OpRegion at the give address.
>           */
>          pci_writel(vga_devfn, PCI_INTEL_OPREGION,
>                     igd_opregion_pgbase << PAGE_SHIFT);
>      }
>
> I think this write would go through QEMU, does it do something with it?
> (I kind of replying to this question at the end of the mail.)
>
> Is this code in hvmloader actually run in your case?

Hi Anthony,

Let me try to answer your question again. My tests indicate
that this code in hvmloader is actually run in my case. As I
mentioned in an earlier message, the allocation of the three
pages for the opregion is not done for the guest if I remove
this code from hvmloader. The only concern I had was about
the difference in what I was reading for the opregion address
in sysfs (fcffc018 in my case) and the address that hvmloader
wrote (fcffc000 in my case). The change is easily explained by
what the Qemu device model (both upstream and traditional)
does when the device model writes the opregion address into
the Intel IGD config attribute:

This is the traditional Qemu code in hw/pt_graphics.c:66

void igd_write_opregion(struct pt_dev *real_dev, uint32_t val)
{
     uint32_t host_opregion = 0;
     int ret;

     if ( igd_guest_opregion )
     {
         PT_LOG("opregion register already been set, ignoring %x\n", val);
         return;
     }

     host_opregion = pt_pci_host_read(real_dev->pci_dev, 
PCI_INTEL_OPREGION, 4);
     igd_guest_opregion = (val & ~0xfff) | (host_opregion & 0xfff);

------------------------ End of code snippet 
-----------------------------------

The offset of the opregion in the guest (0x18 in my case) is
recovered by that last statement. The upstream model does the
same thing using the constant XEN_PCI_INTEL_OPREGION_MASK
set to 0xfff to recover the offset.

So what we have in hvmloader is correct and necessary.

>
>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
>> index 4bbbfe9f16..a4fc473de9 100644
>> --- a/tools/libs/light/libxl_pci.c
>> +++ b/tools/libs/light/libxl_pci.c
>> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
>>                     domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
>>               return ret;
>>           }
>> +
>> +        /* If this is an Intel IGD, allow access to the IGD opregion */
>> +        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;
>> +
>> +        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
>> +        uint32_t error = 0xffffffff;
>> +        if (igd_opregion == error) break;
>> +
>> +        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;
>> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
>> +        if (ret < 0) {
>> +            LOGED(ERROR, domid,
>> +                  "failed to give stubdom%d access to iomem range "
>> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
>> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
>> +                                                IGD_OPREGION_PAGES - 1));
>> +            return ret;
>> +        }
>> +        ret = xc_domain_iomem_permission(CTX->xch, domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
> Here, you seems to add permission to an address that is read from the
> pci config space of the device, but as I've pointed above hvmloader
> seems to overwrite this address.

No, hvmloader wrote the mapped address and here we are
reading the opregion address of the host, not the mapped
address of the guest. There is no problem here.
> It this call to
> xc_domain_iomem_permission() does actually anything useful?
> Or is it by luck that the address returned by
> sysfs_dev_get_igd_opregion() happened to be the address that hvmloader
> is going to write?

No, it is not luck, we use the same constant in hvmloader,
Qemu, and here in this patch to properly map the opregion
to the guest, and the constant is PCI_INTEL_OPREGION, set
to 0xfc, the offset of where in the config attribute the
opregion address is stored.
>
> Or maybe hvmloader doesn't actually do anything?

It does do something, and what it does is necessary.

Regards,

Chuck
Chuck Zmudzinski April 1, 2022, 12:27 a.m. UTC | #21
On 3/31/22 2:32 PM, Chuck Zmudzinski wrote:
> On 3/30/22 1:15 PM, Anthony PERARD wrote:
>> Hi Chuck,
>>
>> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>>> opregion to the guest but libxl does not grant the guest permission to
>> I'm not reading the same thing when looking at code in hvmloader. It
>> seems that hvmloader allocate some memory for the IGD opregion rather
>> than mapping it.
>>
>>
>> tools/firmware/hvmloader/pci.c:184
>>      if ( vendor_id == 0x8086 )
>>      {
>>          igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
>>          /*
>>           * Write the the OpRegion offset to give the opregion
>>           * address to the device model. The device model will trap
>>           * and map the OpRegion at the give address.
>>           */
>>          pci_writel(vga_devfn, PCI_INTEL_OPREGION,
>>                     igd_opregion_pgbase << PAGE_SHIFT);
>>      }
>>
>> I think this write would go through QEMU, does it do something with it?
>> (I kind of replying to this question at the end of the mail.)
>>
>> Is this code in hvmloader actually run in your case?
>
> Hi Anthony,
> ...
> So the conclusion is that hvmloader does allocate the three pages in the
> guest for the opregion and writes a value for the opregion address, but
> it appears it is overwritten later with the error value when the guest
> cannot access the opregion and with the correct value when the guest can
> access the opregion.
>
> So I agree that I should understand what is going on here better. I
> tentatively think the call to pci_writel in hvmloader can be safely
> removed because that value seems to be changed later on somewhere.

After discovering how the device model recovers the offset
of the opregion from the page boundary, I am now certain
that what we currently have in hvmloader is necessary. We
do need to call pci_writel in hvmolader because that is where
we write the mapped value of the page-aligned address of
the opregion in the guest, and then the device model reads
that value, recovers the offset of the opregion to the page
boundary, and writes the address of the beginning of the
opregion, not the page-aligned address that hvmloader wrote,
into the config attribute of the Intel IGD that is passed
through to the guest. That is why it seemed to me that the
address was changed somewhere. The device model modifies
it so it is the actual address of the opregion and not the address
of the page boundary that immediately precedes the opregion.
I hope this is an acceptable explanation of what we currently
have in hvmloader.

Regards,

Chuck
Chuck Zmudzinski April 1, 2022, 1:21 p.m. UTC | #22
On 3/30/22 2:45 PM, Jason Andryuk wrote:
> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>>> opregion to the guest but libxl does not grant the guest permission to
>>> access the mapped memory region. This results in a crash of the i915.ko
>>> kernel module in a Linux HVM guest when it needs to access the IGD
>>> opregion:
>>>
>>> Oct 23 11:36:33 domU kernel: Call Trace:
>>> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
>>> Oct 23 11:36:33 domU kernel:  drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
>>> Oct 23 11:36:33 domU kernel:  drm_reset_vblank_timestamp+0x5b/0xd0 [drm]
>>> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
>>> Oct 23 11:36:33 domU kernel:  intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
>>> Oct 23 11:36:33 domU kernel:  intel_modeset_init_nogem+0x867/0x1d30 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 [i915]
>>> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
>>> Oct 23 11:36:33 domU kernel:  ? vga_switcheroo_client_probe_defer+0x1f/0x40
>>> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
>>> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
>>> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
>>> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
>>> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
>>> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
>>> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
>>> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
>>> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
>>> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
>>> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
>>> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
>>> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
>>> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
>>> Oct 23 11:36:33 domU kernel:  __do_sys_finit_module+0xb1/0x110
>>> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
>>> Oct 23 11:36:33 domU kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> The call trace alone leaves open where exactly the crash occurred.
>> Looking at 5.17 I notice that the first thing the driver does
>> after mapping the range it to check the signature (both in
>> intel_opregion_setup()). As the signature can't possibly match
>> with no access granted to the underlying mappings, there shouldn't
>> be any further attempts to use the region in the driver; if there
>> are, I'd view this as a driver bug.
> Yes.  i915_driver_hw_probe does not check the return value of
> intel_opregion_setup(dev_priv) and just continues on.
>
> Chuck, the attached patch may help if you want to test it.
>
> Regards,
> Jason

I tested the patch - it made no noticeable difference. I still
get the same crash and call trace with the patch. Actually,
the call trace I posted here is only the first of three call
traces, and I still see all three call traces with the patch.
The patch is harmless and the i915 module with the patch
works normally when it can access the intel opregion.

Regards,

Chuck
Chuck Zmudzinski April 1, 2022, 1:41 p.m. UTC | #23
On 4/1/22 9:21 AM, Chuck Zmudzinski wrote:
> On 3/30/22 2:45 PM, Jason Andryuk wrote:
>> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>>> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>>>> opregion to the guest but libxl does not grant the guest permission to
>>>> access the mapped memory region. This results in a crash of the 
>>>> i915.ko
>>>> kernel module in a Linux HVM guest when it needs to access the IGD
>>>> opregion:
>>>>
>>>> Oct 23 11:36:33 domU kernel: Call Trace:
>>>> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
>>>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
>>>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 
>>>> [drm]
>>>> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
>>>> Oct 23 11:36:33 domU kernel: 
>>>> intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>>> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
>>>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 
>>>> [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 
>>>> [i915]
>>>> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? 
>>>> vga_switcheroo_client_probe_defer+0x1f/0x40
>>>> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
>>>> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
>>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>>> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
>>>> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
>>>> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
>>>> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
>>>> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
>>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>>> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
>>>> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
>>>> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
>>>> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
>>>> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
>>>> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
>>>> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
>>>> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
>>>> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
>>>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110
>>>> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
>>>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> The call trace alone leaves open where exactly the crash occurred.
>>> Looking at 5.17 I notice that the first thing the driver does
>>> after mapping the range it to check the signature (both in
>>> intel_opregion_setup()). As the signature can't possibly match
>>> with no access granted to the underlying mappings, there shouldn't
>>> be any further attempts to use the region in the driver; if there
>>> are, I'd view this as a driver bug.
>> Yes.  i915_driver_hw_probe does not check the return value of
>> intel_opregion_setup(dev_priv) and just continues on.
>>
>> Chuck, the attached patch may help if you want to test it.
>>
>> Regards,
>> Jason
>
> I tested the patch - it made no noticeable difference.I still
> get the same crash and call trace with the patch. Actually,
> the call trace I posted here is only the first of three call
> traces, and I still see all three call traces with the patch.

It is probably necessary to patch intet_opregion_setup to
return from it with an error sooner if the goal is to suppress
the call traces that occur when the driver cannot access
the opregion.

Regards,

Chuck
Jason Andryuk April 1, 2022, 1:49 p.m. UTC | #24
On Fri, Apr 1, 2022 at 9:41 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>
> On 4/1/22 9:21 AM, Chuck Zmudzinski wrote:
> > On 3/30/22 2:45 PM, Jason Andryuk wrote:
> >> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
> >>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
> >>>> opregion to the guest but libxl does not grant the guest permission to
> >>>> access the mapped memory region. This results in a crash of the
> >>>> i915.ko
> >>>> kernel module in a Linux HVM guest when it needs to access the IGD
> >>>> opregion:
> >>>>
> >>>> Oct 23 11:36:33 domU kernel: Call Trace:
> >>>> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
> >>>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
> >>>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0
> >>>> [drm]
> >>>> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
> >>>> Oct 23 11:36:33 domU kernel:
> >>>> intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
> >>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> >>>> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
> >>>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30
> >>>> [i915]
> >>>> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
> >>>> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670
> >>>> [i915]
> >>>> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
> >>>> Oct 23 11:36:33 domU kernel:  ?
> >>>> vga_switcheroo_client_probe_defer+0x1f/0x40
> >>>> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
> >>>> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
> >>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
> >>>> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
> >>>> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
> >>>> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
> >>>> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
> >>>> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
> >>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> >>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
> >>>> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
> >>>> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
> >>>> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
> >>>> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
> >>>> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
> >>>> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
> >>>> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
> >>>> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
> >>>> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
> >>>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110
> >>>> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
> >>>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>> The call trace alone leaves open where exactly the crash occurred.
> >>> Looking at 5.17 I notice that the first thing the driver does
> >>> after mapping the range it to check the signature (both in
> >>> intel_opregion_setup()). As the signature can't possibly match
> >>> with no access granted to the underlying mappings, there shouldn't
> >>> be any further attempts to use the region in the driver; if there
> >>> are, I'd view this as a driver bug.
> >> Yes.  i915_driver_hw_probe does not check the return value of
> >> intel_opregion_setup(dev_priv) and just continues on.
> >>
> >> Chuck, the attached patch may help if you want to test it.
> >>
> >> Regards,
> >> Jason
> >
> > I tested the patch - it made no noticeable difference.I still
> > get the same crash and call trace with the patch. Actually,
> > the call trace I posted here is only the first of three call
> > traces, and I still see all three call traces with the patch.

Thanks for testing.  Sorry it didn't help.

> It is probably necessary to patch intet_opregion_setup to
> return from it with an error sooner if the goal is to suppress
> the call traces that occur when the driver cannot access
> the opregion.

It looks correct for 5.17 running in your domU.  I thought the
opregion signature check would fail.  A failure in
intel_opregion_setup would percolate up through i915_driver_hw_probe
to i915_driver_probe.  In i915_driver_probe the error should goto
out_cleanup_mmio and skip intel_modeset_init_nogem which is in your
backtrace.

Regards,
Jason
Chuck Zmudzinski April 6, 2022, 1:31 a.m. UTC | #25
On 4/1/22 9:21 AM, Chuck Zmudzinski wrote:
> On 3/30/22 2:45 PM, Jason Andryuk wrote:
>> On Fri, Mar 18, 2022 at 4:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>>> On 14.03.2022 04:41, Chuck Zmudzinski wrote:
>>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>>>> opregion to the guest but libxl does not grant the guest permission to
>>>> access the mapped memory region. This results in a crash of the 
>>>> i915.ko
>>>> kernel module in a Linux HVM guest when it needs to access the IGD
>>>> opregion:
>>>>
>>>> Oct 23 11:36:33 domU kernel: Call Trace:
>>>> Oct 23 11:36:33 domU kernel:  ? idr_alloc+0x39/0x70
>>>> Oct 23 11:36:33 domU kernel: drm_get_last_vbltimestamp+0xaa/0xc0 [drm]
>>>> Oct 23 11:36:33 domU kernel: drm_reset_vblank_timestamp+0x5b/0xd0 
>>>> [drm]
>>>> Oct 23 11:36:33 domU kernel:  drm_crtc_vblank_on+0x7b/0x130 [drm]
>>>> Oct 23 11:36:33 domU kernel: 
>>>> intel_modeset_setup_hw_state+0xbd4/0x1900 [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>>> Oct 23 11:36:33 domU kernel:  ? ww_mutex_lock+0x15/0x80
>>>> Oct 23 11:36:33 domU kernel: intel_modeset_init_nogem+0x867/0x1d30 
>>>> [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? gen6_write32+0x4b/0x1c0 [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? intel_irq_postinstall+0xb9/0x670 
>>>> [i915]
>>>> Oct 23 11:36:33 domU kernel:  i915_driver_probe+0x5c2/0xc90 [i915]
>>>> Oct 23 11:36:33 domU kernel:  ? 
>>>> vga_switcheroo_client_probe_defer+0x1f/0x40
>>>> Oct 23 11:36:33 domU kernel:  ? i915_pci_probe+0x3f/0x150 [i915]
>>>> Oct 23 11:36:33 domU kernel:  local_pci_probe+0x42/0x80
>>>> Oct 23 11:36:33 domU kernel:  ? _cond_resched+0x16/0x40
>>>> Oct 23 11:36:33 domU kernel:  pci_device_probe+0xfd/0x1b0
>>>> Oct 23 11:36:33 domU kernel:  really_probe+0x222/0x480
>>>> Oct 23 11:36:33 domU kernel:  driver_probe_device+0xe1/0x150
>>>> Oct 23 11:36:33 domU kernel:  device_driver_attach+0xa1/0xb0
>>>> Oct 23 11:36:33 domU kernel:  __driver_attach+0x8a/0x150
>>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>>> Oct 23 11:36:33 domU kernel:  ? device_driver_attach+0xb0/0xb0
>>>> Oct 23 11:36:33 domU kernel:  bus_for_each_dev+0x78/0xc0
>>>> Oct 23 11:36:33 domU kernel:  bus_add_driver+0x12b/0x1e0
>>>> Oct 23 11:36:33 domU kernel:  driver_register+0x8b/0xe0
>>>> Oct 23 11:36:33 domU kernel:  ? 0xffffffffc06b8000
>>>> Oct 23 11:36:33 domU kernel:  i915_init+0x5d/0x70 [i915]
>>>> Oct 23 11:36:33 domU kernel:  do_one_initcall+0x44/0x1d0
>>>> Oct 23 11:36:33 domU kernel:  ? do_init_module+0x23/0x260
>>>> Oct 23 11:36:33 domU kernel:  ? kmem_cache_alloc_trace+0xf5/0x200
>>>> Oct 23 11:36:33 domU kernel:  do_init_module+0x5c/0x260
>>>> Oct 23 11:36:33 domU kernel: __do_sys_finit_module+0xb1/0x110
>>>> Oct 23 11:36:33 domU kernel:  do_syscall_64+0x33/0x80
>>>> Oct 23 11:36:33 domU kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> The call trace alone leaves open where exactly the crash occurred.
>>> Looking at 5.17 I notice that the first thing the driver does
>>> after mapping the range it to check the signature (both in
>>> intel_opregion_setup()). As the signature can't possibly match
>>> with no access granted to the underlying mappings, there shouldn't
>>> be any further attempts to use the region in the driver; if there
>>> are, I'd view this as a driver bug.
>> Yes.  i915_driver_hw_probe does not check the return value of
>> intel_opregion_setup(dev_priv) and just continues on.
>>
>> Chuck, the attached patch may help if you want to test it.
>>
>> Regards,
>> Jason
>
> I tested the patch - it made no noticeable difference.

Correction (sorry for the confusion):

I didn't know I needed to replace more than just a
re-built i915.ko module to enable the patch
for testing. When I updated the entire Debian kernel
package including all the modules and the kernel
image with the patched kernel package, it made
quite a difference.

With Jason's patch, the three call traces just became a
much shorter error message:

Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24)
Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for 
gfx access
Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga 
console
Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device 
80x25
Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active, 
disabling use of stolen memory
Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem 
0xffffffff-0x100001ffe], which spans more than Reserved [mem 
0xfdfff000-0xffffffff]
Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping 
multiple BARs
Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization 
failed (-22)
Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on 
drm/i915; see 
https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs 
for details.
Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with 
error -22
--------------------- End of Kernel Error Log ----------------------

So I think the patch does propagate the error up the
stack and bails out before producing the Call traces,

and...

I even had output after booting - the gdm3 Gnome display
manager login page displayed, but when I tried to login to
the Gnome desktop, the screen went dark and I could
not even login to the headless Xen Dom0 control domain
via ssh after that and I just used the reset button on the
machine to reboot it, so the patch causes some trouble
with the Dom0 when the guest cannot access the
opregion. The patch works fine when the guest can
access the opregion and in that case I was able to
login to the Gnome session, but it caused quite a bit of
trouble and apparently crashed the Dom0 or at
least caused networking in the Dom0 to stop working
when I tried to login to the Gnome session in the
guest for the case when the guest cannot access
the opregion. So I would not recommend Jason's
patch as is for the Linux kernel. The main reason
is that it looks like it is working at first with a
login screen displayed, but when a user tries to login,
the whole system crashes.

Regards,

Chuck
Jason Andryuk April 6, 2022, 1:10 p.m. UTC | #26
On Tue, Apr 5, 2022 at 9:31 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
> Correction (sorry for the confusion):
>
> I didn't know I needed to replace more than just a
> re-built i915.ko module to enable the patch
> for testing. When I updated the entire Debian kernel
> package including all the modules and the kernel
> image with the patched kernel package, it made
> quite a difference.
>
> With Jason's patch, the three call traces just became a
> much shorter error message:
>
> Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24)
> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for
> gfx access
> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga
> console
> Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device
> 80x25
> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active,
> disabling use of stolen memory
> Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem
> 0xffffffff-0x100001ffe], which spans more than Reserved [mem
> 0xfdfff000-0xffffffff]
> Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping
> multiple BARs
> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization
> failed (-22)
> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on
> drm/i915; see
> https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs
> for details.
> Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with
> error -22
> --------------------- End of Kernel Error Log ----------------------
>
> So I think the patch does propagate the error up the
> stack and bails out before producing the Call traces,

Thanks for re-testing.

> and...
>
> I even had output after booting - the gdm3 Gnome display
> manager login page displayed, but when I tried to login to
> the Gnome desktop, the screen went dark and I could
> not even login to the headless Xen Dom0 control domain
> via ssh after that and I just used the reset button on the
> machine to reboot it, so the patch causes some trouble
> with the Dom0 when the guest cannot access the
> opregion. The patch works fine when the guest can
> access the opregion and in that case I was able to
> login to the Gnome session, but it caused quite a bit of
> trouble and apparently crashed the Dom0 or at
> least caused networking in the Dom0 to stop working
> when I tried to login to the Gnome session in the
> guest for the case when the guest cannot access
> the opregion. So I would not recommend Jason's
> patch as is for the Linux kernel. The main reason
> is that it looks like it is working at first with a
> login screen displayed, but when a user tries to login,
> the whole system crashes.

I'm a little surprised you still had output from the VM & display with
the i915 driver not binding.  I guess Linux fell back to another VGA
or Framebuffer driver for the display.

However, locking up the host isn't good.  You didn't happen to catch
any Xen or dom0 output when that happened?

Regards,
Jason
Chuck Zmudzinski April 6, 2022, 3:24 p.m. UTC | #27
On 4/6/22 9:10 AM, Jason Andryuk wrote:
> On Tue, Apr 5, 2022 at 9:31 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>> Correction (sorry for the confusion):
>>
>> I didn't know I needed to replace more than just a
>> re-built i915.ko module to enable the patch
>> for testing. When I updated the entire Debian kernel
>> package including all the modules and the kernel
>> image with the patched kernel package, it made
>> quite a difference.
>>
>> With Jason's patch, the three call traces just became a
>> much shorter error message:
>>
>> Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24)
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for
>> gfx access
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga
>> console
>> Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device
>> 80x25
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active,
>> disabling use of stolen memory
>> Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem
>> 0xffffffff-0x100001ffe], which spans more than Reserved [mem
>> 0xfdfff000-0xffffffff]
>> Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping
>> multiple BARs
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization
>> failed (-22)
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on
>> drm/i915; see
>> https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs
>> for details.
>> Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with
>> error -22
>> --------------------- End of Kernel Error Log ----------------------
>>
>> So I think the patch does propagate the error up the
>> stack and bails out before producing the Call traces,
> Thanks for re-testing.
>
>> and...
>>
>> I even had output after booting - the gdm3 Gnome display
>> manager login page displayed, but when I tried to login to
>> the Gnome desktop, the screen went dark and I could
>> not even login to the headless Xen Dom0 control domain
>> via ssh after that and I just used the reset button on the
>> machine to reboot it, so the patch causes some trouble
>> with the Dom0 when the guest cannot access the
>> opregion. The patch works fine when the guest can
>> access the opregion and in that case I was able to
>> login to the Gnome session, but it caused quite a bit of
>> trouble and apparently crashed the Dom0 or at
>> least caused networking in the Dom0 to stop working
>> when I tried to login to the Gnome session in the
>> guest for the case when the guest cannot access
>> the opregion. So I would not recommend Jason's
>> patch as is for the Linux kernel. The main reason
>> is that it looks like it is working at first with a
>> login screen displayed, but when a user tries to login,
>> the whole system crashes.
> I'm a little surprised you still had output from the VM & display with
> the i915 driver not binding.  I guess Linux fell back to another VGA
> or Framebuffer driver for the display.
>
> However, locking up the host isn't good.  You didn't happen to catch
> any Xen or dom0 output when that happened?
>
> Regards,
> Jason

I just looked at Dom0's systemd journal and it did not
capture anything. The six minute gap between
Apr 05 20:46 and Apr 05 20:52 which is when I
rebooted Dom0 after the crash is when bad things
happened:

Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:1b.0: xen_pciback: vpci: 
assign to virtual slot 0
Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:1b.0: registering for 18
Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:14.0: xen_pciback: vpci: 
assign to virtual slot 1
Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:14.0: registering for 18
Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:02.0: xen_pciback: vpci: 
assign to virtual slot 2
Apr 05 20:46:01 Dom0 kernel: pciback 0000:00:02.0: registering for 18
Apr 05 20:46:01 Dom0 sudo[9639]: pam_unix(sudo:session): session closed 
for user root
Apr 05 20:46:13 Dom0 sshd[9541]: Received disconnect from <redacted> 
port 60294:11: disconnected by user
Apr 05 20:46:13 Dom0 sshd[9541]: Disconnected from user <redacted> 
<redacted> port 60294
Apr 05 20:46:13 Dom0 sshd[9521]: pam_unix(sshd:session): session closed 
for user <redacted>
Apr 05 20:46:13 Dom0 systemd-logind[497]: Session 27 logged out. Waiting 
for processes to exit.
Apr 05 20:46:17 Dom0 kernel: xen-blkback: backend/vbd/18/51712: using 4 
queues, protocol 1 (x86_64-abi) persistent grants
Apr 05 20:46:17 Dom0 kernel: xen-blkback: backend/vbd/18/51728: using 4 
queues, protocol 1 (x86_64-abi) persistent grants
Apr 05 20:46:17 Dom0 kernel: vif vif-18-0 vif18.0: Guest Rx ready
Apr 05 20:46:17 Dom0 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif18.0: 
link becomes ready
Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPDISCOVER from <redacted> via vif18.0
Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPOFFER on <redacted> to <redacted> 
via vif18.0
Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPREQUEST for <redacted> 
(<redacted>) from <redacted> via vif18.0
Apr 05 20:46:19 Dom0 dhcpd[9852]: DHCPACK on <redacted> to <redacted> 
via vif18.0
Apr 05 20:52:34 Dom0 kernel: Linux version 5.16.0-6-amd64 
(debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-19) 11.2.0, GNU 
ld (GNU Binutils for Debian) 2.38) #1 SMP PREEMPT Debian 5.16.18-1 
(2022-03-29)
Apr 05 20:52:34 Dom0 kernel: Command line: placeholder 
root=/dev/mapper/systems-unstable ro reboot=bios quiet console=hvc0

I would probably need to connect Dom0 to a serial
console to capture something from Dom0 or Xen.
I have done that in the past using a serial cable
connected to a Windows 8 laptop using a usb to
serial adapter I have but last time I tried it the usb
to serial adapter did not work, I think because of
the upgrade of the laptop to Windows 10.

Regards,

Chuck
Chuck Zmudzinski April 6, 2022, 5:39 p.m. UTC | #28
On 4/6/22 9:10 AM, Jason Andryuk wrote:
> On Tue, Apr 5, 2022 at 9:31 PM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>> Correction (sorry for the confusion):
>>
>> I didn't know I needed to replace more than just a
>> re-built i915.ko module to enable the patch
>> for testing. When I updated the entire Debian kernel
>> package including all the modules and the kernel
>> image with the patched kernel package, it made
>> quite a difference.
>>
>> With Jason's patch, the three call traces just became a
>> much shorter error message:
>>
>> Apr 05 20:46:18 debian kernel: xen: --> pirq=16 -> irq=24 (gsi=24)
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] VT-d active for
>> gfx access
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: vgaarb: deactivate vga
>> console
>> Apr 05 20:46:18 debian kernel: Console: switching to colour dummy device
>> 80x25
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: [drm] DMAR active,
>> disabling use of stolen memory
>> Apr 05 20:46:18 debian kernel: resource sanity check: requesting [mem
>> 0xffffffff-0x100001ffe], which spans more than Reserved [mem
>> 0xfdfff000-0xffffffff]
>> Apr 05 20:46:18 debian kernel: caller memremap+0xeb/0x1c0 mapping
>> multiple BARs
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Device initialization
>> failed (-22)
>> Apr 05 20:46:18 debian kernel: i915 0000:00:02.0: Please file a bug on
>> drm/i915; see
>> https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs
>> for details.
>> Apr 05 20:46:18 debian kernel: i915: probe of 0000:00:02.0 failed with
>> error -22
>> --------------------- End of Kernel Error Log ----------------------
>>
>> So I think the patch does propagate the error up the
>> stack and bails out before producing the Call traces,
> Thanks for re-testing.
>
> I'm a little surprised you still had output from the VM & display with
> the i915 driver not binding.  I guess Linux fell back to another VGA
> or Framebuffer driver for the display.

By looking at journal entries in the guest, it is clear the Xorg
driver fell back from the kms modesetting driver to the vesa
driver, as shown by the following journal entries.

When guest can access opregion gdm:

Apr 05 20:42:45 debian /usr/libexec/gdm-x-session[1226]: (II) modeset(0):
  Serial No: LX1AA0044210
Apr 05 20:42:45 debian /usr/libexec/gdm-x-session[1226]: (II) modeset(0)
: Monitor name: Acer H236HL

When guest cannot access opregion:

Apr 05 20:46:22 debian /usr/libexec/gdm-x-session[1164]: (II) VESA(0):
  Serial No: LX1AA0044210
Apr 05 20:46:22 debian /usr/libexec/gdm-x-session[1164]: (II) VESA(0):
  Monitor name: Acer H236HL

But as I said when I tried to login to a Gnome session,
the system hung, and there are no journal entries captured
in either the Dom0 or the guest, so it is hard to tell what
happened. I think maybe the full Gnome session, as opposed
to the gdm3 display manager, did not fall back to the Xorg
vesa driver and when it tried to use the Xorg modesetting
driver it caused the system to hang because the modesetting
driver uses KMS and probably tried to use the i915 module
which was not initialized correctly due to the inability to
access the opregion.

I also noted in an earlier message in this thread that when
the guest cannot access the opregion, the guest overwrites
the register that contains the mapped opregion address for
the guest, which is provided for the guest by the Qemu
device model, with the invalid value of 0xffffffff.

When the gnome session manager started the session, it
apparently caused the i915 module to try to access the
opregion at the invalid address 0xffffffff and thus caused
the system to hang, as shown in the journal entry I posted
yesterday:

Apr 05 20:46:18 debian kernel: resource sanity check: requesting
[mem 0xffffffff-0x100001ffe], which spans more than Reserved
[mem 0xfdfff000-0xffffffff]

This is a request by the guest for 2 pages, which is the
size of the opregion, but it is using the invalid address
0xffffffff for the opregion address. So although this resource
sanity check failed, the system still hung later on when the
user tried to login to the gnome session.

Regards,

Chuck
Chuck Zmudzinski June 4, 2022, 1:10 a.m. UTC | #29
On 3/30/22 1:15 PM, Anthony PERARD wrote:
> Hi Chuck,
>
> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>> opregion to the guest but libxl does not grant the guest permission to
> I'm not reading the same thing when looking at code in hvmloader. It
> seems that hvmloader allocate some memory for the IGD opregion rather
> than mapping it.
>
>
> tools/firmware/hvmloader/pci.c:184
>      if ( vendor_id == 0x8086 )
>      {
>          igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
>          /*
>           * Write the the OpRegion offset to give the opregion
>           * address to the device model. The device model will trap
>           * and map the OpRegion at the give address.
>           */
>          pci_writel(vga_devfn, PCI_INTEL_OPREGION,
>                     igd_opregion_pgbase << PAGE_SHIFT);
>      }
>
> I think this write would go through QEMU, does it do something with it?
> (I kind of replying to this question at the end of the mail.)
>
> Is this code in hvmloader actually run in your case?
>
>> Currently, because of another bug in Qemu upstream, this crash can
>> only be reproduced using the traditional Qemu device model, and of
> qemu-traditional is a bit old... What is the bug in QEMU? Do you have a
> link to a patch/mail?

I finally found a patch for the other bug in Qemu upstream. The
patch is currently being used in QubesOS, and they first added
it to their version of Qemu way back in 2017:

https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/3/commits/ab2b4c2ad02827a73c52ba561e9a921cc4bb227c

Although this patch is advertised as applying to the device model
running in a Linux stub domain, it is also needed (at least on my
system) with the device model running in Dom0.

Here is the story:

The patch is titled "qemu: fix wrong mask in pci capability registers 
handling"

There is scant information in the commit message about the nature of
the problem, but I discovered the following in my testing:

On my Intel Haswell system configured for PCI passthrough to the
Xen HVM guest, Qemu does indeed incorrectly set the emulated
register because it uses the wrong mask that disables instead of
enables the PCI_STATUS_CAP_LIST bit of the PCI_STATUS register.

This disabled the MSI-x capability of two of the three PCI devices
passed through to the Xen HVM guest. The problem only
manifests in a bad way in a Linux guest, not in a Windows guest.

One possible reason that only Linux guests are affected is that
I discovered in the Xen xl-dmesg verbose logs that Windows and
Linux use different callbacks for interrupts:

(XEN) Dom1 callback via changed to GSI 28
...
(XEN) Dom3 callback via changed to Direct Vector 0xf3

Dom1 is a Windows Xen HVM and Dom3 is a Linux HVM

Apparently the Direct Vector callback that Linux uses requires
MSI or MSI-x capability of the passed through devices, but the
wrong mask in Qemu disables that capability.

After applying the QubesOS patch to Qemu upstream, the
PCI_STATUS_CAP_LIST bit is set correctly for the guest and
PCI and Intel IGD passthrough works normally because the
Linux guest can make use of the MSI-x capability of the
PCI devices.

The problem was discovered almost five years ago. I don't
know why the fix has not been committed to Qemu
upstream yet.

After this, I was able to discover that the need for libxl to
explicitly grant permission for access to the Intel OpRegion
is only needed for the old traditional device model because
the Xen hypercall to gain permission is included in upstream
Qemu, but it is omitted from the old traditional device model.

So this patch is not needed for users of the upstream device
model who also add the QubesOS patch to fix the
PCI_STATUS_CAP_LIST bit in the PCI_STATUS register.

This all assumes the device model is running in Dom0. The
permission for access to the Intel OpRegion might still be
needed if the upstream device model is running in a stub
domain.

There are other problems in addition to this problem of access
to the Intel OpRegion that are discussed here:

https://www.qubes-os.org/news/2017/10/18/msi-support/

As old as that post is, the feature of allowing PCI and VGA
passthrough to HVM domains is still not well supported,
especially for the case when the device model runs in a
stub domain.

Since my proposed patch only applies to the very insecure
case of the old traditional device model running in Dom0,
I will not pursue it further.

I will look for this feature in future versions of Xen. Currently,
Xen 4.16 advertises support for Linux-based stub domains
as "Tech Preview." So future versions of Xen might handle
this problem in libxl or perhaps in some other way, and also
hopefully the patch to Qemu to fix the PCI capabilities mask
can be committed to Qemu upstream soon so this feature
of Intel IGD passthrough can at least work with Linux
guests and the upstream Qemu running in Dom0.

Regards,

Chuck

>
>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
>> index 4bbbfe9f16..a4fc473de9 100644
>> --- a/tools/libs/light/libxl_pci.c
>> +++ b/tools/libs/light/libxl_pci.c
>> @@ -2531,6 +2572,37 @@ int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
>>                     domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
>>               return ret;
>>           }
>> +
>> +        /* If this is an Intel IGD, allow access to the IGD opregion */
>> +        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;
>> +
>> +        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
>> +        uint32_t error = 0xffffffff;
>> +        if (igd_opregion == error) break;
>> +
>> +        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;
>> +        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
>> +        if (ret < 0) {
>> +            LOGED(ERROR, domid,
>> +                  "failed to give stubdom%d access to iomem range "
>> +                  "%"PRIx64"-%"PRIx64" for IGD passthru",
>> +                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
>> +                                                IGD_OPREGION_PAGES - 1));
>> +            return ret;
>> +        }
>> +        ret = xc_domain_iomem_permission(CTX->xch, domid,
>> +                                         vga_iomem_start,
>> +                                         IGD_OPREGION_PAGES, 1);
> Here, you seems to add permission to an address that is read from the
> pci config space of the device, but as I've pointed above hvmloader
> seems to overwrite this address. It this call to
> xc_domain_iomem_permission() does actually anything useful?
> Or is it by luck that the address returned by
> sysfs_dev_get_igd_opregion() happened to be the address that hvmloader
> is going to write?
>
> Or maybe hvmloader doesn't actually do anything?
>
>
> Some more though on that, looking at QEMU, it seems that there's already
> a call to xc_domain_iomem_permission(), in igd_write_opregion(). So
> adding one in libxl would seems redundant, or maybe it the one for the
> device model's domain that's needed  (named 'stubdom_domid' here)?
>
> Thanks,
>
Chuck Zmudzinski June 6, 2022, 6:47 p.m. UTC | #30
On 6/3/22 9:10 PM, Chuck Zmudzinski wrote:
> On 3/30/22 1:15 PM, Anthony PERARD wrote:
>> Hi Chuck,
>>
>> On Sun, Mar 13, 2022 at 11:41:37PM -0400, Chuck Zmudzinski wrote:
>>> When gfx_passthru is enabled for the Intel IGD, hvmloader maps the IGD
>>> opregion to the guest but libxl does not grant the guest permission to
>> I'm not reading the same thing when looking at code in hvmloader. It
>> seems that hvmloader allocate some memory for the IGD opregion rather
>> than mapping it.
>>
>>
>> tools/firmware/hvmloader/pci.c:184
>>      if ( vendor_id == 0x8086 )
>>      {
>>          igd_opregion_pgbase = mem_hole_alloc(IGD_OPREGION_PAGES);
>>          /*
>>           * Write the the OpRegion offset to give the opregion
>>           * address to the device model. The device model will trap
>>           * and map the OpRegion at the give address.
>>           */
>>          pci_writel(vga_devfn, PCI_INTEL_OPREGION,
>>                     igd_opregion_pgbase << PAGE_SHIFT);
>>      }
>>
>> I think this write would go through QEMU, does it do something with it?
>> (I kind of replying to this question at the end of the mail.)
>>
>> Is this code in hvmloader actually run in your case?
>>
>>> Currently, because of another bug in Qemu upstream, this crash can
>>> only be reproduced using the traditional Qemu device model, and of
>> qemu-traditional is a bit old... What is the bug in QEMU? Do you have a
>> link to a patch/mail?

I finally found a patch that fixes the upstream bug on my system
but I am not sure it is the best fix. It is a patch that QubesOS has
been using since 2017.

I just opened an issue titled "Incorrect register mask for PCI
passthrough on Xen" with Qemu upstream about this problem
which gives all the details:

https://gitlab.com/qemu-project/qemu/-/issues/1061

When you get a chance, can you take a look at it?

Not being an official Xen or Qemu developer, I would appreciate
any suggestions you might have for me before I formally submit
a patch to Qemu upstream. Please reply here or on the Qemu issue
tracker.

Best Regards,

Chuck

>
> I finally found a patch for the other bug in Qemu upstream. The
> patch is currently being used in QubesOS, and they first added
> it to their version of Qemu way back in 2017:
>
> https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/3/commits/ab2b4c2ad02827a73c52ba561e9a921cc4bb227c 
>
>
> Although this patch is advertised as applying to the device model
> running in a Linux stub domain, it is also needed (at least on my
> system) with the device model running in Dom0.
>
> Here is the story:
>
> The patch is titled "qemu: fix wrong mask in pci capability registers 
> handling"
>
> There is scant information in the commit message about the nature of
> the problem, but I discovered the following in my testing:
>
> On my Intel Haswell system configured for PCI passthrough to the
> Xen HVM guest, Qemu does indeed incorrectly set the emulated
> register because it uses the wrong mask that disables instead of
> enables the PCI_STATUS_CAP_LIST bit of the PCI_STATUS register.
>
> This disabled the MSI-x capability of two of the three PCI devices
> passed through to the Xen HVM guest. The problem only
> manifests in a bad way in a Linux guest, not in a Windows guest.
>
> One possible reason that only Linux guests are affected is that
> I discovered in the Xen xl-dmesg verbose logs that Windows and
> Linux use different callbacks for interrupts:
>
> (XEN) Dom1 callback via changed to GSI 28
> ...
> (XEN) Dom3 callback via changed to Direct Vector 0xf3
>
> Dom1 is a Windows Xen HVM and Dom3 is a Linux HVM
>
> Apparently the Direct Vector callback that Linux uses requires
> MSI or MSI-x capability of the passed through devices, but the
> wrong mask in Qemu disables that capability.
>
> After applying the QubesOS patch to Qemu upstream, the
> PCI_STATUS_CAP_LIST bit is set correctly for the guest and
> PCI and Intel IGD passthrough works normally because the
> Linux guest can make use of the MSI-x capability of the
> PCI devices.
>
> The problem was discovered almost five years ago. I don't
> know why the fix has not been committed to Qemu
> upstream yet.
>
> After this, I was able to discover that the need for libxl to
> explicitly grant permission for access to the Intel OpRegion
> is only needed for the old traditional device model because
> the Xen hypercall to gain permission is included in upstream
> Qemu, but it is omitted from the old traditional device model.
>
> So this patch is not needed for users of the upstream device
> model who also add the QubesOS patch to fix the
> PCI_STATUS_CAP_LIST bit in the PCI_STATUS register.
>
> This all assumes the device model is running in Dom0. The
> permission for access to the Intel OpRegion might still be
> needed if the upstream device model is running in a stub
> domain.
>
> There are other problems in addition to this problem of access
> to the Intel OpRegion that are discussed here:
>
> https://www.qubes-os.org/news/2017/10/18/msi-support/
>
> As old as that post is, the feature of allowing PCI and VGA
> passthrough to HVM domains is still not well supported,
> especially for the case when the device model runs in a
> stub domain.
>
> Since my proposed patch only applies to the very insecure
> case of the old traditional device model running in Dom0,
> I will not pursue it further.
>
> I will look for this feature in future versions of Xen. Currently,
> Xen 4.16 advertises support for Linux-based stub domains
> as "Tech Preview." So future versions of Xen might handle
> this problem in libxl or perhaps in some other way, and also
> hopefully the patch to Qemu to fix the PCI capabilities mask
> can be committed to Qemu upstream soon so this feature
> of Intel IGD passthrough can at least work with Linux
> guests and the upstream Qemu running in Dom0.
>
> Regards,
>
> Chuck
Anthony PERARD June 17, 2022, 1:46 p.m. UTC | #31
On Thu, Mar 31, 2022 at 03:44:33PM -0400, Chuck Zmudzinski wrote:
> On 3/31/22 10:19 AM, Jason Andryuk wrote:
> > On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
> > > 
> > > That still doesn't answer my question - will the Qemu upstream
> > > accept the patches that move the hypercalls to the toolstack? If
> > > not, we have to live with what we have now, which is that the
> > > hypercalls are done in Qemu.
> > Xen-associated people maintain hw/xen code in QEMU, so yes it could be accepted.
> > 
> > Maybe it would need to be backwards compatible to have libxl check the
> > QEMU version to decide who makes the hypercall?  Unless it is broken
> > today, in which case just make it work.
> > 
> > Regards,
> > Jason
> 
> I know of another reason to check the Qemu upstream version,
> and that is dealing with deprecated / removed device model
> options that xl.cfg still uses. I looked at that a few years ago
> with regard to the deprecated 'usbdevice tablet' Qemu option,
> but I did not see anything in libxl to distinguish Qemu versions
> except for upstream vs. traditional. AFAICT, detecting traditional
> vs. upstream Qemu depends solely on the device_model_version
> xl.cfg setting. So it might be useful for libxl to add the capability
> to detect the upstream Qemu version or at least create an xl.cfg
> setting to inform libxl about the Qemu version.

Hi,

There's already some code to deal with QEMU's version (QEMU = upstream
Qemu) in libxl, but this code is dealing with an already running QEMU.
When libxl interact with QEMU through QMP, to setup some more devices,
QEMU also advertise it's version, which we can use on follow up qmp
commands.

I think adding passthrough pci devices is all done via QMP, so we can
potentially move an hypercall from QEMU to libxl, and tell libxl that
depending on which version of QEMU is talking to, it needs or not do the
hypercall. Also, we could probably add a parameter so that QEMU know if
it have to do the hypercall or not, and thus newer version of QEMU could
also deal with older version of libxl.

(There's maybe some example like that in both QEMU and libxl, when doing
live migration, dm_state_save_to_fdset() in libxl as a pointer)

Cheers,
Chuck Zmudzinski June 17, 2022, 8:39 p.m. UTC | #32
On 6/17/22 9:46 AM, Anthony PERARD wrote:
> On Thu, Mar 31, 2022 at 03:44:33PM -0400, Chuck Zmudzinski wrote:
>> On 3/31/22 10:19 AM, Jason Andryuk wrote:
>>> On Thu, Mar 31, 2022 at 10:05 AM Chuck Zmudzinski <brchuckz@netscape.net> wrote:
>>>> That still doesn't answer my question - will the Qemu upstream
>>>> accept the patches that move the hypercalls to the toolstack? If
>>>> not, we have to live with what we have now, which is that the
>>>> hypercalls are done in Qemu.
>>> [...]
>> I know of another reason to check the Qemu upstream version,
>> and that is dealing with deprecated / removed device model
>> options that xl.cfg still uses. I looked at that a few years ago
>> with regard to the deprecated 'usbdevice tablet' Qemu option,
>> but I did not see anything in libxl to distinguish Qemu versions
>> except for upstream vs. traditional. AFAICT, detecting traditional
>> vs. upstream Qemu depends solely on the device_model_version
>> xl.cfg setting. So it might be useful for libxl to add the capability
>> to detect the upstream Qemu version or at least create an xl.cfg
>> setting to inform libxl about the Qemu version.
> Hi,
>
> There's already some code to deal with QEMU's version (QEMU = upstream
> Qemu) in libxl, but this code is dealing with an already running QEMU.
> When libxl interact with QEMU through QMP, to setup some more devices,
> QEMU also advertise it's version, which we can use on follow up qmp
> commands.
>
> I think adding passthrough pci devices is all done via QMP, so we can
> potentially move an hypercall from QEMU to libxl, and tell libxl that
> depending on which version of QEMU is talking to, it needs or not do the
> hypercall. Also, we could probably add a parameter so that QEMU know if
> it have to do the hypercall or not, and thus newer version of QEMU could
> also deal with older version of libxl.
>
> (There's maybe some example like that in both QEMU and libxl, when doing
> live migration, dm_state_save_to_fdset() in libxl as a pointer)
>
> Cheers,
>

Hi Anthony,

Thanks for this information, it is useful because I plan to work on
improved Xen / Qemu interactions to better support features
such as running the device model in a stub domain, in which case
it is probably better to do some hypercalls in libxl or maybe in
hvmloader that are currently done in Qemu.

I also would like to see Xen HVM using Q35 instead of i440fx
emulation which also requires improved Xen / Qemu interactions.
I know of the patch set for Q35 emulation that was proposed back
in 2018, but I have not tried anything like that yet. That looks like
a complex problem. Has there been any more recent work in that
area? I couldn't find much recent work on that by searching the
Internet. I have quite a bit to learn before I can make contributions
to support Q35 as a replacement for i440fx, but I plan to slowly
work on it. Any suggestions anyone has are welcome.

Best Regards,

Chuck
diff mbox series

Patch

diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
index 4bbbfe9f16..a4fc473de9 100644
--- a/tools/libs/light/libxl_pci.c
+++ b/tools/libs/light/libxl_pci.c
@@ -24,6 +24,8 @@ 
 #define PCI_OPTIONS            "msitranslate=%d,power_mgmt=%d"
 #define PCI_BDF_XSPATH         "%04x-%02x-%02x-%01x"
 #define PCI_PT_QDEV_ID         "pci-pt-%02x_%02x.%01x"
+#define PCI_INTEL_OPREGION     0xfc
+#define IGD_OPREGION_PAGES     3
 
 static unsigned int pci_encode_bdf(libxl_device_pci *pci)
 {
@@ -610,6 +612,45 @@  out:
     return ret;
 }
 
+static uint32_t sysfs_dev_get_igd_opregion(libxl__gc *gc,
+                                           libxl_device_pci *pci)
+{
+    char *pci_device_config_path =
+            GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/config",
+                      pci->domain, pci->bus, pci->dev, pci->func);
+    size_t read_items;
+    uint32_t igd_opregion;
+    uint32_t error = 0xffffffff;
+
+    FILE *f = fopen(pci_device_config_path, "r");
+    if (!f) {
+        LOGE(ERROR,
+             "pci device "PCI_BDF" does not have config attribute",
+             pci->domain, pci->bus, pci->dev, pci->func);
+        goto out;
+    }
+    if (fseek(f, PCI_INTEL_OPREGION, SEEK_SET)) {
+        LOGE(ERROR,
+             "cannot find igd opregion address of pci device "PCI_BDF,
+             pci->domain, pci->bus, pci->dev, pci->func);
+        goto out;
+    }
+    read_items = fread(&igd_opregion, 4, 1, f);
+    if (read_items != 1) {
+        LOGE(ERROR,
+             "cannot read igd opregion address of pci device "PCI_BDF,
+             pci->domain, pci->bus, pci->dev, pci->func);
+        goto out;
+    }
+    fclose(f);
+    return igd_opregion;
+
+out:
+    if (f)
+        fclose(f);
+    return error;
+}
+
 /*
  * Some devices may need some ways to work well. Here like IGD,
  * we have to pass a specific option to qemu.
@@ -2531,6 +2572,37 @@  int libxl__grant_vga_iomem_permission(libxl__gc *gc, const uint32_t domid,
                   domid, vga_iomem_start, (vga_iomem_start + 0x20 - 1));
             return ret;
         }
+
+        /* If this is an Intel IGD, allow access to the IGD opregion */
+        if (!libxl__is_igd_vga_passthru(gc, d_config)) return 0;
+
+        uint32_t igd_opregion = sysfs_dev_get_igd_opregion(gc, pci);
+        uint32_t error = 0xffffffff;
+        if (igd_opregion == error) break;
+
+        vga_iomem_start = ( (uint64_t) igd_opregion ) >> XC_PAGE_SHIFT;
+        ret = xc_domain_iomem_permission(CTX->xch, stubdom_domid,
+                                         vga_iomem_start,
+                                         IGD_OPREGION_PAGES, 1);
+        if (ret < 0) {
+            LOGED(ERROR, domid,
+                  "failed to give stubdom%d access to iomem range "
+                  "%"PRIx64"-%"PRIx64" for IGD passthru",
+                  stubdom_domid, vga_iomem_start, (vga_iomem_start +
+                                                IGD_OPREGION_PAGES - 1));
+            return ret;
+        }
+        ret = xc_domain_iomem_permission(CTX->xch, domid,
+                                         vga_iomem_start,
+                                         IGD_OPREGION_PAGES, 1);
+        if (ret < 0) {
+            LOGED(ERROR, domid,
+                  "failed to give dom%d access to iomem range "
+                  "%"PRIx64"-%"PRIx64" for IGD passthru",
+                  domid, vga_iomem_start, (vga_iomem_start +
+                                           IGD_OPREGION_PAGES - 1));
+            return ret;
+        }
         break;
     }