diff mbox

[v4,6/6] x86/HVM: report the set of enabled emulated devices through CPUID

Message ID 1453395092-88090-7-git-send-email-roger.pau@citrix.com
State New, archived
Headers show

Commit Message

Roger Pau Monné Jan. 21, 2016, 4:51 p.m. UTC
Add a new HVM-specific feature flag that signals the presence of a bitmap
that contains the current set of enabled emulated devices. The bitmap is
placed in the ecx register. The bit fields used in the bitmap are the same
as the ones used in the xen_arch_domainconfig emulation_flags field, and
their meaning can be found at arch-x86/xen.h.

This will allow Xen to enable emulated devices for HVMlite guests in the
future, by having a proper ABI for reporting which devices are enabled.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/hvm.c              | 4 ++++
 xen/include/public/arch-x86/cpuid.h | 5 +++++
 2 files changed, 9 insertions(+)

Comments

Jan Beulich Jan. 22, 2016, 10:57 a.m. UTC | #1
>>> On 21.01.16 at 17:51, <roger.pau@citrix.com> wrote:
> Add a new HVM-specific feature flag that signals the presence of a bitmap
> that contains the current set of enabled emulated devices. The bitmap is
> placed in the ecx register. The bit fields used in the bitmap are the same
> as the ones used in the xen_arch_domainconfig emulation_flags field, and
> their meaning can be found at arch-x86/xen.h.
> 
> This will allow Xen to enable emulated devices for HVMlite guests in the
> future, by having a proper ABI for reporting which devices are enabled.

The idea is certainly nice and appreciated, but ...

> --- a/xen/include/public/arch-x86/cpuid.h
> +++ b/xen/include/public/arch-x86/cpuid.h
> @@ -78,12 +78,17 @@
>   * HVM-specific features
>   * EAX: Features
>   * EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
> + * ECX: bitmap of enabled devices, according to the bit fields defined in
> + *      arch-x86/xen.h.

... this set of definitions is not currently a stable ABI (limited to
hypervisor and tool stack), and if we wanted to make it stable
we'd first need to think a little about the complications that may
arise if the granularity chosen (think about the PM bit and the
discussion around it before your changes went in) turns out to
be a problem later on.

Also at least some of the features can be determined by other
means (CPUID, ACPI tables), so I'm not even sure we need all
of this, and I'd really prefer to avoid multiple distinct ways to
learn of a certain feature, as it's too easy for the two (or more)
mechanisms to get out of sync.

> All unused bits have undefined values.

Nor is this an option, but maybe this is just a wording issue:
Perhaps you mean to say that they're reserved for future use?
Since truly unused bits have are guaranteed to have the value
zero, just that the set of bits varies.

Jan
Roger Pau Monné Jan. 22, 2016, 12:43 p.m. UTC | #2
El 22/01/16 a les 11.57, Jan Beulich ha escrit:
>>>> On 21.01.16 at 17:51, <roger.pau@citrix.com> wrote:
>> Add a new HVM-specific feature flag that signals the presence of a bitmap
>> that contains the current set of enabled emulated devices. The bitmap is
>> placed in the ecx register. The bit fields used in the bitmap are the same
>> as the ones used in the xen_arch_domainconfig emulation_flags field, and
>> their meaning can be found at arch-x86/xen.h.
>>
>> This will allow Xen to enable emulated devices for HVMlite guests in the
>> future, by having a proper ABI for reporting which devices are enabled.
> 
> The idea is certainly nice and appreciated, but ...
> 
>> --- a/xen/include/public/arch-x86/cpuid.h
>> +++ b/xen/include/public/arch-x86/cpuid.h
>> @@ -78,12 +78,17 @@
>>   * HVM-specific features
>>   * EAX: Features
>>   * EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
>> + * ECX: bitmap of enabled devices, according to the bit fields defined in
>> + *      arch-x86/xen.h.
> 
> ... this set of definitions is not currently a stable ABI (limited to
> hypervisor and tool stack), and if we wanted to make it stable
> we'd first need to think a little about the complications that may
> arise if the granularity chosen (think about the PM bit and the
> discussion around it before your changes went in) turns out to
> be a problem later on.

Yes, in fact I'm having second thoughts on the PM flag, and I think I
should have split it into ACPI_PM and ACPI_TIMER instead.

> Also at least some of the features can be determined by other
> means (CPUID, ACPI tables), so I'm not even sure we need all
> of this, and I'd really prefer to avoid multiple distinct ways to
> learn of a certain feature, as it's too easy for the two (or more)
> mechanisms to get out of sync.

So let's look at the flags and whether there's an existing way to signal
it's presence:

LAPIC: CPUID.01h:EDX[bit 9]
IOAPIC: tied to LAPIC (so either both enabled or none).

HPET: can only be enabled from/with ACPI, since it's base memory address
is not fixed, and we would need to find a way to pass it's address to
the OS in the absence of ACPI.

RTC: I don't know of any way to signal the RTC presence, AFAICT it's
always assumed to be there in the PC architecture. Could maybe return ~0
when reading from IO port 0x71, but that's meh..., not the best way IMHO.

PIC: same as RTC, I don't know of any way to signal it's presence since
it's assumed to be there.

VGA: again I don't think there's an easy way to signal it's presence,
apart from returning ~0 from the multiple IO ports it uses. The fact
that the 0xA0000-0xBFFFF memory range is also marked as RAM in the e820
map in HVMlite DomUs should also trigger OSes into disabling VGA due to
the lack of proper MMIO range, but sadly I think most OSes just assume
it's there.

PIT: assumed to be always present in the PC architecture.

PM: I'm leaning to split this into ACPI_PM and ACPI_TIMER as said
before. ACPI_TIMER presence it's contained inside of ACPI tables, and
the availability of ACPI_PM (power management) can be inferred from the
presence of ACPI itself.

AMD guest IOMMU: AFAICT this seems to be currently disabled, since the
MMIO range it checks is [~0ULL, ~0ULL + 0x8000]. There is a function to
change the base address ~0ULL to something else, but it doesn't seem to
be reachable from any path. In any case, I guess the presence of this
device will be reported from ACPI.

So, we have the following devices that are assumed to be there: RTC,
PIC, PIT. Everything else I think can be signalled by other means
already available.

IMHO, I think we could say that the PIC is never going to be available
to HVMlite guests (in any case we would enable the lapic/ioapic), and
maybe enable the RTC and PIT by default?

Then I think we could get away without any Xen-specific way of reporting
enabled devices.

>> All unused bits have undefined values.
> 
> Nor is this an option, but maybe this is just a wording issue:
> Perhaps you mean to say that they're reserved for future use?
> Since truly unused bits have are guaranteed to have the value
> zero, just that the set of bits varies.

Yes, that's exactly what I meant to say, thanks.

Roger.
Jan Beulich Jan. 22, 2016, 1:24 p.m. UTC | #3
>>> On 22.01.16 at 13:43, <roger.pau@citrix.com> wrote:
> RTC: I don't know of any way to signal the RTC presence, AFAICT it's
> always assumed to be there in the PC architecture. Could maybe return ~0
> when reading from IO port 0x71, but that's meh..., not the best way IMHO.

There actually is an RTC-absent flag in the FADT, which the
hypervisor itself actually looks at (ACPI_FADT_NO_CMOS_RTC).

> PIC: same as RTC, I don't know of any way to signal it's presence since
> it's assumed to be there.

I think PIC absence can also be gathered from ACPI
(ACPI_FADT_HW_REDUCED).

> VGA: again I don't think there's an easy way to signal it's presence,
> apart from returning ~0 from the multiple IO ports it uses. The fact
> that the 0xA0000-0xBFFFF memory range is also marked as RAM in the e820
> map in HVMlite DomUs should also trigger OSes into disabling VGA due to
> the lack of proper MMIO range, but sadly I think most OSes just assume
> it's there.

Yes, VGA is kind of more difficult. Looking at all PCI devices'
command words may provide a hint, as may looking at all PCI
bridges' bridge control words.

> PIT: assumed to be always present in the PC architecture.

See PIC above.

> PM: I'm leaning to split this into ACPI_PM and ACPI_TIMER as said
> before. ACPI_TIMER presence it's contained inside of ACPI tables, and
> the availability of ACPI_PM (power management) can be inferred from the
> presence of ACPI itself.

As indicated in the original discussion, I don't think these should
have been separate flags anyway - either you have ACPI (then
you have indication of both in FADT), or there is no ACPI.

> AMD guest IOMMU: AFAICT this seems to be currently disabled, since the
> MMIO range it checks is [~0ULL, ~0ULL + 0x8000]. There is a function to
> change the base address ~0ULL to something else, but it doesn't seem to
> be reachable from any path. In any case, I guess the presence of this
> device will be reported from ACPI.

Yes, the implementation is incomplete (abandoned?), but IOMMU
presence can always be determined by the guest through
inspecting its ACPI tables.

> So, we have the following devices that are assumed to be there: RTC,
> PIC, PIT. Everything else I think can be signalled by other means
> already available.
> 
> IMHO, I think we could say that the PIC is never going to be available
> to HVMlite guests (in any case we would enable the lapic/ioapic), and
> maybe enable the RTC and PIT by default?

That may be a sane initial setup, but with the ACPI flags named
above we may be able to expressed even their absence.

> Then I think we could get away without any Xen-specific way of reporting
> enabled devices.

Indeed - that should be the preferred goal.

Jan
Andrew Cooper Jan. 22, 2016, 1:34 p.m. UTC | #4
On 22/01/16 12:43, Roger Pau Monné wrote:
> El 22/01/16 a les 11.57, Jan Beulich ha escrit:
>>>>> On 21.01.16 at 17:51, <roger.pau@citrix.com> wrote:
>>> Add a new HVM-specific feature flag that signals the presence of a bitmap
>>> that contains the current set of enabled emulated devices. The bitmap is
>>> placed in the ecx register. The bit fields used in the bitmap are the same
>>> as the ones used in the xen_arch_domainconfig emulation_flags field, and
>>> their meaning can be found at arch-x86/xen.h.
>>>
>>> This will allow Xen to enable emulated devices for HVMlite guests in the
>>> future, by having a proper ABI for reporting which devices are enabled.
>> The idea is certainly nice and appreciated, but ...
>>
>>> --- a/xen/include/public/arch-x86/cpuid.h
>>> +++ b/xen/include/public/arch-x86/cpuid.h
>>> @@ -78,12 +78,17 @@
>>>   * HVM-specific features
>>>   * EAX: Features
>>>   * EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
>>> + * ECX: bitmap of enabled devices, according to the bit fields defined in
>>> + *      arch-x86/xen.h.
>> ... this set of definitions is not currently a stable ABI (limited to
>> hypervisor and tool stack), and if we wanted to make it stable
>> we'd first need to think a little about the complications that may
>> arise if the granularity chosen (think about the PM bit and the
>> discussion around it before your changes went in) turns out to
>> be a problem later on.
> Yes, in fact I'm having second thoughts on the PM flag, and I think I
> should have split it into ACPI_PM and ACPI_TIMER instead.
>
>> Also at least some of the features can be determined by other
>> means (CPUID, ACPI tables), so I'm not even sure we need all
>> of this, and I'd really prefer to avoid multiple distinct ways to
>> learn of a certain feature, as it's too easy for the two (or more)
>> mechanisms to get out of sync.
> So let's look at the flags and whether there's an existing way to signal
> it's presence:
>
> LAPIC: CPUID.01h:EDX[bit 9]
> IOAPIC: tied to LAPIC (so either both enabled or none).

An IOAPIC is by no means required - they are only for turning legacy
interrupts into MSIs.  It would be perfectly fine for a PVH domain to
have an LAPIC and an SRIOV virtual function, without an IOAPIC at all.

The presence of LAPICs and IOAPICs reside in the MADT ACPI table.

Note also that the cpuid bit is a fastforward of the hardware enable bit
in the APIC_BASE MSR.  The cpuid bit will disappear from view if you
hardware-disable the LAPIC.

>
> HPET: can only be enabled from/with ACPI, since it's base memory address
> is not fixed, and we would need to find a way to pass it's address to
> the OS in the absence of ACPI.

In reality, there are heuristics to guess if an HPET is present.  The
legacy HPET traditionally always resides at pfn fed000.  Linux even has
heuristics to find the legacy HPET based on the IOH, for when the BIOS
doesn't present the HPET properly in ACPI.

This leads to an awkward bug where Linux is able to turn off legacy
timer interrupts behinds Xen's back, and cause carnage for kdump
environment, as Xen didn't know to re-enable legacy interrupts on the
crash path.

>
> RTC: I don't know of any way to signal the RTC presence, AFAICT it's
> always assumed to be there in the PC architecture. Could maybe return ~0
> when reading from IO port 0x71, but that's meh..., not the best way IMHO.
>
> PIC: same as RTC, I don't know of any way to signal it's presence since
> it's assumed to be there.
>
> VGA: again I don't think there's an easy way to signal it's presence,
> apart from returning ~0 from the multiple IO ports it uses. The fact
> that the 0xA0000-0xBFFFF memory range is also marked as RAM in the e820
> map in HVMlite DomUs should also trigger OSes into disabling VGA due to
> the lack of proper MMIO range, but sadly I think most OSes just assume
> it's there.

VGA can be found by following the VGA routing bit in PCI config space. 
This is how real hardware makes the legacy IO ranges reach the graphics
card configured as the primary vga device.

>
> PIT: assumed to be always present in the PC architecture.

PIT, RTC and PIC have their presence always assumed, but returning ~0 on
reads is completely fine.  A DMLite OS knows it is booting in a
virtualised environment.

>
> PM: I'm leaning to split this into ACPI_PM and ACPI_TIMER as said
> before. ACPI_TIMER presence it's contained inside of ACPI tables, and
> the availability of ACPI_PM (power management) can be inferred from the
> presence of ACPI itself.
>
> AMD guest IOMMU: AFAICT this seems to be currently disabled, since the
> MMIO range it checks is [~0ULL, ~0ULL + 0x8000]. There is a function to
> change the base address ~0ULL to something else, but it doesn't seem to
> be reachable from any path. In any case, I guess the presence of this
> device will be reported from ACPI.

It is indeed currently disabled  (See
https://bugs.xenserver.org/browse/XSO-132 if you want to see why.  It
manifested as a very curious bug).

It will be available via an IVRS ACPI table when implemented.

>
> So, we have the following devices that are assumed to be there: RTC,
> PIC, PIT. Everything else I think can be signalled by other means
> already available.
>
> IMHO, I think we could say that the PIC is never going to be available
> to HVMlite guests (in any case we would enable the lapic/ioapic), and
> maybe enable the RTC and PIT by default?
>
> Then I think we could get away without any Xen-specific way of reporting
> enabled devices.

DMLite is a new container type.  I would far rather it was assumed that
there was no legacy hardware at all.

~Andrew
Roger Pau Monné Jan. 22, 2016, 2:41 p.m. UTC | #5
El 22/01/16 a les 14.24, Jan Beulich ha escrit:
>>>> On 22.01.16 at 13:43, <roger.pau@citrix.com> wrote:
>> RTC: I don't know of any way to signal the RTC presence, AFAICT it's
>> always assumed to be there in the PC architecture. Could maybe return ~0
>> when reading from IO port 0x71, but that's meh..., not the best way IMHO.
> 
> There actually is an RTC-absent flag in the FADT, which the
> hypervisor itself actually looks at (ACPI_FADT_NO_CMOS_RTC).

So most of this assumes that if we ever want to enable any of those
devices we will provide ACPI tables to the guest?

The RTC can also be used as an interrupt source, which I think it's not
covered by the ACPI_FADT_NO_CMOS_RTC flag.

> 
>> PIC: same as RTC, I don't know of any way to signal it's presence since
>> it's assumed to be there.
> 
> I think PIC absence can also be gathered from ACPI
> (ACPI_FADT_HW_REDUCED).
> 
>> VGA: again I don't think there's an easy way to signal it's presence,
>> apart from returning ~0 from the multiple IO ports it uses. The fact
>> that the 0xA0000-0xBFFFF memory range is also marked as RAM in the e820
>> map in HVMlite DomUs should also trigger OSes into disabling VGA due to
>> the lack of proper MMIO range, but sadly I think most OSes just assume
>> it's there.
> 
> Yes, VGA is kind of more difficult. Looking at all PCI devices'
> command words may provide a hint, as may looking at all PCI
> bridges' bridge control words.

Hm that seems like a rather convoluted procedure, and this needs to be
available very early on during the boot process usually.

>> PIT: assumed to be always present in the PC architecture.
> 
> See PIC above.

At least on FreeBSD PIT is used much earlier than parsing any ACPI
tables (it's used to implement a busy-wait DELAY routine), so I don't
think it's sensible to tie this device to ACPI. Also see my note above
about requiring ACPI in order to signal all of this.

> 
>> PM: I'm leaning to split this into ACPI_PM and ACPI_TIMER as said
>> before. ACPI_TIMER presence it's contained inside of ACPI tables, and
>> the availability of ACPI_PM (power management) can be inferred from the
>> presence of ACPI itself.
> 
> As indicated in the original discussion, I don't think these should
> have been separate flags anyway - either you have ACPI (then
> you have indication of both in FADT), or there is no ACPI.

Right. I think this is something internal that's used between the
toolstack and Xen in order to tell Xen whether it should add those
handlers or not, because AFAIK Xen doesn't know if a certain domain has
ACPI or not.

It should not be exposed in any way to the guest except when using ACPI
tables, that will contain the appropriate value.

>> AMD guest IOMMU: AFAICT this seems to be currently disabled, since the
>> MMIO range it checks is [~0ULL, ~0ULL + 0x8000]. There is a function to
>> change the base address ~0ULL to something else, but it doesn't seem to
>> be reachable from any path. In any case, I guess the presence of this
>> device will be reported from ACPI.
> 
> Yes, the implementation is incomplete (abandoned?), but IOMMU
> presence can always be determined by the guest through
> inspecting its ACPI tables.
> 
>> So, we have the following devices that are assumed to be there: RTC,
>> PIC, PIT. Everything else I think can be signalled by other means
>> already available.
>>
>> IMHO, I think we could say that the PIC is never going to be available
>> to HVMlite guests (in any case we would enable the lapic/ioapic), and
>> maybe enable the RTC and PIT by default?
> 
> That may be a sane initial setup, but with the ACPI flags named
> above we may be able to expressed even their absence.

I still think we should probably enable those, because they tend to be
used very early on boot, before parsing ACPI tables, and in general are
considered to be always there on PCs.

Also, in order to be able to express their absence we would have to
provide ACPI tables to DomUs, which IMHO, is not something we wish to do
right now.

>> Then I think we could get away without any Xen-specific way of reporting
>> enabled devices.
> 
> Indeed - that should be the preferred goal.

Thanks for the feedback, Roger.
Roger Pau Monné Jan. 22, 2016, 2:59 p.m. UTC | #6
El 22/01/16 a les 14.34, Andrew Cooper ha escrit:
> On 22/01/16 12:43, Roger Pau Monné wrote:
>> El 22/01/16 a les 11.57, Jan Beulich ha escrit:
>>>>>> On 21.01.16 at 17:51, <roger.pau@citrix.com> wrote:
>>>> Add a new HVM-specific feature flag that signals the presence of a bitmap
>>>> that contains the current set of enabled emulated devices. The bitmap is
>>>> placed in the ecx register. The bit fields used in the bitmap are the same
>>>> as the ones used in the xen_arch_domainconfig emulation_flags field, and
>>>> their meaning can be found at arch-x86/xen.h.
>>>>
>>>> This will allow Xen to enable emulated devices for HVMlite guests in the
>>>> future, by having a proper ABI for reporting which devices are enabled.
>>> The idea is certainly nice and appreciated, but ...
>>>
>>>> --- a/xen/include/public/arch-x86/cpuid.h
>>>> +++ b/xen/include/public/arch-x86/cpuid.h
>>>> @@ -78,12 +78,17 @@
>>>>   * HVM-specific features
>>>>   * EAX: Features
>>>>   * EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
>>>> + * ECX: bitmap of enabled devices, according to the bit fields defined in
>>>> + *      arch-x86/xen.h.
>>> ... this set of definitions is not currently a stable ABI (limited to
>>> hypervisor and tool stack), and if we wanted to make it stable
>>> we'd first need to think a little about the complications that may
>>> arise if the granularity chosen (think about the PM bit and the
>>> discussion around it before your changes went in) turns out to
>>> be a problem later on.
>> Yes, in fact I'm having second thoughts on the PM flag, and I think I
>> should have split it into ACPI_PM and ACPI_TIMER instead.
>>
>>> Also at least some of the features can be determined by other
>>> means (CPUID, ACPI tables), so I'm not even sure we need all
>>> of this, and I'd really prefer to avoid multiple distinct ways to
>>> learn of a certain feature, as it's too easy for the two (or more)
>>> mechanisms to get out of sync.
>> So let's look at the flags and whether there's an existing way to signal
>> it's presence:
>>
>> LAPIC: CPUID.01h:EDX[bit 9]
>> IOAPIC: tied to LAPIC (so either both enabled or none).
> 
> An IOAPIC is by no means required - they are only for turning legacy
> interrupts into MSIs.  It would be perfectly fine for a PVH domain to
> have an LAPIC and an SRIOV virtual function, without an IOAPIC at all.
> 
> The presence of LAPICs and IOAPICs reside in the MADT ACPI table.

Right, so as said in the reply to Jan, we will require ACPI in order to
enable any of this pieces. I don't have a problem with that, just wasn't
sure if this requirement was desired.

If that's the plan, then I think we would also need to fixup the tables
provided to Dom0 in order to match what's available, but that can be
discussed later.

> Note also that the cpuid bit is a fastforward of the hardware enable bit
> in the APIC_BASE MSR.  The cpuid bit will disappear from view if you
> hardware-disable the LAPIC.

Right, it looks like ACPI is the best way to decide.

>>
>> HPET: can only be enabled from/with ACPI, since it's base memory address
>> is not fixed, and we would need to find a way to pass it's address to
>> the OS in the absence of ACPI.
> 
> In reality, there are heuristics to guess if an HPET is present.  The
> legacy HPET traditionally always resides at pfn fed000.  Linux even has
> heuristics to find the legacy HPET based on the IOH, for when the BIOS
> doesn't present the HPET properly in ACPI.

Heh, if we already require ACPI in order to discover local APICs and IO
APICs, I don't think it hurts to also require it on order to discover HPET.

>> RTC: I don't know of any way to signal the RTC presence, AFAICT it's
>> always assumed to be there in the PC architecture. Could maybe return ~0
>> when reading from IO port 0x71, but that's meh..., not the best way IMHO.
>>
>> PIC: same as RTC, I don't know of any way to signal it's presence since
>> it's assumed to be there.
>>
>> VGA: again I don't think there's an easy way to signal it's presence,
>> apart from returning ~0 from the multiple IO ports it uses. The fact
>> that the 0xA0000-0xBFFFF memory range is also marked as RAM in the e820
>> map in HVMlite DomUs should also trigger OSes into disabling VGA due to
>> the lack of proper MMIO range, but sadly I think most OSes just assume
>> it's there.
> 
> VGA can be found by following the VGA routing bit in PCI config space. 
> This is how real hardware makes the legacy IO ranges reach the graphics
> card configured as the primary vga device.

Hm, I have to look into this, are there any examples of this mechanism
out there?

>>
>> PIT: assumed to be always present in the PC architecture.
> 
> PIT, RTC and PIC have their presence always assumed, but returning ~0 on
> reads is completely fine.  A DMLite OS knows it is booting in a
> virtualised environment.

Yes, that's fine, I'm completely disabling the attachment of those
devices when entering from the Xen entry point ATM on FreeBSD, but how
are we going to notify the OS when they are actually available?

Just by returning something different from ~0 when poking at their IO
ports? Doesn't look like a very robust way IMHO.

> 
>>
>> PM: I'm leaning to split this into ACPI_PM and ACPI_TIMER as said
>> before. ACPI_TIMER presence it's contained inside of ACPI tables, and
>> the availability of ACPI_PM (power management) can be inferred from the
>> presence of ACPI itself.
>>
>> AMD guest IOMMU: AFAICT this seems to be currently disabled, since the
>> MMIO range it checks is [~0ULL, ~0ULL + 0x8000]. There is a function to
>> change the base address ~0ULL to something else, but it doesn't seem to
>> be reachable from any path. In any case, I guess the presence of this
>> device will be reported from ACPI.
> 
> It is indeed currently disabled  (See
> https://bugs.xenserver.org/browse/XSO-132 if you want to see why.  It
> manifested as a very curious bug).
> 
> It will be available via an IVRS ACPI table when implemented.
> 
>>
>> So, we have the following devices that are assumed to be there: RTC,
>> PIC, PIT. Everything else I think can be signalled by other means
>> already available.
>>
>> IMHO, I think we could say that the PIC is never going to be available
>> to HVMlite guests (in any case we would enable the lapic/ioapic), and
>> maybe enable the RTC and PIT by default?
>>
>> Then I think we could get away without any Xen-specific way of reporting
>> enabled devices.
> 
> DMLite is a new container type.  I would far rather it was assumed that
> there was no legacy hardware at all.

So I take that you are in favour of only considering enabling the local
APIC and IO APIC maybe for HVMlite, because of the performance benefits,
while the other devices are _never_ going to be available to HVMlite
guests/hosts at all? (Dom0 already gets the hw VGA)

IMHO, I would like to be able to eventually enable them in order to
provide an environment that's as close as possible to a compatible PC,
in order to reduce the amount of changes required in order to port an OS
to run in this mode.
Jan Beulich Jan. 22, 2016, 3:02 p.m. UTC | #7
>>> On 22.01.16 at 15:41, <roger.pau@citrix.com> wrote:
> El 22/01/16 a les 14.24, Jan Beulich ha escrit:
>>>>> On 22.01.16 at 13:43, <roger.pau@citrix.com> wrote:
>>> RTC: I don't know of any way to signal the RTC presence, AFAICT it's
>>> always assumed to be there in the PC architecture. Could maybe return ~0
>>> when reading from IO port 0x71, but that's meh..., not the best way IMHO.
>> 
>> There actually is an RTC-absent flag in the FADT, which the
>> hypervisor itself actually looks at (ACPI_FADT_NO_CMOS_RTC).
> 
> So most of this assumes that if we ever want to enable any of those
> devices we will provide ACPI tables to the guest?

We could check whether exposing SFI tables to the guest would
be a simpler mechanism allowing enough control.

In the absence of ACPI we need to settle on defaults: As Andrew
has said, contemporary logic would imply no legacy devices for an
environment that can be (made) aware of such.

> The RTC can also be used as an interrupt source, which I think it's not
> covered by the ACPI_FADT_NO_CMOS_RTC flag.

Certainly if there's no RTC device, then there's also no legacy
IRQ8.

>>> VGA: again I don't think there's an easy way to signal it's presence,
>>> apart from returning ~0 from the multiple IO ports it uses. The fact
>>> that the 0xA0000-0xBFFFF memory range is also marked as RAM in the e820
>>> map in HVMlite DomUs should also trigger OSes into disabling VGA due to
>>> the lack of proper MMIO range, but sadly I think most OSes just assume
>>> it's there.
>> 
>> Yes, VGA is kind of more difficult. Looking at all PCI devices'
>> command words may provide a hint, as may looking at all PCI
>> bridges' bridge control words.
> 
> Hm that seems like a rather convoluted procedure, and this needs to be
> available very early on during the boot process usually.

As long as the legacy MMIO address range isn't re-used by some
other device, having an OS blindly write to that range is quite okay
(and common practice) I think. Iiuc you think about getting log
messages out early?

>>> PIT: assumed to be always present in the PC architecture.
>> 
>> See PIC above.
> 
> At least on FreeBSD PIT is used much earlier than parsing any ACPI
> tables (it's used to implement a busy-wait DELAY routine), so I don't
> think it's sensible to tie this device to ACPI. Also see my note above
> about requiring ACPI in order to signal all of this.

See above: Quite likely you will need to do away with using PIT when
run as HVMlite guest.

>>> So, we have the following devices that are assumed to be there: RTC,
>>> PIC, PIT. Everything else I think can be signalled by other means
>>> already available.
>>>
>>> IMHO, I think we could say that the PIC is never going to be available
>>> to HVMlite guests (in any case we would enable the lapic/ioapic), and
>>> maybe enable the RTC and PIT by default?
>> 
>> That may be a sane initial setup, but with the ACPI flags named
>> above we may be able to expressed even their absence.
> 
> I still think we should probably enable those, because they tend to be
> used very early on boot, before parsing ACPI tables, and in general are
> considered to be always there on PCs.

No if you think the modern legacy free way.

Jan
Jan Beulich Jan. 22, 2016, 3:31 p.m. UTC | #8
>>> On 22.01.16 at 15:59, <roger.pau@citrix.com> wrote:
> El 22/01/16 a les 14.34, Andrew Cooper ha escrit:
>> On 22/01/16 12:43, Roger Pau Monné wrote:
>>> IOAPIC: tied to LAPIC (so either both enabled or none).
>> 
>> An IOAPIC is by no means required - they are only for turning legacy
>> interrupts into MSIs.  It would be perfectly fine for a PVH domain to
>> have an LAPIC and an SRIOV virtual function, without an IOAPIC at all.
>> 
>> The presence of LAPICs and IOAPICs reside in the MADT ACPI table.
> 
> Right, so as said in the reply to Jan, we will require ACPI in order to
> enable any of this pieces. I don't have a problem with that, just wasn't
> sure if this requirement was desired.

The question is whether a non-Dom0 HVMlite guest needs any
of these in the first place. Because if it doesn't (and that's the
mode we provide right now), making them dependent on ACPI
availability should be quite fine: If we need them for some
purpose in the guest, we'd need to make ACPI tables available.

> If that's the plan, then I think we would also need to fixup the tables
> provided to Dom0 in order to match what's available, but that can be
> discussed later.

I don't see what you're getting at here: The IO-APIC information
should be usable as is for Dom0 (as is the case for PV).

>>> So, we have the following devices that are assumed to be there: RTC,
>>> PIC, PIT. Everything else I think can be signalled by other means
>>> already available.
>>>
>>> IMHO, I think we could say that the PIC is never going to be available
>>> to HVMlite guests (in any case we would enable the lapic/ioapic), and
>>> maybe enable the RTC and PIT by default?
>>>
>>> Then I think we could get away without any Xen-specific way of reporting
>>> enabled devices.
>> 
>> DMLite is a new container type.  I would far rather it was assumed that
>> there was no legacy hardware at all.
> 
> So I take that you are in favour of only considering enabling the local
> APIC and IO APIC maybe for HVMlite, because of the performance benefits,
> while the other devices are _never_ going to be available to HVMlite
> guests/hosts at all? (Dom0 already gets the hw VGA)
> 
> IMHO, I would like to be able to eventually enable them in order to
> provide an environment that's as close as possible to a compatible PC,
> in order to reduce the amount of changes required in order to port an OS
> to run in this mode.

I didn't think that's among the goals for HVMlite. Just like PV and PVH,
HVMlite requires OS awareness.

Jan
Roger Pau Monné Jan. 22, 2016, 3:41 p.m. UTC | #9
El 22/01/16 a les 16.02, Jan Beulich ha escrit:
>>>> On 22.01.16 at 15:41, <roger.pau@citrix.com> wrote:
>> El 22/01/16 a les 14.24, Jan Beulich ha escrit:
>>>>>> On 22.01.16 at 13:43, <roger.pau@citrix.com> wrote:
>>>> RTC: I don't know of any way to signal the RTC presence, AFAICT it's
>>>> always assumed to be there in the PC architecture. Could maybe return ~0
>>>> when reading from IO port 0x71, but that's meh..., not the best way IMHO.
>>>
>>> There actually is an RTC-absent flag in the FADT, which the
>>> hypervisor itself actually looks at (ACPI_FADT_NO_CMOS_RTC).
>>
>> So most of this assumes that if we ever want to enable any of those
>> devices we will provide ACPI tables to the guest?
> 
> We could check whether exposing SFI tables to the guest would
> be a simpler mechanism allowing enough control.
> 
> In the absence of ACPI we need to settle on defaults: As Andrew
> has said, contemporary logic would imply no legacy devices for an
> environment that can be (made) aware of such.
> 
>> The RTC can also be used as an interrupt source, which I think it's not
>> covered by the ACPI_FADT_NO_CMOS_RTC flag.
> 
> Certainly if there's no RTC device, then there's also no legacy
> IRQ8.
> 
>>>> VGA: again I don't think there's an easy way to signal it's presence,
>>>> apart from returning ~0 from the multiple IO ports it uses. The fact
>>>> that the 0xA0000-0xBFFFF memory range is also marked as RAM in the e820
>>>> map in HVMlite DomUs should also trigger OSes into disabling VGA due to
>>>> the lack of proper MMIO range, but sadly I think most OSes just assume
>>>> it's there.
>>>
>>> Yes, VGA is kind of more difficult. Looking at all PCI devices'
>>> command words may provide a hint, as may looking at all PCI
>>> bridges' bridge control words.
>>
>> Hm that seems like a rather convoluted procedure, and this needs to be
>> available very early on during the boot process usually.
> 
> As long as the legacy MMIO address range isn't re-used by some
> other device, having an OS blindly write to that range is quite okay
> (and common practice) I think. Iiuc you think about getting log
> messages out early?
> 
>>>> PIT: assumed to be always present in the PC architecture.
>>>
>>> See PIC above.
>>
>> At least on FreeBSD PIT is used much earlier than parsing any ACPI
>> tables (it's used to implement a busy-wait DELAY routine), so I don't
>> think it's sensible to tie this device to ACPI. Also see my note above
>> about requiring ACPI in order to signal all of this.
> 
> See above: Quite likely you will need to do away with using PIT when
> run as HVMlite guest.

Oh yes, that's certainly not a problem for FreeBSD. I've already
replaced the usage of the PIT during early boot with the PV timer. I was
just pointing this out in order to make it easier for other OSes to
adopt HVMlite, I wouldn't be surprised to find out that other OSes also
use the PIT as an early source for delay available universally.

>>>> So, we have the following devices that are assumed to be there: RTC,
>>>> PIC, PIT. Everything else I think can be signalled by other means
>>>> already available.
>>>>
>>>> IMHO, I think we could say that the PIC is never going to be available
>>>> to HVMlite guests (in any case we would enable the lapic/ioapic), and
>>>> maybe enable the RTC and PIT by default?
>>>
>>> That may be a sane initial setup, but with the ACPI flags named
>>> above we may be able to expressed even their absence.
>>
>> I still think we should probably enable those, because they tend to be
>> used very early on boot, before parsing ACPI tables, and in general are
>> considered to be always there on PCs.
> 
> No if you think the modern legacy free way.

Ack.

Roger.
Roger Pau Monné Jan. 22, 2016, 3:51 p.m. UTC | #10
El 22/01/16 a les 16.31, Jan Beulich ha escrit:
>>>> On 22.01.16 at 15:59, <roger.pau@citrix.com> wrote:
>> El 22/01/16 a les 14.34, Andrew Cooper ha escrit:
>>> On 22/01/16 12:43, Roger Pau Monné wrote:
>>>> IOAPIC: tied to LAPIC (so either both enabled or none).
>>>
>>> An IOAPIC is by no means required - they are only for turning legacy
>>> interrupts into MSIs.  It would be perfectly fine for a PVH domain to
>>> have an LAPIC and an SRIOV virtual function, without an IOAPIC at all.
>>>
>>> The presence of LAPICs and IOAPICs reside in the MADT ACPI table.
>>
>> Right, so as said in the reply to Jan, we will require ACPI in order to
>> enable any of this pieces. I don't have a problem with that, just wasn't
>> sure if this requirement was desired.
> 
> The question is whether a non-Dom0 HVMlite guest needs any
> of these in the first place. Because if it doesn't (and that's the
> mode we provide right now), making them dependent on ACPI
> availability should be quite fine: If we need them for some
> purpose in the guest, we'd need to make ACPI tables available.

I only foresee non-Dom0 HVMlite guests with pci-passthrough to ever
require any of those, but that's still a long shot...

>> If that's the plan, then I think we would also need to fixup the tables
>> provided to Dom0 in order to match what's available, but that can be
>> discussed later.
> 
> I don't see what you're getting at here: The IO-APIC information
> should be usable as is for Dom0 (as is the case for PV).

Well, the information it's usable, but the IO-APIC is not accessible by
Dom0, you need to use PHYSDEV ops in order to manage it. One of my aims
would be to get rid of quite of the PHYSDEV/PIRQ ops for Dom0, and
instead provide an emulated local and IO APICs. That would also allow us
to make use of posted-interrupts I assume (although I certainly need to
look more deeply into this).

In this case, how are we going to signal Dom0 that the local and IO
APICs reported by ACPI are usable using the bare mental interfaces?

>>>> So, we have the following devices that are assumed to be there: RTC,
>>>> PIC, PIT. Everything else I think can be signalled by other means
>>>> already available.
>>>>
>>>> IMHO, I think we could say that the PIC is never going to be available
>>>> to HVMlite guests (in any case we would enable the lapic/ioapic), and
>>>> maybe enable the RTC and PIT by default?
>>>>
>>>> Then I think we could get away without any Xen-specific way of reporting
>>>> enabled devices.
>>>
>>> DMLite is a new container type.  I would far rather it was assumed that
>>> there was no legacy hardware at all.
>>
>> So I take that you are in favour of only considering enabling the local
>> APIC and IO APIC maybe for HVMlite, because of the performance benefits,
>> while the other devices are _never_ going to be available to HVMlite
>> guests/hosts at all? (Dom0 already gets the hw VGA)
>>
>> IMHO, I would like to be able to eventually enable them in order to
>> provide an environment that's as close as possible to a compatible PC,
>> in order to reduce the amount of changes required in order to port an OS
>> to run in this mode.
> 
> I didn't think that's among the goals for HVMlite. Just like PV and PVH,
> HVMlite requires OS awareness.

Right, just getting rid of the PHYSDEV/PIRQ operations would be a big
win IMHO.

Roger.
Jan Beulich Jan. 25, 2016, 11:23 a.m. UTC | #11
>>> On 22.01.16 at 16:51, <roger.pau@citrix.com> wrote:
> El 22/01/16 a les 16.31, Jan Beulich ha escrit:
>>>>> On 22.01.16 at 15:59, <roger.pau@citrix.com> wrote:
>>> If that's the plan, then I think we would also need to fixup the tables
>>> provided to Dom0 in order to match what's available, but that can be
>>> discussed later.
>> 
>> I don't see what you're getting at here: The IO-APIC information
>> should be usable as is for Dom0 (as is the case for PV).
> 
> Well, the information it's usable, but the IO-APIC is not accessible by
> Dom0, you need to use PHYSDEV ops in order to manage it. One of my aims
> would be to get rid of quite of the PHYSDEV/PIRQ ops for Dom0, and
> instead provide an emulated local and IO APICs. That would also allow us
> to make use of posted-interrupts I assume (although I certainly need to
> look more deeply into this).

Wait - you're aware that some of the IO-APIC setup in the
hypervisor requires Dom0's help? Presenting it with other than
the real ACPI tables would seem to make this impossible.

> In this case, how are we going to signal Dom0 that the local and IO
> APICs reported by ACPI are usable using the bare mental interfaces?

With the above I'm not sure the answer to this would really matter.

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 674feea..f0145f6 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4536,6 +4536,10 @@  void hvm_hypervisor_cpuid_leaf(uint32_t sub_idx,
         /* Indicate presence of vcpu id and set it in ebx */
         *eax |= XEN_HVM_CPUID_VCPU_ID_PRESENT;
         *ebx = current->vcpu_id;
+
+        /* Indicate the presence of the devices bitmap in ecx. */
+        *eax |= XEN_HVM_CPUID_DEVICES_BITMAP;
+        *ecx = current->domain->arch.emulation_flags;
     }
 }
 
diff --git a/xen/include/public/arch-x86/cpuid.h b/xen/include/public/arch-x86/cpuid.h
index d709340..7222483 100644
--- a/xen/include/public/arch-x86/cpuid.h
+++ b/xen/include/public/arch-x86/cpuid.h
@@ -78,12 +78,17 @@ 
  * HVM-specific features
  * EAX: Features
  * EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
+ * ECX: bitmap of enabled devices, according to the bit fields defined in
+ *      arch-x86/xen.h. All unused bits have undefined values. The contents
+ *      of this register is only valid if EAX has the
+ *      XEN_HVM_CPUID_DEVICES_BITMAP flag set.
  */
 #define XEN_HVM_CPUID_APIC_ACCESS_VIRT (1u << 0) /* Virtualized APIC registers */
 #define XEN_HVM_CPUID_X2APIC_VIRT      (1u << 1) /* Virtualized x2APIC accesses */
 /* Memory mapped from other domains has valid IOMMU entries */
 #define XEN_HVM_CPUID_IOMMU_MAPPINGS   (1u << 2)
 #define XEN_HVM_CPUID_VCPU_ID_PRESENT  (1u << 3) /* vcpu id is present in EBX */
+#define XEN_HVM_CPUID_DEVICES_BITMAP   (1u << 4) /* device bitmap in ECX */
 
 #define XEN_CPUID_MAX_NUM_LEAVES 4