Message ID | 20170529172948.3883-1-roger.pau@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
(Using the correct address for Julien) On Mon, May 29, 2017 at 06:29:48PM +0100, Roger Pau Monne wrote: > The current misc/pvh.markdown document refers to PVHv1, remove it to > avoid confusion with PVHv2 since the PVHv1 code has already been > removed. > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > Cc: George Dunlap <George.Dunlap@eu.citrix.com> > Cc: Ian Jackson <ian.jackson@eu.citrix.com> > Cc: Jan Beulich <jbeulich@suse.com> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: Stefano Stabellini <sstabellini@kernel.org> > Cc: Tim Deegan <tim@xen.org> > Cc: Wei Liu <wei.liu2@citrix.com> > Cc: Julien Grall <julien.grall@citrix.com> > --- > docs/misc/pvh.markdown | 377 ------------------------------------------------- > 1 file changed, 377 deletions(-) > delete mode 100644 docs/misc/pvh.markdown > > diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown > deleted file mode 100644 > index 52d8e743e7..0000000000 > --- a/docs/misc/pvh.markdown > +++ /dev/null > @@ -1,377 +0,0 @@ > -# PVH Specification # > - > -## Rationale ## > - > -PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU, and > -on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware > -virtualization extensions present in modern x86 CPUs in order to > -improve performance. > - > -PVH is considered a mix between PV and HVM, and can be seen as a PV guest > -that runs inside of an HVM container, or as a PVHVM guest without any emulated > -devices. The design goal of PVH is to provide the best performance possible and > -to reduce the amount of modifications needed for a guest OS to run in this mode > -(compared to pure PV). > - > -This document tries to describe the interfaces used by PVH guests, focusing > -on how an OS should make use of them in order to support PVH. > - > -## Early boot ## > - > -PVH guests use the PV boot mechanism, that means that the kernel is loaded and > -directly launched by Xen (by jumping into the entry point). In order to do this > -Xen ELF Notes need to be added to the guest kernel, so that they contain the > -information needed by Xen. Here is an example of the ELF Notes added to the > -FreeBSD amd64 kernel in order to boot as PVH: > - > - ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD") > - ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz, __XSTRING(__FreeBSD_version)) > - ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") > - ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) > - ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) > - ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, .quad, xen_start) > - ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) > - ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START) > - ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz, "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector") > - ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") > - ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) > - ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic") > - ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) > - ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes") > - > -On the Linux side, the above can be found in `arch/x86/xen/xen-head.S`. > - > -It is important to highlight the following notes: > - > - * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kernel entry > - point. > - * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of the > - hypercal page inside of the guest kernel (this memory region will be filled > - by Xen prior to booting). > - * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by the kernel. > - In the example above the kernel is only able to boot as a PVH guest, but > - those options can be mixed with the ones used by pure PV guests in order to > - have a kernel that supports both PV and PVH (like Linux). The list of > - options available can be found in the `features.h` public header. Note that > - in the example above `hvm_callback_vector` is in `XEN_ELFNOTE_FEATURES`. > - Older hypervisors will balk at this being part of it, so it can also be put > - in `XEN_ELFNOTE_SUPPORTED_FEATURES` which older hypervisors will ignore. > - > -Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with > -paging enabled (either long mode or protected mode with paging turned on > -depending on the kernel bitness) and some basic page tables setup. An important > -distinction for a 64bit PVH is that it is launched at privilege level 0 as > -opposed to a 64bit PV guest which is launched at privilege level 3. > - > -Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual > -memory address where Xen has placed the `start_info` structure. The `rsp` (`esp` > -on 32bits) will point to the top of an initial single page stack, that can be > -used by the guest kernel. The `start_info` structure contains all the info the > -guest needs in order to initialize. More information about the contents can be > -found in the `xen.h` public header. > - > -### Initial amd64 control registers values ### > - > -Initial values for the control registers are set up by Xen before booting the > -guest kernel. The guest kernel can expect to find the following features > -enabled by Xen. > - > -`CR0` has the following bits set by Xen: > - > - * PE (bit 0): protected mode enable. > - * ET (bit 4): 387 or newer processor. > - * PG (bit 31): paging enabled. > - > -`CR4` has the following bits set by Xen: > - > - * PAE (bit 5): PAE enabled. > - > -And finally in `EFER` the following features are enabled: > - > - * LME (bit 8): Long mode enable. > - * LMA (bit 10): Long mode active. > - > -At least the following flags in `EFER` are guaranteed to be disabled: > - > - * SCE (bit 0): System call extensions disabled. > - * NXE (bit 11): No-Execute disabled. > - > -There's no guarantee about the state of the other bits in the `EFER` register. > - > -All the segments selectors are set with a flat base at zero. > - > -The `cs` segment selector attributes are set to 0x0a09b, which describes an > -executable and readable code segment only accessible by the most privileged > -level. The segment is also set as a 64-bit code segment (`L` flag set, `D` flag > -unset). > - > -The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are all set > -to the same values. The attributes are set to 0x0c093, which implies a read and > -write data segment only accessible by the most privileged level. > - > -The `FS.base`, `GS.base` and `KERNEL_GS.base` MSRs are zeroed out. > - > -The `IDT` and `GDT` are also zeroed, so the guest must be specially careful to > -not trigger a fault until after they have been properly set. The way of setting > -the IDT and the GDT is using the native instructions as would be done on bare > -metal. > - > -The `RFLAGS` register is guaranteed to be clear when jumping into the kernel > -entry point, with the exception of the reserved bit 1 set. > - > -## Memory ## > - > -Since PVH guests rely on virtualization extensions provided by the CPU, they > -have access to a hardware virtualized MMU, which means page-table related > -operations should use the same instructions used on native. > - > -There are however some differences with native. The usage of native MTRR > -operations is forbidden, and `XENPF_*_memtype` hypercalls should be used > -instead. This can be avoided by simply not using MTRR and setting all the > -memory attributes using PAT, which doesn't require the usage of any hypercalls. > - > -Since PVH doesn't use a BIOS in order to boot, the physical memory map has > -to be retrieved using the `XENMEM_memory_map` hypercall, which will return > -an e820 map. This memory map might contain holes that describe MMIO regions, > -that will be already setup by Xen. > - > -*TODO*: we need to figure out what to do with MMIO regions, right now Xen > -sets all the holes in the native e820 to MMIO regions for Dom0 up to 4GB. We > -need to decide what to do with MMIO regions above 4GB on Dom0, and what to do > -for PVH DomUs with pci-passthrough. > - > -In the case of a guest started with memory != maxmem, the e820 memory map > -returned by Xen will contain the memory up to maxmem. The guest has to be very > -careful to only use the lower memory pages up to the value contained in > -`start_info->nr_pages` because any memory page above that value will not be > -populated. > - > -## Physical devices ## > - > -When running as Dom0 the guest OS has the ability to interact with the physical > -devices present in the system. A note should be made that PVH guests require > -a working IOMMU in order to interact with physical devices. > - > -The first step in order to manipulate the devices is to make Xen aware of > -them. Due to the fact that all the hardware description on x86 comes from > -ACPI, Dom0 is responsible for parsing the ACPI tables and notifying Xen about > -the devices it finds. This is done with the `PHYSDEVOP_pci_device_add` > -hypercall. > - > -*TODO*: explain the way to register the different kinds of PCI devices, like > -devices with virtual functions. > - > -## Interrupts ## > - > -All interrupts on PVH guests are routed over event channels, see > -[Event Channel Internals][event_channels] for more detailed information about > -event channels. In order to inject interrupts into the guest an IDT vector is > -used. This is the same mechanism used on PVHVM guests, and allows having > -per-cpu interrupts that can be used to deliver timers or IPIs. > - > -In order to register the callback IDT vector the `HVMOP_set_param` hypercall > -is used with the following values: > - > - domid = DOMID_SELF > - index = HVM_PARAM_CALLBACK_IRQ > - value = (0x2 << 56) | vector_value > - > -The OS has to program the IDT for the `vector_value` using the baremetal > -mechanism. > - > -In order to know which event channel has fired, we need to look into the > -information provided in the `shared_info` structure. The `evtchn_pending` > -array is used as a bitmap in order to find out which event channel has > -fired. Event channels can also be masked by setting it's port value in the > -`shared_info->evtchn_mask` bitmap. > - > -### Interrupts from physical devices ### > - > -When running as Dom0 (or when using pci-passthrough) interrupts from physical > -devices are routed over event channels. There are 3 different kind of > -physical interrupts that can be routed over event channels by Xen: IO APIC, > -MSI and MSI-X interrupts. > - > -Since physical interrupts usually need EOI (End Of Interrupt), Xen allows the > -registration of a memory region that will contain whether a physical interrupt > -needs EOI from the guest or not. This is done with the > -`PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a parameter containing the > -physical address of the memory page that will act as a bitmap. Then in order to > -find out if an IRQ needs EOI or not, the OS can perform a simple bit test on the > -memory page using the PIRQ value. > - > -### IO APIC interrupt routing ### > - > -IO APIC interrupts can be routed over event channels using `PHYSDEVOP` > -hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq` > -hypercall, as an example IRQ#9 is used here: > - > - domid = DOMID_SELF > - type = MAP_PIRQ_TYPE_GSI > - index = 9 > - pirq = 9 > - > -The IRQ#9 is now registered as PIRQ#9. The triggering and polarity can also > -be configured using the `PHYSDEVOP_setup_gsi` hypercall: > - > - gsi = 9 # This is the IRQ value. > - triggering = 0 > - polarity = 0 > - > -In this example the IRQ would be configured to use edge triggering and high > -polarity. > - > -Finally the PIRQ can be bound to an event channel using the > -`EVTCHNOP_bind_pirq`, that will return the event channel port the PIRQ has been > -assigned. After this the event channel will be ready for delivery. > - > -*NOTE*: when running as Dom0, the guest has to parse the interrupt overrides > -found on the ACPI tables and notify Xen about them. > - > -### MSI ### > - > -In order to configure MSI interrupts for a device, Xen must be made aware of > -it's presence first by using the `PHYSDEVOP_pci_device_add` as described above. > -Then the `PHYSDEVOP_map_pirq` hypercall is used: > - > - domid = DOMID_SELF > - type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI > - index = -1 > - pirq = -1 > - bus = pci_device_bus > - devfn = pci_device_function > - entry_nr = number of MSI interrupts > - > -The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI interrupt > -source is being configured. On devices that support MSI interrupt groups > -`MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure them by also placing the > -number of MSI interrupts in the `entry_nr` field. > - > -The values in the `bus` and `devfn` field should be the same as the ones used > -when registering the device with `PHYSDEVOP_pci_device_add`. > - > -### MSI-X ### > - > -*TODO*: how to register/use them. > - > -## Event timers and timecounters ## > - > -Since some hardware is not available on PVH (like the local APIC), Xen provides > -the OS with suitable replacements in order to get the same functionality. One > -of them is the timer interface. Using a set of hypercalls, a guest OS can set > -event timers that will deliver and event channel interrupt to the guest. > - > -In order to use the timer provided by Xen the guest OS first needs to register > -a VIRQ event channel to be used by the timer to deliver the interrupts. The > -event channel is registered using the `EVTCHNOP_bind_virq` hypercall, that > -only takes two parameters: > - > - virq = VIRQ_TIMER > - vcpu = vcpu_id > - > -The port that's going to be used by Xen in order to deliver the interrupt is > -returned in the `port` field. Once the interrupt is set, the timer can be > -programmed using the `VCPUOP_set_singleshot_timer` hypercall. > - > - flags = VCPU_SSHOTTMR_future > - timeout_abs_ns = absolute value when the timer should fire > - > -It is important to notice that the `VCPUOP_set_singleshot_timer` hypercall must > -be executed from the same vCPU where the timer should fire, or else Xen will > -refuse to set it. This is a single-shot timer, so it must be set by the OS > -every time it fires if a periodic timer is desired. > - > -Xen also shares a memory region with the guest OS that contains time related > -values that are updated periodically. This values can be used to implement a > -timecounter or to obtain the current time. This information is placed inside of > -`shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the guest has > -been launched) can be calculated using the following expression and the values > -stored in the `vcpu_time_info` struct: > - > - system_time + ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul) >> 32) > - > -The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be > -calculated using the above value, plus the timeout the system wants to set. > - > -If the OS also wants to obtain the current wallclock time, the value calculated > -above has to be added to the values found in `shared_info->wc_sec` and > -`shared_info->wc_nsec`. > - > -## SMP discover and bring up ## > - > -The process of bringing up secondary CPUs is obviously different from native, > -since PVH doesn't have a local APIC. The first thing to do is to figure out > -how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hypercall, > -using for example this simple loop: > - > - for (i = 0; i < MAXCPU; i++) { > - ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL); > - if (ret >= 0) > - /* vCPU#i is present */ > - } > - > -Note than when running as Dom0, the ACPI tables might report a different number > -of available CPUs. This is because the value on the ACPI tables is the > -number of physical CPUs the host has, and it might bear no resemblance with the > -number of vCPUs Dom0 actually has so it should be ignored. > - > -In order to bring up the secondary vCPUs they must be configured first. This is > -achieved using the `VCPUOP_initialise` hypercall. A valid context has to be > -passed to the vCPU in order to boot. The relevant fields for PVH guests are > -the following: > - > - * `flags`: contains `VGCF_*` flags (see `arch-x86/xen.h` public header). > - * `user_regs`: struct that contains the register values that will be set on > - the vCPU before booting. All GPRs are available to be set, however, the > - most relevant ones are `rip` and `rsp` in order to set the start address > - and the stack. Please note, all selectors must be null. > - * `ctrlreg[3]`: contains the address of the page tables that will be used by > - the vCPU. Other control registers should be set to zero, or else the > - hypercall will fail with -EINVAL. > - > -After the vCPU is initialized with the proper values, it can be started by > -using the `VCPUOP_up` hypercall. The values of the other control registers of > -the vCPU will be the same as the ones described in the `control registers` > -section. > - > -Examples about how to bring up secondary CPUs can be found on the FreeBSD > -code base in `sys/x86/xen/pv.c` and on Linux `arch/x86/xen/smp.c`. > - > -## Control operations (reboot/shutdown) ## > - > -Reboot and shutdown operations on PVH guests are performed using hypercalls. > -In order to issue a reboot, a guest must use the `SHUTDOWN_reboot` hypercall. > -In order to perform a power off from a guest DomU, the `SHUTDOWN_poweroff` > -hypercall should be used. > - > -The way to perform a full system power off from Dom0 is different than what's > -done in a DomU guest. In order to perform a power off from Dom0 the native > -ACPI path should be followed, but the guest should not write the `SLP_EN` > -bit to the Pm1Control register. Instead the `XENPF_enter_acpi_sleep` hypercall > -should be used, filling the following data in the `xen_platform_op` struct: > - > - cmd = XENPF_enter_acpi_sleep > - interface_version = XENPF_INTERFACE_VERSION > - u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue > - u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue > - > -This will allow Xen to do it's clean up and to power off the system. If the > -host is using hardware reduced ACPI, the following field should also be set: > - > - u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1) > - > -## CPUID ## > - > -The cpuid instruction that should be used is the normal `cpuid`, not the > -emulated `cpuid` that PV guests usually require. > - > -*TDOD*: describe which cpuid flags a guest should ignore and also which flags > -describe features can be used. It would also be good to describe the set of > -cpuid flags that will always be present when running as PVH. > - > -## Final notes ## > - > -All the other hardware functionality not described in this document should be > -assumed to be performed in the same way as native. > - > -[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals > -- > 2.11.0 (Apple Git-81) >
PVHv1 is gone so:
Acked-by: Wei Liu <wei.liu2@citrix.com>
diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown deleted file mode 100644 index 52d8e743e7..0000000000 --- a/docs/misc/pvh.markdown +++ /dev/null @@ -1,377 +0,0 @@ -# PVH Specification # - -## Rationale ## - -PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU, and -on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware -virtualization extensions present in modern x86 CPUs in order to -improve performance. - -PVH is considered a mix between PV and HVM, and can be seen as a PV guest -that runs inside of an HVM container, or as a PVHVM guest without any emulated -devices. The design goal of PVH is to provide the best performance possible and -to reduce the amount of modifications needed for a guest OS to run in this mode -(compared to pure PV). - -This document tries to describe the interfaces used by PVH guests, focusing -on how an OS should make use of them in order to support PVH. - -## Early boot ## - -PVH guests use the PV boot mechanism, that means that the kernel is loaded and -directly launched by Xen (by jumping into the entry point). In order to do this -Xen ELF Notes need to be added to the guest kernel, so that they contain the -information needed by Xen. Here is an example of the ELF Notes added to the -FreeBSD amd64 kernel in order to boot as PVH: - - ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD") - ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz, __XSTRING(__FreeBSD_version)) - ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") - ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) - ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) - ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, .quad, xen_start) - ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) - ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START) - ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz, "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector") - ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") - ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) - ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic") - ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) - ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes") - -On the Linux side, the above can be found in `arch/x86/xen/xen-head.S`. - -It is important to highlight the following notes: - - * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kernel entry - point. - * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of the - hypercal page inside of the guest kernel (this memory region will be filled - by Xen prior to booting). - * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by the kernel. - In the example above the kernel is only able to boot as a PVH guest, but - those options can be mixed with the ones used by pure PV guests in order to - have a kernel that supports both PV and PVH (like Linux). The list of - options available can be found in the `features.h` public header. Note that - in the example above `hvm_callback_vector` is in `XEN_ELFNOTE_FEATURES`. - Older hypervisors will balk at this being part of it, so it can also be put - in `XEN_ELFNOTE_SUPPORTED_FEATURES` which older hypervisors will ignore. - -Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with -paging enabled (either long mode or protected mode with paging turned on -depending on the kernel bitness) and some basic page tables setup. An important -distinction for a 64bit PVH is that it is launched at privilege level 0 as -opposed to a 64bit PV guest which is launched at privilege level 3. - -Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual -memory address where Xen has placed the `start_info` structure. The `rsp` (`esp` -on 32bits) will point to the top of an initial single page stack, that can be -used by the guest kernel. The `start_info` structure contains all the info the -guest needs in order to initialize. More information about the contents can be -found in the `xen.h` public header. - -### Initial amd64 control registers values ### - -Initial values for the control registers are set up by Xen before booting the -guest kernel. The guest kernel can expect to find the following features -enabled by Xen. - -`CR0` has the following bits set by Xen: - - * PE (bit 0): protected mode enable. - * ET (bit 4): 387 or newer processor. - * PG (bit 31): paging enabled. - -`CR4` has the following bits set by Xen: - - * PAE (bit 5): PAE enabled. - -And finally in `EFER` the following features are enabled: - - * LME (bit 8): Long mode enable. - * LMA (bit 10): Long mode active. - -At least the following flags in `EFER` are guaranteed to be disabled: - - * SCE (bit 0): System call extensions disabled. - * NXE (bit 11): No-Execute disabled. - -There's no guarantee about the state of the other bits in the `EFER` register. - -All the segments selectors are set with a flat base at zero. - -The `cs` segment selector attributes are set to 0x0a09b, which describes an -executable and readable code segment only accessible by the most privileged -level. The segment is also set as a 64-bit code segment (`L` flag set, `D` flag -unset). - -The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are all set -to the same values. The attributes are set to 0x0c093, which implies a read and -write data segment only accessible by the most privileged level. - -The `FS.base`, `GS.base` and `KERNEL_GS.base` MSRs are zeroed out. - -The `IDT` and `GDT` are also zeroed, so the guest must be specially careful to -not trigger a fault until after they have been properly set. The way of setting -the IDT and the GDT is using the native instructions as would be done on bare -metal. - -The `RFLAGS` register is guaranteed to be clear when jumping into the kernel -entry point, with the exception of the reserved bit 1 set. - -## Memory ## - -Since PVH guests rely on virtualization extensions provided by the CPU, they -have access to a hardware virtualized MMU, which means page-table related -operations should use the same instructions used on native. - -There are however some differences with native. The usage of native MTRR -operations is forbidden, and `XENPF_*_memtype` hypercalls should be used -instead. This can be avoided by simply not using MTRR and setting all the -memory attributes using PAT, which doesn't require the usage of any hypercalls. - -Since PVH doesn't use a BIOS in order to boot, the physical memory map has -to be retrieved using the `XENMEM_memory_map` hypercall, which will return -an e820 map. This memory map might contain holes that describe MMIO regions, -that will be already setup by Xen. - -*TODO*: we need to figure out what to do with MMIO regions, right now Xen -sets all the holes in the native e820 to MMIO regions for Dom0 up to 4GB. We -need to decide what to do with MMIO regions above 4GB on Dom0, and what to do -for PVH DomUs with pci-passthrough. - -In the case of a guest started with memory != maxmem, the e820 memory map -returned by Xen will contain the memory up to maxmem. The guest has to be very -careful to only use the lower memory pages up to the value contained in -`start_info->nr_pages` because any memory page above that value will not be -populated. - -## Physical devices ## - -When running as Dom0 the guest OS has the ability to interact with the physical -devices present in the system. A note should be made that PVH guests require -a working IOMMU in order to interact with physical devices. - -The first step in order to manipulate the devices is to make Xen aware of -them. Due to the fact that all the hardware description on x86 comes from -ACPI, Dom0 is responsible for parsing the ACPI tables and notifying Xen about -the devices it finds. This is done with the `PHYSDEVOP_pci_device_add` -hypercall. - -*TODO*: explain the way to register the different kinds of PCI devices, like -devices with virtual functions. - -## Interrupts ## - -All interrupts on PVH guests are routed over event channels, see -[Event Channel Internals][event_channels] for more detailed information about -event channels. In order to inject interrupts into the guest an IDT vector is -used. This is the same mechanism used on PVHVM guests, and allows having -per-cpu interrupts that can be used to deliver timers or IPIs. - -In order to register the callback IDT vector the `HVMOP_set_param` hypercall -is used with the following values: - - domid = DOMID_SELF - index = HVM_PARAM_CALLBACK_IRQ - value = (0x2 << 56) | vector_value - -The OS has to program the IDT for the `vector_value` using the baremetal -mechanism. - -In order to know which event channel has fired, we need to look into the -information provided in the `shared_info` structure. The `evtchn_pending` -array is used as a bitmap in order to find out which event channel has -fired. Event channels can also be masked by setting it's port value in the -`shared_info->evtchn_mask` bitmap. - -### Interrupts from physical devices ### - -When running as Dom0 (or when using pci-passthrough) interrupts from physical -devices are routed over event channels. There are 3 different kind of -physical interrupts that can be routed over event channels by Xen: IO APIC, -MSI and MSI-X interrupts. - -Since physical interrupts usually need EOI (End Of Interrupt), Xen allows the -registration of a memory region that will contain whether a physical interrupt -needs EOI from the guest or not. This is done with the -`PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a parameter containing the -physical address of the memory page that will act as a bitmap. Then in order to -find out if an IRQ needs EOI or not, the OS can perform a simple bit test on the -memory page using the PIRQ value. - -### IO APIC interrupt routing ### - -IO APIC interrupts can be routed over event channels using `PHYSDEVOP` -hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq` -hypercall, as an example IRQ#9 is used here: - - domid = DOMID_SELF - type = MAP_PIRQ_TYPE_GSI - index = 9 - pirq = 9 - -The IRQ#9 is now registered as PIRQ#9. The triggering and polarity can also -be configured using the `PHYSDEVOP_setup_gsi` hypercall: - - gsi = 9 # This is the IRQ value. - triggering = 0 - polarity = 0 - -In this example the IRQ would be configured to use edge triggering and high -polarity. - -Finally the PIRQ can be bound to an event channel using the -`EVTCHNOP_bind_pirq`, that will return the event channel port the PIRQ has been -assigned. After this the event channel will be ready for delivery. - -*NOTE*: when running as Dom0, the guest has to parse the interrupt overrides -found on the ACPI tables and notify Xen about them. - -### MSI ### - -In order to configure MSI interrupts for a device, Xen must be made aware of -it's presence first by using the `PHYSDEVOP_pci_device_add` as described above. -Then the `PHYSDEVOP_map_pirq` hypercall is used: - - domid = DOMID_SELF - type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI - index = -1 - pirq = -1 - bus = pci_device_bus - devfn = pci_device_function - entry_nr = number of MSI interrupts - -The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI interrupt -source is being configured. On devices that support MSI interrupt groups -`MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure them by also placing the -number of MSI interrupts in the `entry_nr` field. - -The values in the `bus` and `devfn` field should be the same as the ones used -when registering the device with `PHYSDEVOP_pci_device_add`. - -### MSI-X ### - -*TODO*: how to register/use them. - -## Event timers and timecounters ## - -Since some hardware is not available on PVH (like the local APIC), Xen provides -the OS with suitable replacements in order to get the same functionality. One -of them is the timer interface. Using a set of hypercalls, a guest OS can set -event timers that will deliver and event channel interrupt to the guest. - -In order to use the timer provided by Xen the guest OS first needs to register -a VIRQ event channel to be used by the timer to deliver the interrupts. The -event channel is registered using the `EVTCHNOP_bind_virq` hypercall, that -only takes two parameters: - - virq = VIRQ_TIMER - vcpu = vcpu_id - -The port that's going to be used by Xen in order to deliver the interrupt is -returned in the `port` field. Once the interrupt is set, the timer can be -programmed using the `VCPUOP_set_singleshot_timer` hypercall. - - flags = VCPU_SSHOTTMR_future - timeout_abs_ns = absolute value when the timer should fire - -It is important to notice that the `VCPUOP_set_singleshot_timer` hypercall must -be executed from the same vCPU where the timer should fire, or else Xen will -refuse to set it. This is a single-shot timer, so it must be set by the OS -every time it fires if a periodic timer is desired. - -Xen also shares a memory region with the guest OS that contains time related -values that are updated periodically. This values can be used to implement a -timecounter or to obtain the current time. This information is placed inside of -`shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the guest has -been launched) can be calculated using the following expression and the values -stored in the `vcpu_time_info` struct: - - system_time + ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul) >> 32) - -The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be -calculated using the above value, plus the timeout the system wants to set. - -If the OS also wants to obtain the current wallclock time, the value calculated -above has to be added to the values found in `shared_info->wc_sec` and -`shared_info->wc_nsec`. - -## SMP discover and bring up ## - -The process of bringing up secondary CPUs is obviously different from native, -since PVH doesn't have a local APIC. The first thing to do is to figure out -how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hypercall, -using for example this simple loop: - - for (i = 0; i < MAXCPU; i++) { - ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL); - if (ret >= 0) - /* vCPU#i is present */ - } - -Note than when running as Dom0, the ACPI tables might report a different number -of available CPUs. This is because the value on the ACPI tables is the -number of physical CPUs the host has, and it might bear no resemblance with the -number of vCPUs Dom0 actually has so it should be ignored. - -In order to bring up the secondary vCPUs they must be configured first. This is -achieved using the `VCPUOP_initialise` hypercall. A valid context has to be -passed to the vCPU in order to boot. The relevant fields for PVH guests are -the following: - - * `flags`: contains `VGCF_*` flags (see `arch-x86/xen.h` public header). - * `user_regs`: struct that contains the register values that will be set on - the vCPU before booting. All GPRs are available to be set, however, the - most relevant ones are `rip` and `rsp` in order to set the start address - and the stack. Please note, all selectors must be null. - * `ctrlreg[3]`: contains the address of the page tables that will be used by - the vCPU. Other control registers should be set to zero, or else the - hypercall will fail with -EINVAL. - -After the vCPU is initialized with the proper values, it can be started by -using the `VCPUOP_up` hypercall. The values of the other control registers of -the vCPU will be the same as the ones described in the `control registers` -section. - -Examples about how to bring up secondary CPUs can be found on the FreeBSD -code base in `sys/x86/xen/pv.c` and on Linux `arch/x86/xen/smp.c`. - -## Control operations (reboot/shutdown) ## - -Reboot and shutdown operations on PVH guests are performed using hypercalls. -In order to issue a reboot, a guest must use the `SHUTDOWN_reboot` hypercall. -In order to perform a power off from a guest DomU, the `SHUTDOWN_poweroff` -hypercall should be used. - -The way to perform a full system power off from Dom0 is different than what's -done in a DomU guest. In order to perform a power off from Dom0 the native -ACPI path should be followed, but the guest should not write the `SLP_EN` -bit to the Pm1Control register. Instead the `XENPF_enter_acpi_sleep` hypercall -should be used, filling the following data in the `xen_platform_op` struct: - - cmd = XENPF_enter_acpi_sleep - interface_version = XENPF_INTERFACE_VERSION - u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue - u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue - -This will allow Xen to do it's clean up and to power off the system. If the -host is using hardware reduced ACPI, the following field should also be set: - - u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1) - -## CPUID ## - -The cpuid instruction that should be used is the normal `cpuid`, not the -emulated `cpuid` that PV guests usually require. - -*TDOD*: describe which cpuid flags a guest should ignore and also which flags -describe features can be used. It would also be good to describe the set of -cpuid flags that will always be present when running as PVH. - -## Final notes ## - -All the other hardware functionality not described in this document should be -assumed to be performed in the same way as native. - -[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
The current misc/pvh.markdown document refers to PVHv1, remove it to avoid confusion with PVHv2 since the PVHv1 code has already been removed. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tim Deegan <tim@xen.org> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Julien Grall <julien.grall@citrix.com> --- docs/misc/pvh.markdown | 377 ------------------------------------------------- 1 file changed, 377 deletions(-) delete mode 100644 docs/misc/pvh.markdown