Message ID | 20230906043333.448244-16-harshpb@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Nested PAPR API (KVM on PowerVM) | expand |
On Wed Sep 6, 2023 at 2:33 PM AEST, Harsh Prateek Bora wrote: > Adding initial documentation about Nested PAPR API to describe the set > of APIs and its usage. Also talks about the Guest State Buffer elements > and it's format which is used between L0/L1 to communicate L2 state. I would move this patch first (well, behind any cleanup and preparation patches, but before any new API additions). Thanks, Nick > > Signed-off-by: Michael Neuling <mikey@neuling.org> > Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com> > --- > docs/devel/nested-papr.txt | 500 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 500 insertions(+) > create mode 100644 docs/devel/nested-papr.txt > > diff --git a/docs/devel/nested-papr.txt b/docs/devel/nested-papr.txt > new file mode 100644 > index 0000000000..c5c2ba7e50 > --- /dev/null > +++ b/docs/devel/nested-papr.txt > @@ -0,0 +1,500 @@ > +Nested PAPR API (aka KVM on PowerVM) > +==================================== > + > +This API aims at providing support to enable nested virtualization with > +KVM on PowerVM. While the existing support for nested KVM on PowerNV was > +introduced with cap-nested-hv option, however, with a slight design change, > +to enable this on papr/pseries, a new cap-nested-papr option is added. eg: > + > + qemu-system-ppc64 -cpu POWER10 -machine pseries,cap-nested-papr=true ... > + > +Work by: > + Michael Neuling <mikey@neuling.org> > + Vaibhav Jain <vaibhav@linux.ibm.com> > + Jordan Niethe <jniethe5@gmail.com> > + Harsh Prateek Bora <harshpb@linux.ibm.com> > + Shivaprasad G Bhat <sbhat@linux.ibm.com> > + Kautuk Consul <kconsul@linux.vnet.ibm.com> > + > +Below taken from the kernel documentation: > + > +Introduction > +============ > + > +This document explains how a guest operating system can act as a > +hypervisor and run nested guests through the use of hypercalls, if the > +hypervisor has implemented them. The terms L0, L1, and L2 are used to > +refer to different software entities. L0 is the hypervisor mode entity > +that would normally be called the "host" or "hypervisor". L1 is a > +guest virtual machine that is directly run under L0 and is initiated > +and controlled by L0. L2 is a guest virtual machine that is initiated > +and controlled by L1 acting as a hypervisor. A significant design change > +wrt existing API is that now the entire L2 state is maintained within L0. > + > +Existing Nested-HV API > +====================== > + > +Linux/KVM has had support for Nesting as an L0 or L1 since 2018 > + > +The L0 code was added:: > + > + commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce > + Author: Paul Mackerras <paulus@ozlabs.org> > + Date: Mon Oct 8 16:31:03 2018 +1100 > + KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization > + > +The L1 code was added:: > + > + commit 360cae313702cdd0b90f82c261a8302fecef030a > + Author: Paul Mackerras <paulus@ozlabs.org> > + Date: Mon Oct 8 16:31:04 2018 +1100 > + KVM: PPC: Book3S HV: Nested guest entry via hypercall > + > +This API works primarily using a signal hcall h_enter_nested(). This > +call made by the L1 to tell the L0 to start an L2 vCPU with the given > +state. The L0 then starts this L2 and runs until an L2 exit condition > +is reached. Once the L2 exits, the state of the L2 is given back to > +the L1 by the L0. The full L2 vCPU state is always transferred from > +and to L1 when the L2 is run. The L0 doesn't keep any state on the L2 > +vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2 > +-> L1 exit). > + > +The only state kept by the L0 is the partition table. The L1 registers > +it's partition table using the h_set_partition_table() hcall. All > +other state held by the L0 about the L2s is cached state (such as > +shadow page tables). > + > +The L1 may run any L2 or vCPU without first informing the L0. It > +simply starts the vCPU using h_enter_nested(). The creation of L2s and > +vCPUs is done implicitly whenever h_enter_nested() is called. > + > +In this document, we call this existing API the v1 API. > + > +New PAPR API > +=============== > + > +The new PAPR API changes from the v1 API such that the creating L2 and > +associated vCPUs is explicit. In this document, we call this the v2 > +API. > + > +h_enter_nested() is replaced with H_GUEST_VCPU_RUN(). Before this can > +be called the L1 must explicitly create the L2 using h_guest_create() > +and any associated vCPUs() created with h_guest_create_vCPU(). Getting > +and setting vCPU state can also be performed using h_guest_{g|s}et > +hcall. > + > +The basic execution flow is for an L1 to create an L2, run it, and > +delete it is: > + > +- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES() > + (normally at L1 boot time). > + > +- L1 requests the L0 to create an L2 with H_GUEST_CREATE() and receives a token > + > +- L1 requests the L0 to create an L2 vCPU with H_GUEST_CREATE_VCPU() > + > +- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall > + > +- L1 requests the L0 to run the vCPU using H_GUEST_RUN_VCPU() hcall > + > +- L1 deletes L2 with H_GUEST_DELETE() > + > +More details of the individual hcalls follows: > + > +HCALL Details > +============= > + > +This documentation is provided to give an overall understating of the > +API. It doesn't aim to provide full details required to implement > +an L1 or L0. Latest PAPR spec shall be referred for more details. > + > +All these HCALLs are made by the L1 to the L0. > + > +H_GUEST_GET_CAPABILITIES() > +-------------------------- > + > +This is called to get the capabilities of the L0 nested > +hypervisor. This includes capabilities such the CPU versions (eg > +POWER9, POWER10) that are supported as L2s. > + > +H_GUEST_SET_CAPABILITIES() > +-------------------------- > + > +This is called to inform the L0 of the capabilities of the L1 > +hypervisor. The set of flags passed here are the same as > +H_GUEST_GET_CAPABILITIES() > + > +Typically, GET will be called first and then SET will be called with a > +subset of the flags returned from GET. This process allows the L0 and > +L1 to negotiate a agreed set of capabilities. > + > +H_GUEST_CREATE() > +---------------- > + > +This is called to create a L2. Returned is ID of the L2 created > +(similar to an LPID), which can be use on subsequent HCALLs to > +identify the L2. > + > +H_GUEST_CREATE_VCPU() > +--------------------- > + > +This is called to create a vCPU associated with a L2. The L2 id > +(returned from H_GUEST_CREATE()) should be passed it. Also passed in > +is a unique (for this L2) vCPUid. This vCPUid is allocated by the > +L1. > + > +H_GUEST_SET_STATE() > +------------------- > + > +This is called to set L2 wide or vCPU specific L2 state. This info is > +passed via the Guest State Buffer (GSB), details below. > + > +This can set either L2 wide or vcpu specific information. Examples of > +L2 wide is the timebase offset or process scoped page table > +info. Examples of vCPU wide are GPRs or VSRs. A bit in the flags > +parameter specifies if this call is L2 wide or vCPU specific and the > +IDs in the GSB must match this. > + > +The L1 provides a pointer to the GSB as a parameter to this call. Also > +provided is the L2 and vCPU IDs associated with the state to set. > + > +The L1 writes all values in the GSB and the L0 only reads the GSB for > +this call > + > +H_GUEST_GET_STATE() > +------------------- > + > +This is called to get state associated with a L2 or L2 vCPU. This info > +passed via the GSB (details below). > + > +This can get either L2 wide or vcpu specific information. Examples of > +L2 wide is the timebase offset or process scoped page table > +info. Examples of vCPU wide are GPRs or VSRs. A bit in the flags > +parameter specifies if this call is L2 wide or vCPU specific and the > +IDs in the GSB must match this. > + > +The L1 provides a pointer to the GSB as a parameter to this call. Also > +provided is the L2 and vCPU IDs associated with the state to get. > + > +The L1 writes only the IDs and sizes in the GSB. L0 writes the > +associated values for each ID in the GSB. > + > +H_GUEST_RUN_VCPU() > +------------------ > + > +This is called to run an L2 vCPU. The L2 and vCPU IDs are passed in as > +parameters. The vCPU runs with the state set previously using > +H_GUEST_SET_STATE(). When the L2 exits, the L1 will resume from this > +hcall. > + > +This hcall also has associated input and output GSBs. Unlike > +H_GUEST_{S,G}ET_STATE(), these GSB pointers are not passed in as > +parameters to the hcall (This was done in the interest of > +performance). The locations of these GSBs must be preregistered using > +the H_GUEST_SET_STATE() call with ID 0x0c00 and 0x0c01 (see table later > +below). > + > +The input GSB may contain only VCPU wide elements to be set. This GSB > +may also contain zero elements (ie 0 in the first 4 bytes of the GSB) > +if nothing needs to be set. > + > +On exit from the hcall, the output buffer is filled with elements > +determined by the L0. The reason for the exit is contained in GPR4 (ie > +NIP is put in GPR4). The elements returned depend on the exit > +type. For example, if the exit reason is the L2 doing a hcall (GPR4 = > +0xc00), then GPR3-12 are provided in the output GSB as this is the > +state likely needed to service the hcall. If additional state is > +needed, H_GUEST_GET_STATE() may be called by the L1. > + > +To synthesize interrupts in the L2, when calling H_GUEST_RUN_VCPU() > +the L1 may set a flag (as a hcall parameter) and the L0 will > +synthesize the interrupt in the L2. Alternatively, the L1 may > +synthesize the interrupt itself using H_GUEST_SET_STATE() or the > +H_GUEST_RUN_VCPU() input GSB to set the state appropriately. > + > +H_GUEST_DELETE() > +---------------- > + > +This is called to delete an L2. All associated vCPUs are also > +deleted. No specific vCPU delete call is provided. > + > +A flag may be provided to delete all guests. This is used to reset the > +L0 in the case of kdump/kexec. > + > +Guest State Buffer (GSB) > +======================== > + > +The Guest State Buffer (GSB) is the main method of communicating state > +about the L2 between the L1 and L0 via H_GUEST_{G,S}ET() and > +H_GUEST_VCPU_RUN() calls. > + > +State may be associated with a whole L2 (eg timebase offset) or a > +specific L2 vCPU (eg. GPR state). Only L2 VCPU state maybe be set by > +H_GUEST_VCPU_RUN(). > + > +All data in the GSB is big endian (as is standard in PAPR) > + > +The Guest state buffer has a header which gives the number of > +elements, followed by the GSB elements themselves. > + > +GSB header: > + > ++----------+----------+-------------------------------------------+ > +| Offset | Size | Purpose | > +| Bytes | Bytes | | > ++==========+==========+===========================================+ > +| 0 | 4 | Number of elements | > ++----------+----------+-------------------------------------------+ > +| 4 | | Guest state buffer elements | > ++----------+----------+-------------------------------------------+ > + > +GSB element: > + > ++----------+----------+-------------------------------------------+ > +| Offset | Size | Purpose | > +| Bytes | Bytes | | > ++==========+==========+===========================================+ > +| 0 | 2 | ID | > ++----------+----------+-------------------------------------------+ > +| 2 | 2 | Size of Value | > ++----------+----------+-------------------------------------------+ > +| 4 | As above | Value | > ++----------+----------+-------------------------------------------+ > + > +The ID in the GSB element specifies what is to be set. This includes > +archtected state like GPRs, VSRs, SPRs, plus also some meta data about > +the partition like the timebase offset and partition scoped page > +table information. > + > ++--------+-------+----+--------+----------------------------------+ > +| ID | Size | RW | Thread | Details | > +| | Bytes | | Guest | | > +| | | | Scope | | > ++========+=======+====+========+==================================+ > +| 0x0000 | | RW | TG | NOP element | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0001 | 0x08 | R | G | Size of L0 vCPU state | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0002 | 0x08 | R | G | Size Run vCPU out buffer | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0003 | 0x04 | RW | G | Logical PVR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0004 | 0x08 | RW | G | TB Offset (L1 relative) | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0005 | 0x18 | RW | G |Partition scoped page tbl info: | > +| | | | | | > +| | | | |- 0x00 Addr part scope table | > +| | | | |- 0x08 Num addr bits | > +| | | | |- 0x10 Size root dir | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0006 | 0x10 | RW | G |Process Table Information: | > +| | | | | | > +| | | | |- 0x0 Addr proc scope table | > +| | | | |- 0x8 Table size. | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0007-| | | | Reserved | > +| 0x0BFF | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0C00 | 0x10 | RW | T |Run vCPU Input Buffer: | > +| | | | | | > +| | | | |- 0x0 Addr of buffer | > +| | | | |- 0x8 Buffer Size. | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0C01 | 0x10 | RW | T |Run vCPU Output Buffer: | > +| | | | | | > +| | | | |- 0x0 Addr of buffer | > +| | | | |- 0x8 Buffer Size. | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0C02 | 0x08 | RW | T | vCPU VPA Address | > ++--------+-------+----+--------+----------------------------------+ > +| 0x0C03-| | | | Reserved | > +| 0x0FFF | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1000-| 0x08 | RW | T | GPR 0-31 | > +| 0x101F | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1020 | 0x08 | T | T | HDEC expiry TB | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1021 | 0x08 | RW | T | NIA | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1022 | 0x08 | RW | T | MSR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1023 | 0x08 | RW | T | LR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1024 | 0x08 | RW | T | XER | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1025 | 0x08 | RW | T | CTR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1026 | 0x08 | RW | T | CFAR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1027 | 0x08 | RW | T | SRR0 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1028 | 0x08 | RW | T | SRR1 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1029 | 0x08 | RW | T | DAR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x102A | 0x08 | RW | T | DEC expiry TB | > ++--------+-------+----+--------+----------------------------------+ > +| 0x102B | 0x08 | RW | T | VTB | > ++--------+-------+----+--------+----------------------------------+ > +| 0x102C | 0x08 | RW | T | LPCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x102D | 0x08 | RW | T | HFSCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x102E | 0x08 | RW | T | FSCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x102F | 0x08 | RW | T | FPSCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1030 | 0x08 | RW | T | DAWR0 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1031 | 0x08 | RW | T | DAWR1 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1032 | 0x08 | RW | T | CIABR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1033 | 0x08 | RW | T | PURR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1034 | 0x08 | RW | T | SPURR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1035 | 0x08 | RW | T | IC | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1036-| 0x08 | RW | T | SPRG 0-3 | > +| 0x1039 | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x103A | 0x08 | W | T | PPR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x103B | 0x08 | RW | T | MMCR 0-3 | > +| 0x103E | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x103F | 0x08 | RW | T | MMCRA | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1040 | 0x08 | RW | T | SIER | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1041 | 0x08 | RW | T | SIER 2 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1042 | 0x08 | RW | T | SIER 3 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1043 | 0x08 | RW | T | BESCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1044 | 0x08 | RW | T | EBBHR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1045 | 0x08 | RW | T | EBBRR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1046 | 0x08 | RW | T | AMR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1047 | 0x08 | RW | T | IAMR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1048 | 0x08 | RW | T | AMOR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1049 | 0x08 | RW | T | UAMOR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x104A | 0x08 | RW | T | SDAR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x104B | 0x08 | RW | T | SIAR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x104C | 0x08 | RW | T | DSCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x104D | 0x08 | RW | T | TAR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x104E | 0x08 | RW | T | DEXCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x104F | 0x08 | RW | T | HDEXCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1050 | 0x08 | RW | T | HASHKEYR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1051 | 0x08 | RW | T | HASHPKEYR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1052 | 0x08 | RW | T | CTRL | > ++--------+-------+----+--------+----------------------------------+ > +| 0x1053-| | | | Reserved | > +| 0x1FFF | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2000 | 0x04 | RW | T | CR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2001 | 0x04 | RW | T | PIDR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2002 | 0x04 | RW | T | DSISR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2003 | 0x04 | RW | T | VSCR | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2004 | 0x04 | RW | T | VRSAVE | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2005 | 0x04 | RW | T | DAWRX0 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2006 | 0x04 | RW | T | DAWRX1 | > ++--------+-------+----+--------+----------------------------------+ > +| 0x2007-| 0x04 | RW | T | PMC 1-6 | > +| 0x200c | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x200D | 0x04 | RW | T | WORT | > ++--------+-------+----+--------+----------------------------------+ > +| 0x200E | 0x04 | RW | T | PSPB | > ++--------+-------+----+--------+----------------------------------+ > +| 0x200F-| | | | Reserved | > +| 0x2FFF | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x3000-| 0x10 | RW | T | VSR 0-63 | > +| 0x303F | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0x3040-| | | | Reserved | > +| 0xEFFF | | | | | > ++--------+-------+----+--------+----------------------------------+ > +| 0xF000 | 0x08 | R | T | HDAR | > ++--------+-------+----+--------+----------------------------------+ > +| 0xF001 | 0x04 | R | T | HDSISR | > ++--------+-------+----+--------+----------------------------------+ > +| 0xF002 | 0x04 | R | T | HEIR | > ++--------+-------+----+--------+----------------------------------+ > +| 0xF003 | 0x08 | R | T | ASDR | > ++--------+-------+----+--------+----------------------------------+ > + > +Miscellaneous info > +================== > + > +State not in ptregs/hvregs > +-------------------------- > + > +In the v1 API, some state is not in the ptregs/hvstate. This includes > +the vector register and some SPRs. For the L1 to set this state for > +the L2, the L1 loads up these hardware registers before the > +h_enter_nested() call and the L0 ensures they end up as the L2 state > +(by not touching them). > + > +The v2 API removes this and explicitly sets this state via the GSB. > + > +L1 Implementation details: Caching state > +---------------------------------------- > + > +In the v1 API, all state is sent from the L1 to the L0 and vice versa > +on every h_enter_nested() hcall. If the L0 is not currently running > +any L2s, the L0 has no state information about them. The only > +exception to this is the location of the partition table, registered > +via h_set_partition_table(). > + > +The v2 API changes this so that the L0 retains the L2 state even when > +it's vCPUs are no longer running. This means that the L1 only needs to > +communicate with the L0 about L2 state when it needs to modify the L2 > +state, or when it's value is out of date. This provides an opportunity > +for performance optimisation. > + > +When a vCPU exits from a H_GUEST_RUN_VCPU() call, the L1 internally > +marks all L2 state as invalid. This means that if the L1 wants to know > +the L2 state (say via a kvm_get_one_reg() call), it needs to call > +H_GUEST_GET_STATE() to get that state. Once it's read, it's marked as > +valid in L1 until the L2 is run again. > + > +Also, when an L1 modifies L2 vcpu state, it doesn't need to write it > +to the L0 until that L2 vcpu runs again. Hence when the L1 updates > +state (say via a kvm_set_one_reg() call), it writes to an internal L1 > +copy and only flushes this copy to the L0 when the L2 runs again via > +the H_GUEST_VCPU_RUN() input buffer. > + > +This lazy updating of state by the L1 avoids unnecessary > +H_GUEST_{G|S}ET_STATE() calls. > + > +References > +========== > + > +For more details, please refer: > + > +[1] Kernel documentation (currently v4 on mailing list): > + - https://lore.kernel.org/linuxppc-dev/20230905034658.82835-1-jniethe5@gmail.com/
On 9/7/23 09:26, Nicholas Piggin wrote: > On Wed Sep 6, 2023 at 2:33 PM AEST, Harsh Prateek Bora wrote: >> Adding initial documentation about Nested PAPR API to describe the set >> of APIs and its usage. Also talks about the Guest State Buffer elements >> and it's format which is used between L0/L1 to communicate L2 state. > > I would move this patch first (well, behind any cleanup and preparation > patches, but before any new API additions). > Sure, moving this patch before introducing nested PAPR API code. regards, Harsh > Thanks, > Nick > >> >> Signed-off-by: Michael Neuling <mikey@neuling.org> >> Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com> >> --- >> docs/devel/nested-papr.txt | 500 +++++++++++++++++++++++++++++++++++++ >> 1 file changed, 500 insertions(+) >> create mode 100644 docs/devel/nested-papr.txt >> >> diff --git a/docs/devel/nested-papr.txt b/docs/devel/nested-papr.txt >> new file mode 100644 >> index 0000000000..c5c2ba7e50 >> --- /dev/null >> +++ b/docs/devel/nested-papr.txt >> @@ -0,0 +1,500 @@ >> +Nested PAPR API (aka KVM on PowerVM) >> +==================================== >> + >> +This API aims at providing support to enable nested virtualization with >> +KVM on PowerVM. While the existing support for nested KVM on PowerNV was >> +introduced with cap-nested-hv option, however, with a slight design change, >> +to enable this on papr/pseries, a new cap-nested-papr option is added. eg: >> + >> + qemu-system-ppc64 -cpu POWER10 -machine pseries,cap-nested-papr=true ... >> + >> +Work by: >> + Michael Neuling <mikey@neuling.org> >> + Vaibhav Jain <vaibhav@linux.ibm.com> >> + Jordan Niethe <jniethe5@gmail.com> >> + Harsh Prateek Bora <harshpb@linux.ibm.com> >> + Shivaprasad G Bhat <sbhat@linux.ibm.com> >> + Kautuk Consul <kconsul@linux.vnet.ibm.com> >> + >> +Below taken from the kernel documentation: >> + >> +Introduction >> +============ >> + >> +This document explains how a guest operating system can act as a >> +hypervisor and run nested guests through the use of hypercalls, if the >> +hypervisor has implemented them. The terms L0, L1, and L2 are used to >> +refer to different software entities. L0 is the hypervisor mode entity >> +that would normally be called the "host" or "hypervisor". L1 is a >> +guest virtual machine that is directly run under L0 and is initiated >> +and controlled by L0. L2 is a guest virtual machine that is initiated >> +and controlled by L1 acting as a hypervisor. A significant design change >> +wrt existing API is that now the entire L2 state is maintained within L0. >> + >> +Existing Nested-HV API >> +====================== >> + >> +Linux/KVM has had support for Nesting as an L0 or L1 since 2018 >> + >> +The L0 code was added:: >> + >> + commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce >> + Author: Paul Mackerras <paulus@ozlabs.org> >> + Date: Mon Oct 8 16:31:03 2018 +1100 >> + KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization >> + >> +The L1 code was added:: >> + >> + commit 360cae313702cdd0b90f82c261a8302fecef030a >> + Author: Paul Mackerras <paulus@ozlabs.org> >> + Date: Mon Oct 8 16:31:04 2018 +1100 >> + KVM: PPC: Book3S HV: Nested guest entry via hypercall >> + >> +This API works primarily using a signal hcall h_enter_nested(). This >> +call made by the L1 to tell the L0 to start an L2 vCPU with the given >> +state. The L0 then starts this L2 and runs until an L2 exit condition >> +is reached. Once the L2 exits, the state of the L2 is given back to >> +the L1 by the L0. The full L2 vCPU state is always transferred from >> +and to L1 when the L2 is run. The L0 doesn't keep any state on the L2 >> +vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2 >> +-> L1 exit). >> + >> +The only state kept by the L0 is the partition table. The L1 registers >> +it's partition table using the h_set_partition_table() hcall. All >> +other state held by the L0 about the L2s is cached state (such as >> +shadow page tables). >> + >> +The L1 may run any L2 or vCPU without first informing the L0. It >> +simply starts the vCPU using h_enter_nested(). The creation of L2s and >> +vCPUs is done implicitly whenever h_enter_nested() is called. >> + >> +In this document, we call this existing API the v1 API. >> + >> +New PAPR API >> +=============== >> + >> +The new PAPR API changes from the v1 API such that the creating L2 and >> +associated vCPUs is explicit. In this document, we call this the v2 >> +API. >> + >> +h_enter_nested() is replaced with H_GUEST_VCPU_RUN(). Before this can >> +be called the L1 must explicitly create the L2 using h_guest_create() >> +and any associated vCPUs() created with h_guest_create_vCPU(). Getting >> +and setting vCPU state can also be performed using h_guest_{g|s}et >> +hcall. >> + >> +The basic execution flow is for an L1 to create an L2, run it, and >> +delete it is: >> + >> +- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES() >> + (normally at L1 boot time). >> + >> +- L1 requests the L0 to create an L2 with H_GUEST_CREATE() and receives a token >> + >> +- L1 requests the L0 to create an L2 vCPU with H_GUEST_CREATE_VCPU() >> + >> +- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall >> + >> +- L1 requests the L0 to run the vCPU using H_GUEST_RUN_VCPU() hcall >> + >> +- L1 deletes L2 with H_GUEST_DELETE() >> + >> +More details of the individual hcalls follows: >> + >> +HCALL Details >> +============= >> + >> +This documentation is provided to give an overall understating of the >> +API. It doesn't aim to provide full details required to implement >> +an L1 or L0. Latest PAPR spec shall be referred for more details. >> + >> +All these HCALLs are made by the L1 to the L0. >> + >> +H_GUEST_GET_CAPABILITIES() >> +-------------------------- >> + >> +This is called to get the capabilities of the L0 nested >> +hypervisor. This includes capabilities such the CPU versions (eg >> +POWER9, POWER10) that are supported as L2s. >> + >> +H_GUEST_SET_CAPABILITIES() >> +-------------------------- >> + >> +This is called to inform the L0 of the capabilities of the L1 >> +hypervisor. The set of flags passed here are the same as >> +H_GUEST_GET_CAPABILITIES() >> + >> +Typically, GET will be called first and then SET will be called with a >> +subset of the flags returned from GET. This process allows the L0 and >> +L1 to negotiate a agreed set of capabilities. >> + >> +H_GUEST_CREATE() >> +---------------- >> + >> +This is called to create a L2. Returned is ID of the L2 created >> +(similar to an LPID), which can be use on subsequent HCALLs to >> +identify the L2. >> + >> +H_GUEST_CREATE_VCPU() >> +--------------------- >> + >> +This is called to create a vCPU associated with a L2. The L2 id >> +(returned from H_GUEST_CREATE()) should be passed it. Also passed in >> +is a unique (for this L2) vCPUid. This vCPUid is allocated by the >> +L1. >> + >> +H_GUEST_SET_STATE() >> +------------------- >> + >> +This is called to set L2 wide or vCPU specific L2 state. This info is >> +passed via the Guest State Buffer (GSB), details below. >> + >> +This can set either L2 wide or vcpu specific information. Examples of >> +L2 wide is the timebase offset or process scoped page table >> +info. Examples of vCPU wide are GPRs or VSRs. A bit in the flags >> +parameter specifies if this call is L2 wide or vCPU specific and the >> +IDs in the GSB must match this. >> + >> +The L1 provides a pointer to the GSB as a parameter to this call. Also >> +provided is the L2 and vCPU IDs associated with the state to set. >> + >> +The L1 writes all values in the GSB and the L0 only reads the GSB for >> +this call >> + >> +H_GUEST_GET_STATE() >> +------------------- >> + >> +This is called to get state associated with a L2 or L2 vCPU. This info >> +passed via the GSB (details below). >> + >> +This can get either L2 wide or vcpu specific information. Examples of >> +L2 wide is the timebase offset or process scoped page table >> +info. Examples of vCPU wide are GPRs or VSRs. A bit in the flags >> +parameter specifies if this call is L2 wide or vCPU specific and the >> +IDs in the GSB must match this. >> + >> +The L1 provides a pointer to the GSB as a parameter to this call. Also >> +provided is the L2 and vCPU IDs associated with the state to get. >> + >> +The L1 writes only the IDs and sizes in the GSB. L0 writes the >> +associated values for each ID in the GSB. >> + >> +H_GUEST_RUN_VCPU() >> +------------------ >> + >> +This is called to run an L2 vCPU. The L2 and vCPU IDs are passed in as >> +parameters. The vCPU runs with the state set previously using >> +H_GUEST_SET_STATE(). When the L2 exits, the L1 will resume from this >> +hcall. >> + >> +This hcall also has associated input and output GSBs. Unlike >> +H_GUEST_{S,G}ET_STATE(), these GSB pointers are not passed in as >> +parameters to the hcall (This was done in the interest of >> +performance). The locations of these GSBs must be preregistered using >> +the H_GUEST_SET_STATE() call with ID 0x0c00 and 0x0c01 (see table later >> +below). >> + >> +The input GSB may contain only VCPU wide elements to be set. This GSB >> +may also contain zero elements (ie 0 in the first 4 bytes of the GSB) >> +if nothing needs to be set. >> + >> +On exit from the hcall, the output buffer is filled with elements >> +determined by the L0. The reason for the exit is contained in GPR4 (ie >> +NIP is put in GPR4). The elements returned depend on the exit >> +type. For example, if the exit reason is the L2 doing a hcall (GPR4 = >> +0xc00), then GPR3-12 are provided in the output GSB as this is the >> +state likely needed to service the hcall. If additional state is >> +needed, H_GUEST_GET_STATE() may be called by the L1. >> + >> +To synthesize interrupts in the L2, when calling H_GUEST_RUN_VCPU() >> +the L1 may set a flag (as a hcall parameter) and the L0 will >> +synthesize the interrupt in the L2. Alternatively, the L1 may >> +synthesize the interrupt itself using H_GUEST_SET_STATE() or the >> +H_GUEST_RUN_VCPU() input GSB to set the state appropriately. >> + >> +H_GUEST_DELETE() >> +---------------- >> + >> +This is called to delete an L2. All associated vCPUs are also >> +deleted. No specific vCPU delete call is provided. >> + >> +A flag may be provided to delete all guests. This is used to reset the >> +L0 in the case of kdump/kexec. >> + >> +Guest State Buffer (GSB) >> +======================== >> + >> +The Guest State Buffer (GSB) is the main method of communicating state >> +about the L2 between the L1 and L0 via H_GUEST_{G,S}ET() and >> +H_GUEST_VCPU_RUN() calls. >> + >> +State may be associated with a whole L2 (eg timebase offset) or a >> +specific L2 vCPU (eg. GPR state). Only L2 VCPU state maybe be set by >> +H_GUEST_VCPU_RUN(). >> + >> +All data in the GSB is big endian (as is standard in PAPR) >> + >> +The Guest state buffer has a header which gives the number of >> +elements, followed by the GSB elements themselves. >> + >> +GSB header: >> + >> ++----------+----------+-------------------------------------------+ >> +| Offset | Size | Purpose | >> +| Bytes | Bytes | | >> ++==========+==========+===========================================+ >> +| 0 | 4 | Number of elements | >> ++----------+----------+-------------------------------------------+ >> +| 4 | | Guest state buffer elements | >> ++----------+----------+-------------------------------------------+ >> + >> +GSB element: >> + >> ++----------+----------+-------------------------------------------+ >> +| Offset | Size | Purpose | >> +| Bytes | Bytes | | >> ++==========+==========+===========================================+ >> +| 0 | 2 | ID | >> ++----------+----------+-------------------------------------------+ >> +| 2 | 2 | Size of Value | >> ++----------+----------+-------------------------------------------+ >> +| 4 | As above | Value | >> ++----------+----------+-------------------------------------------+ >> + >> +The ID in the GSB element specifies what is to be set. This includes >> +archtected state like GPRs, VSRs, SPRs, plus also some meta data about >> +the partition like the timebase offset and partition scoped page >> +table information. >> + >> ++--------+-------+----+--------+----------------------------------+ >> +| ID | Size | RW | Thread | Details | >> +| | Bytes | | Guest | | >> +| | | | Scope | | >> ++========+=======+====+========+==================================+ >> +| 0x0000 | | RW | TG | NOP element | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0001 | 0x08 | R | G | Size of L0 vCPU state | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0002 | 0x08 | R | G | Size Run vCPU out buffer | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0003 | 0x04 | RW | G | Logical PVR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0004 | 0x08 | RW | G | TB Offset (L1 relative) | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0005 | 0x18 | RW | G |Partition scoped page tbl info: | >> +| | | | | | >> +| | | | |- 0x00 Addr part scope table | >> +| | | | |- 0x08 Num addr bits | >> +| | | | |- 0x10 Size root dir | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0006 | 0x10 | RW | G |Process Table Information: | >> +| | | | | | >> +| | | | |- 0x0 Addr proc scope table | >> +| | | | |- 0x8 Table size. | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0007-| | | | Reserved | >> +| 0x0BFF | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0C00 | 0x10 | RW | T |Run vCPU Input Buffer: | >> +| | | | | | >> +| | | | |- 0x0 Addr of buffer | >> +| | | | |- 0x8 Buffer Size. | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0C01 | 0x10 | RW | T |Run vCPU Output Buffer: | >> +| | | | | | >> +| | | | |- 0x0 Addr of buffer | >> +| | | | |- 0x8 Buffer Size. | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0C02 | 0x08 | RW | T | vCPU VPA Address | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x0C03-| | | | Reserved | >> +| 0x0FFF | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1000-| 0x08 | RW | T | GPR 0-31 | >> +| 0x101F | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1020 | 0x08 | T | T | HDEC expiry TB | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1021 | 0x08 | RW | T | NIA | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1022 | 0x08 | RW | T | MSR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1023 | 0x08 | RW | T | LR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1024 | 0x08 | RW | T | XER | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1025 | 0x08 | RW | T | CTR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1026 | 0x08 | RW | T | CFAR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1027 | 0x08 | RW | T | SRR0 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1028 | 0x08 | RW | T | SRR1 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1029 | 0x08 | RW | T | DAR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x102A | 0x08 | RW | T | DEC expiry TB | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x102B | 0x08 | RW | T | VTB | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x102C | 0x08 | RW | T | LPCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x102D | 0x08 | RW | T | HFSCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x102E | 0x08 | RW | T | FSCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x102F | 0x08 | RW | T | FPSCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1030 | 0x08 | RW | T | DAWR0 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1031 | 0x08 | RW | T | DAWR1 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1032 | 0x08 | RW | T | CIABR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1033 | 0x08 | RW | T | PURR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1034 | 0x08 | RW | T | SPURR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1035 | 0x08 | RW | T | IC | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1036-| 0x08 | RW | T | SPRG 0-3 | >> +| 0x1039 | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x103A | 0x08 | W | T | PPR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x103B | 0x08 | RW | T | MMCR 0-3 | >> +| 0x103E | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x103F | 0x08 | RW | T | MMCRA | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1040 | 0x08 | RW | T | SIER | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1041 | 0x08 | RW | T | SIER 2 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1042 | 0x08 | RW | T | SIER 3 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1043 | 0x08 | RW | T | BESCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1044 | 0x08 | RW | T | EBBHR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1045 | 0x08 | RW | T | EBBRR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1046 | 0x08 | RW | T | AMR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1047 | 0x08 | RW | T | IAMR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1048 | 0x08 | RW | T | AMOR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1049 | 0x08 | RW | T | UAMOR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x104A | 0x08 | RW | T | SDAR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x104B | 0x08 | RW | T | SIAR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x104C | 0x08 | RW | T | DSCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x104D | 0x08 | RW | T | TAR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x104E | 0x08 | RW | T | DEXCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x104F | 0x08 | RW | T | HDEXCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1050 | 0x08 | RW | T | HASHKEYR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1051 | 0x08 | RW | T | HASHPKEYR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1052 | 0x08 | RW | T | CTRL | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x1053-| | | | Reserved | >> +| 0x1FFF | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2000 | 0x04 | RW | T | CR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2001 | 0x04 | RW | T | PIDR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2002 | 0x04 | RW | T | DSISR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2003 | 0x04 | RW | T | VSCR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2004 | 0x04 | RW | T | VRSAVE | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2005 | 0x04 | RW | T | DAWRX0 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2006 | 0x04 | RW | T | DAWRX1 | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x2007-| 0x04 | RW | T | PMC 1-6 | >> +| 0x200c | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x200D | 0x04 | RW | T | WORT | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x200E | 0x04 | RW | T | PSPB | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x200F-| | | | Reserved | >> +| 0x2FFF | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x3000-| 0x10 | RW | T | VSR 0-63 | >> +| 0x303F | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0x3040-| | | | Reserved | >> +| 0xEFFF | | | | | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0xF000 | 0x08 | R | T | HDAR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0xF001 | 0x04 | R | T | HDSISR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0xF002 | 0x04 | R | T | HEIR | >> ++--------+-------+----+--------+----------------------------------+ >> +| 0xF003 | 0x08 | R | T | ASDR | >> ++--------+-------+----+--------+----------------------------------+ >> + >> +Miscellaneous info >> +================== >> + >> +State not in ptregs/hvregs >> +-------------------------- >> + >> +In the v1 API, some state is not in the ptregs/hvstate. This includes >> +the vector register and some SPRs. For the L1 to set this state for >> +the L2, the L1 loads up these hardware registers before the >> +h_enter_nested() call and the L0 ensures they end up as the L2 state >> +(by not touching them). >> + >> +The v2 API removes this and explicitly sets this state via the GSB. >> + >> +L1 Implementation details: Caching state >> +---------------------------------------- >> + >> +In the v1 API, all state is sent from the L1 to the L0 and vice versa >> +on every h_enter_nested() hcall. If the L0 is not currently running >> +any L2s, the L0 has no state information about them. The only >> +exception to this is the location of the partition table, registered >> +via h_set_partition_table(). >> + >> +The v2 API changes this so that the L0 retains the L2 state even when >> +it's vCPUs are no longer running. This means that the L1 only needs to >> +communicate with the L0 about L2 state when it needs to modify the L2 >> +state, or when it's value is out of date. This provides an opportunity >> +for performance optimisation. >> + >> +When a vCPU exits from a H_GUEST_RUN_VCPU() call, the L1 internally >> +marks all L2 state as invalid. This means that if the L1 wants to know >> +the L2 state (say via a kvm_get_one_reg() call), it needs to call >> +H_GUEST_GET_STATE() to get that state. Once it's read, it's marked as >> +valid in L1 until the L2 is run again. >> + >> +Also, when an L1 modifies L2 vcpu state, it doesn't need to write it >> +to the L0 until that L2 vcpu runs again. Hence when the L1 updates >> +state (say via a kvm_set_one_reg() call), it writes to an internal L1 >> +copy and only flushes this copy to the L0 when the L2 runs again via >> +the H_GUEST_VCPU_RUN() input buffer. >> + >> +This lazy updating of state by the L1 avoids unnecessary >> +H_GUEST_{G|S}ET_STATE() calls. >> + >> +References >> +========== >> + >> +For more details, please refer: >> + >> +[1] Kernel documentation (currently v4 on mailing list): >> + - https://lore.kernel.org/linuxppc-dev/20230905034658.82835-1-jniethe5@gmail.com/ >
diff --git a/docs/devel/nested-papr.txt b/docs/devel/nested-papr.txt new file mode 100644 index 0000000000..c5c2ba7e50 --- /dev/null +++ b/docs/devel/nested-papr.txt @@ -0,0 +1,500 @@ +Nested PAPR API (aka KVM on PowerVM) +==================================== + +This API aims at providing support to enable nested virtualization with +KVM on PowerVM. While the existing support for nested KVM on PowerNV was +introduced with cap-nested-hv option, however, with a slight design change, +to enable this on papr/pseries, a new cap-nested-papr option is added. eg: + + qemu-system-ppc64 -cpu POWER10 -machine pseries,cap-nested-papr=true ... + +Work by: + Michael Neuling <mikey@neuling.org> + Vaibhav Jain <vaibhav@linux.ibm.com> + Jordan Niethe <jniethe5@gmail.com> + Harsh Prateek Bora <harshpb@linux.ibm.com> + Shivaprasad G Bhat <sbhat@linux.ibm.com> + Kautuk Consul <kconsul@linux.vnet.ibm.com> + +Below taken from the kernel documentation: + +Introduction +============ + +This document explains how a guest operating system can act as a +hypervisor and run nested guests through the use of hypercalls, if the +hypervisor has implemented them. The terms L0, L1, and L2 are used to +refer to different software entities. L0 is the hypervisor mode entity +that would normally be called the "host" or "hypervisor". L1 is a +guest virtual machine that is directly run under L0 and is initiated +and controlled by L0. L2 is a guest virtual machine that is initiated +and controlled by L1 acting as a hypervisor. A significant design change +wrt existing API is that now the entire L2 state is maintained within L0. + +Existing Nested-HV API +====================== + +Linux/KVM has had support for Nesting as an L0 or L1 since 2018 + +The L0 code was added:: + + commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce + Author: Paul Mackerras <paulus@ozlabs.org> + Date: Mon Oct 8 16:31:03 2018 +1100 + KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization + +The L1 code was added:: + + commit 360cae313702cdd0b90f82c261a8302fecef030a + Author: Paul Mackerras <paulus@ozlabs.org> + Date: Mon Oct 8 16:31:04 2018 +1100 + KVM: PPC: Book3S HV: Nested guest entry via hypercall + +This API works primarily using a signal hcall h_enter_nested(). This +call made by the L1 to tell the L0 to start an L2 vCPU with the given +state. The L0 then starts this L2 and runs until an L2 exit condition +is reached. Once the L2 exits, the state of the L2 is given back to +the L1 by the L0. The full L2 vCPU state is always transferred from +and to L1 when the L2 is run. The L0 doesn't keep any state on the L2 +vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2 +-> L1 exit). + +The only state kept by the L0 is the partition table. The L1 registers +it's partition table using the h_set_partition_table() hcall. All +other state held by the L0 about the L2s is cached state (such as +shadow page tables). + +The L1 may run any L2 or vCPU without first informing the L0. It +simply starts the vCPU using h_enter_nested(). The creation of L2s and +vCPUs is done implicitly whenever h_enter_nested() is called. + +In this document, we call this existing API the v1 API. + +New PAPR API +=============== + +The new PAPR API changes from the v1 API such that the creating L2 and +associated vCPUs is explicit. In this document, we call this the v2 +API. + +h_enter_nested() is replaced with H_GUEST_VCPU_RUN(). Before this can +be called the L1 must explicitly create the L2 using h_guest_create() +and any associated vCPUs() created with h_guest_create_vCPU(). Getting +and setting vCPU state can also be performed using h_guest_{g|s}et +hcall. + +The basic execution flow is for an L1 to create an L2, run it, and +delete it is: + +- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES() + (normally at L1 boot time). + +- L1 requests the L0 to create an L2 with H_GUEST_CREATE() and receives a token + +- L1 requests the L0 to create an L2 vCPU with H_GUEST_CREATE_VCPU() + +- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall + +- L1 requests the L0 to run the vCPU using H_GUEST_RUN_VCPU() hcall + +- L1 deletes L2 with H_GUEST_DELETE() + +More details of the individual hcalls follows: + +HCALL Details +============= + +This documentation is provided to give an overall understating of the +API. It doesn't aim to provide full details required to implement +an L1 or L0. Latest PAPR spec shall be referred for more details. + +All these HCALLs are made by the L1 to the L0. + +H_GUEST_GET_CAPABILITIES() +-------------------------- + +This is called to get the capabilities of the L0 nested +hypervisor. This includes capabilities such the CPU versions (eg +POWER9, POWER10) that are supported as L2s. + +H_GUEST_SET_CAPABILITIES() +-------------------------- + +This is called to inform the L0 of the capabilities of the L1 +hypervisor. The set of flags passed here are the same as +H_GUEST_GET_CAPABILITIES() + +Typically, GET will be called first and then SET will be called with a +subset of the flags returned from GET. This process allows the L0 and +L1 to negotiate a agreed set of capabilities. + +H_GUEST_CREATE() +---------------- + +This is called to create a L2. Returned is ID of the L2 created +(similar to an LPID), which can be use on subsequent HCALLs to +identify the L2. + +H_GUEST_CREATE_VCPU() +--------------------- + +This is called to create a vCPU associated with a L2. The L2 id +(returned from H_GUEST_CREATE()) should be passed it. Also passed in +is a unique (for this L2) vCPUid. This vCPUid is allocated by the +L1. + +H_GUEST_SET_STATE() +------------------- + +This is called to set L2 wide or vCPU specific L2 state. This info is +passed via the Guest State Buffer (GSB), details below. + +This can set either L2 wide or vcpu specific information. Examples of +L2 wide is the timebase offset or process scoped page table +info. Examples of vCPU wide are GPRs or VSRs. A bit in the flags +parameter specifies if this call is L2 wide or vCPU specific and the +IDs in the GSB must match this. + +The L1 provides a pointer to the GSB as a parameter to this call. Also +provided is the L2 and vCPU IDs associated with the state to set. + +The L1 writes all values in the GSB and the L0 only reads the GSB for +this call + +H_GUEST_GET_STATE() +------------------- + +This is called to get state associated with a L2 or L2 vCPU. This info +passed via the GSB (details below). + +This can get either L2 wide or vcpu specific information. Examples of +L2 wide is the timebase offset or process scoped page table +info. Examples of vCPU wide are GPRs or VSRs. A bit in the flags +parameter specifies if this call is L2 wide or vCPU specific and the +IDs in the GSB must match this. + +The L1 provides a pointer to the GSB as a parameter to this call. Also +provided is the L2 and vCPU IDs associated with the state to get. + +The L1 writes only the IDs and sizes in the GSB. L0 writes the +associated values for each ID in the GSB. + +H_GUEST_RUN_VCPU() +------------------ + +This is called to run an L2 vCPU. The L2 and vCPU IDs are passed in as +parameters. The vCPU runs with the state set previously using +H_GUEST_SET_STATE(). When the L2 exits, the L1 will resume from this +hcall. + +This hcall also has associated input and output GSBs. Unlike +H_GUEST_{S,G}ET_STATE(), these GSB pointers are not passed in as +parameters to the hcall (This was done in the interest of +performance). The locations of these GSBs must be preregistered using +the H_GUEST_SET_STATE() call with ID 0x0c00 and 0x0c01 (see table later +below). + +The input GSB may contain only VCPU wide elements to be set. This GSB +may also contain zero elements (ie 0 in the first 4 bytes of the GSB) +if nothing needs to be set. + +On exit from the hcall, the output buffer is filled with elements +determined by the L0. The reason for the exit is contained in GPR4 (ie +NIP is put in GPR4). The elements returned depend on the exit +type. For example, if the exit reason is the L2 doing a hcall (GPR4 = +0xc00), then GPR3-12 are provided in the output GSB as this is the +state likely needed to service the hcall. If additional state is +needed, H_GUEST_GET_STATE() may be called by the L1. + +To synthesize interrupts in the L2, when calling H_GUEST_RUN_VCPU() +the L1 may set a flag (as a hcall parameter) and the L0 will +synthesize the interrupt in the L2. Alternatively, the L1 may +synthesize the interrupt itself using H_GUEST_SET_STATE() or the +H_GUEST_RUN_VCPU() input GSB to set the state appropriately. + +H_GUEST_DELETE() +---------------- + +This is called to delete an L2. All associated vCPUs are also +deleted. No specific vCPU delete call is provided. + +A flag may be provided to delete all guests. This is used to reset the +L0 in the case of kdump/kexec. + +Guest State Buffer (GSB) +======================== + +The Guest State Buffer (GSB) is the main method of communicating state +about the L2 between the L1 and L0 via H_GUEST_{G,S}ET() and +H_GUEST_VCPU_RUN() calls. + +State may be associated with a whole L2 (eg timebase offset) or a +specific L2 vCPU (eg. GPR state). Only L2 VCPU state maybe be set by +H_GUEST_VCPU_RUN(). + +All data in the GSB is big endian (as is standard in PAPR) + +The Guest state buffer has a header which gives the number of +elements, followed by the GSB elements themselves. + +GSB header: + ++----------+----------+-------------------------------------------+ +| Offset | Size | Purpose | +| Bytes | Bytes | | ++==========+==========+===========================================+ +| 0 | 4 | Number of elements | ++----------+----------+-------------------------------------------+ +| 4 | | Guest state buffer elements | ++----------+----------+-------------------------------------------+ + +GSB element: + ++----------+----------+-------------------------------------------+ +| Offset | Size | Purpose | +| Bytes | Bytes | | ++==========+==========+===========================================+ +| 0 | 2 | ID | ++----------+----------+-------------------------------------------+ +| 2 | 2 | Size of Value | ++----------+----------+-------------------------------------------+ +| 4 | As above | Value | ++----------+----------+-------------------------------------------+ + +The ID in the GSB element specifies what is to be set. This includes +archtected state like GPRs, VSRs, SPRs, plus also some meta data about +the partition like the timebase offset and partition scoped page +table information. + ++--------+-------+----+--------+----------------------------------+ +| ID | Size | RW | Thread | Details | +| | Bytes | | Guest | | +| | | | Scope | | ++========+=======+====+========+==================================+ +| 0x0000 | | RW | TG | NOP element | ++--------+-------+----+--------+----------------------------------+ +| 0x0001 | 0x08 | R | G | Size of L0 vCPU state | ++--------+-------+----+--------+----------------------------------+ +| 0x0002 | 0x08 | R | G | Size Run vCPU out buffer | ++--------+-------+----+--------+----------------------------------+ +| 0x0003 | 0x04 | RW | G | Logical PVR | ++--------+-------+----+--------+----------------------------------+ +| 0x0004 | 0x08 | RW | G | TB Offset (L1 relative) | ++--------+-------+----+--------+----------------------------------+ +| 0x0005 | 0x18 | RW | G |Partition scoped page tbl info: | +| | | | | | +| | | | |- 0x00 Addr part scope table | +| | | | |- 0x08 Num addr bits | +| | | | |- 0x10 Size root dir | ++--------+-------+----+--------+----------------------------------+ +| 0x0006 | 0x10 | RW | G |Process Table Information: | +| | | | | | +| | | | |- 0x0 Addr proc scope table | +| | | | |- 0x8 Table size. | ++--------+-------+----+--------+----------------------------------+ +| 0x0007-| | | | Reserved | +| 0x0BFF | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x0C00 | 0x10 | RW | T |Run vCPU Input Buffer: | +| | | | | | +| | | | |- 0x0 Addr of buffer | +| | | | |- 0x8 Buffer Size. | ++--------+-------+----+--------+----------------------------------+ +| 0x0C01 | 0x10 | RW | T |Run vCPU Output Buffer: | +| | | | | | +| | | | |- 0x0 Addr of buffer | +| | | | |- 0x8 Buffer Size. | ++--------+-------+----+--------+----------------------------------+ +| 0x0C02 | 0x08 | RW | T | vCPU VPA Address | ++--------+-------+----+--------+----------------------------------+ +| 0x0C03-| | | | Reserved | +| 0x0FFF | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x1000-| 0x08 | RW | T | GPR 0-31 | +| 0x101F | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x1020 | 0x08 | T | T | HDEC expiry TB | ++--------+-------+----+--------+----------------------------------+ +| 0x1021 | 0x08 | RW | T | NIA | ++--------+-------+----+--------+----------------------------------+ +| 0x1022 | 0x08 | RW | T | MSR | ++--------+-------+----+--------+----------------------------------+ +| 0x1023 | 0x08 | RW | T | LR | ++--------+-------+----+--------+----------------------------------+ +| 0x1024 | 0x08 | RW | T | XER | ++--------+-------+----+--------+----------------------------------+ +| 0x1025 | 0x08 | RW | T | CTR | ++--------+-------+----+--------+----------------------------------+ +| 0x1026 | 0x08 | RW | T | CFAR | ++--------+-------+----+--------+----------------------------------+ +| 0x1027 | 0x08 | RW | T | SRR0 | ++--------+-------+----+--------+----------------------------------+ +| 0x1028 | 0x08 | RW | T | SRR1 | ++--------+-------+----+--------+----------------------------------+ +| 0x1029 | 0x08 | RW | T | DAR | ++--------+-------+----+--------+----------------------------------+ +| 0x102A | 0x08 | RW | T | DEC expiry TB | ++--------+-------+----+--------+----------------------------------+ +| 0x102B | 0x08 | RW | T | VTB | ++--------+-------+----+--------+----------------------------------+ +| 0x102C | 0x08 | RW | T | LPCR | ++--------+-------+----+--------+----------------------------------+ +| 0x102D | 0x08 | RW | T | HFSCR | ++--------+-------+----+--------+----------------------------------+ +| 0x102E | 0x08 | RW | T | FSCR | ++--------+-------+----+--------+----------------------------------+ +| 0x102F | 0x08 | RW | T | FPSCR | ++--------+-------+----+--------+----------------------------------+ +| 0x1030 | 0x08 | RW | T | DAWR0 | ++--------+-------+----+--------+----------------------------------+ +| 0x1031 | 0x08 | RW | T | DAWR1 | ++--------+-------+----+--------+----------------------------------+ +| 0x1032 | 0x08 | RW | T | CIABR | ++--------+-------+----+--------+----------------------------------+ +| 0x1033 | 0x08 | RW | T | PURR | ++--------+-------+----+--------+----------------------------------+ +| 0x1034 | 0x08 | RW | T | SPURR | ++--------+-------+----+--------+----------------------------------+ +| 0x1035 | 0x08 | RW | T | IC | ++--------+-------+----+--------+----------------------------------+ +| 0x1036-| 0x08 | RW | T | SPRG 0-3 | +| 0x1039 | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x103A | 0x08 | W | T | PPR | ++--------+-------+----+--------+----------------------------------+ +| 0x103B | 0x08 | RW | T | MMCR 0-3 | +| 0x103E | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x103F | 0x08 | RW | T | MMCRA | ++--------+-------+----+--------+----------------------------------+ +| 0x1040 | 0x08 | RW | T | SIER | ++--------+-------+----+--------+----------------------------------+ +| 0x1041 | 0x08 | RW | T | SIER 2 | ++--------+-------+----+--------+----------------------------------+ +| 0x1042 | 0x08 | RW | T | SIER 3 | ++--------+-------+----+--------+----------------------------------+ +| 0x1043 | 0x08 | RW | T | BESCR | ++--------+-------+----+--------+----------------------------------+ +| 0x1044 | 0x08 | RW | T | EBBHR | ++--------+-------+----+--------+----------------------------------+ +| 0x1045 | 0x08 | RW | T | EBBRR | ++--------+-------+----+--------+----------------------------------+ +| 0x1046 | 0x08 | RW | T | AMR | ++--------+-------+----+--------+----------------------------------+ +| 0x1047 | 0x08 | RW | T | IAMR | ++--------+-------+----+--------+----------------------------------+ +| 0x1048 | 0x08 | RW | T | AMOR | ++--------+-------+----+--------+----------------------------------+ +| 0x1049 | 0x08 | RW | T | UAMOR | ++--------+-------+----+--------+----------------------------------+ +| 0x104A | 0x08 | RW | T | SDAR | ++--------+-------+----+--------+----------------------------------+ +| 0x104B | 0x08 | RW | T | SIAR | ++--------+-------+----+--------+----------------------------------+ +| 0x104C | 0x08 | RW | T | DSCR | ++--------+-------+----+--------+----------------------------------+ +| 0x104D | 0x08 | RW | T | TAR | ++--------+-------+----+--------+----------------------------------+ +| 0x104E | 0x08 | RW | T | DEXCR | ++--------+-------+----+--------+----------------------------------+ +| 0x104F | 0x08 | RW | T | HDEXCR | ++--------+-------+----+--------+----------------------------------+ +| 0x1050 | 0x08 | RW | T | HASHKEYR | ++--------+-------+----+--------+----------------------------------+ +| 0x1051 | 0x08 | RW | T | HASHPKEYR | ++--------+-------+----+--------+----------------------------------+ +| 0x1052 | 0x08 | RW | T | CTRL | ++--------+-------+----+--------+----------------------------------+ +| 0x1053-| | | | Reserved | +| 0x1FFF | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x2000 | 0x04 | RW | T | CR | ++--------+-------+----+--------+----------------------------------+ +| 0x2001 | 0x04 | RW | T | PIDR | ++--------+-------+----+--------+----------------------------------+ +| 0x2002 | 0x04 | RW | T | DSISR | ++--------+-------+----+--------+----------------------------------+ +| 0x2003 | 0x04 | RW | T | VSCR | ++--------+-------+----+--------+----------------------------------+ +| 0x2004 | 0x04 | RW | T | VRSAVE | ++--------+-------+----+--------+----------------------------------+ +| 0x2005 | 0x04 | RW | T | DAWRX0 | ++--------+-------+----+--------+----------------------------------+ +| 0x2006 | 0x04 | RW | T | DAWRX1 | ++--------+-------+----+--------+----------------------------------+ +| 0x2007-| 0x04 | RW | T | PMC 1-6 | +| 0x200c | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x200D | 0x04 | RW | T | WORT | ++--------+-------+----+--------+----------------------------------+ +| 0x200E | 0x04 | RW | T | PSPB | ++--------+-------+----+--------+----------------------------------+ +| 0x200F-| | | | Reserved | +| 0x2FFF | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x3000-| 0x10 | RW | T | VSR 0-63 | +| 0x303F | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0x3040-| | | | Reserved | +| 0xEFFF | | | | | ++--------+-------+----+--------+----------------------------------+ +| 0xF000 | 0x08 | R | T | HDAR | ++--------+-------+----+--------+----------------------------------+ +| 0xF001 | 0x04 | R | T | HDSISR | ++--------+-------+----+--------+----------------------------------+ +| 0xF002 | 0x04 | R | T | HEIR | ++--------+-------+----+--------+----------------------------------+ +| 0xF003 | 0x08 | R | T | ASDR | ++--------+-------+----+--------+----------------------------------+ + +Miscellaneous info +================== + +State not in ptregs/hvregs +-------------------------- + +In the v1 API, some state is not in the ptregs/hvstate. This includes +the vector register and some SPRs. For the L1 to set this state for +the L2, the L1 loads up these hardware registers before the +h_enter_nested() call and the L0 ensures they end up as the L2 state +(by not touching them). + +The v2 API removes this and explicitly sets this state via the GSB. + +L1 Implementation details: Caching state +---------------------------------------- + +In the v1 API, all state is sent from the L1 to the L0 and vice versa +on every h_enter_nested() hcall. If the L0 is not currently running +any L2s, the L0 has no state information about them. The only +exception to this is the location of the partition table, registered +via h_set_partition_table(). + +The v2 API changes this so that the L0 retains the L2 state even when +it's vCPUs are no longer running. This means that the L1 only needs to +communicate with the L0 about L2 state when it needs to modify the L2 +state, or when it's value is out of date. This provides an opportunity +for performance optimisation. + +When a vCPU exits from a H_GUEST_RUN_VCPU() call, the L1 internally +marks all L2 state as invalid. This means that if the L1 wants to know +the L2 state (say via a kvm_get_one_reg() call), it needs to call +H_GUEST_GET_STATE() to get that state. Once it's read, it's marked as +valid in L1 until the L2 is run again. + +Also, when an L1 modifies L2 vcpu state, it doesn't need to write it +to the L0 until that L2 vcpu runs again. Hence when the L1 updates +state (say via a kvm_set_one_reg() call), it writes to an internal L1 +copy and only flushes this copy to the L0 when the L2 runs again via +the H_GUEST_VCPU_RUN() input buffer. + +This lazy updating of state by the L1 avoids unnecessary +H_GUEST_{G|S}ET_STATE() calls. + +References +========== + +For more details, please refer: + +[1] Kernel documentation (currently v4 on mailing list): + - https://lore.kernel.org/linuxppc-dev/20230905034658.82835-1-jniethe5@gmail.com/