Message ID | 20200410085820.758686-1-gshan@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Support Async Page Fault | expand |
Hi Gavin, On 2020-04-10 09:58, Gavin Shan wrote: > There are two stages of page faults and the stage one page fault is > handled by guest itself. The guest is trapped to host when the page > fault is caused by stage 2 page table, for example missing. The guest > is suspended until the requested page is populated. To populate the > requested page can be costly and might be related to IO activities > if the page was swapped out previously. In this case, the guest has > to suspend for a few of milliseconds at least, regardless of the > overall system load. > > The series adds support to asychornous page fault to improve above > situation. If it's costly to populate the requested page, a signal > (PAGE_NOT_PRESENT) is sent to guest so that the faulting process can > be rescheduled if it can be. Otherwise, it is put into power-saving > mode. Another signal (PAGE_READY) is sent to guest once the requested > page is populated so that the faulting process can be waken up either > from either waiting state or power-saving state. > > In order to fulfil the control flow and convey signals between host > and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced. > The register accepts control block's physical address, plus requested > features. Also, the signal is sent using data abort with the specific > IMPDEF Data Fault Status Code (DFSC). The specific signal is stored > in the control block by host, to be consumed by guest. > > Todo > ==== > * CONFIG_KVM_ASYNC_PF_SYNC is disabled for now because the exception > injection can't work in nested mode. It might be something to be > improved in future. > * KVM_ASYNC_PF_SEND_ALWAYS is disabled even with CONFIG_PREEMPTION > because it's simply not working reliably. > * Tracepoints, which should something to be done in short term. > * kvm-unit-test cases. > * More testing and debugging are needed. Sometimes, the guest can be > stuck and the root cause needs to be figured out. Let me add another few things: - KVM/arm is (supposed to be) an architectural hypervisor. It means that one of the design goal is to have as few differences as possible from the actual hardware. I'm not keen on deviating from it (next thing you know, you'll add all the PV horror from Xen, HV, VMware...). - The idea of butchering the arm64 mm subsystem to handle a new exotic style of exceptions is not something I am looking forward to. We might as well PV the whole MMU, Xen style, and be done with it. I'll let the arm64 maintainers comment on this though. - We don't add IMPDEF sysregs, period. That's reserved for the HW. If you want to trap, there's the HVC instruction to that effect. - If this is such a great improvement, where are the performance numbers? - The fact that it apparently cannot work with nesting nor with preemption tends to indicate that it isn't future proof. Thanks, M.
Hi Marc, On 4/10/20 10:52 PM, Marc Zyngier wrote: > Hi Gavin, > > On 2020-04-10 09:58, Gavin Shan wrote: >> There are two stages of page faults and the stage one page fault is >> handled by guest itself. The guest is trapped to host when the page >> fault is caused by stage 2 page table, for example missing. The guest >> is suspended until the requested page is populated. To populate the >> requested page can be costly and might be related to IO activities >> if the page was swapped out previously. In this case, the guest has >> to suspend for a few of milliseconds at least, regardless of the >> overall system load. >> >> The series adds support to asychornous page fault to improve above >> situation. If it's costly to populate the requested page, a signal >> (PAGE_NOT_PRESENT) is sent to guest so that the faulting process can >> be rescheduled if it can be. Otherwise, it is put into power-saving >> mode. Another signal (PAGE_READY) is sent to guest once the requested >> page is populated so that the faulting process can be waken up either >> from either waiting state or power-saving state. >> >> In order to fulfil the control flow and convey signals between host >> and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced. >> The register accepts control block's physical address, plus requested >> features. Also, the signal is sent using data abort with the specific >> IMPDEF Data Fault Status Code (DFSC). The specific signal is stored >> in the control block by host, to be consumed by guest. >> >> Todo >> ==== >> * CONFIG_KVM_ASYNC_PF_SYNC is disabled for now because the exception >> injection can't work in nested mode. It might be something to be >> improved in future. >> * KVM_ASYNC_PF_SEND_ALWAYS is disabled even with CONFIG_PREEMPTION >> because it's simply not working reliably. >> * Tracepoints, which should something to be done in short term. >> * kvm-unit-test cases. >> * More testing and debugging are needed. Sometimes, the guest can be >> stuck and the root cause needs to be figured out. > > Let me add another few things: > > - KVM/arm is (supposed to be) an architectural hypervisor. It means > that one of the design goal is to have as few differences as possible > from the actual hardware. I'm not keen on deviating from it (next > thing you know, you'll add all the PV horror from Xen, HV, VMware...). > > - The idea of butchering the arm64 mm subsystem to handle a new exotic > style of exceptions is not something I am looking forward to. We > might as well PV the whole MMU, Xen style, and be done with it. I'll > let the arm64 maintainers comment on this though. > Thanks for your comments. The feature won't be enabled on guest side until CONFIG_KVM_GUEST is enabled. More details can be found from PATCH[7/7]. So it would be one specific features supported by KVM. I'm not familiar with xen and would like to learn how MMU is para-virtualized there. Do you have documents recommended to start with? Otherwise, I will try google later. > - We don't add IMPDEF sysregs, period. That's reserved for the HW. If > you want to trap, there's the HVC instruction to that effect. > Yes, HVC can be used for trapping as PV stolen time did. However, I guess it's guarded by specification? For example, the para-virtualized time calls are specified by DEN0057A, as highlighted in include/linux/arm-smccc.h. /* Paravirtualised time calls (defined by ARM DEN0057A) */ #define ARM_SMCCC_HV_PV_TIME_FEATURES \ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ ARM_SMCCC_SMC_64, \ ARM_SMCCC_OWNER_STANDARD_HYP, \ 0x20) #define ARM_SMCCC_HV_PV_TIME_ST \ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ ARM_SMCCC_SMC_64, \ ARM_SMCCC_OWNER_STANDARD_HYP, \ 0x21) I really don't understand how IMPDEF sysreg is used by hardware vendors. Do we have an existing functionality, which depends on IMPDEF sysreg? I was thinking the IMPDEF sysreg can be used by software either, but it seems I'm wrong. > - If this is such a great improvement, where are the performance > numbers? > Yep, Ineed. I'm still looking for appropriate workload currently and hopefully, I can share performance data in RFCv2 :) > - The fact that it apparently cannot work with nesting nor with > preemption tends to indicate that it isn't future proof. > I didn't make myself clear about the nesting. The data abort exception is injected by tweaking ELR_EL1/SPSR_EL1 if the guest is runing in 64-bits and EL1 mode. These registers are loaded when the guest gets chance to run. However, it's impossible to inject two (nested) data abort exception at once. It's something different from nested VM. There was a hot discusson about the preemption support. It's something in the TODO list and needs to be sorted out in future. https://lore.kernel.org/patchwork/patch/1206121/ > Thanks, > > M. > Thanks, Gavin
Hi Gavin, On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote: > On 4/10/20 10:52 PM, Marc Zyngier wrote: > > On 2020-04-10 09:58, Gavin Shan wrote: > > > In order to fulfil the control flow and convey signals between host > > > and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced. > > > The register accepts control block's physical address, plus requested > > > features. Also, the signal is sent using data abort with the specific > > > IMPDEF Data Fault Status Code (DFSC). The specific signal is stored > > > in the control block by host, to be consumed by guest. > > - We don't add IMPDEF sysregs, period. That's reserved for the HW. If > > you want to trap, there's the HVC instruction to that effect. > I really don't understand how IMPDEF sysreg is used by hardware vendors. > Do we have an existing functionality, which depends on IMPDEF sysreg? > I was thinking the IMPDEF sysreg can be used by software either, but > it seems I'm wrong. The key is in the name: an IMPLEMENTATION DEFINED register is defined by the implementation (i.e. the specific CPU microarchitecture), so it's wrong for software to come up with an arbitrary semantic as this will differ from the implementation's defined semantic for the register. Typically, IMP DEF resgisters are used for things that firmware needs to do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking into TLB/cache internals), and are not usually intended for general purpose software. Linux generally avoids the use of IMP DEF registers, but does so in some cases (e.g. for PMUs) after FW explicitly describes that those are safe to access. Thanks, Mark.
Hi Mark, On 4/14/20 9:05 PM, Mark Rutland wrote: > On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote: >> On 4/10/20 10:52 PM, Marc Zyngier wrote: >>> On 2020-04-10 09:58, Gavin Shan wrote: >>>> In order to fulfil the control flow and convey signals between host >>>> and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced. >>>> The register accepts control block's physical address, plus requested >>>> features. Also, the signal is sent using data abort with the specific >>>> IMPDEF Data Fault Status Code (DFSC). The specific signal is stored >>>> in the control block by host, to be consumed by guest. > >>> - We don't add IMPDEF sysregs, period. That's reserved for the HW. If >>> you want to trap, there's the HVC instruction to that effect. > >> I really don't understand how IMPDEF sysreg is used by hardware vendors. >> Do we have an existing functionality, which depends on IMPDEF sysreg? >> I was thinking the IMPDEF sysreg can be used by software either, but >> it seems I'm wrong. > > The key is in the name: an IMPLEMENTATION DEFINED register is defined by > the implementation (i.e. the specific CPU microarchitecture), so it's > wrong for software to come up with an arbitrary semantic as this will > differ from the implementation's defined semantic for the register. > > Typically, IMP DEF resgisters are used for things that firmware needs to > do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking > into TLB/cache internals), and are not usually intended for general > purpose software. > > Linux generally avoids the use of IMP DEF registers, but does so in some > cases (e.g. for PMUs) after FW explicitly describes that those are safe > to access. > Thanks for the explanation and details, which make things much clear. Since the IMPDEF system register can't be used like this way, hypercall (HVC) would be considered to serve same purpose - deliver signals from host to guest. However, the hypercall number and behaviors are guarded by specification. For example, the hypercalls used by para-virtualized stolen time, which are defined in include/linux/arm-smccc.h, are specified by ARM DEN0057A [1]. So I need a specification to be created, where the hypercalls used by this feature are defined? If it's not needed, can I pick hypercalls that aren't used and define their behaviors by myself? [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf Another thing I want to check is about the ESR_EL1[DFSC]. In this series, the asynchronous page fault is identified by IMPDEF DFSC (Data Fault Status Code) in ESR_EL1. According to what we discussed, the IMPDEF DFSC shouldn't be fired (produced) by software. It should be produced by hardware either? What I understood is IMPDEF is hardware behavior. If this is true, I need to avoid using IMPDEF DFSC in next revision :) Thanks, Gavin
On Thu, Apr 16, 2020 at 05:59:33PM +1000, Gavin Shan wrote: > On 4/14/20 9:05 PM, Mark Rutland wrote: > > On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote: > > > On 4/10/20 10:52 PM, Marc Zyngier wrote: > > > > On 2020-04-10 09:58, Gavin Shan wrote: > > > > > In order to fulfil the control flow and convey signals between host > > > > > and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced. > > > > > The register accepts control block's physical address, plus requested > > > > > features. Also, the signal is sent using data abort with the specific > > > > > IMPDEF Data Fault Status Code (DFSC). The specific signal is stored > > > > > in the control block by host, to be consumed by guest. > > > > > > - We don't add IMPDEF sysregs, period. That's reserved for the HW. If > > > > you want to trap, there's the HVC instruction to that effect. > > > > > I really don't understand how IMPDEF sysreg is used by hardware vendors. > > > Do we have an existing functionality, which depends on IMPDEF sysreg? > > > I was thinking the IMPDEF sysreg can be used by software either, but > > > it seems I'm wrong. > > > > The key is in the name: an IMPLEMENTATION DEFINED register is defined by > > the implementation (i.e. the specific CPU microarchitecture), so it's > > wrong for software to come up with an arbitrary semantic as this will > > differ from the implementation's defined semantic for the register. > > > > Typically, IMP DEF resgisters are used for things that firmware needs to > > do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking > > into TLB/cache internals), and are not usually intended for general > > purpose software. > > > > Linux generally avoids the use of IMP DEF registers, but does so in some > > cases (e.g. for PMUs) after FW explicitly describes that those are safe > > to access. > > Thanks for the explanation and details, which make things much clear. Since > the IMPDEF system register can't be used like this way, hypercall (HVC) would > be considered to serve same purpose - deliver signals from host to guest. I'm not sure I follow how you'd use HVC to inject a signal into a guest; the HVC would have to be issued by the guest to the host. Unless you're injecting the signal via some other mechanism (e.g. an interrupt), and the guest issues the HVC in response to that? > However, the hypercall number and behaviors are guarded by > specification. For example, the hypercalls used by para-virtualized > stolen time, which are defined in include/linux/arm-smccc.h, are > specified by ARM DEN0057A [1]. So I need a specification to be > created, where the hypercalls used by this feature are defined? If > it's not needed, can I pick hypercalls that aren't used and define > their behaviors by myself? > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf Take a look at the SMCCC / SMC Calling Convention: https://developer.arm.com/docs/den0028/c ... that defines ranges set aside for hypervisor-specific usage, and despite its name it also applies to HVC calls. There's been intermittent work to add a probing story for that, so that part is subject to change, but for prototyping you can just choose an arbitray number in that range -- just be suere to mention in the commit and cover letter that this part isn't complete. > Another thing I want to check is about the ESR_EL1[DFSC]. In this series, > the asynchronous page fault is identified by IMPDEF DFSC (Data Fault Status > Code) in ESR_EL1. According to what we discussed, the IMPDEF DFSC shouldn't > be fired (produced) by software. It should be produced by hardware either? > What I understood is IMPDEF is hardware behavior. If this is true, I need > to avoid using IMPDEF DFSC in next revision :) Yes, similar applies here. If the guest is making a hypercall, you can return the fault info as the response in GPRs, so I don't think you need to touch any architectural fault registers. Thanks, Mark.
On Thu, Apr 16, 2020 at 10:16:22AM +0100, Mark Rutland wrote: > On Thu, Apr 16, 2020 at 05:59:33PM +1000, Gavin Shan wrote: > > However, the hypercall number and behaviors are guarded by > > specification. For example, the hypercalls used by para-virtualized > > stolen time, which are defined in include/linux/arm-smccc.h, are > > specified by ARM DEN0057A [1]. So I need a specification to be > > created, where the hypercalls used by this feature are defined? If > > it's not needed, can I pick hypercalls that aren't used and define > > their behaviors by myself? > > > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf > > Take a look at the SMCCC / SMC Calling Convention: > > https://developer.arm.com/docs/den0028/c > > ... that defines ranges set aside for hypervisor-specific usage, and > despite its name it also applies to HVC calls. > > There's been intermittent work to add a probing story for that, so that > part is subject to change, but for prototyping you can just choose an > arbitray number in that range -- just be suere to mention in the commit > and cover letter that this part isn't complete. Right, might be simplest to start off with: https://android-kvm.googlesource.com/linux/+/refs/heads/willdeacon/hvc Will
Hi Mark, On 4/16/20 7:16 PM, Mark Rutland wrote: > On Thu, Apr 16, 2020 at 05:59:33PM +1000, Gavin Shan wrote: >> On 4/14/20 9:05 PM, Mark Rutland wrote: >>> On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote: >>>> On 4/10/20 10:52 PM, Marc Zyngier wrote: >>>>> On 2020-04-10 09:58, Gavin Shan wrote: >>>>>> In order to fulfil the control flow and convey signals between host >>>>>> and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced. >>>>>> The register accepts control block's physical address, plus requested >>>>>> features. Also, the signal is sent using data abort with the specific >>>>>> IMPDEF Data Fault Status Code (DFSC). The specific signal is stored >>>>>> in the control block by host, to be consumed by guest. >>> >>>>> - We don't add IMPDEF sysregs, period. That's reserved for the HW. If >>>>> you want to trap, there's the HVC instruction to that effect. >>> >>>> I really don't understand how IMPDEF sysreg is used by hardware vendors. >>>> Do we have an existing functionality, which depends on IMPDEF sysreg? >>>> I was thinking the IMPDEF sysreg can be used by software either, but >>>> it seems I'm wrong. >>> >>> The key is in the name: an IMPLEMENTATION DEFINED register is defined by >>> the implementation (i.e. the specific CPU microarchitecture), so it's >>> wrong for software to come up with an arbitrary semantic as this will >>> differ from the implementation's defined semantic for the register. >>> >>> Typically, IMP DEF resgisters are used for things that firmware needs to >>> do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking >>> into TLB/cache internals), and are not usually intended for general >>> purpose software. >>> >>> Linux generally avoids the use of IMP DEF registers, but does so in some >>> cases (e.g. for PMUs) after FW explicitly describes that those are safe >>> to access. >> >> Thanks for the explanation and details, which make things much clear. Since >> the IMPDEF system register can't be used like this way, hypercall (HVC) would >> be considered to serve same purpose - deliver signals from host to guest. > > I'm not sure I follow how you'd use HVC to inject a signal into a guest; > the HVC would have to be issued by the guest to the host. Unless you're > injecting the signal via some other mechanism (e.g. an interrupt), and > the guest issues the HVC in response to that? > Yeah, I expressed it in wrong way. It should be - HVC is used by guest to inject signal to host. Sorry for the confusion. >> However, the hypercall number and behaviors are guarded by >> specification. For example, the hypercalls used by para-virtualized >> stolen time, which are defined in include/linux/arm-smccc.h, are >> specified by ARM DEN0057A [1]. So I need a specification to be >> created, where the hypercalls used by this feature are defined? If >> it's not needed, can I pick hypercalls that aren't used and define >> their behaviors by myself? >> >> [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf > > Take a look at the SMCCC / SMC Calling Convention: > > https://developer.arm.com/docs/den0028/c > > ... that defines ranges set aside for hypervisor-specific usage, and > despite its name it also applies to HVC calls. > > There's been intermittent work to add a probing story for that, so that > part is subject to change, but for prototyping you can just choose an > arbitray number in that range -- just be suere to mention in the commit > and cover letter that this part isn't complete. > Sure, thanks for the pointer, which is very useful. Will already shared the git repo link about the probing story. I'll take a look and come back to you if I have more questions. Yes, arbitrary numbers in the range is ok for prototyping. >> Another thing I want to check is about the ESR_EL1[DFSC]. In this series, >> the asynchronous page fault is identified by IMPDEF DFSC (Data Fault Status >> Code) in ESR_EL1. According to what we discussed, the IMPDEF DFSC shouldn't >> be fired (produced) by software. It should be produced by hardware either? >> What I understood is IMPDEF is hardware behavior. If this is true, I need >> to avoid using IMPDEF DFSC in next revision :) > > Yes, similar applies here. > > If the guest is making a hypercall, you can return the fault info as the > response in GPRs, so I don't think you need to touch any architectural > fault registers. > The guest passively receives the async page fault from the host. It means there is no hypercall issued by guest. I think the asynchronous property can be stored in control block by host and it's retrieved by guest when the async page fault is handled. In this way, I needn't a specific (IMPDEF) DFSC. Note the physical address of the control block is passed to host when the functionality is enabled by HVC. Thanks, Gavin