mbox series

[RFCv1,0/7] Support Async Page Fault

Message ID 20200410085820.758686-1-gshan@redhat.com (mailing list archive)
Headers show
Series Support Async Page Fault | expand

Message

Gavin Shan April 10, 2020, 8:58 a.m. UTC
There are two stages of page faults and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. To populate the
requested page can be costly and might be related to IO activities
if the page was swapped out previously. In this case, the guest has
to suspend for a few of milliseconds at least, regardless of the
overall system load.

The series adds support to asychornous page fault to improve above
situation. If it's costly to populate the requested page, a signal
(PAGE_NOT_PRESENT) is sent to guest so that the faulting process can
be rescheduled if it can be. Otherwise, it is put into power-saving
mode. Another signal (PAGE_READY) is sent to guest once the requested
page is populated so that the faulting process can be waken up either
from either waiting state or power-saving state.

In order to fulfil the control flow and convey signals between host
and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced.
The register accepts control block's physical address, plus requested
features. Also, the signal is sent using data abort with the specific
IMPDEF Data Fault Status Code (DFSC). The specific signal is stored
in the control block by host, to be consumed by guest.

Todo
====
* CONFIG_KVM_ASYNC_PF_SYNC is disabled for now because the exception
  injection can't work in nested mode. It might be something to be
  improved in future.
* KVM_ASYNC_PF_SEND_ALWAYS is disabled even with CONFIG_PREEMPTION
  because it's simply not working reliably.
* Tracepoints, which should something to be done in short term.
* kvm-unit-test cases.
* More testing and debugging are needed. Sometimes, the guest can be
  stuck and the root cause needs to be figured out.

PATCH[01] renames kvm_vcpu_get_hsr() to kvm_vcpu_get_esr() since the
          aarch32 host isn't supported.
PATCH[02] allows various helper functions to access ESR value from
          somewhere other than vCPU struct.
PATCH[03] replaces @hsr with @esr as aarch32 host isn't supported.
PATCH[04] exports kvm_handle_user_mem_abort(), which is used by the
          subsequent patch.
PATCH[05] introduces API to inject data abort with IMPDEF DFSC
PATCH[06] supports asynchronous page fault for host
PATCH[07] supports asynchronous page fault for guest

Testing
=======

Start a VM and its QEMU process is put into the specific memory cgroup.
The cgroup's memory limitation is less that the total amount of memory
assigned to the VM. For example, the VM is assigned with 4GB memory, but
the cgroup's limitaton is 2GB. A program is run after VM boots up, to
allocate (and access) all free memory. No system hang is found.

Gavin Shan (7):
  kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  kvm/arm64: Detach ESR operator from vCPU struct
  kvm/arm64: Replace hsr with esr
  kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  kvm/arm64: Allow inject data abort with specified DFSC
  kvm/arm64: Support async page fault
  arm64: Support async page fault

 arch/arm64/Kconfig                       |  11 +
 arch/arm64/include/asm/exception.h       |   5 +
 arch/arm64/include/asm/kvm_emulate.h     |  87 +++----
 arch/arm64/include/asm/kvm_host.h        |  46 ++++
 arch/arm64/include/asm/kvm_para.h        |  55 +++++
 arch/arm64/include/asm/sysreg.h          |   3 +
 arch/arm64/include/uapi/asm/Kbuild       |   3 -
 arch/arm64/include/uapi/asm/kvm_para.h   |  22 ++
 arch/arm64/kernel/smp.c                  |  47 ++++
 arch/arm64/kvm/Kconfig                   |   1 +
 arch/arm64/kvm/Makefile                  |   2 +
 arch/arm64/kvm/handle_exit.c             |  48 ++--
 arch/arm64/kvm/hyp/switch.c              |  33 +--
 arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |   7 +-
 arch/arm64/kvm/inject_fault.c            |  38 ++-
 arch/arm64/kvm/sys_regs.c                |  91 +++++--
 arch/arm64/mm/fault.c                    | 239 ++++++++++++++++++-
 virt/kvm/arm/aarch32.c                   |  27 ++-
 virt/kvm/arm/arm.c                       |  36 ++-
 virt/kvm/arm/async_pf.c                  | 290 +++++++++++++++++++++++
 virt/kvm/arm/hyp/aarch32.c               |   4 +-
 virt/kvm/arm/hyp/vgic-v3-sr.c            |   7 +-
 virt/kvm/arm/mmio.c                      |  27 ++-
 virt/kvm/arm/mmu.c                       |  69 ++++--
 24 files changed, 1040 insertions(+), 158 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 delete mode 100644 arch/arm64/include/uapi/asm/Kbuild
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 virt/kvm/arm/async_pf.c

Comments

Marc Zyngier April 10, 2020, 12:52 p.m. UTC | #1
Hi Gavin,

On 2020-04-10 09:58, Gavin Shan wrote:
> There are two stages of page faults and the stage one page fault is
> handled by guest itself. The guest is trapped to host when the page
> fault is caused by stage 2 page table, for example missing. The guest
> is suspended until the requested page is populated. To populate the
> requested page can be costly and might be related to IO activities
> if the page was swapped out previously. In this case, the guest has
> to suspend for a few of milliseconds at least, regardless of the
> overall system load.
> 
> The series adds support to asychornous page fault to improve above
> situation. If it's costly to populate the requested page, a signal
> (PAGE_NOT_PRESENT) is sent to guest so that the faulting process can
> be rescheduled if it can be. Otherwise, it is put into power-saving
> mode. Another signal (PAGE_READY) is sent to guest once the requested
> page is populated so that the faulting process can be waken up either
> from either waiting state or power-saving state.
> 
> In order to fulfil the control flow and convey signals between host
> and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced.
> The register accepts control block's physical address, plus requested
> features. Also, the signal is sent using data abort with the specific
> IMPDEF Data Fault Status Code (DFSC). The specific signal is stored
> in the control block by host, to be consumed by guest.
> 
> Todo
> ====
> * CONFIG_KVM_ASYNC_PF_SYNC is disabled for now because the exception
>   injection can't work in nested mode. It might be something to be
>   improved in future.
> * KVM_ASYNC_PF_SEND_ALWAYS is disabled even with CONFIG_PREEMPTION
>   because it's simply not working reliably.
> * Tracepoints, which should something to be done in short term.
> * kvm-unit-test cases.
> * More testing and debugging are needed. Sometimes, the guest can be
>   stuck and the root cause needs to be figured out.

Let me add another few things:

- KVM/arm is (supposed to be) an architectural hypervisor. It means
  that one of the design goal is to have as few differences as possible
  from the actual hardware. I'm not keen on deviating from it (next
  thing you know, you'll add all the PV horror from Xen, HV, VMware...). 

- The idea of butchering the arm64 mm subsystem to handle a new exotic
  style of exceptions is not something I am looking forward to. We
  might as well PV the whole MMU, Xen style, and be done with it. I'll
  let the arm64 maintainers comment on this though.

- We don't add IMPDEF sysregs, period. That's reserved for the HW. If
  you want to trap, there's the HVC instruction to that effect.

- If this is such a great improvement, where are the performance
  numbers?

- The fact that it apparently cannot work with nesting nor with
  preemption tends to indicate that it isn't future proof.

Thanks,

	M.
Gavin Shan April 14, 2020, 5:39 a.m. UTC | #2
Hi Marc,

On 4/10/20 10:52 PM, Marc Zyngier wrote:
> Hi Gavin,
> 
> On 2020-04-10 09:58, Gavin Shan wrote:
>> There are two stages of page faults and the stage one page fault is
>> handled by guest itself. The guest is trapped to host when the page
>> fault is caused by stage 2 page table, for example missing. The guest
>> is suspended until the requested page is populated. To populate the
>> requested page can be costly and might be related to IO activities
>> if the page was swapped out previously. In this case, the guest has
>> to suspend for a few of milliseconds at least, regardless of the
>> overall system load.
>>
>> The series adds support to asychornous page fault to improve above
>> situation. If it's costly to populate the requested page, a signal
>> (PAGE_NOT_PRESENT) is sent to guest so that the faulting process can
>> be rescheduled if it can be. Otherwise, it is put into power-saving
>> mode. Another signal (PAGE_READY) is sent to guest once the requested
>> page is populated so that the faulting process can be waken up either
>> from either waiting state or power-saving state.
>>
>> In order to fulfil the control flow and convey signals between host
>> and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced.
>> The register accepts control block's physical address, plus requested
>> features. Also, the signal is sent using data abort with the specific
>> IMPDEF Data Fault Status Code (DFSC). The specific signal is stored
>> in the control block by host, to be consumed by guest.
>>
>> Todo
>> ====
>> * CONFIG_KVM_ASYNC_PF_SYNC is disabled for now because the exception
>>    injection can't work in nested mode. It might be something to be
>>    improved in future.
>> * KVM_ASYNC_PF_SEND_ALWAYS is disabled even with CONFIG_PREEMPTION
>>    because it's simply not working reliably.
>> * Tracepoints, which should something to be done in short term.
>> * kvm-unit-test cases.
>> * More testing and debugging are needed. Sometimes, the guest can be
>>    stuck and the root cause needs to be figured out.
> 
> Let me add another few things:
> 
> - KVM/arm is (supposed to be) an architectural hypervisor. It means
>    that one of the design goal is to have as few differences as possible
>    from the actual hardware. I'm not keen on deviating from it (next
>    thing you know, you'll add all the PV horror from Xen, HV, VMware...).
> 
> - The idea of butchering the arm64 mm subsystem to handle a new exotic
>    style of exceptions is not something I am looking forward to. We
>    might as well PV the whole MMU, Xen style, and be done with it. I'll
>    let the arm64 maintainers comment on this though.
> 

Thanks for your comments. The feature won't be enabled on guest side until
CONFIG_KVM_GUEST is enabled. More details can be found from PATCH[7/7]. So
it would be one specific features supported by KVM. I'm not familiar with
xen and would like to learn how MMU is para-virtualized there. Do you have
documents recommended to start with? Otherwise, I will try google later.

> - We don't add IMPDEF sysregs, period. That's reserved for the HW. If
>    you want to trap, there's the HVC instruction to that effect.
> 

Yes, HVC can be used for trapping as PV stolen time did. However, I guess
it's guarded by specification? For example, the para-virtualized time calls
are specified by DEN0057A, as highlighted in include/linux/arm-smccc.h.

/* Paravirtualised time calls (defined by ARM DEN0057A) */
#define ARM_SMCCC_HV_PV_TIME_FEATURES                           \
         ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,                 \
                            ARM_SMCCC_SMC_64,                    \
                            ARM_SMCCC_OWNER_STANDARD_HYP,        \
                            0x20)

#define ARM_SMCCC_HV_PV_TIME_ST                                 \
         ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,                 \
                            ARM_SMCCC_SMC_64,                    \
                            ARM_SMCCC_OWNER_STANDARD_HYP,        \
                            0x21)

I really don't understand how IMPDEF sysreg is used by hardware vendors.
Do we have an existing functionality, which depends on IMPDEF sysreg?
I was thinking the IMPDEF sysreg can be used by software either, but
it seems I'm wrong.

> - If this is such a great improvement, where are the performance
>    numbers?
> 

Yep, Ineed. I'm still looking for appropriate workload currently and hopefully,
I can share performance data in RFCv2 :)

> - The fact that it apparently cannot work with nesting nor with
>    preemption tends to indicate that it isn't future proof.
> 

I didn't make myself clear about the nesting. The data abort exception is injected
by tweaking ELR_EL1/SPSR_EL1 if the guest is runing in 64-bits and EL1 mode. These
registers are loaded when the guest gets chance to run. However, it's impossible to
inject two (nested) data abort exception at once. It's something different from nested
VM.

There was a hot discusson about the preemption support. It's something in the TODO list
and needs to be sorted out in future.

https://lore.kernel.org/patchwork/patch/1206121/


> Thanks,
> 
> 	M.
> 

Thanks,
Gavin
Mark Rutland April 14, 2020, 11:05 a.m. UTC | #3
Hi Gavin,

On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote:
> On 4/10/20 10:52 PM, Marc Zyngier wrote:
> > On 2020-04-10 09:58, Gavin Shan wrote:
> > > In order to fulfil the control flow and convey signals between host
> > > and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced.
> > > The register accepts control block's physical address, plus requested
> > > features. Also, the signal is sent using data abort with the specific
> > > IMPDEF Data Fault Status Code (DFSC). The specific signal is stored
> > > in the control block by host, to be consumed by guest.

> > - We don't add IMPDEF sysregs, period. That's reserved for the HW. If
> >    you want to trap, there's the HVC instruction to that effect.

> I really don't understand how IMPDEF sysreg is used by hardware vendors.
> Do we have an existing functionality, which depends on IMPDEF sysreg?
> I was thinking the IMPDEF sysreg can be used by software either, but
> it seems I'm wrong.

The key is in the name: an IMPLEMENTATION DEFINED register is defined by
the implementation (i.e. the specific CPU microarchitecture), so it's
wrong for software to come up with an arbitrary semantic as this will
differ from the implementation's defined semantic for the register.

Typically, IMP DEF resgisters are used for things that firmware needs to
do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking
into TLB/cache internals), and are not usually intended for general
purpose software.

Linux generally avoids the use of IMP DEF registers, but does so in some
cases (e.g. for PMUs) after FW explicitly describes that those are safe
to access.

Thanks,
Mark.
Gavin Shan April 16, 2020, 7:59 a.m. UTC | #4
Hi Mark,

On 4/14/20 9:05 PM, Mark Rutland wrote:
> On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote:
>> On 4/10/20 10:52 PM, Marc Zyngier wrote:
>>> On 2020-04-10 09:58, Gavin Shan wrote:
>>>> In order to fulfil the control flow and convey signals between host
>>>> and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced.
>>>> The register accepts control block's physical address, plus requested
>>>> features. Also, the signal is sent using data abort with the specific
>>>> IMPDEF Data Fault Status Code (DFSC). The specific signal is stored
>>>> in the control block by host, to be consumed by guest.
> 
>>> - We don't add IMPDEF sysregs, period. That's reserved for the HW. If
>>>     you want to trap, there's the HVC instruction to that effect.
> 
>> I really don't understand how IMPDEF sysreg is used by hardware vendors.
>> Do we have an existing functionality, which depends on IMPDEF sysreg?
>> I was thinking the IMPDEF sysreg can be used by software either, but
>> it seems I'm wrong.
> 
> The key is in the name: an IMPLEMENTATION DEFINED register is defined by
> the implementation (i.e. the specific CPU microarchitecture), so it's
> wrong for software to come up with an arbitrary semantic as this will
> differ from the implementation's defined semantic for the register.
> 
> Typically, IMP DEF resgisters are used for things that firmware needs to
> do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking
> into TLB/cache internals), and are not usually intended for general
> purpose software.
> 
> Linux generally avoids the use of IMP DEF registers, but does so in some
> cases (e.g. for PMUs) after FW explicitly describes that those are safe
> to access.
> 

Thanks for the explanation and details, which make things much clear. Since
the IMPDEF system register can't be used like this way, hypercall (HVC) would
be considered to serve same purpose - deliver signals from host to guest. However,
the hypercall number and behaviors are guarded by specification. For example,
the hypercalls used by para-virtualized stolen time, which are defined in
include/linux/arm-smccc.h, are specified by ARM DEN0057A [1]. So I need a
specification to be created, where the hypercalls used by this feature are
defined? If it's not needed, can I pick hypercalls that aren't used and define
their behaviors by myself?

[1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf

Another thing I want to check is about the ESR_EL1[DFSC]. In this series,
the asynchronous page fault is identified by IMPDEF DFSC (Data Fault Status
Code) in ESR_EL1. According to what we discussed, the IMPDEF DFSC shouldn't
be fired (produced) by software. It should be produced by hardware either?
What I understood is IMPDEF is hardware behavior. If this is true, I need
to avoid using IMPDEF DFSC in next revision :)


Thanks,
Gavin
Mark Rutland April 16, 2020, 9:16 a.m. UTC | #5
On Thu, Apr 16, 2020 at 05:59:33PM +1000, Gavin Shan wrote:
> On 4/14/20 9:05 PM, Mark Rutland wrote:
> > On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote:
> > > On 4/10/20 10:52 PM, Marc Zyngier wrote:
> > > > On 2020-04-10 09:58, Gavin Shan wrote:
> > > > > In order to fulfil the control flow and convey signals between host
> > > > > and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced.
> > > > > The register accepts control block's physical address, plus requested
> > > > > features. Also, the signal is sent using data abort with the specific
> > > > > IMPDEF Data Fault Status Code (DFSC). The specific signal is stored
> > > > > in the control block by host, to be consumed by guest.
> > 
> > > > - We don't add IMPDEF sysregs, period. That's reserved for the HW. If
> > > >     you want to trap, there's the HVC instruction to that effect.
> > 
> > > I really don't understand how IMPDEF sysreg is used by hardware vendors.
> > > Do we have an existing functionality, which depends on IMPDEF sysreg?
> > > I was thinking the IMPDEF sysreg can be used by software either, but
> > > it seems I'm wrong.
> > 
> > The key is in the name: an IMPLEMENTATION DEFINED register is defined by
> > the implementation (i.e. the specific CPU microarchitecture), so it's
> > wrong for software to come up with an arbitrary semantic as this will
> > differ from the implementation's defined semantic for the register.
> > 
> > Typically, IMP DEF resgisters are used for things that firmware needs to
> > do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking
> > into TLB/cache internals), and are not usually intended for general
> > purpose software.
> > 
> > Linux generally avoids the use of IMP DEF registers, but does so in some
> > cases (e.g. for PMUs) after FW explicitly describes that those are safe
> > to access.
> 
> Thanks for the explanation and details, which make things much clear. Since
> the IMPDEF system register can't be used like this way, hypercall (HVC) would
> be considered to serve same purpose - deliver signals from host to guest.

I'm not sure I follow how you'd use HVC to inject a signal into a guest;
the HVC would have to be issued by the guest to the host. Unless you're
injecting the signal via some other mechanism (e.g. an interrupt), and
the guest issues the HVC in response to that?

> However, the hypercall number and behaviors are guarded by
> specification. For example, the hypercalls used by para-virtualized
> stolen time, which are defined in include/linux/arm-smccc.h, are
> specified by ARM DEN0057A [1]. So I need a specification to be
> created, where the hypercalls used by this feature are defined? If
> it's not needed, can I pick hypercalls that aren't used and define
> their behaviors by myself?
> 
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf

Take a look at the SMCCC / SMC Calling Convention:

 https://developer.arm.com/docs/den0028/c

... that defines ranges set aside for hypervisor-specific usage, and
despite its name it also applies to HVC calls.

There's been intermittent work to add a probing story for that, so that
part is subject to change, but for prototyping you can just choose an
arbitray number in that range -- just be suere to mention in the commit
and cover letter that this part isn't complete.

> Another thing I want to check is about the ESR_EL1[DFSC]. In this series,
> the asynchronous page fault is identified by IMPDEF DFSC (Data Fault Status
> Code) in ESR_EL1. According to what we discussed, the IMPDEF DFSC shouldn't
> be fired (produced) by software. It should be produced by hardware either?
> What I understood is IMPDEF is hardware behavior. If this is true, I need
> to avoid using IMPDEF DFSC in next revision :)

Yes, similar applies here.

If the guest is making a hypercall, you can return the fault info as the
response in GPRs, so I don't think you need to touch any architectural
fault registers.

Thanks,
Mark.
Will Deacon April 16, 2020, 9:21 a.m. UTC | #6
On Thu, Apr 16, 2020 at 10:16:22AM +0100, Mark Rutland wrote:
> On Thu, Apr 16, 2020 at 05:59:33PM +1000, Gavin Shan wrote:
> > However, the hypercall number and behaviors are guarded by
> > specification. For example, the hypercalls used by para-virtualized
> > stolen time, which are defined in include/linux/arm-smccc.h, are
> > specified by ARM DEN0057A [1]. So I need a specification to be
> > created, where the hypercalls used by this feature are defined? If
> > it's not needed, can I pick hypercalls that aren't used and define
> > their behaviors by myself?
> > 
> > [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf
> 
> Take a look at the SMCCC / SMC Calling Convention:
> 
>  https://developer.arm.com/docs/den0028/c
> 
> ... that defines ranges set aside for hypervisor-specific usage, and
> despite its name it also applies to HVC calls.
> 
> There's been intermittent work to add a probing story for that, so that
> part is subject to change, but for prototyping you can just choose an
> arbitray number in that range -- just be suere to mention in the commit
> and cover letter that this part isn't complete.

Right, might be simplest to start off with:

https://android-kvm.googlesource.com/linux/+/refs/heads/willdeacon/hvc

Will
Gavin Shan April 17, 2020, 10:34 a.m. UTC | #7
Hi Mark,

On 4/16/20 7:16 PM, Mark Rutland wrote:
> On Thu, Apr 16, 2020 at 05:59:33PM +1000, Gavin Shan wrote:
>> On 4/14/20 9:05 PM, Mark Rutland wrote:
>>> On Tue, Apr 14, 2020 at 03:39:56PM +1000, Gavin Shan wrote:
>>>> On 4/10/20 10:52 PM, Marc Zyngier wrote:
>>>>> On 2020-04-10 09:58, Gavin Shan wrote:
>>>>>> In order to fulfil the control flow and convey signals between host
>>>>>> and guest. A IMPDEF system register (SYS_ASYNC_PF_EL1) is introduced.
>>>>>> The register accepts control block's physical address, plus requested
>>>>>> features. Also, the signal is sent using data abort with the specific
>>>>>> IMPDEF Data Fault Status Code (DFSC). The specific signal is stored
>>>>>> in the control block by host, to be consumed by guest.
>>>
>>>>> - We don't add IMPDEF sysregs, period. That's reserved for the HW. If
>>>>>      you want to trap, there's the HVC instruction to that effect.
>>>
>>>> I really don't understand how IMPDEF sysreg is used by hardware vendors.
>>>> Do we have an existing functionality, which depends on IMPDEF sysreg?
>>>> I was thinking the IMPDEF sysreg can be used by software either, but
>>>> it seems I'm wrong.
>>>
>>> The key is in the name: an IMPLEMENTATION DEFINED register is defined by
>>> the implementation (i.e. the specific CPU microarchitecture), so it's
>>> wrong for software to come up with an arbitrary semantic as this will
>>> differ from the implementation's defined semantic for the register.
>>>
>>> Typically, IMP DEF resgisters are used for things that firmware needs to
>>> do (e.g. enter/exit coherency), or for bringup-time debug (e.g. poking
>>> into TLB/cache internals), and are not usually intended for general
>>> purpose software.
>>>
>>> Linux generally avoids the use of IMP DEF registers, but does so in some
>>> cases (e.g. for PMUs) after FW explicitly describes that those are safe
>>> to access.
>>
>> Thanks for the explanation and details, which make things much clear. Since
>> the IMPDEF system register can't be used like this way, hypercall (HVC) would
>> be considered to serve same purpose - deliver signals from host to guest.
> 
> I'm not sure I follow how you'd use HVC to inject a signal into a guest;
> the HVC would have to be issued by the guest to the host. Unless you're
> injecting the signal via some other mechanism (e.g. an interrupt), and
> the guest issues the HVC in response to that?
> 

Yeah, I expressed it in wrong way. It should be - HVC is used by guest
to inject signal to host. Sorry for the confusion.

>> However, the hypercall number and behaviors are guarded by
>> specification. For example, the hypercalls used by para-virtualized
>> stolen time, which are defined in include/linux/arm-smccc.h, are
>> specified by ARM DEN0057A [1]. So I need a specification to be
>> created, where the hypercalls used by this feature are defined? If
>> it's not needed, can I pick hypercalls that aren't used and define
>> their behaviors by myself?
>>
>> [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0057a/DEN0057A_Paravirtualized_Time_for_Arm_based_Systems_v1_0.pdf
> 
> Take a look at the SMCCC / SMC Calling Convention:
> 
>   https://developer.arm.com/docs/den0028/c
> 
> ... that defines ranges set aside for hypervisor-specific usage, and
> despite its name it also applies to HVC calls.
> 
> There's been intermittent work to add a probing story for that, so that
> part is subject to change, but for prototyping you can just choose an
> arbitray number in that range -- just be suere to mention in the commit
> and cover letter that this part isn't complete.
> 

Sure, thanks for the pointer, which is very useful. Will already shared
the git repo link about the probing story. I'll take a look and come back
to you if I have more questions. Yes, arbitrary numbers in the range is
ok for prototyping.

>> Another thing I want to check is about the ESR_EL1[DFSC]. In this series,
>> the asynchronous page fault is identified by IMPDEF DFSC (Data Fault Status
>> Code) in ESR_EL1. According to what we discussed, the IMPDEF DFSC shouldn't
>> be fired (produced) by software. It should be produced by hardware either?
>> What I understood is IMPDEF is hardware behavior. If this is true, I need
>> to avoid using IMPDEF DFSC in next revision :)
> 
> Yes, similar applies here.
> 
> If the guest is making a hypercall, you can return the fault info as the
> response in GPRs, so I don't think you need to touch any architectural
> fault registers.
> 

The guest passively receives the async page fault from the host. It means
there is no hypercall issued by guest. I think the asynchronous property can
be stored in control block by host and it's retrieved by guest when the async
page fault is handled. In this way, I needn't a specific (IMPDEF) DFSC. Note
the physical address of the control block is passed to host when the functionality
is enabled by HVC.

Thanks,
Gavin