mbox series

[v10,00/59] KVM: arm64: ARMv8.3/8.4 Nested Virtualization support

Message ID 20230515173103.1017669-1-maz@kernel.org (mailing list archive)
Headers show
Series KVM: arm64: ARMv8.3/8.4 Nested Virtualization support | expand

Message

Marc Zyngier May 15, 2023, 5:30 p.m. UTC
This is the 4th drop of NV support on arm64 for this year.

For the previous episodes, see [1].

What's changed:

- New framework to track system register traps that are reinjected in
  guest EL2. It is expected to replace the discrete handling we have
  enjoyed so far, which didn't scale at all. This has already fixed a
  number of bugs that were hidden (a bunch of traps were never
  forwarded...). Still a work in progress, but this is going in the
  right direction.

- Allow the L1 hypervisor to have a S2 that has an input larger than
  the L0 IPA space. This fixes a number of subtle issues, depending on
  how the initial guest was created.

- Consequently, the patch series has gone longer again. Boo. But
  hopefully some of it is easier to review...

[1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org

Andre Przywara (1):
  KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ

Christoffer Dall (5):
  KVM: arm64: nv: Trap EL1 VM register accesses in virtual EL2
  KVM: arm64: nv: Implement nested Stage-2 page table walk logic
  KVM: arm64: nv: Unmap/flush shadow stage 2 page tables
  KVM: arm64: nv: vgic: Emulate the HW bit in software
  KVM: arm64: nv: Sync nested timer state with FEAT_NV2

Jintack Lim (7):
  KVM: arm64: nv: Trap CPACR_EL1 access in virtual EL2
  KVM: arm64: nv: Respect virtual HCR_EL2.TWX setting
  KVM: arm64: nv: Respect virtual CPTR_EL2.{TFP,FPEN} settings
  KVM: arm64: nv: Respect virtual HCR_EL2.{NV,TSC) settings
  KVM: arm64: nv: Configure HCR_EL2 for nested virtualization
  KVM: arm64: nv: Trap and emulate TLBI instructions from virtual EL2
  KVM: arm64: nv: Nested GICv3 Support

Marc Zyngier (46):
  KVM: arm64: Move VTCR_EL2 into struct s2_mmu
  arm64: Add missing Set/Way CMO encodings
  arm64: Add missing VA CMO encodings
  arm64: Add missing ERXMISCx_EL1 encodings
  arm64: Add missing DC ZVA/GVA/GZVA encodings
  arm64: Add TLBI operation encodings
  arm64: Add AT operation encodings
  KVM: arm64: Add missing HCR_EL2 trap bits
  KVM: arm64: nv: Add trap forwarding infrastructure
  KVM: arm64: nv: Add trap forwarding for HCR_EL2
  KVM: arm64: nv: Expose FEAT_EVT to nested guests
  KVM: arm64: nv: Add trap forwarding for MDCR_EL2
  KVM: arm64: nv: Add trap forwarding for CNTHCTL_EL2
  KVM: arm64: nv: Add non-VHE-EL2->EL1 translation helpers
  KVM: arm64: nv: Handle virtual EL2 registers in
    vcpu_read/write_sys_reg()
  KVM: arm64: nv: Handle SPSR_EL2 specially
  KVM: arm64: nv: Handle HCR_EL2.E2H specially
  KVM: arm64: nv: Save/Restore vEL2 sysregs
  KVM: arm64: nv: Support multiple nested Stage-2 mmu structures
  KVM: arm64: nv: Handle shadow stage 2 page faults
  KVM: arm64: nv: Restrict S2 RD/WR permissions to match the guest's
  KVM: arm64: nv: Set a handler for the system instruction traps
  KVM: arm64: nv: Trap and emulate AT instructions from virtual EL2
  KVM: arm64: nv: Fold guest's HCR_EL2 configuration into the host's
  KVM: arm64: nv: Hide RAS from nested guests
  KVM: arm64: nv: Add handling of EL2-specific timer registers
  KVM: arm64: nv: Load timer before the GIC
  KVM: arm64: nv: Don't load the GICv4 context on entering a nested
    guest
  KVM: arm64: nv: Implement maintenance interrupt forwarding
  KVM: arm64: nv: Deal with broken VGIC on maintenance interrupt
    delivery
  KVM: arm64: nv: Allow userspace to request KVM_ARM_VCPU_NESTED_VIRT
  KVM: arm64: nv: Add handling of FEAT_TTL TLB invalidation
  KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like
    information
  KVM: arm64: nv: Tag shadow S2 entries with nested level
  KVM: arm64: nv: Add include containing the VNCR_EL2 offsets
  KVM: arm64: nv: Map VNCR-capable registers to a separate page
  KVM: arm64: nv: Move nested vgic state into the sysreg file
  KVM: arm64: Add FEAT_NV2 cpu feature
  KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
  KVM: arm64: nv: Publish emulated timer interrupt state in the
    in-memory state
  KVM: arm64: nv: Allocate VNCR page when required
  KVM: arm64: nv: Enable ARMv8.4-NV support
  KVM: arm64: nv: Fast-track 'InHost' exception returns
  KVM: arm64: nv: Fast-track EL1 TLBIs for VHE guests
  KVM: arm64: nv: Use FEAT_ECV to trap access to EL0 timers
  KVM: arm64: nv: Accelerate EL0 timer read accesses when FEAT_ECV is on

 .../virt/kvm/devices/arm-vgic-v3.rst          |  12 +-
 arch/arm64/include/asm/esr.h                  |   1 +
 arch/arm64/include/asm/kvm_arm.h              |  14 +
 arch/arm64/include/asm/kvm_asm.h              |   4 +
 arch/arm64/include/asm/kvm_emulate.h          |  93 +-
 arch/arm64/include/asm/kvm_host.h             | 181 +++-
 arch/arm64/include/asm/kvm_hyp.h              |   2 +
 arch/arm64/include/asm/kvm_mmu.h              |  20 +-
 arch/arm64/include/asm/kvm_nested.h           | 133 +++
 arch/arm64/include/asm/stage2_pgtable.h       |   4 +-
 arch/arm64/include/asm/sysreg.h               | 196 ++++
 arch/arm64/include/asm/vncr_mapping.h         |  74 ++
 arch/arm64/include/uapi/asm/kvm.h             |   1 +
 arch/arm64/kernel/cpufeature.c                |  11 +
 arch/arm64/kvm/Makefile                       |   4 +-
 arch/arm64/kvm/arch_timer.c                   |  98 +-
 arch/arm64/kvm/arm.c                          |  33 +-
 arch/arm64/kvm/at.c                           | 219 ++++
 arch/arm64/kvm/emulate-nested.c               | 934 ++++++++++++++++-
 arch/arm64/kvm/handle_exit.c                  |  29 +-
 arch/arm64/kvm/hyp/include/hyp/switch.h       |   8 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h    |   5 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |   8 +-
 arch/arm64/kvm/hyp/nvhe/pkvm.c                |   4 +-
 arch/arm64/kvm/hyp/nvhe/switch.c              |   2 +-
 arch/arm64/kvm/hyp/nvhe/sysreg-sr.c           |   2 +-
 arch/arm64/kvm/hyp/pgtable.c                  |   2 +-
 arch/arm64/kvm/hyp/vgic-v3-sr.c               |   6 +-
 arch/arm64/kvm/hyp/vhe/switch.c               | 206 +++-
 arch/arm64/kvm/hyp/vhe/sysreg-sr.c            | 124 ++-
 arch/arm64/kvm/hyp/vhe/tlb.c                  |  83 ++
 arch/arm64/kvm/mmu.c                          | 255 ++++-
 arch/arm64/kvm/nested.c                       | 799 ++++++++++++++-
 arch/arm64/kvm/pkvm.c                         |   2 +-
 arch/arm64/kvm/reset.c                        |   7 +
 arch/arm64/kvm/sys_regs.c                     | 958 +++++++++++++++++-
 arch/arm64/kvm/trace_arm.h                    |  19 +
 arch/arm64/kvm/vgic/vgic-init.c               |  33 +
 arch/arm64/kvm/vgic/vgic-kvm-device.c         |  32 +-
 arch/arm64/kvm/vgic/vgic-v3-nested.c          | 248 +++++
 arch/arm64/kvm/vgic/vgic-v3.c                 |  43 +-
 arch/arm64/kvm/vgic/vgic.c                    |  29 +
 arch/arm64/kvm/vgic/vgic.h                    |  10 +
 arch/arm64/tools/cpucaps                      |   1 +
 include/clocksource/arm_arch_timer.h          |   4 +
 include/kvm/arm_arch_timer.h                  |   1 +
 include/kvm/arm_vgic.h                        |  17 +
 include/uapi/linux/kvm.h                      |   1 +
 tools/arch/arm/include/uapi/asm/kvm.h         |   1 +
 49 files changed, 4790 insertions(+), 183 deletions(-)
 create mode 100644 arch/arm64/include/asm/vncr_mapping.h
 create mode 100644 arch/arm64/kvm/at.c
 create mode 100644 arch/arm64/kvm/vgic/vgic-v3-nested.c

Comments

Eric Auger May 16, 2023, 4:53 p.m. UTC | #1
Hi Marc,

On 5/15/23 19:30, Marc Zyngier wrote:
> This is the 4th drop of NV support on arm64 for this year.
> 
> For the previous episodes, see [1].
> 
> What's changed:
> 
> - New framework to track system register traps that are reinjected in
>   guest EL2. It is expected to replace the discrete handling we have
>   enjoyed so far, which didn't scale at all. This has already fixed a
>   number of bugs that were hidden (a bunch of traps were never
>   forwarded...). Still a work in progress, but this is going in the
>   right direction.
> 
> - Allow the L1 hypervisor to have a S2 that has an input larger than
>   the L0 IPA space. This fixes a number of subtle issues, depending on
>   how the initial guest was created.
> 
> - Consequently, the patch series has gone longer again. Boo. But
>   hopefully some of it is easier to review...
> 
> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org

I have started testing this and when booting my fedora guest I get

[  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
23f425fd0 [80000209]
[  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },

as soon as the host has kvm-arm.mode=nested

This seems to be triggered very early by EDK2
(ArmPkg/Drivers/TimerDxe/TimerDxe.c).

If I am not wrong this CNTV_CTL_EL0. Do you have any idea?

By the way I got this already with your v9
(git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git
kvm-arm64/nv-6.4-WIP) and with your v10 (I cherry-picked your
maz/kvm-arm64/nv-6.5-WIP branch).

Thanks

Eric





> 
> Andre Przywara (1):
>   KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ
> 
> Christoffer Dall (5):
>   KVM: arm64: nv: Trap EL1 VM register accesses in virtual EL2
>   KVM: arm64: nv: Implement nested Stage-2 page table walk logic
>   KVM: arm64: nv: Unmap/flush shadow stage 2 page tables
>   KVM: arm64: nv: vgic: Emulate the HW bit in software
>   KVM: arm64: nv: Sync nested timer state with FEAT_NV2
> 
> Jintack Lim (7):
>   KVM: arm64: nv: Trap CPACR_EL1 access in virtual EL2
>   KVM: arm64: nv: Respect virtual HCR_EL2.TWX setting
>   KVM: arm64: nv: Respect virtual CPTR_EL2.{TFP,FPEN} settings
>   KVM: arm64: nv: Respect virtual HCR_EL2.{NV,TSC) settings
>   KVM: arm64: nv: Configure HCR_EL2 for nested virtualization
>   KVM: arm64: nv: Trap and emulate TLBI instructions from virtual EL2
>   KVM: arm64: nv: Nested GICv3 Support
> 
> Marc Zyngier (46):
>   KVM: arm64: Move VTCR_EL2 into struct s2_mmu
>   arm64: Add missing Set/Way CMO encodings
>   arm64: Add missing VA CMO encodings
>   arm64: Add missing ERXMISCx_EL1 encodings
>   arm64: Add missing DC ZVA/GVA/GZVA encodings
>   arm64: Add TLBI operation encodings
>   arm64: Add AT operation encodings
>   KVM: arm64: Add missing HCR_EL2 trap bits
>   KVM: arm64: nv: Add trap forwarding infrastructure
>   KVM: arm64: nv: Add trap forwarding for HCR_EL2
>   KVM: arm64: nv: Expose FEAT_EVT to nested guests
>   KVM: arm64: nv: Add trap forwarding for MDCR_EL2
>   KVM: arm64: nv: Add trap forwarding for CNTHCTL_EL2
>   KVM: arm64: nv: Add non-VHE-EL2->EL1 translation helpers
>   KVM: arm64: nv: Handle virtual EL2 registers in
>     vcpu_read/write_sys_reg()
>   KVM: arm64: nv: Handle SPSR_EL2 specially
>   KVM: arm64: nv: Handle HCR_EL2.E2H specially
>   KVM: arm64: nv: Save/Restore vEL2 sysregs
>   KVM: arm64: nv: Support multiple nested Stage-2 mmu structures
>   KVM: arm64: nv: Handle shadow stage 2 page faults
>   KVM: arm64: nv: Restrict S2 RD/WR permissions to match the guest's
>   KVM: arm64: nv: Set a handler for the system instruction traps
>   KVM: arm64: nv: Trap and emulate AT instructions from virtual EL2
>   KVM: arm64: nv: Fold guest's HCR_EL2 configuration into the host's
>   KVM: arm64: nv: Hide RAS from nested guests
>   KVM: arm64: nv: Add handling of EL2-specific timer registers
>   KVM: arm64: nv: Load timer before the GIC
>   KVM: arm64: nv: Don't load the GICv4 context on entering a nested
>     guest
>   KVM: arm64: nv: Implement maintenance interrupt forwarding
>   KVM: arm64: nv: Deal with broken VGIC on maintenance interrupt
>     delivery
>   KVM: arm64: nv: Allow userspace to request KVM_ARM_VCPU_NESTED_VIRT
>   KVM: arm64: nv: Add handling of FEAT_TTL TLB invalidation
>   KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like
>     information
>   KVM: arm64: nv: Tag shadow S2 entries with nested level
>   KVM: arm64: nv: Add include containing the VNCR_EL2 offsets
>   KVM: arm64: nv: Map VNCR-capable registers to a separate page
>   KVM: arm64: nv: Move nested vgic state into the sysreg file
>   KVM: arm64: Add FEAT_NV2 cpu feature
>   KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
>   KVM: arm64: nv: Publish emulated timer interrupt state in the
>     in-memory state
>   KVM: arm64: nv: Allocate VNCR page when required
>   KVM: arm64: nv: Enable ARMv8.4-NV support
>   KVM: arm64: nv: Fast-track 'InHost' exception returns
>   KVM: arm64: nv: Fast-track EL1 TLBIs for VHE guests
>   KVM: arm64: nv: Use FEAT_ECV to trap access to EL0 timers
>   KVM: arm64: nv: Accelerate EL0 timer read accesses when FEAT_ECV is on
> 
>  .../virt/kvm/devices/arm-vgic-v3.rst          |  12 +-
>  arch/arm64/include/asm/esr.h                  |   1 +
>  arch/arm64/include/asm/kvm_arm.h              |  14 +
>  arch/arm64/include/asm/kvm_asm.h              |   4 +
>  arch/arm64/include/asm/kvm_emulate.h          |  93 +-
>  arch/arm64/include/asm/kvm_host.h             | 181 +++-
>  arch/arm64/include/asm/kvm_hyp.h              |   2 +
>  arch/arm64/include/asm/kvm_mmu.h              |  20 +-
>  arch/arm64/include/asm/kvm_nested.h           | 133 +++
>  arch/arm64/include/asm/stage2_pgtable.h       |   4 +-
>  arch/arm64/include/asm/sysreg.h               | 196 ++++
>  arch/arm64/include/asm/vncr_mapping.h         |  74 ++
>  arch/arm64/include/uapi/asm/kvm.h             |   1 +
>  arch/arm64/kernel/cpufeature.c                |  11 +
>  arch/arm64/kvm/Makefile                       |   4 +-
>  arch/arm64/kvm/arch_timer.c                   |  98 +-
>  arch/arm64/kvm/arm.c                          |  33 +-
>  arch/arm64/kvm/at.c                           | 219 ++++
>  arch/arm64/kvm/emulate-nested.c               | 934 ++++++++++++++++-
>  arch/arm64/kvm/handle_exit.c                  |  29 +-
>  arch/arm64/kvm/hyp/include/hyp/switch.h       |   8 +-
>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h    |   5 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         |   8 +-
>  arch/arm64/kvm/hyp/nvhe/pkvm.c                |   4 +-
>  arch/arm64/kvm/hyp/nvhe/switch.c              |   2 +-
>  arch/arm64/kvm/hyp/nvhe/sysreg-sr.c           |   2 +-
>  arch/arm64/kvm/hyp/pgtable.c                  |   2 +-
>  arch/arm64/kvm/hyp/vgic-v3-sr.c               |   6 +-
>  arch/arm64/kvm/hyp/vhe/switch.c               | 206 +++-
>  arch/arm64/kvm/hyp/vhe/sysreg-sr.c            | 124 ++-
>  arch/arm64/kvm/hyp/vhe/tlb.c                  |  83 ++
>  arch/arm64/kvm/mmu.c                          | 255 ++++-
>  arch/arm64/kvm/nested.c                       | 799 ++++++++++++++-
>  arch/arm64/kvm/pkvm.c                         |   2 +-
>  arch/arm64/kvm/reset.c                        |   7 +
>  arch/arm64/kvm/sys_regs.c                     | 958 +++++++++++++++++-
>  arch/arm64/kvm/trace_arm.h                    |  19 +
>  arch/arm64/kvm/vgic/vgic-init.c               |  33 +
>  arch/arm64/kvm/vgic/vgic-kvm-device.c         |  32 +-
>  arch/arm64/kvm/vgic/vgic-v3-nested.c          | 248 +++++
>  arch/arm64/kvm/vgic/vgic-v3.c                 |  43 +-
>  arch/arm64/kvm/vgic/vgic.c                    |  29 +
>  arch/arm64/kvm/vgic/vgic.h                    |  10 +
>  arch/arm64/tools/cpucaps                      |   1 +
>  include/clocksource/arm_arch_timer.h          |   4 +
>  include/kvm/arm_arch_timer.h                  |   1 +
>  include/kvm/arm_vgic.h                        |  17 +
>  include/uapi/linux/kvm.h                      |   1 +
>  tools/arch/arm/include/uapi/asm/kvm.h         |   1 +
>  49 files changed, 4790 insertions(+), 183 deletions(-)
>  create mode 100644 arch/arm64/include/asm/vncr_mapping.h
>  create mode 100644 arch/arm64/kvm/at.c
>  create mode 100644 arch/arm64/kvm/vgic/vgic-v3-nested.c
>
Marc Zyngier May 16, 2023, 6:47 p.m. UTC | #2
Hey Eric,

On Tue, 16 May 2023 17:53:14 +0100,
Eric Auger <eauger@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 5/15/23 19:30, Marc Zyngier wrote:
> > This is the 4th drop of NV support on arm64 for this year.
> > 
> > For the previous episodes, see [1].
> > 
> > What's changed:
> > 
> > - New framework to track system register traps that are reinjected in
> >   guest EL2. It is expected to replace the discrete handling we have
> >   enjoyed so far, which didn't scale at all. This has already fixed a
> >   number of bugs that were hidden (a bunch of traps were never
> >   forwarded...). Still a work in progress, but this is going in the
> >   right direction.
> > 
> > - Allow the L1 hypervisor to have a S2 that has an input larger than
> >   the L0 IPA space. This fixes a number of subtle issues, depending on
> >   how the initial guest was created.
> > 
> > - Consequently, the patch series has gone longer again. Boo. But
> >   hopefully some of it is easier to review...
> > 
> > [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
> 
> I have started testing this and when booting my fedora guest I get
> 
> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
> 23f425fd0 [80000209]
> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
> 
> as soon as the host has kvm-arm.mode=nested

Very odd. A write to CNTV_CTL_EL0 that fails, meaning that we have ECV
traps for the virtual timer set, and so this is probably done from
(virtual) EL2. Which of course we're not properly handling. Duh.

> This seems to be triggered very early by EDK2
> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
> 
> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?

I have a good idea, but I could use some more information:

- Is this a nested guest? Or a non nested guest?

- Can you stash your EDK2 build somewhere so that I can try to
  reproduce it?

- What do you use for VMM on the host?

I'm running EDK2 on both L1 and L2 guests (some Debian builds), and I
don't see any issue. But when running EDK2 on L1, I run it non-nested.

> By the way I got this already with your v9
> (git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git
> kvm-arm64/nv-6.4-WIP) and with your v10 (I cherry-picked your
> maz/kvm-arm64/nv-6.5-WIP branch).

That's an interesting data point. At least, it hasn't regressed
further... I'll try and cook a test tomorrow and run it on my own
rig. Which may or may not behave like yours, you never know...

Thanks a lot for the report!

	M.
Marc Zyngier May 16, 2023, 8:28 p.m. UTC | #3
On Tue, 16 May 2023 17:53:14 +0100,
Eric Auger <eauger@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 5/15/23 19:30, Marc Zyngier wrote:
> > This is the 4th drop of NV support on arm64 for this year.
> > 
> > For the previous episodes, see [1].
> > 
> > What's changed:
> > 
> > - New framework to track system register traps that are reinjected in
> >   guest EL2. It is expected to replace the discrete handling we have
> >   enjoyed so far, which didn't scale at all. This has already fixed a
> >   number of bugs that were hidden (a bunch of traps were never
> >   forwarded...). Still a work in progress, but this is going in the
> >   right direction.
> > 
> > - Allow the L1 hypervisor to have a S2 that has an input larger than
> >   the L0 IPA space. This fixes a number of subtle issues, depending on
> >   how the initial guest was created.
> > 
> > - Consequently, the patch series has gone longer again. Boo. But
> >   hopefully some of it is easier to review...
> > 
> > [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
> 
> I have started testing this and when booting my fedora guest I get
> 
> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
> 23f425fd0 [80000209]
> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
> 
> as soon as the host has kvm-arm.mode=nested
> 
> This seems to be triggered very early by EDK2
> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
> 
> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?

So here's my current analysis:

I assume you are running EDK2 as the L1 guest in a nested
configuration. I also assume that you are not running on an Apple
CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
in nVHE mode.

Finally, I'm going to assume that your implementation has FEAT_ECV and
FEAT_NV2, because I can't see how it could fail otherwise.

In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
that we can trap the EL0 virtual timer and faithfully emulate it (it
is otherwise written to memory, which isn't very helpful).

As it turns out, we don't handle these traps. I didn't spot it because
my test machines are all Apple boxes that don't have a nVHE mode, so
nothing on the nVHE path is getting *ANY* coverage. Hint: having
access to such a machine would help (shipping address on request!).
Otherwise, I'll eventually kill the nVHE support altogether.

I have written the following patch, which compiles, but that I cannot
test with my current setup. Could you please give it a go?

Thanks again,

	M.

From feb03b57de0bcb83254a2d6a3ce320f5e39434b6 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz@kernel.org>
Date: Tue, 16 May 2023 21:06:20 +0100
Subject: [PATCH] KVM: arm64: Handle virtual timer traps when
 CNTHCTL_EL2.EL1TVT is set

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/sysreg.h |  1 +
 arch/arm64/kvm/sys_regs.c       | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 72ff6df5d75b..77a61179ea37 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -436,6 +436,7 @@
 #define SYS_CNTP_CTL_EL0		sys_reg(3, 3, 14, 2, 1)
 #define SYS_CNTP_CVAL_EL0		sys_reg(3, 3, 14, 2, 2)
 
+#define SYS_CNTV_TVAL_EL0		sys_reg(3, 3, 14, 3, 0)
 #define SYS_CNTV_CTL_EL0		sys_reg(3, 3, 14, 3, 1)
 #define SYS_CNTV_CVAL_EL0		sys_reg(3, 3, 14, 3, 2)
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 27a29dcbfcd2..9aa9c4e4b4d6 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1328,6 +1328,14 @@ static bool access_arch_timer(struct kvm_vcpu *vcpu,
 		treg = TIMER_REG_TVAL;
 		break;
 
+	case SYS_CNTV_TVAL_EL0:
+		if (is_hyp_ctxt(vcpu) && vcpu_el2_e2h_is_set(vcpu))
+			tmr = TIMER_HVTIMER;
+		else
+			tmr = TIMER_VTIMER;
+		treg = TIMER_REG_TVAL;
+		break;
+
 	case SYS_AARCH32_CNTP_TVAL:
 	case SYS_CNTP_TVAL_EL02:
 		tmr = TIMER_PTIMER;
@@ -1357,6 +1365,14 @@ static bool access_arch_timer(struct kvm_vcpu *vcpu,
 		treg = TIMER_REG_CTL;
 		break;
 
+	case SYS_CNTV_CTL_EL0:
+		if (is_hyp_ctxt(vcpu) && vcpu_el2_e2h_is_set(vcpu))
+			tmr = TIMER_HVTIMER;
+		else
+			tmr = TIMER_VTIMER;
+		treg = TIMER_REG_CTL;
+		break;
+
 	case SYS_AARCH32_CNTP_CTL:
 	case SYS_CNTP_CTL_EL02:
 		tmr = TIMER_PTIMER;
@@ -1386,6 +1402,14 @@ static bool access_arch_timer(struct kvm_vcpu *vcpu,
 		treg = TIMER_REG_CVAL;
 		break;
 
+	case SYS_CNTV_CVAL_EL0:
+		if (is_hyp_ctxt(vcpu) && vcpu_el2_e2h_is_set(vcpu))
+			tmr = TIMER_HVTIMER;
+		else
+			tmr = TIMER_VTIMER;
+		treg = TIMER_REG_CVAL;
+		break;
+
 	case SYS_AARCH32_CNTP_CVAL:
 	case SYS_CNTP_CVAL_EL02:
 		tmr = TIMER_PTIMER;
@@ -2510,6 +2534,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_CNTP_CTL_EL0), access_arch_timer },
 	{ SYS_DESC(SYS_CNTP_CVAL_EL0), access_arch_timer },
 
+	{ SYS_DESC(SYS_CNTV_TVAL_EL0), access_arch_timer },
+	{ SYS_DESC(SYS_CNTV_CTL_EL0), access_arch_timer },
+	{ SYS_DESC(SYS_CNTV_CVAL_EL0), access_arch_timer },
+
 	/* PMEVCNTRn_EL0 */
 	PMU_PMEVCNTR_EL0(0),
 	PMU_PMEVCNTR_EL0(1),
Eric Auger May 17, 2023, 8:59 a.m. UTC | #4
Hi Marc,

On 5/16/23 22:28, Marc Zyngier wrote:
> On Tue, 16 May 2023 17:53:14 +0100,
> Eric Auger <eauger@redhat.com> wrote:
>>
>> Hi Marc,
>>
>> On 5/15/23 19:30, Marc Zyngier wrote:
>>> This is the 4th drop of NV support on arm64 for this year.
>>>
>>> For the previous episodes, see [1].
>>>
>>> What's changed:
>>>
>>> - New framework to track system register traps that are reinjected in
>>>   guest EL2. It is expected to replace the discrete handling we have
>>>   enjoyed so far, which didn't scale at all. This has already fixed a
>>>   number of bugs that were hidden (a bunch of traps were never
>>>   forwarded...). Still a work in progress, but this is going in the
>>>   right direction.
>>>
>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>   the L0 IPA space. This fixes a number of subtle issues, depending on
>>>   how the initial guest was created.
>>>
>>> - Consequently, the patch series has gone longer again. Boo. But
>>>   hopefully some of it is easier to review...
>>>
>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>
>> I have started testing this and when booting my fedora guest I get
>>
>> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
>> 23f425fd0 [80000209]
>> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
>>
>> as soon as the host has kvm-arm.mode=nested
>>
>> This seems to be triggered very early by EDK2
>> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
>>
>> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?
> 
> So here's my current analysis:
> 
> I assume you are running EDK2 as the L1 guest in a nested
> configuration. I also assume that you are not running on an Apple
> CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
> in nVHE mode.
> 
> Finally, I'm going to assume that your implementation has FEAT_ECV and
> FEAT_NV2, because I can't see how it could fail otherwise.
all the above is correct.
> 
> In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
> that we can trap the EL0 virtual timer and faithfully emulate it (it
> is otherwise written to memory, which isn't very helpful).

indeed
> 
> As it turns out, we don't handle these traps. I didn't spot it because
> my test machines are all Apple boxes that don't have a nVHE mode, so
> nothing on the nVHE path is getting *ANY* coverage. Hint: having
> access to such a machine would help (shipping address on request!).
> Otherwise, I'll eventually kill the nVHE support altogether.
> 
> I have written the following patch, which compiles, but that I cannot
> test with my current setup. Could you please give it a go?

with the patch below, my guest boots nicely. You did it great on the 1st
shot!!! So this fixes my issue. I will continue testing the v10.

Thank you again!

Eric


> 
> Thanks again,
> 
> 	M.
> 
> From feb03b57de0bcb83254a2d6a3ce320f5e39434b6 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Tue, 16 May 2023 21:06:20 +0100
> Subject: [PATCH] KVM: arm64: Handle virtual timer traps when
>  CNTHCTL_EL2.EL1TVT is set
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/include/asm/sysreg.h |  1 +
>  arch/arm64/kvm/sys_regs.c       | 28 ++++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index 72ff6df5d75b..77a61179ea37 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -436,6 +436,7 @@
>  #define SYS_CNTP_CTL_EL0		sys_reg(3, 3, 14, 2, 1)
>  #define SYS_CNTP_CVAL_EL0		sys_reg(3, 3, 14, 2, 2)
>  
> +#define SYS_CNTV_TVAL_EL0		sys_reg(3, 3, 14, 3, 0)
>  #define SYS_CNTV_CTL_EL0		sys_reg(3, 3, 14, 3, 1)
>  #define SYS_CNTV_CVAL_EL0		sys_reg(3, 3, 14, 3, 2)
>  
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 27a29dcbfcd2..9aa9c4e4b4d6 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1328,6 +1328,14 @@ static bool access_arch_timer(struct kvm_vcpu *vcpu,
>  		treg = TIMER_REG_TVAL;
>  		break;
>  
> +	case SYS_CNTV_TVAL_EL0:
> +		if (is_hyp_ctxt(vcpu) && vcpu_el2_e2h_is_set(vcpu))
> +			tmr = TIMER_HVTIMER;
> +		else
> +			tmr = TIMER_VTIMER;
> +		treg = TIMER_REG_TVAL;
> +		break;
> +
>  	case SYS_AARCH32_CNTP_TVAL:
>  	case SYS_CNTP_TVAL_EL02:
>  		tmr = TIMER_PTIMER;
> @@ -1357,6 +1365,14 @@ static bool access_arch_timer(struct kvm_vcpu *vcpu,
>  		treg = TIMER_REG_CTL;
>  		break;
>  
> +	case SYS_CNTV_CTL_EL0:
> +		if (is_hyp_ctxt(vcpu) && vcpu_el2_e2h_is_set(vcpu))
> +			tmr = TIMER_HVTIMER;
> +		else
> +			tmr = TIMER_VTIMER;
> +		treg = TIMER_REG_CTL;
> +		break;
> +
>  	case SYS_AARCH32_CNTP_CTL:
>  	case SYS_CNTP_CTL_EL02:
>  		tmr = TIMER_PTIMER;
> @@ -1386,6 +1402,14 @@ static bool access_arch_timer(struct kvm_vcpu *vcpu,
>  		treg = TIMER_REG_CVAL;
>  		break;
>  
> +	case SYS_CNTV_CVAL_EL0:
> +		if (is_hyp_ctxt(vcpu) && vcpu_el2_e2h_is_set(vcpu))
> +			tmr = TIMER_HVTIMER;
> +		else
> +			tmr = TIMER_VTIMER;
> +		treg = TIMER_REG_CVAL;
> +		break;
> +
>  	case SYS_AARCH32_CNTP_CVAL:
>  	case SYS_CNTP_CVAL_EL02:
>  		tmr = TIMER_PTIMER;
> @@ -2510,6 +2534,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  	{ SYS_DESC(SYS_CNTP_CTL_EL0), access_arch_timer },
>  	{ SYS_DESC(SYS_CNTP_CVAL_EL0), access_arch_timer },
>  
> +	{ SYS_DESC(SYS_CNTV_TVAL_EL0), access_arch_timer },
> +	{ SYS_DESC(SYS_CNTV_CTL_EL0), access_arch_timer },
> +	{ SYS_DESC(SYS_CNTV_CVAL_EL0), access_arch_timer },
> +
>  	/* PMEVCNTRn_EL0 */
>  	PMU_PMEVCNTR_EL0(0),
>  	PMU_PMEVCNTR_EL0(1),
Marc Zyngier May 17, 2023, 2:12 p.m. UTC | #5
On Wed, 17 May 2023 09:59:45 +0100,
Eric Auger <eauger@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 5/16/23 22:28, Marc Zyngier wrote:
> > On Tue, 16 May 2023 17:53:14 +0100,
> > Eric Auger <eauger@redhat.com> wrote:
> >>
> >> Hi Marc,
> >>
> >> On 5/15/23 19:30, Marc Zyngier wrote:
> >>> This is the 4th drop of NV support on arm64 for this year.
> >>>
> >>> For the previous episodes, see [1].
> >>>
> >>> What's changed:
> >>>
> >>> - New framework to track system register traps that are reinjected in
> >>>   guest EL2. It is expected to replace the discrete handling we have
> >>>   enjoyed so far, which didn't scale at all. This has already fixed a
> >>>   number of bugs that were hidden (a bunch of traps were never
> >>>   forwarded...). Still a work in progress, but this is going in the
> >>>   right direction.
> >>>
> >>> - Allow the L1 hypervisor to have a S2 that has an input larger than
> >>>   the L0 IPA space. This fixes a number of subtle issues, depending on
> >>>   how the initial guest was created.
> >>>
> >>> - Consequently, the patch series has gone longer again. Boo. But
> >>>   hopefully some of it is easier to review...
> >>>
> >>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
> >>
> >> I have started testing this and when booting my fedora guest I get
> >>
> >> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
> >> 23f425fd0 [80000209]
> >> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
> >>
> >> as soon as the host has kvm-arm.mode=nested
> >>
> >> This seems to be triggered very early by EDK2
> >> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
> >>
> >> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?
> > 
> > So here's my current analysis:
> > 
> > I assume you are running EDK2 as the L1 guest in a nested
> > configuration. I also assume that you are not running on an Apple
> > CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
> > in nVHE mode.
> > 
> > Finally, I'm going to assume that your implementation has FEAT_ECV and
> > FEAT_NV2, because I can't see how it could fail otherwise.
> all the above is correct.
> > 
> > In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
> > that we can trap the EL0 virtual timer and faithfully emulate it (it
> > is otherwise written to memory, which isn't very helpful).
> 
> indeed
> > 
> > As it turns out, we don't handle these traps. I didn't spot it because
> > my test machines are all Apple boxes that don't have a nVHE mode, so
> > nothing on the nVHE path is getting *ANY* coverage. Hint: having
> > access to such a machine would help (shipping address on request!).
> > Otherwise, I'll eventually kill the nVHE support altogether.
> > 
> > I have written the following patch, which compiles, but that I cannot
> > test with my current setup. Could you please give it a go?
> 
> with the patch below, my guest boots nicely. You did it great on the 1st
> shot!!! So this fixes my issue. I will continue testing the v10.

Thanks a lot for reporting the issue and testing my hacks. I'll
eventually fold it into the rest of the series.

By the way, what are you using as your VMM? I'd really like to
reproduce your setup.

Cheers,

	M.
Eric Auger June 5, 2023, 11:28 a.m. UTC | #6
Hi Marc,

On 5/15/23 19:30, Marc Zyngier wrote:
> This is the 4th drop of NV support on arm64 for this year.
> 
> For the previous episodes, see [1].
> 
> What's changed:
> 
> - New framework to track system register traps that are reinjected in
>   guest EL2. It is expected to replace the discrete handling we have
>   enjoyed so far, which didn't scale at all. This has already fixed a
>   number of bugs that were hidden (a bunch of traps were never
>   forwarded...). Still a work in progress, but this is going in the
>   right direction.
> 
> - Allow the L1 hypervisor to have a S2 that has an input larger than
>   the L0 IPA space. This fixes a number of subtle issues, depending on
>   how the initial guest was created.
> 
> - Consequently, the patch series has gone longer again. Boo. But
>   hopefully some of it is easier to review...
> 
> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
> 
> Andre Przywara (1):
>   KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ

I guess you have executed kselftests on L1 guests. Have all the tests
passed there? On my end it stalls in the KVM_RUN.

for instance
tools/testing/selftests/kvm/aarch64/aarch32_id_regs.c fails in
test_guest_raz(vcpu) on the KVM_RUN. Even with a basic

static void guest_main(void)
{
GUEST_DONE();
}

I get
 aarch32_id_regs-768     [002] .....   410.544665: kvm_exit: IRQ:
HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
 aarch32_id_regs-768     [002] d....   410.544666: kvm_entry: PC:
0x0000000000401ec4
 aarch32_id_regs-768     [002] .....   410.544675: kvm_exit: IRQ:
HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
 aarch32_id_regs-768     [002] d....   410.544676: kvm_entry: PC:
0x0000000000401ec4
 aarch32_id_regs-768     [002] .....   410.544685: kvm_exit: IRQ:
HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4

looping forever instead of

aarch32_id_regs-1085576 [079] d..1. 1401295.068739: kvm_entry: PC:
0x0000000000401ec4
 aarch32_id_regs-1085576 [079] ...1. 1401295.068745: kvm_exit: TRAP:
HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
 aarch32_id_regs-1085576 [079] d..1. 1401295.068790: kvm_entry: PC:
0x0000000000401ec4
 aarch32_id_regs-1085576 [079] ...1. 1401295.068792: kvm_exit: TRAP:
HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
 aarch32_id_regs-1085576 [079] d..1. 1401295.068794: kvm_entry: PC:
0x0000000000401ec4
 aarch32_id_regs-1085576 [079] ...1. 1401295.068795: kvm_exit: TRAP:
HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
 aarch32_id_regs-1085576 [079] d..1. 1401295.068797: kvm_entry: PC:
0x0000000000401ec4
../..

Any idea or any known restriction wrt kselftests?

Thanks

Eric




> 
> Christoffer Dall (5):
>   KVM: arm64: nv: Trap EL1 VM register accesses in virtual EL2
>   KVM: arm64: nv: Implement nested Stage-2 page table walk logic
>   KVM: arm64: nv: Unmap/flush shadow stage 2 page tables
>   KVM: arm64: nv: vgic: Emulate the HW bit in software
>   KVM: arm64: nv: Sync nested timer state with FEAT_NV2
> 
> Jintack Lim (7):
>   KVM: arm64: nv: Trap CPACR_EL1 access in virtual EL2
>   KVM: arm64: nv: Respect virtual HCR_EL2.TWX setting
>   KVM: arm64: nv: Respect virtual CPTR_EL2.{TFP,FPEN} settings
>   KVM: arm64: nv: Respect virtual HCR_EL2.{NV,TSC) settings
>   KVM: arm64: nv: Configure HCR_EL2 for nested virtualization
>   KVM: arm64: nv: Trap and emulate TLBI instructions from virtual EL2
>   KVM: arm64: nv: Nested GICv3 Support
> 
> Marc Zyngier (46):
>   KVM: arm64: Move VTCR_EL2 into struct s2_mmu
>   arm64: Add missing Set/Way CMO encodings
>   arm64: Add missing VA CMO encodings
>   arm64: Add missing ERXMISCx_EL1 encodings
>   arm64: Add missing DC ZVA/GVA/GZVA encodings
>   arm64: Add TLBI operation encodings
>   arm64: Add AT operation encodings
>   KVM: arm64: Add missing HCR_EL2 trap bits
>   KVM: arm64: nv: Add trap forwarding infrastructure
>   KVM: arm64: nv: Add trap forwarding for HCR_EL2
>   KVM: arm64: nv: Expose FEAT_EVT to nested guests
>   KVM: arm64: nv: Add trap forwarding for MDCR_EL2
>   KVM: arm64: nv: Add trap forwarding for CNTHCTL_EL2
>   KVM: arm64: nv: Add non-VHE-EL2->EL1 translation helpers
>   KVM: arm64: nv: Handle virtual EL2 registers in
>     vcpu_read/write_sys_reg()
>   KVM: arm64: nv: Handle SPSR_EL2 specially
>   KVM: arm64: nv: Handle HCR_EL2.E2H specially
>   KVM: arm64: nv: Save/Restore vEL2 sysregs
>   KVM: arm64: nv: Support multiple nested Stage-2 mmu structures
>   KVM: arm64: nv: Handle shadow stage 2 page faults
>   KVM: arm64: nv: Restrict S2 RD/WR permissions to match the guest's
>   KVM: arm64: nv: Set a handler for the system instruction traps
>   KVM: arm64: nv: Trap and emulate AT instructions from virtual EL2
>   KVM: arm64: nv: Fold guest's HCR_EL2 configuration into the host's
>   KVM: arm64: nv: Hide RAS from nested guests
>   KVM: arm64: nv: Add handling of EL2-specific timer registers
>   KVM: arm64: nv: Load timer before the GIC
>   KVM: arm64: nv: Don't load the GICv4 context on entering a nested
>     guest
>   KVM: arm64: nv: Implement maintenance interrupt forwarding
>   KVM: arm64: nv: Deal with broken VGIC on maintenance interrupt
>     delivery
>   KVM: arm64: nv: Allow userspace to request KVM_ARM_VCPU_NESTED_VIRT
>   KVM: arm64: nv: Add handling of FEAT_TTL TLB invalidation
>   KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like
>     information
>   KVM: arm64: nv: Tag shadow S2 entries with nested level
>   KVM: arm64: nv: Add include containing the VNCR_EL2 offsets
>   KVM: arm64: nv: Map VNCR-capable registers to a separate page
>   KVM: arm64: nv: Move nested vgic state into the sysreg file
>   KVM: arm64: Add FEAT_NV2 cpu feature
>   KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
>   KVM: arm64: nv: Publish emulated timer interrupt state in the
>     in-memory state
>   KVM: arm64: nv: Allocate VNCR page when required
>   KVM: arm64: nv: Enable ARMv8.4-NV support
>   KVM: arm64: nv: Fast-track 'InHost' exception returns
>   KVM: arm64: nv: Fast-track EL1 TLBIs for VHE guests
>   KVM: arm64: nv: Use FEAT_ECV to trap access to EL0 timers
>   KVM: arm64: nv: Accelerate EL0 timer read accesses when FEAT_ECV is on
> 
>  .../virt/kvm/devices/arm-vgic-v3.rst          |  12 +-
>  arch/arm64/include/asm/esr.h                  |   1 +
>  arch/arm64/include/asm/kvm_arm.h              |  14 +
>  arch/arm64/include/asm/kvm_asm.h              |   4 +
>  arch/arm64/include/asm/kvm_emulate.h          |  93 +-
>  arch/arm64/include/asm/kvm_host.h             | 181 +++-
>  arch/arm64/include/asm/kvm_hyp.h              |   2 +
>  arch/arm64/include/asm/kvm_mmu.h              |  20 +-
>  arch/arm64/include/asm/kvm_nested.h           | 133 +++
>  arch/arm64/include/asm/stage2_pgtable.h       |   4 +-
>  arch/arm64/include/asm/sysreg.h               | 196 ++++
>  arch/arm64/include/asm/vncr_mapping.h         |  74 ++
>  arch/arm64/include/uapi/asm/kvm.h             |   1 +
>  arch/arm64/kernel/cpufeature.c                |  11 +
>  arch/arm64/kvm/Makefile                       |   4 +-
>  arch/arm64/kvm/arch_timer.c                   |  98 +-
>  arch/arm64/kvm/arm.c                          |  33 +-
>  arch/arm64/kvm/at.c                           | 219 ++++
>  arch/arm64/kvm/emulate-nested.c               | 934 ++++++++++++++++-
>  arch/arm64/kvm/handle_exit.c                  |  29 +-
>  arch/arm64/kvm/hyp/include/hyp/switch.h       |   8 +-
>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h    |   5 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         |   8 +-
>  arch/arm64/kvm/hyp/nvhe/pkvm.c                |   4 +-
>  arch/arm64/kvm/hyp/nvhe/switch.c              |   2 +-
>  arch/arm64/kvm/hyp/nvhe/sysreg-sr.c           |   2 +-
>  arch/arm64/kvm/hyp/pgtable.c                  |   2 +-
>  arch/arm64/kvm/hyp/vgic-v3-sr.c               |   6 +-
>  arch/arm64/kvm/hyp/vhe/switch.c               | 206 +++-
>  arch/arm64/kvm/hyp/vhe/sysreg-sr.c            | 124 ++-
>  arch/arm64/kvm/hyp/vhe/tlb.c                  |  83 ++
>  arch/arm64/kvm/mmu.c                          | 255 ++++-
>  arch/arm64/kvm/nested.c                       | 799 ++++++++++++++-
>  arch/arm64/kvm/pkvm.c                         |   2 +-
>  arch/arm64/kvm/reset.c                        |   7 +
>  arch/arm64/kvm/sys_regs.c                     | 958 +++++++++++++++++-
>  arch/arm64/kvm/trace_arm.h                    |  19 +
>  arch/arm64/kvm/vgic/vgic-init.c               |  33 +
>  arch/arm64/kvm/vgic/vgic-kvm-device.c         |  32 +-
>  arch/arm64/kvm/vgic/vgic-v3-nested.c          | 248 +++++
>  arch/arm64/kvm/vgic/vgic-v3.c                 |  43 +-
>  arch/arm64/kvm/vgic/vgic.c                    |  29 +
>  arch/arm64/kvm/vgic/vgic.h                    |  10 +
>  arch/arm64/tools/cpucaps                      |   1 +
>  include/clocksource/arm_arch_timer.h          |   4 +
>  include/kvm/arm_arch_timer.h                  |   1 +
>  include/kvm/arm_vgic.h                        |  17 +
>  include/uapi/linux/kvm.h                      |   1 +
>  tools/arch/arm/include/uapi/asm/kvm.h         |   1 +
>  49 files changed, 4790 insertions(+), 183 deletions(-)
>  create mode 100644 arch/arm64/include/asm/vncr_mapping.h
>  create mode 100644 arch/arm64/kvm/at.c
>  create mode 100644 arch/arm64/kvm/vgic/vgic-v3-nested.c
>
Marc Zyngier June 6, 2023, 7:30 a.m. UTC | #7
Hey Eric,

On Mon, 05 Jun 2023 12:28:12 +0100,
Eric Auger <eauger@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 5/15/23 19:30, Marc Zyngier wrote:
> > This is the 4th drop of NV support on arm64 for this year.
> > 
> > For the previous episodes, see [1].
> > 
> > What's changed:
> > 
> > - New framework to track system register traps that are reinjected in
> >   guest EL2. It is expected to replace the discrete handling we have
> >   enjoyed so far, which didn't scale at all. This has already fixed a
> >   number of bugs that were hidden (a bunch of traps were never
> >   forwarded...). Still a work in progress, but this is going in the
> >   right direction.
> > 
> > - Allow the L1 hypervisor to have a S2 that has an input larger than
> >   the L0 IPA space. This fixes a number of subtle issues, depending on
> >   how the initial guest was created.
> > 
> > - Consequently, the patch series has gone longer again. Boo. But
> >   hopefully some of it is easier to review...
> > 
> > [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
> > 
> > Andre Przywara (1):
> >   KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ
> 
> I guess you have executed kselftests on L1 guests. Have all the tests
> passed there? On my end it stalls in the KVM_RUN.

No, I hardly run any kselftest, because they are just not designed to
run at EL2 at all. There's some work to be done there, but I just
don't have the bandwidth for that (hint, wink...)

> 
> for instance
> tools/testing/selftests/kvm/aarch64/aarch32_id_regs.c fails in
> test_guest_raz(vcpu) on the KVM_RUN. Even with a basic
> 
> static void guest_main(void)
> {
> GUEST_DONE();
> }

My guess is that the test harness expects things to run at EL1.
Depending on the value you get for HCR_EL2, you could get all sort of
odd behaviours. Also, the harness configures EL1 only, which is
unlikely to work at EL2. My conclusion is that "processor.c" needs to
be taught about EL2, at the very least.

> 
> I get
>  aarch32_id_regs-768     [002] .....   410.544665: kvm_exit: IRQ:
> HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
>  aarch32_id_regs-768     [002] d....   410.544666: kvm_entry: PC:
> 0x0000000000401ec4
>  aarch32_id_regs-768     [002] .....   410.544675: kvm_exit: IRQ:
> HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
>  aarch32_id_regs-768     [002] d....   410.544676: kvm_entry: PC:
> 0x0000000000401ec4
>  aarch32_id_regs-768     [002] .....   410.544685: kvm_exit: IRQ:
> HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
> 
> looping forever instead of
> 
> aarch32_id_regs-1085576 [079] d..1. 1401295.068739: kvm_entry: PC:
> 0x0000000000401ec4
>  aarch32_id_regs-1085576 [079] ...1. 1401295.068745: kvm_exit: TRAP:
> HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
>  aarch32_id_regs-1085576 [079] d..1. 1401295.068790: kvm_entry: PC:
> 0x0000000000401ec4
>  aarch32_id_regs-1085576 [079] ...1. 1401295.068792: kvm_exit: TRAP:
> HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
>  aarch32_id_regs-1085576 [079] d..1. 1401295.068794: kvm_entry: PC:
> 0x0000000000401ec4
>  aarch32_id_regs-1085576 [079] ...1. 1401295.068795: kvm_exit: TRAP:
> HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
>  aarch32_id_regs-1085576 [079] d..1. 1401295.068797: kvm_entry: PC:
> 0x0000000000401ec4
> ../..
> 
> Any idea or any known restriction wrt kselftests?

See above. I'd love someone to actually start looking into it and
devise a testing harness that would run both at EL{0,1,2} *at the same
time* so that we can start exercising some of the trap behaviours that
the architecture mandates.

Also, Alexandru had a some KUT tests a few years ago, but I don't know
what happened of them.

Thanks,

	M.
Eric Auger June 6, 2023, 9:29 a.m. UTC | #8
Hi Marc,
On 6/6/23 09:30, Marc Zyngier wrote:
> Hey Eric,
> 
> On Mon, 05 Jun 2023 12:28:12 +0100,
> Eric Auger <eauger@redhat.com> wrote:
>>
>> Hi Marc,
>>
>> On 5/15/23 19:30, Marc Zyngier wrote:
>>> This is the 4th drop of NV support on arm64 for this year.
>>>
>>> For the previous episodes, see [1].
>>>
>>> What's changed:
>>>
>>> - New framework to track system register traps that are reinjected in
>>>   guest EL2. It is expected to replace the discrete handling we have
>>>   enjoyed so far, which didn't scale at all. This has already fixed a
>>>   number of bugs that were hidden (a bunch of traps were never
>>>   forwarded...). Still a work in progress, but this is going in the
>>>   right direction.
>>>
>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>   the L0 IPA space. This fixes a number of subtle issues, depending on
>>>   how the initial guest was created.
>>>
>>> - Consequently, the patch series has gone longer again. Boo. But
>>>   hopefully some of it is easier to review...
>>>
>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>>
>>> Andre Przywara (1):
>>>   KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ
>>
>> I guess you have executed kselftests on L1 guests. Have all the tests
>> passed there? On my end it stalls in the KVM_RUN.
> 
> No, I hardly run any kselftest, because they are just not designed to
> run at EL2 at all. There's some work to be done there, but I just
> don't have the bandwidth for that (hint, wink...)

oh OK, I missed that point. If nobody is working on this I can start
looking at it. Would be interesting to run them on nested guest too.
> 
>>
>> for instance
>> tools/testing/selftests/kvm/aarch64/aarch32_id_regs.c fails in
>> test_guest_raz(vcpu) on the KVM_RUN. Even with a basic
>>
>> static void guest_main(void)
>> {
>> GUEST_DONE();
>> }
> 
> My guess is that the test harness expects things to run at EL1.
> Depending on the value you get for HCR_EL2, you could get all sort of
> odd behaviours. Also, the harness configures EL1 only, which is
> unlikely to work at EL2. My conclusion is that "processor.c" needs to
> be taught about EL2, at the very least.
> 
>>
>> I get
>>  aarch32_id_regs-768     [002] .....   410.544665: kvm_exit: IRQ:
>> HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
>>  aarch32_id_regs-768     [002] d....   410.544666: kvm_entry: PC:
>> 0x0000000000401ec4
>>  aarch32_id_regs-768     [002] .....   410.544675: kvm_exit: IRQ:
>> HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
>>  aarch32_id_regs-768     [002] d....   410.544676: kvm_entry: PC:
>> 0x0000000000401ec4
>>  aarch32_id_regs-768     [002] .....   410.544685: kvm_exit: IRQ:
>> HSR_EC: 0x0000 (UNKNOWN), PC: 0x0000000000401ec4
>>
>> looping forever instead of
>>
>> aarch32_id_regs-1085576 [079] d..1. 1401295.068739: kvm_entry: PC:
>> 0x0000000000401ec4
>>  aarch32_id_regs-1085576 [079] ...1. 1401295.068745: kvm_exit: TRAP:
>> HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
>>  aarch32_id_regs-1085576 [079] d..1. 1401295.068790: kvm_entry: PC:
>> 0x0000000000401ec4
>>  aarch32_id_regs-1085576 [079] ...1. 1401295.068792: kvm_exit: TRAP:
>> HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
>>  aarch32_id_regs-1085576 [079] d..1. 1401295.068794: kvm_entry: PC:
>> 0x0000000000401ec4
>>  aarch32_id_regs-1085576 [079] ...1. 1401295.068795: kvm_exit: TRAP:
>> HSR_EC: 0x0020 (IABT_LOW), PC: 0x0000000000401ec4
>>  aarch32_id_regs-1085576 [079] d..1. 1401295.068797: kvm_entry: PC:
>> 0x0000000000401ec4
>> ../..
>>
>> Any idea or any known restriction wrt kselftests?
> 
> See above. I'd love someone to actually start looking into it and
> devise a testing harness that would run both at EL{0,1,2} *at the same
> time* so that we can start exercising some of the trap behaviours that
> the architecture mandates.
> 
> Also, Alexandru had a some KUT tests a few years ago, but I don't know
> what happened of them.

Yeah I remember that one too. Alexandru, do you have plans to revive it?
That would be also interesting to run kuts on nested, giving a chance to
have incremental and more unitary testing.

Thanks!

Eric

> 
> Thanks,
> 
> 	M.
>
Eric Auger June 6, 2023, 9:33 a.m. UTC | #9
Hi Marc,

On 5/17/23 16:12, Marc Zyngier wrote:
> On Wed, 17 May 2023 09:59:45 +0100,
> Eric Auger <eauger@redhat.com> wrote:
>>
>> Hi Marc,
>>Hi Marc,
>> On 5/16/23 22:28, Marc Zyngier wrote:
>>> On Tue, 16 May 2023 17:53:14 +0100,
>>> Eric Auger <eauger@redhat.com> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 5/15/23 19:30, Marc Zyngier wrote:
>>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>>
>>>>> For the previous episodes, see [1].
>>>>>
>>>>> What's changed:
>>>>>
>>>>> - New framework to track system register traps that are reinjected in
>>>>>   guest EL2. It is expected to replace the discrete handling we have
>>>>>   enjoyed so far, which didn't scale at all. This has already fixed a
>>>>>   number of bugs that were hidden (a bunch of traps were never
>>>>>   forwarded...). Still a work in progress, but this is going in the
>>>>>   right direction.
>>>>>
>>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>>   the L0 IPA space. This fixes a number of subtle issues, depending on
>>>>>   how the initial guest was created.
>>>>>
>>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>>   hopefully some of it is easier to review...
>>>>>
>>>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>>>
>>>> I have started testing this and when booting my fedora guest I get
>>>>
>>>> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
>>>> 23f425fd0 [80000209]
>>>> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
>>>>
>>>> as soon as the host has kvm-arm.mode=nested
>>>>
>>>> This seems to be triggered very early by EDK2
>>>> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
>>>>
>>>> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?
>>>
>>> So here's my current analysis:
>>>
>>> I assume you are running EDK2 as the L1 guest in a nested
>>> configuration. I also assume that you are not running on an Apple
>>> CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
>>> in nVHE mode.
>>>
>>> Finally, I'm going to assume that your implementation has FEAT_ECV and
>>> FEAT_NV2, because I can't see how it could fail otherwise.
>> all the above is correct.
>>>
>>> In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
>>> that we can trap the EL0 virtual timer and faithfully emulate it (it
>>> is otherwise written to memory, which isn't very helpful).
>>
>> indeed
>>>
>>> As it turns out, we don't handle these traps. I didn't spot it because
>>> my test machines are all Apple boxes that don't have a nVHE mode, so
>>> nothing on the nVHE path is getting *ANY* coverage. Hint: having
>>> access to such a machine would help (shipping address on request!).
>>> Otherwise, I'll eventually kill the nVHE support altogether.
>>>
>>> I have written the following patch, which compiles, but that I cannot
>>> test with my current setup. Could you please give it a go?
>>
>> with the patch below, my guest boots nicely. You did it great on the 1st
>> shot!!! So this fixes my issue. I will continue testing the v10.
> 
> Thanks a lot for reporting the issue and testing my hacks. I'll
> eventually fold it into the rest of the series.
> 
> By the way, what are you using as your VMM? I'd really like to
> reproduce your setup.
Sorry I missed your reply. I am using libvirt + qemu (feat Miguel's RFC)
and fedora L1 guest.

Thanks to your fix, this boots fine. But at the moment it does not
reboot and hangs in edk2 I think. Unfortunately this time I have no
trace on host :-( While looking at your series I will add some traces.

Eric
> 
> Cheers,
> 
> 	M.
>
Marc Zyngier June 6, 2023, 4:22 p.m. UTC | #10
On Tue, 06 Jun 2023 10:29:36 +0100,
Eric Auger <eauger@redhat.com> wrote:
> 
> Hi Marc,
> On 6/6/23 09:30, Marc Zyngier wrote:
> > Hey Eric,
> > 
> > On Mon, 05 Jun 2023 12:28:12 +0100,
> > Eric Auger <eauger@redhat.com> wrote:
> >>
> >> Hi Marc,
> >>
> >> On 5/15/23 19:30, Marc Zyngier wrote:
> >>> This is the 4th drop of NV support on arm64 for this year.
> >>>
> >>> For the previous episodes, see [1].
> >>>
> >>> What's changed:
> >>>
> >>> - New framework to track system register traps that are reinjected in
> >>>   guest EL2. It is expected to replace the discrete handling we have
> >>>   enjoyed so far, which didn't scale at all. This has already fixed a
> >>>   number of bugs that were hidden (a bunch of traps were never
> >>>   forwarded...). Still a work in progress, but this is going in the
> >>>   right direction.
> >>>
> >>> - Allow the L1 hypervisor to have a S2 that has an input larger than
> >>>   the L0 IPA space. This fixes a number of subtle issues, depending on
> >>>   how the initial guest was created.
> >>>
> >>> - Consequently, the patch series has gone longer again. Boo. But
> >>>   hopefully some of it is easier to review...
> >>>
> >>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
> >>>
> >>> Andre Przywara (1):
> >>>   KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ
> >>
> >> I guess you have executed kselftests on L1 guests. Have all the tests
> >> passed there? On my end it stalls in the KVM_RUN.
> > 
> > No, I hardly run any kselftest, because they are just not designed to
> > run at EL2 at all. There's some work to be done there, but I just
> > don't have the bandwidth for that (hint, wink...)
> 
> oh OK, I missed that point. If nobody is working on this I can start
> looking at it. Would be interesting to run them on nested guest too.

If you want to pick this up, it would be extremely helpful. And no,
nobody is really looking into it at the moment, so it's all yours!

Thanks,

	M.
Marc Zyngier June 6, 2023, 4:30 p.m. UTC | #11
On Tue, 06 Jun 2023 10:33:27 +0100,
Eric Auger <eauger@redhat.com> wrote:
> 
> > By the way, what are you using as your VMM? I'd really like to
> > reproduce your setup.
> Sorry I missed your reply. I am using libvirt + qemu (feat Miguel's RFC)
> and fedora L1 guest.

OK, so that's *very* different for what I'm using, which is good! Do
you have a QEMU branch somewhere?

> Thanks to your fix, this boots fine. But at the moment it does not
> reboot and hangs in edk2 I think. Unfortunately this time I have no
> trace on host :-( While looking at your series I will add some traces.

I've been able to run EDK2 compiled for kvmtool with only a couple of
change (such as using SMCs instead of HVCs, and disabling the broken
DT-to-ACPI convertion). However, reboot isn't something I've played
with, as kvmtool doesn't even try to reboot the guest (reboot is
handled as power-off).

I'm pretty sure this is related to tearing down the shadow S2 MMU
contexts when QEMU reinit the vcpus, and we may fail to clean things
up.

I'll try and have a look once I'm back from holiday (and we can have a
look at KVM Forum).

Thanks,

	M.
Miguel Luis June 6, 2023, 5:52 p.m. UTC | #12
Hello Eric, Marc,

> On 6 Jun 2023, at 09:33, Eric Auger <eauger@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 5/17/23 16:12, Marc Zyngier wrote:
>> On Wed, 17 May 2023 09:59:45 +0100,
>> Eric Auger <eauger@redhat.com> wrote:
>>> 
>>> Hi Marc,
>>> Hi Marc,
>>> On 5/16/23 22:28, Marc Zyngier wrote:
>>>> On Tue, 16 May 2023 17:53:14 +0100,
>>>> Eric Auger <eauger@redhat.com> wrote:
>>>>> 
>>>>> Hi Marc,
>>>>> 
>>>>> On 5/15/23 19:30, Marc Zyngier wrote:
>>>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>>> 
>>>>>> For the previous episodes, see [1].
>>>>>> 
>>>>>> What's changed:
>>>>>> 
>>>>>> - New framework to track system register traps that are reinjected in
>>>>>>  guest EL2. It is expected to replace the discrete handling we have
>>>>>>  enjoyed so far, which didn't scale at all. This has already fixed a
>>>>>>  number of bugs that were hidden (a bunch of traps were never
>>>>>>  forwarded...). Still a work in progress, but this is going in the
>>>>>>  right direction.
>>>>>> 
>>>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>>>  the L0 IPA space. This fixes a number of subtle issues, depending on
>>>>>>  how the initial guest was created.
>>>>>> 
>>>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>>>  hopefully some of it is easier to review...
>>>>>> 
>>>>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>>>> 
>>>>> I have started testing this and when booting my fedora guest I get
>>>>> 
>>>>> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
>>>>> 23f425fd0 [80000209]
>>>>> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
>>>>> 
>>>>> as soon as the host has kvm-arm.mode=nested
>>>>> 
>>>>> This seems to be triggered very early by EDK2
>>>>> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
>>>>> 
>>>>> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?
>>>> 
>>>> So here's my current analysis:
>>>> 
>>>> I assume you are running EDK2 as the L1 guest in a nested
>>>> configuration. I also assume that you are not running on an Apple
>>>> CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
>>>> in nVHE mode.
>>>> 
>>>> Finally, I'm going to assume that your implementation has FEAT_ECV and
>>>> FEAT_NV2, because I can't see how it could fail otherwise.
>>> all the above is correct.
>>>> 
>>>> In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
>>>> that we can trap the EL0 virtual timer and faithfully emulate it (it
>>>> is otherwise written to memory, which isn't very helpful).
>>> 
>>> indeed
>>>> 
>>>> As it turns out, we don't handle these traps. I didn't spot it because
>>>> my test machines are all Apple boxes that don't have a nVHE mode, so
>>>> nothing on the nVHE path is getting *ANY* coverage. Hint: having
>>>> access to such a machine would help (shipping address on request!).
>>>> Otherwise, I'll eventually kill the nVHE support altogether.
>>>> 
>>>> I have written the following patch, which compiles, but that I cannot
>>>> test with my current setup. Could you please give it a go?
>>> 
>>> with the patch below, my guest boots nicely. You did it great on the 1st
>>> shot!!! So this fixes my issue. I will continue testing the v10.
>> 
>> Thanks a lot for reporting the issue and testing my hacks. I'll
>> eventually fold it into the rest of the series.
>> 
>> By the way, what are you using as your VMM? I'd really like to
>> reproduce your setup.
> Sorry I missed your reply. I am using libvirt + qemu (feat Miguel's RFC)
> and fedora L1 guest.
> 

Following this subject, I’ve forward ported Alexandru’s KUT patches
( and I encourage others to do it also =) ) which expose an EL2 test that
does three checks:

- whether VHE is supported and enabled
- disable VHE
- re-enable VHE  

I’m running qemu with virtualization=on as well to run this test and it is passing although
problems seem to happen when running with virtualization=off, which I’m still looking into it.

Thanks
Miguel

> Thanks to your fix, this boots fine. But at the moment it does not
> reboot and hangs in edk2 I think. Unfortunately this time I have no
> trace on host :-( While looking at your series I will add some traces.
> 
> Eric
>> 
>> Cheers,
>> 
>> M.
Eric Auger June 7, 2023, 4:30 p.m. UTC | #13
Hi Marc,

On 6/6/23 18:22, Marc Zyngier wrote:
> On Tue, 06 Jun 2023 10:29:36 +0100,
> Eric Auger <eauger@redhat.com> wrote:
>>
>> Hi Marc,
>> On 6/6/23 09:30, Marc Zyngier wrote:
>>> Hey Eric,
>>>
>>> On Mon, 05 Jun 2023 12:28:12 +0100,
>>> Eric Auger <eauger@redhat.com> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 5/15/23 19:30, Marc Zyngier wrote:
>>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>>
>>>>> For the previous episodes, see [1].
>>>>>
>>>>> What's changed:
>>>>>
>>>>> - New framework to track system register traps that are reinjected in
>>>>>   guest EL2. It is expected to replace the discrete handling we have
>>>>>   enjoyed so far, which didn't scale at all. This has already fixed a
>>>>>   number of bugs that were hidden (a bunch of traps were never
>>>>>   forwarded...). Still a work in progress, but this is going in the
>>>>>   right direction.
>>>>>
>>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>>   the L0 IPA space. This fixes a number of subtle issues, depending on
>>>>>   how the initial guest was created.
>>>>>
>>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>>   hopefully some of it is easier to review...
>>>>>
>>>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>>>>
>>>>> Andre Przywara (1):
>>>>>   KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ
>>>>
>>>> I guess you have executed kselftests on L1 guests. Have all the tests
>>>> passed there? On my end it stalls in the KVM_RUN.
>>>
>>> No, I hardly run any kselftest, because they are just not designed to
>>> run at EL2 at all. There's some work to be done there, but I just
>>> don't have the bandwidth for that (hint, wink...)
>>
>> oh OK, I missed that point. If nobody is working on this I can start
>> looking at it. Would be interesting to run them on nested guest too.
> 
> If you want to pick this up, it would be extremely helpful. And no,
> nobody is really looking into it at the moment, so it's all yours!

OK I will study that then :-)

Eric
> 
> Thanks,
> 
> 	M.
>
Eric Auger June 7, 2023, 4:39 p.m. UTC | #14
Hi Marc,

On 6/6/23 18:30, Marc Zyngier wrote:
> On Tue, 06 Jun 2023 10:33:27 +0100,
> Eric Auger <eauger@redhat.com> wrote:
>>
>>> By the way, what are you using as your VMM? I'd really like to
>>> reproduce your setup.
>> Sorry I missed your reply. I am using libvirt + qemu (feat Miguel's RFC)
>> and fedora L1 guest.
> 
> OK, so that's *very* different for what I'm using, which is good! Do
> you have a QEMU branch somewhere?

here it is:
https://github.com/eauger/qemu/tree/nv_rfc_rebase
> 
>> Thanks to your fix, this boots fine. But at the moment it does not
>> reboot and hangs in edk2 I think. Unfortunately this time I have no
>> trace on host :-( While looking at your series I will add some traces.
> 
> I've been able to run EDK2 compiled for kvmtool with only a couple of
> change (such as using SMCs instead of HVCs, and disabling the broken
> DT-to-ACPI convertion). However, reboot isn't something I've played
> with, as kvmtool doesn't even try to reboot the guest (reboot is
> handled as power-off).
> 
> I'm pretty sure this is related to tearing down the shadow S2 MMU
> contexts when QEMU reinit the vcpus, and we may fail to clean things
> up.


> 
> I'll try and have a look once I'm back from holiday (and we can have a
> look at KVM Forum).

I won't be there this year. But enjoy the event and enjoy your vacation!

Thanks

Eric
> 
> Thanks,
> 
> 	M.
>
Eric Auger June 7, 2023, 4:40 p.m. UTC | #15
Hi Miguel,

On 6/6/23 19:52, Miguel Luis wrote:
> Hello Eric, Marc,
> 
>> On 6 Jun 2023, at 09:33, Eric Auger <eauger@redhat.com> wrote:
>>
>> Hi Marc,
>>
>> On 5/17/23 16:12, Marc Zyngier wrote:
>>> On Wed, 17 May 2023 09:59:45 +0100,
>>> Eric Auger <eauger@redhat.com> wrote:
>>>>
>>>> Hi Marc,
>>>> Hi Marc,
>>>> On 5/16/23 22:28, Marc Zyngier wrote:
>>>>> On Tue, 16 May 2023 17:53:14 +0100,
>>>>> Eric Auger <eauger@redhat.com> wrote:
>>>>>>
>>>>>> Hi Marc,
>>>>>>
>>>>>> On 5/15/23 19:30, Marc Zyngier wrote:
>>>>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>>>>
>>>>>>> For the previous episodes, see [1].
>>>>>>>
>>>>>>> What's changed:
>>>>>>>
>>>>>>> - New framework to track system register traps that are reinjected in
>>>>>>>  guest EL2. It is expected to replace the discrete handling we have
>>>>>>>  enjoyed so far, which didn't scale at all. This has already fixed a
>>>>>>>  number of bugs that were hidden (a bunch of traps were never
>>>>>>>  forwarded...). Still a work in progress, but this is going in the
>>>>>>>  right direction.
>>>>>>>
>>>>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>>>>  the L0 IPA space. This fixes a number of subtle issues, depending on
>>>>>>>  how the initial guest was created.
>>>>>>>
>>>>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>>>>  hopefully some of it is easier to review...
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>>>>>
>>>>>> I have started testing this and when booting my fedora guest I get
>>>>>>
>>>>>> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
>>>>>> 23f425fd0 [80000209]
>>>>>> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
>>>>>>
>>>>>> as soon as the host has kvm-arm.mode=nested
>>>>>>
>>>>>> This seems to be triggered very early by EDK2
>>>>>> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
>>>>>>
>>>>>> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?
>>>>>
>>>>> So here's my current analysis:
>>>>>
>>>>> I assume you are running EDK2 as the L1 guest in a nested
>>>>> configuration. I also assume that you are not running on an Apple
>>>>> CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
>>>>> in nVHE mode.
>>>>>
>>>>> Finally, I'm going to assume that your implementation has FEAT_ECV and
>>>>> FEAT_NV2, because I can't see how it could fail otherwise.
>>>> all the above is correct.
>>>>>
>>>>> In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
>>>>> that we can trap the EL0 virtual timer and faithfully emulate it (it
>>>>> is otherwise written to memory, which isn't very helpful).
>>>>
>>>> indeed
>>>>>
>>>>> As it turns out, we don't handle these traps. I didn't spot it because
>>>>> my test machines are all Apple boxes that don't have a nVHE mode, so
>>>>> nothing on the nVHE path is getting *ANY* coverage. Hint: having
>>>>> access to such a machine would help (shipping address on request!).
>>>>> Otherwise, I'll eventually kill the nVHE support altogether.
>>>>>
>>>>> I have written the following patch, which compiles, but that I cannot
>>>>> test with my current setup. Could you please give it a go?
>>>>
>>>> with the patch below, my guest boots nicely. You did it great on the 1st
>>>> shot!!! So this fixes my issue. I will continue testing the v10.
>>>
>>> Thanks a lot for reporting the issue and testing my hacks. I'll
>>> eventually fold it into the rest of the series.
>>>
>>> By the way, what are you using as your VMM? I'd really like to
>>> reproduce your setup.
>> Sorry I missed your reply. I am using libvirt + qemu (feat Miguel's RFC)
>> and fedora L1 guest.
>>
> 
> Following this subject, I’ve forward ported Alexandru’s KUT patches
> ( and I encourage others to do it also =) ) which expose an EL2 test that

Do you have a branch available with Alexandru's rebased kut series?

Thanks

Eric
> does three checks:
> 
> - whether VHE is supported and enabled
> - disable VHE
> - re-enable VHE  
> 
> I’m running qemu with virtualization=on as well to run this test and it is passing although
> problems seem to happen when running with virtualization=off, which I’m still looking into it.
> 
> Thanks
> Miguel
> 
>> Thanks to your fix, this boots fine. But at the moment it does not
>> reboot and hangs in edk2 I think. Unfortunately this time I have no
>> trace on host :-( While looking at your series I will add some traces.
>>
>> Eric
>>>
>>> Cheers,
>>>
>>> M.
> 
>
Miguel Luis June 10, 2023, 8:25 a.m. UTC | #16
Hi Eric,

> On 7 Jun 2023, at 16:40, Eric Auger <eauger@redhat.com> wrote:
> 
> Hi Miguel,
> 
> On 6/6/23 19:52, Miguel Luis wrote:
>> Hello Eric, Marc,
>> 
>>> On 6 Jun 2023, at 09:33, Eric Auger <eauger@redhat.com> wrote:
>>> 
>>> Hi Marc,
>>> 
>>> On 5/17/23 16:12, Marc Zyngier wrote:
>>>> On Wed, 17 May 2023 09:59:45 +0100,
>>>> Eric Auger <eauger@redhat.com> wrote:
>>>>> 
>>>>> Hi Marc,
>>>>> Hi Marc,
>>>>> On 5/16/23 22:28, Marc Zyngier wrote:
>>>>>> On Tue, 16 May 2023 17:53:14 +0100,
>>>>>> Eric Auger <eauger@redhat.com> wrote:
>>>>>>> 
>>>>>>> Hi Marc,
>>>>>>> 
>>>>>>> On 5/15/23 19:30, Marc Zyngier wrote:
>>>>>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>>>>> 
>>>>>>>> For the previous episodes, see [1].
>>>>>>>> 
>>>>>>>> What's changed:
>>>>>>>> 
>>>>>>>> - New framework to track system register traps that are reinjected in
>>>>>>>> guest EL2. It is expected to replace the discrete handling we have
>>>>>>>> enjoyed so far, which didn't scale at all. This has already fixed a
>>>>>>>> number of bugs that were hidden (a bunch of traps were never
>>>>>>>> forwarded...). Still a work in progress, but this is going in the
>>>>>>>> right direction.
>>>>>>>> 
>>>>>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>>>>> the L0 IPA space. This fixes a number of subtle issues, depending on
>>>>>>>> how the initial guest was created.
>>>>>>>> 
>>>>>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>>>>> hopefully some of it is easier to review...
>>>>>>>> 
>>>>>>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>>>>>> 
>>>>>>> I have started testing this and when booting my fedora guest I get
>>>>>>> 
>>>>>>> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
>>>>>>> 23f425fd0 [80000209]
>>>>>>> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
>>>>>>> 
>>>>>>> as soon as the host has kvm-arm.mode=nested
>>>>>>> 
>>>>>>> This seems to be triggered very early by EDK2
>>>>>>> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
>>>>>>> 
>>>>>>> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?
>>>>>> 
>>>>>> So here's my current analysis:
>>>>>> 
>>>>>> I assume you are running EDK2 as the L1 guest in a nested
>>>>>> configuration. I also assume that you are not running on an Apple
>>>>>> CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
>>>>>> in nVHE mode.
>>>>>> 
>>>>>> Finally, I'm going to assume that your implementation has FEAT_ECV and
>>>>>> FEAT_NV2, because I can't see how it could fail otherwise.
>>>>> all the above is correct.
>>>>>> 
>>>>>> In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
>>>>>> that we can trap the EL0 virtual timer and faithfully emulate it (it
>>>>>> is otherwise written to memory, which isn't very helpful).
>>>>> 
>>>>> indeed
>>>>>> 
>>>>>> As it turns out, we don't handle these traps. I didn't spot it because
>>>>>> my test machines are all Apple boxes that don't have a nVHE mode, so
>>>>>> nothing on the nVHE path is getting *ANY* coverage. Hint: having
>>>>>> access to such a machine would help (shipping address on request!).
>>>>>> Otherwise, I'll eventually kill the nVHE support altogether.
>>>>>> 
>>>>>> I have written the following patch, which compiles, but that I cannot
>>>>>> test with my current setup. Could you please give it a go?
>>>>> 
>>>>> with the patch below, my guest boots nicely. You did it great on the 1st
>>>>> shot!!! So this fixes my issue. I will continue testing the v10.
>>>> 
>>>> Thanks a lot for reporting the issue and testing my hacks. I'll
>>>> eventually fold it into the rest of the series.
>>>> 
>>>> By the way, what are you using as your VMM? I'd really like to
>>>> reproduce your setup.
>>> Sorry I missed your reply. I am using libvirt + qemu (feat Miguel's RFC)
>>> and fedora L1 guest.
>>> 
>> 
>> Following this subject, I’ve forward ported Alexandru’s KUT patches
>> ( and I encourage others to do it also =) ) which expose an EL2 test that
> 
> Do you have a branch available with Alexandru's rebased kut series?

Now I do :)

https://github.com/mluis/kvm-unit-tests/tree/nv-WIP

Thanks,
Miguel

> 
> Thanks
> 
> Eric
>> does three checks:
>> 
>> - whether VHE is supported and enabled
>> - disable VHE
>> - re-enable VHE  
>> 
>> I’m running qemu with virtualization=on as well to run this test and it is passing although
>> problems seem to happen when running with virtualization=off, which I’m still looking into it.
>> 
>> Thanks
>> Miguel
>> 
>>> Thanks to your fix, this boots fine. But at the moment it does not
>>> reboot and hangs in edk2 I think. Unfortunately this time I have no
>>> trace on host :-( While looking at your series I will add some traces.
>>> 
>>> Eric
>>>> 
>>>> Cheers,
>>>> 
>>>> M.
Eric Auger June 11, 2023, 4:28 p.m. UTC | #17
Hi Miguel,

On 6/10/23 10:25, Miguel Luis wrote:
> Hi Eric,
> 
>> On 7 Jun 2023, at 16:40, Eric Auger <eauger@redhat.com> wrote:
>>
>> Hi Miguel,
>>
>> On 6/6/23 19:52, Miguel Luis wrote:
>>> Hello Eric, Marc,
>>>
>>>> On 6 Jun 2023, at 09:33, Eric Auger <eauger@redhat.com> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 5/17/23 16:12, Marc Zyngier wrote:
>>>>> On Wed, 17 May 2023 09:59:45 +0100,
>>>>> Eric Auger <eauger@redhat.com> wrote:
>>>>>>
>>>>>> Hi Marc,
>>>>>> Hi Marc,
>>>>>> On 5/16/23 22:28, Marc Zyngier wrote:
>>>>>>> On Tue, 16 May 2023 17:53:14 +0100,
>>>>>>> Eric Auger <eauger@redhat.com> wrote:
>>>>>>>>
>>>>>>>> Hi Marc,
>>>>>>>>
>>>>>>>> On 5/15/23 19:30, Marc Zyngier wrote:
>>>>>>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>>>>>>
>>>>>>>>> For the previous episodes, see [1].
>>>>>>>>>
>>>>>>>>> What's changed:
>>>>>>>>>
>>>>>>>>> - New framework to track system register traps that are reinjected in
>>>>>>>>> guest EL2. It is expected to replace the discrete handling we have
>>>>>>>>> enjoyed so far, which didn't scale at all. This has already fixed a
>>>>>>>>> number of bugs that were hidden (a bunch of traps were never
>>>>>>>>> forwarded...). Still a work in progress, but this is going in the
>>>>>>>>> right direction.
>>>>>>>>>
>>>>>>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>>>>>> the L0 IPA space. This fixes a number of subtle issues, depending on
>>>>>>>>> how the initial guest was created.
>>>>>>>>>
>>>>>>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>>>>>> hopefully some of it is easier to review...
>>>>>>>>>
>>>>>>>>> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
>>>>>>>>
>>>>>>>> I have started testing this and when booting my fedora guest I get
>>>>>>>>
>>>>>>>> [  151.796544] kvm [7617]: Unsupported guest sys_reg access at:
>>>>>>>> 23f425fd0 [80000209]
>>>>>>>> [  151.796544]  { Op0( 3), Op1( 3), CRn(14), CRm( 3), Op2( 1), func_write },
>>>>>>>>
>>>>>>>> as soon as the host has kvm-arm.mode=nested
>>>>>>>>
>>>>>>>> This seems to be triggered very early by EDK2
>>>>>>>> (ArmPkg/Drivers/TimerDxe/TimerDxe.c).
>>>>>>>>
>>>>>>>> If I am not wrong this CNTV_CTL_EL0. Do you have any idea?
>>>>>>>
>>>>>>> So here's my current analysis:
>>>>>>>
>>>>>>> I assume you are running EDK2 as the L1 guest in a nested
>>>>>>> configuration. I also assume that you are not running on an Apple
>>>>>>> CPU. If these assumptions are correct, then EDK2 runs at vEL2, and is
>>>>>>> in nVHE mode.
>>>>>>>
>>>>>>> Finally, I'm going to assume that your implementation has FEAT_ECV and
>>>>>>> FEAT_NV2, because I can't see how it could fail otherwise.
>>>>>> all the above is correct.
>>>>>>>
>>>>>>> In these precise conditions, KVM sets the CNTHCTL_EL2.EL1TVT bit so
>>>>>>> that we can trap the EL0 virtual timer and faithfully emulate it (it
>>>>>>> is otherwise written to memory, which isn't very helpful).
>>>>>>
>>>>>> indeed
>>>>>>>
>>>>>>> As it turns out, we don't handle these traps. I didn't spot it because
>>>>>>> my test machines are all Apple boxes that don't have a nVHE mode, so
>>>>>>> nothing on the nVHE path is getting *ANY* coverage. Hint: having
>>>>>>> access to such a machine would help (shipping address on request!).
>>>>>>> Otherwise, I'll eventually kill the nVHE support altogether.
>>>>>>>
>>>>>>> I have written the following patch, which compiles, but that I cannot
>>>>>>> test with my current setup. Could you please give it a go?
>>>>>>
>>>>>> with the patch below, my guest boots nicely. You did it great on the 1st
>>>>>> shot!!! So this fixes my issue. I will continue testing the v10.
>>>>>
>>>>> Thanks a lot for reporting the issue and testing my hacks. I'll
>>>>> eventually fold it into the rest of the series.
>>>>>
>>>>> By the way, what are you using as your VMM? I'd really like to
>>>>> reproduce your setup.
>>>> Sorry I missed your reply. I am using libvirt + qemu (feat Miguel's RFC)
>>>> and fedora L1 guest.
>>>>
>>>
>>> Following this subject, I’ve forward ported Alexandru’s KUT patches
>>> ( and I encourage others to do it also =) ) which expose an EL2 test that
>>
>> Do you have a branch available with Alexandru's rebased kut series?
> 
> Now I do :)
> 
> https://github.com/mluis/kvm-unit-tests/tree/nv-WIP

Cool, thanks!

Eric
> 
> Thanks,
> Miguel
> 
>>
>> Thanks
>>
>> Eric
>>> does three checks:
>>>
>>> - whether VHE is supported and enabled
>>> - disable VHE
>>> - re-enable VHE  
>>>
>>> I’m running qemu with virtualization=on as well to run this test and it is passing although
>>> problems seem to happen when running with virtualization=off, which I’m still looking into it.
>>>
>>> Thanks
>>> Miguel
>>>
>>>> Thanks to your fix, this boots fine. But at the moment it does not
>>>> reboot and hangs in edk2 I think. Unfortunately this time I have no
>>>> trace on host :-( While looking at your series I will add some traces.
>>>>
>>>> Eric
>>>>>
>>>>> Cheers,
>>>>>
>>>>> M.
> 
>
Ganapatrao Kulkarni June 28, 2023, 6:45 a.m. UTC | #18
Hi Marc,


On 15-05-2023 11:00 pm, Marc Zyngier wrote:
> This is the 4th drop of NV support on arm64 for this year.
> 
> For the previous episodes, see [1].
> 
> What's changed:
> 
> - New framework to track system register traps that are reinjected in
>    guest EL2. It is expected to replace the discrete handling we have
>    enjoyed so far, which didn't scale at all. This has already fixed a
>    number of bugs that were hidden (a bunch of traps were never
>    forwarded...). Still a work in progress, but this is going in the
>    right direction.
> 
> - Allow the L1 hypervisor to have a S2 that has an input larger than
>    the L0 IPA space. This fixes a number of subtle issues, depending on
>    how the initial guest was created.
> 
> - Consequently, the patch series has gone longer again. Boo. But
>    hopefully some of it is easier to review...
> 

I am facing issue in booting NestedVM with V9 as well with 10 patchset.

I have tried V9/V10 on Ampere platform using kvmtool and I could boot
Guest-Hypervisor and then NestedVM without any issue.
However when I try to boot using QEMU(not using EDK2/EFI), 
Guest-Hypervisor is booted with Fedora 37 using virtio disk. From 
Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM, boot 
hangs very early stage of the boot.

I did some debug using ftrace and it seems the Guest-Hypervisor is
getting very high rate of arch-timer interrupts,
due to that all CPU time is going on in serving the Guest-Hypervisor
and it is never going back to NestedVM.

I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]

[1] 
https://lore.kernel.org/all/20230227163718.62003-1-miguel.luis@oracle.com/

> [1] https://lore.kernel.org/r/20230405154008.3552854-1-maz@kernel.org
> 
> Andre Przywara (1):
>    KVM: arm64: nv: vgic: Allow userland to set VGIC maintenance IRQ
> 
> Christoffer Dall (5):
>    KVM: arm64: nv: Trap EL1 VM register accesses in virtual EL2
>    KVM: arm64: nv: Implement nested Stage-2 page table walk logic
>    KVM: arm64: nv: Unmap/flush shadow stage 2 page tables
>    KVM: arm64: nv: vgic: Emulate the HW bit in software
>    KVM: arm64: nv: Sync nested timer state with FEAT_NV2
> 
> Jintack Lim (7):
>    KVM: arm64: nv: Trap CPACR_EL1 access in virtual EL2
>    KVM: arm64: nv: Respect virtual HCR_EL2.TWX setting
>    KVM: arm64: nv: Respect virtual CPTR_EL2.{TFP,FPEN} settings
>    KVM: arm64: nv: Respect virtual HCR_EL2.{NV,TSC) settings
>    KVM: arm64: nv: Configure HCR_EL2 for nested virtualization
>    KVM: arm64: nv: Trap and emulate TLBI instructions from virtual EL2
>    KVM: arm64: nv: Nested GICv3 Support
> 
> Marc Zyngier (46):
>    KVM: arm64: Move VTCR_EL2 into struct s2_mmu
>    arm64: Add missing Set/Way CMO encodings
>    arm64: Add missing VA CMO encodings
>    arm64: Add missing ERXMISCx_EL1 encodings
>    arm64: Add missing DC ZVA/GVA/GZVA encodings
>    arm64: Add TLBI operation encodings
>    arm64: Add AT operation encodings
>    KVM: arm64: Add missing HCR_EL2 trap bits
>    KVM: arm64: nv: Add trap forwarding infrastructure
>    KVM: arm64: nv: Add trap forwarding for HCR_EL2
>    KVM: arm64: nv: Expose FEAT_EVT to nested guests
>    KVM: arm64: nv: Add trap forwarding for MDCR_EL2
>    KVM: arm64: nv: Add trap forwarding for CNTHCTL_EL2
>    KVM: arm64: nv: Add non-VHE-EL2->EL1 translation helpers
>    KVM: arm64: nv: Handle virtual EL2 registers in
>      vcpu_read/write_sys_reg()
>    KVM: arm64: nv: Handle SPSR_EL2 specially
>    KVM: arm64: nv: Handle HCR_EL2.E2H specially
>    KVM: arm64: nv: Save/Restore vEL2 sysregs
>    KVM: arm64: nv: Support multiple nested Stage-2 mmu structures
>    KVM: arm64: nv: Handle shadow stage 2 page faults
>    KVM: arm64: nv: Restrict S2 RD/WR permissions to match the guest's
>    KVM: arm64: nv: Set a handler for the system instruction traps
>    KVM: arm64: nv: Trap and emulate AT instructions from virtual EL2
>    KVM: arm64: nv: Fold guest's HCR_EL2 configuration into the host's
>    KVM: arm64: nv: Hide RAS from nested guests
>    KVM: arm64: nv: Add handling of EL2-specific timer registers
>    KVM: arm64: nv: Load timer before the GIC
>    KVM: arm64: nv: Don't load the GICv4 context on entering a nested
>      guest
>    KVM: arm64: nv: Implement maintenance interrupt forwarding
>    KVM: arm64: nv: Deal with broken VGIC on maintenance interrupt
>      delivery
>    KVM: arm64: nv: Allow userspace to request KVM_ARM_VCPU_NESTED_VIRT
>    KVM: arm64: nv: Add handling of FEAT_TTL TLB invalidation
>    KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like
>      information
>    KVM: arm64: nv: Tag shadow S2 entries with nested level
>    KVM: arm64: nv: Add include containing the VNCR_EL2 offsets
>    KVM: arm64: nv: Map VNCR-capable registers to a separate page
>    KVM: arm64: nv: Move nested vgic state into the sysreg file
>    KVM: arm64: Add FEAT_NV2 cpu feature
>    KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
>    KVM: arm64: nv: Publish emulated timer interrupt state in the
>      in-memory state
>    KVM: arm64: nv: Allocate VNCR page when required
>    KVM: arm64: nv: Enable ARMv8.4-NV support
>    KVM: arm64: nv: Fast-track 'InHost' exception returns
>    KVM: arm64: nv: Fast-track EL1 TLBIs for VHE guests
>    KVM: arm64: nv: Use FEAT_ECV to trap access to EL0 timers
>    KVM: arm64: nv: Accelerate EL0 timer read accesses when FEAT_ECV is on
> 
>   .../virt/kvm/devices/arm-vgic-v3.rst          |  12 +-
>   arch/arm64/include/asm/esr.h                  |   1 +
>   arch/arm64/include/asm/kvm_arm.h              |  14 +
>   arch/arm64/include/asm/kvm_asm.h              |   4 +
>   arch/arm64/include/asm/kvm_emulate.h          |  93 +-
>   arch/arm64/include/asm/kvm_host.h             | 181 +++-
>   arch/arm64/include/asm/kvm_hyp.h              |   2 +
>   arch/arm64/include/asm/kvm_mmu.h              |  20 +-
>   arch/arm64/include/asm/kvm_nested.h           | 133 +++
>   arch/arm64/include/asm/stage2_pgtable.h       |   4 +-
>   arch/arm64/include/asm/sysreg.h               | 196 ++++
>   arch/arm64/include/asm/vncr_mapping.h         |  74 ++
>   arch/arm64/include/uapi/asm/kvm.h             |   1 +
>   arch/arm64/kernel/cpufeature.c                |  11 +
>   arch/arm64/kvm/Makefile                       |   4 +-
>   arch/arm64/kvm/arch_timer.c                   |  98 +-
>   arch/arm64/kvm/arm.c                          |  33 +-
>   arch/arm64/kvm/at.c                           | 219 ++++
>   arch/arm64/kvm/emulate-nested.c               | 934 ++++++++++++++++-
>   arch/arm64/kvm/handle_exit.c                  |  29 +-
>   arch/arm64/kvm/hyp/include/hyp/switch.h       |   8 +-
>   arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h    |   5 +-
>   arch/arm64/kvm/hyp/nvhe/mem_protect.c         |   8 +-
>   arch/arm64/kvm/hyp/nvhe/pkvm.c                |   4 +-
>   arch/arm64/kvm/hyp/nvhe/switch.c              |   2 +-
>   arch/arm64/kvm/hyp/nvhe/sysreg-sr.c           |   2 +-
>   arch/arm64/kvm/hyp/pgtable.c                  |   2 +-
>   arch/arm64/kvm/hyp/vgic-v3-sr.c               |   6 +-
>   arch/arm64/kvm/hyp/vhe/switch.c               | 206 +++-
>   arch/arm64/kvm/hyp/vhe/sysreg-sr.c            | 124 ++-
>   arch/arm64/kvm/hyp/vhe/tlb.c                  |  83 ++
>   arch/arm64/kvm/mmu.c                          | 255 ++++-
>   arch/arm64/kvm/nested.c                       | 799 ++++++++++++++-
>   arch/arm64/kvm/pkvm.c                         |   2 +-
>   arch/arm64/kvm/reset.c                        |   7 +
>   arch/arm64/kvm/sys_regs.c                     | 958 +++++++++++++++++-
>   arch/arm64/kvm/trace_arm.h                    |  19 +
>   arch/arm64/kvm/vgic/vgic-init.c               |  33 +
>   arch/arm64/kvm/vgic/vgic-kvm-device.c         |  32 +-
>   arch/arm64/kvm/vgic/vgic-v3-nested.c          | 248 +++++
>   arch/arm64/kvm/vgic/vgic-v3.c                 |  43 +-
>   arch/arm64/kvm/vgic/vgic.c                    |  29 +
>   arch/arm64/kvm/vgic/vgic.h                    |  10 +
>   arch/arm64/tools/cpucaps                      |   1 +
>   include/clocksource/arm_arch_timer.h          |   4 +
>   include/kvm/arm_arch_timer.h                  |   1 +
>   include/kvm/arm_vgic.h                        |  17 +
>   include/uapi/linux/kvm.h                      |   1 +
>   tools/arch/arm/include/uapi/asm/kvm.h         |   1 +
>   49 files changed, 4790 insertions(+), 183 deletions(-)
>   create mode 100644 arch/arm64/include/asm/vncr_mapping.h
>   create mode 100644 arch/arm64/kvm/at.c
>   create mode 100644 arch/arm64/kvm/vgic/vgic-v3-nested.c
> 

Thanks,
Ganapat
Marc Zyngier June 29, 2023, 7:03 a.m. UTC | #19
Hi Ganapatrao,

On Wed, 28 Jun 2023 07:45:55 +0100,
Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> 
> 
> Hi Marc,
> 
> 
> On 15-05-2023 11:00 pm, Marc Zyngier wrote:
> > This is the 4th drop of NV support on arm64 for this year.
> > 
> > For the previous episodes, see [1].
> > 
> > What's changed:
> > 
> > - New framework to track system register traps that are reinjected in
> >    guest EL2. It is expected to replace the discrete handling we have
> >    enjoyed so far, which didn't scale at all. This has already fixed a
> >    number of bugs that were hidden (a bunch of traps were never
> >    forwarded...). Still a work in progress, but this is going in the
> >    right direction.
> > 
> > - Allow the L1 hypervisor to have a S2 that has an input larger than
> >    the L0 IPA space. This fixes a number of subtle issues, depending on
> >    how the initial guest was created.
> > 
> > - Consequently, the patch series has gone longer again. Boo. But
> >    hopefully some of it is easier to review...
> > 
> 
> I am facing issue in booting NestedVM with V9 as well with 10 patchset.
> 
> I have tried V9/V10 on Ampere platform using kvmtool and I could boot
> Guest-Hypervisor and then NestedVM without any issue.
> However when I try to boot using QEMU(not using EDK2/EFI),
> Guest-Hypervisor is booted with Fedora 37 using virtio disk. From
> Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM,
> boot hangs very early stage of the boot.
> 
> I did some debug using ftrace and it seems the Guest-Hypervisor is
> getting very high rate of arch-timer interrupts,
> due to that all CPU time is going on in serving the Guest-Hypervisor
> and it is never going back to NestedVM.
> 
> I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]

So I went ahead and gave QEMU a go. On my systems, *nothing* works (I
cannot even boot a L1 with 'virtualization=on" (the guest is stuck at
the point where virtio gets probed and waits for its first interrupt).

Worse, booting a hVHE guest results in QEMU generating an assert as it
tries to inject an interrupt using the QEMU GICv3 model, something
that should *NEVER* be in use with KVM.

With help from Eric, I got to a point where the hVHE guest could boot
as long as I kept injecting console interrupts, which is again a
symptom of the vGIC not being used.

So something is *majorly* wrong with the QEMU patches. I don't know
what makes it possible for you to even boot the L1 - if the GIC is
external, injecting an interrupt in the L2 is simply impossible.

Miguel, can you please investigate this?

In the meantime, I'll add some code to the kernel side to refuse the
external interrupt controller configuration with NV. Hopefully that
will lead to some clues about what is going on.

Thanks,

	M.
Ganapatrao Kulkarni July 4, 2023, 12:31 p.m. UTC | #20
Hi Marc,

On 29-06-2023 12:33 pm, Marc Zyngier wrote:
> Hi Ganapatrao,
> 
> On Wed, 28 Jun 2023 07:45:55 +0100,
> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>
>>
>> Hi Marc,
>>
>>
>> On 15-05-2023 11:00 pm, Marc Zyngier wrote:
>>> This is the 4th drop of NV support on arm64 for this year.
>>>
>>> For the previous episodes, see [1].
>>>
>>> What's changed:
>>>
>>> - New framework to track system register traps that are reinjected in
>>>     guest EL2. It is expected to replace the discrete handling we have
>>>     enjoyed so far, which didn't scale at all. This has already fixed a
>>>     number of bugs that were hidden (a bunch of traps were never
>>>     forwarded...). Still a work in progress, but this is going in the
>>>     right direction.
>>>
>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>     the L0 IPA space. This fixes a number of subtle issues, depending on
>>>     how the initial guest was created.
>>>
>>> - Consequently, the patch series has gone longer again. Boo. But
>>>     hopefully some of it is easier to review...
>>>
>>
>> I am facing issue in booting NestedVM with V9 as well with 10 patchset.
>>
>> I have tried V9/V10 on Ampere platform using kvmtool and I could boot
>> Guest-Hypervisor and then NestedVM without any issue.
>> However when I try to boot using QEMU(not using EDK2/EFI),
>> Guest-Hypervisor is booted with Fedora 37 using virtio disk. From
>> Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM,
>> boot hangs very early stage of the boot.
>>
>> I did some debug using ftrace and it seems the Guest-Hypervisor is
>> getting very high rate of arch-timer interrupts,
>> due to that all CPU time is going on in serving the Guest-Hypervisor
>> and it is never going back to NestedVM.
>>
>> I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]
> 
> So I went ahead and gave QEMU a go. On my systems, *nothing* works (I
> cannot even boot a L1 with 'virtualization=on" (the guest is stuck at
> the point where virtio gets probed and waits for its first interrupt).
> 
> Worse, booting a hVHE guest results in QEMU generating an assert as it
> tries to inject an interrupt using the QEMU GICv3 model, something
> that should *NEVER* be in use with KVM.
> 
> With help from Eric, I got to a point where the hVHE guest could boot
> as long as I kept injecting console interrupts, which is again a
> symptom of the vGIC not being used.
> 
> So something is *majorly* wrong with the QEMU patches. I don't know
> what makes it possible for you to even boot the L1 - if the GIC is
> external, injecting an interrupt in the L2 is simply impossible.
> 
> Miguel, can you please investigate this?
> 
> In the meantime, I'll add some code to the kernel side to refuse the
> external interrupt controller configuration with NV. Hopefully that
> will lead to some clues about what is going on.

Continued debugging of the issue and it seems the endless ptimer 
interrupts on Ampere platform is due to some mess up of CVAL of ptimer, 
resulting in interrupt triggered always when it is enabled.

I see function "timer_set_offset" called from kvm_arm_timer_set_reg in 
QEMU case but there is no such calls in kvmtool boot.

If I comment the timer_set_offset calls in kvm_arm_timer_set_reg 
function, then I could boot the Guest-Hypervisor then NestedVM from GH/L1.

I also observed in QEMU case, kvm_arm_timer_set_reg is called to set 
CNT, CVAL and CTL of both vtimer and ptimer.
Not sure why QEMU is setting these registers explicitly? need to dig.

> 
> Thanks,
> 
> 	M.
> 

Thanks,
Ganapat
Ganapatrao Kulkarni July 7, 2023, 9:46 a.m. UTC | #21
On 04-07-2023 06:01 pm, Ganapatrao Kulkarni wrote:
> 
> Hi Marc,
> 
> On 29-06-2023 12:33 pm, Marc Zyngier wrote:
>> Hi Ganapatrao,
>>
>> On Wed, 28 Jun 2023 07:45:55 +0100,
>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>>
>>>
>>> Hi Marc,
>>>
>>>
>>> On 15-05-2023 11:00 pm, Marc Zyngier wrote:
>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>
>>>> For the previous episodes, see [1].
>>>>
>>>> What's changed:
>>>>
>>>> - New framework to track system register traps that are reinjected in
>>>>     guest EL2. It is expected to replace the discrete handling we have
>>>>     enjoyed so far, which didn't scale at all. This has already fixed a
>>>>     number of bugs that were hidden (a bunch of traps were never
>>>>     forwarded...). Still a work in progress, but this is going in the
>>>>     right direction.
>>>>
>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>     the L0 IPA space. This fixes a number of subtle issues, 
>>>> depending on
>>>>     how the initial guest was created.
>>>>
>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>     hopefully some of it is easier to review...
>>>>
>>>
>>> I am facing issue in booting NestedVM with V9 as well with 10 patchset.
>>>
>>> I have tried V9/V10 on Ampere platform using kvmtool and I could boot
>>> Guest-Hypervisor and then NestedVM without any issue.
>>> However when I try to boot using QEMU(not using EDK2/EFI),
>>> Guest-Hypervisor is booted with Fedora 37 using virtio disk. From
>>> Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM,
>>> boot hangs very early stage of the boot.
>>>
>>> I did some debug using ftrace and it seems the Guest-Hypervisor is
>>> getting very high rate of arch-timer interrupts,
>>> due to that all CPU time is going on in serving the Guest-Hypervisor
>>> and it is never going back to NestedVM.
>>>
>>> I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]
>>
>> So I went ahead and gave QEMU a go. On my systems, *nothing* works (I
>> cannot even boot a L1 with 'virtualization=on" (the guest is stuck at
>> the point where virtio gets probed and waits for its first interrupt).
>>
>> Worse, booting a hVHE guest results in QEMU generating an assert as it
>> tries to inject an interrupt using the QEMU GICv3 model, something
>> that should *NEVER* be in use with KVM.
>>
>> With help from Eric, I got to a point where the hVHE guest could boot
>> as long as I kept injecting console interrupts, which is again a
>> symptom of the vGIC not being used.
>>
>> So something is *majorly* wrong with the QEMU patches. I don't know
>> what makes it possible for you to even boot the L1 - if the GIC is
>> external, injecting an interrupt in the L2 is simply impossible.
>>
>> Miguel, can you please investigate this?
>>
>> In the meantime, I'll add some code to the kernel side to refuse the
>> external interrupt controller configuration with NV. Hopefully that
>> will lead to some clues about what is going on.
> 
> Continued debugging of the issue and it seems the endless ptimer 
> interrupts on Ampere platform is due to some mess up of CVAL of ptimer, 
> resulting in interrupt triggered always when it is enabled.
> 
> I see function "timer_set_offset" called from kvm_arm_timer_set_reg in 
> QEMU case but there is no such calls in kvmtool boot.
> 
> If I comment the timer_set_offset calls in kvm_arm_timer_set_reg 
> function, then I could boot the Guest-Hypervisor then NestedVM from GH/L1.
> 
> I also observed in QEMU case, kvm_arm_timer_set_reg is called to set 
> CNT, CVAL and CTL of both vtimer and ptimer.
> Not sure why QEMU is setting these registers explicitly? need to dig.
> 

I don't see any direct ioctl calls to change any timer registers. Looks 
like it is happening from the emulation code(target/arm/helper.c)?

>>
>> Thanks,
>>
>>     M.
>>
> 
> Thanks,
> Ganapat
> 

Thanks,
Ganapat
Miguel Luis July 10, 2023, 12:56 p.m. UTC | #22
Hi Marc,

> On 29 Jun 2023, at 07:03, Marc Zyngier <maz@kernel.org> wrote:
> 
> Hi Ganapatrao,
> 
> On Wed, 28 Jun 2023 07:45:55 +0100,
> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>> 
>> 
>> Hi Marc,
>> 
>> 
>> On 15-05-2023 11:00 pm, Marc Zyngier wrote:
>>> This is the 4th drop of NV support on arm64 for this year.
>>> 
>>> For the previous episodes, see [1].
>>> 
>>> What's changed:
>>> 
>>> - New framework to track system register traps that are reinjected in
>>>   guest EL2. It is expected to replace the discrete handling we have
>>>   enjoyed so far, which didn't scale at all. This has already fixed a
>>>   number of bugs that were hidden (a bunch of traps were never
>>>   forwarded...). Still a work in progress, but this is going in the
>>>   right direction.
>>> 
>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>   the L0 IPA space. This fixes a number of subtle issues, depending on
>>>   how the initial guest was created.
>>> 
>>> - Consequently, the patch series has gone longer again. Boo. But
>>>   hopefully some of it is easier to review...
>>> 
>> 
>> I am facing issue in booting NestedVM with V9 as well with 10 patchset.
>> 
>> I have tried V9/V10 on Ampere platform using kvmtool and I could boot
>> Guest-Hypervisor and then NestedVM without any issue.
>> However when I try to boot using QEMU(not using EDK2/EFI),
>> Guest-Hypervisor is booted with Fedora 37 using virtio disk. From
>> Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM,
>> boot hangs very early stage of the boot.
>> 
>> I did some debug using ftrace and it seems the Guest-Hypervisor is
>> getting very high rate of arch-timer interrupts,
>> due to that all CPU time is going on in serving the Guest-Hypervisor
>> and it is never going back to NestedVM.
>> 
>> I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]
> 
> So I went ahead and gave QEMU a go. On my systems, *nothing* works (I
> cannot even boot a L1 with 'virtualization=on" (the guest is stuck at
> the point where virtio gets probed and waits for its first interrupt).
> 
> Worse, booting a hVHE guest results in QEMU generating an assert as it
> tries to inject an interrupt using the QEMU GICv3 model, something
> that should *NEVER* be in use with KVM.
> 
> With help from Eric, I got to a point where the hVHE guest could boot
> as long as I kept injecting console interrupts, which is again a
> symptom of the vGIC not being used.
> 
> So something is *majorly* wrong with the QEMU patches. I don't know
> what makes it possible for you to even boot the L1 - if the GIC is
> external, injecting an interrupt in the L2 is simply impossible.
> 
> Miguel, can you please investigate this?

Yes, I will investigate it. Sorry for the delay in replying as I took a break
short after KVM forum and I’ve just started to sync up.

Thanks,
Miguel

> 
> In the meantime, I'll add some code to the kernel side to refuse the
> external interrupt controller configuration with NV. Hopefully that
> will lead to some clues about what is going on.
> 
> Thanks,
> 
> M.
> 
> -- 
> Without deviation from the norm, progress is not possible.
Ganapatrao Kulkarni July 11, 2023, 11:56 a.m. UTC | #23
On 07-07-2023 03:16 pm, Ganapatrao Kulkarni wrote:
> 
> 
> On 04-07-2023 06:01 pm, Ganapatrao Kulkarni wrote:
>>
>> Hi Marc,
>>
>> On 29-06-2023 12:33 pm, Marc Zyngier wrote:
>>> Hi Ganapatrao,
>>>
>>> On Wed, 28 Jun 2023 07:45:55 +0100,
>>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>>>
>>>>
>>>> Hi Marc,
>>>>
>>>>
>>>> On 15-05-2023 11:00 pm, Marc Zyngier wrote:
>>>>> This is the 4th drop of NV support on arm64 for this year.
>>>>>
>>>>> For the previous episodes, see [1].
>>>>>
>>>>> What's changed:
>>>>>
>>>>> - New framework to track system register traps that are reinjected in
>>>>>     guest EL2. It is expected to replace the discrete handling we have
>>>>>     enjoyed so far, which didn't scale at all. This has already 
>>>>> fixed a
>>>>>     number of bugs that were hidden (a bunch of traps were never
>>>>>     forwarded...). Still a work in progress, but this is going in the
>>>>>     right direction.
>>>>>
>>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>>     the L0 IPA space. This fixes a number of subtle issues, 
>>>>> depending on
>>>>>     how the initial guest was created.
>>>>>
>>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>>     hopefully some of it is easier to review...
>>>>>
>>>>
>>>> I am facing issue in booting NestedVM with V9 as well with 10 patchset.
>>>>
>>>> I have tried V9/V10 on Ampere platform using kvmtool and I could boot
>>>> Guest-Hypervisor and then NestedVM without any issue.
>>>> However when I try to boot using QEMU(not using EDK2/EFI),
>>>> Guest-Hypervisor is booted with Fedora 37 using virtio disk. From
>>>> Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM,
>>>> boot hangs very early stage of the boot.
>>>>
>>>> I did some debug using ftrace and it seems the Guest-Hypervisor is
>>>> getting very high rate of arch-timer interrupts,
>>>> due to that all CPU time is going on in serving the Guest-Hypervisor
>>>> and it is never going back to NestedVM.
>>>>
>>>> I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]
>>>
>>> So I went ahead and gave QEMU a go. On my systems, *nothing* works (I
>>> cannot even boot a L1 with 'virtualization=on" (the guest is stuck at
>>> the point where virtio gets probed and waits for its first interrupt).
>>>
>>> Worse, booting a hVHE guest results in QEMU generating an assert as it
>>> tries to inject an interrupt using the QEMU GICv3 model, something
>>> that should *NEVER* be in use with KVM.
>>>
>>> With help from Eric, I got to a point where the hVHE guest could boot
>>> as long as I kept injecting console interrupts, which is again a
>>> symptom of the vGIC not being used.
>>>
>>> So something is *majorly* wrong with the QEMU patches. I don't know
>>> what makes it possible for you to even boot the L1 - if the GIC is
>>> external, injecting an interrupt in the L2 is simply impossible.
>>>
>>> Miguel, can you please investigate this?
>>>
>>> In the meantime, I'll add some code to the kernel side to refuse the
>>> external interrupt controller configuration with NV. Hopefully that
>>> will lead to some clues about what is going on.
>>
>> Continued debugging of the issue and it seems the endless ptimer 
>> interrupts on Ampere platform is due to some mess up of CVAL of 
>> ptimer, resulting in interrupt triggered always when it is enabled.
>>
>> I see function "timer_set_offset" called from kvm_arm_timer_set_reg in 
>> QEMU case but there is no such calls in kvmtool boot.
>>
>> If I comment the timer_set_offset calls in kvm_arm_timer_set_reg 
>> function, then I could boot the Guest-Hypervisor then NestedVM from 
>> GH/L1.
>>
>> I also observed in QEMU case, kvm_arm_timer_set_reg is called to set 
>> CNT, CVAL and CTL of both vtimer and ptimer.
>> Not sure why QEMU is setting these registers explicitly? need to dig.
>>
> 
> I don't see any direct ioctl calls to change any timer registers. Looks 
> like it is happening from the emulation code(target/arm/helper.c)?

function "write_list_to_kvmstate(target/arm/kvm.c)" is issuing ioctl to 
write timer registers. PTIMER_CNT and TIMER_CNT writes from this 
function is resulting in offsets change.

> 
>>>
>>> Thanks,
>>>
>>>     M.
>>>
>>
>> Thanks,
>> Ganapat
>>
> 
> Thanks,
> Ganapat

Thanks,
Ganapat
Marc Zyngier July 11, 2023, 12:30 p.m. UTC | #24
On Tue, 11 Jul 2023 12:56:48 +0100,
Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> 
> 
> 
> On 07-07-2023 03:16 pm, Ganapatrao Kulkarni wrote:
> > 
> > 
> > On 04-07-2023 06:01 pm, Ganapatrao Kulkarni wrote:
> >> 
> >> Hi Marc,
> >> 
> >> On 29-06-2023 12:33 pm, Marc Zyngier wrote:
> >>> Hi Ganapatrao,
> >>> 
> >>> On Wed, 28 Jun 2023 07:45:55 +0100,
> >>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> >>>> 
> >>>> 
> >>>> Hi Marc,
> >>>> 
> >>>> 
> >>>> On 15-05-2023 11:00 pm, Marc Zyngier wrote:
> >>>>> This is the 4th drop of NV support on arm64 for this year.
> >>>>> 
> >>>>> For the previous episodes, see [1].
> >>>>> 
> >>>>> What's changed:
> >>>>> 
> >>>>> - New framework to track system register traps that are reinjected in
> >>>>>     guest EL2. It is expected to replace the discrete handling we have
> >>>>>     enjoyed so far, which didn't scale at all. This has already
> >>>>> fixed a
> >>>>>     number of bugs that were hidden (a bunch of traps were never
> >>>>>     forwarded...). Still a work in progress, but this is going in the
> >>>>>     right direction.
> >>>>> 
> >>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
> >>>>>     the L0 IPA space. This fixes a number of subtle issues,
> >>>>> depending on
> >>>>>     how the initial guest was created.
> >>>>> 
> >>>>> - Consequently, the patch series has gone longer again. Boo. But
> >>>>>     hopefully some of it is easier to review...
> >>>>> 
> >>>> 
> >>>> I am facing issue in booting NestedVM with V9 as well with 10 patchset.
> >>>> 
> >>>> I have tried V9/V10 on Ampere platform using kvmtool and I could boot
> >>>> Guest-Hypervisor and then NestedVM without any issue.
> >>>> However when I try to boot using QEMU(not using EDK2/EFI),
> >>>> Guest-Hypervisor is booted with Fedora 37 using virtio disk. From
> >>>> Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM,
> >>>> boot hangs very early stage of the boot.
> >>>> 
> >>>> I did some debug using ftrace and it seems the Guest-Hypervisor is
> >>>> getting very high rate of arch-timer interrupts,
> >>>> due to that all CPU time is going on in serving the Guest-Hypervisor
> >>>> and it is never going back to NestedVM.
> >>>> 
> >>>> I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]
> >>> 
> >>> So I went ahead and gave QEMU a go. On my systems, *nothing* works (I
> >>> cannot even boot a L1 with 'virtualization=on" (the guest is stuck at
> >>> the point where virtio gets probed and waits for its first interrupt).
> >>> 
> >>> Worse, booting a hVHE guest results in QEMU generating an assert as it
> >>> tries to inject an interrupt using the QEMU GICv3 model, something
> >>> that should *NEVER* be in use with KVM.
> >>> 
> >>> With help from Eric, I got to a point where the hVHE guest could boot
> >>> as long as I kept injecting console interrupts, which is again a
> >>> symptom of the vGIC not being used.
> >>> 
> >>> So something is *majorly* wrong with the QEMU patches. I don't know
> >>> what makes it possible for you to even boot the L1 - if the GIC is
> >>> external, injecting an interrupt in the L2 is simply impossible.
> >>> 
> >>> Miguel, can you please investigate this?
> >>> 
> >>> In the meantime, I'll add some code to the kernel side to refuse the
> >>> external interrupt controller configuration with NV. Hopefully that
> >>> will lead to some clues about what is going on.
> >> 
> >> Continued debugging of the issue and it seems the endless ptimer
> >> interrupts on Ampere platform is due to some mess up of CVAL of
> >> ptimer, resulting in interrupt triggered always when it is enabled.
> >> 
> >> I see function "timer_set_offset" called from kvm_arm_timer_set_reg
> >> in QEMU case but there is no such calls in kvmtool boot.
> >> 
> >> If I comment the timer_set_offset calls in kvm_arm_timer_set_reg
> >> function, then I could boot the Guest-Hypervisor then NestedVM from
> >> GH/L1.
> >> 
> >> I also observed in QEMU case, kvm_arm_timer_set_reg is called to
> >> set CNT, CVAL and CTL of both vtimer and ptimer.
> >> Not sure why QEMU is setting these registers explicitly? need to dig.
> >> 
> > 
> > I don't see any direct ioctl calls to change any timer
> > registers. Looks like it is happening from the emulation
> > code(target/arm/helper.c)?
> 
> function "write_list_to_kvmstate(target/arm/kvm.c)" is issuing ioctl
> to write timer registers. PTIMER_CNT and TIMER_CNT writes from this
> function is resulting in offsets change.

Madness. Why is QEMU doing this? It has no business writing to the
timer at any stage, at least with KVM.

This confirms my suspicions that QEMU is confused about what mode it
runs in, and does all sort of funky things it was never expected to
do.

	M.
Miguel Luis July 18, 2023, 10:29 a.m. UTC | #25
Hi Marc,

> On 10 Jul 2023, at 12:56, Miguel Luis <miguel.luis@oracle.com> wrote:
> 
> Hi Marc,
> 
>> On 29 Jun 2023, at 07:03, Marc Zyngier <maz@kernel.org> wrote:
>> 
>> Hi Ganapatrao,
>> 
>> On Wed, 28 Jun 2023 07:45:55 +0100,
>> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
>>> 
>>> 
>>> Hi Marc,
>>> 
>>> 
>>> On 15-05-2023 11:00 pm, Marc Zyngier wrote:
>>>> This is the 4th drop of NV support on arm64 for this year.
>>>> 
>>>> For the previous episodes, see [1].
>>>> 
>>>> What's changed:
>>>> 
>>>> - New framework to track system register traps that are reinjected in
>>>>  guest EL2. It is expected to replace the discrete handling we have
>>>>  enjoyed so far, which didn't scale at all. This has already fixed a
>>>>  number of bugs that were hidden (a bunch of traps were never
>>>>  forwarded...). Still a work in progress, but this is going in the
>>>>  right direction.
>>>> 
>>>> - Allow the L1 hypervisor to have a S2 that has an input larger than
>>>>  the L0 IPA space. This fixes a number of subtle issues, depending on
>>>>  how the initial guest was created.
>>>> 
>>>> - Consequently, the patch series has gone longer again. Boo. But
>>>>  hopefully some of it is easier to review...
>>>> 
>>> 
>>> I am facing issue in booting NestedVM with V9 as well with 10 patchset.
>>> 
>>> I have tried V9/V10 on Ampere platform using kvmtool and I could boot
>>> Guest-Hypervisor and then NestedVM without any issue.
>>> However when I try to boot using QEMU(not using EDK2/EFI),
>>> Guest-Hypervisor is booted with Fedora 37 using virtio disk. From
>>> Guest-Hypervisor console(or ssh shell), If I try to boot NestedVM,
>>> boot hangs very early stage of the boot.
>>> 
>>> I did some debug using ftrace and it seems the Guest-Hypervisor is
>>> getting very high rate of arch-timer interrupts,
>>> due to that all CPU time is going on in serving the Guest-Hypervisor
>>> and it is never going back to NestedVM.
>>> 
>>> I am using QEMU vanilla version v7.2.0 with top-up patches for NV [1]
>> 
>> So I went ahead and gave QEMU a go. On my systems, *nothing* works (I
>> cannot even boot a L1 with 'virtualization=on" (the guest is stuck at
>> the point where virtio gets probed and waits for its first interrupt).

In order to use the previous patches you need to update the linux headers
of QEMU according to the target kernel you’re testing. So you would want to run
./scripts/update-linux-headers.sh <kernel src dir> <qemu src dir> in the place of
patch 1 then you should be able to boot the L1 guest with virtualization=on.

Regarding the L2 guest, it does not boot and I’m in the process of
understanding why. The previous patches had some improvements to make
but I couldn’t relate them to not booting the L2 guest, yet.

Eric stated an issue with NV and SVE enablement which I’m still looking at.
( which is similar to the commit
5b578f088ada3c4319f7274c0221b5d92143fe6a in your kvmtool branch arm64/nv-5.16 )

As a test I’ve disabled SVE on the kernel side with 'arm64.nosve’ and
the output matches the one from your kvmtool:

[    0.000000] CPU features: SYS_ID_AA64PFR0_EL1[35:32]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[59:56]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[55:52]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[47:44]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[43:40]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[35:32]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[23:20]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[19:16]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[7:4]: already set to 0
[    0.000000] CPU features: SYS_ID_AA64ZFR0_EL1[3:0]: already set to 0

That didn’t enabled the L2 guest to boot on QEMU so the issue feels still in a grey area.

As a baseline I tested your kvmtool branch for 5.16 after updating the includes,
and as I expected both L1 and L2 guests boot.

Miguel

>> 
>> Worse, booting a hVHE guest results in QEMU generating an assert as it
>> tries to inject an interrupt using the QEMU GICv3 model, something
>> that should *NEVER* be in use with KVM.
>> 
>> With help from Eric, I got to a point where the hVHE guest could boot
>> as long as I kept injecting console interrupts, which is again a
>> symptom of the vGIC not being used.
>> 
>> So something is *majorly* wrong with the QEMU patches. I don't know
>> what makes it possible for you to even boot the L1 - if the GIC is
>> external, injecting an interrupt in the L2 is simply impossible.
>> 
>> Miguel, can you please investigate this?
> 
> Yes, I will investigate it. Sorry for the delay in replying as I took a break
> short after KVM forum and I’ve just started to sync up.
> 
> Thanks,
> Miguel
> 
>> 
>> In the meantime, I'll add some code to the kernel side to refuse the
>> external interrupt controller configuration with NV. Hopefully that
>> will lead to some clues about what is going on.
>> 
>> Thanks,
>> 
>> M.
>> 
>> -- 
>> Without deviation from the norm, progress is not possible.