mbox series

[v3,00/14] arm64: Support for running as a guest in Arm CCA

Message ID 20240605093006.145492-1-steven.price@arm.com (mailing list archive)
Headers show
Series arm64: Support for running as a guest in Arm CCA | expand

Message

Steven Price June 5, 2024, 9:29 a.m. UTC
This series adds support for running Linux in a protected VM under the
Arm Confidential Compute Architecture (CCA). This has been updated
following the feedback from the v2 posting[1]. Thanks for the feedback!
Individual patches have a change log for v3.

The biggest change from v2 is fixing set_memory_{en,de}crypted() to
perform a break-before-make sequence. Note that only the virtual address
supplied is flipped between shared and protected, so if e.g. a vmalloc()
address is passed the linear map will still point to the (now invalid)
previous IPA. Attempts to access the wrong address may trigger a
Synchronous External Abort. However any code which attempts to access
the 'encrypted' alias after set_memory_decrypted() is already likely to
be broken on platforms that implement memory encryption, so I don't
expect problems.

The ABI to the RMM from a realm (the RSI) is based on the final RMM v1.0
(EAC 5) specification[2]. Future RMM specifications will be backwards
compatible so a guest using the v1.0 specification (i.e. this series)
will be able to run on future versions of the RMM without modification.

Arm plans to set up a CI system to perform at a minimum boot testing of
Linux as a guest within a realm.

This series is based on v6.10-rc1. It is also available as a git
repository:

https://gitlab.arm.com/linux-arm/linux-cca cca-guest/v3

This series (the guest side) should be in a good state so please review
with the intention that this could be merged soon. The host side (KVM
changes) is likely to require some more iteration and I'll post that as
a separate series shortly - note that there is no tie between the series
(i.e. you can mix and match v2 and v3 postings of the host and guest).

Introduction (unchanged from v2 posting)
============
A more general introduction to Arm CCA is available on the Arm
website[3], and links to the other components involved are available in
the overall cover letter.

Arm Confidential Compute Architecture adds two new 'worlds' to the
architecture: Root and Realm. A new software component known as the RMM
(Realm Management Monitor) runs in Realm EL2 and is trusted by both the
Normal World and VMs running within Realms. This enables mutual
distrust between the Realm VMs and the Normal World.

Virtual machines running within a Realm can decide on a (4k)
page-by-page granularity whether to share a page with the (Normal World)
host or to keep it private (protected). This protection is provided by
the hardware and attempts to access a page which isn't shared by the
Normal World will trigger a Granule Protection Fault.

Realm VMs can communicate with the RMM via another SMC interface known
as RSI (Realm Services Interface). This series adds wrappers for the
full set of RSI commands and uses them to manage the Realm IPA State
(RIPAS) and to discover the configuration of the realm.

The VM running within the Realm needs to ensure that memory that is
going to use is marked as 'RIPAS_RAM' (i.e. protected memory accessible
only to the guest). This could be provided by the VMM (and subject to
measurement to ensure it is setup correctly) or the VM can set it
itself.  This series includes a patch which will iterate over all
described RAM and set the RIPAS. This is a relatively cheap operation,
and doesn't require memory donation from the host. Instead, memory can
be dynamically provided by the host on fault. An alternative would be to
update booting.rst and state this as a requirement, but this would
reduce the flexibility of the VMM to manage the available memory to the
guest (as the initial RIPAS state is part of the guest's measurement).

Within the Realm the most-significant active bit of the IPA is used to
select whether the access is to protected memory or to memory shared
with the host. This series treats this bit as if it is attribute bit in
the page tables and will modify it when sharing/unsharing memory with
the host.

This top bit usage also necessitates that the IPA width is made more
dynamic in the guest. The VMM will choose a width (and therefore which
bit controls the shared flag) and the guest must be able to identify
this bit to mask it out when necessary. PHYS_MASK_SHIFT/PHYS_MASK are
therefore made dynamic.

To allow virtio to communicate with the host the shared buffers must be
placed in memory which has this top IPA bit set. This is achieved by
implementing the set_memory_{encrypted,decrypted} APIs for arm64 and
forcing the use of bounce buffers. For now all device access is
considered to required the memory to be shared, at this stage there is
no support for real devices to be assigned to a realm guest - obviously
if device assignment is added this will have to change.

Finally the GIC is (largely) emulated by the (untrusted) host. The RMM
provides some management (including register save/restore) but the
ITS buffers must be placed into shared memory for the host to emulate.
There is likely to be future work to harden the GIC driver against a
malicious host (along with any other drivers used within a Realm guest).

[1] https://lore.kernel.org/r/20240412084213.1733764-1-steven.price%40arm.com
[2] https://developer.arm.com/documentation/den0137/1-0eac5/
[3] https://www.arm.com/architecture/security-features/arm-confidential-compute-architecture

Sami Mujawar (2):
  arm64: rsi: Interfaces to query attestation token
  virt: arm-cca-guest: TSM_REPORT support for realms

Steven Price (5):
  arm64: realm: Query IPA size from the RMM
  arm64: Mark all I/O as non-secure shared
  arm64: Make the PHYS_MASK_SHIFT dynamic
  arm64: Enforce bounce buffers for realm DMA
  arm64: realm: Support nonsecure ITS emulation shared

Suzuki K Poulose (7):
  arm64: rsi: Add RSI definitions
  arm64: Detect if in a realm and set RIPAS RAM
  fixmap: Allow architecture overriding set_fixmap_io
  arm64: Override set_fixmap_io
  arm64: Enable memory encrypt for Realms
  arm64: Force device mappings to be non-secure shared
  efi: arm64: Map Device with Prot Shared

 arch/arm64/Kconfig                            |   3 +
 arch/arm64/include/asm/fixmap.h               |   4 +-
 arch/arm64/include/asm/io.h                   |   6 +-
 arch/arm64/include/asm/mem_encrypt.h          |  17 ++
 arch/arm64/include/asm/pgtable-hwdef.h        |   6 -
 arch/arm64/include/asm/pgtable-prot.h         |   3 +
 arch/arm64/include/asm/pgtable.h              |   7 +-
 arch/arm64/include/asm/rsi.h                  |  48 ++++
 arch/arm64/include/asm/rsi_cmds.h             | 143 ++++++++++++
 arch/arm64/include/asm/rsi_smc.h              | 142 ++++++++++++
 arch/arm64/include/asm/set_memory.h           |   3 +
 arch/arm64/kernel/Makefile                    |   3 +-
 arch/arm64/kernel/efi.c                       |   2 +-
 arch/arm64/kernel/rsi.c                       |  96 ++++++++
 arch/arm64/kernel/setup.c                     |   8 +
 arch/arm64/mm/init.c                          |  10 +-
 arch/arm64/mm/mmu.c                           |  13 ++
 arch/arm64/mm/pageattr.c                      |  65 +++++-
 drivers/irqchip/irq-gic-v3-its.c              |  90 ++++++--
 drivers/virt/coco/Kconfig                     |   2 +
 drivers/virt/coco/Makefile                    |   1 +
 drivers/virt/coco/arm-cca-guest/Kconfig       |  11 +
 drivers/virt/coco/arm-cca-guest/Makefile      |   2 +
 .../virt/coco/arm-cca-guest/arm-cca-guest.c   | 211 ++++++++++++++++++
 include/asm-generic/fixmap.h                  |   2 +
 25 files changed, 858 insertions(+), 40 deletions(-)
 create mode 100644 arch/arm64/include/asm/mem_encrypt.h
 create mode 100644 arch/arm64/include/asm/rsi.h
 create mode 100644 arch/arm64/include/asm/rsi_cmds.h
 create mode 100644 arch/arm64/include/asm/rsi_smc.h
 create mode 100644 arch/arm64/kernel/rsi.c
 create mode 100644 drivers/virt/coco/arm-cca-guest/Kconfig
 create mode 100644 drivers/virt/coco/arm-cca-guest/Makefile
 create mode 100644 drivers/virt/coco/arm-cca-guest/arm-cca-guest.c

Comments

Itaru Kitayama June 5, 2024, 8:37 a.m. UTC | #1
Hi Steven,
On Wed, Jun 05, 2024 at 10:29:52AM +0100, Steven Price wrote:
> This series adds support for running Linux in a protected VM under the
> Arm Confidential Compute Architecture (CCA). This has been updated
> following the feedback from the v2 posting[1]. Thanks for the feedback!
> Individual patches have a change log for v3.
> 
> The biggest change from v2 is fixing set_memory_{en,de}crypted() to
> perform a break-before-make sequence. Note that only the virtual address
> supplied is flipped between shared and protected, so if e.g. a vmalloc()
> address is passed the linear map will still point to the (now invalid)
> previous IPA. Attempts to access the wrong address may trigger a
> Synchronous External Abort. However any code which attempts to access
> the 'encrypted' alias after set_memory_decrypted() is already likely to
> be broken on platforms that implement memory encryption, so I don't
> expect problems.
> 
> The ABI to the RMM from a realm (the RSI) is based on the final RMM v1.0
> (EAC 5) specification[2]. Future RMM specifications will be backwards
> compatible so a guest using the v1.0 specification (i.e. this series)
> will be able to run on future versions of the RMM without modification.
> 
> Arm plans to set up a CI system to perform at a minimum boot testing of
> Linux as a guest within a realm.
> 
> This series is based on v6.10-rc1. It is also available as a git
> repository:
> 
> https://gitlab.arm.com/linux-arm/linux-cca cca-guest/v3
> 
> This series (the guest side) should be in a good state so please review
> with the intention that this could be merged soon. The host side (KVM
> changes) is likely to require some more iteration and I'll post that as
> a separate series shortly - note that there is no tie between the series
> (i.e. you can mix and match v2 and v3 postings of the host and guest).
> 
> Introduction (unchanged from v2 posting)
> ============
> A more general introduction to Arm CCA is available on the Arm
> website[3], and links to the other components involved are available in
> the overall cover letter.
> 
> Arm Confidential Compute Architecture adds two new 'worlds' to the
> architecture: Root and Realm. A new software component known as the RMM
> (Realm Management Monitor) runs in Realm EL2 and is trusted by both the
> Normal World and VMs running within Realms. This enables mutual
> distrust between the Realm VMs and the Normal World.
> 
> Virtual machines running within a Realm can decide on a (4k)
> page-by-page granularity whether to share a page with the (Normal World)
> host or to keep it private (protected). This protection is provided by
> the hardware and attempts to access a page which isn't shared by the
> Normal World will trigger a Granule Protection Fault.
> 
> Realm VMs can communicate with the RMM via another SMC interface known
> as RSI (Realm Services Interface). This series adds wrappers for the
> full set of RSI commands and uses them to manage the Realm IPA State
> (RIPAS) and to discover the configuration of the realm.
> 
> The VM running within the Realm needs to ensure that memory that is
> going to use is marked as 'RIPAS_RAM' (i.e. protected memory accessible
> only to the guest). This could be provided by the VMM (and subject to
> measurement to ensure it is setup correctly) or the VM can set it
> itself.  This series includes a patch which will iterate over all
> described RAM and set the RIPAS. This is a relatively cheap operation,
> and doesn't require memory donation from the host. Instead, memory can
> be dynamically provided by the host on fault. An alternative would be to
> update booting.rst and state this as a requirement, but this would
> reduce the flexibility of the VMM to manage the available memory to the
> guest (as the initial RIPAS state is part of the guest's measurement).
> 
> Within the Realm the most-significant active bit of the IPA is used to
> select whether the access is to protected memory or to memory shared
> with the host. This series treats this bit as if it is attribute bit in
> the page tables and will modify it when sharing/unsharing memory with
> the host.
> 
> This top bit usage also necessitates that the IPA width is made more
> dynamic in the guest. The VMM will choose a width (and therefore which
> bit controls the shared flag) and the guest must be able to identify
> this bit to mask it out when necessary. PHYS_MASK_SHIFT/PHYS_MASK are
> therefore made dynamic.
> 
> To allow virtio to communicate with the host the shared buffers must be
> placed in memory which has this top IPA bit set. This is achieved by
> implementing the set_memory_{encrypted,decrypted} APIs for arm64 and
> forcing the use of bounce buffers. For now all device access is
> considered to required the memory to be shared, at this stage there is
> no support for real devices to be assigned to a realm guest - obviously
> if device assignment is added this will have to change.
> 
> Finally the GIC is (largely) emulated by the (untrusted) host. The RMM
> provides some management (including register save/restore) but the
> ITS buffers must be placed into shared memory for the host to emulate.
> There is likely to be future work to harden the GIC driver against a
> malicious host (along with any other drivers used within a Realm guest).
> 
> [1] https://lore.kernel.org/r/20240412084213.1733764-1-steven.price%40arm.com
> [2] https://developer.arm.com/documentation/den0137/1-0eac5/
> [3] https://www.arm.com/architecture/security-features/arm-confidential-compute-architecture
> 

The v3 guest built with clang booted fine on FVP backed by v2 host kernel.

Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

Thanks,
Itaru.

> Sami Mujawar (2):
>   arm64: rsi: Interfaces to query attestation token
>   virt: arm-cca-guest: TSM_REPORT support for realms
> 
> Steven Price (5):
>   arm64: realm: Query IPA size from the RMM
>   arm64: Mark all I/O as non-secure shared
>   arm64: Make the PHYS_MASK_SHIFT dynamic
>   arm64: Enforce bounce buffers for realm DMA
>   arm64: realm: Support nonsecure ITS emulation shared
> 
> Suzuki K Poulose (7):
>   arm64: rsi: Add RSI definitions
>   arm64: Detect if in a realm and set RIPAS RAM
>   fixmap: Allow architecture overriding set_fixmap_io
>   arm64: Override set_fixmap_io
>   arm64: Enable memory encrypt for Realms
>   arm64: Force device mappings to be non-secure shared
>   efi: arm64: Map Device with Prot Shared
> 
>  arch/arm64/Kconfig                            |   3 +
>  arch/arm64/include/asm/fixmap.h               |   4 +-
>  arch/arm64/include/asm/io.h                   |   6 +-
>  arch/arm64/include/asm/mem_encrypt.h          |  17 ++
>  arch/arm64/include/asm/pgtable-hwdef.h        |   6 -
>  arch/arm64/include/asm/pgtable-prot.h         |   3 +
>  arch/arm64/include/asm/pgtable.h              |   7 +-
>  arch/arm64/include/asm/rsi.h                  |  48 ++++
>  arch/arm64/include/asm/rsi_cmds.h             | 143 ++++++++++++
>  arch/arm64/include/asm/rsi_smc.h              | 142 ++++++++++++
>  arch/arm64/include/asm/set_memory.h           |   3 +
>  arch/arm64/kernel/Makefile                    |   3 +-
>  arch/arm64/kernel/efi.c                       |   2 +-
>  arch/arm64/kernel/rsi.c                       |  96 ++++++++
>  arch/arm64/kernel/setup.c                     |   8 +
>  arch/arm64/mm/init.c                          |  10 +-
>  arch/arm64/mm/mmu.c                           |  13 ++
>  arch/arm64/mm/pageattr.c                      |  65 +++++-
>  drivers/irqchip/irq-gic-v3-its.c              |  90 ++++++--
>  drivers/virt/coco/Kconfig                     |   2 +
>  drivers/virt/coco/Makefile                    |   1 +
>  drivers/virt/coco/arm-cca-guest/Kconfig       |  11 +
>  drivers/virt/coco/arm-cca-guest/Makefile      |   2 +
>  .../virt/coco/arm-cca-guest/arm-cca-guest.c   | 211 ++++++++++++++++++
>  include/asm-generic/fixmap.h                  |   2 +
>  25 files changed, 858 insertions(+), 40 deletions(-)
>  create mode 100644 arch/arm64/include/asm/mem_encrypt.h
>  create mode 100644 arch/arm64/include/asm/rsi.h
>  create mode 100644 arch/arm64/include/asm/rsi_cmds.h
>  create mode 100644 arch/arm64/include/asm/rsi_smc.h
>  create mode 100644 arch/arm64/kernel/rsi.c
>  create mode 100644 drivers/virt/coco/arm-cca-guest/Kconfig
>  create mode 100644 drivers/virt/coco/arm-cca-guest/Makefile
>  create mode 100644 drivers/virt/coco/arm-cca-guest/arm-cca-guest.c
> 
> -- 
> 2.34.1
>
Steven Price June 6, 2024, 9:03 a.m. UTC | #2
On 05/06/2024 09:37, Itaru Kitayama wrote:
> Hi Steven,
> On Wed, Jun 05, 2024 at 10:29:52AM +0100, Steven Price wrote:
>> This series adds support for running Linux in a protected VM under the
>> Arm Confidential Compute Architecture (CCA). This has been updated
>> following the feedback from the v2 posting[1]. Thanks for the feedback!
>> Individual patches have a change log for v3.
>>
>> The biggest change from v2 is fixing set_memory_{en,de}crypted() to
>> perform a break-before-make sequence. Note that only the virtual address
>> supplied is flipped between shared and protected, so if e.g. a vmalloc()
>> address is passed the linear map will still point to the (now invalid)
>> previous IPA. Attempts to access the wrong address may trigger a
>> Synchronous External Abort. However any code which attempts to access
>> the 'encrypted' alias after set_memory_decrypted() is already likely to
>> be broken on platforms that implement memory encryption, so I don't
>> expect problems.
>>
>> The ABI to the RMM from a realm (the RSI) is based on the final RMM v1.0
>> (EAC 5) specification[2]. Future RMM specifications will be backwards
>> compatible so a guest using the v1.0 specification (i.e. this series)
>> will be able to run on future versions of the RMM without modification.
>>
>> Arm plans to set up a CI system to perform at a minimum boot testing of
>> Linux as a guest within a realm.
>>
>> This series is based on v6.10-rc1. It is also available as a git
>> repository:
>>
>> https://gitlab.arm.com/linux-arm/linux-cca cca-guest/v3
>>
>> This series (the guest side) should be in a good state so please review
>> with the intention that this could be merged soon. The host side (KVM
>> changes) is likely to require some more iteration and I'll post that as
>> a separate series shortly - note that there is no tie between the series
>> (i.e. you can mix and match v2 and v3 postings of the host and guest).
>>
>> Introduction (unchanged from v2 posting)
>> ============
>> A more general introduction to Arm CCA is available on the Arm
>> website[3], and links to the other components involved are available in
>> the overall cover letter.
>>
>> Arm Confidential Compute Architecture adds two new 'worlds' to the
>> architecture: Root and Realm. A new software component known as the RMM
>> (Realm Management Monitor) runs in Realm EL2 and is trusted by both the
>> Normal World and VMs running within Realms. This enables mutual
>> distrust between the Realm VMs and the Normal World.
>>
>> Virtual machines running within a Realm can decide on a (4k)
>> page-by-page granularity whether to share a page with the (Normal World)
>> host or to keep it private (protected). This protection is provided by
>> the hardware and attempts to access a page which isn't shared by the
>> Normal World will trigger a Granule Protection Fault.
>>
>> Realm VMs can communicate with the RMM via another SMC interface known
>> as RSI (Realm Services Interface). This series adds wrappers for the
>> full set of RSI commands and uses them to manage the Realm IPA State
>> (RIPAS) and to discover the configuration of the realm.
>>
>> The VM running within the Realm needs to ensure that memory that is
>> going to use is marked as 'RIPAS_RAM' (i.e. protected memory accessible
>> only to the guest). This could be provided by the VMM (and subject to
>> measurement to ensure it is setup correctly) or the VM can set it
>> itself.  This series includes a patch which will iterate over all
>> described RAM and set the RIPAS. This is a relatively cheap operation,
>> and doesn't require memory donation from the host. Instead, memory can
>> be dynamically provided by the host on fault. An alternative would be to
>> update booting.rst and state this as a requirement, but this would
>> reduce the flexibility of the VMM to manage the available memory to the
>> guest (as the initial RIPAS state is part of the guest's measurement).
>>
>> Within the Realm the most-significant active bit of the IPA is used to
>> select whether the access is to protected memory or to memory shared
>> with the host. This series treats this bit as if it is attribute bit in
>> the page tables and will modify it when sharing/unsharing memory with
>> the host.
>>
>> This top bit usage also necessitates that the IPA width is made more
>> dynamic in the guest. The VMM will choose a width (and therefore which
>> bit controls the shared flag) and the guest must be able to identify
>> this bit to mask it out when necessary. PHYS_MASK_SHIFT/PHYS_MASK are
>> therefore made dynamic.
>>
>> To allow virtio to communicate with the host the shared buffers must be
>> placed in memory which has this top IPA bit set. This is achieved by
>> implementing the set_memory_{encrypted,decrypted} APIs for arm64 and
>> forcing the use of bounce buffers. For now all device access is
>> considered to required the memory to be shared, at this stage there is
>> no support for real devices to be assigned to a realm guest - obviously
>> if device assignment is added this will have to change.
>>
>> Finally the GIC is (largely) emulated by the (untrusted) host. The RMM
>> provides some management (including register save/restore) but the
>> ITS buffers must be placed into shared memory for the host to emulate.
>> There is likely to be future work to harden the GIC driver against a
>> malicious host (along with any other drivers used within a Realm guest).
>>
>> [1] https://lore.kernel.org/r/20240412084213.1733764-1-steven.price%40arm.com
>> [2] https://developer.arm.com/documentation/den0137/1-0eac5/
>> [3] https://www.arm.com/architecture/security-features/arm-confidential-compute-architecture
>>
> 
> The v3 guest built with clang booted fine on FVP backed by v2 host kernel.
> 
> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

Thanks for testing!

Steve
Michael Kelley June 7, 2024, 1:38 a.m. UTC | #3
From: Steven Price <steven.price@arm.com> Sent: Wednesday, June 5, 2024 2:30 AM
> 
> This series adds support for running Linux in a protected VM under the
> Arm Confidential Compute Architecture (CCA). This has been updated
> following the feedback from the v2 posting[1]. Thanks for the feedback!
> Individual patches have a change log for v3.
> 
> The biggest change from v2 is fixing set_memory_{en,de}crypted() to
> perform a break-before-make sequence. Note that only the virtual address
> supplied is flipped between shared and protected, so if e.g. a vmalloc()
> address is passed the linear map will still point to the (now invalid)
> previous IPA. Attempts to access the wrong address may trigger a
> Synchronous External Abort. However any code which attempts to access
> the 'encrypted' alias after set_memory_decrypted() is already likely to
> be broken on platforms that implement memory encryption, so I don't
> expect problems.

In the case of a vmalloc() address, load_unaligned_zeropad() could still
make an access to the underlying pages through the linear address. In
CoCo guests on x86, both the vmalloc PTE and the linear map PTE are
flipped, so the load_unaligned_zeropad() problem can occur only during
the transition between decrypted and encrypted. But even then, the
exception handlers have code to fixup this case and allow everything to
proceed normally.

I haven't looked at the code in your patches, but do you handle that case,
or somehow prevent it?

Thanks,

Michael
Catalin Marinas June 7, 2024, 3:12 p.m. UTC | #4
On Fri, Jun 07, 2024 at 01:38:15AM +0000, Michael Kelley wrote:
> From: Steven Price <steven.price@arm.com> Sent: Wednesday, June 5, 2024 2:30 AM
> > This series adds support for running Linux in a protected VM under the
> > Arm Confidential Compute Architecture (CCA). This has been updated
> > following the feedback from the v2 posting[1]. Thanks for the feedback!
> > Individual patches have a change log for v3.
> > 
> > The biggest change from v2 is fixing set_memory_{en,de}crypted() to
> > perform a break-before-make sequence. Note that only the virtual address
> > supplied is flipped between shared and protected, so if e.g. a vmalloc()
> > address is passed the linear map will still point to the (now invalid)
> > previous IPA. Attempts to access the wrong address may trigger a
> > Synchronous External Abort. However any code which attempts to access
> > the 'encrypted' alias after set_memory_decrypted() is already likely to
> > be broken on platforms that implement memory encryption, so I don't
> > expect problems.
> 
> In the case of a vmalloc() address, load_unaligned_zeropad() could still
> make an access to the underlying pages through the linear address. In
> CoCo guests on x86, both the vmalloc PTE and the linear map PTE are
> flipped, so the load_unaligned_zeropad() problem can occur only during
> the transition between decrypted and encrypted. But even then, the
> exception handlers have code to fixup this case and allow everything to
> proceed normally.
> 
> I haven't looked at the code in your patches, but do you handle that case,
> or somehow prevent it?

If we can guarantee that only full a vm_struct area is changed at a
time, the vmap guard page would prevent this issue (not sure we can
though). Otherwise I think we either change the set_memory_*() code to
deal with the other mappings or we handle the exception.

We also have potential user mappings, do we need to do anything about
them?
Michael Kelley June 7, 2024, 4:36 p.m. UTC | #5
From: Catalin Marinas <catalin.marinas@arm.com> Sent: Friday, June 7, 2024 8:13 AM
> 
> On Fri, Jun 07, 2024 at 01:38:15AM +0000, Michael Kelley wrote:
> > From: Steven Price <steven.price@arm.com> Sent: Wednesday, June 5, 2024 2:30 AM
> > > This series adds support for running Linux in a protected VM under the
> > > Arm Confidential Compute Architecture (CCA). This has been updated
> > > following the feedback from the v2 posting[1]. Thanks for the feedback!
> > > Individual patches have a change log for v3.
> > >
> > > The biggest change from v2 is fixing set_memory_{en,de}crypted() to
> > > perform a break-before-make sequence. Note that only the virtual address
> > > supplied is flipped between shared and protected, so if e.g. a vmalloc()
> > > address is passed the linear map will still point to the (now invalid)
> > > previous IPA. Attempts to access the wrong address may trigger a
> > > Synchronous External Abort. However any code which attempts to access
> > > the 'encrypted' alias after set_memory_decrypted() is already likely to
> > > be broken on platforms that implement memory encryption, so I don't
> > > expect problems.
> >
> > In the case of a vmalloc() address, load_unaligned_zeropad() could still
> > make an access to the underlying pages through the linear address. In
> > CoCo guests on x86, both the vmalloc PTE and the linear map PTE are
> > flipped, so the load_unaligned_zeropad() problem can occur only during
> > the transition between decrypted and encrypted. But even then, the
> > exception handlers have code to fixup this case and allow everything to
> > proceed normally.
> >
> > I haven't looked at the code in your patches, but do you handle that case,
> > or somehow prevent it?
> 
> If we can guarantee that only full a vm_struct area is changed at a
> time, the vmap guard page would prevent this issue (not sure we can
> though). Otherwise I think we either change the set_memory_*() code to
> deal with the other mappings or we handle the exception.

I don't think the vmap guard pages help. The vmalloc() memory consists
of individual pages that are scattered throughout the direct map. The stray
reference from load_unaligned_zeropad() will originate in a kmalloc'ed
memory page that precedes one of these scattered individual pages, and
will use a direct map kernel vaddr.  So the guard page in vmalloc space don't
come into play. At least in the Hyper-V use case, an entire vmalloc allocation
*is* flipped as a unit, so the guard pages do prevent a stray reference from
load_unaligned_zeropad() that originates in vmalloc space. At one
point I looked to see if load_unaligned_zeropad() is ever used on vmalloc
addresses.  I think the answer was "no",  making the guard page question
moot, but I'm not sure. :-(

Another thought: The use of load_unaligned_zeropad() is conditional on
CONFIG_DCACHE_WORD_ACCESS. There are #ifdef'ed alternate
implementations that don't use load_unaligned_zeropad() if it is not
enabled. I looked at just disabling it in CoCo VMs, but I don't know the
performance impact. I speculated that the benefits were more noticeable
in processors from a decade or more ago, and perhaps less so now, but
never did any measurements. There was also a snag in that x86-only
code has a usage of load_unaligned_zeropad() without an alternate
implementation, so I never went fully down that path. But arm64 would
probably "just work" if it were disabled.

> 
> We also have potential user mappings, do we need to do anything about
> them?

I'm unclear on the scenario here.  Would memory with a user mapping
ever be flipped between decrypted and encrypted while the user mapping
existed?  I don't recall being concerned about user mappings, so maybe
had ruled out that scenario. On x86, flipping between decrypted and
encrypted may effectively change the contents of the memory, so doing
a flip while mapped into user space seems problematic. But maybe I'm
missing something.

Michael
Catalin Marinas June 10, 2024, 10:34 a.m. UTC | #6
On Fri, Jun 07, 2024 at 04:36:18PM +0000, Michael Kelley wrote:
> From: Catalin Marinas <catalin.marinas@arm.com> Sent: Friday, June 7, 2024 8:13 AM
> > On Fri, Jun 07, 2024 at 01:38:15AM +0000, Michael Kelley wrote:
> > > In the case of a vmalloc() address, load_unaligned_zeropad() could still
> > > make an access to the underlying pages through the linear address. In
> > > CoCo guests on x86, both the vmalloc PTE and the linear map PTE are
> > > flipped, so the load_unaligned_zeropad() problem can occur only during
> > > the transition between decrypted and encrypted. But even then, the
> > > exception handlers have code to fixup this case and allow everything to
> > > proceed normally.
> > >
> > > I haven't looked at the code in your patches, but do you handle that case,
> > > or somehow prevent it?
> > 
> > If we can guarantee that only full a vm_struct area is changed at a
> > time, the vmap guard page would prevent this issue (not sure we can
> > though). Otherwise I think we either change the set_memory_*() code to
> > deal with the other mappings or we handle the exception.
> 
> I don't think the vmap guard pages help. The vmalloc() memory consists
> of individual pages that are scattered throughout the direct map. The stray
> reference from load_unaligned_zeropad() will originate in a kmalloc'ed
> memory page that precedes one of these scattered individual pages, and
> will use a direct map kernel vaddr.  So the guard page in vmalloc space don't
> come into play. At least in the Hyper-V use case, an entire vmalloc allocation
> *is* flipped as a unit, so the guard pages do prevent a stray reference from
> load_unaligned_zeropad() that originates in vmalloc space. At one
> point I looked to see if load_unaligned_zeropad() is ever used on vmalloc
> addresses.  I think the answer was "no",  making the guard page question
> moot, but I'm not sure. :-(

My point was about load_unaligned_zeropad() originating in the vmalloc
space. What I had in mind is changing the underlying linear map via
set_memory_*() while we have live vmalloc() mappings. But I forgot about
the case you mentioned in a previous thread: set_memory_*() being called
on vmalloc()'ed memory directly:

https://lore.kernel.org/r/SN6PR02MB41578D7BFEDE33BD2E8246EFD4E92@SN6PR02MB4157.namprd02.prod.outlook.com/

I wonder whether something like __GFP_DECRYPTED could be used to get
shared memory from the allocation time and avoid having to change the
vmalloc() ranges. This way functions like netvsc_init_buf() would get
decrypted memory from the start and vmbus_establish_gpadl() would not
need to call set_memory_decrypted() on a vmalloc() address.

> Another thought: The use of load_unaligned_zeropad() is conditional on
> CONFIG_DCACHE_WORD_ACCESS. There are #ifdef'ed alternate
> implementations that don't use load_unaligned_zeropad() if it is not
> enabled. I looked at just disabling it in CoCo VMs, but I don't know the
> performance impact. I speculated that the benefits were more noticeable
> in processors from a decade or more ago, and perhaps less so now, but
> never did any measurements. There was also a snag in that x86-only
> code has a usage of load_unaligned_zeropad() without an alternate
> implementation, so I never went fully down that path. But arm64 would
> probably "just work" if it were disabled.

We shouldn't penalise the performance, especially as I expect a single
image to run both as a guest or a host. However, I think now the linear
map is handled correctly since we make the PTE invalid before making the
page shared and this would force the fault path through the one that
safely handles load_unaligned_zeropad(). Steven's patches also currently
reject non-linear-map addresses, I guess this would be a separate
addition.

> > We also have potential user mappings, do we need to do anything about
> > them?
> 
> I'm unclear on the scenario here.  Would memory with a user mapping
> ever be flipped between decrypted and encrypted while the user mapping
> existed? 

Maybe it doesn't matter. Do we expect it the underlying pages to be
flipped while live mappings other than the linear map exist? I assume
not, one would first allocate and configure the memory in the kernel
before some remap_pfn_range() to user with the appropriate pgprot.
Michael Kelley June 10, 2024, 5:03 p.m. UTC | #7
From: Catalin Marinas <catalin.marinas@arm.com> Sent: Monday, June 10, 2024 3:34 AM
> 
> On Fri, Jun 07, 2024 at 04:36:18PM +0000, Michael Kelley wrote:
> > From: Catalin Marinas <catalin.marinas@arm.com> Sent: Friday, June 7, 2024 8:13 AM
> > > On Fri, Jun 07, 2024 at 01:38:15AM +0000, Michael Kelley wrote:
> > > > In the case of a vmalloc() address, load_unaligned_zeropad() could still
> > > > make an access to the underlying pages through the linear address. In
> > > > CoCo guests on x86, both the vmalloc PTE and the linear map PTE are
> > > > flipped, so the load_unaligned_zeropad() problem can occur only during
> > > > the transition between decrypted and encrypted. But even then, the
> > > > exception handlers have code to fixup this case and allow everything to
> > > > proceed normally.
> > > >
> > > > I haven't looked at the code in your patches, but do you handle that case,
> > > > or somehow prevent it?
> > >
> > > If we can guarantee that only full a vm_struct area is changed at a
> > > time, the vmap guard page would prevent this issue (not sure we can
> > > though). Otherwise I think we either change the set_memory_*() code to
> > > deal with the other mappings or we handle the exception.
> >
> > I don't think the vmap guard pages help. The vmalloc() memory consists
> > of individual pages that are scattered throughout the direct map. The stray
> > reference from load_unaligned_zeropad() will originate in a kmalloc'ed
> > memory page that precedes one of these scattered individual pages, and
> > will use a direct map kernel vaddr.  So the guard page in vmalloc space don't
> > come into play. At least in the Hyper-V use case, an entire vmalloc allocation
> > *is* flipped as a unit, so the guard pages do prevent a stray reference from
> > load_unaligned_zeropad() that originates in vmalloc space. At one
> > point I looked to see if load_unaligned_zeropad() is ever used on vmalloc
> > addresses.  I think the answer was "no",  making the guard page question
> > moot, but I'm not sure. :-(
> 
> My point was about load_unaligned_zeropad() originating in the vmalloc
> space. What I had in mind is changing the underlying linear map via
> set_memory_*() while we have live vmalloc() mappings. But I forgot about
> the case you mentioned in a previous thread: set_memory_*() being called
> on vmalloc()'ed memory directly:
> 
> https://lore.kernel.org/all/SN6PR02MB41578D7BFEDE33BD2E8246EFD4E92@SN6PR02MB4157.namprd02.prod.outlook.com/
> 

OK, right.  You and I were thinking about different cases.

> I wonder whether something like __GFP_DECRYPTED could be used to get
> shared memory from the allocation time and avoid having to change the
> vmalloc() ranges. This way functions like netvsc_init_buf() would get
> decrypted memory from the start and vmbus_establish_gpadl() would not
> need to call set_memory_decrypted() on a vmalloc() address.

I would not have any conceptual objections to such an approach. But I'm
certainly not an expert in that area so I'm not sure what it would take
to make that work for vmalloc(). I presume that __GFP_DECRYPTED
should also work for kmalloc()?

I've seen the separate discussion about a designated pool of decrypted
memory, to avoid always allocating a new page and decrypting when a
smaller allocation is sufficient. If such a pool could also work for page size
or larger allocations, it would have the additional benefit of concentrating
decrypted allocations in fewer 2 Meg large pages vs. scattering wherever
and forcing the break-up of more large page mappings in the direct map.

I'll note that netvsc devices can be added or removed from a running VM.
The vmalloc() memory allocated by netvsc_init_buf() can be freed, and/or
additional calls to netvsc_init_buf() can be made at any time -- they aren't
limited to initial Linux boot.  So the mechanism for getting decrypted
memory at allocation time must be reasonably dynamic.

> 
> > Another thought: The use of load_unaligned_zeropad() is conditional on
> > CONFIG_DCACHE_WORD_ACCESS. There are #ifdef'ed alternate
> > implementations that don't use load_unaligned_zeropad() if it is not
> > enabled. I looked at just disabling it in CoCo VMs, but I don't know the
> > performance impact. I speculated that the benefits were more noticeable
> > in processors from a decade or more ago, and perhaps less so now, but
> > never did any measurements. There was also a snag in that x86-only
> > code has a usage of load_unaligned_zeropad() without an alternate
> > implementation, so I never went fully down that path. But arm64 would
> > probably "just work" if it were disabled.
> 
> We shouldn't penalise the performance, especially as I expect a single
> image to run both as a guest or a host. However, I think now the linear
> map is handled correctly since we make the PTE invalid before making the
> page shared and this would force the fault path through the one that
> safely handles load_unaligned_zeropad(). Steven's patches also currently
> reject non-linear-map addresses, I guess this would be a separate
> addition.

Rejecting vmalloc() addresses may work for the moment -- I don't know
when CCA guests might be tried on Hyper-V.  The original SEV-SNP and TDX
work started that way as well. :-) Handling the vmalloc() case was added
later, though I think on x86 the machinery to also flip all the alias PTEs was
already mostly or completely in place, probably for other reasons. So
fixing the vmalloc() case was more about not assuming that the underlying
physical address range is contiguous. Instead, each page must be processed
independently, which was straightforward.

> 
> > > We also have potential user mappings, do we need to do anything about
> > > them?
> >
> > I'm unclear on the scenario here.  Would memory with a user mapping
> > ever be flipped between decrypted and encrypted while the user mapping
> > existed?
> 
> Maybe it doesn't matter. Do we expect it the underlying pages to be
> flipped while live mappings other than the linear map exist? I assume
> not, one would first allocate and configure the memory in the kernel
> before some remap_pfn_range() to user with the appropriate pgprot.

Yes, for user space mappings I also assume not.

Michael
Catalin Marinas June 10, 2024, 5:46 p.m. UTC | #8
On Mon, Jun 10, 2024 at 05:03:44PM +0000, Michael Kelley wrote:
> From: Catalin Marinas <catalin.marinas@arm.com> Sent: Monday, June 10, 2024 3:34 AM
> > I wonder whether something like __GFP_DECRYPTED could be used to get
> > shared memory from the allocation time and avoid having to change the
> > vmalloc() ranges. This way functions like netvsc_init_buf() would get
> > decrypted memory from the start and vmbus_establish_gpadl() would not
> > need to call set_memory_decrypted() on a vmalloc() address.
> 
> I would not have any conceptual objections to such an approach. But I'm
> certainly not an expert in that area so I'm not sure what it would take
> to make that work for vmalloc(). I presume that __GFP_DECRYPTED
> should also work for kmalloc()?
> 
> I've seen the separate discussion about a designated pool of decrypted
> memory, to avoid always allocating a new page and decrypting when a
> smaller allocation is sufficient. If such a pool could also work for page size
> or larger allocations, it would have the additional benefit of concentrating
> decrypted allocations in fewer 2 Meg large pages vs. scattering wherever
> and forcing the break-up of more large page mappings in the direct map.

Yeah, my quick, not fully tested hack here:

https://lore.kernel.org/linux-arm-kernel/ZmNJdSxSz-sYpVgI@arm.com/

It's the underlying page allocator that gives back decrypted pages when
the flag is passed, so it should work for alloc_pages() and friends. The
kmalloc() changes only ensure that we have separate caches for this
memory and they are not merged. It needs some more work on kmem_cache,
maybe introducing a SLAB_DECRYPTED flag as well as not to rely on the
GFP flag.

For vmalloc(), we'd need a pgprot_decrypted() macro to ensure the
decrypted pages are marked with the appropriate attributes (arch
specific), otherwise it's fairly easy to wire up if alloc_pages() gives
back decrypted memory.

> I'll note that netvsc devices can be added or removed from a running VM.
> The vmalloc() memory allocated by netvsc_init_buf() can be freed, and/or
> additional calls to netvsc_init_buf() can be made at any time -- they aren't
> limited to initial Linux boot.  So the mechanism for getting decrypted
> memory at allocation time must be reasonably dynamic.

I think the above should work. But, of course, we'd have to get this
past the mm maintainers, it's likely that I missed something.

> Rejecting vmalloc() addresses may work for the moment -- I don't know
> when CCA guests might be tried on Hyper-V.  The original SEV-SNP and TDX
> work started that way as well. :-) Handling the vmalloc() case was added
> later, though I think on x86 the machinery to also flip all the alias PTEs was
> already mostly or completely in place, probably for other reasons. So
> fixing the vmalloc() case was more about not assuming that the underlying
> physical address range is contiguous. Instead, each page must be processed
> independently, which was straightforward.

There may be a slight performance impact but I guess that's not on a
critical path. Walking the page tables and changing the vmalloc ptes
should be fine but for each page, we'd have to break the linear map,
flush the TLBs, re-create the linear map. Those TLBs may become a
bottleneck, especially on hardware with lots of CPUs and the
microarchitecture. Note that even with a __GFP_DECRYPTED attribute, we'd
still need to go for individual pages in the linear map.
Michael Kelley June 17, 2024, 4:06 a.m. UTC | #9
From: Catalin Marinas <catalin.marinas@arm.com> Sent: Monday, June 10, 2024 10:46 AM
> 
> On Mon, Jun 10, 2024 at 05:03:44PM +0000, Michael Kelley wrote:
> > From: Catalin Marinas <catalin.marinas@arm.com> Sent: Monday, June 10, 2024 3:34 AM
> > > I wonder whether something like __GFP_DECRYPTED could be used to get
> > > shared memory from the allocation time and avoid having to change the
> > > vmalloc() ranges. This way functions like netvsc_init_buf() would get
> > > decrypted memory from the start and vmbus_establish_gpadl() would not
> > > need to call set_memory_decrypted() on a vmalloc() address.
> >
> > I would not have any conceptual objections to such an approach. But I'm
> > certainly not an expert in that area so I'm not sure what it would take
> > to make that work for vmalloc(). I presume that __GFP_DECRYPTED
> > should also work for kmalloc()?
> >
> > I've seen the separate discussion about a designated pool of decrypted
> > memory, to avoid always allocating a new page and decrypting when a
> > smaller allocation is sufficient. If such a pool could also work for page size
> > or larger allocations, it would have the additional benefit of concentrating
> > decrypted allocations in fewer 2 Meg large pages vs. scattering wherever
> > and forcing the break-up of more large page mappings in the direct map.
> 
> Yeah, my quick, not fully tested hack here:
> 
> https://lore.kernel.org/linux-arm-kernel/ZmNJdSxSz-sYpVgI@arm.com/ 
> 
> It's the underlying page allocator that gives back decrypted pages when
> the flag is passed, so it should work for alloc_pages() and friends. The
> kmalloc() changes only ensure that we have separate caches for this
> memory and they are not merged. It needs some more work on kmem_cache,
> maybe introducing a SLAB_DECRYPTED flag as well as not to rely on the
> GFP flag.
> 
> For vmalloc(), we'd need a pgprot_decrypted() macro to ensure the
> decrypted pages are marked with the appropriate attributes (arch
> specific), otherwise it's fairly easy to wire up if alloc_pages() gives
> back decrypted memory.
> 
> > I'll note that netvsc devices can be added or removed from a running VM.
> > The vmalloc() memory allocated by netvsc_init_buf() can be freed, and/or
> > additional calls to netvsc_init_buf() can be made at any time -- they aren't
> > limited to initial Linux boot.  So the mechanism for getting decrypted
> > memory at allocation time must be reasonably dynamic.
> 
> I think the above should work. But, of course, we'd have to get this
> past the mm maintainers, it's likely that I missed something.

Having thought about this a few days, I like the model of telling the
memory allocators to decrypt/re-encrypt the memory, instead of the
caller having to explicitly do set_memory_decrypted()/encrypted().
I'll add some further comments to the thread with your initial
implementation.

> 
> > Rejecting vmalloc() addresses may work for the moment -- I don't know
> > when CCA guests might be tried on Hyper-V.  The original SEV-SNP and TDX
> > work started that way as well. :-) Handling the vmalloc() case was added
> > later, though I think on x86 the machinery to also flip all the alias PTEs was
> > already mostly or completely in place, probably for other reasons. So
> > fixing the vmalloc() case was more about not assuming that the underlying
> > physical address range is contiguous. Instead, each page must be processed
> > independently, which was straightforward.
> 
> There may be a slight performance impact but I guess that's not on a
> critical path. Walking the page tables and changing the vmalloc ptes
> should be fine but for each page, we'd have to break the linear map,
> flush the TLBs, re-create the linear map. Those TLBs may become a
> bottleneck, especially on hardware with lots of CPUs and the
> microarchitecture. Note that even with a __GFP_DECRYPTED attribute, we'd
> still need to go for individual pages in the linear map.

Agreed. While synthetic devices can come-and-go anytime, it's pretty
rare in the grand scheme of things. I guess we would have to try it on
a system with high CPU count, but even if the code needed to "pace
itself" to avoid hammering the TLBs too hard, that would be OK.

Michael