Message ID | 20200421184150.68011-1-andraprs@amazon.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for Nitro Enclaves | expand |
On 21/04/20 20:41, Andra Paraschiv wrote: > An enclave communicates with the primary VM via a local communication channel, > using virtio-vsock [2]. An enclave does not have a disk or a network device > attached. Is it possible to have a sample of this in the samples/ directory? I am interested especially in: - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. - the communication channel; does the enclave see the usual local APIC and IOAPIC interfaces in order to get interrupts from virtio-vsock, and where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? - what the enclave is allowed to do: can it change privilege levels, what happens if the enclave performs an access to nonexistent memory, etc. - whether there are special hypercall interfaces for the enclave > The proposed solution is following the KVM model and uses the KVM API to be able > to create and set resources for enclaves. An additional ioctl command, besides > the ones provided by KVM, is used to start an enclave and setup the addressing > for the communication channel and an enclave unique id. Reusing some KVM ioctls is definitely a good idea, but I wouldn't really say it's the KVM API since the VCPU file descriptor is basically non functional (without KVM_RUN and mmap it's not really the KVM API). Paolo
On 22/04/2020 00:46, Paolo Bonzini wrote: > On 21/04/20 20:41, Andra Paraschiv wrote: >> An enclave communicates with the primary VM via a local communication channel, >> using virtio-vsock [2]. An enclave does not have a disk or a network device >> attached. > Is it possible to have a sample of this in the samples/ directory? I can add in v2 a sample file including the basic flow of how to use the ioctl interface to create / terminate an enclave. Then we can update / build on top it based on the ongoing discussions on the patch series and the received feedback. > > I am interested especially in: > > - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. > > - the communication channel; does the enclave see the usual local APIC > and IOAPIC interfaces in order to get interrupts from virtio-vsock, and > where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? > > - what the enclave is allowed to do: can it change privilege levels, > what happens if the enclave performs an access to nonexistent memory, etc. > > - whether there are special hypercall interfaces for the enclave An enclave is a VM, running on the same host as the primary VM, that launched the enclave. They are siblings. Here we need to think of two components: 1. An enclave abstraction process - a process running in the primary VM guest, that uses the provided ioctl interface of the Nitro Enclaves kernel driver to spawn an enclave VM (that's 2 below). How does all gets to an enclave VM running on the host? There is a Nitro Enclaves emulated PCI device exposed to the primary VM. The driver for this new PCI device is included in the current patch series. The ioctl logic is mapped to PCI device commands e.g. the NE_ENCLAVE_START ioctl maps to an enclave start PCI command or the KVM_SET_USER_MEMORY_REGION maps to an add memory PCI command. The PCI device commands are then translated into actions taken on the hypervisor side; that's the Nitro hypervisor running on the host where the primary VM is running. 2. The enclave itself - a VM running on the same host as the primary VM that spawned it. The enclave VM has no persistent storage or network interface attached, it uses its own memory and CPUs + its virtio-vsock emulated device for communication with the primary VM. The memory and CPUs are carved out of the primary VM, they are dedicated for the enclave. The Nitro hypervisor running on the host ensures memory and CPU isolation between the primary VM and the enclave VM. These two components need to reflect the same state e.g. when the enclave abstraction process (1) is terminated, the enclave VM (2) is terminated as well. With regard to the communication channel, the primary VM has its own emulated virtio-vsock PCI device. The enclave VM has its own emulated virtio-vsock device as well. This channel is used, for example, to fetch data in the enclave and then process it. An application that sets up the vsock socket and connects or listens, depending on the use case, is then developed to use this channel; this happens on both ends - primary VM and enclave VM. Let me know if further clarifications are needed. > >> The proposed solution is following the KVM model and uses the KVM API to be able >> to create and set resources for enclaves. An additional ioctl command, besides >> the ones provided by KVM, is used to start an enclave and setup the addressing >> for the communication channel and an enclave unique id. > Reusing some KVM ioctls is definitely a good idea, but I wouldn't really > say it's the KVM API since the VCPU file descriptor is basically non > functional (without KVM_RUN and mmap it's not really the KVM API). It uses part of the KVM API or a set of KVM ioctls to model the way a VM is created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd are not included. Thanks for the feedback regarding the reuse of KVM ioctls. Andra Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 23/04/20 15:19, Paraschiv, Andra-Irina wrote: > 2. The enclave itself - a VM running on the same host as the primary VM > that spawned it. > > The enclave VM has no persistent storage or network interface attached, > it uses its own memory and CPUs + its virtio-vsock emulated device for > communication with the primary VM. > > The memory and CPUs are carved out of the primary VM, they are dedicated > for the enclave. The Nitro hypervisor running on the host ensures memory > and CPU isolation between the primary VM and the enclave VM. > > These two components need to reflect the same state e.g. when the > enclave abstraction process (1) is terminated, the enclave VM (2) is > terminated as well. > > With regard to the communication channel, the primary VM has its own > emulated virtio-vsock PCI device. The enclave VM has its own emulated > virtio-vsock device as well. This channel is used, for example, to fetch > data in the enclave and then process it. An application that sets up the > vsock socket and connects or listens, depending on the use case, is then > developed to use this channel; this happens on both ends - primary VM > and enclave VM. > > Let me know if further clarifications are needed. Thanks, this is all useful. However can you please clarify the low-level details here? >> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >> - the communication channel; does the enclave see the usual local APIC >> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >> where is the virtio-vsock device (virtio-mmio I suppose) placed in >> memory? >> - what the enclave is allowed to do: can it change privilege levels, >> what happens if the enclave performs an access to nonexistent memory, >> etc. >> - whether there are special hypercall interfaces for the enclave Thanks, Paolo
On 23/04/2020 16:42, Paolo Bonzini wrote: > On 23/04/20 15:19, Paraschiv, Andra-Irina wrote: >> 2. The enclave itself - a VM running on the same host as the primary VM >> that spawned it. >> >> The enclave VM has no persistent storage or network interface attached, >> it uses its own memory and CPUs + its virtio-vsock emulated device for >> communication with the primary VM. >> >> The memory and CPUs are carved out of the primary VM, they are dedicated >> for the enclave. The Nitro hypervisor running on the host ensures memory >> and CPU isolation between the primary VM and the enclave VM. >> >> These two components need to reflect the same state e.g. when the >> enclave abstraction process (1) is terminated, the enclave VM (2) is >> terminated as well. >> >> With regard to the communication channel, the primary VM has its own >> emulated virtio-vsock PCI device. The enclave VM has its own emulated >> virtio-vsock device as well. This channel is used, for example, to fetch >> data in the enclave and then process it. An application that sets up the >> vsock socket and connects or listens, depending on the use case, is then >> developed to use this channel; this happens on both ends - primary VM >> and enclave VM. >> >> Let me know if further clarifications are needed. > Thanks, this is all useful. However can you please clarify the > low-level details here? > >>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. The enclave VM has its own kernel and follows the well-known Linux boot protocol, in the end getting to the user application after init finishes its work, so that's CPL3. >>> - the communication channel; does the enclave see the usual local APIC >>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >>> where is the virtio-vsock device (virtio-mmio I suppose) placed in >>> memory? vsock is using eventfd for signalling; wrt enclave VM, it sees the usual interfaces to get interrupts from virtio dev. It's placed below the typical 4GB; in general, it may depend based on arch. >>> - what the enclave is allowed to do: can it change privilege levels, >>> what happens if the enclave performs an access to nonexistent memory, >>> etc. If talking about the enclave abstraction process, it is running in the primary VM as a user space process, so it will get into primary VM guest kernel if privileged instructions need to be executed. Same happens with the user space application running in the enclave VM. And the VM itself will get to the hypervisor running on the host for privileged instructions. The Nitro hypervisor is based on core KVM technology. Access to nonexistent memory gets faults. >>> - whether there are special hypercall interfaces for the enclave The path towards creating / setting resources / terminating an enclave (here referring to enclave VM) is towards the ioctl interface, with the corresponding misc device, and the emulated PCI device. That's the interface used to manage enclaves. Once booted, the enclave resources setup is not modified anymore. And the way to communicate with the enclave after booting, with the application running in the enclave, is via the vsock comm channel. Thanks, Andra > Thanks, > > Paolo > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 23/04/20 19:42, Paraschiv, Andra-Irina wrote: >> >>>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. > > The enclave VM has its own kernel and follows the well-known Linux boot > protocol, in the end getting to the user application after init finishes > its work, so that's CPL3. CPL3 is how the user application run, but does the enclave's Linux boot process start in real mode at the reset vector (0xfffffff0), in 16-bit protected mode at the Linux bzImage entry point, or at the ELF entry point? Paolo
On 23.04.20 19:51, Paolo Bonzini wrote: > > On 23/04/20 19:42, Paraschiv, Andra-Irina wrote: >>> >>>>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >> >> The enclave VM has its own kernel and follows the well-known Linux boot >> protocol, in the end getting to the user application after init finishes >> its work, so that's CPL3. > > CPL3 is how the user application run, but does the enclave's Linux boot > process start in real mode at the reset vector (0xfffffff0), in 16-bit > protected mode at the Linux bzImage entry point, or at the ELF entry point? There is no "entry point" per se. You prepopulate at target bzImage into the enclave memory on boot which then follows the standard boot protocol. Everything before that (enclave firmware, etc.) is provided by the enclave environment. Think of it like a mechanism to launch a second QEMU instance on the host, but all you can actually control are the -smp, -m, -kernel and -initrd parameters. The only I/O channel you have between your VM and that new VM is a vsock channel which is configured by the host on your behalf. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 23/04/20 22:56, Alexander Graf wrote: >> >> CPL3 is how the user application run, but does the enclave's Linux boot >> process start in real mode at the reset vector (0xfffffff0), in 16-bit >> protected mode at the Linux bzImage entry point, or at the ELF entry >> point? > > There is no "entry point" per se. You prepopulate at target bzImage into > the enclave memory on boot which then follows the standard boot > protocol. Everything There's still a "where" missing in that sentence. :) I assume you put it at 0x10000 (and so the entry point at 0x10200)? That should be documented because that is absolutely not what the KVM API looks like. > before that (enclave firmware, etc.) is provided by > the enclave environment. > > Think of it like a mechanism to launch a second QEMU instance on the > host, but all you can actually control are the -smp, -m, -kernel and > -initrd parameters. Are there requirements on how to populate the memory to ensure that the host firmware doesn't crash and burn? E.g. some free memory right below 4GiB (for the firmware, the LAPIC/IOAPIC or any other special MMIO devices you have, PCI BARs, and the like)? > The only I/O channel you have between your VM and > that new VM is a vsock channel which is configured by the host on your > behalf. Is this virtio-mmio or virtio-pci, and what other emulated devices are there and how do you discover them? Are there any ISA devices (RTC/PIC/PIT), and are there SMBIOS/RSDP/MP tables in the F segment? Thanks, Paolo
On 2020/4/23 21:19, Paraschiv, Andra-Irina wrote: > > > On 22/04/2020 00:46, Paolo Bonzini wrote: >> On 21/04/20 20:41, Andra Paraschiv wrote: >>> An enclave communicates with the primary VM via a local communication channel, >>> using virtio-vsock [2]. An enclave does not have a disk or a network device >>> attached. >> Is it possible to have a sample of this in the samples/ directory? > > I can add in v2 a sample file including the basic flow of how to use the ioctl > interface to create / terminate an enclave. > > Then we can update / build on top it based on the ongoing discussions on the > patch series and the received feedback. > >> >> I am interested especially in: >> >> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >> >> - the communication channel; does the enclave see the usual local APIC >> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >> where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? >> >> - what the enclave is allowed to do: can it change privilege levels, >> what happens if the enclave performs an access to nonexistent memory, etc. >> >> - whether there are special hypercall interfaces for the enclave > > An enclave is a VM, running on the same host as the primary VM, that launched > the enclave. They are siblings. > > Here we need to think of two components: > > 1. An enclave abstraction process - a process running in the primary VM guest, > that uses the provided ioctl interface of the Nitro Enclaves kernel driver to > spawn an enclave VM (that's 2 below). > > How does all gets to an enclave VM running on the host? > > There is a Nitro Enclaves emulated PCI device exposed to the primary VM. The > driver for this new PCI device is included in the current patch series. > Hi Paraschiv, The new PCI device is emulated in QEMU ? If so, is there any plan to send the QEMU code ? > The ioctl logic is mapped to PCI device commands e.g. the NE_ENCLAVE_START ioctl > maps to an enclave start PCI command or the KVM_SET_USER_MEMORY_REGION maps to > an add memory PCI command. The PCI device commands are then translated into > actions taken on the hypervisor side; that's the Nitro hypervisor running on the > host where the primary VM is running. > > 2. The enclave itself - a VM running on the same host as the primary VM that > spawned it. > > The enclave VM has no persistent storage or network interface attached, it uses > its own memory and CPUs + its virtio-vsock emulated device for communication > with the primary VM. > > The memory and CPUs are carved out of the primary VM, they are dedicated for the > enclave. The Nitro hypervisor running on the host ensures memory and CPU > isolation between the primary VM and the enclave VM. > > > These two components need to reflect the same state e.g. when the enclave > abstraction process (1) is terminated, the enclave VM (2) is terminated as well. > > With regard to the communication channel, the primary VM has its own emulated > virtio-vsock PCI device. The enclave VM has its own emulated virtio-vsock device > as well. This channel is used, for example, to fetch data in the enclave and > then process it. An application that sets up the vsock socket and connects or > listens, depending on the use case, is then developed to use this channel; this > happens on both ends - primary VM and enclave VM. > > Let me know if further clarifications are needed. > >> >>> The proposed solution is following the KVM model and uses the KVM API to be able >>> to create and set resources for enclaves. An additional ioctl command, besides >>> the ones provided by KVM, is used to start an enclave and setup the addressing >>> for the communication channel and an enclave unique id. >> Reusing some KVM ioctls is definitely a good idea, but I wouldn't really >> say it's the KVM API since the VCPU file descriptor is basically non >> functional (without KVM_RUN and mmap it's not really the KVM API). > > It uses part of the KVM API or a set of KVM ioctls to model the way a VM is > created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd are not > included. > > Thanks for the feedback regarding the reuse of KVM ioctls. > > Andra > > > > > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar > Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in > Romania. Registration number J22/2621/2005.
On 24/04/2020 06:04, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > On 2020/4/23 21:19, Paraschiv, Andra-Irina wrote: >> >> On 22/04/2020 00:46, Paolo Bonzini wrote: >>> On 21/04/20 20:41, Andra Paraschiv wrote: >>>> An enclave communicates with the primary VM via a local communication channel, >>>> using virtio-vsock [2]. An enclave does not have a disk or a network device >>>> attached. >>> Is it possible to have a sample of this in the samples/ directory? >> I can add in v2 a sample file including the basic flow of how to use the ioctl >> interface to create / terminate an enclave. >> >> Then we can update / build on top it based on the ongoing discussions on the >> patch series and the received feedback. >> >>> I am interested especially in: >>> >>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >>> >>> - the communication channel; does the enclave see the usual local APIC >>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >>> where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? >>> >>> - what the enclave is allowed to do: can it change privilege levels, >>> what happens if the enclave performs an access to nonexistent memory, etc. >>> >>> - whether there are special hypercall interfaces for the enclave >> An enclave is a VM, running on the same host as the primary VM, that launched >> the enclave. They are siblings. >> >> Here we need to think of two components: >> >> 1. An enclave abstraction process - a process running in the primary VM guest, >> that uses the provided ioctl interface of the Nitro Enclaves kernel driver to >> spawn an enclave VM (that's 2 below). >> >> How does all gets to an enclave VM running on the host? >> >> There is a Nitro Enclaves emulated PCI device exposed to the primary VM. The >> driver for this new PCI device is included in the current patch series. >> > Hi Paraschiv, > > The new PCI device is emulated in QEMU ? If so, is there any plan to send the > QEMU code ? Hi, Nope, not that I know of so far. Thanks, Andra > >> The ioctl logic is mapped to PCI device commands e.g. the NE_ENCLAVE_START ioctl >> maps to an enclave start PCI command or the KVM_SET_USER_MEMORY_REGION maps to >> an add memory PCI command. The PCI device commands are then translated into >> actions taken on the hypervisor side; that's the Nitro hypervisor running on the >> host where the primary VM is running. >> >> 2. The enclave itself - a VM running on the same host as the primary VM that >> spawned it. >> >> The enclave VM has no persistent storage or network interface attached, it uses >> its own memory and CPUs + its virtio-vsock emulated device for communication >> with the primary VM. >> >> The memory and CPUs are carved out of the primary VM, they are dedicated for the >> enclave. The Nitro hypervisor running on the host ensures memory and CPU >> isolation between the primary VM and the enclave VM. >> >> >> These two components need to reflect the same state e.g. when the enclave >> abstraction process (1) is terminated, the enclave VM (2) is terminated as well. >> >> With regard to the communication channel, the primary VM has its own emulated >> virtio-vsock PCI device. The enclave VM has its own emulated virtio-vsock device >> as well. This channel is used, for example, to fetch data in the enclave and >> then process it. An application that sets up the vsock socket and connects or >> listens, depending on the use case, is then developed to use this channel; this >> happens on both ends - primary VM and enclave VM. >> >> Let me know if further clarifications are needed. >> >>>> The proposed solution is following the KVM model and uses the KVM API to be able >>>> to create and set resources for enclaves. An additional ioctl command, besides >>>> the ones provided by KVM, is used to start an enclave and setup the addressing >>>> for the communication channel and an enclave unique id. >>> Reusing some KVM ioctls is definitely a good idea, but I wouldn't really >>> say it's the KVM API since the VCPU file descriptor is basically non >>> functional (without KVM_RUN and mmap it's not really the KVM API). >> It uses part of the KVM API or a set of KVM ioctls to model the way a VM is >> created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd are not >> included. >> >> Thanks for the feedback regarding the reuse of KVM ioctls. >> >> Andra >> >> >> >> >> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar >> Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in >> Romania. Registration number J22/2621/2005. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 24/04/2020 11:19, Paraschiv, Andra-Irina wrote: > > > On 24/04/2020 06:04, Longpeng (Mike, Cloud Infrastructure Service > Product Dept.) wrote: >> On 2020/4/23 21:19, Paraschiv, Andra-Irina wrote: >>> >>> On 22/04/2020 00:46, Paolo Bonzini wrote: >>>> On 21/04/20 20:41, Andra Paraschiv wrote: >>>>> An enclave communicates with the primary VM via a local >>>>> communication channel, >>>>> using virtio-vsock [2]. An enclave does not have a disk or a >>>>> network device >>>>> attached. >>>> Is it possible to have a sample of this in the samples/ directory? >>> I can add in v2 a sample file including the basic flow of how to use >>> the ioctl >>> interface to create / terminate an enclave. >>> >>> Then we can update / build on top it based on the ongoing >>> discussions on the >>> patch series and the received feedback. >>> >>>> I am interested especially in: >>>> >>>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >>>> >>>> - the communication channel; does the enclave see the usual local APIC >>>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, >>>> and >>>> where is the virtio-vsock device (virtio-mmio I suppose) placed in >>>> memory? >>>> >>>> - what the enclave is allowed to do: can it change privilege levels, >>>> what happens if the enclave performs an access to nonexistent >>>> memory, etc. >>>> >>>> - whether there are special hypercall interfaces for the enclave >>> An enclave is a VM, running on the same host as the primary VM, that >>> launched >>> the enclave. They are siblings. >>> >>> Here we need to think of two components: >>> >>> 1. An enclave abstraction process - a process running in the primary >>> VM guest, >>> that uses the provided ioctl interface of the Nitro Enclaves kernel >>> driver to >>> spawn an enclave VM (that's 2 below). >>> >>> How does all gets to an enclave VM running on the host? >>> >>> There is a Nitro Enclaves emulated PCI device exposed to the primary >>> VM. The >>> driver for this new PCI device is included in the current patch series. >>> >> Hi Paraschiv, >> >> The new PCI device is emulated in QEMU ? If so, is there any plan to >> send the >> QEMU code ? > > Hi, > > Nope, not that I know of so far. And just to be a bit more clear, the reply above takes into consideration that it's not emulated in QEMU. Thanks, Andra > >> >>> The ioctl logic is mapped to PCI device commands e.g. the >>> NE_ENCLAVE_START ioctl >>> maps to an enclave start PCI command or the >>> KVM_SET_USER_MEMORY_REGION maps to >>> an add memory PCI command. The PCI device commands are then >>> translated into >>> actions taken on the hypervisor side; that's the Nitro hypervisor >>> running on the >>> host where the primary VM is running. >>> >>> 2. The enclave itself - a VM running on the same host as the primary >>> VM that >>> spawned it. >>> >>> The enclave VM has no persistent storage or network interface >>> attached, it uses >>> its own memory and CPUs + its virtio-vsock emulated device for >>> communication >>> with the primary VM. >>> >>> The memory and CPUs are carved out of the primary VM, they are >>> dedicated for the >>> enclave. The Nitro hypervisor running on the host ensures memory and >>> CPU >>> isolation between the primary VM and the enclave VM. >>> >>> >>> These two components need to reflect the same state e.g. when the >>> enclave >>> abstraction process (1) is terminated, the enclave VM (2) is >>> terminated as well. >>> >>> With regard to the communication channel, the primary VM has its own >>> emulated >>> virtio-vsock PCI device. The enclave VM has its own emulated >>> virtio-vsock device >>> as well. This channel is used, for example, to fetch data in the >>> enclave and >>> then process it. An application that sets up the vsock socket and >>> connects or >>> listens, depending on the use case, is then developed to use this >>> channel; this >>> happens on both ends - primary VM and enclave VM. >>> >>> Let me know if further clarifications are needed. >>> >>>>> The proposed solution is following the KVM model and uses the KVM >>>>> API to be able >>>>> to create and set resources for enclaves. An additional ioctl >>>>> command, besides >>>>> the ones provided by KVM, is used to start an enclave and setup >>>>> the addressing >>>>> for the communication channel and an enclave unique id. >>>> Reusing some KVM ioctls is definitely a good idea, but I wouldn't >>>> really >>>> say it's the KVM API since the VCPU file descriptor is basically non >>>> functional (without KVM_RUN and mmap it's not really the KVM API). >>> It uses part of the KVM API or a set of KVM ioctls to model the way >>> a VM is >>> created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd >>> are not >>> included. >>> >>> Thanks for the feedback regarding the reuse of KVM ioctls. >>> >>> Andra >>> >>> >>> >>> >>> Amazon Development Center (Romania) S.R.L. registered office: 27A >>> Sf. Lazar >>> Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. >>> Registered in >>> Romania. Registration number J22/2621/2005. > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
> From: Paraschiv, Andra-Irina > Sent: Thursday, April 23, 2020 9:20 PM > > On 22/04/2020 00:46, Paolo Bonzini wrote: > > On 21/04/20 20:41, Andra Paraschiv wrote: > >> An enclave communicates with the primary VM via a local communication > channel, > >> using virtio-vsock [2]. An enclave does not have a disk or a network device > >> attached. > > Is it possible to have a sample of this in the samples/ directory? > > I can add in v2 a sample file including the basic flow of how to use the > ioctl interface to create / terminate an enclave. > > Then we can update / build on top it based on the ongoing discussions on > the patch series and the received feedback. > > > > > I am interested especially in: > > > > - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. > > > > - the communication channel; does the enclave see the usual local APIC > > and IOAPIC interfaces in order to get interrupts from virtio-vsock, and > > where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? > > > > - what the enclave is allowed to do: can it change privilege levels, > > what happens if the enclave performs an access to nonexistent memory, > etc. > > > > - whether there are special hypercall interfaces for the enclave > > An enclave is a VM, running on the same host as the primary VM, that > launched the enclave. They are siblings. > > Here we need to think of two components: > > 1. An enclave abstraction process - a process running in the primary VM > guest, that uses the provided ioctl interface of the Nitro Enclaves > kernel driver to spawn an enclave VM (that's 2 below). > > How does all gets to an enclave VM running on the host? > > There is a Nitro Enclaves emulated PCI device exposed to the primary VM. > The driver for this new PCI device is included in the current patch series. > > The ioctl logic is mapped to PCI device commands e.g. the > NE_ENCLAVE_START ioctl maps to an enclave start PCI command or the > KVM_SET_USER_MEMORY_REGION maps to an add memory PCI command. > The PCI > device commands are then translated into actions taken on the hypervisor > side; that's the Nitro hypervisor running on the host where the primary > VM is running. > > 2. The enclave itself - a VM running on the same host as the primary VM > that spawned it. > > The enclave VM has no persistent storage or network interface attached, > it uses its own memory and CPUs + its virtio-vsock emulated device for > communication with the primary VM. sounds like a firecracker VM? > > The memory and CPUs are carved out of the primary VM, they are dedicated > for the enclave. The Nitro hypervisor running on the host ensures memory > and CPU isolation between the primary VM and the enclave VM. In last paragraph, you said that the enclave VM uses its own memory and CPUs. Then here, you said the memory/CPUs are carved out and dedicated from the primary VM. Can you elaborate which one is accurate? or a mixed model? > > > These two components need to reflect the same state e.g. when the > enclave abstraction process (1) is terminated, the enclave VM (2) is > terminated as well. > > With regard to the communication channel, the primary VM has its own > emulated virtio-vsock PCI device. The enclave VM has its own emulated > virtio-vsock device as well. This channel is used, for example, to fetch > data in the enclave and then process it. An application that sets up the > vsock socket and connects or listens, depending on the use case, is then > developed to use this channel; this happens on both ends - primary VM > and enclave VM. How does the application in the primary VM assign task to be executed in the enclave VM? I didn't see such command in this series, so suppose it is also communicated through virtio-vsock? > > Let me know if further clarifications are needed. > > > > >> The proposed solution is following the KVM model and uses the KVM API > to be able > >> to create and set resources for enclaves. An additional ioctl command, > besides > >> the ones provided by KVM, is used to start an enclave and setup the > addressing > >> for the communication channel and an enclave unique id. > > Reusing some KVM ioctls is definitely a good idea, but I wouldn't really > > say it's the KVM API since the VCPU file descriptor is basically non > > functional (without KVM_RUN and mmap it's not really the KVM API). > > It uses part of the KVM API or a set of KVM ioctls to model the way a VM > is created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd > are not included. > > Thanks for the feedback regarding the reuse of KVM ioctls. > > Andra > Thanks Kevin
On 23.04.20 23:18, Paolo Bonzini wrote: > > > On 23/04/20 22:56, Alexander Graf wrote: >>> >>> CPL3 is how the user application run, but does the enclave's Linux boot >>> process start in real mode at the reset vector (0xfffffff0), in 16-bit >>> protected mode at the Linux bzImage entry point, or at the ELF entry >>> point? >> >> There is no "entry point" per se. You prepopulate at target bzImage into >> the enclave memory on boot which then follows the standard boot >> protocol. Everything > > There's still a "where" missing in that sentence. :) I assume you put > it at 0x10000 (and so the entry point at 0x10200)? That should be > documented because that is absolutely not what the KVM API looks like. Yes, that part is not documented in the patch set, correct. I would personally just make an example user space binary the documentation for now. Later we will publish a proper device specification outside of the Linux ecosystem which will describe the register layout and image loading semantics in verbatim, so that other OSs can implement the driver too. To answer the question though, the target file is in a newly invented file format called "EIF" and it needs to be loaded at offset 0x800000 of the address space donated to the enclave. > >> before that (enclave firmware, etc.) is provided by >> the enclave environment. >> >> Think of it like a mechanism to launch a second QEMU instance on the >> host, but all you can actually control are the -smp, -m, -kernel and >> -initrd parameters. > > Are there requirements on how to populate the memory to ensure that the > host firmware doesn't crash and burn? E.g. some free memory right below > 4GiB (for the firmware, the LAPIC/IOAPIC or any other special MMIO > devices you have, PCI BARs, and the like)? No, the target memory layout is currently disconnected from the memory layout defined through the KVM_SET_USER_MEMORY_REGION ioctl. While we do check that guest_phys_addr is contiguous, the underlying device API does not have any notion of a "guest address" - all it gets is a scatter-gather sliced bucket of memory. >> The only I/O channel you have between your VM and >> that new VM is a vsock channel which is configured by the host on your >> behalf. > > Is this virtio-mmio or virtio-pci, and what other emulated devices are > there and how do you discover them? Are there any ISA devices > (RTC/PIC/PIT), and are there SMBIOS/RSDP/MP tables in the F segment? It is virtio-mmio for the enclave and virtio-pci for the parent. The enclave is a microvm. For more details on the enclave device topology, we'll have to wait for the public documentation that describes the enclave view of the world though. I don't think that one's public quite yet. This patch set is about the parent's view. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 24/04/2020 12:59, Tian, Kevin wrote: > >> From: Paraschiv, Andra-Irina >> Sent: Thursday, April 23, 2020 9:20 PM >> >> On 22/04/2020 00:46, Paolo Bonzini wrote: >>> On 21/04/20 20:41, Andra Paraschiv wrote: >>>> An enclave communicates with the primary VM via a local communication >> channel, >>>> using virtio-vsock [2]. An enclave does not have a disk or a network device >>>> attached. >>> Is it possible to have a sample of this in the samples/ directory? >> I can add in v2 a sample file including the basic flow of how to use the >> ioctl interface to create / terminate an enclave. >> >> Then we can update / build on top it based on the ongoing discussions on >> the patch series and the received feedback. >> >>> I am interested especially in: >>> >>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >>> >>> - the communication channel; does the enclave see the usual local APIC >>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >>> where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? >>> >>> - what the enclave is allowed to do: can it change privilege levels, >>> what happens if the enclave performs an access to nonexistent memory, >> etc. >>> - whether there are special hypercall interfaces for the enclave >> An enclave is a VM, running on the same host as the primary VM, that >> launched the enclave. They are siblings. >> >> Here we need to think of two components: >> >> 1. An enclave abstraction process - a process running in the primary VM >> guest, that uses the provided ioctl interface of the Nitro Enclaves >> kernel driver to spawn an enclave VM (that's 2 below). >> >> How does all gets to an enclave VM running on the host? >> >> There is a Nitro Enclaves emulated PCI device exposed to the primary VM. >> The driver for this new PCI device is included in the current patch series. >> >> The ioctl logic is mapped to PCI device commands e.g. the >> NE_ENCLAVE_START ioctl maps to an enclave start PCI command or the >> KVM_SET_USER_MEMORY_REGION maps to an add memory PCI command. >> The PCI >> device commands are then translated into actions taken on the hypervisor >> side; that's the Nitro hypervisor running on the host where the primary >> VM is running. >> >> 2. The enclave itself - a VM running on the same host as the primary VM >> that spawned it. >> >> The enclave VM has no persistent storage or network interface attached, >> it uses its own memory and CPUs + its virtio-vsock emulated device for >> communication with the primary VM. > sounds like a firecracker VM? It's a VM crafted for enclave needs. > >> The memory and CPUs are carved out of the primary VM, they are dedicated >> for the enclave. The Nitro hypervisor running on the host ensures memory >> and CPU isolation between the primary VM and the enclave VM. > In last paragraph, you said that the enclave VM uses its own memory and > CPUs. Then here, you said the memory/CPUs are carved out and dedicated > from the primary VM. Can you elaborate which one is accurate? or a mixed > model? Memory and CPUs are carved out of the primary VM and are dedicated for the enclave VM. I mentioned above as "its own" in the sense that the primary VM doesn't use these carved out resources while the enclave is running, as they are dedicated to the enclave. Hope that now it's more clear. > >> >> These two components need to reflect the same state e.g. when the >> enclave abstraction process (1) is terminated, the enclave VM (2) is >> terminated as well. >> >> With regard to the communication channel, the primary VM has its own >> emulated virtio-vsock PCI device. The enclave VM has its own emulated >> virtio-vsock device as well. This channel is used, for example, to fetch >> data in the enclave and then process it. An application that sets up the >> vsock socket and connects or listens, depending on the use case, is then >> developed to use this channel; this happens on both ends - primary VM >> and enclave VM. > How does the application in the primary VM assign task to be executed > in the enclave VM? I didn't see such command in this series, so suppose > it is also communicated through virtio-vsock? The application that runs in the enclave needs to be packaged in an enclave image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the enclave VM. Then the enclave image is loaded in memory. After booting is finished, the application starts. Now, depending on the app implementation and use case, one example can be that the app in the enclave waits for data to be fetched in via the vsock channel. Thanks, Andra > >> Let me know if further clarifications are needed. >> >>>> The proposed solution is following the KVM model and uses the KVM API >> to be able >>>> to create and set resources for enclaves. An additional ioctl command, >> besides >>>> the ones provided by KVM, is used to start an enclave and setup the >> addressing >>>> for the communication channel and an enclave unique id. >>> Reusing some KVM ioctls is definitely a good idea, but I wouldn't really >>> say it's the KVM API since the VCPU file descriptor is basically non >>> functional (without KVM_RUN and mmap it's not really the KVM API). >> It uses part of the KVM API or a set of KVM ioctls to model the way a VM >> is created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd >> are not included. >> >> Thanks for the feedback regarding the reuse of KVM ioctls. >> >> Andra >> > Thanks > Kevin Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 24/04/20 14:56, Alexander Graf wrote: > Yes, that part is not documented in the patch set, correct. I would > personally just make an example user space binary the documentation for > now. Later we will publish a proper device specification outside of the > Linux ecosystem which will describe the register layout and image > loading semantics in verbatim, so that other OSs can implement the > driver too. But this is not part of the device specification, it's part of the child enclave view. And in my opinion, understanding the way the child enclave is programmed is very important to understand if Linux should at all support this new device. > To answer the question though, the target file is in a newly invented > file format called "EIF" and it needs to be loaded at offset 0x800000 of > the address space donated to the enclave. What is this EIF? * a new Linux kernel format? If so, are there patches in flight to compile Linux in this new format (and I would be surprised if they were accepted, since we already have PVH as a standard way to boot uncompressed Linux kernels)? * a userspace binary (the CPL3 that Andra was referring to)? In that case what is the rationale to prefer it over a statically linked ELF binary? * something completely different like WebAssembly? Again, I cannot provide a sensible review without explaining how to use all this. I understand that Amazon needs to do part of the design behind closed doors, but this seems to have the resulted in issues that reminds me of Intel's SGX misadventures. If Amazon has designed NE in a way that is incompatible with open standards, it's up to Amazon to fix it for the patches to be accepted. I'm very saddened to have to say this, because I do love the idea. Thanks, Paolo
On 24.04.20 18:27, Paolo Bonzini wrote: > > On 24/04/20 14:56, Alexander Graf wrote: >> Yes, that part is not documented in the patch set, correct. I would >> personally just make an example user space binary the documentation for >> now. Later we will publish a proper device specification outside of the >> Linux ecosystem which will describe the register layout and image >> loading semantics in verbatim, so that other OSs can implement the >> driver too. > > But this is not part of the device specification, it's part of the child > enclave view. And in my opinion, understanding the way the child > enclave is programmed is very important to understand if Linux should at > all support this new device. Oh, absolutely. All of the "how do I load an enclave image, run it and interact with it" bits need to be explained. What I was saying above is that maybe code is easier to transfer that than a .txt file that gets lost somewhere in the Documentation directory :). I'm more than happy to hear of other suggestions though. > >> To answer the question though, the target file is in a newly invented >> file format called "EIF" and it needs to be loaded at offset 0x800000 of >> the address space donated to the enclave. > > What is this EIF? It's just a very dumb container format that has a trivial header, a section with the bzImage and one to many sections of initramfs. As mentioned earlier in this thread, it really is just "-kernel" and "-initrd", packed into a single binary for transmission to the host. > > * a new Linux kernel format? If so, are there patches in flight to > compile Linux in this new format (and I would be surprised if they were > accepted, since we already have PVH as a standard way to boot > uncompressed Linux kernels)? > > * a userspace binary (the CPL3 that Andra was referring to)? In that > case what is the rationale to prefer it over a statically linked ELF binary? > > * something completely different like WebAssembly? > > Again, I cannot provide a sensible review without explaining how to use > all this. I understand that Amazon needs to do part of the design > behind closed doors, but this seems to have the resulted in issues that > reminds me of Intel's SGX misadventures. If Amazon has designed NE in a > way that is incompatible with open standards, it's up to Amazon to fix Oh, if there's anything that conflicts with open standards here, I would love to hear it immediately. I do not believe in security by obscurity :). Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 23/04/2020 16:19, Paraschiv, Andra-Irina wrote: > > The memory and CPUs are carved out of the primary VM, they are > dedicated for the enclave. The Nitro hypervisor running on the host > ensures memory and CPU isolation between the primary VM and the > enclave VM. I hope you properly take into consideration Hyper-Threading speculative side-channel vulnerabilities here. i.e. Usually cloud providers designate each CPU core to be assigned to run only vCPUs of specific guest. To avoid sharing a single CPU core between multiple guests. To handle this properly, you need to use some kind of core-scheduling mechanism (Such that each CPU core either runs only vCPUs of enclave or only vCPUs of primary VM at any given point in time). In addition, can you elaborate more on how the enclave memory is carved out of the primary VM? Does this involve performing a memory hot-unplug operation from primary VM or just unmap enclave-assigned guest physical pages from primary VM's SLAT (EPT/NPT) and map them now only in enclave's SLAT? > > Let me know if further clarifications are needed. > I don't quite understand why Enclave VM needs to be provisioned/teardown during primary VM's runtime. For example, an alternative could have been to just provision both primary VM and Enclave VM on primary VM startup. Then, wait for primary VM to setup a communication channel with Enclave VM (E.g. via virtio-vsock). Then, primary VM is free to request Enclave VM to perform various tasks when required on the isolated environment. Such setup will mimic a common Enclave setup. Such as Microsoft Windows VBS EPT-based Enclaves (That all runs on VTL1). It is also similar to TEEs running on ARM TrustZone. i.e. In my alternative proposed solution, the Enclave VM is similar to VTL1/TrustZone. It will also avoid requiring introducing a new PCI device and driver. -Liran
On 24/04/20 21:11, Alexander Graf wrote: > What I was saying above is that maybe code is easier to transfer that > than a .txt file that gets lost somewhere in the Documentation directory > :). whynotboth.jpg :D >>> To answer the question though, the target file is in a newly invented >>> file format called "EIF" and it needs to be loaded at offset 0x800000 of >>> the address space donated to the enclave. >> >> What is this EIF? > > It's just a very dumb container format that has a trivial header, a > section with the bzImage and one to many sections of initramfs. > > As mentioned earlier in this thread, it really is just "-kernel" and > "-initrd", packed into a single binary for transmission to the host. Okay, got it. So, correct me if this is wrong, the information that is needed to boot the enclave is: * the kernel, in bzImage format * the initrd * a consecutive amount of memory, to be mapped with KVM_SET_USER_MEMORY_REGION Off list, Alex and I discussed having a struct that points to kernel and initrd off enclave memory, and have the driver build EIF at the appropriate point in enclave memory (the 8 MiB ofset that you mentioned). This however has two disadvantages: 1) having the kernel and initrd loaded by the parent VM in enclave memory has the advantage that you save memory outside the enclave memory for something that is only needed inside the enclave 2) it is less extensible (what if you want to use PVH in the future for example) and puts in the driver policy that should be in userspace. So why not just start running the enclave at 0xfffffff0 in real mode? Yes everybody hates it, but that's what OSes are written against. In the simplest example, the parent enclave can load bzImage and initrd at 0x10000 and place firmware tables (MPTable and DMI) somewhere at 0xf0000; the firmware would just be a few movs to segment registers followed by a long jmp. If you want to keep EIF, we measured in QEMU that there is no measurable difference between loading the kernel in the host and doing it in the guest, so Amazon could provide an EIF loader stub at 0xfffffff0 for backwards compatibility. >> Again, I cannot provide a sensible review without explaining how to use >> all this. I understand that Amazon needs to do part of the design >> behind closed doors, but this seems to have the resulted in issues that >> reminds me of Intel's SGX misadventures. If Amazon has designed NE in a >> way that is incompatible with open standards, it's up to Amazon to fix > > Oh, if there's anything that conflicts with open standards here, I would > love to hear it immediately. I do not believe in security by obscurity :). That's great to hear! Paolo
On 2020/4/24 17:54, Paraschiv, Andra-Irina wrote: > > > On 24/04/2020 11:19, Paraschiv, Andra-Irina wrote: >> >> >> On 24/04/2020 06:04, Longpeng (Mike, Cloud Infrastructure Service Product >> Dept.) wrote: >>> On 2020/4/23 21:19, Paraschiv, Andra-Irina wrote: >>>> >>>> On 22/04/2020 00:46, Paolo Bonzini wrote: >>>>> On 21/04/20 20:41, Andra Paraschiv wrote: >>>>>> An enclave communicates with the primary VM via a local communication >>>>>> channel, >>>>>> using virtio-vsock [2]. An enclave does not have a disk or a network device >>>>>> attached. >>>>> Is it possible to have a sample of this in the samples/ directory? >>>> I can add in v2 a sample file including the basic flow of how to use the ioctl >>>> interface to create / terminate an enclave. >>>> >>>> Then we can update / build on top it based on the ongoing discussions on the >>>> patch series and the received feedback. >>>> >>>>> I am interested especially in: >>>>> >>>>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >>>>> >>>>> - the communication channel; does the enclave see the usual local APIC >>>>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >>>>> where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? >>>>> >>>>> - what the enclave is allowed to do: can it change privilege levels, >>>>> what happens if the enclave performs an access to nonexistent memory, etc. >>>>> >>>>> - whether there are special hypercall interfaces for the enclave >>>> An enclave is a VM, running on the same host as the primary VM, that launched >>>> the enclave. They are siblings. >>>> >>>> Here we need to think of two components: >>>> >>>> 1. An enclave abstraction process - a process running in the primary VM guest, >>>> that uses the provided ioctl interface of the Nitro Enclaves kernel driver to >>>> spawn an enclave VM (that's 2 below). >>>> >>>> How does all gets to an enclave VM running on the host? >>>> >>>> There is a Nitro Enclaves emulated PCI device exposed to the primary VM. The >>>> driver for this new PCI device is included in the current patch series. >>>> >>> Hi Paraschiv, >>> >>> The new PCI device is emulated in QEMU ? If so, is there any plan to send the >>> QEMU code ? >> >> Hi, >> >> Nope, not that I know of so far. > > And just to be a bit more clear, the reply above takes into consideration that > it's not emulated in QEMU. > Thanks. Guys in this thread are much more interested in the design of enclave VM and the new device, but there's no any document about this device yet, so I think the emulate code is a good alternative. However, Alex said the device specific will be published later, so I'll wait for it. > > Thanks, > Andra > >> >>> >>>> The ioctl logic is mapped to PCI device commands e.g. the NE_ENCLAVE_START >>>> ioctl >>>> maps to an enclave start PCI command or the KVM_SET_USER_MEMORY_REGION maps to >>>> an add memory PCI command. The PCI device commands are then translated into >>>> actions taken on the hypervisor side; that's the Nitro hypervisor running on >>>> the >>>> host where the primary VM is running. >>>> >>>> 2. The enclave itself - a VM running on the same host as the primary VM that >>>> spawned it. >>>> >>>> The enclave VM has no persistent storage or network interface attached, it uses >>>> its own memory and CPUs + its virtio-vsock emulated device for communication >>>> with the primary VM. >>>> >>>> The memory and CPUs are carved out of the primary VM, they are dedicated for >>>> the >>>> enclave. The Nitro hypervisor running on the host ensures memory and CPU >>>> isolation between the primary VM and the enclave VM. >>>> >>>> >>>> These two components need to reflect the same state e.g. when the enclave >>>> abstraction process (1) is terminated, the enclave VM (2) is terminated as >>>> well. >>>> >>>> With regard to the communication channel, the primary VM has its own emulated >>>> virtio-vsock PCI device. The enclave VM has its own emulated virtio-vsock >>>> device >>>> as well. This channel is used, for example, to fetch data in the enclave and >>>> then process it. An application that sets up the vsock socket and connects or >>>> listens, depending on the use case, is then developed to use this channel; this >>>> happens on both ends - primary VM and enclave VM. >>>> >>>> Let me know if further clarifications are needed. >>>> >>>>>> The proposed solution is following the KVM model and uses the KVM API to >>>>>> be able >>>>>> to create and set resources for enclaves. An additional ioctl command, >>>>>> besides >>>>>> the ones provided by KVM, is used to start an enclave and setup the >>>>>> addressing >>>>>> for the communication channel and an enclave unique id. >>>>> Reusing some KVM ioctls is definitely a good idea, but I wouldn't really >>>>> say it's the KVM API since the VCPU file descriptor is basically non >>>>> functional (without KVM_RUN and mmap it's not really the KVM API). >>>> It uses part of the KVM API or a set of KVM ioctls to model the way a VM is >>>> created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd are not >>>> included. >>>> >>>> Thanks for the feedback regarding the reuse of KVM ioctls. >>>> >>>> Andra >>>> >>>> >>>> >>>> >>>> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar >>>> Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in >>>> Romania. Registration number J22/2621/2005. >> > > > > > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar > Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in > Romania. Registration number J22/2621/2005. --- Regards, Longpeng(Mike)
> From: Paraschiv, Andra-Irina <andraprs@amazon.com> > Sent: Friday, April 24, 2020 9:59 PM > > > On 24/04/2020 12:59, Tian, Kevin wrote: > > > >> From: Paraschiv, Andra-Irina > >> Sent: Thursday, April 23, 2020 9:20 PM > >> > >> On 22/04/2020 00:46, Paolo Bonzini wrote: > >>> On 21/04/20 20:41, Andra Paraschiv wrote: > >>>> An enclave communicates with the primary VM via a local > communication > >> channel, > >>>> using virtio-vsock [2]. An enclave does not have a disk or a network > device > >>>> attached. > >>> Is it possible to have a sample of this in the samples/ directory? > >> I can add in v2 a sample file including the basic flow of how to use the > >> ioctl interface to create / terminate an enclave. > >> > >> Then we can update / build on top it based on the ongoing discussions on > >> the patch series and the received feedback. > >> > >>> I am interested especially in: > >>> > >>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. > >>> > >>> - the communication channel; does the enclave see the usual local APIC > >>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and > >>> where is the virtio-vsock device (virtio-mmio I suppose) placed in > memory? > >>> > >>> - what the enclave is allowed to do: can it change privilege levels, > >>> what happens if the enclave performs an access to nonexistent memory, > >> etc. > >>> - whether there are special hypercall interfaces for the enclave > >> An enclave is a VM, running on the same host as the primary VM, that > >> launched the enclave. They are siblings. > >> > >> Here we need to think of two components: > >> > >> 1. An enclave abstraction process - a process running in the primary VM > >> guest, that uses the provided ioctl interface of the Nitro Enclaves > >> kernel driver to spawn an enclave VM (that's 2 below). > >> > >> How does all gets to an enclave VM running on the host? > >> > >> There is a Nitro Enclaves emulated PCI device exposed to the primary VM. > >> The driver for this new PCI device is included in the current patch series. > >> > >> The ioctl logic is mapped to PCI device commands e.g. the > >> NE_ENCLAVE_START ioctl maps to an enclave start PCI command or the > >> KVM_SET_USER_MEMORY_REGION maps to an add memory PCI > command. > >> The PCI > >> device commands are then translated into actions taken on the hypervisor > >> side; that's the Nitro hypervisor running on the host where the primary > >> VM is running. > >> > >> 2. The enclave itself - a VM running on the same host as the primary VM > >> that spawned it. > >> > >> The enclave VM has no persistent storage or network interface attached, > >> it uses its own memory and CPUs + its virtio-vsock emulated device for > >> communication with the primary VM. > > sounds like a firecracker VM? > > It's a VM crafted for enclave needs. > > > > >> The memory and CPUs are carved out of the primary VM, they are > dedicated > >> for the enclave. The Nitro hypervisor running on the host ensures memory > >> and CPU isolation between the primary VM and the enclave VM. > > In last paragraph, you said that the enclave VM uses its own memory and > > CPUs. Then here, you said the memory/CPUs are carved out and dedicated > > from the primary VM. Can you elaborate which one is accurate? or a mixed > > model? > > Memory and CPUs are carved out of the primary VM and are dedicated for > the enclave VM. I mentioned above as "its own" in the sense that the > primary VM doesn't use these carved out resources while the enclave is > running, as they are dedicated to the enclave. > > Hope that now it's more clear. yes, it's clearer. > > > > >> > >> These two components need to reflect the same state e.g. when the > >> enclave abstraction process (1) is terminated, the enclave VM (2) is > >> terminated as well. > >> > >> With regard to the communication channel, the primary VM has its own > >> emulated virtio-vsock PCI device. The enclave VM has its own emulated > >> virtio-vsock device as well. This channel is used, for example, to fetch > >> data in the enclave and then process it. An application that sets up the > >> vsock socket and connects or listens, depending on the use case, is then > >> developed to use this channel; this happens on both ends - primary VM > >> and enclave VM. > > How does the application in the primary VM assign task to be executed > > in the enclave VM? I didn't see such command in this series, so suppose > > it is also communicated through virtio-vsock? > > The application that runs in the enclave needs to be packaged in an > enclave image together with the OS ( e.g. kernel, ramdisk, init ) that > will run in the enclave VM. > > Then the enclave image is loaded in memory. After booting is finished, > the application starts. Now, depending on the app implementation and use > case, one example can be that the app in the enclave waits for data to > be fetched in via the vsock channel. > OK, I thought the code/data was dynamically injected from the primary VM and then run in the enclave. From your description it sounds like a servicing model that an auto-running application wait for and respond service request from the application in the primary VM. Thanks Kevin
On 25/04/2020 18:25, Liran Alon wrote: > > On 23/04/2020 16:19, Paraschiv, Andra-Irina wrote: >> >> The memory and CPUs are carved out of the primary VM, they are >> dedicated for the enclave. The Nitro hypervisor running on the host >> ensures memory and CPU isolation between the primary VM and the >> enclave VM. > I hope you properly take into consideration Hyper-Threading > speculative side-channel vulnerabilities here. > i.e. Usually cloud providers designate each CPU core to be assigned to > run only vCPUs of specific guest. To avoid sharing a single CPU core > between multiple guests. > To handle this properly, you need to use some kind of core-scheduling > mechanism (Such that each CPU core either runs only vCPUs of enclave > or only vCPUs of primary VM at any given point in time). > > In addition, can you elaborate more on how the enclave memory is > carved out of the primary VM? > Does this involve performing a memory hot-unplug operation from > primary VM or just unmap enclave-assigned guest physical pages from > primary VM's SLAT (EPT/NPT) and map them now only in enclave's SLAT? Correct, we take into consideration the HT setup. The enclave gets dedicated physical cores. The primary VM and the enclave VM don't run on CPU siblings of a physical core. Regarding the memory carve out, the logic includes page table entries handling. IIRC, memory hot-unplug can be used for the memory blocks that were previously hot-plugged. https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html > >> >> Let me know if further clarifications are needed. >> > I don't quite understand why Enclave VM needs to be > provisioned/teardown during primary VM's runtime. > > For example, an alternative could have been to just provision both > primary VM and Enclave VM on primary VM startup. > Then, wait for primary VM to setup a communication channel with > Enclave VM (E.g. via virtio-vsock). > Then, primary VM is free to request Enclave VM to perform various > tasks when required on the isolated environment. > > Such setup will mimic a common Enclave setup. Such as Microsoft > Windows VBS EPT-based Enclaves (That all runs on VTL1). It is also > similar to TEEs running on ARM TrustZone. > i.e. In my alternative proposed solution, the Enclave VM is similar to > VTL1/TrustZone. > It will also avoid requiring introducing a new PCI device and driver. True, this can be another option, to provision the primary VM and the enclave VM at launch time. In the proposed setup, the primary VM starts with the initial allocated resources (memory, CPUs). The launch path of the enclave VM, as it's spawned on the same host, is done via the ioctl interface - PCI device - host hypervisor path. Short-running or long-running enclave can be bootstrapped during primary VM lifetime. Depending on the use case, a custom set of resources (memory and CPUs) is set for an enclave and then given back when the enclave is terminated; these resources can be used for another enclave spawned later on or the primary VM tasks. Thanks, Andra Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 25/04/2020 19:05, Paolo Bonzini wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 24/04/20 21:11, Alexander Graf wrote: >> What I was saying above is that maybe code is easier to transfer that >> than a .txt file that gets lost somewhere in the Documentation directory >> :). > whynotboth.jpg :D :) Alright, I added it to the list, in addition to the sample we've been talking before, with the basic flow of the ioctl interface usage. > >>>> To answer the question though, the target file is in a newly invented >>>> file format called "EIF" and it needs to be loaded at offset 0x800000 of >>>> the address space donated to the enclave. >>> What is this EIF? >> It's just a very dumb container format that has a trivial header, a >> section with the bzImage and one to many sections of initramfs. >> >> As mentioned earlier in this thread, it really is just "-kernel" and >> "-initrd", packed into a single binary for transmission to the host. > Okay, got it. So, correct me if this is wrong, the information that is > needed to boot the enclave is: > > * the kernel, in bzImage format > > * the initrd > > * a consecutive amount of memory, to be mapped with > KVM_SET_USER_MEMORY_REGION Yes, the kernel bzImage, the kernel command line, the ramdisk(s) are part of the Enclave Image Format (EIF); plus an EIF header including metadata such as magic number, eif version, image size and CRC. > > Off list, Alex and I discussed having a struct that points to kernel and > initrd off enclave memory, and have the driver build EIF at the > appropriate point in enclave memory (the 8 MiB ofset that you mentioned). > > This however has two disadvantages: > > 1) having the kernel and initrd loaded by the parent VM in enclave > memory has the advantage that you save memory outside the enclave memory > for something that is only needed inside the enclave Here you wanted to say disadvantage? :) Wrt saving memory, it's about additional memory from the parent / primary VM needed for handling the enclave image sections (such as the kernel, ramdisk) and setting the EIF at a certain offset in enclave memory? > > 2) it is less extensible (what if you want to use PVH in the future for > example) and puts in the driver policy that should be in userspace. > > > So why not just start running the enclave at 0xfffffff0 in real mode? > Yes everybody hates it, but that's what OSes are written against. In > the simplest example, the parent enclave can load bzImage and initrd at > 0x10000 and place firmware tables (MPTable and DMI) somewhere at > 0xf0000; the firmware would just be a few movs to segment registers > followed by a long jmp. > > If you want to keep EIF, we measured in QEMU that there is no measurable > difference between loading the kernel in the host and doing it in the > guest, so Amazon could provide an EIF loader stub at 0xfffffff0 for > backwards compatibility. Thanks for info. Andra > >>> Again, I cannot provide a sensible review without explaining how to use >>> all this. I understand that Amazon needs to do part of the design >>> behind closed doors, but this seems to have the resulted in issues that >>> reminds me of Intel's SGX misadventures. If Amazon has designed NE in a >>> way that is incompatible with open standards, it's up to Amazon to fix >> Oh, if there's anything that conflicts with open standards here, I would >> love to hear it immediately. I do not believe in security by obscurity :). > That's great to hear! > > Paolo > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 25/04/2020 19:05, Paolo Bonzini wrote: > > On 24/04/20 21:11, Alexander Graf wrote: >> What I was saying above is that maybe code is easier to transfer that >> than a .txt file that gets lost somewhere in the Documentation directory >> :). > whynotboth.jpg :D :) Alright, I added it to the list, in addition to the sample we've been talking before, with the basic flow of the ioctl interface usage. > >>>> To answer the question though, the target file is in a newly invented >>>> file format called "EIF" and it needs to be loaded at offset 0x800000 of >>>> the address space donated to the enclave. >>> What is this EIF? >> It's just a very dumb container format that has a trivial header, a >> section with the bzImage and one to many sections of initramfs. >> >> As mentioned earlier in this thread, it really is just "-kernel" and >> "-initrd", packed into a single binary for transmission to the host. > Okay, got it. So, correct me if this is wrong, the information that is > needed to boot the enclave is: > > * the kernel, in bzImage format > > * the initrd > > * a consecutive amount of memory, to be mapped with > KVM_SET_USER_MEMORY_REGION Yes, the kernel bzImage, the kernel command line, the ramdisk(s) are part of the Enclave Image Format (EIF); plus an EIF header including metadata such as magic number, eif version, image size and CRC. > > Off list, Alex and I discussed having a struct that points to kernel and > initrd off enclave memory, and have the driver build EIF at the > appropriate point in enclave memory (the 8 MiB ofset that you mentioned). > > This however has two disadvantages: > > 1) having the kernel and initrd loaded by the parent VM in enclave > memory has the advantage that you save memory outside the enclave memory > for something that is only needed inside the enclave Here you wanted to say disadvantage? :)Wrt saving memory, it's about additional memory from the parent / primary VM needed for handling the enclave image sections (such as the kernel, ramdisk) and setting the EIF at a certain offset in enclave memory? > > 2) it is less extensible (what if you want to use PVH in the future for > example) and puts in the driver policy that should be in userspace. > > > So why not just start running the enclave at 0xfffffff0 in real mode? > Yes everybody hates it, but that's what OSes are written against. In > the simplest example, the parent enclave can load bzImage and initrd at > 0x10000 and place firmware tables (MPTable and DMI) somewhere at > 0xf0000; the firmware would just be a few movs to segment registers > followed by a long jmp. > > If you want to keep EIF, we measured in QEMU that there is no measurable > difference between loading the kernel in the host and doing it in the > guest, so Amazon could provide an EIF loader stub at 0xfffffff0 for > backwards compatibility. Thanks for info. Andra > >>> Again, I cannot provide a sensible review without explaining how to use >>> all this. I understand that Amazon needs to do part of the design >>> behind closed doors, but this seems to have the resulted in issues that >>> reminds me of Intel's SGX misadventures. If Amazon has designed NE in a >>> way that is incompatible with open standards, it's up to Amazon to fix >> Oh, if there's anything that conflicts with open standards here, I would >> love to hear it immediately. I do not believe in security by obscurity :). > That's great to hear! > > Paolo > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 27/04/20 11:22, Paraschiv, Andra-Irina wrote: >> >> >> 1) having the kernel and initrd loaded by the parent VM in enclave >> memory has the advantage that you save memory outside the enclave memory >> for something that is only needed inside the enclave > > Here you wanted to say disadvantage? :)Wrt saving memory, it's about > additional memory from the parent / primary VM needed for handling the > enclave image sections (such as the kernel, ramdisk) and setting the EIF > at a certain offset in enclave memory? No, it's an advantage. If the parent VM can load everything in enclave memory, it can read() into it directly. It doesn't to waste its own memory for a kernel and initrd, whose only reason to exist is to be copied into enclave memory. Paolo
On 27/04/2020 12:46, Paolo Bonzini wrote: > On 27/04/20 11:22, Paraschiv, Andra-Irina wrote: >>> >>> 1) having the kernel and initrd loaded by the parent VM in enclave >>> memory has the advantage that you save memory outside the enclave memory >>> for something that is only needed inside the enclave >> Here you wanted to say disadvantage? :)Wrt saving memory, it's about >> additional memory from the parent / primary VM needed for handling the >> enclave image sections (such as the kernel, ramdisk) and setting the EIF >> at a certain offset in enclave memory? > No, it's an advantage. If the parent VM can load everything in enclave > memory, it can read() into it directly. It doesn't to waste its own > memory for a kernel and initrd, whose only reason to exist is to be > copied into enclave memory. Ok, got it, saving was referring to actually not using additional memory. Thank you for clarification. Andra Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 27/04/2020 10:56, Paraschiv, Andra-Irina wrote: > > On 25/04/2020 18:25, Liran Alon wrote: >> >> On 23/04/2020 16:19, Paraschiv, Andra-Irina wrote: >>> >>> The memory and CPUs are carved out of the primary VM, they are >>> dedicated for the enclave. The Nitro hypervisor running on the host >>> ensures memory and CPU isolation between the primary VM and the >>> enclave VM. >> I hope you properly take into consideration Hyper-Threading >> speculative side-channel vulnerabilities here. >> i.e. Usually cloud providers designate each CPU core to be assigned >> to run only vCPUs of specific guest. To avoid sharing a single CPU >> core between multiple guests. >> To handle this properly, you need to use some kind of core-scheduling >> mechanism (Such that each CPU core either runs only vCPUs of enclave >> or only vCPUs of primary VM at any given point in time). >> >> In addition, can you elaborate more on how the enclave memory is >> carved out of the primary VM? >> Does this involve performing a memory hot-unplug operation from >> primary VM or just unmap enclave-assigned guest physical pages from >> primary VM's SLAT (EPT/NPT) and map them now only in enclave's SLAT? > > Correct, we take into consideration the HT setup. The enclave gets > dedicated physical cores. The primary VM and the enclave VM don't run > on CPU siblings of a physical core. The way I would imagine this to work is that Primary-VM just specifies how many vCPUs will the Enclave-VM have and those vCPUs will be set with affinity to run on same physical CPU cores as Primary-VM. But with the exception that scheduler is modified to not run vCPUs of Primary-VM and Enclave-VM as sibling on the same physical CPU core (core-scheduling). i.e. This is different than primary-VM losing physical CPU cores permanently as long as the Enclave-VM is running. Or maybe this should even be controlled by a knob in virtual PCI device interface to allow flexibility to customer to decide if Enclave-VM needs dedicated CPU cores or is it ok to share them with Primary-VM as long as core-scheduling is used to guarantee proper isolation. > > Regarding the memory carve out, the logic includes page table entries > handling. As I thought. Thanks for conformation. > > IIRC, memory hot-unplug can be used for the memory blocks that were > previously hot-plugged. > > https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html__;!!GqivPVa7Brio!MubgaBjJabDtNzNpdOxxbSKtLbqXHbsEpTtZ1mj-rnfLvMIbLW1nZ8cK10GhYJQ$ > >> >> I don't quite understand why Enclave VM needs to be >> provisioned/teardown during primary VM's runtime. >> >> For example, an alternative could have been to just provision both >> primary VM and Enclave VM on primary VM startup. >> Then, wait for primary VM to setup a communication channel with >> Enclave VM (E.g. via virtio-vsock). >> Then, primary VM is free to request Enclave VM to perform various >> tasks when required on the isolated environment. >> >> Such setup will mimic a common Enclave setup. Such as Microsoft >> Windows VBS EPT-based Enclaves (That all runs on VTL1). It is also >> similar to TEEs running on ARM TrustZone. >> i.e. In my alternative proposed solution, the Enclave VM is similar >> to VTL1/TrustZone. >> It will also avoid requiring introducing a new PCI device and driver. > > True, this can be another option, to provision the primary VM and the > enclave VM at launch time. > > In the proposed setup, the primary VM starts with the initial > allocated resources (memory, CPUs). The launch path of the enclave VM, > as it's spawned on the same host, is done via the ioctl interface - > PCI device - host hypervisor path. Short-running or long-running > enclave can be bootstrapped during primary VM lifetime. Depending on > the use case, a custom set of resources (memory and CPUs) is set for > an enclave and then given back when the enclave is terminated; these > resources can be used for another enclave spawned later on or the > primary VM tasks. > Yes, I already understood this is how the mechanism work. I'm questioning whether this is indeed a good approach that should also be taken by upstream. The use-case of using Nitro Enclaves is for a Confidential-Computing service. i.e. The ability to provision a compute instance that can be trusted to perform a bunch of computation on sensitive information with high confidence that it cannot be compromised as it's highly isolated. Some technologies such as Intel SGX and AMD SEV attempted to achieve this even with guarantees that the computation is isolated from the hardware and hypervisor itself. I would have expected that for the vast majority of real customer use-cases, the customer will provision a compute instance that runs some confidential-computing task in an enclave which it keeps running for the entire life-time of the compute instance. As the sole purpose of the compute instance is to just expose a service that performs some confidential-computing task. For those cases, it should have been sufficient to just pre-provision a single Enclave-VM that performs this task, together with the compute instance and connect them via virtio-vsock. Without introducing any new virtual PCI device, guest PCI driver and unique semantics of stealing resources (CPUs and Memory) from primary-VM at runtime. In this Nitro Enclave architecture, we de-facto put Compute control-plane abilities in the hands of the guest VM. Instead of introducing new control-plane primitives that allows building the data-plane architecture desired by the customer in a flexible manner. * What if the customer prefers to have it's Enclave VM polling S3 bucket for new tasks and produce results to S3 as-well? Without having any "Primary-VM" or virtio-vsock connection of any kind? * What if for some use-cases customer wants Enclave-VM to have dedicated compute power (i.e. Not share physical CPU cores with primary-VM. Not even with core-scheduling) but for other use-cases, customer prefers to share physical CPU cores with Primary-VM (Together with core-scheduling guarantees)? (Although this could be addressed by extending the virtual PCI device interface with a knob to control this) An alternative would have been to have the following new control-plane primitives: * Ability to provision a VM without boot-volume, but instead from an Image that is used to boot from memory. Allowing to provision disk-less VMs. (E.g. Can be useful for other use-cases such as VMs not requiring EBS at all which could allow cheaper compute instance) * Ability to provision a group of VMs together as a group such that they are guaranteed to launch as sibling VMs on the same host. * Ability to create a fast-path connection between sibling VMs on the same host with virtio-vsock. Or even also other shared-memory mechanism. * Extend AWS Fargate with ability to run multiple microVMs as a group (Similar to above) connected with virtio-vsock. To allow on-demand scale of confidential-computing task. Having said that, I do see a similar architecture to Nitro Enclaves virtual PCI device used for a different purpose: For hypervisor-based security isolation (Such as Windows VBS). E.g. Linux boot-loader can detect the presence of this virtual PCI device and use it to provision multiple VM security domains. Such that when a security domain is created, it is specified what is the hardware resources it have access to (Guest memory pages, IOPorts, MSRs and etc.) and the blob it should run to bootstrap. Similar, but superior than, Hyper-V VSM. In addition, some security domains will be given special abilities to control other security domains (For example, to control the +XS,+XU EPT bits of other security domains to enforce code-integrity. Similar to Windows VBS HVCI). Just an idea... :) -Liran
On 26/04/2020 04:55, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > On 2020/4/24 17:54, Paraschiv, Andra-Irina wrote: >> >> On 24/04/2020 11:19, Paraschiv, Andra-Irina wrote: >>> >>> On 24/04/2020 06:04, Longpeng (Mike, Cloud Infrastructure Service Product >>> Dept.) wrote: >>>> On 2020/4/23 21:19, Paraschiv, Andra-Irina wrote: >>>>> On 22/04/2020 00:46, Paolo Bonzini wrote: >>>>>> On 21/04/20 20:41, Andra Paraschiv wrote: >>>>>>> An enclave communicates with the primary VM via a local communication >>>>>>> channel, >>>>>>> using virtio-vsock [2]. An enclave does not have a disk or a network device >>>>>>> attached. >>>>>> Is it possible to have a sample of this in the samples/ directory? >>>>> I can add in v2 a sample file including the basic flow of how to use the ioctl >>>>> interface to create / terminate an enclave. >>>>> >>>>> Then we can update / build on top it based on the ongoing discussions on the >>>>> patch series and the received feedback. >>>>> >>>>>> I am interested especially in: >>>>>> >>>>>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >>>>>> >>>>>> - the communication channel; does the enclave see the usual local APIC >>>>>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >>>>>> where is the virtio-vsock device (virtio-mmio I suppose) placed in memory? >>>>>> >>>>>> - what the enclave is allowed to do: can it change privilege levels, >>>>>> what happens if the enclave performs an access to nonexistent memory, etc. >>>>>> >>>>>> - whether there are special hypercall interfaces for the enclave >>>>> An enclave is a VM, running on the same host as the primary VM, that launched >>>>> the enclave. They are siblings. >>>>> >>>>> Here we need to think of two components: >>>>> >>>>> 1. An enclave abstraction process - a process running in the primary VM guest, >>>>> that uses the provided ioctl interface of the Nitro Enclaves kernel driver to >>>>> spawn an enclave VM (that's 2 below). >>>>> >>>>> How does all gets to an enclave VM running on the host? >>>>> >>>>> There is a Nitro Enclaves emulated PCI device exposed to the primary VM. The >>>>> driver for this new PCI device is included in the current patch series. >>>>> >>>> Hi Paraschiv, >>>> >>>> The new PCI device is emulated in QEMU ? If so, is there any plan to send the >>>> QEMU code ? >>> Hi, >>> >>> Nope, not that I know of so far. >> And just to be a bit more clear, the reply above takes into consideration that >> it's not emulated in QEMU. >> > Thanks. > > Guys in this thread are much more interested in the design of enclave VM and the > new device, but there's no any document about this device yet, so I think the > emulate code is a good alternative. However, Alex said the device specific will > be published later, so I'll wait for it. True, that was mentioned wrt device spec. The device interface could also be updated based on the ongoing discussions on the patch series. Refs to the device spec should be included e.g. in the .h file of the PCI device, once it's available. Thanks, Andra > >> Thanks, >> Andra >> >>>>> The ioctl logic is mapped to PCI device commands e.g. the NE_ENCLAVE_START >>>>> ioctl >>>>> maps to an enclave start PCI command or the KVM_SET_USER_MEMORY_REGION maps to >>>>> an add memory PCI command. The PCI device commands are then translated into >>>>> actions taken on the hypervisor side; that's the Nitro hypervisor running on >>>>> the >>>>> host where the primary VM is running. >>>>> >>>>> 2. The enclave itself - a VM running on the same host as the primary VM that >>>>> spawned it. >>>>> >>>>> The enclave VM has no persistent storage or network interface attached, it uses >>>>> its own memory and CPUs + its virtio-vsock emulated device for communication >>>>> with the primary VM. >>>>> >>>>> The memory and CPUs are carved out of the primary VM, they are dedicated for >>>>> the >>>>> enclave. The Nitro hypervisor running on the host ensures memory and CPU >>>>> isolation between the primary VM and the enclave VM. >>>>> >>>>> >>>>> These two components need to reflect the same state e.g. when the enclave >>>>> abstraction process (1) is terminated, the enclave VM (2) is terminated as >>>>> well. >>>>> >>>>> With regard to the communication channel, the primary VM has its own emulated >>>>> virtio-vsock PCI device. The enclave VM has its own emulated virtio-vsock >>>>> device >>>>> as well. This channel is used, for example, to fetch data in the enclave and >>>>> then process it. An application that sets up the vsock socket and connects or >>>>> listens, depending on the use case, is then developed to use this channel; this >>>>> happens on both ends - primary VM and enclave VM. >>>>> >>>>> Let me know if further clarifications are needed. >>>>> >>>>>>> The proposed solution is following the KVM model and uses the KVM API to >>>>>>> be able >>>>>>> to create and set resources for enclaves. An additional ioctl command, >>>>>>> besides >>>>>>> the ones provided by KVM, is used to start an enclave and setup the >>>>>>> addressing >>>>>>> for the communication channel and an enclave unique id. >>>>>> Reusing some KVM ioctls is definitely a good idea, but I wouldn't really >>>>>> say it's the KVM API since the VCPU file descriptor is basically non >>>>>> functional (without KVM_RUN and mmap it's not really the KVM API). >>>>> It uses part of the KVM API or a set of KVM ioctls to model the way a VM is >>>>> created / terminated. That's true, KVM_RUN and mmap-ing the vcpu fd are not >>>>> included. >>>>> >>>>> Thanks for the feedback regarding the reuse of KVM ioctls. >>>>> >>>>> Andra >>>>> >>>>> >>>>> >>>>> >>>>> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar >>>>> Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in >>>>> Romania. Registration number J22/2621/2005. >> >> >> >> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar >> Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in >> Romania. Registration number J22/2621/2005. > --- > Regards, > Longpeng(Mike) Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 26/04/2020 11:16, Tian, Kevin wrote: >> From: Paraschiv, Andra-Irina <andraprs@amazon.com> >> Sent: Friday, April 24, 2020 9:59 PM >> >> >> On 24/04/2020 12:59, Tian, Kevin wrote: >>>> From: Paraschiv, Andra-Irina >>>> Sent: Thursday, April 23, 2020 9:20 PM >>>> >>>> On 22/04/2020 00:46, Paolo Bonzini wrote: >>>>> On 21/04/20 20:41, Andra Paraschiv wrote: >>>>>> An enclave communicates with the primary VM via a local >> communication >>>> channel, >>>>>> using virtio-vsock [2]. An enclave does not have a disk or a network >> device >>>>>> attached. >>>>> Is it possible to have a sample of this in the samples/ directory? >>>> I can add in v2 a sample file including the basic flow of how to use the >>>> ioctl interface to create / terminate an enclave. >>>> >>>> Then we can update / build on top it based on the ongoing discussions on >>>> the patch series and the received feedback. >>>> >>>>> I am interested especially in: >>>>> >>>>> - the initial CPU state: CPL0 vs. CPL3, initial program counter, etc. >>>>> >>>>> - the communication channel; does the enclave see the usual local APIC >>>>> and IOAPIC interfaces in order to get interrupts from virtio-vsock, and >>>>> where is the virtio-vsock device (virtio-mmio I suppose) placed in >> memory? >>>>> - what the enclave is allowed to do: can it change privilege levels, >>>>> what happens if the enclave performs an access to nonexistent memory, >>>> etc. >>>>> - whether there are special hypercall interfaces for the enclave >>>> An enclave is a VM, running on the same host as the primary VM, that >>>> launched the enclave. They are siblings. >>>> >>>> Here we need to think of two components: >>>> >>>> 1. An enclave abstraction process - a process running in the primary VM >>>> guest, that uses the provided ioctl interface of the Nitro Enclaves >>>> kernel driver to spawn an enclave VM (that's 2 below). >>>> >>>> How does all gets to an enclave VM running on the host? >>>> >>>> There is a Nitro Enclaves emulated PCI device exposed to the primary VM. >>>> The driver for this new PCI device is included in the current patch series. >>>> >>>> The ioctl logic is mapped to PCI device commands e.g. the >>>> NE_ENCLAVE_START ioctl maps to an enclave start PCI command or the >>>> KVM_SET_USER_MEMORY_REGION maps to an add memory PCI >> command. >>>> The PCI >>>> device commands are then translated into actions taken on the hypervisor >>>> side; that's the Nitro hypervisor running on the host where the primary >>>> VM is running. >>>> >>>> 2. The enclave itself - a VM running on the same host as the primary VM >>>> that spawned it. >>>> >>>> The enclave VM has no persistent storage or network interface attached, >>>> it uses its own memory and CPUs + its virtio-vsock emulated device for >>>> communication with the primary VM. >>> sounds like a firecracker VM? >> It's a VM crafted for enclave needs. >> >>>> The memory and CPUs are carved out of the primary VM, they are >> dedicated >>>> for the enclave. The Nitro hypervisor running on the host ensures memory >>>> and CPU isolation between the primary VM and the enclave VM. >>> In last paragraph, you said that the enclave VM uses its own memory and >>> CPUs. Then here, you said the memory/CPUs are carved out and dedicated >>> from the primary VM. Can you elaborate which one is accurate? or a mixed >>> model? >> Memory and CPUs are carved out of the primary VM and are dedicated for >> the enclave VM. I mentioned above as "its own" in the sense that the >> primary VM doesn't use these carved out resources while the enclave is >> running, as they are dedicated to the enclave. >> >> Hope that now it's more clear. > yes, it's clearer. Good, glad to hear that. > >>>> These two components need to reflect the same state e.g. when the >>>> enclave abstraction process (1) is terminated, the enclave VM (2) is >>>> terminated as well. >>>> >>>> With regard to the communication channel, the primary VM has its own >>>> emulated virtio-vsock PCI device. The enclave VM has its own emulated >>>> virtio-vsock device as well. This channel is used, for example, to fetch >>>> data in the enclave and then process it. An application that sets up the >>>> vsock socket and connects or listens, depending on the use case, is then >>>> developed to use this channel; this happens on both ends - primary VM >>>> and enclave VM. >>> How does the application in the primary VM assign task to be executed >>> in the enclave VM? I didn't see such command in this series, so suppose >>> it is also communicated through virtio-vsock? >> The application that runs in the enclave needs to be packaged in an >> enclave image together with the OS ( e.g. kernel, ramdisk, init ) that >> will run in the enclave VM. >> >> Then the enclave image is loaded in memory. After booting is finished, >> the application starts. Now, depending on the app implementation and use >> case, one example can be that the app in the enclave waits for data to >> be fetched in via the vsock channel. >> > OK, I thought the code/data was dynamically injected from the primary > VM and then run in the enclave. From your description it sounds like > a servicing model that an auto-running application wait for and respond > service request from the application in the primary VM. That was an example with a possible use case; in that one example, data can be dynamically injected e.g. fetch in the enclave VM a bunch data, get back the results after processing, then fetch in another set of data and so on. The architecture of the solution depends on how the tasks are split between the primary VM and the enclave VM and what is sent via the vsock channel. The primary VM, the enclave VM and the communication between them is part of the foundational technology we provide. What's running inside each of them can vary based on the customer use case and updates to fit this infrastructure of several tasks now being split and running part of them in the enclave VM. Thanks, Andra Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 25.04.20 18:05, Paolo Bonzini wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 24/04/20 21:11, Alexander Graf wrote: >> What I was saying above is that maybe code is easier to transfer that >> than a .txt file that gets lost somewhere in the Documentation directory >> :). > > whynotboth.jpg :D Uh, sure? :) Let's first hammer out what we really want for the UABI though. Then we can document it. >>>> To answer the question though, the target file is in a newly invented >>>> file format called "EIF" and it needs to be loaded at offset 0x800000 of >>>> the address space donated to the enclave. >>> >>> What is this EIF? >> >> It's just a very dumb container format that has a trivial header, a >> section with the bzImage and one to many sections of initramfs. >> >> As mentioned earlier in this thread, it really is just "-kernel" and >> "-initrd", packed into a single binary for transmission to the host. > > Okay, got it. So, correct me if this is wrong, the information that is > needed to boot the enclave is: > > * the kernel, in bzImage format > > * the initrd It's a single EIF file for a good reason. There are checksums in there and potentially signatures too, so that you can the enclave can attest itself. For the sake of the user space API, the enclave image really should just be considered a blob. > > * a consecutive amount of memory, to be mapped with > KVM_SET_USER_MEMORY_REGION > > Off list, Alex and I discussed having a struct that points to kernel and > initrd off enclave memory, and have the driver build EIF at the > appropriate point in enclave memory (the 8 MiB ofset that you mentioned). > > This however has two disadvantages: > > 1) having the kernel and initrd loaded by the parent VM in enclave > memory has the advantage that you save memory outside the enclave memory > for something that is only needed inside the enclave > > 2) it is less extensible (what if you want to use PVH in the future for > example) and puts in the driver policy that should be in userspace. > > > So why not just start running the enclave at 0xfffffff0 in real mode? > Yes everybody hates it, but that's what OSes are written against. In > the simplest example, the parent enclave can load bzImage and initrd at > 0x10000 and place firmware tables (MPTable and DMI) somewhere at > 0xf0000; the firmware would just be a few movs to segment registers > followed by a long jmp. There is a bit of initial attestation flow in the enclave, so that you can be sure that the code that is running is actually what you wanted to run. I would also in general prefer to disconnect the notion of "enclave memory" as much as possible from a memory location view. User space shouldn't be in the business of knowing location of its donated memory ended up at which enclave memory position. By disconnecting the view of the memory world, we can do some more optimizations, such as compact memory ranges more efficiently in kernel space. > If you want to keep EIF, we measured in QEMU that there is no measurable > difference between loading the kernel in the host and doing it in the > guest, so Amazon could provide an EIF loader stub at 0xfffffff0 for > backwards compatibility. It's not about performance :). So the other thing we discussed was whether the KVM API really turned out to be a good fit here. After all, today we merely call: * CREATE_VM * SET_MEMORY_RANGE * CREATE_VCPU * START_ENCLAVE where we even butcher up CREATE_VCPU into a meaningless blob of overhead for no good reason. Why don't we build something like the following instead? vm = ne_create(vcpus = 4) ne_set_memory(vm, hva, len) ne_load_image(vm, addr, len) ne_start(vm) That way we would get the EIF loading into kernel space. "LOAD_IMAGE" would only be available in the time window between set_memory and start. It basically implements a memcpy(), but it would completely hide the hidden semantics of where an EIF has to go, so future device versions (or even other enclave implementers) could change the logic. I think it also makes sense to just allocate those 4 ioctls from scratch. Paolo, would you still want to "donate" KVM ioctl space in that case? Overall, the above should address most of the concerns you raised in this mail, right? It still requires copying, but at least we don't have to keep the copy in kernel space. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 27.04.20 13:44, Liran Alon wrote: > > On 27/04/2020 10:56, Paraschiv, Andra-Irina wrote: >> >> On 25/04/2020 18:25, Liran Alon wrote: >>> >>> On 23/04/2020 16:19, Paraschiv, Andra-Irina wrote: >>>> >>>> The memory and CPUs are carved out of the primary VM, they are >>>> dedicated for the enclave. The Nitro hypervisor running on the host >>>> ensures memory and CPU isolation between the primary VM and the >>>> enclave VM. >>> I hope you properly take into consideration Hyper-Threading >>> speculative side-channel vulnerabilities here. >>> i.e. Usually cloud providers designate each CPU core to be assigned >>> to run only vCPUs of specific guest. To avoid sharing a single CPU >>> core between multiple guests. >>> To handle this properly, you need to use some kind of core-scheduling >>> mechanism (Such that each CPU core either runs only vCPUs of enclave >>> or only vCPUs of primary VM at any given point in time). >>> >>> In addition, can you elaborate more on how the enclave memory is >>> carved out of the primary VM? >>> Does this involve performing a memory hot-unplug operation from >>> primary VM or just unmap enclave-assigned guest physical pages from >>> primary VM's SLAT (EPT/NPT) and map them now only in enclave's SLAT? >> >> Correct, we take into consideration the HT setup. The enclave gets >> dedicated physical cores. The primary VM and the enclave VM don't run >> on CPU siblings of a physical core. > The way I would imagine this to work is that Primary-VM just specifies > how many vCPUs will the Enclave-VM have and those vCPUs will be set with > affinity to run on same physical CPU cores as Primary-VM. > But with the exception that scheduler is modified to not run vCPUs of > Primary-VM and Enclave-VM as sibling on the same physical CPU core > (core-scheduling). i.e. This is different than primary-VM losing > physical CPU cores permanently as long as the Enclave-VM is running. > Or maybe this should even be controlled by a knob in virtual PCI device > interface to allow flexibility to customer to decide if Enclave-VM needs > dedicated CPU cores or is it ok to share them with Primary-VM > as long as core-scheduling is used to guarantee proper isolation. Running both parent and enclave on the same core can *potentially* lead to L2 cache leakage, so we decided not to go with it :). >> >> Regarding the memory carve out, the logic includes page table entries >> handling. > As I thought. Thanks for conformation. >> >> IIRC, memory hot-unplug can be used for the memory blocks that were >> previously hot-plugged. >> >> https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html__;!!GqivPVa7Brio!MubgaBjJabDtNzNpdOxxbSKtLbqXHbsEpTtZ1mj-rnfLvMIbLW1nZ8cK10GhYJQ$ >> >> >>> >>> I don't quite understand why Enclave VM needs to be >>> provisioned/teardown during primary VM's runtime. >>> >>> For example, an alternative could have been to just provision both >>> primary VM and Enclave VM on primary VM startup. >>> Then, wait for primary VM to setup a communication channel with >>> Enclave VM (E.g. via virtio-vsock). >>> Then, primary VM is free to request Enclave VM to perform various >>> tasks when required on the isolated environment. >>> >>> Such setup will mimic a common Enclave setup. Such as Microsoft >>> Windows VBS EPT-based Enclaves (That all runs on VTL1). It is also >>> similar to TEEs running on ARM TrustZone. >>> i.e. In my alternative proposed solution, the Enclave VM is similar >>> to VTL1/TrustZone. >>> It will also avoid requiring introducing a new PCI device and driver. >> >> True, this can be another option, to provision the primary VM and the >> enclave VM at launch time. >> >> In the proposed setup, the primary VM starts with the initial >> allocated resources (memory, CPUs). The launch path of the enclave VM, >> as it's spawned on the same host, is done via the ioctl interface - >> PCI device - host hypervisor path. Short-running or long-running >> enclave can be bootstrapped during primary VM lifetime. Depending on >> the use case, a custom set of resources (memory and CPUs) is set for >> an enclave and then given back when the enclave is terminated; these >> resources can be used for another enclave spawned later on or the >> primary VM tasks. >> > Yes, I already understood this is how the mechanism work. I'm > questioning whether this is indeed a good approach that should also be > taken by upstream. I thought the point of Linux was to support devices that exist, rather than change the way the world works around it? ;) > The use-case of using Nitro Enclaves is for a Confidential-Computing > service. i.e. The ability to provision a compute instance that can be > trusted to perform a bunch of computation on sensitive > information with high confidence that it cannot be compromised as it's > highly isolated. Some technologies such as Intel SGX and AMD SEV > attempted to achieve this even with guarantees that > the computation is isolated from the hardware and hypervisor itself. Yeah, that worked really well, didn't it? ;) > I would have expected that for the vast majority of real customer > use-cases, the customer will provision a compute instance that runs some > confidential-computing task in an enclave which it > keeps running for the entire life-time of the compute instance. As the > sole purpose of the compute instance is to just expose a service that > performs some confidential-computing task. > For those cases, it should have been sufficient to just pre-provision a > single Enclave-VM that performs this task, together with the compute > instance and connect them via virtio-vsock. > Without introducing any new virtual PCI device, guest PCI driver and > unique semantics of stealing resources (CPUs and Memory) from primary-VM > at runtime. You would also need to preprovision the image that runs in the enclave, which is usually only determined at runtime. For that you need the PCI driver anyway, so why not make the creation dynamic too? > In this Nitro Enclave architecture, we de-facto put Compute > control-plane abilities in the hands of the guest VM. Instead of > introducing new control-plane primitives that allows building > the data-plane architecture desired by the customer in a flexible manner. > * What if the customer prefers to have it's Enclave VM polling S3 bucket > for new tasks and produce results to S3 as-well? Without having any > "Primary-VM" or virtio-vsock connection of any kind? > * What if for some use-cases customer wants Enclave-VM to have dedicated > compute power (i.e. Not share physical CPU cores with primary-VM. Not > even with core-scheduling) but for other > use-cases, customer prefers to share physical CPU cores with Primary-VM > (Together with core-scheduling guarantees)? (Although this could be > addressed by extending the virtual PCI device > interface with a knob to control this) > > An alternative would have been to have the following new control-plane > primitives: > * Ability to provision a VM without boot-volume, but instead from an > Image that is used to boot from memory. Allowing to provision disk-less > VMs. > (E.g. Can be useful for other use-cases such as VMs not requiring EBS > at all which could allow cheaper compute instance) > * Ability to provision a group of VMs together as a group such that they > are guaranteed to launch as sibling VMs on the same host. > * Ability to create a fast-path connection between sibling VMs on the > same host with virtio-vsock. Or even also other shared-memory mechanism. > * Extend AWS Fargate with ability to run multiple microVMs as a group > (Similar to above) connected with virtio-vsock. To allow on-demand scale > of confidential-computing task. Yes, there are a *lot* of different ways to implement enclaves in a cloud environment. This is the one that we focused on, but I'm sure others in the space will have more ideas. It's definitely an interesting space and I'm eager to see more innovation happening :). > Having said that, I do see a similar architecture to Nitro Enclaves > virtual PCI device used for a different purpose: For hypervisor-based > security isolation (Such as Windows VBS). > E.g. Linux boot-loader can detect the presence of this virtual PCI > device and use it to provision multiple VM security domains. Such that > when a security domain is created, > it is specified what is the hardware resources it have access to (Guest > memory pages, IOPorts, MSRs and etc.) and the blob it should run to > bootstrap. Similar, but superior than, > Hyper-V VSM. In addition, some security domains will be given special > abilities to control other security domains (For example, to control the > +XS,+XU EPT bits of other security > domains to enforce code-integrity. Similar to Windows VBS HVCI). Just an > idea... :) Yes, absolutely! So much fun to be had :D Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 28/04/2020 18:25, Alexander Graf wrote: > > > On 27.04.20 13:44, Liran Alon wrote: >> >> On 27/04/2020 10:56, Paraschiv, Andra-Irina wrote: >>> >>> On 25/04/2020 18:25, Liran Alon wrote: >>>> >>>> On 23/04/2020 16:19, Paraschiv, Andra-Irina wrote: >>>>> >>>>> The memory and CPUs are carved out of the primary VM, they are >>>>> dedicated for the enclave. The Nitro hypervisor running on the host >>>>> ensures memory and CPU isolation between the primary VM and the >>>>> enclave VM. >>>> I hope you properly take into consideration Hyper-Threading >>>> speculative side-channel vulnerabilities here. >>>> i.e. Usually cloud providers designate each CPU core to be assigned >>>> to run only vCPUs of specific guest. To avoid sharing a single CPU >>>> core between multiple guests. >>>> To handle this properly, you need to use some kind of core-scheduling >>>> mechanism (Such that each CPU core either runs only vCPUs of enclave >>>> or only vCPUs of primary VM at any given point in time). >>>> >>>> In addition, can you elaborate more on how the enclave memory is >>>> carved out of the primary VM? >>>> Does this involve performing a memory hot-unplug operation from >>>> primary VM or just unmap enclave-assigned guest physical pages from >>>> primary VM's SLAT (EPT/NPT) and map them now only in enclave's SLAT? >>> >>> Correct, we take into consideration the HT setup. The enclave gets >>> dedicated physical cores. The primary VM and the enclave VM don't run >>> on CPU siblings of a physical core. >> The way I would imagine this to work is that Primary-VM just specifies >> how many vCPUs will the Enclave-VM have and those vCPUs will be set with >> affinity to run on same physical CPU cores as Primary-VM. >> But with the exception that scheduler is modified to not run vCPUs of >> Primary-VM and Enclave-VM as sibling on the same physical CPU core >> (core-scheduling). i.e. This is different than primary-VM losing >> physical CPU cores permanently as long as the Enclave-VM is running. >> Or maybe this should even be controlled by a knob in virtual PCI device >> interface to allow flexibility to customer to decide if Enclave-VM needs >> dedicated CPU cores or is it ok to share them with Primary-VM >> as long as core-scheduling is used to guarantee proper isolation. > > Running both parent and enclave on the same core can *potentially* > lead to L2 cache leakage, so we decided not to go with it :). Haven't thought about the L2 cache. Makes sense. Ack. > >>> >>> Regarding the memory carve out, the logic includes page table entries >>> handling. >> As I thought. Thanks for conformation. >>> >>> IIRC, memory hot-unplug can be used for the memory blocks that were >>> previously hot-plugged. >>> >>> https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html__;!!GqivPVa7Brio!MubgaBjJabDtNzNpdOxxbSKtLbqXHbsEpTtZ1mj-rnfLvMIbLW1nZ8cK10GhYJQ$ >>> >>> >>>> >>>> I don't quite understand why Enclave VM needs to be >>>> provisioned/teardown during primary VM's runtime. >>>> >>>> For example, an alternative could have been to just provision both >>>> primary VM and Enclave VM on primary VM startup. >>>> Then, wait for primary VM to setup a communication channel with >>>> Enclave VM (E.g. via virtio-vsock). >>>> Then, primary VM is free to request Enclave VM to perform various >>>> tasks when required on the isolated environment. >>>> >>>> Such setup will mimic a common Enclave setup. Such as Microsoft >>>> Windows VBS EPT-based Enclaves (That all runs on VTL1). It is also >>>> similar to TEEs running on ARM TrustZone. >>>> i.e. In my alternative proposed solution, the Enclave VM is similar >>>> to VTL1/TrustZone. >>>> It will also avoid requiring introducing a new PCI device and driver. >>> >>> True, this can be another option, to provision the primary VM and the >>> enclave VM at launch time. >>> >>> In the proposed setup, the primary VM starts with the initial >>> allocated resources (memory, CPUs). The launch path of the enclave VM, >>> as it's spawned on the same host, is done via the ioctl interface - >>> PCI device - host hypervisor path. Short-running or long-running >>> enclave can be bootstrapped during primary VM lifetime. Depending on >>> the use case, a custom set of resources (memory and CPUs) is set for >>> an enclave and then given back when the enclave is terminated; these >>> resources can be used for another enclave spawned later on or the >>> primary VM tasks. >>> >> Yes, I already understood this is how the mechanism work. I'm >> questioning whether this is indeed a good approach that should also be >> taken by upstream. > > I thought the point of Linux was to support devices that exist, rather > than change the way the world works around it? ;) I agree. Just poking around to see if upstream wants to implement a different approach for Enclaves, regardless of accepting the Nitro Enclave virtual PCI driver for AWS use-case of course. > >> The use-case of using Nitro Enclaves is for a Confidential-Computing >> service. i.e. The ability to provision a compute instance that can be >> trusted to perform a bunch of computation on sensitive >> information with high confidence that it cannot be compromised as it's >> highly isolated. Some technologies such as Intel SGX and AMD SEV >> attempted to achieve this even with guarantees that >> the computation is isolated from the hardware and hypervisor itself. > > Yeah, that worked really well, didn't it? ;) You haven't seen me saying SGX worked well. :) AMD SEV though still have it's shot (Once SEV-SNP will be GA). > >> I would have expected that for the vast majority of real customer >> use-cases, the customer will provision a compute instance that runs some >> confidential-computing task in an enclave which it >> keeps running for the entire life-time of the compute instance. As the >> sole purpose of the compute instance is to just expose a service that >> performs some confidential-computing task. >> For those cases, it should have been sufficient to just pre-provision a >> single Enclave-VM that performs this task, together with the compute >> instance and connect them via virtio-vsock. >> Without introducing any new virtual PCI device, guest PCI driver and >> unique semantics of stealing resources (CPUs and Memory) from primary-VM >> at runtime. > > You would also need to preprovision the image that runs in the > enclave, which is usually only determined at runtime. For that you > need the PCI driver anyway, so why not make the creation dynamic too? The image doesn't have to be determined at runtime. It could be supplied to control-plane. As mentioned below. > >> In this Nitro Enclave architecture, we de-facto put Compute >> control-plane abilities in the hands of the guest VM. Instead of >> introducing new control-plane primitives that allows building >> the data-plane architecture desired by the customer in a flexible >> manner. >> * What if the customer prefers to have it's Enclave VM polling S3 bucket >> for new tasks and produce results to S3 as-well? Without having any >> "Primary-VM" or virtio-vsock connection of any kind? >> * What if for some use-cases customer wants Enclave-VM to have dedicated >> compute power (i.e. Not share physical CPU cores with primary-VM. Not >> even with core-scheduling) but for other >> use-cases, customer prefers to share physical CPU cores with Primary-VM >> (Together with core-scheduling guarantees)? (Although this could be >> addressed by extending the virtual PCI device >> interface with a knob to control this) >> >> An alternative would have been to have the following new control-plane >> primitives: >> * Ability to provision a VM without boot-volume, but instead from an >> Image that is used to boot from memory. Allowing to provision >> disk-less VMs. >> (E.g. Can be useful for other use-cases such as VMs not requiring EBS >> at all which could allow cheaper compute instance) >> * Ability to provision a group of VMs together as a group such that they >> are guaranteed to launch as sibling VMs on the same host. >> * Ability to create a fast-path connection between sibling VMs on the >> same host with virtio-vsock. Or even also other shared-memory mechanism. >> * Extend AWS Fargate with ability to run multiple microVMs as a group >> (Similar to above) connected with virtio-vsock. To allow on-demand scale >> of confidential-computing task. > > Yes, there are a *lot* of different ways to implement enclaves in a > cloud environment. This is the one that we focused on, but I'm sure > others in the space will have more ideas. It's definitely an > interesting space and I'm eager to see more innovation happening :). > >> Having said that, I do see a similar architecture to Nitro Enclaves >> virtual PCI device used for a different purpose: For hypervisor-based >> security isolation (Such as Windows VBS). >> E.g. Linux boot-loader can detect the presence of this virtual PCI >> device and use it to provision multiple VM security domains. Such that >> when a security domain is created, >> it is specified what is the hardware resources it have access to (Guest >> memory pages, IOPorts, MSRs and etc.) and the blob it should run to >> bootstrap. Similar, but superior than, >> Hyper-V VSM. In addition, some security domains will be given special >> abilities to control other security domains (For example, to control the >> +XS,+XU EPT bits of other security >> domains to enforce code-integrity. Similar to Windows VBS HVCI). Just an >> idea... :) > > Yes, absolutely! So much fun to be had :D :) -Liran > > > Alex > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >
On 28/04/20 17:07, Alexander Graf wrote: >> So why not just start running the enclave at 0xfffffff0 in real mode? >> Yes everybody hates it, but that's what OSes are written against. In >> the simplest example, the parent enclave can load bzImage and initrd at >> 0x10000 and place firmware tables (MPTable and DMI) somewhere at >> 0xf0000; the firmware would just be a few movs to segment registers >> followed by a long jmp. > > There is a bit of initial attestation flow in the enclave, so that > you can be sure that the code that is running is actually what you wanted to > run. Can you explain this, since it's not documented? > vm = ne_create(vcpus = 4) > ne_set_memory(vm, hva, len) > ne_load_image(vm, addr, len) > ne_start(vm) > > That way we would get the EIF loading into kernel space. "LOAD_IMAGE" > would only be available in the time window between set_memory and start. > It basically implements a memcpy(), but it would completely hide the > hidden semantics of where an EIF has to go, so future device versions > (or even other enclave implementers) could change the logic. > > I think it also makes sense to just allocate those 4 ioctls from > scratch. Paolo, would you still want to "donate" KVM ioctl space in that > case? Sure, that's not a problem. Paolo > Overall, the above should address most of the concerns you raised in > this mail, right? It still requires copying, but at least we don't have > to keep the copy in kernel space.
On 28/04/20 17:07, Alexander Graf wrote: > > Why don't we build something like the following instead? > > vm = ne_create(vcpus = 4) > ne_set_memory(vm, hva, len) > ne_load_image(vm, addr, len) > ne_start(vm) > > That way we would get the EIF loading into kernel space. "LOAD_IMAGE" > would only be available in the time window between set_memory and start. > It basically implements a memcpy(), but it would completely hide the > hidden semantics of where an EIF has to go, so future device versions > (or even other enclave implementers) could change the logic. Can we add a file format argument and flags to ne_load_image, to avoid having a v2 ioctl later? Also, would you consider a mode where ne_load_image is not invoked and the enclave starts in real mode at 0xffffff0? Thanks, Paolo
On 30.04.20 12:34, Paolo Bonzini wrote: > > On 28/04/20 17:07, Alexander Graf wrote: >> >> Why don't we build something like the following instead? >> >> vm = ne_create(vcpus = 4) >> ne_set_memory(vm, hva, len) >> ne_load_image(vm, addr, len) >> ne_start(vm) >> >> That way we would get the EIF loading into kernel space. "LOAD_IMAGE" >> would only be available in the time window between set_memory and start. >> It basically implements a memcpy(), but it would completely hide the >> hidden semantics of where an EIF has to go, so future device versions >> (or even other enclave implementers) could change the logic. > > Can we add a file format argument and flags to ne_load_image, to avoid > having a v2 ioctl later? I think flags along should be enough, no? A new format would just be a flag. That said, any of the commands above should have flags IMHO. > Also, would you consider a mode where ne_load_image is not invoked and > the enclave starts in real mode at 0xffffff0? Consider, sure. But I don't quite see any big benefit just yet. The current abstraction level for the booted payloads is much higher. That allows us to simplify the device model dramatically: There is no need to create a virtual flash region for example. In addition, by moving firmware into the trusted base, firmware can execute validation of the target image. If you make it all flat, how do you verify whether what you're booting is what you think you're booting? So in a nutshell, for a PV virtual machine spawning interface, I think it would make sense to have memory fully owned by the parent. In the enclave world, I would rather not like to give the parent too much control over what memory actually means, outside of donating a bucket of it. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 30/04/20 13:21, Alexander Graf wrote: >> Also, would you consider a mode where ne_load_image is not invoked and >> the enclave starts in real mode at 0xffffff0? > > Consider, sure. But I don't quite see any big benefit just yet. The > current abstraction level for the booted payloads is much higher. That > allows us to simplify the device model dramatically: There is no need to > create a virtual flash region for example. It doesn't have to be flash, it can be just ROM. > In addition, by moving firmware into the trusted base, firmware can > execute validation of the target image. If you make it all flat, how do > you verify whether what you're booting is what you think you're booting? So the issue would be that a firmware image provided by the parent could be tampered with by something malicious running in the parent enclave? Paolo > So in a nutshell, for a PV virtual machine spawning interface, I think > it would make sense to have memory fully owned by the parent. In the > enclave world, I would rather not like to give the parent too much > control over what memory actually means, outside of donating a bucket of > it.
On 30.04.20 13:38, Paolo Bonzini wrote: > > On 30/04/20 13:21, Alexander Graf wrote: >>> Also, would you consider a mode where ne_load_image is not invoked and >>> the enclave starts in real mode at 0xffffff0? >> >> Consider, sure. But I don't quite see any big benefit just yet. The >> current abstraction level for the booted payloads is much higher. That >> allows us to simplify the device model dramatically: There is no need to >> create a virtual flash region for example. > > It doesn't have to be flash, it can be just ROM. > >> In addition, by moving firmware into the trusted base, firmware can >> execute validation of the target image. If you make it all flat, how do >> you verify whether what you're booting is what you think you're booting? > > So the issue would be that a firmware image provided by the parent could > be tampered with by something malicious running in the parent enclave? You have to have a root of trust somewhere. That root then checks and attests everything it runs. What exactly would you attest for with a flat address space model? So the issue is that the enclave code can not trust its own integrity if it doesn't have anything at a higher level attesting it. The way this is usually solved on bare metal systems is that you trust your CPU which then checks the firmware integrity (Boot Guard). Where would you put that check in a VM model? How close would it be to a normal VM then? And if it's not, what's the point of sticking to such terrible legacy boot paths? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 30/04/20 13:47, Alexander Graf wrote: >> >> So the issue would be that a firmware image provided by the parent could >> be tampered with by something malicious running in the parent enclave? > > You have to have a root of trust somewhere. That root then checks and > attests everything it runs. What exactly would you attest for with a > flat address space model? > > So the issue is that the enclave code can not trust its own integrity if > it doesn't have anything at a higher level attesting it. The way this is > usually solved on bare metal systems is that you trust your CPU which > then checks the firmware integrity (Boot Guard). Where would you put > that check in a VM model? In the enclave device driver, I would just limit the attestation to the firmware image So yeah it wouldn't be a mode where ne_load_image is not invoked and the enclave starts in real mode at 0xffffff0. You would still need "load image" functionality. > How close would it be to a normal VM then? And > if it's not, what's the point of sticking to such terrible legacy boot > paths? The point is that there's already two plausible loaders for the kernel (bzImage and ELF), so I'd like to decouple the loader and the image. Paolo
On 30.04.20 13:58, Paolo Bonzini wrote: > > On 30/04/20 13:47, Alexander Graf wrote: >>> >>> So the issue would be that a firmware image provided by the parent could >>> be tampered with by something malicious running in the parent enclave? >> >> You have to have a root of trust somewhere. That root then checks and >> attests everything it runs. What exactly would you attest for with a >> flat address space model? >> >> So the issue is that the enclave code can not trust its own integrity if >> it doesn't have anything at a higher level attesting it. The way this is >> usually solved on bare metal systems is that you trust your CPU which >> then checks the firmware integrity (Boot Guard). Where would you put >> that check in a VM model? > > In the enclave device driver, I would just limit the attestation to the > firmware image > > So yeah it wouldn't be a mode where ne_load_image is not invoked and > the enclave starts in real mode at 0xffffff0. You would still need > "load image" functionality. > >> How close would it be to a normal VM then? And >> if it's not, what's the point of sticking to such terrible legacy boot >> paths? > > The point is that there's already two plausible loaders for the kernel > (bzImage and ELF), so I'd like to decouple the loader and the image. The loader is implemented by the enclave device. If it wishes to support bzImage and ELF it does that. Today, it only does bzImage though IIRC :). So yes, they are decoupled? Are you saying you would like to build your own code in any way you like? Well, that means we either need to add support for another loader in the enclave device or your workloads just fakes a bzImage header and gets loaded regardless :). Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 29/04/2020 16:20, Paolo Bonzini wrote: > On 28/04/20 17:07, Alexander Graf wrote: >>> So why not just start running the enclave at 0xfffffff0 in real mode? >>> Yes everybody hates it, but that's what OSes are written against. In >>> the simplest example, the parent enclave can load bzImage and initrd at >>> 0x10000 and place firmware tables (MPTable and DMI) somewhere at >>> 0xf0000; the firmware would just be a few movs to segment registers >>> followed by a long jmp. >> There is a bit of initial attestation flow in the enclave, so that >> you can be sure that the code that is running is actually what you wanted to >> run. > Can you explain this, since it's not documented? Hash values are computed for the entire enclave image (EIF), the kernel and ramdisk(s). That's used, for example, to checkthat the enclave image that is loaded in the enclave VM is the one that was intended to be run. These crypto measurements are included in a signed attestation document generated by the Nitro Hypervisor and further used to prove the identity of the enclave. KMS is an example of service that NE is integrated with and that checks the attestation doc. > >> vm = ne_create(vcpus = 4) >> ne_set_memory(vm, hva, len) >> ne_load_image(vm, addr, len) >> ne_start(vm) >> >> That way we would get the EIF loading into kernel space. "LOAD_IMAGE" >> would only be available in the time window between set_memory and start. >> It basically implements a memcpy(), but it would completely hide the >> hidden semantics of where an EIF has to go, so future device versions >> (or even other enclave implementers) could change the logic. >> >> I think it also makes sense to just allocate those 4 ioctls from >> scratch. Paolo, would you still want to "donate" KVM ioctl space in that >> case? > Sure, that's not a problem. Ok, thanks for confirmation. I've updated the ioctl number documentation to reflect the ioctl space update, taking into account the previous discussion; andnow, given also the proposal above from Alex, the discussions we currently have and considering further easy extensibility of the user space interface. Thanks, Andra >> Overall, the above should address most of the concerns you raised in >> this mail, right? It still requires copying, but at least we don't have >> to keep the copy in kernel space. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Hi! > > it uses its own memory and CPUs + its virtio-vsock emulated device for > > communication with the primary VM. > > > > The memory and CPUs are carved out of the primary VM, they are dedicated > > for the enclave. The Nitro hypervisor running on the host ensures memory > > and CPU isolation between the primary VM and the enclave VM. > > > > These two components need to reflect the same state e.g. when the > > enclave abstraction process (1) is terminated, the enclave VM (2) is > > terminated as well. > > > > With regard to the communication channel, the primary VM has its own > > emulated virtio-vsock PCI device. The enclave VM has its own emulated > > virtio-vsock device as well. This channel is used, for example, to fetch > > data in the enclave and then process it. An application that sets up the > > vsock socket and connects or listens, depending on the use case, is then > > developed to use this channel; this happens on both ends - primary VM > > and enclave VM. > > > > Let me know if further clarifications are needed. > > Thanks, this is all useful. However can you please clarify the > low-level details here? Is the virtual machine manager open-source? If so, I guess pointer for sources would be useful. Best regards, Pavel
On 07/05/2020 20:44, Pavel Machek wrote: > > Hi! > >>> it uses its own memory and CPUs + its virtio-vsock emulated device for >>> communication with the primary VM. >>> >>> The memory and CPUs are carved out of the primary VM, they are dedicated >>> for the enclave. The Nitro hypervisor running on the host ensures memory >>> and CPU isolation between the primary VM and the enclave VM. >>> >>> These two components need to reflect the same state e.g. when the >>> enclave abstraction process (1) is terminated, the enclave VM (2) is >>> terminated as well. >>> >>> With regard to the communication channel, the primary VM has its own >>> emulated virtio-vsock PCI device. The enclave VM has its own emulated >>> virtio-vsock device as well. This channel is used, for example, to fetch >>> data in the enclave and then process it. An application that sets up the >>> vsock socket and connects or listens, depending on the use case, is then >>> developed to use this channel; this happens on both ends - primary VM >>> and enclave VM. >>> >>> Let me know if further clarifications are needed. >> Thanks, this is all useful. However can you please clarify the >> low-level details here? > Is the virtual machine manager open-source? If so, I guess pointer for sources > would be useful. Hi Pavel, Thanks for reaching out. The VMM that is used for the primary / parent VM is not open source. Andra Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On Fri 2020-05-08 10:00:27, Paraschiv, Andra-Irina wrote: > > > On 07/05/2020 20:44, Pavel Machek wrote: > > > >Hi! > > > >>>it uses its own memory and CPUs + its virtio-vsock emulated device for > >>>communication with the primary VM. > >>> > >>>The memory and CPUs are carved out of the primary VM, they are dedicated > >>>for the enclave. The Nitro hypervisor running on the host ensures memory > >>>and CPU isolation between the primary VM and the enclave VM. > >>> > >>>These two components need to reflect the same state e.g. when the > >>>enclave abstraction process (1) is terminated, the enclave VM (2) is > >>>terminated as well. > >>> > >>>With regard to the communication channel, the primary VM has its own > >>>emulated virtio-vsock PCI device. The enclave VM has its own emulated > >>>virtio-vsock device as well. This channel is used, for example, to fetch > >>>data in the enclave and then process it. An application that sets up the > >>>vsock socket and connects or listens, depending on the use case, is then > >>>developed to use this channel; this happens on both ends - primary VM > >>>and enclave VM. > >>> > >>>Let me know if further clarifications are needed. > >>Thanks, this is all useful. However can you please clarify the > >>low-level details here? > >Is the virtual machine manager open-source? If so, I guess pointer for sources > >would be useful. > > Hi Pavel, > > Thanks for reaching out. > > The VMM that is used for the primary / parent VM is not open source. Do we want to merge code that opensource community can not test? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sat, 2020-05-09 at 21:21 +0200, Pavel Machek wrote: > > On Fri 2020-05-08 10:00:27, Paraschiv, Andra-Irina wrote: > > > > > > On 07/05/2020 20:44, Pavel Machek wrote: > > > > > > Hi! > > > > > > > > it uses its own memory and CPUs + its virtio-vsock emulated device for > > > > > communication with the primary VM. > > > > > > > > > > The memory and CPUs are carved out of the primary VM, they are dedicated > > > > > for the enclave. The Nitro hypervisor running on the host ensures memory > > > > > and CPU isolation between the primary VM and the enclave VM. > > > > > > > > > > These two components need to reflect the same state e.g. when the > > > > > enclave abstraction process (1) is terminated, the enclave VM (2) is > > > > > terminated as well. > > > > > > > > > > With regard to the communication channel, the primary VM has its own > > > > > emulated virtio-vsock PCI device. The enclave VM has its own emulated > > > > > virtio-vsock device as well. This channel is used, for example, to fetch > > > > > data in the enclave and then process it. An application that sets up the > > > > > vsock socket and connects or listens, depending on the use case, is then > > > > > developed to use this channel; this happens on both ends - primary VM > > > > > and enclave VM. > > > > > > > > > > Let me know if further clarifications are needed. > > > > > > > > Thanks, this is all useful. However can you please clarify the > > > > low-level details here? > > > > > > Is the virtual machine manager open-source? If so, I guess pointer for sources > > > would be useful. > > > > Hi Pavel, > > > > Thanks for reaching out. > > > > The VMM that is used for the primary / parent VM is not open source. > > Do we want to merge code that opensource community can not test? Hehe.. this isn't quite the story Pavel :) We merge support for proprietary hypervisors, this is no different. You can test it, well at least you'll be able to ... when AWS deploys the functionality. You don't need the hypervisor itself to be open source. In fact, in this case, it's not even low level invasive arch code like some of the above can be. It's a driver for a PCI device :-) Granted a virtual one. We merge drivers for PCI devices routinely without the RTL or firmware of those devices being open source. So yes, we probably want this if it's going to be a useful features to users when running on AWS EC2. (Disclaimer: I work for AWS these days). Cheers, Ben.
On 10/05/2020 14:02, Herrenschmidt, Benjamin wrote: > On Sat, 2020-05-09 at 21:21 +0200, Pavel Machek wrote: >> On Fri 2020-05-08 10:00:27, Paraschiv, Andra-Irina wrote: >>> >>> On 07/05/2020 20:44, Pavel Machek wrote: >>>> Hi! >>>> >>>>>> it uses its own memory and CPUs + its virtio-vsock emulated device for >>>>>> communication with the primary VM. >>>>>> >>>>>> The memory and CPUs are carved out of the primary VM, they are dedicated >>>>>> for the enclave. The Nitro hypervisor running on the host ensures memory >>>>>> and CPU isolation between the primary VM and the enclave VM. >>>>>> >>>>>> These two components need to reflect the same state e.g. when the >>>>>> enclave abstraction process (1) is terminated, the enclave VM (2) is >>>>>> terminated as well. >>>>>> >>>>>> With regard to the communication channel, the primary VM has its own >>>>>> emulated virtio-vsock PCI device. The enclave VM has its own emulated >>>>>> virtio-vsock device as well. This channel is used, for example, to fetch >>>>>> data in the enclave and then process it. An application that sets up the >>>>>> vsock socket and connects or listens, depending on the use case, is then >>>>>> developed to use this channel; this happens on both ends - primary VM >>>>>> and enclave VM. >>>>>> >>>>>> Let me know if further clarifications are needed. >>>>> Thanks, this is all useful. However can you please clarify the >>>>> low-level details here? >>>> Is the virtual machine manager open-source? If so, I guess pointer for sources >>>> would be useful. >>> Hi Pavel, >>> >>> Thanks for reaching out. >>> >>> The VMM that is used for the primary / parent VM is not open source. >> Do we want to merge code that opensource community can not test? > Hehe.. this isn't quite the story Pavel :) > > We merge support for proprietary hypervisors, this is no different. You > can test it, well at least you'll be able to ... when AWS deploys the > functionality. You don't need the hypervisor itself to be open source. > > In fact, in this case, it's not even low level invasive arch code like > some of the above can be. It's a driver for a PCI device :-) Granted a > virtual one. We merge drivers for PCI devices routinely without the RTL > or firmware of those devices being open source. > > So yes, we probably want this if it's going to be a useful features to > users when running on AWS EC2. (Disclaimer: I work for AWS these days). Indeed, it will available for checking out how it works. The discussions are ongoing here on the LKML - understanding the context, clarifying items, sharing feedback and coming with codebase updates and basic example flow of the ioctl interface usage. This all helps with the path towards merging. Thanks, Ben, for the follow-up. Andra Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 10/05/2020 12:57, Li Qiang wrote: > > > Paraschiv, Andra-Irina <andraprs@amazon.com > <mailto:andraprs@amazon.com>> 于2020年4月24日周五 下午10:03写道: > > > > On 24/04/2020 12:59, Tian, Kevin wrote: > > > >> From: Paraschiv, Andra-Irina > >> Sent: Thursday, April 23, 2020 9:20 PM > >> > >> On 22/04/2020 00:46, Paolo Bonzini wrote: > >>> On 21/04/20 20:41, Andra Paraschiv wrote: > >>>> An enclave communicates with the primary VM via a local > communication > >> channel, > >>>> using virtio-vsock [2]. An enclave does not have a disk or a > network device > >>>> attached. > >>> Is it possible to have a sample of this in the samples/ directory? > >> I can add in v2 a sample file including the basic flow of how > to use the > >> ioctl interface to create / terminate an enclave. > >> > >> Then we can update / build on top it based on the ongoing > discussions on > >> the patch series and the received feedback. > >> > >>> I am interested especially in: > >>> > >>> - the initial CPU state: CPL0 vs. CPL3, initial program > counter, etc. > >>> > >>> - the communication channel; does the enclave see the usual > local APIC > >>> and IOAPIC interfaces in order to get interrupts from > virtio-vsock, and > >>> where is the virtio-vsock device (virtio-mmio I suppose) > placed in memory? > >>> > >>> - what the enclave is allowed to do: can it change privilege > levels, > >>> what happens if the enclave performs an access to nonexistent > memory, > >> etc. > >>> - whether there are special hypercall interfaces for the enclave > >> An enclave is a VM, running on the same host as the primary VM, > that > >> launched the enclave. They are siblings. > >> > >> Here we need to think of two components: > >> > >> 1. An enclave abstraction process - a process running in the > primary VM > >> guest, that uses the provided ioctl interface of the Nitro Enclaves > >> kernel driver to spawn an enclave VM (that's 2 below). > >> > >> How does all gets to an enclave VM running on the host? > >> > >> There is a Nitro Enclaves emulated PCI device exposed to the > primary VM. > >> The driver for this new PCI device is included in the current > patch series. > >> > >> The ioctl logic is mapped to PCI device commands e.g. the > >> NE_ENCLAVE_START ioctl maps to an enclave start PCI command or the > >> KVM_SET_USER_MEMORY_REGION maps to an add memory PCI command. > >> The PCI > >> device commands are then translated into actions taken on the > hypervisor > >> side; that's the Nitro hypervisor running on the host where the > primary > >> VM is running. > >> > >> 2. The enclave itself - a VM running on the same host as the > primary VM > >> that spawned it. > >> > >> The enclave VM has no persistent storage or network interface > attached, > >> it uses its own memory and CPUs + its virtio-vsock emulated > device for > >> communication with the primary VM. > > sounds like a firecracker VM? > > It's a VM crafted for enclave needs. > > > > >> The memory and CPUs are carved out of the primary VM, they are > dedicated > >> for the enclave. The Nitro hypervisor running on the host > ensures memory > >> and CPU isolation between the primary VM and the enclave VM. > > In last paragraph, you said that the enclave VM uses its own > memory and > > CPUs. Then here, you said the memory/CPUs are carved out and > dedicated > > from the primary VM. Can you elaborate which one is accurate? or > a mixed > > model? > > Memory and CPUs are carved out of the primary VM and are dedicated > for > the enclave VM. I mentioned above as "its own" in the sense that the > primary VM doesn't use these carved out resources while the > enclave is > running, as they are dedicated to the enclave. > > Hope that now it's more clear. > > > > >> > >> These two components need to reflect the same state e.g. when the > >> enclave abstraction process (1) is terminated, the enclave VM > (2) is > >> terminated as well. > >> > >> With regard to the communication channel, the primary VM has > its own > >> emulated virtio-vsock PCI device. The enclave VM has its own > emulated > >> virtio-vsock device as well. This channel is used, for example, > to fetch > >> data in the enclave and then process it. An application that > sets up the > >> vsock socket and connects or listens, depending on the use > case, is then > >> developed to use this channel; this happens on both ends - > primary VM > >> and enclave VM. > > How does the application in the primary VM assign task to be > executed > > in the enclave VM? I didn't see such command in this series, so > suppose > > it is also communicated through virtio-vsock? > > The application that runs in the enclave needs to be packaged in an > enclave image together with the OS ( e.g. kernel, ramdisk, init ) > that > will run in the enclave VM. > > Then the enclave image is loaded in memory. After booting is > finished, > the application starts. Now, depending on the app implementation > and use > case, one example can be that the app in the enclave waits for > data to > be fetched in via the vsock channel. > > > Hi Paraschiv, > > So here the custom's application should be programmed to respect the > enclave VM spec, > and can't be any binary, right? And also the application in enclave > can't use any other IO > except the vsock? Hi, The application running in the enclave should be built so that it uses the available exposed functionality e.g. the vsock comm channel. With regard to I/O, vsock is the means to interact with the primary / parent VM. The enclave VM doesn't have a network interface attached or persistent storage. There is also an exposed device in the enclave, for the attestation flow e.g. to get the signed attestation document generated by the Nitro Hypervisor on the host where the primary VM and the enclave VM run. From a previous mail thread on LKML, where I added a couple of clarifications on the attestation flow: " Hash values are computed for the entire enclave image (EIF), the kernel and ramdisk(s). That's used, for example, to check that the enclave image that is loaded in the enclave VM is the one that was intended to be run. These crypto measurements are included in a signed attestation document generated by the Nitro Hypervisor and further used to prove the identity of the enclave. KMS is an example of service that NE is integrated with and that checks the attestation doc. " Thanks, Andra > > > > >> Let me know if further clarifications are needed. > >> > >>>> The proposed solution is following the KVM model and uses the > KVM API > >> to be able > >>>> to create and set resources for enclaves. An additional ioctl > command, > >> besides > >>>> the ones provided by KVM, is used to start an enclave and > setup the > >> addressing > >>>> for the communication channel and an enclave unique id. > >>> Reusing some KVM ioctls is definitely a good idea, but I > wouldn't really > >>> say it's the KVM API since the VCPU file descriptor is > basically non > >>> functional (without KVM_RUN and mmap it's not really the KVM API). > >> It uses part of the KVM API or a set of KVM ioctls to model the > way a VM > >> is created / terminated. That's true, KVM_RUN and mmap-ing the > vcpu fd > >> are not included. > >> > >> Thanks for the feedback regarding the reuse of KVM ioctls. > >> > >> Andra > >> > > Thanks > > Kevin > > > > > Amazon Development Center (Romania) S.R.L. registered office: 27A > Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, > Romania. Registered in Romania. Registration number J22/2621/2005. > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On Sun, May 10, 2020 at 11:02:18AM +0000, Herrenschmidt, Benjamin wrote: > On Sat, 2020-05-09 at 21:21 +0200, Pavel Machek wrote: > > > > On Fri 2020-05-08 10:00:27, Paraschiv, Andra-Irina wrote: > > > > > > > > > On 07/05/2020 20:44, Pavel Machek wrote: > > > > > > > > Hi! > > > > > > > > > > it uses its own memory and CPUs + its virtio-vsock emulated device for > > > > > > communication with the primary VM. > > > > > > > > > > > > The memory and CPUs are carved out of the primary VM, they are dedicated > > > > > > for the enclave. The Nitro hypervisor running on the host ensures memory > > > > > > and CPU isolation between the primary VM and the enclave VM. > > > > > > > > > > > > These two components need to reflect the same state e.g. when the > > > > > > enclave abstraction process (1) is terminated, the enclave VM (2) is > > > > > > terminated as well. > > > > > > > > > > > > With regard to the communication channel, the primary VM has its own > > > > > > emulated virtio-vsock PCI device. The enclave VM has its own emulated > > > > > > virtio-vsock device as well. This channel is used, for example, to fetch > > > > > > data in the enclave and then process it. An application that sets up the > > > > > > vsock socket and connects or listens, depending on the use case, is then > > > > > > developed to use this channel; this happens on both ends - primary VM > > > > > > and enclave VM. > > > > > > > > > > > > Let me know if further clarifications are needed. > > > > > > > > > > Thanks, this is all useful. However can you please clarify the > > > > > low-level details here? > > > > > > > > Is the virtual machine manager open-source? If so, I guess pointer for sources > > > > would be useful. > > > > > > Hi Pavel, > > > > > > Thanks for reaching out. > > > > > > The VMM that is used for the primary / parent VM is not open source. > > > > Do we want to merge code that opensource community can not test? > > Hehe.. this isn't quite the story Pavel :) > > We merge support for proprietary hypervisors, this is no different. You > can test it, well at least you'll be able to ... when AWS deploys the > functionality. You don't need the hypervisor itself to be open source. > > In fact, in this case, it's not even low level invasive arch code like > some of the above can be. It's a driver for a PCI device :-) Granted a > virtual one. We merge drivers for PCI devices routinely without the RTL > or firmware of those devices being open source. > > So yes, we probably want this if it's going to be a useful features to > users when running on AWS EC2. (Disclaimer: I work for AWS these days). I agree that the VMM does not need to be open source. What is missing though are details of the enclave's initial state and the image format required to boot code. Until this documentation is available only Amazon can write a userspace application that does anything useful with this driver. Some of the people from Amazon are long-time Linux contributors (such as yourself!) and the intent to publish this information has been expressed, so I'm sure that will be done. Until then, it's cool but no one else can play with it. Stefan