Message ID | 20220722230241.1944655-1-avagin@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM/x86: add a new hypercall to execute host system | expand |
+x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM. On Fri, Jul 22, 2022, Andrei Vagin wrote: > Another option is the KVM platform. In this case, the Sentry (gVisor > kernel) can run in a guest ring0 and create/manage multiple address > spaces. Its performance is much better than the ptrace one, but it is > still not great compared with the native performance. This change > optimizes the most critical part, which is the syscall overhead. What exactly is the source of the syscall overhead, and what alternatives have been explored? Making arbitrary syscalls from within KVM is mildly terrifying. > The idea of using vmcall to execute system calls isn’t new. Two large users > of gVisor (Google and AntFinacial) have out-of-tree code to implement such > hypercalls. > > In the Google kernel, we have a kvm-like subsystem designed especially > for gVisor. This change is the first step of integrating it into the KVM > code base and making it available to all Linux users. Can you please lay out the complete set of changes that you will be proposing? Doesn't have to be gory details, but at a minimum there needs to be a high level description that very clearly defines the scope of what changes you want to make and what the end result will look like. It's practically impossible to review this series without first understanding the bigger picture, e.g. if KVM_HC_HOST_SYSCALL is ultimately useless without the other bits you plan to upstream, then merging it without a high level of confidence that the other bits are acceptable is a bad idea since it commits KVM to supporting unused ABI.
On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@google.com> wrote: > > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM. > > On Fri, Jul 22, 2022, Andrei Vagin wrote: > > Another option is the KVM platform. In this case, the Sentry (gVisor > > kernel) can run in a guest ring0 and create/manage multiple address > > spaces. Its performance is much better than the ptrace one, but it is > > still not great compared with the native performance. This change > > optimizes the most critical part, which is the syscall overhead. > > What exactly is the source of the syscall overhead, Here are perf traces for two cases: when "guest" syscalls are executed via hypercalls and when syscalls are executed by the user-space VMM: https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e And here are two tests that I use to collect these traces: https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99 If we compare these traces, we can find that in the second case, we spend extra time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put, syscall_exit_to_user_mode. > and what alternatives have been explored? Making arbitrary syscalls from > within KVM is mildly terrifying. "mildly terrifying" is a good sentence in this case:). If I were in your place, I would think about it similarly. I understand these concerns about calling syscalls from the KVM code, and this is why I hide this feature under a separate capability that can be enabled explicitly. We can think about restricting the list of system calls that this hypercall can execute. In the user-space changes for gVisor, we have a list of system calls that are not executed via this hypercall. For example, sigprocmask is never executed by this hypercall, because the kvm vcpu has its signal mask. Another example is the ioctl syscall, because it can be one of kvm ioctl-s. As for alternatives, we explored different ways: == Host Ring3/Guest ring0 mixed mode == This is how the gVisor KVM platform works right now. We don’t have a separate hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs in the host ring3 and the guest ring0 and can transparently switch between these two contexts. When the Sentry starts, it creates a new kernel VM instance and maps its memory to the guest physical. Then it makes a set of page tables for the Sentry that mirrors the host virtual address space. When host and guest address spaces are identical, the Sentry can switch between these two contexts. The bluepill function switches the Sentry into guest ring0. It calls a privileged instruction (CLI) that is a no-op in the guest (by design, since we ensure interrupts are disabled for guest ring 0 execution) and triggers a signal on the host. The signal is handled by the bluepillHandler that takes a virtual CPU and executes it with the current thread state grabbed from a signal frame. As for regular VMs, user processes have their own address spaces (page tables) and run in guest ring3. So when the Sentry is going to execute a user process, it needs to be sure that it is running inside a VM, and it is the exact point when it calls bluepill(). Then it executes a user process with its page tables before it triggers an exception or a system call. All such events are trapped and handled in the Sentry. The Sentry is a normal Linux process that can trigger a fault and execute system calls. To handle these events, the Sentry returns to the host mode. If ring0 sysenter or exception entry point detects an event from the Sentry, they save the current thread state on a per-CPU structure and trigger VMEXIT. This returns us into bluepillHandler, where we set the thread state on a signal frame and exit from the signal handler, so the Sentry resumes from the point where it has been in the VM. In this scheme, the sentry syscall time is 3600ns. This is for the case when a system call is called from gr0. The benefit of this way is that only a first system call triggers vmexit and all subsequent syscalls are executed on the host natively. But it has downsides: * Each sentry system call trigger the full exit to hr3. * Each vmenter/vmexit requires to trigger a signal but it is expensive. * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry has to be fully enclosed in a VM to be able to support these technologies. == Execute system calls from a user-space VMM == In this case, the Sentry is always running in VM, and a syscall handler in GR0 triggers vmexit to transfer control to VMM (user process that is running in hr3), VMM executes a required system call, and transfers control back to the Sentry. We can say that it implements the suggested hypercall in the user-space. The sentry syscall time is 2100ns in this case. The new hypercall does the same but without switching to the host ring 3. It reduces the sentry syscall time to 1000ns. == A new BPF hook to handle vmexit-s == https://github.com/avagin/linux-task-diag/commits/kvm-bpf This way allows us to reach the same performance numbers, but it gives less control over who and how use this functionality. Second, it requires adding a few questionable BPF helpers like calling syscall from BPF hooks. == Non-KVM platforms == We are experimenting with non-KVM platforms. We have the ptrace platform, but it is almost for experiments due to the slowness of the ptrace interface. Another idea was to add the process_vm_exec system call: https://lwn.net/Articles/852662/ This system call can significantly increase performance compared with the ptrace platform, but it is still slower than the KVM platform in its current form (without the new hypercall). But this is true only if we run the KVM platform on a bare-metal. In the case of nested-virtualization, the KVM platform becomes much slower, which is expected. We have another idea to use the seccomp notify to trap system calls, but it requires some kernel change to reach a reasonable performance. I am working on these changes and will present them soon. I want to emphasize that non-KVM platforms don't allow us to implement the confidential concept in gVisor, but this is one of our main goals concerning the KVM platform. All previous numbers have been getting from the same host (Xeon(R) Gold 6268CL, 5.19-rc5). > > > The idea of using vmcall to execute system calls isn’t new. Two large users > > of gVisor (Google and AntFinacial) have out-of-tree code to implement such > > hypercalls. > > > > In the Google kernel, we have a kvm-like subsystem designed especially > > for gVisor. This change is the first step of integrating it into the KVM > > code base and making it available to all Linux users. > > Can you please lay out the complete set of changes that you will be proposing? > Doesn't have to be gory details, but at a minimum there needs to be a high level > description that very clearly defines the scope of what changes you want to make > and what the end result will look like. > > It's practically impossible to review this series without first understanding the > bigger picture, e.g. if KVM_HC_HOST_SYSCALL is ultimately useless without the other > bits you plan to upstream, then merging it without a high level of confidence that > the other bits are acceptable is a bad idea since it commits KVM to supporting > unused ABI. I was not precise in my description. This is the only change that we need right now. The gVisor KVM platform is the real thing that exists today and works on the upstream kernels: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/kvm/ This hypercall improves its performance and makes it comparable with the google-internal platform. Thanks, Andrei
On 7/26/22 10:33, Andrei Vagin wrote: > We can think about restricting the list of system calls that this hypercall can > execute. In the user-space changes for gVisor, we have a list of system calls > that are not executed via this hypercall. For example, sigprocmask is never > executed by this hypercall, because the kvm vcpu has its signal mask. Another > example is the ioctl syscall, because it can be one of kvm ioctl-s. The main issue I have is that the system call addresses are not translated. On one hand, I understand why it's done like this; it's pretty much impossible to do it without duplicating half of the sentry in the host kernel. And the KVM API you're adding is certainly sensible. On the other hand this makes the hypercall even more specialized, as it depends on the guest's memslot layout, and not self-sufficient, in the sense that the sandbox isn't secure without prior copying and validation of arguments in guest ring0. > == Host Ring3/Guest ring0 mixed mode == > > This is how the gVisor KVM platform works right now. We don’t have a separate > hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual > machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs > in the host ring3 and the guest ring0 and can transparently switch between > these two contexts. In this scheme, the sentry syscall time is 3600ns. > This is for the case when a system call is called from gr0. > > The benefit of this way is that only a first system call triggers vmexit and > all subsequent syscalls are executed on the host natively. > > But it has downsides: > * Each sentry system call trigger the full exit to hr3. > * Each vmenter/vmexit requires to trigger a signal but it is expensive. > * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry > has to be fully enclosed in a VM to be able to support these technologies. > > == Execute system calls from a user-space VMM == > > In this case, the Sentry is always running in VM, and a syscall handler in GR0 > triggers vmexit to transfer control to VMM (user process that is running in > hr3), VMM executes a required system call, and transfers control back to the > Sentry. We can say that it implements the suggested hypercall in the > user-space. > > The sentry syscall time is 2100ns in this case. > > The new hypercall does the same but without switching to the host ring 3. It > reduces the sentry syscall time to 1000ns. Yeah, ~3000 clock cycles is what I would expect. What does it translate to in terms of benchmarks? For example a simple netperf/UDP_RR benchmark. Paolo
On Tue, Jul 26, 2022, Andrei Vagin wrote: > On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@google.com> wrote: > > > > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM. > > > > On Fri, Jul 22, 2022, Andrei Vagin wrote: > > > Another option is the KVM platform. In this case, the Sentry (gVisor > > > kernel) can run in a guest ring0 and create/manage multiple address > > > spaces. Its performance is much better than the ptrace one, but it is > > > still not great compared with the native performance. This change > > > optimizes the most critical part, which is the syscall overhead. > > > > What exactly is the source of the syscall overhead, > > Here are perf traces for two cases: when "guest" syscalls are executed via > hypercalls and when syscalls are executed by the user-space VMM: > https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e > > And here are two tests that I use to collect these traces: > https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99 > > If we compare these traces, we can find that in the second case, we spend extra > time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put, > syscall_exit_to_user_mode. So of those, I think the only path a robust implementation can actually avoid, without significantly whittling down the allowed set of syscalls, is syscall_exit_to_user_mode(). The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run through that before calling out of KVM. E.g. prctrl(ARCH_GET_GS) will read the wrong GS.base if MSR_KERNEL_GS_BASE isn't restored. And that necessitates calling vmx_prepare_switch_to_guest() when resuming the vCPU. FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound to be a syscall that accesses user FPU state and will do the wrong thing if guest state is loaded. For gVisor, that's all presumably a non-issue because it uses a small set of syscalls (or has guest==host state?), but for a common KVM feature it's problematic. > > and what alternatives have been explored? Making arbitrary syscalls from > > within KVM is mildly terrifying. > > "mildly terrifying" is a good sentence in this case:). If I were in your place, > I would think about it similarly. > > I understand these concerns about calling syscalls from the KVM code, and this > is why I hide this feature under a separate capability that can be enabled > explicitly. > > We can think about restricting the list of system calls that this hypercall can > execute. In the user-space changes for gVisor, we have a list of system calls > that are not executed via this hypercall. Can you provide that list? > But it has downsides: > * Each sentry system call trigger the full exit to hr3. > * Each vmenter/vmexit requires to trigger a signal but it is expensive. Can you explain this one? I didn't quite follow what this is referring to. > * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry > has to be fully enclosed in a VM to be able to support these technologies. Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC tackled the "syscalls are crazy expensive" problem by using a message queue and a dedicated task outside of the enclave to handle syscalls. Would something like that work, or is having to burn a pCPU (or more) to handle syscalls in the host a non-starter?
On Fri, Jul 22 2022 at 23:41, Sean Christopherson wrote: > +x86 maintainers, patch 1 most definitely needs acceptance from folks > beyond KVM. Thanks for putting us on CC. It seems to be incredibly hard to CC the relevant maintainers and to get the prefix in the subject straight. > On Fri, Jul 22, 2022, Andrei Vagin wrote: >> Another option is the KVM platform. In this case, the Sentry (gVisor >> kernel) can run in a guest ring0 and create/manage multiple address >> spaces. Its performance is much better than the ptrace one, but it is >> still not great compared with the native performance. This change >> optimizes the most critical part, which is the syscall overhead. > > What exactly is the source of the syscall overhead, and what alternatives have > been explored? Making arbitrary syscalls from within KVM is mildly terrifying. What's even worse is that this exposes a magic kernel syscall interface to random driver writers. Seriously no. This approach is certainly a clever idea, but exposing this outside of a very restricted and controlled environment is a patently bad idea. I skimmed the documentation on the project page: sudo modprobe kvm-intel && sudo chmod a+rw /dev/kvm Can you spot the fail? I gave up reading further as shortly after that gem the page failed to render sensibly in Firefox. Hint: Graphics What's completely missing from the cover letter _and_ from the project documentation is which subset of KVM functionality this is actually using and how the actual content of the "guest" looks like. It's all blury handwaving and lots of marketing to me. Thanks, tglx
On Tue, Jul 26 2022 at 15:10, Sean Christopherson wrote: > On Tue, Jul 26, 2022, Andrei Vagin wrote: >> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry >> has to be fully enclosed in a VM to be able to support these technologies. > > Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC > tackled the "syscalls are crazy expensive" problem by using a message queue and > a dedicated task outside of the enclave to handle syscalls. Would something like > that work, or is having to burn a pCPU (or more) to handle syscalls in the host a > non-starter? Let's put VMs aside for a moment. The problem you are trying to solve is ptrace overhead because that requires context switching, right? Did you ever try to solve this with SYSCALL_USER_DISPATCH? That requires signals, which are not cheap either, but we certainly could come up with a lightweight signal implementation for that particular use case. Thanks, tglx
On Tue, Jul 26, 2022 at 03:10:34PM +0000, Sean Christopherson wrote: > On Tue, Jul 26, 2022, Andrei Vagin wrote: > > On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM. > > > > > > On Fri, Jul 22, 2022, Andrei Vagin wrote: > > > > Another option is the KVM platform. In this case, the Sentry (gVisor > > > > kernel) can run in a guest ring0 and create/manage multiple address > > > > spaces. Its performance is much better than the ptrace one, but it is > > > > still not great compared with the native performance. This change > > > > optimizes the most critical part, which is the syscall overhead. > > > > > > What exactly is the source of the syscall overhead, > > > > Here are perf traces for two cases: when "guest" syscalls are executed via > > hypercalls and when syscalls are executed by the user-space VMM: > > https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e > > > > And here are two tests that I use to collect these traces: > > https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99 > > > > If we compare these traces, we can find that in the second case, we spend extra > > time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put, > > syscall_exit_to_user_mode. > > So of those, I think the only path a robust implementation can actually avoid, > without significantly whittling down the allowed set of syscalls, is > syscall_exit_to_user_mode(). > > The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run > through that before calling out of KVM. E.g. prctrl(ARCH_GET_GS) will read the > wrong GS.base if MSR_KERNEL_GS_BASE isn't restored. And that necessitates > calling vmx_prepare_switch_to_guest() when resuming the vCPU. > > FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound > to be a syscall that accesses user FPU state and will do the wrong thing if guest > state is loaded. > > For gVisor, that's all presumably a non-issue because it uses a small set of > syscalls (or has guest==host state?), but for a common KVM feature it's problematic. I think the number of system calls that touch a state that is intersected with KVM is very limited and we can blocklist all of them. Another option is to have an allow list of system calls to be sure that we don't miss anything. > > > > and what alternatives have been explored? Making arbitrary syscalls from > > > within KVM is mildly terrifying. > > > > "mildly terrifying" is a good sentence in this case:). If I were in your place, > > I would think about it similarly. > > > > I understand these concerns about calling syscalls from the KVM code, and this > > is why I hide this feature under a separate capability that can be enabled > > explicitly. > > > > We can think about restricting the list of system calls that this hypercall can > > execute. In the user-space changes for gVisor, we have a list of system calls > > that are not executed via this hypercall. > > Can you provide that list? Here is the list that are not executed via this hypercall: clone, exit, exit_group, ioctl, rt_sigreturn, mmap, arch_prctl, sigprocmask. And here is the list of all system calls that we allow for the Sentry: clock_gettime, close, dup, dup3, epoll_create1, epoll_ctl, epoll_pwait, eventfd2, exit, exit_group, fallocate, fchmod, fcntl, fstat, fsync, ftruncate, futex, getcpu, getpid, getrandom, getsockopt, gettid, gettimeofday, ioctl, lseek, madvise, membarrier, mincore, mmap, mprotect, munmap, nanosleep, ppol, pread64, preadv, preadv2, pwrite64, pwritev, pwritev2, read, recvmsg, recvmmsg, sendmsg, sendmmsg, restart_syscall, rt_sigaction, rt_sigprocmask, rt_sigreturn, sched_yield, settimer, shutdown, sigaltstack, statx, sync_file_range, tee, timer_create, timer_delete, timer_settime, tgkill, utimensat, write, writev. > > > But it has downsides: > > * Each sentry system call trigger the full exit to hr3. > > * Each vmenter/vmexit requires to trigger a signal but it is expensive. > > Can you explain this one? I didn't quite follow what this is referring to. In my message, there was the explanation of how the gVisor KVM platform works right now, and here are two points why it is slow. Each time when the Sentry triggers a system call, it has to switch to the host ring3. When the Sentry wants to switch to the guest ring0, it triggers a signal to fall in a signal handler. There, we have a sigcontext that we use to get the current thread state to resume execution in gr0, and then when the Sentry needs to switch back to hr3, we set the sentry state from gr0 to sigcontext and returns from the signal handler. > > > * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry > > has to be fully enclosed in a VM to be able to support these technologies. > > Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC > tackled the "syscalls are crazy expensive" problem by using a message queue and > a dedicated task outside of the enclave to handle syscalls. Would something like > that work, or is having to burn a pCPU (or more) to handle syscalls in the host a > non-starter? Context-switching is expensive... There was a few attempts to implement synchronous context-switching ([1], [2]) that can help in this case, but even with this sort of optimizations, it is too expensive. 1. https://lwn.net/Articles/824409/ 2. https://www.spinics.net/lists/linux-api/msg50417.html Thanks, Andrei
On Tue, Jul 26, 2022 at 3:10 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, Jul 26 2022 at 15:10, Sean Christopherson wrote: > > On Tue, Jul 26, 2022, Andrei Vagin wrote: > >> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry > >> has to be fully enclosed in a VM to be able to support these technologies. > > > > Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC > > tackled the "syscalls are crazy expensive" problem by using a message queue and > > a dedicated task outside of the enclave to handle syscalls. Would something like > > that work, or is having to burn a pCPU (or more) to handle syscalls in the host a > > non-starter? > > Let's put VMs aside for a moment. The problem you are trying to solve is > ptrace overhead because that requires context switching, right? Yes, you are right. > > Did you ever try to solve this with SYSCALL_USER_DISPATCH? That requires > signals, which are not cheap either, but we certainly could come up with > a lightweight signal implementation for that particular use case. We thought about this interface and how it could be used for gVisor needs. I think the main question is how to manage guest address spaces. gVisor can run multiple processes in one sandbox. Each process must have its address space isolated from other address spaces. The gVisor kernel (Sentry) has to run in a separate address space that guest processes don't have access to, but the Sentry has to be able to access all other address spaces. > > Thanks, > > tglx >
On Tue, Jul 26, 2022 at 3:27 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 7/26/22 10:33, Andrei Vagin wrote: ... > > == Execute system calls from a user-space VMM == > > > > In this case, the Sentry is always running in VM, and a syscall handler in GR0 > > triggers vmexit to transfer control to VMM (user process that is running in > > hr3), VMM executes a required system call, and transfers control back to the > > Sentry. We can say that it implements the suggested hypercall in the > > user-space. > > > > The sentry syscall time is 2100ns in this case. > > > > The new hypercall does the same but without switching to the host ring 3. It > > reduces the sentry syscall time to 1000ns. > > Yeah, ~3000 clock cycles is what I would expect. > > What does it translate to in terms of benchmarks? For example a simple > netperf/UDP_RR benchmark. * netperf in gVisor with the syscall fast path: $ ./runsc --platform kvm --network host --rootless do netperf -H ::1 -p 12865 -t UDP_RR MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to ::1 (::1) port 0 AF_INET6 : interval : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 212992 212992 1 1 10.00 95965.18 212992 212992 * netperf in gVisor without syscall fast path: $ ./runsc.orig --platform kvm --network host --rootless do netperf -H ::1 -p 12865 -t UDP_RR MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to ::1 (::1) port 0 AF_INET6 : interval : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 212992 212992 1 1 10.00 58709.17 212992 212992 * netperf executed on the host without gVisor $ netperf -H ::1 -p 12865 -t UDP_RR MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to ::1 (::1) port 0 AF_INET6 : interval : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 212992 212992 1 1 10.00 146460.80 212992 212992 Thanks, Andrei
On Tue, Jul 26, 2022 at 6:03 PM Andrei Vagin <avagin@google.com> wrote: > > On Tue, Jul 26, 2022 at 3:10 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > On Tue, Jul 26 2022 at 15:10, Sean Christopherson wrote: > > > On Tue, Jul 26, 2022, Andrei Vagin wrote: > > >> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry > > >> has to be fully enclosed in a VM to be able to support these technologies. > > > > > > Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC > > > tackled the "syscalls are crazy expensive" problem by using a message queue and > > > a dedicated task outside of the enclave to handle syscalls. Would something like > > > that work, or is having to burn a pCPU (or more) to handle syscalls in the host a > > > non-starter? > > > > Let's put VMs aside for a moment. The problem you are trying to solve is > > ptrace overhead because that requires context switching, right? > > Yes, you are right. > > > > > Did you ever try to solve this with SYSCALL_USER_DISPATCH? That requires > > signals, which are not cheap either, but we certainly could come up with > > a lightweight signal implementation for that particular use case. Thomas, I found that the idea of a lightweight signal implementation can be interesting in a slightly different context. I have a prototype of a gVisor platform that uses seccomp to trap guest system calls. Guest threads are running in stub processes that are used to manage guest address spaces. Each stub process sets seccomp filters to trap all system calls and it has a signal handler for SIGTRAP, SIGSEGV, SIGFPU, SIGILL, and SIGBUS. Each time when one of these signals is triggered, the signal handler notifies the Sentry about it. This platform has two obvious problems: * It requires context switching. * Signals are expensive. The first one can be solved with umcg which allows doing synchronous context switches between processes. A lightweight signal implementation can solve the second problem. Do you have any concrete ideas on how to do that? Thanks, Andrei