diff mbox series

[RFC,30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request

Message ID 20231108111806.92604-31-nsaenz@amazon.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: hyperv: Introduce VSM support | expand

Commit Message

Nicolas Saenz Julienne Nov. 8, 2023, 11:18 a.m. UTC
Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
injecting out-of-band Hyper-V secure intercepts. For now only memory
access intercepts are supported. These are triggered when access a GPA
protected by a higher VTL. The memory intercept metadata is filled based
on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
injected into the guest through SynIC message.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/kvm_host.h |  10 +++
 arch/x86/kvm/hyperv.c           | 114 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/hyperv.h           |   2 +
 arch/x86/kvm/x86.c              |   3 +
 4 files changed, 129 insertions(+)

Comments

Alexander Graf Nov. 8, 2023, 12:45 p.m. UTC | #1
On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
> injecting out-of-band Hyper-V secure intercepts. For now only memory
> access intercepts are supported. These are triggered when access a GPA
> protected by a higher VTL. The memory intercept metadata is filled based
> on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
> injected into the guest through SynIC message.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>


IMHO memory protection violations should result in a user space exit. 
User space can then validate what to do with the violation and if 
necessary inject an intercept.

That means from an API point of view, you want a new exit reason 
(violation) and an ioctl that allows you to transmit the violating CPU 
state into the target vCPU. I don't think the injection should even know 
that the source of data for the violation was a vCPU.



Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Nicolas Saenz Julienne Nov. 8, 2023, 1:38 p.m. UTC | #2
On Wed Nov 8, 2023 at 12:45 PM UTC, Alexander Graf wrote:
>
> On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> > Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
> > injecting out-of-band Hyper-V secure intercepts. For now only memory
> > access intercepts are supported. These are triggered when access a GPA
> > protected by a higher VTL. The memory intercept metadata is filled based
> > on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
> > injected into the guest through SynIC message.
> >
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
>
>
> IMHO memory protection violations should result in a user space exit. 

It already does, it's not very explicit from the patch itself, since the
functionality was introduced in through the "KVM: guest_memfd() and
per-page attributes" series [1].

See this snippet in patch #27:

+	if (kvm_hv_vsm_enabled(vcpu->kvm)) {
+		if (kvm_hv_faultin_pfn(vcpu, fault)) {
+			kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+			return -EFAULT;
+		}
+	}

Otherwise the doc in patch #33 also mentions this. :)

> User space can then validate what to do with the violation and if 
> necessary inject an intercept.

I do agree that secure intercept injection should be moved into to
user-space, and happen as a reaction to a user-space memory fault exit.
I was unable to do so yet, since the intercepts require a level of
introspection that is not yet available to QEMU. For example, providing
the length of the instruction that caused the fault. I'll work on
exposing the necessary information to user-space and move the whole
intercept concept there.

Nicolas

[1] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.
Maxim Levitsky Nov. 28, 2023, 8:19 a.m. UTC | #3
On Wed, 2023-11-08 at 13:38 +0000, Nicolas Saenz Julienne wrote:
> On Wed Nov 8, 2023 at 12:45 PM UTC, Alexander Graf wrote:
> > On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> > > Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
> > > injecting out-of-band Hyper-V secure intercepts. For now only memory
> > > access intercepts are supported. These are triggered when access a GPA
> > > protected by a higher VTL. The memory intercept metadata is filled based
> > > on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
> > > injected into the guest through SynIC message.
> > > 
> > > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > 
> > IMHO memory protection violations should result in a user space exit. 
> 
> It already does, it's not very explicit from the patch itself, since the
> functionality was introduced in through the "KVM: guest_memfd() and
> per-page attributes" series [1].
> 
> See this snippet in patch #27:
> 
> +	if (kvm_hv_vsm_enabled(vcpu->kvm)) {
> +		if (kvm_hv_faultin_pfn(vcpu, fault)) {
> +			kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +			return -EFAULT;
> +		}
> +	}
> 
> Otherwise the doc in patch #33 also mentions this. :)
> 
> > User space can then validate what to do with the violation and if 
> > necessary inject an intercept.
> 
> I do agree that secure intercept injection should be moved into to
> user-space, and happen as a reaction to a user-space memory fault exit.
> I was unable to do so yet, since the intercepts require a level of
> introspection that is not yet available to QEMU. For example, providing
> the length of the instruction that caused the fault. I'll work on
> exposing the necessary information to user-space and move the whole
> intercept concept there.

All the missing information should be included in the new userspace VM exit payload.


Also I would like to share my knowledge of SYNIC and how to deal with it in userspace,
because you will have to send lots of SYNC messages from userspace if we go with
the suggested approach of doing it in the userspace.

- SYNIC has one message per channel, so there is no way to queue more than one
message in the same channel. Usually only channel 0 is used (but I haven't researched
this much).

- In-kernel STIMER emulation queues Synic messages, but it always does this in
the vCPU thread by processing the request 'KVM_REQ_HV_STIMER', and when userspace
wants to queue something with SYNIC it also does this on vCPU thread, this is how
races are avoided.

kvm_hv_process_stimers -> stimer_expiration -> stimer_send_msg(stimer);

If the delivery fails (that is if SYNIC slot already has a message pending there),
then the timer remains pending and the next KVM_REQ_HV_STIMER request will attempt to
deliver it again.
 
After the guest processes a SYNIC message, it erases it by overwriting its message type with 0,
and then the guest notifies the hypervisor about a free slot by either doing a write
to a special MSR (HV_X64_MSR_EOM) or by EOI'ing the APIC interrupt.

According to my observation windows uses the second approach (EOI),
which thankfully works even on AVIC because the Sync Interrupts happen to be level triggered,
and AVIC does intercept level triggered EOI.

Once intercepted the EOI event triggers a delivery of an another stimer message via the vCPU thread,
by raising another KVM_REQ_HV_STIMER request on it.

kvm_hv_notify_acked_sint -> stimer_mark_pending -> kvm_make_request(KVM_REQ_HV_STIMER, vcpu);


Now if the userspace faces already full SYNIC slot, it has to wait, and I don't know if
it can be notified of an EOI or it busy waits somehow.

Note that Qemu's VMBUS/SYNC implementation was never tested in production IMHO, it was once implemented
and is only used currently by a few unit tests.

It might make sense to add userspace SYNIC message queuing to the kernel,
so that userspace could queue as many messages as it wants and let the kernel
copy the first message in the queue to the actual SYNIC slot, every time
it becomes free.


Final note on SYNIC is that qemu's synic code actually installs an overlay
page over sync slots area and writes to it when it queues a message, 
but the in-kernel stimer code just writes to the GPA regardless
if there is an overlay memslot or not.

Another benefit of a proper way of queuing SYNIC messages from the userspace,
is that it might enable the kernel's STIMER to queue the SYNIC message
directly from the timer interrupt routine, which will remove about 1000 vmexits per second that
are caused by KVM_REQ_HV_STIMER on vCPU0, even when posted timer interrupts are used.

I can implement this if you think that this makes sense.

Those are my 0.2 cents.

Best regards,
	Maxim Levitsky


> 
> Nicolas
> 
> [1] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.
>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a854776d91e..39671e075555 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -113,6 +113,7 @@ 
 	KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_HV_TLB_FLUSH \
 	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_HV_INJECT_INTERCEPT	KVM_ARCH_REQ(33)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -639,6 +640,13 @@  struct kvm_vcpu_hv_tlb_flush_fifo {
 	DECLARE_KFIFO(entries, u64, KVM_HV_TLB_FLUSH_FIFO_SIZE);
 };
 
+struct kvm_vcpu_hv_intercept_info {
+	struct kvm_vcpu *vcpu;
+	int type;
+	u64 gpa;
+	u8 access;
+};
+
 /* Hyper-V per vcpu emulation context */
 struct kvm_vcpu_hv {
 	struct kvm_vcpu *vcpu;
@@ -673,6 +681,8 @@  struct kvm_vcpu_hv {
 		u64 vm_id;
 		u32 vp_id;
 	} nested;
+
+	struct kvm_vcpu_hv_intercept_info intercept_info;
 };
 
 struct kvm_hypervisor_cpuid {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index eb6a4848e306..38ee3abdef9c 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2789,6 +2789,120 @@  int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static void store_kvm_segment(const struct kvm_segment *kvmseg,
+			      struct hv_x64_segment_register *reg)
+{
+	reg->base = kvmseg->base;
+	reg->limit = kvmseg->limit;
+	reg->selector = kvmseg->selector;
+	reg->segment_type = kvmseg->type;
+	reg->present = kvmseg->present;
+	reg->descriptor_privilege_level = kvmseg->dpl;
+	reg->_default = kvmseg->db;
+	reg->non_system_segment = kvmseg->s;
+	reg->_long = kvmseg->l;
+	reg->granularity = kvmseg->g;
+	reg->available = kvmseg->avl;
+}
+
+static void deliver_gpa_intercept(struct kvm_vcpu *target_vcpu,
+				  struct kvm_vcpu *intercepted_vcpu, u64 gpa,
+				  u64 gva, u8 access_type_mask)
+{
+	ulong cr0;
+	struct hv_message msg = { 0 };
+	struct hv_memory_intercept_message *intercept = (struct hv_memory_intercept_message *)msg.u.payload;
+	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(target_vcpu);
+	struct x86_exception e;
+	struct kvm_segment kvmseg;
+
+	msg.header.message_type = HVMSG_GPA_INTERCEPT;
+	msg.header.payload_size = sizeof(*intercept);
+
+	intercept->header.vp_index = to_hv_vcpu(intercepted_vcpu)->vp_index;
+	intercept->header.instruction_length = intercepted_vcpu->arch.exit_instruction_len;
+	intercept->header.access_type_mask = access_type_mask;
+	kvm_x86_ops.get_segment(intercepted_vcpu, &kvmseg, VCPU_SREG_CS);
+	store_kvm_segment(&kvmseg, &intercept->header.cs);
+
+	cr0 = kvm_read_cr0(intercepted_vcpu);
+	intercept->header.exec_state.cr0_pe = (cr0 & X86_CR0_PE);
+	intercept->header.exec_state.cr0_am = (cr0 & X86_CR0_AM);
+	intercept->header.exec_state.cpl = kvm_x86_ops.get_cpl(intercepted_vcpu);
+	intercept->header.exec_state.efer_lma = is_long_mode(intercepted_vcpu);
+	intercept->header.exec_state.debug_active = 0;
+	intercept->header.exec_state.interruption_pending = 0;
+	intercept->header.rip = kvm_rip_read(intercepted_vcpu);
+	intercept->header.rflags = kvm_get_rflags(intercepted_vcpu);
+
+	/*
+	 * For exec violations we don't have a way to decode an instruction that issued a fetch
+	 * to a non-X page because CPU points RIP and GPA to the fetch destination in the faulted page.
+	 * Instruction length though is the length of the fetch source.
+	 * Seems like Hyper-V is aware of that and is not trying to access those fields.
+	 */
+	if (access_type_mask == HV_INTERCEPT_ACCESS_EXECUTE) {
+		intercept->instruction_byte_count = 0;
+	} else {
+		intercept->instruction_byte_count = intercepted_vcpu->arch.exit_instruction_len;
+		if (intercept->instruction_byte_count > sizeof(intercept->instruction_bytes))
+			intercept->instruction_byte_count = sizeof(intercept->instruction_bytes);
+		if (kvm_read_guest_virt(intercepted_vcpu,
+					kvm_rip_read(intercepted_vcpu),
+					intercept->instruction_bytes,
+					intercept->instruction_byte_count, &e))
+			goto inject_ud;
+	}
+
+	intercept->memory_access_info.gva_valid = (gva != 0);
+	intercept->gva = gva;
+	intercept->gpa = gpa;
+	intercept->cache_type = HV_X64_CACHE_TYPE_WRITEBACK;
+	kvm_x86_ops.get_segment(intercepted_vcpu, &kvmseg, VCPU_SREG_DS);
+	store_kvm_segment(&kvmseg, &intercept->ds);
+	kvm_x86_ops.get_segment(intercepted_vcpu, &kvmseg, VCPU_SREG_SS);
+	store_kvm_segment(&kvmseg, &intercept->ss);
+	intercept->rax = kvm_rax_read(intercepted_vcpu);
+	intercept->rcx = kvm_rcx_read(intercepted_vcpu);
+	intercept->rdx = kvm_rdx_read(intercepted_vcpu);
+	intercept->rbx = kvm_rbx_read(intercepted_vcpu);
+	intercept->rsp = kvm_rsp_read(intercepted_vcpu);
+	intercept->rbp = kvm_rbp_read(intercepted_vcpu);
+	intercept->rsi = kvm_rsi_read(intercepted_vcpu);
+	intercept->rdi = kvm_rdi_read(intercepted_vcpu);
+	intercept->r8 = kvm_r8_read(intercepted_vcpu);
+	intercept->r9 = kvm_r9_read(intercepted_vcpu);
+	intercept->r10 = kvm_r10_read(intercepted_vcpu);
+	intercept->r11 = kvm_r11_read(intercepted_vcpu);
+	intercept->r12 = kvm_r12_read(intercepted_vcpu);
+	intercept->r13 = kvm_r13_read(intercepted_vcpu);
+	intercept->r14 = kvm_r14_read(intercepted_vcpu);
+	intercept->r15 = kvm_r15_read(intercepted_vcpu);
+
+	if (synic_deliver_msg(&hv_vcpu->synic, 0, &msg, true))
+		goto inject_ud;
+
+	return;
+
+inject_ud:
+	kvm_queue_exception(target_vcpu, UD_VECTOR);
+}
+
+void kvm_hv_deliver_intercept(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_hv_intercept_info *info = &to_hv_vcpu(vcpu)->intercept_info;
+
+	switch (info->type) {
+	case HVMSG_GPA_INTERCEPT:
+		deliver_gpa_intercept(vcpu, info->vcpu, info->gpa, 0,
+				      info->access);
+		break;
+	default:
+		pr_warn("Unknown exception\n");
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_hv_deliver_intercept);
+
 void kvm_hv_init_vm(struct kvm *kvm)
 {
 	struct kvm_hv *hv = to_kvm_hv(kvm);
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index ae781b4d4669..8efc4916e0cb 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -273,4 +273,6 @@  int kvm_hv_vtl_dev_register(void);
 void kvm_hv_vtl_dev_unregister(void);
 int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
+void kvm_hv_deliver_intercept(struct kvm_vcpu *vcpu);
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 82d3b86d9c93..f2581eec39a9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10707,6 +10707,9 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+		if (kvm_check_request(KVM_REQ_HV_INJECT_INTERCEPT, vcpu))
+			kvm_hv_deliver_intercept(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||