diff mbox series

[v4,51/75] x86/sev-es: Handle MMIO events

Message ID 20200714120917.11253-52-joro@8bytes.org (mailing list archive)
State New, archived
Headers show
Series x86: SEV-ES Guest Support | expand

Commit Message

Joerg Roedel July 14, 2020, 12:08 p.m. UTC
From: Tom Lendacky <thomas.lendacky@amd.com>

Add handler for VC exceptions caused by MMIO intercepts. These
intercepts come along as nested page faults on pages with reserved
bits set.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/uapi/asm/svm.h |   5 +
 arch/x86/kernel/sev-es.c        | 219 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)

Comments

Mike Stunes July 21, 2020, 9:01 p.m. UTC | #1
Hi Joerg,

Thanks for the new patch-set!

> On Jul 14, 2020, at 5:08 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Add handler for VC exceptions caused by MMIO intercepts. These
> intercepts come along as nested page faults on pages with reserved
> bits set.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> [ jroedel@suse.de: Adapt to VC handling framework ]
> Co-developed-by: Joerg Roedel <jroedel@suse.de>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> 
> <snip>

I’m running into an MMIO-related bug when I try testing this on our hypervisor.

During boot, probe_roms (arch/x86/kernel/probe_roms.c) uses romchecksum over the video ROM and extension ROM regions. In my test VM, the video ROM romchecksum starts at virtual address 0xffff8880000c0000 and has length 65536. But, at address 0xffff8880000c4000, we switch from being video-ROM-backed to being unbacked by anything.

With SEV-ES enabled, our platform handles reads and writes to unbacked memory by treating them as MMIO. So, the read from 0xffff8880000c4000 causes a #VC, which is handled by do_early_exception.

In handling the #VC, vc_slow_virt_to_phys fails for that address. My understanding is that the #VC handler should then add an entry to the page tables and retry the faulting access. Somehow, that isn’t happening. From the hypervisor side, it looks like the guest is looping somehow. (I think the VCPU is mostly unhalted and making progress, but the guest never gets past that romchecksum.) The guest never actually makes an MMIO vmgexit for that address.

If I remove the call to probe_roms from setup_arch, or remove the calls to romchecksum from probe_roms, this kernel boots normally.

Please let me know of other tests I should run or data that I can collect. Thanks!

Mike
Joerg Roedel July 22, 2020, 7:55 a.m. UTC | #2
Hi Mike,

On Tue, Jul 21, 2020 at 09:01:44PM +0000, Mike Stunes wrote:
> I’m running into an MMIO-related bug when I try testing this on our hypervisor.
> 
> During boot, probe_roms (arch/x86/kernel/probe_roms.c) uses
> romchecksum over the video ROM and extension ROM regions. In my test
> VM, the video ROM romchecksum starts at virtual address
> 0xffff8880000c0000 and has length 65536. But, at address
> 0xffff8880000c4000, we switch from being video-ROM-backed to being
> unbacked by anything.
> 
> With SEV-ES enabled, our platform handles reads and writes to unbacked
> memory by treating them as MMIO. So, the read from 0xffff8880000c4000
> causes a #VC, which is handled by do_early_exception.
> 
> In handling the #VC, vc_slow_virt_to_phys fails for that address. My
> understanding is that the #VC handler should then add an entry to the
> page tables and retry the faulting access. Somehow, that isn’t
> happening. From the hypervisor side, it looks like the guest is
> looping somehow. (I think the VCPU is mostly unhalted and making
> progress, but the guest never gets past that romchecksum.) The guest
> never actually makes an MMIO vmgexit for that address.

That sounds like your guest is in a page-fault loop, but I can't yet
explain why. Can you please find out the instruction which is causing
the #VC exception?

If a page-fault happens during #VC emulation it is forwared to the
page-fault handler, so this should work. But somehow this isn't
happening or the page-fault handler can't map the faulting address.


Regards,

	Joerg
Joerg Roedel July 22, 2020, 8:05 a.m. UTC | #3
Hmm, I have a theory ...

On Tue, Jul 21, 2020 at 09:01:44PM +0000, Mike Stunes wrote:
> If I remove the call to probe_roms from setup_arch, or remove the calls to romchecksum from probe_roms, this kernel boots normally.
> 
> Please let me know of other tests I should run or data that I can collect. Thanks!

... can you please try the attached diff?

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 251d0aabc55a..e1fea7a38019 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -389,7 +389,8 @@ static bool vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
 	pgd_t *pgd;
 	pte_t *pte;
 
-	pgd = pgd_offset(current->active_mm, va);
+	pgd = __va(read_cr3_pa());
+	pgd = &pgd[pgd_index(va)];
 	pte = lookup_address_in_pgd(pgd, va, &level);
 	if (!pte) {
 		ctxt->fi.vector     = X86_TRAP_PF;
Mike Stunes July 22, 2020, 10:53 p.m. UTC | #4
> On Jul 22, 2020, at 1:05 AM, Joerg Roedel <jroedel@suse.de> wrote:
> 
> Hmm, I have a theory ...
> 
> On Tue, Jul 21, 2020 at 09:01:44PM +0000, Mike Stunes wrote:
>> If I remove the call to probe_roms from setup_arch, or remove the calls to romchecksum from probe_roms, this kernel boots normally.
>> 
>> Please let me know of other tests I should run or data that I can collect. Thanks!
> 
> ... can you please try the attached diff?
> 
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 251d0aabc55a..e1fea7a38019 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -389,7 +389,8 @@ static bool vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
> 	pgd_t *pgd;
> 	pte_t *pte;
> 
> -	pgd = pgd_offset(current->active_mm, va);
> +	pgd = __va(read_cr3_pa());
> +	pgd = &pgd[pgd_index(va)];
> 	pte = lookup_address_in_pgd(pgd, va, &level);
> 	if (!pte) {
> 		ctxt->fi.vector     = X86_TRAP_PF;

Thanks Joerg! With that change in place, this kernel boots normally. What was the problem?
Joerg Roedel July 23, 2020, 7:21 a.m. UTC | #5
Hi Mike,

On Wed, Jul 22, 2020 at 10:53:02PM +0000, Mike Stunes wrote:
> Thanks Joerg! With that change in place, this kernel boots normally.
> What was the problem?

The problem was that the code got its page-table from
current->active_mm. But these pointers are not set up during early boot,
so that the #VC handler can't walk the page-table and propagates a
page-fault every time. This loops forever.

Getting the page-table from CR3 instead works at all stages of the
systems runtime.

Regards,

	Joerg
diff mbox series

Patch

diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index c68d1618c9b0..8f36ae021a7f 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -81,6 +81,11 @@ 
 #define SVM_EXIT_AVIC_INCOMPLETE_IPI		0x401
 #define SVM_EXIT_AVIC_UNACCELERATED_ACCESS	0x402
 
+/* SEV-ES software-defined VMGEXIT events */
+#define SVM_VMGEXIT_MMIO_READ			0x80000001
+#define SVM_VMGEXIT_MMIO_WRITE			0x80000002
+#define SVM_VMGEXIT_UNSUPPORTED_EVENT		0x8000ffff
+
 #define SVM_EXIT_ERR           -1
 
 #define SVM_EXIT_REASONS \
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 08b66fd5ae9a..44995c3ad17e 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -374,6 +374,36 @@  static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 	return ES_EXCEPTION;
 }
 
+static bool vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+				 unsigned long vaddr, phys_addr_t *paddr)
+{
+	unsigned long va = (unsigned long)vaddr;
+	unsigned int level;
+	phys_addr_t pa;
+	pgd_t *pgd;
+	pte_t *pte;
+
+	pgd = pgd_offset(current->active_mm, va);
+	pte = lookup_address_in_pgd(pgd, va, &level);
+	if (!pte) {
+		ctxt->fi.vector     = X86_TRAP_PF;
+		ctxt->fi.cr2        = vaddr;
+		ctxt->fi.error_code = 0;
+
+		if (user_mode(ctxt->regs))
+			ctxt->fi.error_code |= X86_PF_USER;
+
+		return false;
+	}
+
+	pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+	pa |= va & ~page_level_mask(level);
+
+	*paddr = pa;
+
+	return true;
+}
+
 /* Include code shared with pre-decompression boot stage */
 #include "sev-es-shared.c"
 
@@ -456,6 +486,192 @@  static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
 	do_early_exception(ctxt->regs, trapnr);
 }
 
+static long *vc_insn_get_reg(struct es_em_ctxt *ctxt)
+{
+	long *reg_array;
+	int offset;
+
+	reg_array = (long *)ctxt->regs;
+	offset    = insn_get_modrm_reg_off(&ctxt->insn, ctxt->regs);
+
+	if (offset < 0)
+		return NULL;
+
+	offset /= sizeof(long);
+
+	return reg_array + offset;
+}
+
+static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+				 unsigned int bytes, bool read)
+{
+	u64 exit_code, exit_info_1, exit_info_2;
+	unsigned long ghcb_pa = __pa(ghcb);
+	phys_addr_t paddr;
+	void __user *ref;
+
+	ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
+	if (ref == (void __user *)-1L)
+		return ES_UNSUPPORTED;
+
+	exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
+
+	if (!vc_slow_virt_to_phys(ghcb, ctxt, (unsigned long)ref, &paddr)) {
+		if (!read)
+			ctxt->fi.error_code |= X86_PF_WRITE;
+
+		return ES_EXCEPTION;
+	}
+
+	exit_info_1 = paddr;
+	exit_info_2 = bytes;    /* Can never be greater than 8 */
+
+	ghcb->save.sw_scratch = ghcb_pa + offsetof(struct ghcb, shared_buffer);
+
+	return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
+}
+
+static enum es_result vc_handle_mmio_twobyte_ops(struct ghcb *ghcb,
+						 struct es_em_ctxt *ctxt)
+{
+	struct insn *insn = &ctxt->insn;
+	unsigned int bytes = 0;
+	enum es_result ret;
+	int sign_byte;
+	long *reg_data;
+
+	switch (insn->opcode.bytes[1]) {
+		/* MMIO Read w/ zero-extension */
+	case 0xb6:
+		bytes = 1;
+		fallthrough;
+	case 0xb7:
+		if (!bytes)
+			bytes = 2;
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		/* Zero extend based on operand size */
+		reg_data = vc_insn_get_reg(ctxt);
+		if (!reg_data)
+			return ES_DECODE_FAILED;
+
+		memset(reg_data, 0, insn->opnd_bytes);
+
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+
+		/* MMIO Read w/ sign-extension */
+	case 0xbe:
+		bytes = 1;
+		fallthrough;
+	case 0xbf:
+		if (!bytes)
+			bytes = 2;
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		/* Sign extend based on operand size */
+		reg_data = vc_insn_get_reg(ctxt);
+		if (!reg_data)
+			return ES_DECODE_FAILED;
+
+		if (bytes == 1) {
+			u8 *val = (u8 *)ghcb->shared_buffer;
+
+			sign_byte = (*val & 0x80) ? 0xff : 0x00;
+		} else {
+			u16 *val = (u16 *)ghcb->shared_buffer;
+
+			sign_byte = (*val & 0x8000) ? 0xff : 0x00;
+		}
+		memset(reg_data, sign_byte, insn->opnd_bytes);
+
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+
+	default:
+		ret = ES_UNSUPPORTED;
+	}
+
+	return ret;
+}
+
+static enum es_result vc_handle_mmio(struct ghcb *ghcb,
+				     struct es_em_ctxt *ctxt)
+{
+	struct insn *insn = &ctxt->insn;
+	unsigned int bytes = 0;
+	enum es_result ret;
+	long *reg_data;
+
+	switch (insn->opcode.bytes[0]) {
+	/* MMIO Write */
+	case 0x88:
+		bytes = 1;
+		fallthrough;
+	case 0x89:
+		if (!bytes)
+			bytes = insn->opnd_bytes;
+
+		reg_data = vc_insn_get_reg(ctxt);
+		if (!reg_data)
+			return ES_DECODE_FAILED;
+
+		memcpy(ghcb->shared_buffer, reg_data, bytes);
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+		break;
+
+	case 0xc6:
+		bytes = 1;
+		fallthrough;
+	case 0xc7:
+		if (!bytes)
+			bytes = insn->opnd_bytes;
+
+		memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+		break;
+
+		/* MMIO Read */
+	case 0x8a:
+		bytes = 1;
+		fallthrough;
+	case 0x8b:
+		if (!bytes)
+			bytes = insn->opnd_bytes;
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		reg_data = vc_insn_get_reg(ctxt);
+		if (!reg_data)
+			return ES_DECODE_FAILED;
+
+		if (bytes == 4)
+			*reg_data = 0;  /* Zero-extend for 32-bit operation */
+
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+
+		/* Two-Byte Opcodes */
+	case 0x0f:
+		ret = vc_handle_mmio_twobyte_ops(ghcb, ctxt);
+		break;
+	default:
+		ret = ES_UNSUPPORTED;
+	}
+
+	return ret;
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code)
@@ -469,6 +685,9 @@  static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_IOIO:
 		result = vc_handle_ioio(ghcb, ctxt);
 		break;
+	case SVM_EXIT_NPF:
+		result = vc_handle_mmio(ghcb, ctxt);
+		break;
 	default:
 		/*
 		 * Unexpected #VC exception