From patchwork Sat Nov 10 15:43:42 2012
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Christoffer Dall <c.dall@virtualopensystems.com>
X-Patchwork-Id: 1723981
Return-Path: <kvm-owner@vger.kernel.org>
X-Original-To: patchwork-kvm@patchwork.kernel.org
Delivered-To: patchwork-process-083081@patchwork1.kernel.org
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by patchwork1.kernel.org (Postfix) with ESMTP id 9CE263FC8F
	for <patchwork-kvm@patchwork.kernel.org>;
	Sat, 10 Nov 2012 15:43:48 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752076Ab2KJPnp (ORCPT
	<rfc822;patchwork-kvm@patchwork.kernel.org>);
	Sat, 10 Nov 2012 10:43:45 -0500
Received: from mail-we0-f174.google.com ([74.125.82.174]:56477 "EHLO
	mail-we0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751975Ab2KJPno (ORCPT <rfc822; kvm@vger.kernel.org>);
	Sat, 10 Nov 2012 10:43:44 -0500
Received: by mail-we0-f174.google.com with SMTP id t9so2183999wey.19
	for <kvm@vger.kernel.org>; Sat, 10 Nov 2012 07:43:44 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20120113;
	h=subject:to:from:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-type:content-transfer-encoding
	:x-gm-message-state;
	bh=UNltaei3JPETjnFhfOd4oR6AjWjyRndL7bThBB6ep00=;
	b=CxMxN1o0xRCl+1W2ySG5+JtBMyifg/VZZerVdhn+jVkI6N2AuOVP46+fHV5OKPQXny
	jHqzdmAWxjpt/mhDfrWHZUtiZ0TkSCbVMHiDYJkBaD2lmOa0jiENw+SQukD+ekVnWcQ0
	rXSkwvnN2UMwUBfFfsbpI7vkrON9llW/nkH+6EaYCoCmXqbts/7o6Vgio+vpTVTa7q2u
	XBjewr1R3BtU8rFQ/0gUIpHZzKpTVJnE+VasHoRXEN+BEKMlqOJSQlQJ+o8lNKouXbZX
	h09VQW/MZEWJElEmQJj9JguQWJgEUnyAwtZbQp2pI/pKUUnaLkB2+LkB8cKy7azQxGP0
	jdRQ==
Received: by 10.216.217.214 with SMTP id i64mr5682008wep.124.1352562223744;
	Sat, 10 Nov 2012 07:43:43 -0800 (PST)
Received: from [127.0.1.1] (ip1.c116.obr91.cust.comxnet.dk. [87.72.8.103])
	by mx.google.com with ESMTPS id fg6sm6730128wib.3.2012.11.10.07.43.42
	(version=TLSv1/SSLv3 cipher=OTHER);
	Sat, 10 Nov 2012 07:43:43 -0800 (PST)
Subject: [PATCH v4 13/14] KVM: ARM: Handle guest faults in KVM
To: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.cs.columbia.edu
From: Christoffer Dall <c.dall@virtualopensystems.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>,
	Marcelo Tosatti <mtosatti@redhat.com>
Date: Sat, 10 Nov 2012 16:43:42 +0100
Message-ID: <20121110154342.2836.9669.stgit@chazy-air>
In-Reply-To: <20121110154203.2836.46686.stgit@chazy-air>
References: <20121110154203.2836.46686.stgit@chazy-air>
User-Agent: StGit/0.15
MIME-Version: 1.0
X-Gm-Message-State: 
 ALoCoQlbGx/vG+6K3oDWJK2UcrRMPav2r/4BJcWoGCLh2lgfKpCdR7gT73P6cYy0Xf8MVb9W0PNd
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Handles the guest faults in KVM by mapping in corresponding user pages
in the 2nd stage page tables.

We invalidate the instruction cache by MVA whenever we map a page to the
guest (no, we cannot only do it when we have an iabt because the guest
may happily read/write a page before hitting the icache) if the hardware
uses VIPT or PIPT.  In the latter case, we can invalidate only that
physical page.  In the first case, all bets are off and we simply must
invalidate the whole affair.  Not that VIVT icaches are tagged with
vmids, and we are out of the woods on that one.  Alexander Graf was nice
enough to remind us of this massive pain.

There is also  a subtle bug hidden somewhere, which we currently hide by
marking all pages dirty even when the pages are only mapped read-only.  The
current hypothesis is that marking pages dirty may exercise the IO system and
data cache more and therefore we don't see stale data in the guest, but it's
purely guesswork.  The bug is manifested by seemingly random kernel crashes in
guests when the host is under extreme memory pressure and swapping is enabled.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
---
 arch/arm/include/asm/kvm_arm.h |    9 ++
 arch/arm/include/asm/kvm_asm.h |    2 +
 arch/arm/kvm/mmu.c             |  145 ++++++++++++++++++++++++++++++++++++++++
 arch/arm/kvm/trace.h           |   26 +++++++
 4 files changed, 181 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index e2cf681..368db1d 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -156,11 +156,20 @@
 #define HSR_ISS		(HSR_IL - 1)
 #define HSR_ISV_SHIFT	(24)
 #define HSR_ISV		(1U << HSR_ISV_SHIFT)
+#define HSR_FSC		(0x3f)
+#define HSR_FSC_TYPE	(0x3c)
+#define HSR_WNR		(1 << 6)
 #define HSR_CV_SHIFT	(24)
 #define HSR_CV		(1U << HSR_CV_SHIFT)
 #define HSR_COND_SHIFT	(20)
 #define HSR_COND	(0xfU << HSR_COND_SHIFT)
 
+#define FSC_FAULT	(0x04)
+#define FSC_PERM	(0x0c)
+
+/* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
+#define HPFAR_MASK	(~0xf)
+
 #define HSR_EC_UNKNOWN	(0x00)
 #define HSR_EC_WFI	(0x01)
 #define HSR_EC_CP15_32	(0x03)
diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
index 47a0e57..6fccdb3 100644
--- a/arch/arm/include/asm/kvm_asm.h
+++ b/arch/arm/include/asm/kvm_asm.h
@@ -71,6 +71,8 @@ extern char __kvm_hyp_vector[];
 extern char __kvm_hyp_code_start[];
 extern char __kvm_hyp_code_end[];
 
+extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
+
 extern void __kvm_flush_vm_context(void);
 extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
 
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index f45be86..6c9ee3a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -21,9 +21,11 @@
 #include <linux/io.h>
 #include <asm/idmap.h>
 #include <asm/pgalloc.h>
+#include <asm/cacheflush.h>
 #include <asm/kvm_arm.h>
 #include <asm/kvm_mmu.h>
 #include <asm/kvm_asm.h>
+#include <asm/kvm_emulate.h>
 #include <asm/mach/map.h>
 #include <trace/events/kvm.h>
 
@@ -503,9 +505,150 @@ out:
 	return ret;
 }
 
+static void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
+{
+	/*
+	 * If we are going to insert an instruction page and the icache is
+	 * either VIPT or PIPT, there is a potential problem where the host
+	 * (or another VM) may have used this page at the same virtual address
+	 * as this guest, and we read incorrect data from the icache.  If
+	 * we're using a PIPT cache, we can invalidate just that page, but if
+	 * we are using a VIPT cache we need to invalidate the entire icache -
+	 * damn shame - as written in the ARM ARM (DDI 0406C - Page B3-1384)
+	 */
+	if (icache_is_pipt()) {
+		unsigned long hva = gfn_to_hva(kvm, gfn);
+		__cpuc_coherent_user_range(hva, hva + PAGE_SIZE);
+	} else if (!icache_is_vivt_asid_tagged()) {
+		/* any kind of VIPT cache */
+		__flush_icache_all();
+	}
+}
+
+static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+			  gfn_t gfn, struct kvm_memory_slot *memslot,
+			  bool is_iabt, unsigned long fault_status)
+{
+	pte_t new_pte;
+	pfn_t pfn;
+	int ret;
+	bool write_fault, writable;
+	unsigned long mmu_seq;
+	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+
+	if (is_iabt)
+		write_fault = false;
+	else if ((vcpu->arch.hsr & HSR_ISV) && !(vcpu->arch.hsr & HSR_WNR))
+		write_fault = false;
+	else
+		write_fault = true;
+
+	if (fault_status == FSC_PERM && !write_fault) {
+		kvm_err("Unexpected L2 read permission error\n");
+		return -EFAULT;
+	}
+
+	/* We need minimum second+third level pages */
+	ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
+	if (ret)
+		return ret;
+
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
+
+	pfn = gfn_to_pfn_prot(vcpu->kvm, gfn, write_fault, &writable);
+	if (is_error_pfn(pfn))
+		return -EFAULT;
+
+	new_pte = pfn_pte(pfn, PAGE_S2);
+	coherent_icache_guest_page(vcpu->kvm, gfn);
+
+	spin_lock(&vcpu->kvm->mmu_lock);
+	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
+		goto out_unlock;
+	if (writable) {
+		pte_val(new_pte) |= L_PTE_S2_RDWR;
+		kvm_set_pfn_dirty(pfn);
+	}
+	stage2_set_pte(vcpu->kvm, memcache, fault_ipa, &new_pte, false);
+
+out_unlock:
+	spin_unlock(&vcpu->kvm->mmu_lock);
+	/*
+	 * XXX TODO FIXME:
+-        * This is _really_ *weird* !!!
+-        * We should be calling the _clean version, because we set the pfn dirty
+	 * if we map the page writable, but this causes memory failures in
+	 * guests under heavy memory pressure on the host and heavy swapping.
+	 */
+	kvm_release_pfn_dirty(pfn);
+	return 0;
+}
+
+/**
+ * kvm_handle_guest_abort - handles all 2nd stage aborts
+ * @vcpu:	the VCPU pointer
+ * @run:	the kvm_run structure
+ *
+ * Any abort that gets to the host is almost guaranteed to be caused by a
+ * missing second stage translation table entry, which can mean that either the
+ * guest simply needs more memory and we must allocate an appropriate page or it
+ * can mean that the guest tried to access I/O memory, which is emulated by user
+ * space. The distinction is based on the IPA causing the fault and whether this
+ * memory region has been registered as standard RAM by user space.
+ */
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	return -EINVAL;
+	unsigned long hsr_ec;
+	unsigned long fault_status;
+	phys_addr_t fault_ipa;
+	struct kvm_memory_slot *memslot = NULL;
+	bool is_iabt;
+	gfn_t gfn;
+	int ret;
+
+	hsr_ec = vcpu->arch.hsr >> HSR_EC_SHIFT;
+	is_iabt = (hsr_ec == HSR_EC_IABT);
+	fault_ipa = ((phys_addr_t)vcpu->arch.hpfar & HPFAR_MASK) << 8;
+
+	trace_kvm_guest_fault(*vcpu_pc(vcpu), vcpu->arch.hsr,
+			      vcpu->arch.hxfar, fault_ipa);
+
+	/* Check the stage-2 fault is trans. fault or write fault */
+	fault_status = (vcpu->arch.hsr & HSR_FSC_TYPE);
+	if (fault_status != FSC_FAULT && fault_status != FSC_PERM) {
+		kvm_err("Unsupported fault status: EC=%#lx DFCS=%#lx\n",
+			hsr_ec, fault_status);
+		return -EFAULT;
+	}
+
+	gfn = fault_ipa >> PAGE_SHIFT;
+	if (!kvm_is_visible_gfn(vcpu->kvm, gfn)) {
+		if (is_iabt) {
+			/* Prefetch Abort on I/O address */
+			kvm_inject_pabt(vcpu, vcpu->arch.hxfar);
+			return 1;
+		}
+
+		if (fault_status != FSC_FAULT) {
+			kvm_err("Unsupported fault status on io memory: %#lx\n",
+				fault_status);
+			return -EFAULT;
+		}
+
+		kvm_pr_unimpl("I/O address abort...");
+		return 0;
+	}
+
+	memslot = gfn_to_memslot(vcpu->kvm, gfn);
+	if (!memslot->user_alloc) {
+		kvm_err("non user-alloc memslots not supported\n");
+		return -EINVAL;
+	}
+
+	ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot,
+			     is_iabt, fault_status);
+	return ret ? ret : 1;
 }
 
 static void handle_hva_to_gpa(struct kvm *kvm,
diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h
index c86a513..e2a9d2e 100644
--- a/arch/arm/kvm/trace.h
+++ b/arch/arm/kvm/trace.h
@@ -39,6 +39,32 @@ TRACE_EVENT(kvm_exit,
 	TP_printk("PC: 0x%08lx", __entry->vcpu_pc)
 );
 
+TRACE_EVENT(kvm_guest_fault,
+	TP_PROTO(unsigned long vcpu_pc, unsigned long hsr,
+		 unsigned long hxfar,
+		 unsigned long ipa),
+	TP_ARGS(vcpu_pc, hsr, hxfar, ipa),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	vcpu_pc		)
+		__field(	unsigned long,	hsr		)
+		__field(	unsigned long,	hxfar		)
+		__field(	unsigned long,	ipa		)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_pc		= vcpu_pc;
+		__entry->hsr			= hsr;
+		__entry->hxfar			= hxfar;
+		__entry->ipa			= ipa;
+	),
+
+	TP_printk("guest fault at PC %#08lx (hxfar %#08lx, "
+		  "ipa %#08lx, hsr %#08lx",
+		  __entry->vcpu_pc, __entry->hxfar,
+		  __entry->hsr, __entry->ipa)
+);
+
 TRACE_EVENT(kvm_irq_line,
 	TP_PROTO(unsigned int type, int vcpu_idx, int irq_num, int level),
 	TP_ARGS(type, vcpu_idx, irq_num, level),