[V7,0/10] KVM: X86: Introducing ROE Protection Kernel Hardening
diff mbox

Message ID 20181207124803.10828-1-ahmedsoliman@mena.vt.edu
State New
Headers show

Commit Message

Ahmed Abd El Mawgood Dec. 7, 2018, 12:47 p.m. UTC
-- Summary --

ROE is a hypercall that enables host operating system to restrict guest's access
to its own memory. This will provide a hardening mechanism that can be used to
stop rootkits from manipulating kernel static data structures and code. Once a
memory region is protected the guest kernel can't even request undoing the

Memory protected by ROE should be non-swapable because even if the ROE protected
page got swapped out, It won't be possible to write anything in its place.

ROE hypercall should be capable of either protecting a whole memory frame or
parts of it. With these two, it should be possible for guest kernel to protect
its memory and all the page table entries for that memory inside the page table.
I am still not sure whether this should be part of ROE job or the guest's job.

The reason why it would be better to implement this from inside kvm: instead of
(host) user space is the need to access SPTEs to modify the permissions, while
mprotect() from user space can work in theory. It will become a big performance
hit to vmexit and switch to user space mode on each fault, on the other hand,
having the permission handled by EPT should make some remarkable performance

Our threat model assumes that an attacker got full root access to a running
guest and his goal is to manipulate kernel code/data (hook syscalls, overwrite
IDT ..etc).

There is future work in progress to also put some sort of protection on the page
table register CR3 and other critical registers that can be intercepted by KVM.
This way it won't be possible for an attacker to manipulate any part of the
guests page table.

-- Test Case --

I was requested to add a test to tools/testing/selftests/kvm/. But the original
testing suite didn't work for my machine, I experienced shutdown due to triple
fault because of EPT fault with the current tests. I tried bisecting but the
triple fault was there from the very first commit.

So instead I would provide here a demo kernel module to test the current

	#include <linux/init.h>
	#include <linux/module.h>
	#include <linux/kernel.h>
	#include <linux/slab.h>
	#include <linux/kvm_para.h>
	#include <linux/mm.h>
	MODULE_DESCRIPTION("ROE Hello world Module");

	#define KVM_HC_ROE 11
	#define ROE_VERSION 0
	#define ROE_MPROTECT 1

	static long roe_version(void){
		return kvm_hypercall1 (KVM_HC_ROE, ROE_VERSION);

	static long roe_mprotect(void *addr, long pg_count) {
		return kvm_hypercall3 (KVM_HC_ROE, ROE_MPROTECT, (u64)addr, pg_count);

	static long roe_mprotect_chunk(void *addr, long size) {
		return kvm_hypercall3 (KVM_HC_ROE, ROE_MPROTECT_CHUNK, (u64)addr, size);

 	static int __init hello(void ) {
		int x;
		struct page *pg1, *pg2;
		void *memory;
		pg1 = alloc_page(GFP_KERNEL);
		pg2 = alloc_page(GFP_KERNEL);
		memory = page_to_virt(pg1);
		pr_info ("ROE_VERSION: %ld\n", roe_version());
		pr_info ("Allocated memory: 0x%llx\n", (u64)memory);
		pr_info("Physical Address: 0x%llx\n", virt_to_phys(memory));
		strcpy((char *)memory, "ROE PROTECTED");
		pr_info("memory_content: %s\n", (char *)memory);
		x = roe_mprotect((void *)memory, 1);
		strcpy((char *)memory, "The strcpy should silently fail and"
					"memory content won't be modified");
		pr_info("memory_content: %s\n", (char *)memory);
		memory = page_to_virt(pg2);
		pr_info ("Allocated memory: 0x%llx\n", (u64)memory);
		pr_info("Physical Address: 0x%llx\n", virt_to_phys(memory));
		strcpy((char *)memory, "ROE PROTECTED PARTIALLY");
		roe_mprotect_chunk((void *)memory, strlen((char *)memory));
		pr_info("memory_content: %s\n", (char *)memory);
		strcpy((char *)memory, "XXXXXXXXXXXXXXXXXXXXXXX"
					" <- Text here not modified still Can concat");
		pr_info("memory_content: %s\n", (char *)memory);
		return 0;
	static void __exit bye(void) {
		pr_info("Allocated Memory May never be freed at all!\n");
		pr_info("Actually this is more of an ABI demonstration\n");
		pr_info("than actual use case\n");


I tried this on Gentoo host with Ubuntu guest and Qemu from git after applying
the following changes to Qemu

-- Change log V6 -> V7 --

- Completely remove CONFIG_KVM_ROE, ROE is always enabled, since it is opt in
- Bug fixes regarding how each element in the protection bitmap maps to the
  equivalent SPTE.
- General Code cleaning.

-- Known Issues --

- THP is not supported yet. In general it is not supported when the guest frame
  size is not the same as the equivalent EPT frame size.

The previous version (V6) of the patch set can be found at [1]

-- links --

[1] https://lkml.org/lkml/2018/11/4/417

-- List of patches --

[PATCH V7 01/10] KVM: State whether memory should be freed in
[PATCH V7 02/10] KVM: X86: Add arbitrary data pointer in kvm memslot
[PATCH V7 03/10] KVM: X86: Add helper function to convert SPTE to GFN
[PATCH V7 04/10] KVM: Document Memory ROE
[PATCH V7 05/10] KVM: Create architecture independent ROE skeleton
[PATCH V7 06/10] KVM: X86: Enable ROE for x86
[PATCH V7 07/10] KVM: Add support for byte granular memory ROE
[PATCH V7 08/10] KVM: X86: Port ROE_MPROTECT_CHUNK to x86
[PATCH V7 09/10] KVM: Add new exit reason For ROE violations
[PATCH V7 10/10] KVM: Log ROE violations in system log

-- Difstat --

 Documentation/virtual/kvm/hypercalls.txt |  40 ++++
 arch/x86/include/asm/kvm_host.h          |   2 +-
 arch/x86/kvm/Makefile                    |   4 +-
 arch/x86/kvm/mmu.c                       | 121 +++++------
 arch/x86/kvm/mmu.h                       |  31 ++-
 arch/x86/kvm/roe.c                       | 104 ++++++++++
 arch/x86/kvm/roe_arch.h                  |  28 +++
 arch/x86/kvm/x86.c                       |  21 +-
 include/kvm/roe.h                        |  28 +++
 include/linux/kvm_host.h                 |  25 +++
 include/uapi/linux/kvm.h                 |   2 +-
 include/uapi/linux/kvm_para.h            |   5 +
 virt/kvm/kvm_main.c                      |  56 +++--
 virt/kvm/roe.c                           | 342 +++++++++++++++++++++++++++++++
 virt/kvm/roe_generic.h                   |  18 ++
 15 files changed, 732 insertions(+), 95 deletions(-)

Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman@mena.vt.edu>


Ahmed Abd El Mawgood Dec. 13, 2018, 4 p.m. UTC | #1

> Given that writes to these areas should be exceptional occurrences,

No not in the case of partially protected page.

> I don't  understand why this path needs to be optimized. To me it seems, a straight-
> forward userspace implementation with no additional code in the kernel achieves
> the same feature. Can you elaborate?

The performance hit I was talking about was when dealing with page with
mixed content, given that page is partially read only and partially writable.
The way this is handled is by emulating the writes from inside kvm, now if this
was done from host's userspace, then every write operating (a size of at most
1 cpu  word I guess ?) will require switching from guest to kvm and then to host
userspace, which is major performance hit. I originally made it for protecting
guest's page table part that shouldn't be remapped ever. Since the page table
gets modified a lot, emulating the writes from host user space instead
of the kernel
would add an unnecessary overhead. Also it doesn't sound right to me
to place the
protection inside the page table when it can be placed inside the
virtualization EPT.

But aside from that, I think I was wrong about hinting that it is
simple when done from
user space, handling cases like THP and huge pages (which I did not support yet)
doesn't seam to be easy when done from user space and when doing Registers
ROE has some arch specific details. That's why I think it is better to
continue doing
it from KVM kernel module.

Ahmed Abd El Mawgood Dec. 21, 2018, 2:05 p.m. UTC | #2

> > I don't  understand why this path needs to be optimized. To me it seems, a straight-
> > forward userspace implementation with no additional code in the kernel achieves
> > the same feature. Can you elaborate?

I was doing some benchmarking to figure out the overhead introduced by
ROE, I think I can add more
details about the overhead I am talking about, first I will explain
the existing paths for a memory write
[1] Normal memory write is done directly in guest kernel space.
[2] Writing into Fully ROE protected page (The write operation will fail).
[3] Writing into Partial ROE protected region (The write operation will fail).
[4] Writing into writable memory in a page that contains Partial ROE
protected region (The write operation is committed to memory).

Path [1] is the normal write... guest kernel will not have to switch
to guest and the performance was almost the same between host and
guest, Writing 1 MB (a byte at a time) took no more than 4
milliseconds. This will not be affected by whether ROE is done from
users pace or kernel space.

Path [2] will switch between guest's kernel to host kernel, then the
host kernel switches to user space to decide what should be done.  The
guest host ->host kernel -> host user space switch is done on ever
separate write attempt (which is approx 1000000 in this case) It took
~5000 milliseconds to fail the 1M write attempt. and as the above one
user space ROE will not affect this one that much and I am not aware
of any possible optimization, yet ideas are welcomed.

Path [3] will also switch between guest kernel to host kernel to host
users pace...However the time taken to attempt 1M write took ~5000
when the guest had less than 32 protected chunks system wide, as the
number of chunks increased, the time also increased in a linear
fashion till it reached 20 seconds took to do 1M write attempt when
the system had about separate 2048 protected chunks. For this
benchmark I allocated a page and protected every other byte :). I
think this can be optimized by replacing the linked list used to keep
track of chunks with maybe skip-list or Red Black tree. and It will be
available in the next patch set. as the previous cases user space VS
kernel space will not affect performance here at all.

Path [4] The guest kernel switches to host kernel and the write
operation is done in the host kernel (note we saved a switch from host
kernel to host user space)
The host kernel emulates the write operation and get back to guest
kernel. The writing speed was notably slow but on average twice the
speed at Path[3] (~2900 ms for less than 32 chunks and it went up to
11 seconds for 2048 chunks. Path [4] can be optimized the same way
path [3].

Note that the dominating factor here is how many switches are done, If
ROE was implemented in user-space, Path [4] which will be at least as
slow as Path [3] which is about 2x slower.

I hope it is less ambiguous now.


Junior Researcher, IoT and Cyber Security lab, SmartCI, Alexandria
University, & CIS @  VMI

diff mbox

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 4880a05399..57d0973aca 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2035,6 +2035,9 @@  int kvm_cpu_exec(CPUState *cpu)
             ret = 0;
+	case KVM_EXIT_ROE:
+	    ret = 0;
+	    break;
             ret = EXCP_INTERRUPT;
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index f11a7eb49c..67aded8f00 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -235,7 +235,7 @@  struct kvm_hyperv_exit {
 #define KVM_EXIT_S390_STSI        25
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
+#define KVM_EXIT_ROE              28
 /* Emulate instruction failed. */