From patchwork Thu Dec 20 18:28:31 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739303
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 781596C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:19 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 60AC928E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:19 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 54C4328F39; Thu, 20 Dec 2018 18:36:19 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4DAF728E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389324AbeLTSgQ (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:16 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44330 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389300AbeLTSgH (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:07 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 633DB305FFAE;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 551B1306E479;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 01/20] kvm: document the VM introspection API
Date: Thu, 20 Dec 2018 20:28:31 +0200
Message-Id: <20181220182850.4579-2-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

This includes the protocol, the commands and events.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 1351 ++++++++++++++++++++++++++++
 1 file changed, 1351 insertions(+)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
new file mode 100644
index 000000000000..1cb11fdb2471
--- /dev/null
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -0,0 +1,1351 @@
+=========================================================
+KVMI - The kernel virtual machine introspection subsystem
+=========================================================
+
+The KVM introspection subsystem provides a facility for applications running
+on the host or in a separate VM, to control the execution of other VM-s
+(pause, resume, shutdown), query the state of the vCPUs (GPRs, MSRs etc.),
+alter the page access bits in the shadow page tables (only for the hardware
+backed ones, eg. Intel's EPT) and receive notifications when events of
+interest have taken place (shadow page table level faults, key MSR writes,
+hypercalls etc.). Some notifications can be responded to with an action
+(like preventing an MSR from being written), others are mere informative
+(like breakpoint events which can be used for execution tracing).
+With few exceptions, all events are optional. An application using this
+subsystem will explicitly register for them.
+
+The use case that gave way for the creation of this subsystem is to monitor
+the guest OS and as such the ABI/API is highly influenced by how the guest
+software (kernel, applications) sees the world. For example, some events
+provide information specific for the host CPU architecture
+(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software
+to implement a critical feature (fast system calls).
+
+At the moment, the target audience for KVMI are security software authors
+that wish to perform forensics on newly discovered threats (exploits) or
+to implement another layer of security like preventing a large set of
+kernel rootkits simply by "locking" the kernel image in the shadow page
+tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that
+made KVMI a separate subsystem, even though many of these features are
+available in the device manager (eg. QEMU). The ability to build a security
+application that does not interfere (in terms of performance) with the
+guest software asks for a specialized interface that is designed for minimum
+overhead.
+
+API/ABI
+=======
+
+This chapter describes the VMI interface used to monitor and control local
+guests from a user application.
+
+Overview
+--------
+
+The interface is socket based, one connection for every VM. One end is in the
+host kernel while the other is held by the user application (introspection
+tool).
+
+The initial connection is established by an application running on the host
+(eg. QEMU) that connects to the introspection tool and after a handshake the
+socket is passed to the host kernel making all further communication take
+place between it and the introspection tool. The initiating party (QEMU) can
+close its end so that any potential exploits cannot take a hold of it.
+
+The socket protocol allows for commands and events to be multiplexed over
+the same connection. As such, it is possible for the introspection tool to
+receive an event while waiting for the result of a command. Also, it can
+send a command while the host kernel is waiting for a reply to an event.
+
+The kernel side of the socket communication is blocking and will wait for
+an answer from its peer indefinitely or until the guest is powered off
+(killed), restarted or the peer goes away, at which point it will wake
+up and properly cleanup as if the introspection subsystem has never been
+used on that guest. Obviously, whether the guest can really continue
+normal execution depends on whether the introspection tool has made any
+modifications that require an active KVMI channel.
+
+All messages (commands or events) have a common header::
+
+	struct kvmi_msg_hdr {
+		__u16 id;
+		__u16 size;
+		__u32 seq;
+	};
+
+and all need a reply with the same kind of header, having the same
+sequence number (``seq``) and the same message id (``id``).
+
+Because events from different vCPU threads can send messages at the same
+time and the replies can come in any order, the receiver loop uses the
+sequence number (seq) to identify which reply belongs to which vCPU, in
+order to dispatch the message to the right thread waiting for it.
+
+After ``kvmi_msg_hdr``, ``id`` specific data of ``size`` bytes will
+follow.
+
+The message header and its data must be sent with one ``sendmsg()`` call
+to the socket. This simplifies the receiver loop and avoids
+the reconstruction of messages on the other side.
+
+The wire protocol uses the host native byte-order. The introspection tool
+must check this during the handshake and do the necessary conversion.
+
+A command reply begins with::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	}
+
+followed by the command specific data if the error code ``err`` is zero.
+
+The error code -KVM_ENOSYS (packed in a ``kvmi_error_code``) is returned for
+unsupported commands.
+
+The error code -KVM_EACCES is returned for disallowed commands (see **Hooking**).
+
+The error code is related to the message processing. For all the other
+errors (socket errors, incomplete messages, wrong sequence numbers
+etc.) the socket must be closed. The device manager will be notified
+and it will reconnect.
+
+While all commands will have a reply as soon as possible, the replies
+to events will probably be delayed until a set of (new) commands will
+complete::
+
+   Host kernel               Tool
+   -----------               ----
+   event 1 ->
+                             <- command 1
+   command 1 reply ->
+                             <- command 2
+   command 2 reply ->
+                             <- event 1 reply
+
+If both ends send a message at the same time::
+
+   Host kernel               Tool
+   -----------               ----
+   event X ->                <- command X
+
+the host kernel will reply to 'command X', regardless of the receive time
+(before or after the 'event X' was sent).
+
+As it can be seen below, the wire protocol specifies occasional padding. This
+is to permit working with the data by directly using C structures or to round
+the structure size to a multiple of 8 bytes (64bit) to improve the copy
+operations that happen during ``recvmsg()`` or ``sendmsg()``. The members
+should have the native alignment of the host (4 bytes on x86). All padding
+must be initialized with zero otherwise the respective commands will fail
+with -KVM_EINVAL.
+
+To describe the commands/events, we reuse some conventions from api.txt:
+
+  - Architectures: which instruction set architectures provide this command/event
+
+  - Versions: which versions provide this command/event
+
+  - Parameters: incoming message data
+
+  - Returns: outgoing/reply message data
+
+Handshake
+---------
+
+Although this falls out of the scope of the introspection subsystem, below
+is a proposal of a handshake that can be used by implementors.
+
+Based on the system administration policies, the management tool
+(eg. libvirt) starts device managers (eg. QEMU) with some extra arguments:
+what introspector could monitor/control that specific guest (and how to
+connect to) and what introspection commands/events are allowed.
+
+The device manager will connect to the introspection tool and wait for a
+cryptographic hash of a cookie that should be known by both peers. If the
+hash is correct (the destination has been "authenticated"), the device
+manager will send another cryptographic hash and random salt. The peer
+recomputes the hash of the cookie bytes including the salt and if they match,
+the device manager has been "authenticated" too. This is a rather crude
+system that makes it difficult for device manager exploits to trick the
+introspection tool into believing its working OK.
+
+The cookie would normally be generated by a management tool (eg. libvirt)
+and make it available to the device manager and to a properly authenticated
+client. It is the job of a third party to retrieve the cookie from the
+management application and pass it over a secure channel to the introspection
+tool.
+
+Once the basic "authentication" has taken place, the introspection tool
+can receive information on the guest (its UUID) and other flags (endianness
+or features supported by the host kernel).
+
+In the end, the device manager will pass the file handle (plus the allowed
+commands/events) to KVM, and forget about it. It will be notified by
+KVM when the introspection tool closes the file handle (in case of
+errors), and should reinitiate the handshake.
+
+Once the file handle reaches KVM, the introspection tool should use the
+*KVMI_GET_VERSION* command to get the API version, the commands and the
+events (see *KVMI_CONTROL_VM_EVENTS* and *KVMI_CONTROL_EVENTS*) which
+are allowed for this guest. The error code -KVM_EPERM will be returned
+if the introspection tool uses a command or enables an event which is
+not allowed.
+
+Unhooking
+---------
+
+During a VMI session it is possible for the guest to be patched and for
+some of these patches to "talk" with the introspection tool. It thus
+becomes necessary to remove them before the guest is suspended, moved
+(migrated) or a snapshot with memory is created.
+
+The actions are normally performed by the device manager. In the case
+of QEMU, it will use the *KVM_INTROSPECTION_UNHOOK* ioctl to trigger
+the *KVMI_EVENT_UNHOOK* event and wait for a limited amount of time (a
+few seconds) for a confirmation from the introspection tool
+that is OK to proceed.
+
+Live migrations
+---------------
+
+Before the live migration takes place, the introspection tool has to be
+notified and have a chance to unhook (see **Unhooking**).
+
+The QEMU instance on the receiving end, if configured for KVMI, will need to
+establish a connection to the introspection tool after the migration has
+completed.
+
+Obviously, this creates a window in which the guest is not introspected. The
+user will need to be aware of this detail. Future introspection
+technologies can choose not to disconnect and instead transfer the necessary
+context to the introspection tool at the migration destination via a separate
+channel.
+
+Memory access safety
+--------------------
+
+The KVMI API gives access to the entire guest physical address space but
+provides no information on which parts of it are system RAM and which are
+device-specific memory (DMA, emulated MMIO, reserved by a passthrough
+device etc.). It is up to the user to determine, using the guest operating
+system data structures, the areas that are safe to access (code, stack, heap
+etc.).
+
+Commands
+--------
+
+The following C structures are meant to be used directly when communicating
+over the wire. The peer that detects any size mismatch should simply close
+the connection and report the error.
+
+0. KVMI_GET_VERSION
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_version_reply {
+		__u32 version;
+		__u32 commands;
+		__u32 events;
+		__u32 padding;
+	};
+
+Returns the introspection API version, the bit mask with allowed commands
+and the bit mask with allowed events.
+
+These two masks represent all the features allowed by the management tool
+(see **Handshake**) or supported by the host.
+
+The host kernel and the userland can use these macros to check if
+a command/event is allowed for a guest::
+
+	KVMI_ALLOWED_COMMAND(cmd_id, cmd_mask)
+	KVMI_ALLOWED_VM_EVENT(event_id, event_mask)
+	KVMI_ALLOWED_VCPU_EVENT(event_id, event_mask)
+
+This command is always successful.
+
+1. KVMI_GET_GUEST_INFO
+----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_guest_info {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_guest_info_reply {
+		__u32 vcpu_count;
+		__u32 padding;
+		__u64 tsc_speed;
+	};
+
+Returns the number of online vCPUs and the TSC frequency (in HZ)
+if available.
+
+The parameter ``vcpu`` must be zero. It is required for consistency with
+all other commands and in the future it might be used to return true
+vCPU-specific information.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+2. KVMI_PAUSE_ALL_VCPUS
+------------------------
+
+:Architecture: all
+:Versions: >= 1
+:Parameters: none
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_pause_all_vcpus_reply {
+		__u32 vcpu_count;
+		__u32 padding;
+	};
+
+Kicks all the vCPUs from guest. Once this command returns, no VCPU will
+run in guest mode and some of them may be ready to accept introspection
+commands.
+
+This command returns the number of kicked vCPUs. All of these will send
+a *KVMI_EVENT_PAUSE_VCPU* event before entering in guest mode.
+
+Please note that new vCPUs may be created later. The introspector should
+use *KVMI_CONTROL_VM_EVENTS* to enable the *KVMI_EVENT_CREATE_VCPU* event
+in order to stop these new vCPUs as well (by delaying the event reply).
+
+This command is always successful (unless disallowed).
+
+3. KVMI_GET_REGISTERS
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_registers {
+		__u16 vcpu;
+		__u16 nmsrs;
+		__u16 padding[2];
+		__u32 msrs_idx[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_registers_reply {
+		__u32 mode;
+		__u32 padding;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct kvm_msrs msrs;
+	};
+
+For the given vCPU and the ``nmsrs`` sized array of MSRs registers,
+returns the current vCPU mode (in bytes: 2, 4 or 8), the general purpose
+registers, the special registers and the requested set of MSRs.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - one of the indicated MSR-s is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+4. KVMI_SET_REGISTERS
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_set_registers {
+		__u16 vcpu;
+		__u16 padding[3];
+		struct kvm_regs regs;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the general purpose registers for the given vCPU. The changes become
+visible to other threads accessing the KVM vCPU structure after the event
+currently being handled is replied to.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+5. KVMI_GET_CPUID
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_cpuid {
+		__u16 vcpu;
+		__u16 padding[3];
+		__u32 function;
+		__u32 index;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_cpuid_reply {
+		__u32 eax;
+		__u32 ebx;
+		__u32 ecx;
+		__u32 edx;
+	};
+
+Returns a CPUID leaf (as seen by the guest OS).
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOENT - the selected leaf is not present or is invalid
+
+6. KVMI_GET_PAGE_ACCESS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_page_access {
+		__u16 vcpu;
+		__u16 count;
+		__u16 view;
+		__u16 padding;
+		__u64 gpa[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_page_access_reply {
+		__u8 access[0];
+	};
+
+Returns the spte access bits (rwx) for the specified vCPU and for an array of
+``count`` guest physical addresses.
+
+The valid access bits for *KVMI_GET_PAGE_ACCESS* and *KVMI_SET_PAGE_ACCESS*
+are::
+
+	KVMI_PAGE_ACCESS_R
+	KVMI_PAGE_ACCESS_W
+	KVMI_PAGE_ACCESS_X
+
+By default, for any guest physical address, the returned access mode will
+be 'rwx' (all the above bits). If the introspection tool must prevent
+the code execution from a guest page, for example, it should use the
+KVMI_SET_PAGE_ACCESS command to set the 'rw' bits for any guest physical
+addresses contained in that page. Of course, in order to receive
+page fault events when these violations take place, the KVMI_CONTROL_EVENTS
+command must be used to enable this type of event (KVMI_EVENT_PF).
+
+On Intel hardware with multiple EPT views, the ``view`` argument selects the
+EPT view (0 is primary). On all other hardware it must be zero.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EINVAL - a SPT view was selected but the hardware doesn't support it
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+7. KVMI_SET_PAGE_ACCESS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_page_access_entry {
+		__u64 gpa;
+		__u8 access;
+		__u8 padding[7];
+	};
+
+	struct kvmi_set_page_access {
+		__u16 vcpu;
+		__u16 count;
+		__u16 view;
+		__u16 padding;
+		struct kvmi_page_access_entry entries[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the spte access bits (rwx) for an array of ``count`` guest physical
+addresses.
+
+The command will fail with -KVM_EINVAL if any of the specified combination
+of access bits is not supported.
+
+The command will make the changes in order and it will not stop on errors. The
+introspector tool should handle the rollback.
+
+In order to 'forget' an address, all the access bits ('rwx') must be set.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified access bits combination is invalid
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EINVAL - a SPT view was selected but the hardware doesn't support it
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to add the page tracking structures
+
+8. KVMI_INJECT_EXCEPTION
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_inject_exception {
+		__u16 vcpu;
+		__u8 nr;
+		__u8 has_error;
+		__u16 error_code;
+		__u16 padding;
+		__u64 address;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Injects a vCPU exception with or without an error code. In case of page fault
+exception, the guest virtual address has to be specified.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified exception number is invalid
+* -KVM_EINVAL - the specified address is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+9. KVMI_READ_PHYSICAL
+---------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_read_physical {
+		__u64 gpa;
+		__u64 size;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	__u8 data[0];
+
+Reads from the guest memory.
+
+Currently, the size must be non-zero and the read must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
+10. KVMI_WRITE_PHYSICAL
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_write_physical {
+		__u64 gpa;
+		__u64 size;
+		__u8  data[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Writes into the guest memory.
+
+Currently, the size must be non-zero and the write must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
+11. KVMI_CONTROL_EVENTS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_events {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 events;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables vCPU introspection events, by setting or clearing one or
+more of the following bits::
+
+	KVMI_EVENT_CR_FLAG
+	KVMI_EVENT_MSR_FLAG
+	KVMI_EVENT_XSETBV_FLAG
+	KVMI_EVENT_BREAKPOINT_FLAG
+	KVMI_EVENT_HYPERCALL_FLAG
+	KVMI_EVENT_PF_FLAG
+	KVMI_EVENT_TRAP_FLAG
+	KVMI_EVENT_DESCRIPTOR_FLAG
+
+For example:
+
+	``events = KVMI_EVENT_BREAKPOINT_FLAG | KVMI_EVENT_PF``
+
+it will disable all events but breakpoints and page faults.
+
+When an event is enabled, the introspection tool is notified and it
+must return a reply: allow, skip, etc. (see 'Events' below).
+
+The *KVMI_EVENT_PAUSE_VCPU* event is always allowed because
+it is triggered by the *KVMI_PAUSE_ALL_VCPUS* command. The
+*KVMI_EVENT_CREATE_VCPU* and *KVMI_UNHOOK_EVENT* events are controlled
+by the *KVMI_CONTROL_VM_EVENTS* command.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified mask of events is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EPERM - access to one or more events specified in the events mask is
+  restricted by the host
+
+12. KVMI_CONTROL_CR
+-------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_cr {
+		__u16 vcpu;
+		__u8 enable;
+		__u8 padding;
+		__u32 cr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific control register and must
+be used in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_CR_FLAG*
+bit set.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified control register is not part of the CR0, CR3
+   or CR4 set
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+13. KVMI_CONTROL_MSR
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_msr {
+		__u16 vcpu;
+		__u8 enable;
+		__u8 padding;
+		__u32 msr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific MSR and must be used
+in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_MSR_FLAG* bit set.
+
+Currently, only MSRs within the following two ranges are supported. Trying
+to control events for any other register will fail with -KVM_EINVAL::
+
+	0          ... 0x00001fff
+	0xc0000000 ... 0xc0001fff
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified MSR is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+14. KVMI_CONTROL_VE
+-------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_ve {
+		__u16 vcpu;
+		__u16 count;
+		__u8 enable;
+		__u8 padding[3];
+		__u64 gpa[0]
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+On hardware supporting virtualized exceptions, this command can control
+the #VE bit for the listed guest physical addresses.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - one of the specified gpa-s is invalid
+* -KVM_EINVAL - the hardware does not support #VE
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+.. note::
+
+  Virtualized exceptions are designed such that they can be controlled by
+  the guest itself and used for (among others) accelerate network
+  operations. Since this will obviously interfere with VMI, the guest
+  is denied access to VE while the introspection channel is active.
+
+15. KVMI_CONTROL_VM_EVENTS
+--------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_vm_events {
+		__u32 events;
+		__u32 padding;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables VM introspection events, by setting or clearing one or
+more of the following bits::
+
+	KVMI_EVENT_CREATE_VCPU_FLAG
+	KVMI_EVENT_UNHOOK_FLAG
+
+For example:
+
+	``events = KVMI_EVENT_UNHOOK_FLAG``
+
+it will enable *KVMI_EVENT_UNHOOK_FLAG* event only.
+
+When an event is enabled, the introspection tool is notified and it
+must return a reply: allow, skip, etc. (see 'Events' below).
+
+:Errors:
+
+* -KVM_EINVAL - the specified mask of events is invalid
+* -KVM_EPERM - access to one or more events specified in the events mask is
+  restricted by the host
+
+
+16. KVMI_GET_MAP_TOKEN
+----------------------
+
+:Architecture: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_map_token_reply {
+		struct kvmi_map_mem_token token;
+	};
+
+Where::
+
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+Requests a token for a memory map operation.
+
+On this command, the host generates a random token to be used (once)
+to map a physical page from the introspected guest. The introspector
+could use the token with the KVM_INTRO_MEM_MAP ioctl (on /dev/kvmmem)
+to map a guest physical page to one of its memory pages. The ioctl,
+in turn, will use the KVM_HC_MEM_MAP hypercall (see hypercalls.txt).
+
+The guest kernel exposing /dev/kvmmem keeps a list with all the mappings
+(to all the guests introspected by the tool) in order to unmap them
+(using the KVM_HC_MEM_UNMAP hypercall) when /dev/kvmmem is closed or on
+demand (using the KVM_INTRO_MEM_UNMAP ioctl).
+
+:Errors:
+
+* -KVM_EAGAIN - too many tokens have accumulated
+* -KVM_ENOMEM - not enough memory to allocate a new token
+
+17. KVMI_GET_XSAVE
+------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_xsave {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_xsave_reply {
+		__u32 region[0];
+	};
+
+Returns a buffer containing the XSAVE area. Currently, the size of
+``kvm_xsave`` is used, but it could change. The userspace should get
+the buffer size from the message size.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+Events
+======
+
+All vCPU events are sent using the *KVMI_EVENT* message id. No event will
+be sent unless it is explicitly enabled with a *KVMI_CONTROL_EVENTS*
+or a *KVMI_CONTROL_VM_EVENTS* command or requested, as it is the case
+with the *KVMI_EVENT_PAUSE_VCPU* event (see **KVMI_PAUSE_ALL_VCPUS**).
+
+The message data begins with a common structure, having the vCPU id,
+its mode (in bytes: 2, 4 and 8) and the event::
+
+	struct kvmi_event {
+		__u32 event;
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding;
+		/* arch/event specific data */
+	}
+
+On x86 the structure looks like this::
+
+	struct kvmi_event {
+		__u32 event;
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct {
+			__u64 sysenter_cs;
+			__u64 sysenter_esp;
+			__u64 sysenter_eip;
+			__u64 efer;
+			__u64 star;
+			__u64 lstar;
+			__u64 cstar;
+			__u64 pat;
+			__u64 shadow_gs;
+		} msrs;
+	};
+
+It contains information about the vCPU state at the time of the event.
+
+The replies to events have the *KVMI_EVENT_REPLY* message id and begin
+with a common structure::
+
+	struct kvmi_event_reply {
+		__u32 action;
+		__u32 event;
+	};
+
+All events accept the KVMI_EVENT_ACTION_CRASH action, which stops the
+guest ungracefully but as soon as possible.
+
+Most of the events accept the KVMI_EVENT_ACTION_CONTINUE action, which
+lets the instruction that caused the event to continue (unless specified
+otherwise).
+
+Some of the events accept the KVMI_EVENT_ACTION_RETRY action, to continue
+by re-entering the guest.
+
+Specific data can follow these common structures.
+
+0. KVMI_EVENT_PAUSE_VCPU
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent in response to a *KVMI_PAUSE_ALL_VCPUS* command and
+cannot be disabled via *KVMI_CONTROL_EVENTS*.
+
+One event is sent for every *KVMI_PAUSE_ALL_VCPUS* command.
+
+This event has a low priority. It will be sent after any other vCPU
+introspection event and when no vCPU introspection command is queued.
+
+1. KVMI_EVENT_CR
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_cr {
+		__u16 cr;
+		__u16 padding[3];
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_cr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a control register is going to be changed and the
+introspection has been enabled for this event and for this specific
+register (see *KVMI_CONTROL_EVENTS* and *KVMI_CONTROL_CR_FLAG*).
+
+``kvmi_event``, the control register number, the old value and the new value
+are sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
+2. KVMI_EVENT_MSR
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_msr {
+		__u32 msr;
+		__u32 padding;
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_msr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a model specific register is going to be changed
+and the introspection has been enabled for this event and for this specific
+register (see *KVMI_CONTROL_EVENTS* and *KVMI_CONTROL_MSR_FLAG*).
+
+``kvmi_event``, the MSR number, the old value and the new value are
+sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
+3. KVMI_EVENT_XSETBV
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent when the extended control register XCR0 is going
+to be changed and the introspection has been enabled for this event
+(see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event`` is sent to the introspector.
+
+4. KVMI_EVENT_BREAKPOINT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_breakpoint {
+		__u64 gpa;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent when a breakpoint was reached and the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+Some of these breakpoints could have been injected by the introspector,
+placed in the slack space of various functions and used as notification
+for when the OS or an application has reached a certain state or is
+trying to perform a certain operation (like creating a process).
+
+``kvmi_event`` and the guest physical address are sent to the introspector.
+
+The *RETRY* action is used by the introspector for its own breakpoints.
+
+5. KVMI_EVENT_HYPERCALL
+-----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent on a specific user hypercall when the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+The hypercall number must be ``KVM_HC_XEN_HVM_OP`` with the
+``KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT`` sub-function
+(see hypercalls.txt).
+
+It is used by the code residing inside the introspected guest to call the
+introspection tool and to report certain details about its operation. For
+example, a classic antimalware remediation tool can report what it has
+found during a scan.
+
+6. KVMI_EVENT_PF
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_pf {
+		__u64 gva;
+		__u64 gpa;
+		__u32 mode;
+		__u32 padding;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_pf_reply {
+		__u8 singlestep;
+		__u8 rep_complete;
+		__u16 padding;
+		__u32 ctx_size;
+		__u8 ctx_data[256];
+	};
+
+This event is sent when a hypervisor page fault occurs due to a failed
+permission check in the shadow page tables, the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*) and the event was
+generated for a page in which the introspector has shown interest
+(ie. has previously touched it by adjusting the spte permissions).
+
+The shadow page tables can be used by the introspection tool to guarantee
+the purpose of code areas inside the guest (code, rodata, stack, heap
+etc.) Each attempt at an operation unfitting for a certain memory
+range (eg. execute code in heap) triggers a page fault and gives the
+introspection tool the chance to audit the code attempting the operation.
+
+``kvmi_event``, guest virtual address, guest physical address and the
+exit qualification (mode) are sent to the introspector.
+
+The *CONTINUE* action will continue the page fault handling via emulation
+(with custom input if ``ctx_size`` > 0). The use of custom input is
+to trick the guest software into believing it has read certain data,
+in order to hide the content of certain memory areas (eg. hide injected
+code from integrity checkers). If ``rep_complete`` is not zero, the REP
+prefixed instruction should be emulated just once (or at least no other
+*KVMI_EVENT_PF* event should be sent for the current instruction).
+
+The *RETRY* action is used by the introspector to retry the execution of
+the current instruction. Either using single-step (if ``singlestep`` is
+not zero) or return to guest (if the introspector changed the instruction
+pointer or the page restrictions).
+
+7. KVMI_EVENT_TRAP
+------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_trap {
+		__u32 vector;
+		__u32 type;
+		__u32 error_code;
+		__u32 padding;
+		__u64 cr2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent if a previous *KVMI_INJECT_EXCEPTION* command has
+been overwritten by an interrupt picked up during guest reentry and the
+introspection has been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event``, exception/interrupt number (vector), exception/interrupt
+type, exception code (``error_code``) and CR2 are sent to the introspector.
+
+8. KVMI_EVENT_CREATE_VCPU
+-------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent when a new vCPU is created and the introspection has
+been enabled for this event (see *KVMI_CONTROL_VM_EVENTS*).
+
+9. KVMI_EVENT_DESCRIPTOR
+-------------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+	struct kvmi_event_descriptor {
+		union {
+			struct {
+				__u32 instr_info;
+				__u32 padding;
+				__u64 exit_qualification;
+			} vmx;
+			struct {
+				__u64 exit_info;
+				__u64 padding;
+			} svm;
+		} arch;
+		__u8 descriptor;
+		__u8 write;
+		__u8 padding[6];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent when a descriptor table register is accessed and the
+introspection has been enabled for this event (see *KVMI_CONTROL_EVENTS*
+and *KVMI_EVENT_DESCRIPTOR_FLAG*).
+
+``kvmi_event`` and ``kvmi_event_descriptor`` are sent to the introspector.
+
+``descriptor`` can be one of::
+
+	KVMI_DESC_IDTR
+	KVMI_DESC_GDTR
+	KVMI_DESC_LDTR
+	KVMI_DESC_TR
+
+``write`` is 1 if the descriptor was written, 0 otherwise.
+
+10. KVMI_EVENT_UNHOOK
+---------------------
+
+:Architecture: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent when the device manager (ie. QEMU) has to
+pause/stop/migrate the guest (see **Unhooking**) and the introspection
+has been enabled for this event (see **KVMI_CONTROL_VM_EVENTS**).
+The introspection tool has a chance to unhook and close the KVMI channel
+(signaling that the operation can proceed).

From patchwork Thu Dec 20 18:28:32 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739285
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 16F0717E1
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:06 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 07D0B28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:06 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id F06E028F39; Thu, 20 Dec 2018 18:36:05 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AAE7C28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:05 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389326AbeLTSgE (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:04 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44332 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389303AbeLTSgD (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:03 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 697AB305FFAF;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 627DF306E47A;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 02/20] kvm: document the VM introspection ioctl-s and
 capability
Date: Thu, 20 Dec 2018 20:28:32 +0200
Message-Id: <20181220182850.4579-3-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The uuid member of struct kvm_introspection is used to show
the guest id in the error messages.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/api.txt | 59 +++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index cd209f7730af..e263e3d4b9e4 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3753,6 +3753,58 @@ Coalesced pio is based on coalesced mmio. There is little difference
 between coalesced mmio and pio except that coalesced pio records accesses
 to I/O ports.
 
+4.998 KVM_INTROSPECTION_HOOK
+
+Capability: KVM_CAP_INTROSPECTION
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_introspection (in)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to enable the introspection of the current VM.
+
+struct kvm_introspection {
+	int fd;
+	__u32 padding;
+	__u32 commands;
+	__u32 events;
+	__u8 uuid[16];
+};
+
+fd is the file handle of a socket connected to the introspection tool,
+
+commands is a bitmask with the introspection commands allowed on this VM,
+
+events is a bitmask with the introspection events allowed on this VM.
+
+uuid is used for debug and error messages.
+
+It can fail with -EFAULT if:
+ - memory allocation failed
+ - this VM is already introspected
+ - the file handle doesn't correspond to an active socket
+
+The KVMI version can be retrieved using the KVM_CAP_INTROSPECTION of
+the KVM_CHECK_EXTENSION ioctl() at run-time.
+
+4.999 KVM_INTROSPECTION_UNHOOK
+
+Capability: KVM_CAP_INTROSPECTION
+Architectures: x86
+Type: vm ioctl
+Parameters: none
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to disable the introspection of the current VM.
+It is useful when the VM is paused/suspended/migrated.
+
+It can fail with -EFAULT if:
+  - the introspection is not enabled
+  - the socket (passed with KVM_INTROSPECTION_HOOK) had an error
+
+If the ioctl is successful, the userspace should give the introspection
+tool a chance to unhook the VM.
+
 5. The kvm_run structure
 ------------------------
 
@@ -4888,6 +4940,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
 AArch64, this value will be reported in the ISS field of ESR_ELx.
 
 See KVM_CAP_VCPU_EVENTS for more details.
+
 8.20 KVM_CAP_HYPERV_SEND_IPI
 
 Architectures: x86
@@ -4895,3 +4948,9 @@ Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
+8.99 KVM_CAP_INTROSPECTION
+
+Architectures: all
+Parameters: none
+Returns: KVMI version on success, 0 on error

From patchwork Thu Dec 20 18:28:33 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739305
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 28D806C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:21 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1A5A928E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:21 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 0F1F128F39; Thu, 20 Dec 2018 18:36:21 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B197E28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:20 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389350AbeLTSgT (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:19 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44328 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389301AbeLTSgE (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:04 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 6DBE2305FFB0;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 65F66306E47B;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 03/20] kvm: document the VM introspection hypercalls
Date: Thu, 20 Dec 2018 20:28:33 +0200
Message-Id: <20181220182850.4579-4-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

KVM_HC_XEN_HVM_OP is used by the agent injected in the introspected guest,
while KVM_HC_MEM_MAP and KVM_HC_MEM_UNMAP are used by the introspection
tool running in another guest.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 Documentation/virtual/kvm/hypercalls.txt | 68 +++++++++++++++++++++++-
 1 file changed, 67 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
index da24c138c8d1..f59fc2c9698e 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -122,7 +122,7 @@ compute the CLOCK_REALTIME for its clock, at the same instant.
 Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
 or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
 
-6. KVM_HC_SEND_IPI
+7. KVM_HC_SEND_IPI
 ------------------------
 Architecture: x86
 Status: active
@@ -141,3 +141,69 @@ a0 corresponds to the APIC ID in the third argument (a2), bit 1
 corresponds to the APIC ID a2+1, and so on.
 
 Returns the number of CPUs to which the IPIs were delivered successfully.
+
+8. KVM_HC_XEN_HVM_OP
+--------------------
+
+Architecture: x86
+Status: active
+Purpose: To enable communication between a guest agent and a VMI application
+Usage:
+
+An event will be sent to the VMI application (see kvmi.rst) if the following
+registers, which differ between 32bit and 64bit, have the following values:
+
+       32bit       64bit     value
+       ---------------------------
+       ebx (a0)    rdi       KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
+       ecx (a1)    rsi       0
+
+This specification copies Xen's { __HYPERVISOR_hvm_op,
+HVMOP_guest_request_vm_event } hypercall and can originate from kernel or
+userspace.
+
+It returns 0 if successful, or a negative POSIX.1 error code if it fails. The
+absence of an active VMI application is not signaled in any way.
+
+The following registers are clobbered:
+
+  * 32bit: edx, esi, edi, ebp
+  * 64bit: rdx, r10, r8, r9
+
+In particular, for KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT, the last two
+registers can be poisoned deliberately and cannot be used for passing
+information.
+
+9. KVM_HC_MEM_MAP
+-----------------
+
+Architecture: x86
+Status: active
+Purpose: Map a guest physical page to another VM (the introspector).
+Usage:
+
+a0: pointer to a token obtained with a KVMI_GET_MAP_TOKEN command (see kvmi.rst)
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+a1: guest physical address to be mapped
+
+a2: guest physical address from introspector that will be replaced
+
+Both guest physical addresses will end up poiting to the same physical page.
+
+Returns KVM_EFAULT in case of an error.
+
+10. KVM_HC_MEM_UNMAP
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Unmap a previously mapped page.
+Usage:
+
+a0: guest physical address from introspector
+
+The address will stop pointing to the introspected page and a new physical
+page is allocated for this gpa.

From patchwork Thu Dec 20 18:28:34 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739301
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E5CA13B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:19 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1A92E28F35
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:19 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 0F35128F3A; Thu, 20 Dec 2018 18:36:19 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 98C1828F35
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389335AbeLTSgG (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:06 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44324 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389306AbeLTSgG (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:06 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 72C9F305FFB1;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 699A9306E47C;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>, =?utf-8?q?Mircea_C?=
	=?utf-8?q?=C3=AErjaliu?= <mcirjaliu@bitdefender.com>,
 Marian Rotariu <marian.c.rotariu@gmail.com>
Subject: [RFC PATCH v5 04/20] kvm: add the VM introspection API/ABI headers
Date: Thu, 20 Dec 2018 20:28:34 +0200
Message-Id: <20181220182850.4579-5-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

And more KVM_E* error codes in kvm_para.h

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Marian Rotariu <marian.c.rotariu@gmail.com>
---
 arch/x86/include/uapi/asm/kvmi.h | 234 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h         |  12 ++
 include/uapi/linux/kvm_para.h    |  11 +-
 include/uapi/linux/kvmi.h        | 192 +++++++++++++++++++++++++
 4 files changed, 448 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/uapi/asm/kvmi.h
 create mode 100644 include/uapi/linux/kvmi.h

diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
new file mode 100644
index 000000000000..d114f82bc070
--- /dev/null
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -0,0 +1,234 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_X86_KVMI_H
+#define _UAPI_ASM_X86_KVMI_H
+
+/*
+ * KVMI x86 specific structures and definitions
+ *
+ */
+
+#include <asm/kvm.h>
+#include <linux/types.h>
+
+#define KVMI_EVENT_UNHOOK	0
+#define KVMI_EVENT_CR		1	/* control register was modified */
+#define KVMI_EVENT_MSR		2	/* model specific reg. was modified */
+#define KVMI_EVENT_XSETBV	3	/* ext. control register was modified */
+#define KVMI_EVENT_BREAKPOINT	4	/* breakpoint was reached */
+#define KVMI_EVENT_HYPERCALL	5	/* user hypercall */
+#define KVMI_EVENT_PF		6	/* hyp. page fault was encountered */
+#define KVMI_EVENT_TRAP		7	/* trap was injected */
+#define KVMI_EVENT_DESCRIPTOR	8	/* descriptor table access */
+#define KVMI_EVENT_CREATE_VCPU	9
+#define KVMI_EVENT_PAUSE_VCPU	10
+/* TODO: find a way to split the events between common and arch dependent */
+#define KVMI_NUM_EVENTS		11	/* increment with each added event */
+
+#define KVMI_EVENT_UNHOOK_FLAG		(1 << KVMI_EVENT_UNHOOK)
+#define KVMI_EVENT_CR_FLAG		(1 << KVMI_EVENT_CR)
+#define KVMI_EVENT_MSR_FLAG		(1 << KVMI_EVENT_MSR)
+#define KVMI_EVENT_XSETBV_FLAG		(1 << KVMI_EVENT_XSETBV)
+#define KVMI_EVENT_BREAKPOINT_FLAG	(1 << KVMI_EVENT_BREAKPOINT)
+#define KVMI_EVENT_HYPERCALL_FLAG	(1 << KVMI_EVENT_HYPERCALL)
+#define KVMI_EVENT_PF_FLAG		(1 << KVMI_EVENT_PF)
+#define KVMI_EVENT_TRAP_FLAG		(1 << KVMI_EVENT_TRAP)
+#define KVMI_EVENT_DESCRIPTOR_FLAG	(1 << KVMI_EVENT_DESCRIPTOR)
+#define KVMI_EVENT_CREATE_VCPU_FLAG	(1 << KVMI_EVENT_CREATE_VCPU)
+#define KVMI_EVENT_PAUSE_VCPU_FLAG	(1 << KVMI_EVENT_PAUSE_VCPU)
+
+#define KVMI_EVENT_ACTION_CONTINUE	0
+#define KVMI_EVENT_ACTION_RETRY		1
+#define KVMI_EVENT_ACTION_CRASH		2
+
+#define KVMI_KNOWN_VCPU_EVENTS ( \
+	KVMI_EVENT_CR_FLAG | \
+	KVMI_EVENT_MSR_FLAG | \
+	KVMI_EVENT_XSETBV_FLAG | \
+	KVMI_EVENT_BREAKPOINT_FLAG | \
+	KVMI_EVENT_HYPERCALL_FLAG | \
+	KVMI_EVENT_PF_FLAG | \
+	KVMI_EVENT_TRAP_FLAG | \
+	KVMI_EVENT_DESCRIPTOR_FLAG | \
+	KVMI_EVENT_PAUSE_VCPU_FLAG)
+
+#define KVMI_KNOWN_VM_EVENTS ( \
+	KVMI_EVENT_CREATE_VCPU_FLAG | \
+	KVMI_EVENT_UNHOOK_FLAG)
+
+#define KVMI_KNOWN_EVENTS (KVMI_KNOWN_VCPU_EVENTS | KVMI_KNOWN_VM_EVENTS)
+
+#define KVMI_ALLOWED_VM_EVENT(event_id, event_mask) \
+		((1 << (event_id)) & ((event_mask) & KVMI_KNOWN_VM_EVENTS))
+#define KVMI_ALLOWED_VCPU_EVENT(event_id, event_mask) \
+		((1 << (event_id)) & ((event_mask) & KVMI_KNOWN_VCPU_EVENTS))
+
+#define KVMI_PAGE_ACCESS_R (1 << 0)
+#define KVMI_PAGE_ACCESS_W (1 << 1)
+#define KVMI_PAGE_ACCESS_X (1 << 2)
+
+struct kvmi_event_cr {
+	__u16 cr;
+	__u16 padding[3];
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_msr {
+	__u32 msr;
+	__u32 padding;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_breakpoint {
+	__u64 gpa;
+};
+
+struct kvmi_event_pf {
+	__u64 gva;
+	__u64 gpa;
+	__u32 mode;
+	__u32 padding;
+};
+
+struct kvmi_event_trap {
+	__u32 vector;
+	__u32 type;
+	__u32 error_code;
+	__u32 padding;
+	__u64 cr2;
+};
+
+#define KVMI_DESC_IDTR	1
+#define KVMI_DESC_GDTR	2
+#define KVMI_DESC_LDTR	3
+#define KVMI_DESC_TR	4
+
+struct kvmi_event_descriptor {
+	union {
+		struct {
+			__u32 instr_info;
+			__u32 padding;
+			__u64 exit_qualification;
+		} vmx;
+		struct {
+			__u64 exit_info;
+			__u64 padding;
+		} svm;
+	} arch;
+	__u8 descriptor;
+	__u8 write;
+	__u8 padding[6];
+};
+
+struct kvmi_event {
+	__u32 event;
+	__u16 vcpu;
+	__u8 mode;		/* 2, 4 or 8 */
+	__u8 padding;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct {
+		__u64 sysenter_cs;
+		__u64 sysenter_esp;
+		__u64 sysenter_eip;
+		__u64 efer;
+		__u64 star;
+		__u64 lstar;
+		__u64 cstar;
+		__u64 pat;
+		__u64 shadow_gs;
+	} msrs;
+};
+
+struct kvmi_event_cr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_msr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_pf_reply {
+	__u8 singlestep;
+	__u8 rep_complete;
+	__u16 padding;
+	__u32 ctx_size;
+	__u8 ctx_data[256];
+};
+
+struct kvmi_control_cr {
+	__u16 vcpu;
+	__u8 enable;
+	__u8 padding;
+	__u32 cr;
+};
+
+struct kvmi_control_msr {
+	__u16 vcpu;
+	__u8 enable;
+	__u8 padding;
+	__u32 msr;
+};
+
+struct kvmi_guest_info {
+	__u16 vcpu_count;
+	__u16 padding1;
+	__u32 padding2;
+	__u64 tsc_speed;
+};
+
+struct kvmi_inject_exception {
+	__u16 vcpu;
+	__u8 nr;
+	__u8 has_error;
+	__u16 error_code;
+	__u16 padding;
+	__u64 address;
+};
+
+struct kvmi_get_registers {
+	__u16 vcpu;
+	__u16 nmsrs;
+	__u16 padding[2];
+	__u32 msrs_idx[0];
+};
+
+struct kvmi_get_registers_reply {
+	__u32 mode;
+	__u32 padding;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct kvm_msrs msrs;
+};
+
+struct kvmi_set_registers {
+	__u16 vcpu;
+	__u16 padding[3];
+	struct kvm_regs regs;
+};
+
+struct kvmi_get_cpuid {
+	__u16 vcpu;
+	__u16 padding[3];
+	__u32 function;
+	__u32 index;
+};
+
+struct kvmi_get_cpuid_reply {
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvmi_get_xsave {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_get_xsave_reply {
+	__u32 region[0];
+};
+
+#endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2b7a652c9fa4..f5028f20d441 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -976,6 +976,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_EXCEPTION_PAYLOAD 164
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
 
+#define KVM_CAP_INTROSPECTION 999
+
 #ifdef KVM_CAP_IRQ_ROUTING
 
 struct kvm_irq_routing_irqchip {
@@ -1501,6 +1503,16 @@ struct kvm_sev_dbg {
 	__u32 len;
 };
 
+struct kvm_introspection {
+	int fd;
+	__u32 padding;
+	__u32 commands;
+	__u32 events;
+	__u8 uuid[16];
+};
+#define KVM_INTROSPECTION_HOOK    _IOW(KVMIO, 0xff, struct kvm_introspection)
+#define KVM_INTROSPECTION_UNHOOK  _IO(KVMIO,  0xfe)
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 6c0ce49931e5..2c2c32d99a5a 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -10,13 +10,18 @@
  * - kvm_para_available
  */
 
-/* Return values for hypercalls */
+/* Return values for hypercalls and VM introspection */
 #define KVM_ENOSYS		1000
 #define KVM_EFAULT		EFAULT
 #define KVM_EINVAL		EINVAL
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
+#define KVM_EAGAIN		11
+#define KVM_EBUSY		EBUSY
+#define KVM_ENOENT		ENOENT
+#define KVM_ENOMEM		ENOMEM
+#define KVM_EACCES		EACCES
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
@@ -29,6 +34,10 @@
 #define KVM_HC_CLOCK_PAIRING		9
 #define KVM_HC_SEND_IPI		10
 
+#define KVM_HC_MEM_MAP			32
+#define KVM_HC_MEM_UNMAP		33
+#define KVM_HC_XEN_HVM_OP		34 /* Xen's __HYPERVISOR_hvm_op */
+
 /*
  * hypercalls use architecture specific
  */
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
new file mode 100644
index 000000000000..6b24825627bf
--- /dev/null
+++ b/include/uapi/linux/kvmi.h
@@ -0,0 +1,192 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __UAPI_KVMI_H__
+#define __UAPI_KVMI_H__
+
+/*
+ * KVMI specific structures and definitions
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <asm/kvmi.h>
+
+#define KVMI_VERSION 0x00000001
+
+#define KVMI_GET_VERSION		1
+#define KVMI_GET_GUEST_INFO		3
+#define KVMI_GET_REGISTERS		6
+#define KVMI_SET_REGISTERS		7
+#define KVMI_GET_PAGE_ACCESS		10
+#define KVMI_SET_PAGE_ACCESS		11
+#define KVMI_INJECT_EXCEPTION		12
+#define KVMI_READ_PHYSICAL		13
+#define KVMI_WRITE_PHYSICAL		14
+#define KVMI_GET_MAP_TOKEN		15
+#define KVMI_CONTROL_EVENTS		17
+#define KVMI_CONTROL_CR			18
+#define KVMI_CONTROL_MSR		19
+#define KVMI_EVENT			23
+#define KVMI_EVENT_REPLY		24
+#define KVMI_GET_CPUID			25
+#define KVMI_GET_XSAVE			26
+#define KVMI_PAUSE_ALL_VCPUS		27
+#define KVMI_CONTROL_VM_EVENTS		28
+
+/* TODO: find a way to split the commands between common and arch dependent */
+
+#define KVMI_GET_VERSION_FLAG		(1 << KVMI_GET_VERSION)
+#define KVMI_GET_GUEST_INFO_FLAG	(1 << KVMI_GET_GUEST_INFO)
+#define KVMI_GET_REGISTERS_FLAG		(1 << KVMI_GET_REGISTERS)
+#define KVMI_SET_REGISTERS_FLAG		(1 << KVMI_SET_REGISTERS)
+#define KVMI_GET_PAGE_ACCESS_FLAG	(1 << KVMI_GET_PAGE_ACCESS)
+#define KVMI_SET_PAGE_ACCESS_FLAG	(1 << KVMI_SET_PAGE_ACCESS)
+#define KVMI_INJECT_EXCEPTION_FLAG	(1 << KVMI_INJECT_EXCEPTION)
+#define KVMI_READ_PHYSICAL_FLAG		(1 << KVMI_READ_PHYSICAL)
+#define KVMI_WRITE_PHYSICAL_FLAG	(1 << KVMI_WRITE_PHYSICAL)
+#define KVMI_GET_MAP_TOKEN_FLAG		(1 << KVMI_GET_MAP_TOKEN)
+#define KVMI_CONTROL_EVENTS_FLAG	(1 << KVMI_CONTROL_EVENTS)
+#define KVMI_CONTROL_CR_FLAG		(1 << KVMI_CONTROL_CR)
+#define KVMI_CONTROL_MSR_FLAG		(1 << KVMI_CONTROL_MSR)
+#define KVMI_EVENT_FLAG			(1 << KVMI_EVENT)
+#define KVMI_EVENT_REPLY_FLAG		(1 << KVMI_EVENT_REPLY)
+#define KVMI_GET_CPUID_FLAG		(1 << KVMI_GET_CPUID)
+#define KVMI_GET_XSAVE_FLAG		(1 << KVMI_GET_XSAVE)
+#define KVMI_PAUSE_ALL_VCPUS_FLAG	(1 << KVMI_PAUSE_ALL_VCPUS)
+#define KVMI_CONTROL_VM_EVENTS_FLAG	(1 << KVMI_CONTROL_VM_EVENTS)
+
+#define KVMI_KNOWN_COMMANDS (\
+	KVMI_GET_VERSION_FLAG | \
+	KVMI_GET_GUEST_INFO_FLAG | \
+	KVMI_GET_REGISTERS_FLAG | \
+	KVMI_SET_REGISTERS_FLAG | \
+	KVMI_GET_PAGE_ACCESS_FLAG | \
+	KVMI_SET_PAGE_ACCESS_FLAG | \
+	KVMI_INJECT_EXCEPTION_FLAG | \
+	KVMI_READ_PHYSICAL_FLAG | \
+	KVMI_WRITE_PHYSICAL_FLAG | \
+	KVMI_GET_MAP_TOKEN_FLAG | \
+	KVMI_CONTROL_EVENTS_FLAG | \
+	KVMI_CONTROL_CR_FLAG | \
+	KVMI_CONTROL_MSR_FLAG | \
+	KVMI_EVENT_FLAG | \
+	KVMI_EVENT_REPLY_FLAG | \
+	KVMI_GET_CPUID_FLAG | \
+	KVMI_GET_XSAVE_FLAG | \
+	KVMI_PAUSE_ALL_VCPUS_FLAG | \
+	KVMI_CONTROL_VM_EVENTS_FLAG)
+
+#define KVMI_ALLOWED_COMMAND(cmd_id, cmd_mask) \
+		((1 << (cmd_id)) & ((cmd_mask) & KVMI_KNOWN_COMMANDS))
+
+#define KVMI_MSG_SIZE 4096
+
+struct kvmi_msg_hdr {
+	__u16 id;
+	__u16 size;
+	__u32 seq;
+};
+
+struct kvmi_error_code {
+	__s32 err;
+	__u32 padding;
+};
+
+struct kvmi_get_version_reply {
+	__u32 version;
+	__u32 commands;
+	__u32 events;
+	__u32 padding;
+};
+
+struct kvmi_get_guest_info {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_get_guest_info_reply {
+	__u32 vcpu_count;
+	__u32 padding;
+	__u64 tsc_speed;
+};
+
+struct kvmi_pause_all_vcpus_reply {
+	__u32 vcpu_count;
+	__u32 padding;
+};
+
+struct kvmi_event_reply {
+	__u32 action;
+	__u32 event;
+};
+
+struct kvmi_control_events {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 events;
+};
+
+struct kvmi_control_vm_events {
+	__u32 events;
+	__u32 padding;
+};
+
+struct kvmi_get_page_access {
+	__u16 vcpu;
+	__u16 count;
+	__u16 view;
+	__u16 padding;
+	__u64 gpa[0];
+};
+
+struct kvmi_get_page_access_reply {
+	__u8 access[0];
+};
+
+struct kvmi_page_access_entry {
+	__u64 gpa;
+	__u8 access;
+	__u8 padding[7];
+};
+
+struct kvmi_set_page_access {
+	__u16 vcpu;
+	__u16 count;
+	__u16 view;
+	__u16 padding;
+	struct kvmi_page_access_entry entries[0];
+};
+
+struct kvmi_read_physical {
+	__u64 gpa;
+	__u64 size;
+};
+
+struct kvmi_write_physical {
+	__u64 gpa;
+	__u64 size;
+	__u8  data[0];
+};
+
+struct kvmi_map_mem_token {
+	__u64 token[4];
+};
+
+struct kvmi_get_map_token_reply {
+	struct kvmi_map_mem_token token;
+};
+
+/* Map other guest's gpa to local gva */
+struct kvmi_mem_map {
+	struct kvmi_map_mem_token token;
+	__u64 gpa;
+	__u64 gva;
+};
+
+/*
+ * ioctls for /dev/kvmmem
+ */
+#define KVM_INTRO_MEM_MAP	_IOW('i', 0x01, struct kvmi_mem_map)
+#define KVM_INTRO_MEM_UNMAP	_IOW('i', 0x02, unsigned long)
+
+#endif /* __UAPI_KVMI_H__ */

From patchwork Thu Dec 20 18:28:35 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739287
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ABE126C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:06 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9C3FE28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:06 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 908C928F39; Thu, 20 Dec 2018 18:36:06 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4B15428E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389330AbeLTSgF (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:05 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44338 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389305AbeLTSgE (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:04 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 7DF57305FFB3;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 6F9E43074866;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 05/20] kvm: x86: do not unconditionally patch the
 hypercall instruction during emulation
Date: Thu, 20 Dec 2018 20:28:35 +0200
Message-Id: <20181220182850.4579-6-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

It can happen for us to end up emulating the VMCALL instruction as a
result of the handling of an EPT write fault. In this situation, the
emulator will try to unconditionally patch the correct hypercall opcode
bytes using emulator_write_emulated(). However, this last call uses the
fault GPA (if available) or walks the guest page tables at RIP,
otherwise. The trouble begins when using KVMI, when we forbid the use of
the fault GPA and fallback to the guest pt walk: in Windows (8.1 and
newer) the page that we try to write into is marked read-execute and as
such emulator_write_emulated() fails and we inject a write #PF, leading
to a guest crash.

The fix is rather simple: check the existing instruction bytes before
doing the patching. This does not change the normal KVM behaviour, but
does help when using KVMI as we no longer inject a write #PF.

TODO: move KVM_HYPERCALL_INSN_LEN into a proper header and add a
BUILD_BUG_ON() in vmx_patch_hypercall() and svm_patch_hypercall().

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f049ecfac7bb..1d4bab80617c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7016,16 +7016,33 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
 
+#define KVM_HYPERCALL_INSN_LEN 3
+
 static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
 {
+	int err;
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-	char instruction[3];
+	char buf[KVM_HYPERCALL_INSN_LEN];
+	char instruction[KVM_HYPERCALL_INSN_LEN];
 	unsigned long rip = kvm_rip_read(vcpu);
 
+	err = emulator_read_emulated(ctxt, rip, buf, sizeof(buf),
+				     &ctxt->exception);
+	if (err != X86EMUL_CONTINUE)
+		return err;
+
 	kvm_x86_ops->patch_hypercall(vcpu, instruction);
+	if (!memcmp(instruction, buf, sizeof(instruction)))
+		/*
+		 * The hypercall instruction is the correct one. Retry
+		 * its execution maybe we got here as a result of an
+		 * event other than #UD which has been resolved in the
+		 * mean time.
+		 */
+		return X86EMUL_CONTINUE;
 
-	return emulator_write_emulated(ctxt, rip, instruction, 3,
-		&ctxt->exception);
+	return emulator_write_emulated(ctxt, rip, instruction,
+				       sizeof(instruction), &ctxt->exception);
 }
 
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)

From patchwork Thu Dec 20 18:28:36 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739293
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6170813B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:15 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4AF3228E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:15 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3EA9528F39; Thu, 20 Dec 2018 18:36:15 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F362228E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:13 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389346AbeLTSgN (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:13 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44336 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389304AbeLTSgM (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:12 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 972D2305FFB4;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 7D7843064495;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?q?Mircea_?=
	=?utf-8?q?C=C3=AErjaliu?= <mcirjaliu@bitdefender.com>
Subject: [RFC PATCH v5 06/20] mm: add support for remote mapping
Date: Thu, 20 Dec 2018 20:28:36 +0200
Message-Id: <20181220182850.4579-7-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

The following two new mm exports are introduced:
 * mm_remote_map(struct mm_struct *req_mm,
                 unsigned long req_hva,
		 unsigned long map_hva)
 * mm_remote_unmap(unsigned long map_hva)

This patch allows one process to map into its address space a page from
another process. The previous page (if it exists) is dropped. There is
no corresponding pair of system calls as this API is meant to be used
by the kernel itself only.

The targeted user is the upcoming KVM VM introspection subsystem (KVMI),
where an introspector running in its own VM will map pages from the
introspected guest in order to eliminate round trips to the host kernel
(read/write guest pages).

The flow is as follows: the introspector identifies a guest physical
address where some information of interest is located. It creates a one
page anonymous mapping with MAP_LOCKED | MAP_POPULATE and calls the
kernel via an IOCTL on /dev/kvmmem giving the map virtual address and
the guest physical address as arguments. The kernel converts the map
va into a physical page (gpa in KVM-speak) and passes it to the host
kernel via a hypercall, along with the introspected guest gpa. The host
kernel converts the two gpa-s into their appropriate hva-s (host virtual
addresses) and makes sure the vma backing up the page belonging to the VM
in which the introspector runs, points to the indicated page into the
introspected guest. I have not included here the use of the mapping
token described in the KVMI documentation.

An introspector tool will always keep the most used mappings around
(tested with 4096).

There are some restrictions here:
 * the anonymous mapping created by the introspector must not change
   its backing physical page;
 * the page that ends up being shared must be locked in memory (ie.
   it cannot be paged out);
 * the mapping created adds to the mappings belonging to the
   controlling KVM process (qemu) corresponding to the VM inside which
   the introspector runs. /proc/sys/vm/max_map_count must be adjusted
   accordingly (eg. 16k mappings x 100 VMs = 1.6 million).

The following types might need some explaining, in order to ease reading
the patch:
 * struct intro_db - there is one such structure per each VM in which an
   introspector process is running. All instances are placed in a global
   hash table in order to avoid using space in mm_struct;
 * struct target_db - there is one such structure per each introspected
   VM. An intro_db instance can point to multiple target_db instances;
 * struct page_db - there is one such structure for each mapping. Thus,
   a target_db instance can point to multiple page_db instances.

TODO:
 * remove the module support and turn it into an optional built-in.
   This will avoid the need for exporting the guts of the memory manager;
 * figure out why this cannot be used with KSM enabled;
 * drop the intro/introspector naming;
 * test with more than one introspector (ie. multiple struct intro_db).

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
---
 include/linux/mm.h   |   13 +
 include/linux/rmap.h |    1 +
 mm/Kconfig           |    9 +
 mm/Makefile          |    1 +
 mm/gup.c             |    1 +
 mm/huge_memory.c     |    1 +
 mm/internal.h        |    5 -
 mm/mempolicy.c       |    1 +
 mm/mmap.c            |    1 +
 mm/mmu_notifier.c    |    1 +
 mm/pgtable-generic.c |    1 +
 mm/remote_mapping.c  | 1438 ++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c            |   39 +-
 mm/swapfile.c        |    1 +
 14 files changed, 1506 insertions(+), 7 deletions(-)
 create mode 100644 mm/remote_mapping.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..9b30ae83f821 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1206,6 +1206,9 @@ void page_address_init(void);
 #define page_address_init()  do { } while(0)
 #endif
 
+/* rmap.c */
+extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
+
 extern void *page_rmapping(struct page *page);
 extern struct anon_vma *page_anon_vma(struct page *page);
 extern struct address_space *page_mapping(struct page *page);
@@ -2786,6 +2789,16 @@ static inline bool debug_guardpage_enabled(void) { return false; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#if IS_ENABLED(CONFIG_REMOTE_MAPPING)
+extern int mm_remote_map(struct mm_struct *req_mm, unsigned long req_hva,
+			 unsigned long map_hva);
+extern int mm_remote_unmap(unsigned long map_hva);
+#else /* CONFIG_REMOTE_MAPPING */
+static inline int mm_remote_map(struct mm_struct *req_mm, unsigned long req_hva,
+				unsigned long map_hva) { return -EINVAL; }
+static inline int mm_remote_unmap(unsigned long map_hva) { return -EINVAL; }
+#endif /* CONFIG_REMOTE_MAPPING */
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..2f5b3e1c2613 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -141,6 +141,7 @@ static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
  */
 void anon_vma_init(void);	/* create anon_vma_cachep */
 int  __anon_vma_prepare(struct vm_area_struct *);
+int anon_vma_assign(struct vm_area_struct *vma, struct anon_vma *anon_vma);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
diff --git a/mm/Kconfig b/mm/Kconfig
index d85e39da47ae..3b149f743a2b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -757,4 +757,13 @@ config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
+config REMOTE_MAPPING
+	bool "Remote memory mapping"
+	depends on MMU && !KSM
+	default n
+
+	help
+	  Allows a given application to map pages of another application in its own
+	  address space.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..e69a3b15627a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_REMOTE_MAPPING) += remote_mapping.o
diff --git a/mm/gup.c b/mm/gup.c
index 8cb68a50dbdf..ee5ca8bbc8e3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -450,6 +450,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		put_dev_pagemap(ctx.pgmap);
 	return page;
 }
+EXPORT_SYMBOL(follow_page_mask);
 
 static int get_gate_page(struct mm_struct *mm, unsigned long address,
 		unsigned int gup_flags, struct vm_area_struct **vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5da55b38b1b7..3584ed1530af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2297,6 +2297,7 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 
 	__split_huge_pmd(vma, pmd, address, freeze, page);
 }
+EXPORT_SYMBOL(split_huge_pmd_address);
 
 void vma_adjust_trans_huge(struct vm_area_struct *vma,
 			     unsigned long start,
diff --git a/mm/internal.h b/mm/internal.h
index 291eb2b6d1d8..935291b44ae0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -92,11 +92,6 @@ extern unsigned long highest_memmap_pfn;
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
 
-/*
- * in mm/rmap.c:
- */
-extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
-
 /*
  * in mm/page_alloc.c
  */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d4496d9d34f5..f4637f9e900c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2072,6 +2072,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 out:
 	return page;
 }
+EXPORT_SYMBOL(alloc_pages_vma);
 
 /**
  * 	alloc_pages_current - Allocate pages.
diff --git a/mm/mmap.c b/mm/mmap.c
index 6c04292e16a7..5929098c4a01 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2705,6 +2705,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	return __split_vma(mm, vma, addr, new_below);
 }
+EXPORT_SYMBOL(split_vma);
 
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5119ff846769..43e61b32f01e 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -173,6 +173,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	}
 	srcu_read_unlock(&srcu, id);
 }
+EXPORT_SYMBOL(__mmu_notifier_change_pte);
 
 int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 532c29276fce..59a8adeb1a71 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,6 +88,7 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		flush_tlb_page(vma, address);
 	return pte;
 }
+EXPORT_SYMBOL(ptep_clear_flush);
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/remote_mapping.c b/mm/remote_mapping.c
new file mode 100644
index 000000000000..f9d25d1b5990
--- /dev/null
+++ b/mm/remote_mapping.c
@@ -0,0 +1,1438 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Remote memory mapping.
+ *
+ * Copyright (C) 2017-2018 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/rmap.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/mm.h>
+#include <linux/huge_mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched/mm.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/hashtable.h>
+#include <linux/refcount.h>
+#include <linux/debugfs.h>
+#include "internal.h"
+
+#define ASSERT(exp) BUG_ON(!(exp))
+
+#define TAKEN_BIT 0
+#define TDB_HASH_BITS 4
+#define IDB_HASH_BITS 2
+
+struct page_db {
+	/* Target for this mapping */
+	struct mm_struct *target;
+
+	/* HVAs of target & introspector */
+	unsigned long req_hva;
+	unsigned long map_hva;
+
+	/* Target-side link (interval tree) */
+	union {
+		struct {
+			struct rb_node target_rb;
+			unsigned long rb_subtree_last;
+		};
+		struct list_head temp;
+	};
+
+	/* Introspector-side link (RB tree) */
+	struct rb_node intro_rb;
+
+	unsigned long flags;
+};
+
+struct target_db {
+	struct mm_struct *mm;		/* mm of this struct */
+	struct hlist_node db_link;	/* database link */
+
+	struct mmu_notifier mn;		/* for notifications from mm */
+	struct rcu_head	rcu;		/* for delayed freeing */
+	refcount_t refcnt;
+
+	spinlock_t lock;		/* lock for the following */
+	struct mm_struct *introspector;	/* introspector for this target */
+	struct rb_root_cached rb_root;	/* mapped HVA from this target */
+};
+
+struct intro_db {
+	struct mm_struct *mm;		/* mm of this struct */
+	struct hlist_node db_link;	/* database link */
+
+	struct mmu_notifier mn;		/* for notifications from mm */
+	struct rcu_head	rcu;		/* for delayed freeing */
+	refcount_t refcnt;
+
+	spinlock_t lock;		/* lock for the following */
+	struct rb_root rb_root;		/* for local mappings */
+};
+
+/* forward declarations */
+static int mm_remote_unmap_action(struct mm_struct *map_mm,
+				  unsigned long map_hva);
+
+static void mm_remote_db_target_release(struct target_db *tdb);
+static void mm_remote_db_intro_release(struct intro_db *idb);
+
+static void tdb_release(struct mmu_notifier *mn, struct mm_struct *mm);
+static void idb_release(struct mmu_notifier *mn, struct mm_struct *mm);
+
+static const struct mmu_notifier_ops tdb_notifier_ops = {
+	.release = tdb_release,
+};
+
+static const struct mmu_notifier_ops idb_notifier_ops = {
+	.release = idb_release,
+};
+
+static DEFINE_HASHTABLE(tdb_hash, TDB_HASH_BITS);
+static DEFINE_SPINLOCK(tdb_lock);
+
+static DEFINE_HASHTABLE(idb_hash, IDB_HASH_BITS);
+static DEFINE_SPINLOCK(idb_lock);
+
+static struct kmem_cache *pdb_cache;
+static atomic_t pdb_count = ATOMIC_INIT(0);
+static atomic_t map_count = ATOMIC_INIT(0);
+
+static struct dentry *mm_remote_debugfs_dir;
+
+static void target_db_init(struct target_db *tdb, struct mm_struct *mm)
+{
+	tdb->mm = mm;
+	tdb->mn.ops = &tdb_notifier_ops;
+	refcount_set(&tdb->refcnt, 1);
+
+	tdb->introspector = NULL;
+	tdb->rb_root = RB_ROOT_CACHED;
+	spin_lock_init(&tdb->lock);
+}
+
+static inline unsigned long page_db_start(const struct page_db *pdb)
+{
+	return pdb->req_hva;
+}
+
+static inline unsigned long page_db_last(const struct page_db *pdb)
+{
+	return pdb->req_hva + PAGE_SIZE;
+}
+
+INTERVAL_TREE_DEFINE(struct page_db, target_rb, unsigned long,
+	rb_subtree_last, page_db_start, page_db_last,
+	static inline, __page_db_interval_tree)
+
+static void target_db_insert(struct target_db *tdb, struct page_db *pdb)
+{
+	__page_db_interval_tree_insert(pdb, &tdb->rb_root);
+}
+
+static bool target_db_empty(const struct target_db *tdb)
+{
+	return RB_EMPTY_ROOT(&tdb->rb_root.rb_root);
+}
+
+static bool target_db_remove(struct target_db *tdb, struct page_db *pdb)
+{
+	bool result = false;
+
+	if (!target_db_empty(tdb)) {
+		__page_db_interval_tree_remove(pdb, &tdb->rb_root);
+		result = true;
+	}
+
+	RB_CLEAR_NODE(&pdb->target_rb);
+	pdb->rb_subtree_last = 0;
+
+	return result;
+}
+
+#define target_db_foreach(pdb, root, start, last)	\
+	for (pdb = __page_db_interval_tree_iter_first(root, start, last);\
+	     pdb; pdb = __page_db_interval_tree_iter_next(pdb, start, last))
+
+static void target_db_get(struct target_db *tdb)
+{
+	refcount_inc(&tdb->refcnt);
+}
+
+static void target_db_free_delayed(struct rcu_head *rcu)
+{
+	struct target_db *tdb = container_of(rcu, struct target_db, rcu);
+
+	pr_debug("%s: for mm %016lx\n", __func__, (unsigned long)tdb->mm);
+
+	kfree(tdb);
+}
+
+static void target_db_put(struct target_db *tdb)
+{
+	if (refcount_dec_and_test(&tdb->refcnt)) {
+		pr_debug("%s: for MM %016lx\n", __func__,
+			(unsigned long)tdb->mm);
+
+		mm_remote_db_target_release(tdb);
+
+		ASSERT(target_db_empty(tdb));
+
+		mmu_notifier_call_srcu(&tdb->rcu, target_db_free_delayed);
+	}
+}
+
+static struct target_db *target_db_lookup(const struct mm_struct *mm)
+{
+	struct target_db *tdb;
+
+	spin_lock(&tdb_lock);
+	hash_for_each_possible(tdb_hash, tdb, db_link, (unsigned long)mm)
+		if (tdb->mm == mm) {
+			target_db_get(tdb);
+			spin_unlock(&tdb_lock);
+
+			return tdb;
+		}
+	spin_unlock(&tdb_lock);
+
+	return NULL;
+}
+
+static void target_db_extract(struct target_db *tdb)
+{
+	spin_lock(&tdb_lock);
+	hash_del(&tdb->db_link);
+	spin_unlock(&tdb_lock);
+}
+
+static struct target_db *target_db_lookup_or_add(struct mm_struct *mm)
+{
+	struct target_db *tdb;
+	int result;
+
+	spin_lock(&tdb_lock);
+
+	/* lookup in hash */
+	hash_for_each_possible(tdb_hash, tdb, db_link, (unsigned long)mm)
+		if (tdb->mm == mm) {
+			target_db_get(tdb);
+			spin_unlock(&tdb_lock);
+
+			return tdb;
+		}
+
+	/* no tdb found, alloc one */
+	tdb = kzalloc(sizeof(*tdb), GFP_ATOMIC);
+	if (tdb == NULL) {
+		spin_unlock(&tdb_lock);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* init & add to hash */
+	target_db_init(tdb, mm);
+	hash_add(tdb_hash, &tdb->db_link, (unsigned long)mm);
+
+	spin_unlock(&tdb_lock);
+
+	/*
+	 * register a mmu notifier when adding this entry to the list - at this
+	 * point other threads may already have hold of this tdb
+	 */
+	result = mmu_notifier_register(&tdb->mn, mm);
+	if (IS_ERR_VALUE((long) result)) {
+		target_db_extract(tdb);
+		target_db_put(tdb);
+		return ERR_PTR((long) result);
+	}
+
+	pr_debug("%s: new entry for mm %016lx\n",
+		__func__, (unsigned long)tdb->mm);
+
+	/* return this entry to user with incremented reference count */
+	target_db_get(tdb);
+
+	return tdb;
+}
+
+static void intro_db_init(struct intro_db *idb, struct mm_struct *mm)
+{
+	idb->mm = mm;
+	idb->mn.ops = &idb_notifier_ops;
+	refcount_set(&idb->refcnt, 1);
+
+	idb->rb_root = RB_ROOT;
+	spin_lock_init(&idb->lock);
+}
+
+static void intro_db_insert(struct intro_db *idb, struct page_db *pdb)
+{
+	struct rb_root *root = &idb->rb_root;
+	struct rb_node **new = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	/* Figure out where to put new node */
+	while (*new) {
+		struct page_db *this = rb_entry(*new, struct page_db, intro_rb);
+
+		parent = *new;
+		if (pdb->map_hva < this->map_hva)
+			new = &((*new)->rb_left);
+		else if (pdb->map_hva > this->map_hva)
+			new = &((*new)->rb_right);
+		else {
+			ASSERT(pdb->map_hva != this->map_hva);
+			return;
+		}
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&pdb->intro_rb, parent, new);
+	rb_insert_color(&pdb->intro_rb, root);
+}
+
+static struct page_db *intro_db_search(struct intro_db *idb,
+				       unsigned long map_hva)
+{
+	struct rb_root *root = &idb->rb_root;
+	struct rb_node *node = root->rb_node;
+
+	while (node) {
+		struct page_db *pdb = rb_entry(node, struct page_db, intro_rb);
+
+		if (map_hva < pdb->map_hva)
+			node = node->rb_left;
+		else if (map_hva > pdb->map_hva)
+			node = node->rb_right;
+		else
+			return pdb;
+	}
+
+	return NULL;
+}
+
+static bool intro_db_empty(const struct intro_db *idb)
+{
+	return RB_EMPTY_ROOT(&idb->rb_root);
+}
+
+static bool intro_db_remove(struct intro_db *idb, struct page_db *pdb)
+{
+	bool result = false;
+
+	if (!intro_db_empty(idb)) {
+		rb_erase(&pdb->intro_rb, &idb->rb_root);
+		result = true;
+	}
+
+	RB_CLEAR_NODE(&pdb->intro_rb);
+
+	return result;
+}
+
+static void intro_db_get(struct intro_db *idb)
+{
+	refcount_inc(&idb->refcnt);
+}
+
+static void intro_db_free_delayed(struct rcu_head *rcu)
+{
+	struct intro_db *idb = container_of(rcu, struct intro_db, rcu);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)idb->mm);
+
+	kfree(idb);
+}
+
+static void intro_db_put(struct intro_db *idb)
+{
+	if (refcount_dec_and_test(&idb->refcnt)) {
+		pr_debug("%s: mm %016lx\n", __func__, (unsigned long)idb->mm);
+
+		mm_remote_db_intro_release(idb);
+
+		ASSERT(intro_db_empty(idb));
+
+		mmu_notifier_call_srcu(&idb->rcu, intro_db_free_delayed);
+	}
+}
+
+static struct intro_db *intro_db_lookup(const struct mm_struct *mm)
+{
+	struct intro_db *idb;
+
+	spin_lock(&idb_lock);
+	hash_for_each_possible(idb_hash, idb, db_link, (unsigned long)mm)
+		if (idb->mm == mm) {
+			intro_db_get(idb);
+			spin_unlock(&idb_lock);
+
+			return idb;
+		}
+	spin_unlock(&idb_lock);
+
+	return NULL;
+}
+
+static void intro_db_extract(struct intro_db *idb)
+{
+	spin_lock(&idb_lock);
+	hash_del(&idb->db_link);
+	spin_unlock(&idb_lock);
+}
+
+static struct intro_db *intro_db_lookup_or_add(struct mm_struct *mm)
+{
+	struct intro_db *idb;
+	int result;
+
+	spin_lock(&idb_lock);
+
+	/* lookup in hash */
+	hash_for_each_possible(idb_hash, idb, db_link, (unsigned long)mm)
+		if (idb->mm == mm) {
+			intro_db_get(idb);
+			spin_unlock(&idb_lock);
+
+			return idb;
+		}
+
+	/* no mdb found, alloc one */
+	idb = kzalloc(sizeof(*idb), GFP_ATOMIC);
+	if (idb == NULL) {
+		spin_unlock(&idb_lock);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* init & add to hash */
+	intro_db_init(idb, mm);
+	hash_add(idb_hash, &idb->db_link, (unsigned long)mm);
+
+	spin_unlock(&idb_lock);
+
+	/*
+	 * register a mmu notifier when adding this entry to the list - at this
+	 * point other threads may already have hold of this idb
+	 */
+	result = mmu_notifier_register(&idb->mn, mm);
+	if (IS_ERR_VALUE((long)result)) {
+		intro_db_extract(idb);
+		intro_db_put(idb);
+		return ERR_PTR((long)result);
+	}
+
+	pr_debug("%s: new entry for mm %016lx\n",
+		__func__, (unsigned long)idb->mm);
+
+	/* return this entry to user with incremented reference count */
+	intro_db_get(idb);
+
+	return idb;
+}
+
+static struct page_db *page_db_alloc(void)
+{
+	struct page_db *result;
+
+	result = kmem_cache_alloc(pdb_cache, GFP_KERNEL);
+	if (result == NULL)
+		return NULL;
+
+	memset(result, 0, sizeof(*result));
+	atomic_inc(&pdb_count);
+
+	return result;
+}
+
+static void page_db_free(struct page_db *pdb)
+{
+	kmem_cache_free(pdb_cache, pdb);
+	BUG_ON(atomic_add_negative(-1, &pdb_count));
+}
+
+/*
+ * According to the new semantics, we first reserve a mapping entry in the
+ * introspector and we mark it as taken. Any other thread trying to insert
+ * the same mapping (identified by map_hva) will return with -EALREADY. The
+ * entry will be marked as taken as long as the owning thread works on it.
+ * The taken bit serves to synchronize with any unmapper thread trying to
+ * extract this entry from the database at the same time. Clearing this bit
+ * is not ordered relative to other instructions, so it may be cleared by
+ * the owner but read as set by an unmapper thread in the introspector
+ * critical region.
+ */
+static int
+page_db_reserve(struct mm_struct *introspector, unsigned long map_hva,
+		struct mm_struct *target, unsigned long req_hva,
+		struct page_db **ppdb)
+{
+	struct intro_db *idb;
+	struct page_db *pdb;
+	int result = 0;
+
+	/*
+	 * returns a valid pointer or an error value, never NULL
+	 * also gets reference to entry
+	 */
+	idb = intro_db_lookup_or_add(introspector);
+	if (IS_ERR_VALUE(idb))
+		return PTR_ERR(idb);
+
+	/*
+	 * alloc mapping entry outside the introspector critical region - most
+	 * likely the entry (identified by map_hva) isn't already reserved in
+	 * the tree and we won't need to throw the allocation away
+	 */
+	pdb = page_db_alloc();
+	if (unlikely(pdb == NULL)) {
+		result = -ENOMEM;
+		goto out;
+	}
+
+	/* fill pdb */
+	pdb->target = target;
+	pdb->req_hva = req_hva;
+	pdb->map_hva = map_hva;
+
+	/* insert mapping entry into the introspector if not already there */
+	spin_lock(&idb->lock);
+
+	if (unlikely(intro_db_search(idb, map_hva))) {
+		page_db_free(pdb);
+		result = -EALREADY;
+	} else {
+		intro_db_insert(idb, pdb);
+		/*
+		 * after the introspector critical region ends, this flag will
+		 * be read as set because of the implicit memory barrier of the
+		 * unlock op
+		 */
+		__set_bit(TAKEN_BIT, &pdb->flags);
+	}
+
+	spin_unlock(&idb->lock);
+
+	/* output this value */
+	if (result == 0)
+		*ppdb = pdb;
+
+out:
+	/*
+	 * do not free MDBs for the introspector/target, just unpin them;
+	 * they will get freed by the mmu_notifier->release() callbacks
+	 */
+	intro_db_put(idb);
+
+	return result;
+}
+
+/*
+ * This function should be called at the beginning of the unmap function, it
+ * will take ownership of the entry if possible, then the entry can be removed
+ * from the target database. After removal, the entry can be unreserved.
+ */
+static int
+page_db_acquire(struct mm_struct *introspector, unsigned long map_hva,
+		struct page_db **ppdb)
+{
+	struct intro_db *idb;
+	struct page_db *pdb;
+	int result = 0;
+
+	/* also gets reference to entry */
+	idb = intro_db_lookup(introspector);
+	if (idb == NULL)
+		return -EINVAL;
+
+	spin_lock(&idb->lock);
+
+	pdb = intro_db_search(idb, map_hva);
+	if (pdb == NULL) {
+		result = -ENOENT;
+	} else if (__test_and_set_bit(TAKEN_BIT, &pdb->flags)) {
+		/*
+		 * other thread owns this entry and may map or unmap it (in
+		 * which case the entry will be gone entirely), the only action
+		 * suitable is to retry access and hope the entry is there
+		 */
+		result = -EAGAIN;
+	}
+
+	spin_unlock(&idb->lock);
+
+	/* output this value */
+	if (result == 0)
+		*ppdb = pdb;
+
+	/*
+	 * do not free MDBs for the introspector/target, just unpin them;
+	 * they will get freed by the mmu_notifier->release() callbacks
+	 */
+	intro_db_put(idb);
+
+	return result;
+}
+
+static void
+page_db_release(struct page_db *pdb)
+{
+	__clear_bit(TAKEN_BIT, &pdb->flags);
+}
+
+/*
+ * Reverse of page_db_reserve(), must be called by the same introspector thread
+ * that has acquired the mapping entry by page_db_reserve()/page_db_acquire().
+ */
+static int
+page_db_unreserve(struct mm_struct *introspector, struct page_db *pdb)
+{
+	struct intro_db *idb;
+	bool removed;
+	int result = 0;
+
+	/* also gets reference to entry */
+	idb = intro_db_lookup(introspector);
+	if (idb == NULL)
+		return -EINVAL;
+
+	spin_lock(&idb->lock);
+	removed = intro_db_remove(idb, pdb);
+	spin_unlock(&idb->lock);
+
+	page_db_free(pdb);
+
+	if (!removed)
+		pr_debug("%s: entry for map_hva %016lx already freed.\n",
+			__func__, pdb->map_hva);
+
+	/*
+	 * do not free MDBs for the introspector/target, just unpin them;
+	 * they will get freed by the mmu_notifier->release() callbacks
+	 */
+	intro_db_put(idb);
+
+	return result;
+}
+
+static int
+page_db_add_target(struct page_db *pdb, struct mm_struct *target,
+		   struct mm_struct *introspector)
+{
+	struct target_db *tdb;
+	int result = 0;
+
+	/*
+	 * returns a valid pointer or an error value, never NULL
+	 * also gets reference to entry
+	 */
+	tdb = target_db_lookup_or_add(target);
+	if (IS_ERR_VALUE(tdb))
+		return PTR_ERR(tdb);
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* check that target is not introspected by someone else */
+	if (tdb->introspector != NULL && tdb->introspector != introspector)
+		result = -EINVAL;
+	else {
+		tdb->introspector = introspector;
+		target_db_insert(tdb, pdb);
+	}
+
+	spin_unlock(&tdb->lock);
+
+	/*
+	 * do not free MDBs for the introspector/target, just unpin them;
+	 * they will get freed by the mmu_notifier->release() callbacks
+	 */
+	target_db_put(tdb);
+
+	return result;
+}
+
+static int
+page_db_remove_target(struct page_db *pdb)
+{
+	struct target_db *tdb;
+	int result = 0;
+	bool removed;
+
+	/* find target entry in the database */
+	tdb = target_db_lookup(pdb->target);
+	if (tdb == NULL)
+		return -EINVAL;
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* remove mapping from target */
+	removed = target_db_remove(tdb, pdb);
+	if (!removed)
+		pr_debug("%s: mapping for req_hva %016lx of %016lx already freed\n",
+			__func__, pdb->req_hva, (unsigned long)pdb->target);
+
+	/* clear the introspector if no more mappings */
+	if (target_db_empty(tdb)) {
+		tdb->introspector = NULL;
+		pr_debug("%s: all mappings gone for target mm %016lx\n",
+			__func__, (unsigned long)pdb->target);
+	}
+
+	spin_unlock(&tdb->lock);
+
+	/*
+	 * do not free MDBs for the introspector/target, just unpin them;
+	 * they will get freed by the mmu_notifier->release() callbacks
+	 */
+	target_db_put(tdb);
+
+	return result;
+}
+
+/*
+ * The target is referenced by a bunch of PDBs not reachable from introspector;
+ * go there and break the target-side links (by removing the tree) while at the
+ * same time clear the pointers from the PDBs to this target. In this way, the
+ * current target will be reachable a single time while walking a tree of PDBs
+ * extracted from the introspector.
+ */
+static void mm_remote_db_cleanup_target(struct target_db *tdb)
+{
+	struct page_db *pdb, *npdb;
+	struct rb_root temp_rb;
+	struct mm_struct *introspector;
+	long result;
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* if we ended up here the target must be introspected */
+	ASSERT(tdb->introspector != NULL);
+	introspector = tdb->introspector;
+	tdb->introspector = NULL;
+
+	/* take away the interval tree from the target */
+	temp_rb.rb_node = tdb->rb_root.rb_root.rb_node;
+	tdb->rb_root = RB_ROOT_CACHED;
+
+	spin_unlock(&tdb->lock);
+
+	/*
+	 * walk the tree & clear links to target - this function is serialized
+	 * with respect to the main loop in mm_remote_db_intro_release() so
+	 * there will be no race on pdb->target
+	 */
+	rbtree_postorder_for_each_entry_safe(pdb, npdb, &temp_rb, target_rb) {
+		/* clear links to target */
+		pdb->target = NULL;
+		pdb->rb_subtree_last = 0;
+		RB_CLEAR_NODE(&pdb->target_rb);
+
+		/* do the unmapping */
+		result = mm_remote_unmap_action(introspector, pdb->map_hva);
+		if (IS_ERR_VALUE(result))
+			pr_debug("%s: failed unmapping map_hva %016lx!\n",
+				__func__, pdb->map_hva);
+	}
+}
+
+/*
+ * The introspector is closing. This means the normal mapping/unmapping logic
+ * does not work anymore.
+ * This function will not race against mm_remote_db_target_release(), since the
+ * introspector's MM is pinned during that call.
+ */
+static void mm_remote_db_intro_release(struct intro_db *idb)
+{
+	struct page_db *pdb, *npdb;
+	struct target_db *tdb;
+	struct rb_root temp_rb;
+
+	/* introspector-side locking */
+	spin_lock(&idb->lock);
+
+	/* take away the internal tree */
+	temp_rb.rb_node = idb->rb_root.rb_node;
+	idb->rb_root = RB_ROOT;
+
+	spin_unlock(&idb->lock);
+
+	if (!RB_EMPTY_ROOT(&temp_rb))
+		pr_debug("%s: introspector mm %016lx has some mappings\n",
+			__func__, (unsigned long)idb->mm);
+
+	/* iterate the tree over introspector entries */
+	rbtree_postorder_for_each_entry_safe(pdb, npdb, &temp_rb, intro_rb) {
+		/* see comments in function above */
+		if (pdb->target == NULL)
+			goto just_free;
+
+		/* pin entry for target - maybe it has been released */
+		tdb = target_db_lookup(pdb->target);
+		if (tdb == NULL)
+			goto just_free;
+
+		/* see comments of this function */
+		mm_remote_db_cleanup_target(tdb);
+
+		/* unpin entry for target */
+		target_db_put(tdb);
+
+just_free:
+		page_db_free(pdb);
+	}
+}
+
+/*
+ * The target MM is closing. This means the pages are unmapped by the default
+ * kernel logic on the target side, but we must also clear the mappings on the
+ * introspector side.
+ * This function won't collide with the mapping function since we get here on
+ * target MM teardown and the mapping function won't be able to get a reference
+ * to the target MM.
+ * Thin function may collide with the unmapping function that acquires mappings
+ * in which case the acquired mappings are ignored.
+ */
+static void mm_remote_db_target_release(struct target_db *tdb)
+{
+	struct page_db *pdb, *npdb;
+	struct intro_db *idb;
+	struct mm_struct *introspector;
+	struct rb_root temp_rb;
+	LIST_HEAD(temp_list);
+	long result;
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* no introspector, nothing to do */
+	if (tdb->introspector == NULL) {
+		ASSERT(target_db_empty(tdb));
+		spin_unlock(&tdb->lock);
+		return;
+	}
+
+	/* extract introspector */
+	introspector = tdb->introspector;
+	tdb->introspector = NULL;
+
+	/* take away the interval tree from the target */
+	temp_rb.rb_node = tdb->rb_root.rb_root.rb_node;
+	tdb->rb_root = RB_ROOT_CACHED;
+
+	spin_unlock(&tdb->lock);
+
+	/* pin the introspector mm so it won't go away */
+	if (!mmget_not_zero(introspector))
+		return;
+
+	/*
+	 * acquire the entry of the introspector - can be NULL if the
+	 * introspector failed to register a MMU notifier
+	 */
+	idb = intro_db_lookup(introspector);
+	if (idb == NULL)
+		goto out_introspector;
+
+	/* introspector-side locking */
+	spin_lock(&idb->lock);
+
+	rbtree_postorder_for_each_entry_safe(pdb, npdb, &temp_rb, target_rb) {
+		/*
+		 * this mapping entry happens to be taken (most likely) for
+		 * unmapping individually, leave it alone
+		 */
+		if (__test_and_set_bit(TAKEN_BIT, &pdb->flags)) {
+			pr_debug("%s: skip acquired mapping for map_hva %016lx\n",
+				__func__, pdb->map_hva);
+			continue;
+		}
+
+		/* add it to temp list for later processing */
+		list_add(&pdb->temp, &temp_list);
+	}
+
+	spin_unlock(&idb->lock);
+
+	/* unmap entries outside introspector lock */
+	list_for_each_entry(pdb, &temp_list, temp) {
+		pr_debug("%s: internal unmapping of map_hva %016lx\n",
+			__func__, pdb->map_hva);
+
+		/* do the unmapping */
+		result = mm_remote_unmap_action(introspector, pdb->map_hva);
+		if (IS_ERR_VALUE(result))
+			pr_debug("%s: failed unmapping map_hva %016lx!\n",
+				__func__, pdb->map_hva);
+	}
+
+	spin_lock(&idb->lock);
+
+	/* loop over temp list & remove from introspector tree */
+	list_for_each_entry_safe(pdb, npdb, &temp_list, temp) {
+		/*
+		 * unmap & free only if found in the introspector tree, it may
+		 * have been already extracted & processed by another code path
+		 */
+		if (!intro_db_remove(idb, pdb))
+			continue;
+
+		page_db_free(pdb);
+	}
+
+	spin_unlock(&idb->lock);
+
+	/* unpin this entry */
+	intro_db_put(idb);
+
+out_introspector:
+	/* unpin the introspector mm */
+	mmput(introspector);
+}
+
+static void tdb_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct target_db *tdb = container_of(mn, struct target_db, mn);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)mm);
+
+	/*
+	 * at this point other threads may already have hold of this tdb
+	 */
+	target_db_extract(tdb);
+	target_db_put(tdb);
+}
+
+static void idb_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct intro_db *idb = container_of(mn, struct intro_db, mn);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)mm);
+
+	/*
+	 * at this point other threads may already have hold of this idb
+	 */
+	intro_db_extract(idb);
+	intro_db_put(idb);
+}
+
+static struct vm_area_struct *
+isolate_page_vma(struct vm_area_struct *vma, unsigned long addr)
+{
+	int result;
+
+	/* corner case */
+	if (vma_pages(vma) == 1)
+		return vma;
+
+	if (addr != vma->vm_start) {
+		/* first split only if address in the middle */
+		result = split_vma(vma->vm_mm, vma, addr, false);
+		if (IS_ERR_VALUE((long)result))
+			return ERR_PTR((long)result);
+
+		vma = find_vma(vma->vm_mm, addr);
+		if (unlikely(vma == NULL))
+			return ERR_PTR(-ENOENT);
+
+		/* corner case (again) */
+		if (vma_pages(vma) == 1)
+			return vma;
+	}
+
+	result = split_vma(vma->vm_mm, vma, addr + PAGE_SIZE, true);
+	if (IS_ERR_VALUE((long)result))
+		return ERR_PTR((long)result);
+
+	vma = find_vma(vma->vm_mm, addr);
+	if (unlikely(vma == NULL))
+		return ERR_PTR(-ENOENT);
+
+	BUG_ON(vma_pages(vma) != 1);
+
+	return vma;
+}
+
+/*
+ * Lightweight version of vma_merge() to reduce the internal fragmentation of
+ * the mapping process' address space. It merges small VMAs that emerged by
+ * splitting a larger VMA with the function above.
+ */
+static int merge_page_vma(struct vm_area_struct *vma)
+{
+	struct vm_area_struct *prev = vma->vm_prev;
+	struct vm_area_struct *next = vma->vm_next;
+	int result = 0;
+
+	if (prev->vm_end == vma->vm_start && prev->anon_vma == vma->anon_vma &&
+		prev->vm_flags == vma->vm_flags)
+		result = __vma_adjust(prev, prev->vm_start, vma->vm_end,
+			prev->vm_pgoff, NULL, vma);
+
+	if (unlikely(result != 0))
+		return result;
+
+	if (vma->vm_end == next->vm_start && vma->anon_vma == next->anon_vma &&
+		vma->vm_flags == next->vm_flags)
+		result = __vma_adjust(vma, vma->vm_start, next->vm_end,
+			vma->vm_pgoff, NULL, next);
+
+	return result;
+}
+
+static int mm_remote_replace_pte(struct vm_area_struct *map_vma,
+				 unsigned long map_hva, struct page *map_page,
+				 struct page *new_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	pte_t newpte;
+
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+
+	/* classic replace_page() code */
+	pmd = mm_find_pmd(map_mm, map_hva);
+	if (!pmd)
+		return -EFAULT;
+
+	mmun_start = map_hva;
+	mmun_end = map_hva + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
+
+	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
+
+	/* the caller needs to hold the pte lock */
+	page_remove_rmap(map_page, false);
+
+	/* create new PTE based on requested page */
+	if (new_page != NULL) {
+		newpte = mk_pte(new_page, map_vma->vm_page_prot);
+		if (map_vma->vm_flags & VM_WRITE)
+			newpte = pte_mkwrite(pte_mkdirty(newpte));
+	} else
+		newpte.pte = 0;
+
+	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
+	ptep_clear_flush_notify(map_vma, map_hva, ptep);
+	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
+
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
+
+	return 0;
+}
+
+static void mm_remote_put_req(struct page *req_page,
+	struct anon_vma *req_anon_vma)
+{
+	if (req_anon_vma)
+		put_anon_vma(req_anon_vma);
+
+	/* get_user_pages_remote() incremented page reference count */
+	if (req_page)
+		put_page(req_page);
+}
+
+static int mm_remote_get_req(struct mm_struct *req_mm, unsigned long req_hva,
+			     struct page **preq_page,
+			     struct anon_vma **preq_anon_vma)
+{
+	struct page *req_page = NULL;
+	struct anon_vma *req_anon_vma = NULL;
+	struct vm_area_struct *req_vma = NULL;
+	long nrpages;
+	int result = 0;
+
+	down_read(&req_mm->mmap_sem);
+
+	/* get host page corresponding to requested address */
+	nrpages = get_user_pages_remote(NULL, req_mm,
+		req_hva, 1, FOLL_WRITE | FOLL_SPLIT,
+		&req_page, &req_vma, NULL);
+	if (unlikely(nrpages == 0)) {
+		pr_err("intro: no page for req_hva %016lx\n", req_hva);
+		result = -ENOENT;
+		goto out_err;
+	} else if (IS_ERR_VALUE(nrpages)) {
+		result = nrpages;
+		pr_err("intro: get_user_pages_remote() failed: %d\n", result);
+		goto out_err;
+	}
+
+	/* limit introspection to anon memory */
+	if (!PageAnon(req_page)) {
+		result = -EINVAL;
+		pr_err("intro: page at req_hva %016lx not anon\n", req_hva);
+		goto out_err;
+	}
+
+	/* take & lock this anon vma */
+	req_anon_vma = page_get_anon_vma(req_page);
+	if (unlikely(req_anon_vma == NULL)) {
+		result = -EINVAL;
+		pr_err("intro: no anon vma for req_hva %016lx\n", req_hva);
+		goto out_err;
+	}
+
+	/* output these values only if successful */
+	*preq_page = req_page;
+	*preq_anon_vma = req_anon_vma;
+
+out_err:
+	/* error handling local to the function */
+	if (result)
+		mm_remote_put_req(req_page, req_anon_vma);
+
+	up_read(&req_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_remap(struct mm_struct *map_mm, unsigned long map_hva,
+			   struct page *req_page, struct anon_vma *req_anon_vma)
+{
+	struct vm_area_struct *map_vma;
+	struct page *map_page = NULL;
+	int result = 0;
+
+	/* VMA will be modified */
+	down_write(&map_mm->mmap_sem);
+
+	/* find VMA containing address */
+	map_vma = find_vma(map_mm, map_hva);
+	if (unlikely(map_vma == NULL)) {
+		pr_err("intro: no local VMA found for remapping\n");
+		result = -ENOENT;
+		goto out_finalize;
+	}
+
+	/* split local VMA for rmap redirecting */
+	map_vma = isolate_page_vma(map_vma, map_hva);
+	if (IS_ERR_VALUE(map_vma)) {
+		result = PTR_ERR(map_vma);
+		pr_debug("%s: isolate_page_vma() failed: %d\n",
+			__func__, result);
+		goto out_finalize;
+	}
+
+	/* find (not get) local page corresponding to target address */
+	map_page = follow_page(map_vma, map_hva, FOLL_SPLIT);
+	if (IS_ERR_VALUE(map_page)) {
+		result = PTR_ERR(map_page);
+		pr_debug("%s: follow_page() failed: %d\n",
+			__func__, result);
+		goto out_finalize;
+	}
+
+	/* TODO: I assumed before that this page can be NULL in case a mapping
+	 * request reuses the address that was left empty by a previous unmap,
+	 * but I have never seen this case in practice
+	 */
+	if (unlikely(map_page == NULL)) {
+		pr_err("intro: no local page found for remapping\n");
+		result = -ENOENT;
+		goto out_finalize;
+	}
+
+	/* decouple anon_vma from small VMA; the original anon_vma will be kept
+	 * as backup in vm_private_data and restored when the mapping is undone
+	 */
+	map_vma->vm_private_data = map_vma->anon_vma;
+	unlink_anon_vmas(map_vma);
+	map_vma->anon_vma = NULL;
+
+	/* temporary anon_vma_lock_write()s req_anon_vma */
+	result = anon_vma_assign(map_vma, req_anon_vma);
+	if (IS_ERR_VALUE((long)result))
+		goto out_noanon;
+
+	/* We're done working with this anon_vma, unpin it.
+	 * TODO: is it safe to assume that as long as the degree was incremented
+	 * during anon_vma_assign(), this anon_vma won't be released right
+	 * after this call ??!
+	 */
+	put_anon_vma(req_anon_vma);
+	req_anon_vma = NULL;	/* guard against mm_remote_put_req() */
+
+	lock_page(req_page);
+	mlock_vma_page(req_page);
+	unlock_page(req_page);
+
+	/* redirect PTE - this function can fail before altering any PTE */
+	result = mm_remote_replace_pte(map_vma, map_hva, map_page, req_page);
+	if (IS_ERR_VALUE((long)result))
+		goto out_nopte;
+
+	/* increment PTE mappings as a result of referencing req_page */
+	atomic_inc(&req_page->_mapcount);
+
+	/* release this page only after references to it have been cleared */
+	free_page_and_swap_cache(map_page);
+
+	atomic_inc(&map_count);
+	up_write(&map_mm->mmap_sem);
+
+	return 0;
+
+out_nopte:
+	/* map_vma->anon_vma will be req_anon_vma */
+	unlink_anon_vmas(map_vma);
+	map_vma->anon_vma = NULL;
+
+out_noanon:
+	/* map_vma->anon_vma will be NULL at this point */
+	anon_vma_assign(map_vma, map_vma->vm_private_data);
+	map_vma->vm_private_data = NULL;
+	merge_page_vma(map_vma);
+
+out_finalize:
+	/* just unpin these - req_anon_vma can be NULL */
+	mm_remote_put_req(req_page, req_anon_vma);
+
+	up_write(&map_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_map_action(struct mm_struct *req_mm, unsigned long req_hva,
+				struct mm_struct *map_mm, unsigned long map_hva)
+{
+	struct page *req_page;
+	struct anon_vma *req_anon_vma;
+	int result;
+
+	result = mm_remote_get_req(req_mm, req_hva, &req_page, &req_anon_vma);
+	if (IS_ERR_VALUE((long)result))
+		return result;
+
+	/* does its own error recovery */
+	result = mm_remote_remap(map_mm, map_hva, req_page, req_anon_vma);
+	if (IS_ERR_VALUE((long)result))
+		return result;
+
+	return 0;
+}
+
+int mm_remote_map(struct mm_struct *req_mm, unsigned long req_hva,
+		  unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct page_db *pdb = NULL;
+	int result = 0;
+
+	pr_debug("%s: req_mm %016lx, req_hva %016lx, map_hva %016lx\n",
+		__func__, (unsigned long)req_mm, (unsigned long)req_hva,
+		map_hva);
+
+	/* try to pin the target MM so it won't go away (map_mm is ours) */
+	if (!mmget_not_zero(req_mm))
+		return -EINVAL;
+
+	/* reserve mapping entry in the introspector */
+	result = page_db_reserve(map_mm, map_hva, req_mm, req_hva,  &pdb);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	/* do the actual memory mapping */
+	result = mm_remote_map_action(req_mm, req_hva, map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result)) {
+		page_db_unreserve(map_mm, pdb);
+		goto out;
+	}
+
+	/* add mapping to target database */
+	result = page_db_add_target(pdb, req_mm, map_mm);
+	if (IS_ERR_VALUE((long)result)) {
+		mm_remote_unmap_action(map_mm, map_hva);
+		page_db_unreserve(map_mm, pdb);
+		goto out;
+	}
+
+	/* we're done working with this one */
+	page_db_release(pdb);
+
+out:
+	mmput(req_mm);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(mm_remote_map);
+
+static int mm_remote_unmap_action(struct mm_struct *map_mm,
+				  unsigned long map_hva)
+{
+	struct vm_area_struct *map_vma;
+	struct page *req_page = NULL;
+	int result;
+
+	/* VMA will be modified */
+	down_write(&map_mm->mmap_sem);
+
+	/* find destination VMA for mapping */
+	map_vma = find_vma(map_mm, map_hva);
+	if (unlikely(map_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("intro: no local VMA found for unmapping\n");
+		goto out_err;
+	}
+
+	/* find (not get) page mapped to destination address */
+	req_page = follow_page(map_vma, map_hva, 0);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		req_page = NULL;
+		pr_err("intro: follow_page() failed: %d\n", result);
+		goto out_err;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("intro: follow_page() returned no page\n");
+		goto out_err;
+	}
+
+	/* page table fixing here */
+	result = mm_remote_replace_pte(map_vma, map_hva, req_page, NULL);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	/* decouple links to anon_vmas & restore original anon_vma */
+	unlink_anon_vmas(map_vma);
+	map_vma->anon_vma = NULL;
+
+	/* this function can fail before setting the anon_vma */
+	result = anon_vma_assign(map_vma, map_vma->vm_private_data);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+	map_vma->vm_private_data = NULL;
+
+	/* now try merging the empty VMA with its neighbours */
+	result = merge_page_vma(map_vma);
+	if (IS_ERR_VALUE((long)result))
+		pr_err("intro: merge_page_vma() failed: %d\n", result);
+
+	lock_page(req_page);
+	munlock_vma_page(req_page);
+	unlock_page(req_page);
+
+	/* reference count was inc during get_user_pages_remote() */
+	free_page_and_swap_cache(req_page);
+	dec_mm_counter(map_mm, MM_ANONPAGES);
+
+	BUG_ON(atomic_add_negative(-1, &map_count));
+	goto out_finalize;
+
+out_err:
+	/* reference count was inc during get_user_pages_remote() */
+	if (req_page != NULL)
+		put_page(req_page);
+
+out_finalize:
+	up_write(&map_mm->mmap_sem);
+
+	return result;
+}
+
+int mm_remote_unmap(unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct page_db *pdb;
+	int result;
+
+	pr_debug("%s: map_hva %016lx\n", __func__, map_hva);
+
+	/* lookup mapping in the introspector database */
+	result = page_db_acquire(map_mm, map_hva, &pdb);
+	if (IS_ERR_VALUE((long)result))
+		return result;
+
+	/* the unmapping is done on local mm only */
+	result = mm_remote_unmap_action(map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result))
+		pr_debug("%s: mm_remote_unmap_action() failed: %d\n",
+			__func__, result);
+
+	result = page_db_remove_target(pdb);
+	if (IS_ERR_VALUE((long)result))
+		pr_debug("%s: page_db_remove_target() failed: %d\n",
+			__func__, result);
+
+	result = page_db_unreserve(map_mm, pdb);
+	if (IS_ERR_VALUE((long)result))
+		pr_debug("%s: page_db_unreserve() failed: %d\n",
+			__func__, result);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(mm_remote_unmap);
+
+#ifdef CONFIG_DEBUG_FS
+static void __init mm_remote_debugfs_init(void)
+{
+	mm_remote_debugfs_dir = debugfs_create_dir("remote_mapping", NULL);
+	if (mm_remote_debugfs_dir == NULL)
+		return;
+
+	debugfs_create_atomic_t("map_count", 0444, mm_remote_debugfs_dir,
+				&map_count);
+	debugfs_create_atomic_t("pdb_count", 0444, mm_remote_debugfs_dir,
+				&pdb_count);
+}
+
+static void __exit mm_remote_debugfs_exit(void)
+{
+	debugfs_remove_recursive(mm_remote_debugfs_dir);
+}
+#else /* CONFIG_DEBUG_FS */
+static void __init mm_remote_debugfs_init(void)
+{
+}
+
+static void __exit mm_remote_debugfs_exit(void)
+{
+}
+#endif /* CONFIG_DEBUG_FS */
+
+static int __init mm_remote_init(void)
+{
+	pdb_cache = KMEM_CACHE(page_db, SLAB_PANIC | SLAB_ACCOUNT);
+	if (!pdb_cache)
+		return -ENOMEM;
+
+	mm_remote_debugfs_init();
+
+	return 0;
+}
+
+static void __exit mm_remote_exit(void)
+{
+	mm_remote_debugfs_exit();
+
+	/* number of mappings & unmappings must match */
+	BUG_ON(atomic_read(&map_count) != 0);
+
+	/* check for leaks */
+	BUG_ON(atomic_read(&pdb_count) != 0);
+}
+
+module_init(mm_remote_init);
+module_exit(mm_remote_exit);
+MODULE_LICENSE("GPL");
diff --git a/mm/rmap.c b/mm/rmap.c
index 85b7f9423352..7081b1aed14b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -219,6 +219,34 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
  out_enomem:
 	return -ENOMEM;
 }
+EXPORT_SYMBOL(__anon_vma_prepare);
+
+
+int anon_vma_assign(struct vm_area_struct *vma, struct anon_vma *anon_vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct anon_vma_chain *avc;
+
+	avc = anon_vma_chain_alloc(GFP_KERNEL);
+	if (avc == NULL)
+		return -ENOMEM;
+
+	anon_vma_lock_write(anon_vma);
+	/* page_table_lock to protect against threads */
+	spin_lock(&mm->page_table_lock);
+
+	/* link req_anon_vma to map_vma */
+	vma->anon_vma = anon_vma;
+	anon_vma_chain_link(vma, avc, anon_vma);
+	/* vma reference or self-parent link for new root */
+	anon_vma->degree++;
+
+	spin_unlock(&mm->page_table_lock);
+	anon_vma_unlock_write(anon_vma);
+
+	return 0;
+}
+EXPORT_SYMBOL(anon_vma_assign);
 
 /*
  * This is a useful helper function for locking the anon_vma root as
@@ -372,6 +400,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	unlink_anon_vmas(vma);
 	return -ENOMEM;
 }
+EXPORT_SYMBOL(anon_vma_fork);
 
 void unlink_anon_vmas(struct vm_area_struct *vma)
 {
@@ -419,6 +448,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 		anon_vma_chain_free(avc);
 	}
 }
+EXPORT_SYMBOL(unlink_anon_vmas);
 
 static void anon_vma_ctor(void *data)
 {
@@ -496,6 +526,7 @@ struct anon_vma *page_get_anon_vma(struct page *page)
 
 	return anon_vma;
 }
+EXPORT_SYMBOL(page_get_anon_vma);
 
 /*
  * Similar to page_get_anon_vma() except it locks the anon_vma.
@@ -740,6 +771,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 out:
 	return pmd;
 }
+EXPORT_SYMBOL(mm_find_pmd);
 
 struct page_referenced_arg {
 	int mapcount;
@@ -1017,9 +1049,9 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
 
 /**
  * __page_set_anon_rmap - set up new anonymous rmap
- * @page:	Page to add to rmap	
+ * @page:	Page to add to rmap
  * @vma:	VM area to add page to.
- * @address:	User virtual address of the mapping	
+ * @address:	User virtual address of the mapping
  * @exclusive:	the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
@@ -1168,6 +1200,7 @@ void page_add_new_anon_rmap(struct page *page,
 	__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
+EXPORT_SYMBOL(page_add_new_anon_rmap);
 
 /**
  * page_add_file_rmap - add pte mapping to a file page
@@ -1329,6 +1362,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * faster for those pages still in swapcache.
 	 */
 }
+EXPORT_SYMBOL(page_remove_rmap);
 
 /*
  * @arg: enum ttu_flags will be passed to this argument
@@ -1756,6 +1790,7 @@ void __put_anon_vma(struct anon_vma *anon_vma)
 	if (root != anon_vma && atomic_dec_and_test(&root->refcount))
 		anon_vma_free(root);
 }
+EXPORT_SYMBOL(__put_anon_vma);
 
 static struct anon_vma *rmap_walk_anon_lock(struct page *page,
 					struct rmap_walk_control *rwc)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8688ae65ef58..4f5bfce18a2e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1615,6 +1615,7 @@ int try_to_free_swap(struct page *page)
 	SetPageDirty(page);
 	return 1;
 }
+EXPORT_SYMBOL(try_to_free_swap);
 
 /*
  * Free the swap entry like above, but also try to

From patchwork Thu Dec 20 18:28:37 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739289
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 333106C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:10 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 21EF728E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:10 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 15FC228F39; Thu, 20 Dec 2018 18:36:10 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 83FB228E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:09 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389339AbeLTSgH (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:07 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44334 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389302AbeLTSgG (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:06 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 9CE13305FFB5;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 94CDB306E477;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?q?Mircea_?=
	=?utf-8?q?C=C3=AErjaliu?= <mcirjaliu@bitdefender.com>
Subject: [RFC PATCH v5 07/20] add memory map/unmap support for VM
 introspection on the guest side
Date: Thu, 20 Dec 2018 20:28:37 +0200
Message-Id: <20181220182850.4579-8-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

An introspection tool running in a dedicated VM can use the new device
(/dev/kvmmem) to map memory from other introspected VM-s.

Two ioctl operations are supported:
  - KVM_HC_MEM_MAP/struct kvmi_mem_map
  - KVM_HC_MEM_UNMAP/unsigned long

In order to map an introspected gpa to the local gva, the process using
this device needs to obtain a token from the host KVMI subsystem (see
Documentation/virtual/kvm/kvmi.rst - KVMI_GET_MAP_TOKEN).

Both operations use hypercalls (KVM_HC_MEM_MAP, KVM_HC_MEM_UNMAP)
to pass the requests to the host kernel/KVMi (see hypercalls.txt).

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
---
 arch/x86/Kconfig                  |   9 +
 arch/x86/include/asm/kvmi_guest.h |  10 +
 arch/x86/kernel/Makefile          |   1 +
 arch/x86/kernel/kvmi_mem_guest.c  |  26 +++
 virt/kvm/kvmi_mem_guest.c         | 364 ++++++++++++++++++++++++++++++
 5 files changed, 410 insertions(+)
 create mode 100644 arch/x86/include/asm/kvmi_guest.h
 create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
 create mode 100644 virt/kvm/kvmi_mem_guest.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..e5fd12fafd85 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -804,6 +804,15 @@ config KVM_DEBUG_FS
 	  Statistics are displayed in debugfs filesystem. Enabling this option
 	  may incur significant overhead.
 
+config KVM_INTROSPECTION_GUEST
+	bool "KVM Memory Introspection support on Guest"
+	depends on KVM_GUEST
+	default n
+	help
+	  This option enables functions and hypercalls for security applications
+	  running in a separate VM to control the execution of other VM-s, query
+	  the state of the vCPU-s (GPR-s, MSR-s etc.).
+
 config PARAVIRT_TIME_ACCOUNTING
 	bool "Paravirtual steal time accounting"
 	depends on PARAVIRT
diff --git a/arch/x86/include/asm/kvmi_guest.h b/arch/x86/include/asm/kvmi_guest.h
new file mode 100644
index 000000000000..c7ed53a938e0
--- /dev/null
+++ b/arch/x86/include/asm/kvmi_guest.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_GUEST_H__
+#define __KVMI_GUEST_H__
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+	gpa_t req_gpa, gpa_t map_gpa);
+long kvmi_arch_unmap_hc(gpa_t map_gpa);
+
+
+#endif
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8824d01c0c35..4802b3544d72 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -115,6 +115,7 @@ obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
+obj-$(CONFIG_KVM_INTROSPECTION_GUEST)	+= kvmi_mem_guest.o ../../../virt/kvm/kvmi_mem_guest.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
 
diff --git a/arch/x86/kernel/kvmi_mem_guest.c b/arch/x86/kernel/kvmi_mem_guest.c
new file mode 100644
index 000000000000..c4e2613f90f3
--- /dev/null
+++ b/arch/x86/kernel/kvmi_mem_guest.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <uapi/linux/kvmi.h>
+#include <uapi/linux/kvm_para.h>
+#include <linux/kvm_types.h>
+#include <asm/kvm_para.h>
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+		       gpa_t req_gpa, gpa_t map_gpa)
+{
+	return kvm_hypercall3(KVM_HC_MEM_MAP, (unsigned long)tknp,
+			      req_gpa, map_gpa);
+}
+
+long kvmi_arch_unmap_hc(gpa_t map_gpa)
+{
+	return kvm_hypercall1(KVM_HC_MEM_UNMAP, map_gpa);
+}
diff --git a/virt/kvm/kvmi_mem_guest.c b/virt/kvm/kvmi_mem_guest.c
new file mode 100644
index 000000000000..a78eba31786b
--- /dev/null
+++ b/virt/kvm/kvmi_mem_guest.c
@@ -0,0 +1,364 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017-2018 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/rmap.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <uapi/linux/kvmi.h>
+#include <asm/kvmi_guest.h>
+
+#define ASSERT(exp) BUG_ON(!(exp))
+
+
+static struct list_head file_list;
+static spinlock_t file_lock;
+
+struct file_map {
+	struct list_head file_list;
+	struct file *file;
+	struct list_head map_list;
+	struct mutex lock;
+	bool active;	/* for tearing down */
+};
+
+struct page_map {
+	struct list_head map_list;
+	__u64 gpa;
+	unsigned long vaddr;
+	unsigned long paddr;
+};
+
+
+static int kvm_dev_open(struct inode *inodep, struct file *filp)
+{
+	struct file_map *fmp;
+
+	pr_debug("kvmi: file %016lx opened by process %s\n",
+		 (unsigned long) filp, current->comm);
+
+	/* link the file 1:1 with such a structure */
+	fmp = kmalloc(sizeof(*fmp), GFP_KERNEL);
+	if (fmp == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&fmp->file_list);
+	fmp->file = filp;
+	filp->private_data = fmp;
+	INIT_LIST_HEAD(&fmp->map_list);
+	mutex_init(&fmp->lock);
+	fmp->active = true;
+
+	/* add the entry to the global list */
+	spin_lock(&file_lock);
+	list_add_tail(&fmp->file_list, &file_list);
+	spin_unlock(&file_lock);
+
+	return 0;
+}
+
+/* actually does the mapping of a page */
+static long _do_mapping(struct kvmi_mem_map *map_req, struct page_map *pmp)
+{
+	unsigned long paddr;
+	struct vm_area_struct *vma;
+	struct page *page;
+	long result;
+
+	pr_debug("kvmi: mapping remote GPA %016llx into %016llx\n",
+		 map_req->gpa, map_req->gva);
+
+	/* check access to memory location */
+	if (!access_ok(VERIFY_READ, map_req->gva, PAGE_SIZE)) {
+		pr_err("kvmi: invalid virtual address for mapping\n");
+		return -EINVAL;
+	}
+
+	down_read(&current->mm->mmap_sem);
+
+	/* find the page to be replaced */
+	vma = find_vma(current->mm, map_req->gva);
+	if (IS_ERR_OR_NULL(vma)) {
+		result = PTR_ERR(vma);
+		pr_err("kvmi: find_vma() failed with result %ld\n", result);
+		goto out;
+	}
+
+	page = follow_page(vma, map_req->gva, 0);
+	if (IS_ERR_OR_NULL(page)) {
+		result = PTR_ERR(page);
+		pr_err("kvmi: follow_page() failed with result %ld\n", result);
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(page, "page to map_req into");
+
+	WARN(is_zero_pfn(page_to_pfn(page)), "zero-page still mapped");
+
+	/* get the physical address and store it in page_map */
+	paddr = page_to_phys(page);
+	pr_debug("kvmi: page phys addr %016lx\n", paddr);
+	pmp->paddr = paddr;
+
+	/* last thing to do is host mapping */
+	result = kvmi_arch_map_hc(&map_req->token, map_req->gpa, paddr);
+
+out:
+	up_read(&current->mm->mmap_sem);
+
+	return result;
+}
+
+/* actually does the unmapping of a page */
+static long _do_unmapping(unsigned long paddr)
+{
+	pr_debug("kvmi: unmapping request for phys addr %016lx\n", paddr);
+
+	/* local GPA uniquely identifies the mapping on the host */
+	return kvmi_arch_unmap_hc(paddr);
+}
+
+static long kvm_dev_ioctl_map(struct file_map *fmp, struct kvmi_mem_map *map)
+{
+	struct page_map *pmp;
+	long result;
+
+	if (!access_ok(VERIFY_READ, map->gva, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok(VERIFY_WRITE, map->gva, PAGE_SIZE))
+		return -EINVAL;
+
+	/* prepare list entry */
+	pmp = kmalloc(sizeof(*pmp), GFP_KERNEL);
+	if (pmp == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&pmp->map_list);
+	pmp->gpa = map->gpa;
+	pmp->vaddr = map->gva;
+
+	/* acquire the file mapping */
+	mutex_lock(&fmp->lock);
+
+	/* check if other thread is closing the file */
+	if (!fmp->active) {
+		result = -ENODEV;
+		pr_warn("kvmi: unable to map, file is being closed\n");
+		goto out_err;
+	}
+
+	/* do the actual mapping */
+	result = _do_mapping(map, pmp);
+	if (IS_ERR_VALUE(result))
+		goto out_err;
+
+	/* link to list */
+	list_add_tail(&pmp->map_list, &fmp->map_list);
+
+	/* all fine */
+	goto out_finalize;
+
+out_err:
+	kfree(pmp);
+
+out_finalize:
+	mutex_unlock(&fmp->lock);
+
+	return result;
+}
+
+static long kvm_dev_ioctl_unmap(struct file_map *fmp, unsigned long vaddr)
+{
+	struct list_head *cur;
+	struct page_map *pmp;
+	bool found = false;
+	long result = 0;
+
+	/* acquire the file */
+	mutex_lock(&fmp->lock);
+
+	/* check if other thread is closing the file */
+	if (!fmp->active) {
+		result = -ENODEV;
+		pr_warn("kvmi: unable to unmap, file is being closed\n");
+		goto out_err;
+	}
+
+	/* check that this address belongs to us */
+	list_for_each(cur, &fmp->map_list) {
+		pmp = list_entry(cur, struct page_map, map_list);
+
+		/* found */
+		if (pmp->vaddr == vaddr) {
+			found = true;
+			break;
+		}
+	}
+
+	/* not found ? */
+	if (!found) {
+		result = -ENOENT;
+		pr_err("kvmi: address %016lx not mapped\n", vaddr);
+		goto out_err;
+	}
+
+	/* decouple guest mapping */
+	list_del(&pmp->map_list);
+
+out_err:
+	mutex_unlock(&fmp->lock);
+
+	if (found) {
+		/* unmap & ignore result */
+		_do_unmapping(pmp->paddr);
+
+		/* free guest mapping */
+		kfree(pmp);
+	}
+
+	return result;
+}
+
+static long kvm_dev_ioctl(struct file *filp,
+			  unsigned int ioctl, unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	struct file_map *fmp;
+	long result;
+
+	/* minor check */
+	fmp = filp->private_data;
+	ASSERT(fmp->file == filp);
+
+	switch (ioctl) {
+	case KVM_INTRO_MEM_MAP: {
+		struct kvmi_mem_map map;
+
+		result = -EFAULT;
+		if (copy_from_user(&map, argp, sizeof(map)))
+			break;
+
+		result = kvm_dev_ioctl_map(fmp, &map);
+		break;
+	}
+	case KVM_INTRO_MEM_UNMAP: {
+		unsigned long vaddr = (unsigned long) arg;
+
+		result = kvm_dev_ioctl_unmap(fmp, vaddr);
+		break;
+	}
+	default:
+		pr_err("kvmi: ioctl %d not implemented\n", ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static int kvm_dev_release(struct inode *inodep, struct file *filp)
+{
+	struct file_map *fmp;
+	struct list_head *cur, *next;
+	struct page_map *pmp;
+
+	pr_debug("kvmi: file %016lx closed by process %s\n",
+		 (unsigned long) filp, current->comm);
+
+	/* acquire the file */
+	fmp = filp->private_data;
+	mutex_lock(&fmp->lock);
+
+	/* mark for teardown */
+	fmp->active = false;
+
+	/* release mappings taken on this instance of the file */
+	list_for_each_safe(cur, next, &fmp->map_list) {
+		pmp = list_entry(cur, struct page_map, map_list);
+
+		/* unmap address */
+		_do_unmapping(pmp->paddr);
+
+		/* decouple & free guest mapping */
+		list_del(&pmp->map_list);
+		kfree(pmp);
+	}
+
+	/* done processing this file mapping */
+	mutex_unlock(&fmp->lock);
+
+	/* decouple file mapping */
+	spin_lock(&file_lock);
+	list_del(&fmp->file_list);
+	spin_unlock(&file_lock);
+
+	/* free it */
+	kfree(fmp);
+
+	return 0;
+}
+
+
+static const struct file_operations kvmmem_ops = {
+	.open		= kvm_dev_open,
+	.unlocked_ioctl = kvm_dev_ioctl,
+	.compat_ioctl   = kvm_dev_ioctl,
+	.release	= kvm_dev_release,
+};
+
+static struct miscdevice kvm_mem_dev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "kvmmem",
+	.fops		= &kvmmem_ops,
+};
+
+int __init kvm_intro_guest_init(void)
+{
+	int result;
+
+	if (!kvm_para_available()) {
+		pr_err("kvmi: paravirt not available\n");
+		return -EPERM;
+	}
+
+	result = misc_register(&kvm_mem_dev);
+	if (result) {
+		pr_err("kvmi: misc device register failed: %d\n", result);
+		return result;
+	}
+
+	INIT_LIST_HEAD(&file_list);
+	spin_lock_init(&file_lock);
+
+	pr_info("kvmi: guest memory introspection device created\n");
+
+	return 0;
+}
+
+void kvm_intro_guest_exit(void)
+{
+	misc_deregister(&kvm_mem_dev);
+}
+
+module_init(kvm_intro_guest_init)
+module_exit(kvm_intro_guest_exit)

From patchwork Thu Dec 20 18:28:38 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739295
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5944E13B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:16 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 44B1D28F3B
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:16 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 332FC28F3A; Thu, 20 Dec 2018 18:36:16 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6ACBF28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:15 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389315AbeLTSgM (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:12 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44322 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389307AbeLTSgL (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:11 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 A1522305FFB6;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 9A7FE306E47A;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 08/20] kvm: page track: add support for preread,
 prewrite and preexec
Date: Thu, 20 Dec 2018 20:28:38 +0200
Message-Id: <20181220182850.4579-9-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

These callbacks return a boolean value. If false, the emulation should
stop and the instruction should be reexecuted in guest. The preread
callback can return the bytes needed by the read operation.

kvm_page_track_create_memslot() was extended in order to track gfn-s
as soon as the memory slots are created.

kvm_page_track_write() was extended to pass the gva.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h       |   2 +-
 arch/x86/include/asm/kvm_page_track.h |  33 ++++--
 arch/x86/kvm/mmu.c                    | 145 ++++++++++++++++++++++---
 arch/x86/kvm/mmu.h                    |   4 +
 arch/x86/kvm/page_track.c             | 147 +++++++++++++++++++++++---
 arch/x86/kvm/x86.c                    |  54 +++++++---
 drivers/gpu/drm/i915/gvt/kvmgt.c      |   2 +-
 7 files changed, 334 insertions(+), 53 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fbda5a917c5b..796fc5f1d76c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1251,7 +1251,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
 bool pdptrs_changed(struct kvm_vcpu *vcpu);
 
-int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
+int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			  const void *val, int bytes);
 
 struct kvm_irq_mask_notifier {
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 172f9749dbb2..a431e5e1e5cb 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,7 +3,10 @@
 #define _ASM_X86_KVM_PAGE_TRACK_H
 
 enum kvm_page_track_mode {
+	KVM_PAGE_TRACK_PREREAD,
+	KVM_PAGE_TRACK_PREWRITE,
 	KVM_PAGE_TRACK_WRITE,
+	KVM_PAGE_TRACK_PREEXEC,
 	KVM_PAGE_TRACK_MAX,
 };
 
@@ -22,6 +25,13 @@ struct kvm_page_track_notifier_head {
 struct kvm_page_track_notifier_node {
 	struct hlist_node node;
 
+	bool (*track_preread)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			      u8 *new, int bytes,
+			      struct kvm_page_track_notifier_node *node,
+			      bool *data_ready);
+	bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			       const u8 *new, int bytes,
+			       struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when guest is writing the write-tracked page
 	 * and write emulation is finished at that time.
@@ -32,11 +42,17 @@ struct kvm_page_track_notifier_node {
 	 * @bytes: the written length.
 	 * @node: this node
 	 */
-	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
-			    int bytes, struct kvm_page_track_notifier_node *node);
+	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			    const u8 *new, int bytes,
+			    struct kvm_page_track_notifier_node *node);
+	bool (*track_preexec)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			      struct kvm_page_track_notifier_node *node);
+	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  unsigned long npages,
+				  struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when memory slot is being moved or removed
-	 * users can drop write-protection for the pages in that memory slot
+	 * users can drop active protection for the pages in that memory slot
 	 *
 	 * @kvm: the kvm where memory slot being moved or removed
 	 * @slot: the memory slot being moved or removed
@@ -51,7 +67,7 @@ void kvm_page_track_cleanup(struct kvm *kvm);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 				 struct kvm_memory_slot *dont);
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
@@ -69,7 +85,12 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
 				   struct kvm_page_track_notifier_node *n);
-void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
-			  int bytes);
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			    u8 *new, int bytes, bool *data_ready);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			     const u8 *new, int bytes);
+void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			  const u8 *new, int bytes);
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7c03c0f35444..ff3252621fcf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1082,9 +1082,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_add_page(kvm, slot, gfn,
-						    KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 }
@@ -1099,9 +1103,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
 	slot = __gfn_to_memslot(slots, gfn);
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_remove_page(kvm, slot, gfn,
-						       KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
@@ -1490,6 +1498,31 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
+static bool spte_read_protect(u64 *sptep)
+{
+	u64 spte = *sptep;
+	bool exec_only_supported = (shadow_present_mask == 0ull);
+
+	rmap_printk("rmap_read_protect: spte %p %llx\n", sptep, *sptep);
+
+	WARN_ON_ONCE(!exec_only_supported);
+
+	spte = spte & ~(PT_WRITABLE_MASK | PT_PRESENT_MASK);
+
+	return mmu_spte_update(sptep, spte);
+}
+
+static bool spte_exec_protect(u64 *sptep)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_exec_protect: spte %p %llx\n", sptep, *sptep);
+
+	spte = spte & ~PT_USER_MASK;
+
+	return mmu_spte_update(sptep, spte);
+}
+
 static bool __rmap_write_protect(struct kvm *kvm,
 				 struct kvm_rmap_head *rmap_head,
 				 bool pt_protect)
@@ -1504,6 +1537,32 @@ static bool __rmap_write_protect(struct kvm *kvm,
 	return flush;
 }
 
+static bool __rmap_read_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_read_protect(sptep);
+
+	return flush;
+}
+
+static bool __rmap_exec_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_exec_protect(sptep);
+
+	return flush;
+}
+
 static bool spte_clear_dirty(u64 *sptep)
 {
 	u64 spte = *sptep;
@@ -1674,6 +1733,36 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	return write_protected;
 }
 
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool read_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		read_protected |= __rmap_read_protect(kvm, rmap_head);
+	}
+
+	return read_protected;
+}
+
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool exec_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		exec_protected |= __rmap_exec_protect(kvm, rmap_head);
+	}
+
+	return exec_protected;
+}
+
 static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
@@ -2366,6 +2455,20 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sp);
 }
 
+static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn,
+					   unsigned int acc)
+{
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+		acc &= ~ACC_USER_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+		acc &= ~ACC_WRITE_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC))
+		acc &= ~ACC_EXEC_MASK;
+
+	return acc;
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
@@ -2386,7 +2489,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	role.direct = direct;
 	if (role.direct)
 		role.cr4_pae = 0;
-	role.access = access;
+	role.access = kvm_mmu_page_track_acc(vcpu, gfn, access);
 	if (!vcpu->arch.mmu->direct_map
 	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
 		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
@@ -2767,7 +2870,8 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_mmu_page *sp;
 
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
 		return true;
 
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
@@ -3106,7 +3210,10 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 
 	for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
 		if (iterator.level == level) {
-			emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
+			unsigned int acc = kvm_mmu_page_track_acc(vcpu, gfn,
+								  ACC_ALL);
+
+			emulate = mmu_set_spte(vcpu, iterator.sptep, acc,
 					       write, level, gfn, pfn, prefault,
 					       map_writable);
 			direct_pte_prefetch(vcpu, iterator.sptep);
@@ -3881,15 +3988,21 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
 	if (unlikely(error_code & PFERR_RSVD_MASK))
 		return false;
 
-	if (!(error_code & PFERR_PRESENT_MASK) ||
-	      !(error_code & PFERR_WRITE_MASK))
+	if (!(error_code & PFERR_PRESENT_MASK))
 		return false;
 
 	/*
-	 * guest is writing the page which is write tracked which can
+	 * guest is reading/writing/fetching the page which is
+	 * read/write/execute tracked which can
 	 * not be fixed by page fault handler.
 	 */
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (((error_code & PFERR_USER_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+	    || ((error_code & PFERR_WRITE_MASK)
+		&& (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE)
+		 || kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE)))
+	    || ((error_code & PFERR_FETCH_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC)))
 		return true;
 
 	return false;
@@ -5175,7 +5288,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
 	return spte;
 }
 
-static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			      const u8 *new, int bytes,
 			      struct kvm_page_track_notifier_node *node)
 {
@@ -5335,7 +5448,9 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	 * explicitly shadowing L1's page tables, i.e. unprotecting something
 	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
 	 */
-	if (!mmio_info_in_cache(vcpu, cr2, direct) && !is_guest_mode(vcpu))
+	if (!mmio_info_in_cache(vcpu, cr2, direct) && !is_guest_mode(vcpu)
+			&& kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2),
+						    KVM_PAGE_TRACK_PREEXEC))
 		emulation_type = EMULTYPE_ALLOW_RETRY;
 emulate:
 	/*
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c7b333147c4a..45948dabe0b6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -210,5 +210,9 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 3052a59a3065..fc792939a05c 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -1,5 +1,5 @@
 /*
- * Support KVM gust page tracking
+ * Support KVM guest page tracking
  *
  * This feature allows us to track page access in guest. Currently, only
  * write access is tracked.
@@ -34,10 +34,13 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 		}
 }
 
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages)
 {
-	int  i;
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	int i;
 
 	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
 		slot->arch.gfn_track[i] =
@@ -47,6 +50,17 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
 			goto track_free;
 	}
 
+	head = &kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return 0;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_create_slot)
+			n->track_create_slot(kvm, slot, npages, n);
+	srcu_read_unlock(&head->track_srcu, idx);
+
 	return 0;
 
 track_free:
@@ -87,7 +101,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
 				  struct kvm_memory_slot *slot, gfn_t gfn,
@@ -105,9 +119,16 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
 	 */
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
-	if (mode == KVM_PAGE_TRACK_WRITE)
+	if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
 		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
 			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREREAD) {
+		if (kvm_mmu_slot_gfn_read_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREEXEC) {
+		if (kvm_mmu_slot_gfn_exec_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
 
@@ -122,7 +143,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
@@ -215,15 +236,84 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
 
+/*
+ * Notify the node that a read access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ *
+ * The nodes will always be in conflict if they track the same page:
+ * - accepting a read won't guarantee that the next node will not override
+ *   the data (filling new/bytes and setting data_ready)
+ * - filling new/bytes with custom data won't guarantee that the next node
+ *   will not override that
+ */
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			    u8 *new, int bytes, bool *data_ready)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	*data_ready = false;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preread)
+			if (!n->track_preread(vcpu, gpa, gva, new, bytes, n,
+					       data_ready))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
+/*
+ * Notify the node that a write access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ */
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			     const u8 *new, int bytes)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_prewrite)
+			if (!n->track_prewrite(vcpu, gpa, gva, new, bytes, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that write access is intercepted and write emulation is
  * finished at this time.
  *
- * The node should figure out if the written page is the one that node is
- * interested in by itself.
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
  */
-void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
-			  int bytes)
+void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			  const u8 *new, int bytes)
 {
 	struct kvm_page_track_notifier_head *head;
 	struct kvm_page_track_notifier_node *n;
@@ -237,16 +327,45 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	idx = srcu_read_lock(&head->track_srcu);
 	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
 		if (n->track_write)
-			n->track_write(vcpu, gpa, new, bytes, n);
+			n->track_write(vcpu, gpa, gva, new, bytes, n);
+	srcu_read_unlock(&head->track_srcu, idx);
+}
+
+/*
+ * Notify the node that an instruction is about to be executed.
+ * Returning false doesn't stop the other nodes from being called,
+ * but it will stop the emulation with X86EMUL_RETRY_INSTR.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ */
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preexec)
+			if (!n->track_preexec(vcpu, gpa, gva, n))
+				ret = false;
 	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
 }
 
 /*
  * Notify the node that memory slot is being removed or moved so that it can
- * drop write-protection for the pages in the memory slot.
+ * drop active protection for the pages in the memory slot.
  *
- * The node should figure out it has any write-protected pages in this slot
- * by itself.
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
  */
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1d4bab80617c..3f332b947c1b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5024,6 +5024,9 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
 	if (unlikely(gpa == UNMAPPED_GVA))
 		return X86EMUL_PROPAGATE_FAULT;
 
+	if (!kvm_page_track_preexec(vcpu, gpa, addr))
+		return X86EMUL_RETRY_INSTR;
+
 	offset = addr & (PAGE_SIZE-1);
 	if (WARN_ON(offset + bytes > PAGE_SIZE))
 		bytes = (unsigned)PAGE_SIZE - offset;
@@ -5192,22 +5195,35 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 	return vcpu_is_mmio_gpa(vcpu, gva, *gpa, write);
 }
 
-int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
+int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			const void *val, int bytes)
 {
-	int ret;
+	if (!kvm_page_track_prewrite(vcpu, gpa, gva, val, bytes))
+		return X86EMUL_RETRY_INSTR;
+	if (kvm_vcpu_write_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
+	kvm_page_track_write(vcpu, gpa, gva, val, bytes);
+	return X86EMUL_CONTINUE;
+}
 
-	ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
-	if (ret < 0)
-		return 0;
-	kvm_page_track_write(vcpu, gpa, val, bytes);
-	return 1;
+static int emulator_read_phys(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			      void *val, int bytes)
+{
+	bool data_ready;
+
+	if (!kvm_page_track_preread(vcpu, gpa, gva, val, bytes, &data_ready))
+		return X86EMUL_RETRY_INSTR;
+	if (data_ready)
+		return X86EMUL_CONTINUE;
+	if (kvm_vcpu_read_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
+	return X86EMUL_CONTINUE;
 }
 
 struct read_write_emulator_ops {
 	int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
 				  int bytes);
-	int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
+	int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 				  void *val, int bytes);
 	int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
 			       int bytes, void *val);
@@ -5228,16 +5244,16 @@ static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
 	return 0;
 }
 
-static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			void *val, int bytes)
 {
-	return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
+	return emulator_read_phys(vcpu, gpa, gva, val, bytes);
 }
 
-static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			 void *val, int bytes)
 {
-	return emulator_write_phys(vcpu, gpa, val, bytes);
+	return emulator_write_phys(vcpu, gpa, gva, val, bytes);
 }
 
 static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
@@ -5306,8 +5322,11 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 			return X86EMUL_PROPAGATE_FAULT;
 	}
 
-	if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
-		return X86EMUL_CONTINUE;
+	if (!ret) {
+		ret = ops->read_write_emulate(vcpu, gpa, addr, val, bytes);
+		if (ret == X86EMUL_CONTINUE || ret == X86EMUL_RETRY_INSTR)
+			return ret;
+	}
 
 	/*
 	 * Is this MMIO handled locally?
@@ -5442,6 +5461,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (is_error_page(page))
 		goto emul_write;
 
+	if (!kvm_page_track_prewrite(vcpu, gpa, addr, new, bytes))
+		return X86EMUL_RETRY_INSTR;
+
 	kaddr = kmap_atomic(page);
 	kaddr += offset_in_page(gpa);
 	switch (bytes) {
@@ -5467,7 +5489,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 		return X86EMUL_CMPXCHG_FAILED;
 
 	kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
-	kvm_page_track_write(vcpu, gpa, new, bytes);
+	kvm_page_track_write(vcpu, gpa, addr, new, bytes);
 
 	return X86EMUL_CONTINUE;
 
@@ -9243,7 +9265,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 		}
 	}
 
-	if (kvm_page_track_create_memslot(slot, npages))
+	if (kvm_page_track_create_memslot(kvm, slot, npages))
 		goto out_free;
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index c1072143da1d..8c4858603825 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1540,7 +1540,7 @@ static int kvmgt_page_track_remove(unsigned long handle, u64 gfn)
 	return 0;
 }
 
-static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 		const u8 *val, int len,
 		struct kvm_page_track_notifier_node *node)
 {

From patchwork Thu Dec 20 18:28:39 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739297
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 886D36C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:16 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 764F028F35
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:16 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 6AC0528F3A; Thu, 20 Dec 2018 18:36:16 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ACC7528F35
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:15 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389343AbeLTSgL (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:11 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44340 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389313AbeLTSgJ (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:09 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 AA2EB305FFB7;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id A1043306E47B;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>,
 Marian Rotariu <marian.c.rotariu@gmail.com>
Subject: [RFC PATCH v5 09/20] kvm: extend to accomodate the VM introspection
 subsystem
Date: Thu, 20 Dec 2018 20:28:39 +0200
Message-Id: <20181220182850.4579-10-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Add EMULATION_USER_EXIT to stop the emulator and re-execute the current
instruction in guest.

Add kvm_set_mtf(), kvm_set_interrupt_shadow() and kvm_spt_fault()
to allow single-stepping for unimplemented x86 instructions.

Add kvm_arch_msr_intercept() to enable/disable MSR interception and
prevent MSR/write exists from being disabled.

Add kvm_mmu_nested_pagefault() and kvm_mmu_fault_gla() to filter #PF
introspection events.

Add kvm_arch_queue_bp() to reinject breakpoints.

Export kvm_vcpu_ioctl_x86_get_xsave().

Add kvm_arch_vcpu_get_regs(), kvm_arch_vcpu_set_regs(), kvm_arch_vcpu_get_sregs()
and kvm_arch_vcpu_set_guest_debug() to avoid vcpu_load()/vcpu_unload() calls.

Export the availability of EPT views. This is used to validate the
KVMI_GET_PAGE_ACCESS and KVMI_SET_PAGE_ACCESS commands when the
introspection tool selects a different EPT view.

Export kill_pid_info() to ungracefully shutdown the guest.

Export the #PF error code bits.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Marian Rotariu <marian.c.rotariu@gmail.com>
---
 arch/x86/include/asm/kvm_emulate.h |   1 +
 arch/x86/include/asm/kvm_host.h    |  18 +++++
 arch/x86/include/asm/vmx.h         |   2 +
 arch/x86/kvm/emulate.c             |   9 ++-
 arch/x86/kvm/mmu.c                 |  10 +++
 arch/x86/kvm/svm.c                 |  68 ++++++++++++++----
 arch/x86/kvm/vmx.c                 |  91 ++++++++++++++++++-----
 arch/x86/kvm/x86.c                 | 111 +++++++++++++++++++++++++----
 include/linux/kvm_host.h           |  10 +++
 kernel/signal.c                    |   1 +
 10 files changed, 277 insertions(+), 44 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 93c4bf598fb0..4ed7b702e9ce 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -444,6 +444,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
 #define EMULATION_INTERCEPTED 2
+#define EMULATION_USER_EXIT 3
 void init_decode_cache(struct x86_emulate_ctxt *ctxt);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 796fc5f1d76c..db4248513900 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -757,6 +757,9 @@ struct kvm_vcpu_arch {
 	/* set at EPT violation at this point */
 	unsigned long exit_qualification;
 
+	/* #PF translated error code from EPT/NPT exit reason */
+	u64 error_code;
+
 	/* pv related host specific info */
 	struct {
 		bool pv_unhalted;
@@ -1186,6 +1189,13 @@ struct kvm_x86_ops {
 
 	int (*nested_enable_evmcs)(struct kvm_vcpu *vcpu,
 				   uint16_t *vmcs_version);
+
+	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable);
+	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
+	void (*set_mtf)(struct kvm_vcpu *vcpu, bool enable);
+	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
+	bool (*spt_fault)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
@@ -1283,6 +1293,7 @@ extern u64  kvm_max_tsc_scaling_ratio;
 extern u64  kvm_default_tsc_scaling_ratio;
 
 extern u64 kvm_mce_cap_supported;
+extern bool kvm_eptp_switching_supported;
 
 enum emulation_result {
 	EMULATE_DONE,         /* no further processing */
@@ -1581,4 +1592,11 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)                      \
 	*(type *)((buf) + (offset) - 0x7e00) = val
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable);
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
+bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
+bool kvm_spt_fault(struct kvm_vcpu *vcpu);
+void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable);
+void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index ade0f153947d..9b18c1fa5068 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -526,6 +526,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE_BIT	3
 #define EPT_VIOLATION_WRITABLE_BIT	4
 #define EPT_VIOLATION_EXECUTABLE_BIT	5
+#define EPT_VIOLATION_GLA_VALID_BIT	7
 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
 #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
 #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
@@ -533,6 +534,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
 #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
 #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
+#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
 /*
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 78e430f4e15c..74baab83c526 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5366,7 +5366,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 					ctxt->memopp->addr.mem.ea + ctxt->_eip);
 
 done:
-	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_USER_EXIT;
+	else if (rc == X86EMUL_CONTINUE)
+		return EMULATION_OK;
+	else
+		return EMULATION_FAILED;
 }
 
 bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt)
@@ -5736,6 +5741,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 	if (rc == X86EMUL_INTERCEPTED)
 		return EMULATION_INTERCEPTED;
 
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_USER_EXIT;
 	if (rc == X86EMUL_CONTINUE)
 		writeback_registers(ctxt);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ff3252621fcf..ed4b3c593bf5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -6119,3 +6119,13 @@ void kvm_mmu_module_exit(void)
 	unregister_shrinker(&mmu_shrinker);
 	mmu_audit_disable();
 }
+
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->fault_gla(vcpu);
+}
+
+bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->nested_pagefault(vcpu);
+}
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cc6467b35a85..b5714b8a2f51 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1054,7 +1054,8 @@ static bool msr_write_intercepted(struct kvm_vcpu *vcpu, unsigned msr)
 	return !!test_bit(bit_write,  &tmp);
 }
 
-static void set_msr_interception(u32 *msrpm, unsigned msr,
+static void set_msr_interception(struct vcpu_svm *svm,
+				 u32 *msrpm, unsigned msr,
 				 int read, int write)
 {
 	u8 bit_read, bit_write;
@@ -1090,7 +1091,7 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
 		if (!direct_access_msrs[i].always)
 			continue;
 
-		set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
+		set_msr_interception(NULL, msrpm, direct_access_msrs[i].index, 1, 1);
 	}
 }
 
@@ -1142,10 +1143,10 @@ static void svm_enable_lbrv(struct vcpu_svm *svm)
 	u32 *msrpm = svm->msrpm;
 
 	svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1);
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
 }
 
 static void svm_disable_lbrv(struct vcpu_svm *svm)
@@ -1153,10 +1154,10 @@ static void svm_disable_lbrv(struct vcpu_svm *svm)
 	u32 *msrpm = svm->msrpm;
 
 	svm->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK;
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0);
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 0, 0);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTFROMIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTTOIP, 0, 0);
 }
 
 static void disable_nmi_singlestep(struct vcpu_svm *svm)
@@ -2660,6 +2661,8 @@ static int pf_interception(struct vcpu_svm *svm)
 	u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2);
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
+	svm->vcpu.arch.error_code = error_code;
+
 	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
 			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
 			svm->vmcb->control.insn_bytes : NULL,
@@ -4262,7 +4265,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		 * We update the L1 MSR bit as well since it will end up
 		 * touching the MSR anyway now.
 		 */
-		set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+		set_msr_interception(svm, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
 		break;
 	case MSR_IA32_PRED_CMD:
 		if (!msr->host_initiated &&
@@ -4278,7 +4281,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
 		if (is_guest_mode(vcpu))
 			break;
-		set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
+		set_msr_interception(svm, svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
 		break;
 	case MSR_AMD64_VIRT_SPEC_CTRL:
 		if (!msr->host_initiated &&
@@ -7058,6 +7061,41 @@ static int nested_enable_evmcs(struct kvm_vcpu *vcpu,
 	return -ENODEV;
 }
 
+static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 *msrpm = svm->msrpm;
+
+	/*
+	 * The below code enable or disable the msr interception for both
+	 * read and write. The best way will be to get here the current
+	 * bit status for read and send that value as argument.
+	 */
+	set_msr_interception(svm, msrpm, msr, enable, enable);
+}
+
+static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return ~0ull;
+}
+
+static bool svm_nested_pagefault(struct kvm_vcpu *vcpu)
+{
+	return false;
+}
+
+static bool svm_spt_fault(struct kvm_vcpu *vcpu)
+{
+	const struct vcpu_svm *svm = to_svm(vcpu);
+
+	return (svm->vmcb->control.exit_code == SVM_EXIT_NPF);
+}
+
+static void svm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
+{
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -7189,6 +7227,12 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.mem_enc_unreg_region = svm_unregister_enc_region,
 
 	.nested_enable_evmcs = nested_enable_evmcs,
+
+	.set_mtf = svm_set_mtf,
+	.msr_intercept = svm_msr_intercept,
+	.fault_gla = svm_fault_gla,
+	.nested_pagefault = svm_nested_pagefault,
+	.spt_fault = svm_spt_fault,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8d5d984541be..34e4cda9b1ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1308,7 +1308,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
 static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
 					    u16 error_code);
 static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
-static __always_inline void vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
+							  unsigned long *msr_bitmap,
 							  u32 msr, int type);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
@@ -4255,7 +4256,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		 * in the merging. We update the vmcs01 here for L1 as well
 		 * since it will end up touching the MSR anyway now.
 		 */
-		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+		vmx_disable_intercept_for_msr(vcpu, vmx->vmcs01.msr_bitmap,
 					      MSR_IA32_SPEC_CTRL,
 					      MSR_TYPE_RW);
 		break;
@@ -4283,7 +4284,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		 * vmcs02.msr_bitmap here since it gets completely overwritten
 		 * in the merging.
 		 */
-		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
+		vmx_disable_intercept_for_msr(vcpu, vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
 					      MSR_TYPE_W);
 		break;
 	case MSR_IA32_ARCH_CAPABILITIES:
@@ -5953,7 +5954,8 @@ static void free_vpid(int vpid)
 	spin_unlock(&vmx_vpid_lock);
 }
 
-static __always_inline void vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
+							  unsigned long *msr_bitmap,
 							  u32 msr, int type)
 {
 	int f = sizeof(unsigned long);
@@ -6029,13 +6031,14 @@ static __always_inline void vmx_enable_intercept_for_msr(unsigned long *msr_bitm
 	}
 }
 
-static __always_inline void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+static __always_inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu,
+						      unsigned long *msr_bitmap,
 			     			      u32 msr, int type, bool value)
 {
 	if (value)
 		vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
 	else
-		vmx_disable_intercept_for_msr(msr_bitmap, msr, type);
+		vmx_disable_intercept_for_msr(vcpu, msr_bitmap, msr, type);
 }
 
 /*
@@ -6096,7 +6099,8 @@ static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu)
 
 #define X2APIC_MSR(r) (APIC_BASE_MSR + ((r) >> 4))
 
-static void vmx_update_msr_bitmap_x2apic(unsigned long *msr_bitmap,
+static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu,
+					 unsigned long *msr_bitmap,
 					 u8 mode)
 {
 	int msr;
@@ -6112,11 +6116,11 @@ static void vmx_update_msr_bitmap_x2apic(unsigned long *msr_bitmap,
 		 * TPR reads and writes can be virtualized even if virtual interrupt
 		 * delivery is not in use.
 		 */
-		vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TASKPRI), MSR_TYPE_RW);
+		vmx_disable_intercept_for_msr(vcpu, msr_bitmap, X2APIC_MSR(APIC_TASKPRI), MSR_TYPE_RW);
 		if (mode & MSR_BITMAP_MODE_X2APIC_APICV) {
 			vmx_enable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_R);
-			vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
-			vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+			vmx_disable_intercept_for_msr(vcpu, msr_bitmap, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
+			vmx_disable_intercept_for_msr(vcpu, msr_bitmap, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
 		}
 	}
 }
@@ -6132,7 +6136,7 @@ static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
 		return;
 
 	if (changed & (MSR_BITMAP_MODE_X2APIC | MSR_BITMAP_MODE_X2APIC_APICV))
-		vmx_update_msr_bitmap_x2apic(msr_bitmap, mode);
+		vmx_update_msr_bitmap_x2apic(vcpu, msr_bitmap, mode);
 
 	vmx->msr_bitmap_mode = mode;
 }
@@ -7713,10 +7717,11 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 			EPT_VIOLATION_EXECUTABLE))
 		      ? PFERR_PRESENT_MASK : 0;
 
-	error_code |= (exit_qualification & 0x100) != 0 ?
-	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
+		      ? PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
 	vcpu->arch.exit_qualification = exit_qualification;
+	vcpu->arch.error_code = error_code;
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -8026,6 +8031,8 @@ static __init int hardware_setup(void)
 	if (enable_shadow_vmcs)
 		init_vmcs_shadow_fields();
 
+	kvm_eptp_switching_supported = cpu_has_vmx_vmfunc();
+
 	kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 	nested_vmx_setup_ctls_msrs(&vmcs_config.nested, enable_apicv);
 
@@ -11557,12 +11564,12 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 		goto free_msrs;
 
 	msr_bitmap = vmx->vmcs01.msr_bitmap;
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
 	vmx->msr_bitmap_mode = 0;
 
 	vmx->loaded_vmcs = &vmx->vmcs01;
@@ -14994,6 +15001,46 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+static void vmx_set_mtf(struct kvm_vcpu *vcpu, bool enable)
+{
+	if (enable)
+		vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
+			      CPU_BASED_MONITOR_TRAP_FLAG);
+	else
+		vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
+				CPU_BASED_MONITOR_TRAP_FLAG);
+}
+
+static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+			      bool enable)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
+
+	vmx_set_intercept_for_msr(vcpu, msr_bitmap, msr, MSR_TYPE_W, enable);
+}
+
+static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
+		return vmcs_readl(GUEST_LINEAR_ADDRESS);
+	return ~0ull;
+}
+
+static bool vmx_nested_pagefault(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
+		return false;
+	return true;
+}
+
+static bool vmx_spt_fault(struct kvm_vcpu *vcpu)
+{
+	const struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	return (vmx->exit_reason == EXIT_REASON_EPT_VIOLATION);
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -15141,6 +15188,12 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.enable_smi_window = enable_smi_window,
 
 	.nested_enable_evmcs = nested_enable_evmcs,
+
+	.set_mtf = vmx_set_mtf,
+	.msr_intercept = vmx_msr_intercept,
+	.fault_gla = vmx_fault_gla,
+	.nested_pagefault = vmx_nested_pagefault,
+	.spt_fault = vmx_spt_fault,
 };
 
 static void vmx_cleanup_l1d_flush(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3f332b947c1b..94dbc270bd84 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -150,6 +150,9 @@ EXPORT_SYMBOL_GPL(enable_vmware_backdoor);
 static bool __read_mostly force_emulation_prefix = false;
 module_param(force_emulation_prefix, bool, S_IRUGO);
 
+bool __read_mostly kvm_eptp_switching_supported;
+EXPORT_SYMBOL_GPL(kvm_eptp_switching_supported);
+
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {
@@ -3714,8 +3717,8 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 	}
 }
 
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
-					 struct kvm_xsave *guest_xsave)
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave)
 {
 	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
 		memset(guest_xsave, 0, sizeof(struct kvm_xsave));
@@ -6349,7 +6352,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 		trace_kvm_emulate_insn_start(vcpu);
 		++vcpu->stat.insn_emulation;
-		if (r != EMULATION_OK)  {
+		if (r == EMULATION_USER_EXIT)
+			return EMULATE_DONE;
+		if (r != EMULATION_OK) {
 			if (emulation_type & EMULTYPE_TRAP_UD)
 				return EMULATE_FAIL;
 			if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
@@ -6390,6 +6395,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 	r = x86_emulate_insn(ctxt);
 
+	if (r == EMULATION_USER_EXIT)
+		return EMULATE_DONE;
 	if (r == EMULATION_INTERCEPTED)
 		return EMULATE_DONE;
 
@@ -8169,6 +8176,11 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
+void kvm_arch_vcpu_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+	__get_regs(vcpu, regs);
+}
+
 static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
 	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
@@ -8209,6 +8221,39 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
+/*
+ * Similar to __set_regs() but it does not reset the exceptions
+ */
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
+	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+
+	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
+	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
+	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
+	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
+	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
+	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
+#ifdef CONFIG_X86_64
+	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
+	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
+	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
+	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
+	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
+	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
+	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
+	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
+#endif
+
+	kvm_rip_write(vcpu, regs->rip);
+	kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
+
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+}
+
 void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	struct kvm_segment cs;
@@ -8264,6 +8309,11 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+void kvm_arch_vcpu_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+	__get_sregs(vcpu, sregs);
+}
+
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
 				    struct kvm_mp_state *mp_state)
 {
@@ -8457,16 +8507,15 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
-int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
-					struct kvm_guest_debug *dbg)
+int kvm_arch_vcpu_set_guest_debug(struct kvm_vcpu *vcpu,
+				  struct kvm_guest_debug *dbg)
 {
 	unsigned long rflags;
-	int i, r;
-
-	vcpu_load(vcpu);
+	int i;
+	int ret = 0;
 
 	if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
-		r = -EBUSY;
+		ret = -EBUSY;
 		if (vcpu->arch.exception.pending)
 			goto out;
 		if (dbg->control & KVM_GUESTDBG_INJECT_DB)
@@ -8507,11 +8556,19 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 
 	kvm_x86_ops->update_bp_intercept(vcpu);
 
-	r = 0;
-
 out:
+	return ret;
+}
+
+int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
+					struct kvm_guest_debug *dbg)
+{
+	int ret;
+
+	vcpu_load(vcpu);
+	ret = kvm_arch_vcpu_set_guest_debug(vcpu, dbg);
 	vcpu_put(vcpu);
-	return r;
+	return ret;
 }
 
 /*
@@ -9762,6 +9819,36 @@ bool kvm_vector_hashing_enabled(void)
 }
 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	kvm_x86_ops->msr_intercept(vcpu, msr, enable);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_msr_intercept);
+
+void kvm_arch_queue_bp(struct kvm_vcpu *vcpu)
+{
+	kvm_queue_exception(vcpu, BP_VECTOR);
+}
+
+void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
+{
+	kvm_x86_ops->set_mtf(vcpu, enable);
+}
+EXPORT_SYMBOL(kvm_set_mtf);
+
+void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+	kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
+}
+EXPORT_SYMBOL(kvm_set_interrupt_shadow);
+
+bool kvm_spt_fault(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->spt_fault(vcpu);
+}
+EXPORT_SYMBOL(kvm_spt_fault);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c926698040e0..1f3fa3c50133 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -775,9 +775,13 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 				    struct kvm_translation *tr);
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
+void kvm_arch_vcpu_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
+void kvm_arch_vcpu_get_sregs(struct kvm_vcpu *vcpu,
+				  struct kvm_sregs *sregs);
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
@@ -786,7 +790,11 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 				    struct kvm_mp_state *mp_state);
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 					struct kvm_guest_debug *dbg);
+int kvm_arch_vcpu_set_guest_debug(struct kvm_vcpu *vcpu,
+				  struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave);
 
 int kvm_arch_init(void *opaque);
 void kvm_arch_exit(void);
@@ -1299,4 +1307,6 @@ static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
 
+void kvm_arch_queue_bp(struct kvm_vcpu *vcpu);
+
 #endif
diff --git a/kernel/signal.c b/kernel/signal.c
index 9a32bc2088c9..11a9766b6df1 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1372,6 +1372,7 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
 		 */
 	}
 }
+EXPORT_SYMBOL(kill_pid_info);
 
 static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
 {

From patchwork Thu Dec 20 18:28:40 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739299
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7662E13B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:17 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 658BE28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:17 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 59E1B28F39; Thu, 20 Dec 2018 18:36:17 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3E88728E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:16 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389318AbeLTSgJ (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:09 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44344 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389319AbeLTSgH (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:07 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 B10A9305FFB8;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id A787130228B3;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>
Subject: [RFC PATCH v5 10/20] kvm: x86: add tracepoints for interrupt and
 exception injections
Date: Thu, 20 Dec 2018 20:28:40 +0200
Message-Id: <20181220182850.4579-11-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Nicușor Cîțu <ncitu@bitdefender.com>

This patch introduces additional tracepoints that are meant to help
in following the flow of interrupts and exceptions queued to a guest
VM. At the same time the kvm_exit tracepoint is enhanced with the
vCPU ID.

One scenario in which these help is debugging lost interrupts due to
a buggy VMEXIT handler.

Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
---
 arch/x86/kvm/svm.c   |   9 +++-
 arch/x86/kvm/trace.h | 118 ++++++++++++++++++++++++++++++++++---------
 arch/x86/kvm/vmx.c   |   8 ++-
 arch/x86/kvm/x86.c   |  12 +++--
 4 files changed, 116 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b5714b8a2f51..f355860d845c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -801,6 +801,8 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
 	bool reinject = vcpu->arch.exception.injected;
 	u32 error_code = vcpu->arch.exception.error_code;
 
+	trace_kvm_inj_exception(vcpu);
+
 	/*
 	 * If we are within a nested VM we'd better #VMEXIT and let the guest
 	 * handle the exception
@@ -5031,6 +5033,8 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	trace_kvm_inj_nmi(vcpu);
+
 	svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
 	vcpu->arch.hflags |= HF_NMI_MASK;
 	set_intercept(svm, INTERCEPT_IRET);
@@ -5056,7 +5060,8 @@ static void svm_set_irq(struct kvm_vcpu *vcpu)
 
 	BUG_ON(!(gif_set(svm)));
 
-	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
+	trace_kvm_inj_interrupt(vcpu);
+
 	++vcpu->stat.irq_injections;
 
 	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
@@ -5560,6 +5565,8 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu)
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct vmcb_control_area *control = &svm->vmcb->control;
 
+	trace_kvm_cancel_inj(vcpu);
+
 	control->exit_int_info = control->event_inj;
 	control->exit_int_info_err = control->event_inj_err;
 	control->event_inj = 0;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 0659465a745c..36d17765b5d3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -227,6 +227,7 @@ TRACE_EVENT(kvm_exit,
 	TP_ARGS(exit_reason, vcpu, isa),
 
 	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id		)
 		__field(	unsigned int,	exit_reason	)
 		__field(	unsigned long,	guest_rip	)
 		__field(	u32,	        isa             )
@@ -235,6 +236,7 @@ TRACE_EVENT(kvm_exit,
 	),
 
 	TP_fast_assign(
+		__entry->vcpu_id	= vcpu->vcpu_id;
 		__entry->exit_reason	= exit_reason;
 		__entry->guest_rip	= kvm_rip_read(vcpu);
 		__entry->isa            = isa;
@@ -242,7 +244,8 @@ TRACE_EVENT(kvm_exit,
 					   &__entry->info2);
 	),
 
-	TP_printk("reason %s rip 0x%lx info %llx %llx",
+	TP_printk("vcpu %u reason %s rip 0x%lx info %llx %llx",
+		 __entry->vcpu_id,
 		 (__entry->isa == KVM_ISA_VMX) ?
 		 __print_symbolic(__entry->exit_reason, VMX_EXIT_REASONS) :
 		 __print_symbolic(__entry->exit_reason, SVM_EXIT_REASONS),
@@ -252,19 +255,38 @@ TRACE_EVENT(kvm_exit,
 /*
  * Tracepoint for kvm interrupt injection:
  */
-TRACE_EVENT(kvm_inj_virq,
-	TP_PROTO(unsigned int irq),
-	TP_ARGS(irq),
-
+TRACE_EVENT(kvm_inj_interrupt,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
 	TP_STRUCT__entry(
-		__field(	unsigned int,	irq		)
+		__field(__u32, vcpu_id)
+		__field(__u32, nr)
 	),
-
 	TP_fast_assign(
-		__entry->irq		= irq;
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->nr = vcpu->arch.interrupt.nr;
 	),
+	TP_printk("vcpu %u irq %u",
+		  __entry->vcpu_id,
+		  __entry->nr
+	)
+);
 
-	TP_printk("irq %u", __entry->irq)
+/*
+ * Tracepoint for kvm nmi injection:
+ */
+TRACE_EVENT(kvm_inj_nmi,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
+	TP_STRUCT__entry(
+		__field(__u32, vcpu_id)
+	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+	),
+	TP_printk("vcpu %u",
+		  __entry->vcpu_id
+	)
 );
 
 #define EXS(x) { x##_VECTOR, "#" #x }
@@ -275,28 +297,76 @@ TRACE_EVENT(kvm_inj_virq,
 	EXS(MF), EXS(AC), EXS(MC)
 
 /*
- * Tracepoint for kvm interrupt injection:
+ * Tracepoint for kvm exception injection:
  */
-TRACE_EVENT(kvm_inj_exception,
-	TP_PROTO(unsigned exception, bool has_error, unsigned error_code),
-	TP_ARGS(exception, has_error, error_code),
-
+TRACE_EVENT(
+	kvm_inj_exception,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
 	TP_STRUCT__entry(
-		__field(	u8,	exception	)
-		__field(	u8,	has_error	)
-		__field(	u32,	error_code	)
+		__field(__u32, vcpu_id)
+		__field(__u8, nr)
+		__field(__u64, address)
+		__field(__u16, error_code)
+		__field(bool, has_error_code)
 	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->nr = vcpu->arch.exception.nr;
+		__entry->address = vcpu->arch.exception.nested_apf ?
+			vcpu->arch.apf.nested_apf_token : vcpu->arch.cr2;
+		__entry->error_code = vcpu->arch.exception.error_code;
+		__entry->has_error_code = vcpu->arch.exception.has_error_code;
+	),
+	TP_printk("vcpu %u %s address %llx error %x",
+		  __entry->vcpu_id,
+		  __print_symbolic(__entry->nr, kvm_trace_sym_exc),
+		  __entry->nr == PF_VECTOR ? __entry->address : 0,
+		  __entry->has_error_code ? __entry->error_code : 0
+	)
+);
 
+TRACE_EVENT(
+	kvm_inj_emul_exception,
+	TP_PROTO(struct kvm_vcpu *vcpu, struct x86_exception *fault),
+	TP_ARGS(vcpu, fault),
+	TP_STRUCT__entry(
+		__field(__u32, vcpu_id)
+		__field(__u8, vector)
+		__field(__u64, address)
+		__field(__u16, error_code)
+		__field(bool, error_code_valid)
+	),
 	TP_fast_assign(
-		__entry->exception	= exception;
-		__entry->has_error	= has_error;
-		__entry->error_code	= error_code;
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->vector = fault->vector;
+		__entry->address = fault->address;
+		__entry->error_code = fault->error_code;
+		__entry->error_code_valid = fault->error_code_valid;
 	),
+	TP_printk("vcpu %u %s address %llx error %x",
+		  __entry->vcpu_id,
+		  __print_symbolic(__entry->vector, kvm_trace_sym_exc),
+		  __entry->vector == PF_VECTOR ? __entry->address : 0,
+		  __entry->error_code_valid ? __entry->error_code : 0
+	)
+);
 
-	TP_printk("%s (0x%x)",
-		  __print_symbolic(__entry->exception, kvm_trace_sym_exc),
-		  /* FIXME: don't print error_code if not present */
-		  __entry->has_error ? __entry->error_code : 0)
+/*
+ * Tracepoint for kvm cancel injection:
+ */
+TRACE_EVENT(kvm_cancel_inj,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
+	TP_STRUCT__entry(
+		__field(__u32, vcpu_id)
+	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+	),
+	TP_printk("vcpu %u",
+		  __entry->vcpu_id
+	)
 );
 
 /*
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 34e4cda9b1ba..fc0219336edb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3358,6 +3358,8 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
 	u32 error_code = vcpu->arch.exception.error_code;
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	trace_kvm_inj_exception(vcpu);
+
 	kvm_deliver_exception_payload(vcpu);
 
 	if (has_error_code) {
@@ -6856,7 +6858,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
 	uint32_t intr;
 	int irq = vcpu->arch.interrupt.nr;
 
-	trace_kvm_inj_virq(irq);
+	trace_kvm_inj_interrupt(vcpu);
 
 	++vcpu->stat.irq_injections;
 	if (vmx->rmode.vm86_active) {
@@ -6883,6 +6885,8 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	trace_kvm_inj_nmi(vcpu);
+
 	if (!enable_vnmi) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon
@@ -11102,6 +11106,8 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	trace_kvm_cancel_inj(vcpu);
+
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 94dbc270bd84..e413f81d6c46 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5960,6 +5960,9 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
 static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
 {
 	struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
+
+	trace_kvm_inj_emul_exception(vcpu, &ctxt->exception);
+
 	if (ctxt->exception.vector == PF_VECTOR)
 		return kvm_propagate_fault(vcpu, &ctxt->exception);
 
@@ -7162,10 +7165,6 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 
 	/* try to inject new event if pending */
 	if (vcpu->arch.exception.pending) {
-		trace_kvm_inj_exception(vcpu->arch.exception.nr,
-					vcpu->arch.exception.has_error_code,
-					vcpu->arch.exception.error_code);
-
 		WARN_ON_ONCE(vcpu->arch.exception.injected);
 		vcpu->arch.exception.pending = false;
 		vcpu->arch.exception.injected = true;
@@ -9851,7 +9850,10 @@ EXPORT_SYMBOL(kvm_spt_fault);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
-EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_interrupt);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_nmi);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_exception);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cancel_inj);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);

From patchwork Thu Dec 20 18:28:42 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739291
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A112F6C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:11 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9050528E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:11 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 847F428F39; Thu, 20 Dec 2018 18:36:11 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0D3C728E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:11 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389340AbeLTSgK (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:10 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44320 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389315AbeLTSgH (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:07 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 DA402305FFBA;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id D111A306E47C;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>, =?utf-8?q?Mircea_C?=
	=?utf-8?q?=C3=AErjaliu?= <mcirjaliu@bitdefender.com>
Subject: [RFC PATCH v5 12/20] kvm: x86: hook in the VM introspection subsystem
Date: Thu, 20 Dec 2018 20:28:42 +0200
Message-Id: <20181220182850.4579-13-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

Notify KVMI on vCPU create/destroy, VM destroy events, the new ioctl-s
(KVM_INTROSPECTION_HOOK, KVM_INTROSPECTION_UNHOOK) and when a KVMI
command is received (KVM_REQ_INTROSPECTION).

Change the x86 emulator and mmu behavior when KVMI is present (TODO:
do it when active) or when dealing with a page of interest for the
introspection tool.

Also, the EPT AD bits feature is disabled by this patch.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
---
 arch/x86/kvm/mmu.c  | 13 +++++++++++--
 arch/x86/kvm/svm.c  |  2 ++
 arch/x86/kvm/vmx.c  |  4 +++-
 arch/x86/kvm/x86.c  | 33 ++++++++++++++++++++++++++++-----
 virt/kvm/kvm_main.c | 34 +++++++++++++++++++++++++++++++++-
 5 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ed4b3c593bf5..85a980e5cc27 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -40,6 +40,7 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
+#include <linux/kvmi.h>
 
 #include <asm/page.h>
 #include <asm/pat.h>
@@ -2458,6 +2459,9 @@ static void clear_sp_write_flooding_count(u64 *spte)
 static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn,
 					   unsigned int acc)
 {
+	if (!kvmi_tracked_gfn(vcpu, gfn))
+		return acc;
+
 	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
 		acc &= ~ACC_USER_MASK;
 	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
@@ -5433,8 +5437,13 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	 */
 	if (vcpu->arch.mmu->direct_map &&
 	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
-		return 1;
+		if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(cr2))) {
+			if (kvmi_update_ad_flags(vcpu))
+				return 1;
+		} else {
+			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
+			return 1;
+		}
 	}
 
 	/*
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f355860d845c..766578401d1c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -18,6 +18,7 @@
 #define pr_fmt(fmt) "SVM: " fmt
 
 #include <linux/kvm_host.h>
+#include <linux/kvmi.h>
 
 #include "irq.h"
 #include "mmu.h"
@@ -50,6 +51,7 @@
 #include <asm/kvm_para.h>
 #include <asm/irq_remapping.h>
 #include <asm/spec-ctrl.h>
+#include <asm/kvmi.h>
 
 #include <asm/virtext.h>
 #include "trace.h"
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fc0219336edb..e9de89bd7f1c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -36,6 +36,7 @@
 #include <linux/hrtimer.h>
 #include <linux/frame.h>
 #include <linux/nospec.h>
+#include <linux/kvmi.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -55,6 +56,7 @@
 #include <asm/mmu_context.h>
 #include <asm/spec-ctrl.h>
 #include <asm/mshyperv.h>
+#include <asm/kvmi.h>
 
 #include "trace.h"
 #include "pmu.h"
@@ -7939,7 +7941,7 @@ static __init int hardware_setup(void)
 	    !cpu_has_vmx_invept_global())
 		enable_ept = 0;
 
-	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
+	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
 		enable_ept_ad_bits = 0;
 
 	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e413f81d6c46..f073fe596244 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -20,6 +20,8 @@
  */
 
 #include <linux/kvm_host.h>
+#include <linux/kvmi.h>
+#include <uapi/linux/kvmi.h>
 #include "irq.h"
 #include "mmu.h"
 #include "i8254.h"
@@ -3076,6 +3078,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = kvm_x86_ops->get_nested_state ?
 			kvm_x86_ops->get_nested_state(NULL, 0, 0) : 0;
 		break;
+#ifdef CONFIG_KVM_INTROSPECTION
+	case KVM_CAP_INTROSPECTION:
+		r = KVMI_VERSION;
+		break;
+#endif
 	default:
 		break;
 	}
@@ -5314,7 +5321,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 	 * operation using rep will only have the initial GPA from the NPF
 	 * occurred.
 	 */
-	if (vcpu->arch.gpa_available &&
+	if (vcpu->arch.gpa_available && !kvmi_is_present() &&
 	    emulator_can_use_gpa(ctxt) &&
 	    (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
 		gpa = vcpu->arch.gpa_val;
@@ -6096,7 +6103,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 
-		if (indirect_shadow_pages)
+		if (indirect_shadow_pages
+		    && !kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 		return true;
@@ -6107,7 +6115,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 	 * and it failed try to unshadow page and re-enter the
 	 * guest to let CPU execute the instruction.
 	 */
-	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
+	if (!kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
+		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 	/*
 	 * If the access faults on its page table, it can not
@@ -6159,6 +6168,9 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (!vcpu->arch.mmu->direct_map)
 		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
 
+	if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
+		return false;
+
 	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 	return true;
@@ -6324,6 +6336,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 	vcpu->arch.l1tf_flush_l1d = true;
 
+	kvmi_init_emulate(vcpu);
+
 	/*
 	 * Clear write_fault_to_shadow_pgtable here to ensure it is
 	 * never reused.
@@ -6360,6 +6374,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 		if (r != EMULATION_OK) {
 			if (emulation_type & EMULTYPE_TRAP_UD)
 				return EMULATE_FAIL;
+			if (!kvmi_track_emul_unimplemented(vcpu, cr2))
+				return EMULATE_DONE;
 			if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
 						emulation_type))
 				return EMULATE_DONE;
@@ -6429,9 +6445,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 			writeback = false;
 		r = EMULATE_USER_EXIT;
 		vcpu->arch.complete_userspace_io = complete_emulated_mmio;
-	} else if (r == EMULATION_RESTART)
+	} else if (r == EMULATION_RESTART) {
+		kvmi_activate_rep_complete(vcpu);
 		goto restart;
-	else
+	} else
 		r = EMULATE_DONE;
 
 	if (writeback) {
@@ -7677,6 +7694,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		 */
 		if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
 			kvm_hv_process_stimers(vcpu);
+
+		if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu))
+			kvmi_handle_requests(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
@@ -9492,6 +9512,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 	    (vcpu->arch.smi_pending && !is_smm(vcpu)))
 		return true;
 
+	if (kvm_test_request(KVM_REQ_INTROSPECTION, vcpu))
+		return true;
+
 	if (kvm_arch_interrupt_allowed(vcpu) &&
 	    (kvm_cpu_has_interrupt(vcpu) ||
 	    kvm_guest_apic_has_interrupt(vcpu)))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2679e476b6c3..8db2bf9ca0c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/kvmi.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -315,6 +316,11 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
 		goto fail_free_run;
+
+	r = kvmi_vcpu_init(vcpu);
+	if (r < 0)
+		goto fail_free_run;
+
 	return 0;
 
 fail_free_run:
@@ -332,6 +338,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 	 * descriptors are already gone.
 	 */
 	put_pid(rcu_dereference_protected(vcpu->pid, 1));
+	kvmi_vcpu_uninit(vcpu);
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
 }
@@ -681,6 +688,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (r)
 		goto out_err;
 
+	kvmi_create_vm(kvm);
+
 	spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	spin_unlock(&kvm_lock);
@@ -726,6 +735,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	kvmi_destroy_vm(kvm);
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
@@ -1476,7 +1486,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Whoever called remap_pfn_range is also going to call e.g.
 	 * unmap_mapping_range before the underlying pages are freed,
 	 * causing a call to our MMU notifier.
-	 */ 
+	 */
 	kvm_get_pfn(pfn);
 
 	*p_pfn = pfn;
@@ -3133,6 +3143,24 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+#ifdef CONFIG_KVM_INTROSPECTION
+	case KVM_INTROSPECTION_HOOK: {
+		struct kvm_introspection i;
+
+		r = -EFAULT;
+		if (copy_from_user(&i, argp, sizeof(i)))
+			goto out;
+
+		r = kvmi_hook(kvm, &i);
+		break;
+	}
+	case KVM_INTROSPECTION_UNHOOK:
+		r = -EFAULT;
+		if (kvmi_notify_unhook(kvm))
+			goto out;
+		r = 0;
+		break;
+#endif /* CONFIG_KVM_INTROSPECTION */
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
@@ -4075,6 +4103,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = kvm_vfio_ops_init();
 	WARN_ON(r);
 
+	r = kvmi_init();
+	WARN_ON(r);
+
 	return 0;
 
 out_unreg:
@@ -4100,6 +4131,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+	kvmi_uninit();
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);

From patchwork Thu Dec 20 18:28:43 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739283
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6DF136C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:05 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5D81D28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:05 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 514C528F39; Thu, 20 Dec 2018 18:36:05 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0B98428E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:05 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389323AbeLTSgE (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:04 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44308 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730844AbeLTSgB (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:01 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 E15EB305FFBB;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id D62CF30513AD;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>
Subject: [RFC PATCH v5 13/20] kvm: x86: hook in kvmi_msr_event()
Date: Thu, 20 Dec 2018 20:28:43 +0200
Message-Id: <20181220182850.4579-14-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Inform the guest introspection tool that an MSR is going to be changed.

The kvmi_msr_event() function will check a bitmap of MSR-s of interest
(configured via a KVMI_CONTROL_EVENTS(KVMI_MSR_CONTROL) request) and, if
the new value differs from the previous one, it will generate a
notification. The introspection tool can respond by allowing the guest
to continue with normal execution or by discarding the change.

This is meant to prevent malicious changes to MSR-s such as
MSR_IA32_SYSENTER_EIP.

The patch ensures that write access tracking of these MSR-s of interest
is not disabled.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
---
 arch/x86/kvm/svm.c | 5 +++++
 arch/x86/kvm/vmx.c | 5 +++++
 arch/x86/kvm/x86.c | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 766578401d1c..d6c7680e224b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1066,6 +1066,11 @@ static void set_msr_interception(struct vcpu_svm *svm,
 	unsigned long tmp;
 	u32 offset;
 
+#ifdef CONFIG_KVM_INTROSPECTION
+	if (!write && kvmi_monitored_msr(&svm->vcpu, msr))
+		return;
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 	/*
 	 * If this warning triggers extend the direct_access_msrs list at the
 	 * beginning of the file
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e9de89bd7f1c..fd92267a7a65 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5967,6 +5967,11 @@ static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
 	if (!cpu_has_vmx_msr_bitmap())
 		return;
 
+#ifdef CONFIG_KVM_INTROSPECTION
+	if ((type & MSR_TYPE_W) && kvmi_monitored_msr(vcpu, msr))
+		return;
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 	if (static_branch_unlikely(&enable_evmcs))
 		evmcs_touch_msr_bitmap();
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f073fe596244..6ab052fdb786 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1301,6 +1301,9 @@ EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
  */
 int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
+	if (!kvmi_msr_event(vcpu, msr))
+		return 1;
+
 	switch (msr->index) {
 	case MSR_FS_BASE:
 	case MSR_GS_BASE:

From patchwork Thu Dec 20 18:28:44 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739317
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 739AA6C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:30 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6450C28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:30 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 588EE28F39; Thu, 20 Dec 2018 18:36:30 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0FCB928E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389342AbeLTSg3 (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:29 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44304 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1725355AbeLTSgA (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:00 -0500
X-Greylist: delayed 381 seconds by postgrey-1.27 at vger.kernel.org;
 Thu, 20 Dec 2018 13:35:59 EST
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 E5FB3305FFBC;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id E048130228B5;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>
Subject: [RFC PATCH v5 14/20] kvm: x86: hook in kvmi_breakpoint_event()
Date: Thu, 20 Dec 2018 20:28:44 +0200
Message-Id: <20181220182850.4579-15-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Inform the guest introspection tool that a breakpoint instruction (INT3)
is being executed. These one-byte intructions are placed in the slack
space of various functions and used as notification for when the OS or
an application has reached a certain state or is trying to perform a
certain operation (like creating a process).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
---
 arch/x86/kvm/svm.c |  4 ++++
 arch/x86/kvm/vmx.c | 15 +++++++++++----
 arch/x86/kvm/x86.c |  7 +++++++
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d6c7680e224b..17902a9e518d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2721,6 +2721,10 @@ static int bp_interception(struct vcpu_svm *svm)
 {
 	struct kvm_run *kvm_run = svm->vcpu.run;
 
+	if (!kvmi_breakpoint_event(&svm->vcpu,
+		svm->vmcb->save.cs.base + svm->vmcb->save.rip))
+		return 1;
+
 	kvm_run->exit_reason = KVM_EXIT_DEBUG;
 	kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip;
 	kvm_run->debug.arch.exception = BP_VECTOR;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fd92267a7a65..72d4b521fed7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7088,7 +7088,7 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_run *kvm_run = vcpu->run;
 	u32 intr_info, ex_no, error_code;
-	unsigned long cr2, rip, dr6;
+	unsigned long cr2, dr6;
 	u32 vect_info;
 	enum emulation_result er;
 
@@ -7166,7 +7166,10 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		kvm_run->debug.arch.dr6 = dr6 | DR6_FIXED_1;
 		kvm_run->debug.arch.dr7 = vmcs_readl(GUEST_DR7);
 		/* fall through */
-	case BP_VECTOR:
+	case BP_VECTOR: {
+		unsigned long gva = vmcs_readl(GUEST_CS_BASE) +
+			kvm_rip_read(vcpu);
+
 		/*
 		 * Update instruction length as we may reinject #BP from
 		 * user space while in guest debugging mode. Reading it for
@@ -7174,11 +7177,15 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		 */
 		vmx->vcpu.arch.event_exit_inst_len =
 			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+
+		if (!kvmi_breakpoint_event(vcpu, gva))
+			return 1;
+
 		kvm_run->exit_reason = KVM_EXIT_DEBUG;
-		rip = kvm_rip_read(vcpu);
-		kvm_run->debug.arch.pc = vmcs_readl(GUEST_CS_BASE) + rip;
+		kvm_run->debug.arch.pc = gva;
 		kvm_run->debug.arch.exception = ex_no;
 		break;
+	}
 	default:
 		kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
 		kvm_run->ex.exception = ex_no;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6ab052fdb786..b1cbcf44cf70 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8546,6 +8546,13 @@ int kvm_arch_vcpu_set_guest_debug(struct kvm_vcpu *vcpu,
 			kvm_queue_exception(vcpu, BP_VECTOR);
 	}
 
+#ifdef CONFIG_KVM_INTROSPECTION
+	if (kvmi_bp_intercepted(vcpu, dbg->control)) {
+		ret = -EBUSY;
+		goto out;
+	}
+#endif
+
 	/*
 	 * Read rflags as long as potentially injected trace flags are still
 	 * filtered out.

From patchwork Thu Dec 20 18:28:45 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739281
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AAC4917E1
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:02 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 991DC28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:02 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 8CDCA28F39; Thu, 20 Dec 2018 18:36:02 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 43AB728E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389298AbeLTSgB (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:01 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44306 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730932AbeLTSgA (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:00 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 EA63B305FFBD;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id E467D306E477;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>
Subject: [RFC PATCH v5 15/20] kvm: x86: intercept accesses to IDTR, GDTR,
 LDTR and TR
Date: Thu, 20 Dec 2018 20:28:45 +0200
Message-Id: <20181220182850.4579-16-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Nicușor Cîțu <ncitu@bitdefender.com>

These will be used to implement a tiny agent which runs in the context
of an introspected guest and uses virtualized exceptions (#VE) and
alternate EPT views (VMFUNC #0) to filter converted VMEXITS. The events
of interested will be suppressed (after some appropriate guest-side
handling) while the rest will be sent to the introspector via a VMCALL.

Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
---
 arch/x86/kvm/svm.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 17902a9e518d..d6b5baa513c3 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4738,6 +4738,41 @@ static int avic_unaccelerated_access_interception(struct vcpu_svm *svm)
 	return ret;
 }
 
+#ifdef CONFIG_KVM_INTROSPECTION
+static int descriptor_access_interception(struct vcpu_svm *svm)
+{
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct vmcb_control_area *c = &svm->vmcb->control;
+
+	switch (c->exit_code) {
+	case SVM_EXIT_IDTR_READ:
+	case SVM_EXIT_IDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_IDTR, c->exit_code == SVM_EXIT_IDTR_WRITE);
+		break;
+	case SVM_EXIT_GDTR_READ:
+	case SVM_EXIT_GDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_GDTR, c->exit_code == SVM_EXIT_GDTR_WRITE);
+		break;
+	case SVM_EXIT_LDTR_READ:
+	case SVM_EXIT_LDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_LDTR, c->exit_code == SVM_EXIT_LDTR_WRITE);
+		break;
+	case SVM_EXIT_TR_READ:
+	case SVM_EXIT_TR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_TR, c->exit_code == SVM_EXIT_TR_WRITE);
+		break;
+	default:
+		break;
+	}
+
+	return 1;
+}
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_READ_CR0]			= cr_interception,
 	[SVM_EXIT_READ_CR3]			= cr_interception,
@@ -4803,6 +4838,16 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_RSM]                          = rsm_interception,
 	[SVM_EXIT_AVIC_INCOMPLETE_IPI]		= avic_incomplete_ipi_interception,
 	[SVM_EXIT_AVIC_UNACCELERATED_ACCESS]	= avic_unaccelerated_access_interception,
+#ifdef CONFIG_KVM_INTROSPECTION
+	[SVM_EXIT_IDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_TR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_IDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_TR_WRITE]			= descriptor_access_interception,
+#endif /* CONFIG_KVM_INTROSPECTION */
 };
 
 static void dump_vmcb(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 72d4b521fed7..b976903c925c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7293,6 +7293,35 @@ static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
 
 static int handle_desc(struct kvm_vcpu *vcpu)
 {
+#ifdef CONFIG_KVM_INTROSPECTION
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 exit_reason = vmx->exit_reason;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	unsigned char store = (vmx_instruction_info >> 29) & 0x1;
+	unsigned char descriptor = 0;
+
+	if (exit_reason == EXIT_REASON_GDTR_IDTR) {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_IDTR;
+		else
+			descriptor = KVMI_DESC_GDTR;
+	} else {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_TR;
+		else
+			descriptor = KVMI_DESC_LDTR;
+	}
+
+	/*
+	 * For now, this function returns false only when the guest
+	 * is ungracefully stopped (crashed) by the introspection tool.
+	 */
+	if (!kvmi_descriptor_event(vcpu, vmx_instruction_info,
+				   exit_qualification, descriptor, store))
+		return false;
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 	WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP));
 	return kvm_emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }

From patchwork Thu Dec 20 18:28:46 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739311
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CA8786C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:25 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BA8CC28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:25 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id AEA0E28F39; Thu, 20 Dec 2018 18:36:25 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 687F428E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389310AbeLTSgC (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:02 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44324 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389299AbeLTSgB (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:01 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 02B78305FFBE;
        Thu, 20 Dec 2018 20:29:38 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id E7BF3304BD66;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?b?Tmlj?=
	=?utf-8?b?dciZb3IgQ8OuyJt1?= <ncitu@bitdefender.com>
Subject: [RFC PATCH v5 16/20] kvm: x86: hook the single-step events
Date: Thu, 20 Dec 2018 20:28:46 +0200
Message-Id: <20181220182850.4579-17-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

A previous patch has introduced support for single-stepping a guest
(currently only VMX is supported). This mechanism is used to cope with
instructions that cannot be handled by the x86 emulator during the
handling of a VMEXIT. In these situations, all other vCPU-s are kicked
and held, the EPT-based protection is removed and the guest is single
stepped by the vCPU that triggered the initial VMEXIT. Upon completion
the EPT-base protection is reinstalled and all vCPU-s all allowed to
return to the guest.

This is a rather slow workaround that kicks in occasionally. In the
future, the most frequently single-stepped instructions should be added
to the emulator (usually, stores to and from memory - SSE/AVX).

This patch adds the VMX-side of events meant to disable the single step
mechanism.

Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/vmx.c | 6 ++++++
 arch/x86/kvm/x86.c | 5 ++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b976903c925c..48bbc35bd246 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8142,6 +8142,7 @@ static int handle_invalid_op(struct kvm_vcpu *vcpu)
 
 static int handle_monitor_trap(struct kvm_vcpu *vcpu)
 {
+	kvmi_stop_ss(vcpu);
 	return 1;
 }
 
@@ -10681,6 +10682,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	if (kvmi_vcpu_enabled_ss(vcpu)
+			&& exit_reason != EXIT_REASON_EPT_VIOLATION
+			&& exit_reason != EXIT_REASON_MONITOR_TRAP_FLAG)
+		kvmi_stop_ss(vcpu);
+
 	if (exit_reason < kvm_vmx_max_exit_handlers
 	    && kvm_vmx_exit_handlers[exit_reason])
 		return kvm_vmx_exit_handlers[exit_reason](vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b1cbcf44cf70..d163be11b2b5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7147,8 +7147,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 	int r;
 
 	/* try to reinject previous events if any */
-
-	if (vcpu->arch.exception.injected)
+	if (vcpu->arch.exception.injected && !kvmi_vcpu_enabled_ss(vcpu))
 		kvm_x86_ops->queue_exception(vcpu);
 	/*
 	 * Do not inject an NMI or interrupt if there is a pending
@@ -7184,7 +7183,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 	}
 
 	/* try to inject new event if pending */
-	if (vcpu->arch.exception.pending) {
+	if (vcpu->arch.exception.pending && !kvmi_vcpu_enabled_ss(vcpu)) {
 		WARN_ON_ONCE(vcpu->arch.exception.injected);
 		vcpu->arch.exception.pending = false;
 		vcpu->arch.exception.injected = true;

From patchwork Thu Dec 20 18:28:47 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739279
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2DCB96C2
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:02 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1E5EA28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:02 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 1214E28F39; Thu, 20 Dec 2018 18:36:02 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C455028E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:01 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389292AbeLTSgA (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:00 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44310 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1729184AbeLTSgA (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:00 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 05B83305FFBF;
        Thu, 20 Dec 2018 20:29:38 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id EAFE9304BD68;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 17/20] kvm: x86: hook in kvmi_cr_event()
Date: Thu, 20 Dec 2018 20:28:47 +0200
Message-Id: <20181220182850.4579-18-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Notify the guest introspection tool that cr{0,3,4} is going to be
changed. The function kvmi_cr_event() will load in crX the new value
if the tool permits it.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d163be11b2b5..40c966e83ff6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -776,6 +776,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 0, old_cr0, &cr0))
+		return 1;
+
 	kvm_x86_ops->set_cr0(vcpu, cr0);
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
@@ -920,6 +923,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 	}
 
+	if (!kvmi_cr_event(vcpu, 4, old_cr4, &cr4))
+		return 1;
+
 	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
@@ -936,6 +942,7 @@ EXPORT_SYMBOL_GPL(kvm_set_cr4);
 
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
+	unsigned long old_cr3 = kvm_read_cr3(vcpu);
 	bool skip_tlb_flush = false;
 #ifdef CONFIG_X86_64
 	bool pcid_enabled = kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE);
@@ -946,7 +953,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 	}
 #endif
 
-	if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
+	if (cr3 == old_cr3 && !pdptrs_changed(vcpu)) {
 		if (!skip_tlb_flush) {
 			kvm_mmu_sync_roots(vcpu);
 			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
@@ -961,6 +968,9 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 3, old_cr3, &cr3))
+		return 1;
+
 	kvm_mmu_new_cr3(vcpu, cr3, skip_tlb_flush);
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);

From patchwork Thu Dec 20 18:28:48 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739313
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 49E3813B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:27 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3B12728E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:27 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 2F1E328F39; Thu, 20 Dec 2018 18:36:27 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D545C28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389296AbeLTSgB (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:01 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44312 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1731536AbeLTSgA (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:00 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 07FE0305FFC0;
        Thu, 20 Dec 2018 20:29:38 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id EDF44304BD69;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 18/20] kvm: x86: hook in kvmi_xsetbv_event()
Date: Thu, 20 Dec 2018 20:28:48 +0200
Message-Id: <20181220182850.4579-19-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Notify the guest introspection tool that the extended control register
has been changed.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 40c966e83ff6..89e508ca7c8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -866,6 +866,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
+#ifdef CONFIG_KVM_INTROSPECTION
+	if (xcr != vcpu->arch.xcr0)
+		kvmi_xsetbv_event(vcpu);
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 	if (kvm_x86_ops->get_cpl(vcpu) != 0 ||
 	    __kvm_set_xcr(vcpu, index, xcr)) {
 		kvm_inject_gp(vcpu, 0);

From patchwork Thu Dec 20 18:28:49 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739315
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 58C3513B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:28 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 49BE528E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:28 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3DD7528F39; Thu, 20 Dec 2018 18:36:28 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DF75228E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730779AbeLTSg1 (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:27 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44320 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1729184AbeLTSgC (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:02 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 0D230305FFC2;
        Thu, 20 Dec 2018 20:29:38 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id F0E8A306E47A;
        Thu, 20 Dec 2018 20:29:37 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>, =?utf-8?q?Mircea_?=
	=?utf-8?q?C=C3=AErjaliu?= <mcirjaliu@bitdefender.com>
Subject: [RFC PATCH v5 19/20] kvm: x86: handle the introspection hypercalls
Date: Thu, 20 Dec 2018 20:28:49 +0200
Message-Id: <20181220182850.4579-20-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Two hypercalls (KVM_HC_MEM_MAP, KVM_HC_MEM_UNMAP) are used by the
introspection tool running in a VM to map/unmap memory from the
introspected VM-s.

The third hypercall (KVM_HC_XEN_HVM_OP) is used by the code residing
inside the introspected guest to call the introspection tool and to report
certain details about its operation. For example, a classic antimalware
remediation tool can report what it has found during a scan.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 89e508ca7c8b..6d5f801e4797 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7027,11 +7027,14 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
 	unsigned long nr, a0, a1, a2, a3, ret;
 	int op_64_bit;
+	bool kvmi_hc;
 
-	if (kvm_hv_hypercall_enabled(vcpu->kvm))
+	nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
+	kvmi_hc = (u32)nr == KVM_HC_XEN_HVM_OP;
+
+	if (kvm_hv_hypercall_enabled(vcpu->kvm) && !kvmi_hc)
 		return kvm_hv_hypercall(vcpu);
 
-	nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
 	a0 = kvm_register_read(vcpu, VCPU_REGS_RBX);
 	a1 = kvm_register_read(vcpu, VCPU_REGS_RCX);
 	a2 = kvm_register_read(vcpu, VCPU_REGS_RDX);
@@ -7048,7 +7051,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		a3 &= 0xFFFFFFFF;
 	}
 
-	if (kvm_x86_ops->get_cpl(vcpu) != 0) {
+	if (kvm_x86_ops->get_cpl(vcpu) != 0 && !kvmi_hc) {
 		ret = -KVM_EPERM;
 		goto out;
 	}
@@ -7069,6 +7072,19 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
 		break;
 #endif
+#ifdef CONFIG_KVM_INTROSPECTION
+	case KVM_HC_MEM_MAP:
+		ret = kvmi_host_mem_map(vcpu, (gva_t)a0, (gpa_t)a1, (gpa_t)a2);
+		break;
+	case KVM_HC_MEM_UNMAP:
+		ret = kvmi_host_mem_unmap(vcpu, (gpa_t)a0);
+		break;
+	case KVM_HC_XEN_HVM_OP:
+		ret = 0;
+		if (!kvmi_hypercall_event(vcpu))
+			ret = -KVM_ENOSYS;
+		break;
+#endif /* CONFIG_KVM_INTROSPECTION */
 	default:
 		ret = -KVM_ENOSYS;
 		break;

From patchwork Thu Dec 20 18:28:50 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?=
 <alazar@bitdefender.com>
X-Patchwork-Id: 10739307
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9B14313B5
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:23 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 87FFC28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:23 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 7681328F39; Thu, 20 Dec 2018 18:36:23 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 352CA28E58
	for <patchwork-kvm@patchwork.kernel.org>;
 Thu, 20 Dec 2018 18:36:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2389320AbeLTSgD (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Thu, 20 Dec 2018 13:36:03 -0500
Received: from mx01.bbu.dsd.mx.bitdefender.com ([91.199.104.161]:44322 "EHLO
        mx01.bbu.dsd.mx.bitdefender.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S2389293AbeLTSgC (ORCPT
        <rfc822;kvm@vger.kernel.org>); Thu, 20 Dec 2018 13:36:02 -0500
Received: from smtp.bitdefender.com (smtp02.buh.bitdefender.net [10.17.80.76])
        by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id
 0BB35305FFC1;
        Thu, 20 Dec 2018 20:29:38 +0200 (EET)
Received: from host.bbu.bitdefender.biz (unknown [10.10.193.111])
        by smtp.bitdefender.com (Postfix) with ESMTPSA id 008BC304BD6A;
        Thu, 20 Dec 2018 20:29:38 +0200 (EET)
From: =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
To: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
 =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, =?utf-8?q?Mihai_Don?=
	=?utf-8?q?=C8=9Bu?= <mdontu@bitdefender.com>,
 =?utf-8?q?Adalbert_Laz=C4=83r?= <alazar@bitdefender.com>
Subject: [RFC PATCH v5 20/20] kvm: x86: hook in kvmi_trap_event()
Date: Thu, 20 Dec 2018 20:28:50 +0200
Message-Id: <20181220182850.4579-21-alazar@bitdefender.com>
In-Reply-To: <20181220182850.4579-1-alazar@bitdefender.com>
References: <20181220182850.4579-1-alazar@bitdefender.com>
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Mihai DONTU <mdontu@bitdefender.com>

Inform the guest introspection tool that the exception (from a previous
KVMI_INJECT_EXCEPTION command) was not successfully injected.

It can happen for the introspection tool to queue a pagefault but have
it overwritten by an interrupt picked up during guest reentry.
kvmi_trap_event() is used to inform the tool when such a situation has
been detected.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6d5f801e4797..64572f7ec689 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7773,6 +7773,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	if (kvmi_lost_exception(vcpu))
+		kvmi_trap_event(vcpu);
+
 	r = kvm_mmu_reload(vcpu);
 	if (unlikely(r)) {
 		goto cancel_injection;