From patchwork Thu Nov 11 14:13:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12614837 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 552A9C4332F for ; Thu, 11 Nov 2021 14:16:17 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id EB82261152 for ; Thu, 11 Nov 2021 14:16:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EB82261152 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=nongnu.org Received: from localhost ([::1]:42580 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mlAsC-0000mQ-23 for qemu-devel@archiver.kernel.org; Thu, 11 Nov 2021 09:16:16 -0500 Received: from eggs.gnu.org ([209.51.188.92]:40166) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mlAqu-0007pT-JO for qemu-devel@nongnu.org; Thu, 11 Nov 2021 09:14:56 -0500 Received: from mga07.intel.com ([134.134.136.100]:50516) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mlAqs-0006Oc-BD for qemu-devel@nongnu.org; Thu, 11 Nov 2021 09:14:56 -0500 X-IronPort-AV: E=McAfee;i="6200,9189,10164"; a="296353058" X-IronPort-AV: E=Sophos;i="5.87,226,1631602800"; d="scan'208";a="296353058" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Nov 2021 06:14:51 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.87,226,1631602800"; d="scan'208";a="492555222" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga007.jf.intel.com with ESMTP; 11 Nov 2021 06:14:41 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Subject: [RFC PATCH 0/6] KVM: mm: fd-based approach for supporting KVM guest private memory Date: Thu, 11 Nov 2021 22:13:39 +0800 Message-Id: <20211111141352.26311-1-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Received-SPF: none client-ip=134.134.136.100; envelope-from=chao.p.peng@linux.intel.com; helo=mga07.intel.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Ingo Molnar , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Jim Mattson , Sean Christopherson , susie.li@intel.com, Jeff Layton , john.ji@intel.com, Yu Zhang , Paolo Bonzini , Andrew Morton , "Kirill A . Shutemov" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" This RFC series try to implement the fd-based KVM guest private memory proposal described at [1]. We had some offline discussions on this series already and that results a different design proposal from Paolo. This thread includes both the original RFC patch series for proposal [1] as well as the summary for the new proposal from Paolo so that we can continue the discussion. To understand the patch and the new proposal you are highly recommended to read the original proposal [1] firstly. Patch Description ================= The patch include a private memory implementation in memfd/shmem backing store and KVM support for private memory slot as well its counterpart in QEMU. Patch1: kernel part shmem/memfd support Patch2-6: KVM part Patch7-13: QEMU part QEMU Usage: -machine private-memory-backend=ram1 \ -object memory-backend-memfd,id=ram1,size=5G,guest_private=on,seal=off New Proposal ============ Below is a summary of the changes for the new proposal that was discussed in the offline thread. In general, this new proposal reuses the concept of fd-based guest memory backing store that described in [1] but uses a different way to coordinate the private and shared parts into one single memslot instead of introducing dedicated private memslot. - memslot extension The new proposal suggests to add the private fd and the offset to existing 'shared' memslot so both private/shared memory can live in one single memslot. A page in the memslot is either private or shared. A page is private only when it's allocated in the private fd, all the other cases it's treated as shared, this includes those already mapped as shared as well as those having not been mapped. - private memory map/unmap Userspace's map/unmap operations are done by fallocate() ioctl on private fd. - map: default fallocate() with mode=0. - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE. There would be two new callbacks registered by KVM and called by memory backing store during above map/unmap operations: - map(inode, offset, size): memory backing store to tell related KVM memslot to do a shared->private conversion. - unmap(inode, offset, size): memory backing store to tell related KVM memslot to do a private->shared conversion. Memory backing store also needs to provide a new callback for KVM to query if a page is already allocated in private-fd so KVM can know if the page is private or not. - page_allocated(inode, offset): for shmem this would simply return pagecache_get_page(). There are two places in KVM that can exit to userspace to trigger private/share conversion: - explicit conversion: happens when guest calls into KVM to explicitly map a range(as private or shared), KVM then exits to userspace to do the above map/unmap operations. - implicit conversion: happens in KVM page fault handler. * if fault due to a private memory access then cause a userspace exit for a shared->private conversion request when page_allocate() return false, otherwise map that directly without usrspace exit. * If fault due to a shared memory access then cause a userspace exit for a private->shared conversion request when page_allocate() return true, otherwise map that directly without userspace exit. An example flow: guest Linux userspace ------------------------- -------------------- ----------------------- ioctl(KVM_RUN) access private memoryd '--- EPT violation --. v userspace exit '------------------. v munmap shared memfd fallocate private memfd .------------------' v fallocate() call guest_ops unmap shared PTE map private PTE ... ioctl(KVM_RUN) Compared to the original proposal: - no need to introduce KVM memslot hole punching API, - would avoid potential memslot performance/scalability/fragment issue, - may also reduce userspace complexity, - but requires additional callbacks between KVM and memory backing store. [1] https://lkml.kernel.org/kvm/51a6f74f-6c05-74b9-3fd7-b7cd900fb8cc@redhat.com/t/ Thanks, Chao --- Chao Peng (6): mm: Add F_SEAL_GUEST to shmem/memfd kvm: x86: Introduce guest private memory address space to memslot kvm: x86: add private_ops to memslot kvm: x86: implement private_ops for memfd backing store kvm: x86: add KVM_EXIT_MEMORY_ERROR exit KVM: add KVM_SPLIT_MEMORY_REGION Documentation/virt/kvm/api.rst | 1 + arch/x86/include/asm/kvm_host.h | 5 +- arch/x86/include/uapi/asm/kvm.h | 4 + arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/memfd.c | 63 +++++++++++ arch/x86/kvm/mmu/mmu.c | 69 ++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 3 +- arch/x86/kvm/x86.c | 3 +- include/linux/kvm_host.h | 41 ++++++- include/linux/memfd.h | 22 ++++ include/linux/shmem_fs.h | 9 ++ include/uapi/linux/fcntl.h | 1 + include/uapi/linux/kvm.h | 34 ++++++ mm/memfd.c | 34 +++++- mm/shmem.c | 127 +++++++++++++++++++++- virt/kvm/kvm_main.c | 185 +++++++++++++++++++++++++++++++- 16 files changed, 581 insertions(+), 22 deletions(-) create mode 100644 arch/x86/kvm/memfd.c