From patchwork Thu Mar 5 17:17:45 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrea Arcangeli X-Patchwork-Id: 5948261 Return-Path: X-Original-To: patchwork-kvm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id CF42A9F318 for ; Thu, 5 Mar 2015 17:22:42 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id CC1C3202B8 for ; Thu, 5 Mar 2015 17:22:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B77AD202AE for ; Thu, 5 Mar 2015 17:22:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757043AbbCERWg (ORCPT ); Thu, 5 Mar 2015 12:22:36 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36866 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754193AbbCERTU (ORCPT ); Thu, 5 Mar 2015 12:19:20 -0500 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t25HI9kO022055 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 5 Mar 2015 12:18:09 -0500 Received: from mail.random (ovpn-116-22.ams2.redhat.com [10.36.116.22]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t25HI5e4032745 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 5 Mar 2015 12:18:06 -0500 From: Andrea Arcangeli To: qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, Android Kernel Team Cc: "Kirill A. Shutemov" , Pavel Emelyanov , Sanidhya Kashyap , zhang.zhanghailiang@huawei.com, Linus Torvalds , Andres Lagar-Cavilla , Dave Hansen , Paolo Bonzini , Rik van Riel , Mel Gorman , Andy Lutomirski , Andrew Morton , Sasha Levin , Hugh Dickins , Peter Feiner , "Dr. David Alan Gilbert" , Christopher Covington , Johannes Weiner , Robert Love , Dmitry Adamushko , Neil Brown , Mike Hommey , Taras Glek , Jan Kara , KOSAKI Motohiro , Michel Lespinasse , Minchan Kim , Keith Packard , "Huangpeng (Peter)" , Anthony Liguori , Stefan Hajnoczi , Wenchao Xia , Andrew Jones , Juan Quintela Subject: [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt Date: Thu, 5 Mar 2015 18:17:45 +0100 Message-Id: <1425575884-2574-3-git-send-email-aarcange@redhat.com> In-Reply-To: <1425575884-2574-1-git-send-email-aarcange@redhat.com> References: <1425575884-2574-1-git-send-email-aarcange@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add documentation. Signed-off-by: Andrea Arcangeli --- Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 Documentation/vm/userfaultfd.txt -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt new file mode 100644 index 0000000..2ec296c --- /dev/null +++ b/Documentation/vm/userfaultfd.txt @@ -0,0 +1,97 @@ += Userfaultfd = + +== Objective == + +Userfaults allow to implement on demand paging from userland and more +generally they allow userland to take control various memory page +faults, something otherwise only the kernel code could do. + +For example userfaults allows a proper and more optimal implementation +of the PROT_NONE+SIGSEGV trick. + +== Design == + +Userfaults are delivered and resolved through the userfaultfd syscall. + +The userfaultfd (aside from registering and unregistering virtual +memory ranges) provides for two primary functionalities: + +1) read/POLLIN protocol to notify an userland thread of the faults + happening + +2) various UFFDIO_* ioctls that can mangle over the virtual memory + regions registered in the userfaultfd that allows userland to + efficiently resolve the userfaults it receives via 1) or to mangle + the virtual memory in the background + +The real advantage of userfaults if compared to regular virtual memory +management of mremap/mprotect is that the userfaults in all their +operations never involve heavyweight structures like vmas (in fact the +userfaultfd runtime load never takes the mmap_sem for writing). + +Vmas are not suitable for page(or hugepage)-granular fault tracking +when dealing with virtual address spaces that could span +Terabytes. Too many vmas would be needed for that. + +The userfaultfd once opened by invoking the syscall, can also be +passed using unix domain sockets to a manager process, so the same +manager process could handle the userfaults of a multitude of +different process without them being aware about what is going on +(well of course unless they later try to use the userfaultfd themself +on the same region the manager is already tracking, which is a corner +case that would currently return -EBUSY). + +== API == + +When first opened the userfaultfd must be enabled invoking the +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API +which will specify the read/POLLIN protocol userland intends to speak +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested +uffdio_api.api is spoken also by the running kernel), will return into +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of +respectively the activated feature bits below PAGE_SHIFT in the +userfault addresses returned by read(2) and the generic ioctl +available. + +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should +be invoked (if present in the returned uffdio_api.ioctls bitmask) to +register a memory range in the userfaultfd by setting the +uffdio_register structure accordingly. The uffdio_register.mode +bitmask will specify to the kernel which kind of faults to track for +the range (UFFDIO_REGISTER_MODE_MISSING would track missing +pages). The UFFDIO_REGISTER ioctl will return the +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve +userfaults on the range reigstered. Not all ioctls will necessarily be +supported for all memory types depending on the underlying virtual +memory backend (anonymous memory vs tmpfs vs real filebacked +mappings). + +Userland can use the uffdio_register.ioctls to mangle the virtual +address space in the background (to add or potentially also remove +memory from the userfaultfd registered range). This means an userfault +could be triggering just before userland maps in the background the +user-faulted page. To avoid POLLIN resulting in an unexpected blocking +read (if the UFFD is not opened in nonblocking mode in the first +place), we don't allow the background thread to wake userfaults that +haven't been read by userland yet. If we would do that likely the +UFFDIO_WAKE ioctl could be dropped. This may change in the future +(with a UFFD_API protocol bumb combined with the removal of the +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid +optimization and worthy to force userland to use the UFFD always in +nonblocking mode if combined with POLLIN. + +userfaultfd is also a generic enough feature, that it allows KVM to +implement postcopy live migration (one form of memory externalization +consisting of a virtual machine running with part or all of its memory +residing on a different node in the cloud) without having to modify a +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT +and all other GUP features works just fine in combination with +userfaults (userfaults trigger async page faults in the guest +scheduler so those guest processes that aren't waiting for userfaults +can keep running in the guest vcpus). + +The primary ioctl to resolve userfaults is UFFDIO_COPY. That +atomically copies a page into the userfault registered range and wakes +up the blocked userfaults (unless uffdio_copy.mode & +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to +UFFDIO_COPY.