From patchwork Fri Jul 28 11:53:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Fabio M. De Francesco" X-Patchwork-Id: 13331765 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 271D8C00528 for ; Fri, 28 Jul 2023 12:01:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32B648D0001; Fri, 28 Jul 2023 08:01:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2B4196B0074; Fri, 28 Jul 2023 08:01:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 130EF8D0001; Fri, 28 Jul 2023 08:01:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id F0C1A6B0072 for ; Fri, 28 Jul 2023 08:01:11 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 869491406C3 for ; Fri, 28 Jul 2023 12:01:11 +0000 (UTC) X-FDA: 81060879942.25.9E0EC71 Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) by imf05.hostedemail.com (Postfix) with ESMTP id E0EA5100060 for ; Fri, 28 Jul 2023 12:01:01 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=WcjR4GrS; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of fmdefrancesco@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=fmdefrancesco@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690545662; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=CWHxp6BYdj1hUJAaAzWGPFjl679471PQKsgA5+VvHzI=; b=5P0jj8rnynvjBy4Q/Qv2l3bSuaHGiIiZXjH62pChCLnzpQIiMwBGqJLZwM8Rr7vgrw+teG B1FIJSsc3hwxWY0ILI6X0RXDqN+ph6QxSHiiFk6OcoBPDkSycHbp8tpb0gBBhPBLSQhq7w TgysIAoDN7CGyRm7AwxhmvzqDdUn3po= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=WcjR4GrS; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of fmdefrancesco@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=fmdefrancesco@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690545662; a=rsa-sha256; cv=none; b=JWnS66UNulJRm4qebeWZCOYOa/MM6U7nO/EEdVeIewBo3x8mLYq3C7CvFMqXiqWM4mS4r/ uALqDV69nUSzIRrKFqqosGs3x3CUfNGs6NknS887bfxeJCpiQycRe413DvJImZCu7E3Frk kHK3Mv/FEWiuQZ+Blg2jv5KP8QmkGpc= Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-3fbc244d384so21501765e9.0 for ; Fri, 28 Jul 2023 05:01:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1690545660; x=1691150460; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=CWHxp6BYdj1hUJAaAzWGPFjl679471PQKsgA5+VvHzI=; b=WcjR4GrSPK9AcULOwlcp9u5LCac4D8DTrx8onXlvQ+V3Xrrf9fMg93DLHFgJBCzrOI uA/wFSCt5zIuCe1WNavuKIga+bIlAQS5N7XckyoYHdxI08iFU6faB0Qz7inZzrC8NP3z xLdpI0ZtQxxfwp9L5wsdhktuj1tsEyGmLWy6X2mwWnjY5NOi8UXfATZJAuuPtP8/tmTN 4p9+zR7YinSSbsrmAZOjLCfMrSHJmEaDH7OH2Z7a7Ddq+ZPaC+Db80JI48/xgAhuoFEQ 2fiw4KBZWDiFyX8DtVsR9tObg/FAa7PFFPHKa/GdQ0bq8A9RBefHKUHdIUx6AHJLGyyU 1PKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690545660; x=1691150460; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CWHxp6BYdj1hUJAaAzWGPFjl679471PQKsgA5+VvHzI=; b=mIk57ZymKYDxoypaNFUEkPL+J34NWbEeJj+BYsOJhiP38IftTc5wL1R3fyFFKb6w5M R4BffoDlHL5+Us0CTaJf29BzOl05vfvXwKkNy+8C9Uirjey0QducGDqioxcfBiUsMp8P NJPmH05zcouxntmeHmsPujEnEr8RwyvoXepzep0I/NTKREVsT8kUqjVd5I8o86lOABOM v+RGwKePlK1xkvq6DmKcx9GjsHysGBgSk8APM8Bhe+O//dEHBWXYRRxLjUEqCkC6G97P 9xpaxSRll+0pdj/Dy8MqcZXdb2nIRpQ+nRs0Viu8RNYdTkqnZxt8vRkNHLUE8XbMRbpU TjNg== X-Gm-Message-State: ABy/qLZ7bzAXZzcVIk+/Es5hM9T/enZTFjalHRlK1vu9Hi6kzXeyUzUA E7KrD7ilmPJgIwfdqNUgGK4= X-Google-Smtp-Source: APBJJlEUUYRrKzptH8xy7CAmo7/9D1YkWZXuZn+2o4ZW8I6hH44ezT5uhOjJmRR/1W6WyQd0ln8xgA== X-Received: by 2002:a5d:474c:0:b0:30f:b7b4:3e55 with SMTP id o12-20020a5d474c000000b0030fb7b43e55mr1528699wrs.19.1690545659659; Fri, 28 Jul 2023 05:00:59 -0700 (PDT) Received: from localhost.localdomain (host-95-244-91-78.retail.telecomitalia.it. [95.244.91.78]) by smtp.gmail.com with ESMTPSA id p16-20020a5d68d0000000b003140f47224csm4645454wrw.15.2023.07.28.05.00.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Jul 2023 05:00:59 -0700 (PDT) From: "Fabio M. De Francesco" To: Jonathan Corbet , Jonathan Cameron , Linus Walleij , Mike Rapoport , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: "Fabio M. De Francesco" , Andrew Morton , Ira Weiny , Matthew Wilcox , Randy Dunlap Subject: [PATCH] Documentation/page_tables: Add info about MMU/TLB and Page Faults Date: Fri, 28 Jul 2023 13:53:01 +0200 Message-ID: <20230728120054.12306-1-fmdefrancesco@gmail.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: E0EA5100060 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: cfs3u67kutedju97brjxuq5bgfmuwkw7 X-HE-Tag: 1690545661-754069 X-HE-Meta: U2FsdGVkX1/q3UYAtQxUugZ0NSjpZkKyZurbnRXCpEg0BwP9PsaYALJ4HjLd+AS87IdGcL5hSfRQptJGTHchu+LlH/Un+d/3gpJu/ruhsITmKZWsLDssvGyTrHEKPHuyS8Q7m8TM7lWCMik+c5Nrv0pLfiPbooaHFw38yLzqwAV4a55Lk/Lc+1Jf2PErYE9tE7J49UvPo9w0BxSVXrnZCfmBgs1TrWdF74ot0DWlYBbzu3Iq9zOrd/V2GCIp1f0PZdrvSYZiocG66xQrG7bNFJ/aBeYzi2OS+Egk+Rwm2SD3s2KtoDlU+TJSCGY0o7PLQzntLzuPej6Qx8bWR+hlx2gZZIY+PKckaaBoFclrKeLCeEW2Q2fmcXP4O2E3tZUSR7n6QER2mUV7Mx+0mHRpssvChZsBPkfKHqU6opHDdcwvRjjl/YcLDPEZX/bX7CC2GH6LkKaKXWEnCtaXa8h2Wjn1xBYQdI6Z0HDtpeuLZtGr48lJVeMOG/3AafCuurYljJOY/YdCU9TVj8HEkRaqsmZg2or71VV91qnjoicWmd9eNynvQ6wgFBCqsfDoEeyJB9mERWap48ax90hE3rUtMPJK/hJ6IJ/eLNpxzSlJGjLxNFN5YkIao/AyQKpBdFQZvIh5d8J4lqcerjF14ob5m/pbwW5nC3CyfK9mY4Z9vxQReXAaaVb80P4mu7nZ5F2vNpW59S3CV+tEVqhIDAuz1yCMJPxDXGeqQXZXy6rz3Mr6hL33n2xNOmKxEgENkKEk0brYdNSMr/HLYOux53zv33XTKboCCE+ktDvVPQoieKahBSjd5UjFDqgLJMkt1DNghc5Vii6enC/JMlywaVnuw37Xb7CZ8pYRnZREK+ziLbMbKG4JeG9KYbEsOaQjbsMpMRfykr4yvgg+1HLkDKRPtFwP3ARglx4eRWtR6YdvvFzfMTnXDCiNXouN8XN7XwuJ5cHpGbXBfBfBhgnJBri 5bxqqrI0 sr7XV9zgnKu+Fb3cGykwUUv9zQaHPoHhv+x4KR6ESG9kCiaIrXZd3EFfnr2qoZKyX3NHz3wUf1ocFUTntmleMXja1AJxNRBmCFbm9ZbL6WKQ1DEGl5hs8XbJWFfQxO2n5NK10n2BMph1sQSU4pv3m8hToaJjOMiQkI16Rp2JqrHsrm/lCq6DxBDu720kP2MbM1W4Su4oQjv60DPuI+Lv/PbPDvycZ5qH4q8WglaednTYlF+jxj06BwZnkGRn6pkCBqjSkjdH+YkZHPLh63NIYljs++91Q0BoaTLReXo+YOXV9vZNngoMFvhmyz3P+XjSxTJksSN8WLtqwrAbaDhl4fD2KRRu3s4Gb5UPurR9MBLxULXvZxGFaYchhA7NQxSDfIk4Rr9GUcgtVzDjH5jMIyPI2w997i2jlHZsYk67dOjCtna+0zKcfxf23YsfpzeyYe7t7jMmc72CUXB/b85LuQ4dSM7XC5lfALD9hCFz9rBCM/898NTaywMf8gljnjJInn2tY8lgJFa29eT0Sg1YmK3t6r3u9Gh9SzUCqmZip71cFegXYDCm8PjUsj0Mc750iE+ogeg5ahTy1nlxr9Ln+PkhfGNUfFAWL8qrRf1fFIeQ4zprReM4BtGKSeMUqI27x+zzQrg64G/cVnMmsG27FLLmonrsj7moEaQF1b6Bzh4eMlXDx8jdX6D4JUWSIoIZYxsbynLS9zs3aWZAWnsrYhWru7rXxmLilH9lgAJa+LgJOGL5DdTezt0YEs6JQrDPi1senzxYesW1pDwGBX6B1H6bvY6fw04IF6HBw6TMBxyanVm4k4Aa62c6fBlhw6/3r8P+BajbvUu47U+8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Extend page_tables.rst by adding a section about the role of MMU and TLB in translating between virtual addresses and physical page frames. Furthermore explain the concept behind Page Faults and how the Linux kernel handles TLB misses. Finally briefly explain how and why to disable the page faults handler. Cc: Andrew Morton Cc: Ira Weiny Cc: Jonathan Cameron Cc: Jonathan Corbet Cc: Linus Walleij Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Randy Dunlap Signed-off-by: Fabio M. De Francesco --- This has been an RFC PATCH in its 2nd version for a week or so. I received comments and suggestions on it from Jonathan Cameron (thanks!), and so it has now been modified to a real patch. I hope that other people want to add their comments on this document in order to further improve and extend it. The link to the thread with the RFC PATCH v2 and the messages between Jonathan and me start at https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@gmail.com/#r Documentation/mm/page_tables.rst | 105 +++++++++++++++++++++++++++++++ 1 file changed, 105 insertions(+) diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst index 7840c1891751..6ecfd6d2f1f3 100644 --- a/Documentation/mm/page_tables.rst +++ b/Documentation/mm/page_tables.rst @@ -152,3 +152,108 @@ Page table handling code that wishes to be architecture-neutral, such as the virtual memory manager, will need to be written so that it traverses all of the currently five levels. This style should also be preferred for architecture-specific code, so as to be robust to future changes. + + +MMU, TLB, and Page Faults +========================= + +The `Memory Management Unit (MMU)` is a hardware component that handles virtual +to physical address translations. It may use relatively small caches in hardware +called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up +these translations. + +When a process wants to access a memory location, the CPU provides a virtual +address to the MMU, which then uses the MMU to check access permissions and +dirty bits, and if possible it resolves the physical address and consents the +requested type of access to the corresponding physical address. + +If the TLBs have not yet any recorded translations, the MMU may use the Page +Walk Caches and complete or restart the page tables walks until a physical +address can finally be resolved. Permissions and dirty bits are checked. + +In the context of a virtual memory system, like the one used by the Linux +kernel, each page of memory has associated permission and dirty bits. + +The dirty bit for a page is set (i.e., turned on) when the page is written +to. This indicates that the page has been modified since it was loaded into +memory. It probably needs to be written on disk or other cores may need to +be informed about previous changes before allowing further operations. + +If nothing prevents it, eventually the physical memory can be accessed and +the requested operation on the physical frame is performed. + +There are several reasons why the MMU can't find certain translations. It +could happen because the process is trying to access a range of memory that is +not allowed to, or because the data is not present into RAM. + +When these conditions happen, the MMU triggers page faults, which are types +of exceptions that signal the CPU to pause the current process and run a special +function to handle the mentioned page faults. + +One cause of page faults is due to bugs (or maliciously crafted addresses) and +happens when a process tries to access a range of memory that it doesn't have +permission to. This could be because the memory is reserved for the kernel or +for another process, or because the process is trying to write to a read-only +section of memory. When this happens, the kernel sends a Segmentation Fault +(SIGSEGV) signal to the process, which usually causes the process to terminate. + +An expected and more common cause of page faults is an optimization called "lazy +allocation". This is a technique used by the Kernel to improve memory efficiency +and reduce footprint. Instead of allocating physical memory to a process as soon +as it's requested, the Kernel waits until the process actually tries to use the +memory. This can save a significant amount of memory in cases where a process +requests a large block but only uses a small portion of it. + +A related technique is called "Copy-on-Write" (CoW), where the Kernel allows +multiple processes to share the same physical memory as long as they're only +reading from it. If a process tries to write to the shared memory, the kernel +triggers a page fault and allocates a separate copy of the memory for the +process. This allows the Kernel to save memory and avoid unnecessary data +copying and, by doing so, it reduces latency and space occupation. + +Now, let's see how the Linux kernel handles these page faults: + +1. For most architectures, `do_page_fault()` is the primary interrupt handler + for page faults. It delegates the actual handling of the page fault to + `handle_mm_fault()`. This function checks the cause of the page fault and + takes the appropriate action, such as loading the required page into + memory, granting the process the necessary permissions, or sending a + SIGSEGV signal to the process. + +2. In the specific case of the x86 architecture, the interrupt handler is + defined by the `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro, which calls + `handle_page_fault()`. This function then calls either + `do_user_addr_fault()` or `do_kern_addr_fault()`, depending on whether + the fault occurred in user space or kernel space. Both of these functions + eventually lead to `handle_mm_fault()`, similar to the workflow in other + architectures. + +`handle_mm_fault()` (likely) ends up calling `__handle_mm_fault()` to carry +out the actual work of allocation of the page tables. It works by using +several functions to find the entry's offsets of the 4 - 5 layers of tables +and allocate the tables it needs to. The functions that look for the offset +have names like `*_offset()`, where the "*" is for pgd, p4d, pud, pmd, pte; +instead the functions to allocate the corresponding tables, layer by layer, +are named `*_alloc`, with the above mentioned convention to name them after +the corresponding types of tables in the hierarchy. + +At the very end of the walk with allocations, if it didn't return errors, +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via +`do_fault()` performs one of `do_read_fault()`, `do_cow_fault()`, +`do_shared_fault()`. "read", "cow", "shared" give hints about the reasons +and the kind of fault it's handling. + +The actual implementation of the workflow is very complex. Its design allows +Linux to handle page faults in a way that is tailored to the specific +characteristics of each architecture, while still sharing a common overall +structure. + +To conclude this brief overview from very high altitude of how Linux handles +page faults, let's add that page faults handler can be disabled and enabled +respectively with `pagefault_disable()` and `pagefault_enable()`. + +Several code path make use of the latter two functions because they need to +disable traps into the page faults handler, mostly to prevent deadlocks.[1] + +[1] mm/userfaultfd: Replace kmap/kmap_atomic() with kmap_local_page() +https://lore.kernel.org/all/20221025220136.2366143-1-ira.weiny@intel.com/