diff mbox series

[v2] Documentation/page_tables: Add info about MMU/TLB and Page Faults

Message ID 20230813182552.31792-1-fmdefrancesco@gmail.com (mailing list archive)
State New
Headers show
Series [v2] Documentation/page_tables: Add info about MMU/TLB and Page Faults | expand

Commit Message

Fabio M. De Francesco Aug. 13, 2023, 6:25 p.m. UTC
Extend page_tables.rst by adding a section about the role of MMU and TLB
in translating between virtual addresses and physical page frames.
Furthermore explain the concept behind Page Faults and how the Linux
kernel handles TLB misses. Finally briefly explain how and why to disable
the page faults handler.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
---

v1 -> v2: This version takes into account the comments provided by Mike
(thanks!). I hope I haven't overlooked anything he suggested :-)
https://lore.kernel.org/all/20230807105010.GK2607694@kernel.org/

Furthermore, v2 adds few more information about swapping which was not present
in v1.

before the "real" patch, this has been an RFC PATCH in its 2nd version for a week
or so until I received comments and suggestions from Jonathan Cameron (thanks!),
and then it morphed to a real patch.

The link to the thread with the RFC PATCH v2 and the messages between Jonathan
and me start at https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@gmail.com/#r

 Documentation/mm/page_tables.rst | 128 +++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

Comments

Linus Walleij Aug. 15, 2023, 8:51 a.m. UTC | #1
Hi Fabio,

overall this v2 looks good!

The below are my grammar and spelling nitpicks.

On Sun, Aug 13, 2023 at 8:25 PM Fabio M. De Francesco
<fmdefrancesco@gmail.com> wrote:

> Extend page_tables.rst by adding a section about the role of MMU and TLB
> in translating between virtual addresses and physical page frames.
> Furthermore explain the concept behind Page Faults and how the Linux
> kernel handles TLB misses. Finally briefly explain how and why to disable
> the page faults handler.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Linus Walleij <linus.walleij@linaro.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
(...)
> +If the above-mentioned conditions happen in user-space, the kernel sends a
> +`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
> +causes the termination of the thread and of the process it belongs to.
> +
> +Instead, there are also common and expected other causes of page faults. These

The word you are looking for is "Additionally" right?

"Additionally, there are..."

> +These techniques improve memory efficiency, reduce latency, and minimize space
> +occupation. This document won't go deeper into the details of "Lazy Allocation"
> +and "Copy-on-Write" because these subjects are out of scope for they belong to

"for they belong" -> "as they belong"
(I think)

> +Swapping differentiate itself from the other mentioned techniques because it's

differentiates

> +not so desirable since it's performed as a means to reduce memory under heavy
> +pressure.

"not so desirable" -> "undesirable"

> +Swapping can't work for memory mapped by kernel logical addresses. These are a

"kernel logical addresses" -> "kernel-internal logical addresses"

> +If everything fails to make room for the data that must reside be present in

"If everything fails" -> "If the kernel fails"

> +This document is going to simplify and show an high altitude view of how the
> +Linux kernel handles these page faults, creates tables and tables' entries,
> +check if memory is present and, if not, requests to load data from persistent
> +storage or from other devices, and updates the MMU and its caches...

Skip "..." for just period "."

> +The first steps are architectures dependent. Most architectures jump to

architectures -> architecture

> +Whatever the routes, all architectures end up to the invocation of
> +`handle_mm_fault()` which, in turn, (likely) ends up calling
> +`__handle_mm_fault()` to carry out the actual work of allocation of the page
> +tables.

"of allocation of the" -> "of allocating the"

> +`__handle_mm_fault()` carries out its work by calling several functions to
> +find the entry's offsets of the upper layers of the page tables and allocate
> +the tables that it may need to.

Skip the last "to".

> +Linux supports larger page sizes than the usual 4KB (i.e., the so called
> +`huge pages`). When using these kinds of larger pages, higher level pages can
> +directly map them, with no need to use lower level page entries (PTE). Huge
> +pages contain large contiguos physical regions that usually span from 2MB to

contiguous

> +The huge pages bring with them several benefits like reduced TLB pressure,
> +reduced page table overhead, memory allocation efficiency, and performance
> +improvement for certain workloads. However, these benefits come with
> +trade-offs, like wasted memory and allocation challenges. Huge pages are out
> +of scope of the present document, therefore, it won't go into further details.

Since you explain what they are, it feels they are in scope?
I would just skip the last sentence.

> +To conclude this brief overview from very high altitude of how Linux handles

To conclude this high altitude view of...

> +Several code path make use of the latter two functions because they need to

code paths

With or without the above suggestions:
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Fabio M. De Francesco Aug. 15, 2023, 12:27 p.m. UTC | #2
On martedì 15 agosto 2023 10:51:24 CEST Linus Walleij wrote:
> Hi Fabio,
> 
> overall this v2 looks good!

Hi Linus,

Thanks for your review. I appreciated it.

I'm counting at least ten mistakes. Well my poor English should still improve 
in order to work on documentation.

I agree with you on all changes you are proposing, so I won't agree line by 
line. Instead I'll send a v3 and forward your tag. 

I have only a doubt and a questions. 
I'll jump directly to the relevant parts.

> 
> The below are my grammar and spelling nitpicks.
>
> [snip] 
> 
> > +If the above-mentioned conditions happen in user-space, the kernel sends 
a
> > +`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal
> > usually +causes the termination of the thread and of the process it 
belongs
> > to. +
> > +Instead, there are also common and expected other causes of page faults.
> > These
> The word you are looking for is "Additionally" right?
> 
> "Additionally, there are..."

I was only able to use "Instead" to express that, contrary to the former 
conditions that is unexpected and uncommon, there are other expected and 
common causes of page faults. I thought that "Instead" stresses that the 
latter causes carry with them opposite and wanted consequences.

I think of "additionally" as a means to introduce less important and less 
frequently occurring conditions.

Nevertheless, I'll change it to "Additionally" as you are asking for.

Everything that follows from here onward should surely be changed as you are 
suggesting.  

[snip]

> > +Swapping can't work for memory mapped by kernel logical addresses. These
> > are a
> "kernel logical addresses" -> "kernel-internal logical addresses"

My only question is about why you prefer "kernel-internal" to a straight 
"kernel". Can you please say more about this?

[snip]
 
> With
> or without the above suggestions:

I'll do the v3 _with_ the above suggestions.

> Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
> 
> Yours,
> Linus Walleij

Again thanks,

Fabio
Linus Walleij Aug. 15, 2023, 12:58 p.m. UTC | #3
Hi Fabio!

trying to answer these things as best I can!

Notice I'm not natively anglo-saxon either.

It's refreshing to have a discussion about formulations in text
in addition to our everyday technical churn!

On Tue, Aug 15, 2023 at 2:27 PM Fabio M. De Francesco
<fmdefrancesco@gmail.com> wrote:
> On martedì 15 agosto 2023 10:51:24 CEST Linus Walleij wrote:
> > > +Instead, there are also common and expected other causes of page faults.
> > > These
> > The word you are looking for is "Additionally" right?
> >
> > "Additionally, there are..."
>
> I was only able to use "Instead" to express that, contrary to the former
> conditions that is unexpected and uncommon, there are other expected and
> common causes of page faults. I thought that "Instead" stresses that the
> latter causes carry with them opposite and wanted consequences.
>
> I think of "additionally" as a means to introduce less important and less
> frequently occurring conditions.
>
> Nevertheless, I'll change it to "Additionally" as you are asking for.

I think the following is the best:

"There are also other, common and expected causes of page faults".

No bridge words. I can't really explain it, it's just language intuition. :/

An option is to also move the section about the common case section
before the exceptions, which may be more natural to the flow of the text.

> > > +Swapping can't work for memory mapped by kernel logical addresses. These
> > > are a
> > "kernel logical addresses" -> "kernel-internal logical addresses"
>
> My only question is about why you prefer "kernel-internal" to a straight
> "kernel". Can you please say more about this?

It's because the kernel handles many address spaces and is aware about
also the userspace address space and the physical address space.
So just so emphasize which one it is.

Yours,
Linus Walleij
diff mbox series

Patch

diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst
index 7840c1891751..ad9e52f2d7f1 100644
--- a/Documentation/mm/page_tables.rst
+++ b/Documentation/mm/page_tables.rst
@@ -152,3 +152,131 @@  Page table handling code that wishes to be architecture-neutral, such as the
 virtual memory manager, will need to be written so that it traverses all of the
 currently five levels. This style should also be preferred for
 architecture-specific code, so as to be robust to future changes.
+
+
+MMU, TLB, and Page Faults
+=========================
+
+The `Memory Management Unit (MMU)` is a hardware component that handles virtual
+to physical address translations. It may use relatively small caches in hardware
+called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
+these translations.
+
+When CPU accesses a memory location, it provides a virtual address to the MMU,
+which checks if there is the existing translation in the TLB or in the Page
+Walk Caches (on architectures that support them). If no translation is found,
+MMU uses the page walks to determine the physical address and create the map.
+
+The dirty bit for a page is set (i.e., turned on) when the page is written to.
+Each page of memory has associated permission and dirty bits. The latter
+indicate that the page has been modified since it was loaded into memory.
+
+If nothing prevents it, eventually the physical memory can be accessed and the
+requested operation on the physical frame is performed.
+
+There are several reasons why the MMU can't find certain translations. It could
+happen because the CPU is trying to access memory that the current task is not
+permitted to, or because the data is not present into physical memory.
+
+When these conditions happen, the MMU triggers page faults, which are types of
+exceptions that signal the CPU to pause the current execution and run a special
+function to handle the mentioned exceptions.
+
+Page faults may be caused by code bugs or by maliciously crafted addresses that
+the CPU is instructed to dereference and access. A thread of a process could
+use an instruction to address (non-shared) memory which does not belong to its
+own address space, or could try to execute an instruction that want to write to
+a read-only location.
+
+If the above-mentioned conditions happen in user-space, the kernel sends a
+`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
+causes the termination of the thread and of the process it belongs to.
+
+Instead, there are also common and expected other causes of page faults. These
+are triggered by process management optimization techniques called "Lazy
+Allocation" and "Copy-on-Write". Page faults may also happen when frames have
+been swapped out to persistent storage (swap partition or file) and evicted from
+their physical locations.
+
+These techniques improve memory efficiency, reduce latency, and minimize space
+occupation. This document won't go deeper into the details of "Lazy Allocation"
+and "Copy-on-Write" because these subjects are out of scope for they belong to
+Process Address Management.
+
+Swapping differentiate itself from the other mentioned techniques because it's
+not so desirable since it's performed as a means to reduce memory under heavy
+pressure.
+
+Swapping can't work for memory mapped by kernel logical addresses. These are a
+subset of the kernel virtual space that directly maps a contiguous range of
+physical memory. Given any logical address, its physical address is determined
+with simple arithmetic on an offset. Accesses to logical addresses are fast
+because they avoid the need for complex page table lookups at the expenses of
+frames not being evictable and pageable out.
+
+If everything fails to make room for the data that must reside be present in
+physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
+by terminating lower priority processes until pressure reduces under a safe
+threshold.
+
+This document is going to simplify and show an high altitude view of how the
+Linux kernel handles these page faults, creates tables and tables' entries,
+check if memory is present and, if not, requests to load data from persistent
+storage or from other devices, and updates the MMU and its caches...
+
+The first steps are architectures dependent. Most architectures jump to
+`do_page_fault()`, whereas the x86 interrupt handler is defined by the
+`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.
+
+Whatever the routes, all architectures end up to the invocation of
+`handle_mm_fault()` which, in turn, (likely) ends up calling
+`__handle_mm_fault()` to carry out the actual work of allocation of the page
+tables.
+
+The unfortunate case of not being able to call `__handle_mm_fault()` means
+that the virtual address is pointing to areas of physical memory which are not
+permitted to be accessed (at least from the current context). This
+condition resolves to the kernel sending the above-mentioned SIGSEGV signal
+to the process and leads to the consequences already explained.
+
+`__handle_mm_fault()` carries out its work by calling several functions to
+find the entry's offsets of the upper layers of the page tables and allocate
+the tables that it may need to.
+
+The functions that look for the offset have names like `*_offset()`, where the
+"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
+corresponding tables, layer by layer, are called `*_alloc`, using the
+above-mentioned convention to name them after the corresponding types of tables
+in the hierarchy.
+
+The page table walk may end at one of the middle or upper layers (PMD, PUD).
+
+Linux supports larger page sizes than the usual 4KB (i.e., the so called
+`huge pages`). When using these kinds of larger pages, higher level pages can
+directly map them, with no need to use lower level page entries (PTE). Huge
+pages contain large contiguos physical regions that usually span from 2MB to
+1GB. They are respectively mapped by the PMD and PUD page entries.
+
+The huge pages bring with them several benefits like reduced TLB pressure,
+reduced page table overhead, memory allocation efficiency, and performance
+improvement for certain workloads. However, these benefits come with
+trade-offs, like wasted memory and allocation challenges. Huge pages are out
+of scope of the present document, therefore, it won't go into further details.
+
+At the very end of the walk with allocations, if it didn't return errors,
+`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()`
+performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`.
+"read", "cow", "shared" give hints about the reasons and the kind of fault it's
+handling.
+
+The actual implementation of the workflow is very complex. Its design allows
+Linux to handle page faults in a way that is tailored to the specific
+characteristics of each architecture, while still sharing a common overall
+structure.
+
+To conclude this brief overview from very high altitude of how Linux handles
+page faults, let's add that page faults handler can be disabled and enabled
+respectively with `pagefault_disable()` and `pagefault_enable()`.
+
+Several code path make use of the latter two functions because they need to
+disable traps into the page faults handler, mostly to prevent deadlocks.