Message ID | 20230723120721.7139-1-fmdefrancesco@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,v2] Documentation/page_tables: Add info about MMU/TLB and Page Faults | expand |
On Sun, 23 Jul 2023 14:07:09 +0200 "Fabio M. De Francesco" <fmdefrancesco@gmail.com> wrote: > Extend page_tables.rst by adding a section about the role of MMU and TLB > in translating between virtual addresses and physical page frames. > Furthermore explain the concept behind Page Faults and how the Linux > kernel handles TLB misses. Finally briefly explain how and why to disable > the page faults handler. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Bagas Sanjaya <bagasdotme@gmail.com> > Cc: Ira Weiny <ira.weiny@intel.com> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Linus Walleij <linus.walleij@linaro.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Mike Rapoport <rppt@kernel.org> > Cc: Randy Dunlap <rdunlap@infradead.org> > Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Hi Fabio, Some superficial comments... > --- > > v1->v2: Add further information about lower level functions in the page > fault handler and add information about how and why to disable / enable > the page fault handler (provided a link to a Ira's patch that make use > of pagefault_disable() to prevent deadlocks). > > This is an RFC PATCH because of two reasons: > > 1) I've heard that there is consensus about the need to revise and > extend the MM documentation, but I'm not sure about whether or not > developers need these kind of introductory information. > > 2) While preparing this little patch I decided to take a quicj look at Spell check your intro text. > the code and found out it currently is not how I thought I remembered > it. I'm especially speaking about the x86 case. I'm not sure that I've > been able to properly understand what I described as a difference in > workflow compared to most of the other architecture. > > Therefore, for the two reasons explained above, I'd like to hear from > people actively involved in MM. If this is not what you want, feel free > to throw it away. Otherwise I'd be happy to write more on this and other > MM topics. I'm looking forward for comments on this small work. > > Documentation/mm/page_tables.rst | 87 ++++++++++++++++++++++++++++++++ > 1 file changed, 87 insertions(+) > > diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst > index 7840c1891751..2be56f50c88f 100644 > --- a/Documentation/mm/page_tables.rst > +++ b/Documentation/mm/page_tables.rst > @@ -152,3 +152,90 @@ Page table handling code that wishes to be architecture-neutral, such as the > virtual memory manager, will need to be written so that it traverses all of the > currently five levels. This style should also be preferred for > architecture-specific code, so as to be robust to future changes. > + > + > +MMU, TLB, and Page Faults > +========================= > + > +The Memory Management Unit (MMU) is a hardware component that handles virtual to > +physical address translations. It uses a relatively small cache in hardware It may use a relatively... (I doubt Linux supports anything that doesn't have a TLB but they aren't required by some architectures - just a performance optmization that you 'can' add to an implementation.) > +called the Translation Lookaside Buffer (TLB) to speed up these translations. > +When a process wants to access a memory location, the CPU provides a virtual > +address to the MMU, which then uses the TLB to quickly find the corresponding > +physical address. > + > +However, sometimes the MMU can't find a valid translation in the TLB. This > +could be because the process is trying to access a range of memory that it's not > +allowed to, or because the memory hasn't been loaded into RAM yet. It might not find it because this is first attempt to do the translation and the MMU hasn't filled the TLB entry yet, or a capacity eviction has happened. Basically failure to find it in the TLB doesn't mean we get a page fault (unless you are on an ancient architecture where TLB entries are software filled which is definitely not the case for most modern ones). > When this > +happens, the MMU triggers a page fault, which is a type of interrupt that Hmm. Whilst similar to an interrupt I'd argue that it's not one.. > +signals the CPU to pause the current process and run a special function to > +handle the fault. ... Jonathan
On lunedì 24 luglio 2023 11:55:05 CEST Jonathan Cameron wrote: > On Sun, 23 Jul 2023 14:07:09 +0200 > > "Fabio M. De Francesco" <fmdefrancesco@gmail.com> wrote: > > Extend page_tables.rst by adding a section about the role of MMU and TLB > > in translating between virtual addresses and physical page frames. > > Furthermore explain the concept behind Page Faults and how the Linux > > kernel handles TLB misses. Finally briefly explain how and why to disable > > the page faults handler. > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Bagas Sanjaya <bagasdotme@gmail.com> > > Cc: Ira Weiny <ira.weiny@intel.com> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Cc: Jonathan Corbet <corbet@lwn.net> > > Cc: Linus Walleij <linus.walleij@linaro.org> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Mike Rapoport <rppt@kernel.org> > > Cc: Randy Dunlap <rdunlap@infradead.org> > > Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> > > Hi Fabio, > Hi Jonathan, > Some superficial comments... Maybe that they are "superficial", BTW they are indeed very welcome :-) > > --- > > > > v1->v2: Add further information about lower level functions in the page > > fault handler and add information about how and why to disable / enable > > the page fault handler (provided a link to a Ira's patch that make use > > of pagefault_disable() to prevent deadlocks). > > > > This is an RFC PATCH because of two reasons: > > > > 1) I've heard that there is consensus about the need to revise and > > extend the MM documentation, but I'm not sure about whether or not > > developers need these kind of introductory information. > > > > 2) While preparing this little patch I decided to take a quicj look at > > Spell check your intro text. Sure, I'll s/quicj/quick/ > > the code and found out it currently is not how I thought I remembered > > it. I'm especially speaking about the x86 case. I'm not sure that I've > > been able to properly understand what I described as a difference in > > workflow compared to most of the other architecture. > > > > Therefore, for the two reasons explained above, I'd like to hear from > > people actively involved in MM. If this is not what you want, feel free > > to throw it away. Otherwise I'd be happy to write more on this and other > > MM topics. I'm looking forward for comments on this small work. > > > > Documentation/mm/page_tables.rst | 87 ++++++++++++++++++++++++++++++++ > > 1 file changed, 87 insertions(+) > > > > diff --git a/Documentation/mm/page_tables.rst > > b/Documentation/mm/page_tables.rst index 7840c1891751..2be56f50c88f 100644 > > --- a/Documentation/mm/page_tables.rst > > +++ b/Documentation/mm/page_tables.rst > > @@ -152,3 +152,90 @@ Page table handling code that wishes to be > > architecture-neutral, such as the> > > virtual memory manager, will need to be written so that it traverses all of > > the currently five levels. This style should also be preferred for > > architecture-specific code, so as to be robust to future changes. > > > > + > > + > > +MMU, TLB, and Page Faults > > +========================= > > + > > +The Memory Management Unit (MMU) is a hardware component that handles > > virtual to +physical address translations. It uses a relatively small cache > > in hardware > It may use a relatively... > (I doubt Linux supports anything that doesn't have a TLB but they aren't > required by some architectures - just a performance optimization that you > 'can' add to an implementation.) Oh, I didn't know that Linux supports non-MMU architectures. However I suspect that the vast majority have MMU and TLB. Is it correct? I'll change the statement to "it may use, and in the vast majority of supported architecture it indeed uses [...]". How about this? Is it not yet what you meant? > > +called the Translation Lookaside Buffer (TLB) to speed up these > > translations. +When a process wants to access a memory location, the CPU > > provides a virtual +address to the MMU, which then uses the TLB to quickly > > find the corresponding +physical address. > > + > > +However, sometimes the MMU can't find a valid translation in the TLB. This > > +could be because the process is trying to access a range of memory that > > it's not +allowed to, or because the memory hasn't been loaded into RAM > > yet. > > It might not find it because this is first attempt to do the translation and > the MMU hasn't filled the TLB entry yet, or a capacity eviction has happened. I thought that "[...] hasn't been loaded into RAM yet" would have covered all cases comprising lazy allocation, copy-n-write, and swapped out pages. I talked about the first two later in the text, but I forgot to speak about swapped out page frames to persistent storage so I'll add it in the next version. However, I am thinking that is not the TLB misses that may cause an exception to fault in memory, but it is the MMU itself if not able to fill the TLB with the content of allocated page tables. If you confirm so, I'll need to rewrite the first introductory paragraphs. Can you please confirm? > Basically failure to find it in the TLB doesn't mean we get a page fault > (unless you are on an ancient architecture where TLB entries are software > filled which is definitely not the case for most modern ones). Let me summarize so that you can confirm or deny whether or not I understood... 1) TLB misses don't cause page faults unless MMU is not able to find the entries in the hierarchy of page tables. If it finds the entries is transparently refills the TLB buffer with the found translations. 2) Page faults happens only if MMU, after walking the hierarchy, cannot yet find any suitable translation. > > When this > > > > +happens, the MMU triggers a page fault, which is a type of interrupt that > > Hmm. Whilst similar to an interrupt I'd argue that it's not one.. 3) I shouldn't define it as an "interrupt" because it technically is not. How about "exception" or "software exception"? > > +signals the CPU to pause the current process and run a special function to > > +handle the fault. > > ... > > Jonathan I don't read any other comments on the second part of the RFC. Does it mean that the second part is OK from your POV? It would be of great help if you could set aside some more minutes and clear the doubts I just expressed and answer the questions I asked :-) Thanks for the comments, Fabio
On Mon, 24 Jul 2023 13:21:40 +0200 "Fabio M. De Francesco" <fmdefrancesco@gmail.com> wrote: > On lunedì 24 luglio 2023 11:55:05 CEST Jonathan Cameron wrote: > > On Sun, 23 Jul 2023 14:07:09 +0200 > > > > "Fabio M. De Francesco" <fmdefrancesco@gmail.com> wrote: > > > Extend page_tables.rst by adding a section about the role of MMU and TLB > > > in translating between virtual addresses and physical page frames. > > > Furthermore explain the concept behind Page Faults and how the Linux > > > kernel handles TLB misses. Finally briefly explain how and why to disable > > > the page faults handler. > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: Bagas Sanjaya <bagasdotme@gmail.com> > > > Cc: Ira Weiny <ira.weiny@intel.com> > > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > Cc: Jonathan Corbet <corbet@lwn.net> > > > Cc: Linus Walleij <linus.walleij@linaro.org> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Mike Rapoport <rppt@kernel.org> > > > Cc: Randy Dunlap <rdunlap@infradead.org> > > > Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> > > > > Hi Fabio, > > > > Hi Jonathan, > > > Some superficial comments... > > Maybe that they are "superficial", BTW they are indeed very welcome :-) > > > > --- > > > > > > v1->v2: Add further information about lower level functions in the page > > > fault handler and add information about how and why to disable / enable > > > the page fault handler (provided a link to a Ira's patch that make use > > > of pagefault_disable() to prevent deadlocks). > > > > > > This is an RFC PATCH because of two reasons: > > > > > > 1) I've heard that there is consensus about the need to revise and > > > extend the MM documentation, but I'm not sure about whether or not > > > developers need these kind of introductory information. > > > > > > 2) While preparing this little patch I decided to take a quicj look at > > > > Spell check your intro text. > > Sure, I'll s/quicj/quick/ > > > > the code and found out it currently is not how I thought I remembered > > > it. I'm especially speaking about the x86 case. I'm not sure that I've > > > been able to properly understand what I described as a difference in > > > workflow compared to most of the other architecture. > > > > > > Therefore, for the two reasons explained above, I'd like to hear from > > > people actively involved in MM. If this is not what you want, feel free > > > to throw it away. Otherwise I'd be happy to write more on this and other > > > MM topics. I'm looking forward for comments on this small work. > > > > > > Documentation/mm/page_tables.rst | 87 ++++++++++++++++++++++++++++++++ > > > 1 file changed, 87 insertions(+) > > > > > > diff --git a/Documentation/mm/page_tables.rst > > > b/Documentation/mm/page_tables.rst index 7840c1891751..2be56f50c88f 100644 > > > --- a/Documentation/mm/page_tables.rst > > > +++ b/Documentation/mm/page_tables.rst > > > @@ -152,3 +152,90 @@ Page table handling code that wishes to be > > > architecture-neutral, such as the> > > > virtual memory manager, will need to be written so that it traverses all > of > > > the currently five levels. This style should also be preferred for > > > architecture-specific code, so as to be robust to future changes. > > > > > > + > > > + > > > +MMU, TLB, and Page Faults > > > +========================= > > > + > > > +The Memory Management Unit (MMU) is a hardware component that handles > > > virtual to +physical address translations. It uses a relatively small > cache > > > in hardware > > It may use a relatively... > > (I doubt Linux supports anything that doesn't have a TLB but they aren't > > required by some architectures - just a performance optimization that you > > 'can' add to an implementation.) > > Oh, I didn't know that Linux supports non-MMU architectures. However I suspect > that the vast majority have MMU and TLB. Is it correct? Yes - most do. Note that you can have an MMU without a TLB as well (or turn the TLB off) though I'd assume that anyone doing that is half way through implementing the MMU and will add one later given how shockingly bad performance is without a TLB. Whilst we are here, if talking about TLBs should consider also mentioning Page Walk Caches which cache the result of part of the MMU page table walk and are also flushed by (some) TLB Invalidations. > > I'll change the statement to "it may use, and in the vast majority of > supported architecture it indeed uses [...]". How about this? Is it not yet > what you meant? I'd flip that around and avoid mention of architectures - what matters here is implementations of those architectures, not the architectures themselves (for instance RiscV and ARM v8.5 are architectures, but they allow a lot of flexibility on implementation - including what translation caches exist). Also plural - there are normally multiple levels of TLB similar to how we have L1 and L2 cache with different characteristics. Can also be multiple parallel TLBs supporting different sizes of page etc. This stuff gets really complex on a modern system. Perhaps: "In most supported CPUs, the MMU uses relatively small hardware caches called the Translation Lookaside Buffers (TLBs) and Page Walk Caches to speed up these translations." Probably need to describe a Page Walk Cache in more detail though as not as obvious a concept as a TLB. > > > > +called the Translation Lookaside Buffer (TLB) to speed up these > > > translations. +When a process wants to access a memory location, the CPU > > > provides a virtual +address to the MMU, which then uses the TLB to quickly > > > find the corresponding +physical address. > > > + > > > +However, sometimes the MMU can't find a valid translation in the TLB. > This > > > +could be because the process is trying to access a range of memory that > > > it's not +allowed to, or because the memory hasn't been loaded into RAM > > > yet. > > > > It might not find it because this is first attempt to do the translation and > > the MMU hasn't filled the TLB entry yet, or a capacity eviction has > happened. > > I thought that "[...] hasn't been loaded into RAM yet" would have covered all > cases comprising lazy allocation, copy-n-write, and swapped out pages. I > talked about the first two later in the text, but I forgot to speak about > swapped out page frames to persistent storage so I'll add it in the next > version. > > However, I am thinking that is not the TLB misses that may cause an exception > to fault in memory, but it is the MMU itself if not able to fill the TLB with > the content of allocated page tables. If you confirm so, I'll need to rewrite > the first introductory paragraphs. Can you please confirm? That's correct - what matters is the MMU can't perform the translation (either via cache or via page walk.) As an aside there are software managed TLB architectures in which the TLB is always filled by software but we can probably ignore those :) > > > Basically failure to find it in the TLB doesn't mean we get a page fault > > (unless you are on an ancient architecture where TLB entries are software > > filled which is definitely not the case for most modern ones). > > Let me summarize so that you can confirm or deny whether or not I > understood... > > 1) TLB misses don't cause page faults unless MMU is not able to find the > entries in the hierarchy of page tables. If it finds the entries is > transparently refills the TLB buffer with the found translations. In most architectures that's right. (It's computer arch - if there is a way to do something differently, someone will have done it). > > 2) Page faults happens only if MMU, after walking the hierarchy, cannot yet > find any suitable translation. Yes, but suitable translation includes things like permissions faults and related (the boundary there gets blurry between access control and tracking of access via things like dirty bits in the page table) > > > > When this > > > > > > +happens, the MMU triggers a page fault, which is a type of interrupt that > > > > Hmm. Whilst similar to an interrupt I'd argue that it's not one.. > > 3) I shouldn't define it as an "interrupt" because it technically is not. How > about "exception" or "software exception"? I'd just go with exception probably. > > > > +signals the CPU to pause the current process and run a special function > to > > > +handle the fault. > > > > ... > > > > Jonathan > > I don't read any other comments on the second part of the RFC. Does it mean > that the second part is OK from your POV? > > It would be of great help if you could set aside some more minutes and clear > the doubts I just expressed and answer the questions I asked :-) > > Thanks for the comments, > > Fabio > > >
diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst index 7840c1891751..2be56f50c88f 100644 --- a/Documentation/mm/page_tables.rst +++ b/Documentation/mm/page_tables.rst @@ -152,3 +152,90 @@ Page table handling code that wishes to be architecture-neutral, such as the virtual memory manager, will need to be written so that it traverses all of the currently five levels. This style should also be preferred for architecture-specific code, so as to be robust to future changes. + + +MMU, TLB, and Page Faults +========================= + +The Memory Management Unit (MMU) is a hardware component that handles virtual to +physical address translations. It uses a relatively small cache in hardware +called the Translation Lookaside Buffer (TLB) to speed up these translations. +When a process wants to access a memory location, the CPU provides a virtual +address to the MMU, which then uses the TLB to quickly find the corresponding +physical address. + +However, sometimes the MMU can't find a valid translation in the TLB. This +could be because the process is trying to access a range of memory that it's not +allowed to, or because the memory hasn't been loaded into RAM yet. When this +happens, the MMU triggers a page fault, which is a type of interrupt that +signals the CPU to pause the current process and run a special function to +handle the fault. + +One cause of page faults is due to bugs (or maliciously crafted addresses) and +happens when a process tries to access a range of memory that it doesn't have +permission to. This could be because the memory is reserved for the kernel or +for another process, or because the process is trying to write to a read-only +section of memory. When this happens, the kernel sends a Segmentation Fault +(SIGSEGV) signal to the process, which usually causes the process to terminate. + +An expected and more common cause of page faults is "lazy allocation". This is +a technique used by the Kernel to improve memory efficiency and reduce +footprint. Instead of allocating physical memory to a process as soon as it's +requested, the kernel waits until the process actually tries to use the memory. +This can save a significant amount of memory in cases where a process requests +a large block but only uses a small portion of it. + +A related technique is "Copy-on-Write" (COW), where the Kernel allows multiple +processes to share the same physical memory as long as they're only reading +from it. If a process tries to write to the shared memory, the kernel triggers +a page fault and allocates a separate copy of the memory for the process. This +allows the kernel to save memory and avoid unnecessary data copying and, by +doing so, it reduces latency. + +Now, let's see how the Linux kernel handles these page faults: + +1. For most architectures, `do_page_fault()` is the primary interrupt handler + for page faults. It delegates the actual handling of the page fault to + `handle_mm_fault()`. This function checks the cause of the page fault and + takes the appropriate action, such as loading the required page into + memory, granting the process the necessary permissions, or sending a + SIGSEGV signal to the process. + +2. In the specific case of the x86 architecture, the interrupt handler is + defined by the `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro, which calls + `handle_page_fault()`. This function then calls either + `do_user_addr_fault()` or `do_kern_addr_fault()`, depending on whether + the fault occurred in user space or kernel space. Both of these functions + eventually lead to `handle_mm_fault()`, similar to the workflow in other + architectures. + +`handle_mm_fault()` (likely) ends up calling `__handle_mm_fault()` to carry +out the actual work of allocation of the page tables. It works by using +several functions to find the entry's offsets of the 4 - 5 layers of tables +and allocate the tables it needs to. The functions that look for the offset +have names like `*_offset()`, where the "*" is for pgd, p4d, pud, pmd, pte; +instead the functions to allocate the corresponding tables, layer by layer, +are named `*_alloc`, with the above mentioned convention to name them after +the corresponding types of tables in the hierarchy. + +At the very end of the walk with allocations, if it didn't return errors, +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via +`do_fault()` performs one of `do_read_fault()`, `do_cow_fault()`, +`do_shared_fault()`. "read", "cow", "shared" give hints about the reasons +and the kind of fault it's handling. + +The actual implementation of the workflow is very complex. Its design allows +Linux to handle page faults in a way that is tailored to the specific +characteristics of each architecture, while still sharing a common overall +structure. + +To conclude this brief overview from very high altitude of how Linux handles +page faults, let's add that page faults handler can be disabled and enabled +respectively with `pagefault_disable()` and `pagefault_enable()`. + +Several code path make use of the latter two functions because they need to +disable traps into the page faults handler, mostly to prevent deadlocks.[1] + +[1] mm/userfaultfd: Replace kmap/kmap_atomic() with kmap_local_page() +https://lore.kernel.org/all/20221025220136.2366143-1-ira.weiny@intel.com/ +
Extend page_tables.rst by adding a section about the role of MMU and TLB in translating between virtual addresses and physical page frames. Furthermore explain the concept behind Page Faults and how the Linux kernel handles TLB misses. Finally briefly explain how and why to disable the page faults handler. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> --- v1->v2: Add further information about lower level functions in the page fault handler and add information about how and why to disable / enable the page fault handler (provided a link to a Ira's patch that make use of pagefault_disable() to prevent deadlocks). This is an RFC PATCH because of two reasons: 1) I've heard that there is consensus about the need to revise and extend the MM documentation, but I'm not sure about whether or not developers need these kind of introductory information. 2) While preparing this little patch I decided to take a quicj look at the code and found out it currently is not how I thought I remembered it. I'm especially speaking about the x86 case. I'm not sure that I've been able to properly understand what I described as a difference in workflow compared to most of the other architecture. Therefore, for the two reasons explained above, I'd like to hear from people actively involved in MM. If this is not what you want, feel free to throw it away. Otherwise I'd be happy to write more on this and other MM topics. I'm looking forward for comments on this small work. Documentation/mm/page_tables.rst | 87 ++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+)