[v5] Documentation/mm: Initial page table documentation

Message ID	20230614072548.996940-1-linus.walleij@linaro.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Linus Walleij <linus.walleij@linaro.org> To: Andrew Morton <akpm@linux-foundation.org>, Jonathan Corbet <corbet@lwn.net> Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, Linus Walleij <linus.walleij@linaro.org>, Matthew Wilcox <willy@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@kernel.org>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Bagas Sanjaya <bagasdotme@gmail.com> Subject: [PATCH v5] Documentation/mm: Initial page table documentation Date: Wed, 14 Jun 2023 09:25:48 +0200 Message-Id: <20230614072548.996940-1-linus.walleij@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v5] Documentation/mm: Initial page table documentation \| expand [v5] Documentation/mm: Initial page table documentation

Linus Walleij June 14, 2023, 7:25 a.m. UTC

This is based on an earlier blog post at people.kernel.org,
it describes the concepts about page tables that were hardest
for me to grasp when dealing with them for the first time,
such as the prevalent three-letter acronyms pfn, pgd, p4d,
pud, pmd and pte.

I don't know if this is what people want, but it's what I would
have wanted. The wording, introduction, choice of initial subjects
and choice of style is mine.

I discussed at one point with Mike Rapoport to bring this into
the kernel documentation, so here is a small proposal.

The current form is augmented in response to feedback from
Mike Rapoport, Matthew Wilcox, Jonathan Cameron, Kuan-Ying Lee,
Randy Dunlap and Bagas Sanjaya.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://people.kernel.org/linusw/arm32-page-tables
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
ChangeLog v4->v5:
- Drop the word "target" from the paragraph about virtual
  addresses as pointed out by Matthew Wilcox.
- Drop "program counter" mention in paragraph about physical and
  virtual addresses as pointed out by Matthew Wilcox.
- Update the changelog below to reflect who provided which
  feedback so everybode can see that their feedback is being
  taken into account.
- Collect Mike Rapoports Review tag.
ChangeLog v3->v4:
- Singularis to pluralis fix pointed out by Jonathan Cameron
- Reword the origin story about hierarchical page tables a bit
  inspired by the input from Mike Rapoport.
ChangeLog v2->v3:
- Fix the page size example, also have examples for both 4K and
  16K pages since people will confront these in response to
  feedback from Kuan-Ying Lee.
- Add a section explaining a bit why we have hierarchical
  page tables at all.
ChangeLog v1->v2:
- Fixed speling mistakes
- Copyedit the paragraph on page frame numbers in response
  to feedback from Matthew Wilcox.
- Reverse the arrows in the page table hierarchy illustration in
  response to feedback from Matthew Wilcox.
- Reverse the order of description of the page hierarchy levels in
  response to feedback from Matthew Wilcox.
- Create a new section for folding
- Emphasize that architectures should try to be page hierarchy
  neutral in response to feedback from Mike Rapoport.
- Trying to better describe the fact that the lowest page table PTE
  is called like that for historical reasons, in response to
  sevaral comments on earlier blog posts on the subject.
---
 Documentation/mm/page_tables.rst | 149 +++++++++++++++++++++++++++++++
 1 file changed, 149 insertions(+)

Jonathan Corbet June 16, 2023, 2:14 p.m. UTC | #1

Linus Walleij <linus.walleij@linaro.org> writes:

> This is based on an earlier blog post at people.kernel.org,
> it describes the concepts about page tables that were hardest
> for me to grasp when dealing with them for the first time,
> such as the prevalent three-letter acronyms pfn, pgd, p4d,
> pud, pmd and pte.
>
> I don't know if this is what people want, but it's what I would
> have wanted. The wording, introduction, choice of initial subjects
> and choice of style is mine.
>
> I discussed at one point with Mike Rapoport to bring this into
> the kernel documentation, so here is a small proposal.
>
> The current form is augmented in response to feedback from
> Mike Rapoport, Matthew Wilcox, Jonathan Cameron, Kuan-Ying Lee,
> Randy Dunlap and Bagas Sanjaya.
>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Reviewed-by: Mike Rapoport <rppt@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Link: https://people.kernel.org/linusw/arm32-page-tables
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

So I think this has gone around enough and have decided to pick it up.
If there are parts that people aren't happy with, we can surely fix them
up as we go on.  Meanwhile, it's good to see an effort to fill in one of
the gaps here, thanks for doing it.

Thanks,

jon

Fabio M. De Francesco June 18, 2023, 1:16 p.m. UTC | #2

On mercoledì 14 giugno 2023 09:25:48 CEST Linus Walleij wrote:
> This is based on an earlier blog post at people.kernel.org,
> it describes the concepts about page tables that were hardest
> for me to grasp when dealing with them for the first time,
> such as the prevalent three-letter acronyms pfn, pgd, p4d,
> pud, pmd and pte.
> I don't know if this is what people want, but it's what I would
> have wanted. The wording, introduction, choice of initial subjects
> and choice of style is mine.
> 
> I discussed at one point with Mike Rapoport to bring this into
> the kernel documentation, so here is a small proposal.
> 
> The current form is augmented in response to feedback from
> Mike Rapoport, Matthew Wilcox, Jonathan Cameron, Kuan-Ying Lee,
> Randy Dunlap and Bagas Sanjaya.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Reviewed-by: Mike Rapoport <rppt@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Link: https://people.kernel.org/linusw/arm32-page-tables
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---

I am writing to express my dissent regarding the proposal to add basic 
information about the role of hierarchical (multi-level) page tables in 
mapping virtual memory to physical page frames. While I understand the 
importance of documentation, I believe that including such fundamental 
operating system concepts in the specialized Linux kernel documentation would 
be redundant and unnecessary.

The proposed addition appears to be a combination of trivia and a basic 
Operating Systems I course that one might encounter during their second year 
as an undergraduate student studying Computer Science or Computer Engineering. 

AFAIK, these concepts are already taught extensively to individuals pursuing a 
B.Sc. degree in Computer Science or a related field, both in Italy, where I 
live, and elsewhere. Therefore, it seems unlikely that Linux kernel developers 
would be unfamiliar with such fundamental topics, such as the mapping of 
virtual memory to physical page frames using multi-level (hierarchical) page 
tables.

I question the target audience of this documentation. How can we expect any 
developer working with Linux to be unaware of such basic concepts? Adding 
documentation about these foundational concepts would create a precedent, 
potentially leading to further documentation on other fundamental abstractions 
like "task," "multi-threading," and "scheduling" – concepts that are integral 
to kernel management. The inclusion of such basic topics could quickly clutter 
up the specialized Linux kernel documentation.

Let us not forget that there is a wealth of resources available outside the 
Linux kernel documentation. Books on OS theory or online courses from esteemed 
universities can easily provide individuals with the necessary knowledge on 
these fundamental concepts. Encouraging developers to explore these external 
resources fosters a culture of continuous learning and self-improvement, 
benefiting the entire Linux development community.

In conclusion, I respectfully oppose the proposal to add basic operating 
system concepts, such as the hierarchical page tables, to the official Linux 
kernel documentation. I believe that such information is readily accessible 
through existing resources and that the specialized documentation should focus 
on advanced topics and unique aspects specific to the Linux kernel.

Thank you for considering my perspective.

Regards,

Fabio M. De Francesco

P.S.: The only parts I find enough interesting are those regarding the names 
of the types and other few bits of information, only because these are indeed 
Linux kernel focused and may not be found in the above-mentioned wealth of 
resources available outside, like Tanenbaum's and Silberschatz's books 
(however I'm not entirely sure they miss those information). 

> ChangeLog v4->v5:
> - Drop the word "target" from the paragraph about virtual
>   addresses as pointed out by Matthew Wilcox.
> - Drop "program counter" mention in paragraph about physical and
>   virtual addresses as pointed out by Matthew Wilcox.
> - Update the changelog below to reflect who provided which
>   feedback so everybode can see that their feedback is being
>   taken into account.
> - Collect Mike Rapoports Review tag.
> ChangeLog v3->v4:
> - Singularis to pluralis fix pointed out by Jonathan Cameron
> - Reword the origin story about hierarchical page tables a bit
>   inspired by the input from Mike Rapoport.
> ChangeLog v2->v3:
> - Fix the page size example, also have examples for both 4K and
>   16K pages since people will confront these in response to
>   feedback from Kuan-Ying Lee.
> - Add a section explaining a bit why we have hierarchical
>   page tables at all.
> ChangeLog v1->v2:
> - Fixed speling mistakes
> - Copyedit the paragraph on page frame numbers in response
>   to feedback from Matthew Wilcox.
> - Reverse the arrows in the page table hierarchy illustration in
>   response to feedback from Matthew Wilcox.
> - Reverse the order of description of the page hierarchy levels in
>   response to feedback from Matthew Wilcox.
> - Create a new section for folding
> - Emphasize that architectures should try to be page hierarchy
>   neutral in response to feedback from Mike Rapoport.
> - Trying to better describe the fact that the lowest page table PTE
>   is called like that for historical reasons, in response to
>   sevaral comments on earlier blog posts on the subject.
> ---
>  Documentation/mm/page_tables.rst | 149 +++++++++++++++++++++++++++++++
>  1 file changed, 149 insertions(+)
> 
> diff --git a/Documentation/mm/page_tables.rst
> b/Documentation/mm/page_tables.rst index 96939571d7bc..7840c1891751 100644
> --- a/Documentation/mm/page_tables.rst
> +++ b/Documentation/mm/page_tables.rst
> @@ -3,3 +3,152 @@
>  ===========
>  Page Tables
>  ===========
> +
> +Paged virtual memory was invented along with virtual memory as a concept in
> +1962 on the Ferranti Atlas Computer which was the first computer with paged
> +virtual memory. The feature migrated to newer computers and became a de 
facto
> +feature of all Unix-like systems as time went by. In 1985 the feature was
> +included in the Intel 80386, which was the CPU Linux 1.0 was developed on. 
+
> +Page tables map virtual addresses as seen by the CPU into physical 
addresses
> +as seen on the external memory bus.
> +
> +Linux defines page tables as a hierarchy which is currently five levels in
> +height. The architecture code for each supported architecture will then
> +map this to the restrictions of the hardware.
> +
> +The physical address corresponding to the virtual address is often 
referenced
> +by the underlying physical page frame. The **page frame number** or **pfn**
> +is the physical address of the page (as seen on the external memory bus)
> +divided by `PAGE_SIZE`.
> +
> +Physical memory address 0 will be *pfn 0* and the highest pfn will be
> +the last page of physical memory the external address bus of the CPU can
> +address.
> +
> +With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
> +address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
> +and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
> +at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff.
> +
> +As you can see, with 4KB pages the page base address uses bits 12-31 of the
> +address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
> +`PAGE_SIZE` is usually defined in terms of the page shift as `(1 <<
> PAGE_SHIFT)` +
> +Over time a deeper hierarchy has been developed in response to increasing
> memory +sizes. When Linux was created, 4KB pages and a single page table
> called +`swapper_pg_dir` with 1024 entries was used, covering 4MB which
> coincided with +the fact that Torvald's first computer had 4MB of physical
> memory. Entries in +this single table were referred to as *PTE*:s - page
> table entries. +
> +The software page table hierarchy reflects the fact that page table 
hardware
> has +become hierarchical and that in turn is done to save page table memory
> and +speed up mapping.
> +
> +One could of course imagine a single, linear page table with enormous 
amounts
> +of entries, breaking down the whole memory into single pages. Such a page
> table +would be very sparse, because large portions of the virtual memory
> usually +remains unused. By using hierarchical page tables large holes in 
the
> virtual +address space does not waste valuable page table memory, because it
> will suffice +to mark large areas as unmapped at a higher level in the page
> table hierarchy. +
> +Additionally, on modern CPUs, a higher level page table entry can point
> directly +to a physical memory range, which allows mapping a contiguous 
range
> of several +megabytes or even gigabytes in a single high-level page table
> entry, taking +shortcuts in mapping virtual memory to physical memory: there
> is no need to +traverse deeper in the hierarchy when you find a large mapped
> range like this. +
> +The page table hierarchy has now developed into this::
> +
> +  +-----+
> +  | PGD |
> +  +-----+
> +     |
> +     |   +-----+
> +     +-->| P4D |
> +         +-----+
> +            |
> +            |   +-----+
> +            +-->| PUD |
> +                +-----+
> +                   |
> +                   |   +-----+
> +                   +-->| PMD |
> +                       +-----+
> +                          |
> +                          |   +-----+
> +                          +-->| PTE |
> +                              +-----+
> +
> +
> +Symbols on the different levels of the page table hierarchy have the
> following +meaning beginning from the bottom:
> +
> +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
> +  The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type,
> each +  mapping a single page of virtual memory to a single page of physical
> memory. +  The architecture defines the size and contents of `pteval_t`.
> +
> +  A typical example is that the `pteval_t` is a 32- or 64-bit value with 
the
> +  upper bits being a **pfn** (page frame number), and the lower bits being
> some +  architecture-specific bits such as memory protection.
> +
> +  The **entry** part of the name is a bit confusing because while in Linux
> 1.0 +  this did refer to a single page table entry in the single top level
> page +  table, it was retrofitted to be an array of mapping elements when
> two-level +  page tables were first introduced, so the *pte* is the 
lowermost
> page +  *table*, not a page table *entry*.
> +
> +- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy
> right +  above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
> +
> +- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced
> after +  the other levels to handle 4-level page tables. It is potentially
> unused, +  or *folded* as we will discuss later.
> +
> +- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced 
to
> +  handle 5-level page tables after the *pud* was introduced. Now it was
> clear +  that we needed to replace *pgd*, *pmd*, *pud* etc with a figure
> indicating the +  directory level and that we cannot go on with ad hoc names
> any more. This +  is only used on systems which actually have 5 levels of
> page tables, otherwise +  it is folded.
> +
> +- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux 
kernel
> +  main page table handling the PGD for the kernel memory is still found in 
+
>  `swapper_pg_dir`, but each userspace process in the system also has its own
> +  memory context and thus its own *pgd*, found in `struct mm_struct` which 
+
>  in turn is referenced to in each `struct task_struct`. So tasks have memory
> +  context in the form of a `struct mm_struct` and this in turn has a + 
> `struct pgt_t *pgd` pointer to the corresponding page global directory. +
> +To repeat: each level in the page table hierarchy is a *array of pointers*,
> so +the **pgd** contains `PTRS_PER_PGD` pointers to the next level below,
> **p4d** +contains `PTRS_PER_P4D` pointers to **pud** items and so on. The
> number of +pointers on each level is architecture-defined.::
> +
> +        PMD
> +  --> +-----+           PTE
> +      | ptr |-------> +-----+
> +      | ptr |-        | ptr |-------> PAGE
> +      | ptr | \       | ptr |
> +      | ptr |  \        ...
> +      | ... |   \
> +      | ptr |    \         PTE
> +      +-----+     +----> +-----+
> +                         | ptr |-------> PAGE
> +                         | ptr |
> +                           ...
> +
> +
> +Page Table Folding
> +==================
> +
> +If the architecture does not use all the page table levels, they can be
> *folded* +which means skipped, and all operations performed on page tables
> will be +compile-time augmented to just skip a level when accessing the next
> lower +level.
> +
> +Page table handling code that wishes to be architecture-neutral, such as 
the
> +virtual memory manager, will need to be written so that it traverses all of
> the +currently five levels. This style should also be preferred for
> +architecture-specific code, so as to be robust to future changes.
> --
> 2.40.1

Jonathan Corbet June 18, 2023, 6:54 p.m. UTC | #3

"Fabio M. De Francesco" <fmdefrancesco@gmail.com> writes:

> I question the target audience of this documentation. How can we expect any 
> developer working with Linux to be unaware of such basic concepts? Adding 
> documentation about these foundational concepts would create a precedent, 
> potentially leading to further documentation on other fundamental abstractions 
> like "task," "multi-threading," and "scheduling" – concepts that are integral 
> to kernel management. The inclusion of such basic topics could quickly clutter 
> up the specialized Linux kernel documentation.

Someday, if we find ourselves in the position of having too much
documentation, we can entertain patches to clean out material that is
deemed to be too elementary for kernel developers.  Before then, though,
if we are worried about clutter, we may want to put more effort into
addressing the large amount of duplicated and obsolete documentation in
the kernel now.

Until then, I see no reason to oppose the addition of material that,
even if you don't personally find it helpful, may indeed be helpful to
developers trying to come up to speed on just what the kernel is doing.

Thanks,

jon

Linus Walleij June 19, 2023, 8:16 a.m. UTC | #4

On Sun, Jun 18, 2023 at 3:16 PM Fabio M. De Francesco
<fmdefrancesco@gmail.com> wrote:

> I am writing to express my dissent regarding the proposal to add basic
> information about the role of hierarchical (multi-level) page tables in
> mapping virtual memory to physical page frames.

I have understood that some think this, perhaps the intro could use
some dieting, what about sending a patch to make it look like you
want it to?

> The proposed addition appears to be a combination of trivia and a basic
> Operating Systems I course that one might encounter during their second year
> as an undergraduate student studying Computer Science or Computer Engineering.
>
> AFAIK, these concepts are already taught extensively to individuals pursuing a
> B.Sc. degree in Computer Science or a related field, both in Italy, where I
> live, and elsewhere.

Knowing the audience is always the hard part of wording technical
documentation, not the contents per se. I might fail, I might be slight off,
my co-developers are there to help.

Assuming that newcomers to the Linux kernel have formal academic
background or specifically operating system education is a bit thick
IMO, suffice to read pages 108-111 of Glyn Moody's book
"Rebel Code" about the background of the network maintainer.
There are a whole bunch of random people attracted to Linux
development.

Memory management may be different though? Mel having written
his PhD thesis about the Linux VMM and all might set the bar higher
for contributors. I don't know really. But the documentation is not there
just for the MM contributors, as the MM primitives are found sprinkled
all over the kernel.

Yours,
Linus Walleij

Fabio M. De Francesco June 21, 2023, 1:10 a.m. UTC | #5

On lunedì 19 giugno 2023 10:16:56 CEST Linus Walleij wrote:
> On Sun, Jun 18, 2023 at 3:16 PM Fabio M. De Francesco
> 
> <fmdefrancesco@gmail.com> wrote:
> > I am writing to express my dissent regarding the proposal to add basic
> > information about the role of hierarchical (multi-level) page tables in
> > mapping virtual memory to physical page frames.

[...]

> Assuming that newcomers to the Linux kernel have formal academic
> background or specifically operating system education is a bit thick
> IMO, suffice to read pages 108-111 of Glyn Moody's book
> "Rebel Code" about the background of the network maintainer.
> There are a whole bunch of random people attracted to Linux
> development.

Linus,

I must admit that I have had a change of heart regarding the necessity of this 
documentation.

This change came about after reading Jon's reply, as well as your own.

However, it wasn't just because of the two of you. It was mainly due to my 
conversations with some colleagues I work with, who hold M.Sc. degrees in 
Computer Science. 

Despite not having a formal background in CS or CE myself, I have taken the 
time to self-teach the subject matter, which I expected them to be well-versed 
in.

To my surprise, they only have a vague understanding of page tables and the 
fact that processes use addresses that may not correspond to physical 
locations. That's about it!

Hence, I now fully support your initiative and want to express my gratitude 
for undertaking this task.

The only thing I would prefer not to see is the historical reference to the 
first implementation of hierarchical page tables. After all, many concepts 
implemented in Linux are derived or adapted from existing knowledge or 
implementations in other kernels. However, I can also understand why you 
prefer to have it as an introduction to the subject.

Once again thanks,

Fabio

Linus Walleij June 21, 2023, 7:35 a.m. UTC | #6

Hi Fabio!

thanks for your reply!

The ways of technical documentation are never easy, but what
we are using right now is the socratic method, dialogue at its best,
which is pretty much the best way I know.

WRT the problem of education:
In gloomy days I have been referring to something I tongue-in-cheek
call "the second software crisis" (not any established term)
so in contrast with the first software crisis which was about the
complexity of software development outgrowing hardware
development, the second software crisis is due to software
developers losing contact and knowledge of hardware, with big
white spots on their mental map so that is part of what I am
trying to fix here.

Best regards,
Linus Walleij

[v5] Documentation/mm: Initial page table documentation

Commit Message

Comments

Patch