diff mbox series

docs/mm: Physical Memory: add "Memory map" section

Message ID 20230906074210.3051751-1-rppt@kernel.org (mailing list archive)
State New
Headers show
Series docs/mm: Physical Memory: add "Memory map" section | expand

Commit Message

Mike Rapoport Sept. 6, 2023, 7:42 a.m. UTC
From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Briefly describe memory map and add sub-sections for pages, folios and
ptdescs.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 Documentation/mm/physical_memory.rst | 338 ++++++++++++++++++++++++++-
 1 file changed, 332 insertions(+), 6 deletions(-)

Comments

Matthew Wilcox Sept. 6, 2023, 12:21 p.m. UTC | #1
On Wed, Sep 06, 2023 at 10:42:10AM +0300, Mike Rapoport wrote:
> +The basic memory descriptor is called :ref:`struct page <Pages>` and it is
> +essentially a union of several structures, each representing a page frame
> +metadata for a paricular usage.

"each representing page frame metadata".  And "particular".

>  Folios
> -======
> +------
>  
> -.. admonition:: Stub
> +`struct folio` represents a physically, virtually and logically contiguous
> +set of bytes. It is a power-of-two in size, and it is aligned to that same
> +power-of-two. It is at least as large as ``PAGE_SIZE``. If it is in the
> +page cache, it is at a file offset which is a multiple of that
> +power-of-two. It may be mapped into userspace at an address which is at an
> +arbitrary page offset, but its kernel virtual address is aligned to its
> +size.
>  
> -   This section is incomplete. Please list and describe the appropriate fields.
> +`struct folio` occupies several consecutive entries in the memory map and
> +has the following fields:
> +
> +``flags``
> +  Identical to the page flags.
> +
> +``lru``
> +  Least Recently Used list; tracks how recently this folio was used.
> +
> +``mlock_count``
> +  Number of times this folio has been pinned by mlock().
> +
> +``mapping``
> +  The file this page belongs to. Can be pagecache or swapcahe. For
> +  anonymous memory refers to the `struct anon_vma`.
> +
> +``index``
> +  Offset within the file, in units of pages. For anonymous memory, this is
> +  the index from the beginning of the mmap.
> +
> +``private``
> +  Filesystem per-folio data (see folio_attach_private()). Used for
> +  ``swp_entry_t`` if folio is in the swap cache
> +  (i.e. folio_test_swapcache() is true)
> +
> +``_mapcount``
> +  Do not access this member directly. Use folio_mapcount() to find out how
> +  many times this folio is mapped by userspace.
> +
> +``_refcount``
> +  Do not access this member directly. Use folio_ref_count() to find how
> +  many references there are to this folio.
> +
> +``memcg_data``
> +  Memory Control Group data.
> +
> +``_folio_dtor``
> +  Which destructor to use for this folio.
> +
> +``_folio_order``
> +  The allocation order of a folio. Do not use directly, call folio_order().
> +
> +``_entire_mapcount``
> +  How many times the entire folio is mapped as a single unit (for example
> +  by a PMD or PUD entry). Does not include PTE-mapped subpages. This might
> +  be useful for debugging, but to find out how many times the folio is
> +  mapped look at folio_mapcount() or page_mapcount() or total_mapcount()
> +  instead.
> +  Do not use directly, call folio_entire_mapcount().
> +
> +``_nr_pages_mapped``
> +  The total number of times the folio is mapped.
> +  Do not use directly, call folio_mapcount().
> +
> +``_pincount``
> +  Used to track pinning of the folio for DMA.
> +  Do not use directly, call folio_maybe_dma_pinned().
> +
> +``_folio_nr_pages``
> +  The number of pages in the folio.
> +  Do not use directly, call folio_nr_pages().
> +
> +``_hugetlb_subpool``
> +  HugeTLB subpool the folio beongs to.
> +  Do not use directly, use accessor in ``include/linux/hugetlb.h``.
> +
> +``_hugetlb_cgroup``
> +  Memory Control Group data for a HugeTLB folio.
> +  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
> +
> +``_hugetlb_cgroup_rsvd``
> +  Memory Control Group data for a HugeTLB folio.
> +  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
> +
> +``_hugetlb_hwpoison``
> +  List of failed (hwpoisoned) pages for a HugeTLB folio.
> +  Do not use directly, call raw_hwp_list_head().
> +
> +``_deferred_list``
> +  Folios to be split under memory pressure.

I don't understand why you've done all this instead of linking to the
kernel-doc I wrote.
Mike Rapoport Sept. 6, 2023, 12:52 p.m. UTC | #2
On Wed, Sep 06, 2023 at 01:21:02PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 06, 2023 at 10:42:10AM +0300, Mike Rapoport wrote:
> > +The basic memory descriptor is called :ref:`struct page <Pages>` and it is
> > +essentially a union of several structures, each representing a page frame
> > +metadata for a paricular usage.
> 
> "each representing page frame metadata".  And "particular".

sure
 
> >  Folios
> > -======
> > +------
> >  
> > -.. admonition:: Stub
> > +`struct folio` represents a physically, virtually and logically contiguous
> > +set of bytes. It is a power-of-two in size, and it is aligned to that same
> > +power-of-two. It is at least as large as ``PAGE_SIZE``. If it is in the
> > +page cache, it is at a file offset which is a multiple of that
> > +power-of-two. It may be mapped into userspace at an address which is at an
> > +arbitrary page offset, but its kernel virtual address is aligned to its
> > +size.
> >  
> > -   This section is incomplete. Please list and describe the appropriate fields.
> > +`struct folio` occupies several consecutive entries in the memory map and
> > +has the following fields:
> > +
> > +``flags``
> > +  Identical to the page flags.
> > +
> > +``lru``
> > +  Least Recently Used list; tracks how recently this folio was used.
> > +
> > +``mlock_count``
> > +  Number of times this folio has been pinned by mlock().
> > +
> > +``mapping``
> > +  The file this page belongs to. Can be pagecache or swapcahe. For
> > +  anonymous memory refers to the `struct anon_vma`.
> > +
> > +``index``
> > +  Offset within the file, in units of pages. For anonymous memory, this is
> > +  the index from the beginning of the mmap.
> > +
> > +``private``
> > +  Filesystem per-folio data (see folio_attach_private()). Used for
> > +  ``swp_entry_t`` if folio is in the swap cache
> > +  (i.e. folio_test_swapcache() is true)
> > +
> > +``_mapcount``
> > +  Do not access this member directly. Use folio_mapcount() to find out how
> > +  many times this folio is mapped by userspace.
> > +
> > +``_refcount``
> > +  Do not access this member directly. Use folio_ref_count() to find how
> > +  many references there are to this folio.
> > +
> > +``memcg_data``
> > +  Memory Control Group data.
> > +
> > +``_folio_dtor``
> > +  Which destructor to use for this folio.
> > +
> > +``_folio_order``
> > +  The allocation order of a folio. Do not use directly, call folio_order().
> > +
> > +``_entire_mapcount``
> > +  How many times the entire folio is mapped as a single unit (for example
> > +  by a PMD or PUD entry). Does not include PTE-mapped subpages. This might
> > +  be useful for debugging, but to find out how many times the folio is
> > +  mapped look at folio_mapcount() or page_mapcount() or total_mapcount()
> > +  instead.
> > +  Do not use directly, call folio_entire_mapcount().
> > +
> > +``_nr_pages_mapped``
> > +  The total number of times the folio is mapped.
> > +  Do not use directly, call folio_mapcount().
> > +
> > +``_pincount``
> > +  Used to track pinning of the folio for DMA.
> > +  Do not use directly, call folio_maybe_dma_pinned().
> > +
> > +``_folio_nr_pages``
> > +  The number of pages in the folio.
> > +  Do not use directly, call folio_nr_pages().
> > +
> > +``_hugetlb_subpool``
> > +  HugeTLB subpool the folio beongs to.
> > +  Do not use directly, use accessor in ``include/linux/hugetlb.h``.
> > +
> > +``_hugetlb_cgroup``
> > +  Memory Control Group data for a HugeTLB folio.
> > +  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
> > +
> > +``_hugetlb_cgroup_rsvd``
> > +  Memory Control Group data for a HugeTLB folio.
> > +  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
> > +
> > +``_hugetlb_hwpoison``
> > +  List of failed (hwpoisoned) pages for a HugeTLB folio.
> > +  Do not use directly, call raw_hwp_list_head().
> > +
> > +``_deferred_list``
> > +  Folios to be split under memory pressure.
> 
> I don't understand why you've done all this instead of linking to the
> kernel-doc I wrote.

We can't have it both in Documentation/core-api/mm-api.rst and here without
sphinx complaining: 

Documentation/mm/physical_memory:561: include/linux/mm_types.h:3: WARNING: Duplicate C declaration, also defined at core-api/mm-api:3.
Declaration is '.. c:struct:: folio'.
Matthew Wilcox Sept. 6, 2023, 2:09 p.m. UTC | #3
On Wed, Sep 06, 2023 at 03:52:14PM +0300, Mike Rapoport wrote:
> On Wed, Sep 06, 2023 at 01:21:02PM +0100, Matthew Wilcox wrote:
> > I don't understand why you've done all this instead of linking to the
> > kernel-doc I wrote.
> 
> We can't have it both in Documentation/core-api/mm-api.rst and here without
> sphinx complaining: 

I didn't say "include it here", I said "link to it".

I don't see the value in documenting all of these structs here.  They
should be documented in the subsystem that actually uses them.  Just a
note that there's per-allocation data stored in struct page should be
enough, I think.
Jonathan Corbet Sept. 6, 2023, 2:41 p.m. UTC | #4
Mike Rapoport <rppt@kernel.org> writes:

> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Briefly describe memory map and add sub-sections for pages, folios and
> ptdescs.
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> ---
>  Documentation/mm/physical_memory.rst | 338 ++++++++++++++++++++++++++-
>  1 file changed, 332 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
> index 531e73b003dd..e3318897bf57 100644
> --- a/Documentation/mm/physical_memory.rst
> +++ b/Documentation/mm/physical_memory.rst
> @@ -343,23 +343,349 @@ Zones
>  
>     This section is incomplete. Please list and describe the appropriate fields.
>  
> +.. _memmap:

Please, let's not clutter the docs with labels that are never used.  We
don't do that with code!

> +Memory map and memory descriptors
> +=================================
> +
> +Every physical page frame in the systam has an associated descriptor which
> +is used to keep track of its status. The collection of these descriptors is
> +called `memory map` and it is arranged in one or more arrays, depending on

*the* memory map

Also, why the `backtick quotes` ?  They don't have any particular
meaning to Sphinx here.

> +the selection of the memory model. Memory models are described in more
> +detail in Documentation/mm/memory-model.rst
> +
> +The basic memory descriptor is called :ref:`struct page <Pages>` and it is
> +essentially a union of several structures, each representing a page frame
> +metadata for a paricular usage.
> +
> +In many cases the entries in the memory map are not treated as `struct page`,
> +but rather as different types of descriptors such as :ref:`struct folio
> +<Folios>`, :ref:`struct ptdesc <ptdesc>` or `struct slab`.

I would hope that just saying "struct folio" would do the right thing;
did that not happen for you?

>  .. _pages:

I'd drop this label too

>  Pages
> -=====
> +-----
>  
> -.. admonition:: Stub
> +`struct page` tracks status of a single physical page frame. This structure

tracks *the* status

> +is a mixture of several types that represent metadata for different uses of
> +a page frame. To save memory these types partially overlap so the `struct
> +page` definition in ``include/linux/mm_types.h`` mixes scalar fields and
> +unions of structures.
>  
> -   This section is incomplete. Please list and describe the appropriate fields.
> +Common fields
> +~~~~~~~~~~~~~
> +
> +``flags``
> +  This field contains flags which describe the status of the page and
> +  additional information about the page, like, for instance, zone, section
> +  and node this page belongs to. Several flags determine how the page is
> +  used, sometimes in combination with ``page_type`` field. Other flags
> +  determine the state of the page, for instance if it is dirty or should be
> +  reclaimed, what LRU list this page is on and many others.
> +
> +  All flags are declared in ``include/linux/page-flags.h``. There are a
> +  number of macros defined for testing, clearing and setting the flags. Page
> +  flags should not be accessed directly, but only using these macros.

It would sure be nice if we had documentation for what all the flags
mean :)

> +  The layout of the ``flags`` field depends on the kernel configuration. It
> +  is affeted by selection of the memory model, section size for SPARSEMEM

affected

> +  without VMEMMAP, number of zone types, maximal number of nodes and other
> +  build time parameters, such as ``CONFIG_NUMA_BALANCING``,
> +  ``CONFIG_KASAN_SW_TAGS`` and ``CONFIG_LRU_GEN``.
> +
> +  For example, a kernel configured for 64-bit system with
> +  SPARSEMEM_VMEMMAP, four zone types and maximum of 64 nodes and other
> +  relevant options disabled layout of ``flags`` will be::
> +
> +    63   58 57  56 55                  23 22                      0
> +    +------+------+----------------------+------------------------+
> +    | node | zone |         ...          |         flags          |
> +    +------+------+----------------------+------------------------+
> +
> +  And for the same configuration with enabled ``CONFIG_LRU_GEN`` and
> +  ``CONFIG_NUMA_BALANCING`` it will be::
> +
> +    63   58 57  56 55    42 41     39 38      37 36  23 22        0
> +    +------+------+--------+---------+----------+------+----------+
> +    | node | zone | cpupid | lru_gen | lru_refs | ...  |  flags   |
> +    +------+------+--------+---------+----------+------+----------+
> +
> +  For the exact details refer to ``include/linux/page-flags-layout.h`` and
> +  ``include/linux/mmzone.h``.
> +
> +  Although in the above examples the page flags layout includes 23 flags,
> +  their number may vary with different kernel configurations.
> +
> +``_refcount``
> +  Usage count of the `struct page`. Should not be used directly. Use
> +  accessors defined in ``include/linux/page_ref.h``.
> +
> +``memcg_data``
> +  An opaque object used by memory cgroups. Defined only when
> +  ``CONFIG_MEMCG`` is enabled.
> +
> +``virtual``
> +  Virtual address in the kernel direct map. Will be ``NULL`` for highmem
> +  pages. Only defined for some architectures.

I'd say virtual is absent more often than present anymore, right?
Perhaps it's worth being more explicit about that.  And maybe say to use
page_address() rather than accessing it directly?

> +``kmsan_shadow``
> +  KMSAN shadow page: every bit indicates whether the corresponding bit of
> +  the original page is initialized (0) or not (1). Defined only when
> +  ``CONFIG_KMSAN`` is enabled.
> +
> +``kmsan_origin``
> +  KMSAN origin page: every 4 bytes contain an id of the stack trace where
> +  the uninitialized value was created. Defined only when ``CONFIG_KMSAN``
> +  is enabled.
> +
> +``_last_cpupid``
> +  IDs of last CPU and last process that accessed the page. Only enabled if
> +  there are not enough bits in the ``flags`` field.
> +  Do not use directly, use accessors defined in ``include/linux/mm.h``
> +
> +Fields shared between multiple types
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +``_mapcount``
> +  If the page can be mapped to userspace, encodes the number of times this
> +  page is referenced by a page table.
> +  Do not use directly, call page_mapcount().

Have we figured out what mapcount really means yet? :)

> +``page_type``
> +  If the page is neither ``PageSlab`` nor mappable to userspace, the value
> +  stored here may help determine what this page is used for. See
> +  ``include/linux/page-flags.h`` for a list of page types which are
> +  currently stored here.
> +
> +``rcu_head``
> +  You can use this to free a page by RCU. Available for page table pages
> +  and for page cache and anonymous pages not linked to any of the LRU
> +  lists.
> +
> +Page cache and anonymous pages
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The following fields are used to link `struct page` to a linked list and
> +they overlap with each other:
> +
> +``lru``
> +  Linked list pointers for pages on LRU lists, for example active_list
> +  protected by ``lruvec->lru_lock``. Sometimes used as a generic list by
> +  the page owner.
> +
> +For pages on unevictable "LRU list" ``lru`` is overlayed with an anonymous
> +struct containing two fields:
> +
> +``__filler``
> +  A dummy field that must be always even to avoid conflict with compound
> +  page encoding.

Do we care about the constraints on this field's contents?  Presumably
that is taken care of somewhere and nobody should mess with it?

> +``mlock_count``
> +  Number of times the page has been pinned by mlock().
> +
> +Pages on free lists used by the page allocator are linked to the relevant
> +list with eithter of the two below fields:

Spellcheckers are your friend :)

> +``buddy_list``
> +  Links the page to one of the free lists in the buddy allocator. Overlaps
> +  with ``lru``.
> +
> +``pcp_list``
> +  Links the page to a per-cpu free list. Overlaps with ``lru``.
> +
> +``mapping``
> +  The file this page belongs to. Can be pagecache or swapcahe. For
> +  anonymous memory refers to the `struct anon_vma`.
> +  See also ``include/linux/page-flags.h`` for ``PAGE_MAPPING_FLAGS``

It seems like putting in the types for fields like this would be useful;
readers of the HTML docs can then follow the links and see what is
actually pointed to.

> +``index``
> +  Page offset within mapping. Overlaps with ``share``.
> +
> +``share``
> +  Share count for fsdax. Overlaps with ``index``.
> +
> +``private``
> +  Mapping-private opaque data. Usually used for buffer_heads if
> +  PagePrivate. Used for swp_entry_t if PageSwapCache. Indicates order in
> +  the buddy system if PageBuddy.
> +
> +Page pool
> +~~~~~~~~~
> +
> +The following fields are used by
> +`page_pool <Documentation/networking/page_pool.rst>`
> +allocator used by the networking stack.
> +
> +``pp_magic``
> +  Magic value to avoid recycling non page_pool allocated pages.
> +
> +``pp``
> +  `struct page_pool` holding the page.
> +
> +``_pp_mapping_pad``
> +  A padding to avoid collision of page_pool data with ``mapping``.
> +
> +``dma_addr``
> +  DMAable address of the page.
> +
> +``dma_addr_upper``
> +  Upper part of DMA address on 32-bit architectures that use 64-bit DMA
> +  addressing. Overlaps with ``pp_frag_count``.
> +
> +``pp_frag_count``
> +  Used by sub-page allocations in ``page_pool``. Not supported on 32-bit
> +  architectures with 64-bit DMA addresses. Overlaps with ``dma_addr_upper``.
> +
> +Tail pages of compound page
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +``compound_head``
> +  Pointer to the head page of compound page. Bit zero is always set for
> +  tail pages and cleared for head pages.
> +
> +ZONE_DEVICE pages
> +~~~~~~~~~~~~~~~~~
> +
> +``pgmap``
> +  Points to the hosting device page map.
> +
> +``zone_device_data``
> +  Private data used by the owning device.
>  
>  .. _folios:
>  
>  Folios
> -======
> +------

As Willy said, linking to the existing docs might be better here.

> -.. admonition:: Stub
> +`struct folio` represents a physically, virtually and logically contiguous
> +set of bytes. It is a power-of-two in size, and it is aligned to that same
> +power-of-two. It is at least as large as ``PAGE_SIZE``. If it is in the
> +page cache, it is at a file offset which is a multiple of that
> +power-of-two. It may be mapped into userspace at an address which is at an
> +arbitrary page offset, but its kernel virtual address is aligned to its
> +size.
>  
> -   This section is incomplete. Please list and describe the appropriate fields.
> +`struct folio` occupies several consecutive entries in the memory map and
> +has the following fields:
> +
> +``flags``
> +  Identical to the page flags.
> +
> +``lru``
> +  Least Recently Used list; tracks how recently this folio was used.
> +
> +``mlock_count``
> +  Number of times this folio has been pinned by mlock().
> +
> +``mapping``
> +  The file this page belongs to. Can be pagecache or swapcahe. For
> +  anonymous memory refers to the `struct anon_vma`.
> +
> +``index``
> +  Offset within the file, in units of pages. For anonymous memory, this is
> +  the index from the beginning of the mmap.
> +
> +``private``
> +  Filesystem per-folio data (see folio_attach_private()). Used for
> +  ``swp_entry_t`` if folio is in the swap cache
> +  (i.e. folio_test_swapcache() is true)
> +
> +``_mapcount``
> +  Do not access this member directly. Use folio_mapcount() to find out how
> +  many times this folio is mapped by userspace.
> +
> +``_refcount``
> +  Do not access this member directly. Use folio_ref_count() to find how
> +  many references there are to this folio.
> +
> +``memcg_data``
> +  Memory Control Group data.
> +
> +``_folio_dtor``
> +  Which destructor to use for this folio.
> +
> +``_folio_order``
> +  The allocation order of a folio. Do not use directly, call folio_order().
> +
> +``_entire_mapcount``
> +  How many times the entire folio is mapped as a single unit (for example
> +  by a PMD or PUD entry). Does not include PTE-mapped subpages. This might
> +  be useful for debugging, but to find out how many times the folio is
> +  mapped look at folio_mapcount() or page_mapcount() or total_mapcount()
> +  instead.
> +  Do not use directly, call folio_entire_mapcount().
> +
> +``_nr_pages_mapped``
> +  The total number of times the folio is mapped.
> +  Do not use directly, call folio_mapcount().
> +
> +``_pincount``
> +  Used to track pinning of the folio for DMA.
> +  Do not use directly, call folio_maybe_dma_pinned().
> +
> +``_folio_nr_pages``
> +  The number of pages in the folio.
> +  Do not use directly, call folio_nr_pages().
> +
> +``_hugetlb_subpool``
> +  HugeTLB subpool the folio beongs to.
> +  Do not use directly, use accessor in ``include/linux/hugetlb.h``.
> +
> +``_hugetlb_cgroup``
> +  Memory Control Group data for a HugeTLB folio.
> +  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
> +
> +``_hugetlb_cgroup_rsvd``
> +  Memory Control Group data for a HugeTLB folio.
> +  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
> +
> +``_hugetlb_hwpoison``
> +  List of failed (hwpoisoned) pages for a HugeTLB folio.
> +  Do not use directly, call raw_hwp_list_head().
> +
> +``_deferred_list``
> +  Folios to be split under memory pressure.
> +
> +.. _ptdesc:
> +
> +Page table descriptors
> +----------------------
> +
> +`struct ptdesc` describes the pages used by page tables. It has the
> +following fields:
> +
> +``_page_flags``
> +  Same as page flags. Unused for page tables.
> +
> +``pt_rcu_head``
> +  For freeing page table pages using RCU.
> +
> +``pt_list``
> +  List of used page tables. Used for s390 and x86.
> +
> +``pmd_huge_pte``
> +  Used by THP to track page tables that map huge pages. Protected by
> +  ``ptdesc->ptl`` or ``mm->page_table_lock``, depending on values of
> +  ``CONFIG_NR_CPUS`` and ``CONFIG_SPLIT_PTLOCK_CPUS`` configuration
> +  options.
> +
> +``pt_mm``
> +  Pointer to mm_struct owning the page table. Only used for PGDs on x86.
> +
> +``pt_frag_refcount``
> +  For fragmented page table tracking. Used on Powerpc and s390 only.
> +
> +``ptl``
> +  Page table lock. If the size of `spinlock_t` object is small enough the
> +  lock is embedded in `struct ptdesc`, otherwise this field points to a
> +  lock allocated for each page table page.
> +
> +``_refcount``
> +  Same as page refcount. Used for s390 page tables.
> +
> +``pt_memcg_data``
> +  Memcg data. Tracked for page tables here.

It's good to see this documentation being filled in!

An overall concern that comes to mind is that you're documenting
something that is very much a moving target.  It's already a bit of an
awkward fit with the page types that have been split out into their own
structures, and will become more so as that work proceeds.  The document
seems likely to go out of date quickly.

I wonder if it might be better to structure it as if the splitting of
struct page were already complete, with a section for each page
descriptor type, even the ones that don't exist as separate entities
yet?  Maybe that would make it easier for people to keep it current as
they hack pieces out of struct page?

Just a thought.

Thanks,

jon
Matthew Wilcox Sept. 6, 2023, 3:04 p.m. UTC | #5
On Wed, Sep 06, 2023 at 08:41:28AM -0600, Jonathan Corbet wrote:
> > +  All flags are declared in ``include/linux/page-flags.h``. There are a
> > +  number of macros defined for testing, clearing and setting the flags. Page
> > +  flags should not be accessed directly, but only using these macros.
> 
> It would sure be nice if we had documentation for what all the flags
> mean :)

When I figure them out, I'll let you know!

> > +``_mapcount``
> > +  If the page can be mapped to userspace, encodes the number of times this
> > +  page is referenced by a page table.
> > +  Do not use directly, call page_mapcount().
> 
> Have we figured out what mapcount really means yet? :)

Hah!  I know what this field means today!  In two hours time, I might
be less sure!  (Does LWN want to come along to that MM meeting and write
it up for an article?)

> > +``virtual``
> > +  Virtual address in the kernel direct map. Will be ``NULL`` for highmem
> > +  pages. Only defined for some architectures.
> 
> I'd say virtual is absent more often than present anymore, right?
> Perhaps it's worth being more explicit about that.  And maybe say to use
> page_address() rather than accessing it directly?

That's something I've been thinking about for the folio kernel-doc.
Just stop documenting the things that you "shouldn't use".
Non-kernel-doc comments in the source about what you should use instead,
but no kernel-doc comments to say "Use page_address() instead of this".

> > +For pages on unevictable "LRU list" ``lru`` is overlayed with an anonymous
> > +struct containing two fields:
> > +
> > +``__filler``
> > +  A dummy field that must be always even to avoid conflict with compound
> > +  page encoding.
> 
> Do we care about the constraints on this field's contents?  Presumably
> that is taken care of somewhere and nobody should mess with it?

I also think that documenting here which things are in a union with
other things is unnecessary.  If someone cares for such a level of
detail, they'd better be reading the source code instead of this.
Nobody should be using it, better to just leave it undocumented.

> > +``mapping``
> > +  The file this page belongs to. Can be pagecache or swapcahe. For

Oh, actually, no, it can't be swapcache.  If the page is in the
swapcache, you find its swapcache through swapcache_mapping().
That's because ->mapping is reused as an anon_vma pointer for anon
pages.

> > +  anonymous memory refers to the `struct anon_vma`.
> > +  See also ``include/linux/page-flags.h`` for ``PAGE_MAPPING_FLAGS``
> 
> It seems like putting in the types for fields like this would be useful;
> readers of the HTML docs can then follow the links and see what is
> actually pointed to.
> 
> > +``index``
> > +  Page offset within mapping. Overlaps with ``share``.
> > +
> > +``share``
> > +  Share count for fsdax. Overlaps with ``index``.

fsdax is not pagecache, so this probably shouldn't be documented here.

> I wonder if it might be better to structure it as if the splitting of
> struct page were already complete, with a section for each page
> descriptor type, even the ones that don't exist as separate entities
> yet?  Maybe that would make it easier for people to keep it current as
> they hack pieces out of struct page?

Yes.  Although I don't think we quite know what it's all going to
look like yet, which makes it challenging to document!
Jonathan Corbet Sept. 6, 2023, 3:24 p.m. UTC | #6
Matthew Wilcox <willy@infradead.org> writes:

>> Have we figured out what mapcount really means yet? :)
>
> Hah!  I know what this field means today!  In two hours time, I might
> be less sure!  (Does LWN want to come along to that MM meeting and write
> it up for an article?)

It's on my radar, not yet sure if I can make it or not.  Wednesdays are
a pain...

jon
Mike Rapoport Sept. 7, 2023, 2:20 p.m. UTC | #7
On Wed, Sep 06, 2023 at 08:41:28AM -0600, Jonathan Corbet wrote:
> Mike Rapoport <rppt@kernel.org> writes:
> 
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > Briefly describe memory map and add sub-sections for pages, folios and
> > ptdescs.
> >
> > Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> > ---
> >  Documentation/mm/physical_memory.rst | 338 ++++++++++++++++++++++++++-
> >  1 file changed, 332 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
> > index 531e73b003dd..e3318897bf57 100644
> > --- a/Documentation/mm/physical_memory.rst
> > +++ b/Documentation/mm/physical_memory.rst
> 
> > +the selection of the memory model. Memory models are described in more
> > +detail in Documentation/mm/memory-model.rst
> > +
> > +The basic memory descriptor is called :ref:`struct page <Pages>` and it is
> > +essentially a union of several structures, each representing a page frame
> > +metadata for a paricular usage.
> > +
> > +In many cases the entries in the memory map are not treated as `struct page`,
> > +but rather as different types of descriptors such as :ref:`struct folio
> > +<Folios>`, :ref:`struct ptdesc <ptdesc>` or `struct slab`.
> 
> I would hope that just saying "struct folio" would do the right thing;
> did that not happen for you?

`struct folio` links to core-api/mm-api.rst, I wanted a link to the "Folio"
section here.
 
> > -   This section is incomplete. Please list and describe the appropriate fields.
> > +Common fields
> > +~~~~~~~~~~~~~
> > +
> > +``flags``
> > +  This field contains flags which describe the status of the page and
> > +  additional information about the page, like, for instance, zone, section
> > +  and node this page belongs to. Several flags determine how the page is
> > +  used, sometimes in combination with ``page_type`` field. Other flags
> > +  determine the state of the page, for instance if it is dirty or should be
> > +  reclaimed, what LRU list this page is on and many others.
> > +
> > +  All flags are declared in ``include/linux/page-flags.h``. There are a
> > +  number of macros defined for testing, clearing and setting the flags. Page
> > +  flags should not be accessed directly, but only using these macros.
> 
> It would sure be nice if we had documentation for what all the flags
> mean :)

This alone would take another several months :)
 
> > +  The layout of the ``flags`` field depends on the kernel configuration. It
> > +  is affeted by selection of the memory model, section size for SPARSEMEM
> > +
> > +``virtual``
> > +  Virtual address in the kernel direct map. Will be ``NULL`` for highmem
> > +  pages. Only defined for some architectures.
> 
> I'd say virtual is absent more often than present anymore, right?
> Perhaps it's worth being more explicit about that.  And maybe say to use
> page_address() rather than accessing it directly?
> 
> > +Fields shared between multiple types
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +``_mapcount``
> > +  If the page can be mapped to userspace, encodes the number of times this
> > +  page is referenced by a page table.
> > +  Do not use directly, call page_mapcount().
> 
> Have we figured out what mapcount really means yet? :)

It's a number, isn't it? :)
 
> > +``mlock_count``
> > +  Number of times the page has been pinned by mlock().
> > +
> > +Pages on free lists used by the page allocator are linked to the relevant
> > +list with eithter of the two below fields:
> 
> Spellcheckers are your friend :)

I was sure checkpatch.pl does that :(
 
> > +``buddy_list``
> > +  Links the page to one of the free lists in the buddy allocator. Overlaps
> > +  with ``lru``.
> > +
> > +``pcp_list``
> > +  Links the page to a per-cpu free list. Overlaps with ``lru``.
> > +
> > +``mapping``
> > +  The file this page belongs to. Can be pagecache or swapcahe. For
> > +  anonymous memory refers to the `struct anon_vma`.
> > +  See also ``include/linux/page-flags.h`` for ``PAGE_MAPPING_FLAGS``
> 
> It seems like putting in the types for fields like this would be useful;
> readers of the HTML docs can then follow the links and see what is
> actually pointed to.
 
Does sphinx create references for defines? (presuming someone will add
kernel-doc description for them)

> >  Folios
> > -======
> > +------
> 
> As Willy said, linking to the existing docs might be better here.
 
...
 
> It's good to see this documentation being filled in!
> 
> An overall concern that comes to mind is that you're documenting
> something that is very much a moving target.  It's already a bit of an
> awkward fit with the page types that have been split out into their own
> structures, and will become more so as that work proceeds.  The document
> seems likely to go out of date quickly.

I hesitated quite a lot about writing this documentation exactly because of
page types being a moving target. And I decided to give it a go to have
something that will show up in git grep and hopefully people would update
the documentation along with the code changes. (Sorry Willy, I know it's
more work for you).

An alternative is to wait until we completely get rid of struct page and
have this undocumented for quite some time.
 
> I wonder if it might be better to structure it as if the splitting of
> struct page were already complete, with a section for each page
> descriptor type, even the ones that don't exist as separate entities
> yet?  Maybe that would make it easier for people to keep it current as
> they hack pieces out of struct page?

After reading your and Willy's comments, I think description of the fields
in "API reference" style is not what I want to see in this document in the
end. I'd prefer it to target people who want to dig deeper into mm and
understand how things work rather than how to use them.

That's why I think linking kernel-doc here would be suboptimal here because
kernel-doc is API reference and does not describe internal workings in the
majority of cases.

Starting with the comments we have in the code (both kernel-doc and not)
with some additions and alterations seems to me a good starting point.

Just some thoughts :)
 
> Just a thought.
> 
> Thanks,
> 
> jon
diff mbox series

Patch

diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index 531e73b003dd..e3318897bf57 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -343,23 +343,349 @@  Zones
 
    This section is incomplete. Please list and describe the appropriate fields.
 
+.. _memmap:
+
+Memory map and memory descriptors
+=================================
+
+Every physical page frame in the systam has an associated descriptor which
+is used to keep track of its status. The collection of these descriptors is
+called `memory map` and it is arranged in one or more arrays, depending on
+the selection of the memory model. Memory models are described in more
+detail in Documentation/mm/memory-model.rst
+
+The basic memory descriptor is called :ref:`struct page <Pages>` and it is
+essentially a union of several structures, each representing a page frame
+metadata for a paricular usage.
+
+In many cases the entries in the memory map are not treated as `struct page`,
+but rather as different types of descriptors such as :ref:`struct folio
+<Folios>`, :ref:`struct ptdesc <ptdesc>` or `struct slab`.
+
 .. _pages:
 
 Pages
-=====
+-----
 
-.. admonition:: Stub
+`struct page` tracks status of a single physical page frame. This structure
+is a mixture of several types that represent metadata for different uses of
+a page frame. To save memory these types partially overlap so the `struct
+page` definition in ``include/linux/mm_types.h`` mixes scalar fields and
+unions of structures.
 
-   This section is incomplete. Please list and describe the appropriate fields.
+Common fields
+~~~~~~~~~~~~~
+
+``flags``
+  This field contains flags which describe the status of the page and
+  additional information about the page, like, for instance, zone, section
+  and node this page belongs to. Several flags determine how the page is
+  used, sometimes in combination with ``page_type`` field. Other flags
+  determine the state of the page, for instance if it is dirty or should be
+  reclaimed, what LRU list this page is on and many others.
+
+  All flags are declared in ``include/linux/page-flags.h``. There are a
+  number of macros defined for testing, clearing and setting the flags. Page
+  flags should not be accessed directly, but only using these macros.
+
+  The layout of the ``flags`` field depends on the kernel configuration. It
+  is affeted by selection of the memory model, section size for SPARSEMEM
+  without VMEMMAP, number of zone types, maximal number of nodes and other
+  build time parameters, such as ``CONFIG_NUMA_BALANCING``,
+  ``CONFIG_KASAN_SW_TAGS`` and ``CONFIG_LRU_GEN``.
+
+  For example, a kernel configured for 64-bit system with
+  SPARSEMEM_VMEMMAP, four zone types and maximum of 64 nodes and other
+  relevant options disabled layout of ``flags`` will be::
+
+    63   58 57  56 55                  23 22                      0
+    +------+------+----------------------+------------------------+
+    | node | zone |         ...          |         flags          |
+    +------+------+----------------------+------------------------+
+
+  And for the same configuration with enabled ``CONFIG_LRU_GEN`` and
+  ``CONFIG_NUMA_BALANCING`` it will be::
+
+    63   58 57  56 55    42 41     39 38      37 36  23 22        0
+    +------+------+--------+---------+----------+------+----------+
+    | node | zone | cpupid | lru_gen | lru_refs | ...  |  flags   |
+    +------+------+--------+---------+----------+------+----------+
+
+  For the exact details refer to ``include/linux/page-flags-layout.h`` and
+  ``include/linux/mmzone.h``.
+
+  Although in the above examples the page flags layout includes 23 flags,
+  their number may vary with different kernel configurations.
+
+``_refcount``
+  Usage count of the `struct page`. Should not be used directly. Use
+  accessors defined in ``include/linux/page_ref.h``.
+
+``memcg_data``
+  An opaque object used by memory cgroups. Defined only when
+  ``CONFIG_MEMCG`` is enabled.
+
+``virtual``
+  Virtual address in the kernel direct map. Will be ``NULL`` for highmem
+  pages. Only defined for some architectures.
+
+``kmsan_shadow``
+  KMSAN shadow page: every bit indicates whether the corresponding bit of
+  the original page is initialized (0) or not (1). Defined only when
+  ``CONFIG_KMSAN`` is enabled.
+
+``kmsan_origin``
+  KMSAN origin page: every 4 bytes contain an id of the stack trace where
+  the uninitialized value was created. Defined only when ``CONFIG_KMSAN``
+  is enabled.
+
+``_last_cpupid``
+  IDs of last CPU and last process that accessed the page. Only enabled if
+  there are not enough bits in the ``flags`` field.
+  Do not use directly, use accessors defined in ``include/linux/mm.h``
+
+Fields shared between multiple types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``_mapcount``
+  If the page can be mapped to userspace, encodes the number of times this
+  page is referenced by a page table.
+  Do not use directly, call page_mapcount().
+
+``page_type``
+  If the page is neither ``PageSlab`` nor mappable to userspace, the value
+  stored here may help determine what this page is used for. See
+  ``include/linux/page-flags.h`` for a list of page types which are
+  currently stored here.
+
+``rcu_head``
+  You can use this to free a page by RCU. Available for page table pages
+  and for page cache and anonymous pages not linked to any of the LRU
+  lists.
+
+Page cache and anonymous pages
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following fields are used to link `struct page` to a linked list and
+they overlap with each other:
+
+``lru``
+  Linked list pointers for pages on LRU lists, for example active_list
+  protected by ``lruvec->lru_lock``. Sometimes used as a generic list by
+  the page owner.
+
+For pages on unevictable "LRU list" ``lru`` is overlayed with an anonymous
+struct containing two fields:
+
+``__filler``
+  A dummy field that must be always even to avoid conflict with compound
+  page encoding.
+
+``mlock_count``
+  Number of times the page has been pinned by mlock().
+
+Pages on free lists used by the page allocator are linked to the relevant
+list with eithter of the two below fields:
+
+``buddy_list``
+  Links the page to one of the free lists in the buddy allocator. Overlaps
+  with ``lru``.
+
+``pcp_list``
+  Links the page to a per-cpu free list. Overlaps with ``lru``.
+
+``mapping``
+  The file this page belongs to. Can be pagecache or swapcahe. For
+  anonymous memory refers to the `struct anon_vma`.
+  See also ``include/linux/page-flags.h`` for ``PAGE_MAPPING_FLAGS``
+
+``index``
+  Page offset within mapping. Overlaps with ``share``.
+
+``share``
+  Share count for fsdax. Overlaps with ``index``.
+
+``private``
+  Mapping-private opaque data. Usually used for buffer_heads if
+  PagePrivate. Used for swp_entry_t if PageSwapCache. Indicates order in
+  the buddy system if PageBuddy.
+
+Page pool
+~~~~~~~~~
+
+The following fields are used by
+`page_pool <Documentation/networking/page_pool.rst>`
+allocator used by the networking stack.
+
+``pp_magic``
+  Magic value to avoid recycling non page_pool allocated pages.
+
+``pp``
+  `struct page_pool` holding the page.
+
+``_pp_mapping_pad``
+  A padding to avoid collision of page_pool data with ``mapping``.
+
+``dma_addr``
+  DMAable address of the page.
+
+``dma_addr_upper``
+  Upper part of DMA address on 32-bit architectures that use 64-bit DMA
+  addressing. Overlaps with ``pp_frag_count``.
+
+``pp_frag_count``
+  Used by sub-page allocations in ``page_pool``. Not supported on 32-bit
+  architectures with 64-bit DMA addresses. Overlaps with ``dma_addr_upper``.
+
+Tail pages of compound page
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``compound_head``
+  Pointer to the head page of compound page. Bit zero is always set for
+  tail pages and cleared for head pages.
+
+ZONE_DEVICE pages
+~~~~~~~~~~~~~~~~~
+
+``pgmap``
+  Points to the hosting device page map.
+
+``zone_device_data``
+  Private data used by the owning device.
 
 .. _folios:
 
 Folios
-======
+------
 
-.. admonition:: Stub
+`struct folio` represents a physically, virtually and logically contiguous
+set of bytes. It is a power-of-two in size, and it is aligned to that same
+power-of-two. It is at least as large as ``PAGE_SIZE``. If it is in the
+page cache, it is at a file offset which is a multiple of that
+power-of-two. It may be mapped into userspace at an address which is at an
+arbitrary page offset, but its kernel virtual address is aligned to its
+size.
 
-   This section is incomplete. Please list and describe the appropriate fields.
+`struct folio` occupies several consecutive entries in the memory map and
+has the following fields:
+
+``flags``
+  Identical to the page flags.
+
+``lru``
+  Least Recently Used list; tracks how recently this folio was used.
+
+``mlock_count``
+  Number of times this folio has been pinned by mlock().
+
+``mapping``
+  The file this page belongs to. Can be pagecache or swapcahe. For
+  anonymous memory refers to the `struct anon_vma`.
+
+``index``
+  Offset within the file, in units of pages. For anonymous memory, this is
+  the index from the beginning of the mmap.
+
+``private``
+  Filesystem per-folio data (see folio_attach_private()). Used for
+  ``swp_entry_t`` if folio is in the swap cache
+  (i.e. folio_test_swapcache() is true)
+
+``_mapcount``
+  Do not access this member directly. Use folio_mapcount() to find out how
+  many times this folio is mapped by userspace.
+
+``_refcount``
+  Do not access this member directly. Use folio_ref_count() to find how
+  many references there are to this folio.
+
+``memcg_data``
+  Memory Control Group data.
+
+``_folio_dtor``
+  Which destructor to use for this folio.
+
+``_folio_order``
+  The allocation order of a folio. Do not use directly, call folio_order().
+
+``_entire_mapcount``
+  How many times the entire folio is mapped as a single unit (for example
+  by a PMD or PUD entry). Does not include PTE-mapped subpages. This might
+  be useful for debugging, but to find out how many times the folio is
+  mapped look at folio_mapcount() or page_mapcount() or total_mapcount()
+  instead.
+  Do not use directly, call folio_entire_mapcount().
+
+``_nr_pages_mapped``
+  The total number of times the folio is mapped.
+  Do not use directly, call folio_mapcount().
+
+``_pincount``
+  Used to track pinning of the folio for DMA.
+  Do not use directly, call folio_maybe_dma_pinned().
+
+``_folio_nr_pages``
+  The number of pages in the folio.
+  Do not use directly, call folio_nr_pages().
+
+``_hugetlb_subpool``
+  HugeTLB subpool the folio beongs to.
+  Do not use directly, use accessor in ``include/linux/hugetlb.h``.
+
+``_hugetlb_cgroup``
+  Memory Control Group data for a HugeTLB folio.
+  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
+
+``_hugetlb_cgroup_rsvd``
+  Memory Control Group data for a HugeTLB folio.
+  Do not use directly, use accessor in ``include/linux/hugetlb_cgroup.h``.
+
+``_hugetlb_hwpoison``
+  List of failed (hwpoisoned) pages for a HugeTLB folio.
+  Do not use directly, call raw_hwp_list_head().
+
+``_deferred_list``
+  Folios to be split under memory pressure.
+
+.. _ptdesc:
+
+Page table descriptors
+----------------------
+
+`struct ptdesc` describes the pages used by page tables. It has the
+following fields:
+
+``_page_flags``
+  Same as page flags. Unused for page tables.
+
+``pt_rcu_head``
+  For freeing page table pages using RCU.
+
+``pt_list``
+  List of used page tables. Used for s390 and x86.
+
+``pmd_huge_pte``
+  Used by THP to track page tables that map huge pages. Protected by
+  ``ptdesc->ptl`` or ``mm->page_table_lock``, depending on values of
+  ``CONFIG_NR_CPUS`` and ``CONFIG_SPLIT_PTLOCK_CPUS`` configuration
+  options.
+
+``pt_mm``
+  Pointer to mm_struct owning the page table. Only used for PGDs on x86.
+
+``pt_frag_refcount``
+  For fragmented page table tracking. Used on Powerpc and s390 only.
+
+``ptl``
+  Page table lock. If the size of `spinlock_t` object is small enough the
+  lock is embedded in `struct ptdesc`, otherwise this field points to a
+  lock allocated for each page table page.
+
+``_refcount``
+  Same as page refcount. Used for s390 page tables.
+
+``pt_memcg_data``
+  Memcg data. Tracked for page tables here.
 
 .. _initialization: