diff mbox series

[v3,14/15] Documentation: add document for pte_ref

Message ID 20211110105428.32458-15-zhengqi.arch@bytedance.com (mailing list archive)
State New
Headers show
Series Free user PTE page table pages | expand

Commit Message

Qi Zheng Nov. 10, 2021, 10:54 a.m. UTC
This commit adds document for pte_ref under `Documentation/vm/`.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 212 insertions(+)
 create mode 100644 Documentation/vm/pte_ref.rst

Comments

Jonathan Corbet Nov. 10, 2021, 2:39 p.m. UTC | #1
Qi Zheng <zhengqi.arch@bytedance.com> writes:

> This commit adds document for pte_ref under `Documentation/vm/`.

Thanks for documenting this work!

> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 212 insertions(+)
>  create mode 100644 Documentation/vm/pte_ref.rst

When you add a new RST file, you also need to add it to the associated
index.rst file or it won't be included in the docs build.  Instead,
you'll get the "not included in any toctree" warning that you surely saw
when you tested the docs build with this file :)

> diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
> new file mode 100644
> index 000000000000..c5323a263464
> --- /dev/null
> +++ b/Documentation/vm/pte_ref.rst
> @@ -0,0 +1,212 @@
> +.. _pte_ref:

Do you need this label anywhere?  If not, I'd leave it out.

Thanks,

jon
Qi Zheng Nov. 11, 2021, 5:40 a.m. UTC | #2
On 11/10/21 10:39 PM, Jonathan Corbet wrote:
> Qi Zheng <zhengqi.arch@bytedance.com> writes:
> 
>> This commit adds document for pte_ref under `Documentation/vm/`.
> 
> Thanks for documenting this work!
> 
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 212 insertions(+)
>>   create mode 100644 Documentation/vm/pte_ref.rst
> 
> When you add a new RST file, you also need to add it to the associated
> index.rst file or it won't be included in the docs build.  Instead,
> you'll get the "not included in any toctree" warning that you surely saw
> when you tested the docs build with this file :)

OK, I will add it to the associated index.rst in the next version.

> 
>> diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
>> new file mode 100644
>> index 000000000000..c5323a263464
>> --- /dev/null
>> +++ b/Documentation/vm/pte_ref.rst
>> @@ -0,0 +1,212 @@
>> +.. _pte_ref:
> 
> Do you need this label anywhere?  If not, I'd leave it out.

I will remove this label in the next version.

Thanks,
Qi

> 
> Thanks,
> 
> jon
>
diff mbox series

Patch

diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
new file mode 100644
index 000000000000..c5323a263464
--- /dev/null
+++ b/Documentation/vm/pte_ref.rst
@@ -0,0 +1,212 @@ 
+.. _pte_ref:
+
+============================================================================
+pte_ref: Tracking about how many references to each user PTE page table page
+============================================================================
+
+.. contents:: :local:
+
+1. Preface
+==========
+
+Now in order to pursue high performance, applications mostly use some
+high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
+These memory allocators use ``madvise(MADV_DONTNEED or MADV_FREE)`` to release
+physical memory for the following reasons::
+
+ First of all, we should hold as few write locks of mmap_lock as possible,since
+ the mmap_lock semaphore has long been a contention point in the memory
+ management subsystem. The mmap()/munmap() hold the write lock, and the
+ madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using madvise()
+ instead of munmap() to released physical memory can reduce the competition of
+ the mmap_lock.
+
+ Secondly, after using madvise() to release physical memory, there is no need to
+ build vma and allocate page tables again when accessing the same virtual
+ address again, which can also save some time.
+
+The following is the largest user PTE page table memory that can be allocated by
+a single user process in a 32-bit and a 64-bit system.
+
++---------------------------+--------+---------+
+|                           | 32-bit | 64-bit  |
++===========================+========+=========+
+| user PTE page table pages | 3 MiB  | 512 GiB |
++---------------------------+--------+---------+
+| user PMD page table pages | 3 KiB  | 1 GiB   |
++---------------------------+--------+---------+
+
+(for 32-bit, take 3G user address space, 4K page size as an example; for 64-bit,
+take 48-bit address width, 4K page size as an example.)
+
+After using ``madvise()``, everything looks good, but as can be seen from the
+above table, a single process can create a large number of PTE page tables on a
+64-bit system, since both of the ``MADV_DONTNEED`` and ``MADV_FREE`` will not
+release page table memory. And before the process exits or calls ``munmap()``,
+the kernel cannot reclaim these pages even if these PTE page tables do not map
+anything.
+
+Therefore, we decided to introduce reference count to manage the PTE page table
+life cycle, so that some free PTE page table memory in the system can be
+dynamically released.
+
+2. The reference count of user PTE page table pages
+===================================================
+
+We introduce two members for the ``struct page`` of the user PTE page table
+page::
+
+ union {
+	pgtable_t pmd_huge_pte; /* protected by page->ptl */
+	pmd_t *pmd;             /* PTE page only */
+ };
+ union {
+	struct mm_struct *pt_mm; /* x86 pgds only */
+	atomic_t pt_frag_refcount; /* powerpc */
+	atomic_t pte_refcount;  /* PTE page only */
+ };
+
+The ``pmd`` member record the pmd entry that maps the user PTE page table page,
+the ``pte_refcount`` member keep track of how many references to the user PTE
+page table page.
+
+The following people will hold a reference on the user PTE page table page::
+
+ The !pte_none() entry, such as regular page table entry that map physical
+ pages, or swap entry, or migrate entry, etc.
+
+ Visitor to the PTE page table entries, such as page table walker.
+
+Any ``!pte_none()`` entry and visitor can be regarded as the user of its PTE
+page table page. When the ``pte_refcount`` is reduced to 0, it means that no one
+is using the PTE page table page, then this free PTE page table page can be
+released back to the system at this time.
+
+3. Competitive relationship
+===========================
+
+Now, the user page table will only be released by calling ``free_pgtables()``
+when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` path).
+So other threads only need to ensure mutual exclusion with these paths to ensure
+that the page table is not released. For example::
+
+	thread A			thread B
+	page table walker		munmap
+	=================		======
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+		pte_offset_map_lock()
+		*walk page table*
+		pte_unmap_unlock()
+	}
+	mmap_read_unlock()
+
+					mmap_write_lock_killable()
+					detach_vmas_to_be_unmapped()
+					unmap_region()
+					--> free_pgtables()
+
+But after we introduce the reference count for the user PTE page table page,
+these existing balances will be broken. The page can be released at any time
+when its ``pte_refcount`` is reduced to 0. Therefore, the following case may
+happen::
+
+	thread A		thread B			thread C
+	page table walker	madvise(MADV_DONTNEED)		page fault
+	=================	======================		==========
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+
+				mmap_read_lock()
+				unmap_page_range()
+				--> zap_pte_range()
+				    *the pte_refcount is reduced to 0*
+				    --> *free PTE page table page*
+
+		/* broken!! */					mmap_read_lock()
+		pte_offset_map_lock()
+
+As we can see, all of the thread A, B and C hold the read lock of mmap_lock, so
+they can execute concurrently. When thread B releases the PTE page table page,
+the value in the corresponding pmd entry will become unstable, which may be
+none or huge pmd, or map a new PTE page table page again. This will cause system
+chaos and even panic.
+
+So as described in the section "The reference count of user PTE page table
+pages", we need to try to take a reference to the PTE page table page before
+walking page table, then the system will become orderly again::
+
+	thread A		thread B
+	page table walker	madvise(MADV_DONTNEED)
+	=================	======================
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+		pte_try_get()
+		--> pte_get_unless_zero
+		*if successfully, then:*
+
+				mmap_read_lock()
+				unmap_page_range()
+				--> zap_pte_range()
+				    *the pte_refcount is reduced to 1*
+
+		pte_offset_map_lock()
+		*walk page table*
+		pte_unmap_unlock()
+		pte_put()
+		--> *the pte_refcount is reduced to 0*
+		    --> *free PTE page table page*
+
+There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
+to do any additional operations to ensure that the system is in order. Take fast
+GUP as an example::
+
+	thread A		thread B
+	fast GUP		madvise(MADV_DONTNEED)
+	========		======================
+
+	get_user_pages_fast_only()
+	--> local_irq_save();
+				*free PTE page table page*
+				--> unhook page
+				    /* The CPU where thread A is located closed
+				     * the local interrupt and cannot respond to
+				     * IPI, so it will block here */
+				    TLB invalidate page
+	    gup_pgd_range();
+	    local_irq_restore();
+	    			    *free page*
+
+4. Helpers
+==========
+
++---------------------+-------------------------------------------------+
+| pte_ref_init        | Initialize the pte_refcount and pmd             |
++---------------------+-------------------------------------------------+
+| pte_to_pmd          | Get the corresponding pmd                       |
++---------------------+-------------------------------------------------+
+| pte_update_pmd      | Update the corresponding pmd                    |
++---------------------+-------------------------------------------------+
+| pte_get             | Increment a pte_refcount                        |
++---------------------+-------------------------------------------------+
+| pte_get_many        | Add a value to a pte_refcount                   |
++---------------------+-------------------------------------------------+
+| pte_get_unless_zero | Increment a pte_refcount unless it is 0         |
++---------------------+-------------------------------------------------+
+| pte_try_get         | Try to increment a pte_refcount                 |
++---------------------+-------------------------------------------------+
+| pte_tryget_map      | Try to increment a pte_refcount before          |
+|                     | pte_offset_map()                                |
++---------------------+-------------------------------------------------+
+| pte_tryget_map_lock | Try to increment a pte_refcount before          |
+|                     | pte_offset_map_lock()                           |
++---------------------+-------------------------------------------------+
+| pte_put             | Decrement a pte_refcount                        |
++---------------------+-------------------------------------------------+
+| pte_put_many        | Sub a value to a pte_refcount                   |
++---------------------+-------------------------------------------------+
+| pte_put_vmf         | Decrement a pte_refcount in the page fault path |
++---------------------+-------------------------------------------------+