diff mbox series

[v3,06/15] mm: introduce refcount for user PTE page table page

Message ID 20211110105428.32458-7-zhengqi.arch@bytedance.com (mailing list archive)
State New
Headers show
Series Free user PTE page table pages | expand

Commit Message

Qi Zheng Nov. 10, 2021, 10:54 a.m. UTC
1. Preface
==========

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
physical memory for the following reasons::

 First of all, we should hold as few write locks of mmap_lock as possible,
 since the mmap_lock semaphore has long been a contention point in the
 memory management subsystem. The mmap()/munmap() hold the write lock, and
 the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
 madvise() instead of munmap() to released physical memory can reduce the
 competition of the mmap_lock.

 Secondly, after using madvise() to release physical memory, there is no
 need to build vma and allocate page tables again when accessing the same
 virtual address again, which can also save some time.

The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.

+---------------------------+--------+---------+
|                           | 32-bit | 64-bit  |
+===========================+========+=========+
| user PTE page table pages | 3 MiB  | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB  | 1 GiB   |
+---------------------------+--------+---------+

(for 32-bit, take 3G user address space, 4K page size as an example;
 for 64-bit, take 48-bit address width, 4K page size as an example.)

After using madvise(), everything looks good, but as can be seen from the
above table, a single process can create a large number of PTE page tables
on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
release page table memory. And before the process exits or calls munmap(),
the kernel cannot reclaim these pages even if these PTE page tables do not
map anything.

Therefore, we decided to introduce reference count to manage the PTE page
table life cycle, so that some free PTE page table memory in the system
can be dynamically released.

2. The reference count of user PTE page table pages
===================================================

We introduce two members for the struct page of the user PTE page table
page::

 union {
	pgtable_t pmd_huge_pte; /* protected by page->ptl */
	pmd_t *pmd;             /* PTE page only */
 };
 union {
	struct mm_struct *pt_mm; /* x86 pgds only */
	atomic_t pt_frag_refcount; /* powerpc */
	atomic_t pte_refcount;  /* PTE page only */
 };

The pmd member record the pmd entry that maps the user PTE page table page,
the pte_refcount member keep track of how many references to the user PTE
page table page.

The following people will hold a reference on the user PTE page table
page::

 The !pte_none() entry, such as regular page table entry that map physical
 pages, or swap entry, or migrate entry, etc.

 Visitor to the PTE page table entries, such as page table walker.

Any ``!pte_none()`` entry and visitor can be regarded as the user of its
PTE page table page. When the ``pte_refcount`` is reduced to 0, it means
that no one is using the PTE page table page, then this free PTE page
table page can be released back to the system at this time.

3. Helpers
==========

+---------------------+-------------------------------------------------+
| pte_ref_init        | Initialize the pte_refcount and pmd             |
+---------------------+-------------------------------------------------+
| pte_to_pmd          | Get the corresponding pmd                       |
+---------------------+-------------------------------------------------+
| pte_update_pmd      | Update the corresponding pmd                    |
+---------------------+-------------------------------------------------+
| pte_get             | Increment a pte_refcount                        |
+---------------------+-------------------------------------------------+
| pte_get_many        | Add a value to a pte_refcount                   |
+---------------------+-------------------------------------------------+
| pte_get_unless_zero | Increment a pte_refcount unless it is 0         |
+---------------------+-------------------------------------------------+
| pte_try_get         | Try to increment a pte_refcount                 |
+---------------------+-------------------------------------------------+
| pte_tryget_map      | Try to increment a pte_refcount before          |
|                     | pte_offset_map()                                |
+---------------------+-------------------------------------------------+
| pte_tryget_map_lock | Try to increment a pte_refcount before          |
|                     | pte_offset_map_lock()                           |
+---------------------+-------------------------------------------------+
| pte_put             | Decrement a pte_refcount                        |
+---------------------+-------------------------------------------------+
| pte_put_many        | Sub a value to a pte_refcount                   |
+---------------------+-------------------------------------------------+
| pte_put_vmf         | Decrement a pte_refcount in the page fault path |
+---------------------+-------------------------------------------------+

4. About this commit
====================
This commit just introduces some dummy helpers, the actual logic will
be implemented in future commits.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm_types.h |  6 +++-
 include/linux/pte_ref.h  | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/Makefile              |  4 +--
 mm/pte_ref.c             | 55 ++++++++++++++++++++++++++++++
 4 files changed, 149 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

Comments

kernel test robot Nov. 11, 2021, 12:37 a.m. UTC | #1
Hi Qi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on linus/master next-20211110]
[cannot apply to tip/perf/core tip/x86/core v5.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
base:   https://github.com/hnaz/linux-mm master
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/e03404013f81d7b11aa6f5c3fef3816320b2baf0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
        git checkout e03404013f81d7b11aa6f5c3fef3816320b2baf0
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=mips prepare

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   error: no override and no default toolchain set
   init/Kconfig:70:warning: 'RUSTC_VERSION': number is invalid
   In file included from include/linux/mmzone.h:21,
                    from include/linux/gfp.h:6,
                    from include/linux/radix-tree.h:12,
                    from include/linux/fs.h:15,
                    from include/linux/compat.h:17,
                    from arch/mips/kernel/asm-offsets.c:12:
>> include/linux/mm_types.h:154:33: error: unknown type name 'pmd_t'
     154 |                                 pmd_t *pmd;             /* PTE page only */
         |                                 ^~~~~
   In file included from include/asm-generic/div64.h:27,
                    from arch/mips/include/asm/div64.h:89,
                    from include/linux/math.h:5,
                    from include/linux/math64.h:6,
                    from include/linux/time.h:6,
                    from include/linux/compat.h:10,
                    from arch/mips/kernel/asm-offsets.c:12:
   include/linux/mm.h: In function 'pte_alloc':
   include/linux/mm.h:2318:22: error: implicit declaration of function 'is_huge_pmd'; did you mean 'zap_huge_pmd'? [-Werror=implicit-function-declaration]
    2318 |         if (unlikely(is_huge_pmd(*pmd)))
         |                      ^~~~~~~~~~~
   include/linux/compiler.h:78:45: note: in definition of macro 'unlikely'
      78 | # define unlikely(x)    __builtin_expect(!!(x), 0)
         |                                             ^
   arch/mips/kernel/asm-offsets.c: At top level:
   arch/mips/kernel/asm-offsets.c:26:6: error: no previous prototype for 'output_ptreg_defines' [-Werror=missing-prototypes]
      26 | void output_ptreg_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:78:6: error: no previous prototype for 'output_task_defines' [-Werror=missing-prototypes]
      78 | void output_task_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:92:6: error: no previous prototype for 'output_thread_info_defines' [-Werror=missing-prototypes]
      92 | void output_thread_info_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:108:6: error: no previous prototype for 'output_thread_defines' [-Werror=missing-prototypes]
     108 | void output_thread_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:136:6: error: no previous prototype for 'output_thread_fpu_defines' [-Werror=missing-prototypes]
     136 | void output_thread_fpu_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:179:6: error: no previous prototype for 'output_mm_defines' [-Werror=missing-prototypes]
     179 | void output_mm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:218:6: error: no previous prototype for 'output_sc_defines' [-Werror=missing-prototypes]
     218 | void output_sc_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:253:6: error: no previous prototype for 'output_signal_defined' [-Werror=missing-prototypes]
     253 | void output_signal_defined(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:320:6: error: no previous prototype for 'output_pbe_defines' [-Werror=missing-prototypes]
     320 | void output_pbe_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:332:6: error: no previous prototype for 'output_pm_defines' [-Werror=missing-prototypes]
     332 | void output_pm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:346:6: error: no previous prototype for 'output_kvm_defines' [-Werror=missing-prototypes]
     346 | void output_kvm_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:390:6: error: no previous prototype for 'output_cps_defines' [-Werror=missing-prototypes]
     390 | void output_cps_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors
   make[2]: *** [scripts/Makefile.build:122: arch/mips/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [Makefile:1288: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:226: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +/pmd_t +154 include/linux/mm_types.h

    70	
    71	struct page {
    72		unsigned long flags;		/* Atomic flags, some possibly
    73						 * updated asynchronously */
    74		/*
    75		 * Five words (20/40 bytes) are available in this union.
    76		 * WARNING: bit 0 of the first word is used for PageTail(). That
    77		 * means the other users of this union MUST NOT use the bit to
    78		 * avoid collision and false-positive PageTail().
    79		 */
    80		union {
    81			struct {	/* Page cache and anonymous pages */
    82				/**
    83				 * @lru: Pageout list, eg. active_list protected by
    84				 * lruvec->lru_lock.  Sometimes used as a generic list
    85				 * by the page owner.
    86				 */
    87				struct list_head lru;
    88				/* See page-flags.h for PAGE_MAPPING_FLAGS */
    89				struct address_space *mapping;
    90				pgoff_t index;		/* Our offset within mapping. */
    91				/**
    92				 * @private: Mapping-private opaque data.
    93				 * Usually used for buffer_heads if PagePrivate.
    94				 * Used for swp_entry_t if PageSwapCache.
    95				 * Indicates order in the buddy system if PageBuddy.
    96				 */
    97				unsigned long private;
    98			};
    99			struct {	/* page_pool used by netstack */
   100				/**
   101				 * @pp_magic: magic value to avoid recycling non
   102				 * page_pool allocated pages.
   103				 */
   104				unsigned long pp_magic;
   105				struct page_pool *pp;
   106				unsigned long _pp_mapping_pad;
   107				unsigned long dma_addr;
   108				atomic_long_t pp_frag_count;
   109			};
   110			struct {	/* slab, slob and slub */
   111				union {
   112					struct list_head slab_list;
   113					struct {	/* Partial pages */
   114						struct page *next;
   115	#ifdef CONFIG_64BIT
   116						int pages;	/* Nr of pages left */
   117	#else
   118						short int pages;
   119	#endif
   120					};
   121				};
   122				struct kmem_cache *slab_cache; /* not slob */
   123				/* Double-word boundary */
   124				void *freelist;		/* first free object */
   125				union {
   126					void *s_mem;	/* slab: first object */
   127					unsigned long counters;		/* SLUB */
   128					struct {			/* SLUB */
   129						unsigned inuse:16;
   130						unsigned objects:15;
   131						unsigned frozen:1;
   132					};
   133				};
   134			};
   135			struct {	/* Tail pages of compound page */
   136				unsigned long compound_head;	/* Bit zero is set */
   137	
   138				/* First tail page only */
   139				unsigned char compound_dtor;
   140				unsigned char compound_order;
   141				atomic_t compound_mapcount;
   142				unsigned int compound_nr; /* 1 << compound_order */
   143			};
   144			struct {	/* Second tail page of compound page */
   145				unsigned long _compound_pad_1;	/* compound_head */
   146				atomic_t hpage_pinned_refcount;
   147				/* For both global and memcg */
   148				struct list_head deferred_list;
   149			};
   150			struct {	/* Page table pages */
   151				unsigned long _pt_pad_1;	/* compound_head */
   152				union {
   153					pgtable_t pmd_huge_pte; /* protected by page->ptl */
 > 154					pmd_t *pmd;             /* PTE page only */
   155				};
   156				unsigned long _pt_pad_2;	/* mapping */
   157				union {
   158					struct mm_struct *pt_mm; /* x86 pgds only */
   159					atomic_t pt_frag_refcount; /* powerpc */
   160					atomic_t pte_refcount;  /* PTE page only */
   161				};
   162	#if ALLOC_SPLIT_PTLOCKS
   163				spinlock_t *ptl;
   164	#else
   165				spinlock_t ptl;
   166	#endif
   167			};
   168			struct {	/* ZONE_DEVICE pages */
   169				/** @pgmap: Points to the hosting device page map. */
   170				struct dev_pagemap *pgmap;
   171				void *zone_device_data;
   172				/*
   173				 * ZONE_DEVICE private pages are counted as being
   174				 * mapped so the next 3 words hold the mapping, index,
   175				 * and private fields from the source anonymous or
   176				 * page cache page while the page is migrated to device
   177				 * private memory.
   178				 * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
   179				 * use the mapping, index, and private fields when
   180				 * pmem backed DAX files are mapped.
   181				 */
   182			};
   183	
   184			/** @rcu_head: You can use this to free a page by RCU. */
   185			struct rcu_head rcu_head;
   186		};
   187	
   188		union {		/* This union is 4 bytes in size. */
   189			/*
   190			 * If the page can be mapped to userspace, encodes the number
   191			 * of times this page is referenced by a page table.
   192			 */
   193			atomic_t _mapcount;
   194	
   195			/*
   196			 * If the page is neither PageSlab nor mappable to userspace,
   197			 * the value stored here may help determine what this page
   198			 * is used for.  See page-flags.h for a list of page types
   199			 * which are currently stored here.
   200			 */
   201			unsigned int page_type;
   202	
   203			unsigned int active;		/* SLAB */
   204			int units;			/* SLOB */
   205		};
   206	
   207		/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
   208		atomic_t _refcount;
   209	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb8c6f5f19bc..c599008d54fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -149,11 +149,15 @@  struct page {
 		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
-			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+			union {
+				pgtable_t pmd_huge_pte; /* protected by page->ptl */
+				pmd_t *pmd;             /* PTE page only */
+			};
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
 				atomic_t pt_frag_refcount; /* powerpc */
+				atomic_t pte_refcount;  /* PTE page only */
 			};
 #if ALLOC_SPLIT_PTLOCKS
 			spinlock_t *ptl;
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..b6d8335bdc59
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,87 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#include <linux/pgtable.h>
+
+enum pte_tryget_type {
+	TRYGET_SUCCESSED,
+	TRYGET_FAILED_ZERO,
+	TRYGET_FAILED_NONE,
+	TRYGET_FAILED_HUGE_PMD,
+};
+
+bool pte_get_unless_zero(pmd_t *pmd);
+enum pte_tryget_type pte_try_get(pmd_t *pmd);
+void pte_put_vmf(struct vm_fault *vmf);
+
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+	return NULL;
+}
+
+static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+}
+
+static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
+{
+}
+
+/*
+ * pte_get - Increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Similar to the mechanism of page refcount, the user of PTE page table
+ * should hold a refcount to it before accessing.
+ */
+static inline void pte_get(pmd_t *pmd)
+{
+	pte_get_many(pmd, 1);
+}
+
+static inline pte_t *pte_tryget_map(pmd_t *pmd, unsigned long address)
+{
+	if (pte_try_get(pmd))
+		return NULL;
+
+	return pte_offset_map(pmd, address);
+}
+
+static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
+					 unsigned long address, spinlock_t **ptlp)
+{
+	if (pte_try_get(pmd))
+		return NULL;
+
+	return pte_offset_map_lock(mm, pmd, address, ptlp);
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, unsigned int nr)
+{
+}
+
+/*
+ * pte_put - Decrement refcount for the PTE page table.
+ * @mm: the mm_struct of the target address space.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ * @addr: the start address of the tlb range to be flushed.
+ *
+ * The PTE page table page will be freed when the last refcount is dropped.
+ */
+static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+	pte_put_many(mm, pmd, addr, 1);
+}
+
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index d6c0042e3aa0..ea679bf75a5f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,8 +38,8 @@  mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o
-
+			   pgtable-generic.o rmap.o vmalloc.o \
+			   pte_ref.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..de109905bc8f
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,55 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+
+#include <linux/pte_ref.h>
+#include <linux/mm.h>
+
+/*
+ * pte_get_unless_zero - Increment refcount for the PTE page table
+ *			 unless it is zero.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ */
+bool pte_get_unless_zero(pmd_t *pmd)
+{
+	return true;
+}
+
+/*
+ * pte_try_get - Try to increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Return true if the increment succeeded. Otherwise return false.
+ *
+ * Before Operating the PTE page table, we need to hold a refcount
+ * to protect against the concurrent release of the PTE page table.
+ * But we will fail in the following case:
+ * 	- The content mapped in @pmd is not a PTE page
+ * 	- The refcount of the PTE page table is zero, it will be freed
+ */
+enum pte_tryget_type pte_try_get(pmd_t *pmd)
+{
+	if (unlikely(pmd_none(*pmd)))
+		return TRYGET_FAILED_NONE;
+	if (unlikely(is_huge_pmd(*pmd)))
+		return TRYGET_FAILED_HUGE_PMD;
+
+	return TRYGET_SUCCESSED;
+}
+
+/*
+ * pte_put_vmf - Decrement refcount for the PTE page table.
+ * @vmf: fault information
+ *
+ * The mmap_lock may be unlocked in advance in some cases
+ * in handle_pte_fault(), then the pmd entry will no longer
+ * be stable. For example, the corresponds of the PTE page may
+ * be replaced(e.g. mremap), so we should ensure the pte_put()
+ * is performed in the critical section of the mmap_lock.
+ */
+void pte_put_vmf(struct vm_fault *vmf)
+{
+}