diff mbox series

[2/2] x86/vmemmap: Handle unpopulated sub-pmd ranges

Message ID 20210202112450.11932-3-osalvador@suse.de (mailing list archive)
State New, archived
Headers show
Series Cleanup and fixups for vmemmap handling | expand

Commit Message

Oscar Salvador Feb. 2, 2021, 11:24 a.m. UTC
When the size of a struct page is not multiple of 2MB, sections do
not span a PMD anymore and so when populating them some parts of the
PMD will remain unused.
Because of this, PMDs will be left behind when depopulating sections
since remove_pmd_table() thinks that those unused parts are still in
use.

Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv()
will do the right thing and will let us free the PMD when the last user
of it is gone.

This patch is based on a similar patch by David Hildenbrand:

https://lore.kernel.org/linux-mm/20200722094558.9828-9-david@redhat.com/
https://lore.kernel.org/linux-mm/20200722094558.9828-10-david@redhat.com/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/x86/mm/init_64.c | 87 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 10 deletions(-)

Comments

David Hildenbrand Feb. 2, 2021, 1:29 p.m. UTC | #1
> @@ -1088,10 +1150,10 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
>   				pages++;
>   			} else {
>   				/* If here, we are freeing vmemmap pages. */
> -				memset((void *)addr, PAGE_INUSE, next - addr);
> +				memset((void *)addr, PAGE_UNUSED, next - addr);
>   
>   				page_addr = page_address(pud_page(*pud));
> -				if (!memchr_inv(page_addr, PAGE_INUSE,
> +				if (!memchr_inv(page_addr, PAGE_UNUSED,
>   						PUD_SIZE)) {
>   					free_pagetable(pud_page(*pud),
>   						       get_order(PUD_SIZE));

I'm sorry to bother you again, but isn't that dead code as well?

How do we ever end up using 1GB pages for the vmemmap? At least not via 
vmemmap_populate() - so I guess never? There are not many occurrences of 
"PUD_SIZE" in the file after all ...

I think we can simplify that code.
Oscar Salvador Feb. 2, 2021, 1:52 p.m. UTC | #2
On Tue, Feb 02, 2021 at 02:29:11PM +0100, David Hildenbrand wrote:
> > @@ -1088,10 +1150,10 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> >   				pages++;
> >   			} else {
> >   				/* If here, we are freeing vmemmap pages. */
> > -				memset((void *)addr, PAGE_INUSE, next - addr);
> > +				memset((void *)addr, PAGE_UNUSED, next - addr);
> >   				page_addr = page_address(pud_page(*pud));
> > -				if (!memchr_inv(page_addr, PAGE_INUSE,
> > +				if (!memchr_inv(page_addr, PAGE_UNUSED,
> >   						PUD_SIZE)) {
> >   					free_pagetable(pud_page(*pud),
> >   						       get_order(PUD_SIZE));
> 
> I'm sorry to bother you again, but isn't that dead code as well?

Heh, I spotted that earlier, but I did not think much of it honestly.

All this was introduced by:

 commit ae9aae9eda2db71bf4b592f15618b0160eb07731
 Author: Wen Congyang <wency@cn.fujitsu.com>
 Date:   Fri Feb 22 16:33:04 2013 -0800

     memory-hotplug: common APIs to support page tables hot-remove


> How do we ever end up using 1GB pages for the vmemmap? At least not via
> vmemmap_populate() - so I guess never? There are not many occurrences of
> "PUD_SIZE" in the file after all ...

AFAICT, we don't. The largest we populate for vmemmap is 2MB.
I see init_memory_mapping can use 1G, but that should not affect us.

I guess that the vmemmap handling for 1GB can go as well.
I will update the patchset.
kernel test robot Feb. 2, 2021, 8:17 p.m. UTC | #3
Hi Oscar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/x86/mm]
[also build test ERROR on hnaz-linux-mm/master v5.11-rc6 next-20210125]
[cannot apply to luto/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Oscar-Salvador/Cleanup-and-fixups-for-vmemmap-handling/20210202-192636
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 167dcfc08b0b1f964ea95d410aa496fd78adf475
config: x86_64-randconfig-r004-20210202 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 275c6af7d7f1ed63a03d05b4484413e447133269)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/2995155f4651bbb8c0d5f2e58e6e77321c5a889a
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Oscar-Salvador/Cleanup-and-fixups-for-vmemmap-handling/20210202-192636
        git checkout 2995155f4651bbb8c0d5f2e58e6e77321c5a889a
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/x86/mm/init_64.c:1588:6: error: implicit declaration of function 'vmemmap_use_new_sub_pmd' [-Werror,-Wimplicit-function-declaration]
                                           vmemmap_use_new_sub_pmd(addr, next);
                                           ^
>> arch/x86/mm/init_64.c:1594:4: error: implicit declaration of function 'vmemmap_use_sub_pmd' [-Werror,-Wimplicit-function-declaration]
                           vmemmap_use_sub_pmd(addr, next);
                           ^
   2 errors generated.


vim +/vmemmap_use_new_sub_pmd +1588 arch/x86/mm/init_64.c

  1535	
  1536	static int __meminit vmemmap_populate_hugepages(unsigned long start,
  1537			unsigned long end, int node, struct vmem_altmap *altmap)
  1538	{
  1539		unsigned long addr;
  1540		unsigned long next;
  1541		pgd_t *pgd;
  1542		p4d_t *p4d;
  1543		pud_t *pud;
  1544		pmd_t *pmd;
  1545	
  1546		for (addr = start; addr < end; addr = next) {
  1547			next = pmd_addr_end(addr, end);
  1548	
  1549			pgd = vmemmap_pgd_populate(addr, node);
  1550			if (!pgd)
  1551				return -ENOMEM;
  1552	
  1553			p4d = vmemmap_p4d_populate(pgd, addr, node);
  1554			if (!p4d)
  1555				return -ENOMEM;
  1556	
  1557			pud = vmemmap_pud_populate(p4d, addr, node);
  1558			if (!pud)
  1559				return -ENOMEM;
  1560	
  1561			pmd = pmd_offset(pud, addr);
  1562			if (pmd_none(*pmd)) {
  1563				void *p;
  1564	
  1565				p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
  1566				if (p) {
  1567					pte_t entry;
  1568	
  1569					entry = pfn_pte(__pa(p) >> PAGE_SHIFT,
  1570							PAGE_KERNEL_LARGE);
  1571					set_pmd(pmd, __pmd(pte_val(entry)));
  1572	
  1573					/* check to see if we have contiguous blocks */
  1574					if (p_end != p || node_start != node) {
  1575						if (p_start)
  1576							pr_debug(" [%lx-%lx] PMD -> [%p-%p] on node %d\n",
  1577							       addr_start, addr_end-1, p_start, p_end-1, node_start);
  1578						addr_start = addr;
  1579						node_start = node;
  1580						p_start = p;
  1581					}
  1582	
  1583					addr_end = addr + PMD_SIZE;
  1584					p_end = p + PMD_SIZE;
  1585	
  1586					if (!IS_ALIGNED(addr, PMD_SIZE) ||
  1587					    !IS_ALIGNED(next, PMD_SIZE))
> 1588						vmemmap_use_new_sub_pmd(addr, next);
  1589					continue;
  1590				} else if (altmap)
  1591					return -ENOMEM; /* no fallback */
  1592			} else if (pmd_large(*pmd)) {
  1593				vmemmap_verify((pte_t *)pmd, node, addr, next);
> 1594				vmemmap_use_sub_pmd(addr, next);
  1595				continue;
  1596			}
  1597			if (vmemmap_populate_basepages(addr, next, node, NULL))
  1598				return -ENOMEM;
  1599		}
  1600		return 0;
  1601	}
  1602	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff mbox series

Patch

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 4cfa902ec861..b239708e504e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -871,7 +871,72 @@  int arch_add_memory(int nid, u64 start, u64 size,
 	return add_pages(nid, start_pfn, nr_pages, params);
 }
 
-#define PAGE_INUSE 0xFD
+#define PAGE_UNUSED 0xFD
+
+/*
+ * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
+ * from unused_pmd_start to next PMD_SIZE boundary.
+ */
+static unsigned long unused_pmd_start __meminitdata;
+
+static void __meminit vmemmap_flush_unused_pmd(void)
+{
+	if (!unused_pmd_start)
+		return;
+	/*
+	 * Clears (unused_pmd_start, PMD_END]
+	 */
+	memset((void *)unused_pmd_start, PAGE_UNUSED,
+	       ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
+	unused_pmd_start = 0;
+}
+
+/* Returns true if the PMD is completely unused and thus it can be freed */
+static bool __meminit vmemmap_unuse_sub_pmd(unsigned long addr, unsigned long end)
+{
+	unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+	vmemmap_flush_unused_pmd();
+	memset((void *)addr, PAGE_UNUSED, end - addr);
+
+	return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE);
+}
+
+static void __meminit vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
+{
+	/*
+	 * We only optimize if the new used range directly follows the
+	 * previously unused range (esp., when populating consecutive sections).
+	 */
+	if (unused_pmd_start == start) {
+		if (likely(IS_ALIGNED(end, PMD_SIZE)))
+			unused_pmd_start = 0;
+		else
+			unused_pmd_start = end;
+		return;
+	}
+
+	vmemmap_flush_unused_pmd();
+}
+
+static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end)
+{
+	vmemmap_flush_unused_pmd();
+
+	/*
+	 * Mark the unused parts of the new memmap range
+	 */
+	if (!IS_ALIGNED(start, PMD_SIZE))
+		memset((void *)start, PAGE_UNUSED,
+		       start - ALIGN_DOWN(start, PMD_SIZE));
+	/*
+	 * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of
+	 * consecutive sections. Remember for the last added PMD the last
+	 * unused range in the populated PMD.
+	 */
+	if (!IS_ALIGNED(end, PMD_SIZE))
+		unused_pmd_start = end;
+}
 
 static void __meminit free_pagetable(struct page *page, int order)
 {
@@ -1010,7 +1075,6 @@  remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 	unsigned long next, pages = 0;
 	pte_t *pte_base;
 	pmd_t *pmd;
-	void *page_addr;
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
@@ -1031,12 +1095,10 @@  remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 				spin_unlock(&init_mm.page_table_lock);
 				pages++;
 			} else {
-				/* If here, we are freeing vmemmap pages. */
-				memset((void *)addr, PAGE_INUSE, next - addr);
-
-				page_addr = page_address(pmd_page(*pmd));
-				if (!memchr_inv(page_addr, PAGE_INUSE,
-						PMD_SIZE)) {
+				/*
+				 * Free the PMD if the whole range is unused.
+				 */
+				if (vmemmap_unuse_sub_pmd(addr, next)) {
 					free_hugepage_table(pmd_page(*pmd),
 							    altmap);
 
@@ -1088,10 +1150,10 @@  remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 				pages++;
 			} else {
 				/* If here, we are freeing vmemmap pages. */
-				memset((void *)addr, PAGE_INUSE, next - addr);
+				memset((void *)addr, PAGE_UNUSED, next - addr);
 
 				page_addr = page_address(pud_page(*pud));
-				if (!memchr_inv(page_addr, PAGE_INUSE,
+				if (!memchr_inv(page_addr, PAGE_UNUSED,
 						PUD_SIZE)) {
 					free_pagetable(pud_page(*pud),
 						       get_order(PUD_SIZE));
@@ -1520,11 +1582,16 @@  static int __meminit vmemmap_populate_hugepages(unsigned long start,
 
 				addr_end = addr + PMD_SIZE;
 				p_end = p + PMD_SIZE;
+
+				if (!IS_ALIGNED(addr, PMD_SIZE) ||
+				    !IS_ALIGNED(next, PMD_SIZE))
+					vmemmap_use_new_sub_pmd(addr, next);
 				continue;
 			} else if (altmap)
 				return -ENOMEM; /* no fallback */
 		} else if (pmd_large(*pmd)) {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
+			vmemmap_use_sub_pmd(addr, next);
 			continue;
 		}
 		if (vmemmap_populate_basepages(addr, next, node, NULL))