diff mbox series

[v3,27/34] mm: pagewalk: Add 'depth' parameter to pte_hole

Message ID 20190227170608.27963-28-steven.price@arm.com (mailing list archive)
State New, archived
Headers show
Series Convert x86 & arm64 to use generic page walk | expand

Commit Message

Steven Price Feb. 27, 2019, 5:06 p.m. UTC
The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth
the missing entry is. Add this is an extra parameter to pte_hole().
When the depth isn't know (e.g. processing a vma) then -1 is passed.

The depth that is reported is the actual level where the entry is
missing (ignoring any folding that is in place), i.e. any levels where
PTRS_PER_P?D is set to 1 are ignored.

Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 fs/proc/task_mmu.c |  4 ++--
 include/linux/mm.h |  6 ++++--
 mm/hmm.c           |  2 +-
 mm/migrate.c       |  1 +
 mm/mincore.c       |  1 +
 mm/pagewalk.c      | 31 +++++++++++++++++++++++++------
 6 files changed, 34 insertions(+), 11 deletions(-)

Comments

Dave Hansen Feb. 27, 2019, 5:38 p.m. UTC | #1
On 2/27/19 9:06 AM, Steven Price wrote:
>  #ifdef CONFIG_SHMEM
>  static int smaps_pte_hole(unsigned long addr, unsigned long end,
> -		struct mm_walk *walk)
> +			  __always_unused int depth, struct mm_walk *walk)
>  {

I think this 'depth' argument is a mistake.  It's synthetic and it's
surely going to be a source of bugs.

The page table dumpers seem to be using this to dump out the "name" of a
hole which seems a bit bogus in the first place.  I'd much rather teach
the dumpers about the length of the hole, "the hole is 0x12340000 bytes
long", rather than "there's a hole at this level".
Steven Price Feb. 28, 2019, 11:28 a.m. UTC | #2
On 27/02/2019 17:38, Dave Hansen wrote:
> On 2/27/19 9:06 AM, Steven Price wrote:
>>  #ifdef CONFIG_SHMEM
>>  static int smaps_pte_hole(unsigned long addr, unsigned long end,
>> -		struct mm_walk *walk)
>> +			  __always_unused int depth, struct mm_walk *walk)
>>  {
> 
> I think this 'depth' argument is a mistake.  It's synthetic and it's
> surely going to be a source of bugs.
> 
> The page table dumpers seem to be using this to dump out the "name" of a
> hole which seems a bit bogus in the first place.  I'd much rather teach
> the dumpers about the length of the hole, "the hole is 0x12340000 bytes
> long", rather than "there's a hole at this level".

I originally started by trying to calculate the 'depth' from (end -
addr), e.g. for arm64:

level = 4 - (ilog2(end - addr) - PAGE_SHIFT) / (PAGE_SHIFT - 3)

However there are two issues that I encountered:

* walk_page_range() takes a range of addresses to walk. This means that
holes at the beginning/end of the range are clamped to the address
range. This particularly shows up at the end of the range as I use ~0ULL
as the end which leads to (~0ULL - addr) which is 1 byte short of the
desired size. Obviously that particular corner-case is easy to work
round, but it seemed fragile.

* The above definition for arm64 isn't correct in all cases. You need to
account for things like CONFIG_PGTABLE_LEVELS. Other architectures also
have various quirks in their page tables.

I guess I could try something like:

static int get_level(unsigned long addr, unsigned long end)
{
	/* Add 1 to account for ~0ULL */
	unsigned long size = (end - addr) + 1;
	if (size < PMD_SIZE)
		return 4;
	else if (size < PUD_SIZE)
		return 3;
	else if (size < P4D_SIZE)
		return 2;
	else if (size < PGD_SIZE)
		return 1;
	return 0;
}

There are two immediate problems with that:

 * The "+1" to deal with ~0ULL is fragile

 * PGD_SIZE isn't what you might expect, it's not defined for most
architectures and arm64/x86 use it as the size of the PGD table.
Although that's easy enough to fix up.

Do you think a function like above would be preferable?

The other option would of course be to just drop the information from
the debugfs file about at which level the holes are. But it can be
useful information to see whether there are empty levels in the page
table structure. Although this is an area where x86 and arm64 differ
currently (x86 explicitly shows the gaps, arm64 doesn't), so if x86
doesn't mind losing that functionality that would certainly simplify things!

Thanks,

Steve
Dave Hansen Feb. 28, 2019, 7 p.m. UTC | #3
On 2/28/19 3:28 AM, Steven Price wrote:
> static int get_level(unsigned long addr, unsigned long end)
> {
> 	/* Add 1 to account for ~0ULL */
> 	unsigned long size = (end - addr) + 1;
> 	if (size < PMD_SIZE)
> 		return 4;
> 	else if (size < PUD_SIZE)
> 		return 3;
> 	else if (size < P4D_SIZE)
> 		return 2;
> 	else if (size < PGD_SIZE)
> 		return 1;
> 	return 0;
> }
> 
> There are two immediate problems with that:
> 
>  * The "+1" to deal with ~0ULL is fragile
> 
>  * PGD_SIZE isn't what you might expect, it's not defined for most
> architectures and arm64/x86 use it as the size of the PGD table.
> Although that's easy enough to fix up.
> 
> Do you think a function like above would be preferable?

The question still stands of why we *need* the depth/level in the first
place.  As I said, we obviously need it for printing out the "name" of
the level.  Is that it?

> The other option would of course be to just drop the information from
> the debugfs file about at which level the holes are. But it can be
> useful information to see whether there are empty levels in the page
> table structure. Although this is an area where x86 and arm64 differ
> currently (x86 explicitly shows the gaps, arm64 doesn't), so if x86
> doesn't mind losing that functionality that would certainly simplify things!

I think I'd actually be OK with the holes just not showing up.  I
actually find it kinda hard to read sometimes with the holes in there.
I'd be curious what others think though.
Steven Price March 1, 2019, 11:24 a.m. UTC | #4
On 28/02/2019 19:00, Dave Hansen wrote:
> On 2/28/19 3:28 AM, Steven Price wrote:
>> static int get_level(unsigned long addr, unsigned long end)
>> {
>> 	/* Add 1 to account for ~0ULL */
>> 	unsigned long size = (end - addr) + 1;
>> 	if (size < PMD_SIZE)
>> 		return 4;
>> 	else if (size < PUD_SIZE)
>> 		return 3;
>> 	else if (size < P4D_SIZE)
>> 		return 2;
>> 	else if (size < PGD_SIZE)
>> 		return 1;
>> 	return 0;
>> }
>>
>> There are two immediate problems with that:
>>
>>  * The "+1" to deal with ~0ULL is fragile
>>
>>  * PGD_SIZE isn't what you might expect, it's not defined for most
>> architectures and arm64/x86 use it as the size of the PGD table.
>> Although that's easy enough to fix up.
>>
>> Do you think a function like above would be preferable?
> 
> The question still stands of why we *need* the depth/level in the first
> place.  As I said, we obviously need it for printing out the "name" of
> the level.  Is that it?

That is the only use I'm currently aware of.

>> The other option would of course be to just drop the information from
>> the debugfs file about at which level the holes are. But it can be
>> useful information to see whether there are empty levels in the page
>> table structure. Although this is an area where x86 and arm64 differ
>> currently (x86 explicitly shows the gaps, arm64 doesn't), so if x86
>> doesn't mind losing that functionality that would certainly simplify things!
> 
> I think I'd actually be OK with the holes just not showing up.  I
> actually find it kinda hard to read sometimes with the holes in there.
> I'd be curious what others think though.

If no-one has any objections to dropping the holes in the output, then I
can rebase on something like below and drop this 'depth' patch.

Steve

----8<----
From a9eabadfc212389068ec5cc60265c7a55585bb76 Mon Sep 17 00:00:00 2001
From: Steven Price <steven.price@arm.com>
Date: Fri, 1 Mar 2019 10:06:33 +0000
Subject: [PATCH] x86: mm: Hide page table holes in debugfs

For the /sys/kernel/debug/page_tables/ files, rather than outputing a
mostly empty line when a block of memory isn't present just skip the
line. This keeps the output shorter and will help with a future change
switching to using the generic page walk code as we no longer care about
the 'level' that the page table holes are at.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/x86/mm/dump_pagetables.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e3cdc85ce5b6..a0f4139631dd 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -304,8 +304,8 @@ static void note_page(struct seq_file *m, struct
pg_state *st,
 		/*
 		 * Now print the actual finished series
 		 */
-		if (!st->marker->max_lines ||
-		    st->lines < st->marker->max_lines) {
+		if ((cur & _PAGE_PRESENT) && (!st->marker->max_lines ||
+		    st->lines < st->marker->max_lines)) {
 			pt_dump_seq_printf(m, st->to_dmesg,
 					   "0x%0*lx-0x%0*lx   ",
 					   width, st->start_address,
@@ -321,7 +321,9 @@ static void note_page(struct seq_file *m, struct
pg_state *st,
 			printk_prot(m, st->current_prot, st->level,
 				    st->to_dmesg);
 		}
-		st->lines++;
+		if (cur & _PAGE_PRESENT) {
+			st->lines++;
+		}

 		/*
 		 * We print markers for special areas of address space,
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0ec9edab2f3..91131cd4e9e0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -474,7 +474,7 @@  static void smaps_account(struct mem_size_stats *mss, struct page *page,
 
 #ifdef CONFIG_SHMEM
 static int smaps_pte_hole(unsigned long addr, unsigned long end,
-		struct mm_walk *walk)
+			  __always_unused int depth, struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 
@@ -1203,7 +1203,7 @@  static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
 }
 
 static int pagemap_pte_hole(unsigned long start, unsigned long end,
-				struct mm_walk *walk)
+			    __always_unused int depth, struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
 	unsigned long addr = start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1a4b1615d012..4ae3634a9118 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1420,7 +1420,9 @@  void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
  *	       pmd_trans_huge() pmds.  They may simply choose to
  *	       split_huge_page() instead of handling it explicitly.
  * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
- * @pte_hole: if set, called for each hole at all levels
+ * @pte_hole: if set, called for each hole at all levels,
+ *            depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD, 4:PTE
+ *            any depths where PTRS_PER_P?D is equal to 1 are skipped
  * @hugetlb_entry: if set, called for each hugetlb entry
  * @test_walk: caller specific callback function to determine whether
  *             we walk over the current vma or not. Returning 0
@@ -1445,7 +1447,7 @@  struct mm_walk {
 	int (*pte_entry)(pte_t *pte, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
-			struct mm_walk *walk);
+			int depth, struct mm_walk *walk);
 	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
 			     unsigned long addr, unsigned long next,
 			     struct mm_walk *walk);
diff --git a/mm/hmm.c b/mm/hmm.c
index a04e4b810610..e3e6b8fda437 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -440,7 +440,7 @@  static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 }
 
 static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
-			     struct mm_walk *walk)
+			     __always_unused int depth, struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..8b62a9fecb5c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2121,6 +2121,7 @@  struct migrate_vma {
 
 static int migrate_vma_collect_hole(unsigned long start,
 				    unsigned long end,
+				    __always_unused int depth,
 				    struct mm_walk *walk)
 {
 	struct migrate_vma *migrate = walk->private;
diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..c4edbc688241 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -104,6 +104,7 @@  static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 }
 
 static int mincore_unmapped_range(unsigned long addr, unsigned long end,
+				   __always_unused int depth,
 				   struct mm_walk *walk)
 {
 	walk->private += __mincore_unmapped_range(addr, end,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index dac0c848b458..57946bcd810c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -4,6 +4,22 @@ 
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
 
+/*
+ * We want to know the real level where a entry is located ignoring any
+ * folding of levels which may be happening. For example if p4d is folded then
+ * a missing entry found at level 1 (p4d) is actually at level 0 (pgd).
+ */
+static int real_depth(int depth)
+{
+	if (depth == 3 && PTRS_PER_PMD == 1)
+		depth = 2;
+	if (depth == 2 && PTRS_PER_PUD == 1)
+		depth = 1;
+	if (depth == 1 && PTRS_PER_P4D == 1)
+		depth = 0;
+	return depth;
+}
+
 static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
@@ -31,6 +47,7 @@  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd_t *pmd;
 	unsigned long next;
 	int err = 0;
+	int depth = real_depth(3);
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -38,7 +55,7 @@  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd)) {
 			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, walk);
+				err = walk->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
 			continue;
@@ -81,6 +98,7 @@  static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud_t *pud;
 	unsigned long next;
 	int err = 0;
+	int depth = real_depth(2);
 
 	pud = pud_offset(p4d, addr);
 	do {
@@ -88,7 +106,7 @@  static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		next = pud_addr_end(addr, end);
 		if (pud_none(*pud)) {
 			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, walk);
+				err = walk->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
 			continue;
@@ -123,13 +141,14 @@  static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	p4d_t *p4d;
 	unsigned long next;
 	int err = 0;
+	int depth = real_depth(1);
 
 	p4d = p4d_offset(pgd, addr);
 	do {
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, walk);
+				err = walk->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
 			continue;
@@ -160,7 +179,7 @@  static int walk_pgd_range(unsigned long addr, unsigned long end,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, walk);
+				err = walk->pte_hole(addr, next, 0, walk);
 			if (err)
 				break;
 			continue;
@@ -206,7 +225,7 @@  static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 		if (pte)
 			err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
 		else if (walk->pte_hole)
-			err = walk->pte_hole(addr, next, walk);
+			err = walk->pte_hole(addr, next, -1, walk);
 
 		if (err)
 			break;
@@ -249,7 +268,7 @@  static int walk_page_test(unsigned long start, unsigned long end,
 	if (vma->vm_flags & VM_PFNMAP) {
 		int err = 1;
 		if (walk->pte_hole)
-			err = walk->pte_hole(start, end, walk);
+			err = walk->pte_hole(start, end, -1, walk);
 		return err ? err : 1;
 	}
 	return 0;