diff mbox series

mm: Speed up mremap on large regions

Message ID 20181009201400.168705-1-joel@joelfernandes.org (mailing list archive)
State New, archived
Headers show
Series mm: Speed up mremap on large regions | expand

Commit Message

Joel Fernandes Oct. 9, 2018, 8:14 p.m. UTC
Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speed up is three orders of magnitude. On a 1GB mremap, the mremap
completion times drops from 160-250 millesconds to 380-400 microseconds.

Before:
Total mremap time for 1GB data: 242321014 nanoseconds.
Total mremap time for 1GB data: 196842467 nanoseconds.
Total mremap time for 1GB data: 167051162 nanoseconds.

After:
Total mremap time for 1GB data: 385781 nanoseconds.
Total mremap time for 1GB data: 388959 nanoseconds.
Total mremap time for 1GB data: 402813 nanoseconds.

Incase THP is enabled, the optimization is skipped. I also flush the
tlb every time we do this optimization since I couldn't find a way to
determine if the low-level PTEs are dirty. It is seen that the cost of
doing so is not much compared the improvement, on both x86-64 and arm64.

Cc: minchan@google.com
Cc: hughd@google.com
Cc: lokeshgidra@google.com
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 mm/mremap.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

Comments

Andrew Morton Oct. 9, 2018, 9:38 p.m. UTC | #1
On Tue,  9 Oct 2018 13:14:00 -0700 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:

> Android needs to mremap large regions of memory during memory management
> related operations. The mremap system call can be really slow if THP is
> not enabled. The bottleneck is move_page_tables, which is copying each
> pte at a time, and can be really slow across a large map. Turning on THP
> may not be a viable option, and is not for us. This patch speeds up the
> performance for non-THP system by copying at the PMD level when possible.
> 
> The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> completion times drops from 160-250 millesconds to 380-400 microseconds.
> 
> Before:
> Total mremap time for 1GB data: 242321014 nanoseconds.
> Total mremap time for 1GB data: 196842467 nanoseconds.
> Total mremap time for 1GB data: 167051162 nanoseconds.
> 
> After:
> Total mremap time for 1GB data: 385781 nanoseconds.
> Total mremap time for 1GB data: 388959 nanoseconds.
> Total mremap time for 1GB data: 402813 nanoseconds.
> 
> Incase THP is enabled, the optimization is skipped. I also flush the
> tlb every time we do this optimization since I couldn't find a way to
> determine if the low-level PTEs are dirty. It is seen that the cost of
> doing so is not much compared the improvement, on both x86-64 and arm64.

Looks tasty.

> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  		drop_rmap_locks(vma);
>  }
>  
> +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,

I'll park this for now, shall plan to add a `static' in there then
merge it up after 4.20-rc1.
Kirill A. Shutemov Oct. 9, 2018, 10:02 p.m. UTC | #2
On Tue, Oct 09, 2018 at 01:14:00PM -0700, Joel Fernandes (Google) wrote:
> Android needs to mremap large regions of memory during memory management
> related operations. The mremap system call can be really slow if THP is
> not enabled. The bottleneck is move_page_tables, which is copying each
> pte at a time, and can be really slow across a large map. Turning on THP
> may not be a viable option, and is not for us. This patch speeds up the
> performance for non-THP system by copying at the PMD level when possible.
> 
> The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> completion times drops from 160-250 millesconds to 380-400 microseconds.
> 
> Before:
> Total mremap time for 1GB data: 242321014 nanoseconds.
> Total mremap time for 1GB data: 196842467 nanoseconds.
> Total mremap time for 1GB data: 167051162 nanoseconds.
> 
> After:
> Total mremap time for 1GB data: 385781 nanoseconds.
> Total mremap time for 1GB data: 388959 nanoseconds.
> Total mremap time for 1GB data: 402813 nanoseconds.
> 
> Incase THP is enabled, the optimization is skipped. I also flush the
> tlb every time we do this optimization since I couldn't find a way to
> determine if the low-level PTEs are dirty. It is seen that the cost of
> doing so is not much compared the improvement, on both x86-64 and arm64.

Okay. That's interesting.

It makes me wounder why do we pass virtual address to pte_alloc() (and
pte_alloc_one() inside).

If an arch has real requirement to tight a page table to a virtual address
than the optimization cannot be used as it is. Per-arch should be fine
for this case, I guess.

If nobody uses the address we should just drop the argument as a
preparation to the patch.

Anyway, I think the optimization requires some groundwork before it can be
accepted. At least some explanation why it is safe to move page table from
one spot in virtual address space to another.
Joel Fernandes Oct. 9, 2018, 11:04 p.m. UTC | #3
On Wed, Oct 10, 2018 at 01:02:22AM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 09, 2018 at 01:14:00PM -0700, Joel Fernandes (Google) wrote:
> > Android needs to mremap large regions of memory during memory management
> > related operations. The mremap system call can be really slow if THP is
> > not enabled. The bottleneck is move_page_tables, which is copying each
> > pte at a time, and can be really slow across a large map. Turning on THP
> > may not be a viable option, and is not for us. This patch speeds up the
> > performance for non-THP system by copying at the PMD level when possible.
> > 
> > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > 
> > Before:
> > Total mremap time for 1GB data: 242321014 nanoseconds.
> > Total mremap time for 1GB data: 196842467 nanoseconds.
> > Total mremap time for 1GB data: 167051162 nanoseconds.
> > 
> > After:
> > Total mremap time for 1GB data: 385781 nanoseconds.
> > Total mremap time for 1GB data: 388959 nanoseconds.
> > Total mremap time for 1GB data: 402813 nanoseconds.
> > 
> > Incase THP is enabled, the optimization is skipped. I also flush the
> > tlb every time we do this optimization since I couldn't find a way to
> > determine if the low-level PTEs are dirty. It is seen that the cost of
> > doing so is not much compared the improvement, on both x86-64 and arm64.
> 
> Okay. That's interesting.
> 
> It makes me wounder why do we pass virtual address to pte_alloc() (and
> pte_alloc_one() inside).
> 
> If an arch has real requirement to tight a page table to a virtual address
> than the optimization cannot be used as it is. Per-arch should be fine
> for this case, I guess.
> 
> If nobody uses the address we should just drop the argument as a
> preparation to the patch.

I couldn't find any use of the address. But I am wondering why you feel
passing the address is something that can't be done with the optimization.
The pte_alloc only happens if the optimization is not triggered.

Also the clean up of the argument that you're proposing is a bit out of scope
of this patch but yeah we could clean it up in a separate patch if needed. I
don't feel too strongly about that. It seems cosmetic and in the future if
the address that's passed in is needed, then the architecture can use it.

> Anyway, I think the optimization requires some groundwork before it can be
> accepted. At least some explanation why it is safe to move page table from
> one spot in virtual address space to another.

So I did go through several scenarios and its fine to my eyes. I tested
it too and couldn't find any issue. Could you describe your concern a bit
more? The mm->mmap_sem lock is held through out the mremap. Further we are
acquiring needed rmap locks if needed and the ptl locks of the old new
page-table pages. And this same path is already copying pmds for hugepages.

thanks,

 - Joel
Joel Fernandes Oct. 9, 2018, 11:08 p.m. UTC | #4
On Tue, Oct 09, 2018 at 02:38:59PM -0700, Andrew Morton wrote:
> On Tue,  9 Oct 2018 13:14:00 -0700 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> 
> > Android needs to mremap large regions of memory during memory management
> > related operations. The mremap system call can be really slow if THP is
> > not enabled. The bottleneck is move_page_tables, which is copying each
> > pte at a time, and can be really slow across a large map. Turning on THP
> > may not be a viable option, and is not for us. This patch speeds up the
> > performance for non-THP system by copying at the PMD level when possible.
> > 
> > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > 
> > Before:
> > Total mremap time for 1GB data: 242321014 nanoseconds.
> > Total mremap time for 1GB data: 196842467 nanoseconds.
> > Total mremap time for 1GB data: 167051162 nanoseconds.
> > 
> > After:
> > Total mremap time for 1GB data: 385781 nanoseconds.
> > Total mremap time for 1GB data: 388959 nanoseconds.
> > Total mremap time for 1GB data: 402813 nanoseconds.
> > 
> > Incase THP is enabled, the optimization is skipped. I also flush the
> > tlb every time we do this optimization since I couldn't find a way to
> > determine if the low-level PTEs are dirty. It is seen that the cost of
> > doing so is not much compared the improvement, on both x86-64 and arm64.
> 
> Looks tasty.

Thanks :)

> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> >  		drop_rmap_locks(vma);
> >  }
> >  
> > +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> 
> I'll park this for now, shall plan to add a `static' in there then
> merge it up after 4.20-rc1.

Thanks, I will also add static to the function in my own tree just for the
future in case I'm doing another revision.

- Joel
Kirill A. Shutemov Oct. 10, 2018, 10 a.m. UTC | #5
On Tue, Oct 09, 2018 at 04:04:47PM -0700, Joel Fernandes wrote:
> On Wed, Oct 10, 2018 at 01:02:22AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Oct 09, 2018 at 01:14:00PM -0700, Joel Fernandes (Google) wrote:
> > > Android needs to mremap large regions of memory during memory management
> > > related operations. The mremap system call can be really slow if THP is
> > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > pte at a time, and can be really slow across a large map. Turning on THP
> > > may not be a viable option, and is not for us. This patch speeds up the
> > > performance for non-THP system by copying at the PMD level when possible.
> > > 
> > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > > 
> > > Before:
> > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > 
> > > After:
> > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > 
> > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > tlb every time we do this optimization since I couldn't find a way to
> > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > doing so is not much compared the improvement, on both x86-64 and arm64.
> > 
> > Okay. That's interesting.
> > 
> > It makes me wounder why do we pass virtual address to pte_alloc() (and
> > pte_alloc_one() inside).
> > 
> > If an arch has real requirement to tight a page table to a virtual address
> > than the optimization cannot be used as it is. Per-arch should be fine
> > for this case, I guess.
> > 
> > If nobody uses the address we should just drop the argument as a
> > preparation to the patch.
> 
> I couldn't find any use of the address. But I am wondering why you feel
> passing the address is something that can't be done with the optimization.
> The pte_alloc only happens if the optimization is not triggered.

Yes, I now.

My worry is that some architecture has to allocate page table differently
depending on virtual address (due to aliasing or something). Original page
table was allocated for one virtual address and moving the page table to
different spot in virtual address space may break the invariant.

> Also the clean up of the argument that you're proposing is a bit out of scope
> of this patch but yeah we could clean it up in a separate patch if needed. I
> don't feel too strongly about that. It seems cosmetic and in the future if
> the address that's passed in is needed, then the architecture can use it.

Please, do. This should be pretty mechanical change, but it will help to
make sure that none of obscure architecture will be broken by the change.
Joel Fernandes Oct. 11, 2018, 12:46 a.m. UTC | #6
On Wed, Oct 10, 2018 at 01:00:11PM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 09, 2018 at 04:04:47PM -0700, Joel Fernandes wrote:
> > On Wed, Oct 10, 2018 at 01:02:22AM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Oct 09, 2018 at 01:14:00PM -0700, Joel Fernandes (Google) wrote:
> > > > Android needs to mremap large regions of memory during memory management
> > > > related operations. The mremap system call can be really slow if THP is
> > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > pte at a time, and can be really slow across a large map. Turning on THP
> > > > may not be a viable option, and is not for us. This patch speeds up the
> > > > performance for non-THP system by copying at the PMD level when possible.
> > > > 
> > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > > > 
> > > > Before:
> > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > 
> > > > After:
> > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > 
> > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > tlb every time we do this optimization since I couldn't find a way to
> > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > doing so is not much compared the improvement, on both x86-64 and arm64.
> > > 
> > > Okay. That's interesting.
> > > 
> > > It makes me wounder why do we pass virtual address to pte_alloc() (and
> > > pte_alloc_one() inside).
> > > 
> > > If an arch has real requirement to tight a page table to a virtual address
> > > than the optimization cannot be used as it is. Per-arch should be fine
> > > for this case, I guess.
> > > 
> > > If nobody uses the address we should just drop the argument as a
> > > preparation to the patch.
> > 
> > I couldn't find any use of the address. But I am wondering why you feel
> > passing the address is something that can't be done with the optimization.
> > The pte_alloc only happens if the optimization is not triggered.
> 
> Yes, I now.
> 
> My worry is that some architecture has to allocate page table differently
> depending on virtual address (due to aliasing or something). Original page
> table was allocated for one virtual address and moving the page table to
> different spot in virtual address space may break the invariant.
> 
> > Also the clean up of the argument that you're proposing is a bit out of scope
> > of this patch but yeah we could clean it up in a separate patch if needed. I
> > don't feel too strongly about that. It seems cosmetic and in the future if
> > the address that's passed in is needed, then the architecture can use it.
> 
> Please, do. This should be pretty mechanical change, but it will help to
> make sure that none of obscure architecture will be broken by the change.
> 

The thing is its quite a lot of change, I wrote a coccinelle script to do it
tree wide, following is the diffstat:
 48 files changed, 91 insertions(+), 124 deletions(-)

Imagine then having to add the address argument back in the future in case
its ever needed. Is it really worth doing it? Anyway I confirmed that the
address is NOT used for anything at the moment so your fears of the
optimization doing anything wonky really don't exist at the moment. I really
feel this is unnecessary but I am Ok with others agree the second arg to
pte_alloc should be removed in light of this change. Andrew, what do you
think?

Anyway following are the coccinelle scripts and the resulting diff, the first
script is to remove the second argument (address) from pte_alloc and friends
(there is still some hand removal todo, but those are few). The Second script
actually verifies that the 'address' arg is unused indeed, before the
removal.  And the Third thing I added below is the diff from applying the
first script.

First script : Remove the address argument from pte_alloc and friends:

// Options: --no-includes --include-headers

virtual patch

//////////////// REMOVE ARG FROM FUNC DEFINITON

// Do for pte_alloc
@pte_args1 depends on patch exists@
identifier E1, E2;
type T1, T2;
@@

- pte_alloc(T1 E1, T2 E2)
+ pte_alloc(T1 E1)
{ ... }

// Do same thing for pte_alloc_kernel
@pte_args4 depends on patch exists@
identifier E1, E2;
type T1, T2;
@@

- pte_alloc_kernel(T1 E1, T2 E2)
+ pte_alloc_kernel(T1 E1)
{ ... }

// Do for pte_alloc_one_kernel
@pte_args depends on patch exists@
identifier E1, E2;
type T1, T2;
@@

- pte_alloc_one_kernel(T1 E1, T2 E2)
+ pte_alloc_one_kernel(T1 E1)
{ ... }

// Do same thing for pte_alloc_one
@pte_args6 depends on patch exists@
identifier E1, E2;
type T1, T2;
@@

- pte_alloc_one(T1 E1, T2 E2)
+ pte_alloc_one(T1 E1)
{ ... }


//////////////// REMOVE ARG FROM FUNC DECLARATION
@pte_args12 depends on patch exists@
identifier E1, E2;
type T1, T2, T3;
@@

(
- T3 pte_alloc(T1 E1, T2 E2);
+ T3 pte_alloc(T1 E1);
|
- T3 pte_alloc_kernel(T1 E1, T2 E2);
+ T3 pte_alloc_kernel(T1 E1);
|
- T3 pte_alloc_one_kernel(T1 E1, T2 E2);
+ T3 pte_alloc_one_kernel(T1 E1);
|
- T3 pte_alloc_one(T1 E1, T2 E2);
+ T3 pte_alloc_one(T1 E1);
)

//////////////// REMOVE ARG FROM CALLERS
@pte_args13 depends on patch exists@
expression E1, E2;
@@
(
- pte_alloc(E1, E2)
+ pte_alloc(E1)
|
- pte_alloc_kernel(E1, E2)
+ pte_alloc_kernel(E1)
|
- pte_alloc_one_kernel(E1, E2)
+ pte_alloc_one_kernel(E1)
|
- pte_alloc_one(E1, E2)
+ pte_alloc_one(E1)
)

----------

Second script: Verify that the second argument is indeed unused.
This is just for pte_alloc but I wrote similar scripts for other pte_alloc
variants. Only pte_alloc_one is using the address because its passing it to
pte_alloc_one_kernel. But that function is not using it anymore.

// Options: --include-headers

virtual report

@pte_args depends on report@
identifier E1, E2;
type T1, T2;
position p;
@@

 pte_alloc@p(T1 E1, T2 E2)
 {
<+...
 E2
...+>
 }

@script:python depends on report@
p << pte_args.p;
@@
coccilib.report.print_report(p[0], "WARNING: found definition of pte_alloc with address used in the body")

------------

Diff of applying the first .cocci script:

diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h
index ab3e3a8638fb..02f9f91bb4f0 100644
--- a/arch/alpha/include/asm/pgalloc.h
+++ b/arch/alpha/include/asm/pgalloc.h
@@ -52,7 +52,7 @@ pmd_free(struct mm_struct *mm, pmd_t *pmd)
 }
 
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
+pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
@@ -65,9 +65,9 @@ pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 }
 
 static inline pgtable_t
-pte_alloc_one(struct mm_struct *mm, unsigned long address)
+pte_alloc_one(struct mm_struct *mm)
 {
-	pte_t *pte = pte_alloc_one_kernel(mm, address);
+	pte_t *pte = pte_alloc_one_kernel(mm);
 	struct page *page;
 
 	if (!pte)
diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
index 3749234b7419..9c9b5a5ebf2e 100644
--- a/arch/arc/include/asm/pgalloc.h
+++ b/arch/arc/include/asm/pgalloc.h
@@ -90,8 +90,7 @@ static inline int __get_order_pte(void)
 	return get_order(PTRS_PER_PTE * sizeof(pte_t));
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -102,7 +101,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 }
 
 static inline pgtable_t
-pte_alloc_one(struct mm_struct *mm, unsigned long address)
+pte_alloc_one(struct mm_struct *mm)
 {
 	pgtable_t pte_pg;
 	struct page *page;
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 2d7344f0e208..17ab72f0cc4e 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -81,7 +81,7 @@ static inline void clean_pte_table(pte_t *pte)
  *  +------------+
  */
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
+pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -93,7 +93,7 @@ pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 }
 
 static inline pgtable_t
-pte_alloc_one(struct mm_struct *mm, unsigned long addr)
+pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 2e05bcd944c8..52fa47c73bf0 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -91,13 +91,13 @@ extern pgd_t *pgd_alloc(struct mm_struct *mm);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgdp);
 
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
+pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return (pte_t *)__get_free_page(PGALLOC_GFP);
 }
 
 static inline pgtable_t
-pte_alloc_one(struct mm_struct *mm, unsigned long addr)
+pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/hexagon/include/asm/pgalloc.h b/arch/hexagon/include/asm/pgalloc.h
index eeebf862c46c..d36183887b60 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -59,8 +59,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long) pgd);
 }
 
-static inline struct page *pte_alloc_one(struct mm_struct *mm,
-					 unsigned long address)
+static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
@@ -75,8 +74,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm,
 }
 
 /* _kernel variant gets to use a different allocator */
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	gfp_t flags =  GFP_KERNEL | __GFP_ZERO;
 	return (pte_t *) __get_free_page(flags);
diff --git a/arch/ia64/include/asm/pgalloc.h b/arch/ia64/include/asm/pgalloc.h
index 3ee5362f2661..c9e481023c25 100644
--- a/arch/ia64/include/asm/pgalloc.h
+++ b/arch/ia64/include/asm/pgalloc.h
@@ -83,7 +83,7 @@ pmd_populate_kernel(struct mm_struct *mm, pmd_t * pmd_entry, pte_t * pte)
 	pmd_val(*pmd_entry) = __pa(pte);
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *page;
 	void *pg;
@@ -99,8 +99,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr)
 	return page;
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long addr)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 3b85c3ecac38..183b8194ce1f 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -224,7 +224,7 @@ put_kernel_page (struct page *page, unsigned long address, pgprot_t pgprot)
 		pmd = pmd_alloc(&init_mm, pud, address);
 		if (!pmd)
 			goto out;
-		pte = pte_alloc_kernel(pmd, address);
+		pte = pte_alloc_kernel(pmd);
 		if (!pte)
 			goto out;
 		if (!pte_none(*pte))
diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h
index 12fe700632f4..9ac47b4aed41 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -12,8 +12,7 @@ extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 
 extern const char bad_pmd_string[];
 
-extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-	unsigned long address)
+extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	unsigned long page = __get_free_page(GFP_DMA);
 
@@ -32,7 +31,7 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
 #define pmd_alloc_one_fast(mm, address) ({ BUG(); ((pmd_t *)1); })
 #define pmd_alloc_one(mm, address)      ({ BUG(); ((pmd_t *)2); })
 
-#define pte_alloc_one_fast(mm, addr) pte_alloc_one(mm, addr)
+#define pte_alloc_one_fast(mm, addr) pte_alloc_one(mm)
 
 #define pmd_populate(mm, pmd, page) (pmd_val(*pmd) = \
 	(unsigned long)(page_address(page)))
@@ -50,8 +49,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t page,
 
 #define __pmd_free_tlb(tlb, pmd, address) do { } while (0)
 
-static inline struct page *pte_alloc_one(struct mm_struct *mm,
-	unsigned long address)
+static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *page = alloc_pages(GFP_DMA, 0);
 	pte_t *pte;
diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h
index 7859a86319cf..d04d9ba9b976 100644
--- a/arch/m68k/include/asm/motorola_pgalloc.h
+++ b/arch/m68k/include/asm/motorola_pgalloc.h
@@ -8,7 +8,7 @@
 extern pmd_t *get_pointer_table(void);
 extern int free_pointer_table(pmd_t *);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -28,7 +28,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 	free_page((unsigned long) pte);
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *page;
 	pte_t *pte;
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h b/arch/m68k/include/asm/sun3_pgalloc.h
index 11485d38de4e..1456c5eecbd9 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -35,8 +35,7 @@ do {							\
 	tlb_remove_page((tlb), pte);			\
 } while (0)
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	unsigned long page = __get_free_page(GFP_KERNEL);
 
@@ -47,8 +46,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	return (pte_t *) (page);
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
-					unsigned long address)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
         struct page *page = alloc_pages(GFP_KERNEL, 0);
 
diff --git a/arch/m68k/mm/kmap.c b/arch/m68k/mm/kmap.c
index 40a3b327da07..883d1cdcc34b 100644
--- a/arch/m68k/mm/kmap.c
+++ b/arch/m68k/mm/kmap.c
@@ -208,7 +208,7 @@ void __iomem *__ioremap(unsigned long physaddr, unsigned long size, int cachefla
 			virtaddr += PTRTREESIZE;
 			size -= PTRTREESIZE;
 		} else {
-			pte_dir = pte_alloc_kernel(pmd_dir, virtaddr);
+			pte_dir = pte_alloc_kernel(pmd_dir);
 			if (!pte_dir) {
 				printk("ioremap: no mem for pte_dir\n");
 				return NULL;
diff --git a/arch/m68k/sun3x/dvma.c b/arch/m68k/sun3x/dvma.c
index b2acbc862f60..87935a30d628 100644
--- a/arch/m68k/sun3x/dvma.c
+++ b/arch/m68k/sun3x/dvma.c
@@ -109,7 +109,7 @@ inline int dvma_map_cpu(unsigned long kaddr,
 			pte_t *pte;
 			unsigned long end3;
 
-			if((pte = pte_alloc_kernel(pmd, vaddr)) == NULL) {
+			if((pte = pte_alloc_kernel(pmd)) == NULL) {
 				ret = -ENOMEM;
 				goto out;
 			}
diff --git a/arch/microblaze/include/asm/pgalloc.h b/arch/microblaze/include/asm/pgalloc.h
index 7c89390c0c13..08984b05713f 100644
--- a/arch/microblaze/include/asm/pgalloc.h
+++ b/arch/microblaze/include/asm/pgalloc.h
@@ -108,10 +108,9 @@ static inline void free_pgd_slow(pgd_t *pgd)
 #define pmd_alloc_one_fast(mm, address)	({ BUG(); ((pmd_t *)1); })
 #define pmd_alloc_one(mm, address)	({ BUG(); ((pmd_t *)2); })
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
 
-static inline struct page *pte_alloc_one(struct mm_struct *mm,
-		unsigned long address)
+static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *ptepage;
 
diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c
index 7f525962cdfa..fd9013f2b8ee 100644
--- a/arch/microblaze/mm/pgtable.c
+++ b/arch/microblaze/mm/pgtable.c
@@ -144,7 +144,7 @@ int map_page(unsigned long va, phys_addr_t pa, int flags)
 	/* Use upper 10 bits of VA to index the first level map */
 	pd = pmd_offset(pgd_offset_k(va), va);
 	/* Use middle 10 bits of VA to index the second-level map */
-	pg = pte_alloc_kernel(pd, va); /* from powerpc - pgtable.c */
+	pg = pte_alloc_kernel(pd); /* from powerpc - pgtable.c */
 	/* pg = pte_alloc_kernel(&init_mm, pd, va); */
 
 	if (pg != NULL) {
@@ -235,8 +235,7 @@ unsigned long iopa(unsigned long addr)
 	return pa;
 }
 
-__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-		unsigned long address)
+__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 	if (mem_init_done) {
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index 39b9f311c4ef..27808d9461f4 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -50,14 +50,12 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_pages((unsigned long)pgd, PGD_ORDER);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-	unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return (pte_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, PTE_ORDER);
 }
 
-static inline struct page *pte_alloc_one(struct mm_struct *mm,
-	unsigned long address)
+static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/mips/mm/ioremap.c b/arch/mips/mm/ioremap.c
index 1601d90b087b..18c2be1b69c0 100644
--- a/arch/mips/mm/ioremap.c
+++ b/arch/mips/mm/ioremap.c
@@ -56,7 +56,7 @@ static inline int remap_area_pmd(pmd_t * pmd, unsigned long address,
 	phys_addr -= address;
 	BUG_ON(address >= end);
 	do {
-		pte_t * pte = pte_alloc_kernel(pmd, address);
+		pte_t * pte = pte_alloc_kernel(pmd);
 		if (!pte)
 			return -ENOMEM;
 		remap_area_pte(pte, address, end - address, address + phys_addr, flags);
diff --git a/arch/nds32/include/asm/pgalloc.h b/arch/nds32/include/asm/pgalloc.h
index 27448869131a..3c5fee5b5759 100644
--- a/arch/nds32/include/asm/pgalloc.h
+++ b/arch/nds32/include/asm/pgalloc.h
@@ -22,8 +22,7 @@ extern void pgd_free(struct mm_struct *mm, pgd_t * pgd);
 
 #define check_pgt_cache()		do { } while (0)
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long addr)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -34,7 +33,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	return pte;
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	pgtable_t pte;
 
diff --git a/arch/nds32/kernel/dma.c b/arch/nds32/kernel/dma.c
index d0dbd4fe9645..17b485ab631a 100644
--- a/arch/nds32/kernel/dma.c
+++ b/arch/nds32/kernel/dma.c
@@ -310,7 +310,7 @@ static int __init consistent_init(void)
 		 * It's not necessary to warn here. */
 		/* WARN_ON(!pmd_none(*pmd)); */
 
-		pte = pte_alloc_kernel(pmd, CONSISTENT_BASE);
+		pte = pte_alloc_kernel(pmd);
 		if (!pte) {
 			ret = -ENOMEM;
 			break;
diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index bb47d08c8ef7..3a149ead1207 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -37,8 +37,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_pages((unsigned long)pgd, PGD_ORDER);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-	unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -47,8 +46,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	return pte;
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
-	unsigned long address)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/nios2/mm/ioremap.c b/arch/nios2/mm/ioremap.c
index 3a28177a01eb..00d1434c2f28 100644
--- a/arch/nios2/mm/ioremap.c
+++ b/arch/nios2/mm/ioremap.c
@@ -61,7 +61,7 @@ static inline int remap_area_pmd(pmd_t *pmd, unsigned long address,
 	if (address >= end)
 		BUG();
 	do {
-		pte_t *pte = pte_alloc_kernel(pmd, address);
+		pte_t *pte = pte_alloc_kernel(pmd);
 
 		if (!pte)
 			return -ENOMEM;
diff --git a/arch/openrisc/include/asm/pgalloc.h b/arch/openrisc/include/asm/pgalloc.h
index 8999b9226512..149c82ee4b8b 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -70,10 +70,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
 
-static inline struct page *pte_alloc_one(struct mm_struct *mm,
-					 unsigned long address)
+static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 	pte = alloc_pages(GFP_KERNEL, 0);
diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c
index 2175e4bfd9fc..24fb1021c75a 100644
--- a/arch/openrisc/mm/ioremap.c
+++ b/arch/openrisc/mm/ioremap.c
@@ -118,8 +118,7 @@ EXPORT_SYMBOL(iounmap);
  * the memblock infrastructure.
  */
 
-pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm,
-					 unsigned long address)
+pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index cf13275f7c6d..d05c678c77c4 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -122,7 +122,7 @@ pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
 static inline pgtable_t
-pte_alloc_one(struct mm_struct *mm, unsigned long address)
+pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *page = alloc_page(GFP_KERNEL|__GFP_ZERO);
 	if (!page)
@@ -135,7 +135,7 @@ pte_alloc_one(struct mm_struct *mm, unsigned long address)
 }
 
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
+pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	return pte;
diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 04c48f1ef3fb..d5588c80bda2 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -113,7 +113,7 @@ static inline int map_pmd_uncached(pmd_t * pmd, unsigned long vaddr,
 	if (end > PGDIR_SIZE)
 		end = PGDIR_SIZE;
 	do {
-		pte_t * pte = pte_alloc_kernel(pmd, vaddr);
+		pte_t * pte = pte_alloc_kernel(pmd);
 		if (!pte)
 			return -ENOMEM;
 		if (map_pte_uncached(pte, orig_vaddr, end - vaddr, paddr_ptr))
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..af9e13555d95 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -82,8 +82,8 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 #define pmd_pgtable(pmd) pmd_page(pmd)
 #endif
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
-extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pgtable_t pte_alloc_one(struct mm_struct *mm);
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 391ed2c3b697..8a33f2044923 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -192,14 +192,12 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
 	return (pgtable_t)pmd_page_vaddr(pmd);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return (pte_t *)pte_fragment_alloc(mm, address, 1);
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
-				      unsigned long address)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	return (pgtable_t)pte_fragment_alloc(mm, address, 0);
 }
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 8825953c225b..16623f53f0d4 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -83,8 +83,8 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 #define pmd_pgtable(pmd) pmd_page(pmd)
 #endif
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
-extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pgtable_t pte_alloc_one(struct mm_struct *mm);
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h b/arch/powerpc/include/asm/nohash/64/pgalloc.h
index e2d62d033708..2e7e0230edf4 100644
--- a/arch/powerpc/include/asm/nohash/64/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h
@@ -96,14 +96,12 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 }
 
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
-				      unsigned long address)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *page;
 	pte_t *pte;
diff --git a/arch/powerpc/mm/pgtable-book3e.c b/arch/powerpc/mm/pgtable-book3e.c
index a2298930f990..c931a3f536ca 100644
--- a/arch/powerpc/mm/pgtable-book3e.c
+++ b/arch/powerpc/mm/pgtable-book3e.c
@@ -86,7 +86,7 @@ int map_kernel_page(unsigned long ea, unsigned long pa, unsigned long flags)
 		pmdp = pmd_alloc(&init_mm, pudp, ea);
 		if (!pmdp)
 			return -ENOMEM;
-		ptep = pte_alloc_kernel(pmdp, ea);
+		ptep = pte_alloc_kernel(pmdp);
 		if (!ptep)
 			return -ENOMEM;
 		set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT,
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index 692bfc9e372c..49eae44db5b2 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -158,7 +158,7 @@ int hash__map_kernel_page(unsigned long ea, unsigned long pa, unsigned long flag
 		pmdp = pmd_alloc(&init_mm, pudp, ea);
 		if (!pmdp)
 			return -ENOMEM;
-		ptep = pte_alloc_kernel(pmdp, ea);
+		ptep = pte_alloc_kernel(pmdp);
 		if (!ptep)
 			return -ENOMEM;
 		set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT,
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index c879979faa73..8002696c338e 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -163,7 +163,7 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
 		ptep = pmdp_ptep(pmdp);
 		goto set_the_pte;
 	}
-	ptep = pte_alloc_kernel(pmdp, ea);
+	ptep = pte_alloc_kernel(pmdp);
 	if (!ptep)
 		return -ENOMEM;
 
@@ -212,7 +212,7 @@ void radix__change_memory_range(unsigned long start, unsigned long end,
 			ptep = pmdp_ptep(pmdp);
 			goto update_the_pte;
 		}
-		ptep = pte_alloc_kernel(pmdp, idx);
+		ptep = pte_alloc_kernel(pmdp);
 		if (!ptep)
 			continue;
 update_the_pte:
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 120a49bfb9c6..12d5520586cc 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -43,7 +43,7 @@ EXPORT_SYMBOL(ioremap_bot);	/* aka VMALLOC_END */
 
 extern char etext[], _stext[], _sinittext[], _einittext[];
 
-__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
+__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -57,7 +57,7 @@ __ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 	return pte;
 }
 
-pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
+pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *ptepage;
 
@@ -218,7 +218,7 @@ int map_kernel_page(unsigned long va, phys_addr_t pa, int flags)
 	/* Use upper 10 bits of VA to index the first level map */
 	pd = pmd_offset(pud_offset(pgd_offset_k(va), va), va);
 	/* Use middle 10 bits of VA to index the second-level map */
-	pg = pte_alloc_kernel(pd, va);
+	pg = pte_alloc_kernel(pd);
 	if (pg != 0) {
 		err = 0;
 		/* The PTE should never be already set nor present in the
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index a79ed5faff3a..94043cf83c90 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -82,15 +82,13 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-	unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return (pte_t *)__get_free_page(
 		GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
 }
 
-static inline struct page *pte_alloc_one(struct mm_struct *mm,
-	unsigned long address)
+static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index ed053a359ab7..8ad73cb31121 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -32,14 +32,12 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 /*
  * Allocate and free page tables.
  */
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
-					unsigned long address)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *page;
 	void *pg;
diff --git a/arch/sparc/include/asm/pgalloc_32.h b/arch/sparc/include/asm/pgalloc_32.h
index 90459481c6c7..282be50a4adf 100644
--- a/arch/sparc/include/asm/pgalloc_32.h
+++ b/arch/sparc/include/asm/pgalloc_32.h
@@ -58,10 +58,9 @@ void pmd_populate(struct mm_struct *mm, pmd_t *pmdp, struct page *ptep);
 void pmd_set(pmd_t *pmdp, pte_t *ptep);
 #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(PMD, PTE)
 
-pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address);
+pgtable_t pte_alloc_one(struct mm_struct *mm);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					  unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return srmmu_get_nocache(PTE_SIZE, PTE_SIZE);
 }
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 874632f34f62..48abccba4991 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -60,10 +60,8 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 	kmem_cache_free(pgtable_cache, pmd);
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-			    unsigned long address);
-pgtable_t pte_alloc_one(struct mm_struct *mm,
-			unsigned long address);
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index f396048a0d68..6133f21811e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2921,8 +2921,7 @@ void __flush_tlb_all(void)
 			     : : "r" (pstate));
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-			    unsigned long address)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	pte_t *pte = NULL;
@@ -2933,8 +2932,7 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	return pte;
 }
 
-pgtable_t pte_alloc_one(struct mm_struct *mm,
-			unsigned long address)
+pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page)
diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
index be9cb0065179..0c59e690aead 100644
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -364,12 +364,12 @@ pgd_t *get_pgd_fast(void)
  * Alignments up to the page size are the same for physical and virtual
  * addresses of the nocache area.
  */
-pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
+pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	unsigned long pte;
 	struct page *page;
 
-	if ((pte = (unsigned long)pte_alloc_one_kernel(mm, address)) == 0)
+	if ((pte = (unsigned long) pte_alloc_one_kernel(mm)) == 0)
 		return NULL;
 	page = pfn_to_page(__nocache_pa(pte) >> PAGE_SHIFT);
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 3c0e470ea646..1f277191fbf3 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -197,7 +197,7 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long) pgd);
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -205,7 +205,7 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 	return pte;
 }
 
-pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
+pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/unicore32/include/asm/pgalloc.h b/arch/unicore32/include/asm/pgalloc.h
index f0fdb268f8f2..7cceabecf4e3 100644
--- a/arch/unicore32/include/asm/pgalloc.h
+++ b/arch/unicore32/include/asm/pgalloc.h
@@ -34,7 +34,7 @@ extern void free_pgd_slow(struct mm_struct *mm, pgd_t *pgd);
  * Allocate one PTE table.
  */
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
+pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *pte;
 
@@ -46,7 +46,7 @@ pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 }
 
 static inline pgtable_t
-pte_alloc_one(struct mm_struct *mm, unsigned long addr)
+pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 089e78c4effd..a2eff247377b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -23,12 +23,12 @@ EXPORT_SYMBOL(physical_mask);
 
 gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT);
 }
 
-pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
+pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
diff --git a/arch/xtensa/include/asm/pgalloc.h b/arch/xtensa/include/asm/pgalloc.h
index 1065bc8bcae5..b3b388ff2f01 100644
--- a/arch/xtensa/include/asm/pgalloc.h
+++ b/arch/xtensa/include/asm/pgalloc.h
@@ -38,8 +38,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
-					 unsigned long address)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	pte_t *ptep;
 	int i;
@@ -52,13 +51,12 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	return ptep;
 }
 
-static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
-					unsigned long addr)
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
 	pte_t *pte;
 	struct page *page;
 
-	pte = pte_alloc_one_kernel(mm, addr);
+	pte = pte_alloc_one_kernel(mm);
 	if (!pte)
 		return NULL;
 	page = virt_to_page(pte);
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 517f5853ffed..c0fc9204b4a5 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -65,7 +65,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
 	u64 pfn;
 
 	pfn = phys_addr >> PAGE_SHIFT;
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel(pmd);
 	if (!pte)
 		return -ENOMEM;
 	do {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 00704060b7f7..fd7e8714e5a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -558,7 +558,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		return VM_FAULT_FALLBACK;
 	}
 
-	pgtable = pte_alloc_one(vma->vm_mm, haddr);
+	pgtable = pte_alloc_one(vma->vm_mm);
 	if (unlikely(!pgtable)) {
 		ret = VM_FAULT_OOM;
 		goto release;
@@ -683,7 +683,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		struct page *zero_page;
 		bool set;
 		vm_fault_t ret;
-		pgtable = pte_alloc_one(vma->vm_mm, haddr);
+		pgtable = pte_alloc_one(vma->vm_mm);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
 		zero_page = mm_get_huge_zero_page(vma->vm_mm);
@@ -772,7 +772,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		return VM_FAULT_SIGBUS;
 
 	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm, addr);
+		pgtable = pte_alloc_one(vma->vm_mm);
 		if (!pgtable)
 			return VM_FAULT_OOM;
 	}
@@ -910,7 +910,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(vma))
 		return 0;
 
-	pgtable = pte_alloc_one(dst_mm, addr);
+	pgtable = pte_alloc_one(dst_mm);
 	if (unlikely(!pgtable))
 		goto out;
 
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index 7a2a2f13f86f..272849cd2007 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -121,7 +121,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 			pte_t *p;
 
 			if (slab_is_available())
-				p = pte_alloc_one_kernel(&init_mm, addr);
+				p = pte_alloc_one_kernel(&init_mm);
 			else
 				p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
 			if (!p)
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..6feb96ae5599 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -650,7 +650,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
 	spinlock_t *ptl;
-	pgtable_t new = pte_alloc_one(mm, address);
+	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
 
@@ -683,7 +683,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 {
-	pte_t *new = pte_alloc_one_kernel(&init_mm, address);
+	pte_t *new = pte_alloc_one_kernel(&init_mm);
 	if (!new)
 		return -ENOMEM;
 
@@ -2192,7 +2192,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	spinlock_t *uninitialized_var(ptl);
 
 	pte = (mm == &init_mm) ?
-		pte_alloc_kernel(pmd, addr) :
+		pte_alloc_kernel(pmd) :
 		pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
@@ -3365,7 +3365,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	 * related to pte entry. Use the preallocated table for that.
 	 */
 	if (arch_needs_pgtable_deposit() && !vmf->prealloc_pte) {
-		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm, vmf->address);
+		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
 		smp_wmb(); /* See comment in __pte_alloc() */
@@ -3603,8 +3603,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 			start_pgoff + nr_pages - 1);
 
 	if (pmd_none(*vmf->pmd)) {
-		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm,
-						  vmf->address);
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			goto out;
 		smp_wmb(); /* See comment in __pte_alloc() */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a728fc492557..d5115fda9fd2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -141,7 +141,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 	 * callers keep track of where we're up to.
 	 */
 
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel(pmd);
 	if (!pte)
 		return -ENOMEM;
 	do {
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index ed162a6c57c5..3f8180414301 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -628,7 +628,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 		BUG_ON(pmd_sect(*pmd));
 
 		if (pmd_none(*pmd)) {
-			pte = pte_alloc_one_kernel(NULL, addr);
+			pte = pte_alloc_one_kernel(NULL);
 			if (!pte) {
 				kvm_err("Cannot allocate Hyp pte\n");
 				return -ENOMEM;
Joel Fernandes Oct. 11, 2018, 12:50 a.m. UTC | #7
On Wed, Oct 10, 2018 at 5:46 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> On Wed, Oct 10, 2018 at 01:00:11PM +0300, Kirill A. Shutemov wrote:
[...]
>>
>> My worry is that some architecture has to allocate page table differently
>> depending on virtual address (due to aliasing or something). Original page
>> table was allocated for one virtual address and moving the page table to
>> different spot in virtual address space may break the invariant.
>>
>> > Also the clean up of the argument that you're proposing is a bit out of scope
>> > of this patch but yeah we could clean it up in a separate patch if needed. I
>> > don't feel too strongly about that. It seems cosmetic and in the future if
>> > the address that's passed in is needed, then the architecture can use it.
>>
>> Please, do. This should be pretty mechanical change, but it will help to
>> make sure that none of obscure architecture will be broken by the change.
>>
>
> The thing is its quite a lot of change, I wrote a coccinelle script to do it
> tree wide, following is the diffstat:
>  48 files changed, 91 insertions(+), 124 deletions(-)
>
> Imagine then having to add the address argument back in the future in case
> its ever needed. Is it really worth doing it? Anyway I confirmed that the
> address is NOT used for anything at the moment so your fears of the
> optimization doing anything wonky really don't exist at the moment. I really
> feel this is unnecessary but I am Ok with others agree the second arg to
> pte_alloc should be removed in light of this change. Andrew, what do you
> think?

I meant to say here, "I am Ok if others agree the second arg to
pte_alloc should be removed", but I would really like some input from
the others as well on what they think.
Kirill A. Shutemov Oct. 11, 2018, 5:14 a.m. UTC | #8
On Wed, Oct 10, 2018 at 05:46:18PM -0700, Joel Fernandes wrote:
> diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> index 391ed2c3b697..8a33f2044923 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> @@ -192,14 +192,12 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
>  	return (pgtable_t)pmd_page_vaddr(pmd);
>  }
>  
> -static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
> -					  unsigned long address)
> +static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
>  {
>  	return (pte_t *)pte_fragment_alloc(mm, address, 1);
>  }

This is obviously broken.
Kirill A. Shutemov Oct. 11, 2018, 8:11 a.m. UTC | #9
On Thu, Oct 11, 2018 at 08:14:19AM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 10, 2018 at 05:46:18PM -0700, Joel Fernandes wrote:
> > diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> > index 391ed2c3b697..8a33f2044923 100644
> > --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
> > +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> > @@ -192,14 +192,12 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
> >  	return (pgtable_t)pmd_page_vaddr(pmd);
> >  }
> >  
> > -static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
> > -					  unsigned long address)
> > +static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
> >  {
> >  	return (pte_t *)pte_fragment_alloc(mm, address, 1);
> >  }
> 
> This is obviously broken.

I've checked pte_fragment_alloc() and it doesn't use the address too.
We need to modify it too.
Kirill A. Shutemov Oct. 11, 2018, 8:17 a.m. UTC | #10
On Wed, Oct 10, 2018 at 05:46:18PM -0700, Joel Fernandes wrote:
> On Wed, Oct 10, 2018 at 01:00:11PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Oct 09, 2018 at 04:04:47PM -0700, Joel Fernandes wrote:
> > > On Wed, Oct 10, 2018 at 01:02:22AM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Oct 09, 2018 at 01:14:00PM -0700, Joel Fernandes (Google) wrote:
> > > > > Android needs to mremap large regions of memory during memory management
> > > > > related operations. The mremap system call can be really slow if THP is
> > > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > > pte at a time, and can be really slow across a large map. Turning on THP
> > > > > may not be a viable option, and is not for us. This patch speeds up the
> > > > > performance for non-THP system by copying at the PMD level when possible.
> > > > > 
> > > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > > > > 
> > > > > Before:
> > > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > > 
> > > > > After:
> > > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > > 
> > > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > > tlb every time we do this optimization since I couldn't find a way to
> > > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > > doing so is not much compared the improvement, on both x86-64 and arm64.
> > > > 
> > > > Okay. That's interesting.
> > > > 
> > > > It makes me wounder why do we pass virtual address to pte_alloc() (and
> > > > pte_alloc_one() inside).
> > > > 
> > > > If an arch has real requirement to tight a page table to a virtual address
> > > > than the optimization cannot be used as it is. Per-arch should be fine
> > > > for this case, I guess.
> > > > 
> > > > If nobody uses the address we should just drop the argument as a
> > > > preparation to the patch.
> > > 
> > > I couldn't find any use of the address. But I am wondering why you feel
> > > passing the address is something that can't be done with the optimization.
> > > The pte_alloc only happens if the optimization is not triggered.
> > 
> > Yes, I now.
> > 
> > My worry is that some architecture has to allocate page table differently
> > depending on virtual address (due to aliasing or something). Original page
> > table was allocated for one virtual address and moving the page table to
> > different spot in virtual address space may break the invariant.
> > 
> > > Also the clean up of the argument that you're proposing is a bit out of scope
> > > of this patch but yeah we could clean it up in a separate patch if needed. I
> > > don't feel too strongly about that. It seems cosmetic and in the future if
> > > the address that's passed in is needed, then the architecture can use it.
> > 
> > Please, do. This should be pretty mechanical change, but it will help to
> > make sure that none of obscure architecture will be broken by the change.
> > 
> 
> The thing is its quite a lot of change, I wrote a coccinelle script to do it
> tree wide, following is the diffstat:
>  48 files changed, 91 insertions(+), 124 deletions(-)
> 
> Imagine then having to add the address argument back in the future in case
> its ever needed. Is it really worth doing it?

This is the point. It will get us chance to consider if the optimization
is still safe.

And it shouldn't be hard: [partially] revert the commit and get the address
back into the interface.

> First script : Remove the address argument from pte_alloc and friends:

...

Please add __pte_alloc(), __pte_alloc_kernel() and pte_alloc() to the
list for patching.
Michal Hocko Oct. 11, 2018, 12:02 p.m. UTC | #11
On Thu 11-10-18 11:17:19, Kirill A. Shutemov wrote:
[...]
> > The thing is its quite a lot of change, I wrote a coccinelle script to do it
> > tree wide, following is the diffstat:
> >  48 files changed, 91 insertions(+), 124 deletions(-)
> > 
> > Imagine then having to add the address argument back in the future in case
> > its ever needed. Is it really worth doing it?
> 
> This is the point. It will get us chance to consider if the optimization
> is still safe.
> 
> And it shouldn't be hard: [partially] revert the commit and get the address
> back into the interface.

I agree with Kirill. This will also remove quite a lot of pointless
code and make it more clear. It is impossible to see what is the address
good for and I couldn't really trace back to commit introducing it to
guess that either. So making sure nobody does anything with it is a good
pre-requisite to make further changes on top.

The chage itself is really interesting, I still have to digest it
completely to see there are no cornercases but from a quick glance it
looks reasonable.
Joel Fernandes Oct. 12, 2018, 1:47 a.m. UTC | #12
On Thu, Oct 11, 2018 at 11:11:11AM +0300, Kirill A. Shutemov wrote:
> On Thu, Oct 11, 2018 at 08:14:19AM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 10, 2018 at 05:46:18PM -0700, Joel Fernandes wrote:
> > > diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> > > index 391ed2c3b697..8a33f2044923 100644
> > > --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
> > > +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> > > @@ -192,14 +192,12 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
> > >  	return (pgtable_t)pmd_page_vaddr(pmd);
> > >  }
> > >  
> > > -static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
> > > -					  unsigned long address)
> > > +static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
> > >  {
> > >  	return (pte_t *)pte_fragment_alloc(mm, address, 1);
> > >  }
> > 
> > This is obviously broken.

I was actually aware of this, I was in the early stages of writing the script
but just shared the diff to give an idea the number of changes.

> I've checked pte_fragment_alloc() and it doesn't use the address too.
> We need to modify it too.

I rewrote the Coccinelle script and manually fixed pte_fragment_alloc in
that. I sent an update with you and Michal on CC, please check it. Thanks a
lot.

 - Joel
Jann Horn Oct. 12, 2018, 3:21 a.m. UTC | #13
+cc xen maintainers and kvm folks

On Fri, Oct 12, 2018 at 4:40 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
> Android needs to mremap large regions of memory during memory management
> related operations. The mremap system call can be really slow if THP is
> not enabled. The bottleneck is move_page_tables, which is copying each
> pte at a time, and can be really slow across a large map. Turning on THP
> may not be a viable option, and is not for us. This patch speeds up the
> performance for non-THP system by copying at the PMD level when possible.
[...]
> +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> +                 unsigned long new_addr, unsigned long old_end,
> +                 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> +{
[...]
> +       /*
> +        * We don't have to worry about the ordering of src and dst
> +        * ptlocks because exclusive mmap_sem prevents deadlock.
> +        */
> +       old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> +       if (old_ptl) {
> +               pmd_t pmd;
> +
> +               new_ptl = pmd_lockptr(mm, new_pmd);
> +               if (new_ptl != old_ptl)
> +                       spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> +               /* Clear the pmd */
> +               pmd = *old_pmd;
> +               pmd_clear(old_pmd);
> +
> +               VM_BUG_ON(!pmd_none(*new_pmd));
> +
> +               /* Set the new pmd */
> +               set_pmd_at(mm, new_addr, new_pmd, pmd);
> +               if (new_ptl != old_ptl)
> +                       spin_unlock(new_ptl);
> +               spin_unlock(old_ptl);

How does this interact with Xen PV? From a quick look at the Xen PV
integration code in xen_alloc_ptpage(), it looks to me as if, in a
config that doesn't use split ptlocks, this is going to temporarily
drop Xen's type count for the page to zero, causing Xen to de-validate
and then re-validate the L1 pagetable; if you first set the new pmd
before clearing the old one, that wouldn't happen. I don't know how
this interacts with shadow paging implementations.

> +               *need_flush = true;
> +               return true;
> +       }
> +       return false;
> +}
Jürgen Groß Oct. 12, 2018, 5:29 a.m. UTC | #14
On 12/10/2018 05:21, Jann Horn wrote:
> +cc xen maintainers and kvm folks
> 
> On Fri, Oct 12, 2018 at 4:40 AM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
>> Android needs to mremap large regions of memory during memory management
>> related operations. The mremap system call can be really slow if THP is
>> not enabled. The bottleneck is move_page_tables, which is copying each
>> pte at a time, and can be really slow across a large map. Turning on THP
>> may not be a viable option, and is not for us. This patch speeds up the
>> performance for non-THP system by copying at the PMD level when possible.
> [...]
>> +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>> +                 unsigned long new_addr, unsigned long old_end,
>> +                 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
>> +{
> [...]
>> +       /*
>> +        * We don't have to worry about the ordering of src and dst
>> +        * ptlocks because exclusive mmap_sem prevents deadlock.
>> +        */
>> +       old_ptl = pmd_lock(vma->vm_mm, old_pmd);
>> +       if (old_ptl) {
>> +               pmd_t pmd;
>> +
>> +               new_ptl = pmd_lockptr(mm, new_pmd);
>> +               if (new_ptl != old_ptl)
>> +                       spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>> +
>> +               /* Clear the pmd */
>> +               pmd = *old_pmd;
>> +               pmd_clear(old_pmd);
>> +
>> +               VM_BUG_ON(!pmd_none(*new_pmd));
>> +
>> +               /* Set the new pmd */
>> +               set_pmd_at(mm, new_addr, new_pmd, pmd);
>> +               if (new_ptl != old_ptl)
>> +                       spin_unlock(new_ptl);
>> +               spin_unlock(old_ptl);
> 
> How does this interact with Xen PV? From a quick look at the Xen PV
> integration code in xen_alloc_ptpage(), it looks to me as if, in a
> config that doesn't use split ptlocks, this is going to temporarily
> drop Xen's type count for the page to zero, causing Xen to de-validate
> and then re-validate the L1 pagetable; if you first set the new pmd
> before clearing the old one, that wouldn't happen. I don't know how
> this interacts with shadow paging implementations.

No, this isn't an issue. As the L1 pagetable isn't being released it
will stay pinned, so there will be no need to revalidate it.

For Xen in shadow mode I'm quite sure it just doesn't matter. In the
case another thread of the process is accessing the memory in parallel
it might even be better to not having a L1 pagetable with 2 references
at the same time, but this is an academic problem which doesn't need to
be tuned for performance IMO.


Juergen
Jann Horn Oct. 12, 2018, 5:34 a.m. UTC | #15
On Fri, Oct 12, 2018 at 7:29 AM Juergen Gross <jgross@suse.com> wrote:
> On 12/10/2018 05:21, Jann Horn wrote:
> > +cc xen maintainers and kvm folks
> >
> > On Fri, Oct 12, 2018 at 4:40 AM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> >> Android needs to mremap large regions of memory during memory management
> >> related operations. The mremap system call can be really slow if THP is
> >> not enabled. The bottleneck is move_page_tables, which is copying each
> >> pte at a time, and can be really slow across a large map. Turning on THP
> >> may not be a viable option, and is not for us. This patch speeds up the
> >> performance for non-THP system by copying at the PMD level when possible.
> > [...]
> >> +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> >> +                 unsigned long new_addr, unsigned long old_end,
> >> +                 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> >> +{
> > [...]
> >> +       /*
> >> +        * We don't have to worry about the ordering of src and dst
> >> +        * ptlocks because exclusive mmap_sem prevents deadlock.
> >> +        */
> >> +       old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> >> +       if (old_ptl) {
> >> +               pmd_t pmd;
> >> +
> >> +               new_ptl = pmd_lockptr(mm, new_pmd);
> >> +               if (new_ptl != old_ptl)
> >> +                       spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> >> +
> >> +               /* Clear the pmd */
> >> +               pmd = *old_pmd;
> >> +               pmd_clear(old_pmd);
> >> +
> >> +               VM_BUG_ON(!pmd_none(*new_pmd));
> >> +
> >> +               /* Set the new pmd */
> >> +               set_pmd_at(mm, new_addr, new_pmd, pmd);
> >> +               if (new_ptl != old_ptl)
> >> +                       spin_unlock(new_ptl);
> >> +               spin_unlock(old_ptl);
> >
> > How does this interact with Xen PV? From a quick look at the Xen PV
> > integration code in xen_alloc_ptpage(), it looks to me as if, in a
> > config that doesn't use split ptlocks, this is going to temporarily
> > drop Xen's type count for the page to zero, causing Xen to de-validate
> > and then re-validate the L1 pagetable; if you first set the new pmd
> > before clearing the old one, that wouldn't happen. I don't know how
> > this interacts with shadow paging implementations.
>
> No, this isn't an issue. As the L1 pagetable isn't being released it
> will stay pinned, so there will be no need to revalidate it.

Where exactly is the L1 pagetable pinned? xen_alloc_ptpage() does:

        if (static_branch_likely(&xen_struct_pages_ready))
            SetPagePinned(page);

        if (!PageHighMem(page)) {
            xen_mc_batch();

            __set_pfn_prot(pfn, PAGE_KERNEL_RO);

            if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS)
                __pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);

            xen_mc_issue(PARAVIRT_LAZY_MMU);
        } else {
            /* make sure there are no stray mappings of
               this page */
            kmap_flush_unused();
        }

which means that if USE_SPLIT_PTE_PTLOCKS is false, the table doesn't
get pinned and only stays typed as long as it is referenced by an L2
table, right?
Paolo Bonzini Oct. 12, 2018, 7:24 a.m. UTC | #16
On 12/10/2018 05:21, Jann Horn wrote:
> I don't know how this interacts with shadow paging implementations.

Shadow paging simply uses MMU notifiers and that does not assume that
PTE invalidation is atomic.  The invalidate_range_start and
invalidate_range_end calls are not affected by Joel's patch, so it's ok
for KVM and also for other users of MMU notifiers.

Paolo
Jürgen Groß Oct. 12, 2018, 7:29 a.m. UTC | #17
On 12/10/2018 07:34, Jann Horn wrote:
> On Fri, Oct 12, 2018 at 7:29 AM Juergen Gross <jgross@suse.com> wrote:
>> On 12/10/2018 05:21, Jann Horn wrote:
>>> +cc xen maintainers and kvm folks
>>>
>>> On Fri, Oct 12, 2018 at 4:40 AM Joel Fernandes (Google)
>>> <joel@joelfernandes.org> wrote:
>>>> Android needs to mremap large regions of memory during memory management
>>>> related operations. The mremap system call can be really slow if THP is
>>>> not enabled. The bottleneck is move_page_tables, which is copying each
>>>> pte at a time, and can be really slow across a large map. Turning on THP
>>>> may not be a viable option, and is not for us. This patch speeds up the
>>>> performance for non-THP system by copying at the PMD level when possible.
>>> [...]
>>>> +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>>>> +                 unsigned long new_addr, unsigned long old_end,
>>>> +                 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
>>>> +{
>>> [...]
>>>> +       /*
>>>> +        * We don't have to worry about the ordering of src and dst
>>>> +        * ptlocks because exclusive mmap_sem prevents deadlock.
>>>> +        */
>>>> +       old_ptl = pmd_lock(vma->vm_mm, old_pmd);
>>>> +       if (old_ptl) {
>>>> +               pmd_t pmd;
>>>> +
>>>> +               new_ptl = pmd_lockptr(mm, new_pmd);
>>>> +               if (new_ptl != old_ptl)
>>>> +                       spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>>>> +
>>>> +               /* Clear the pmd */
>>>> +               pmd = *old_pmd;
>>>> +               pmd_clear(old_pmd);
>>>> +
>>>> +               VM_BUG_ON(!pmd_none(*new_pmd));
>>>> +
>>>> +               /* Set the new pmd */
>>>> +               set_pmd_at(mm, new_addr, new_pmd, pmd);
>>>> +               if (new_ptl != old_ptl)
>>>> +                       spin_unlock(new_ptl);
>>>> +               spin_unlock(old_ptl);
>>>
>>> How does this interact with Xen PV? From a quick look at the Xen PV
>>> integration code in xen_alloc_ptpage(), it looks to me as if, in a
>>> config that doesn't use split ptlocks, this is going to temporarily
>>> drop Xen's type count for the page to zero, causing Xen to de-validate
>>> and then re-validate the L1 pagetable; if you first set the new pmd
>>> before clearing the old one, that wouldn't happen. I don't know how
>>> this interacts with shadow paging implementations.
>>
>> No, this isn't an issue. As the L1 pagetable isn't being released it
>> will stay pinned, so there will be no need to revalidate it.
> 
> Where exactly is the L1 pagetable pinned? xen_alloc_ptpage() does:
> 
>         if (static_branch_likely(&xen_struct_pages_ready))
>             SetPagePinned(page);

This marking the pagetable as to be pinned, in order to pin it via

xen_activate_mm()
  xen_pgd_pin()
    __xen_pgd_pin()
      __xen_pgd_walk()
        xen_pin_page()
          xen_do_pin()

> 
>         if (!PageHighMem(page)) {
>             xen_mc_batch();
> 
>             __set_pfn_prot(pfn, PAGE_KERNEL_RO);
> 
>             if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS)
>                 __pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);
> 
>             xen_mc_issue(PARAVIRT_LAZY_MMU);
>         } else {
>             /* make sure there are no stray mappings of
>                this page */
>             kmap_flush_unused();
>         }
> 
> which means that if USE_SPLIT_PTE_PTLOCKS is false, the table doesn't
> get pinned and only stays typed as long as it is referenced by an L2
> table, right?

In case the pagetable has been allocated since activation of the
address space it seems indeed not to be pinned yet. IMO this is not
meant to be that way, but probably most kernel configs for 32-bit
PV guests have NR_CPUS > 4, so they do use split ptlocks.

In fact this seems to be a bug as at deactivation of the address
space the kernel will try to unpin the pagetable and the hypervisor
would issue a warning if it has been built with debug messages
enabled. Same applies to suspend()/resume() cycles.


Juergen
diff mbox series

Patch

diff --git a/mm/mremap.c b/mm/mremap.c
index 5c2e18505f75..68ddc9e9dfde 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,6 +191,54 @@  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		drop_rmap_locks(vma);
 }
 
+bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+		  unsigned long new_addr, unsigned long old_end,
+		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
+{
+	spinlock_t *old_ptl, *new_ptl;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
+	    || old_end - old_addr < PMD_SIZE)
+		return false;
+
+	/*
+	 * The destination pmd shouldn't be established, free_pgtables()
+	 * should have release it.
+	 */
+	if (WARN_ON(!pmd_none(*new_pmd)))
+		return false;
+
+	/*
+	 * We don't have to worry about the ordering of src and dst
+	 * ptlocks because exclusive mmap_sem prevents deadlock.
+	 */
+	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+	if (old_ptl) {
+		pmd_t pmd;
+
+		new_ptl = pmd_lockptr(mm, new_pmd);
+		if (new_ptl != old_ptl)
+			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+		/* Clear the pmd */
+		pmd = *old_pmd;
+		pmd_clear(old_pmd);
+
+		VM_BUG_ON(!pmd_none(*new_pmd));
+
+		/* Set the new pmd */
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
+		if (new_ptl != old_ptl)
+			spin_unlock(new_ptl);
+		spin_unlock(old_ptl);
+
+		*need_flush = true;
+		return true;
+	}
+	return false;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
@@ -239,7 +287,21 @@  unsigned long move_page_tables(struct vm_area_struct *vma,
 			split_huge_pmd(vma, old_pmd, old_addr);
 			if (pmd_trans_unstable(old_pmd))
 				continue;
+		} else if (extent == PMD_SIZE) {
+			bool moved;
+
+			/* See comment in move_ptes() */
+			if (need_rmap_locks)
+				take_rmap_locks(vma);
+			moved = move_normal_pmd(vma, old_addr, new_addr,
+					old_end, old_pmd, new_pmd,
+					&need_flush);
+			if (need_rmap_locks)
+				drop_rmap_locks(vma);
+			if (moved)
+				continue;
 		}
+
 		if (pte_alloc(new_vma->vm_mm, new_pmd, new_addr))
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;