diff mbox series

[5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries

Message ID 20210610083549.386085-6-aneesh.kumar@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series mremap fixes | expand

Commit Message

Aneesh Kumar K.V June 10, 2021, 8:35 a.m. UTC
pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at
can only be used to set a hugepage PTE. Since we are not setting up a hugepage
PTE here, use the pmd/pud_populate interface.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Aneesh Kumar K.V June 13, 2021, 9:06 a.m. UTC | #1
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jun 10, 2021 at 1:36 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
>>
>>         VM_BUG_ON(!pud_none(*new_pud));
>>
>> -       /* Set the new pud */
>> -       set_pud_at(mm, new_addr, new_pud, pud);
>> +       pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
>>         flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
>>         if (new_ptl != old_ptl)
>>                 spin_unlock(new_ptl);
>
> That "(pmd_t *)pud_page_vaddr(pud)" looks a bit odd and doesn't match
> the pattern.
>
> Should we perhaps have a wrapper for it? Maybe called "pud_pgtable()",
> the same way we have pmd_pgtable()?

IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
is arch dependent. On some architecture it is pte_t * and on the other
struct page *. The reason being highmem and level 4 page table can
be located in highmem. 

The above is not true with the level 3 table and hence we didn't add an
extra type to point to level 3 table.

We could add pud_pgtable(), but then it will essentially be a typecast
as I did above?

Even if we want to do that, that should be done as a separate patch than
this series?

>
> And that wrapper would be good to have a comment or two about what it
> actually is. The same way that pmd_pgtable() should but doesn't ;(
>
>             Linus
Matthew Wilcox June 13, 2021, 10:50 a.m. UTC | #2
On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
> is arch dependent. On some architecture it is pte_t * and on the other
> struct page *. The reason being highmem and level 4 page table can
> be located in highmem. 

That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
we have pgtable_t for the benefit of s390's crazy sub-page page table
sizes.

Also, please stop numbering page tables upside down.  PTEs are first
level, not fourth.
Aneesh Kumar K.V June 13, 2021, 11:13 a.m. UTC | #3
On 6/13/21 4:20 PM, Matthew Wilcox wrote:
> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>> is arch dependent. On some architecture it is pte_t * and on the other
>> struct page *. The reason being highmem and level 4 page table can
>> be located in highmem.
> 
> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
> we have pgtable_t for the benefit of s390's crazy sub-page page table
> sizes.

That is also true with ppc64. We do sub-page page table size. I was 
trying to explain why it can't be pte_t * everywhere and why we have
it as struct page *.

> 
> Also, please stop numbering page tables upside down.  PTEs are first
> level, not fourth.
> 

POWER ISA do name it the other way. I also see some pages explaining 
levels the other way

https://www.bottomupcs.com/virtual_memory_linux.xhtml

whereas 
https://en.wikipedia.org/wiki/Intel_5-level_paging#/media/File:Page_Tables_(5_levels).svg

I am pretty sure I had commits that explained page table level as I did 
in this thread. I will switch to your suggestion in further discussions. 
May be the best solution is to attribute it with more details like level 
1 (pte_t *)?


-aneesh
Linus Torvalds June 13, 2021, 6:53 p.m. UTC | #4
On Sun, Jun 13, 2021 at 2:06 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
> is arch dependent. On some architecture it is pte_t * and on the other
> struct page *. The reason being highmem and level 4 page table can
> be located in highmem.

Honestly, the same confusion is real - in a different way - about
pud_page_vaddr().

I really hate that function.

Just grep for the uses, and the definitions, to see what I mean. It's crazy.

I'm perfectly happy not having a "pud_pagetable()" function, but that
cast on pud_page_vaddr() is indicative of real problems.

One solution might be to just say "pud_page_vaddr()" must return a "pmd_t *".

I think it's what all the users actually want anyway.

              Linus
Christophe Leroy June 14, 2021, 5:27 a.m. UTC | #5
Le 13/06/2021 à 13:13, Aneesh Kumar K.V a écrit :
> On 6/13/21 4:20 PM, Matthew Wilcox wrote:
>> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>>> is arch dependent. On some architecture it is pte_t * and on the other
>>> struct page *. The reason being highmem and level 4 page table can
>>> be located in highmem.
>>
>> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
>> we have pgtable_t for the benefit of s390's crazy sub-page page table
>> sizes.
> 
> That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't 
> be pte_t * everywhere and why we have
> it as struct page *.

ppc32 as well. On the 8xx, with 16k size pages, the HW still use 4k page tables, so we do use sub-pages.

In order too keep the code simple, we have converted all powerpc to sub-pages for that, allthough 
some powerpc platforms have only one sub-page per page.

Christophe
diff mbox series

Patch

diff --git a/mm/mremap.c b/mm/mremap.c
index 795a7d628b53..dacfa9111ab1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,6 +26,7 @@ 
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
 
 #include "internal.h"
 
@@ -258,8 +259,7 @@  static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 
 	VM_BUG_ON(!pmd_none(*new_pmd));
 
-	/* Set the new pmd */
-	set_pmd_at(mm, new_addr, new_pmd, pmd);
+	pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 	flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
@@ -306,8 +306,7 @@  static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
 
 	VM_BUG_ON(!pud_none(*new_pud));
 
-	/* Set the new pud */
-	set_pud_at(mm, new_addr, new_pud, pud);
+	pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
 	flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);