Message ID | 20210610083549.386085-6-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mremap fixes | expand |
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Jun 10, 2021 at 1:36 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, >> >> VM_BUG_ON(!pud_none(*new_pud)); >> >> - /* Set the new pud */ >> - set_pud_at(mm, new_addr, new_pud, pud); >> + pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud)); >> flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); >> if (new_ptl != old_ptl) >> spin_unlock(new_ptl); > > That "(pmd_t *)pud_page_vaddr(pud)" looks a bit odd and doesn't match > the pattern. > > Should we perhaps have a wrapper for it? Maybe called "pud_pgtable()", > the same way we have pmd_pgtable()? IIUC the reason why we do have pmd_pgtable() is that pgtable_t type is arch dependent. On some architecture it is pte_t * and on the other struct page *. The reason being highmem and level 4 page table can be located in highmem. The above is not true with the level 3 table and hence we didn't add an extra type to point to level 3 table. We could add pud_pgtable(), but then it will essentially be a typecast as I did above? Even if we want to do that, that should be done as a separate patch than this series? > > And that wrapper would be good to have a comment or two about what it > actually is. The same way that pmd_pgtable() should but doesn't ;( > > Linus
On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote: > IIUC the reason why we do have pmd_pgtable() is that pgtable_t type > is arch dependent. On some architecture it is pte_t * and on the other > struct page *. The reason being highmem and level 4 page table can > be located in highmem. That is ahistorical. See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 -- we have pgtable_t for the benefit of s390's crazy sub-page page table sizes. Also, please stop numbering page tables upside down. PTEs are first level, not fourth.
On 6/13/21 4:20 PM, Matthew Wilcox wrote: > On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote: >> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type >> is arch dependent. On some architecture it is pte_t * and on the other >> struct page *. The reason being highmem and level 4 page table can >> be located in highmem. > > That is ahistorical. See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 -- > we have pgtable_t for the benefit of s390's crazy sub-page page table > sizes. That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't be pte_t * everywhere and why we have it as struct page *. > > Also, please stop numbering page tables upside down. PTEs are first > level, not fourth. > POWER ISA do name it the other way. I also see some pages explaining levels the other way https://www.bottomupcs.com/virtual_memory_linux.xhtml whereas https://en.wikipedia.org/wiki/Intel_5-level_paging#/media/File:Page_Tables_(5_levels).svg I am pretty sure I had commits that explained page table level as I did in this thread. I will switch to your suggestion in further discussions. May be the best solution is to attribute it with more details like level 1 (pte_t *)? -aneesh
On Sun, Jun 13, 2021 at 2:06 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > IIUC the reason why we do have pmd_pgtable() is that pgtable_t type > is arch dependent. On some architecture it is pte_t * and on the other > struct page *. The reason being highmem and level 4 page table can > be located in highmem. Honestly, the same confusion is real - in a different way - about pud_page_vaddr(). I really hate that function. Just grep for the uses, and the definitions, to see what I mean. It's crazy. I'm perfectly happy not having a "pud_pagetable()" function, but that cast on pud_page_vaddr() is indicative of real problems. One solution might be to just say "pud_page_vaddr()" must return a "pmd_t *". I think it's what all the users actually want anyway. Linus
Le 13/06/2021 à 13:13, Aneesh Kumar K.V a écrit : > On 6/13/21 4:20 PM, Matthew Wilcox wrote: >> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote: >>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type >>> is arch dependent. On some architecture it is pte_t * and on the other >>> struct page *. The reason being highmem and level 4 page table can >>> be located in highmem. >> >> That is ahistorical. See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 -- >> we have pgtable_t for the benefit of s390's crazy sub-page page table >> sizes. > > That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't > be pte_t * everywhere and why we have > it as struct page *. ppc32 as well. On the 8xx, with 16k size pages, the HW still use 4k page tables, so we do use sub-pages. In order too keep the code simple, we have converted all powerpc to sub-pages for that, allthough some powerpc platforms have only one sub-page per page. Christophe
diff --git a/mm/mremap.c b/mm/mremap.c index 795a7d628b53..dacfa9111ab1 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -26,6 +26,7 @@ #include <asm/cacheflush.h> #include <asm/tlbflush.h> +#include <asm/pgalloc.h> #include "internal.h" @@ -258,8 +259,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, VM_BUG_ON(!pmd_none(*new_pmd)); - /* Set the new pmd */ - set_pmd_at(mm, new_addr, new_pmd, pmd); + pmd_populate(mm, new_pmd, pmd_pgtable(pmd)); flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); if (new_ptl != old_ptl) spin_unlock(new_ptl); @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, VM_BUG_ON(!pud_none(*new_pud)); - /* Set the new pud */ - set_pud_at(mm, new_addr, new_pud, pud); + pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud)); flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); if (new_ptl != old_ptl) spin_unlock(new_ptl);
pmd/pud_populate is the right interface to be used to set the respective page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at can only be used to set a hugepage PTE. Since we are not setting up a hugepage PTE here, use the pmd/pud_populate interface. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- mm/mremap.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)