diff mbox series

mm, thp: Fix mlocking THP page with migration enabled

Message ID 20180911103403.38086-1-kirill.shutemov@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series mm, thp: Fix mlocking THP page with migration enabled | expand

Commit Message

Kirill A . Shutemov Sept. 11, 2018, 10:34 a.m. UTC
A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.

If a user tries to mlock() part of a huge page, we want the rest of the
page to be reclaimable.

We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
PMD on border of VM_LOCKED VMA will be split into PTE table.

Introduction of THP migration breaks the rules around mlocking THP
pages. If we had a single PMD mapping of the page in mlocked VMA, the
page will get mlocked, regardless of PTE mappings of the page.

For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
remove_migration_pmd().

Anon THP pages can only be shared between processes via fork(). Mlocked
page can only be shared if parent mlocked it before forking, otherwise
CoW will be triggered on mlock().

For Anon-THP, we can fix the issue by munlocking the page on removing PTE
migration entry for the page. PTEs for the page will always come after
mlocked PMD: rmap walks VMAs from oldest to newest.

Test-case:

	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/wait.h>
	#include <linux/mempolicy.h>
	#include <numaif.h>

	int main(void)
	{
	        unsigned long nodemask = 4;
	        void *addr;

		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

	        if (fork()) {
			wait(NULL);
			return 0;
	        }

	        mlock(addr, 4UL << 10);
	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
	                &nodemask, 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL);

	        return 0;
	}

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
Cc: <stable@vger.kernel.org> [v4.14+]
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 2 +-
 mm/migrate.c     | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

Comments

Zi Yan Sept. 11, 2018, 8:30 p.m. UTC | #1
Hi Kirill,

On 11 Sep 2018, at 6:34, Kirill A. Shutemov wrote:

> A transparent huge page is represented by a single entry on an LRU list.
> Therefore, we can only make unevictable an entire compound page, not
> individual subpages.
>
> If a user tries to mlock() part of a huge page, we want the rest of the
> page to be reclaimable.
>
> We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
> PMD on border of VM_LOCKED VMA will be split into PTE table.
>
> Introduction of THP migration breaks the rules around mlocking THP
> pages. If we had a single PMD mapping of the page in mlocked VMA, the
> page will get mlocked, regardless of PTE mappings of the page.
>
> For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
> remove_migration_pmd().
>
> Anon THP pages can only be shared between processes via fork(). Mlocked
> page can only be shared if parent mlocked it before forking, otherwise
> CoW will be triggered on mlock().
>
> For Anon-THP, we can fix the issue by munlocking the page on removing PTE
> migration entry for the page. PTEs for the page will always come after
> mlocked PMD: rmap walks VMAs from oldest to newest.
>
> Test-case:
>
> 	#include <unistd.h>
> 	#include <sys/mman.h>
> 	#include <sys/wait.h>
> 	#include <linux/mempolicy.h>
> 	#include <numaif.h>
>
> 	int main(void)
> 	{
> 	        unsigned long nodemask = 4;
> 	        void *addr;
>
> 		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
> 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
>
> 	        if (fork()) {
> 			wait(NULL);
> 			return 0;
> 	        }
>
> 	        mlock(addr, 4UL << 10);
> 	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
> 	                &nodemask, 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL);
>
> 	        return 0;
> 	}
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
> Cc: <stable@vger.kernel.org> [v4.14+]
> Cc: Zi Yan <zi.yan@cs.rutgers.edu>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/huge_memory.c | 2 +-
>  mm/migrate.c     | 3 +++
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 533f9b00147d..00704060b7f7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2931,7 +2931,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  	else
>  		page_add_file_rmap(new, true);
>  	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> -	if (vma->vm_flags & VM_LOCKED)
> +	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
>  		mlock_vma_page(new);
>  	update_mmu_cache_pmd(vma, address, pvmw->pmd);
>  }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d6a2e89b086a..01dad96b25b5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -275,6 +275,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
>  			mlock_vma_page(new);
>
> +		if (PageTransCompound(new) && PageMlocked(page))
> +			clear_page_mlock(page);
> +
>  		/* No need to invalidate - it was non-present before */
>  		update_mmu_cache(vma, pvmw.address, pvmw.pte);
>  	}
> -- 
> 2.18.0

Thanks for your patch. It fixes the mlock problem demonstrated by your test program.

I want to understand the Anon THP part of the problem more clearly. For Anon THPs,
you said, PTEs for the page will always come after mlocked PMD. I just wonder that if
a process forks a child1 which forks its own child2 and the child1 mlocks a subpage causing
split_pmd_page() and migrates its PTE-mapped THP, will the kernel see the sequence of PMD-mapped THP,
PTE-mapped THP, and PMD-mapped THP while walking VMAs? Will the second PMD-mapped THP
reset the mlock on the page?

In addition, I also discover that PageDoubleMap is not set for double mapped Anon THPs after migration,
the following patch fixes it. Do you want me to send it separately or you can merge it
with your patch?

diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..defe8fc265e3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1121,6 +1121,16 @@ void do_page_add_anon_rmap(struct page *page,
                 */
                if (compound)
                        __inc_node_page_state(page, NR_ANON_THPS);
+               else {
+                       if (PageTransCompound(page) && compound_mapcount(page) > 0) {
+                               struct page *head = compound_head(page);
+                               int i;
+
+                               SetPageDoubleMap(head);
+                               for (i = 0; i < HPAGE_PMD_NR; i++)
+                                       atomic_inc(&head[i]._mapcount);
+                       }
+               }
                __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
        }
        if (unlikely(PageKsm(page)))

—
Best Regards,
Yan Zi
Kirill A. Shutemov Sept. 11, 2018, 10:02 p.m. UTC | #2
On Tue, Sep 11, 2018 at 04:30:02PM -0400, Zi Yan wrote:
> I want to understand the Anon THP part of the problem more clearly. For Anon THPs,
> you said, PTEs for the page will always come after mlocked PMD. I just wonder that if
> a process forks a child1 which forks its own child2 and the child1 mlocks a subpage causing
> split_pmd_page() and migrates its PTE-mapped THP, will the kernel see the sequence of PMD-mapped THP,
> PTE-mapped THP, and PMD-mapped THP while walking VMAs? Will the second PMD-mapped THP
> reset the mlock on the page?

VM_LOCKED is not inheritable. child2 will not have the VMA mlocked and
will not try to mlock the page. If child2 will do mlock() manually it will
cause CoW and the process will ge new pages in the new VMA.

> In addition, I also discover that PageDoubleMap is not set for double mapped Anon THPs after migration,
> the following patch fixes it. Do you want me to send it separately or you can merge it
> with your patch?

I think we can live without DoubleMap for anon-THP: rmap walking order
semantics and the fact that page can be shared only over fork() should be
enough to get it under control. DoubleMap comes with overhead and I would
like to avoid it where possible.
Aneesh Kumar K.V Sept. 12, 2018, 9:58 a.m. UTC | #3
On 9/11/18 4:04 PM, Kirill A. Shutemov wrote:
> A transparent huge page is represented by a single entry on an LRU list.
> Therefore, we can only make unevictable an entire compound page, not
> individual subpages.
> 
> If a user tries to mlock() part of a huge page, we want the rest of the
> page to be reclaimable.
> 
> We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
> PMD on border of VM_LOCKED VMA will be split into PTE table.
> 
> Introduction of THP migration breaks the rules around mlocking THP
> pages. If we had a single PMD mapping of the page in mlocked VMA, the
> page will get mlocked, regardless of PTE mappings of the page.
> 
> For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
> remove_migration_pmd().
> 
> Anon THP pages can only be shared between processes via fork(). Mlocked
> page can only be shared if parent mlocked it before forking, otherwise
> CoW will be triggered on mlock().
> 
> For Anon-THP, we can fix the issue by munlocking the page on removing PTE
> migration entry for the page. PTEs for the page will always come after
> mlocked PMD: rmap walks VMAs from oldest to newest.
> 
> Test-case:
> 
> 	#include <unistd.h>
> 	#include <sys/mman.h>
> 	#include <sys/wait.h>
> 	#include <linux/mempolicy.h>
> 	#include <numaif.h>
> 
> 	int main(void)
> 	{
> 	        unsigned long nodemask = 4;
> 	        void *addr;
> 
> 		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
> 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
> 
> 	        if (fork()) {
> 			wait(NULL);
> 			return 0;
> 	        }
> 
> 	        mlock(addr, 4UL << 10);
> 	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
> 	                &nodemask, 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL);
> 
> 	        return 0;
> 	}
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
> Cc: <stable@vger.kernel.org> [v4.14+]
> Cc: Zi Yan <zi.yan@cs.rutgers.edu>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   mm/huge_memory.c | 2 +-
>   mm/migrate.c     | 3 +++
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 533f9b00147d..00704060b7f7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2931,7 +2931,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>   	else
>   		page_add_file_rmap(new, true);
>   	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> -	if (vma->vm_flags & VM_LOCKED)
> +	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
>   		mlock_vma_page(new);
>   	update_mmu_cache_pmd(vma, address, pvmw->pmd);
>   }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d6a2e89b086a..01dad96b25b5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -275,6 +275,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>   		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
>   			mlock_vma_page(new);
>   
> +		if (PageTransCompound(new) && PageMlocked(page))
> +			clear_page_mlock(page);
> +

Can you explain this more? I am confused by the usage of 'new' and 
'page' there. I guess the idea is if we are removing the migration pte 
at level 4 table, and if we found the backing page compound don't mark 
the page Mlocked?


-aneesh
Kirill A. Shutemov Sept. 12, 2018, 10:24 a.m. UTC | #4
On Wed, Sep 12, 2018 at 03:28:24PM +0530, Aneesh Kumar K.V wrote:
> On 9/11/18 4:04 PM, Kirill A. Shutemov wrote:
> > A transparent huge page is represented by a single entry on an LRU list.
> > Therefore, we can only make unevictable an entire compound page, not
> > individual subpages.
> > 
> > If a user tries to mlock() part of a huge page, we want the rest of the
> > page to be reclaimable.
> > 
> > We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
> > PMD on border of VM_LOCKED VMA will be split into PTE table.
> > 
> > Introduction of THP migration breaks the rules around mlocking THP
> > pages. If we had a single PMD mapping of the page in mlocked VMA, the
> > page will get mlocked, regardless of PTE mappings of the page.
> > 
> > For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
> > remove_migration_pmd().
> > 
> > Anon THP pages can only be shared between processes via fork(). Mlocked
> > page can only be shared if parent mlocked it before forking, otherwise
> > CoW will be triggered on mlock().
> > 
> > For Anon-THP, we can fix the issue by munlocking the page on removing PTE
> > migration entry for the page. PTEs for the page will always come after
> > mlocked PMD: rmap walks VMAs from oldest to newest.
> > 
> > Test-case:
> > 
> > 	#include <unistd.h>
> > 	#include <sys/mman.h>
> > 	#include <sys/wait.h>
> > 	#include <linux/mempolicy.h>
> > 	#include <numaif.h>
> > 
> > 	int main(void)
> > 	{
> > 	        unsigned long nodemask = 4;
> > 	        void *addr;
> > 
> > 		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
> > 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
> > 
> > 	        if (fork()) {
> > 			wait(NULL);
> > 			return 0;
> > 	        }
> > 
> > 	        mlock(addr, 4UL << 10);
> > 	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
> > 	                &nodemask, 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL);
> > 
> > 	        return 0;
> > 	}
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
> > Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
> > Cc: <stable@vger.kernel.org> [v4.14+]
> > Cc: Zi Yan <zi.yan@cs.rutgers.edu>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >   mm/huge_memory.c | 2 +-
> >   mm/migrate.c     | 3 +++
> >   2 files changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 533f9b00147d..00704060b7f7 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2931,7 +2931,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> >   	else
> >   		page_add_file_rmap(new, true);
> >   	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> > -	if (vma->vm_flags & VM_LOCKED)
> > +	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
> >   		mlock_vma_page(new);
> >   	update_mmu_cache_pmd(vma, address, pvmw->pmd);
> >   }
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index d6a2e89b086a..01dad96b25b5 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -275,6 +275,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
> >   		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
> >   			mlock_vma_page(new);
> > +		if (PageTransCompound(new) && PageMlocked(page))
> > +			clear_page_mlock(page);
> > +
> 
> Can you explain this more? I am confused by the usage of 'new' and 'page'
> there.

'new' is the PTE subpage of 'page'. clear_page_mlock() wants to see head
page.

I guess we can rewrite it more clearly:

+		if (PageTransHuge(page) && PageMlocked(page))
+			clear_page_mlock(page);
+

> I guess the idea is if we are removing the migration pte at level 4
> table, and if we found the backing page compound don't mark the page
> Mlocked?

We has to clear mlock, not only don't mark it as such.
remove_migration_pmd() for other VMA may mark it as mlock and we want to
revert it on the first PTE mapping of the page.
Vegard Nossum Sept. 12, 2018, 8:10 p.m. UTC | #5
On Tue, 11 Sep 2018 at 12:34, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> A transparent huge page is represented by a single entry on an LRU list.
> Therefore, we can only make unevictable an entire compound page, not
> individual subpages.
>
> If a user tries to mlock() part of a huge page, we want the rest of the
> page to be reclaimable.
>
> We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
> PMD on border of VM_LOCKED VMA will be split into PTE table.
>
> Introduction of THP migration breaks the rules around mlocking THP
> pages. If we had a single PMD mapping of the page in mlocked VMA, the
> page will get mlocked, regardless of PTE mappings of the page.
>
> For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
> remove_migration_pmd().
>
> Anon THP pages can only be shared between processes via fork(). Mlocked
> page can only be shared if parent mlocked it before forking, otherwise
> CoW will be triggered on mlock().
>
> For Anon-THP, we can fix the issue by munlocking the page on removing PTE
> migration entry for the page. PTEs for the page will always come after
> mlocked PMD: rmap walks VMAs from oldest to newest.
>
> Test-case:
>
>         #include <unistd.h>
>         #include <sys/mman.h>
>         #include <sys/wait.h>
>         #include <linux/mempolicy.h>
>         #include <numaif.h>
>
>         int main(void)
>         {
>                 unsigned long nodemask = 4;
>                 void *addr;
>
>                 addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
>                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
>
>                 if (fork()) {
>                         wait(NULL);
>                         return 0;
>                 }
>
>                 mlock(addr, 4UL << 10);
>                 mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
>                         &nodemask, 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL);

MPOL_MF_MOVE_ALL is actually not required to trigger the bug.

>
>                 return 0;
>         }
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Vegard Nossum <vegard.nossum@gmail.com>

Would you mind putting vegard.nossum@oracle.com instead?

> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")

The commit I bisected the problem to was actually a different one:

commit c8633798497ce894c22ab083eb884c8294c537b2
Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date:   Fri Sep 8 16:11:08 2017 -0700

    mm: mempolicy: mbind and migrate_pages support thp migration

But maybe you had a good reason to choose the other one instead. They
are close together in any case, so I guess it would be hard to find a
kernel with one commit and not the other.

> Cc: <stable@vger.kernel.org> [v4.14+]
> Cc: Zi Yan <zi.yan@cs.rutgers.edu>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>

You could also add:

Link: https://lkml.org/lkml/2018/8/30/464

Thanks for debugging this.


Vegard
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 533f9b00147d..00704060b7f7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2931,7 +2931,7 @@  void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	else
 		page_add_file_rmap(new, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
-	if (vma->vm_flags & VM_LOCKED)
+	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
 		mlock_vma_page(new);
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index d6a2e89b086a..01dad96b25b5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -275,6 +275,9 @@  static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
 			mlock_vma_page(new);
 
+		if (PageTransCompound(new) && PageMlocked(page))
+			clear_page_mlock(page);
+
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
 	}