diff mbox

mm: pmd dirty emulation in page fault handler

Message ID 20161222145203.GA18970@bbox (mailing list archive)
State New, archived
Headers show

Commit Message

Minchan Kim Dec. 22, 2016, 2:52 p.m. UTC
Hello,

On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote:

< snip >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 36c774f..7408ddc 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> >  				return do_huge_pmd_numa_page(&vmf, orig_pmd);
> >  
> > -			if ((vmf.flags & FAULT_FLAG_WRITE) &&
> > -					!pmd_write(orig_pmd)) {
> > -				ret = wp_huge_pmd(&vmf, orig_pmd);
> > -				if (!(ret & VM_FAULT_FALLBACK))
> > +			if (vmf.flags & FAULT_FLAG_WRITE) {
> > +				if (!pmd_write(orig_pmd)) {
> > +					ret = wp_huge_pmd(&vmf, orig_pmd);
> > +					if (ret == VM_FAULT_FALLBACK)
> 
> In theory, more than one flag can be set and it would lead to
> false-negative. Bit check was the right thing.
> 
> And I don't understand why do you need to change code in
> __handle_mm_fault() at all.
> From what I see change to huge_pmd_set_accessed() should be enough.

Yeb. Thanks for the review. Here v2 goes.

From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Thu, 22 Dec 2016 23:43:49 +0900
Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler

Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de

The problem is page fault handler supports only accessed flag emulation
for THP page of SW-dirty/accessed architecture.

This patch enables dirty-bit emulation for those architectures.
Without it, MADV_FREE makes application hang by repeated fault forever.

[1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

Cc: Jason Evans <je@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: <stable@vger.kernel.org> [4.5+]
Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
Reported-by: Andreas Schwab <schwab@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
* from v1
  * Remove __handle_mm_fault part - Kirill

 mm/huge_memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Kirill A . Shutemov Dec. 22, 2016, 6:35 p.m. UTC | #1
On Thu, Dec 22, 2016 at 11:52:03PM +0900, Minchan Kim wrote:
> Hello,
> 
> On Thu, Dec 22, 2016 at 11:17:13AM +0300, Kirill A. Shutemov wrote:
> 
> < snip >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 36c774f..7408ddc 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3637,18 +3637,20 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> > >  				return do_huge_pmd_numa_page(&vmf, orig_pmd);
> > >  
> > > -			if ((vmf.flags & FAULT_FLAG_WRITE) &&
> > > -					!pmd_write(orig_pmd)) {
> > > -				ret = wp_huge_pmd(&vmf, orig_pmd);
> > > -				if (!(ret & VM_FAULT_FALLBACK))
> > > +			if (vmf.flags & FAULT_FLAG_WRITE) {
> > > +				if (!pmd_write(orig_pmd)) {
> > > +					ret = wp_huge_pmd(&vmf, orig_pmd);
> > > +					if (ret == VM_FAULT_FALLBACK)
> > 
> > In theory, more than one flag can be set and it would lead to
> > false-negative. Bit check was the right thing.
> > 
> > And I don't understand why do you need to change code in
> > __handle_mm_fault() at all.
> > From what I see change to huge_pmd_set_accessed() should be enough.
> 
> Yeb. Thanks for the review. Here v2 goes.
> 
> From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Thu, 22 Dec 2016 23:43:49 +0900
> Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler
> 
> Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
> 
> The problem is page fault handler supports only accessed flag emulation
> for THP page of SW-dirty/accessed architecture.
> 
> This patch enables dirty-bit emulation for those architectures.
> Without it, MADV_FREE makes application hang by repeated fault forever.
> 
> [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called
> 
> Cc: Jason Evans <je@fb.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: <stable@vger.kernel.org> [4.5+]
> Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
> Reported-by: Andreas Schwab <schwab@suse.de>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Andreas Schwab Dec. 22, 2016, 10:12 p.m. UTC | #2
On Dez 22 2016, Minchan Kim <minchan@kernel.org> wrote:

> From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Thu, 22 Dec 2016 23:43:49 +0900
> Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler
>
> Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
>
> The problem is page fault handler supports only accessed flag emulation
> for THP page of SW-dirty/accessed architecture.
>
> This patch enables dirty-bit emulation for those architectures.
> Without it, MADV_FREE makes application hang by repeated fault forever.
>
> [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

Successfully tested a backport to 4.9.

Andreas.
Michal Hocko Dec. 23, 2016, 9:17 a.m. UTC | #3
On Thu 22-12-16 23:52:03, Minchan Kim wrote:
[...]
> >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Thu, 22 Dec 2016 23:43:49 +0900
> Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler
> 
> Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
> 
> The problem is page fault handler supports only accessed flag emulation
> for THP page of SW-dirty/accessed architecture.
> 
> This patch enables dirty-bit emulation for those architectures.
> Without it, MADV_FREE makes application hang by repeated fault forever.

The changelog is rather terse and considering the issue is rather subtle
and it aims the stable tree I think it could see more information. How
do we end up looping in the page fault and why the dirty pmd stops it.
Could you update the changelog to be more verbose, please? I am still
digesting this patch but I believe it is correct fwiw...

Thanks!

> [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called
> 
> Cc: Jason Evans <je@fb.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: <stable@vger.kernel.org> [4.5+]
> Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
> Reported-by: Andreas Schwab <schwab@suse.de>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> * from v1
>   * Remove __handle_mm_fault part - Kirill
> 
>  mm/huge_memory.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 10eedbf..29ec8a4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -883,15 +883,17 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
>  	pmd_t entry;
>  	unsigned long haddr;
> +	bool write = vmf->flags & FAULT_FLAG_WRITE;
>  
>  	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
>  	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>  		goto unlock;
>  
>  	entry = pmd_mkyoung(orig_pmd);
> +	if (write)
> +		entry = pmd_mkdirty(entry);
>  	haddr = vmf->address & HPAGE_PMD_MASK;
> -	if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry,
> -				vmf->flags & FAULT_FLAG_WRITE))
> +	if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write))
>  		update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd);
>  
>  unlock:
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Minchan Kim Dec. 23, 2016, 9:53 a.m. UTC | #4
Hi,

On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote:
> On Thu 22-12-16 23:52:03, Minchan Kim wrote:
> [...]
> > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Thu, 22 Dec 2016 23:43:49 +0900
> > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler
> > 
> > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
> > 
> > The problem is page fault handler supports only accessed flag emulation
> > for THP page of SW-dirty/accessed architecture.
> > 
> > This patch enables dirty-bit emulation for those architectures.
> > Without it, MADV_FREE makes application hang by repeated fault forever.
> 
> The changelog is rather terse and considering the issue is rather subtle
> and it aims the stable tree I think it could see more information. How
> do we end up looping in the page fault and why the dirty pmd stops it.
> Could you update the changelog to be more verbose, please? I am still
> digesting this patch but I believe it is correct fwiw...
> 

How about this? Feel free to suggest better wording.

Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de

The problem is currently page fault handler doesn't supports dirty bit
emulation of pte for non-HW dirty-bit architecture so that application
stucks until VM marked the pmd dirty.

How the emulation work depends on the architecture. In case of arm64,
when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
mark the pte dirty via triggering page fault when store access happens.
Once the page fault occurs, VM marks the pte dirty and arch code for
setting pte will clear PTE_RDONLY for application to proceed.

IOW, if VM doesn't mark the pte dirty, application hangs forever by
repeated fault(i.e., store op but the pte is PTE_RDONLY).

This patch enables dirty-bit emulation for those architectures.
Michal Hocko Dec. 23, 2016, 11:54 a.m. UTC | #5
On Fri 23-12-16 18:53:36, Minchan Kim wrote:
> Hi,
> 
> On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote:
> > On Thu 22-12-16 23:52:03, Minchan Kim wrote:
> > [...]
> > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim <minchan@kernel.org>
> > > Date: Thu, 22 Dec 2016 23:43:49 +0900
> > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler
> > > 
> > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
> > > 
> > > The problem is page fault handler supports only accessed flag emulation
> > > for THP page of SW-dirty/accessed architecture.
> > > 
> > > This patch enables dirty-bit emulation for those architectures.
> > > Without it, MADV_FREE makes application hang by repeated fault forever.
> > 
> > The changelog is rather terse and considering the issue is rather subtle
> > and it aims the stable tree I think it could see more information. How
> > do we end up looping in the page fault and why the dirty pmd stops it.
> > Could you update the changelog to be more verbose, please? I am still
> > digesting this patch but I believe it is correct fwiw...
> > 
> 
> How about this? Feel free to suggest better wording.
> 
> Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
> 
> The problem is currently page fault handler doesn't supports dirty bit
> emulation of pte for non-HW dirty-bit architecture so that application

s@pte@pmd@ ?

> stucks until VM marked the pmd dirty.
> 
> How the emulation work depends on the architecture. In case of arm64,
> when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
> mark the pte dirty via triggering page fault when store access happens.
> Once the page fault occurs, VM marks the pte dirty and arch code for
> setting pte will clear PTE_RDONLY for application to proceed.
> 
> IOW, if VM doesn't mark the pte dirty, application hangs forever by
> repeated fault(i.e., store op but the pte is PTE_RDONLY).
> 
> This patch enables dirty-bit emulation for those architectures.

Yes this is helpful and much more clear, thank you. One thing that is
still not clear to me is why cannot we handle that in the arch specific
code. I mean what is the side effect of doing pmd_mkdirty for
architectures which do not need it?
Minchan Kim Dec. 23, 2016, 2:01 p.m. UTC | #6
On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 18:53:36, Minchan Kim wrote:
> > Hi,
> > 
> > On Fri, Dec 23, 2016 at 10:17:25AM +0100, Michal Hocko wrote:
> > > On Thu 22-12-16 23:52:03, Minchan Kim wrote:
> > > [...]
> > > > >From b3ec95c0df91ad113525968a4a6b53030fd0b48d Mon Sep 17 00:00:00 2001
> > > > From: Minchan Kim <minchan@kernel.org>
> > > > Date: Thu, 22 Dec 2016 23:43:49 +0900
> > > > Subject: [PATCH v2] mm: pmd dirty emulation in page fault handler
> > > > 
> > > > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> > > > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
> > > > 
> > > > The problem is page fault handler supports only accessed flag emulation
> > > > for THP page of SW-dirty/accessed architecture.
> > > > 
> > > > This patch enables dirty-bit emulation for those architectures.
> > > > Without it, MADV_FREE makes application hang by repeated fault forever.
> > > 
> > > The changelog is rather terse and considering the issue is rather subtle
> > > and it aims the stable tree I think it could see more information. How
> > > do we end up looping in the page fault and why the dirty pmd stops it.
> > > Could you update the changelog to be more verbose, please? I am still
> > > digesting this patch but I believe it is correct fwiw...
> > > 
> > 
> > How about this? Feel free to suggest better wording.
> > 
> > Andreas reported [1] made a test in jemalloc hang in THP mode in arm64.
> > http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de
> > 
> > The problem is currently page fault handler doesn't supports dirty bit
> > emulation of pte for non-HW dirty-bit architecture so that application
> 
> s@pte@pmd@ ?

It would be more clear. Will update with it.

> 
> > stucks until VM marked the pmd dirty.
> > 
> > How the emulation work depends on the architecture. In case of arm64,
> > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
> > mark the pte dirty via triggering page fault when store access happens.
> > Once the page fault occurs, VM marks the pte dirty and arch code for
> > setting pte will clear PTE_RDONLY for application to proceed.
> > 
> > IOW, if VM doesn't mark the pte dirty, application hangs forever by
> > repeated fault(i.e., store op but the pte is PTE_RDONLY).
> > 
> > This patch enables dirty-bit emulation for those architectures.
> 
> Yes this is helpful and much more clear, thank you. One thing that is
> still not clear to me is why cannot we handle that in the arch specific
> code. I mean what is the side effect of doing pmd_mkdirty for
> architectures which do not need it?

For architecture which supports H/W access/dirty bit, it couldn't be
reached there code path so there is no side effect, I think. A thing
I can think of is just increasing code size little bit. Maybe, we
could optimize away some ifdef magic but not sure worth it. We have
been same way pte(not pmd) emulation handling for several decacdes.
Anyway, it should be off-topic, I think.

Thanks.

> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Dec. 23, 2016, 2:53 p.m. UTC | #7
On Fri 23-12-16 23:01:31, Minchan Kim wrote:
> On Fri, Dec 23, 2016 at 12:54:21PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 18:53:36, Minchan Kim wrote:
[...]
> > > stucks until VM marked the pmd dirty.
> > > 
> > > How the emulation work depends on the architecture. In case of arm64,
> > > when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
> > > mark the pte dirty via triggering page fault when store access happens.
> > > Once the page fault occurs, VM marks the pte dirty and arch code for
> > > setting pte will clear PTE_RDONLY for application to proceed.
> > > 
> > > IOW, if VM doesn't mark the pte dirty, application hangs forever by
> > > repeated fault(i.e., store op but the pte is PTE_RDONLY).
> > > 
> > > This patch enables dirty-bit emulation for those architectures.
> > 
> > Yes this is helpful and much more clear, thank you. One thing that is
> > still not clear to me is why cannot we handle that in the arch specific
> > code. I mean what is the side effect of doing pmd_mkdirty for
> > architectures which do not need it?
> 
> For architecture which supports H/W access/dirty bit, it couldn't be
> reached there code path so there is no side effect, I think.

ahh, I knew I was missing something. It definitely wasn't obvious to me
and my x86 config it simply generates code to call
huge_pmd_set_accessed.

> A thing
> I can think of is just increasing code size little bit. Maybe, we
> could optimize away some ifdef magic but not sure worth it.

it is not
diff mbox

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10eedbf..29ec8a4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -883,15 +883,17 @@  void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	pmd_t entry;
 	unsigned long haddr;
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
 
 	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
 		goto unlock;
 
 	entry = pmd_mkyoung(orig_pmd);
+	if (write)
+		entry = pmd_mkdirty(entry);
 	haddr = vmf->address & HPAGE_PMD_MASK;
-	if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry,
-				vmf->flags & FAULT_FLAG_WRITE))
+	if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write))
 		update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd);
 
 unlock: