diff mbox

mm: Avoid returning VM_FAULT_RETRY from ->page_mkwrite handlers

Message ID 20170203150729.15863-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara Feb. 3, 2017, 3:07 p.m. UTC
Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
from ->page_mkwrite is completely unhandled by the mm code and results
in locking and writeably mapping the page which definitely is not what
the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
results in bailing out from the fault code, the CPU then retries the
access, and we fault again effectively doing what the handler wanted.

CC: lustre-devel@lists.lustre.org
CC: cluster-devel@redhat.com
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +---
 include/linux/buffer_head.h                      | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

Comments

Matthew Wilcox (Oracle) Feb. 3, 2017, 3:13 p.m. UTC | #1
On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote:
> Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> from ->page_mkwrite is completely unhandled by the mm code and results
> in locking and writeably mapping the page which definitely is not what
> the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> results in bailing out from the fault code, the CPU then retries the
> access, and we fault again effectively doing what the handler wanted.

Reading this commit message makes me wonder if this is the best fix.
It would seem logical that if I want the fault to be retried that I should
return VM_FAULT_RETRY, not VM_FAULT_NOPAGE.  Why don't we have the MM
treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give
driver / filesystem writers one fewer way to shoot themselves in the foot?
Jan Kara Feb. 3, 2017, 3:46 p.m. UTC | #2
On Fri 03-02-17 07:13:59, Matthew Wilcox wrote:
> On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote:
> > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> > from ->page_mkwrite is completely unhandled by the mm code and results
> > in locking and writeably mapping the page which definitely is not what
> > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> > results in bailing out from the fault code, the CPU then retries the
> > access, and we fault again effectively doing what the handler wanted.
> 
> Reading this commit message makes me wonder if this is the best fix.
> It would seem logical that if I want the fault to be retried that I should
> return VM_FAULT_RETRY, not VM_FAULT_NOPAGE.  Why don't we have the MM
> treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give
> driver / filesystem writers one fewer way to shoot themselves in the foot?

VM_FAULT_RETRY is special, it may be used only if FAULT_FLAG_ALLOW_RETRY
was set in page fault flags and it means - we have dropped mmap_sem, we
loaded page needed to satisfy the fault and now we need to try again (have
a look at __lock_page_or_retry()). I have my reservations about this
interface but it works...

								Honza
Matthew Wilcox (Oracle) Feb. 3, 2017, 3:53 p.m. UTC | #3
On Fri, Feb 03, 2017 at 04:46:40PM +0100, Jan Kara wrote:
> On Fri 03-02-17 07:13:59, Matthew Wilcox wrote:
> > On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote:
> > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> > > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> > > from ->page_mkwrite is completely unhandled by the mm code and results
> > > in locking and writeably mapping the page which definitely is not what
> > > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> > > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> > > results in bailing out from the fault code, the CPU then retries the
> > > access, and we fault again effectively doing what the handler wanted.
> > 
> > Reading this commit message makes me wonder if this is the best fix.
> > It would seem logical that if I want the fault to be retried that I should
> > return VM_FAULT_RETRY, not VM_FAULT_NOPAGE.  Why don't we have the MM
> > treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give
> > driver / filesystem writers one fewer way to shoot themselves in the foot?
> 
> VM_FAULT_RETRY is special, it may be used only if FAULT_FLAG_ALLOW_RETRY
> was set in page fault flags and it means - we have dropped mmap_sem, we
> loaded page needed to satisfy the fault and now we need to try again (have
> a look at __lock_page_or_retry()). I have my reservations about this
> interface but it works...

Oh, I understand what it's *supposed* to be used for ;-)  It's just
a bit of an attractive nuisance.  Maybe renaming it to something like
VM_FAULT_PAGE_RETRY would stop people from thinking that it meant "retry
the fault".  And we could #define VM_FAULT_RETRY VM_FAULT_NOPAGE so that
people who want to retry the fault in a normal way could use a return
value that sounds like it does what they want instead of a return value
that is supposed to be used to indicate that we put a PFN into the
page table?
Jan Kara Feb. 3, 2017, 4:10 p.m. UTC | #4
On Fri 03-02-17 07:53:26, Matthew Wilcox wrote:
> On Fri, Feb 03, 2017 at 04:46:40PM +0100, Jan Kara wrote:
> > On Fri 03-02-17 07:13:59, Matthew Wilcox wrote:
> > > On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote:
> > > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> > > > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> > > > from ->page_mkwrite is completely unhandled by the mm code and results
> > > > in locking and writeably mapping the page which definitely is not what
> > > > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> > > > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> > > > results in bailing out from the fault code, the CPU then retries the
> > > > access, and we fault again effectively doing what the handler wanted.
> > > 
> > > Reading this commit message makes me wonder if this is the best fix.
> > > It would seem logical that if I want the fault to be retried that I should
> > > return VM_FAULT_RETRY, not VM_FAULT_NOPAGE.  Why don't we have the MM
> > > treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give
> > > driver / filesystem writers one fewer way to shoot themselves in the foot?
> > 
> > VM_FAULT_RETRY is special, it may be used only if FAULT_FLAG_ALLOW_RETRY
> > was set in page fault flags and it means - we have dropped mmap_sem, we
> > loaded page needed to satisfy the fault and now we need to try again (have
> > a look at __lock_page_or_retry()). I have my reservations about this
> > interface but it works...
> 
> Oh, I understand what it's *supposed* to be used for ;-)  It's just
> a bit of an attractive nuisance.  Maybe renaming it to something like
> VM_FAULT_PAGE_RETRY would stop people from thinking that it meant "retry
> the fault".  And we could #define VM_FAULT_RETRY VM_FAULT_NOPAGE so that
> people who want to retry the fault in a normal way could use a return
> value that sounds like it does what they want instead of a return value
> that is supposed to be used to indicate that we put a PFN into the
> page table?

So a better name for VM_FAULT_RETRY and disentangling VM_FAULT_NOPAGE so
that it is not misused for retrying the fault would be nice. But it is a
much larger endeavor (I actually had a look into this some time ago) than
this simple bugfix...

								Honza
Andrew Morton Feb. 3, 2017, 11:20 p.m. UTC | #5
On Fri,  3 Feb 2017 16:07:29 +0100 Jan Kara <jack@suse.cz> wrote:

> Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> from ->page_mkwrite is completely unhandled by the mm code and results
> in locking and writeably mapping the page which definitely is not what
> the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> results in bailing out from the fault code, the CPU then retries the
> access, and we fault again effectively doing what the handler wanted.

I'm not getting any sense of the urgency of this fix.  The bug *sounds*
bad?  Which kernel versions need fixing?
Xiong, Jinshan Feb. 3, 2017, 11:44 p.m. UTC | #6
Hi Jan,

Thanks for the patch. 

The proposed patch should be able to fix the problem, however, do you think it would be a better approach by revising it as:

…
case -EAGAIN:
	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
		up_read(&mm->mmap_sem);
		return VM_FAULT_RETRY;
	}
	return VM_FAULT_NOPAGE;
…

This way it can retry fault routine in mm instead of letting CPU have a new fault access.

Thanks,
Jinshan

> On Feb 3, 2017, at 7:07 AM, Jan Kara <jack@suse.cz> wrote:

> 

> Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return

> code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY

> from ->page_mkwrite is completely unhandled by the mm code and results

> in locking and writeably mapping the page which definitely is not what

> the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other

> filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which

> results in bailing out from the fault code, the CPU then retries the

> access, and we fault again effectively doing what the handler wanted.

> 

> CC: lustre-devel@lists.lustre.org

> CC: cluster-devel@redhat.com

> Reported-by: Al Viro <viro@ZenIV.linux.org.uk>

> Signed-off-by: Jan Kara <jack@suse.cz>

> ---

> drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +---

> include/linux/buffer_head.h                      | 4 +---

> 2 files changed, 2 insertions(+), 6 deletions(-)

> 

> diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c

> index ee01f20d8b11..9afa6bec3e6f 100644

> --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c

> +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c

> @@ -390,15 +390,13 @@ static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

> 		result = VM_FAULT_LOCKED;

> 		break;

> 	case -ENODATA:

> +	case -EAGAIN:

> 	case -EFAULT:

> 		result = VM_FAULT_NOPAGE;

> 		break;

> 	case -ENOMEM:

> 		result = VM_FAULT_OOM;

> 		break;

> -	case -EAGAIN:

> -		result = VM_FAULT_RETRY;

> -		break;

> 	default:

> 		result = VM_FAULT_SIGBUS;

> 		break;

> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h

> index d67ab83823ad..79591c3660cc 100644

> --- a/include/linux/buffer_head.h

> +++ b/include/linux/buffer_head.h

> @@ -243,12 +243,10 @@ static inline int block_page_mkwrite_return(int err)

> {

> 	if (err == 0)

> 		return VM_FAULT_LOCKED;

> -	if (err == -EFAULT)

> +	if (err == -EFAULT || err == -EAGAIN)

> 		return VM_FAULT_NOPAGE;

> 	if (err == -ENOMEM)

> 		return VM_FAULT_OOM;

> -	if (err == -EAGAIN)

> -		return VM_FAULT_RETRY;

> 	/* -ENOSPC, -EDQUOT, -EIO ... */

> 	return VM_FAULT_SIGBUS;

> }

> -- 

> 2.10.2

> 

> _______________________________________________

> lustre-devel mailing list

> lustre-devel@lists.lustre.org

> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
Jan Kara Feb. 6, 2017, 8:59 a.m. UTC | #7
Hi Xiong,

On Fri 03-02-17 23:44:57, Xiong, Jinshan wrote:
> Thanks for the patch. 
> 
> The proposed patch should be able to fix the problem, however, do you
> think it would be a better approach by revising it as:
> 
> …
> case -EAGAIN:
> 	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> 		up_read(&mm->mmap_sem);
> 		return VM_FAULT_RETRY;
> 	}
> 	return VM_FAULT_NOPAGE;
> …
> 
> This way it can retry fault routine in mm instead of letting CPU have a
> new fault access.

Well, we could do that but IMHO that is a negligible benefit not worth the
complications in the code. After all these retries should better be rare or
you have bigger problems with your fault handler... What would be
worthwhile is something like:

	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
		up_read(&mm->mmap_sem);
		<wait for condition causing EAGAIN to resolve>
		return VM_FAULT_RETRY;
	}

However that wait is specific to the fault handler so we cannot do that in
the generic code.

								Honza

> > On Feb 3, 2017, at 7:07 AM, Jan Kara <jack@suse.cz> wrote:
> > 
> > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> > from ->page_mkwrite is completely unhandled by the mm code and results
> > in locking and writeably mapping the page which definitely is not what
> > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> > results in bailing out from the fault code, the CPU then retries the
> > access, and we fault again effectively doing what the handler wanted.
> > 
> > CC: lustre-devel@lists.lustre.org
> > CC: cluster-devel@redhat.com
> > Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +---
> > include/linux/buffer_head.h                      | 4 +---
> > 2 files changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c
> > index ee01f20d8b11..9afa6bec3e6f 100644
> > --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c
> > +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c
> > @@ -390,15 +390,13 @@ static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > 		result = VM_FAULT_LOCKED;
> > 		break;
> > 	case -ENODATA:
> > +	case -EAGAIN:
> > 	case -EFAULT:
> > 		result = VM_FAULT_NOPAGE;
> > 		break;
> > 	case -ENOMEM:
> > 		result = VM_FAULT_OOM;
> > 		break;
> > -	case -EAGAIN:
> > -		result = VM_FAULT_RETRY;
> > -		break;
> > 	default:
> > 		result = VM_FAULT_SIGBUS;
> > 		break;
> > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > index d67ab83823ad..79591c3660cc 100644
> > --- a/include/linux/buffer_head.h
> > +++ b/include/linux/buffer_head.h
> > @@ -243,12 +243,10 @@ static inline int block_page_mkwrite_return(int err)
> > {
> > 	if (err == 0)
> > 		return VM_FAULT_LOCKED;
> > -	if (err == -EFAULT)
> > +	if (err == -EFAULT || err == -EAGAIN)
> > 		return VM_FAULT_NOPAGE;
> > 	if (err == -ENOMEM)
> > 		return VM_FAULT_OOM;
> > -	if (err == -EAGAIN)
> > -		return VM_FAULT_RETRY;
> > 	/* -ENOSPC, -EDQUOT, -EIO ... */
> > 	return VM_FAULT_SIGBUS;
> > }
> > -- 
> > 2.10.2
> > 
> > _______________________________________________
> > lustre-devel mailing list
> > lustre-devel@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>
Jan Kara Feb. 6, 2017, 9:24 a.m. UTC | #8
On Fri 03-02-17 15:20:54, Andrew Morton wrote:
> On Fri,  3 Feb 2017 16:07:29 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> > from ->page_mkwrite is completely unhandled by the mm code and results
> > in locking and writeably mapping the page which definitely is not what
> > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> > results in bailing out from the fault code, the CPU then retries the
> > access, and we fault again effectively doing what the handler wanted.
> 
> I'm not getting any sense of the urgency of this fix.  The bug *sounds*
> bad?  Which kernel versions need fixing?

So I did more analysis of GFS2 and Lustre behavior. AFAICS GFS2 returns
EAGAIN only for truncated page, when we then return with VM_FAULT_RETRY,
do_page_mkwrite() locks the page, sees it is truncated and bails out
properly thus silently fixes up the problem. The Lustre bug looks like it
could actually result in some real problems and the bug is there since the
initial commit in which Lustre was added in 3.11 (d7e09d0397e84).

So overall the issue doesn't look like too serious currently but it is
certainly a serious bug waiting to happen.

								Honza
Xiong, Jinshan Feb. 6, 2017, 8:52 p.m. UTC | #9
looks good to me.

Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>

> On Feb 3, 2017, at 7:07 AM, Jan Kara <jack@suse.cz> wrote:
> 
> Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
> code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
> from ->page_mkwrite is completely unhandled by the mm code and results
> in locking and writeably mapping the page which definitely is not what
> the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other
> filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which
> results in bailing out from the fault code, the CPU then retries the
> access, and we fault again effectively doing what the handler wanted.
> 
> CC: lustre-devel@lists.lustre.org
> CC: cluster-devel@redhat.com
> Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +---
> include/linux/buffer_head.h                      | 4 +---
> 2 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c
> index ee01f20d8b11..9afa6bec3e6f 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c
> +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c
> @@ -390,15 +390,13 @@ static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> 		result = VM_FAULT_LOCKED;
> 		break;
> 	case -ENODATA:
> +	case -EAGAIN:
> 	case -EFAULT:
> 		result = VM_FAULT_NOPAGE;
> 		break;
> 	case -ENOMEM:
> 		result = VM_FAULT_OOM;
> 		break;
> -	case -EAGAIN:
> -		result = VM_FAULT_RETRY;
> -		break;
> 	default:
> 		result = VM_FAULT_SIGBUS;
> 		break;
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index d67ab83823ad..79591c3660cc 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -243,12 +243,10 @@ static inline int block_page_mkwrite_return(int err)
> {
> 	if (err == 0)
> 		return VM_FAULT_LOCKED;
> -	if (err == -EFAULT)
> +	if (err == -EFAULT || err == -EAGAIN)
> 		return VM_FAULT_NOPAGE;
> 	if (err == -ENOMEM)
> 		return VM_FAULT_OOM;
> -	if (err == -EAGAIN)
> -		return VM_FAULT_RETRY;
> 	/* -ENOSPC, -EDQUOT, -EIO ... */
> 	return VM_FAULT_SIGBUS;
> }
> -- 
> 2.10.2
> 
> _______________________________________________
> lustre-devel mailing list
> lustre-devel@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
diff mbox

Patch

diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c
index ee01f20d8b11..9afa6bec3e6f 100644
--- a/drivers/staging/lustre/lustre/llite/llite_mmap.c
+++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c
@@ -390,15 +390,13 @@  static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		result = VM_FAULT_LOCKED;
 		break;
 	case -ENODATA:
+	case -EAGAIN:
 	case -EFAULT:
 		result = VM_FAULT_NOPAGE;
 		break;
 	case -ENOMEM:
 		result = VM_FAULT_OOM;
 		break;
-	case -EAGAIN:
-		result = VM_FAULT_RETRY;
-		break;
 	default:
 		result = VM_FAULT_SIGBUS;
 		break;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d67ab83823ad..79591c3660cc 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -243,12 +243,10 @@  static inline int block_page_mkwrite_return(int err)
 {
 	if (err == 0)
 		return VM_FAULT_LOCKED;
-	if (err == -EFAULT)
+	if (err == -EFAULT || err == -EAGAIN)
 		return VM_FAULT_NOPAGE;
 	if (err == -ENOMEM)
 		return VM_FAULT_OOM;
-	if (err == -EAGAIN)
-		return VM_FAULT_RETRY;
 	/* -ENOSPC, -EDQUOT, -EIO ... */
 	return VM_FAULT_SIGBUS;
 }