Message ID | 20170203150729.15863-1-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote: > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > from ->page_mkwrite is completely unhandled by the mm code and results > in locking and writeably mapping the page which definitely is not what > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > results in bailing out from the fault code, the CPU then retries the > access, and we fault again effectively doing what the handler wanted. Reading this commit message makes me wonder if this is the best fix. It would seem logical that if I want the fault to be retried that I should return VM_FAULT_RETRY, not VM_FAULT_NOPAGE. Why don't we have the MM treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give driver / filesystem writers one fewer way to shoot themselves in the foot?
On Fri 03-02-17 07:13:59, Matthew Wilcox wrote: > On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote: > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > > from ->page_mkwrite is completely unhandled by the mm code and results > > in locking and writeably mapping the page which definitely is not what > > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > > results in bailing out from the fault code, the CPU then retries the > > access, and we fault again effectively doing what the handler wanted. > > Reading this commit message makes me wonder if this is the best fix. > It would seem logical that if I want the fault to be retried that I should > return VM_FAULT_RETRY, not VM_FAULT_NOPAGE. Why don't we have the MM > treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give > driver / filesystem writers one fewer way to shoot themselves in the foot? VM_FAULT_RETRY is special, it may be used only if FAULT_FLAG_ALLOW_RETRY was set in page fault flags and it means - we have dropped mmap_sem, we loaded page needed to satisfy the fault and now we need to try again (have a look at __lock_page_or_retry()). I have my reservations about this interface but it works... Honza
On Fri, Feb 03, 2017 at 04:46:40PM +0100, Jan Kara wrote: > On Fri 03-02-17 07:13:59, Matthew Wilcox wrote: > > On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote: > > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > > > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > > > from ->page_mkwrite is completely unhandled by the mm code and results > > > in locking and writeably mapping the page which definitely is not what > > > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > > > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > > > results in bailing out from the fault code, the CPU then retries the > > > access, and we fault again effectively doing what the handler wanted. > > > > Reading this commit message makes me wonder if this is the best fix. > > It would seem logical that if I want the fault to be retried that I should > > return VM_FAULT_RETRY, not VM_FAULT_NOPAGE. Why don't we have the MM > > treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give > > driver / filesystem writers one fewer way to shoot themselves in the foot? > > VM_FAULT_RETRY is special, it may be used only if FAULT_FLAG_ALLOW_RETRY > was set in page fault flags and it means - we have dropped mmap_sem, we > loaded page needed to satisfy the fault and now we need to try again (have > a look at __lock_page_or_retry()). I have my reservations about this > interface but it works... Oh, I understand what it's *supposed* to be used for ;-) It's just a bit of an attractive nuisance. Maybe renaming it to something like VM_FAULT_PAGE_RETRY would stop people from thinking that it meant "retry the fault". And we could #define VM_FAULT_RETRY VM_FAULT_NOPAGE so that people who want to retry the fault in a normal way could use a return value that sounds like it does what they want instead of a return value that is supposed to be used to indicate that we put a PFN into the page table?
On Fri 03-02-17 07:53:26, Matthew Wilcox wrote: > On Fri, Feb 03, 2017 at 04:46:40PM +0100, Jan Kara wrote: > > On Fri 03-02-17 07:13:59, Matthew Wilcox wrote: > > > On Fri, Feb 03, 2017 at 04:07:29PM +0100, Jan Kara wrote: > > > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > > > > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > > > > from ->page_mkwrite is completely unhandled by the mm code and results > > > > in locking and writeably mapping the page which definitely is not what > > > > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > > > > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > > > > results in bailing out from the fault code, the CPU then retries the > > > > access, and we fault again effectively doing what the handler wanted. > > > > > > Reading this commit message makes me wonder if this is the best fix. > > > It would seem logical that if I want the fault to be retried that I should > > > return VM_FAULT_RETRY, not VM_FAULT_NOPAGE. Why don't we have the MM > > > treat VM_FAULT_RETRY the same way that it treats VM_FAULT_NOPAGE and give > > > driver / filesystem writers one fewer way to shoot themselves in the foot? > > > > VM_FAULT_RETRY is special, it may be used only if FAULT_FLAG_ALLOW_RETRY > > was set in page fault flags and it means - we have dropped mmap_sem, we > > loaded page needed to satisfy the fault and now we need to try again (have > > a look at __lock_page_or_retry()). I have my reservations about this > > interface but it works... > > Oh, I understand what it's *supposed* to be used for ;-) It's just > a bit of an attractive nuisance. Maybe renaming it to something like > VM_FAULT_PAGE_RETRY would stop people from thinking that it meant "retry > the fault". And we could #define VM_FAULT_RETRY VM_FAULT_NOPAGE so that > people who want to retry the fault in a normal way could use a return > value that sounds like it does what they want instead of a return value > that is supposed to be used to indicate that we put a PFN into the > page table? So a better name for VM_FAULT_RETRY and disentangling VM_FAULT_NOPAGE so that it is not misused for retrying the fault would be nice. But it is a much larger endeavor (I actually had a look into this some time ago) than this simple bugfix... Honza
On Fri, 3 Feb 2017 16:07:29 +0100 Jan Kara <jack@suse.cz> wrote: > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > from ->page_mkwrite is completely unhandled by the mm code and results > in locking and writeably mapping the page which definitely is not what > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > results in bailing out from the fault code, the CPU then retries the > access, and we fault again effectively doing what the handler wanted. I'm not getting any sense of the urgency of this fix. The bug *sounds* bad? Which kernel versions need fixing?
Hi Jan, Thanks for the patch. The proposed patch should be able to fix the problem, however, do you think it would be a better approach by revising it as: … case -EAGAIN: if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) { up_read(&mm->mmap_sem); return VM_FAULT_RETRY; } return VM_FAULT_NOPAGE; … This way it can retry fault routine in mm instead of letting CPU have a new fault access. Thanks, Jinshan > On Feb 3, 2017, at 7:07 AM, Jan Kara <jack@suse.cz> wrote: > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > from ->page_mkwrite is completely unhandled by the mm code and results > in locking and writeably mapping the page which definitely is not what > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > results in bailing out from the fault code, the CPU then retries the > access, and we fault again effectively doing what the handler wanted. > > CC: lustre-devel@lists.lustre.org > CC: cluster-devel@redhat.com > Reported-by: Al Viro <viro@ZenIV.linux.org.uk> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +--- > include/linux/buffer_head.h | 4 +--- > 2 files changed, 2 insertions(+), 6 deletions(-) > > diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c > index ee01f20d8b11..9afa6bec3e6f 100644 > --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c > +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c > @@ -390,15 +390,13 @@ static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > result = VM_FAULT_LOCKED; > break; > case -ENODATA: > + case -EAGAIN: > case -EFAULT: > result = VM_FAULT_NOPAGE; > break; > case -ENOMEM: > result = VM_FAULT_OOM; > break; > - case -EAGAIN: > - result = VM_FAULT_RETRY; > - break; > default: > result = VM_FAULT_SIGBUS; > break; > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h > index d67ab83823ad..79591c3660cc 100644 > --- a/include/linux/buffer_head.h > +++ b/include/linux/buffer_head.h > @@ -243,12 +243,10 @@ static inline int block_page_mkwrite_return(int err) > { > if (err == 0) > return VM_FAULT_LOCKED; > - if (err == -EFAULT) > + if (err == -EFAULT || err == -EAGAIN) > return VM_FAULT_NOPAGE; > if (err == -ENOMEM) > return VM_FAULT_OOM; > - if (err == -EAGAIN) > - return VM_FAULT_RETRY; > /* -ENOSPC, -EDQUOT, -EIO ... */ > return VM_FAULT_SIGBUS; > } > -- > 2.10.2 > > _______________________________________________ > lustre-devel mailing list > lustre-devel@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
Hi Xiong, On Fri 03-02-17 23:44:57, Xiong, Jinshan wrote: > Thanks for the patch. > > The proposed patch should be able to fix the problem, however, do you > think it would be a better approach by revising it as: > > … > case -EAGAIN: > if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) { > up_read(&mm->mmap_sem); > return VM_FAULT_RETRY; > } > return VM_FAULT_NOPAGE; > … > > This way it can retry fault routine in mm instead of letting CPU have a > new fault access. Well, we could do that but IMHO that is a negligible benefit not worth the complications in the code. After all these retries should better be rare or you have bigger problems with your fault handler... What would be worthwhile is something like: if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) { up_read(&mm->mmap_sem); <wait for condition causing EAGAIN to resolve> return VM_FAULT_RETRY; } However that wait is specific to the fault handler so we cannot do that in the generic code. Honza > > On Feb 3, 2017, at 7:07 AM, Jan Kara <jack@suse.cz> wrote: > > > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > > from ->page_mkwrite is completely unhandled by the mm code and results > > in locking and writeably mapping the page which definitely is not what > > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > > results in bailing out from the fault code, the CPU then retries the > > access, and we fault again effectively doing what the handler wanted. > > > > CC: lustre-devel@lists.lustre.org > > CC: cluster-devel@redhat.com > > Reported-by: Al Viro <viro@ZenIV.linux.org.uk> > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +--- > > include/linux/buffer_head.h | 4 +--- > > 2 files changed, 2 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c > > index ee01f20d8b11..9afa6bec3e6f 100644 > > --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c > > +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c > > @@ -390,15 +390,13 @@ static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > result = VM_FAULT_LOCKED; > > break; > > case -ENODATA: > > + case -EAGAIN: > > case -EFAULT: > > result = VM_FAULT_NOPAGE; > > break; > > case -ENOMEM: > > result = VM_FAULT_OOM; > > break; > > - case -EAGAIN: > > - result = VM_FAULT_RETRY; > > - break; > > default: > > result = VM_FAULT_SIGBUS; > > break; > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h > > index d67ab83823ad..79591c3660cc 100644 > > --- a/include/linux/buffer_head.h > > +++ b/include/linux/buffer_head.h > > @@ -243,12 +243,10 @@ static inline int block_page_mkwrite_return(int err) > > { > > if (err == 0) > > return VM_FAULT_LOCKED; > > - if (err == -EFAULT) > > + if (err == -EFAULT || err == -EAGAIN) > > return VM_FAULT_NOPAGE; > > if (err == -ENOMEM) > > return VM_FAULT_OOM; > > - if (err == -EAGAIN) > > - return VM_FAULT_RETRY; > > /* -ENOSPC, -EDQUOT, -EIO ... */ > > return VM_FAULT_SIGBUS; > > } > > -- > > 2.10.2 > > > > _______________________________________________ > > lustre-devel mailing list > > lustre-devel@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org >
On Fri 03-02-17 15:20:54, Andrew Morton wrote: > On Fri, 3 Feb 2017 16:07:29 +0100 Jan Kara <jack@suse.cz> wrote: > > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > > from ->page_mkwrite is completely unhandled by the mm code and results > > in locking and writeably mapping the page which definitely is not what > > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > > results in bailing out from the fault code, the CPU then retries the > > access, and we fault again effectively doing what the handler wanted. > > I'm not getting any sense of the urgency of this fix. The bug *sounds* > bad? Which kernel versions need fixing? So I did more analysis of GFS2 and Lustre behavior. AFAICS GFS2 returns EAGAIN only for truncated page, when we then return with VM_FAULT_RETRY, do_page_mkwrite() locks the page, sees it is truncated and bails out properly thus silently fixes up the problem. The Lustre bug looks like it could actually result in some real problems and the bug is there since the initial commit in which Lustre was added in 3.11 (d7e09d0397e84). So overall the issue doesn't look like too serious currently but it is certainly a serious bug waiting to happen. Honza
looks good to me. Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> > On Feb 3, 2017, at 7:07 AM, Jan Kara <jack@suse.cz> wrote: > > Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return > code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY > from ->page_mkwrite is completely unhandled by the mm code and results > in locking and writeably mapping the page which definitely is not what > the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other > filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which > results in bailing out from the fault code, the CPU then retries the > access, and we fault again effectively doing what the handler wanted. > > CC: lustre-devel@lists.lustre.org > CC: cluster-devel@redhat.com > Reported-by: Al Viro <viro@ZenIV.linux.org.uk> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +--- > include/linux/buffer_head.h | 4 +--- > 2 files changed, 2 insertions(+), 6 deletions(-) > > diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c > index ee01f20d8b11..9afa6bec3e6f 100644 > --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c > +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c > @@ -390,15 +390,13 @@ static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > result = VM_FAULT_LOCKED; > break; > case -ENODATA: > + case -EAGAIN: > case -EFAULT: > result = VM_FAULT_NOPAGE; > break; > case -ENOMEM: > result = VM_FAULT_OOM; > break; > - case -EAGAIN: > - result = VM_FAULT_RETRY; > - break; > default: > result = VM_FAULT_SIGBUS; > break; > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h > index d67ab83823ad..79591c3660cc 100644 > --- a/include/linux/buffer_head.h > +++ b/include/linux/buffer_head.h > @@ -243,12 +243,10 @@ static inline int block_page_mkwrite_return(int err) > { > if (err == 0) > return VM_FAULT_LOCKED; > - if (err == -EFAULT) > + if (err == -EFAULT || err == -EAGAIN) > return VM_FAULT_NOPAGE; > if (err == -ENOMEM) > return VM_FAULT_OOM; > - if (err == -EAGAIN) > - return VM_FAULT_RETRY; > /* -ENOSPC, -EDQUOT, -EIO ... */ > return VM_FAULT_SIGBUS; > } > -- > 2.10.2 > > _______________________________________________ > lustre-devel mailing list > lustre-devel@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c index ee01f20d8b11..9afa6bec3e6f 100644 --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c @@ -390,15 +390,13 @@ static int ll_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) result = VM_FAULT_LOCKED; break; case -ENODATA: + case -EAGAIN: case -EFAULT: result = VM_FAULT_NOPAGE; break; case -ENOMEM: result = VM_FAULT_OOM; break; - case -EAGAIN: - result = VM_FAULT_RETRY; - break; default: result = VM_FAULT_SIGBUS; break; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index d67ab83823ad..79591c3660cc 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -243,12 +243,10 @@ static inline int block_page_mkwrite_return(int err) { if (err == 0) return VM_FAULT_LOCKED; - if (err == -EFAULT) + if (err == -EFAULT || err == -EAGAIN) return VM_FAULT_NOPAGE; if (err == -ENOMEM) return VM_FAULT_OOM; - if (err == -EAGAIN) - return VM_FAULT_RETRY; /* -ENOSPC, -EDQUOT, -EIO ... */ return VM_FAULT_SIGBUS; }
Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY from ->page_mkwrite is completely unhandled by the mm code and results in locking and writeably mapping the page which definitely is not what the caller wanted. Fix Lustre and block_page_mkwrite_ret() used by other filesystems (notably GFS2) to return VM_FAULT_NOPAGE instead which results in bailing out from the fault code, the CPU then retries the access, and we fault again effectively doing what the handler wanted. CC: lustre-devel@lists.lustre.org CC: cluster-devel@redhat.com Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz> --- drivers/staging/lustre/lustre/llite/llite_mmap.c | 4 +--- include/linux/buffer_head.h | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-)