diff mbox series

[3/4] filemap: drop the mmap_sem for all blocking operations

Message ID 20181130195812.19536-4-josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series drop the mmap_sem when doing IO in the fault path | expand

Commit Message

Josef Bacik Nov. 30, 2018, 7:58 p.m. UTC
Currently we only drop the mmap_sem if there is contention on the page
lock.  The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.

The problem with this is the assumption that the readahead does
anything.  In the case that the box is under extreme memory or IO
pressure we may end up not reading anything at all for readahead, which
means we will end up reading in the page under the mmap_sem.

Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time.  This includes
while issuing readahead, locking the page, or needing to call ->readpage
because readahead did not occur.  Then once we have a fully uptodate
page we can return with VM_FAULT_RETRY and come back again to find our
nicely in-cache page that was gotten outside of the mmap_sem.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/filemap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 93 insertions(+), 20 deletions(-)

Comments

Andrew Morton Dec. 4, 2018, 10:50 p.m. UTC | #1
On Fri, 30 Nov 2018 14:58:11 -0500 Josef Bacik <josef@toxicpanda.com> wrote:

> Currently we only drop the mmap_sem if there is contention on the page
> lock.  The idea is that we issue readahead and then go to lock the page
> while it is under IO and we want to not hold the mmap_sem during the IO.
> 
> The problem with this is the assumption that the readahead does
> anything.  In the case that the box is under extreme memory or IO
> pressure we may end up not reading anything at all for readahead, which
> means we will end up reading in the page under the mmap_sem.

Please describe here why this is considered to be a problem. 
Application stalling, I assume?  Some description of in-the-field
observations would be appropriate.  ie, how serious is the problem
whcih is being addressed.

> Instead rework filemap fault path to drop the mmap sem at any point that
> we may do IO or block for an extended period of time.  This includes
> while issuing readahead, locking the page, or needing to call ->readpage
> because readahead did not occur.  Then once we have a fully uptodate
> page we can return with VM_FAULT_RETRY and come back again to find our
> nicely in-cache page that was gotten outside of the mmap_sem.
> 
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2304,28 +2304,44 @@ EXPORT_SYMBOL(generic_file_read_iter);
>  
>  #ifdef CONFIG_MMU
>  #define MMAP_LOTSAMISS  (100)
> +static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
> +					     struct vm_area_struct *vma,
> +					     int flags)
> +{
> +	if (fpin)
> +		return fpin;
> +	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
> +	    FAULT_FLAG_ALLOW_RETRY) {
> +		fpin = get_file(vma->vm_file);
> +		up_read(&vma->vm_mm->mmap_sem);
> +	}
> +	return fpin;
> +}

A code comment would be nice.  What it does and, especially, why it does it.

> -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> -		put_page(page);
> -		return ret | VM_FAULT_RETRY;
> +	/*
> +	 * We are open-coding lock_page_or_retry here because we want to do the
> +	 * readpage if necessary while the mmap_sem is dropped.  If there
> +	 * happens to be a lock on the page but it wasn't being faulted in we'd
> +	 * come back around without ALLOW_RETRY set and then have to do the IO
> +	 * under the mmap_sem, which would be a bummer.

Expanding on "a bummer" would help here ;)

> +	 */
> +	if (!trylock_page(page)) {
> +		fpin = maybe_unlock_mmap_for_io(fpin, vmf->vma, vmf->flags);
> +		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> +			goto out_retry;
> +		if (vmf->flags & FAULT_FLAG_KILLABLE) {
> +			if (__lock_page_killable(page)) {
> +				/*
> +				 * If we don't have the right flags for
> +				 * maybe_unlock_mmap_for_io to do it's thing we

"its"

> +				 * still need to drop the sem and return
> +				 * VM_FAULT_RETRY so the upper layer checks the
> +				 * signal and takes the appropriate action.
> +				 */
> +				if (!fpin)
> +					up_read(&vmf->vma->vm_mm->mmap_sem);
> +				goto out_retry;
> +			}
> +		} else
> +			__lock_page(page);
>  	}
>  
>  	/* Did it get truncated? */
Johannes Weiner Dec. 5, 2018, 10:23 p.m. UTC | #2
On Fri, Nov 30, 2018 at 02:58:11PM -0500, Josef Bacik wrote:
> Currently we only drop the mmap_sem if there is contention on the page
> lock.  The idea is that we issue readahead and then go to lock the page
> while it is under IO and we want to not hold the mmap_sem during the IO.
> 
> The problem with this is the assumption that the readahead does
> anything.  In the case that the box is under extreme memory or IO
> pressure we may end up not reading anything at all for readahead, which
> means we will end up reading in the page under the mmap_sem.

I'd also add that even if readahead did something, the block request
queues could be contended enough that merely submitting the io could
become IO bound if it has to wait for in-flight requests.

Not really a concern with cgroup IO control, but this has always
somewhat defeated the original purpose of the mmap_sem dropping
(avoiding serializing page faults when there is a writer queued).

> Instead rework filemap fault path to drop the mmap sem at any point that
> we may do IO or block for an extended period of time.  This includes
> while issuing readahead, locking the page, or needing to call ->readpage
> because readahead did not occur.  Then once we have a fully uptodate
> page we can return with VM_FAULT_RETRY and come back again to find our
> nicely in-cache page that was gotten outside of the mmap_sem.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Keeping the fpin throughout the fault handler makes things a lot
simpler than the -EAGAIN and wait_on_page_locked dance from earlier
versions. Nice.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Jan Kara Dec. 7, 2018, 11:01 a.m. UTC | #3
On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> Currently we only drop the mmap_sem if there is contention on the page
> lock.  The idea is that we issue readahead and then go to lock the page
> while it is under IO and we want to not hold the mmap_sem during the IO.
> 
> The problem with this is the assumption that the readahead does
> anything.  In the case that the box is under extreme memory or IO
> pressure we may end up not reading anything at all for readahead, which
> means we will end up reading in the page under the mmap_sem.
> 
> Instead rework filemap fault path to drop the mmap sem at any point that
> we may do IO or block for an extended period of time.  This includes
> while issuing readahead, locking the page, or needing to call ->readpage
> because readahead did not occur.  Then once we have a fully uptodate
> page we can return with VM_FAULT_RETRY and come back again to find our
> nicely in-cache page that was gotten outside of the mmap_sem.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  mm/filemap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 93 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index f068712c2525..5e76b24b2a0f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2304,28 +2304,44 @@ EXPORT_SYMBOL(generic_file_read_iter);
>  
>  #ifdef CONFIG_MMU
>  #define MMAP_LOTSAMISS  (100)
> +static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
> +					     struct vm_area_struct *vma,
> +					     int flags)
> +{
> +	if (fpin)
> +		return fpin;
> +	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
> +	    FAULT_FLAG_ALLOW_RETRY) {
> +		fpin = get_file(vma->vm_file);
> +		up_read(&vma->vm_mm->mmap_sem);
> +	}
> +	return fpin;
> +}
>  
>  /*
>   * Synchronous readahead happens when we don't even find
>   * a page in the page cache at all.
>   */
> -static void do_sync_mmap_readahead(struct vm_area_struct *vma,
> -				   struct file_ra_state *ra,
> -				   struct file *file,
> -				   pgoff_t offset)
> +static struct file *do_sync_mmap_readahead(struct vm_area_struct *vma,
> +					   struct file_ra_state *ra,
> +					   struct file *file,
> +					   pgoff_t offset,
> +					   int flags)
>  {

IMO it would be nicer to pass vmf here at this point. Everything this
function needs is there and the number of arguments is already quite big.
But I don't insist.

>  /*
>   * Asynchronous readahead happens when we find the page and PG_readahead,
>   * so we want to possibly extend the readahead further..
>   */
> -static void do_async_mmap_readahead(struct vm_area_struct *vma,
> -				    struct file_ra_state *ra,
> -				    struct file *file,
> -				    struct page *page,
> -				    pgoff_t offset)
> +static struct file *do_async_mmap_readahead(struct vm_area_struct *vma,
> +					    struct file_ra_state *ra,
> +					    struct file *file,
> +					    struct page *page,
> +					    pgoff_t offset, int flags)
>  {

The same here (except for 'page' which needs to be kept).

> @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  			return vmf_error(-ENOMEM);
>  	}
>  
> -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> -		put_page(page);
> -		return ret | VM_FAULT_RETRY;
> +	/*
> +	 * We are open-coding lock_page_or_retry here because we want to do the
> +	 * readpage if necessary while the mmap_sem is dropped.  If there
> +	 * happens to be a lock on the page but it wasn't being faulted in we'd
> +	 * come back around without ALLOW_RETRY set and then have to do the IO
> +	 * under the mmap_sem, which would be a bummer.
> +	 */

Hum, lock_page_or_retry() has two callers and you've just killed one. I
think it would be better to modify the function to suit both callers rather
than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
would unconditionally acquire the lock and return whether it has dropped
mmap sem or not? Callers can then decide what to do.

BTW I'm not sure this complication is really worth it. The "drop mmap_sem
for IO" is never going to be 100% thing if nothing else because only one
retry is allowed in do_user_addr_fault(). So the second time we get to
filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
blocking locking. So I think your code needs to catch common cases you
observe in practice but not those super-rare corner cases...

								Honza
Josef Bacik Dec. 10, 2018, 6:44 p.m. UTC | #4
On Fri, Dec 07, 2018 at 12:01:38PM +0100, Jan Kara wrote:
> On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> > Currently we only drop the mmap_sem if there is contention on the page
> > lock.  The idea is that we issue readahead and then go to lock the page
> > while it is under IO and we want to not hold the mmap_sem during the IO.
> > 
> > The problem with this is the assumption that the readahead does
> > anything.  In the case that the box is under extreme memory or IO
> > pressure we may end up not reading anything at all for readahead, which
> > means we will end up reading in the page under the mmap_sem.
> > 
> > Instead rework filemap fault path to drop the mmap sem at any point that
> > we may do IO or block for an extended period of time.  This includes
> > while issuing readahead, locking the page, or needing to call ->readpage
> > because readahead did not occur.  Then once we have a fully uptodate
> > page we can return with VM_FAULT_RETRY and come back again to find our
> > nicely in-cache page that was gotten outside of the mmap_sem.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  mm/filemap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 93 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index f068712c2525..5e76b24b2a0f 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2304,28 +2304,44 @@ EXPORT_SYMBOL(generic_file_read_iter);
> >  
> >  #ifdef CONFIG_MMU
> >  #define MMAP_LOTSAMISS  (100)
> > +static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
> > +					     struct vm_area_struct *vma,
> > +					     int flags)
> > +{
> > +	if (fpin)
> > +		return fpin;
> > +	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
> > +	    FAULT_FLAG_ALLOW_RETRY) {
> > +		fpin = get_file(vma->vm_file);
> > +		up_read(&vma->vm_mm->mmap_sem);
> > +	}
> > +	return fpin;
> > +}
> >  
> >  /*
> >   * Synchronous readahead happens when we don't even find
> >   * a page in the page cache at all.
> >   */
> > -static void do_sync_mmap_readahead(struct vm_area_struct *vma,
> > -				   struct file_ra_state *ra,
> > -				   struct file *file,
> > -				   pgoff_t offset)
> > +static struct file *do_sync_mmap_readahead(struct vm_area_struct *vma,
> > +					   struct file_ra_state *ra,
> > +					   struct file *file,
> > +					   pgoff_t offset,
> > +					   int flags)
> >  {
> 
> IMO it would be nicer to pass vmf here at this point. Everything this
> function needs is there and the number of arguments is already quite big.
> But I don't insist.
> 
> >  /*
> >   * Asynchronous readahead happens when we find the page and PG_readahead,
> >   * so we want to possibly extend the readahead further..
> >   */
> > -static void do_async_mmap_readahead(struct vm_area_struct *vma,
> > -				    struct file_ra_state *ra,
> > -				    struct file *file,
> > -				    struct page *page,
> > -				    pgoff_t offset)
> > +static struct file *do_async_mmap_readahead(struct vm_area_struct *vma,
> > +					    struct file_ra_state *ra,
> > +					    struct file *file,
> > +					    struct page *page,
> > +					    pgoff_t offset, int flags)
> >  {
> 
> The same here (except for 'page' which needs to be kept).
> 
> > @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  			return vmf_error(-ENOMEM);
> >  	}
> >  
> > -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > -		put_page(page);
> > -		return ret | VM_FAULT_RETRY;
> > +	/*
> > +	 * We are open-coding lock_page_or_retry here because we want to do the
> > +	 * readpage if necessary while the mmap_sem is dropped.  If there
> > +	 * happens to be a lock on the page but it wasn't being faulted in we'd
> > +	 * come back around without ALLOW_RETRY set and then have to do the IO
> > +	 * under the mmap_sem, which would be a bummer.
> > +	 */
> 
> Hum, lock_page_or_retry() has two callers and you've just killed one. I
> think it would be better to modify the function to suit both callers rather
> than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
> would unconditionally acquire the lock and return whether it has dropped
> mmap sem or not? Callers can then decide what to do.
> 

I tried this, but it ends up being convoluted, since swap doesn't have a file to
pin we have to add extra cases for that, and then change the return value to
indicate wether we locked the page _and_ dropped the mmap sem, or just locked
the page, etc.  It didn't seem the extra complication, so I just broke the open
coding out into its own helper.

> BTW I'm not sure this complication is really worth it. The "drop mmap_sem
> for IO" is never going to be 100% thing if nothing else because only one
> retry is allowed in do_user_addr_fault(). So the second time we get to
> filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
> blocking locking. So I think your code needs to catch common cases you
> observe in practice but not those super-rare corner cases...

I had counters in all of these paths because I was sure some things weren't
getting hit at all, but it turns out each of these cases gets hit with
surprisingly high regularity.  The lock_page_or_retry() case in particular gets
hit a lot with multi-threaded applications that got paged out because of heavy
memory pressure.  By no means is it as high as just the normal readpage or
readahead cases, but it's not 0, so I'd rather have the extra helper here to
make sure we're never getting screwed.  Thanks,

Josef
Jan Kara Dec. 11, 2018, 9:40 a.m. UTC | #5
On Mon 10-12-18 13:44:39, Josef Bacik wrote:
> On Fri, Dec 07, 2018 at 12:01:38PM +0100, Jan Kara wrote:
> > On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> > > @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  			return vmf_error(-ENOMEM);
> > >  	}
> > >  
> > > -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > > -		put_page(page);
> > > -		return ret | VM_FAULT_RETRY;
> > > +	/*
> > > +	 * We are open-coding lock_page_or_retry here because we want to do the
> > > +	 * readpage if necessary while the mmap_sem is dropped.  If there
> > > +	 * happens to be a lock on the page but it wasn't being faulted in we'd
> > > +	 * come back around without ALLOW_RETRY set and then have to do the IO
> > > +	 * under the mmap_sem, which would be a bummer.
> > > +	 */
> > 
> > Hum, lock_page_or_retry() has two callers and you've just killed one. I
> > think it would be better to modify the function to suit both callers rather
> > than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
> > would unconditionally acquire the lock and return whether it has dropped
> > mmap sem or not? Callers can then decide what to do.
> > 
> 
> I tried this, but it ends up being convoluted, since swap doesn't have a file to
> pin we have to add extra cases for that, and then change the return value to
> indicate wether we locked the page _and_ dropped the mmap sem, or just locked
> the page, etc.  It didn't seem the extra complication, so I just broke the open
> coding out into its own helper.

OK. Thanks for looking into this!

> > BTW I'm not sure this complication is really worth it. The "drop mmap_sem
> > for IO" is never going to be 100% thing if nothing else because only one
> > retry is allowed in do_user_addr_fault(). So the second time we get to
> > filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
> > blocking locking. So I think your code needs to catch common cases you
> > observe in practice but not those super-rare corner cases...
> 
> I had counters in all of these paths because I was sure some things weren't
> getting hit at all, but it turns out each of these cases gets hit with
> surprisingly high regularity. 

Cool! Could you share these counters? I'd be curious and they'd be actually
nice as a motivation in the changelog of this patch to show the benefit.

> The lock_page_or_retry() case in particular gets hit a lot with
> multi-threaded applications that got paged out because of heavy memory
> pressure.  By no means is it as high as just the normal readpage or
> readahead cases, but it's not 0, so I'd rather have the extra helper here
> to make sure we're never getting screwed.

Do you mean the case where we the page is locked in filemap_fault() (so
that lock_page_or_retry() bails after waiting) and when the page becomes
unlocked it is not uptodate? Because that is the reason why you opencode
lock_page_or_retry(), right? I'm not aware of any normal code path that
would create page in page cache and not try to fill it with data before
unlocking it so that's why I'm really trying to make sure we understand
each other.

								Honza
Josef Bacik Dec. 11, 2018, 4:08 p.m. UTC | #6
On Tue, Dec 11, 2018 at 10:40:34AM +0100, Jan Kara wrote:
> On Mon 10-12-18 13:44:39, Josef Bacik wrote:
> > On Fri, Dec 07, 2018 at 12:01:38PM +0100, Jan Kara wrote:
> > > On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> > > > @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > > >  			return vmf_error(-ENOMEM);
> > > >  	}
> > > >  
> > > > -	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > > > -		put_page(page);
> > > > -		return ret | VM_FAULT_RETRY;
> > > > +	/*
> > > > +	 * We are open-coding lock_page_or_retry here because we want to do the
> > > > +	 * readpage if necessary while the mmap_sem is dropped.  If there
> > > > +	 * happens to be a lock on the page but it wasn't being faulted in we'd
> > > > +	 * come back around without ALLOW_RETRY set and then have to do the IO
> > > > +	 * under the mmap_sem, which would be a bummer.
> > > > +	 */
> > > 
> > > Hum, lock_page_or_retry() has two callers and you've just killed one. I
> > > think it would be better to modify the function to suit both callers rather
> > > than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
> > > would unconditionally acquire the lock and return whether it has dropped
> > > mmap sem or not? Callers can then decide what to do.
> > > 
> > 
> > I tried this, but it ends up being convoluted, since swap doesn't have a file to
> > pin we have to add extra cases for that, and then change the return value to
> > indicate wether we locked the page _and_ dropped the mmap sem, or just locked
> > the page, etc.  It didn't seem the extra complication, so I just broke the open
> > coding out into its own helper.
> 
> OK. Thanks for looking into this!
> 
> > > BTW I'm not sure this complication is really worth it. The "drop mmap_sem
> > > for IO" is never going to be 100% thing if nothing else because only one
> > > retry is allowed in do_user_addr_fault(). So the second time we get to
> > > filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
> > > blocking locking. So I think your code needs to catch common cases you
> > > observe in practice but not those super-rare corner cases...
> > 
> > I had counters in all of these paths because I was sure some things weren't
> > getting hit at all, but it turns out each of these cases gets hit with
> > surprisingly high regularity. 
> 
> Cool! Could you share these counters? I'd be curious and they'd be actually
> nice as a motivation in the changelog of this patch to show the benefit.
> 

Hmm I can't seem to find anything other than the scratch txt file I had to keep
track of stuff, but the un-labeled numbers are 18953 and 879.  I assume the
largest number is the times we went through the readpage path where we weren't
doing any mmap_sem dropping and the 879 was for the lock_page path, but I have
no way of knowing at this point.

> > The lock_page_or_retry() case in particular gets hit a lot with
> > multi-threaded applications that got paged out because of heavy memory
> > pressure.  By no means is it as high as just the normal readpage or
> > readahead cases, but it's not 0, so I'd rather have the extra helper here
> > to make sure we're never getting screwed.
> 
> Do you mean the case where we the page is locked in filemap_fault() (so
> that lock_page_or_retry() bails after waiting) and when the page becomes
> unlocked it is not uptodate? Because that is the reason why you opencode
> lock_page_or_retry(), right? I'm not aware of any normal code path that
> would create page in page cache and not try to fill it with data before
> unlocking it so that's why I'm really trying to make sure we understand
> each other.

Uhh so that's embarressing.  We have an internal patchset that I thought was
upstream but hasn't come along yet.  Basically before this patchset the way we
dealt with this problem was to short-circuit readahead IO's by checking to see
if the blkcg was congested (or if there was a fatal signal pending) and doing 
bio_wouldblock_error on the bio.  So this very case came up a lot, readahead
would go through because it got in before we were congested, but would then get
throttled, and then once the throttling was over would get aborted.  Other
threads would run into these pages that had been locked, but they are never read
in which means they waited for the lock to be dropped, did the VM_FAULT_RETRY,
came back unable to drop the mmap_sem and did the actual readpage() and would
get throttled.

This means this particular part of the patch isn't helpful for upstream at the
moment, but now that I know these patches aren't upstream yet that'll be my next
project, so I'd like to keep this bit here as it is.  Thanks,

Josef
Jan Kara Dec. 11, 2018, 4:38 p.m. UTC | #7
On Tue 11-12-18 11:08:53, Josef Bacik wrote:
> On Tue, Dec 11, 2018 at 10:40:34AM +0100, Jan Kara wrote:
> > > The lock_page_or_retry() case in particular gets hit a lot with
> > > multi-threaded applications that got paged out because of heavy memory
> > > pressure.  By no means is it as high as just the normal readpage or
> > > readahead cases, but it's not 0, so I'd rather have the extra helper here
> > > to make sure we're never getting screwed.
> > 
> > Do you mean the case where we the page is locked in filemap_fault() (so
> > that lock_page_or_retry() bails after waiting) and when the page becomes
> > unlocked it is not uptodate? Because that is the reason why you opencode
> > lock_page_or_retry(), right? I'm not aware of any normal code path that
> > would create page in page cache and not try to fill it with data before
> > unlocking it so that's why I'm really trying to make sure we understand
> > each other.
> 
> Uhh so that's embarressing.  We have an internal patchset that I thought
> was upstream but hasn't come along yet.  Basically before this patchset
> the way we dealt with this problem was to short-circuit readahead IO's by
> checking to see if the blkcg was congested (or if there was a fatal
> signal pending) and doing bio_wouldblock_error on the bio.  So this very
> case came up a lot, readahead would go through because it got in before
> we were congested, but would then get throttled, and then once the
> throttling was over would get aborted.  Other threads would run into
> these pages that had been locked, but they are never read in which means
> they waited for the lock to be dropped, did the VM_FAULT_RETRY, came back
> unable to drop the mmap_sem and did the actual readpage() and would get
> throttled.

OK, I'm somewhat unsure why we throttle on bios that actually get aborted
but that's a separate discussion over a different patches. Overall it makes
sense that some submitted readahead may actually get aborted on congestion
and thus unlocked pages will not be uptodate. So I agree that this case is
actually reasonably likely to happen. Just please mention case like aborted
readahead in the comment so that we don't wonder about the reason in a few
years again.

								Honza
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index f068712c2525..5e76b24b2a0f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2304,28 +2304,44 @@  EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
 #define MMAP_LOTSAMISS  (100)
+static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
+					     struct vm_area_struct *vma,
+					     int flags)
+{
+	if (fpin)
+		return fpin;
+	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
+	    FAULT_FLAG_ALLOW_RETRY) {
+		fpin = get_file(vma->vm_file);
+		up_read(&vma->vm_mm->mmap_sem);
+	}
+	return fpin;
+}
 
 /*
  * Synchronous readahead happens when we don't even find
  * a page in the page cache at all.
  */
-static void do_sync_mmap_readahead(struct vm_area_struct *vma,
-				   struct file_ra_state *ra,
-				   struct file *file,
-				   pgoff_t offset)
+static struct file *do_sync_mmap_readahead(struct vm_area_struct *vma,
+					   struct file_ra_state *ra,
+					   struct file *file,
+					   pgoff_t offset,
+					   int flags)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (!ra->ra_pages)
-		return;
+		return fpin;
 
 	if (vma->vm_flags & VM_SEQ_READ) {
+		fpin = maybe_unlock_mmap_for_io(fpin, vma, flags);
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
-		return;
+		return fpin;
 	}
 
 	/* Avoid banging the cache line if not needed */
@@ -2337,37 +2353,43 @@  static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
 	if (ra->mmap_miss > MMAP_LOTSAMISS)
-		return;
+		return fpin;
 
 	/*
 	 * mmap read-around
 	 */
+	fpin = maybe_unlock_mmap_for_io(fpin, vma, flags);
 	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
+	return fpin;
 }
 
 /*
  * Asynchronous readahead happens when we find the page and PG_readahead,
  * so we want to possibly extend the readahead further..
  */
-static void do_async_mmap_readahead(struct vm_area_struct *vma,
-				    struct file_ra_state *ra,
-				    struct file *file,
-				    struct page *page,
-				    pgoff_t offset)
+static struct file *do_async_mmap_readahead(struct vm_area_struct *vma,
+					    struct file_ra_state *ra,
+					    struct file *file,
+					    struct page *page,
+					    pgoff_t offset, int flags)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct file *fpin = NULL;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vma->vm_flags & VM_RAND_READ)
-		return;
+		return fpin;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		fpin = maybe_unlock_mmap_for_io(fpin, vma, flags);
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
+	return fpin;
 }
 
 /**
@@ -2397,6 +2419,7 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 {
 	int error;
 	struct file *file = vmf->vma->vm_file;
+	struct file *fpin = NULL;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
@@ -2418,10 +2441,12 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
-		do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
+		fpin = do_async_mmap_readahead(vmf->vma, ra, file, page, offset,
+					       vmf->flags);
 	} else if (!page) {
 		/* No page in the page cache at all */
-		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
+		fpin = do_sync_mmap_readahead(vmf->vma, ra, file, offset,
+					      vmf->flags);
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
@@ -2433,9 +2458,32 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 			return vmf_error(-ENOMEM);
 	}
 
-	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
-		put_page(page);
-		return ret | VM_FAULT_RETRY;
+	/*
+	 * We are open-coding lock_page_or_retry here because we want to do the
+	 * readpage if necessary while the mmap_sem is dropped.  If there
+	 * happens to be a lock on the page but it wasn't being faulted in we'd
+	 * come back around without ALLOW_RETRY set and then have to do the IO
+	 * under the mmap_sem, which would be a bummer.
+	 */
+	if (!trylock_page(page)) {
+		fpin = maybe_unlock_mmap_for_io(fpin, vmf->vma, vmf->flags);
+		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+			goto out_retry;
+		if (vmf->flags & FAULT_FLAG_KILLABLE) {
+			if (__lock_page_killable(page)) {
+				/*
+				 * If we don't have the right flags for
+				 * maybe_unlock_mmap_for_io to do it's thing we
+				 * still need to drop the sem and return
+				 * VM_FAULT_RETRY so the upper layer checks the
+				 * signal and takes the appropriate action.
+				 */
+				if (!fpin)
+					up_read(&vmf->vma->vm_mm->mmap_sem);
+				goto out_retry;
+			}
+		} else
+			__lock_page(page);
 	}
 
 	/* Did it get truncated? */
@@ -2453,6 +2501,16 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
+	/*
+	 * We've made it this far and we had to drop our mmap_sem, now is the
+	 * time to return to the upper layer and have it re-find the vma and
+	 * redo the fault.
+	 */
+	if (fpin) {
+		unlock_page(page);
+		goto out_retry;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 * We must recheck i_size under page lock.
@@ -2475,12 +2533,15 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * and we need to check for errors.
 	 */
 	ClearPageError(page);
+	fpin = maybe_unlock_mmap_for_io(fpin, vmf->vma, vmf->flags);
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {
 		wait_on_page_locked(page);
 		if (!PageUptodate(page))
 			error = -EIO;
 	}
+	if (fpin)
+		goto out_retry;
 	put_page(page);
 
 	if (!error || error == AOP_TRUNCATED_PAGE)
@@ -2489,6 +2550,18 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
 	return VM_FAULT_SIGBUS;
+
+out_retry:
+	/*
+	 * We dropped the mmap_sem, we need to return to the fault handler to
+	 * re-find the vma and come back and find our hopefully still populated
+	 * page.
+	 */
+	if (page)
+		put_page(page);
+	if (fpin)
+		fput(fpin);
+	return ret | VM_FAULT_RETRY;
 }
 EXPORT_SYMBOL(filemap_fault);