diff mbox series

[2/7] mm: Protect operations adding pages to page cache with i_mapping_lock

Message ID 20210413112859.32249-2-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series fs: Hole punch vs page cache filling races | expand

Commit Message

Jan Kara April 13, 2021, 11:28 a.m. UTC
Currently, serializing operations such as page fault, read, or readahead
against hole punching is rather difficult. The basic race scheme is
like:

fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
  truncate_inode_pages_range()
						  <create pages in page
						   cache here>
  <update fs block mapping and free blocks>

Now the problem is in this way read / page fault / readahead can
instantiate pages in page cache with potentially stale data (if blocks
get quickly reused). Avoiding this race is not simple - page locks do
not work because we want to make sure there are *no* pages in given
range. inode->i_rwsem does not work because page fault happens under
mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
the performance for mixed read-write workloads suffer.

So create a new rw_semaphore in the inode - i_mapping_sem - that
protects adding of pages to page cache for page faults / reads /
readahead.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 Documentation/filesystems/locking.rst | 38 ++++++++++++++------
 fs/inode.c                            |  3 ++
 include/linux/fs.h                    |  2 ++
 mm/filemap.c                          | 51 ++++++++++++++++++++++++---
 mm/readahead.c                        |  2 ++
 mm/rmap.c                             | 37 +++++++++----------
 mm/truncate.c                         |  2 +-
 7 files changed, 101 insertions(+), 34 deletions(-)

Comments

Christoph Hellwig April 13, 2021, 12:57 p.m. UTC | #1
>  	if (error == AOP_TRUNCATED_PAGE)
>  		put_page(page);
> +	up_read(&mapping->host->i_mapping_sem);
>  	return error;

Please add an unlock_mapping label above this up_read and consolidate
most of the other unlocks by jumping there (put_and_wait_on_page_locked
probablt can't use it).

>  truncated:
>  	unlock_page(page);
> @@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb,
>  	return AOP_TRUNCATED_PAGE;

The trunated case actually seems to miss the unlock.

Similarly I think filemap_fault would benefit from a common
unlock path.
Jan Kara April 13, 2021, 1:56 p.m. UTC | #2
On Tue 13-04-21 13:57:46, Christoph Hellwig wrote:
> >  	if (error == AOP_TRUNCATED_PAGE)
> >  		put_page(page);
> > +	up_read(&mapping->host->i_mapping_sem);
> >  	return error;
> 
> Please add an unlock_mapping label above this up_read and consolidate
> most of the other unlocks by jumping there (put_and_wait_on_page_locked
> probablt can't use it).

Yeah, I've actually simplified the labels even a bit more like:

...
        error = filemap_read_page(iocb->ki_filp, mapping, page);
        goto unlock_mapping;
unlock:
        unlock_page(page);
unlock_mapping:
        up_read(&mapping->host->i_mapping_sem);
        if (error == AOP_TRUNCATED_PAGE)
                put_page(page);
        return error;

and everything now jumps to either unlock or unlock_mapping (except for
put_and_wait_on_page_locked() case).

> >  truncated:
> >  	unlock_page(page);
> > @@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb,
> >  	return AOP_TRUNCATED_PAGE;
> 
> The trunated case actually seems to miss the unlock.
> 
> Similarly I think filemap_fault would benefit from a common
> unlock path.

Right, thanks for catching that!

								Honza
Dave Chinner April 14, 2021, 12:01 a.m. UTC | #3
On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
>   truncate_inode_pages_range()
> 						  <create pages in page
> 						   cache here>
>   <update fs block mapping and free blocks>
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the inode - i_mapping_sem - that
> protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
....
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bd7c50e060a9..bc82a7856d3e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -77,7 +77,8 @@
>   *        ->i_pages lock
>   *
>   *  ->i_rwsem
> - *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> + *    ->i_mapping_sem		(acquired by fs in truncate path)
> + *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)

This is now officially a confusing mess.

We have inode->i_mapping that points to an address space, which has
an internal i_mmap variable and an i_mmap_rwsem that serialises
access to the i_mmap tree.

Now we have a inode->i_mapping_sem (which is actually a rwsem, not a
sem) that sorta serialises additions to the mapping tree
(inode->i_mapping->i_pages) against truncate, but it does not
serialise accesses to the rest of the inode->i_mapping structure
itself despite the similarlities in naming.

Then we have the inode_lock() API and the i_mmap_lock() API that
wrap around the i_rwsem and i_mmap_rwsem, but there's no API for
inode_mapping_lock()...

THen we have the mmap_lock in the page fault path as well, which is
also an rwsem despite the name, and that protects something
completely different to the i_mmap and the i_mapping.

IOWs, we have 4 layers of entwined structures and locking here
that pretty much all have the same name but protect different things
and not always the obvious thing the name suggests. This makes it
really difficult to actually read the code and understand that the
correct lock is being used in the correct place...


>   *
>   *  ->mmap_lock
>   *    ->i_mmap_rwsem
> @@ -85,7 +86,8 @@
>   *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
>   *
>   *  ->mmap_lock
> - *    ->lock_page		(access_process_vm)
> + *    ->i_mapping_sem		(filemap_fault)
> + *      ->lock_page		(filemap_fault, access_process_vm)
>   *
>   *  ->i_rwsem			(generic_perform_write)
>   *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
> @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb,
>  {
>  	int error;
>  
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!down_read_trylock(&mapping->host->i_mapping_sem))
> +			return -EAGAIN;
> +	} else {
> +		down_read(&mapping->host->i_mapping_sem);
> +	}

We really need a lock primitive for this. The number of times this
exact lock pattern is being replicated all through the IO path is
getting out of hand.

static inline bool
down_read_try_or_lock(struct rwsem *sem, bool try)
{
	if (try) {
		if (!down_read_trylock(sem))
			return false;
	} else {
		down_read(&mapping->host->i_mapping_sem);
	}
	return true;
}

and the callers become:

	if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
		return -EAGAIN;

We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc
and we don't need to rely on cargo cult knowledge to propagate this
pattern anymore. Because I'm betting relatively few people actually
know why the code is written this way because the only place it is
documented is in an XFS commit message....

Doing this is a separate cleanup, though, and not something that
needs to be done in this patchset.

> index c5b0457415be..ac5bb50b3a4c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  	 */
>  	unsigned int nofs = memalloc_nofs_save();
>  
> +	down_read(&mapping->host->i_mapping_sem);
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */

I can't say I'm a great fan of having the mapping reach back up to
the host to lock the host. THis seems the wrong way around to me
given that most of the locking in the IO path is in "host locks
mapping" and "mapping locks internal mapping structures" order...

I also come back to the naming confusion here, in that when we look
at this in long hand from the inode perspective, this chain actually
looks like:

	lock(inode->i_mapping->inode->i_mapping_sem)

i.e. the mapping is reaching back up outside it's scope to lock
itself against other inode->i_mapping operations. Smells of layering
violations to me.

So, next question: should this truncate semanphore actually be part
of the address space, not the inode? This patch is actually moving
the page fault serialisation from the inode into the address space
operations when page faults and page cache operations are done, so
maybe the lock should also make that move? That would help clear up
the naming problem, because now we can name it based around what it
serialises in the address space, not the address space as a whole...

Cheers,

Dave.
Jan Kara April 14, 2021, 12:23 p.m. UTC | #4
On Wed 14-04-21 10:01:13, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote:
> >   *
> >   *  ->mmap_lock
> >   *    ->i_mmap_rwsem
> > @@ -85,7 +86,8 @@
> >   *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
> >   *
> >   *  ->mmap_lock
> > - *    ->lock_page		(access_process_vm)
> > + *    ->i_mapping_sem		(filemap_fault)
> > + *      ->lock_page		(filemap_fault, access_process_vm)
> >   *
> >   *  ->i_rwsem			(generic_perform_write)
> >   *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
> > @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb,
> >  {
> >  	int error;
> >  
> > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > +		if (!down_read_trylock(&mapping->host->i_mapping_sem))
> > +			return -EAGAIN;
> > +	} else {
> > +		down_read(&mapping->host->i_mapping_sem);
> > +	}
> 
> We really need a lock primitive for this. The number of times this
> exact lock pattern is being replicated all through the IO path is
> getting out of hand.
> 
> static inline bool
> down_read_try_or_lock(struct rwsem *sem, bool try)
> {
> 	if (try) {
> 		if (!down_read_trylock(sem))
> 			return false;
> 	} else {
> 		down_read(&mapping->host->i_mapping_sem);
> 	}
> 	return true;
> }
> 
> and the callers become:
> 
> 	if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
> 		return -EAGAIN;
> 
> We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc
> and we don't need to rely on cargo cult knowledge to propagate this
> pattern anymore. Because I'm betting relatively few people actually
> know why the code is written this way because the only place it is
> documented is in an XFS commit message....
> 
> Doing this is a separate cleanup, though, and not something that
> needs to be done in this patchset.

Yep, good idea but let's do it in a separate patch set.

> > index c5b0457415be..ac5bb50b3a4c 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  	 */
> >  	unsigned int nofs = memalloc_nofs_save();
> >  
> > +	down_read(&mapping->host->i_mapping_sem);
> >  	/*
> >  	 * Preallocate as many pages as we will need.
> >  	 */
> 
> I can't say I'm a great fan of having the mapping reach back up to
> the host to lock the host. THis seems the wrong way around to me
> given that most of the locking in the IO path is in "host locks
> mapping" and "mapping locks internal mapping structures" order...
> 
> I also come back to the naming confusion here, in that when we look
> at this in long hand from the inode perspective, this chain actually
> looks like:
> 
> 	lock(inode->i_mapping->inode->i_mapping_sem)
> 
> i.e. the mapping is reaching back up outside it's scope to lock
> itself against other inode->i_mapping operations. Smells of layering
> violations to me.
> 
> So, next question: should this truncate semanphore actually be part
> of the address space, not the inode? This patch is actually moving
> the page fault serialisation from the inode into the address space
> operations when page faults and page cache operations are done, so
> maybe the lock should also make that move? That would help clear up
> the naming problem, because now we can name it based around what it
> serialises in the address space, not the address space as a whole...

I think that moving the lock to address_space makes some sence although the
lock actually protects consistency of inode->i_mapping->i_pages with
whatever the filesystem has in its file_offset->disk_block mapping
structures (which are generally associated with the inode). So it is not
only about inode->i_mapping contents but I agree that struct address_space
is probably a bit more logical place than struct inode.

Regarding the name: How about i_pages_rwsem? The lock is protecting
invalidation of mapping->i_pages and needs to be held until insertion of
pages into i_pages is safe again...

								Honza
Dave Chinner April 14, 2021, 9:57 p.m. UTC | #5
On Wed, Apr 14, 2021 at 02:23:19PM +0200, Jan Kara wrote:
> On Wed 14-04-21 10:01:13, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote:
> > > index c5b0457415be..ac5bb50b3a4c 100644
> > > --- a/mm/readahead.c
> > > +++ b/mm/readahead.c
> > > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > >  	 */
> > >  	unsigned int nofs = memalloc_nofs_save();
> > >  
> > > +	down_read(&mapping->host->i_mapping_sem);
> > >  	/*
> > >  	 * Preallocate as many pages as we will need.
> > >  	 */
> > 
> > I can't say I'm a great fan of having the mapping reach back up to
> > the host to lock the host. THis seems the wrong way around to me
> > given that most of the locking in the IO path is in "host locks
> > mapping" and "mapping locks internal mapping structures" order...
> > 
> > I also come back to the naming confusion here, in that when we look
> > at this in long hand from the inode perspective, this chain actually
> > looks like:
> > 
> > 	lock(inode->i_mapping->inode->i_mapping_sem)
> > 
> > i.e. the mapping is reaching back up outside it's scope to lock
> > itself against other inode->i_mapping operations. Smells of layering
> > violations to me.
> > 
> > So, next question: should this truncate semanphore actually be part
> > of the address space, not the inode? This patch is actually moving
> > the page fault serialisation from the inode into the address space
> > operations when page faults and page cache operations are done, so
> > maybe the lock should also make that move? That would help clear up
> > the naming problem, because now we can name it based around what it
> > serialises in the address space, not the address space as a whole...
> 
> I think that moving the lock to address_space makes some sence although the
> lock actually protects consistency of inode->i_mapping->i_pages with
> whatever the filesystem has in its file_offset->disk_block mapping
> structures (which are generally associated with the inode).

Well, I look at is as a mechanism that the filesystem uses to ensure
coherency of the page cache accesses w.r.t. physical layout changes.
The layout is a property of the inode, but changes to the physical
layout of the inode are serialised by other inode based mechanisms.
THe page cache isn't part of the inode - it's part of the address
space - but coherency with the inode is required. Hence inode
operations need to be able to ensure coherency of the address space
content and accesses w.r.t. physical layout changes of the inode,
but the address space really knows nothing about the physical layout
of the inode or how it gets changed...

Hence it's valid for the inode operations to lock the address space
to ensure coherency of the page cache when making physical layout
changes, but locking the address space, by itself, is not sufficient
to safely serialise against physical changes to the inode layout.

> So it is not
> only about inode->i_mapping contents but I agree that struct address_space
> is probably a bit more logical place than struct inode.

Yup. Remember that the XFS_MMAPLOCK arose at the inode level because
that was the only way the filesystem could acheive the necessary
serialisation of page cache accesses whilst doing physical layout
changes. So the lock became an "inode property" because of
implementation constraints, not because it was the best way to
implement the necessary coherency hooks.

> Regarding the name: How about i_pages_rwsem? The lock is protecting
> invalidation of mapping->i_pages and needs to be held until insertion of
> pages into i_pages is safe again...

I don't actually have a good name for this right now. :(

The i_pages structure has it's own internal locking, so
i_pages_rwsem implies things that aren't necessarily true, and
taking a read lock for insertion for something that is named like a
structure protection lock creates cognitive dissonance...

I keep wanting to say "lock for invalidation" and "lock to exclude
invalidation" because those are the two actions that we need for
coherency of operations. But they are way too verbose for an actual
API...

So I want to call this an "invalidation lock" of some kind (no need
to encode the type in the name!), but haven't worked out a good
shorthand for "address space invalidation coherency mechanism"...

Naming is hard. :/

Cheers,

Dave.
Matthew Wilcox April 14, 2021, 10:25 p.m. UTC | #6
On Wed, Apr 14, 2021 at 10:01:13AM +1000, Dave Chinner wrote:
> > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > +		if (!down_read_trylock(&mapping->host->i_mapping_sem))
> > +			return -EAGAIN;
> > +	} else {
> > +		down_read(&mapping->host->i_mapping_sem);
> > +	}
> 
> We really need a lock primitive for this. The number of times this
> exact lock pattern is being replicated all through the IO path is
> getting out of hand.
> 
> static inline bool
> down_read_try_or_lock(struct rwsem *sem, bool try)
> {
> 	if (try) {
> 		if (!down_read_trylock(sem))
> 			return false;
> 	} else {
> 		down_read(&mapping->host->i_mapping_sem);
> 	}
> 	return true;
> }
> 
> and the callers become:
> 
> 	if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
> 		return -EAGAIN;

I think that should be written:

	if (!iocb_read_lock(iocb, &rwsem))
		return -EAGAIN;

and implemented as:

static inline int iocb_read_lock(struct kiocb *iocb, struct rwsem *sem)
{
	if (iocb->ki_flags & IOCB_NOWAIT)
		return down_read_trylock(sem) ? 0 : -EAGAIN;
	return down_read_killable(sem);
}
Dave Chinner April 15, 2021, 2:05 a.m. UTC | #7
On Wed, Apr 14, 2021 at 11:25:31PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 14, 2021 at 10:01:13AM +1000, Dave Chinner wrote:
> > > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > > +		if (!down_read_trylock(&mapping->host->i_mapping_sem))
> > > +			return -EAGAIN;
> > > +	} else {
> > > +		down_read(&mapping->host->i_mapping_sem);
> > > +	}
> > 
> > We really need a lock primitive for this. The number of times this
> > exact lock pattern is being replicated all through the IO path is
> > getting out of hand.
> > 
> > static inline bool
> > down_read_try_or_lock(struct rwsem *sem, bool try)
> > {
> > 	if (try) {
> > 		if (!down_read_trylock(sem))
> > 			return false;
> > 	} else {
> > 		down_read(&mapping->host->i_mapping_sem);
> > 	}
> > 	return true;
> > }
> > 
> > and the callers become:
> > 
> > 	if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
> > 		return -EAGAIN;
> 
> I think that should be written:
> 
> 	if (!iocb_read_lock(iocb, &rwsem))
> 		return -EAGAIN;
> 
> and implemented as:
> 
> static inline int iocb_read_lock(struct kiocb *iocb, struct rwsem *sem)
> {
> 	if (iocb->ki_flags & IOCB_NOWAIT)
> 		return down_read_trylock(sem) ? 0 : -EAGAIN;
> 	return down_read_killable(sem);
> }

Yup, we already have done that with xfs_ilock_iocb(), but my point
is that this "non blocking try lock or lock" pattern is slowly being
used in more places than just IOCB_NOWAIT situations.  e.g. We use
if for IOMAP_NOWAIT locking in XFS, too, and ISTR other places where
optimisitic locking is used are replicating it, too.

Hence my suggestion that is moved up into the locking primitives,
not merely have context specific wrappers added...

Cheers,

Dave.
Jan Kara April 15, 2021, 1:11 p.m. UTC | #8
On Thu 15-04-21 07:57:39, Dave Chinner wrote:
> On Wed, Apr 14, 2021 at 02:23:19PM +0200, Jan Kara wrote:
> > Regarding the name: How about i_pages_rwsem? The lock is protecting
> > invalidation of mapping->i_pages and needs to be held until insertion of
> > pages into i_pages is safe again...
> 
> I don't actually have a good name for this right now. :(
> 
> The i_pages structure has it's own internal locking, so
> i_pages_rwsem implies things that aren't necessarily true, and
> taking a read lock for insertion for something that is named like a
> structure protection lock creates cognitive dissonance...
> 
> I keep wanting to say "lock for invalidation" and "lock to exclude
> invalidation" because those are the two actions that we need for
> coherency of operations. But they are way too verbose for an actual
> API...
> 
> So I want to call this an "invalidation lock" of some kind (no need
> to encode the type in the name!), but haven't worked out a good
> shorthand for "address space invalidation coherency mechanism"...

So "invalidate_lock" was just next on my list of things to suggest so I'm
fine with that name. Or maybe block_invalidate_lock, block_remove_lock,
map_remove_lock, ... Dunno :).

								Honza
diff mbox series

Patch

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index b7dcc86c92a4..67ba0a81301a 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -266,19 +266,19 @@  prototypes::
 locking rules:
 	All except set_page_dirty and freepage may block
 
-======================	======================== =========
-ops			PageLocked(page)	 i_rwsem
-======================	======================== =========
+======================	======================== =========	===============
+ops			PageLocked(page)	 i_rwsem	i_mapping_sem
+======================	======================== =========	===============
 writepage:		yes, unlocks (see below)
-readpage:		yes, unlocks
+readpage:		yes, unlocks				shared
 writepages:
 set_page_dirty		no
-readahead:		yes, unlocks
-readpages:		no
+readahead:		yes, unlocks				shared
+readpages:		no					shared
 write_begin:		locks the page		 exclusive
 write_end:		yes, unlocks		 exclusive
 bmap:
-invalidatepage:		yes
+invalidatepage:		yes					exclusive
 releasepage:		yes
 freepage:		yes
 direct_IO:
@@ -373,7 +373,10 @@  keep it that way and don't breed new callers.
 ->invalidatepage() is called when the filesystem must attempt to drop
 some or all of the buffers from the page when it is being truncated. It
 returns zero on success. If ->invalidatepage is zero, the kernel uses
-block_invalidatepage() instead.
+block_invalidatepage() instead. The filesystem should exclusively acquire
+i_mapping_sem before invalidating page cache in truncate / hole punch path (and
+thus calling into ->invalidatepage) to block races between page cache
+invalidation and page cache filling functions (fault, read, ...).
 
 ->releasepage() is called when the kernel is about to try to drop the
 buffers from the page in preparation for freeing it.  It returns zero to
@@ -567,6 +570,19 @@  in sys_read() and friends.
 the lease within the individual filesystem to record the result of the
 operation
 
+->fallocate implementation must be really careful to maintain page cache
+consistency when punching holes or performing other operations that invalidate
+page cache contents. Usually the filesystem needs to call
+truncate_inode_pages_range() to invalidate relevant range of the page cache.
+However the filesystem usually also needs to update its internal (and on disk)
+view of file offset -> disk block mapping. Until this update is finished, the
+filesystem needs to block page faults and reads from reloading now-stale page
+cache contents from the disk. VFS provides inode->i_mapping_sem for this and
+acquires it in shared mode in paths loading pages from disk (filemap_fault(),
+filemap_read(), readahead paths). The filesystem is responsible for taking this
+lock in its fallocate implementation and generally whenever the page cache
+contents needs to be invalidated because a block is moving from under a page.
+
 dquot_operations
 ================
 
@@ -628,9 +644,9 @@  access:		yes
 to be faulted in. The filesystem must find and return the page associated
 with the passed in "pgoff" in the vm_fault structure. If it is possible that
 the page may be truncated and/or invalidated, then the filesystem must lock
-the page, then ensure it is not already truncated (the page lock will block
-subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
-locked. The VM will unlock the page.
+i_mapping_sem, then ensure the page is not already truncated (i_mapping_sem
+will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the
+page locked. The VM will unlock the page.
 
 ->map_pages() is called when VM asks to map easy accessible pages.
 Filesystem should find and map pages associated with offsets from "start_pgoff"
diff --git a/fs/inode.c b/fs/inode.c
index a047ab306f9a..e23e707a507d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -175,6 +175,9 @@  int inode_init_always(struct super_block *sb, struct inode *inode)
 
 	init_rwsem(&inode->i_rwsem);
 	lockdep_set_class(&inode->i_rwsem, &sb->s_type->i_mutex_key);
+	init_rwsem(&inode->i_mapping_sem);
+	lockdep_set_class(&inode->i_mapping_sem,
+			  &sb->s_type->i_mapping_sem_key);
 
 	atomic_set(&inode->i_dio_count, 0);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec8f3ddf4a6a..c020c105d2d9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -660,6 +660,7 @@  struct inode {
 	/* Misc */
 	unsigned long		i_state;
 	struct rw_semaphore	i_rwsem;
+	struct rw_semaphore	i_mapping_sem;
 
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 	unsigned long		dirtied_time_when;
@@ -2351,6 +2352,7 @@  struct file_system_type {
 
 	struct lock_class_key i_lock_key;
 	struct lock_class_key i_mutex_key;
+	struct lock_class_key i_mapping_sem_key;
 	struct lock_class_key i_mutex_dir_key;
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index bd7c50e060a9..bc82a7856d3e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -77,7 +77,8 @@ 
  *        ->i_pages lock
  *
  *  ->i_rwsem
- *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
+ *    ->i_mapping_sem		(acquired by fs in truncate path)
+ *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_lock
  *    ->i_mmap_rwsem
@@ -85,7 +86,8 @@ 
  *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
  *
  *  ->mmap_lock
- *    ->lock_page		(access_process_vm)
+ *    ->i_mapping_sem		(filemap_fault)
+ *      ->lock_page		(filemap_fault, access_process_vm)
  *
  *  ->i_rwsem			(generic_perform_write)
  *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
@@ -2276,16 +2278,28 @@  static int filemap_update_page(struct kiocb *iocb,
 {
 	int error;
 
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!down_read_trylock(&mapping->host->i_mapping_sem))
+			return -EAGAIN;
+	} else {
+		down_read(&mapping->host->i_mapping_sem);
+	}
+
 	if (!trylock_page(page)) {
-		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
+		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) {
+			up_read(&mapping->host->i_mapping_sem);
 			return -EAGAIN;
+		}
 		if (!(iocb->ki_flags & IOCB_WAITQ)) {
+			up_read(&mapping->host->i_mapping_sem);
 			put_and_wait_on_page_locked(page, TASK_KILLABLE);
 			return AOP_TRUNCATED_PAGE;
 		}
 		error = __lock_page_async(page, iocb->ki_waitq);
-		if (error)
+		if (error) {
+			up_read(&mapping->host->i_mapping_sem);
 			return error;
+		}
 	}
 
 	if (!page->mapping)
@@ -2302,6 +2316,7 @@  static int filemap_update_page(struct kiocb *iocb,
 	error = filemap_read_page(iocb->ki_filp, mapping, page);
 	if (error == AOP_TRUNCATED_PAGE)
 		put_page(page);
+	up_read(&mapping->host->i_mapping_sem);
 	return error;
 truncated:
 	unlock_page(page);
@@ -2309,6 +2324,7 @@  static int filemap_update_page(struct kiocb *iocb,
 	return AOP_TRUNCATED_PAGE;
 unlock:
 	unlock_page(page);
+	up_read(&mapping->host->i_mapping_sem);
 	return error;
 }
 
@@ -2323,6 +2339,19 @@  static int filemap_create_page(struct file *file,
 	if (!page)
 		return -ENOMEM;
 
+	/*
+	 * Protect against truncate / hole punch. Grabbing i_mapping_sem here
+	 * assures we cannot instantiate and bring uptodate new pagecache pages
+	 * after evicting page cache during truncate and before actually
+	 * freeing blocks.  Note that we could release i_mapping_sem after
+	 * inserting the page into page cache as the locked page would then be
+	 * enough to synchronize with hole punching. But there are code paths
+	 * such as filemap_update_page() filling in partially uptodate pages or
+	 * ->readpages() that need to hold i_mapping_sem while mapping blocks
+	 * for IO so let's hold the lock here as well to keep locking rules
+	 * simple.
+	 */
+	down_read(&mapping->host->i_mapping_sem);
 	error = add_to_page_cache_lru(page, mapping, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2334,9 +2363,11 @@  static int filemap_create_page(struct file *file,
 	if (error)
 		goto error;
 
+	up_read(&mapping->host->i_mapping_sem);
 	pagevec_add(pvec, page);
 	return 0;
 error:
+	up_read(&mapping->host->i_mapping_sem);
 	put_page(page);
 	return error;
 }
@@ -2896,6 +2927,13 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 		fpin = do_sync_mmap_readahead(vmf);
+	}
+
+	/*
+	 * See comment in filemap_create_page() why we need i_mapping_sem
+	 */
+	down_read(&inode->i_mapping_sem);
+	if (!page) {
 retry_find:
 		page = pagecache_get_page(mapping, offset,
 					  FGP_CREAT|FGP_FOR_MMAP,
@@ -2903,6 +2941,7 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 		if (!page) {
 			if (fpin)
 				goto out_retry;
+			up_read(&inode->i_mapping_sem);
 			return VM_FAULT_OOM;
 		}
 	}
@@ -2943,9 +2982,11 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (unlikely(offset >= max_off)) {
 		unlock_page(page);
 		put_page(page);
+		up_read(&inode->i_mapping_sem);
 		return VM_FAULT_SIGBUS;
 	}
 
+	up_read(&inode->i_mapping_sem);
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
 
@@ -2971,6 +3012,7 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (!error || error == AOP_TRUNCATED_PAGE)
 		goto retry_find;
 
+	up_read(&inode->i_mapping_sem);
 	shrink_readahead_size_eio(ra);
 	return VM_FAULT_SIGBUS;
 
@@ -2982,6 +3024,7 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 */
 	if (page)
 		put_page(page);
+	up_read(&inode->i_mapping_sem);
 	if (fpin)
 		fput(fpin);
 	return ret | VM_FAULT_RETRY;
diff --git a/mm/readahead.c b/mm/readahead.c
index c5b0457415be..ac5bb50b3a4c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -192,6 +192,7 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 	 */
 	unsigned int nofs = memalloc_nofs_save();
 
+	down_read(&mapping->host->i_mapping_sem);
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -236,6 +237,7 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 	 * will then handle the error.
 	 */
 	read_pages(ractl, &page_pool, false);
+	up_read(&mapping->host->i_mapping_sem);
 	memalloc_nofs_restore(nofs);
 }
 EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
diff --git a/mm/rmap.c b/mm/rmap.c
index dba8cb8a5578..37e5dceb4351 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -22,24 +22,25 @@ 
  *
  * inode->i_rwsem	(while writing or truncating, not reading or faulting)
  *   mm->mmap_lock
- *     page->flags PG_locked (lock_page)   * (see hugetlbfs below)
- *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
- *         mapping->i_mmap_rwsem
- *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- *           anon_vma->rwsem
- *             mm->page_table_lock or pte_lock
- *               swap_lock (in swap_duplicate, swap_info_get)
- *                 mmlist_lock (in mmput, drain_mmlist and others)
- *                 mapping->private_lock (in __set_page_dirty_buffers)
- *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
- *                     i_pages lock (widely used)
- *                       lruvec->lru_lock (in lock_page_lruvec_irq)
- *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
- *                   sb_lock (within inode_lock in fs/fs-writeback.c)
- *                   i_pages lock (widely used, in set_page_dirty,
- *                             in arch-dependent flush_dcache_mmap_lock,
- *                             within bdi.wb->list_lock in __sync_single_inode)
+ *     inode->i_mapping_sem (in filemap_fault)
+ *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
+ *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
+ *           mapping->i_mmap_rwsem
+ *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
+ *             anon_vma->rwsem
+ *               mm->page_table_lock or pte_lock
+ *                 swap_lock (in swap_duplicate, swap_info_get)
+ *                   mmlist_lock (in mmput, drain_mmlist and others)
+ *                   mapping->private_lock (in __set_page_dirty_buffers)
+ *                     lock_page_memcg move_lock (in __set_page_dirty_buffers)
+ *                       i_pages lock (widely used)
+ *                         lruvec->lru_lock (in lock_page_lruvec_irq)
+ *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
+ *                     sb_lock (within inode_lock in fs/fs-writeback.c)
+ *                     i_pages lock (widely used, in set_page_dirty,
+ *                               in arch-dependent flush_dcache_mmap_lock,
+ *                               within bdi.wb->list_lock in __sync_single_inode)
  *
  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
diff --git a/mm/truncate.c b/mm/truncate.c
index 2cf71d8c3c62..464ad70a081f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -416,7 +416,7 @@  EXPORT_SYMBOL(truncate_inode_pages_range);
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
  *
- * Called under (and serialised by) inode->i_rwsem.
+ * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
  *
  * Note: When this function returns, there can be a page in the process of
  * deletion (inside __delete_from_page_cache()) in the specified range.  Thus