mbox series

[RFC,v2,0/9] Replacing the readpages a_op

Message ID 20200115023843.31325-1-willy@infradead.org (mailing list archive)
Headers show
Series Replacing the readpages a_op | expand

Message

Matthew Wilcox Jan. 15, 2020, 2:38 a.m. UTC
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This is an attempt to add a ->readahead op to replace ->readpages.  I've
converted two users, iomap/xfs and cifs.  The cifs conversion is lacking
fscache support, and that's just because I didn't want to do that work;
I don't believe there's anything fundamental to it.  But I wanted to do
iomap because it is The Infrastructure Of The Future and cifs because it
is the sole remaining user of add_to_page_cache_locked(), which enables
the last two patches in the series.  By the way, that gives CIFS access
to the workingset shadow infrastructure, which it had to ignore before
because it couldn't put pages onto the lru list at the right time.

v2: Chris asked me to show what this would look like if we just have
the implementation look up the pages in the page cache, and I managed
to figure out some things I'd done wrong last time.  It's even simpler
than v1 (net 104 lines deleted).

Matthew Wilcox (Oracle) (9):
  mm: Fix the return type of __do_page_cache_readahead
  readahead: Ignore return value of ->readpages
  XArray: Add xarray_for_each_range
  readahead: Put pages in cache earlier
  mm: Add readahead address space operation
  iomap,xfs: Convert from readpages to readahead
  cifs: Convert from readpages to readahead
  mm: Remove add_to_page_cache_locked
  mm: Unify all add_to_page_cache variants

 Documentation/core-api/xarray.rst     |  10 +-
 Documentation/filesystems/locking.rst |   7 +-
 Documentation/filesystems/vfs.rst     |  11 ++
 fs/cifs/file.c                        | 143 +++++---------------------
 fs/iomap/buffered-io.c                |  72 +++----------
 fs/iomap/trace.h                      |   2 +-
 fs/xfs/xfs_aops.c                     |  10 +-
 include/linux/fs.h                    |   2 +
 include/linux/iomap.h                 |   2 +-
 include/linux/pagemap.h               |  25 ++---
 include/linux/xarray.h                |  30 ++++++
 mm/filemap.c                          |  72 ++++---------
 mm/internal.h                         |   2 +-
 mm/readahead.c                        |  76 +++++++++-----
 14 files changed, 180 insertions(+), 284 deletions(-)

Comments

Matthew Wilcox Jan. 18, 2020, 11:13 p.m. UTC | #1
On Tue, Jan 14, 2020 at 06:38:34PM -0800, Matthew Wilcox wrote:
> This is an attempt to add a ->readahead op to replace ->readpages.  I've
> converted two users, iomap/xfs and cifs.  The cifs conversion is lacking
> fscache support, and that's just because I didn't want to do that work;
> I don't believe there's anything fundamental to it.  But I wanted to do
> iomap because it is The Infrastructure Of The Future and cifs because it
> is the sole remaining user of add_to_page_cache_locked(), which enables
> the last two patches in the series.  By the way, that gives CIFS access
> to the workingset shadow infrastructure, which it had to ignore before
> because it couldn't put pages onto the lru list at the right time.
> 
> v2: Chris asked me to show what this would look like if we just have
> the implementation look up the pages in the page cache, and I managed
> to figure out some things I'd done wrong last time.  It's even simpler
> than v1 (net 104 lines deleted).

I want to discuss whether to change the page refcount guarantees while we're
changing the API.  Currently,

page is allocated with a refcount of 1
page is locked, and inserted into page cache and refcount is bumped to 2
->readahead is called
callee is supposed to call put_page() after submitting I/O
I/O completion will unlock the page after I/O completes, leaving the refcount at 1.

So, what if we leave the refcount at 1 throughout the submission process,
saving ourselves two atomic ops per page?  We have to ensure that after
the page is submitted for I/O, the submission path no longer touches
the page.  So the process of converting a filesystem to ->readahead
becomes slightly more complex, but there's a bugger win as a result.

Opinions?
Jan Kara Jan. 21, 2020, 11:36 a.m. UTC | #2
Hello Matthew!

On Tue 14-01-20 18:38:34, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This is an attempt to add a ->readahead op to replace ->readpages.  I've
> converted two users, iomap/xfs and cifs.  The cifs conversion is lacking
> fscache support, and that's just because I didn't want to do that work;
> I don't believe there's anything fundamental to it.  But I wanted to do
> iomap because it is The Infrastructure Of The Future and cifs because it
> is the sole remaining user of add_to_page_cache_locked(), which enables
> the last two patches in the series.  By the way, that gives CIFS access
> to the workingset shadow infrastructure, which it had to ignore before
> because it couldn't put pages onto the lru list at the right time.
> 
> v2: Chris asked me to show what this would look like if we just have
> the implementation look up the pages in the page cache, and I managed
> to figure out some things I'd done wrong last time.  It's even simpler
> than v1 (net 104 lines deleted).

I have an unfinished patch series laying around that pulls the ->readpage
/ ->readpages API in somewhat different direction so I'd like to discuss
whether it's possible to solve my problem using your API. The problem I
have is that currently some operations such as hole punching can race with
->readpage / ->readpages like:

CPU0						CPU1
fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
  filemap_write_and_wait_range()
  down_write(inode->i_rwsem);
  truncate_pagecache_range();
						readahead(fd, off, len)
						  creates pages in page cache
						  looks up block mapping
  removes blocks from inode and frees them
						  issues bio
						    - reads stale data -
						      potential security
						      issue

Now how I wanted to address this is that I'd change the API convention for
->readpage() so that we call it with the page unlocked and the function
would lock the page, check it's still OK, and do what it needs. And this
will allow ->readpage() and also ->readpages() to grab lock
(EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
while we are adding pages to page cache and mapping underlying blocks.

Now your API makes even ->readpages() (actually ->readahead) called with
pages locked so that makes this approach problematic because of lock
inversions. So I'd prefer if we could keep the situation that ->readpages /
->readahead gets called without any pages in page cache locked...

								Honza
Matthew Wilcox Jan. 21, 2020, 9:48 p.m. UTC | #3
On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > v2: Chris asked me to show what this would look like if we just have
> > the implementation look up the pages in the page cache, and I managed
> > to figure out some things I'd done wrong last time.  It's even simpler
> > than v1 (net 104 lines deleted).
> 
> I have an unfinished patch series laying around that pulls the ->readpage
> / ->readpages API in somewhat different direction so I'd like to discuss
> whether it's possible to solve my problem using your API. The problem I
> have is that currently some operations such as hole punching can race with
> ->readpage / ->readpages like:
> 
> CPU0						CPU1
> fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
>   filemap_write_and_wait_range()
>   down_write(inode->i_rwsem);
>   truncate_pagecache_range();
> 						readahead(fd, off, len)
> 						  creates pages in page cache
> 						  looks up block mapping
>   removes blocks from inode and frees them
> 						  issues bio
> 						    - reads stale data -
> 						      potential security
> 						      issue
> 
> Now how I wanted to address this is that I'd change the API convention for
> ->readpage() so that we call it with the page unlocked and the function
> would lock the page, check it's still OK, and do what it needs. And this
> will allow ->readpage() and also ->readpages() to grab lock
> (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> while we are adding pages to page cache and mapping underlying blocks.
> 
> Now your API makes even ->readpages() (actually ->readahead) called with
> pages locked so that makes this approach problematic because of lock
> inversions. So I'd prefer if we could keep the situation that ->readpages /
> ->readahead gets called without any pages in page cache locked...

I'm not a huge fan of that approach because it increases the number of
atomic ops (right now, we __SetPageLocked on the page before adding it
to i_pages).  Holepunch is a rather rare operation while readpage and
readpages/readahead are extremely common, so can we make holepunch take
a lock that will prevent new readpage(s) succeeding?

I have an idea to move the lock entries from DAX to being a generic page
cache concept.  That way, holepunch could insert lock entries into the
pagecache to cover the range being punched, and readpage(s) would either
skip lock entries or block on them.

Maybe there's a better approach though.
Jan Kara Jan. 22, 2020, 9:44 a.m. UTC | #4
On Tue 21-01-20 13:48:45, Matthew Wilcox wrote:
> On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > v2: Chris asked me to show what this would look like if we just have
> > > the implementation look up the pages in the page cache, and I managed
> > > to figure out some things I'd done wrong last time.  It's even simpler
> > > than v1 (net 104 lines deleted).
> > 
> > I have an unfinished patch series laying around that pulls the ->readpage
> > / ->readpages API in somewhat different direction so I'd like to discuss
> > whether it's possible to solve my problem using your API. The problem I
> > have is that currently some operations such as hole punching can race with
> > ->readpage / ->readpages like:
> > 
> > CPU0						CPU1
> > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> >   filemap_write_and_wait_range()
> >   down_write(inode->i_rwsem);
> >   truncate_pagecache_range();
> > 						readahead(fd, off, len)
> > 						  creates pages in page cache
> > 						  looks up block mapping
> >   removes blocks from inode and frees them
> > 						  issues bio
> > 						    - reads stale data -
> > 						      potential security
> > 						      issue
> > 
> > Now how I wanted to address this is that I'd change the API convention for
> > ->readpage() so that we call it with the page unlocked and the function
> > would lock the page, check it's still OK, and do what it needs. And this
> > will allow ->readpage() and also ->readpages() to grab lock
> > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > while we are adding pages to page cache and mapping underlying blocks.
> > 
> > Now your API makes even ->readpages() (actually ->readahead) called with
> > pages locked so that makes this approach problematic because of lock
> > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > ->readahead gets called without any pages in page cache locked...
> 
> I'm not a huge fan of that approach because it increases the number of
> atomic ops (right now, we __SetPageLocked on the page before adding it
> to i_pages).

Yeah, good point. The per-page cost of locking may be noticeable.

> Holepunch is a rather rare operation while readpage and
> readpages/readahead are extremely common, so can we make holepunch take a
> lock that will prevent new readpage(s) succeeding?

I'm not opposed - in fact my solution would do exactly that (with
EXT4_I(inode)->i_mmap_sem), just the lock ordering wrt page lock and
mmap_sem is causing troubles and that's why I need the change in the
API for readpage and friends.

> I have an idea to move the lock entries from DAX to being a generic page
> cache concept.  That way, holepunch could insert lock entries into the
> pagecache to cover the range being punched, and readpage(s) would either
> skip lock entries or block on them.

Two notes on the entry locks:

The additional traffic in the xarray creating locked entries and then
removing them is going to cost as well. But if that's only for hole
punching, it would be bearable I guess.

This does not solve the problem with the lock ordering. There are quite
some constraints on this synchronization scheme. Generally we want to
prevent creation of pages in the page cache in a certain range. That means
we need to block read(2), readahead(2), madvise(2) MADV_WILLNEED, page
faults.  The page faults constrain us that the lock has to rank below
mmap_sem. On the other hand hole punching needs to hold the lock while
evicting pages so that mandates that the lock needs to rank above page
lock. Also note that read(2) can cause page faults (to copy data to
userspace) so to avoid lock inversion against mmap_sem, any protection must
not cover that part of the read path which basically leaves us with
->readpage()/->readpages() as the only place where we can grab the lock.
Which nicely covers also all the other places creating pages in the page
cache we need to block. Except that ->readpage has the unfortunate property
of being called under page lock. But maybe we could create a new hook
somewhere in the paths creating pages to acquire the lock early. But so far
I don't have an idea for something nice.

								Honza
Dave Chinner Jan. 22, 2020, 11:47 p.m. UTC | #5
On Tue, Jan 21, 2020 at 01:48:45PM -0800, Matthew Wilcox wrote:
> On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > v2: Chris asked me to show what this would look like if we just have
> > > the implementation look up the pages in the page cache, and I managed
> > > to figure out some things I'd done wrong last time.  It's even simpler
> > > than v1 (net 104 lines deleted).
> > 
> > I have an unfinished patch series laying around that pulls the ->readpage
> > / ->readpages API in somewhat different direction so I'd like to discuss
> > whether it's possible to solve my problem using your API. The problem I
> > have is that currently some operations such as hole punching can race with
> > ->readpage / ->readpages like:
> > 
> > CPU0						CPU1
> > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> >   filemap_write_and_wait_range()
> >   down_write(inode->i_rwsem);
> >   truncate_pagecache_range();

shouldn't fallocate be holding EXT4_I(inode)->i_mmap_sem before it
truncates the page cache? Otherwise it's not serialised against
page faults. Looks at code ... oh, it does hold the i_mmap_sem in
write mode, so....

> > 						readahead(fd, off, len)
> > 						  creates pages in page cache
> > 						  looks up block mapping
> >   removes blocks from inode and frees them
> > 						  issues bio
> > 						    - reads stale data -
> > 						      potential security
> > 						      issue

.... I'm not sure that this race condition should exist anymore
as readahead should not run until the filesystem drops it's inode
and mmap locks after the entire extent freeing operation is
complete...

> > Now how I wanted to address this is that I'd change the API convention for
> > ->readpage() so that we call it with the page unlocked and the function
> > would lock the page, check it's still OK, and do what it needs. And this
> > will allow ->readpage() and also ->readpages() to grab lock
> > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > while we are adding pages to page cache and mapping underlying blocks.
> > 
> > Now your API makes even ->readpages() (actually ->readahead) called with
> > pages locked so that makes this approach problematic because of lock
> > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > ->readahead gets called without any pages in page cache locked...
> 
> I'm not a huge fan of that approach because it increases the number of
> atomic ops (right now, we __SetPageLocked on the page before adding it
> to i_pages).  Holepunch is a rather rare operation while readpage and
> readpages/readahead are extremely common, so can we make holepunch take
> a lock that will prevent new readpage(s) succeeding?
> 
> I have an idea to move the lock entries from DAX to being a generic page
> cache concept.  That way, holepunch could insert lock entries into the
> pagecache to cover the range being punched, and readpage(s) would either
> skip lock entries or block on them.
> 
> Maybe there's a better approach though.

Can we step back for a moment and look at how we already serialise
readahead against truncate/hole punch? While the readahead code
itself doesn't serialise against truncate, in all cases we should be
running through the filesystem at a higher layer and provides the
truncate/holepunch serialisation before we get to the readahead
code.

The read() syscall IO path:

  read()
    ->read_iter()
      filesystem takes truncate serialisation lock
      generic_file_read_iter()
        generic_file_buffered_read()
	  page_cache_sync_readahead()
	    ....
	  page_cache_async_readahead()
	    ....
      .....
      filesystem drops truncate serialisation lock

The page fault IO path:

page fault
  ->fault
    xfs_vm_filemap_fault
      filesystem takes mmap truncate serialisation lock
	filemap_fault
	  do_async_mmap_readahead
	    page_cache_async_readahead
	  ....
	  do_sync_mmap_readahead
	    page_cache_sync_readahead
	.....
      filesystem drops mmap truncate serialisation lock

Then there is fadvise(WILLNEED) which calls
force_page_cache_readahead() directly. We now have ->fadvise for
filesystems to add locking around this:

fadvise64
  vfs_fadvise
    ->fadvise
      xfs_file_fadvise
        filesystem takes truncate serialisation lock
	generic_fadvise()
        filesystem drops truncate serialisation lock

XFS and overlay both have ->fadvise methods to provide this
serialisation of readahead, but ext4 does not so users could trigger
the stated race condition through fadvise(WILLNEED) or readahead()
syscalls.

Hence, AFAICT, ext4 only needs to implement ->fadvise and the stated
readahead vs truncate_pagecache_range() race condition goes away.
Unless, of course, I've missed some other entry point into the
readahead code that does not vector through the filesystem first.
What have I missed?

Cheers,

Dave.
Jan Kara Jan. 23, 2020, 10:21 a.m. UTC | #6
On Thu 23-01-20 10:47:40, Dave Chinner wrote:
> On Tue, Jan 21, 2020 at 01:48:45PM -0800, Matthew Wilcox wrote:
> > On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > > v2: Chris asked me to show what this would look like if we just have
> > > > the implementation look up the pages in the page cache, and I managed
> > > > to figure out some things I'd done wrong last time.  It's even simpler
> > > > than v1 (net 104 lines deleted).
> > > 
> > > I have an unfinished patch series laying around that pulls the ->readpage
> > > / ->readpages API in somewhat different direction so I'd like to discuss
> > > whether it's possible to solve my problem using your API. The problem I
> > > have is that currently some operations such as hole punching can race with
> > > ->readpage / ->readpages like:
> > > 
> > > CPU0						CPU1
> > > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> > >   filemap_write_and_wait_range()
> > >   down_write(inode->i_rwsem);
> > >   truncate_pagecache_range();
> 
> shouldn't fallocate be holding EXT4_I(inode)->i_mmap_sem before it
> truncates the page cache? Otherwise it's not serialised against
> page faults. Looks at code ... oh, it does hold the i_mmap_sem in
> write mode, so....

Yes.

> > > 						readahead(fd, off, len)
> > > 						  creates pages in page cache
> > > 						  looks up block mapping
> > >   removes blocks from inode and frees them
> > > 						  issues bio
> > > 						    - reads stale data -
> > > 						      potential security
> > > 						      issue
> 
> .... I'm not sure that this race condition should exist anymore
> as readahead should not run until the filesystem drops it's inode
> and mmap locks after the entire extent freeing operation is
> complete...

Not for XFS but for all the other filesystems see below..

> > > Now how I wanted to address this is that I'd change the API convention for
> > > ->readpage() so that we call it with the page unlocked and the function
> > > would lock the page, check it's still OK, and do what it needs. And this
> > > will allow ->readpage() and also ->readpages() to grab lock
> > > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > > while we are adding pages to page cache and mapping underlying blocks.
> > > 
> > > Now your API makes even ->readpages() (actually ->readahead) called with
> > > pages locked so that makes this approach problematic because of lock
> > > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > > ->readahead gets called without any pages in page cache locked...
> > 
> > I'm not a huge fan of that approach because it increases the number of
> > atomic ops (right now, we __SetPageLocked on the page before adding it
> > to i_pages).  Holepunch is a rather rare operation while readpage and
> > readpages/readahead are extremely common, so can we make holepunch take
> > a lock that will prevent new readpage(s) succeeding?
> > 
> > I have an idea to move the lock entries from DAX to being a generic page
> > cache concept.  That way, holepunch could insert lock entries into the
> > pagecache to cover the range being punched, and readpage(s) would either
> > skip lock entries or block on them.
> > 
> > Maybe there's a better approach though.
> 
> Can we step back for a moment and look at how we already serialise
> readahead against truncate/hole punch? While the readahead code
> itself doesn't serialise against truncate, in all cases we should be
> running through the filesystem at a higher layer and provides the
> truncate/holepunch serialisation before we get to the readahead
> code.
> 
> The read() syscall IO path:
> 
>   read()
>     ->read_iter()
>       filesystem takes truncate serialisation lock
>       generic_file_read_iter()
>         generic_file_buffered_read()
> 	  page_cache_sync_readahead()
> 	    ....
> 	  page_cache_async_readahead()
> 	    ....
>       .....
>       filesystem drops truncate serialisation lock

Yes, this is the scheme XFS uses. But ext4 and other filesystems use a
scheme where read is serialized against truncate only by page locks and
i_size checks. Which works for truncate but is not enough for hole
punching. And locking read(2) and readahead(2) in all these filesystem with
i_rwsem is going to cause heavy regressions with mixed read-write workloads
and unnecessarily so because we don't need to lock reads against writes,
just against truncate or hole punching.

So I wanted to use i_mmap_sem for the serialization of the read path against
truncate. But due to lock ordering with mmap_sem and because reads do take
page faults to copy data it is not straightforward - hence my messing with
->readpage(). Now that I'm thinking about it, there's also a possibility of
introducing yet another rwsem into the inode that would rank above
mmap_sem and be used to serialize ->read_iter and ->fadvise against
truncate. But having three rwsems in the inode for serialization seems a
bit too convoluted for my taste.

								Honza
Jan Kara Jan. 23, 2020, 10:31 a.m. UTC | #7
On Wed 22-01-20 10:44:14, Jan Kara wrote:
> On Tue 21-01-20 13:48:45, Matthew Wilcox wrote:
> > On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > > v2: Chris asked me to show what this would look like if we just have
> > > > the implementation look up the pages in the page cache, and I managed
> > > > to figure out some things I'd done wrong last time.  It's even simpler
> > > > than v1 (net 104 lines deleted).
> > > 
> > > I have an unfinished patch series laying around that pulls the ->readpage
> > > / ->readpages API in somewhat different direction so I'd like to discuss
> > > whether it's possible to solve my problem using your API. The problem I
> > > have is that currently some operations such as hole punching can race with
> > > ->readpage / ->readpages like:
> > > 
> > > CPU0						CPU1
> > > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> > >   filemap_write_and_wait_range()
> > >   down_write(inode->i_rwsem);
> > >   truncate_pagecache_range();
> > > 						readahead(fd, off, len)
> > > 						  creates pages in page cache
> > > 						  looks up block mapping
> > >   removes blocks from inode and frees them
> > > 						  issues bio
> > > 						    - reads stale data -
> > > 						      potential security
> > > 						      issue
> > > 
> > > Now how I wanted to address this is that I'd change the API convention for
> > > ->readpage() so that we call it with the page unlocked and the function
> > > would lock the page, check it's still OK, and do what it needs. And this
> > > will allow ->readpage() and also ->readpages() to grab lock
> > > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > > while we are adding pages to page cache and mapping underlying blocks.
> > > 
> > > Now your API makes even ->readpages() (actually ->readahead) called with
> > > pages locked so that makes this approach problematic because of lock
> > > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > > ->readahead gets called without any pages in page cache locked...
> > 
> > I'm not a huge fan of that approach because it increases the number of
> > atomic ops (right now, we __SetPageLocked on the page before adding it
> > to i_pages).
> 
> Yeah, good point. The per-page cost of locking may be noticeable.

Thinking about this a bit more, we should be using ->readpages() to fill
most of the data. And for ->readpages() there would be no additional
overhead. Just for ->readpage() which should be rarely needed. We just need
to come up with a good solution for filesystems that have ->readpage but
not ->readpages.

								Honza
Dave Chinner Jan. 23, 2020, 10:29 p.m. UTC | #8
On Thu, Jan 23, 2020 at 11:21:01AM +0100, Jan Kara wrote:
> On Thu 23-01-20 10:47:40, Dave Chinner wrote:
> > On Tue, Jan 21, 2020 at 01:48:45PM -0800, Matthew Wilcox wrote:
> > > On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > > > v2: Chris asked me to show what this would look like if we just have
> > > > > the implementation look up the pages in the page cache, and I managed
> > > > > to figure out some things I'd done wrong last time.  It's even simpler
> > > > > than v1 (net 104 lines deleted).
> > > > 
> > > > I have an unfinished patch series laying around that pulls the ->readpage
> > > > / ->readpages API in somewhat different direction so I'd like to discuss
> > > > whether it's possible to solve my problem using your API. The problem I
> > > > have is that currently some operations such as hole punching can race with
> > > > ->readpage / ->readpages like:
> > > > 
> > > > CPU0						CPU1
> > > > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> > > >   filemap_write_and_wait_range()
> > > >   down_write(inode->i_rwsem);
> > > >   truncate_pagecache_range();
> > 
> > shouldn't fallocate be holding EXT4_I(inode)->i_mmap_sem before it
> > truncates the page cache? Otherwise it's not serialised against
> > page faults. Looks at code ... oh, it does hold the i_mmap_sem in
> > write mode, so....
> 
> Yes.
> 
> > > > 						readahead(fd, off, len)
> > > > 						  creates pages in page cache
> > > > 						  looks up block mapping
> > > >   removes blocks from inode and frees them
> > > > 						  issues bio
> > > > 						    - reads stale data -
> > > > 						      potential security
> > > > 						      issue
> > 
> > .... I'm not sure that this race condition should exist anymore
> > as readahead should not run until the filesystem drops it's inode
> > and mmap locks after the entire extent freeing operation is
> > complete...
> 
> Not for XFS but for all the other filesystems see below..
> 
> > > > Now how I wanted to address this is that I'd change the API convention for
> > > > ->readpage() so that we call it with the page unlocked and the function
> > > > would lock the page, check it's still OK, and do what it needs. And this
> > > > will allow ->readpage() and also ->readpages() to grab lock
> > > > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > > > while we are adding pages to page cache and mapping underlying blocks.
> > > > 
> > > > Now your API makes even ->readpages() (actually ->readahead) called with
> > > > pages locked so that makes this approach problematic because of lock
> > > > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > > > ->readahead gets called without any pages in page cache locked...
> > > 
> > > I'm not a huge fan of that approach because it increases the number of
> > > atomic ops (right now, we __SetPageLocked on the page before adding it
> > > to i_pages).  Holepunch is a rather rare operation while readpage and
> > > readpages/readahead are extremely common, so can we make holepunch take
> > > a lock that will prevent new readpage(s) succeeding?
> > > 
> > > I have an idea to move the lock entries from DAX to being a generic page
> > > cache concept.  That way, holepunch could insert lock entries into the
> > > pagecache to cover the range being punched, and readpage(s) would either
> > > skip lock entries or block on them.
> > > 
> > > Maybe there's a better approach though.
> > 
> > Can we step back for a moment and look at how we already serialise
> > readahead against truncate/hole punch? While the readahead code
> > itself doesn't serialise against truncate, in all cases we should be
> > running through the filesystem at a higher layer and provides the
> > truncate/holepunch serialisation before we get to the readahead
> > code.
> > 
> > The read() syscall IO path:
> > 
> >   read()
> >     ->read_iter()
> >       filesystem takes truncate serialisation lock
> >       generic_file_read_iter()
> >         generic_file_buffered_read()
> > 	  page_cache_sync_readahead()
> > 	    ....
> > 	  page_cache_async_readahead()
> > 	    ....
> >       .....
> >       filesystem drops truncate serialisation lock
> 
> Yes, this is the scheme XFS uses. But ext4 and other filesystems use a
> scheme where read is serialized against truncate only by page locks and
> i_size checks.

Right, which I've characterised in the past as a "performance over
correctness" hack that has resulted in infrastructure that can only
protect against truncate and not general page invalidation
operations. It's also been the source of many, many truncate bugs
over the past 20 years and has never worked for hole punch, yet....

> Which works for truncate but is not enough for hole
> punching. And locking read(2) and readahead(2) in all these filesystem with
> i_rwsem is going to cause heavy regressions with mixed read-write workloads
> and unnecessarily so because we don't need to lock reads against writes,
> just against truncate or hole punching.

.... performance over correctness is still used as the justification
for keeping this ancient, troublesome architecture in place even
though it cannot support the requirements of modern filesystems.

Indeed, this problem had to be fixed with DAX, and the ext4 DAX read
path uses shared inode locks. And now the ext4 DIO path uses shared
inode locking, too. Only the ext4 buffered read path does not use
shared read locks to avoid this race condition.

I don't have a "maintain the existing mixed rw performance" solution
for you here, except to say that I'm working (slowly) towards fixing
this problem on XFS with range locking for IO instead of using the
i_rwsem. That fixes so many other concurrency issues, too, such as
allowing fallocate() and ftruncate() to run concurrently with
non-overlapping IO to the same file. IOWs, moving IO locking away
from the page cache will allow much greater concurrency and
performance for a much wider range of filesystem operations than
trying to screw around with individual page locks.

> So I wanted to use i_mmap_sem for the serialization of the read path against
> truncate. But due to lock ordering with mmap_sem and because reads do take
> page faults to copy data it is not straightforward - hence my messing with
> ->readpage().

Another manifestation of the age old "can't take the same lock in
the syscall IO path and the page fault IO path" problem. We've
previously looked hacking about readpages in XFS, too, but it was
too complex, didn't improve performance, and in general such changes
were heading in the opposite direction we have been moving the IO
path infrastructure towards with iomap.

i.e. improving buffered IO efficiency requires use to reduce the
amount of work we do per page, not increase it....

> Now that I'm thinking about it, there's also a possibility of
> introducing yet another rwsem into the inode that would rank above
> mmap_sem and be used to serialize ->read_iter and ->fadvise against
> truncate. But having three rwsems in the inode for serialization seems a
> bit too convoluted for my taste.

Yup, that's called "falling off the locking cliff". :/

Cheers,

Dave.