[7/7] nfs: page cache invalidation for dio

On Wed, 22 Jan 2014 07:04:09 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Wed, 22 Jan 2014 00:24:14 -0800
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > In any case, this helps but it's a little odd. With this patch, you add
> > > an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > also left in the call to nfs_zap_mapping in the completion codepath.
> > > 
> > > So now, we shoot down the mapping prior to doing a DIO write, and then
> > > mark the mapping for invalidation again when the write completes. Was
> > > that intentional?
> > > 
> > > It seems a little excessive and might hurt performance in some cases.
> > > OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > this approach seems to give better cache coherency.
> > 
> > Thile follows the model implemented and documented in
> > generic_file_direct_write().
> > 
> 
> Ok, thanks. That makes sense, and the problem described in those
> comments is almost exactly the one I've seen in practice.
> 
> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> flag is handled, but that really has nothing to do with this patchset.
> 
> You can add my Tested-by to the set if you like...
> 

(re-sending with Trond's address fixed)

I may have spoken too soon...

This patchset didn't fix the problem once I cranked up the concurrency
from 100 child tasks to 1000. I think that HCH's patchset makes sense
and helps narrow the race window some, but the way that
nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.

The following patch does seem to fix it however. It's a combination of
a test patch that Trond gave me a while back and another change to
serialize the nfs_invalidate_mapping ops.

I think it's a reasonable approach to deal with the problem, but we
likely have some other areas that will need similar treatment since
they also check NFS_INO_INVALID_DATA: 

    nfs_write_pageuptodate
    nfs_readdir_search_for_cookie
    nfs_update_inode

Trond, thoughts? It's not quite ready for merge, but I'd like to get an
opinion on the basic approach, or whether you have an idea of how
better handle the races here:

------------------8<--------------------

NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping

There is a possible race in how the nfs_invalidate_mapping is handled.
Currently, we go and invalidate the pages in the file and then clear
NFS_INO_INVALID_DATA.

The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.

So, we must clear the flag first and then invalidate the pages.  This
however, opens another race:

It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.

Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it sees that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.

This effect is easily manifested by running diotest3 from the LTP test
suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache instead of being done on the wire when it should. While mixing
direct and buffered I/O isn't recommended, I believe it's possible to
hit this in other ways that just use buffered I/O, even though that
makes it harder to reproduce.

The problem is that the checking/clearing of that flag and the
invalidation of the mapping need to be as a unit. Fix this by
serializing concurrent invalidations with a bitlock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/inode.c         | 32 +++++++++++++++++++++++++++-----
 include/linux/nfs_fs.h |  1 +
 2 files changed, 28 insertions(+), 5 deletions(-)

[7/7] nfs: page cache invalidation for dio

Commit Message

Comments

Patch