diff mbox

[7/7] nfs: page cache invalidation for dio

Message ID 20140124105213.2c40a783@tlielax.poochiereds.net (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton Jan. 24, 2014, 3:52 p.m. UTC
On Wed, 22 Jan 2014 07:04:09 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Wed, 22 Jan 2014 00:24:14 -0800
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > In any case, this helps but it's a little odd. With this patch, you add
> > > an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > also left in the call to nfs_zap_mapping in the completion codepath.
> > > 
> > > So now, we shoot down the mapping prior to doing a DIO write, and then
> > > mark the mapping for invalidation again when the write completes. Was
> > > that intentional?
> > > 
> > > It seems a little excessive and might hurt performance in some cases.
> > > OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > this approach seems to give better cache coherency.
> > 
> > Thile follows the model implemented and documented in
> > generic_file_direct_write().
> > 
> 
> Ok, thanks. That makes sense, and the problem described in those
> comments is almost exactly the one I've seen in practice.
> 
> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> flag is handled, but that really has nothing to do with this patchset.
> 
> You can add my Tested-by to the set if you like...
> 

(re-sending with Trond's address fixed)

I may have spoken too soon...

This patchset didn't fix the problem once I cranked up the concurrency
from 100 child tasks to 1000. I think that HCH's patchset makes sense
and helps narrow the race window some, but the way that
nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.

The following patch does seem to fix it however. It's a combination of
a test patch that Trond gave me a while back and another change to
serialize the nfs_invalidate_mapping ops.

I think it's a reasonable approach to deal with the problem, but we
likely have some other areas that will need similar treatment since
they also check NFS_INO_INVALID_DATA: 

    nfs_write_pageuptodate
    nfs_readdir_search_for_cookie
    nfs_update_inode

Trond, thoughts? It's not quite ready for merge, but I'd like to get an
opinion on the basic approach, or whether you have an idea of how
better handle the races here:

------------------8<--------------------

NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping

There is a possible race in how the nfs_invalidate_mapping is handled.
Currently, we go and invalidate the pages in the file and then clear
NFS_INO_INVALID_DATA.

The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.

So, we must clear the flag first and then invalidate the pages.  This
however, opens another race:

It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.

Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it sees that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.

This effect is easily manifested by running diotest3 from the LTP test
suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache instead of being done on the wire when it should. While mixing
direct and buffered I/O isn't recommended, I believe it's possible to
hit this in other ways that just use buffered I/O, even though that
makes it harder to reproduce.

The problem is that the checking/clearing of that flag and the
invalidation of the mapping need to be as a unit. Fix this by
serializing concurrent invalidations with a bitlock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/inode.c         | 32 +++++++++++++++++++++++++++-----
 include/linux/nfs_fs.h |  1 +
 2 files changed, 28 insertions(+), 5 deletions(-)

Comments

Trond Myklebust Jan. 24, 2014, 5:11 p.m. UTC | #1
On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@redhat.com> wrote:

> On Wed, 22 Jan 2014 07:04:09 -0500
> Jeff Layton <jlayton@redhat.com> wrote:
> 
>> On Wed, 22 Jan 2014 00:24:14 -0800
>> Christoph Hellwig <hch@infradead.org> wrote:
>> 
>>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
>>>> In any case, this helps but it's a little odd. With this patch, you add
>>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
>>>> also left in the call to nfs_zap_mapping in the completion codepath.
>>>> 
>>>> So now, we shoot down the mapping prior to doing a DIO write, and then
>>>> mark the mapping for invalidation again when the write completes. Was
>>>> that intentional?
>>>> 
>>>> It seems a little excessive and might hurt performance in some cases.
>>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
>>>> this approach seems to give better cache coherency.
>>> 
>>> Thile follows the model implemented and documented in
>>> generic_file_direct_write().
>>> 
>> 
>> Ok, thanks. That makes sense, and the problem described in those
>> comments is almost exactly the one I've seen in practice.
>> 
>> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
>> flag is handled, but that really has nothing to do with this patchset.
>> 
>> You can add my Tested-by to the set if you like...
>> 
> 
> (re-sending with Trond's address fixed)
> 
> I may have spoken too soon...
> 
> This patchset didn't fix the problem once I cranked up the concurrency
> from 100 child tasks to 1000. I think that HCH's patchset makes sense
> and helps narrow the race window some, but the way that
> nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> 
> The following patch does seem to fix it however. It's a combination of
> a test patch that Trond gave me a while back and another change to
> serialize the nfs_invalidate_mapping ops.
> 
> I think it's a reasonable approach to deal with the problem, but we
> likely have some other areas that will need similar treatment since
> they also check NFS_INO_INVALID_DATA: 
> 
>    nfs_write_pageuptodate
>    nfs_readdir_search_for_cookie
>    nfs_update_inode
> 
> Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> opinion on the basic approach, or whether you have an idea of how
> better handle the races here:

I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.

Cheers
  Trond
--
Trond Myklebust
Linux NFS client maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Jan. 24, 2014, 5:29 p.m. UTC | #2
On Fri, 24 Jan 2014 10:11:11 -0700
Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> 
> On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Wed, 22 Jan 2014 07:04:09 -0500
> > Jeff Layton <jlayton@redhat.com> wrote:
> > 
> >> On Wed, 22 Jan 2014 00:24:14 -0800
> >> Christoph Hellwig <hch@infradead.org> wrote:
> >> 
> >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> >>>> In any case, this helps but it's a little odd. With this patch, you add
> >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> >>>> 
> >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> >>>> mark the mapping for invalidation again when the write completes. Was
> >>>> that intentional?
> >>>> 
> >>>> It seems a little excessive and might hurt performance in some cases.
> >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> >>>> this approach seems to give better cache coherency.
> >>> 
> >>> Thile follows the model implemented and documented in
> >>> generic_file_direct_write().
> >>> 
> >> 
> >> Ok, thanks. That makes sense, and the problem described in those
> >> comments is almost exactly the one I've seen in practice.
> >> 
> >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> >> flag is handled, but that really has nothing to do with this patchset.
> >> 
> >> You can add my Tested-by to the set if you like...
> >> 
> > 
> > (re-sending with Trond's address fixed)
> > 
> > I may have spoken too soon...
> > 
> > This patchset didn't fix the problem once I cranked up the concurrency
> > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > and helps narrow the race window some, but the way that
> > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > 
> > The following patch does seem to fix it however. It's a combination of
> > a test patch that Trond gave me a while back and another change to
> > serialize the nfs_invalidate_mapping ops.
> > 
> > I think it's a reasonable approach to deal with the problem, but we
> > likely have some other areas that will need similar treatment since
> > they also check NFS_INO_INVALID_DATA: 
> > 
> >    nfs_write_pageuptodate
> >    nfs_readdir_search_for_cookie
> >    nfs_update_inode
> > 
> > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > opinion on the basic approach, or whether you have an idea of how
> > better handle the races here:
> 
> I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> 


nfs_write_pageuptodate does this:

---------------8<-----------------
        if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
                return false;
out:
        return PageUptodate(page) != 0;
---------------8<-----------------

With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
only later would the page be invalidated. So, there's a race window in
there where the bit could be cleared but the page flag is still set,
even though it's on its way out the cache. So, I think we'd need to do
some similar sort of locking in there to make sure that doesn't happen.

nfs_update_inode just does this:

        if (invalid & NFS_INO_INVALID_DATA)
                nfs_fscache_invalidate(inode);

...again, since we clear the bit first with this patch, I think we have
a potential race window there too. We might not see it set in a
situation where we would have before. That case is a bit more
problematic since we can't sleep to wait on the bitlock there.

It might be best to just get rid of that call altogether and move it
into nfs_invalidate_mapping. It seems to me that we ought to just
handle fscache the same way we do the pagecache when it comes to
invalidation.

As far as the readdir code goes, I haven't looked as closely at that
yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we
settle the other two cases, I'll give that closer scrutiny.

Thanks,
Trond Myklebust Jan. 24, 2014, 5:40 p.m. UTC | #3
On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> On Fri, 24 Jan 2014 10:11:11 -0700
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
> > 
> > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > >> Christoph Hellwig <hch@infradead.org> wrote:
> > >> 
> > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > >>>> 
> > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > >>>> mark the mapping for invalidation again when the write completes. Was
> > >>>> that intentional?
> > >>>> 
> > >>>> It seems a little excessive and might hurt performance in some cases.
> > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > >>>> this approach seems to give better cache coherency.
> > >>> 
> > >>> Thile follows the model implemented and documented in
> > >>> generic_file_direct_write().
> > >>> 
> > >> 
> > >> Ok, thanks. That makes sense, and the problem described in those
> > >> comments is almost exactly the one I've seen in practice.
> > >> 
> > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > >> flag is handled, but that really has nothing to do with this patchset.
> > >> 
> > >> You can add my Tested-by to the set if you like...
> > >> 
> > > 
> > > (re-sending with Trond's address fixed)
> > > 
> > > I may have spoken too soon...
> > > 
> > > This patchset didn't fix the problem once I cranked up the concurrency
> > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > and helps narrow the race window some, but the way that
> > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > 
> > > The following patch does seem to fix it however. It's a combination of
> > > a test patch that Trond gave me a while back and another change to
> > > serialize the nfs_invalidate_mapping ops.
> > > 
> > > I think it's a reasonable approach to deal with the problem, but we
> > > likely have some other areas that will need similar treatment since
> > > they also check NFS_INO_INVALID_DATA: 
> > > 
> > >    nfs_write_pageuptodate
> > >    nfs_readdir_search_for_cookie
> > >    nfs_update_inode
> > > 
> > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > opinion on the basic approach, or whether you have an idea of how
> > > better handle the races here:
> > 
> > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > 
> 
> 
> nfs_write_pageuptodate does this:
> 
> ---------------8<-----------------
>         if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
>                 return false;
> out:
>         return PageUptodate(page) != 0;
> ---------------8<-----------------
> 
> With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> only later would the page be invalidated. So, there's a race window in
> there where the bit could be cleared but the page flag is still set,
> even though it's on its way out the cache. So, I think we'd need to do
> some similar sort of locking in there to make sure that doesn't happen.

We _cannot_ lock against nfs_revalidate_mapping() here, because we could
end up deadlocking with invalidate_inode_pages2().

If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
the optimisation in that case, but I'd like to understand what the race
would be: don't forget that the page is marked as PageUptodate(), which
means that either invalidate_inode_pages2() has not yet reached this
page, or that a read of the page succeeded after the invalidation was
made.

> nfs_update_inode just does this:
> 
>         if (invalid & NFS_INO_INVALID_DATA)
>                 nfs_fscache_invalidate(inode);
> 
> ...again, since we clear the bit first with this patch, I think we have
> a potential race window there too. We might not see it set in a
> situation where we would have before. That case is a bit more
> problematic since we can't sleep to wait on the bitlock there.

Umm... That test in nfs_update_inode() is there because we might just
have _set_ the NFS_INO_INVALID_DATA bit.

> 
> It might be best to just get rid of that call altogether and move it
> into nfs_invalidate_mapping. It seems to me that we ought to just
> handle fscache the same way we do the pagecache when it comes to
> invalidation.
> 
> As far as the readdir code goes, I haven't looked as closely at that
> yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we
> settle the other two cases, I'll give that closer scrutiny.
> 
> Thanks,
Jeff Layton Jan. 24, 2014, 6 p.m. UTC | #4
On Fri, 24 Jan 2014 10:40:06 -0700
Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> > On Fri, 24 Jan 2014 10:11:11 -0700
> > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > 
> > > 
> > > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > > Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > > >> Christoph Hellwig <hch@infradead.org> wrote:
> > > >> 
> > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > > >>>> 
> > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > > >>>> mark the mapping for invalidation again when the write completes. Was
> > > >>>> that intentional?
> > > >>>> 
> > > >>>> It seems a little excessive and might hurt performance in some cases.
> > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > >>>> this approach seems to give better cache coherency.
> > > >>> 
> > > >>> Thile follows the model implemented and documented in
> > > >>> generic_file_direct_write().
> > > >>> 
> > > >> 
> > > >> Ok, thanks. That makes sense, and the problem described in those
> > > >> comments is almost exactly the one I've seen in practice.
> > > >> 
> > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > > >> flag is handled, but that really has nothing to do with this patchset.
> > > >> 
> > > >> You can add my Tested-by to the set if you like...
> > > >> 
> > > > 
> > > > (re-sending with Trond's address fixed)
> > > > 
> > > > I may have spoken too soon...
> > > > 
> > > > This patchset didn't fix the problem once I cranked up the concurrency
> > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > > and helps narrow the race window some, but the way that
> > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > > 
> > > > The following patch does seem to fix it however. It's a combination of
> > > > a test patch that Trond gave me a while back and another change to
> > > > serialize the nfs_invalidate_mapping ops.
> > > > 
> > > > I think it's a reasonable approach to deal with the problem, but we
> > > > likely have some other areas that will need similar treatment since
> > > > they also check NFS_INO_INVALID_DATA: 
> > > > 
> > > >    nfs_write_pageuptodate
> > > >    nfs_readdir_search_for_cookie
> > > >    nfs_update_inode
> > > > 
> > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > > opinion on the basic approach, or whether you have an idea of how
> > > > better handle the races here:
> > > 
> > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > > 
> > 
> > 
> > nfs_write_pageuptodate does this:
> > 
> > ---------------8<-----------------
> >         if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> >                 return false;
> > out:
> >         return PageUptodate(page) != 0;
> > ---------------8<-----------------
> > 
> > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> > only later would the page be invalidated. So, there's a race window in
> > there where the bit could be cleared but the page flag is still set,
> > even though it's on its way out the cache. So, I think we'd need to do
> > some similar sort of locking in there to make sure that doesn't happen.
> 
> We _cannot_ lock against nfs_revalidate_mapping() here, because we could
> end up deadlocking with invalidate_inode_pages2().
> 
> If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
> the optimisation in that case, but I'd like to understand what the race
> would be: don't forget that the page is marked as PageUptodate(), which
> means that either invalidate_inode_pages2() has not yet reached this
> page, or that a read of the page succeeded after the invalidation was
> made.
> 

Right. The first situation seems wrong to me. We've marked the file as
INVALID and then cleared the bit to start the process of invalidating
the actual pages. It seems like nfs_write_pageuptodate ought not return
true even if PageUptodate() is still set at that point.

We could check NFS_INO_INVALIDATING, but we might miss that
optimization in a lot of cases just because something happens to be
in nfs_revalidate_mapping. Maybe that means that this bitlock isn't
sufficient and we need some other mechanism. I'm not sure what that
should be though.

> > nfs_update_inode just does this:
> > 
> >         if (invalid & NFS_INO_INVALID_DATA)
> >                 nfs_fscache_invalidate(inode);
> > 
> > ...again, since we clear the bit first with this patch, I think we have
> > a potential race window there too. We might not see it set in a
> > situation where we would have before. That case is a bit more
> > problematic since we can't sleep to wait on the bitlock there.
> 
> Umm... That test in nfs_update_inode() is there because we might just
> have _set_ the NFS_INO_INVALID_DATA bit.
> 

Correct. But do we need to force a fscache invalidation at that point,
or can it wait until we're going to invalidate the mapping too?

> > 
> > It might be best to just get rid of that call altogether and move it
> > into nfs_invalidate_mapping. It seems to me that we ought to just
> > handle fscache the same way we do the pagecache when it comes to
> > invalidation.
> > 
> > As far as the readdir code goes, I haven't looked as closely at that
> > yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we
> > settle the other two cases, I'll give that closer scrutiny.
> >
Trond Myklebust Jan. 24, 2014, 6:46 p.m. UTC | #5
On Fri, 2014-01-24 at 13:00 -0500, Jeff Layton wrote:
> On Fri, 24 Jan 2014 10:40:06 -0700
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
> > On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> > > On Fri, 24 Jan 2014 10:11:11 -0700
> > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > 
> > > > 
> > > > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > > > Jeff Layton <jlayton@redhat.com> wrote:
> > > > > 
> > > > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > > > >> Christoph Hellwig <hch@infradead.org> wrote:
> > > > >> 
> > > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > > > >>>> 
> > > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > > > >>>> mark the mapping for invalidation again when the write completes. Was
> > > > >>>> that intentional?
> > > > >>>> 
> > > > >>>> It seems a little excessive and might hurt performance in some cases.
> > > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > > >>>> this approach seems to give better cache coherency.
> > > > >>> 
> > > > >>> Thile follows the model implemented and documented in
> > > > >>> generic_file_direct_write().
> > > > >>> 
> > > > >> 
> > > > >> Ok, thanks. That makes sense, and the problem described in those
> > > > >> comments is almost exactly the one I've seen in practice.
> > > > >> 
> > > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > > > >> flag is handled, but that really has nothing to do with this patchset.
> > > > >> 
> > > > >> You can add my Tested-by to the set if you like...
> > > > >> 
> > > > > 
> > > > > (re-sending with Trond's address fixed)
> > > > > 
> > > > > I may have spoken too soon...
> > > > > 
> > > > > This patchset didn't fix the problem once I cranked up the concurrency
> > > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > > > and helps narrow the race window some, but the way that
> > > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > > > 
> > > > > The following patch does seem to fix it however. It's a combination of
> > > > > a test patch that Trond gave me a while back and another change to
> > > > > serialize the nfs_invalidate_mapping ops.
> > > > > 
> > > > > I think it's a reasonable approach to deal with the problem, but we
> > > > > likely have some other areas that will need similar treatment since
> > > > > they also check NFS_INO_INVALID_DATA: 
> > > > > 
> > > > >    nfs_write_pageuptodate
> > > > >    nfs_readdir_search_for_cookie
> > > > >    nfs_update_inode
> > > > > 
> > > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > > > opinion on the basic approach, or whether you have an idea of how
> > > > > better handle the races here:
> > > > 
> > > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > > > 
> > > 
> > > 
> > > nfs_write_pageuptodate does this:
> > > 
> > > ---------------8<-----------------
> > >         if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> > >                 return false;
> > > out:
> > >         return PageUptodate(page) != 0;
> > > ---------------8<-----------------
> > > 
> > > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> > > only later would the page be invalidated. So, there's a race window in
> > > there where the bit could be cleared but the page flag is still set,
> > > even though it's on its way out the cache. So, I think we'd need to do
> > > some similar sort of locking in there to make sure that doesn't happen.
> > 
> > We _cannot_ lock against nfs_revalidate_mapping() here, because we could
> > end up deadlocking with invalidate_inode_pages2().
> > 
> > If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
> > the optimisation in that case, but I'd like to understand what the race
> > would be: don't forget that the page is marked as PageUptodate(), which
> > means that either invalidate_inode_pages2() has not yet reached this
> > page, or that a read of the page succeeded after the invalidation was
> > made.
> > 
> 
> Right. The first situation seems wrong to me. We've marked the file as
> INVALID and then cleared the bit to start the process of invalidating
> the actual pages. It seems like nfs_write_pageuptodate ought not return
> true even if PageUptodate() is still set at that point.
> 
> We could check NFS_INO_INVALIDATING, but we might miss that
> optimization in a lot of cases just because something happens to be
> in nfs_revalidate_mapping. Maybe that means that this bitlock isn't
> sufficient and we need some other mechanism. I'm not sure what that
> should be though.

Convert your patch to use wait_on_bit(), and then to call
wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.

> > > nfs_update_inode just does this:
> > > 
> > >         if (invalid & NFS_INO_INVALID_DATA)
> > >                 nfs_fscache_invalidate(inode);
> > > 
> > > ...again, since we clear the bit first with this patch, I think we have
> > > a potential race window there too. We might not see it set in a
> > > situation where we would have before. That case is a bit more
> > > problematic since we can't sleep to wait on the bitlock there.
> > 
> > Umm... That test in nfs_update_inode() is there because we might just
> > have _set_ the NFS_INO_INVALID_DATA bit.
> > 
> 
> Correct. But do we need to force a fscache invalidation at that point,
> or can it wait until we're going to invalidate the mapping too?

That's a question for David. My assumption is that since invalidation is
handled asynchronously by the fscache layer itself, that we need to let
it start that process as soon as possible, but perhaps these races are
an indication that we should actually do it at the time when we call
invalidate_inode_pages2() (or at the latest, when we're evicting the
inode from the icache)...
Jeff Layton Jan. 24, 2014, 9:21 p.m. UTC | #6
On Fri, 24 Jan 2014 11:46:41 -0700
Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Fri, 2014-01-24 at 13:00 -0500, Jeff Layton wrote:
> > On Fri, 24 Jan 2014 10:40:06 -0700
> > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > 
> > > On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> > > > On Fri, 24 Jan 2014 10:11:11 -0700
> > > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > > 
> > > > > 
> > > > > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > 
> > > > > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > > > > Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > 
> > > > > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > > > > >> Christoph Hellwig <hch@infradead.org> wrote:
> > > > > >> 
> > > > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > > > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > > > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > > > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > > > > >>>> 
> > > > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > > > > >>>> mark the mapping for invalidation again when the write completes. Was
> > > > > >>>> that intentional?
> > > > > >>>> 
> > > > > >>>> It seems a little excessive and might hurt performance in some cases.
> > > > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > > > >>>> this approach seems to give better cache coherency.
> > > > > >>> 
> > > > > >>> Thile follows the model implemented and documented in
> > > > > >>> generic_file_direct_write().
> > > > > >>> 
> > > > > >> 
> > > > > >> Ok, thanks. That makes sense, and the problem described in those
> > > > > >> comments is almost exactly the one I've seen in practice.
> > > > > >> 
> > > > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > > > > >> flag is handled, but that really has nothing to do with this patchset.
> > > > > >> 
> > > > > >> You can add my Tested-by to the set if you like...
> > > > > >> 
> > > > > > 
> > > > > > (re-sending with Trond's address fixed)
> > > > > > 
> > > > > > I may have spoken too soon...
> > > > > > 
> > > > > > This patchset didn't fix the problem once I cranked up the concurrency
> > > > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > > > > and helps narrow the race window some, but the way that
> > > > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > > > > 
> > > > > > The following patch does seem to fix it however. It's a combination of
> > > > > > a test patch that Trond gave me a while back and another change to
> > > > > > serialize the nfs_invalidate_mapping ops.
> > > > > > 
> > > > > > I think it's a reasonable approach to deal with the problem, but we
> > > > > > likely have some other areas that will need similar treatment since
> > > > > > they also check NFS_INO_INVALID_DATA: 
> > > > > > 
> > > > > >    nfs_write_pageuptodate
> > > > > >    nfs_readdir_search_for_cookie
> > > > > >    nfs_update_inode
> > > > > > 
> > > > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > > > > opinion on the basic approach, or whether you have an idea of how
> > > > > > better handle the races here:
> > > > > 
> > > > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > > > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > > > > 
> > > > 
> > > > 
> > > > nfs_write_pageuptodate does this:
> > > > 
> > > > ---------------8<-----------------
> > > >         if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> > > >                 return false;
> > > > out:
> > > >         return PageUptodate(page) != 0;
> > > > ---------------8<-----------------
> > > > 
> > > > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> > > > only later would the page be invalidated. So, there's a race window in
> > > > there where the bit could be cleared but the page flag is still set,
> > > > even though it's on its way out the cache. So, I think we'd need to do
> > > > some similar sort of locking in there to make sure that doesn't happen.
> > > 
> > > We _cannot_ lock against nfs_revalidate_mapping() here, because we could
> > > end up deadlocking with invalidate_inode_pages2().
> > > 
> > > If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
> > > the optimisation in that case, but I'd like to understand what the race
> > > would be: don't forget that the page is marked as PageUptodate(), which
> > > means that either invalidate_inode_pages2() has not yet reached this
> > > page, or that a read of the page succeeded after the invalidation was
> > > made.
> > > 
> > 
> > Right. The first situation seems wrong to me. We've marked the file as
> > INVALID and then cleared the bit to start the process of invalidating
> > the actual pages. It seems like nfs_write_pageuptodate ought not return
> > true even if PageUptodate() is still set at that point.
> > 
> > We could check NFS_INO_INVALIDATING, but we might miss that
> > optimization in a lot of cases just because something happens to be
> > in nfs_revalidate_mapping. Maybe that means that this bitlock isn't
> > sufficient and we need some other mechanism. I'm not sure what that
> > should be though.
> 
> Convert your patch to use wait_on_bit(), and then to call
> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
> 

I think that too would be racy...

We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
can't wait_on_bit_lock() under that. So (pseudocode):

wait_on_bit
take i_lock
check and clear NFS_INO_INVALID_DATA
drop i_lock
wait_on_bit_lock

...so between dropping the i_lock and wait_on_bit_lock, we have a place
where another task could check the flag and find it clear.

I think the upshot here is that a bit_lock may not be the appropriate
thing to use to handle this. I'll have to ponder what might be better...

> > > > nfs_update_inode just does this:
> > > > 
> > > >         if (invalid & NFS_INO_INVALID_DATA)
> > > >                 nfs_fscache_invalidate(inode);
> > > > 
> > > > ...again, since we clear the bit first with this patch, I think we have
> > > > a potential race window there too. We might not see it set in a
> > > > situation where we would have before. That case is a bit more
> > > > problematic since we can't sleep to wait on the bitlock there.
> > > 
> > > Umm... That test in nfs_update_inode() is there because we might just
> > > have _set_ the NFS_INO_INVALID_DATA bit.
> > > 
> > 
> > Correct. But do we need to force a fscache invalidation at that point,
> > or can it wait until we're going to invalidate the mapping too?
> 
> That's a question for David. My assumption is that since invalidation is
> handled asynchronously by the fscache layer itself, that we need to let
> it start that process as soon as possible, but perhaps these races are
> an indication that we should actually do it at the time when we call
> invalidate_inode_pages2() (or at the latest, when we're evicting the
> inode from the icache)...
> 
> 

Ok, looks like it just sets a flag, so if we can handle this somehow
w/o sleeping then it may not matter. Again, I'll have to ponder what
may be better than a bit_lock.

Thanks,
Trond Myklebust Jan. 25, 2014, 12:39 a.m. UTC | #7
On Jan 24, 2014, at 14:21, Jeff Layton <jlayton@redhat.com> wrote:

> On Fri, 24 Jan 2014 11:46:41 -0700
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>> 
>> Convert your patch to use wait_on_bit(), and then to call
>> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
>> 
> 
> I think that too would be racy...
> 
> We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
> can't wait_on_bit_lock() under that. So (pseudocode):
> 
> wait_on_bit
> take i_lock
> check and clear NFS_INO_INVALID_DATA
> drop i_lock
> wait_on_bit_lock
> 
> ...so between dropping the i_lock and wait_on_bit_lock, we have a place
> where another task could check the flag and find it clear.


	for(;;) {
		wait_on_bit(NFS_INO_INVALIDATING)
		/* Optimisation: don’t lock NFS_INO_INVALIDATING
		 * if NFS_INO_INVALID_DATA was cleared while we waited.
		 */
		if (!test_bit(NFS_INO_INVALID_DATA))
			return;
		if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
			break;
	}
	spin_lock(inode->i_lock);
	if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
		spin_unlock(inode->i_lock);
		goto out_raced;
	}
….
out_raced:
	clear_bit(NFS_INO_INVALIDATING)
	wake_up_bit(NFS_INO_INVALIDATING)


--
Trond Myklebust
Linux NFS client maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Jan. 25, 2014, 12:54 a.m. UTC | #8
On Fri, 24 Jan 2014 17:39:45 -0700
Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> 
> On Jan 24, 2014, at 14:21, Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Fri, 24 Jan 2014 11:46:41 -0700
> > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> >> 
> >> Convert your patch to use wait_on_bit(), and then to call
> >> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
> >> 
> > 
> > I think that too would be racy...
> > 
> > We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
> > can't wait_on_bit_lock() under that. So (pseudocode):
> > 
> > wait_on_bit
> > take i_lock
> > check and clear NFS_INO_INVALID_DATA
> > drop i_lock
> > wait_on_bit_lock
> > 
> > ...so between dropping the i_lock and wait_on_bit_lock, we have a place
> > where another task could check the flag and find it clear.
> 
> 
> 	for(;;) {
> 		wait_on_bit(NFS_INO_INVALIDATING)
> 		/* Optimisation: don’t lock NFS_INO_INVALIDATING
> 		 * if NFS_INO_INVALID_DATA was cleared while we waited.
> 		 */
> 		if (!test_bit(NFS_INO_INVALID_DATA))
> 			return;
> 		if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
> 			break;
> 	}
> 	spin_lock(inode->i_lock);
> 	if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
> 		spin_unlock(inode->i_lock);
> 		goto out_raced;
> 	}
> ….
> out_raced:
> 	clear_bit(NFS_INO_INVALIDATING)
> 	wake_up_bit(NFS_INO_INVALIDATING)
> 
> 
> --
> Trond Myklebust
> Linux NFS client maintainer
> 

Hmm maybe. OTOH, if we're using atomic bitops do we need to deal with
the spinlock?  I'll ponder it over the weekend and give it a harder
look on Monday.

Thanks for the thoughts so far...
Trond Myklebust Jan. 25, 2014, 1:05 a.m. UTC | #9
On Jan 24, 2014, at 17:54, Jeff Layton <jlayton@redhat.com> wrote:

> On Fri, 24 Jan 2014 17:39:45 -0700
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
>> 
>> On Jan 24, 2014, at 14:21, Jeff Layton <jlayton@redhat.com> wrote:
>> 
>>> On Fri, 24 Jan 2014 11:46:41 -0700
>>> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>> 
>>>> Convert your patch to use wait_on_bit(), and then to call
>>>> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
>>>> 
>>> 
>>> I think that too would be racy...
>>> 
>>> We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
>>> can't wait_on_bit_lock() under that. So (pseudocode):
>>> 
>>> wait_on_bit
>>> take i_lock
>>> check and clear NFS_INO_INVALID_DATA
>>> drop i_lock
>>> wait_on_bit_lock
>>> 
>>> ...so between dropping the i_lock and wait_on_bit_lock, we have a place
>>> where another task could check the flag and find it clear.
>> 
>> 
>> 	for(;;) {
>> 		wait_on_bit(NFS_INO_INVALIDATING)
>> 		/* Optimisation: don’t lock NFS_INO_INVALIDATING
>> 		 * if NFS_INO_INVALID_DATA was cleared while we waited.
>> 		 */
>> 		if (!test_bit(NFS_INO_INVALID_DATA))
>> 			return;
>> 		if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
>> 			break;
>> 	}
>> 	spin_lock(inode->i_lock);
>> 	if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
>> 		spin_unlock(inode->i_lock);
>> 		goto out_raced;
>> 	}
>> ….
>> out_raced:
>> 	clear_bit(NFS_INO_INVALIDATING)
>> 	wake_up_bit(NFS_INO_INVALIDATING)
>> 
>> 
>> --
>> Trond Myklebust
>> Linux NFS client maintainer
>> 
> 
> Hmm maybe. OTOH, if we're using atomic bitops do we need to deal with
> the spinlock?  I'll ponder it over the weekend and give it a harder
> look on Monday.
> 

The NFS_I(inode)->cache_validity doesn’t use bitops, so the correct behaviour is to put NFS_INO_INVALIDATING inside NFS_I(inode)->flags (which is an atomic bit op field), and then continue to use the spin lock for NFS_INO_INVALID_DATA.

--
Trond Myklebust
Linux NFS client maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Jan. 25, 2014, 1:11 a.m. UTC | #10
On Jan 24, 2014, at 18:05, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> 
> On Jan 24, 2014, at 17:54, Jeff Layton <jlayton@redhat.com> wrote:
> 
>> On Fri, 24 Jan 2014 17:39:45 -0700
>> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>> 
>>> 
>>> On Jan 24, 2014, at 14:21, Jeff Layton <jlayton@redhat.com> wrote:
>>> 
>>>> On Fri, 24 Jan 2014 11:46:41 -0700
>>>> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>>> 
>>>>> Convert your patch to use wait_on_bit(), and then to call
>>>>> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
>>>>> 
>>>> 
>>>> I think that too would be racy...
>>>> 
>>>> We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
>>>> can't wait_on_bit_lock() under that. So (pseudocode):
>>>> 
>>>> wait_on_bit
>>>> take i_lock
>>>> check and clear NFS_INO_INVALID_DATA
>>>> drop i_lock
>>>> wait_on_bit_lock
>>>> 
>>>> ...so between dropping the i_lock and wait_on_bit_lock, we have a place
>>>> where another task could check the flag and find it clear.
>>> 
>>> 
>>> 	for(;;) {
>>> 		wait_on_bit(NFS_INO_INVALIDATING)
>>> 		/* Optimisation: don’t lock NFS_INO_INVALIDATING
>>> 		 * if NFS_INO_INVALID_DATA was cleared while we waited.
>>> 		 */
>>> 		if (!test_bit(NFS_INO_INVALID_DATA))
>>> 			return;
>>> 		if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
>>> 			break;
>>> 	}
>>> 	spin_lock(inode->i_lock);
>>> 	if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
>>> 		spin_unlock(inode->i_lock);
>>> 		goto out_raced;
>>> 	}
>>> ….
>>> out_raced:
>>> 	clear_bit(NFS_INO_INVALIDATING)
>>> 	wake_up_bit(NFS_INO_INVALIDATING)
>>> 
>>> 
>>> --
>>> Trond Myklebust
>>> Linux NFS client maintainer
>>> 
>> 
>> Hmm maybe. OTOH, if we're using atomic bitops do we need to deal with
>> the spinlock?  I'll ponder it over the weekend and give it a harder
>> look on Monday.
>> 
> 
> The NFS_I(inode)->cache_validity doesn’t use bitops, so the correct behaviour is to put NFS_INO_INVALIDATING inside NFS_I(inode)->flags (which is an atomic bit op field), and then continue to use the spin lock for NFS_INO_INVALID_DATA.

In other words please replace the atomic test_bit(NFS_INO_INVALID_DATA) and test_and_clear_bit(NFS_INO_INVALID_DATA) in the above pseudocode with the appropriate tests and clears of NFS_I(inode)->cache_validity.

--
Trond Myklebust
Linux NFS client maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 00ad1c2..6fa07e1 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -977,11 +977,11 @@  static int nfs_invalidate_mapping(struct inode *inode, struct address_space *map
 		if (ret < 0)
 			return ret;
 	}
-	spin_lock(&inode->i_lock);
-	nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
-	if (S_ISDIR(inode->i_mode))
+	if (S_ISDIR(inode->i_mode)) {
+		spin_lock(&inode->i_lock);
 		memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
-	spin_unlock(&inode->i_lock);
+		spin_unlock(&inode->i_lock);
+	}
 	nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
 	nfs_fscache_wait_on_invalidate(inode);
 
@@ -1007,6 +1007,7 @@  static bool nfs_mapping_need_revalidate_inode(struct inode *inode)
 int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
+	unsigned long *bitlock = &NFS_I(inode)->flags;
 	int ret = 0;
 
 	/* swapfiles are not supposed to be shared. */
@@ -1018,12 +1019,33 @@  int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
 		if (ret < 0)
 			goto out;
 	}
+
+	/*
+	 * We must clear NFS_INO_INVALID_DATA first to ensure that
+	 * invalidations that come in while we're shooting down the mappings
+	 * are respected. But, that leaves a race window where one revalidator
+	 * can clear the flag, and then another checks it before the mapping
+	 * gets invalidated. Fix that by serializing access to this part of
+	 * the function.
+	 */
+	ret = wait_on_bit_lock(bitlock, NFS_INO_INVALIDATING,
+				nfs_wait_bit_killable, TASK_KILLABLE);
+	if (ret)
+		goto out;
+
+	spin_lock(&inode->i_lock);
 	if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
+		nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+		spin_unlock(&inode->i_lock);
 		trace_nfs_invalidate_mapping_enter(inode);
 		ret = nfs_invalidate_mapping(inode, mapping);
 		trace_nfs_invalidate_mapping_exit(inode, ret);
-	}
+	} else
+		spin_unlock(&inode->i_lock);
 
+	clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
+	smp_mb__after_clear_bit();
+	wake_up_bit(bitlock, NFS_INO_INVALIDATING);
 out:
 	return ret;
 }
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 4899737..18fb16f 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -215,6 +215,7 @@  struct nfs_inode {
 #define NFS_INO_ADVISE_RDPLUS	(0)		/* advise readdirplus */
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
+#define NFS_INO_INVALIDATING	(3)		/* inode is being invalidated */
 #define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */