Message ID | 20170414140753.16108-2-aryabinin@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote: > Some direct write fs hooks call invalidate_inode_pages2[_range]() > conditionally iff mapping->nrpages is not zero. If page cache is empty, > buffered read following after direct IO write would get stale data from > the cleancache. > > Also it doesn't feel right to check only for ->nrpages because > invalidate_inode_pages2[_range] invalidates exceptional entries as well. > > Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages > state. > > Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") > Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> > --- <> > diff --git a/fs/dax.c b/fs/dax.c > index 2e382fe..1e8cca0 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > * into page tables. We have to tear down these mappings so that data > * written by write(2) is visible in mmap. > */ > - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) { > + if ((iomap->flags & IOMAP_F_NEW)) { > invalidate_inode_pages2_range(inode->i_mapping, > pos >> PAGE_SHIFT, > (end - 1) >> PAGE_SHIFT); tl;dr: I think the old code is correct, and that you don't need this change. This should be harmless, but could slow us down a little if we keep calling invalidate_inode_pages2_range() without really needing to. Really for DAX I think we need to call invalidate_inode_page2_range() only if we have zero pages mapped over the place where we are doing I/O, which is why we check nrpages. Is DAX even allowed to be used at the same time as cleancache? From a brief look at Documentation/vm/cleancache.txt, it seems like these two features are incompatible. With DAX we already are avoiding the page cache completely. Anyway, I don't see how this change in DAX can save us from a data corruption (which is what you're seeing, right?), and I think it could slow us down, so I'd prefer to leave things as they are.
On Fri, 14 Apr 2017 17:07:50 +0300 Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > Some direct write fs hooks call invalidate_inode_pages2[_range]() > conditionally iff mapping->nrpages is not zero. If page cache is empty, > buffered read following after direct IO write would get stale data from > the cleancache. > > Also it doesn't feel right to check only for ->nrpages because > invalidate_inode_pages2[_range] invalidates exceptional entries as well. > > Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages > state. I'm not understanding this. I can buy the argument about nrexceptional, but why does cleancache require the invalidate_inode_pages2_range) call even when ->nrpages is zero? I *assume* it's because invalidate_inode_pages2_range() calls cleancache_invalidate_inode(), yes? If so, can we please add this to the changelog? If not then please explain further.
On 04/18/2017 10:38 PM, Ross Zwisler wrote: > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote: >> Some direct write fs hooks call invalidate_inode_pages2[_range]() >> conditionally iff mapping->nrpages is not zero. If page cache is empty, >> buffered read following after direct IO write would get stale data from >> the cleancache. >> >> Also it doesn't feel right to check only for ->nrpages because >> invalidate_inode_pages2[_range] invalidates exceptional entries as well. >> >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages >> state. >> >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> >> --- > <> >> diff --git a/fs/dax.c b/fs/dax.c >> index 2e382fe..1e8cca0 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, >> * into page tables. We have to tear down these mappings so that data >> * written by write(2) is visible in mmap. >> */ >> - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) { >> + if ((iomap->flags & IOMAP_F_NEW)) { >> invalidate_inode_pages2_range(inode->i_mapping, >> pos >> PAGE_SHIFT, >> (end - 1) >> PAGE_SHIFT); > > tl;dr: I think the old code is correct, and that you don't need this change. > > This should be harmless, but could slow us down a little if we keep > calling invalidate_inode_pages2_range() without really needing to. Really for > DAX I think we need to call invalidate_inode_page2_range() only if we have > zero pages mapped over the place where we are doing I/O, which is why we check > nrpages. > Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also invalidates exceptional radix tree entries. Is that correct that we invalidate exceptional entries only if ->nrpages > 0 and skip invalidation otherwise? > Is DAX even allowed to be used at the same time as cleancache? From a brief > look at Documentation/vm/cleancache.txt, it seems like these two features are > incompatible. With DAX we already are avoiding the page cache completely. tl;dr: I think you're right. cleancache may store any PageUptodate && PageMappedToDisk page evicted from page cache (see __delete_from_page_cache) DAX deletes hole page via __delete_from_page_cache(), but I don't see we mark such page as Uptodate or MappedToDisk so it will never go into the cleancache. Latter cleancache_get_page() (e.g. it's called from mpage_readpages() which is called from blkdev_read_pages()) I assume that DAX doesn't use a_ops->readpages() method so cleancache_get_page() is never called from DAX. > Anyway, I don't see how this change in DAX can save us from a data corruption > (which is what you're seeing, right?), and I think it could slow us down, so > I'd prefer to leave things as they are. > I'll remove this hunk from v2. Thanks.
On 04/19/2017 01:46 AM, Andrew Morton wrote: > On Fri, 14 Apr 2017 17:07:50 +0300 Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > >> Some direct write fs hooks call invalidate_inode_pages2[_range]() >> conditionally iff mapping->nrpages is not zero. If page cache is empty, >> buffered read following after direct IO write would get stale data from >> the cleancache. >> >> Also it doesn't feel right to check only for ->nrpages because >> invalidate_inode_pages2[_range] invalidates exceptional entries as well. >> >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages >> state. > > I'm not understanding this. I can buy the argument about > nrexceptional, but why does cleancache require the > invalidate_inode_pages2_range) call even when ->nrpages is zero? > > I *assume* it's because invalidate_inode_pages2_range() calls > cleancache_invalidate_inode(), yes? If so, can we please add this to > the changelog? If not then please explain further. > Yes, your assumption is correct. I'll fix the changelog.
On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote: > On 04/18/2017 10:38 PM, Ross Zwisler wrote: > > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote: > >> Some direct write fs hooks call invalidate_inode_pages2[_range]() > >> conditionally iff mapping->nrpages is not zero. If page cache is empty, > >> buffered read following after direct IO write would get stale data from > >> the cleancache. > >> > >> Also it doesn't feel right to check only for ->nrpages because > >> invalidate_inode_pages2[_range] invalidates exceptional entries as well. > >> > >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages > >> state. > >> > >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") > >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> > >> --- > > <> > >> diff --git a/fs/dax.c b/fs/dax.c > >> index 2e382fe..1e8cca0 100644 > >> --- a/fs/dax.c > >> +++ b/fs/dax.c > >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > >> * into page tables. We have to tear down these mappings so that data > >> * written by write(2) is visible in mmap. > >> */ > >> - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) { > >> + if ((iomap->flags & IOMAP_F_NEW)) { > >> invalidate_inode_pages2_range(inode->i_mapping, > >> pos >> PAGE_SHIFT, > >> (end - 1) >> PAGE_SHIFT); > > > > tl;dr: I think the old code is correct, and that you don't need this change. > > > > This should be harmless, but could slow us down a little if we keep > > calling invalidate_inode_pages2_range() without really needing to. Really for > > DAX I think we need to call invalidate_inode_page2_range() only if we have > > zero pages mapped over the place where we are doing I/O, which is why we check > > nrpages. > > > > Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also > invalidates exceptional radix tree entries. Is that correct that we invalidate > exceptional entries only if ->nrpages > 0 and skip invalidation otherwise? For DAX we only invalidate clean DAX exceptional entries so that we can keep dirty entries around for writeback, but yes you're correct that we only do the invalidation if nrpages > 0. And yes, it does seem a bit weird. :)
On Wed 19-04-17 13:28:36, Ross Zwisler wrote: > On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote: > > On 04/18/2017 10:38 PM, Ross Zwisler wrote: > > > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote: > > >> Some direct write fs hooks call invalidate_inode_pages2[_range]() > > >> conditionally iff mapping->nrpages is not zero. If page cache is empty, > > >> buffered read following after direct IO write would get stale data from > > >> the cleancache. > > >> > > >> Also it doesn't feel right to check only for ->nrpages because > > >> invalidate_inode_pages2[_range] invalidates exceptional entries as well. > > >> > > >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages > > >> state. > > >> > > >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") > > >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> > > >> --- > > > <> > > >> diff --git a/fs/dax.c b/fs/dax.c > > >> index 2e382fe..1e8cca0 100644 > > >> --- a/fs/dax.c > > >> +++ b/fs/dax.c > > >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > >> * into page tables. We have to tear down these mappings so that data > > >> * written by write(2) is visible in mmap. > > >> */ > > >> - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) { > > >> + if ((iomap->flags & IOMAP_F_NEW)) { > > >> invalidate_inode_pages2_range(inode->i_mapping, > > >> pos >> PAGE_SHIFT, > > >> (end - 1) >> PAGE_SHIFT); > > > > > > tl;dr: I think the old code is correct, and that you don't need this change. > > > > > > This should be harmless, but could slow us down a little if we keep > > > calling invalidate_inode_pages2_range() without really needing to. Really for > > > DAX I think we need to call invalidate_inode_page2_range() only if we have > > > zero pages mapped over the place where we are doing I/O, which is why we check > > > nrpages. > > > > > > > Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also > > invalidates exceptional radix tree entries. Is that correct that we invalidate > > exceptional entries only if ->nrpages > 0 and skip invalidation otherwise? > > For DAX we only invalidate clean DAX exceptional entries so that we can keep > dirty entries around for writeback, but yes you're correct that we only do the > invalidation if nrpages > 0. And yes, it does seem a bit weird. :) Actually in this place the nrpages check is deliberate since there should only be hole pages or nothing in the invalidated range - see the comment before the if. But thinking more about it this assumption actually is not right in presence of zero PMD entries in the radix tree. So this change actually also fixes a possible bug for DAX but we should do it as a separate patch with a proper changelog. Honza
diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index 3de3b4a8..786d0de 100644 --- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -423,7 +423,7 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) unsigned long pg_start, pg_end; pg_start = origin >> PAGE_SHIFT; pg_end = (origin + retval - 1) >> PAGE_SHIFT; - if (inode->i_mapping && inode->i_mapping->nrpages) + if (inode->i_mapping) invalidate_inode_pages2_range(inode->i_mapping, pg_start, pg_end); iocb->ki_pos += retval; diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index c3b2fa0..6539fa3 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -1857,7 +1857,7 @@ cifs_invalidate_mapping(struct inode *inode) { int rc = 0; - if (inode->i_mapping && inode->i_mapping->nrpages != 0) { + if (inode->i_mapping) { rc = invalidate_inode_pages2(inode->i_mapping); if (rc) cifs_dbg(VFS, "%s: could not invalidate inode %p\n", diff --git a/fs/dax.c b/fs/dax.c index 2e382fe..1e8cca0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, * into page tables. We have to tear down these mappings so that data * written by write(2) is visible in mmap. */ - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) { + if ((iomap->flags & IOMAP_F_NEW)) { invalidate_inode_pages2_range(inode->i_mapping, pos >> PAGE_SHIFT, (end - 1) >> PAGE_SHIFT); diff --git a/fs/iomap.c b/fs/iomap.c index 0b457ff..7e1f947 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -880,16 +880,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, flags |= IOMAP_WRITE; } - if (mapping->nrpages) { - ret = filemap_write_and_wait_range(mapping, start, end); - if (ret) - goto out_free_dio; + ret = filemap_write_and_wait_range(mapping, start, end); + if (ret) + goto out_free_dio; - ret = invalidate_inode_pages2_range(mapping, + ret = invalidate_inode_pages2_range(mapping, start >> PAGE_SHIFT, end >> PAGE_SHIFT); - WARN_ON_ONCE(ret); - ret = 0; - } + WARN_ON_ONCE(ret); + ret = 0; inode_dio_begin(inode); @@ -944,7 +942,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, * one is a pretty crazy thing to do, so we don't support it 100%. If * this invalidation fails, tough, the write still worked... */ - if (iov_iter_rw(iter) == WRITE && mapping->nrpages) { + if (iov_iter_rw(iter) == WRITE) { int err = invalidate_inode_pages2_range(mapping, start >> PAGE_SHIFT, end >> PAGE_SHIFT); WARN_ON_ONCE(err); diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index aab32fc..183ab4d 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -1024,10 +1024,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter) result = nfs_direct_write_schedule_iovec(dreq, iter, pos); - if (mapping->nrpages) { - invalidate_inode_pages2_range(mapping, - pos >> PAGE_SHIFT, end); - } + invalidate_inode_pages2_range(mapping, + pos >> PAGE_SHIFT, end); nfs_end_io_direct(inode); diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index f489a5a..b727ec8 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -1118,10 +1118,12 @@ static int nfs_invalidate_mapping(struct inode *inode, struct address_space *map if (ret < 0) return ret; } - ret = invalidate_inode_pages2(mapping); - if (ret < 0) - return ret; } + + ret = invalidate_inode_pages2(mapping); + if (ret < 0) + return ret; + if (S_ISDIR(inode->i_mode)) { spin_lock(&inode->i_lock); memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf)); diff --git a/mm/filemap.c b/mm/filemap.c index e9e5f7b..d233d59 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2721,18 +2721,16 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) * about to write. We do this *before* the write so that we can return * without clobbering -EIOCBQUEUED from ->direct_IO(). */ - if (mapping->nrpages) { - written = invalidate_inode_pages2_range(mapping, + written = invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end); - /* - * If a page can not be invalidated, return 0 to fall back - * to buffered write. - */ - if (written) { - if (written == -EBUSY) - return 0; - goto out; - } + /* + * If a page can not be invalidated, return 0 to fall back + * to buffered write. + */ + if (written) { + if (written == -EBUSY) + return 0; + goto out; } data = *from; @@ -2746,10 +2744,8 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) * so we don't support it 100%. If this invalidation * fails, tough, the write still worked... */ - if (mapping->nrpages) { - invalidate_inode_pages2_range(mapping, - pos >> PAGE_SHIFT, end); - } + invalidate_inode_pages2_range(mapping, + pos >> PAGE_SHIFT, end); if (written > 0) { pos += written;
Some direct write fs hooks call invalidate_inode_pages2[_range]() conditionally iff mapping->nrpages is not zero. If page cache is empty, buffered read following after direct IO write would get stale data from the cleancache. Also it doesn't feel right to check only for ->nrpages because invalidate_inode_pages2[_range] invalidates exceptional entries as well. Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages state. Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> --- fs/9p/vfs_file.c | 2 +- fs/cifs/inode.c | 2 +- fs/dax.c | 2 +- fs/iomap.c | 16 +++++++--------- fs/nfs/direct.c | 6 ++---- fs/nfs/inode.c | 8 +++++--- mm/filemap.c | 26 +++++++++++--------------- 7 files changed, 28 insertions(+), 34 deletions(-)