diff mbox

[1/4] fs: fix data invalidation in the cleancache during direct IO

Message ID 20170414140753.16108-2-aryabinin@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrey Ryabinin April 14, 2017, 2:07 p.m. UTC
Some direct write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. If page cache is empty,
buffered read following after direct IO write would get stale data from
the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
state.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 fs/9p/vfs_file.c |  2 +-
 fs/cifs/inode.c  |  2 +-
 fs/dax.c         |  2 +-
 fs/iomap.c       | 16 +++++++---------
 fs/nfs/direct.c  |  6 ++----
 fs/nfs/inode.c   |  8 +++++---
 mm/filemap.c     | 26 +++++++++++---------------
 7 files changed, 28 insertions(+), 34 deletions(-)

Comments

Ross Zwisler April 18, 2017, 7:38 p.m. UTC | #1
On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> buffered read following after direct IO write would get stale data from
> the cleancache.
> 
> Also it doesn't feel right to check only for ->nrpages because
> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> 
> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> state.
> 
> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> ---
<>
> diff --git a/fs/dax.c b/fs/dax.c
> index 2e382fe..1e8cca0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	 * into page tables. We have to tear down these mappings so that data
>  	 * written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if ((iomap->flags & IOMAP_F_NEW)) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);

tl;dr: I think the old code is correct, and that you don't need this change.

This should be harmless, but could slow us down a little if we keep
calling invalidate_inode_pages2_range() without really needing to.  Really for
DAX I think we need to call invalidate_inode_page2_range() only if we have
zero pages mapped over the place where we are doing I/O, which is why we check
nrpages.

Is DAX even allowed to be used at the same time as cleancache?  From a brief
look at Documentation/vm/cleancache.txt, it seems like these two features are
incompatible.  With DAX we already are avoiding the page cache completely.

Anyway, I don't see how this change in DAX can save us from a data corruption
(which is what you're seeing, right?), and I think it could slow us down, so
I'd prefer to leave things as they are.
Andrew Morton April 18, 2017, 10:46 p.m. UTC | #2
On Fri, 14 Apr 2017 17:07:50 +0300 Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:

> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> buffered read following after direct IO write would get stale data from
> the cleancache.
> 
> Also it doesn't feel right to check only for ->nrpages because
> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> 
> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> state.

I'm not understanding this.  I can buy the argument about
nrexceptional, but why does cleancache require the
invalidate_inode_pages2_range) call even when ->nrpages is zero?

I *assume* it's because invalidate_inode_pages2_range() calls
cleancache_invalidate_inode(), yes?  If so, can we please add this to
the changelog?  If not then please explain further.
Andrey Ryabinin April 19, 2017, 3:11 p.m. UTC | #3
On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
>> Some direct write fs hooks call invalidate_inode_pages2[_range]()
>> conditionally iff mapping->nrpages is not zero. If page cache is empty,
>> buffered read following after direct IO write would get stale data from
>> the cleancache.
>>
>> Also it doesn't feel right to check only for ->nrpages because
>> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
>>
>> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
>> state.
>>
>> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
>> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> ---
> <>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 2e382fe..1e8cca0 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>>  	 * into page tables. We have to tear down these mappings so that data
>>  	 * written by write(2) is visible in mmap.
>>  	 */
>> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
>> +	if ((iomap->flags & IOMAP_F_NEW)) {
>>  		invalidate_inode_pages2_range(inode->i_mapping,
>>  					      pos >> PAGE_SHIFT,
>>  					      (end - 1) >> PAGE_SHIFT);
> 
> tl;dr: I think the old code is correct, and that you don't need this change.
> 
> This should be harmless, but could slow us down a little if we keep
> calling invalidate_inode_pages2_range() without really needing to.  Really for
> DAX I think we need to call invalidate_inode_page2_range() only if we have
> zero pages mapped over the place where we are doing I/O, which is why we check
> nrpages.
> 

Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
invalidates exceptional radix tree entries. Is that correct that we invalidate
exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?


> Is DAX even allowed to be used at the same time as cleancache?  From a brief
> look at Documentation/vm/cleancache.txt, it seems like these two features are
> incompatible.  With DAX we already are avoiding the page cache completely.

tl;dr: I think you're right.

cleancache may store any PageUptodate && PageMappedToDisk page evicted from page cache (see __delete_from_page_cache)
DAX deletes hole page via __delete_from_page_cache(), but I don't see we mark such page as Uptodate or MappedToDisk
so it will never go into the cleancache.

Latter cleancache_get_page() (e.g. it's called from mpage_readpages() which is called from blkdev_read_pages())
I assume that DAX doesn't use a_ops->readpages() method so cleancache_get_page() is never called from DAX.


> Anyway, I don't see how this change in DAX can save us from a data corruption
> (which is what you're seeing, right?), and I think it could slow us down, so
> I'd prefer to leave things as they are.
> 

I'll remove this hunk from v2.

Thanks.
Andrey Ryabinin April 19, 2017, 3:15 p.m. UTC | #4
On 04/19/2017 01:46 AM, Andrew Morton wrote:
> On Fri, 14 Apr 2017 17:07:50 +0300 Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
> 
>> Some direct write fs hooks call invalidate_inode_pages2[_range]()
>> conditionally iff mapping->nrpages is not zero. If page cache is empty,
>> buffered read following after direct IO write would get stale data from
>> the cleancache.
>>
>> Also it doesn't feel right to check only for ->nrpages because
>> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
>>
>> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
>> state.
> 
> I'm not understanding this.  I can buy the argument about
> nrexceptional, but why does cleancache require the
> invalidate_inode_pages2_range) call even when ->nrpages is zero?
> 
> I *assume* it's because invalidate_inode_pages2_range() calls
> cleancache_invalidate_inode(), yes?  If so, can we please add this to
> the changelog?  If not then please explain further.
> 

Yes, your assumption is correct. I'll fix the changelog.
Ross Zwisler April 19, 2017, 7:28 p.m. UTC | #5
On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote:
> On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> >> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> >> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> >> buffered read following after direct IO write would get stale data from
> >> the cleancache.
> >>
> >> Also it doesn't feel right to check only for ->nrpages because
> >> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> >>
> >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> >> state.
> >>
> >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> >> ---
> > <>
> >> diff --git a/fs/dax.c b/fs/dax.c
> >> index 2e382fe..1e8cca0 100644
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >>  	 * into page tables. We have to tear down these mappings so that data
> >>  	 * written by write(2) is visible in mmap.
> >>  	 */
> >> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> >> +	if ((iomap->flags & IOMAP_F_NEW)) {
> >>  		invalidate_inode_pages2_range(inode->i_mapping,
> >>  					      pos >> PAGE_SHIFT,
> >>  					      (end - 1) >> PAGE_SHIFT);
> > 
> > tl;dr: I think the old code is correct, and that you don't need this change.
> > 
> > This should be harmless, but could slow us down a little if we keep
> > calling invalidate_inode_pages2_range() without really needing to.  Really for
> > DAX I think we need to call invalidate_inode_page2_range() only if we have
> > zero pages mapped over the place where we are doing I/O, which is why we check
> > nrpages.
> > 
> 
> Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
> invalidates exceptional radix tree entries. Is that correct that we invalidate
> exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?

For DAX we only invalidate clean DAX exceptional entries so that we can keep
dirty entries around for writeback, but yes you're correct that we only do the
invalidation if nrpages > 0.  And yes, it does seem a bit weird. :)
Jan Kara April 20, 2017, 2:35 p.m. UTC | #6
On Wed 19-04-17 13:28:36, Ross Zwisler wrote:
> On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote:
> > On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> > > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> > >> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> > >> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> > >> buffered read following after direct IO write would get stale data from
> > >> the cleancache.
> > >>
> > >> Also it doesn't feel right to check only for ->nrpages because
> > >> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> > >>
> > >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> > >> state.
> > >>
> > >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> > >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> > >> ---
> > > <>
> > >> diff --git a/fs/dax.c b/fs/dax.c
> > >> index 2e382fe..1e8cca0 100644
> > >> --- a/fs/dax.c
> > >> +++ b/fs/dax.c
> > >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >>  	 * into page tables. We have to tear down these mappings so that data
> > >>  	 * written by write(2) is visible in mmap.
> > >>  	 */
> > >> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> > >> +	if ((iomap->flags & IOMAP_F_NEW)) {
> > >>  		invalidate_inode_pages2_range(inode->i_mapping,
> > >>  					      pos >> PAGE_SHIFT,
> > >>  					      (end - 1) >> PAGE_SHIFT);
> > > 
> > > tl;dr: I think the old code is correct, and that you don't need this change.
> > > 
> > > This should be harmless, but could slow us down a little if we keep
> > > calling invalidate_inode_pages2_range() without really needing to.  Really for
> > > DAX I think we need to call invalidate_inode_page2_range() only if we have
> > > zero pages mapped over the place where we are doing I/O, which is why we check
> > > nrpages.
> > > 
> > 
> > Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
> > invalidates exceptional radix tree entries. Is that correct that we invalidate
> > exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?
> 
> For DAX we only invalidate clean DAX exceptional entries so that we can keep
> dirty entries around for writeback, but yes you're correct that we only do the
> invalidation if nrpages > 0.  And yes, it does seem a bit weird. :)

Actually in this place the nrpages check is deliberate since there should
only be hole pages or nothing in the invalidated range - see the comment
before the if. But thinking more about it this assumption actually is not
right in presence of zero PMD entries in the radix tree. So this change
actually also fixes a possible bug for DAX but we should do it as a
separate patch with a proper changelog.

								Honza
diff mbox

Patch

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a8..786d0de 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -423,7 +423,7 @@  v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		unsigned long pg_start, pg_end;
 		pg_start = origin >> PAGE_SHIFT;
 		pg_end = (origin + retval - 1) >> PAGE_SHIFT;
-		if (inode->i_mapping && inode->i_mapping->nrpages)
+		if (inode->i_mapping)
 			invalidate_inode_pages2_range(inode->i_mapping,
 						      pg_start, pg_end);
 		iocb->ki_pos += retval;
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index c3b2fa0..6539fa3 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1857,7 +1857,7 @@  cifs_invalidate_mapping(struct inode *inode)
 {
 	int rc = 0;
 
-	if (inode->i_mapping && inode->i_mapping->nrpages != 0) {
+	if (inode->i_mapping) {
 		rc = invalidate_inode_pages2(inode->i_mapping);
 		if (rc)
 			cifs_dbg(VFS, "%s: could not invalidate inode %p\n",
diff --git a/fs/dax.c b/fs/dax.c
index 2e382fe..1e8cca0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1047,7 +1047,7 @@  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	 * into page tables. We have to tear down these mappings so that data
 	 * written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if ((iomap->flags & IOMAP_F_NEW)) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
diff --git a/fs/iomap.c b/fs/iomap.c
index 0b457ff..7e1f947 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -880,16 +880,14 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_WRITE;
 	}
 
-	if (mapping->nrpages) {
-		ret = filemap_write_and_wait_range(mapping, start, end);
-		if (ret)
-			goto out_free_dio;
+	ret = filemap_write_and_wait_range(mapping, start, end);
+	if (ret)
+		goto out_free_dio;
 
-		ret = invalidate_inode_pages2_range(mapping,
+	ret = invalidate_inode_pages2_range(mapping,
 				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-		ret = 0;
-	}
+	WARN_ON_ONCE(ret);
+	ret = 0;
 
 	inode_dio_begin(inode);
 
@@ -944,7 +942,7 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
 	 * this invalidation fails, tough, the write still worked...
 	 */
-	if (iov_iter_rw(iter) == WRITE && mapping->nrpages) {
+	if (iov_iter_rw(iter) == WRITE) {
 		int err = invalidate_inode_pages2_range(mapping,
 				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
 		WARN_ON_ONCE(err);
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index aab32fc..183ab4d 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -1024,10 +1024,8 @@  ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 
 	result = nfs_direct_write_schedule_iovec(dreq, iter, pos);
 
-	if (mapping->nrpages) {
-		invalidate_inode_pages2_range(mapping,
-					      pos >> PAGE_SHIFT, end);
-	}
+	invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end);
 
 	nfs_end_io_direct(inode);
 
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f489a5a..b727ec8 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1118,10 +1118,12 @@  static int nfs_invalidate_mapping(struct inode *inode, struct address_space *map
 			if (ret < 0)
 				return ret;
 		}
-		ret = invalidate_inode_pages2(mapping);
-		if (ret < 0)
-			return ret;
 	}
+
+	ret = invalidate_inode_pages2(mapping);
+	if (ret < 0)
+		return ret;
+
 	if (S_ISDIR(inode->i_mode)) {
 		spin_lock(&inode->i_lock);
 		memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
diff --git a/mm/filemap.c b/mm/filemap.c
index e9e5f7b..d233d59 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2721,18 +2721,16 @@  generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * about to write.  We do this *before* the write so that we can return
 	 * without clobbering -EIOCBQUEUED from ->direct_IO().
 	 */
-	if (mapping->nrpages) {
-		written = invalidate_inode_pages2_range(mapping,
+	written = invalidate_inode_pages2_range(mapping,
 					pos >> PAGE_SHIFT, end);
-		/*
-		 * If a page can not be invalidated, return 0 to fall back
-		 * to buffered write.
-		 */
-		if (written) {
-			if (written == -EBUSY)
-				return 0;
-			goto out;
-		}
+	/*
+	 * If a page can not be invalidated, return 0 to fall back
+	 * to buffered write.
+	 */
+	if (written) {
+		if (written == -EBUSY)
+			return 0;
+		goto out;
 	}
 
 	data = *from;
@@ -2746,10 +2744,8 @@  generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * so we don't support it 100%.  If this invalidation
 	 * fails, tough, the write still worked...
 	 */
-	if (mapping->nrpages) {
-		invalidate_inode_pages2_range(mapping,
-					      pos >> PAGE_SHIFT, end);
-	}
+	invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end);
 
 	if (written > 0) {
 		pos += written;