diff mbox

[6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals

Message ID 1474994615-29553-7-git-send-email-jack@suse.cz (mailing list archive)
State Not Applicable
Headers show

Commit Message

Jan Kara Sept. 27, 2016, 4:43 p.m. UTC
Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

Comments

Dave Chinner Sept. 28, 2016, 12:20 a.m. UTC | #1
On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> @@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
>  		return -EIO;
>  
> +	/*
> +	 * Write can allocate block for an area which has a hole page mapped
> +	 * into page tables. We have to tear down these mappings so that data
> +	 * written by write(2) is visible in mmap.
> +	 */
> +	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {

gcc should be throwing warnings about that:

	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {

Cheers,

Dave.
Christoph Hellwig Sept. 30, 2016, 8:55 a.m. UTC | #2
On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> Currently each filesystem (possibly through generic_file_direct_write()
> or iomap_dax_rw()) takes care of invalidating page tables and evicting
> hole pages from the radix tree when write(2) to the file happens. This
> invalidation is only necessary when there is some block allocation
> resulting from write(2). Furthermore in current place the invalidation
> is racy wrt page fault instantiating a hole page just after we have
> invalidated it.
> 
> So perform the page invalidation inside dax_do_io() where we can do it
> only when really necessary and after blocks have been allocated so
> nobody will be instantiating new hole pages anymore.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

This looks fine with the comment from Dave addressed:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> +				if (buffer_new(bh) &&
> +				    inode->i_mapping->nrpages) {

Btw, it would be nice if the nrpages check could move into
invalidate_inode_pages2_range instead of having to bother with it in
the callers.
Jan Kara Oct. 3, 2016, 2:58 p.m. UTC | #3
On Wed 28-09-16 10:20:34, Dave Chinner wrote:
> On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> > @@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> >  		return -EIO;
> >  
> > +	/*
> > +	 * Write can allocate block for an area which has a hole page mapped
> > +	 * into page tables. We have to tear down these mappings so that data
> > +	 * written by write(2) is visible in mmap.
> > +	 */
> > +	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {
> 
> gcc should be throwing warnings about that:
> 
> 	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {

Actually the bitwise '&' takes precedense over the logical '&&' so the
evaluation order ends up being correct. But I agree it's better to be
explicit with parenthesis here. Fixed.

								Honza
Jan Kara Oct. 3, 2016, 3:02 p.m. UTC | #4
On Fri 30-09-16 01:55:44, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> > Currently each filesystem (possibly through generic_file_direct_write()
> > or iomap_dax_rw()) takes care of invalidating page tables and evicting
> > hole pages from the radix tree when write(2) to the file happens. This
> > invalidation is only necessary when there is some block allocation
> > resulting from write(2). Furthermore in current place the invalidation
> > is racy wrt page fault instantiating a hole page just after we have
> > invalidated it.
> > 
> > So perform the page invalidation inside dax_do_io() where we can do it
> > only when really necessary and after blocks have been allocated so
> > nobody will be instantiating new hole pages anymore.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> This looks fine with the comment from Dave addressed:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks.

> > +				if (buffer_new(bh) &&
> > +				    inode->i_mapping->nrpages) {
> 
> Btw, it would be nice if the nrpages check could move into
> invalidate_inode_pages2_range instead of having to bother with it in
> the callers.

We cannot do that - someone can be possibly calling
invalidate_inode_pages2_range() to invalidate exceptional entries in the
given range (although I don't think there's currently any such caller). DAX
code currently needs it only to invalidate hole pages so that's why it can
do an optimization to call invalidate_inode_pages2_range() only when
mapping->nrpages > 0. But in general it would be a trap for
invalidate_inode_pages2_range() to work only if mapping->nrpages > 0...

								Honza
Dave Chinner Oct. 3, 2016, 11:40 p.m. UTC | #5
On Mon, Oct 03, 2016 at 04:58:02PM +0200, Jan Kara wrote:
> On Wed 28-09-16 10:20:34, Dave Chinner wrote:
> > On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> > > @@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> > >  		return -EIO;
> > >  
> > > +	/*
> > > +	 * Write can allocate block for an area which has a hole page mapped
> > > +	 * into page tables. We have to tear down these mappings so that data
> > > +	 * written by write(2) is visible in mmap.
> > > +	 */
> > > +	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {
> > 
> > gcc should be throwing warnings about that:
> > 
> > 	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> 
> Actually the bitwise '&' takes precedense over the logical '&&' so the
> evaluation order ends up being correct.

Yes, I know that.

However, my concern is that such expressions don't indicate the
/intent/ of the author and so it can be difficult when reading the
code to determine if the logic is correct or whether it is a typo
and is wrong. In many cases, & and && will function identically for
the tested cases, so neither cursory review or testing would catch
something like this being wrong.

> But I agree it's better to be
> explicit with parenthesis here. Fixed.

Yup, a little bit of "documentation" goes a long way :P

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index c8a639d2214e..2f69ca891aab 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -186,6 +186,18 @@  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				 */
 				WARN_ON_ONCE(rw == WRITE &&
 					     buffer_unwritten(bh));
+				/*
+				 * Write can allocate block for an area which
+				 * has a hole page mapped into page tables. We
+				 * have to tear down these mappings so that
+				 * data written by write(2) is visible in mmap.
+				 */
+				if (buffer_new(bh) &&
+				    inode->i_mapping->nrpages) {
+					invalidate_inode_pages2_range(
+						  inode->i_mapping, page,
+						  (bh_max - 1) >> PAGE_SHIFT);
+				}
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -1410,6 +1422,17 @@  iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
 		return -EIO;
 
+	/*
+	 * Write can allocate block for an area which has a hole page mapped
+	 * into page tables. We have to tear down these mappings so that data
+	 * written by write(2) is visible in mmap.
+	 */
+	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {
+		invalidate_inode_pages2_range(inode->i_mapping,
+					      pos >> PAGE_SHIFT,
+					      (end - 1) >> PAGE_SHIFT);
+	}
+
 	while (pos < end) {
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		struct blk_dax_ctl dax = { 0 };
@@ -1469,23 +1492,6 @@  iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == WRITE)
 		flags |= IOMAP_WRITE;
 
-	/*
-	 * Yes, even DAX files can have page cache attached to them:  A zeroed
-	 * page is inserted into the pagecache when we have to serve a write
-	 * fault on a hole.  It should never be dirtied and can simply be
-	 * dropped from the pagecache once we get real data for the page.
-	 *
-	 * XXX: This is racy against mmap, and there's nothing we can do about
-	 * it. We'll eventually need to shift this down even further so that
-	 * we can check if we allocated blocks over a hole first.
-	 */
-	if (mapping->nrpages) {
-		ret = invalidate_inode_pages2_range(mapping,
-				pos >> PAGE_SHIFT,
-				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-	}
-
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
 				iter, iomap_dax_actor);