diff mbox

[5/6] dax: Call ->iomap_begin without entry lock during dax fault

Message ID 1479980796-26161-6-git-send-email-jack@suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara Nov. 24, 2016, 9:46 a.m. UTC
Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX.

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

Comments

Ross Zwisler Dec. 1, 2016, 10:24 p.m. UTC | #1
On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> Currently ->iomap_begin() handler is called with entry lock held. If the
> filesystem held any locks between ->iomap_begin() and ->iomap_end()
> (such as ext4 which will want to hold transaction open), this would cause
> lock inversion with the iomap_apply() from standard IO path which first
> calls ->iomap_begin() and only then calls ->actor() callback which grabs
> entry locks for DAX.

I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
an issue currently, or are you just trying to make the code consistent so we
don't run into issues in the future?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ross Zwisler Dec. 1, 2016, 11:27 p.m. UTC | #2
On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
> 
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

Ah, I see that you use this new ordering in patch 6/6 so that you can change
your interaction with the ext4 journal.  I'm still curious if we have a lock
ordering inversion within DAX, but if this ordering helps you with ext4, good
enough.

One quick comment:

> @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>        */                                                                     
>       entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);              
>       if (IS_ERR(entry))                                                      
> -             goto fallback;                                                  
> +             goto finish_iomap;                                              
>                                                                               
> -     /*                                                                      
> -      * Note that we don't use iomap_apply here.  We aren't doing I/O, only  
> -      * setting up a mapping, so really we're using iomap_begin() as a way   
> -      * to look up our filesystem block.                                     
> -      */                                                                     
> -     pos = (loff_t)pgoff << PAGE_SHIFT;                                      
> -     error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);    
> -     if (error)                                                              
> -             goto unlock_entry;                                              
>       if (iomap.offset + iomap.length < pos + PMD_SIZE)                       
> -             goto finish_iomap;                                              
> +             goto unlock_entry;       

I think this offset+length bounds check could be moved along with the
iomap_begin() call up above the grab_mapping_entry().  You would then goto
'finish_iomap' if you hit this error condition, allowing you to avoid grabbing
and releasing of the mapping entry.

Other than that one small nit, this looks fine to me:
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Dec. 2, 2016, 10:08 a.m. UTC | #3
On Thu 01-12-16 15:24:47, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
> 
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

So dax_iomap_actor() copies data from / to user provided buffer. That can
fault and if the buffer happens to be mmaped file on DAX filesystem, the
fault will end up grabbing entry locks. Sample evil test:

	fd = open("some_file", O_RDWR);
	buf = mmap(NULL, 65536, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	write(fd, buf, 4096);

								Honza
Jan Kara Dec. 2, 2016, 10:12 a.m. UTC | #4
On Thu 01-12-16 16:27:04, Ross Zwisler wrote:
> On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote:
> > On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > > Currently ->iomap_begin() handler is called with entry lock held. If the
> > > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > > (such as ext4 which will want to hold transaction open), this would cause
> > > lock inversion with the iomap_apply() from standard IO path which first
> > > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > > entry locks for DAX.
> > 
> > I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> > an issue currently, or are you just trying to make the code consistent so we
> > don't run into issues in the future?
> 
> Ah, I see that you use this new ordering in patch 6/6 so that you can change
> your interaction with the ext4 journal.  I'm still curious if we have a lock
> ordering inversion within DAX, but if this ordering helps you with ext4, good
> enough.
> 
> One quick comment:
> 
> > @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >        */                                                                     
> >       entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);              
> >       if (IS_ERR(entry))                                                      
> > -             goto fallback;                                                  
> > +             goto finish_iomap;                                              
> >                                                                               
> > -     /*                                                                      
> > -      * Note that we don't use iomap_apply here.  We aren't doing I/O, only  
> > -      * setting up a mapping, so really we're using iomap_begin() as a way   
> > -      * to look up our filesystem block.                                     
> > -      */                                                                     
> > -     pos = (loff_t)pgoff << PAGE_SHIFT;                                      
> > -     error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);    
> > -     if (error)                                                              
> > -             goto unlock_entry;                                              
> >       if (iomap.offset + iomap.length < pos + PMD_SIZE)                       
> > -             goto finish_iomap;                                              
> > +             goto unlock_entry;       
> 
> I think this offset+length bounds check could be moved along with the
> iomap_begin() call up above the grab_mapping_entry().  You would then goto
> 'finish_iomap' if you hit this error condition, allowing you to avoid grabbing
> and releasing of the mapping entry.

Yes, that is nicer. Changed.

> Other than that one small nit, this looks fine to me:
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks.

								Honza
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 38f996976ebf..be39633d346e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,6 +1077,15 @@  dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1109,12 +1118,6 @@  int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1125,9 +1128,15 @@  int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1150,13 +1159,13 @@  int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1168,12 +1177,15 @@  int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1182,30 +1194,25 @@  int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1330,6 +1337,15 @@  int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto fallback;
+	/*
 	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
 	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
 	 * the tree, for instance), it will return -EEXIST and we just fall
@@ -1337,19 +1353,10 @@  int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
 	if (IS_ERR(entry))
-		goto fallback;
+		goto finish_iomap;
 
-	/*
-	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
-	 * setting up a mapping, so really we're using iomap_begin() as a way
-	 * to look up our filesystem block.
-	 */
-	pos = (loff_t)pgoff << PAGE_SHIFT;
-	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
-	if (error)
-		goto unlock_entry;
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
-		goto finish_iomap;
+		goto unlock_entry;
 
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -1363,7 +1370,7 @@  int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1372,20 +1379,23 @@  int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);