Message ID | 1479980796-26161-6-git-send-email-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote: > Currently ->iomap_begin() handler is called with entry lock held. If the > filesystem held any locks between ->iomap_begin() and ->iomap_end() > (such as ext4 which will want to hold transaction open), this would cause > lock inversion with the iomap_apply() from standard IO path which first > calls ->iomap_begin() and only then calls ->actor() callback which grabs > entry locks for DAX. I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this an issue currently, or are you just trying to make the code consistent so we don't run into issues in the future?
On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote: > On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote: > > Currently ->iomap_begin() handler is called with entry lock held. If the > > filesystem held any locks between ->iomap_begin() and ->iomap_end() > > (such as ext4 which will want to hold transaction open), this would cause > > lock inversion with the iomap_apply() from standard IO path which first > > calls ->iomap_begin() and only then calls ->actor() callback which grabs > > entry locks for DAX. > > I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this > an issue currently, or are you just trying to make the code consistent so we > don't run into issues in the future? Ah, I see that you use this new ordering in patch 6/6 so that you can change your interaction with the ext4 journal. I'm still curious if we have a lock ordering inversion within DAX, but if this ordering helps you with ext4, good enough. One quick comment: > @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, > */ > entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD); > if (IS_ERR(entry)) > - goto fallback; > + goto finish_iomap; > > - /* > - * Note that we don't use iomap_apply here. We aren't doing I/O, only > - * setting up a mapping, so really we're using iomap_begin() as a way > - * to look up our filesystem block. > - */ > - pos = (loff_t)pgoff << PAGE_SHIFT; > - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap); > - if (error) > - goto unlock_entry; > if (iomap.offset + iomap.length < pos + PMD_SIZE) > - goto finish_iomap; > + goto unlock_entry; I think this offset+length bounds check could be moved along with the iomap_begin() call up above the grab_mapping_entry(). You would then goto 'finish_iomap' if you hit this error condition, allowing you to avoid grabbing and releasing of the mapping entry. Other than that one small nit, this looks fine to me: Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
On Thu 01-12-16 15:24:47, Ross Zwisler wrote: > On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote: > > Currently ->iomap_begin() handler is called with entry lock held. If the > > filesystem held any locks between ->iomap_begin() and ->iomap_end() > > (such as ext4 which will want to hold transaction open), this would cause > > lock inversion with the iomap_apply() from standard IO path which first > > calls ->iomap_begin() and only then calls ->actor() callback which grabs > > entry locks for DAX. > > I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this > an issue currently, or are you just trying to make the code consistent so we > don't run into issues in the future? So dax_iomap_actor() copies data from / to user provided buffer. That can fault and if the buffer happens to be mmaped file on DAX filesystem, the fault will end up grabbing entry locks. Sample evil test: fd = open("some_file", O_RDWR); buf = mmap(NULL, 65536, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); write(fd, buf, 4096); Honza
On Thu 01-12-16 16:27:04, Ross Zwisler wrote: > On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote: > > On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote: > > > Currently ->iomap_begin() handler is called with entry lock held. If the > > > filesystem held any locks between ->iomap_begin() and ->iomap_end() > > > (such as ext4 which will want to hold transaction open), this would cause > > > lock inversion with the iomap_apply() from standard IO path which first > > > calls ->iomap_begin() and only then calls ->actor() callback which grabs > > > entry locks for DAX. > > > > I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this > > an issue currently, or are you just trying to make the code consistent so we > > don't run into issues in the future? > > Ah, I see that you use this new ordering in patch 6/6 so that you can change > your interaction with the ext4 journal. I'm still curious if we have a lock > ordering inversion within DAX, but if this ordering helps you with ext4, good > enough. > > One quick comment: > > > @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, > > */ > > entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD); > > if (IS_ERR(entry)) > > - goto fallback; > > + goto finish_iomap; > > > > - /* > > - * Note that we don't use iomap_apply here. We aren't doing I/O, only > > - * setting up a mapping, so really we're using iomap_begin() as a way > > - * to look up our filesystem block. > > - */ > > - pos = (loff_t)pgoff << PAGE_SHIFT; > > - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap); > > - if (error) > > - goto unlock_entry; > > if (iomap.offset + iomap.length < pos + PMD_SIZE) > > - goto finish_iomap; > > + goto unlock_entry; > > I think this offset+length bounds check could be moved along with the > iomap_begin() call up above the grab_mapping_entry(). You would then goto > 'finish_iomap' if you hit this error condition, allowing you to avoid grabbing > and releasing of the mapping entry. Yes, that is nicer. Changed. > Other than that one small nit, this looks fine to me: > Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Thanks. Honza
diff --git a/fs/dax.c b/fs/dax.c index 38f996976ebf..be39633d346e 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1077,6 +1077,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, } EXPORT_SYMBOL_GPL(dax_iomap_rw); +static int dax_fault_return(int error) +{ + if (error == 0) + return VM_FAULT_NOPAGE; + if (error == -ENOMEM) + return VM_FAULT_OOM; + return VM_FAULT_SIGBUS; +} + /** * dax_iomap_fault - handle a page fault on a DAX file * @vma: The virtual memory area where the fault occurred @@ -1109,12 +1118,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf, if (pos >= i_size_read(inode)) return VM_FAULT_SIGBUS; - entry = grab_mapping_entry(mapping, vmf->pgoff, 0); - if (IS_ERR(entry)) { - error = PTR_ERR(entry); - goto out; - } - if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page) flags |= IOMAP_WRITE; @@ -1125,9 +1128,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf, */ error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap); if (error) - goto unlock_entry; + return dax_fault_return(error); if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) { - error = -EIO; /* fs corruption? */ + vmf_ret = dax_fault_return(-EIO); /* fs corruption? */ + goto finish_iomap; + } + + entry = grab_mapping_entry(mapping, vmf->pgoff, 0); + if (IS_ERR(entry)) { + vmf_ret = dax_fault_return(PTR_ERR(entry)); goto finish_iomap; } @@ -1150,13 +1159,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf, } if (error) - goto finish_iomap; + goto error_unlock_entry; __SetPageUptodate(vmf->cow_page); vmf_ret = finish_fault(vmf); if (!vmf_ret) vmf_ret = VM_FAULT_DONE_COW; - goto finish_iomap; + goto unlock_entry; } switch (iomap.type) { @@ -1168,12 +1177,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf, } error = dax_insert_mapping(mapping, iomap.bdev, sector, PAGE_SIZE, &entry, vma, vmf); + /* -EBUSY is fine, somebody else faulted on the same PTE */ + if (error == -EBUSY) + error = 0; break; case IOMAP_UNWRITTEN: case IOMAP_HOLE: if (!(vmf->flags & FAULT_FLAG_WRITE)) { vmf_ret = dax_load_hole(mapping, &entry, vmf); - goto finish_iomap; + goto unlock_entry; } /*FALLTHRU*/ default: @@ -1182,30 +1194,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf, break; } - finish_iomap: - if (ops->iomap_end) { - if (error || (vmf_ret & VM_FAULT_ERROR)) { - /* keep previous error */ - ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags, - &iomap); - } else { - error = ops->iomap_end(inode, pos, PAGE_SIZE, - PAGE_SIZE, flags, &iomap); - } - } + error_unlock_entry: + vmf_ret = dax_fault_return(error) | major; unlock_entry: put_locked_mapping_entry(mapping, vmf->pgoff, entry); - out: - if (error == -ENOMEM) - return VM_FAULT_OOM | major; - /* -EBUSY is fine, somebody else faulted on the same PTE */ - if (error < 0 && error != -EBUSY) - return VM_FAULT_SIGBUS | major; - if (vmf_ret) { - WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */ - return vmf_ret; + finish_iomap: + if (ops->iomap_end) { + int copied = PAGE_SIZE; + + if (vmf_ret & VM_FAULT_ERROR) + copied = 0; + /* + * The fault is done by now and there's no way back (other + * thread may be already happily using PTE we have installed). + * Just ignore error from ->iomap_end since we cannot do much + * with it. + */ + ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap); } - return VM_FAULT_NOPAGE | major; + return vmf_ret; } EXPORT_SYMBOL_GPL(dax_iomap_fault); @@ -1330,6 +1337,15 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, goto fallback; /* + * Note that we don't use iomap_apply here. We aren't doing I/O, only + * setting up a mapping, so really we're using iomap_begin() as a way + * to look up our filesystem block. + */ + pos = (loff_t)pgoff << PAGE_SHIFT; + error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap); + if (error) + goto fallback; + /* * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX * PMD or a HZP entry. If it can't (because a 4k page is already in * the tree, for instance), it will return -EEXIST and we just fall @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, */ entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD); if (IS_ERR(entry)) - goto fallback; + goto finish_iomap; - /* - * Note that we don't use iomap_apply here. We aren't doing I/O, only - * setting up a mapping, so really we're using iomap_begin() as a way - * to look up our filesystem block. - */ - pos = (loff_t)pgoff << PAGE_SHIFT; - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap); - if (error) - goto unlock_entry; if (iomap.offset + iomap.length < pos + PMD_SIZE) - goto finish_iomap; + goto unlock_entry; vmf.pgoff = pgoff; vmf.flags = flags; @@ -1363,7 +1370,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, case IOMAP_UNWRITTEN: case IOMAP_HOLE: if (WARN_ON_ONCE(write)) - goto finish_iomap; + goto unlock_entry; result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap, &entry); break; @@ -1372,20 +1379,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, break; } + unlock_entry: + put_locked_mapping_entry(mapping, pgoff, entry); finish_iomap: if (ops->iomap_end) { - if (result == VM_FAULT_FALLBACK) { - ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags, - &iomap); - } else { - error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE, - iomap_flags, &iomap); - if (error) - result = VM_FAULT_FALLBACK; - } + int copied = PMD_SIZE; + + if (result == VM_FAULT_FALLBACK) + copied = 0; + /* + * The fault is done by now and there's no way back (other + * thread may be already happily using PMD we have installed). + * Just ignore error from ->iomap_end since we cannot do much + * with it. + */ + ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags, + &iomap); } - unlock_entry: - put_locked_mapping_entry(mapping, pgoff, entry); fallback: if (result == VM_FAULT_FALLBACK) { split_huge_pmd(vma, pmd, address);
Currently ->iomap_begin() handler is called with entry lock held. If the filesystem held any locks between ->iomap_begin() and ->iomap_end() (such as ext4 which will want to hold transaction open), this would cause lock inversion with the iomap_apply() from standard IO path which first calls ->iomap_begin() and only then calls ->actor() callback which grabs entry locks for DAX. Fix the problem by nesting grabbing of entry lock inside ->iomap_begin() - ->iomap_end() pair. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/dax.c | 120 ++++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 65 insertions(+), 55 deletions(-)