diff mbox series

[v4] vfs: fix page locking deadlocks when deduping files

Message ID 20190815164940.GA15198@magnolia (mailing list archive)
State New, archived
Headers show
Series [v4] vfs: fix page locking deadlocks when deduping files | expand

Commit Message

Darrick J. Wong Aug. 15, 2019, 4:49 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

When dedupe wants to use the page cache to compare parts of two files
for dedupe, we must be very careful to handle locking correctly.  The
current code doesn't do this.  It must lock and unlock the page only
once if the two pages are the same, since the overlapping range check
doesn't catch this when blocksize < pagesize.  If the pages are distinct
but from the same file, we must observe page locking order and lock them
in order of increasing offset to avoid clashing with writeback locking.

Fixes: 876bec6f9bbfcb3 ("vfs: refactor clone/dedupe_file_range common functions")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v4: drop the unnecessary page offset checks
v3: revalidate page after locking it
v2: provide an unlock helper
---
 fs/read_write.c |   48 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 40 insertions(+), 8 deletions(-)

Comments

Matthew Wilcox (Oracle) Aug. 15, 2019, 6:18 p.m. UTC | #1
On Thu, Aug 15, 2019 at 09:49:40AM -0700, Darrick J. Wong wrote:
> Fixes: 876bec6f9bbfcb3 ("vfs: refactor clone/dedupe_file_range common functions")
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

(I approve of this patch going upstream)

However, I think there are further simplifications to be made here.
With this patch applied, vfs_dedupe_get_page() now looks like this:

static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
{
        struct page *page;

        page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL);
        if (IS_ERR(page))
                return page;
        if (!PageUptodate(page)) {
                put_page(page);
                return ERR_PTR(-EIO);
        }
        return page;
}

But I don't think read_mapping_page() can return a page which doesn't have
PageUptodate set.  Follow the path down through read_cache_page() into
do_read_cache_page().

Other than the locations which return an ERR_PTR, the only return point
is at the out: label.  Three of the gotos to 'out' are guarded by 'if
(PageUptodate)'.  The fourth is after calling wait_on_page_read().  Which
will return ERR_PTR(-EIO) if the page isn't Uptodate after being unlocked.

Subtracting that never-exercised check from the routine leaves us with:

static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
{
        struct page *page;

        page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL);
        if (IS_ERR(page))
                return page;
        return page;
}

which is fundamentally just:

static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
{
        return read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL);
}

which seems like it might have a better name and be located in pagemap.h?

I might also like to see

 out:
+	VM_BUG_ON(!PageUptodate(page));
 	mark_page_accessed(page);

at the end of do_read_cache_page(), just to be sure nobody ever screws
that up.
Christoph Hellwig Aug. 16, 2019, 6:47 a.m. UTC | #2
On Thu, Aug 15, 2019 at 11:18:04AM -0700, Matthew Wilcox wrote:
> But I don't think read_mapping_page() can return a page which doesn't have
> PageUptodate set.  Follow the path down through read_cache_page() into
> do_read_cache_page().

read_mapping_page() can't, but I think we need the check after the 
lock_page still.
Matthew Wilcox (Oracle) Aug. 16, 2019, 4:57 p.m. UTC | #3
On Thu, Aug 15, 2019 at 11:47:53PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 15, 2019 at 11:18:04AM -0700, Matthew Wilcox wrote:
> > But I don't think read_mapping_page() can return a page which doesn't have
> > PageUptodate set.  Follow the path down through read_cache_page() into
> > do_read_cache_page().
> 
> read_mapping_page() can't, but I think we need the check after the 
> lock_page still.

The current code checks before locking the page:

        if (!PageUptodate(page)) {
                put_page(page);
                return ERR_PTR(-EIO);
        }
        lock_page(page);

so the race with clearing Uptodate already existed.

XFS can ClearPageUptodate if it gets an error while doing writeback of
a dirty page.  But we shouldn't be able to get a dirty page.  Before we
get to this point, we call filemap_write_and_wait_range() while holding
the inode's modification lock.

So I think we can't see a !Uptodate page, but of course I'd be happy to
see a sequence of events that breaks this reasoning.
diff mbox series

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index 1f5088dec566..68bcdd65b9f6 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1811,10 +1811,7 @@  static int generic_remap_check_len(struct inode *inode_in,
 	return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL;
 }
 
-/*
- * Read a page's worth of file data into the page cache.  Return the page
- * locked.
- */
+/* Read a page's worth of file data into the page cache. */
 static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
 {
 	struct page *page;
@@ -1826,10 +1823,32 @@  static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
 		put_page(page);
 		return ERR_PTR(-EIO);
 	}
-	lock_page(page);
 	return page;
 }
 
+/*
+ * Lock two pages, ensuring that we lock in offset order if the pages are from
+ * the same file.
+ */
+static void vfs_lock_two_pages(struct page *page1, struct page *page2)
+{
+	/* Always lock in order of increasing index. */
+	if (page1->index > page2->index)
+		swap(page1, page2);
+
+	lock_page(page1);
+	if (page1 != page2)
+		lock_page(page2);
+}
+
+/* Unlock two pages, being careful not to unlock the same page twice. */
+static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
+{
+	unlock_page(page1);
+	if (page1 != page2)
+		unlock_page(page2);
+}
+
 /*
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
@@ -1867,10 +1886,23 @@  static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		dest_page = vfs_dedupe_get_page(dest, destoff);
 		if (IS_ERR(dest_page)) {
 			error = PTR_ERR(dest_page);
-			unlock_page(src_page);
 			put_page(src_page);
 			goto out_error;
 		}
+
+		vfs_lock_two_pages(src_page, dest_page);
+
+		/*
+		 * Now that we've locked both pages, make sure they're still
+		 * mapped to the file we're interested in.  If not, someone
+		 * is invalidating pages on us and we lose.
+		 */
+		if (src_page->mapping != src->i_mapping ||
+		    dest_page->mapping != dest->i_mapping) {
+			same = false;
+			goto unlock;
+		}
+
 		src_addr = kmap_atomic(src_page);
 		dest_addr = kmap_atomic(dest_page);
 
@@ -1882,8 +1914,8 @@  static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 
 		kunmap_atomic(dest_addr);
 		kunmap_atomic(src_addr);
-		unlock_page(dest_page);
-		unlock_page(src_page);
+unlock:
+		vfs_unlock_two_pages(src_page, dest_page);
 		put_page(dest_page);
 		put_page(src_page);