Message ID | 20190815164940.GA15198@magnolia (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] vfs: fix page locking deadlocks when deduping files | expand |
On Thu, Aug 15, 2019 at 09:49:40AM -0700, Darrick J. Wong wrote: > Fixes: 876bec6f9bbfcb3 ("vfs: refactor clone/dedupe_file_range common functions") > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> (I approve of this patch going upstream) However, I think there are further simplifications to be made here. With this patch applied, vfs_dedupe_get_page() now looks like this: static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) { struct page *page; page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL); if (IS_ERR(page)) return page; if (!PageUptodate(page)) { put_page(page); return ERR_PTR(-EIO); } return page; } But I don't think read_mapping_page() can return a page which doesn't have PageUptodate set. Follow the path down through read_cache_page() into do_read_cache_page(). Other than the locations which return an ERR_PTR, the only return point is at the out: label. Three of the gotos to 'out' are guarded by 'if (PageUptodate)'. The fourth is after calling wait_on_page_read(). Which will return ERR_PTR(-EIO) if the page isn't Uptodate after being unlocked. Subtracting that never-exercised check from the routine leaves us with: static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) { struct page *page; page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL); if (IS_ERR(page)) return page; return page; } which is fundamentally just: static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) { return read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL); } which seems like it might have a better name and be located in pagemap.h? I might also like to see out: + VM_BUG_ON(!PageUptodate(page)); mark_page_accessed(page); at the end of do_read_cache_page(), just to be sure nobody ever screws that up.
On Thu, Aug 15, 2019 at 11:18:04AM -0700, Matthew Wilcox wrote: > But I don't think read_mapping_page() can return a page which doesn't have > PageUptodate set. Follow the path down through read_cache_page() into > do_read_cache_page(). read_mapping_page() can't, but I think we need the check after the lock_page still.
On Thu, Aug 15, 2019 at 11:47:53PM -0700, Christoph Hellwig wrote: > On Thu, Aug 15, 2019 at 11:18:04AM -0700, Matthew Wilcox wrote: > > But I don't think read_mapping_page() can return a page which doesn't have > > PageUptodate set. Follow the path down through read_cache_page() into > > do_read_cache_page(). > > read_mapping_page() can't, but I think we need the check after the > lock_page still. The current code checks before locking the page: if (!PageUptodate(page)) { put_page(page); return ERR_PTR(-EIO); } lock_page(page); so the race with clearing Uptodate already existed. XFS can ClearPageUptodate if it gets an error while doing writeback of a dirty page. But we shouldn't be able to get a dirty page. Before we get to this point, we call filemap_write_and_wait_range() while holding the inode's modification lock. So I think we can't see a !Uptodate page, but of course I'd be happy to see a sequence of events that breaks this reasoning.
diff --git a/fs/read_write.c b/fs/read_write.c index 1f5088dec566..68bcdd65b9f6 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1811,10 +1811,7 @@ static int generic_remap_check_len(struct inode *inode_in, return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL; } -/* - * Read a page's worth of file data into the page cache. Return the page - * locked. - */ +/* Read a page's worth of file data into the page cache. */ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) { struct page *page; @@ -1826,10 +1823,32 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) put_page(page); return ERR_PTR(-EIO); } - lock_page(page); return page; } +/* + * Lock two pages, ensuring that we lock in offset order if the pages are from + * the same file. + */ +static void vfs_lock_two_pages(struct page *page1, struct page *page2) +{ + /* Always lock in order of increasing index. */ + if (page1->index > page2->index) + swap(page1, page2); + + lock_page(page1); + if (page1 != page2) + lock_page(page2); +} + +/* Unlock two pages, being careful not to unlock the same page twice. */ +static void vfs_unlock_two_pages(struct page *page1, struct page *page2) +{ + unlock_page(page1); + if (page1 != page2) + unlock_page(page2); +} + /* * Compare extents of two files to see if they are the same. * Caller must have locked both inodes to prevent write races. @@ -1867,10 +1886,23 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, dest_page = vfs_dedupe_get_page(dest, destoff); if (IS_ERR(dest_page)) { error = PTR_ERR(dest_page); - unlock_page(src_page); put_page(src_page); goto out_error; } + + vfs_lock_two_pages(src_page, dest_page); + + /* + * Now that we've locked both pages, make sure they're still + * mapped to the file we're interested in. If not, someone + * is invalidating pages on us and we lose. + */ + if (src_page->mapping != src->i_mapping || + dest_page->mapping != dest->i_mapping) { + same = false; + goto unlock; + } + src_addr = kmap_atomic(src_page); dest_addr = kmap_atomic(dest_page); @@ -1882,8 +1914,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, kunmap_atomic(dest_addr); kunmap_atomic(src_addr); - unlock_page(dest_page); - unlock_page(src_page); +unlock: + vfs_unlock_two_pages(src_page, dest_page); put_page(dest_page); put_page(src_page);