Message ID | 20201026182019.1547662-1-bfoster@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iomap: support partial page discard on writeback block mapping failure | expand |
On Mon, Oct 26, 2020 at 02:20:19PM -0400, Brian Foster wrote: > iomap writeback mapping failure only calls into ->discard_page() if > the current page has not been added to the ioend. Accordingly, the > XFS callback assumes a full page discard and invalidation. This is > problematic for sub-page block size filesystems where some portion > of a page might have been mapped successfully before a failure to > map a delalloc block occurs. ->discard_page() is not called in that > error scenario and the bio is explicitly failed by iomap via the > error return from ->prepare_ioend(). As a result, the filesystem > leaks delalloc blocks and corrupts the filesystem block counters. > > Since XFS is the only user of ->discard_page(), tweak the semantics > to invoke the callback unconditionally on mapping errors and provide > the first offset in the page that failed to map. Update > xfs_discard_page() to discard the corresponding portion of the file > and pass the range along to iomap_invalidatepage(). The latter > already properly handles both full and sub-page scenarios by not > changing any iomap or page state on sub-page invalidations. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > > One additional thing I'm trying to rectify that is slightly related to > this patch is how iomap handles the page in the partial writepage error > case. The comments suggest the page should be kept dirty, but > write_cache_pages() clears the dirty state for each page before calling > into ->writepage(). iomap_writepage_map() does call > clear_page_dirty_for_io() in the success path, which seems harmless but > superfluous. That aside, we don't seem to actually redirty the page in > the partial writepage case, so the set_page_writeback_keepwrite() call > seems insufficient. I.e., even if we did cycle back into > write_cache_pages() and find the TOWRITE page, we just skip it since the > page isn't actually dirty. > > Unless I'm missing something, this all seems slightly broken to me. I > think we can drop the clear_page_dirty_for_io() call from iomap, and > instead we need to add a call to redirty_page_for_writepage() in the > _keepwrite() error case. Beyond that, I'm kind of wondering if there's a > reason for using _keepwrite() to revisit pages as such at all. AFAICT > write_cache_pages() doesn't cycle around until it's invoked again, at > which point it retags a new set of dirty pages and as above, we > presumably have to redirty the page for _keepwrite() to have any > practical effect anyways. Thoughts? Am I missing something here? > I have a bit more information on this after some investigation. For reference, the whole keepwrite thing was originally introduced for ext4 in commit 1c8349a17137 ("ext4: fix data integrity sync in ordered mode"). It was subsequently used by XFS in commit 0d085a529b42 ("xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly"). It was specifically required for cluster writeback (xfs_cluster_write()), which was invoked from within ->writepage() to locate and write out additional dirty pages covering the same extent mapping as for the current ->writepage() page. The context of cluster writeout was slightly different in that the page dirty bit wasn't cleared on entry (because we were jumping out ahead of write_cache_pages()) and TOWRITE had to be maintained for the higher level writeback code to revisit the page if it was partially processed. The primary writeback code in XFS was eventually reworked and the whole cluster write thing removed. Since much of the code was duplicate between the two paths, I think some of the partial/error handling bits leaked over into the traditional writeback path during that process. For example, commit 150d5be09ce4 ("xfs: remove xfs_cancel_ioend") reworked the writepage error handling using xfs_start_page_writeback(), but that helper included clear_dirty logic that 1.) isn't quite relevant/correct for ->writepage() given the aforementioned contextual differences and 2.) managed the keepwrite tag as a side effect. IOW, traditional writepage expected the dirty bit already cleared and had to either redirty the page or call set_page_writeback(). The cluster writepage path expected the dirty bit to be set and either processed the entire page (and cleared the dirty bit and called set_page_writeback()) or partially processed the page (and did not clear the dirty bit and called set_page_writeback_keepwrite()). The latter usage was eventually folded into ->writepage() where the underlying assumptions of cluster write did not hold true. From there, various reworks/refactors may have occurred in XFS and I suspect eventually lead to this code being lifted out into iomap. Brian > Brian > > fs/iomap/buffered-io.c | 16 +++++++++------- > fs/xfs/xfs_aops.c | 13 +++++++------ > include/linux/iomap.h | 2 +- > 3 files changed, 17 insertions(+), 14 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index bcfc288dba3f..a99964c4b93f 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1412,14 +1412,16 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > * appropriately. > */ > if (unlikely(error)) { > + unsigned int pageoff = offset_in_page(file_offset); > + /* > + * Let the filesystem know what portion of the current page > + * failed to map. If the page wasn't been added to ioend, it > + * won't be affected by I/O completion and we must unlock it > + * now. > + */ > + if (wpc->ops->discard_page) > + wpc->ops->discard_page(page, pageoff); > if (!count) { > - /* > - * If the current page hasn't been added to ioend, it > - * won't be affected by I/O completions and we must > - * discard and unlock it right here. > - */ > - if (wpc->ops->discard_page) > - wpc->ops->discard_page(page); > ClearPageUptodate(page); > unlock_page(page); > goto done; > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index b35611882ff9..8a17b46a3978 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -527,13 +527,14 @@ xfs_prepare_ioend( > */ > static void > xfs_discard_page( > - struct page *page) > + struct page *page, > + unsigned int pageoff) > { > struct inode *inode = page->mapping->host; > struct xfs_inode *ip = XFS_I(inode); > struct xfs_mount *mp = ip->i_mount; > - loff_t offset = page_offset(page); > - xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, offset); > + loff_t fileoff = page_offset(page) + pageoff; > + xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, fileoff); > int error; > > if (XFS_FORCED_SHUTDOWN(mp)) > @@ -541,14 +542,14 @@ xfs_discard_page( > > xfs_alert_ratelimited(mp, > "page discard on page "PTR_FMT", inode 0x%llx, offset %llu.", > - page, ip->i_ino, offset); > + page, ip->i_ino, fileoff); > > error = xfs_bmap_punch_delalloc_range(ip, start_fsb, > - PAGE_SIZE / i_blocksize(inode)); > + (PAGE_SIZE - pageoff) / i_blocksize(inode)); > if (error && !XFS_FORCED_SHUTDOWN(mp)) > xfs_alert(mp, "page discard unable to remove delalloc mapping."); > out_invalidate: > - iomap_invalidatepage(page, 0, PAGE_SIZE); > + iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff); > } > > static const struct iomap_writeback_ops xfs_writeback_ops = { > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 4d1d3c3469e9..646aaefe0dae 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -220,7 +220,7 @@ struct iomap_writeback_ops { > * Optional, allows the file system to discard state on a page where > * we failed to submit any I/O. > */ > - void (*discard_page)(struct page *page); > + void (*discard_page)(struct page *page, unsigned int pageoff); > }; > > struct iomap_writepage_ctx { > -- > 2.25.4 >
> if (unlikely(error)) { > + unsigned int pageoff = offset_in_page(file_offset); > + /* > + * Let the filesystem know what portion of the current page > + * failed to map. If the page wasn't been added to ioend, it > + * won't be affected by I/O completion and we must unlock it > + * now. > + */ > + if (wpc->ops->discard_page) > + wpc->ops->discard_page(page, pageoff); I don't think we need the pageoff variable here. Also it would seem more natural to pass the full file_offset offset instead of having to recreate it in the file system.
On Wed, Oct 28, 2020 at 07:31:27AM +0000, Christoph Hellwig wrote: > > if (unlikely(error)) { > > + unsigned int pageoff = offset_in_page(file_offset); > > + /* > > + * Let the filesystem know what portion of the current page > > + * failed to map. If the page wasn't been added to ioend, it > > + * won't be affected by I/O completion and we must unlock it > > + * now. > > + */ > > + if (wpc->ops->discard_page) > > + wpc->ops->discard_page(page, pageoff); > > I don't think we need the pageoff variable here. Also it would > seem more natural to pass the full file_offset offset instead of > having to recreate it in the file system. > I used the variable just to avoid having to split the function call into multiple lines. I.e., it just looked more readable to me than: if (wpc->ops->discard_page) wpc->ops->discard_page(page, offset_in_page(file_offset)); I can change it back if that is preferred (or possibly use a function pointer variable instead). I suppose that's also avoided by passing file_offset directly, but that seems a little odd to me for a page oriented callback. Brian
On Wed, Oct 28, 2020 at 07:32:00AM -0400, Brian Foster wrote: > I used the variable just to avoid having to split the function call into > multiple lines. I.e., it just looked more readable to me than: > > if (wpc->ops->discard_page) > wpc->ops->discard_page(page, > offset_in_page(file_offset)); > > I can change it back if that is preferred (or possibly use a function > pointer variable instead). I suppose that's also avoided by passing > file_offset directly, but that seems a little odd to me for a page > oriented callback. I think passing the file offset makes more sense, especially as the only instance needs it anyway.
On Wed, Oct 28, 2020 at 02:04:08PM +0000, Christoph Hellwig wrote: > On Wed, Oct 28, 2020 at 07:32:00AM -0400, Brian Foster wrote: > > I used the variable just to avoid having to split the function call into > > multiple lines. I.e., it just looked more readable to me than: > > > > if (wpc->ops->discard_page) > > wpc->ops->discard_page(page, > > offset_in_page(file_offset)); > > > > I can change it back if that is preferred (or possibly use a function > > pointer variable instead). I suppose that's also avoided by passing > > file_offset directly, but that seems a little odd to me for a page > > oriented callback. > > I think passing the file offset makes more sense, especially as the > only instance needs it anyway. > The callback uses both, but I'm not tied to it either way. I'll make that tweak for the next version.. Brian
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index bcfc288dba3f..a99964c4b93f 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1412,14 +1412,16 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, * appropriately. */ if (unlikely(error)) { + unsigned int pageoff = offset_in_page(file_offset); + /* + * Let the filesystem know what portion of the current page + * failed to map. If the page wasn't been added to ioend, it + * won't be affected by I/O completion and we must unlock it + * now. + */ + if (wpc->ops->discard_page) + wpc->ops->discard_page(page, pageoff); if (!count) { - /* - * If the current page hasn't been added to ioend, it - * won't be affected by I/O completions and we must - * discard and unlock it right here. - */ - if (wpc->ops->discard_page) - wpc->ops->discard_page(page); ClearPageUptodate(page); unlock_page(page); goto done; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index b35611882ff9..8a17b46a3978 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -527,13 +527,14 @@ xfs_prepare_ioend( */ static void xfs_discard_page( - struct page *page) + struct page *page, + unsigned int pageoff) { struct inode *inode = page->mapping->host; struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; - loff_t offset = page_offset(page); - xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, offset); + loff_t fileoff = page_offset(page) + pageoff; + xfs_fileoff_t start_fsb = XFS_B_TO_FSBT(mp, fileoff); int error; if (XFS_FORCED_SHUTDOWN(mp)) @@ -541,14 +542,14 @@ xfs_discard_page( xfs_alert_ratelimited(mp, "page discard on page "PTR_FMT", inode 0x%llx, offset %llu.", - page, ip->i_ino, offset); + page, ip->i_ino, fileoff); error = xfs_bmap_punch_delalloc_range(ip, start_fsb, - PAGE_SIZE / i_blocksize(inode)); + (PAGE_SIZE - pageoff) / i_blocksize(inode)); if (error && !XFS_FORCED_SHUTDOWN(mp)) xfs_alert(mp, "page discard unable to remove delalloc mapping."); out_invalidate: - iomap_invalidatepage(page, 0, PAGE_SIZE); + iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff); } static const struct iomap_writeback_ops xfs_writeback_ops = { diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 4d1d3c3469e9..646aaefe0dae 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -220,7 +220,7 @@ struct iomap_writeback_ops { * Optional, allows the file system to discard state on a page where * we failed to submit any I/O. */ - void (*discard_page)(struct page *page); + void (*discard_page)(struct page *page, unsigned int pageoff); }; struct iomap_writepage_ctx {
iomap writeback mapping failure only calls into ->discard_page() if the current page has not been added to the ioend. Accordingly, the XFS callback assumes a full page discard and invalidation. This is problematic for sub-page block size filesystems where some portion of a page might have been mapped successfully before a failure to map a delalloc block occurs. ->discard_page() is not called in that error scenario and the bio is explicitly failed by iomap via the error return from ->prepare_ioend(). As a result, the filesystem leaks delalloc blocks and corrupts the filesystem block counters. Since XFS is the only user of ->discard_page(), tweak the semantics to invoke the callback unconditionally on mapping errors and provide the first offset in the page that failed to map. Update xfs_discard_page() to discard the corresponding portion of the file and pass the range along to iomap_invalidatepage(). The latter already properly handles both full and sub-page scenarios by not changing any iomap or page state on sub-page invalidations. Signed-off-by: Brian Foster <bfoster@redhat.com> --- One additional thing I'm trying to rectify that is slightly related to this patch is how iomap handles the page in the partial writepage error case. The comments suggest the page should be kept dirty, but write_cache_pages() clears the dirty state for each page before calling into ->writepage(). iomap_writepage_map() does call clear_page_dirty_for_io() in the success path, which seems harmless but superfluous. That aside, we don't seem to actually redirty the page in the partial writepage case, so the set_page_writeback_keepwrite() call seems insufficient. I.e., even if we did cycle back into write_cache_pages() and find the TOWRITE page, we just skip it since the page isn't actually dirty. Unless I'm missing something, this all seems slightly broken to me. I think we can drop the clear_page_dirty_for_io() call from iomap, and instead we need to add a call to redirty_page_for_writepage() in the _keepwrite() error case. Beyond that, I'm kind of wondering if there's a reason for using _keepwrite() to revisit pages as such at all. AFAICT write_cache_pages() doesn't cycle around until it's invoked again, at which point it retags a new set of dirty pages and as above, we presumably have to redirty the page for _keepwrite() to have any practical effect anyways. Thoughts? Am I missing something here? Brian fs/iomap/buffered-io.c | 16 +++++++++------- fs/xfs/xfs_aops.c | 13 +++++++------ include/linux/iomap.h | 2 +- 3 files changed, 17 insertions(+), 14 deletions(-)