Message ID | 157915535801.2406747.10502356876965505327.stgit@magnolia (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xfs: fix stale disk exposure after crash | expand |
On Wed, Jan 15, 2020 at 10:15:58PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In the previous patch, we solved a stale disk contents exposure problem > by forcing the delalloc write path to create unwritten extents, write > the data, and convert the extents to written after writeback completes. > > This is a pretty huge hammer to use, so we'll relax the delalloc write > strategy to go straight to written extents (as we once did) if someone > tells us to write the entire file to disk. This reopens the exposure > window slightly, but we'll only be affected if writeback completes out > of order and the system crashes during writeback. > > Because once again we can map written extents past EOF, we also > enlarge the writepages window downward if the window is beyond the > on-disk size and there are written extents after the EOF block. This > ensures that speculative post-EOF preallocations are not left uncovered. This does sound really sketchy. Do you have any performance numbers justifying something this nasty?
On Thu, Jan 16, 2020 at 08:49:00AM -0800, Christoph Hellwig wrote: > On Wed, Jan 15, 2020 at 10:15:58PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > In the previous patch, we solved a stale disk contents exposure problem > > by forcing the delalloc write path to create unwritten extents, write > > the data, and convert the extents to written after writeback completes. > > > > This is a pretty huge hammer to use, so we'll relax the delalloc write > > strategy to go straight to written extents (as we once did) if someone > > tells us to write the entire file to disk. This reopens the exposure > > window slightly, but we'll only be affected if writeback completes out > > of order and the system crashes during writeback. > > > > Because once again we can map written extents past EOF, we also > > enlarge the writepages window downward if the window is beyond the > > on-disk size and there are written extents after the EOF block. This > > ensures that speculative post-EOF preallocations are not left uncovered. > > This does sound really sketchy. Do you have any performance numbers > justifying something this nasty? Nope! :D IIRC Dave also expressed interested in performance impacts the last time I sent this series, albeit more from the perspective of quantifying how much pain we'd incur from forcing all writes to perform an unwritten extent conversion at the end. FWIW after months of running this on my internal systems, I haven't been able to quantify any significant difference before and after, even with rmap enabled. There's slightly more log traffic from the extra bmbt/rmapbt/inode core updates, but even then the log is fairly good at deduping repeated updates. Both transactions usually commit before the log checkpoints. Frankly I wouldn't apply this patch (or 'xfs: extend the range of flush_unmap ranges') on the grounds that re-opening potential disclosure flaws is never worth the risk. I'm also pretty sure that being careful to convert delalloc data fork extents to unwritten extents fixes the stale disclosure flaw that Ritesh wrote about in ('iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range'). (As far as ext4 goes, I talked to Jan and Ted this morning and they seemed to think that they could solve the race on their end by retaining the unwritten state in the incore extent cache because ext4 apparently doesn't commit the extent map update transaction until after writeback completes.) --D
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 220ea1dc67ab..65b2bd12720e 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4545,7 +4545,8 @@ xfs_bmapi_convert_delalloc( int whichfork, xfs_off_t offset, struct iomap *iomap, - unsigned int *seq) + unsigned int *seq, + bool full_writeback) { struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); struct xfs_mount *mp = ip->i_mount; @@ -4610,11 +4611,12 @@ xfs_bmapi_convert_delalloc( * * New data fork extents must be mapped in as unwritten and converted * to real extents after the write succeeds to avoid exposing stale - * disk contents if we crash. + * disk contents if we crash. We relax this requirement if we've been + * told to flush all data to disk. */ if (whichfork == XFS_COW_FORK) bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC; - else + else if (!full_writeback) bma.flags = XFS_BMAPI_PREALLOC; if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev)) diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index 14d25e0b7d9c..9d0b0ed83c9f 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -228,7 +228,8 @@ int xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork, struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur, int eof); int xfs_bmapi_convert_delalloc(struct xfs_inode *ip, int whichfork, - xfs_off_t offset, struct iomap *iomap, unsigned int *seq); + xfs_off_t offset, struct iomap *iomap, unsigned int *seq, + bool full_writeback); int xfs_bmap_add_extent_unwritten_real(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork, struct xfs_iext_cursor *icur, struct xfs_btree_cur **curp, diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 3a688eb5c5ae..45174dfa0b7d 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -18,10 +18,13 @@ #include "xfs_bmap_util.h" #include "xfs_reflink.h" +#define XFS_WRITEPAGE_FULL_RANGE (1 << 0) + struct xfs_writepage_ctx { struct iomap_writepage_ctx ctx; unsigned int data_seq; unsigned int cow_seq; + unsigned int flags; }; static inline struct xfs_writepage_ctx * @@ -327,7 +330,8 @@ xfs_convert_blocks( */ do { error = xfs_bmapi_convert_delalloc(ip, whichfork, offset, - &wpc->iomap, seq); + &wpc->iomap, seq, + XFS_WPC(wpc)->flags & XFS_WRITEPAGE_FULL_RANGE); if (error) return error; } while (wpc->iomap.offset + wpc->iomap.length <= offset); @@ -567,6 +571,48 @@ xfs_vm_writepage( return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops); } +/* + * If we've been told to write a range of the file that is beyond the on-disk + * file size and there's a written extent beyond the EOF block, we conclude + * that we previously wrote a speculative post-EOF preallocation to disk (as + * written extents) and later extended the incore file size. + * + * To prevent exposure of the contents of those speculative preallocations + * after a crash, extend the writeback range all the way down to the old file + * size to make sure that those pages get flushed. + */ +static void +xfs_vm_adjust_posteof_writepages( + struct xfs_inode *ip, + struct writeback_control *wbc) +{ + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec irec; + + xfs_ilock(ip, XFS_ILOCK_SHARED); + if (ip->i_d.di_size >= wbc->range_start) + goto out; + + /* We're done if we can't find a real extent past EOF. */ + if (!xfs_iext_lookup_extent(ip, XFS_IFORK_PTR(ip, XFS_DATA_FORK), + XFS_B_TO_FSB(ip->i_mount, ip->i_d.di_size), &icur, + &irec)) + goto out; + if (irec.br_startblock == HOLESTARTBLOCK) + goto out; + + wbc->range_start = ip->i_d.di_size; + + /* Adjust the number of pages to write, if needed. */ + if (wbc->nr_to_write == LONG_MAX) + goto out; + + wbc->nr_to_write += (wbc->range_start >> PAGE_SHIFT) - + (ip->i_d.di_size >> PAGE_SHIFT); +out: + xfs_iunlock(ip, XFS_ILOCK_SHARED); +} + STATIC int xfs_vm_writepages( struct address_space *mapping, @@ -574,6 +620,10 @@ xfs_vm_writepages( { struct xfs_writepage_ctx wpc = { }; + xfs_vm_adjust_posteof_writepages(XFS_I(mapping->host), wbc); + if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) + wpc.flags |= XFS_WRITEPAGE_FULL_RANGE; + xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops); }