Message ID | 155259894630.30230.10064390935593758177.stgit@magnolia (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xfs: various rough fixes | expand |
On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > When writing to a delalloc region in the data fork, commit the new > allocations (of the da reservation) as unwritten so that the mappings > are only marked written once writeback completes successfully. This > fixes the problem of stale data exposure if the system goes down during > targeted writeback of a specific region of a file, as tested by > generic/042. So what does this do to buffered sequential and random write performance? Next, if the entire delalloc extent being allocated is beyond the on-disk EOF (typical for extending sequential buffered writes), why do we want those to be allocated as unwritten? i.e. isn't "allocate as unwritten" only necessary for delalloc extent allocation inside EOF? Cheers, Dave.
On Fri, Mar 15, 2019 at 10:08:41AM +1100, Dave Chinner wrote: > On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > When writing to a delalloc region in the data fork, commit the new > > allocations (of the da reservation) as unwritten so that the mappings > > are only marked written once writeback completes successfully. This > > fixes the problem of stale data exposure if the system goes down during > > targeted writeback of a specific region of a file, as tested by > > generic/042. > > So what does this do to buffered sequential and random write > performance? Whacks it quite a bit -- 10-20% depending on how fast the storage is in the first place. I barely noticed on my usb thumb drive, however. :P That said, shovelling that many writes through the completion workqueues creates a thundering herd problem on the ILOCK so maybe it wouldn't be so bad if we simply dumped the completions on a per-inode queue and only had one thread handling the completions. (As we've been discussing on IRC.) > Next, if the entire delalloc extent being allocated is beyond the > on-disk EOF (typical for extending sequential buffered writes), why > do we want those to be allocated as unwritten? i.e. isn't "allocate > as unwritten" only necessary for delalloc extent allocation > inside EOF? I wasn't sure on this point -- for delalloc extents that start at or above EOF, we can continue go straight to real like we do today. We still have to use the setfilesize transaction to update isize on disk, so it probably doesn't make that big of a difference. If the delalloc extent crosses EOF then I think it makes sense to map it in as unwritten, issue the write, and convert the extent to real during io completion, and not split it up just to avoid having unwritten extents past EOF. IOWS, there wasn't any particular intentionality behind having the code not set PREALLOC if the extent is totally beyond EOF; this was just the bare minimum to get a discussion started. :) --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Mar 14, 2019 at 08:13:03PM -0700, Darrick J. Wong wrote: > On Fri, Mar 15, 2019 at 10:08:41AM +1100, Dave Chinner wrote: > > On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > When writing to a delalloc region in the data fork, commit the new > > > allocations (of the da reservation) as unwritten so that the mappings > > > are only marked written once writeback completes successfully. This > > > fixes the problem of stale data exposure if the system goes down during > > > targeted writeback of a specific region of a file, as tested by > > > generic/042. > > > > So what does this do to buffered sequential and random write > > performance? > > Whacks it quite a bit -- 10-20% depending on how fast the storage is in > the first place. I barely noticed on my usb thumb drive, however. :P I figured that would be the case, with random writes being much worse... > That said, shovelling that many writes through the completion workqueues > creates a thundering herd problem on the ILOCK so maybe it wouldn't be > so bad if we simply dumped the completions on a per-inode queue and only > had one thread handling the completions. > > (As we've been discussing on IRC.) *nod* > > Next, if the entire delalloc extent being allocated is beyond the > > on-disk EOF (typical for extending sequential buffered writes), why > > do we want those to be allocated as unwritten? i.e. isn't "allocate > > as unwritten" only necessary for delalloc extent allocation > > inside EOF? > > I wasn't sure on this point -- for delalloc extents that start at or > above EOF, we can continue go straight to real like we do today. We > still have to use the setfilesize transaction to update isize on disk, > so it probably doesn't make that big of a difference. We have to keep the setfilesize completion code, anyway (think partial last block file extensions), but the setfilesize stuff is much, much lower overhead than unwritten extent conversion so i think we want to avoid unwritten extents where-ever they are unnecessary. > If the delalloc extent crosses EOF then I think it makes sense to map it > in as unwritten, issue the write, and convert the extent to real during > io completion, and not split it up just to avoid having unwritten > extents past EOF. Yup, makes sense. For a sequential write, they should always be beyond EOF. For anything else, we use unwritten. > IOWS, there wasn't any particular intentionality behind having the code > not set PREALLOC if the extent is totally beyond EOF; this was just the > bare minimum to get a discussion started. :) *nod* Cheers, Dave.
On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > On Thu, Mar 14, 2019 at 08:13:03PM -0700, Darrick J. Wong wrote: > > On Fri, Mar 15, 2019 at 10:08:41AM +1100, Dave Chinner wrote: > > > On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > When writing to a delalloc region in the data fork, commit the new > > > > allocations (of the da reservation) as unwritten so that the mappings > > > > are only marked written once writeback completes successfully. This > > > > fixes the problem of stale data exposure if the system goes down during > > > > targeted writeback of a specific region of a file, as tested by > > > > generic/042. > > > > > > So what does this do to buffered sequential and random write > > > performance? > > > > Whacks it quite a bit -- 10-20% depending on how fast the storage is in > > the first place. I barely noticed on my usb thumb drive, however. :P > > I figured that would be the case, with random writes being much > worse... > > > That said, shovelling that many writes through the completion workqueues > > creates a thundering herd problem on the ILOCK so maybe it wouldn't be > > so bad if we simply dumped the completions on a per-inode queue and only > > had one thread handling the completions. > > > > (As we've been discussing on IRC.) > > *nod* > > > > Next, if the entire delalloc extent being allocated is beyond the > > > on-disk EOF (typical for extending sequential buffered writes), why > > > do we want those to be allocated as unwritten? i.e. isn't "allocate > > > as unwritten" only necessary for delalloc extent allocation > > > inside EOF? > > > > I wasn't sure on this point -- for delalloc extents that start at or > > above EOF, we can continue go straight to real like we do today. We > > still have to use the setfilesize transaction to update isize on disk, > > so it probably doesn't make that big of a difference. > > We have to keep the setfilesize completion code, anyway (think > partial last block file extensions), but the setfilesize stuff is > much, much lower overhead than unwritten extent conversion so i > think we want to avoid unwritten extents where-ever they are > unnecessary. > The additional tradeoff to consider for fully post-eof extents is eliminating the need to zero them out entirely on extending writes (or truncates) that jump over them but pull them within EOF. That could eliminate quite a bit of data I/O with large enough preallocations in place. Of course, this is a different (and probably less common) write pattern from straightforward sequential writes, but IIRC it comes for free simply by mapping extents as unwritten since the zeroing code is smart enough to know when zeroing is unnecessary. It would be nice if we could figure out a way to take advantage of that optimization without disrupting performance or whatever of the sequential case. Though I wouldn't suggest this as a blocker to addressing the stale data exposure problem, of course. (Another angle could be to determine when extent conversions might be optimal to data zeroing and do the former in place of the latter on file extension..). > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > in as unwritten, issue the write, and convert the extent to real during > > io completion, and not split it up just to avoid having unwritten > > extents past EOF. > > Yup, makes sense. For a sequential write, they should always be > beyond EOF. For anything else, we use unwritten. > I'm not quite sure I follow the suggested change in behavior here tbh so maybe this is not an issue, but another thing to consider is whether selective delalloc -> real conversion for post-eof extents translates to conditional stale data exposure vectors in certain file extension scenarios. I think even the post-eof zeroing code may be susceptible to this problem if the log happens to push after a truncate transaction commits but before the associated dirty pages are written back.. Brian > > IOWS, there wasn't any particular intentionality behind having the code > > not set PREALLOC if the extent is totally beyond EOF; this was just the > > bare minimum to get a discussion started. :) > > *nod* > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > in as unwritten, issue the write, and convert the extent to real during > > > io completion, and not split it up just to avoid having unwritten > > > extents past EOF. > > > > Yup, makes sense. For a sequential write, they should always be > > beyond EOF. For anything else, we use unwritten. > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > maybe this is not an issue, but another thing to consider is whether > selective delalloc -> real conversion for post-eof extents translates to > conditional stale data exposure vectors in certain file extension > scenarios. No, it doesn't because the transaction that moves EOF is done after the data write into the region it exposes is complete. hence the device cache flush that occurs before the log write containing the transaction that moves the EOF will always result in the data being on stable storage *before* the extending szie transaction hits the journal and exposes it. > I think even the post-eof zeroing code may be susceptible to > this problem if the log happens to push after a truncate transaction > commits but before the associated dirty pages are written back.. That's what this code in xfs_setattr_size() handles: /* * We are going to log the inode size change in this transaction so * any previous writes that are beyond the on disk EOF and the new * EOF that have not been written out need to be written here. If we * do not write the data out, we expose ourselves to the null files * problem. Note that this includes any block zeroing we did above; * otherwise those blocks may not be zeroed after a crash. */ if (did_zeroing || (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, ip->i_d.di_size, newsize - 1); if (error) return error; } error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); i.e. it ensures the data and post-eof zeroing is on disk before the truncate transaction is started. We already hold the IOLOCK at this point and drained in-flight direct IO, so no IO that can affect the truncate region will begin or be in progress after the flush and wait above is complete. Cheers, Dave.
On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote: > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > > in as unwritten, issue the write, and convert the extent to real during > > > > io completion, and not split it up just to avoid having unwritten > > > > extents past EOF. > > > > > > Yup, makes sense. For a sequential write, they should always be > > > beyond EOF. For anything else, we use unwritten. > > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > > maybe this is not an issue, but another thing to consider is whether > > selective delalloc -> real conversion for post-eof extents translates to > > conditional stale data exposure vectors in certain file extension > > scenarios. > > No, it doesn't because the transaction that moves EOF is done after > the data write into the region it exposes is complete. hence the > device cache flush that occurs before the log write containing the > transaction that moves the EOF will always result in the data being > on stable storage *before* the extending szie transaction hits the > journal and exposes it. > Ok, but this ordering alone doesn't guarantee the data we've just written is on-disk before the transaction hits the log. > > I think even the post-eof zeroing code may be susceptible to > > this problem if the log happens to push after a truncate transaction > > commits but before the associated dirty pages are written back.. > > That's what this code in xfs_setattr_size() handles: > > /* > * We are going to log the inode size change in this transaction so > * any previous writes that are beyond the on disk EOF and the new > * EOF that have not been written out need to be written here. If we > * do not write the data out, we expose ourselves to the null files > * problem. Note that this includes any block zeroing we did above; > * otherwise those blocks may not be zeroed after a crash. > */ > if (did_zeroing || > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > ip->i_d.di_size, newsize - 1); > if (error) > return error; > } > Ah, I forgot about this code. So this combined with the above provides correctness for this particular case with respect to stale data exposure after a crash. AFAICT this is still a performance tradeoff as the zeroing may not be necessary if the post-eof extents happened to be unwritten. This optimization is also achieved "for free" in this path because did_zeroing is only set if iomap had to physically zero some pages. This also only covers one extension scenario. It looks like the typical write extend scenario is intended to be handled similarly by submitting the di_size update transaction from data write completion. It looks to me that this should be fairly robust in most common cases (i.e., background writeback and/or fsync()), but I'm not totally convinced this is always robust because write time extension doesn't perform the same kind of writeback ordering as the truncate example above. Consider sync_file_range() for example. Suppose we have a large, physical post-eof prealloc extent and perform a sparse extended write to the last block of that extent followed by a sync_file_range() of the just written data. The file write should physically zero the pagecache for the preallocated region before the write, then the proper write completes and returns. Nothing has changed on disk so far. A sync_file_range() call sets the range start/end in the writeback_control and works its way through __filemap_fdatawrite_range() to write_cache_pages() and our writeback mechanism for the requested pages only. From there, the same logic applies as for background writeback: we issue writeback and I/O completion commits a file extension transaction. AFAICT, nothing has guaranteed writeback has even initiated on any of the zeroed pages over the non-unwritten extent that has just been pulled within EOF, which means we are susceptible to stale data exposure until that happens. While reasoning about this I realized this should be pretty easy to test: - pre-seed a small fs with a data pattern (0xab) - mount with a 1mb alloc size (for ease of test) - run something like the following (uses default 0xcd data pattern) # xfs_io -f -x -c "pwrite 0 4k" -c fsync -c "pwrite 32k 4k" \ -c "sync_range -w 32k 4k" -c "sync_range -b 32k 4k" \ -c "shutdown -f" /mnt/file - remount and check the file: # hexdump /mnt/file 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001000 abab abab abab abab abab abab abab abab * 0008000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0009000 ... which clearly shows stale data exposure. Hmm, this might actually be a good fstest if we don't have one already. Brian > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > > i.e. it ensures the data and post-eof zeroing is on disk before the > truncate transaction is started. We already hold the IOLOCK at this > point and drained in-flight direct IO, so no IO that can affect the > truncate region will begin or be in progress after the flush and > wait above is complete. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote: > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote: > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > > > in as unwritten, issue the write, and convert the extent to real during > > > > > io completion, and not split it up just to avoid having unwritten > > > > > extents past EOF. > > > > > > > > Yup, makes sense. For a sequential write, they should always be > > > > beyond EOF. For anything else, we use unwritten. > > > > > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > > > maybe this is not an issue, but another thing to consider is whether > > > selective delalloc -> real conversion for post-eof extents translates to > > > conditional stale data exposure vectors in certain file extension > > > scenarios. > > > > No, it doesn't because the transaction that moves EOF is done after > > the data write into the region it exposes is complete. hence the > > device cache flush that occurs before the log write containing the > > transaction that moves the EOF will always result in the data being > > on stable storage *before* the extending szie transaction hits the > > journal and exposes it. > > > > Ok, but this ordering alone doesn't guarantee the data we've just > written is on-disk before the transaction hits the log. Which transaction are you talking about? This ordering guarantees that the data is on stable storage before the EOF size change transactions that exposes the region is on disk (that's the whole point of updates after data writes). If you are talking about the allocation transaction taht we are going to write the data into, then you are right that it doesn't guarantee that the data is on disk before that commits, but when the allocation is beyond EOF it doesn't matter ebcause is won't be exposed until after the data is written and the EOF moving transaction is committed. As I've previously proposed, what I suspect we should be doing is not commiting the allocation transaction until IO completion, which closes this stale data exposure hole completely and removes the need for allocating unwritten extentsi and then having to convert them. Direct IO could also do this, and so it could stop using unwritten extents, too.... > > > I think even the post-eof zeroing code may be susceptible to > > > this problem if the log happens to push after a truncate transaction > > > commits but before the associated dirty pages are written back.. > > > > That's what this code in xfs_setattr_size() handles: > > > > /* > > * We are going to log the inode size change in this transaction so > > * any previous writes that are beyond the on disk EOF and the new > > * EOF that have not been written out need to be written here. If we > > * do not write the data out, we expose ourselves to the null files > > * problem. Note that this includes any block zeroing we did above; > > * otherwise those blocks may not be zeroed after a crash. > > */ > > if (did_zeroing || > > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > > error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > > ip->i_d.di_size, newsize - 1); > > if (error) > > return error; > > } > > > > Ah, I forgot about this code. So this combined with the above provides > correctness for this particular case with respect to stale data exposure > after a crash. AFAICT this is still a performance tradeoff as the > zeroing may not be necessary if the post-eof extents happened to be > unwritten. This optimization is also achieved "for free" in this path > because did_zeroing is only set if iomap had to physically zero some > pages. > > This also only covers one extension scenario. It looks like the typical > write extend scenario is intended to be handled similarly by submitting > the di_size update transaction from data write completion. It looks to > me that this should be fairly robust in most common cases (i.e., > background writeback and/or fsync()), but I'm not totally convinced this > is always robust because write time extension doesn't perform the same > kind of writeback ordering as the truncate example above. It depends on IO completion order as to whether writeback is truly robust. Calling filemap_write_and_wait_range() doesn't change that, because it doesn't strictly order IO completions and so we can still expose a region under IO that hasn't been signalled as complete by the hardware. > Consider sync_file_range() for example. sync_file_range() is not a data integrity operation - it does not ensure disk caches are flushed and so cannot provide any crash guarantees. hence we can completely ignore when we talk about data integrity operations - it's no different from normal writeback. > Suppose we have a large, > physical post-eof prealloc extent and perform a sparse extended write to > the last block of that extent followed by a sync_file_range() of the > just written data. The file write should physically zero the pagecache > for the preallocated region before the write, then the proper write > completes and returns. Nothing has changed on disk so far. A > sync_file_range() call sets the range start/end in the writeback_control > and works its way through __filemap_fdatawrite_range() to > write_cache_pages() and our writeback mechanism for the requested pages > only. From there, the same logic applies as for background writeback: we > issue writeback and I/O completion commits a file extension transaction. > AFAICT, nothing has guaranteed writeback has even initiated on any of > the zeroed pages over the non-unwritten extent that has just been pulled > within EOF, which means we are susceptible to stale data exposure until > that happens. Yup, yet another reason why sync_file_range() is useless as a data integrity operation. I've had to explain this repeatedly to people over the past 10-15 years, and eventually the man page was updated with this: Warning This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file's meta¿ data. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an overwrite. On filesystems using copy- on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches. > > While reasoning about this I realized this should be pretty easy to > test: > > - pre-seed a small fs with a data pattern (0xab) > - mount with a 1mb alloc size (for ease of test) > - run something like the following (uses default 0xcd data pattern) > > # xfs_io -f -x -c "pwrite 0 4k" -c fsync -c "pwrite 32k 4k" \ > -c "sync_range -w 32k 4k" -c "sync_range -b 32k 4k" \ > -c "shutdown -f" /mnt/file > > - remount and check the file: > > # hexdump /mnt/file > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > * > 0001000 abab abab abab abab abab abab abab abab > * > 0008000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > * > 0009000 > > ... which clearly shows stale data exposure. Hmm, this might actually be > a good fstest if we don't have one already. Yup, but keep in mind this is exactly what is expected from using sync_file_range(). Friends don't let friends use sync_file_range() because it's a guaranteed vector for user data corruption and stale data exposure. Cheers, Dave.
On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote: > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote: > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote: > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > > > > in as unwritten, issue the write, and convert the extent to real during > > > > > > io completion, and not split it up just to avoid having unwritten > > > > > > extents past EOF. > > > > > > > > > > Yup, makes sense. For a sequential write, they should always be > > > > > beyond EOF. For anything else, we use unwritten. > > > > > > > > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > > > > maybe this is not an issue, but another thing to consider is whether > > > > selective delalloc -> real conversion for post-eof extents translates to > > > > conditional stale data exposure vectors in certain file extension > > > > scenarios. > > > > > > No, it doesn't because the transaction that moves EOF is done after > > > the data write into the region it exposes is complete. hence the > > > device cache flush that occurs before the log write containing the > > > transaction that moves the EOF will always result in the data being > > > on stable storage *before* the extending szie transaction hits the > > > journal and exposes it. > > > > > > > Ok, but this ordering alone doesn't guarantee the data we've just > > written is on-disk before the transaction hits the log. > > Which transaction are you talking about? This ordering guarantees > that the data is on stable storage before the EOF size change > transactions that exposes the region is on disk (that's the whole > point of updates after data writes). > > If you are talking about the allocation transaction taht we are > going to write the data into, then you are right that it doesn't > guarantee that the data is on disk before that commits, but when the > allocation is beyond EOF it doesn't matter ebcause is won't be > exposed until after the data is written and the EOF moving > transaction is committed. > The extending transaction that you referred to above and with respect to the device cache flush. I'm simply saying that the transaction has to be ordered with respect to full writeback completion as well, which is also what you are saying further down. (We're in agreement... ;)). > As I've previously proposed, what I suspect we should be doing is > not commiting the allocation transaction until IO completion, which > closes this stale data exposure hole completely and removes the need > for allocating unwritten extentsi and then having to convert them. > Direct IO could also do this, and so it could stop using unwritten > extents, too.... > That sounds like an interesting (and potentially more efficient) alternative to using unwritten extents across the board only to convert them as I/Os complete. I guess we'd need to make sure that the allocation transaction holds across potentially many ioends considering the max extents size of 8GB or so. I suppose the locking and handling of fragmented allocation cases could get a bit interesting there as well, but it's probably worth a prototype attempt at least to survey the concept... (I wonder if Darrick is still paying attention or found something more interesting to do..? :D) I also assume we'd still need to use unwritten extents for cases where the allocation completes before all of the extent is within EOF (i.e., the extent zeroing idea you mentioned on irc..). > > > > I think even the post-eof zeroing code may be susceptible to > > > > this problem if the log happens to push after a truncate transaction > > > > commits but before the associated dirty pages are written back.. > > > > > > That's what this code in xfs_setattr_size() handles: > > > > > > /* > > > * We are going to log the inode size change in this transaction so > > > * any previous writes that are beyond the on disk EOF and the new > > > * EOF that have not been written out need to be written here. If we > > > * do not write the data out, we expose ourselves to the null files > > > * problem. Note that this includes any block zeroing we did above; > > > * otherwise those blocks may not be zeroed after a crash. > > > */ > > > if (did_zeroing || > > > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > > > error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > > > ip->i_d.di_size, newsize - 1); > > > if (error) > > > return error; > > > } > > > > > > > Ah, I forgot about this code. So this combined with the above provides > > correctness for this particular case with respect to stale data exposure > > after a crash. AFAICT this is still a performance tradeoff as the > > zeroing may not be necessary if the post-eof extents happened to be > > unwritten. This optimization is also achieved "for free" in this path > > because did_zeroing is only set if iomap had to physically zero some > > pages. > > > > This also only covers one extension scenario. It looks like the typical > > write extend scenario is intended to be handled similarly by submitting > > the di_size update transaction from data write completion. It looks to > > me that this should be fairly robust in most common cases (i.e., > > background writeback and/or fsync()), but I'm not totally convinced this > > is always robust because write time extension doesn't perform the same > > kind of writeback ordering as the truncate example above. > > It depends on IO completion order as to whether writeback is truly > robust. Calling filemap_write_and_wait_range() doesn't change that, > because it doesn't strictly order IO completions and so we can still > expose a region under IO that hasn't been signalled as complete by > the hardware. > Indeed. > > Consider sync_file_range() for example. > > sync_file_range() is not a data integrity operation - it does not > ensure disk caches are flushed and so cannot provide any crash > guarantees. hence we can completely ignore when we talk about data > integrity operations - it's no different from normal writeback. > Of course. I am by no means claiming sync_file_range() is an integrity operation. Technically neither is background writeback, even though the latter has fairly predictable behavior. I'm simply using sync_file_range() as an example and means to simulate out of order writeback for test purposes. > > Suppose we have a large, > > physical post-eof prealloc extent and perform a sparse extended write to > > the last block of that extent followed by a sync_file_range() of the > > just written data. The file write should physically zero the pagecache > > for the preallocated region before the write, then the proper write > > completes and returns. Nothing has changed on disk so far. A > > sync_file_range() call sets the range start/end in the writeback_control > > and works its way through __filemap_fdatawrite_range() to > > write_cache_pages() and our writeback mechanism for the requested pages > > only. From there, the same logic applies as for background writeback: we > > issue writeback and I/O completion commits a file extension transaction. > > AFAICT, nothing has guaranteed writeback has even initiated on any of > > the zeroed pages over the non-unwritten extent that has just been pulled > > within EOF, which means we are susceptible to stale data exposure until > > that happens. > > Yup, yet another reason why sync_file_range() is useless as a data > integrity operation. I've had to explain this repeatedly to people > over the past 10-15 years, and eventually the man page was updated > with this: > > Warning > This system call is extremely dangerous and should not be > used in portable programs. None of these operations writes > out the file's meta¿ data. Therefore, unless the > application is strictly performing overwrites of > already-instantiated disk blocks, there are no guarantees > that the data will be available after a crash. There is no > user interface to know if a write is purely an overwrite. > On filesystems using copy- on-write semantics (e.g., btrfs) > an overwrite of existing allocated blocks is impossible. > When writing into preallocated space, many filesystems > also require calls into the block allocator, which this > system call does not sync out to disk. This system call > does not flush disk write caches and thus does not provide > any data integrity on systems with volatile disk write > caches. > > > ... > > Yup, but keep in mind this is exactly what is expected from using > sync_file_range(). Friends don't let friends use sync_file_range() > because it's a guaranteed vector for user data corruption and stale > data exposure. > Again, I'm using sync_file_range() as a dumb lever to control writeback completion ordering. It's a test tool for my purposes and not something I'd suggest for use otherwise or that should inherently provide integrity guarantees. Please don't get distracted reading more into it than that. We can induce the exact same writeback behavior and stale data exposure problem using something like hole punch instead, for example, and it's just as reliable a reproducer. The broader point is that by controlling writeback ordering, we can explicitly demonstrate stale data exposure. If fixed properly, we should never expect stale data exposure even in the event of out of order writeback completion, regardless of the cause. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, Mar 18, 2019 at 05:50:14PM -0400, Brian Foster wrote: > On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote: > > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote: > > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote: > > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > > > > > in as unwritten, issue the write, and convert the extent to real during > > > > > > > io completion, and not split it up just to avoid having unwritten > > > > > > > extents past EOF. > > > > > > > > > > > > Yup, makes sense. For a sequential write, they should always be > > > > > > beyond EOF. For anything else, we use unwritten. > > > > > > > > > > > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > > > > > maybe this is not an issue, but another thing to consider is whether > > > > > selective delalloc -> real conversion for post-eof extents translates to > > > > > conditional stale data exposure vectors in certain file extension > > > > > scenarios. > > > > > > > > No, it doesn't because the transaction that moves EOF is done after > > > > the data write into the region it exposes is complete. hence the > > > > device cache flush that occurs before the log write containing the > > > > transaction that moves the EOF will always result in the data being > > > > on stable storage *before* the extending szie transaction hits the > > > > journal and exposes it. > > > > > > > > > > Ok, but this ordering alone doesn't guarantee the data we've just > > > written is on-disk before the transaction hits the log. > > > > Which transaction are you talking about? This ordering guarantees > > that the data is on stable storage before the EOF size change > > transactions that exposes the region is on disk (that's the whole > > point of updates after data writes). > > > > If you are talking about the allocation transaction taht we are > > going to write the data into, then you are right that it doesn't > > guarantee that the data is on disk before that commits, but when the > > allocation is beyond EOF it doesn't matter ebcause is won't be > > exposed until after the data is written and the EOF moving > > transaction is committed. > > > > The extending transaction that you referred to above and with respect to > the device cache flush. I'm simply saying that the transaction has to be > ordered with respect to full writeback completion as well, which is also > what you are saying further down. (We're in agreement... ;)). > > > As I've previously proposed, what I suspect we should be doing is > > not commiting the allocation transaction until IO completion, which > > closes this stale data exposure hole completely and removes the need > > for allocating unwritten extentsi and then having to convert them. > > Direct IO could also do this, and so it could stop using unwritten > > extents, too.... > > > > That sounds like an interesting (and potentially more efficient) > alternative to using unwritten extents across the board only to convert > them as I/Os complete. I guess we'd need to make sure that the > allocation transaction holds across potentially many ioends considering > the max extents size of 8GB or so. I suppose the locking and handling of > fragmented allocation cases could get a bit interesting there as well, > but it's probably worth a prototype attempt at least to survey the > concept... (I wonder if Darrick is still paying attention or found > something more interesting to do..? :D) /me wandered off into fixing the io completion mess; if either of you want to work on a better prototype, I encourage you to go for it. :) Though I wonder how bad would be the effects of holding the allocation transaction open across a (potentially very slow) writeout. We'd still be holding onto the dirty inode's ilock as well as the AGF header and whatever bnobt/cntbt blocks are dirty. Wouldn't that stall any other thread that was walking the AGs looking for free space? > I also assume we'd still need to use unwritten extents for cases where > the allocation completes before all of the extent is within EOF (i.e., > the extent zeroing idea you mentioned on irc..). Yes, I think it would still be necessary. My guess is that xfs_bmapi_convert_delalloc has to create unwritten extents for any extent starting before the on-disk EOF but can create real extents for anything past the on-disk isize (because we don't set the on-disk isize to match the in-core isize until writes complete). I guess the "zero pages between old isize and new isize" code can simply reset any real extents it finds to be unwritten too, as we've been discussing. > > > > > I think even the post-eof zeroing code may be susceptible to > > > > > this problem if the log happens to push after a truncate transaction > > > > > commits but before the associated dirty pages are written back.. > > > > > > > > That's what this code in xfs_setattr_size() handles: > > > > > > > > /* > > > > * We are going to log the inode size change in this transaction so > > > > * any previous writes that are beyond the on disk EOF and the new > > > > * EOF that have not been written out need to be written here. If we > > > > * do not write the data out, we expose ourselves to the null files > > > > * problem. Note that this includes any block zeroing we did above; > > > > * otherwise those blocks may not be zeroed after a crash. > > > > */ > > > > if (did_zeroing || > > > > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > > > > error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > > > > ip->i_d.di_size, newsize - 1); > > > > if (error) > > > > return error; > > > > } > > > > > > > > > > Ah, I forgot about this code. So this combined with the above provides > > > correctness for this particular case with respect to stale data exposure > > > after a crash. AFAICT this is still a performance tradeoff as the > > > zeroing may not be necessary if the post-eof extents happened to be > > > unwritten. This optimization is also achieved "for free" in this path > > > because did_zeroing is only set if iomap had to physically zero some > > > pages. > > > > > > This also only covers one extension scenario. It looks like the typical > > > write extend scenario is intended to be handled similarly by submitting > > > the di_size update transaction from data write completion. It looks to > > > me that this should be fairly robust in most common cases (i.e., > > > background writeback and/or fsync()), but I'm not totally convinced this > > > is always robust because write time extension doesn't perform the same > > > kind of writeback ordering as the truncate example above. > > > > It depends on IO completion order as to whether writeback is truly > > robust. Calling filemap_write_and_wait_range() doesn't change that, > > because it doesn't strictly order IO completions and so we can still > > expose a region under IO that hasn't been signalled as complete by > > the hardware. > > > > Indeed. > > > > Consider sync_file_range() for example. > > > > sync_file_range() is not a data integrity operation - it does not > > ensure disk caches are flushed and so cannot provide any crash > > guarantees. hence we can completely ignore when we talk about data > > integrity operations - it's no different from normal writeback. > > > > Of course. I am by no means claiming sync_file_range() is an integrity > operation. Technically neither is background writeback, even though the > latter has fairly predictable behavior. I'm simply using > sync_file_range() as an example and means to simulate out of order > writeback for test purposes. > > > > Suppose we have a large, > > > physical post-eof prealloc extent and perform a sparse extended write to > > > the last block of that extent followed by a sync_file_range() of the > > > just written data. The file write should physically zero the pagecache > > > for the preallocated region before the write, then the proper write > > > completes and returns. Nothing has changed on disk so far. A > > > sync_file_range() call sets the range start/end in the writeback_control > > > and works its way through __filemap_fdatawrite_range() to > > > write_cache_pages() and our writeback mechanism for the requested pages > > > only. From there, the same logic applies as for background writeback: we > > > issue writeback and I/O completion commits a file extension transaction. > > > AFAICT, nothing has guaranteed writeback has even initiated on any of > > > the zeroed pages over the non-unwritten extent that has just been pulled > > > within EOF, which means we are susceptible to stale data exposure until > > > that happens. > > > > Yup, yet another reason why sync_file_range() is useless as a data > > integrity operation. I've had to explain this repeatedly to people > > over the past 10-15 years, and eventually the man page was updated > > with this: > > > > Warning > > This system call is extremely dangerous and should not be > > used in portable programs. None of these operations writes > > out the file's meta¿ data. Therefore, unless the > > application is strictly performing overwrites of > > already-instantiated disk blocks, there are no guarantees > > that the data will be available after a crash. There is no > > user interface to know if a write is purely an overwrite. > > On filesystems using copy- on-write semantics (e.g., btrfs) > > an overwrite of existing allocated blocks is impossible. > > When writing into preallocated space, many filesystems > > also require calls into the block allocator, which this > > system call does not sync out to disk. This system call > > does not flush disk write caches and thus does not provide > > any data integrity on systems with volatile disk write > > caches. > > > > > > ... > > > > Yup, but keep in mind this is exactly what is expected from using > > sync_file_range(). Friends don't let friends use sync_file_range() > > because it's a guaranteed vector for user data corruption and stale > > data exposure. > > > > Again, I'm using sync_file_range() as a dumb lever to control writeback > completion ordering. It's a test tool for my purposes and not something > I'd suggest for use otherwise or that should inherently provide > integrity guarantees. Please don't get distracted reading more into it > than that. We can induce the exact same writeback behavior and stale > data exposure problem using something like hole punch instead, for > example, and it's just as reliable a reproducer. > > The broader point is that by controlling writeback ordering, we can > explicitly demonstrate stale data exposure. If fixed properly, we should > never expect stale data exposure even in the event of out of order > writeback completion, regardless of the cause. Agree. Even if s_f_r is a giant trash fire, if we're exposing stale disk contents it's game over anyway and we ought to fix it. --D > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Tue, Mar 19, 2019 at 11:02:49AM -0700, Darrick J. Wong wrote: > On Mon, Mar 18, 2019 at 05:50:14PM -0400, Brian Foster wrote: > > On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote: > > > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote: > > > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote: > > > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > > > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > > > > > > in as unwritten, issue the write, and convert the extent to real during > > > > > > > > io completion, and not split it up just to avoid having unwritten > > > > > > > > extents past EOF. > > > > > > > > > > > > > > Yup, makes sense. For a sequential write, they should always be > > > > > > > beyond EOF. For anything else, we use unwritten. > > > > > > > > > > > > > > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > > > > > > maybe this is not an issue, but another thing to consider is whether > > > > > > selective delalloc -> real conversion for post-eof extents translates to > > > > > > conditional stale data exposure vectors in certain file extension > > > > > > scenarios. > > > > > > > > > > No, it doesn't because the transaction that moves EOF is done after > > > > > the data write into the region it exposes is complete. hence the > > > > > device cache flush that occurs before the log write containing the > > > > > transaction that moves the EOF will always result in the data being > > > > > on stable storage *before* the extending szie transaction hits the > > > > > journal and exposes it. > > > > > > > > > > > > > Ok, but this ordering alone doesn't guarantee the data we've just > > > > written is on-disk before the transaction hits the log. > > > > > > Which transaction are you talking about? This ordering guarantees > > > that the data is on stable storage before the EOF size change > > > transactions that exposes the region is on disk (that's the whole > > > point of updates after data writes). > > > > > > If you are talking about the allocation transaction taht we are > > > going to write the data into, then you are right that it doesn't > > > guarantee that the data is on disk before that commits, but when the > > > allocation is beyond EOF it doesn't matter ebcause is won't be > > > exposed until after the data is written and the EOF moving > > > transaction is committed. > > > > > > > The extending transaction that you referred to above and with respect to > > the device cache flush. I'm simply saying that the transaction has to be > > ordered with respect to full writeback completion as well, which is also > > what you are saying further down. (We're in agreement... ;)). > > > > > As I've previously proposed, what I suspect we should be doing is > > > not commiting the allocation transaction until IO completion, which > > > closes this stale data exposure hole completely and removes the need > > > for allocating unwritten extentsi and then having to convert them. > > > Direct IO could also do this, and so it could stop using unwritten > > > extents, too.... > > > > > > > That sounds like an interesting (and potentially more efficient) > > alternative to using unwritten extents across the board only to convert > > them as I/Os complete. I guess we'd need to make sure that the > > allocation transaction holds across potentially many ioends considering > > the max extents size of 8GB or so. I suppose the locking and handling of > > fragmented allocation cases could get a bit interesting there as well, > > but it's probably worth a prototype attempt at least to survey the > > concept... (I wonder if Darrick is still paying attention or found > > something more interesting to do..? :D) > > /me wandered off into fixing the io completion mess; if either of you > want to work on a better prototype, I encourage you to go for it. :) > > Though I wonder how bad would be the effects of holding the allocation > transaction open across a (potentially very slow) writeout. We'd still > be holding onto the dirty inode's ilock as well as the AGF header and > whatever bnobt/cntbt blocks are dirty. Wouldn't that stall any other > thread that was walking the AGs looking for free space? Yeah, which is why I mentioned on IRC that it may be better to use an intent/done transaction pair similar to an EFI/EFD. i.e. on IO completion we only have to log the "write done" and it can contain just hte range for the specific IO, hence we can do large allocations and only mark the areas we write as containing data. That would allow zeroing of the regions that aren't covered by a done intent within EOF during recovery.... > > I also assume we'd still need to use unwritten extents for cases where > > the allocation completes before all of the extent is within EOF (i.e., > > the extent zeroing idea you mentioned on irc..). > > Yes, I think it would still be necessary. My guess is that > xfs_bmapi_convert_delalloc has to create unwritten extents for any > extent starting before the on-disk EOF but can create real extents for > anything past the on-disk isize (because we don't set the on-disk isize > to match the in-core isize until writes complete). Yes, that was my assumption too, but I suspect that with an intent/done pair, even that is not necessary. > I guess the "zero > pages between old isize and new isize" code can simply reset any real > extents it finds to be unwritten too, as we've been discussing. *nod* > > The broader point is that by controlling writeback ordering, we can > > explicitly demonstrate stale data exposure. If fixed properly, we should > > never expect stale data exposure even in the event of out of order > > writeback completion, regardless of the cause. > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > disk contents it's game over anyway and we ought to fix it. Well, that's exactly the problem with SFR, isn't it? The dirty data is outside the range the user asks to sync and the implementation doesn't go anywhere near the filesystem so we can't actually do the right thing here. So we are left to either hack fixes around a shitty, broken interface and everyone pays the penalty, or we ignore it. We've simply ignored it for the past 10+ years because it really can't be used safely for it's intended purposes (as everyone who tries to use it finds out eventually)... Cheers, Dave.
On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > On Tue, Mar 19, 2019 at 11:02:49AM -0700, Darrick J. Wong wrote: > > On Mon, Mar 18, 2019 at 05:50:14PM -0400, Brian Foster wrote: > > > On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote: > > > > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote: > > > > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote: > > > > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > > > > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > > > > > > > in as unwritten, issue the write, and convert the extent to real during > > > > > > > > > io completion, and not split it up just to avoid having unwritten > > > > > > > > > extents past EOF. > > > > > > > > > > > > > > > > Yup, makes sense. For a sequential write, they should always be > > > > > > > > beyond EOF. For anything else, we use unwritten. > > > > > > > > > > > > > > > > > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > > > > > > > maybe this is not an issue, but another thing to consider is whether > > > > > > > selective delalloc -> real conversion for post-eof extents translates to > > > > > > > conditional stale data exposure vectors in certain file extension > > > > > > > scenarios. > > > > > > > > > > > > No, it doesn't because the transaction that moves EOF is done after > > > > > > the data write into the region it exposes is complete. hence the > > > > > > device cache flush that occurs before the log write containing the > > > > > > transaction that moves the EOF will always result in the data being > > > > > > on stable storage *before* the extending szie transaction hits the > > > > > > journal and exposes it. > > > > > > > > > > > > > > > > Ok, but this ordering alone doesn't guarantee the data we've just > > > > > written is on-disk before the transaction hits the log. > > > > > > > > Which transaction are you talking about? This ordering guarantees > > > > that the data is on stable storage before the EOF size change > > > > transactions that exposes the region is on disk (that's the whole > > > > point of updates after data writes). > > > > > > > > If you are talking about the allocation transaction taht we are > > > > going to write the data into, then you are right that it doesn't > > > > guarantee that the data is on disk before that commits, but when the > > > > allocation is beyond EOF it doesn't matter ebcause is won't be > > > > exposed until after the data is written and the EOF moving > > > > transaction is committed. > > > > > > > > > > The extending transaction that you referred to above and with respect to > > > the device cache flush. I'm simply saying that the transaction has to be > > > ordered with respect to full writeback completion as well, which is also > > > what you are saying further down. (We're in agreement... ;)). > > > > > > > As I've previously proposed, what I suspect we should be doing is > > > > not commiting the allocation transaction until IO completion, which > > > > closes this stale data exposure hole completely and removes the need > > > > for allocating unwritten extentsi and then having to convert them. > > > > Direct IO could also do this, and so it could stop using unwritten > > > > extents, too.... > > > > > > > > > > That sounds like an interesting (and potentially more efficient) > > > alternative to using unwritten extents across the board only to convert > > > them as I/Os complete. I guess we'd need to make sure that the > > > allocation transaction holds across potentially many ioends considering > > > the max extents size of 8GB or so. I suppose the locking and handling of > > > fragmented allocation cases could get a bit interesting there as well, > > > but it's probably worth a prototype attempt at least to survey the > > > concept... (I wonder if Darrick is still paying attention or found > > > something more interesting to do..? :D) > > > > /me wandered off into fixing the io completion mess; if either of you > > want to work on a better prototype, I encourage you to go for it. :) > > > > Though I wonder how bad would be the effects of holding the allocation > > transaction open across a (potentially very slow) writeout. We'd still > > be holding onto the dirty inode's ilock as well as the AGF header and > > whatever bnobt/cntbt blocks are dirty. Wouldn't that stall any other > > thread that was walking the AGs looking for free space? > Indeed, that's the type of stuff I was wondering about as well. > Yeah, which is why I mentioned on IRC that it may be better to use > an intent/done transaction pair similar to an EFI/EFD. i.e. on IO > completion we only have to log the "write done" and it can contain > just hte range for the specific IO, hence we can do large > allocations and only mark the areas we write as containing data. > That would allow zeroing of the regions that aren't covered by a > done intent within EOF during recovery.... > This wasn't exactly what comes to mind for me with the thought of using intent items. I was thinking this meant to separate the block allocation from block mapping such that I/O can be issued to allocated blocks, but new blocks aren't mapped into the file until the I/O (or conversion to unwritten, etc.) completes, similar to how COW remapping works. The objective here would be to never map uninitialized blocks to accessible regions of a file. This of course would somehow or another need to make sure new extent regions that aren't the target of I/O still end up properly mapped, that allocated but unmapped blocks are handled on log recovery, etc. (reflink is essentially a reference implementation of all this). It sounds like what you're suggesting above is more along the lines of leaving writeback as it works today, but log a write start intent in the transaction that maps new blocks and pair it with write done completions that occurs as blocks are initialized (again via I/O completion, extent conversion, etc.), such that in the event of a crash log recovery can do whatever is necessary to initialize the blocks correctly. The objective here is to track uninitialized regions of a file such that log recovery can do the initialization. This would more naturally handle the post-eof scenario as we already issue zeroing buffered writes to those blocks on file extension, but we'd need to be Ok with either issuing quite a bit of I/O from log recovery, converting newly mapped extents to unwritten (putting us in a similar performance situation as the original patch to always use unwritten extents for affected files), or doing something wonky like punching holes, etc. Am I following the high level idea correctly? If so, both seem like reasonable ideas, the latter probably more simple at a glance. I'd need to think about it more to have more concrete feedback on feasibility, etc.. > > > I also assume we'd still need to use unwritten extents for cases where > > > the allocation completes before all of the extent is within EOF (i.e., > > > the extent zeroing idea you mentioned on irc..). > > > > Yes, I think it would still be necessary. My guess is that > > xfs_bmapi_convert_delalloc has to create unwritten extents for any > > extent starting before the on-disk EOF but can create real extents for > > anything past the on-disk isize (because we don't set the on-disk isize > > to match the in-core isize until writes complete). > > Yes, that was my assumption too, but I suspect that with an > intent/done pair, even that is not necessary. > Ok, so then this is an implementation side effect and probably not worth digging too far into without a higher level design. > > I guess the "zero > > pages between old isize and new isize" code can simply reset any real > > extents it finds to be unwritten too, as we've been discussing. > > *nod* > > > > The broader point is that by controlling writeback ordering, we can > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > never expect stale data exposure even in the event of out of order > > > writeback completion, regardless of the cause. > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > disk contents it's game over anyway and we ought to fix it. > > Well, that's exactly the problem with SFR, isn't it? The dirty data > is outside the range the user asks to sync and the implementation > doesn't go anywhere near the filesystem so we can't actually do the > right thing here. So we are left to either hack fixes around a > shitty, broken interface and everyone pays the penalty, or we ignore > it. We've simply ignored it for the past 10+ years because it really > can't be used safely for it's intended purposes (as everyone who > tries to use it finds out eventually)... > Just to make this abundantly clear, there is nothing unique, magic or special required of sync_file_range() to reproduce this stale data exposure. See the following variation of a command sequence from the fstest I posted the other day: ... # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ -c "fpunch 96k 4k" <file> ... # hexdump <file> 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0011000 abab abab abab abab abab abab abab abab * 0018000 0000 0000 0000 0000 0000 0000 0000 0000 * 0019000 Same problem, and it can be reproduced just as easily with a reflink or even an overlapping direct I/O. I haven't confirmed, but I suspect with enough effort background writeback is susceptible to this problem as well. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > The broader point is that by controlling writeback ordering, we can > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > never expect stale data exposure even in the event of out of order > > > > writeback completion, regardless of the cause. > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > disk contents it's game over anyway and we ought to fix it. > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > is outside the range the user asks to sync and the implementation > > doesn't go anywhere near the filesystem so we can't actually do the > > right thing here. So we are left to either hack fixes around a > > shitty, broken interface and everyone pays the penalty, or we ignore > > it. We've simply ignored it for the past 10+ years because it really > > can't be used safely for it's intended purposes (as everyone who > > tries to use it finds out eventually)... > > > > Just to make this abundantly clear, there is nothing unique, magic or > special required of sync_file_range() to reproduce this stale data > exposure. See the following variation of a command sequence from the > fstest I posted the other day: > > ... > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > -c "fpunch 96k 4k" <file> > ... > # hexdump <file> > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > * > 0011000 abab abab abab abab abab abab abab abab > * > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > * > 0019000 > > Same problem, and it can be reproduced just as easily with a reflink or > even an overlapping direct I/O. How does overlapping direct IO cause stale data exposure given that it uses unwritten extent allocation? That, of course, is ignoring the fact that the result of overlapping concurent direct IO is, by definition, undefined and an application bug if it occurs.... > I haven't confirmed, but I suspect with > enough effort background writeback is susceptible to this problem as > well. Well, the similarity here is with the NULL files problems and using XFS_ITRUNCATE case to avoid NULL files after truncating them down. i.e. if that flag is set, we do extra flushing where appropriate to ensure ithe data at risk of being lost hits the disk before the loss mechanism can be triggered. i.e. this is the mirror equivalent of XFS_ITRUNCATE - it is the truncate-up flushing case in that we need to wait for the zeroing to hit the disk before we change the file size. Eseentially, what this says to me is we need to track when we extended the file via a write beyond EOF via an inode flag. Then if we need to flush a data range from the file for integrity reasons (such as a fpnuch) and this flag is set, we ignore the "start" parameter of the range and flush from the start of the file instead. i.e. we guarantee that any dirty zeroed pages between the start of the file and the end of the range we are operating on gets written back. This will capture all these "didn't flush regions zeroed on file extension" problems in one go, without needing to change the way we do allocation or EOF zeroing. I suspect we could just use the XFS_ITRUNCATE flag for this - set it in the EOF zeroing code, check it in xfs_flush_unmap_range() and other places that do range based flushing.... Cheers, Dave.
On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > The broader point is that by controlling writeback ordering, we can > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > never expect stale data exposure even in the event of out of order > > > > > writeback completion, regardless of the cause. > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > is outside the range the user asks to sync and the implementation > > > doesn't go anywhere near the filesystem so we can't actually do the > > > right thing here. So we are left to either hack fixes around a > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > it. We've simply ignored it for the past 10+ years because it really > > > can't be used safely for it's intended purposes (as everyone who > > > tries to use it finds out eventually)... > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > special required of sync_file_range() to reproduce this stale data > > exposure. See the following variation of a command sequence from the > > fstest I posted the other day: > > > > ... > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > -c "fpunch 96k 4k" <file> > > ... > > # hexdump <file> > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > * > > 0011000 abab abab abab abab abab abab abab abab > > * > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > * > > 0019000 > > > > Same problem, and it can be reproduced just as easily with a reflink or > > even an overlapping direct I/O. > > How does overlapping direct IO cause stale data exposure given that > it uses unwritten extent allocation? That, of course, is ignoring > the fact that the result of overlapping concurent direct IO is, by > definition, undefined and an application bug if it occurs.... > It's overlapping a buffered write and so starts with a flush instead of an unwritten allocation (so even a dio read is sufficient). Also, the target of the dio isn't where stale data is exposed, so even if there was an existing underlying unwritten block before the initial buffered write it wouldn't change anything. Yes, this is technically usage with undefined behavior, but that still misses the forest from the trees. I'm not claiming this is good or supported practice just as I'm not not claiming sync_file_range() should be used for anything in practice. I'm just pointing out it's one of several possible, obvious data exposure vectors. I suspect there are others; these were just the easy variants to reproduce. If you prefer to reason about reproducer patterns that don't use operations with quirks or undefined behaviors, that's fine, just use the reflink or hole punch example posted above that produce the same behavior. > > I haven't confirmed, but I suspect with > > enough effort background writeback is susceptible to this problem as > > well. > > Well, the similarity here is with the NULL files problems and using > XFS_ITRUNCATE case to avoid NULL files after truncating them down. > i.e. if that flag is set, we do extra flushing where appropriate to > ensure ithe data at risk of being lost hits the disk before the loss > mechanism can be triggered. > > i.e. this is the mirror equivalent of XFS_ITRUNCATE - it is the > truncate-up flushing case in that we need to wait for the zeroing to > hit the disk before we change the file size. > > Eseentially, what this says to me is we need to track when we > extended the file via a write beyond EOF via an inode flag. Then if we need to > flush a data range from the file for integrity reasons (such as a > fpnuch) and this flag is set, we ignore the "start" parameter of the > range and flush from the start of the file instead. i.e. we > guarantee that any dirty zeroed pages between the start of the file > and the end of the range we are operating on gets written back. This > will capture all these "didn't flush regions zeroed on file > extension" problems in one go, without needing to change the way we > do allocation or EOF zeroing. > I don't think that would prevent the problem outside of the common write extension case. If the file is sparse (i.e. truncated up ahead of time) such that the file size is on disk, then any flush sequence and block allocation within EOF is susceptible to stale data exposure in the window between when the extent is allocated and writeback completes over the entire extent. This variant is reproducible even with a plain old fsync() call, it's just hard enough to reproduce to require instrumentation (as opposed to the test sequences I've posted in this thread so far). It may be possible to selectively (i.e. as an optimization) use an XFS_ITRUNCATED like algorithm as you propose here to always force in order writeback in the cases where writeback extends the on-disk isize (and thus isize protects from stale exposure), but I think we'd still need something tied to extent allocation one way or another in addition (like Darrick's original unwritten extent patch or your proposed intent/done idea in the previous reply) to address the fundamental problem. The question is what is the ideal solution.. Brian > I suspect we could just use the XFS_ITRUNCATE flag for this - set it > in the EOF zeroing code, check it in xfs_flush_unmap_range() and > other places that do range based flushing.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > never expect stale data exposure even in the event of out of order > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > is outside the range the user asks to sync and the implementation > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > right thing here. So we are left to either hack fixes around a > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > it. We've simply ignored it for the past 10+ years because it really > > > > can't be used safely for it's intended purposes (as everyone who > > > > tries to use it finds out eventually)... > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > special required of sync_file_range() to reproduce this stale data > > > exposure. See the following variation of a command sequence from the > > > fstest I posted the other day: > > > > > > ... > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > -c "fpunch 96k 4k" <file> > > > ... > > > # hexdump <file> > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > * > > > 0011000 abab abab abab abab abab abab abab abab > > > * > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > * > > > 0019000 > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > even an overlapping direct I/O. > > > > How does overlapping direct IO cause stale data exposure given that > > it uses unwritten extent allocation? That, of course, is ignoring > > the fact that the result of overlapping concurent direct IO is, by > > definition, undefined and an application bug if it occurs.... > > > > It's overlapping a buffered write and so starts with a flush instead of Ok, but that's not the normal usage of the phrase "overlapping direct IO". Overlapping direct IO means two concurrent direct IOs to the same LBA range in the device underlying the filesystem. i.e. they overlap in both the temporal and physical (device LBA) domains. Subtle difference, yes, but a very important one. > an unwritten allocation (so even a dio read is sufficient). Also, the > target of the dio isn't where stale data is exposed, so even if there > was an existing underlying unwritten block before the initial buffered > write it wouldn't change anything. > > Yes, this is technically usage with undefined behavior, but that still > misses the forest from the trees. No, I'm not missing the forest for the trees. > I'm not claiming this is good or > supported practice just as I'm not not claiming sync_file_range() should > be used for anything in practice. I'm just pointing out it's one of > several possible, obvious data exposure vectors. I suspect there are > others; these were just the easy variants to reproduce. I'm aware that we have these problems - we've known about them for 20+ years and the many ways they can be exposed. Indeed, We've been talking about using unwritten extents for delalloc for as long as I can remember, but never done it because the actual occurrence of these problems in production systems is /extremely/ rare and the cost of using unwritten extents everywhere will affect /everyone/. i.e. It's always been easy to create test conditions to expose stale data, but reported cases of stale data exposure after a production system crash are pretty much non-existent. This just isn't a common problem that users are experiencing, but solving it with "unwritten extents for everyone" comes at a substantial cost for /everyone/, /all the time/. That's why a) it hasn't been done yet; and b) any discussion about "use unwritten extents everywhere" always ends up in trying to find a solution that isn't just applying the "big hammer" and ignoring the problems that causes. > If you prefer to > reason about reproducer patterns that don't use operations with quirks > or undefined behaviors, that's fine, just use the reflink or hole punch > example posted above that produce the same behavior. Well, yes. But it does go both ways - if you don't want people to say "that's undefined" or "that makes no sense", then don't use those quirky examples if you can use better, unambiguous examples. For example, that's exactly how I phrased this extension flushing mechanism trigger: > > Eseentially, what this says to me is we need to track when we > > extended the file via a write beyond EOF via an inode flag. Then if we need to > > flush a data range from the file for integrity reasons (such as a > > fpnuch) and this flag is set, we ignore the "start" parameter of the See? Entirely generic trigger description, uses hole punch as an example, but also covers all the quirky cases such as direct IO range flushing without having to mention them explicitly. > > range and flush from the start of the file instead. i.e. we > > guarantee that any dirty zeroed pages between the start of the file > > and the end of the range we are operating on gets written back. This > > will capture all these "didn't flush regions zeroed on file > > extension" problems in one go, without needing to change the way we > > do allocation or EOF zeroing. > > > > I don't think that would prevent the problem outside of the common write > extension case. I feel like we're going around in circles now, Brian. :/ I didn't propose this as the "only solution we need". I've proposed it in the context that we use "only use unwritten extents for allocations within EOF" to solve the common "write within EOF" case. This, however, doesn't solve this common write extension case where we don't want to use unwritten extents because of the performance cost. And now you're saying that "well, this extending write thing doesn't solve the inside EOF problem"...... > It may be possible to selectively (i.e. as an optimization) use an > XFS_ITRUNCATED like algorithm as you propose here to always force in > order writeback in the cases where writeback extends the on-disk isize > (and thus isize protects from stale exposure), but I think we'd still > need something tied to extent allocation one way or another in addition > (like Darrick's original unwritten extent patch or your proposed > intent/done idea in the previous reply) to address the fundamental > problem. The question is what is the ideal solution.. The solution will be a combination of things that work together, not one big hammer. The more we can avoid exposure vectors by operational ordering rather than transactional state changes, the better the solution will perform. Cheers, Dave.
On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > is outside the range the user asks to sync and the implementation > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > right thing here. So we are left to either hack fixes around a > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > special required of sync_file_range() to reproduce this stale data > > > > exposure. See the following variation of a command sequence from the > > > > fstest I posted the other day: > > > > > > > > ... > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > -c "fpunch 96k 4k" <file> > > > > ... > > > > # hexdump <file> > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > * > > > > 0011000 abab abab abab abab abab abab abab abab > > > > * > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > * > > > > 0019000 > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > even an overlapping direct I/O. > > > > > > How does overlapping direct IO cause stale data exposure given that > > > it uses unwritten extent allocation? That, of course, is ignoring > > > the fact that the result of overlapping concurent direct IO is, by > > > definition, undefined and an application bug if it occurs.... > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > Ok, but that's not the normal usage of the phrase "overlapping > direct IO". Overlapping direct IO means two concurrent direct IOs > to the same LBA range in the device underlying the filesystem. i.e. > they overlap in both the temporal and physical (device LBA) domains. > > Subtle difference, yes, but a very important one. > Yeah, I just referenced it in passing and was unclear. The point is just that there are multiple vectors to the problem, some of which are sane sequences and some apparently not, and I'd rather not get caught up in the minutiae of why some of those sequences are not sane. I'm trying to see if we can find a solution to the fundamental data exposure problem. > > an unwritten allocation (so even a dio read is sufficient). Also, the > > target of the dio isn't where stale data is exposed, so even if there > > was an existing underlying unwritten block before the initial buffered > > write it wouldn't change anything. > > > > Yes, this is technically usage with undefined behavior, but that still > > misses the forest from the trees. > > No, I'm not missing the forest for the trees. > > > I'm not claiming this is good or > > supported practice just as I'm not not claiming sync_file_range() should > > be used for anything in practice. I'm just pointing out it's one of > > several possible, obvious data exposure vectors. I suspect there are > > others; these were just the easy variants to reproduce. > > I'm aware that we have these problems - we've known about them for > 20+ years and the many ways they can be exposed. Indeed, We've been > talking about using unwritten extents for delalloc for as long as I > can remember, but never done it because the actual occurrence of > these problems in production systems is /extremely/ rare and the > cost of using unwritten extents everywhere will affect /everyone/. > > i.e. It's always been easy to create test conditions to expose stale > data, but reported cases of stale data exposure after a production > system crash are pretty much non-existent. This just isn't a common > problem that users are experiencing, but solving it with "unwritten > extents for everyone" comes at a substantial cost for /everyone/, > /all the time/. > I know you're aware. I'm pretty sure we've discussed this directly in the past (with regard to how buffer heads were an impediment to a solution) and that you made me aware of the issue in the first place. ;) > That's why a) it hasn't been done yet; and b) any discussion about > "use unwritten extents everywhere" always ends up in trying to find > a solution that isn't just applying the "big hammer" and ignoring > the problems that causes. > Right.. > > If you prefer to > > reason about reproducer patterns that don't use operations with quirks > > or undefined behaviors, that's fine, just use the reflink or hole punch > > example posted above that produce the same behavior. > > Well, yes. But it does go both ways - if you don't want people to > say "that's undefined" or "that makes no sense", then don't use > those quirky examples if you can use better, unambiguous examples. > The sync_file_range() reproducer was a poor example in that regard. I'm not concerned over that being pointed out, but I've explained several times subsequently that we can replace the sync_file_range() calls with a sane command (hole punch) to reproduce the same general behavior and resulting data exposure problem and thus the quirks of the former are not a primary factor. Despite that, this discussion has somehow turned into an appraisal of sync_file_range(). Moving on.. I posted some comments/questions around your log item suggestion for extent zeroing during recovery in my previous reply and that part apparently got completely snipped out and ignored. :/ Any feedback on that? > For example, that's exactly how I phrased this extension flushing > mechanism trigger: > > > > Eseentially, what this says to me is we need to track when we > > > extended the file via a write beyond EOF via an inode flag. Then if we need to > > > flush a data range from the file for integrity reasons (such as a > > > fpnuch) and this flag is set, we ignore the "start" parameter of the > > See? Entirely generic trigger description, uses hole punch as an > example, but also covers all the quirky cases such as direct IO > range flushing without having to mention them explicitly. > > > > range and flush from the start of the file instead. i.e. we > > > guarantee that any dirty zeroed pages between the start of the file > > > and the end of the range we are operating on gets written back. This > > > will capture all these "didn't flush regions zeroed on file > > > extension" problems in one go, without needing to change the way we > > > do allocation or EOF zeroing. > > > > > > > I don't think that would prevent the problem outside of the common write > > extension case. > > I feel like we're going around in circles now, Brian. :/ > Yes, let's try to reset here... > I didn't propose this as the "only solution we need". I've proposed > it in the context that we use "only use unwritten extents for > allocations within EOF" to solve the common "write within EOF" case. > This, however, doesn't solve this common write extension case where > we don't want to use unwritten extents because of the performance > cost. > > And now you're saying that "well, this extending write thing > doesn't solve the inside EOF problem"...... > Ok, I lost that context. As I said above, I can see how this could be useful in combination with another approach to handle inside eof allocations. There are still some loose ends (in my mind) to consider with this technique, however... This requires that we always guarantee in-order writeback, right? I can see how we can do that in the cases we control based on the flag based approach you suggested. But do we really have enough control within the fs to always prevent out of order writeback? What about non-integrity background writeback? AFAICT it's technically possible to writeback out of expected order (the cyclic range logic looks particularly suspicious to me if you consider a mapping can be truncated and repopulated many times over without resetting the cycle index). What about memory reclaim? Is it not possible to hit the ->writepage() path via reclaim for an arbitrary page in a mapping? It's not clear to me if page migration is a factor for us as well, but it also looks like it has some specific page writeback hooks. I suppose there are technically ways around these things. We could skip writeback of certain pages in some cases (but that may have unacceptable consequences) or make the decision to use one exposure preventation approach or another dynamically at writepage time based on the state of the inode, mapping and associated page. I'd think background writeback is the common case as opposed to things like hole punch and whatnot, however, so a dynamic solution leaves the effectiveness of this approach somewhat open ended unless we characterize how all possible writeback scenarios behave. Given that, it seems more useful to me to work through the inside eof extent allocation scenario first and work back from there. Then we can at least try to establish potential performance impact and determine if we should try to work around it in certain cases, etc. On that note, please see the part that was snipped out of my earlier reply[1], in particular the part that ended with "Am I following the high level idea correctly?" :P Thanks. Brian [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2 > > It may be possible to selectively (i.e. as an optimization) use an > > XFS_ITRUNCATED like algorithm as you propose here to always force in > > order writeback in the cases where writeback extends the on-disk isize > > (and thus isize protects from stale exposure), but I think we'd still > > need something tied to extent allocation one way or another in addition > > (like Darrick's original unwritten extent patch or your proposed > > intent/done idea in the previous reply) to address the fundamental > > problem. The question is what is the ideal solution.. > > The solution will be a combination of things that work together, not > one big hammer. The more we can avoid exposure vectors by > operational ordering rather than transactional state changes, the > better the solution will perform. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > is outside the range the user asks to sync and the implementation > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > special required of sync_file_range() to reproduce this stale data > > > > > exposure. See the following variation of a command sequence from the > > > > > fstest I posted the other day: > > > > > > > > > > ... > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > -c "fpunch 96k 4k" <file> > > > > > ... > > > > > # hexdump <file> > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > * > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > * > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > * > > > > > 0019000 > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > even an overlapping direct I/O. > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > the fact that the result of overlapping concurent direct IO is, by > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > Ok, but that's not the normal usage of the phrase "overlapping > > direct IO". Overlapping direct IO means two concurrent direct IOs > > to the same LBA range in the device underlying the filesystem. i.e. > > they overlap in both the temporal and physical (device LBA) domains. > > > > Subtle difference, yes, but a very important one. > > > > Yeah, I just referenced it in passing and was unclear. The point is just > that there are multiple vectors to the problem, some of which are sane > sequences and some apparently not, and I'd rather not get caught up in > the minutiae of why some of those sequences are not sane. I'm trying to > see if we can find a solution to the fundamental data exposure problem. Is there a way to increase the amount of writeback? Say for example that we have a delalloc reservation for blocks 0-9, and writeback asks us to write back block 7. If we succeed at converting the entire 10-block reservation into a real extent, why can't we come back and say "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going to need something like that for blocksize > pagesize filesystems too? > > > > an unwritten allocation (so even a dio read is sufficient). Also, the > > > target of the dio isn't where stale data is exposed, so even if there > > > was an existing underlying unwritten block before the initial buffered > > > write it wouldn't change anything. > > > > > > Yes, this is technically usage with undefined behavior, but that still > > > misses the forest from the trees. > > > > No, I'm not missing the forest for the trees. > > > > > I'm not claiming this is good or > > > supported practice just as I'm not not claiming sync_file_range() should > > > be used for anything in practice. I'm just pointing out it's one of > > > several possible, obvious data exposure vectors. I suspect there are > > > others; these were just the easy variants to reproduce. > > > > I'm aware that we have these problems - we've known about them for > > 20+ years and the many ways they can be exposed. Indeed, We've been > > talking about using unwritten extents for delalloc for as long as I > > can remember, but never done it because the actual occurrence of > > these problems in production systems is /extremely/ rare and the > > cost of using unwritten extents everywhere will affect /everyone/. > > > > i.e. It's always been easy to create test conditions to expose stale > > data, but reported cases of stale data exposure after a production > > system crash are pretty much non-existent. This just isn't a common > > problem that users are experiencing, but solving it with "unwritten > > extents for everyone" comes at a substantial cost for /everyone/, > > /all the time/. > > > > I know you're aware. I'm pretty sure we've discussed this directly in > the past (with regard to how buffer heads were an impediment to a > solution) and that you made me aware of the issue in the first place. ;) > > > That's why a) it hasn't been done yet; and b) any discussion about > > "use unwritten extents everywhere" always ends up in trying to find > > a solution that isn't just applying the "big hammer" and ignoring > > the problems that causes. > > > > Right.. > > > > If you prefer to > > > reason about reproducer patterns that don't use operations with quirks > > > or undefined behaviors, that's fine, just use the reflink or hole punch > > > example posted above that produce the same behavior. > > > > Well, yes. But it does go both ways - if you don't want people to > > say "that's undefined" or "that makes no sense", then don't use > > those quirky examples if you can use better, unambiguous examples. > > > > The sync_file_range() reproducer was a poor example in that regard. I'm (I don't want to stir this up again but just so we're all clear I had figured the s_f_r call was really just a way to simulate random drive-by scattershot background writeback for testing purposes.) > not concerned over that being pointed out, but I've explained several > times subsequently that we can replace the sync_file_range() calls with > a sane command (hole punch) to reproduce the same general behavior and > resulting data exposure problem and thus the quirks of the former are > not a primary factor. Despite that, this discussion has somehow turned > into an appraisal of sync_file_range(). Moving on.. > > I posted some comments/questions around your log item suggestion for > extent zeroing during recovery in my previous reply and that part > apparently got completely snipped out and ignored. :/ Any feedback on > that? A(nother) big downside of adding a new "write intent": now we need to add a log incompat flag which will occasionally generate "I wrote something, crashed, and this old knoppix cd won't recover the fs for some reason" reports. > > For example, that's exactly how I phrased this extension flushing > > mechanism trigger: > > > > > > Eseentially, what this says to me is we need to track when we > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to > > > > flush a data range from the file for integrity reasons (such as a > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the > > > > See? Entirely generic trigger description, uses hole punch as an > > example, but also covers all the quirky cases such as direct IO > > range flushing without having to mention them explicitly. > > > > > > range and flush from the start of the file instead. i.e. we > > > > guarantee that any dirty zeroed pages between the start of the file > > > > and the end of the range we are operating on gets written back. This > > > > will capture all these "didn't flush regions zeroed on file > > > > extension" problems in one go, without needing to change the way we > > > > do allocation or EOF zeroing. > > > > > > > > > > I don't think that would prevent the problem outside of the common write > > > extension case. > > > > I feel like we're going around in circles now, Brian. :/ > > > > Yes, let's try to reset here... > > > I didn't propose this as the "only solution we need". I've proposed > > it in the context that we use "only use unwritten extents for > > allocations within EOF" to solve the common "write within EOF" case. > > This, however, doesn't solve this common write extension case where > > we don't want to use unwritten extents because of the performance > > cost. > > > > And now you're saying that "well, this extending write thing > > doesn't solve the inside EOF problem"...... > > > > Ok, I lost that context. As I said above, I can see how this could be > useful in combination with another approach to handle inside eof > allocations. There are still some loose ends (in my mind) to consider > with this technique, however... > > This requires that we always guarantee in-order writeback, right? I can > see how we can do that in the cases we control based on the flag based > approach you suggested. But do we really have enough control within the > fs to always prevent out of order writeback? What about non-integrity > background writeback? AFAICT it's technically possible to writeback out > of expected order (the cyclic range logic looks particularly suspicious > to me if you consider a mapping can be truncated and repopulated many > times over without resetting the cycle index). What about memory > reclaim? Is it not possible to hit the ->writepage() path via reclaim > for an arbitrary page in a mapping? It's not clear to me if page > migration is a factor for us as well, but it also looks like it has some > specific page writeback hooks. > > I suppose there are technically ways around these things. We could skip > writeback of certain pages in some cases (but that may have unacceptable > consequences) or make the decision to use one exposure preventation > approach or another dynamically at writepage time based on the state of > the inode, mapping and associated page. I'd think background writeback > is the common case as opposed to things like hole punch and whatnot, > however, so a dynamic solution leaves the effectiveness of this approach > somewhat open ended unless we characterize how all possible writeback > scenarios behave. > > Given that, it seems more useful to me to work through the inside eof > extent allocation scenario first and work back from there. Then we can > at least try to establish potential performance impact and determine if > we should try to work around it in certain cases, etc. <nod> I'll get back to working on two parts of fixing this: a) adapting the "convert delalloc to unwritten" patch that started this thread so that it only does that for conversions that start within eof, b) fixing your punch case by extending the range that xfs_flush_unmap_range asks to flush if the caller is trying to flush a range past the ondisk eof. Though I think (b) is unnecessary if we had the ability to flush all the pages fronting a delalloc reservation that we just converted to a real extent, as I muttered above. > On that note, > please see the part that was snipped out of my earlier reply[1], in > particular the part that ended with "Am I following the high level idea > correctly?" :P Thanks. I thought your take at least matched my interpretation of Dave's comments. Thoughts? --D > > Brian > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2 > > > > It may be possible to selectively (i.e. as an optimization) use an > > > XFS_ITRUNCATED like algorithm as you propose here to always force in > > > order writeback in the cases where writeback extends the on-disk isize > > > (and thus isize protects from stale exposure), but I think we'd still > > > need something tied to extent allocation one way or another in addition > > > (like Darrick's original unwritten extent patch or your proposed > > > intent/done idea in the previous reply) to address the fundamental > > > problem. The question is what is the ideal solution.. > > > > The solution will be a combination of things that work together, not > > one big hammer. The more we can avoid exposure vectors by > > operational ordering rather than transactional state changes, the > > better the solution will perform. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > exposure. See the following variation of a command sequence from the > > > > > > fstest I posted the other day: > > > > > > > > > > > > ... > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > -c "fpunch 96k 4k" <file> > > > > > > ... > > > > > > # hexdump <file> > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > * > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > * > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > * > > > > > > 0019000 > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > even an overlapping direct I/O. > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > to the same LBA range in the device underlying the filesystem. i.e. > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > Subtle difference, yes, but a very important one. > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > that there are multiple vectors to the problem, some of which are sane > > sequences and some apparently not, and I'd rather not get caught up in > > the minutiae of why some of those sequences are not sane. I'm trying to > > see if we can find a solution to the fundamental data exposure problem. > > Is there a way to increase the amount of writeback? Say for example > that we have a delalloc reservation for blocks 0-9, and writeback asks > us to write back block 7. If we succeed at converting the entire > 10-block reservation into a real extent, why can't we come back and say > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > to need something like that for blocksize > pagesize filesystems too? > That might be possible in various places. In fact, I think we used to do something like this in XFS (perhaps not for this purpose). I just jumped back to v2.6.32 and see that the xfs_cluster_write() mechanism still exists at that point down in the ->writepage() path. I'm not sure if doing that alone is worth the effort if you consider it's more of a mitigation than a solution to the underlying problem. Once we convert the associated delalloc extent, we're open to stale data exposure (assuming we're not protected by on-disk EOF and whatnot) until all of the remaining pages over the extent are written back, whether that occurs in the current writeback cycle or a subsequent one. I suppose it might make sense in service to another solution, such as writing back everything preceding a page that is going to update on-disk EOF (which is something I think Dave suggested earlier as an incremental improvement, I've kind of lost track). We'd probably just have to make sure that the on-disk eof can never get ahead of the lower index writeback pages by I/O reordering or whatever.. > > > > > > an unwritten allocation (so even a dio read is sufficient). Also, the > > > > target of the dio isn't where stale data is exposed, so even if there > > > > was an existing underlying unwritten block before the initial buffered > > > > write it wouldn't change anything. > > > > > > > > Yes, this is technically usage with undefined behavior, but that still > > > > misses the forest from the trees. > > > > > > No, I'm not missing the forest for the trees. > > > > > > > I'm not claiming this is good or > > > > supported practice just as I'm not not claiming sync_file_range() should > > > > be used for anything in practice. I'm just pointing out it's one of > > > > several possible, obvious data exposure vectors. I suspect there are > > > > others; these were just the easy variants to reproduce. > > > > > > I'm aware that we have these problems - we've known about them for > > > 20+ years and the many ways they can be exposed. Indeed, We've been > > > talking about using unwritten extents for delalloc for as long as I > > > can remember, but never done it because the actual occurrence of > > > these problems in production systems is /extremely/ rare and the > > > cost of using unwritten extents everywhere will affect /everyone/. > > > > > > i.e. It's always been easy to create test conditions to expose stale > > > data, but reported cases of stale data exposure after a production > > > system crash are pretty much non-existent. This just isn't a common > > > problem that users are experiencing, but solving it with "unwritten > > > extents for everyone" comes at a substantial cost for /everyone/, > > > /all the time/. > > > > > > > I know you're aware. I'm pretty sure we've discussed this directly in > > the past (with regard to how buffer heads were an impediment to a > > solution) and that you made me aware of the issue in the first place. ;) > > > > > That's why a) it hasn't been done yet; and b) any discussion about > > > "use unwritten extents everywhere" always ends up in trying to find > > > a solution that isn't just applying the "big hammer" and ignoring > > > the problems that causes. > > > > > > > Right.. > > > > > > If you prefer to > > > > reason about reproducer patterns that don't use operations with quirks > > > > or undefined behaviors, that's fine, just use the reflink or hole punch > > > > example posted above that produce the same behavior. > > > > > > Well, yes. But it does go both ways - if you don't want people to > > > say "that's undefined" or "that makes no sense", then don't use > > > those quirky examples if you can use better, unambiguous examples. > > > > > > > The sync_file_range() reproducer was a poor example in that regard. I'm > > (I don't want to stir this up again but just so we're all clear I had > figured the s_f_r call was really just a way to simulate random drive-by > scattershot background writeback for testing purposes.) > Ack. > > not concerned over that being pointed out, but I've explained several > > times subsequently that we can replace the sync_file_range() calls with > > a sane command (hole punch) to reproduce the same general behavior and > > resulting data exposure problem and thus the quirks of the former are > > not a primary factor. Despite that, this discussion has somehow turned > > into an appraisal of sync_file_range(). Moving on.. > > > > I posted some comments/questions around your log item suggestion for > > extent zeroing during recovery in my previous reply and that part > > apparently got completely snipped out and ignored. :/ Any feedback on > > that? > > A(nother) big downside of adding a new "write intent": now we need to > add a log incompat flag which will occasionally generate "I wrote > something, crashed, and this old knoppix cd won't recover the fs for > some reason" reports. > Yeah, though I think it's not really recommended to process an fs with a dirty log across kernels like that. We already have some recovery restrictions that prevent this, such on-disk order for example. Of course that's different from just going back to an older kernel version on the same hardware and getting stuck that way because of a feature bit or something. > > > For example, that's exactly how I phrased this extension flushing > > > mechanism trigger: > > > > > > > > Eseentially, what this says to me is we need to track when we > > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to > > > > > flush a data range from the file for integrity reasons (such as a > > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the > > > > > > See? Entirely generic trigger description, uses hole punch as an > > > example, but also covers all the quirky cases such as direct IO > > > range flushing without having to mention them explicitly. > > > > > > > > range and flush from the start of the file instead. i.e. we > > > > > guarantee that any dirty zeroed pages between the start of the file > > > > > and the end of the range we are operating on gets written back. This > > > > > will capture all these "didn't flush regions zeroed on file > > > > > extension" problems in one go, without needing to change the way we > > > > > do allocation or EOF zeroing. > > > > > > > > > > > > > I don't think that would prevent the problem outside of the common write > > > > extension case. > > > > > > I feel like we're going around in circles now, Brian. :/ > > > > > > > Yes, let's try to reset here... > > > > > I didn't propose this as the "only solution we need". I've proposed > > > it in the context that we use "only use unwritten extents for > > > allocations within EOF" to solve the common "write within EOF" case. > > > This, however, doesn't solve this common write extension case where > > > we don't want to use unwritten extents because of the performance > > > cost. > > > > > > And now you're saying that "well, this extending write thing > > > doesn't solve the inside EOF problem"...... > > > > > > > Ok, I lost that context. As I said above, I can see how this could be > > useful in combination with another approach to handle inside eof > > allocations. There are still some loose ends (in my mind) to consider > > with this technique, however... > > > > This requires that we always guarantee in-order writeback, right? I can > > see how we can do that in the cases we control based on the flag based > > approach you suggested. But do we really have enough control within the > > fs to always prevent out of order writeback? What about non-integrity > > background writeback? AFAICT it's technically possible to writeback out > > of expected order (the cyclic range logic looks particularly suspicious > > to me if you consider a mapping can be truncated and repopulated many > > times over without resetting the cycle index). What about memory > > reclaim? Is it not possible to hit the ->writepage() path via reclaim > > for an arbitrary page in a mapping? It's not clear to me if page > > migration is a factor for us as well, but it also looks like it has some > > specific page writeback hooks. > > > > I suppose there are technically ways around these things. We could skip > > writeback of certain pages in some cases (but that may have unacceptable > > consequences) or make the decision to use one exposure preventation > > approach or another dynamically at writepage time based on the state of > > the inode, mapping and associated page. I'd think background writeback > > is the common case as opposed to things like hole punch and whatnot, > > however, so a dynamic solution leaves the effectiveness of this approach > > somewhat open ended unless we characterize how all possible writeback > > scenarios behave. > > > > Given that, it seems more useful to me to work through the inside eof > > extent allocation scenario first and work back from there. Then we can > > at least try to establish potential performance impact and determine if > > we should try to work around it in certain cases, etc. > > <nod> I'll get back to working on two parts of fixing this: > > a) adapting the "convert delalloc to unwritten" patch that started this > thread so that it only does that for conversions that start within eof, > > b) fixing your punch case by extending the range that xfs_flush_unmap_range > asks to flush if the caller is trying to flush a range past the ondisk > eof. > > Though I think (b) is unnecessary if we had the ability to flush all the > pages fronting a delalloc reservation that we just converted to a real > extent, as I muttered above. > I think it's more that we have to order all such page writebacks against the on-disk size update so the latter protects us from exposure in the event of a crash, but otherwise this sounds like a reasonable next step to me.. Brian > > On that note, > > please see the part that was snipped out of my earlier reply[1], in > > particular the part that ended with "Am I following the high level idea > > correctly?" :P Thanks. > > I thought your take at least matched my interpretation of Dave's > comments. > > Thoughts? > > --D > > > > > Brian > > > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2 > > > > > > It may be possible to selectively (i.e. as an optimization) use an > > > > XFS_ITRUNCATED like algorithm as you propose here to always force in > > > > order writeback in the cases where writeback extends the on-disk isize > > > > (and thus isize protects from stale exposure), but I think we'd still > > > > need something tied to extent allocation one way or another in addition > > > > (like Darrick's original unwritten extent patch or your proposed > > > > intent/done idea in the previous reply) to address the fundamental > > > > problem. The question is what is the ideal solution.. > > > > > > The solution will be a combination of things that work together, not > > > one big hammer. The more we can avoid exposure vectors by > > > operational ordering rather than transactional state changes, the > > > better the solution will perform. > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com
On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > > exposure. See the following variation of a command sequence from the > > > > > > > fstest I posted the other day: > > > > > > > > > > > > > > ... > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > > -c "fpunch 96k 4k" <file> > > > > > > > ... > > > > > > > # hexdump <file> > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > > * > > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > > * > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > > * > > > > > > > 0019000 > > > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > > even an overlapping direct I/O. > > > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > > to the same LBA range in the device underlying the filesystem. i.e. > > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > > > Subtle difference, yes, but a very important one. > > > > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > > that there are multiple vectors to the problem, some of which are sane > > > sequences and some apparently not, and I'd rather not get caught up in > > > the minutiae of why some of those sequences are not sane. I'm trying to > > > see if we can find a solution to the fundamental data exposure problem. > > > > Is there a way to increase the amount of writeback? Say for example > > that we have a delalloc reservation for blocks 0-9, and writeback asks > > us to write back block 7. If we succeed at converting the entire > > 10-block reservation into a real extent, why can't we come back and say > > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > > to need something like that for blocksize > pagesize filesystems too? > > > > That might be possible in various places. In fact, I think we used to do > something like this in XFS (perhaps not for this purpose). I just jumped > back to v2.6.32 and see that the xfs_cluster_write() mechanism still > exists at that point down in the ->writepage() path. > > I'm not sure if doing that alone is worth the effort if you consider > it's more of a mitigation than a solution to the underlying problem. > Once we convert the associated delalloc extent, we're open to stale data > exposure (assuming we're not protected by on-disk EOF and whatnot) until > all of the remaining pages over the extent are written back, whether > that occurs in the current writeback cycle or a subsequent one. > > I suppose it might make sense in service to another solution, such as > writing back everything preceding a page that is going to update on-disk > EOF (which is something I think Dave suggested earlier as an incremental > improvement, I've kind of lost track). We'd probably just have to make > sure that the on-disk eof can never get ahead of the lower index > writeback pages by I/O reordering or whatever.. ...which would be easi(er) to handle if we could chain all the file-extending writeback bios to a single ioend like you suggested in the per-inode ioend completion thread, right? Though such a beast would still be racy, I think, because background writeback could start writing back a /single/ page from the middle of the post-eof range before the integrity writeback flushes everything else, and if the integrity writeback completes before the background writeback does, we're still exposed. I think? Dunno, maybe the per-inode completion can help here, since at least if the completions all come in at the same time we'll reorder and combine the metadata work, though that's not a full solution either. > > > > > > > > an unwritten allocation (so even a dio read is sufficient). Also, the > > > > > target of the dio isn't where stale data is exposed, so even if there > > > > > was an existing underlying unwritten block before the initial buffered > > > > > write it wouldn't change anything. > > > > > > > > > > Yes, this is technically usage with undefined behavior, but that still > > > > > misses the forest from the trees. > > > > > > > > No, I'm not missing the forest for the trees. > > > > > > > > > I'm not claiming this is good or > > > > > supported practice just as I'm not not claiming sync_file_range() should > > > > > be used for anything in practice. I'm just pointing out it's one of > > > > > several possible, obvious data exposure vectors. I suspect there are > > > > > others; these were just the easy variants to reproduce. > > > > > > > > I'm aware that we have these problems - we've known about them for > > > > 20+ years and the many ways they can be exposed. Indeed, We've been > > > > talking about using unwritten extents for delalloc for as long as I > > > > can remember, but never done it because the actual occurrence of > > > > these problems in production systems is /extremely/ rare and the > > > > cost of using unwritten extents everywhere will affect /everyone/. > > > > > > > > i.e. It's always been easy to create test conditions to expose stale > > > > data, but reported cases of stale data exposure after a production > > > > system crash are pretty much non-existent. This just isn't a common > > > > problem that users are experiencing, but solving it with "unwritten > > > > extents for everyone" comes at a substantial cost for /everyone/, > > > > /all the time/. > > > > > > > > > > I know you're aware. I'm pretty sure we've discussed this directly in > > > the past (with regard to how buffer heads were an impediment to a > > > solution) and that you made me aware of the issue in the first place. ;) > > > > > > > That's why a) it hasn't been done yet; and b) any discussion about > > > > "use unwritten extents everywhere" always ends up in trying to find > > > > a solution that isn't just applying the "big hammer" and ignoring > > > > the problems that causes. > > > > > > > > > > Right.. > > > > > > > > If you prefer to > > > > > reason about reproducer patterns that don't use operations with quirks > > > > > or undefined behaviors, that's fine, just use the reflink or hole punch > > > > > example posted above that produce the same behavior. > > > > > > > > Well, yes. But it does go both ways - if you don't want people to > > > > say "that's undefined" or "that makes no sense", then don't use > > > > those quirky examples if you can use better, unambiguous examples. > > > > > > > > > > The sync_file_range() reproducer was a poor example in that regard. I'm > > > > (I don't want to stir this up again but just so we're all clear I had > > figured the s_f_r call was really just a way to simulate random drive-by > > scattershot background writeback for testing purposes.) > > > > Ack. > > > > not concerned over that being pointed out, but I've explained several > > > times subsequently that we can replace the sync_file_range() calls with > > > a sane command (hole punch) to reproduce the same general behavior and > > > resulting data exposure problem and thus the quirks of the former are > > > not a primary factor. Despite that, this discussion has somehow turned > > > into an appraisal of sync_file_range(). Moving on.. > > > > > > I posted some comments/questions around your log item suggestion for > > > extent zeroing during recovery in my previous reply and that part > > > apparently got completely snipped out and ignored. :/ Any feedback on > > > that? > > > > A(nother) big downside of adding a new "write intent": now we need to > > add a log incompat flag which will occasionally generate "I wrote > > something, crashed, and this old knoppix cd won't recover the fs for > > some reason" reports. > > > > Yeah, though I think it's not really recommended to process an fs with a > dirty log across kernels like that. That may be, but it seems to work most often enough that our service techs think they can do it 100%. And I have the bug reports from when that doesn't work to prove it. :( > We already have some recovery > restrictions that prevent this, such on-disk order for example. Of > course that's different from just going back to an older kernel version > on the same hardware and getting stuck that way because of a feature bit > or something. <nod> > > > > For example, that's exactly how I phrased this extension flushing > > > > mechanism trigger: > > > > > > > > > > Eseentially, what this says to me is we need to track when we > > > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to > > > > > > flush a data range from the file for integrity reasons (such as a > > > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the > > > > > > > > See? Entirely generic trigger description, uses hole punch as an > > > > example, but also covers all the quirky cases such as direct IO > > > > range flushing without having to mention them explicitly. > > > > > > > > > > range and flush from the start of the file instead. i.e. we > > > > > > guarantee that any dirty zeroed pages between the start of the file > > > > > > and the end of the range we are operating on gets written back. This > > > > > > will capture all these "didn't flush regions zeroed on file > > > > > > extension" problems in one go, without needing to change the way we > > > > > > do allocation or EOF zeroing. > > > > > > > > > > > > > > > > I don't think that would prevent the problem outside of the common write > > > > > extension case. > > > > > > > > I feel like we're going around in circles now, Brian. :/ > > > > > > > > > > Yes, let's try to reset here... > > > > > > > I didn't propose this as the "only solution we need". I've proposed > > > > it in the context that we use "only use unwritten extents for > > > > allocations within EOF" to solve the common "write within EOF" case. > > > > This, however, doesn't solve this common write extension case where > > > > we don't want to use unwritten extents because of the performance > > > > cost. > > > > > > > > And now you're saying that "well, this extending write thing > > > > doesn't solve the inside EOF problem"...... > > > > > > > > > > Ok, I lost that context. As I said above, I can see how this could be > > > useful in combination with another approach to handle inside eof > > > allocations. There are still some loose ends (in my mind) to consider > > > with this technique, however... > > > > > > This requires that we always guarantee in-order writeback, right? I can > > > see how we can do that in the cases we control based on the flag based > > > approach you suggested. But do we really have enough control within the > > > fs to always prevent out of order writeback? What about non-integrity > > > background writeback? AFAICT it's technically possible to writeback out > > > of expected order (the cyclic range logic looks particularly suspicious > > > to me if you consider a mapping can be truncated and repopulated many > > > times over without resetting the cycle index). What about memory > > > reclaim? Is it not possible to hit the ->writepage() path via reclaim > > > for an arbitrary page in a mapping? It's not clear to me if page > > > migration is a factor for us as well, but it also looks like it has some > > > specific page writeback hooks. > > > > > > I suppose there are technically ways around these things. We could skip > > > writeback of certain pages in some cases (but that may have unacceptable > > > consequences) or make the decision to use one exposure preventation > > > approach or another dynamically at writepage time based on the state of > > > the inode, mapping and associated page. I'd think background writeback > > > is the common case as opposed to things like hole punch and whatnot, > > > however, so a dynamic solution leaves the effectiveness of this approach > > > somewhat open ended unless we characterize how all possible writeback > > > scenarios behave. > > > > > > Given that, it seems more useful to me to work through the inside eof > > > extent allocation scenario first and work back from there. Then we can > > > at least try to establish potential performance impact and determine if > > > we should try to work around it in certain cases, etc. > > > > <nod> I'll get back to working on two parts of fixing this: > > > > a) adapting the "convert delalloc to unwritten" patch that started this > > thread so that it only does that for conversions that start within eof, > > > > b) fixing your punch case by extending the range that xfs_flush_unmap_range > > asks to flush if the caller is trying to flush a range past the ondisk > > eof. > > > > Though I think (b) is unnecessary if we had the ability to flush all the > > pages fronting a delalloc reservation that we just converted to a real > > extent, as I muttered above. > > > > I think it's more that we have to order all such page writebacks against > the on-disk size update so the latter protects us from exposure in the > event of a crash, but otherwise this sounds like a reasonable next step > to me.. <nod> Figuring out ordered completion is going to take some thought. --D > Brian > > > > On that note, > > > please see the part that was snipped out of my earlier reply[1], in > > > particular the part that ended with "Am I following the high level idea > > > correctly?" :P Thanks. > > > > I thought your take at least matched my interpretation of Dave's > > comments. > > > > Thoughts? > > > > --D > > > > > > > > Brian > > > > > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2 > > > > > > > > It may be possible to selectively (i.e. as an optimization) use an > > > > > XFS_ITRUNCATED like algorithm as you propose here to always force in > > > > > order writeback in the cases where writeback extends the on-disk isize > > > > > (and thus isize protects from stale exposure), but I think we'd still > > > > > need something tied to extent allocation one way or another in addition > > > > > (like Darrick's original unwritten extent patch or your proposed > > > > > intent/done idea in the previous reply) to address the fundamental > > > > > problem. The question is what is the ideal solution.. > > > > > > > > The solution will be a combination of things that work together, not > > > > one big hammer. The more we can avoid exposure vectors by > > > > operational ordering rather than transactional state changes, the > > > > better the solution will perform. > > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com
On Thu, Apr 04, 2019 at 08:22:54AM -0700, Darrick J. Wong wrote: > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > > > exposure. See the following variation of a command sequence from the > > > > > > > > fstest I posted the other day: > > > > > > > > > > > > > > > > ... > > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > > > -c "fpunch 96k 4k" <file> > > > > > > > > ... > > > > > > > > # hexdump <file> > > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > > > * > > > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > > > * > > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > > > * > > > > > > > > 0019000 > > > > > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > > > even an overlapping direct I/O. > > > > > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > > > to the same LBA range in the device underlying the filesystem. i.e. > > > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > > > > > Subtle difference, yes, but a very important one. > > > > > > > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > > > that there are multiple vectors to the problem, some of which are sane > > > > sequences and some apparently not, and I'd rather not get caught up in > > > > the minutiae of why some of those sequences are not sane. I'm trying to > > > > see if we can find a solution to the fundamental data exposure problem. > > > > > > Is there a way to increase the amount of writeback? Say for example > > > that we have a delalloc reservation for blocks 0-9, and writeback asks > > > us to write back block 7. If we succeed at converting the entire > > > 10-block reservation into a real extent, why can't we come back and say > > > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > > > to need something like that for blocksize > pagesize filesystems too? > > > > > > > That might be possible in various places. In fact, I think we used to do > > something like this in XFS (perhaps not for this purpose). I just jumped > > back to v2.6.32 and see that the xfs_cluster_write() mechanism still > > exists at that point down in the ->writepage() path. > > > > I'm not sure if doing that alone is worth the effort if you consider > > it's more of a mitigation than a solution to the underlying problem. > > Once we convert the associated delalloc extent, we're open to stale data > > exposure (assuming we're not protected by on-disk EOF and whatnot) until > > all of the remaining pages over the extent are written back, whether > > that occurs in the current writeback cycle or a subsequent one. > > > > I suppose it might make sense in service to another solution, such as > > writing back everything preceding a page that is going to update on-disk > > EOF (which is something I think Dave suggested earlier as an incremental > > improvement, I've kind of lost track). We'd probably just have to make > > sure that the on-disk eof can never get ahead of the lower index > > writeback pages by I/O reordering or whatever.. > > ...which would be easi(er) to handle if we could chain all the > file-extending writeback bios to a single ioend like you suggested in > the per-inode ioend completion thread, right? > Hmm, I think that could be a useful tool in some scenarios... to combine all append bios to a single completion and i_size update for example. As you suggest below, that by itself probably doesn't cover all of the possible scenarios. > Though such a beast would still be racy, I think, because background > writeback could start writing back a /single/ page from the middle of > the post-eof range before the integrity writeback flushes everything > else, and if the integrity writeback completes before the background > writeback does, we're still exposed. I think? > Basically, I think we're exposed if we ever allow an on-disk size update to commit where delalloc extents have been converted within that EOF and not fully written back. I think there's probably a matrix of different ways to avoid that. We could batch append ioends as noted above, perhaps restrict or defer append transactions until the problematic state no longer exists on an inode (i.e., defer the size update until the range within the new on-disk EOF is naturally stabilized instead of trying to force it to happen in various writeback scenarios), and/or incorporate more restrictive delalloc conversion and selective use of unwritten extents (probably at some performance/fragmentation cost), etc. > Dunno, maybe the per-inode completion can help here, since at least if > the completions all come in at the same time we'll reorder and combine > the metadata work, though that's not a full solution either. > Yeah, we'd be relying on timing to facilitate completion batching, at least in the current implementation. Perhaps we could reuse that mechanism to explicitly defer append transactions such that we don't expose stale data. For example, keep the ioend/tx around until there are no delalloc or "stale" blocks within the new on-disk EOF, assuming we can find a way to track "stale" blocks (in-core extent state?). The tradeoff for something like that would be on-disk eof may not update as fast in the event of a crash, but that doesn't seem unreasonable to me. > > > > > > > > > > an unwritten allocation (so even a dio read is sufficient). Also, the > > > > > > target of the dio isn't where stale data is exposed, so even if there > > > > > > was an existing underlying unwritten block before the initial buffered > > > > > > write it wouldn't change anything. > > > > > > > > > > > > Yes, this is technically usage with undefined behavior, but that still > > > > > > misses the forest from the trees. > > > > > > > > > > No, I'm not missing the forest for the trees. > > > > > > > > > > > I'm not claiming this is good or > > > > > > supported practice just as I'm not not claiming sync_file_range() should > > > > > > be used for anything in practice. I'm just pointing out it's one of > > > > > > several possible, obvious data exposure vectors. I suspect there are > > > > > > others; these were just the easy variants to reproduce. > > > > > > > > > > I'm aware that we have these problems - we've known about them for > > > > > 20+ years and the many ways they can be exposed. Indeed, We've been > > > > > talking about using unwritten extents for delalloc for as long as I > > > > > can remember, but never done it because the actual occurrence of > > > > > these problems in production systems is /extremely/ rare and the > > > > > cost of using unwritten extents everywhere will affect /everyone/. > > > > > > > > > > i.e. It's always been easy to create test conditions to expose stale > > > > > data, but reported cases of stale data exposure after a production > > > > > system crash are pretty much non-existent. This just isn't a common > > > > > problem that users are experiencing, but solving it with "unwritten > > > > > extents for everyone" comes at a substantial cost for /everyone/, > > > > > /all the time/. > > > > > > > > > > > > > I know you're aware. I'm pretty sure we've discussed this directly in > > > > the past (with regard to how buffer heads were an impediment to a > > > > solution) and that you made me aware of the issue in the first place. ;) > > > > > > > > > That's why a) it hasn't been done yet; and b) any discussion about > > > > > "use unwritten extents everywhere" always ends up in trying to find > > > > > a solution that isn't just applying the "big hammer" and ignoring > > > > > the problems that causes. > > > > > > > > > > > > > Right.. > > > > > > > > > > If you prefer to > > > > > > reason about reproducer patterns that don't use operations with quirks > > > > > > or undefined behaviors, that's fine, just use the reflink or hole punch > > > > > > example posted above that produce the same behavior. > > > > > > > > > > Well, yes. But it does go both ways - if you don't want people to > > > > > say "that's undefined" or "that makes no sense", then don't use > > > > > those quirky examples if you can use better, unambiguous examples. > > > > > > > > > > > > > The sync_file_range() reproducer was a poor example in that regard. I'm > > > > > > (I don't want to stir this up again but just so we're all clear I had > > > figured the s_f_r call was really just a way to simulate random drive-by > > > scattershot background writeback for testing purposes.) > > > > > > > Ack. > > > > > > not concerned over that being pointed out, but I've explained several > > > > times subsequently that we can replace the sync_file_range() calls with > > > > a sane command (hole punch) to reproduce the same general behavior and > > > > resulting data exposure problem and thus the quirks of the former are > > > > not a primary factor. Despite that, this discussion has somehow turned > > > > into an appraisal of sync_file_range(). Moving on.. > > > > > > > > I posted some comments/questions around your log item suggestion for > > > > extent zeroing during recovery in my previous reply and that part > > > > apparently got completely snipped out and ignored. :/ Any feedback on > > > > that? > > > > > > A(nother) big downside of adding a new "write intent": now we need to > > > add a log incompat flag which will occasionally generate "I wrote > > > something, crashed, and this old knoppix cd won't recover the fs for > > > some reason" reports. > > > > > > > Yeah, though I think it's not really recommended to process an fs with a > > dirty log across kernels like that. > > That may be, but it seems to work most often enough that our service > techs think they can do it 100%. And I have the bug reports from when > that doesn't work to prove it. :( > Heh. :P > > We already have some recovery > > restrictions that prevent this, such on-disk order for example. Of > > course that's different from just going back to an older kernel version > > on the same hardware and getting stuck that way because of a feature bit > > or something. > > <nod> > > > > > > For example, that's exactly how I phrased this extension flushing > > > > > mechanism trigger: > > > > > > > > > > > > Eseentially, what this says to me is we need to track when we > > > > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to > > > > > > > flush a data range from the file for integrity reasons (such as a > > > > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the > > > > > > > > > > See? Entirely generic trigger description, uses hole punch as an > > > > > example, but also covers all the quirky cases such as direct IO > > > > > range flushing without having to mention them explicitly. > > > > > > > > > > > > range and flush from the start of the file instead. i.e. we > > > > > > > guarantee that any dirty zeroed pages between the start of the file > > > > > > > and the end of the range we are operating on gets written back. This > > > > > > > will capture all these "didn't flush regions zeroed on file > > > > > > > extension" problems in one go, without needing to change the way we > > > > > > > do allocation or EOF zeroing. > > > > > > > > > > > > > > > > > > > I don't think that would prevent the problem outside of the common write > > > > > > extension case. > > > > > > > > > > I feel like we're going around in circles now, Brian. :/ > > > > > > > > > > > > > Yes, let's try to reset here... > > > > > > > > > I didn't propose this as the "only solution we need". I've proposed > > > > > it in the context that we use "only use unwritten extents for > > > > > allocations within EOF" to solve the common "write within EOF" case. > > > > > This, however, doesn't solve this common write extension case where > > > > > we don't want to use unwritten extents because of the performance > > > > > cost. > > > > > > > > > > And now you're saying that "well, this extending write thing > > > > > doesn't solve the inside EOF problem"...... > > > > > > > > > > > > > Ok, I lost that context. As I said above, I can see how this could be > > > > useful in combination with another approach to handle inside eof > > > > allocations. There are still some loose ends (in my mind) to consider > > > > with this technique, however... > > > > > > > > This requires that we always guarantee in-order writeback, right? I can > > > > see how we can do that in the cases we control based on the flag based > > > > approach you suggested. But do we really have enough control within the > > > > fs to always prevent out of order writeback? What about non-integrity > > > > background writeback? AFAICT it's technically possible to writeback out > > > > of expected order (the cyclic range logic looks particularly suspicious > > > > to me if you consider a mapping can be truncated and repopulated many > > > > times over without resetting the cycle index). What about memory > > > > reclaim? Is it not possible to hit the ->writepage() path via reclaim > > > > for an arbitrary page in a mapping? It's not clear to me if page > > > > migration is a factor for us as well, but it also looks like it has some > > > > specific page writeback hooks. > > > > > > > > I suppose there are technically ways around these things. We could skip > > > > writeback of certain pages in some cases (but that may have unacceptable > > > > consequences) or make the decision to use one exposure preventation > > > > approach or another dynamically at writepage time based on the state of > > > > the inode, mapping and associated page. I'd think background writeback > > > > is the common case as opposed to things like hole punch and whatnot, > > > > however, so a dynamic solution leaves the effectiveness of this approach > > > > somewhat open ended unless we characterize how all possible writeback > > > > scenarios behave. > > > > > > > > Given that, it seems more useful to me to work through the inside eof > > > > extent allocation scenario first and work back from there. Then we can > > > > at least try to establish potential performance impact and determine if > > > > we should try to work around it in certain cases, etc. > > > > > > <nod> I'll get back to working on two parts of fixing this: > > > > > > a) adapting the "convert delalloc to unwritten" patch that started this > > > thread so that it only does that for conversions that start within eof, > > > > > > b) fixing your punch case by extending the range that xfs_flush_unmap_range > > > asks to flush if the caller is trying to flush a range past the ondisk > > > eof. > > > > > > Though I think (b) is unnecessary if we had the ability to flush all the > > > pages fronting a delalloc reservation that we just converted to a real > > > extent, as I muttered above. > > > > > > > I think it's more that we have to order all such page writebacks against > > the on-disk size update so the latter protects us from exposure in the > > event of a crash, but otherwise this sounds like a reasonable next step > > to me.. > > <nod> Figuring out ordered completion is going to take some thought. > Indeed, it's not really trivial. I think we've kind of established various potential techniques to use with various associated tradeoffs. Just using unwritten extents has performance issues, the write intent thing has backwards compatibility tradeoffs, etc. Perhaps if we can find some way to cleverly order writeback and size updates in the most common cases, we could then pick a suitably limited fallback and not have to worry as much about the associated tradeoffs. Brian > --D > > > Brian > > > > > > On that note, > > > > please see the part that was snipped out of my earlier reply[1], in > > > > particular the part that ended with "Am I following the high level idea > > > > correctly?" :P Thanks. > > > > > > I thought your take at least matched my interpretation of Dave's > > > comments. > > > > > > Thoughts? > > > > > > --D > > > > > > > > > > > Brian > > > > > > > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2 > > > > > > > > > > It may be possible to selectively (i.e. as an optimization) use an > > > > > > XFS_ITRUNCATED like algorithm as you propose here to always force in > > > > > > order writeback in the cases where writeback extends the on-disk isize > > > > > > (and thus isize protects from stale exposure), but I think we'd still > > > > > > need something tied to extent allocation one way or another in addition > > > > > > (like Darrick's original unwritten extent patch or your proposed > > > > > > intent/done idea in the previous reply) to address the fundamental > > > > > > problem. The question is what is the ideal solution.. > > > > > > > > > > The solution will be a combination of things that work together, not > > > > > one big hammer. The more we can avoid exposure vectors by > > > > > operational ordering rather than transactional state changes, the > > > > > better the solution will perform. > > > > > > > > > > Cheers, > > > > > > > > > > Dave. > > > > > -- > > > > > Dave Chinner > > > > > david@fromorbit.com
On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > > exposure. See the following variation of a command sequence from the > > > > > > > fstest I posted the other day: > > > > > > > > > > > > > > ... > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > > -c "fpunch 96k 4k" <file> > > > > > > > ... > > > > > > > # hexdump <file> > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > > * > > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > > * > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > > * > > > > > > > 0019000 > > > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > > even an overlapping direct I/O. > > > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > > to the same LBA range in the device underlying the filesystem. i.e. > > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > > > Subtle difference, yes, but a very important one. > > > > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > > that there are multiple vectors to the problem, some of which are sane > > > sequences and some apparently not, and I'd rather not get caught up in > > > the minutiae of why some of those sequences are not sane. I'm trying to > > > see if we can find a solution to the fundamental data exposure problem. > > > > Is there a way to increase the amount of writeback? Say for example > > that we have a delalloc reservation for blocks 0-9, and writeback asks > > us to write back block 7. If we succeed at converting the entire > > 10-block reservation into a real extent, why can't we come back and say > > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > > to need something like that for blocksize > pagesize filesystems too? > > > > That might be possible in various places. In fact, I think we used to do > something like this in XFS (perhaps not for this purpose). I just jumped > back to v2.6.32 and see that the xfs_cluster_write() mechanism still > exists at that point down in the ->writepage() path. We /really/ don't want to go back to that. This caused us to have to essentially implement write_cache_pages() inside ->writepage. If we want to expand the writeback range, (i.e. move the offset we start at) then we need to do that in xfs_vm_writepages(), before we call write_cache_pages(). This means all the tagged writepages stuff will continue to work, and we don't end up doing redundant page cache lookups because we're trying to write back the same range from two different places. > I'm not sure if doing that alone is worth the effort if you consider > it's more of a mitigation than a solution to the underlying problem. > Once we convert the associated delalloc extent, we're open to stale data > exposure (assuming we're not protected by on-disk EOF and whatnot) until > all of the remaining pages over the extent are written back, whether > that occurs in the current writeback cycle or a subsequent one. If we were to do a read iomap lookup in xfs_vm_writepages() we could determine if the delalloc conversion would allocate a range outside the current write that is bing flushed, and either modify the writeback control range/offset to match the delalloc extent. however, this runs the risk of breaking all the fairness algorithms in high level writeback, so we'd need to be very careful about how we extend the write range for writeback inside EOF.... Cheers, Dave.
On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote: > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > > > exposure. See the following variation of a command sequence from the > > > > > > > > fstest I posted the other day: > > > > > > > > > > > > > > > > ... > > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > > > -c "fpunch 96k 4k" <file> > > > > > > > > ... > > > > > > > > # hexdump <file> > > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > > > * > > > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > > > * > > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > > > * > > > > > > > > 0019000 > > > > > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > > > even an overlapping direct I/O. > > > > > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > > > to the same LBA range in the device underlying the filesystem. i.e. > > > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > > > > > Subtle difference, yes, but a very important one. > > > > > > > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > > > that there are multiple vectors to the problem, some of which are sane > > > > sequences and some apparently not, and I'd rather not get caught up in > > > > the minutiae of why some of those sequences are not sane. I'm trying to > > > > see if we can find a solution to the fundamental data exposure problem. > > > > > > Is there a way to increase the amount of writeback? Say for example > > > that we have a delalloc reservation for blocks 0-9, and writeback asks > > > us to write back block 7. If we succeed at converting the entire > > > 10-block reservation into a real extent, why can't we come back and say > > > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > > > to need something like that for blocksize > pagesize filesystems too? > > > > > > > That might be possible in various places. In fact, I think we used to do > > something like this in XFS (perhaps not for this purpose). I just jumped > > back to v2.6.32 and see that the xfs_cluster_write() mechanism still > > exists at that point down in the ->writepage() path. > > We /really/ don't want to go back to that. This caused us to have to > essentially implement write_cache_pages() inside ->writepage. > Agreed.. > If we want to expand the writeback range, (i.e. move the offset we > start at) then we need to do that in xfs_vm_writepages(), before we > call write_cache_pages(). This means all the tagged writepages stuff > will continue to work, and we don't end up doing redundant page > cache lookups because we're trying to write back the same range from > two different places. > IMO, messing around with writeback like this is a mitigation and/or a contributer to a broader solution, not a fundamental fix by itself. For example, I could see doing something like that if we already put proper writeback vs. append transaction ordering in place and we identify certain corner cases where we want to prevent delaying an on-disk i_size update for too long because a user indirectly triggered writeback of a high index page. > > I'm not sure if doing that alone is worth the effort if you consider > > it's more of a mitigation than a solution to the underlying problem. > > Once we convert the associated delalloc extent, we're open to stale data > > exposure (assuming we're not protected by on-disk EOF and whatnot) until > > all of the remaining pages over the extent are written back, whether > > that occurs in the current writeback cycle or a subsequent one. > > If we were to do a read iomap lookup in xfs_vm_writepages() we could > determine if the delalloc conversion would allocate a range outside > the current write that is bing flushed, and either modify the > writeback control range/offset to match the delalloc extent. > > however, this runs the risk of breaking all the fairness algorithms > in high level writeback, so we'd need to be very careful about how > we extend the write range for writeback inside EOF.... > Indeed. The more I think about this, the more I think we'd have an easier time trying to schedule the on-disk i_size update vs. trying to control the size and range of writeback in all possible scenarios. The former would still require non-trivial tracking and logic, but that's something writeback control described above could actually help with. For example, we don't currently have any way to track "stale" state of just allocated but not yet fully written extents. If we established a rule that we always wrote back such extents in entirety and via a single ioend, then writeback completion code would simply need to check for delalloc extents within in-core eof to determine when it's safe to update on-disk eof. If there are cases where we determine we can't writeback an entire delalloc extent because of some fairness reasons or whatever, we could decide to only physically allocate the subset of that delalloc extent that we can write back rather than the whole thing (trading off stale data exposure for some increased fragmentation). That way writeback completion still sees existence of delalloc blocks and knows to hold off on the on-disk size update. One challenge with an approach like that is if we do skip completion time on-disk i_size updates, we'd have to make sure that those updates still happen eventually if the address space happens to change outside of writeback. For example, if those remaining dirty delalloc pages somehow end up invalidated and never become real blocks, we may never see another writeback cycle on the inode and we'd need to update on-disk size from some other context (in a timely fashion). Of course this only deals with the append case. We'd still need to select a mechanism for dealing with inside EOF allocations (such as the whole write-intent thing or unwritten extents). To me, the intent thing still seems like the most simple/elegant option we've discussed so far because the potential for low impact. Part of me thinks that if we determine we'll fall back to that approach for certain cases anyways (like inside EOF allocs), it might be worth just doing that first and evaluate using it unconditionally. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Apr 05, 2019 at 07:24:44AM -0400, Brian Foster wrote: > On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote: > > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > > > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > > > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > > > > exposure. See the following variation of a command sequence from the > > > > > > > > > fstest I posted the other day: > > > > > > > > > > > > > > > > > > ... > > > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > > > > -c "fpunch 96k 4k" <file> > > > > > > > > > ... > > > > > > > > > # hexdump <file> > > > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > > > > * > > > > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > > > > * > > > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > > > > * > > > > > > > > > 0019000 > > > > > > > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > > > > even an overlapping direct I/O. > > > > > > > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > > > > to the same LBA range in the device underlying the filesystem. i.e. > > > > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > > > > > > > Subtle difference, yes, but a very important one. > > > > > > > > > > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > > > > that there are multiple vectors to the problem, some of which are sane > > > > > sequences and some apparently not, and I'd rather not get caught up in > > > > > the minutiae of why some of those sequences are not sane. I'm trying to > > > > > see if we can find a solution to the fundamental data exposure problem. > > > > > > > > Is there a way to increase the amount of writeback? Say for example > > > > that we have a delalloc reservation for blocks 0-9, and writeback asks > > > > us to write back block 7. If we succeed at converting the entire > > > > 10-block reservation into a real extent, why can't we come back and say > > > > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > > > > to need something like that for blocksize > pagesize filesystems too? > > > > > > > > > > That might be possible in various places. In fact, I think we used to do > > > something like this in XFS (perhaps not for this purpose). I just jumped > > > back to v2.6.32 and see that the xfs_cluster_write() mechanism still > > > exists at that point down in the ->writepage() path. > > > > We /really/ don't want to go back to that. This caused us to have to > > essentially implement write_cache_pages() inside ->writepage. > > > > Agreed.. > > > If we want to expand the writeback range, (i.e. move the offset we > > start at) then we need to do that in xfs_vm_writepages(), before we > > call write_cache_pages(). This means all the tagged writepages stuff > > will continue to work, and we don't end up doing redundant page > > cache lookups because we're trying to write back the same range from > > two different places. > > > > IMO, messing around with writeback like this is a mitigation and/or a > contributer to a broader solution, not a fundamental fix by itself. For > example, I could see doing something like that if we already put proper > writeback vs. append transaction ordering in place and we identify > certain corner cases where we want to prevent delaying an on-disk i_size > update for too long because a user indirectly triggered writeback of a > high index page. > > > > I'm not sure if doing that alone is worth the effort if you consider > > > it's more of a mitigation than a solution to the underlying problem. > > > Once we convert the associated delalloc extent, we're open to stale data > > > exposure (assuming we're not protected by on-disk EOF and whatnot) until > > > all of the remaining pages over the extent are written back, whether > > > that occurs in the current writeback cycle or a subsequent one. > > > > If we were to do a read iomap lookup in xfs_vm_writepages() we could > > determine if the delalloc conversion would allocate a range outside > > the current write that is bing flushed, and either modify the > > writeback control range/offset to match the delalloc extent. > > > > however, this runs the risk of breaking all the fairness algorithms > > in high level writeback, so we'd need to be very careful about how > > we extend the write range for writeback inside EOF.... > > > > Indeed. The more I think about this, the more I think we'd have an > easier time trying to schedule the on-disk i_size update vs. trying to > control the size and range of writeback in all possible scenarios. The > former would still require non-trivial tracking and logic, but that's > something writeback control described above could actually help with. > > For example, we don't currently have any way to track "stale" state of > just allocated but not yet fully written extents. If we established a > rule that we always wrote back such extents in entirety and via a single > ioend, then writeback completion code would simply need to check for > delalloc extents within in-core eof to determine when it's safe to > update on-disk eof. If there are cases where we determine we can't > writeback an entire delalloc extent because of some fairness reasons or > whatever, we could decide to only physically allocate the subset of that > delalloc extent that we can write back rather than the whole thing > (trading off stale data exposure for some increased fragmentation). That > way writeback completion still sees existence of delalloc blocks and > knows to hold off on the on-disk size update. > > One challenge with an approach like that is if we do skip completion > time on-disk i_size updates, we'd have to make sure that those updates > still happen eventually if the address space happens to change outside > of writeback. For example, if those remaining dirty delalloc pages > somehow end up invalidated and never become real blocks, we may never > see another writeback cycle on the inode and we'd need to update on-disk > size from some other context (in a timely fashion). > > Of course this only deals with the append case. We'd still need to > select a mechanism for dealing with inside EOF allocations (such as the > whole write-intent thing or unwritten extents). To me, the intent thing > still seems like the most simple/elegant option we've discussed so far > because the potential for low impact. Part of me thinks that if we > determine we'll fall back to that approach for certain cases anyways > (like inside EOF allocs), it might be worth just doing that first and > evaluate using it unconditionally. /me wonders if we could solve this by using a bitmap to track all the in-flight post-ondisk-eof writeback such that when a particular writeback completes it can check if there aren't any writeback at a lower offset still in progress and only then update the ondisk eof? (Where "a bitmap or something" could just use another instance of Dave's range lock structure...) --D > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Fri, Apr 05, 2019 at 08:12:32AM -0700, Darrick J. Wong wrote: > On Fri, Apr 05, 2019 at 07:24:44AM -0400, Brian Foster wrote: > > On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote: > > > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > > > If we want to expand the writeback range, (i.e. move the offset we > > > start at) then we need to do that in xfs_vm_writepages(), before we > > > call write_cache_pages(). This means all the tagged writepages stuff > > > will continue to work, and we don't end up doing redundant page > > > cache lookups because we're trying to write back the same range from > > > two different places. > > > > > > > IMO, messing around with writeback like this is a mitigation and/or a > > contributer to a broader solution, not a fundamental fix by itself. For > > example, I could see doing something like that if we already put proper > > writeback vs. append transaction ordering in place and we identify > > certain corner cases where we want to prevent delaying an on-disk i_size > > update for too long because a user indirectly triggered writeback of a > > high index page. *nod* Yeah, the problem is when we come to really large files on large memory machines where there might be gigabytes of dirty pages over a relatively slow disk... > > > > I'm not sure if doing that alone is worth the effort if you consider > > > > it's more of a mitigation than a solution to the underlying problem. > > > > Once we convert the associated delalloc extent, we're open to stale data > > > > exposure (assuming we're not protected by on-disk EOF and whatnot) until > > > > all of the remaining pages over the extent are written back, whether > > > > that occurs in the current writeback cycle or a subsequent one. > > > > > > If we were to do a read iomap lookup in xfs_vm_writepages() we could > > > determine if the delalloc conversion would allocate a range outside > > > the current write that is bing flushed, and either modify the > > > writeback control range/offset to match the delalloc extent. > > > > > > however, this runs the risk of breaking all the fairness algorithms > > > in high level writeback, so we'd need to be very careful about how > > > we extend the write range for writeback inside EOF.... > > > > > > > Indeed. The more I think about this, the more I think we'd have an > > easier time trying to schedule the on-disk i_size update vs. trying to > > control the size and range of writeback in all possible scenarios. The > > former would still require non-trivial tracking and logic, but that's > > something writeback control described above could actually help with. > > > > For example, we don't currently have any way to track "stale" state of > > just allocated but not yet fully written extents. If we established a > > rule that we always wrote back such extents in entirety and via a single > > ioend, then writeback completion code would simply need to check for > > delalloc extents within in-core eof to determine when it's safe to > > update on-disk eof. The problem with this is that it either limits the size of the allocation we can make (due to writeback chunk size) or we break the writeback fairness algorithms by having to write back allocation sized chunks... > > If there are cases where we determine we can't > > writeback an entire delalloc extent because of some fairness reasons or > > whatever, we could decide to only physically allocate the subset of that > > delalloc extent that we can write back rather than the whole thing > > (trading off stale data exposure for some increased fragmentation). That > > way writeback completion still sees existence of delalloc blocks and > > knows to hold off on the on-disk size update. The moment we get concurrent writes we are going to interleave and fragment the files with this approach. It's the large speculative delalloc beyond EOF that prevents fragmentation in the this case, so the moment we decide that we aren't going to allocate the entire delalloc extent we essentially throw away a significant amount of the benefit that the speculative delalloc provides us with. > > One challenge with an approach like that is if we do skip completion > > time on-disk i_size updates, we'd have to make sure that those updates > > still happen eventually if the address space happens to change outside > > of writeback. For example, if those remaining dirty delalloc pages > > somehow end up invalidated and never become real blocks, we may never > > see another writeback cycle on the inode and we'd need to update on-disk > > size from some other context (in a timely fashion). > > > > Of course this only deals with the append case. We'd still need to > > select a mechanism for dealing with inside EOF allocations (such as the > > whole write-intent thing or unwritten extents). To me, the intent thing > > still seems like the most simple/elegant option we've discussed so far > > because the potential for low impact. Part of me thinks that if we > > determine we'll fall back to that approach for certain cases anyways > > (like inside EOF allocs), it might be worth just doing that first and > > evaluate using it unconditionally. > > /me wonders if we could solve this by using a bitmap to track all the > in-flight post-ondisk-eof writeback such that when a particular > writeback completes it can check if there aren't any writeback at a > lower offset still in progress and only then update the ondisk eof? > > (Where "a bitmap or something" could just use another instance of Dave's > range lock structure...) I'm not sure we need to do that. Perhaps all we need is a new in-core iext_tree record extent type flag. i.e. we have an unwritten flag that mirrors the on-disk unwritten flag, but what we are really talking about here is having a new in-memory extent state that is "allocated but contains no data". This flag never makes it to disk. When we allocate a delalloc region we tag the entire extent in with this magic flag, and as IO completes we do the equivalent of unwritten extent conversion except it only touches the in-memory tree records. When the entire range between the current on-disk EOF and end of this completing IO has the "not yet written" flag cleared, we then log the EOF size change. For non-random IO, we'll essentially just be trimming one extent and extending another, so it should be pretty fast. If we want to add intents (so it works inside EOF), then we can simply log a new DONE (or UPDATE) intent every time we convert a range. That would leave log recovery with an allocation intent plus a series of updates that document written ranges. This seems to me like it solves the tracking problem for both cases and provides a simple hook for logging intent completions. It also seems to me like it would work for direct IO, too, getting rid of the need for unconditional unwritten extent allocation and conversion. That's potentially a big win for database and VM folks who want to write into sparse files safely.... Cheers, Dave.
On Mon, Apr 08, 2019 at 10:17:41AM +1000, Dave Chinner wrote: > On Fri, Apr 05, 2019 at 08:12:32AM -0700, Darrick J. Wong wrote: > > On Fri, Apr 05, 2019 at 07:24:44AM -0400, Brian Foster wrote: > > > On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote: > > > > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > > > > If we want to expand the writeback range, (i.e. move the offset we > > > > start at) then we need to do that in xfs_vm_writepages(), before we > > > > call write_cache_pages(). This means all the tagged writepages stuff > > > > will continue to work, and we don't end up doing redundant page > > > > cache lookups because we're trying to write back the same range from > > > > two different places. > > > > > > > > > > IMO, messing around with writeback like this is a mitigation and/or a > > > contributer to a broader solution, not a fundamental fix by itself. For > > > example, I could see doing something like that if we already put proper > > > writeback vs. append transaction ordering in place and we identify > > > certain corner cases where we want to prevent delaying an on-disk i_size > > > update for too long because a user indirectly triggered writeback of a > > > high index page. > > *nod* > > Yeah, the problem is when we come to really large files on large > memory machines where there might be gigabytes of dirty pages over a > relatively slow disk... > > > > > > I'm not sure if doing that alone is worth the effort if you consider > > > > > it's more of a mitigation than a solution to the underlying problem. > > > > > Once we convert the associated delalloc extent, we're open to stale data > > > > > exposure (assuming we're not protected by on-disk EOF and whatnot) until > > > > > all of the remaining pages over the extent are written back, whether > > > > > that occurs in the current writeback cycle or a subsequent one. > > > > > > > > If we were to do a read iomap lookup in xfs_vm_writepages() we could > > > > determine if the delalloc conversion would allocate a range outside > > > > the current write that is bing flushed, and either modify the > > > > writeback control range/offset to match the delalloc extent. > > > > > > > > however, this runs the risk of breaking all the fairness algorithms > > > > in high level writeback, so we'd need to be very careful about how > > > > we extend the write range for writeback inside EOF.... > > > > > > > > > > Indeed. The more I think about this, the more I think we'd have an > > > easier time trying to schedule the on-disk i_size update vs. trying to > > > control the size and range of writeback in all possible scenarios. The > > > former would still require non-trivial tracking and logic, but that's > > > something writeback control described above could actually help with. > > > > > > For example, we don't currently have any way to track "stale" state of > > > just allocated but not yet fully written extents. If we established a > > > rule that we always wrote back such extents in entirety and via a single > > > ioend, then writeback completion code would simply need to check for > > > delalloc extents within in-core eof to determine when it's safe to > > > update on-disk eof. > > The problem with this is that it either limits the size of the > allocation we can make (due to writeback chunk size) or we break the > writeback fairness algorithms by having to write back allocation > sized chunks... > > > > If there are cases where we determine we can't > > > writeback an entire delalloc extent because of some fairness reasons or > > > whatever, we could decide to only physically allocate the subset of that > > > delalloc extent that we can write back rather than the whole thing > > > (trading off stale data exposure for some increased fragmentation). That > > > way writeback completion still sees existence of delalloc blocks and > > > knows to hold off on the on-disk size update. > > The moment we get concurrent writes we are going to interleave and > fragment the files with this approach. It's the large speculative > delalloc beyond EOF that prevents fragmentation in the this case, so > the moment we decide that we aren't going to allocate the entire > delalloc extent we essentially throw away a significant amount of > the benefit that the speculative delalloc provides us with. > Yep, that's the tradeoff. Note that the idea was not to restrict delalloc conversions to writeback chunk size. That would outright lead to serious fragmentation. The thought was based on the previously mentioned concept of artificially increasing writeback ranges in XFS. So if writeback only asked for a few pages but we decided to do the old "cluster write" thing because we're over a multi-GB delalloc extent, then we don't necessarily have to make a black and white decision between writing back just the pages asked for or GBs of pages over the entire extent. We could always limit the delalloc conversion to something in-between if there is a reasonable balance between increasing the writeback range and not introducing too much fragmentation. That is completely non-deterministic and probably would require some experimentation to determine practicality. I also think it's reasonable to say "we have delalloc for a reason, don't do that." :P Anyways, my preference is still to avoid messing with writeback at all if we can help it so I wouldn't go down this path unless we ruled out other options. Note that the obvious next step in complexity to that is to reuse the COW fork infrastructure for newly allocated extents such that we only map allocated blocks to the file after they are written. That seems interesting to me because it relies on mechanisms we already have in place. Not only do the mechanisms exist, but this is arguably already (partially) prototyped via Christoph's debug mode always_cow stuff. generic/536 passes on a quick test of a reflink=1 fs with always_cow enabled, for example. With that, we wouldn't need to play games with writeback ranges, size updates or new intent types. This would depend on appropriate feature infrastructure being enabled on the fs and I'm not sure how performance would stack up against more passive approaches like those we're discussing below.. > > > One challenge with an approach like that is if we do skip completion > > > time on-disk i_size updates, we'd have to make sure that those updates > > > still happen eventually if the address space happens to change outside > > > of writeback. For example, if those remaining dirty delalloc pages > > > somehow end up invalidated and never become real blocks, we may never > > > see another writeback cycle on the inode and we'd need to update on-disk > > > size from some other context (in a timely fashion). > > > > > > Of course this only deals with the append case. We'd still need to > > > select a mechanism for dealing with inside EOF allocations (such as the > > > whole write-intent thing or unwritten extents). To me, the intent thing > > > still seems like the most simple/elegant option we've discussed so far > > > because the potential for low impact. Part of me thinks that if we > > > determine we'll fall back to that approach for certain cases anyways > > > (like inside EOF allocs), it might be worth just doing that first and > > > evaluate using it unconditionally. > > > > /me wonders if we could solve this by using a bitmap to track all the > > in-flight post-ondisk-eof writeback such that when a particular > > writeback completes it can check if there aren't any writeback at a > > lower offset still in progress and only then update the ondisk eof? > > > > (Where "a bitmap or something" could just use another instance of Dave's > > range lock structure...) > > I'm not sure we need to do that. Perhaps all we need is a new > in-core iext_tree record extent type flag. i.e. we have an > unwritten flag that mirrors the on-disk unwritten flag, but what we > are really talking about here is having a new in-memory extent state > that is "allocated but contains no data". > > This flag never makes it to disk. When we allocate a delalloc region > we tag the entire extent in with this magic flag, and as IO > completes we do the equivalent of unwritten extent conversion except > it only touches the in-memory tree records. When the entire range > between the current on-disk EOF and end of this completing IO has > the "not yet written" flag cleared, we then log the EOF size change. > This is pretty much what I was handwaving a bit about earlier with regard to using existing in-core extent tracking to reflect "stale" state (where "stale" means these physical blocks currently contain stale data). The core mechanism sounds reasonable, but there's still some engineering required to work out the details. For example, we'd need to be able to identify the last "not yet written" extent of a particular file range so we know when to log the size update, we may need to be able to make sure the size update happens if it's possible for some of those "not yet written" extents to become invalidated without ever being written back... > For non-random IO, we'll essentially just be trimming one extent and > extending another, so it should be pretty fast. If we want to add > intents (so it works inside EOF), then we can simply log a new DONE > (or UPDATE) intent every time we convert a range. That would leave > log recovery with an allocation intent plus a series of updates that > document written ranges. > ... and we may need to consider how to balance deferring a size update vs. making a size update and causing all subsequent (within eof) writebacks require intents (or whatever). I think these are all solvable problems though. That said, if the intent thing turned out to be dirt cheap, it could make the added "stale" extent tracking logic kind of pointless. OTOH, I suppose the opposite is certainly possible as well, I haven't thought either through enough beyond that they both seem plausible options. > This seems to me like it solves the tracking problem for both cases > and provides a simple hook for logging intent completions. It also > seems to me like it would work for direct IO, too, getting rid of > the need for unconditional unwritten extent allocation and > conversion. That's potentially a big win for database and VM folks > who want to write into sparse files safely.... > I hadn't thought about direct I/O, but that makes sense. It sounds like there's enough potential for an RFC/POC here. ;) Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 931edfdca22e..dae5f1734297 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4077,16 +4077,18 @@ xfs_bmapi_allocate( bma->got.br_state = XFS_EXT_NORM; /* - * In the data fork, a wasdelay extent has been initialized, so - * shouldn't be flagged as unwritten. + * In the data fork, the pages backed by a delalloc extent have been + * dirtied but not yet written to disk, so the allocation should be + * marked unwritten. Only after writeback completes successfully + * should we convert those mappings to real, so that a crash during + * writeback won't expose stale disk contents. * * For the cow fork, however, we convert delalloc reservations * (extents allocated for speculative preallocation) to * allocated unwritten extents, and only convert the unwritten * extents to real extents when we're about to write the data. */ - if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) && - (bma->flags & XFS_BMAPI_PREALLOC)) + if (bma->flags & XFS_BMAPI_PREALLOC) bma->got.br_state = XFS_EXT_UNWRITTEN; if (bma->wasdel) @@ -4496,8 +4498,9 @@ xfs_bmapi_convert_delalloc( bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN); bma.total = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK); bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork); + bma.flags = XFS_BMAPI_PREALLOC; if (whichfork == XFS_COW_FORK) - bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC; + bma.flags |= XFS_BMAPI_COWFORK; if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev)) bma.prev.br_startoff = NULLFILEOFF;