[2/4] xfs: force writes to delalloc regions to unwritten
diff mbox series

Message ID 155259894630.30230.10064390935593758177.stgit@magnolia
State New
Headers show
Series
  • xfs: various rough fixes
Related show

Commit Message

Darrick J. Wong March 14, 2019, 9:29 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

When writing to a delalloc region in the data fork, commit the new
allocations (of the da reservation) as unwritten so that the mappings
are only marked written once writeback completes successfully.  This
fixes the problem of stale data exposure if the system goes down during
targeted writeback of a specific region of a file, as tested by
generic/042.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

Dave Chinner March 14, 2019, 11:08 p.m. UTC | #1
On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When writing to a delalloc region in the data fork, commit the new
> allocations (of the da reservation) as unwritten so that the mappings
> are only marked written once writeback completes successfully.  This
> fixes the problem of stale data exposure if the system goes down during
> targeted writeback of a specific region of a file, as tested by
> generic/042.

So what does this do to buffered sequential and random write
performance?

Next, if the entire delalloc extent being allocated is beyond the
on-disk EOF (typical for extending sequential buffered writes), why
do we want those to be allocated as unwritten? i.e. isn't "allocate
as unwritten" only necessary for delalloc extent allocation
inside EOF?

Cheers,

Dave.
Darrick J. Wong March 15, 2019, 3:13 a.m. UTC | #2
On Fri, Mar 15, 2019 at 10:08:41AM +1100, Dave Chinner wrote:
> On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When writing to a delalloc region in the data fork, commit the new
> > allocations (of the da reservation) as unwritten so that the mappings
> > are only marked written once writeback completes successfully.  This
> > fixes the problem of stale data exposure if the system goes down during
> > targeted writeback of a specific region of a file, as tested by
> > generic/042.
> 
> So what does this do to buffered sequential and random write
> performance?

Whacks it quite a bit -- 10-20% depending on how fast the storage is in
the first place.  I barely noticed on my usb thumb drive, however. :P

That said, shovelling that many writes through the completion workqueues
creates a thundering herd problem on the ILOCK so maybe it wouldn't be
so bad if we simply dumped the completions on a per-inode queue and only
had one thread handling the completions.

(As we've been discussing on IRC.)

> Next, if the entire delalloc extent being allocated is beyond the
> on-disk EOF (typical for extending sequential buffered writes), why
> do we want those to be allocated as unwritten? i.e. isn't "allocate
> as unwritten" only necessary for delalloc extent allocation
> inside EOF?

I wasn't sure on this point -- for delalloc extents that start at or
above EOF, we can continue go straight to real like we do today.  We
still have to use the setfilesize transaction to update isize on disk,
so it probably doesn't make that big of a difference.

If the delalloc extent crosses EOF then I think it makes sense to map it
in as unwritten, issue the write, and convert the extent to real during
io completion, and not split it up just to avoid having unwritten
extents past EOF.

IOWS, there wasn't any particular intentionality behind having the code
not set PREALLOC if the extent is totally beyond EOF; this was just the
bare minimum to get a discussion started. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 15, 2019, 3:40 a.m. UTC | #3
On Thu, Mar 14, 2019 at 08:13:03PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 15, 2019 at 10:08:41AM +1100, Dave Chinner wrote:
> > On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > When writing to a delalloc region in the data fork, commit the new
> > > allocations (of the da reservation) as unwritten so that the mappings
> > > are only marked written once writeback completes successfully.  This
> > > fixes the problem of stale data exposure if the system goes down during
> > > targeted writeback of a specific region of a file, as tested by
> > > generic/042.
> > 
> > So what does this do to buffered sequential and random write
> > performance?
> 
> Whacks it quite a bit -- 10-20% depending on how fast the storage is in
> the first place.  I barely noticed on my usb thumb drive, however. :P

I figured that would be the case, with random writes being much
worse...

> That said, shovelling that many writes through the completion workqueues
> creates a thundering herd problem on the ILOCK so maybe it wouldn't be
> so bad if we simply dumped the completions on a per-inode queue and only
> had one thread handling the completions.
> 
> (As we've been discussing on IRC.)

*nod*

> > Next, if the entire delalloc extent being allocated is beyond the
> > on-disk EOF (typical for extending sequential buffered writes), why
> > do we want those to be allocated as unwritten? i.e. isn't "allocate
> > as unwritten" only necessary for delalloc extent allocation
> > inside EOF?
> 
> I wasn't sure on this point -- for delalloc extents that start at or
> above EOF, we can continue go straight to real like we do today.  We
> still have to use the setfilesize transaction to update isize on disk,
> so it probably doesn't make that big of a difference.

We have to keep the setfilesize completion code, anyway (think
partial last block file extensions), but the setfilesize stuff is
much, much lower overhead than unwritten extent conversion so i
think we want to avoid unwritten extents where-ever they are
unnecessary.


> If the delalloc extent crosses EOF then I think it makes sense to map it
> in as unwritten, issue the write, and convert the extent to real during
> io completion, and not split it up just to avoid having unwritten
> extents past EOF.

Yup, makes sense. For a sequential write, they should always be
beyond EOF. For anything else, we use unwritten.

> IOWS, there wasn't any particular intentionality behind having the code
> not set PREALLOC if the extent is totally beyond EOF; this was just the
> bare minimum to get a discussion started. :)

*nod*

Cheers,

Dave.
Brian Foster March 15, 2019, 12:29 p.m. UTC | #4
On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> On Thu, Mar 14, 2019 at 08:13:03PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 15, 2019 at 10:08:41AM +1100, Dave Chinner wrote:
> > > On Thu, Mar 14, 2019 at 02:29:06PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > When writing to a delalloc region in the data fork, commit the new
> > > > allocations (of the da reservation) as unwritten so that the mappings
> > > > are only marked written once writeback completes successfully.  This
> > > > fixes the problem of stale data exposure if the system goes down during
> > > > targeted writeback of a specific region of a file, as tested by
> > > > generic/042.
> > > 
> > > So what does this do to buffered sequential and random write
> > > performance?
> > 
> > Whacks it quite a bit -- 10-20% depending on how fast the storage is in
> > the first place.  I barely noticed on my usb thumb drive, however. :P
> 
> I figured that would be the case, with random writes being much
> worse...
> 
> > That said, shovelling that many writes through the completion workqueues
> > creates a thundering herd problem on the ILOCK so maybe it wouldn't be
> > so bad if we simply dumped the completions on a per-inode queue and only
> > had one thread handling the completions.
> > 
> > (As we've been discussing on IRC.)
> 
> *nod*
> 
> > > Next, if the entire delalloc extent being allocated is beyond the
> > > on-disk EOF (typical for extending sequential buffered writes), why
> > > do we want those to be allocated as unwritten? i.e. isn't "allocate
> > > as unwritten" only necessary for delalloc extent allocation
> > > inside EOF?
> > 
> > I wasn't sure on this point -- for delalloc extents that start at or
> > above EOF, we can continue go straight to real like we do today.  We
> > still have to use the setfilesize transaction to update isize on disk,
> > so it probably doesn't make that big of a difference.
> 
> We have to keep the setfilesize completion code, anyway (think
> partial last block file extensions), but the setfilesize stuff is
> much, much lower overhead than unwritten extent conversion so i
> think we want to avoid unwritten extents where-ever they are
> unnecessary.
> 

The additional tradeoff to consider for fully post-eof extents is
eliminating the need to zero them out entirely on extending writes (or
truncates) that jump over them but pull them within EOF. That could
eliminate quite a bit of data I/O with large enough preallocations in
place.

Of course, this is a different (and probably less common) write pattern
from straightforward sequential writes, but IIRC it comes for free
simply by mapping extents as unwritten since the zeroing code is smart
enough to know when zeroing is unnecessary. It would be nice if we could
figure out a way to take advantage of that optimization without
disrupting performance or whatever of the sequential case. Though I
wouldn't suggest this as a blocker to addressing the stale data exposure
problem, of course. (Another angle could be to determine when extent
conversions might be optimal to data zeroing and do the former in place
of the latter on file extension..).

> 
> > If the delalloc extent crosses EOF then I think it makes sense to map it
> > in as unwritten, issue the write, and convert the extent to real during
> > io completion, and not split it up just to avoid having unwritten
> > extents past EOF.
> 
> Yup, makes sense. For a sequential write, they should always be
> beyond EOF. For anything else, we use unwritten.
> 

I'm not quite sure I follow the suggested change in behavior here tbh so
maybe this is not an issue, but another thing to consider is whether
selective delalloc -> real conversion for post-eof extents translates to
conditional stale data exposure vectors in certain file extension
scenarios. I think even the post-eof zeroing code may be susceptible to
this problem if the log happens to push after a truncate transaction
commits but before the associated dirty pages are written back..

Brian

> > IOWS, there wasn't any particular intentionality behind having the code
> > not set PREALLOC if the extent is totally beyond EOF; this was just the
> > bare minimum to get a discussion started. :)
> 
> *nod*
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 17, 2019, 10:40 p.m. UTC | #5
On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > in as unwritten, issue the write, and convert the extent to real during
> > > io completion, and not split it up just to avoid having unwritten
> > > extents past EOF.
> > 
> > Yup, makes sense. For a sequential write, they should always be
> > beyond EOF. For anything else, we use unwritten.
> > 
> 
> I'm not quite sure I follow the suggested change in behavior here tbh so
> maybe this is not an issue, but another thing to consider is whether
> selective delalloc -> real conversion for post-eof extents translates to
> conditional stale data exposure vectors in certain file extension
> scenarios.

No, it doesn't because the transaction that moves EOF is done after
the data write into the region it exposes is complete. hence the
device cache flush that occurs before the log write containing the
transaction that moves the EOF will always result in the data being
on stable storage *before* the extending szie transaction hits the
journal and exposes it.

> I think even the post-eof zeroing code may be susceptible to
> this problem if the log happens to push after a truncate transaction
> commits but before the associated dirty pages are written back..

That's what this code in xfs_setattr_size() handles:

        /*
         * We are going to log the inode size change in this transaction so
         * any previous writes that are beyond the on disk EOF and the new
         * EOF that have not been written out need to be written here.  If we
         * do not write the data out, we expose ourselves to the null files
         * problem. Note that this includes any block zeroing we did above;
         * otherwise those blocks may not be zeroed after a crash.
         */
        if (did_zeroing ||
            (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
                error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
                                                ip->i_d.di_size, newsize - 1);
                if (error)
                        return error;
        }

        error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);

i.e. it ensures the data and post-eof zeroing is on disk before the
truncate transaction is started. We already hold the IOLOCK at this
point and drained in-flight direct IO, so no IO that can affect the
truncate region will begin or be in progress after the flush and
wait above is complete.

Cheers,

Dave.
Brian Foster March 18, 2019, 12:40 p.m. UTC | #6
On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote:
> On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > > in as unwritten, issue the write, and convert the extent to real during
> > > > io completion, and not split it up just to avoid having unwritten
> > > > extents past EOF.
> > > 
> > > Yup, makes sense. For a sequential write, they should always be
> > > beyond EOF. For anything else, we use unwritten.
> > > 
> > 
> > I'm not quite sure I follow the suggested change in behavior here tbh so
> > maybe this is not an issue, but another thing to consider is whether
> > selective delalloc -> real conversion for post-eof extents translates to
> > conditional stale data exposure vectors in certain file extension
> > scenarios.
> 
> No, it doesn't because the transaction that moves EOF is done after
> the data write into the region it exposes is complete. hence the
> device cache flush that occurs before the log write containing the
> transaction that moves the EOF will always result in the data being
> on stable storage *before* the extending szie transaction hits the
> journal and exposes it.
> 

Ok, but this ordering alone doesn't guarantee the data we've just
written is on-disk before the transaction hits the log.

> > I think even the post-eof zeroing code may be susceptible to
> > this problem if the log happens to push after a truncate transaction
> > commits but before the associated dirty pages are written back..
> 
> That's what this code in xfs_setattr_size() handles:
> 
>         /*
>          * We are going to log the inode size change in this transaction so
>          * any previous writes that are beyond the on disk EOF and the new
>          * EOF that have not been written out need to be written here.  If we
>          * do not write the data out, we expose ourselves to the null files
>          * problem. Note that this includes any block zeroing we did above;
>          * otherwise those blocks may not be zeroed after a crash.
>          */
>         if (did_zeroing ||
>             (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
>                 error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
>                                                 ip->i_d.di_size, newsize - 1);
>                 if (error)
>                         return error;
>         }
> 

Ah, I forgot about this code. So this combined with the above provides
correctness for this particular case with respect to stale data exposure
after a crash. AFAICT this is still a performance tradeoff as the
zeroing may not be necessary if the post-eof extents happened to be
unwritten. This optimization is also achieved "for free" in this path
because did_zeroing is only set if iomap had to physically zero some
pages.

This also only covers one extension scenario. It looks like the typical
write extend scenario is intended to be handled similarly by submitting
the di_size update transaction from data write completion. It looks to
me that this should be fairly robust in most common cases (i.e.,
background writeback and/or fsync()), but I'm not totally convinced this
is always robust because write time extension doesn't perform the same
kind of writeback ordering as the truncate example above.

Consider sync_file_range() for example. Suppose we have a large,
physical post-eof prealloc extent and perform a sparse extended write to
the last block of that extent followed by a sync_file_range() of the
just written data. The file write should physically zero the pagecache
for the preallocated region before the write, then the proper write
completes and returns. Nothing has changed on disk so far. A
sync_file_range() call sets the range start/end in the writeback_control
and works its way through __filemap_fdatawrite_range() to
write_cache_pages() and our writeback mechanism for the requested pages
only. From there, the same logic applies as for background writeback: we
issue writeback and I/O completion commits a file extension transaction.
AFAICT, nothing has guaranteed writeback has even initiated on any of
the zeroed pages over the non-unwritten extent that has just been pulled
within EOF, which means we are susceptible to stale data exposure until
that happens.

While reasoning about this I realized this should be pretty easy to
test:

- pre-seed a small fs with a data pattern (0xab)
- mount with a 1mb alloc size (for ease of test)
- run something like the following (uses default 0xcd data pattern)

# xfs_io -f -x -c "pwrite 0 4k" -c fsync -c "pwrite 32k 4k" \
	-c "sync_range -w 32k 4k" -c "sync_range -b 32k 4k" \
	-c "shutdown -f" /mnt/file

- remount and check the file:

# hexdump /mnt/file                                                                                                                                                                                                         
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001000 abab abab abab abab abab abab abab abab
*
0008000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0009000

... which clearly shows stale data exposure. Hmm, this might actually be
a good fstest if we don't have one already. 

Brian

>         error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> 
> i.e. it ensures the data and post-eof zeroing is on disk before the
> truncate transaction is started. We already hold the IOLOCK at this
> point and drained in-flight direct IO, so no IO that can affect the
> truncate region will begin or be in progress after the flush and
> wait above is complete.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 18, 2019, 8:35 p.m. UTC | #7
On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote:
> On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote:
> > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > > > in as unwritten, issue the write, and convert the extent to real during
> > > > > io completion, and not split it up just to avoid having unwritten
> > > > > extents past EOF.
> > > > 
> > > > Yup, makes sense. For a sequential write, they should always be
> > > > beyond EOF. For anything else, we use unwritten.
> > > > 
> > > 
> > > I'm not quite sure I follow the suggested change in behavior here tbh so
> > > maybe this is not an issue, but another thing to consider is whether
> > > selective delalloc -> real conversion for post-eof extents translates to
> > > conditional stale data exposure vectors in certain file extension
> > > scenarios.
> > 
> > No, it doesn't because the transaction that moves EOF is done after
> > the data write into the region it exposes is complete. hence the
> > device cache flush that occurs before the log write containing the
> > transaction that moves the EOF will always result in the data being
> > on stable storage *before* the extending szie transaction hits the
> > journal and exposes it.
> > 
> 
> Ok, but this ordering alone doesn't guarantee the data we've just
> written is on-disk before the transaction hits the log.

Which transaction are you talking about? This ordering guarantees
that the data is on stable storage before the EOF size change
transactions that exposes the region is on disk (that's the whole
point of updates after data writes).

If you are talking about the allocation transaction taht we are
going to write the data into, then you are right that it doesn't
guarantee that the data is on disk before that commits, but when the
allocation is beyond EOF it doesn't matter ebcause is won't be
exposed until after the data is written and the EOF moving
transaction is committed.

As I've previously proposed, what I suspect we should be doing is
not commiting the allocation transaction until IO completion, which
closes this stale data exposure hole completely and removes the need
for allocating unwritten extentsi and then having to convert them.
Direct IO could also do this, and so it could stop using unwritten
extents, too....

> > > I think even the post-eof zeroing code may be susceptible to
> > > this problem if the log happens to push after a truncate transaction
> > > commits but before the associated dirty pages are written back..
> > 
> > That's what this code in xfs_setattr_size() handles:
> > 
> >         /*
> >          * We are going to log the inode size change in this transaction so
> >          * any previous writes that are beyond the on disk EOF and the new
> >          * EOF that have not been written out need to be written here.  If we
> >          * do not write the data out, we expose ourselves to the null files
> >          * problem. Note that this includes any block zeroing we did above;
> >          * otherwise those blocks may not be zeroed after a crash.
> >          */
> >         if (did_zeroing ||
> >             (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
> >                 error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> >                                                 ip->i_d.di_size, newsize - 1);
> >                 if (error)
> >                         return error;
> >         }
> > 
> 
> Ah, I forgot about this code. So this combined with the above provides
> correctness for this particular case with respect to stale data exposure
> after a crash. AFAICT this is still a performance tradeoff as the
> zeroing may not be necessary if the post-eof extents happened to be
> unwritten. This optimization is also achieved "for free" in this path
> because did_zeroing is only set if iomap had to physically zero some
> pages.
> 
> This also only covers one extension scenario. It looks like the typical
> write extend scenario is intended to be handled similarly by submitting
> the di_size update transaction from data write completion. It looks to
> me that this should be fairly robust in most common cases (i.e.,
> background writeback and/or fsync()), but I'm not totally convinced this
> is always robust because write time extension doesn't perform the same
> kind of writeback ordering as the truncate example above.

It depends on IO completion order as to whether writeback is truly
robust. Calling filemap_write_and_wait_range() doesn't change that,
because it doesn't strictly order IO completions and so we can still
expose a region under IO that hasn't been signalled as complete by
the hardware.

> Consider sync_file_range() for example.

sync_file_range() is not a data integrity operation - it does not
ensure disk caches are flushed and so cannot provide any crash
guarantees. hence we can completely ignore when we talk about data
integrity operations - it's no different from normal writeback.

> Suppose we have a large,
> physical post-eof prealloc extent and perform a sparse extended write to
> the last block of that extent followed by a sync_file_range() of the
> just written data. The file write should physically zero the pagecache
> for the preallocated region before the write, then the proper write
> completes and returns. Nothing has changed on disk so far. A
> sync_file_range() call sets the range start/end in the writeback_control
> and works its way through __filemap_fdatawrite_range() to
> write_cache_pages() and our writeback mechanism for the requested pages
> only. From there, the same logic applies as for background writeback: we
> issue writeback and I/O completion commits a file extension transaction.
> AFAICT, nothing has guaranteed writeback has even initiated on any of
> the zeroed pages over the non-unwritten extent that has just been pulled
> within EOF, which means we are susceptible to stale data exposure until
> that happens.

Yup, yet another reason why sync_file_range() is useless as a data
integrity operation. I've had to explain this repeatedly to people
over the past 10-15 years, and eventually the man page was updated
with this:

Warning
	This  system  call  is extremely dangerous and should not be
	used in portable programs.  None of these operations writes
	out the file's meta¿ data.  Therefore, unless the
	application is strictly performing overwrites of
	already-instantiated disk blocks, there are no guarantees
	that the  data will be available after a crash.  There is no
	user interface to know if a write is purely an overwrite.
	On filesystems using copy- on-write semantics (e.g., btrfs)
	an overwrite of existing allocated blocks  is  impossible.
	When writing  into  preallocated  space,  many filesystems
	also  require calls into the block allocator, which this
	system call does not sync out to disk.  This system call
	does not flush disk write caches and thus does not provide
	any data integrity on systems with volatile disk write
	caches.

> 
> While reasoning about this I realized this should be pretty easy to
> test:
> 
> - pre-seed a small fs with a data pattern (0xab)
> - mount with a 1mb alloc size (for ease of test)
> - run something like the following (uses default 0xcd data pattern)
> 
> # xfs_io -f -x -c "pwrite 0 4k" -c fsync -c "pwrite 32k 4k" \
> 	-c "sync_range -w 32k 4k" -c "sync_range -b 32k 4k" \
> 	-c "shutdown -f" /mnt/file
> 
> - remount and check the file:
> 
> # hexdump /mnt/file                                                                                                                                                                                                         
> 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0001000 abab abab abab abab abab abab abab abab
> *
> 0008000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0009000
> 
> ... which clearly shows stale data exposure. Hmm, this might actually be
> a good fstest if we don't have one already. 

Yup, but keep in mind this is exactly what is expected from using
sync_file_range(). Friends don't let friends use sync_file_range()
because it's a guaranteed vector for user data corruption and stale
data exposure.

Cheers,

Dave.
Brian Foster March 18, 2019, 9:50 p.m. UTC | #8
On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote:
> On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote:
> > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > > > > in as unwritten, issue the write, and convert the extent to real during
> > > > > > io completion, and not split it up just to avoid having unwritten
> > > > > > extents past EOF.
> > > > > 
> > > > > Yup, makes sense. For a sequential write, they should always be
> > > > > beyond EOF. For anything else, we use unwritten.
> > > > > 
> > > > 
> > > > I'm not quite sure I follow the suggested change in behavior here tbh so
> > > > maybe this is not an issue, but another thing to consider is whether
> > > > selective delalloc -> real conversion for post-eof extents translates to
> > > > conditional stale data exposure vectors in certain file extension
> > > > scenarios.
> > > 
> > > No, it doesn't because the transaction that moves EOF is done after
> > > the data write into the region it exposes is complete. hence the
> > > device cache flush that occurs before the log write containing the
> > > transaction that moves the EOF will always result in the data being
> > > on stable storage *before* the extending szie transaction hits the
> > > journal and exposes it.
> > > 
> > 
> > Ok, but this ordering alone doesn't guarantee the data we've just
> > written is on-disk before the transaction hits the log.
> 
> Which transaction are you talking about? This ordering guarantees
> that the data is on stable storage before the EOF size change
> transactions that exposes the region is on disk (that's the whole
> point of updates after data writes).
> 
> If you are talking about the allocation transaction taht we are
> going to write the data into, then you are right that it doesn't
> guarantee that the data is on disk before that commits, but when the
> allocation is beyond EOF it doesn't matter ebcause is won't be
> exposed until after the data is written and the EOF moving
> transaction is committed.
> 

The extending transaction that you referred to above and with respect to
the device cache flush. I'm simply saying that the transaction has to be
ordered with respect to full writeback completion as well, which is also
what you are saying further down. (We're in agreement... ;)).

> As I've previously proposed, what I suspect we should be doing is
> not commiting the allocation transaction until IO completion, which
> closes this stale data exposure hole completely and removes the need
> for allocating unwritten extentsi and then having to convert them.
> Direct IO could also do this, and so it could stop using unwritten
> extents, too....
> 

That sounds like an interesting (and potentially more efficient)
alternative to using unwritten extents across the board only to convert
them as I/Os complete. I guess we'd need to make sure that the
allocation transaction holds across potentially many ioends considering
the max extents size of 8GB or so. I suppose the locking and handling of
fragmented allocation cases could get a bit interesting there as well,
but it's probably worth a prototype attempt at least to survey the
concept... (I wonder if Darrick is still paying attention or found
something more interesting to do..? :D)

I also assume we'd still need to use unwritten extents for cases where
the allocation completes before all of the extent is within EOF (i.e.,
the extent zeroing idea you mentioned on irc..).

> > > > I think even the post-eof zeroing code may be susceptible to
> > > > this problem if the log happens to push after a truncate transaction
> > > > commits but before the associated dirty pages are written back..
> > > 
> > > That's what this code in xfs_setattr_size() handles:
> > > 
> > >         /*
> > >          * We are going to log the inode size change in this transaction so
> > >          * any previous writes that are beyond the on disk EOF and the new
> > >          * EOF that have not been written out need to be written here.  If we
> > >          * do not write the data out, we expose ourselves to the null files
> > >          * problem. Note that this includes any block zeroing we did above;
> > >          * otherwise those blocks may not be zeroed after a crash.
> > >          */
> > >         if (did_zeroing ||
> > >             (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
> > >                 error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > >                                                 ip->i_d.di_size, newsize - 1);
> > >                 if (error)
> > >                         return error;
> > >         }
> > > 
> > 
> > Ah, I forgot about this code. So this combined with the above provides
> > correctness for this particular case with respect to stale data exposure
> > after a crash. AFAICT this is still a performance tradeoff as the
> > zeroing may not be necessary if the post-eof extents happened to be
> > unwritten. This optimization is also achieved "for free" in this path
> > because did_zeroing is only set if iomap had to physically zero some
> > pages.
> > 
> > This also only covers one extension scenario. It looks like the typical
> > write extend scenario is intended to be handled similarly by submitting
> > the di_size update transaction from data write completion. It looks to
> > me that this should be fairly robust in most common cases (i.e.,
> > background writeback and/or fsync()), but I'm not totally convinced this
> > is always robust because write time extension doesn't perform the same
> > kind of writeback ordering as the truncate example above.
> 
> It depends on IO completion order as to whether writeback is truly
> robust. Calling filemap_write_and_wait_range() doesn't change that,
> because it doesn't strictly order IO completions and so we can still
> expose a region under IO that hasn't been signalled as complete by
> the hardware.
> 

Indeed.

> > Consider sync_file_range() for example.
> 
> sync_file_range() is not a data integrity operation - it does not
> ensure disk caches are flushed and so cannot provide any crash
> guarantees. hence we can completely ignore when we talk about data
> integrity operations - it's no different from normal writeback.
> 

Of course. I am by no means claiming sync_file_range() is an integrity
operation. Technically neither is background writeback, even though the
latter has fairly predictable behavior. I'm simply using
sync_file_range() as an example and means to simulate out of order
writeback for test purposes.

> > Suppose we have a large,
> > physical post-eof prealloc extent and perform a sparse extended write to
> > the last block of that extent followed by a sync_file_range() of the
> > just written data. The file write should physically zero the pagecache
> > for the preallocated region before the write, then the proper write
> > completes and returns. Nothing has changed on disk so far. A
> > sync_file_range() call sets the range start/end in the writeback_control
> > and works its way through __filemap_fdatawrite_range() to
> > write_cache_pages() and our writeback mechanism for the requested pages
> > only. From there, the same logic applies as for background writeback: we
> > issue writeback and I/O completion commits a file extension transaction.
> > AFAICT, nothing has guaranteed writeback has even initiated on any of
> > the zeroed pages over the non-unwritten extent that has just been pulled
> > within EOF, which means we are susceptible to stale data exposure until
> > that happens.
> 
> Yup, yet another reason why sync_file_range() is useless as a data
> integrity operation. I've had to explain this repeatedly to people
> over the past 10-15 years, and eventually the man page was updated
> with this:
> 
> Warning
> 	This  system  call  is extremely dangerous and should not be
> 	used in portable programs.  None of these operations writes
> 	out the file's meta¿ data.  Therefore, unless the
> 	application is strictly performing overwrites of
> 	already-instantiated disk blocks, there are no guarantees
> 	that the  data will be available after a crash.  There is no
> 	user interface to know if a write is purely an overwrite.
> 	On filesystems using copy- on-write semantics (e.g., btrfs)
> 	an overwrite of existing allocated blocks  is  impossible.
> 	When writing  into  preallocated  space,  many filesystems
> 	also  require calls into the block allocator, which this
> 	system call does not sync out to disk.  This system call
> 	does not flush disk write caches and thus does not provide
> 	any data integrity on systems with volatile disk write
> 	caches.
> 
> > 
...
> 
> Yup, but keep in mind this is exactly what is expected from using
> sync_file_range(). Friends don't let friends use sync_file_range()
> because it's a guaranteed vector for user data corruption and stale
> data exposure.
> 

Again, I'm using sync_file_range() as a dumb lever to control writeback
completion ordering. It's a test tool for my purposes and not something
I'd suggest for use otherwise or that should inherently provide
integrity guarantees. Please don't get distracted reading more into it
than that. We can induce the exact same writeback behavior and stale
data exposure problem using something like hole punch instead, for
example, and it's just as reliable a reproducer.

The broader point is that by controlling writeback ordering, we can
explicitly demonstrate stale data exposure. If fixed properly, we should
never expect stale data exposure even in the event of out of order
writeback completion, regardless of the cause.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong March 19, 2019, 6:02 p.m. UTC | #9
On Mon, Mar 18, 2019 at 05:50:14PM -0400, Brian Foster wrote:
> On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote:
> > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote:
> > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote:
> > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > > > > > in as unwritten, issue the write, and convert the extent to real during
> > > > > > > io completion, and not split it up just to avoid having unwritten
> > > > > > > extents past EOF.
> > > > > > 
> > > > > > Yup, makes sense. For a sequential write, they should always be
> > > > > > beyond EOF. For anything else, we use unwritten.
> > > > > > 
> > > > > 
> > > > > I'm not quite sure I follow the suggested change in behavior here tbh so
> > > > > maybe this is not an issue, but another thing to consider is whether
> > > > > selective delalloc -> real conversion for post-eof extents translates to
> > > > > conditional stale data exposure vectors in certain file extension
> > > > > scenarios.
> > > > 
> > > > No, it doesn't because the transaction that moves EOF is done after
> > > > the data write into the region it exposes is complete. hence the
> > > > device cache flush that occurs before the log write containing the
> > > > transaction that moves the EOF will always result in the data being
> > > > on stable storage *before* the extending szie transaction hits the
> > > > journal and exposes it.
> > > > 
> > > 
> > > Ok, but this ordering alone doesn't guarantee the data we've just
> > > written is on-disk before the transaction hits the log.
> > 
> > Which transaction are you talking about? This ordering guarantees
> > that the data is on stable storage before the EOF size change
> > transactions that exposes the region is on disk (that's the whole
> > point of updates after data writes).
> > 
> > If you are talking about the allocation transaction taht we are
> > going to write the data into, then you are right that it doesn't
> > guarantee that the data is on disk before that commits, but when the
> > allocation is beyond EOF it doesn't matter ebcause is won't be
> > exposed until after the data is written and the EOF moving
> > transaction is committed.
> > 
> 
> The extending transaction that you referred to above and with respect to
> the device cache flush. I'm simply saying that the transaction has to be
> ordered with respect to full writeback completion as well, which is also
> what you are saying further down. (We're in agreement... ;)).
> 
> > As I've previously proposed, what I suspect we should be doing is
> > not commiting the allocation transaction until IO completion, which
> > closes this stale data exposure hole completely and removes the need
> > for allocating unwritten extentsi and then having to convert them.
> > Direct IO could also do this, and so it could stop using unwritten
> > extents, too....
> > 
> 
> That sounds like an interesting (and potentially more efficient)
> alternative to using unwritten extents across the board only to convert
> them as I/Os complete. I guess we'd need to make sure that the
> allocation transaction holds across potentially many ioends considering
> the max extents size of 8GB or so. I suppose the locking and handling of
> fragmented allocation cases could get a bit interesting there as well,
> but it's probably worth a prototype attempt at least to survey the
> concept... (I wonder if Darrick is still paying attention or found
> something more interesting to do..? :D)

/me wandered off into fixing the io completion mess; if either of you
want to work on a better prototype, I encourage you to go for it. :)

Though I wonder how bad would be the effects of holding the allocation
transaction open across a (potentially very slow) writeout.  We'd still
be holding onto the dirty inode's ilock as well as the AGF header and
whatever bnobt/cntbt blocks are dirty.  Wouldn't that stall any other
thread that was walking the AGs looking for free space?

> I also assume we'd still need to use unwritten extents for cases where
> the allocation completes before all of the extent is within EOF (i.e.,
> the extent zeroing idea you mentioned on irc..).

Yes, I think it would still be necessary.  My guess is that
xfs_bmapi_convert_delalloc has to create unwritten extents for any
extent starting before the on-disk EOF but can create real extents for
anything past the on-disk isize (because we don't set the on-disk isize
to match the in-core isize until writes complete).  I guess the "zero
pages between old isize and new isize" code can simply reset any real
extents it finds to be unwritten too, as we've been discussing.

> > > > > I think even the post-eof zeroing code may be susceptible to
> > > > > this problem if the log happens to push after a truncate transaction
> > > > > commits but before the associated dirty pages are written back..
> > > > 
> > > > That's what this code in xfs_setattr_size() handles:
> > > > 
> > > >         /*
> > > >          * We are going to log the inode size change in this transaction so
> > > >          * any previous writes that are beyond the on disk EOF and the new
> > > >          * EOF that have not been written out need to be written here.  If we
> > > >          * do not write the data out, we expose ourselves to the null files
> > > >          * problem. Note that this includes any block zeroing we did above;
> > > >          * otherwise those blocks may not be zeroed after a crash.
> > > >          */
> > > >         if (did_zeroing ||
> > > >             (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
> > > >                 error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > > >                                                 ip->i_d.di_size, newsize - 1);
> > > >                 if (error)
> > > >                         return error;
> > > >         }
> > > > 
> > > 
> > > Ah, I forgot about this code. So this combined with the above provides
> > > correctness for this particular case with respect to stale data exposure
> > > after a crash. AFAICT this is still a performance tradeoff as the
> > > zeroing may not be necessary if the post-eof extents happened to be
> > > unwritten. This optimization is also achieved "for free" in this path
> > > because did_zeroing is only set if iomap had to physically zero some
> > > pages.
> > > 
> > > This also only covers one extension scenario. It looks like the typical
> > > write extend scenario is intended to be handled similarly by submitting
> > > the di_size update transaction from data write completion. It looks to
> > > me that this should be fairly robust in most common cases (i.e.,
> > > background writeback and/or fsync()), but I'm not totally convinced this
> > > is always robust because write time extension doesn't perform the same
> > > kind of writeback ordering as the truncate example above.
> > 
> > It depends on IO completion order as to whether writeback is truly
> > robust. Calling filemap_write_and_wait_range() doesn't change that,
> > because it doesn't strictly order IO completions and so we can still
> > expose a region under IO that hasn't been signalled as complete by
> > the hardware.
> > 
> 
> Indeed.
> 
> > > Consider sync_file_range() for example.
> > 
> > sync_file_range() is not a data integrity operation - it does not
> > ensure disk caches are flushed and so cannot provide any crash
> > guarantees. hence we can completely ignore when we talk about data
> > integrity operations - it's no different from normal writeback.
> > 
> 
> Of course. I am by no means claiming sync_file_range() is an integrity
> operation. Technically neither is background writeback, even though the
> latter has fairly predictable behavior. I'm simply using
> sync_file_range() as an example and means to simulate out of order
> writeback for test purposes.
> 
> > > Suppose we have a large,
> > > physical post-eof prealloc extent and perform a sparse extended write to
> > > the last block of that extent followed by a sync_file_range() of the
> > > just written data. The file write should physically zero the pagecache
> > > for the preallocated region before the write, then the proper write
> > > completes and returns. Nothing has changed on disk so far. A
> > > sync_file_range() call sets the range start/end in the writeback_control
> > > and works its way through __filemap_fdatawrite_range() to
> > > write_cache_pages() and our writeback mechanism for the requested pages
> > > only. From there, the same logic applies as for background writeback: we
> > > issue writeback and I/O completion commits a file extension transaction.
> > > AFAICT, nothing has guaranteed writeback has even initiated on any of
> > > the zeroed pages over the non-unwritten extent that has just been pulled
> > > within EOF, which means we are susceptible to stale data exposure until
> > > that happens.
> > 
> > Yup, yet another reason why sync_file_range() is useless as a data
> > integrity operation. I've had to explain this repeatedly to people
> > over the past 10-15 years, and eventually the man page was updated
> > with this:
> > 
> > Warning
> > 	This  system  call  is extremely dangerous and should not be
> > 	used in portable programs.  None of these operations writes
> > 	out the file's meta¿ data.  Therefore, unless the
> > 	application is strictly performing overwrites of
> > 	already-instantiated disk blocks, there are no guarantees
> > 	that the  data will be available after a crash.  There is no
> > 	user interface to know if a write is purely an overwrite.
> > 	On filesystems using copy- on-write semantics (e.g., btrfs)
> > 	an overwrite of existing allocated blocks  is  impossible.
> > 	When writing  into  preallocated  space,  many filesystems
> > 	also  require calls into the block allocator, which this
> > 	system call does not sync out to disk.  This system call
> > 	does not flush disk write caches and thus does not provide
> > 	any data integrity on systems with volatile disk write
> > 	caches.
> > 
> > > 
> ...
> > 
> > Yup, but keep in mind this is exactly what is expected from using
> > sync_file_range(). Friends don't let friends use sync_file_range()
> > because it's a guaranteed vector for user data corruption and stale
> > data exposure.
> > 
> 
> Again, I'm using sync_file_range() as a dumb lever to control writeback
> completion ordering. It's a test tool for my purposes and not something
> I'd suggest for use otherwise or that should inherently provide
> integrity guarantees. Please don't get distracted reading more into it
> than that. We can induce the exact same writeback behavior and stale
> data exposure problem using something like hole punch instead, for
> example, and it's just as reliable a reproducer.
> 
> The broader point is that by controlling writeback ordering, we can
> explicitly demonstrate stale data exposure. If fixed properly, we should
> never expect stale data exposure even in the event of out of order
> writeback completion, regardless of the cause.

Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
disk contents it's game over anyway and we ought to fix it.

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Dave Chinner March 19, 2019, 8:25 p.m. UTC | #10
On Tue, Mar 19, 2019 at 11:02:49AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 18, 2019 at 05:50:14PM -0400, Brian Foster wrote:
> > On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote:
> > > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote:
> > > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote:
> > > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> > > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > > > > > > in as unwritten, issue the write, and convert the extent to real during
> > > > > > > > io completion, and not split it up just to avoid having unwritten
> > > > > > > > extents past EOF.
> > > > > > > 
> > > > > > > Yup, makes sense. For a sequential write, they should always be
> > > > > > > beyond EOF. For anything else, we use unwritten.
> > > > > > > 
> > > > > > 
> > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so
> > > > > > maybe this is not an issue, but another thing to consider is whether
> > > > > > selective delalloc -> real conversion for post-eof extents translates to
> > > > > > conditional stale data exposure vectors in certain file extension
> > > > > > scenarios.
> > > > > 
> > > > > No, it doesn't because the transaction that moves EOF is done after
> > > > > the data write into the region it exposes is complete. hence the
> > > > > device cache flush that occurs before the log write containing the
> > > > > transaction that moves the EOF will always result in the data being
> > > > > on stable storage *before* the extending szie transaction hits the
> > > > > journal and exposes it.
> > > > > 
> > > > 
> > > > Ok, but this ordering alone doesn't guarantee the data we've just
> > > > written is on-disk before the transaction hits the log.
> > > 
> > > Which transaction are you talking about? This ordering guarantees
> > > that the data is on stable storage before the EOF size change
> > > transactions that exposes the region is on disk (that's the whole
> > > point of updates after data writes).
> > > 
> > > If you are talking about the allocation transaction taht we are
> > > going to write the data into, then you are right that it doesn't
> > > guarantee that the data is on disk before that commits, but when the
> > > allocation is beyond EOF it doesn't matter ebcause is won't be
> > > exposed until after the data is written and the EOF moving
> > > transaction is committed.
> > > 
> > 
> > The extending transaction that you referred to above and with respect to
> > the device cache flush. I'm simply saying that the transaction has to be
> > ordered with respect to full writeback completion as well, which is also
> > what you are saying further down. (We're in agreement... ;)).
> > 
> > > As I've previously proposed, what I suspect we should be doing is
> > > not commiting the allocation transaction until IO completion, which
> > > closes this stale data exposure hole completely and removes the need
> > > for allocating unwritten extentsi and then having to convert them.
> > > Direct IO could also do this, and so it could stop using unwritten
> > > extents, too....
> > > 
> > 
> > That sounds like an interesting (and potentially more efficient)
> > alternative to using unwritten extents across the board only to convert
> > them as I/Os complete. I guess we'd need to make sure that the
> > allocation transaction holds across potentially many ioends considering
> > the max extents size of 8GB or so. I suppose the locking and handling of
> > fragmented allocation cases could get a bit interesting there as well,
> > but it's probably worth a prototype attempt at least to survey the
> > concept... (I wonder if Darrick is still paying attention or found
> > something more interesting to do..? :D)
> 
> /me wandered off into fixing the io completion mess; if either of you
> want to work on a better prototype, I encourage you to go for it. :)
> 
> Though I wonder how bad would be the effects of holding the allocation
> transaction open across a (potentially very slow) writeout.  We'd still
> be holding onto the dirty inode's ilock as well as the AGF header and
> whatever bnobt/cntbt blocks are dirty.  Wouldn't that stall any other
> thread that was walking the AGs looking for free space?

Yeah, which is why I mentioned on IRC that it may be better to use
an intent/done transaction pair similar to an EFI/EFD. i.e. on IO
completion we only have to log the "write done" and it can contain
just hte range for the specific IO, hence we can do large
allocations and only mark the areas we write as containing data.
That would allow zeroing of the regions that aren't covered by a
done intent within EOF during recovery....

> > I also assume we'd still need to use unwritten extents for cases where
> > the allocation completes before all of the extent is within EOF (i.e.,
> > the extent zeroing idea you mentioned on irc..).
> 
> Yes, I think it would still be necessary.  My guess is that
> xfs_bmapi_convert_delalloc has to create unwritten extents for any
> extent starting before the on-disk EOF but can create real extents for
> anything past the on-disk isize (because we don't set the on-disk isize
> to match the in-core isize until writes complete).

Yes, that was my assumption too, but I suspect that with an
intent/done pair, even that is not necessary.

> I guess the "zero
> pages between old isize and new isize" code can simply reset any real
> extents it finds to be unwritten too, as we've been discussing.

*nod*

> > The broader point is that by controlling writeback ordering, we can
> > explicitly demonstrate stale data exposure. If fixed properly, we should
> > never expect stale data exposure even in the event of out of order
> > writeback completion, regardless of the cause.
> 
> Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> disk contents it's game over anyway and we ought to fix it.

Well, that's exactly the problem with SFR, isn't it? The dirty data
is outside the range the user asks to sync and the implementation
doesn't go anywhere near the filesystem so we can't actually do the
right thing here. So we are left to either hack fixes around a
shitty, broken interface and everyone pays the penalty, or we ignore
it. We've simply ignored it for the past 10+ years because it really
can't be used safely for it's intended purposes (as everyone who
tries to use it finds out eventually)...

Cheers,

Dave.
Brian Foster March 20, 2019, 12:02 p.m. UTC | #11
On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> On Tue, Mar 19, 2019 at 11:02:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Mar 18, 2019 at 05:50:14PM -0400, Brian Foster wrote:
> > > On Tue, Mar 19, 2019 at 07:35:06AM +1100, Dave Chinner wrote:
> > > > On Mon, Mar 18, 2019 at 08:40:39AM -0400, Brian Foster wrote:
> > > > > On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote:
> > > > > > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote:
> > > > > > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote:
> > > > > > > > > If the delalloc extent crosses EOF then I think it makes sense to map it
> > > > > > > > > in as unwritten, issue the write, and convert the extent to real during
> > > > > > > > > io completion, and not split it up just to avoid having unwritten
> > > > > > > > > extents past EOF.
> > > > > > > > 
> > > > > > > > Yup, makes sense. For a sequential write, they should always be
> > > > > > > > beyond EOF. For anything else, we use unwritten.
> > > > > > > > 
> > > > > > > 
> > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so
> > > > > > > maybe this is not an issue, but another thing to consider is whether
> > > > > > > selective delalloc -> real conversion for post-eof extents translates to
> > > > > > > conditional stale data exposure vectors in certain file extension
> > > > > > > scenarios.
> > > > > > 
> > > > > > No, it doesn't because the transaction that moves EOF is done after
> > > > > > the data write into the region it exposes is complete. hence the
> > > > > > device cache flush that occurs before the log write containing the
> > > > > > transaction that moves the EOF will always result in the data being
> > > > > > on stable storage *before* the extending szie transaction hits the
> > > > > > journal and exposes it.
> > > > > > 
> > > > > 
> > > > > Ok, but this ordering alone doesn't guarantee the data we've just
> > > > > written is on-disk before the transaction hits the log.
> > > > 
> > > > Which transaction are you talking about? This ordering guarantees
> > > > that the data is on stable storage before the EOF size change
> > > > transactions that exposes the region is on disk (that's the whole
> > > > point of updates after data writes).
> > > > 
> > > > If you are talking about the allocation transaction taht we are
> > > > going to write the data into, then you are right that it doesn't
> > > > guarantee that the data is on disk before that commits, but when the
> > > > allocation is beyond EOF it doesn't matter ebcause is won't be
> > > > exposed until after the data is written and the EOF moving
> > > > transaction is committed.
> > > > 
> > > 
> > > The extending transaction that you referred to above and with respect to
> > > the device cache flush. I'm simply saying that the transaction has to be
> > > ordered with respect to full writeback completion as well, which is also
> > > what you are saying further down. (We're in agreement... ;)).
> > > 
> > > > As I've previously proposed, what I suspect we should be doing is
> > > > not commiting the allocation transaction until IO completion, which
> > > > closes this stale data exposure hole completely and removes the need
> > > > for allocating unwritten extentsi and then having to convert them.
> > > > Direct IO could also do this, and so it could stop using unwritten
> > > > extents, too....
> > > > 
> > > 
> > > That sounds like an interesting (and potentially more efficient)
> > > alternative to using unwritten extents across the board only to convert
> > > them as I/Os complete. I guess we'd need to make sure that the
> > > allocation transaction holds across potentially many ioends considering
> > > the max extents size of 8GB or so. I suppose the locking and handling of
> > > fragmented allocation cases could get a bit interesting there as well,
> > > but it's probably worth a prototype attempt at least to survey the
> > > concept... (I wonder if Darrick is still paying attention or found
> > > something more interesting to do..? :D)
> > 
> > /me wandered off into fixing the io completion mess; if either of you
> > want to work on a better prototype, I encourage you to go for it. :)
> > 
> > Though I wonder how bad would be the effects of holding the allocation
> > transaction open across a (potentially very slow) writeout.  We'd still
> > be holding onto the dirty inode's ilock as well as the AGF header and
> > whatever bnobt/cntbt blocks are dirty.  Wouldn't that stall any other
> > thread that was walking the AGs looking for free space?
> 

Indeed, that's the type of stuff I was wondering about as well.

> Yeah, which is why I mentioned on IRC that it may be better to use
> an intent/done transaction pair similar to an EFI/EFD. i.e. on IO
> completion we only have to log the "write done" and it can contain
> just hte range for the specific IO, hence we can do large
> allocations and only mark the areas we write as containing data.
> That would allow zeroing of the regions that aren't covered by a
> done intent within EOF during recovery....
> 

This wasn't exactly what comes to mind for me with the thought of using
intent items. I was thinking this meant to separate the block allocation
from block mapping such that I/O can be issued to allocated blocks, but
new blocks aren't mapped into the file until the I/O (or conversion to
unwritten, etc.) completes, similar to how COW remapping works. The
objective here would be to never map uninitialized blocks to accessible
regions of a file. This of course would somehow or another need to make
sure new extent regions that aren't the target of I/O still end up
properly mapped, that allocated but unmapped blocks are handled on log
recovery, etc. (reflink is essentially a reference implementation of all
this).

It sounds like what you're suggesting above is more along the lines of
leaving writeback as it works today, but log a write start intent in the
transaction that maps new blocks and pair it with write done completions
that occurs as blocks are initialized (again via I/O completion, extent
conversion, etc.), such that in the event of a crash log recovery can do
whatever is necessary to initialize the blocks correctly. The objective
here is to track uninitialized regions of a file such that log recovery
can do the initialization. This would more naturally handle the post-eof
scenario as we already issue zeroing buffered writes to those blocks on
file extension, but we'd need to be Ok with either issuing quite a bit
of I/O from log recovery, converting newly mapped extents to unwritten
(putting us in a similar performance situation as the original patch to
always use unwritten extents for affected files), or doing something
wonky like punching holes, etc.

Am I following the high level idea correctly? If so, both seem like
reasonable ideas, the latter probably more simple at a glance. I'd need
to think about it more to have more concrete feedback on feasibility,
etc..

> > > I also assume we'd still need to use unwritten extents for cases where
> > > the allocation completes before all of the extent is within EOF (i.e.,
> > > the extent zeroing idea you mentioned on irc..).
> > 
> > Yes, I think it would still be necessary.  My guess is that
> > xfs_bmapi_convert_delalloc has to create unwritten extents for any
> > extent starting before the on-disk EOF but can create real extents for
> > anything past the on-disk isize (because we don't set the on-disk isize
> > to match the in-core isize until writes complete).
> 
> Yes, that was my assumption too, but I suspect that with an
> intent/done pair, even that is not necessary.
> 

Ok, so then this is an implementation side effect and probably not worth
digging too far into without a higher level design.

> > I guess the "zero
> > pages between old isize and new isize" code can simply reset any real
> > extents it finds to be unwritten too, as we've been discussing.
> 
> *nod*
> 
> > > The broader point is that by controlling writeback ordering, we can
> > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > never expect stale data exposure even in the event of out of order
> > > writeback completion, regardless of the cause.
> > 
> > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > disk contents it's game over anyway and we ought to fix it.
> 
> Well, that's exactly the problem with SFR, isn't it? The dirty data
> is outside the range the user asks to sync and the implementation
> doesn't go anywhere near the filesystem so we can't actually do the
> right thing here. So we are left to either hack fixes around a
> shitty, broken interface and everyone pays the penalty, or we ignore
> it. We've simply ignored it for the past 10+ years because it really
> can't be used safely for it's intended purposes (as everyone who
> tries to use it finds out eventually)...
> 

Just to make this abundantly clear, there is nothing unique, magic or
special required of sync_file_range() to reproduce this stale data
exposure. See the following variation of a command sequence from the
fstest I posted the other day:

...
# xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
	-c "fpunch 96k 4k" <file>
...
# hexdump <file>
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0011000 abab abab abab abab abab abab abab abab
*
0018000 0000 0000 0000 0000 0000 0000 0000 0000
*
0019000

Same problem, and it can be reproduced just as easily with a reflink or
even an overlapping direct I/O. I haven't confirmed, but I suspect with
enough effort background writeback is susceptible to this problem as
well.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 20, 2019, 9:10 p.m. UTC | #12
On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > The broader point is that by controlling writeback ordering, we can
> > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > never expect stale data exposure even in the event of out of order
> > > > writeback completion, regardless of the cause.
> > > 
> > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > disk contents it's game over anyway and we ought to fix it.
> > 
> > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > is outside the range the user asks to sync and the implementation
> > doesn't go anywhere near the filesystem so we can't actually do the
> > right thing here. So we are left to either hack fixes around a
> > shitty, broken interface and everyone pays the penalty, or we ignore
> > it. We've simply ignored it for the past 10+ years because it really
> > can't be used safely for it's intended purposes (as everyone who
> > tries to use it finds out eventually)...
> > 
> 
> Just to make this abundantly clear, there is nothing unique, magic or
> special required of sync_file_range() to reproduce this stale data
> exposure. See the following variation of a command sequence from the
> fstest I posted the other day:
> 
> ...
> # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> 	-c "fpunch 96k 4k" <file>
> ...
> # hexdump <file>
> 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0011000 abab abab abab abab abab abab abab abab
> *
> 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0019000
> 
> Same problem, and it can be reproduced just as easily with a reflink or
> even an overlapping direct I/O.

How does overlapping direct IO cause stale data exposure given that
it uses unwritten extent allocation? That, of course, is ignoring
the fact that the result of overlapping concurent direct IO is, by
definition, undefined and an application bug if it occurs....

> I haven't confirmed, but I suspect with
> enough effort background writeback is susceptible to this problem as
> well.

Well, the similarity here is with the NULL files problems and using
XFS_ITRUNCATE case to avoid NULL files after truncating them down.
i.e.  if that flag is set, we do extra flushing where appropriate to
ensure ithe data at risk of being lost hits the disk before the loss
mechanism can be triggered.

i.e. this is the mirror equivalent of XFS_ITRUNCATE - it is the
truncate-up flushing case in that we need to wait for the zeroing to
hit the disk before we change the file size.

Eseentially, what this says to me is we need to track when we
extended the file via a write beyond EOF via an inode flag. Then if we need to
flush a data range from the file for integrity reasons (such as a
fpnuch) and this flag is set, we ignore the "start" parameter of the
range and flush from the start of the file instead. i.e. we
guarantee that any dirty zeroed pages between the start of the file
and the end of the range we are operating on gets written back. This
will capture all these "didn't flush regions zeroed on file
extension" problems in one go, without needing to change the way we
do allocation or EOF zeroing.

I suspect we could just use the XFS_ITRUNCATE flag for this - set it
in the EOF zeroing code, check it in xfs_flush_unmap_range() and
other places that do range based flushing....

Cheers,

Dave.
Brian Foster March 21, 2019, 12:27 p.m. UTC | #13
On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > The broader point is that by controlling writeback ordering, we can
> > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > never expect stale data exposure even in the event of out of order
> > > > > writeback completion, regardless of the cause.
> > > > 
> > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > disk contents it's game over anyway and we ought to fix it.
> > > 
> > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > is outside the range the user asks to sync and the implementation
> > > doesn't go anywhere near the filesystem so we can't actually do the
> > > right thing here. So we are left to either hack fixes around a
> > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > it. We've simply ignored it for the past 10+ years because it really
> > > can't be used safely for it's intended purposes (as everyone who
> > > tries to use it finds out eventually)...
> > > 
> > 
> > Just to make this abundantly clear, there is nothing unique, magic or
> > special required of sync_file_range() to reproduce this stale data
> > exposure. See the following variation of a command sequence from the
> > fstest I posted the other day:
> > 
> > ...
> > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > 	-c "fpunch 96k 4k" <file>
> > ...
> > # hexdump <file>
> > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > *
> > 0011000 abab abab abab abab abab abab abab abab
> > *
> > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > *
> > 0019000
> > 
> > Same problem, and it can be reproduced just as easily with a reflink or
> > even an overlapping direct I/O.
> 
> How does overlapping direct IO cause stale data exposure given that
> it uses unwritten extent allocation? That, of course, is ignoring
> the fact that the result of overlapping concurent direct IO is, by
> definition, undefined and an application bug if it occurs....
> 

It's overlapping a buffered write and so starts with a flush instead of
an unwritten allocation (so even a dio read is sufficient). Also, the
target of the dio isn't where stale data is exposed, so even if there
was an existing underlying unwritten block before the initial buffered
write it wouldn't change anything.

Yes, this is technically usage with undefined behavior, but that still
misses the forest from the trees. I'm not claiming this is good or
supported practice just as I'm not not claiming sync_file_range() should
be used for anything in practice. I'm just pointing out it's one of
several possible, obvious data exposure vectors. I suspect there are
others; these were just the easy variants to reproduce. If you prefer to
reason about reproducer patterns that don't use operations with quirks
or undefined behaviors, that's fine, just use the reflink or hole punch
example posted above that produce the same behavior.

> > I haven't confirmed, but I suspect with
> > enough effort background writeback is susceptible to this problem as
> > well.
> 
> Well, the similarity here is with the NULL files problems and using
> XFS_ITRUNCATE case to avoid NULL files after truncating them down.
> i.e.  if that flag is set, we do extra flushing where appropriate to
> ensure ithe data at risk of being lost hits the disk before the loss
> mechanism can be triggered.
> 
> i.e. this is the mirror equivalent of XFS_ITRUNCATE - it is the
> truncate-up flushing case in that we need to wait for the zeroing to
> hit the disk before we change the file size.
> 
> Eseentially, what this says to me is we need to track when we
> extended the file via a write beyond EOF via an inode flag. Then if we need to
> flush a data range from the file for integrity reasons (such as a
> fpnuch) and this flag is set, we ignore the "start" parameter of the
> range and flush from the start of the file instead. i.e. we
> guarantee that any dirty zeroed pages between the start of the file
> and the end of the range we are operating on gets written back. This
> will capture all these "didn't flush regions zeroed on file
> extension" problems in one go, without needing to change the way we
> do allocation or EOF zeroing.
> 

I don't think that would prevent the problem outside of the common write
extension case. If the file is sparse (i.e. truncated up ahead of time)
such that the file size is on disk, then any flush sequence and block
allocation within EOF is susceptible to stale data exposure in the
window between when the extent is allocated and writeback completes over
the entire extent. This variant is reproducible even with a plain old
fsync() call, it's just hard enough to reproduce to require
instrumentation (as opposed to the test sequences I've posted in this
thread so far).

It may be possible to selectively (i.e. as an optimization) use an
XFS_ITRUNCATED like algorithm as you propose here to always force in
order writeback in the cases where writeback extends the on-disk isize
(and thus isize protects from stale exposure), but I think we'd still
need something tied to extent allocation one way or another in addition
(like Darrick's original unwritten extent patch or your proposed
intent/done idea in the previous reply) to address the fundamental
problem. The question is what is the ideal solution..

Brian

> I suspect we could just use the XFS_ITRUNCATE flag for this - set it
> in the EOF zeroing code, check it in xfs_flush_unmap_range() and
> other places that do range based flushing....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 22, 2019, 1:52 a.m. UTC | #14
On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > never expect stale data exposure even in the event of out of order
> > > > > > writeback completion, regardless of the cause.
> > > > > 
> > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > disk contents it's game over anyway and we ought to fix it.
> > > > 
> > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > is outside the range the user asks to sync and the implementation
> > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > right thing here. So we are left to either hack fixes around a
> > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > it. We've simply ignored it for the past 10+ years because it really
> > > > can't be used safely for it's intended purposes (as everyone who
> > > > tries to use it finds out eventually)...
> > > > 
> > > 
> > > Just to make this abundantly clear, there is nothing unique, magic or
> > > special required of sync_file_range() to reproduce this stale data
> > > exposure. See the following variation of a command sequence from the
> > > fstest I posted the other day:
> > > 
> > > ...
> > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > 	-c "fpunch 96k 4k" <file>
> > > ...
> > > # hexdump <file>
> > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > *
> > > 0011000 abab abab abab abab abab abab abab abab
> > > *
> > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > *
> > > 0019000
> > > 
> > > Same problem, and it can be reproduced just as easily with a reflink or
> > > even an overlapping direct I/O.
> > 
> > How does overlapping direct IO cause stale data exposure given that
> > it uses unwritten extent allocation? That, of course, is ignoring
> > the fact that the result of overlapping concurent direct IO is, by
> > definition, undefined and an application bug if it occurs....
> > 
> 
> It's overlapping a buffered write and so starts with a flush instead of

Ok, but that's not the normal usage of the phrase "overlapping
direct IO".  Overlapping direct IO means two concurrent direct IOs
to the same LBA range in the device underlying the filesystem. i.e.
they overlap in both the temporal and physical (device LBA) domains.

Subtle difference, yes, but a very important one.

> an unwritten allocation (so even a dio read is sufficient). Also, the
> target of the dio isn't where stale data is exposed, so even if there
> was an existing underlying unwritten block before the initial buffered
> write it wouldn't change anything.
> 
> Yes, this is technically usage with undefined behavior, but that still
> misses the forest from the trees.

No, I'm not missing the forest for the trees.

> I'm not claiming this is good or
> supported practice just as I'm not not claiming sync_file_range() should
> be used for anything in practice. I'm just pointing out it's one of
> several possible, obvious data exposure vectors. I suspect there are
> others; these were just the easy variants to reproduce.

I'm aware that we have these problems - we've known about them for
20+ years and the many ways they can be exposed. Indeed, We've been
talking about using unwritten extents for delalloc for as long as I
can remember, but never done it because the actual occurrence of
these problems in production systems is /extremely/ rare and the
cost of using unwritten extents everywhere will affect /everyone/.

i.e. It's always been easy to create test conditions to expose stale
data, but reported cases of stale data exposure after a production
system crash are pretty much non-existent. This just isn't a common
problem that users are experiencing, but solving it with "unwritten
extents for everyone" comes at a substantial cost for /everyone/,
/all the time/.

That's why a) it hasn't been done yet; and b) any discussion about
"use unwritten extents everywhere" always ends up in trying to find
a solution that isn't just applying the "big hammer" and ignoring
the problems that causes.

> If you prefer to
> reason about reproducer patterns that don't use operations with quirks
> or undefined behaviors, that's fine, just use the reflink or hole punch
> example posted above that produce the same behavior.

Well, yes. But it does go both ways - if you don't want people to
say "that's undefined" or "that makes no sense", then don't use
those quirky examples if you can use better, unambiguous examples.

For example, that's exactly how I phrased this extension flushing
mechanism trigger:

> > Eseentially, what this says to me is we need to track when we
> > extended the file via a write beyond EOF via an inode flag. Then if we need to
> > flush a data range from the file for integrity reasons (such as a
> > fpnuch) and this flag is set, we ignore the "start" parameter of the

See? Entirely generic trigger description, uses hole punch as an
example, but also covers all the quirky cases such as direct IO
range flushing without having to mention them explicitly.

> > range and flush from the start of the file instead. i.e. we
> > guarantee that any dirty zeroed pages between the start of the file
> > and the end of the range we are operating on gets written back. This
> > will capture all these "didn't flush regions zeroed on file
> > extension" problems in one go, without needing to change the way we
> > do allocation or EOF zeroing.
> > 
> 
> I don't think that would prevent the problem outside of the common write
> extension case.

I feel like we're going around in circles now, Brian. :/

I didn't propose this as the "only solution we need". I've proposed
it in the context that we use "only use unwritten extents for
allocations within EOF" to solve the common "write within EOF" case.
This, however, doesn't solve this common write extension case where
we don't want to use unwritten extents because of the performance
cost.

And now you're saying that "well, this extending write thing
doesn't solve the inside EOF problem"......

> It may be possible to selectively (i.e. as an optimization) use an
> XFS_ITRUNCATED like algorithm as you propose here to always force in
> order writeback in the cases where writeback extends the on-disk isize
> (and thus isize protects from stale exposure), but I think we'd still
> need something tied to extent allocation one way or another in addition
> (like Darrick's original unwritten extent patch or your proposed
> intent/done idea in the previous reply) to address the fundamental
> problem. The question is what is the ideal solution..

The solution will be a combination of things that work together, not
one big hammer. The more we can avoid exposure vectors by
operational ordering rather than transactional state changes, the
better the solution will perform.

Cheers,

Dave.
Brian Foster March 22, 2019, 12:46 p.m. UTC | #15
On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > writeback completion, regardless of the cause.
> > > > > > 
> > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > 
> > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > is outside the range the user asks to sync and the implementation
> > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > right thing here. So we are left to either hack fixes around a
> > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > tries to use it finds out eventually)...
> > > > > 
> > > > 
> > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > special required of sync_file_range() to reproduce this stale data
> > > > exposure. See the following variation of a command sequence from the
> > > > fstest I posted the other day:
> > > > 
> > > > ...
> > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > 	-c "fpunch 96k 4k" <file>
> > > > ...
> > > > # hexdump <file>
> > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > *
> > > > 0011000 abab abab abab abab abab abab abab abab
> > > > *
> > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > *
> > > > 0019000
> > > > 
> > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > even an overlapping direct I/O.
> > > 
> > > How does overlapping direct IO cause stale data exposure given that
> > > it uses unwritten extent allocation? That, of course, is ignoring
> > > the fact that the result of overlapping concurent direct IO is, by
> > > definition, undefined and an application bug if it occurs....
> > > 
> > 
> > It's overlapping a buffered write and so starts with a flush instead of
> 
> Ok, but that's not the normal usage of the phrase "overlapping
> direct IO".  Overlapping direct IO means two concurrent direct IOs
> to the same LBA range in the device underlying the filesystem. i.e.
> they overlap in both the temporal and physical (device LBA) domains.
> 
> Subtle difference, yes, but a very important one.
> 

Yeah, I just referenced it in passing and was unclear. The point is just
that there are multiple vectors to the problem, some of which are sane
sequences and some apparently not, and I'd rather not get caught up in
the minutiae of why some of those sequences are not sane. I'm trying to
see if we can find a solution to the fundamental data exposure problem.

> > an unwritten allocation (so even a dio read is sufficient). Also, the
> > target of the dio isn't where stale data is exposed, so even if there
> > was an existing underlying unwritten block before the initial buffered
> > write it wouldn't change anything.
> > 
> > Yes, this is technically usage with undefined behavior, but that still
> > misses the forest from the trees.
> 
> No, I'm not missing the forest for the trees.
> 
> > I'm not claiming this is good or
> > supported practice just as I'm not not claiming sync_file_range() should
> > be used for anything in practice. I'm just pointing out it's one of
> > several possible, obvious data exposure vectors. I suspect there are
> > others; these were just the easy variants to reproduce.
> 
> I'm aware that we have these problems - we've known about them for
> 20+ years and the many ways they can be exposed. Indeed, We've been
> talking about using unwritten extents for delalloc for as long as I
> can remember, but never done it because the actual occurrence of
> these problems in production systems is /extremely/ rare and the
> cost of using unwritten extents everywhere will affect /everyone/.
> 
> i.e. It's always been easy to create test conditions to expose stale
> data, but reported cases of stale data exposure after a production
> system crash are pretty much non-existent. This just isn't a common
> problem that users are experiencing, but solving it with "unwritten
> extents for everyone" comes at a substantial cost for /everyone/,
> /all the time/.
> 

I know you're aware. I'm pretty sure we've discussed this directly in
the past (with regard to how buffer heads were an impediment to a
solution) and that you made me aware of the issue in the first place. ;)

> That's why a) it hasn't been done yet; and b) any discussion about
> "use unwritten extents everywhere" always ends up in trying to find
> a solution that isn't just applying the "big hammer" and ignoring
> the problems that causes.
> 

Right..

> > If you prefer to
> > reason about reproducer patterns that don't use operations with quirks
> > or undefined behaviors, that's fine, just use the reflink or hole punch
> > example posted above that produce the same behavior.
> 
> Well, yes. But it does go both ways - if you don't want people to
> say "that's undefined" or "that makes no sense", then don't use
> those quirky examples if you can use better, unambiguous examples.
> 

The sync_file_range() reproducer was a poor example in that regard. I'm
not concerned over that being pointed out, but I've explained several
times subsequently that we can replace the sync_file_range() calls with
a sane command (hole punch) to reproduce the same general behavior and
resulting data exposure problem and thus the quirks of the former are
not a primary factor. Despite that, this discussion has somehow turned
into an appraisal of sync_file_range(). Moving on..

I posted some comments/questions around your log item suggestion for
extent zeroing during recovery in my previous reply and that part
apparently got completely snipped out and ignored. :/ Any feedback on
that?

> For example, that's exactly how I phrased this extension flushing
> mechanism trigger:
> 
> > > Eseentially, what this says to me is we need to track when we
> > > extended the file via a write beyond EOF via an inode flag. Then if we need to
> > > flush a data range from the file for integrity reasons (such as a
> > > fpnuch) and this flag is set, we ignore the "start" parameter of the
> 
> See? Entirely generic trigger description, uses hole punch as an
> example, but also covers all the quirky cases such as direct IO
> range flushing without having to mention them explicitly.
> 
> > > range and flush from the start of the file instead. i.e. we
> > > guarantee that any dirty zeroed pages between the start of the file
> > > and the end of the range we are operating on gets written back. This
> > > will capture all these "didn't flush regions zeroed on file
> > > extension" problems in one go, without needing to change the way we
> > > do allocation or EOF zeroing.
> > > 
> > 
> > I don't think that would prevent the problem outside of the common write
> > extension case.
> 
> I feel like we're going around in circles now, Brian. :/
> 

Yes, let's try to reset here...

> I didn't propose this as the "only solution we need". I've proposed
> it in the context that we use "only use unwritten extents for
> allocations within EOF" to solve the common "write within EOF" case.
> This, however, doesn't solve this common write extension case where
> we don't want to use unwritten extents because of the performance
> cost.
> 
> And now you're saying that "well, this extending write thing
> doesn't solve the inside EOF problem"......
> 

Ok, I lost that context. As I said above, I can see how this could be
useful in combination with another approach to handle inside eof
allocations. There are still some loose ends (in my mind) to consider
with this technique, however...

This requires that we always guarantee in-order writeback, right? I can
see how we can do that in the cases we control based on the flag based
approach you suggested. But do we really have enough control within the
fs to always prevent out of order writeback? What about non-integrity
background writeback? AFAICT it's technically possible to writeback out
of expected order (the cyclic range logic looks particularly suspicious
to me if you consider a mapping can be truncated and repopulated many
times over without resetting the cycle index). What about memory
reclaim? Is it not possible to hit the ->writepage() path via reclaim
for an arbitrary page in a mapping? It's not clear to me if page
migration is a factor for us as well, but it also looks like it has some
specific page writeback hooks.

I suppose there are technically ways around these things. We could skip
writeback of certain pages in some cases (but that may have unacceptable
consequences) or make the decision to use one exposure preventation
approach or another dynamically at writepage time based on the state of
the inode, mapping and associated page. I'd think background writeback
is the common case as opposed to things like hole punch and whatnot,
however, so a dynamic solution leaves the effectiveness of this approach
somewhat open ended unless we characterize how all possible writeback
scenarios behave.

Given that, it seems more useful to me to work through the inside eof
extent allocation scenario first and work back from there. Then we can
at least try to establish potential performance impact and determine if
we should try to work around it in certain cases, etc. On that note,
please see the part that was snipped out of my earlier reply[1], in
particular the part that ended with "Am I following the high level idea
correctly?" :P Thanks.

Brian

[1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2

> > It may be possible to selectively (i.e. as an optimization) use an
> > XFS_ITRUNCATED like algorithm as you propose here to always force in
> > order writeback in the cases where writeback extends the on-disk isize
> > (and thus isize protects from stale exposure), but I think we'd still
> > need something tied to extent allocation one way or another in addition
> > (like Darrick's original unwritten extent patch or your proposed
> > intent/done idea in the previous reply) to address the fundamental
> > problem. The question is what is the ideal solution..
> 
> The solution will be a combination of things that work together, not
> one big hammer. The more we can avoid exposure vectors by
> operational ordering rather than transactional state changes, the
> better the solution will perform.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong April 3, 2019, 10:38 p.m. UTC | #16
On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > writeback completion, regardless of the cause.
> > > > > > > 
> > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > 
> > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > is outside the range the user asks to sync and the implementation
> > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > tries to use it finds out eventually)...
> > > > > > 
> > > > > 
> > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > special required of sync_file_range() to reproduce this stale data
> > > > > exposure. See the following variation of a command sequence from the
> > > > > fstest I posted the other day:
> > > > > 
> > > > > ...
> > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > 	-c "fpunch 96k 4k" <file>
> > > > > ...
> > > > > # hexdump <file>
> > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > *
> > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > *
> > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > *
> > > > > 0019000
> > > > > 
> > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > even an overlapping direct I/O.
> > > > 
> > > > How does overlapping direct IO cause stale data exposure given that
> > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > the fact that the result of overlapping concurent direct IO is, by
> > > > definition, undefined and an application bug if it occurs....
> > > > 
> > > 
> > > It's overlapping a buffered write and so starts with a flush instead of
> > 
> > Ok, but that's not the normal usage of the phrase "overlapping
> > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > to the same LBA range in the device underlying the filesystem. i.e.
> > they overlap in both the temporal and physical (device LBA) domains.
> > 
> > Subtle difference, yes, but a very important one.
> > 
> 
> Yeah, I just referenced it in passing and was unclear. The point is just
> that there are multiple vectors to the problem, some of which are sane
> sequences and some apparently not, and I'd rather not get caught up in
> the minutiae of why some of those sequences are not sane. I'm trying to
> see if we can find a solution to the fundamental data exposure problem.

Is there a way to increase the amount of writeback?  Say for example
that we have a delalloc reservation for blocks 0-9, and writeback asks
us to write back block 7.  If we succeed at converting the entire
10-block reservation into a real extent, why can't we come back and say
"Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
to need something like that for blocksize > pagesize filesystems too?

> 
> > > an unwritten allocation (so even a dio read is sufficient). Also, the
> > > target of the dio isn't where stale data is exposed, so even if there
> > > was an existing underlying unwritten block before the initial buffered
> > > write it wouldn't change anything.
> > > 
> > > Yes, this is technically usage with undefined behavior, but that still
> > > misses the forest from the trees.
> > 
> > No, I'm not missing the forest for the trees.
> > 
> > > I'm not claiming this is good or
> > > supported practice just as I'm not not claiming sync_file_range() should
> > > be used for anything in practice. I'm just pointing out it's one of
> > > several possible, obvious data exposure vectors. I suspect there are
> > > others; these were just the easy variants to reproduce.
> > 
> > I'm aware that we have these problems - we've known about them for
> > 20+ years and the many ways they can be exposed. Indeed, We've been
> > talking about using unwritten extents for delalloc for as long as I
> > can remember, but never done it because the actual occurrence of
> > these problems in production systems is /extremely/ rare and the
> > cost of using unwritten extents everywhere will affect /everyone/.
> > 
> > i.e. It's always been easy to create test conditions to expose stale
> > data, but reported cases of stale data exposure after a production
> > system crash are pretty much non-existent. This just isn't a common
> > problem that users are experiencing, but solving it with "unwritten
> > extents for everyone" comes at a substantial cost for /everyone/,
> > /all the time/.
> > 
> 
> I know you're aware. I'm pretty sure we've discussed this directly in
> the past (with regard to how buffer heads were an impediment to a
> solution) and that you made me aware of the issue in the first place. ;)
> 
> > That's why a) it hasn't been done yet; and b) any discussion about
> > "use unwritten extents everywhere" always ends up in trying to find
> > a solution that isn't just applying the "big hammer" and ignoring
> > the problems that causes.
> > 
> 
> Right..
> 
> > > If you prefer to
> > > reason about reproducer patterns that don't use operations with quirks
> > > or undefined behaviors, that's fine, just use the reflink or hole punch
> > > example posted above that produce the same behavior.
> > 
> > Well, yes. But it does go both ways - if you don't want people to
> > say "that's undefined" or "that makes no sense", then don't use
> > those quirky examples if you can use better, unambiguous examples.
> > 
> 
> The sync_file_range() reproducer was a poor example in that regard. I'm

(I don't want to stir this up again but just so we're all clear I had
figured the s_f_r call was really just a way to simulate random drive-by
scattershot background writeback for testing purposes.)

> not concerned over that being pointed out, but I've explained several
> times subsequently that we can replace the sync_file_range() calls with
> a sane command (hole punch) to reproduce the same general behavior and
> resulting data exposure problem and thus the quirks of the former are
> not a primary factor. Despite that, this discussion has somehow turned
> into an appraisal of sync_file_range(). Moving on..
> 
> I posted some comments/questions around your log item suggestion for
> extent zeroing during recovery in my previous reply and that part
> apparently got completely snipped out and ignored. :/ Any feedback on
> that?

A(nother) big downside of adding a new "write intent": now we need to
add a log incompat flag which will occasionally generate "I wrote
something, crashed, and this old knoppix cd won't recover the fs for
some reason" reports.

> > For example, that's exactly how I phrased this extension flushing
> > mechanism trigger:
> > 
> > > > Eseentially, what this says to me is we need to track when we
> > > > extended the file via a write beyond EOF via an inode flag. Then if we need to
> > > > flush a data range from the file for integrity reasons (such as a
> > > > fpnuch) and this flag is set, we ignore the "start" parameter of the
> > 
> > See? Entirely generic trigger description, uses hole punch as an
> > example, but also covers all the quirky cases such as direct IO
> > range flushing without having to mention them explicitly.
> > 
> > > > range and flush from the start of the file instead. i.e. we
> > > > guarantee that any dirty zeroed pages between the start of the file
> > > > and the end of the range we are operating on gets written back. This
> > > > will capture all these "didn't flush regions zeroed on file
> > > > extension" problems in one go, without needing to change the way we
> > > > do allocation or EOF zeroing.
> > > > 
> > > 
> > > I don't think that would prevent the problem outside of the common write
> > > extension case.
> > 
> > I feel like we're going around in circles now, Brian. :/
> > 
> 
> Yes, let's try to reset here...
> 
> > I didn't propose this as the "only solution we need". I've proposed
> > it in the context that we use "only use unwritten extents for
> > allocations within EOF" to solve the common "write within EOF" case.
> > This, however, doesn't solve this common write extension case where
> > we don't want to use unwritten extents because of the performance
> > cost.
> > 
> > And now you're saying that "well, this extending write thing
> > doesn't solve the inside EOF problem"......
> > 
> 
> Ok, I lost that context. As I said above, I can see how this could be
> useful in combination with another approach to handle inside eof
> allocations. There are still some loose ends (in my mind) to consider
> with this technique, however...
> 
> This requires that we always guarantee in-order writeback, right? I can
> see how we can do that in the cases we control based on the flag based
> approach you suggested. But do we really have enough control within the
> fs to always prevent out of order writeback? What about non-integrity
> background writeback? AFAICT it's technically possible to writeback out
> of expected order (the cyclic range logic looks particularly suspicious
> to me if you consider a mapping can be truncated and repopulated many
> times over without resetting the cycle index). What about memory
> reclaim? Is it not possible to hit the ->writepage() path via reclaim
> for an arbitrary page in a mapping? It's not clear to me if page
> migration is a factor for us as well, but it also looks like it has some
> specific page writeback hooks.
> 
> I suppose there are technically ways around these things. We could skip
> writeback of certain pages in some cases (but that may have unacceptable
> consequences) or make the decision to use one exposure preventation
> approach or another dynamically at writepage time based on the state of
> the inode, mapping and associated page. I'd think background writeback
> is the common case as opposed to things like hole punch and whatnot,
> however, so a dynamic solution leaves the effectiveness of this approach
> somewhat open ended unless we characterize how all possible writeback
> scenarios behave.
> 
> Given that, it seems more useful to me to work through the inside eof
> extent allocation scenario first and work back from there. Then we can
> at least try to establish potential performance impact and determine if
> we should try to work around it in certain cases, etc.

<nod> I'll get back to working on two parts of fixing this:

a) adapting the "convert delalloc to unwritten" patch that started this
thread so that it only does that for conversions that start within eof,

b) fixing your punch case by extending the range that xfs_flush_unmap_range
asks to flush if the caller is trying to flush a range past the ondisk
eof.

Though I think (b) is unnecessary if we had the ability to flush all the
pages fronting a delalloc reservation that we just converted to a real
extent, as I muttered above.

> On that note,
> please see the part that was snipped out of my earlier reply[1], in
> particular the part that ended with "Am I following the high level idea
> correctly?" :P Thanks.

I thought your take at least matched my interpretation of Dave's
comments.

Thoughts?

--D

> 
> Brian
> 
> [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2
> 
> > > It may be possible to selectively (i.e. as an optimization) use an
> > > XFS_ITRUNCATED like algorithm as you propose here to always force in
> > > order writeback in the cases where writeback extends the on-disk isize
> > > (and thus isize protects from stale exposure), but I think we'd still
> > > need something tied to extent allocation one way or another in addition
> > > (like Darrick's original unwritten extent patch or your proposed
> > > intent/done idea in the previous reply) to address the fundamental
> > > problem. The question is what is the ideal solution..
> > 
> > The solution will be a combination of things that work together, not
> > one big hammer. The more we can avoid exposure vectors by
> > operational ordering rather than transactional state changes, the
> > better the solution will perform.
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Brian Foster April 4, 2019, 12:50 p.m. UTC | #17
On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > > writeback completion, regardless of the cause.
> > > > > > > > 
> > > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > > 
> > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > > is outside the range the user asks to sync and the implementation
> > > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > > tries to use it finds out eventually)...
> > > > > > > 
> > > > > > 
> > > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > > special required of sync_file_range() to reproduce this stale data
> > > > > > exposure. See the following variation of a command sequence from the
> > > > > > fstest I posted the other day:
> > > > > > 
> > > > > > ...
> > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > > 	-c "fpunch 96k 4k" <file>
> > > > > > ...
> > > > > > # hexdump <file>
> > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > > *
> > > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > > *
> > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > > *
> > > > > > 0019000
> > > > > > 
> > > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > > even an overlapping direct I/O.
> > > > > 
> > > > > How does overlapping direct IO cause stale data exposure given that
> > > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > > the fact that the result of overlapping concurent direct IO is, by
> > > > > definition, undefined and an application bug if it occurs....
> > > > > 
> > > > 
> > > > It's overlapping a buffered write and so starts with a flush instead of
> > > 
> > > Ok, but that's not the normal usage of the phrase "overlapping
> > > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > > to the same LBA range in the device underlying the filesystem. i.e.
> > > they overlap in both the temporal and physical (device LBA) domains.
> > > 
> > > Subtle difference, yes, but a very important one.
> > > 
> > 
> > Yeah, I just referenced it in passing and was unclear. The point is just
> > that there are multiple vectors to the problem, some of which are sane
> > sequences and some apparently not, and I'd rather not get caught up in
> > the minutiae of why some of those sequences are not sane. I'm trying to
> > see if we can find a solution to the fundamental data exposure problem.
> 
> Is there a way to increase the amount of writeback?  Say for example
> that we have a delalloc reservation for blocks 0-9, and writeback asks
> us to write back block 7.  If we succeed at converting the entire
> 10-block reservation into a real extent, why can't we come back and say
> "Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
> to need something like that for blocksize > pagesize filesystems too?
> 

That might be possible in various places. In fact, I think we used to do
something like this in XFS (perhaps not for this purpose). I just jumped
back to v2.6.32 and see that the xfs_cluster_write() mechanism still
exists at that point down in the ->writepage() path.

I'm not sure if doing that alone is worth the effort if you consider
it's more of a mitigation than a solution to the underlying problem.
Once we convert the associated delalloc extent, we're open to stale data
exposure (assuming we're not protected by on-disk EOF and whatnot) until
all of the remaining pages over the extent are written back, whether
that occurs in the current writeback cycle or a subsequent one.

I suppose it might make sense in service to another solution, such as
writing back everything preceding a page that is going to update on-disk
EOF (which is something I think Dave suggested earlier as an incremental
improvement, I've kind of lost track). We'd probably just have to make
sure that the on-disk eof can never get ahead of the lower index
writeback pages by I/O reordering or whatever..

> > 
> > > > an unwritten allocation (so even a dio read is sufficient). Also, the
> > > > target of the dio isn't where stale data is exposed, so even if there
> > > > was an existing underlying unwritten block before the initial buffered
> > > > write it wouldn't change anything.
> > > > 
> > > > Yes, this is technically usage with undefined behavior, but that still
> > > > misses the forest from the trees.
> > > 
> > > No, I'm not missing the forest for the trees.
> > > 
> > > > I'm not claiming this is good or
> > > > supported practice just as I'm not not claiming sync_file_range() should
> > > > be used for anything in practice. I'm just pointing out it's one of
> > > > several possible, obvious data exposure vectors. I suspect there are
> > > > others; these were just the easy variants to reproduce.
> > > 
> > > I'm aware that we have these problems - we've known about them for
> > > 20+ years and the many ways they can be exposed. Indeed, We've been
> > > talking about using unwritten extents for delalloc for as long as I
> > > can remember, but never done it because the actual occurrence of
> > > these problems in production systems is /extremely/ rare and the
> > > cost of using unwritten extents everywhere will affect /everyone/.
> > > 
> > > i.e. It's always been easy to create test conditions to expose stale
> > > data, but reported cases of stale data exposure after a production
> > > system crash are pretty much non-existent. This just isn't a common
> > > problem that users are experiencing, but solving it with "unwritten
> > > extents for everyone" comes at a substantial cost for /everyone/,
> > > /all the time/.
> > > 
> > 
> > I know you're aware. I'm pretty sure we've discussed this directly in
> > the past (with regard to how buffer heads were an impediment to a
> > solution) and that you made me aware of the issue in the first place. ;)
> > 
> > > That's why a) it hasn't been done yet; and b) any discussion about
> > > "use unwritten extents everywhere" always ends up in trying to find
> > > a solution that isn't just applying the "big hammer" and ignoring
> > > the problems that causes.
> > > 
> > 
> > Right..
> > 
> > > > If you prefer to
> > > > reason about reproducer patterns that don't use operations with quirks
> > > > or undefined behaviors, that's fine, just use the reflink or hole punch
> > > > example posted above that produce the same behavior.
> > > 
> > > Well, yes. But it does go both ways - if you don't want people to
> > > say "that's undefined" or "that makes no sense", then don't use
> > > those quirky examples if you can use better, unambiguous examples.
> > > 
> > 
> > The sync_file_range() reproducer was a poor example in that regard. I'm
> 
> (I don't want to stir this up again but just so we're all clear I had
> figured the s_f_r call was really just a way to simulate random drive-by
> scattershot background writeback for testing purposes.)
> 

Ack.

> > not concerned over that being pointed out, but I've explained several
> > times subsequently that we can replace the sync_file_range() calls with
> > a sane command (hole punch) to reproduce the same general behavior and
> > resulting data exposure problem and thus the quirks of the former are
> > not a primary factor. Despite that, this discussion has somehow turned
> > into an appraisal of sync_file_range(). Moving on..
> > 
> > I posted some comments/questions around your log item suggestion for
> > extent zeroing during recovery in my previous reply and that part
> > apparently got completely snipped out and ignored. :/ Any feedback on
> > that?
> 
> A(nother) big downside of adding a new "write intent": now we need to
> add a log incompat flag which will occasionally generate "I wrote
> something, crashed, and this old knoppix cd won't recover the fs for
> some reason" reports.
> 

Yeah, though I think it's not really recommended to process an fs with a
dirty log across kernels like that. We already have some recovery
restrictions that prevent this, such on-disk order for example. Of
course that's different from just going back to an older kernel version
on the same hardware and getting stuck that way because of a feature bit
or something.

> > > For example, that's exactly how I phrased this extension flushing
> > > mechanism trigger:
> > > 
> > > > > Eseentially, what this says to me is we need to track when we
> > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to
> > > > > flush a data range from the file for integrity reasons (such as a
> > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the
> > > 
> > > See? Entirely generic trigger description, uses hole punch as an
> > > example, but also covers all the quirky cases such as direct IO
> > > range flushing without having to mention them explicitly.
> > > 
> > > > > range and flush from the start of the file instead. i.e. we
> > > > > guarantee that any dirty zeroed pages between the start of the file
> > > > > and the end of the range we are operating on gets written back. This
> > > > > will capture all these "didn't flush regions zeroed on file
> > > > > extension" problems in one go, without needing to change the way we
> > > > > do allocation or EOF zeroing.
> > > > > 
> > > > 
> > > > I don't think that would prevent the problem outside of the common write
> > > > extension case.
> > > 
> > > I feel like we're going around in circles now, Brian. :/
> > > 
> > 
> > Yes, let's try to reset here...
> > 
> > > I didn't propose this as the "only solution we need". I've proposed
> > > it in the context that we use "only use unwritten extents for
> > > allocations within EOF" to solve the common "write within EOF" case.
> > > This, however, doesn't solve this common write extension case where
> > > we don't want to use unwritten extents because of the performance
> > > cost.
> > > 
> > > And now you're saying that "well, this extending write thing
> > > doesn't solve the inside EOF problem"......
> > > 
> > 
> > Ok, I lost that context. As I said above, I can see how this could be
> > useful in combination with another approach to handle inside eof
> > allocations. There are still some loose ends (in my mind) to consider
> > with this technique, however...
> > 
> > This requires that we always guarantee in-order writeback, right? I can
> > see how we can do that in the cases we control based on the flag based
> > approach you suggested. But do we really have enough control within the
> > fs to always prevent out of order writeback? What about non-integrity
> > background writeback? AFAICT it's technically possible to writeback out
> > of expected order (the cyclic range logic looks particularly suspicious
> > to me if you consider a mapping can be truncated and repopulated many
> > times over without resetting the cycle index). What about memory
> > reclaim? Is it not possible to hit the ->writepage() path via reclaim
> > for an arbitrary page in a mapping? It's not clear to me if page
> > migration is a factor for us as well, but it also looks like it has some
> > specific page writeback hooks.
> > 
> > I suppose there are technically ways around these things. We could skip
> > writeback of certain pages in some cases (but that may have unacceptable
> > consequences) or make the decision to use one exposure preventation
> > approach or another dynamically at writepage time based on the state of
> > the inode, mapping and associated page. I'd think background writeback
> > is the common case as opposed to things like hole punch and whatnot,
> > however, so a dynamic solution leaves the effectiveness of this approach
> > somewhat open ended unless we characterize how all possible writeback
> > scenarios behave.
> > 
> > Given that, it seems more useful to me to work through the inside eof
> > extent allocation scenario first and work back from there. Then we can
> > at least try to establish potential performance impact and determine if
> > we should try to work around it in certain cases, etc.
> 
> <nod> I'll get back to working on two parts of fixing this:
> 
> a) adapting the "convert delalloc to unwritten" patch that started this
> thread so that it only does that for conversions that start within eof,
> 
> b) fixing your punch case by extending the range that xfs_flush_unmap_range
> asks to flush if the caller is trying to flush a range past the ondisk
> eof.
> 
> Though I think (b) is unnecessary if we had the ability to flush all the
> pages fronting a delalloc reservation that we just converted to a real
> extent, as I muttered above.
> 

I think it's more that we have to order all such page writebacks against
the on-disk size update so the latter protects us from exposure in the
event of a crash, but otherwise this sounds like a reasonable next step
to me..

Brian

> > On that note,
> > please see the part that was snipped out of my earlier reply[1], in
> > particular the part that ended with "Am I following the high level idea
> > correctly?" :P Thanks.
> 
> I thought your take at least matched my interpretation of Dave's
> comments.
> 
> Thoughts?
> 
> --D
> 
> > 
> > Brian
> > 
> > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2
> > 
> > > > It may be possible to selectively (i.e. as an optimization) use an
> > > > XFS_ITRUNCATED like algorithm as you propose here to always force in
> > > > order writeback in the cases where writeback extends the on-disk isize
> > > > (and thus isize protects from stale exposure), but I think we'd still
> > > > need something tied to extent allocation one way or another in addition
> > > > (like Darrick's original unwritten extent patch or your proposed
> > > > intent/done idea in the previous reply) to address the fundamental
> > > > problem. The question is what is the ideal solution..
> > > 
> > > The solution will be a combination of things that work together, not
> > > one big hammer. The more we can avoid exposure vectors by
> > > operational ordering rather than transactional state changes, the
> > > better the solution will perform.
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
Darrick J. Wong April 4, 2019, 3:22 p.m. UTC | #18
On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > > > writeback completion, regardless of the cause.
> > > > > > > > > 
> > > > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > > > 
> > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > > > is outside the range the user asks to sync and the implementation
> > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > > > tries to use it finds out eventually)...
> > > > > > > > 
> > > > > > > 
> > > > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > > > special required of sync_file_range() to reproduce this stale data
> > > > > > > exposure. See the following variation of a command sequence from the
> > > > > > > fstest I posted the other day:
> > > > > > > 
> > > > > > > ...
> > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > > > 	-c "fpunch 96k 4k" <file>
> > > > > > > ...
> > > > > > > # hexdump <file>
> > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > > > *
> > > > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > > > *
> > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > > > *
> > > > > > > 0019000
> > > > > > > 
> > > > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > > > even an overlapping direct I/O.
> > > > > > 
> > > > > > How does overlapping direct IO cause stale data exposure given that
> > > > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > > > the fact that the result of overlapping concurent direct IO is, by
> > > > > > definition, undefined and an application bug if it occurs....
> > > > > > 
> > > > > 
> > > > > It's overlapping a buffered write and so starts with a flush instead of
> > > > 
> > > > Ok, but that's not the normal usage of the phrase "overlapping
> > > > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > > > to the same LBA range in the device underlying the filesystem. i.e.
> > > > they overlap in both the temporal and physical (device LBA) domains.
> > > > 
> > > > Subtle difference, yes, but a very important one.
> > > > 
> > > 
> > > Yeah, I just referenced it in passing and was unclear. The point is just
> > > that there are multiple vectors to the problem, some of which are sane
> > > sequences and some apparently not, and I'd rather not get caught up in
> > > the minutiae of why some of those sequences are not sane. I'm trying to
> > > see if we can find a solution to the fundamental data exposure problem.
> > 
> > Is there a way to increase the amount of writeback?  Say for example
> > that we have a delalloc reservation for blocks 0-9, and writeback asks
> > us to write back block 7.  If we succeed at converting the entire
> > 10-block reservation into a real extent, why can't we come back and say
> > "Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
> > to need something like that for blocksize > pagesize filesystems too?
> > 
> 
> That might be possible in various places. In fact, I think we used to do
> something like this in XFS (perhaps not for this purpose). I just jumped
> back to v2.6.32 and see that the xfs_cluster_write() mechanism still
> exists at that point down in the ->writepage() path.
> 
> I'm not sure if doing that alone is worth the effort if you consider
> it's more of a mitigation than a solution to the underlying problem.
> Once we convert the associated delalloc extent, we're open to stale data
> exposure (assuming we're not protected by on-disk EOF and whatnot) until
> all of the remaining pages over the extent are written back, whether
> that occurs in the current writeback cycle or a subsequent one.
> 
> I suppose it might make sense in service to another solution, such as
> writing back everything preceding a page that is going to update on-disk
> EOF (which is something I think Dave suggested earlier as an incremental
> improvement, I've kind of lost track). We'd probably just have to make
> sure that the on-disk eof can never get ahead of the lower index
> writeback pages by I/O reordering or whatever..

...which would be easi(er) to handle if we could chain all the
file-extending writeback bios to a single ioend like you suggested in
the per-inode ioend completion thread, right?

Though such a beast would still be racy, I think, because background
writeback could start writing back a /single/ page from the middle of
the post-eof range before the integrity writeback flushes everything
else, and if the integrity writeback completes before the background
writeback does, we're still exposed.  I think?

Dunno, maybe the per-inode completion can help here, since at least if
the completions all come in at the same time we'll reorder and combine
the metadata work, though that's not a full solution either.

> > > 
> > > > > an unwritten allocation (so even a dio read is sufficient). Also, the
> > > > > target of the dio isn't where stale data is exposed, so even if there
> > > > > was an existing underlying unwritten block before the initial buffered
> > > > > write it wouldn't change anything.
> > > > > 
> > > > > Yes, this is technically usage with undefined behavior, but that still
> > > > > misses the forest from the trees.
> > > > 
> > > > No, I'm not missing the forest for the trees.
> > > > 
> > > > > I'm not claiming this is good or
> > > > > supported practice just as I'm not not claiming sync_file_range() should
> > > > > be used for anything in practice. I'm just pointing out it's one of
> > > > > several possible, obvious data exposure vectors. I suspect there are
> > > > > others; these were just the easy variants to reproduce.
> > > > 
> > > > I'm aware that we have these problems - we've known about them for
> > > > 20+ years and the many ways they can be exposed. Indeed, We've been
> > > > talking about using unwritten extents for delalloc for as long as I
> > > > can remember, but never done it because the actual occurrence of
> > > > these problems in production systems is /extremely/ rare and the
> > > > cost of using unwritten extents everywhere will affect /everyone/.
> > > > 
> > > > i.e. It's always been easy to create test conditions to expose stale
> > > > data, but reported cases of stale data exposure after a production
> > > > system crash are pretty much non-existent. This just isn't a common
> > > > problem that users are experiencing, but solving it with "unwritten
> > > > extents for everyone" comes at a substantial cost for /everyone/,
> > > > /all the time/.
> > > > 
> > > 
> > > I know you're aware. I'm pretty sure we've discussed this directly in
> > > the past (with regard to how buffer heads were an impediment to a
> > > solution) and that you made me aware of the issue in the first place. ;)
> > > 
> > > > That's why a) it hasn't been done yet; and b) any discussion about
> > > > "use unwritten extents everywhere" always ends up in trying to find
> > > > a solution that isn't just applying the "big hammer" and ignoring
> > > > the problems that causes.
> > > > 
> > > 
> > > Right..
> > > 
> > > > > If you prefer to
> > > > > reason about reproducer patterns that don't use operations with quirks
> > > > > or undefined behaviors, that's fine, just use the reflink or hole punch
> > > > > example posted above that produce the same behavior.
> > > > 
> > > > Well, yes. But it does go both ways - if you don't want people to
> > > > say "that's undefined" or "that makes no sense", then don't use
> > > > those quirky examples if you can use better, unambiguous examples.
> > > > 
> > > 
> > > The sync_file_range() reproducer was a poor example in that regard. I'm
> > 
> > (I don't want to stir this up again but just so we're all clear I had
> > figured the s_f_r call was really just a way to simulate random drive-by
> > scattershot background writeback for testing purposes.)
> > 
> 
> Ack.
> 
> > > not concerned over that being pointed out, but I've explained several
> > > times subsequently that we can replace the sync_file_range() calls with
> > > a sane command (hole punch) to reproduce the same general behavior and
> > > resulting data exposure problem and thus the quirks of the former are
> > > not a primary factor. Despite that, this discussion has somehow turned
> > > into an appraisal of sync_file_range(). Moving on..
> > > 
> > > I posted some comments/questions around your log item suggestion for
> > > extent zeroing during recovery in my previous reply and that part
> > > apparently got completely snipped out and ignored. :/ Any feedback on
> > > that?
> > 
> > A(nother) big downside of adding a new "write intent": now we need to
> > add a log incompat flag which will occasionally generate "I wrote
> > something, crashed, and this old knoppix cd won't recover the fs for
> > some reason" reports.
> > 
> 
> Yeah, though I think it's not really recommended to process an fs with a
> dirty log across kernels like that.

That may be, but it seems to work most often enough that our service
techs think they can do it 100%.  And I have the bug reports from when
that doesn't work to prove it. :(

> We already have some recovery
> restrictions that prevent this, such on-disk order for example. Of
> course that's different from just going back to an older kernel version
> on the same hardware and getting stuck that way because of a feature bit
> or something.

<nod>

> > > > For example, that's exactly how I phrased this extension flushing
> > > > mechanism trigger:
> > > > 
> > > > > > Eseentially, what this says to me is we need to track when we
> > > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to
> > > > > > flush a data range from the file for integrity reasons (such as a
> > > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the
> > > > 
> > > > See? Entirely generic trigger description, uses hole punch as an
> > > > example, but also covers all the quirky cases such as direct IO
> > > > range flushing without having to mention them explicitly.
> > > > 
> > > > > > range and flush from the start of the file instead. i.e. we
> > > > > > guarantee that any dirty zeroed pages between the start of the file
> > > > > > and the end of the range we are operating on gets written back. This
> > > > > > will capture all these "didn't flush regions zeroed on file
> > > > > > extension" problems in one go, without needing to change the way we
> > > > > > do allocation or EOF zeroing.
> > > > > > 
> > > > > 
> > > > > I don't think that would prevent the problem outside of the common write
> > > > > extension case.
> > > > 
> > > > I feel like we're going around in circles now, Brian. :/
> > > > 
> > > 
> > > Yes, let's try to reset here...
> > > 
> > > > I didn't propose this as the "only solution we need". I've proposed
> > > > it in the context that we use "only use unwritten extents for
> > > > allocations within EOF" to solve the common "write within EOF" case.
> > > > This, however, doesn't solve this common write extension case where
> > > > we don't want to use unwritten extents because of the performance
> > > > cost.
> > > > 
> > > > And now you're saying that "well, this extending write thing
> > > > doesn't solve the inside EOF problem"......
> > > > 
> > > 
> > > Ok, I lost that context. As I said above, I can see how this could be
> > > useful in combination with another approach to handle inside eof
> > > allocations. There are still some loose ends (in my mind) to consider
> > > with this technique, however...
> > > 
> > > This requires that we always guarantee in-order writeback, right? I can
> > > see how we can do that in the cases we control based on the flag based
> > > approach you suggested. But do we really have enough control within the
> > > fs to always prevent out of order writeback? What about non-integrity
> > > background writeback? AFAICT it's technically possible to writeback out
> > > of expected order (the cyclic range logic looks particularly suspicious
> > > to me if you consider a mapping can be truncated and repopulated many
> > > times over without resetting the cycle index). What about memory
> > > reclaim? Is it not possible to hit the ->writepage() path via reclaim
> > > for an arbitrary page in a mapping? It's not clear to me if page
> > > migration is a factor for us as well, but it also looks like it has some
> > > specific page writeback hooks.
> > > 
> > > I suppose there are technically ways around these things. We could skip
> > > writeback of certain pages in some cases (but that may have unacceptable
> > > consequences) or make the decision to use one exposure preventation
> > > approach or another dynamically at writepage time based on the state of
> > > the inode, mapping and associated page. I'd think background writeback
> > > is the common case as opposed to things like hole punch and whatnot,
> > > however, so a dynamic solution leaves the effectiveness of this approach
> > > somewhat open ended unless we characterize how all possible writeback
> > > scenarios behave.
> > > 
> > > Given that, it seems more useful to me to work through the inside eof
> > > extent allocation scenario first and work back from there. Then we can
> > > at least try to establish potential performance impact and determine if
> > > we should try to work around it in certain cases, etc.
> > 
> > <nod> I'll get back to working on two parts of fixing this:
> > 
> > a) adapting the "convert delalloc to unwritten" patch that started this
> > thread so that it only does that for conversions that start within eof,
> > 
> > b) fixing your punch case by extending the range that xfs_flush_unmap_range
> > asks to flush if the caller is trying to flush a range past the ondisk
> > eof.
> > 
> > Though I think (b) is unnecessary if we had the ability to flush all the
> > pages fronting a delalloc reservation that we just converted to a real
> > extent, as I muttered above.
> > 
> 
> I think it's more that we have to order all such page writebacks against
> the on-disk size update so the latter protects us from exposure in the
> event of a crash, but otherwise this sounds like a reasonable next step
> to me..

<nod> Figuring out ordered completion is going to take some thought.

--D

> Brian
> 
> > > On that note,
> > > please see the part that was snipped out of my earlier reply[1], in
> > > particular the part that ended with "Am I following the high level idea
> > > correctly?" :P Thanks.
> > 
> > I thought your take at least matched my interpretation of Dave's
> > comments.
> > 
> > Thoughts?
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2
> > > 
> > > > > It may be possible to selectively (i.e. as an optimization) use an
> > > > > XFS_ITRUNCATED like algorithm as you propose here to always force in
> > > > > order writeback in the cases where writeback extends the on-disk isize
> > > > > (and thus isize protects from stale exposure), but I think we'd still
> > > > > need something tied to extent allocation one way or another in addition
> > > > > (like Darrick's original unwritten extent patch or your proposed
> > > > > intent/done idea in the previous reply) to address the fundamental
> > > > > problem. The question is what is the ideal solution..
> > > > 
> > > > The solution will be a combination of things that work together, not
> > > > one big hammer. The more we can avoid exposure vectors by
> > > > operational ordering rather than transactional state changes, the
> > > > better the solution will perform.
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
Brian Foster April 4, 2019, 4:16 p.m. UTC | #19
On Thu, Apr 04, 2019 at 08:22:54AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > > > > writeback completion, regardless of the cause.
> > > > > > > > > > 
> > > > > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > > > > 
> > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > > > > is outside the range the user asks to sync and the implementation
> > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > > > > tries to use it finds out eventually)...
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > > > > special required of sync_file_range() to reproduce this stale data
> > > > > > > > exposure. See the following variation of a command sequence from the
> > > > > > > > fstest I posted the other day:
> > > > > > > > 
> > > > > > > > ...
> > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > > > > 	-c "fpunch 96k 4k" <file>
> > > > > > > > ...
> > > > > > > > # hexdump <file>
> > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > > > > *
> > > > > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > > > > *
> > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > > > > *
> > > > > > > > 0019000
> > > > > > > > 
> > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > > > > even an overlapping direct I/O.
> > > > > > > 
> > > > > > > How does overlapping direct IO cause stale data exposure given that
> > > > > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > > > > the fact that the result of overlapping concurent direct IO is, by
> > > > > > > definition, undefined and an application bug if it occurs....
> > > > > > > 
> > > > > > 
> > > > > > It's overlapping a buffered write and so starts with a flush instead of
> > > > > 
> > > > > Ok, but that's not the normal usage of the phrase "overlapping
> > > > > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > > > > to the same LBA range in the device underlying the filesystem. i.e.
> > > > > they overlap in both the temporal and physical (device LBA) domains.
> > > > > 
> > > > > Subtle difference, yes, but a very important one.
> > > > > 
> > > > 
> > > > Yeah, I just referenced it in passing and was unclear. The point is just
> > > > that there are multiple vectors to the problem, some of which are sane
> > > > sequences and some apparently not, and I'd rather not get caught up in
> > > > the minutiae of why some of those sequences are not sane. I'm trying to
> > > > see if we can find a solution to the fundamental data exposure problem.
> > > 
> > > Is there a way to increase the amount of writeback?  Say for example
> > > that we have a delalloc reservation for blocks 0-9, and writeback asks
> > > us to write back block 7.  If we succeed at converting the entire
> > > 10-block reservation into a real extent, why can't we come back and say
> > > "Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
> > > to need something like that for blocksize > pagesize filesystems too?
> > > 
> > 
> > That might be possible in various places. In fact, I think we used to do
> > something like this in XFS (perhaps not for this purpose). I just jumped
> > back to v2.6.32 and see that the xfs_cluster_write() mechanism still
> > exists at that point down in the ->writepage() path.
> > 
> > I'm not sure if doing that alone is worth the effort if you consider
> > it's more of a mitigation than a solution to the underlying problem.
> > Once we convert the associated delalloc extent, we're open to stale data
> > exposure (assuming we're not protected by on-disk EOF and whatnot) until
> > all of the remaining pages over the extent are written back, whether
> > that occurs in the current writeback cycle or a subsequent one.
> > 
> > I suppose it might make sense in service to another solution, such as
> > writing back everything preceding a page that is going to update on-disk
> > EOF (which is something I think Dave suggested earlier as an incremental
> > improvement, I've kind of lost track). We'd probably just have to make
> > sure that the on-disk eof can never get ahead of the lower index
> > writeback pages by I/O reordering or whatever..
> 
> ...which would be easi(er) to handle if we could chain all the
> file-extending writeback bios to a single ioend like you suggested in
> the per-inode ioend completion thread, right?
> 

Hmm, I think that could be a useful tool in some scenarios... to combine
all append bios to a single completion and i_size update for example. As
you suggest below, that by itself probably doesn't cover all of the
possible scenarios.

> Though such a beast would still be racy, I think, because background
> writeback could start writing back a /single/ page from the middle of
> the post-eof range before the integrity writeback flushes everything
> else, and if the integrity writeback completes before the background
> writeback does, we're still exposed.  I think?
> 

Basically, I think we're exposed if we ever allow an on-disk size update
to commit where delalloc extents have been converted within that EOF and
not fully written back. I think there's probably a matrix of different
ways to avoid that. We could batch append ioends as noted above, perhaps
restrict or defer append transactions until the problematic state no
longer exists on an inode (i.e., defer the size update until the range
within the new on-disk EOF is naturally stabilized instead of trying to
force it to happen in various writeback scenarios), and/or incorporate
more restrictive delalloc conversion and selective use of unwritten
extents (probably at some performance/fragmentation cost), etc.

> Dunno, maybe the per-inode completion can help here, since at least if
> the completions all come in at the same time we'll reorder and combine
> the metadata work, though that's not a full solution either.
> 

Yeah, we'd be relying on timing to facilitate completion batching, at
least in the current implementation. Perhaps we could reuse that
mechanism to explicitly defer append transactions such that we don't
expose stale data. For example, keep the ioend/tx around until there are
no delalloc or "stale" blocks within the new on-disk EOF, assuming we
can find a way to track "stale" blocks (in-core extent state?). The
tradeoff for something like that would be on-disk eof may not update as
fast in the event of a crash, but that doesn't seem unreasonable to me.

> > > > 
> > > > > > an unwritten allocation (so even a dio read is sufficient). Also, the
> > > > > > target of the dio isn't where stale data is exposed, so even if there
> > > > > > was an existing underlying unwritten block before the initial buffered
> > > > > > write it wouldn't change anything.
> > > > > > 
> > > > > > Yes, this is technically usage with undefined behavior, but that still
> > > > > > misses the forest from the trees.
> > > > > 
> > > > > No, I'm not missing the forest for the trees.
> > > > > 
> > > > > > I'm not claiming this is good or
> > > > > > supported practice just as I'm not not claiming sync_file_range() should
> > > > > > be used for anything in practice. I'm just pointing out it's one of
> > > > > > several possible, obvious data exposure vectors. I suspect there are
> > > > > > others; these were just the easy variants to reproduce.
> > > > > 
> > > > > I'm aware that we have these problems - we've known about them for
> > > > > 20+ years and the many ways they can be exposed. Indeed, We've been
> > > > > talking about using unwritten extents for delalloc for as long as I
> > > > > can remember, but never done it because the actual occurrence of
> > > > > these problems in production systems is /extremely/ rare and the
> > > > > cost of using unwritten extents everywhere will affect /everyone/.
> > > > > 
> > > > > i.e. It's always been easy to create test conditions to expose stale
> > > > > data, but reported cases of stale data exposure after a production
> > > > > system crash are pretty much non-existent. This just isn't a common
> > > > > problem that users are experiencing, but solving it with "unwritten
> > > > > extents for everyone" comes at a substantial cost for /everyone/,
> > > > > /all the time/.
> > > > > 
> > > > 
> > > > I know you're aware. I'm pretty sure we've discussed this directly in
> > > > the past (with regard to how buffer heads were an impediment to a
> > > > solution) and that you made me aware of the issue in the first place. ;)
> > > > 
> > > > > That's why a) it hasn't been done yet; and b) any discussion about
> > > > > "use unwritten extents everywhere" always ends up in trying to find
> > > > > a solution that isn't just applying the "big hammer" and ignoring
> > > > > the problems that causes.
> > > > > 
> > > > 
> > > > Right..
> > > > 
> > > > > > If you prefer to
> > > > > > reason about reproducer patterns that don't use operations with quirks
> > > > > > or undefined behaviors, that's fine, just use the reflink or hole punch
> > > > > > example posted above that produce the same behavior.
> > > > > 
> > > > > Well, yes. But it does go both ways - if you don't want people to
> > > > > say "that's undefined" or "that makes no sense", then don't use
> > > > > those quirky examples if you can use better, unambiguous examples.
> > > > > 
> > > > 
> > > > The sync_file_range() reproducer was a poor example in that regard. I'm
> > > 
> > > (I don't want to stir this up again but just so we're all clear I had
> > > figured the s_f_r call was really just a way to simulate random drive-by
> > > scattershot background writeback for testing purposes.)
> > > 
> > 
> > Ack.
> > 
> > > > not concerned over that being pointed out, but I've explained several
> > > > times subsequently that we can replace the sync_file_range() calls with
> > > > a sane command (hole punch) to reproduce the same general behavior and
> > > > resulting data exposure problem and thus the quirks of the former are
> > > > not a primary factor. Despite that, this discussion has somehow turned
> > > > into an appraisal of sync_file_range(). Moving on..
> > > > 
> > > > I posted some comments/questions around your log item suggestion for
> > > > extent zeroing during recovery in my previous reply and that part
> > > > apparently got completely snipped out and ignored. :/ Any feedback on
> > > > that?
> > > 
> > > A(nother) big downside of adding a new "write intent": now we need to
> > > add a log incompat flag which will occasionally generate "I wrote
> > > something, crashed, and this old knoppix cd won't recover the fs for
> > > some reason" reports.
> > > 
> > 
> > Yeah, though I think it's not really recommended to process an fs with a
> > dirty log across kernels like that.
> 
> That may be, but it seems to work most often enough that our service
> techs think they can do it 100%.  And I have the bug reports from when
> that doesn't work to prove it. :(
> 

Heh. :P

> > We already have some recovery
> > restrictions that prevent this, such on-disk order for example. Of
> > course that's different from just going back to an older kernel version
> > on the same hardware and getting stuck that way because of a feature bit
> > or something.
> 
> <nod>
> 
> > > > > For example, that's exactly how I phrased this extension flushing
> > > > > mechanism trigger:
> > > > > 
> > > > > > > Eseentially, what this says to me is we need to track when we
> > > > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to
> > > > > > > flush a data range from the file for integrity reasons (such as a
> > > > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the
> > > > > 
> > > > > See? Entirely generic trigger description, uses hole punch as an
> > > > > example, but also covers all the quirky cases such as direct IO
> > > > > range flushing without having to mention them explicitly.
> > > > > 
> > > > > > > range and flush from the start of the file instead. i.e. we
> > > > > > > guarantee that any dirty zeroed pages between the start of the file
> > > > > > > and the end of the range we are operating on gets written back. This
> > > > > > > will capture all these "didn't flush regions zeroed on file
> > > > > > > extension" problems in one go, without needing to change the way we
> > > > > > > do allocation or EOF zeroing.
> > > > > > > 
> > > > > > 
> > > > > > I don't think that would prevent the problem outside of the common write
> > > > > > extension case.
> > > > > 
> > > > > I feel like we're going around in circles now, Brian. :/
> > > > > 
> > > > 
> > > > Yes, let's try to reset here...
> > > > 
> > > > > I didn't propose this as the "only solution we need". I've proposed
> > > > > it in the context that we use "only use unwritten extents for
> > > > > allocations within EOF" to solve the common "write within EOF" case.
> > > > > This, however, doesn't solve this common write extension case where
> > > > > we don't want to use unwritten extents because of the performance
> > > > > cost.
> > > > > 
> > > > > And now you're saying that "well, this extending write thing
> > > > > doesn't solve the inside EOF problem"......
> > > > > 
> > > > 
> > > > Ok, I lost that context. As I said above, I can see how this could be
> > > > useful in combination with another approach to handle inside eof
> > > > allocations. There are still some loose ends (in my mind) to consider
> > > > with this technique, however...
> > > > 
> > > > This requires that we always guarantee in-order writeback, right? I can
> > > > see how we can do that in the cases we control based on the flag based
> > > > approach you suggested. But do we really have enough control within the
> > > > fs to always prevent out of order writeback? What about non-integrity
> > > > background writeback? AFAICT it's technically possible to writeback out
> > > > of expected order (the cyclic range logic looks particularly suspicious
> > > > to me if you consider a mapping can be truncated and repopulated many
> > > > times over without resetting the cycle index). What about memory
> > > > reclaim? Is it not possible to hit the ->writepage() path via reclaim
> > > > for an arbitrary page in a mapping? It's not clear to me if page
> > > > migration is a factor for us as well, but it also looks like it has some
> > > > specific page writeback hooks.
> > > > 
> > > > I suppose there are technically ways around these things. We could skip
> > > > writeback of certain pages in some cases (but that may have unacceptable
> > > > consequences) or make the decision to use one exposure preventation
> > > > approach or another dynamically at writepage time based on the state of
> > > > the inode, mapping and associated page. I'd think background writeback
> > > > is the common case as opposed to things like hole punch and whatnot,
> > > > however, so a dynamic solution leaves the effectiveness of this approach
> > > > somewhat open ended unless we characterize how all possible writeback
> > > > scenarios behave.
> > > > 
> > > > Given that, it seems more useful to me to work through the inside eof
> > > > extent allocation scenario first and work back from there. Then we can
> > > > at least try to establish potential performance impact and determine if
> > > > we should try to work around it in certain cases, etc.
> > > 
> > > <nod> I'll get back to working on two parts of fixing this:
> > > 
> > > a) adapting the "convert delalloc to unwritten" patch that started this
> > > thread so that it only does that for conversions that start within eof,
> > > 
> > > b) fixing your punch case by extending the range that xfs_flush_unmap_range
> > > asks to flush if the caller is trying to flush a range past the ondisk
> > > eof.
> > > 
> > > Though I think (b) is unnecessary if we had the ability to flush all the
> > > pages fronting a delalloc reservation that we just converted to a real
> > > extent, as I muttered above.
> > > 
> > 
> > I think it's more that we have to order all such page writebacks against
> > the on-disk size update so the latter protects us from exposure in the
> > event of a crash, but otherwise this sounds like a reasonable next step
> > to me..
> 
> <nod> Figuring out ordered completion is going to take some thought.
> 

Indeed, it's not really trivial. I think we've kind of established
various potential techniques to use with various associated tradeoffs.
Just using unwritten extents has performance issues, the write intent
thing has backwards compatibility tradeoffs, etc.

Perhaps if we can find some way to cleverly order writeback and size
updates in the most common cases, we could then pick a suitably limited
fallback and not have to worry as much about the associated tradeoffs.

Brian

> --D
> 
> > Brian
> > 
> > > > On that note,
> > > > please see the part that was snipped out of my earlier reply[1], in
> > > > particular the part that ended with "Am I following the high level idea
> > > > correctly?" :P Thanks.
> > > 
> > > I thought your take at least matched my interpretation of Dave's
> > > comments.
> > > 
> > > Thoughts?
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2
> > > > 
> > > > > > It may be possible to selectively (i.e. as an optimization) use an
> > > > > > XFS_ITRUNCATED like algorithm as you propose here to always force in
> > > > > > order writeback in the cases where writeback extends the on-disk isize
> > > > > > (and thus isize protects from stale exposure), but I think we'd still
> > > > > > need something tied to extent allocation one way or another in addition
> > > > > > (like Darrick's original unwritten extent patch or your proposed
> > > > > > intent/done idea in the previous reply) to address the fundamental
> > > > > > problem. The question is what is the ideal solution..
> > > > > 
> > > > > The solution will be a combination of things that work together, not
> > > > > one big hammer. The more we can avoid exposure vectors by
> > > > > operational ordering rather than transactional state changes, the
> > > > > better the solution will perform.
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
Dave Chinner April 4, 2019, 10:08 p.m. UTC | #20
On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > > > writeback completion, regardless of the cause.
> > > > > > > > > 
> > > > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > > > 
> > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > > > is outside the range the user asks to sync and the implementation
> > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > > > tries to use it finds out eventually)...
> > > > > > > > 
> > > > > > > 
> > > > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > > > special required of sync_file_range() to reproduce this stale data
> > > > > > > exposure. See the following variation of a command sequence from the
> > > > > > > fstest I posted the other day:
> > > > > > > 
> > > > > > > ...
> > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > > > 	-c "fpunch 96k 4k" <file>
> > > > > > > ...
> > > > > > > # hexdump <file>
> > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > > > *
> > > > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > > > *
> > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > > > *
> > > > > > > 0019000
> > > > > > > 
> > > > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > > > even an overlapping direct I/O.
> > > > > > 
> > > > > > How does overlapping direct IO cause stale data exposure given that
> > > > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > > > the fact that the result of overlapping concurent direct IO is, by
> > > > > > definition, undefined and an application bug if it occurs....
> > > > > > 
> > > > > 
> > > > > It's overlapping a buffered write and so starts with a flush instead of
> > > > 
> > > > Ok, but that's not the normal usage of the phrase "overlapping
> > > > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > > > to the same LBA range in the device underlying the filesystem. i.e.
> > > > they overlap in both the temporal and physical (device LBA) domains.
> > > > 
> > > > Subtle difference, yes, but a very important one.
> > > > 
> > > 
> > > Yeah, I just referenced it in passing and was unclear. The point is just
> > > that there are multiple vectors to the problem, some of which are sane
> > > sequences and some apparently not, and I'd rather not get caught up in
> > > the minutiae of why some of those sequences are not sane. I'm trying to
> > > see if we can find a solution to the fundamental data exposure problem.
> > 
> > Is there a way to increase the amount of writeback?  Say for example
> > that we have a delalloc reservation for blocks 0-9, and writeback asks
> > us to write back block 7.  If we succeed at converting the entire
> > 10-block reservation into a real extent, why can't we come back and say
> > "Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
> > to need something like that for blocksize > pagesize filesystems too?
> > 
> 
> That might be possible in various places. In fact, I think we used to do
> something like this in XFS (perhaps not for this purpose). I just jumped
> back to v2.6.32 and see that the xfs_cluster_write() mechanism still
> exists at that point down in the ->writepage() path.

We /really/ don't want to go back to that. This caused us to have to
essentially implement write_cache_pages() inside ->writepage.

If we want to expand the writeback range, (i.e. move the offset we
start at) then we need to do that in xfs_vm_writepages(), before we
call write_cache_pages(). This means all the tagged writepages stuff
will continue to work, and we don't end up doing redundant page
cache lookups because we're trying to write back the same range from
two different places.

> I'm not sure if doing that alone is worth the effort if you consider
> it's more of a mitigation than a solution to the underlying problem.
> Once we convert the associated delalloc extent, we're open to stale data
> exposure (assuming we're not protected by on-disk EOF and whatnot) until
> all of the remaining pages over the extent are written back, whether
> that occurs in the current writeback cycle or a subsequent one.

If we were to do a read iomap lookup in xfs_vm_writepages() we could
determine if the delalloc conversion would allocate a range outside
the current write that is bing flushed, and either modify the
writeback control range/offset to match the delalloc extent.

however, this runs the risk of breaking all the fairness algorithms
in high level writeback, so we'd need to be very careful about how
we extend the write range for writeback inside EOF....

Cheers,

Dave.
Brian Foster April 5, 2019, 11:24 a.m. UTC | #21
On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote:
> On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > > > > writeback completion, regardless of the cause.
> > > > > > > > > > 
> > > > > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > > > > 
> > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > > > > is outside the range the user asks to sync and the implementation
> > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > > > > tries to use it finds out eventually)...
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > > > > special required of sync_file_range() to reproduce this stale data
> > > > > > > > exposure. See the following variation of a command sequence from the
> > > > > > > > fstest I posted the other day:
> > > > > > > > 
> > > > > > > > ...
> > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > > > > 	-c "fpunch 96k 4k" <file>
> > > > > > > > ...
> > > > > > > > # hexdump <file>
> > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > > > > *
> > > > > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > > > > *
> > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > > > > *
> > > > > > > > 0019000
> > > > > > > > 
> > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > > > > even an overlapping direct I/O.
> > > > > > > 
> > > > > > > How does overlapping direct IO cause stale data exposure given that
> > > > > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > > > > the fact that the result of overlapping concurent direct IO is, by
> > > > > > > definition, undefined and an application bug if it occurs....
> > > > > > > 
> > > > > > 
> > > > > > It's overlapping a buffered write and so starts with a flush instead of
> > > > > 
> > > > > Ok, but that's not the normal usage of the phrase "overlapping
> > > > > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > > > > to the same LBA range in the device underlying the filesystem. i.e.
> > > > > they overlap in both the temporal and physical (device LBA) domains.
> > > > > 
> > > > > Subtle difference, yes, but a very important one.
> > > > > 
> > > > 
> > > > Yeah, I just referenced it in passing and was unclear. The point is just
> > > > that there are multiple vectors to the problem, some of which are sane
> > > > sequences and some apparently not, and I'd rather not get caught up in
> > > > the minutiae of why some of those sequences are not sane. I'm trying to
> > > > see if we can find a solution to the fundamental data exposure problem.
> > > 
> > > Is there a way to increase the amount of writeback?  Say for example
> > > that we have a delalloc reservation for blocks 0-9, and writeback asks
> > > us to write back block 7.  If we succeed at converting the entire
> > > 10-block reservation into a real extent, why can't we come back and say
> > > "Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
> > > to need something like that for blocksize > pagesize filesystems too?
> > > 
> > 
> > That might be possible in various places. In fact, I think we used to do
> > something like this in XFS (perhaps not for this purpose). I just jumped
> > back to v2.6.32 and see that the xfs_cluster_write() mechanism still
> > exists at that point down in the ->writepage() path.
> 
> We /really/ don't want to go back to that. This caused us to have to
> essentially implement write_cache_pages() inside ->writepage.
> 

Agreed..

> If we want to expand the writeback range, (i.e. move the offset we
> start at) then we need to do that in xfs_vm_writepages(), before we
> call write_cache_pages(). This means all the tagged writepages stuff
> will continue to work, and we don't end up doing redundant page
> cache lookups because we're trying to write back the same range from
> two different places.
> 

IMO, messing around with writeback like this is a mitigation and/or a
contributer to a broader solution, not a fundamental fix by itself. For
example, I could see doing something like that if we already put proper
writeback vs. append transaction ordering in place and we identify
certain corner cases where we want to prevent delaying an on-disk i_size
update for too long because a user indirectly triggered writeback of a
high index page.

> > I'm not sure if doing that alone is worth the effort if you consider
> > it's more of a mitigation than a solution to the underlying problem.
> > Once we convert the associated delalloc extent, we're open to stale data
> > exposure (assuming we're not protected by on-disk EOF and whatnot) until
> > all of the remaining pages over the extent are written back, whether
> > that occurs in the current writeback cycle or a subsequent one.
> 
> If we were to do a read iomap lookup in xfs_vm_writepages() we could
> determine if the delalloc conversion would allocate a range outside
> the current write that is bing flushed, and either modify the
> writeback control range/offset to match the delalloc extent.
> 
> however, this runs the risk of breaking all the fairness algorithms
> in high level writeback, so we'd need to be very careful about how
> we extend the write range for writeback inside EOF....
> 

Indeed. The more I think about this, the more I think we'd have an
easier time trying to schedule the on-disk i_size update vs. trying to
control the size and range of writeback in all possible scenarios. The
former would still require non-trivial tracking and logic, but that's
something writeback control described above could actually help with.

For example, we don't currently have any way to track "stale" state of
just allocated but not yet fully written extents. If we established a
rule that we always wrote back such extents in entirety and via a single
ioend, then writeback completion code would simply need to check for
delalloc extents within in-core eof to determine when it's safe to
update on-disk eof. If there are cases where we determine we can't
writeback an entire delalloc extent because of some fairness reasons or
whatever, we could decide to only physically allocate the subset of that
delalloc extent that we can write back rather than the whole thing
(trading off stale data exposure for some increased fragmentation). That
way writeback completion still sees existence of delalloc blocks and
knows to hold off on the on-disk size update.

One challenge with an approach like that is if we do skip completion
time on-disk i_size updates, we'd have to make sure that those updates
still happen eventually if the address space happens to change outside
of writeback. For example, if those remaining dirty delalloc pages
somehow end up invalidated and never become real blocks, we may never
see another writeback cycle on the inode and we'd need to update on-disk
size from some other context (in a timely fashion).

Of course this only deals with the append case. We'd still need to
select a mechanism for dealing with inside EOF allocations (such as the
whole write-intent thing or unwritten extents). To me, the intent thing
still seems like the most simple/elegant option we've discussed so far
because the potential for low impact. Part of me thinks that if we
determine we'll fall back to that approach for certain cases anyways
(like inside EOF allocs), it might be worth just doing that first and
evaluate using it unconditionally.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong April 5, 2019, 3:12 p.m. UTC | #22
On Fri, Apr 05, 2019 at 07:24:44AM -0400, Brian Foster wrote:
> On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote:
> > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> > > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> > > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > > > > > writeback completion, regardless of the cause.
> > > > > > > > > > > 
> > > > > > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > > > > > 
> > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > > > > > is outside the range the user asks to sync and the implementation
> > > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > > > > > tries to use it finds out eventually)...
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > > > > > special required of sync_file_range() to reproduce this stale data
> > > > > > > > > exposure. See the following variation of a command sequence from the
> > > > > > > > > fstest I posted the other day:
> > > > > > > > > 
> > > > > > > > > ...
> > > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > > > > > 	-c "fpunch 96k 4k" <file>
> > > > > > > > > ...
> > > > > > > > > # hexdump <file>
> > > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > > > > > *
> > > > > > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > > > > > *
> > > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > > > > > *
> > > > > > > > > 0019000
> > > > > > > > > 
> > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > > > > > even an overlapping direct I/O.
> > > > > > > > 
> > > > > > > > How does overlapping direct IO cause stale data exposure given that
> > > > > > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > > > > > the fact that the result of overlapping concurent direct IO is, by
> > > > > > > > definition, undefined and an application bug if it occurs....
> > > > > > > > 
> > > > > > > 
> > > > > > > It's overlapping a buffered write and so starts with a flush instead of
> > > > > > 
> > > > > > Ok, but that's not the normal usage of the phrase "overlapping
> > > > > > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > > > > > to the same LBA range in the device underlying the filesystem. i.e.
> > > > > > they overlap in both the temporal and physical (device LBA) domains.
> > > > > > 
> > > > > > Subtle difference, yes, but a very important one.
> > > > > > 
> > > > > 
> > > > > Yeah, I just referenced it in passing and was unclear. The point is just
> > > > > that there are multiple vectors to the problem, some of which are sane
> > > > > sequences and some apparently not, and I'd rather not get caught up in
> > > > > the minutiae of why some of those sequences are not sane. I'm trying to
> > > > > see if we can find a solution to the fundamental data exposure problem.
> > > > 
> > > > Is there a way to increase the amount of writeback?  Say for example
> > > > that we have a delalloc reservation for blocks 0-9, and writeback asks
> > > > us to write back block 7.  If we succeed at converting the entire
> > > > 10-block reservation into a real extent, why can't we come back and say
> > > > "Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
> > > > to need something like that for blocksize > pagesize filesystems too?
> > > > 
> > > 
> > > That might be possible in various places. In fact, I think we used to do
> > > something like this in XFS (perhaps not for this purpose). I just jumped
> > > back to v2.6.32 and see that the xfs_cluster_write() mechanism still
> > > exists at that point down in the ->writepage() path.
> > 
> > We /really/ don't want to go back to that. This caused us to have to
> > essentially implement write_cache_pages() inside ->writepage.
> > 
> 
> Agreed..
> 
> > If we want to expand the writeback range, (i.e. move the offset we
> > start at) then we need to do that in xfs_vm_writepages(), before we
> > call write_cache_pages(). This means all the tagged writepages stuff
> > will continue to work, and we don't end up doing redundant page
> > cache lookups because we're trying to write back the same range from
> > two different places.
> > 
> 
> IMO, messing around with writeback like this is a mitigation and/or a
> contributer to a broader solution, not a fundamental fix by itself. For
> example, I could see doing something like that if we already put proper
> writeback vs. append transaction ordering in place and we identify
> certain corner cases where we want to prevent delaying an on-disk i_size
> update for too long because a user indirectly triggered writeback of a
> high index page.
> 
> > > I'm not sure if doing that alone is worth the effort if you consider
> > > it's more of a mitigation than a solution to the underlying problem.
> > > Once we convert the associated delalloc extent, we're open to stale data
> > > exposure (assuming we're not protected by on-disk EOF and whatnot) until
> > > all of the remaining pages over the extent are written back, whether
> > > that occurs in the current writeback cycle or a subsequent one.
> > 
> > If we were to do a read iomap lookup in xfs_vm_writepages() we could
> > determine if the delalloc conversion would allocate a range outside
> > the current write that is bing flushed, and either modify the
> > writeback control range/offset to match the delalloc extent.
> > 
> > however, this runs the risk of breaking all the fairness algorithms
> > in high level writeback, so we'd need to be very careful about how
> > we extend the write range for writeback inside EOF....
> > 
> 
> Indeed. The more I think about this, the more I think we'd have an
> easier time trying to schedule the on-disk i_size update vs. trying to
> control the size and range of writeback in all possible scenarios. The
> former would still require non-trivial tracking and logic, but that's
> something writeback control described above could actually help with.
> 
> For example, we don't currently have any way to track "stale" state of
> just allocated but not yet fully written extents. If we established a
> rule that we always wrote back such extents in entirety and via a single
> ioend, then writeback completion code would simply need to check for
> delalloc extents within in-core eof to determine when it's safe to
> update on-disk eof. If there are cases where we determine we can't
> writeback an entire delalloc extent because of some fairness reasons or
> whatever, we could decide to only physically allocate the subset of that
> delalloc extent that we can write back rather than the whole thing
> (trading off stale data exposure for some increased fragmentation). That
> way writeback completion still sees existence of delalloc blocks and
> knows to hold off on the on-disk size update.
> 
> One challenge with an approach like that is if we do skip completion
> time on-disk i_size updates, we'd have to make sure that those updates
> still happen eventually if the address space happens to change outside
> of writeback. For example, if those remaining dirty delalloc pages
> somehow end up invalidated and never become real blocks, we may never
> see another writeback cycle on the inode and we'd need to update on-disk
> size from some other context (in a timely fashion).
> 
> Of course this only deals with the append case. We'd still need to
> select a mechanism for dealing with inside EOF allocations (such as the
> whole write-intent thing or unwritten extents). To me, the intent thing
> still seems like the most simple/elegant option we've discussed so far
> because the potential for low impact. Part of me thinks that if we
> determine we'll fall back to that approach for certain cases anyways
> (like inside EOF allocs), it might be worth just doing that first and
> evaluate using it unconditionally.

/me wonders if we could solve this by using a bitmap to track all the
in-flight post-ondisk-eof writeback such that when a particular
writeback completes it can check if there aren't any writeback at a
lower offset still in progress and only then update the ondisk eof?

(Where "a bitmap or something" could just use another instance of Dave's
range lock structure...)

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Dave Chinner April 8, 2019, 12:17 a.m. UTC | #23
On Fri, Apr 05, 2019 at 08:12:32AM -0700, Darrick J. Wong wrote:
> On Fri, Apr 05, 2019 at 07:24:44AM -0400, Brian Foster wrote:
> > On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote:
> > > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> > > If we want to expand the writeback range, (i.e. move the offset we
> > > start at) then we need to do that in xfs_vm_writepages(), before we
> > > call write_cache_pages(). This means all the tagged writepages stuff
> > > will continue to work, and we don't end up doing redundant page
> > > cache lookups because we're trying to write back the same range from
> > > two different places.
> > > 
> > 
> > IMO, messing around with writeback like this is a mitigation and/or a
> > contributer to a broader solution, not a fundamental fix by itself. For
> > example, I could see doing something like that if we already put proper
> > writeback vs. append transaction ordering in place and we identify
> > certain corner cases where we want to prevent delaying an on-disk i_size
> > update for too long because a user indirectly triggered writeback of a
> > high index page.

*nod*

Yeah, the problem is when we come to really large files on large
memory machines where there might be gigabytes of dirty pages over a
relatively slow disk...

> > > > I'm not sure if doing that alone is worth the effort if you consider
> > > > it's more of a mitigation than a solution to the underlying problem.
> > > > Once we convert the associated delalloc extent, we're open to stale data
> > > > exposure (assuming we're not protected by on-disk EOF and whatnot) until
> > > > all of the remaining pages over the extent are written back, whether
> > > > that occurs in the current writeback cycle or a subsequent one.
> > > 
> > > If we were to do a read iomap lookup in xfs_vm_writepages() we could
> > > determine if the delalloc conversion would allocate a range outside
> > > the current write that is bing flushed, and either modify the
> > > writeback control range/offset to match the delalloc extent.
> > > 
> > > however, this runs the risk of breaking all the fairness algorithms
> > > in high level writeback, so we'd need to be very careful about how
> > > we extend the write range for writeback inside EOF....
> > > 
> > 
> > Indeed. The more I think about this, the more I think we'd have an
> > easier time trying to schedule the on-disk i_size update vs. trying to
> > control the size and range of writeback in all possible scenarios. The
> > former would still require non-trivial tracking and logic, but that's
> > something writeback control described above could actually help with.
> >
> > For example, we don't currently have any way to track "stale" state of
> > just allocated but not yet fully written extents. If we established a
> > rule that we always wrote back such extents in entirety and via a single
> > ioend, then writeback completion code would simply need to check for
> > delalloc extents within in-core eof to determine when it's safe to
> > update on-disk eof.

The problem with this is that it either limits the size of the
allocation we can make (due to writeback chunk size) or we break the
writeback fairness algorithms by having to write back allocation
sized chunks...

> > If there are cases where we determine we can't
> > writeback an entire delalloc extent because of some fairness reasons or
> > whatever, we could decide to only physically allocate the subset of that
> > delalloc extent that we can write back rather than the whole thing
> > (trading off stale data exposure for some increased fragmentation). That
> > way writeback completion still sees existence of delalloc blocks and
> > knows to hold off on the on-disk size update.

The moment we get concurrent writes we are going to interleave and
fragment the files with this approach. It's the large speculative
delalloc beyond EOF that prevents fragmentation in the this case, so
the moment we decide that we aren't going to allocate the entire
delalloc extent we essentially throw away a significant amount of
the benefit that the speculative delalloc provides us with.

> > One challenge with an approach like that is if we do skip completion
> > time on-disk i_size updates, we'd have to make sure that those updates
> > still happen eventually if the address space happens to change outside
> > of writeback. For example, if those remaining dirty delalloc pages
> > somehow end up invalidated and never become real blocks, we may never
> > see another writeback cycle on the inode and we'd need to update on-disk
> > size from some other context (in a timely fashion).
> > 
> > Of course this only deals with the append case. We'd still need to
> > select a mechanism for dealing with inside EOF allocations (such as the
> > whole write-intent thing or unwritten extents). To me, the intent thing
> > still seems like the most simple/elegant option we've discussed so far
> > because the potential for low impact. Part of me thinks that if we
> > determine we'll fall back to that approach for certain cases anyways
> > (like inside EOF allocs), it might be worth just doing that first and
> > evaluate using it unconditionally.
> 
> /me wonders if we could solve this by using a bitmap to track all the
> in-flight post-ondisk-eof writeback such that when a particular
> writeback completes it can check if there aren't any writeback at a
> lower offset still in progress and only then update the ondisk eof?
> 
> (Where "a bitmap or something" could just use another instance of Dave's
> range lock structure...)

I'm not sure we need to do that. Perhaps all we need is a new
in-core iext_tree record extent type flag. i.e. we have an
unwritten flag that mirrors the on-disk unwritten flag, but what we
are really talking about here is having a new in-memory extent state
that is "allocated but contains no data".

This flag never makes it to disk. When we allocate a delalloc region
we tag the entire extent in with this magic flag, and as IO
completes we do the equivalent of unwritten extent conversion except
it only touches the in-memory tree records. When the entire range
between the current on-disk EOF and end of this completing IO has
the "not yet written" flag cleared, we then log the EOF size change.

For non-random IO, we'll essentially just be trimming one extent and
extending another, so it should be pretty fast. If we want to add
intents (so it works inside EOF), then we can simply log a new DONE
(or UPDATE) intent every time we convert a range. That would leave
log recovery with an allocation intent plus a series of updates that
document written ranges.

This seems to me like it solves the tracking problem for both cases
and provides a simple hook for logging intent completions. It also
seems to me like it would work for direct IO, too, getting rid of
the need for unconditional unwritten extent allocation and
conversion. That's potentially a big win for database and VM folks
who want to write into sparse files safely....

Cheers,

Dave.
Brian Foster April 8, 2019, 12:02 p.m. UTC | #24
On Mon, Apr 08, 2019 at 10:17:41AM +1000, Dave Chinner wrote:
> On Fri, Apr 05, 2019 at 08:12:32AM -0700, Darrick J. Wong wrote:
> > On Fri, Apr 05, 2019 at 07:24:44AM -0400, Brian Foster wrote:
> > > On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote:
> > > > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> > > > If we want to expand the writeback range, (i.e. move the offset we
> > > > start at) then we need to do that in xfs_vm_writepages(), before we
> > > > call write_cache_pages(). This means all the tagged writepages stuff
> > > > will continue to work, and we don't end up doing redundant page
> > > > cache lookups because we're trying to write back the same range from
> > > > two different places.
> > > > 
> > > 
> > > IMO, messing around with writeback like this is a mitigation and/or a
> > > contributer to a broader solution, not a fundamental fix by itself. For
> > > example, I could see doing something like that if we already put proper
> > > writeback vs. append transaction ordering in place and we identify
> > > certain corner cases where we want to prevent delaying an on-disk i_size
> > > update for too long because a user indirectly triggered writeback of a
> > > high index page.
> 
> *nod*
> 
> Yeah, the problem is when we come to really large files on large
> memory machines where there might be gigabytes of dirty pages over a
> relatively slow disk...
> 
> > > > > I'm not sure if doing that alone is worth the effort if you consider
> > > > > it's more of a mitigation than a solution to the underlying problem.
> > > > > Once we convert the associated delalloc extent, we're open to stale data
> > > > > exposure (assuming we're not protected by on-disk EOF and whatnot) until
> > > > > all of the remaining pages over the extent are written back, whether
> > > > > that occurs in the current writeback cycle or a subsequent one.
> > > > 
> > > > If we were to do a read iomap lookup in xfs_vm_writepages() we could
> > > > determine if the delalloc conversion would allocate a range outside
> > > > the current write that is bing flushed, and either modify the
> > > > writeback control range/offset to match the delalloc extent.
> > > > 
> > > > however, this runs the risk of breaking all the fairness algorithms
> > > > in high level writeback, so we'd need to be very careful about how
> > > > we extend the write range for writeback inside EOF....
> > > > 
> > > 
> > > Indeed. The more I think about this, the more I think we'd have an
> > > easier time trying to schedule the on-disk i_size update vs. trying to
> > > control the size and range of writeback in all possible scenarios. The
> > > former would still require non-trivial tracking and logic, but that's
> > > something writeback control described above could actually help with.
> > >
> > > For example, we don't currently have any way to track "stale" state of
> > > just allocated but not yet fully written extents. If we established a
> > > rule that we always wrote back such extents in entirety and via a single
> > > ioend, then writeback completion code would simply need to check for
> > > delalloc extents within in-core eof to determine when it's safe to
> > > update on-disk eof.
> 
> The problem with this is that it either limits the size of the
> allocation we can make (due to writeback chunk size) or we break the
> writeback fairness algorithms by having to write back allocation
> sized chunks...
> 
> > > If there are cases where we determine we can't
> > > writeback an entire delalloc extent because of some fairness reasons or
> > > whatever, we could decide to only physically allocate the subset of that
> > > delalloc extent that we can write back rather than the whole thing
> > > (trading off stale data exposure for some increased fragmentation). That
> > > way writeback completion still sees existence of delalloc blocks and
> > > knows to hold off on the on-disk size update.
> 
> The moment we get concurrent writes we are going to interleave and
> fragment the files with this approach. It's the large speculative
> delalloc beyond EOF that prevents fragmentation in the this case, so
> the moment we decide that we aren't going to allocate the entire
> delalloc extent we essentially throw away a significant amount of
> the benefit that the speculative delalloc provides us with.
> 

Yep, that's the tradeoff. Note that the idea was not to restrict
delalloc conversions to writeback chunk size. That would outright lead
to serious fragmentation.

The thought was based on the previously mentioned concept of
artificially increasing writeback ranges in XFS. So if writeback only
asked for a few pages but we decided to do the old "cluster write" thing
because we're over a multi-GB delalloc extent, then we don't necessarily
have to make a black and white decision between writing back just the
pages asked for or GBs of pages over the entire extent. We could always
limit the delalloc conversion to something in-between if there is a
reasonable balance between increasing the writeback range and not
introducing too much fragmentation. That is completely non-deterministic
and probably would require some experimentation to determine
practicality. I also think it's reasonable to say "we have delalloc for
a reason, don't do that." :P Anyways, my preference is still to avoid
messing with writeback at all if we can help it so I wouldn't go down
this path unless we ruled out other options.

Note that the obvious next step in complexity to that is to reuse the
COW fork infrastructure for newly allocated extents such that we only
map allocated blocks to the file after they are written. That seems
interesting to me because it relies on mechanisms we already have in
place. Not only do the mechanisms exist, but this is arguably already
(partially) prototyped via Christoph's debug mode always_cow stuff.
generic/536 passes on a quick test of a reflink=1 fs with always_cow
enabled, for example.

With that, we wouldn't need to play games with writeback ranges, size
updates or new intent types. This would depend on appropriate feature
infrastructure being enabled on the fs and I'm not sure how performance
would stack up against more passive approaches like those we're
discussing below..

> > > One challenge with an approach like that is if we do skip completion
> > > time on-disk i_size updates, we'd have to make sure that those updates
> > > still happen eventually if the address space happens to change outside
> > > of writeback. For example, if those remaining dirty delalloc pages
> > > somehow end up invalidated and never become real blocks, we may never
> > > see another writeback cycle on the inode and we'd need to update on-disk
> > > size from some other context (in a timely fashion).
> > > 
> > > Of course this only deals with the append case. We'd still need to
> > > select a mechanism for dealing with inside EOF allocations (such as the
> > > whole write-intent thing or unwritten extents). To me, the intent thing
> > > still seems like the most simple/elegant option we've discussed so far
> > > because the potential for low impact. Part of me thinks that if we
> > > determine we'll fall back to that approach for certain cases anyways
> > > (like inside EOF allocs), it might be worth just doing that first and
> > > evaluate using it unconditionally.
> > 
> > /me wonders if we could solve this by using a bitmap to track all the
> > in-flight post-ondisk-eof writeback such that when a particular
> > writeback completes it can check if there aren't any writeback at a
> > lower offset still in progress and only then update the ondisk eof?
> > 
> > (Where "a bitmap or something" could just use another instance of Dave's
> > range lock structure...)
> 
> I'm not sure we need to do that. Perhaps all we need is a new
> in-core iext_tree record extent type flag. i.e. we have an
> unwritten flag that mirrors the on-disk unwritten flag, but what we
> are really talking about here is having a new in-memory extent state
> that is "allocated but contains no data".
> 
> This flag never makes it to disk. When we allocate a delalloc region
> we tag the entire extent in with this magic flag, and as IO
> completes we do the equivalent of unwritten extent conversion except
> it only touches the in-memory tree records. When the entire range
> between the current on-disk EOF and end of this completing IO has
> the "not yet written" flag cleared, we then log the EOF size change.
> 

This is pretty much what I was handwaving a bit about earlier with
regard to using existing in-core extent tracking to reflect "stale"
state (where "stale" means these physical blocks currently contain stale
data). The core mechanism sounds reasonable, but there's still some
engineering required to work out the details. For example, we'd need to
be able to identify the last "not yet written" extent of a particular
file range so we know when to log the size update, we may need to be
able to make sure the size update happens if it's possible for some of
those "not yet written" extents to become invalidated without ever being
written back...

> For non-random IO, we'll essentially just be trimming one extent and
> extending another, so it should be pretty fast. If we want to add
> intents (so it works inside EOF), then we can simply log a new DONE
> (or UPDATE) intent every time we convert a range. That would leave
> log recovery with an allocation intent plus a series of updates that
> document written ranges.
> 

... and we may need to consider how to balance deferring a size update
vs. making a size update and causing all subsequent (within eof)
writebacks require intents (or whatever). I think these are all solvable
problems though.

That said, if the intent thing turned out to be dirt cheap, it could
make the added "stale" extent tracking logic kind of pointless. OTOH, I
suppose the opposite is certainly possible as well, I haven't thought
either through enough beyond that they both seem plausible options.

> This seems to me like it solves the tracking problem for both cases
> and provides a simple hook for logging intent completions. It also
> seems to me like it would work for direct IO, too, getting rid of
> the need for unconditional unwritten extent allocation and
> conversion. That's potentially a big win for database and VM folks
> who want to write into sparse files safely....
> 

I hadn't thought about direct I/O, but that makes sense. It sounds like
there's enough potential for an RFC/POC here. ;)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Patch
diff mbox series

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 931edfdca22e..dae5f1734297 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4077,16 +4077,18 @@  xfs_bmapi_allocate(
 	bma->got.br_state = XFS_EXT_NORM;
 
 	/*
-	 * In the data fork, a wasdelay extent has been initialized, so
-	 * shouldn't be flagged as unwritten.
+	 * In the data fork, the pages backed by a delalloc extent have been
+	 * dirtied but not yet written to disk, so the allocation should be
+	 * marked unwritten.  Only after writeback completes successfully
+	 * should we convert those mappings to real, so that a crash during
+	 * writeback won't expose stale disk contents.
 	 *
 	 * For the cow fork, however, we convert delalloc reservations
 	 * (extents allocated for speculative preallocation) to
 	 * allocated unwritten extents, and only convert the unwritten
 	 * extents to real extents when we're about to write the data.
 	 */
-	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
-	    (bma->flags & XFS_BMAPI_PREALLOC))
+	if (bma->flags & XFS_BMAPI_PREALLOC)
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
 	if (bma->wasdel)
@@ -4496,8 +4498,9 @@  xfs_bmapi_convert_delalloc(
 	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
 	bma.total = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
 	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
+	bma.flags = XFS_BMAPI_PREALLOC;
 	if (whichfork == XFS_COW_FORK)
-		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
+		bma.flags |= XFS_BMAPI_COWFORK;
 
 	if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev))
 		bma.prev.br_startoff = NULLFILEOFF;