diff mbox

fstests: generic test for fsync followed by truncate and link

Message ID 1423831674-25678-1-git-send-email-fdmanana@suse.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Filipe Manana Feb. 13, 2015, 12:47 p.m. UTC
This test is motivated by an fsync issue discovered in btrfs.
The issue was that we could lose file data, that was previously fsync'ed
successfully, if we end up shrinking (via truncate) the file, add a hard
link to our file and then persist the fsync log later via an fsync of
other inode for example. After a power loss our file content wouldn't
match what it had when we last fsync'ed, but instead it had the data
prior to that fsync.

The btrfs issue was fixed by the following linux kernel patch:

  Btrfs: don't remove extents and xattrs when logging new names

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 tests/generic/044     | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/044.out |  18 +++++++
 tests/generic/group   |   1 +
 3 files changed, 150 insertions(+)
 create mode 100755 tests/generic/044
 create mode 100644 tests/generic/044.out

Comments

Dave Chinner Feb. 16, 2015, 12:33 a.m. UTC | #1
On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote:
> This test is motivated by an fsync issue discovered in btrfs.
> The issue was that we could lose file data, that was previously fsync'ed
> successfully, if we end up shrinking (via truncate) the file, add a hard
> link to our file and then persist the fsync log later via an fsync of
> other inode for example. After a power loss our file content wouldn't
> match what it had when we last fsync'ed, but instead it had the data
> prior to that fsync.

Prior to which fsync?

XFS has strictly ordered metadata journalling, which means
transactions committed prior to *any* fsync will be present on disk
after the fsync. ext4 is not quite so strict, but has essentially
the same behaviour. ext3 in ordered mode behaves like XFS.

As such, ext3, ext4 and XFS return the state of the file as "0xaa for
0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after
the filesystem is remounted and the log replayed.

So, as this test is written, it does not encapsulate the
longstanding, expected behaviour of existing filesystems. btrfs
should behave like XFS, ext3 and ext4 if possible - being different
is only going to cause confusion and pain for application
developers.

> The btrfs issue was fixed by the following linux kernel patch:
> 
>   Btrfs: don't remove extents and xattrs when logging new names

Sounds like you've just introduced an ordered mode behavioural
journalling regression into btrfs...

Cheers,

Dave.
Liu Bo Feb. 16, 2015, 4:19 a.m. UTC | #2
On Mon, Feb 16, 2015 at 11:33:37AM +1100, Dave Chinner wrote:
> On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote:
> > This test is motivated by an fsync issue discovered in btrfs.
> > The issue was that we could lose file data, that was previously fsync'ed
> > successfully, if we end up shrinking (via truncate) the file, add a hard
> > link to our file and then persist the fsync log later via an fsync of
> > other inode for example. After a power loss our file content wouldn't
> > match what it had when we last fsync'ed, but instead it had the data
> > prior to that fsync.
> 
> Prior to which fsync?
> 
> XFS has strictly ordered metadata journalling, which means
> transactions committed prior to *any* fsync will be present on disk
> after the fsync. ext4 is not quite so strict, but has essentially
> the same behaviour. ext3 in ordered mode behaves like XFS.
> 
> As such, ext3, ext4 and XFS return the state of the file as "0xaa for
> 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after
> the filesystem is remounted and the log replayed.

But xfs doesn't work as above, something wrong?  
This my config file,

------------------------------------------
[btrfs]
TEST_DEV=/dev/vdd
TEST_DIR=/mnt/test
SCRATCH_DEV=/dev/vdc
SCRATCH_MNT=/mnt/scratch
FSTYP=btrfs
MKFS_OPTIONS="-f -b2G"
RESULT_BASE="`pwd`/results/`date +%m%d%y_%H%M`"
MOUNT_OPTIONS=""
RECREATE_TEST_DEV=true

[xfs]
FSTYP=xfs
MKFS_OPTIONS="-f"

[ext4]
FSTYP=ext4
MKFS_OPTIONS=""
------------------------------------------

Results:
-------------------------------------
cat btrfs/generic/044.out.bad

QA output created by 044
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 4096/4096 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content before:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00
0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0100000
File content after:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000

-------------------------------------
cat xfs/generic/044.out.bad

QA output created by 044
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 4096/4096 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content before:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00
0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0100000
File content after:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0100000

-------------------------------------
cat ext4/generic/044.out.bad

QA output created by 044
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 4096/4096 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content before:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00
0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0100000
File content after:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00
0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0100000


Thanks,

-liubo
> 
> So, as this test is written, it does not encapsulate the
> longstanding, expected behaviour of existing filesystems. btrfs
> should behave like XFS, ext3 and ext4 if possible - being different
> is only going to cause confusion and pain for application
> developers.
> 
> > The btrfs issue was fixed by the following linux kernel patch:
> > 
> >   Btrfs: don't remove extents and xattrs when logging new names
> 
> Sounds like you've just introduced an ordered mode behavioural
> journalling regression into btrfs...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 16, 2015, 9:45 a.m. UTC | #3
On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote:
>> This test is motivated by an fsync issue discovered in btrfs.
>> The issue was that we could lose file data, that was previously fsync'ed
>> successfully, if we end up shrinking (via truncate) the file, add a hard
>> link to our file and then persist the fsync log later via an fsync of
>> other inode for example. After a power loss our file content wouldn't
>> match what it had when we last fsync'ed, but instead it had the data
>> prior to that fsync.
>
> Prior to which fsync?

The fsync of file "foo" - this is the file we're testing for
correctness and we only fsync it once.

>
> XFS has strictly ordered metadata journalling, which means
> transactions committed prior to *any* fsync will be present on disk
> after the fsync. ext4 is not quite so strict, but has essentially
> the same behaviour. ext3 in ordered mode behaves like XFS.
>
> As such, ext3, ext4 and XFS return the state of the file as "0xaa for
> 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after
> the filesystem is remounted and the log replayed.

Well, that's not I get with XFS, at least on a 3.19 kernel.
After log replay, I get a file with a size of 32Kb, the first 8192
bytes have the value 0xaa and the remaining 24Kb have value 0x00 -
this doesn't match the file content after any operation.

That seems unexpected to me.
I think only 3 results are acceptable after the log replay:

1) A file with a size of 12Kb, first 8Kb of data with value 0xaa, and
the last 4Kb of data with the value 0xcc. This is the file content
after it was explicitly fsync'ed;

2) A file with a size of 5000 bytes, all bytes having the value 0xaa.
This is the result after the shrinking truncate;

3) A file with a size of 32Kb, where the first 5000 bytes all have the
value 0xaa and the remaining bytes with value 0xaa. This is the result
of the expanding truncate.

Result 1) is the one I most expected. Results 2) and 3) are possible -
the fs might decide to commit the current transaction after any of the
truncate operations - it's still correct behaviour.
Also, after adding the hard link to foo, it might decide update the
fsync log, in this case we should get to 3).

I'm ok with changing the test to consider valid any of those 3 cases.

>
> So, as this test is written, it does not encapsulate the
> longstanding, expected behaviour of existing filesystems. btrfs
> should behave like XFS, ext3 and ext4 if possible - being different
> is only going to cause confusion and pain for application
> developers.
>
>> The btrfs issue was fixed by the following linux kernel patch:
>>
>>   Btrfs: don't remove extents and xattrs when logging new names
>
> Sounds like you've just introduced an ordered mode behavioural
> journalling regression into btrfs...

How can it be a regression?
Before this change, after the fsync log replay, the file content
matched what it had when "sync" was called - i.e. doesn't match the
content when the fsync (on "foo") was performed nor the content after
any of the subsequent truncate operations.

thanks

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 16, 2015, 9:50 a.m. UTC | #4
On Mon, Feb 16, 2015 at 9:45 AM, Filipe David Manana <fdmanana@gmail.com> wrote:
> On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote:
>> On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote:
>>> This test is motivated by an fsync issue discovered in btrfs.
>>> The issue was that we could lose file data, that was previously fsync'ed
>>> successfully, if we end up shrinking (via truncate) the file, add a hard
>>> link to our file and then persist the fsync log later via an fsync of
>>> other inode for example. After a power loss our file content wouldn't
>>> match what it had when we last fsync'ed, but instead it had the data
>>> prior to that fsync.
>>
>> Prior to which fsync?
>
> The fsync of file "foo" - this is the file we're testing for
> correctness and we only fsync it once.
>
>>
>> XFS has strictly ordered metadata journalling, which means
>> transactions committed prior to *any* fsync will be present on disk
>> after the fsync. ext4 is not quite so strict, but has essentially
>> the same behaviour. ext3 in ordered mode behaves like XFS.
>>
>> As such, ext3, ext4 and XFS return the state of the file as "0xaa for
>> 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after
>> the filesystem is remounted and the log replayed.
>
> Well, that's not I get with XFS, at least on a 3.19 kernel.
> After log replay, I get a file with a size of 32Kb, the first 8192
> bytes have the value 0xaa and the remaining 24Kb have value 0x00 -
> this doesn't match the file content after any operation.
>
> That seems unexpected to me.
> I think only 3 results are acceptable after the log replay:
>
> 1) A file with a size of 12Kb, first 8Kb of data with value 0xaa, and
> the last 4Kb of data with the value 0xcc. This is the file content
> after it was explicitly fsync'ed;
>
> 2) A file with a size of 5000 bytes, all bytes having the value 0xaa.
> This is the result after the shrinking truncate;
>
> 3) A file with a size of 32Kb, where the first 5000 bytes all have the
> value 0xaa and the remaining bytes with value 0xaa. This is the result
> of the expanding truncate.

Sorry, typo. Obviously I meant to say the remaining bytes (5001 to
32Kb) having a value of 0x00, and not 0xaa.

>
> Result 1) is the one I most expected. Results 2) and 3) are possible -
> the fs might decide to commit the current transaction after any of the
> truncate operations - it's still correct behaviour.
> Also, after adding the hard link to foo, it might decide update the
> fsync log, in this case we should get to 3).
>
> I'm ok with changing the test to consider valid any of those 3 cases.
>
>>
>> So, as this test is written, it does not encapsulate the
>> longstanding, expected behaviour of existing filesystems. btrfs
>> should behave like XFS, ext3 and ext4 if possible - being different
>> is only going to cause confusion and pain for application
>> developers.
>>
>>> The btrfs issue was fixed by the following linux kernel patch:
>>>
>>>   Btrfs: don't remove extents and xattrs when logging new names
>>
>> Sounds like you've just introduced an ordered mode behavioural
>> journalling regression into btrfs...
>
> How can it be a regression?
> Before this change, after the fsync log replay, the file content
> matched what it had when "sync" was called - i.e. doesn't match the
> content when the fsync (on "foo") was performed nor the content after
> any of the subsequent truncate operations.
>
> thanks
>
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe fstests" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."
Dave Chinner Feb. 16, 2015, 9:10 p.m. UTC | #5
On Mon, Feb 16, 2015 at 12:19:58PM +0800, Liu Bo wrote:
> On Mon, Feb 16, 2015 at 11:33:37AM +1100, Dave Chinner wrote:
> > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote:
> > > This test is motivated by an fsync issue discovered in btrfs.
> > > The issue was that we could lose file data, that was previously fsync'ed
> > > successfully, if we end up shrinking (via truncate) the file, add a hard
> > > link to our file and then persist the fsync log later via an fsync of
> > > other inode for example. After a power loss our file content wouldn't
> > > match what it had when we last fsync'ed, but instead it had the data
> > > prior to that fsync.
> > 
> > Prior to which fsync?
> > 
> > XFS has strictly ordered metadata journalling, which means
> > transactions committed prior to *any* fsync will be present on disk
> > after the fsync. ext4 is not quite so strict, but has essentially
> > the same behaviour. ext3 in ordered mode behaves like XFS.
> > 
> > As such, ext3, ext4 and XFS return the state of the file as "0xaa for
> > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after
> > the filesystem is remounted and the log replayed.
> 
> But xfs doesn't work as above, something wrong?  

Maybe. I didn't check the exact output XFS or ext4 were giving, just
that is was different to what the test expected.

XFS and ext4 also behave differently w.r.t. data vs metadata
ordering, so there's a good change what you see here is the XFS
metadata being written, but not the corresponding block zeroing
that occurred on extenstion....

> -------------------------------------
> cat xfs/generic/044.out.bad
....
> File content after:
> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> *
> 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 0100000

So the truncate down to 5k should have resulted in zeros past 5k,
but the zeros only start at 8k. Yup, that looks like setattr didn't
flush the zeroed tail page on truncate up. I'll have a further look
at that. Thanks for the head's up!

Cheers,

Dave.
Dave Chinner Feb. 16, 2015, 10:02 p.m. UTC | #6
On Mon, Feb 16, 2015 at 09:45:22AM +0000, Filipe David Manana wrote:
> On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote:
> >> This test is motivated by an fsync issue discovered in btrfs.
> >> The issue was that we could lose file data, that was previously fsync'ed
> >> successfully, if we end up shrinking (via truncate) the file, add a hard
> >> link to our file and then persist the fsync log later via an fsync of
> >> other inode for example. After a power loss our file content wouldn't
> >> match what it had when we last fsync'ed, but instead it had the data
> >> prior to that fsync.
> >
> > Prior to which fsync?
> 
> The fsync of file "foo" - this is the file we're testing for
> correctness and we only fsync it once.

Ok. But there's also an implied fsync because btrfs is supposed to
have an ordered metadata transaction subsystem.

> > XFS has strictly ordered metadata journalling, which means
> > transactions committed prior to *any* fsync will be present on disk
> > after the fsync. ext4 is not quite so strict, but has essentially
> > the same behaviour. ext3 in ordered mode behaves like XFS.
> >
> > As such, ext3, ext4 and XFS return the state of the file as "0xaa for
> > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after
> > the filesystem is remounted and the log replayed.
> 
> Well, that's not I get with XFS, at least on a 3.19 kernel.
> After log replay, I get a file with a size of 32Kb, the first 8192
> bytes have the value 0xaa and the remaining 24Kb have value 0x00 -
> this doesn't match the file content after any operation.

See my previous reply in the thread. In memory behaviour appears
correct, which means fsx will pass but we won't have picked up a
regression in power failure behaviour.

> I'm ok with changing the test to consider valid any of those 3 cases.
> 
> >
> > So, as this test is written, it does not encapsulate the
> > longstanding, expected behaviour of existing filesystems. btrfs
> > should behave like XFS, ext3 and ext4 if possible - being different
> > is only going to cause confusion and pain for application
> > developers.
> >
> >> The btrfs issue was fixed by the following linux kernel patch:
> >>
> >>   Btrfs: don't remove extents and xattrs when logging new names
> >
> > Sounds like you've just introduced an ordered mode behavioural
> > journalling regression into btrfs...
> 
> How can it be a regression?

Maybe it's not a regression - I don't know what the old behaviour
was - but it certainly seems like a bug to me.  Btrfs is supposed to
have a strongly ordered transaction model, similar to XFS and ext4
and the test as you've written it implies transactional change
ordering is not preserved in btrfs (ouch!).

Ordered mode on this test is tricky because you are running fsync on
a newly created file. That implies fsync on the parent directory,
because fsync means you are stabilising all the *transactions* that
changed metadata on the new file. That means the dirent created to
point to the new file should be fsync()d, too.

So, order of modifications:

file foo data written and synced
file foo truncate/extent
hard link foo_link to foo created
new file bar created
new file bar fsync

In this case, it's the hard link of foo to foo_link that creates the
ordered dependency between foo, the parent directory and bar. The
directory is then modified in a future transaction by creating bar,
and the fsync on bar stabilises the parent directory as
transactional atomicity requires all objects in a transaction to be
stabilised when one object is asked to be stabilised. And for the
latest modification to an object to be stabilised, ordered metadata
requires all previous modifications to that object also need to be
stabilised at the same time.

IOWs, the fsync on bar has this ordered change dependency
resolution:

fsync(bar)
  - implied parent directory fsync for dirent pointing to bar
    - stabilises parent directory, which
      - also stabilises new hard link dirent to foo_link, which
	- stabilises new link count in foo, which
	  - stabilises inode core of foo, which
	    - stabilises changed inode size from truncate/extent
	    - stabilises extent list changes from truncate/extent

IOWs, ordered metadata transactions mean the second fsync, through
the chain of dependent modifications previously made, should also
stabilise all the metadata changes made on foo since the fsync was
called on it.

> Before this change, after the fsync log replay, the file content
> matched what it had when "sync" was called - i.e. doesn't match the
> content when the fsync (on "foo") was performed nor the content after
> any of the subsequent truncate operations.

IOWs, btrfs did not follow ordered metadata behavioural rules before
the change, nor does it follow them after the change, but it's
now broken in a different, more subtle and surprising way.

Cheers,

Dave.
Filipe Manana Feb. 19, 2015, 6:16 p.m. UTC | #7
On Mon, Feb 16, 2015 at 10:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Feb 16, 2015 at 09:45:22AM +0000, Filipe David Manana wrote:
>> On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote:
>> >> This test is motivated by an fsync issue discovered in btrfs.
>> >> The issue was that we could lose file data, that was previously fsync'ed
>> >> successfully, if we end up shrinking (via truncate) the file, add a hard
>> >> link to our file and then persist the fsync log later via an fsync of
>> >> other inode for example. After a power loss our file content wouldn't
>> >> match what it had when we last fsync'ed, but instead it had the data
>> >> prior to that fsync.
>> >
>> > Prior to which fsync?
>>
>> The fsync of file "foo" - this is the file we're testing for
>> correctness and we only fsync it once.
>
> Ok. But there's also an implied fsync because btrfs is supposed to
> have an ordered metadata transaction subsystem.
>
>> > XFS has strictly ordered metadata journalling, which means
>> > transactions committed prior to *any* fsync will be present on disk
>> > after the fsync. ext4 is not quite so strict, but has essentially
>> > the same behaviour. ext3 in ordered mode behaves like XFS.
>> >
>> > As such, ext3, ext4 and XFS return the state of the file as "0xaa for
>> > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after
>> > the filesystem is remounted and the log replayed.
>>
>> Well, that's not I get with XFS, at least on a 3.19 kernel.
>> After log replay, I get a file with a size of 32Kb, the first 8192
>> bytes have the value 0xaa and the remaining 24Kb have value 0x00 -
>> this doesn't match the file content after any operation.
>
> See my previous reply in the thread. In memory behaviour appears
> correct, which means fsx will pass but we won't have picked up a
> regression in power failure behaviour.
>
>> I'm ok with changing the test to consider valid any of those 3 cases.
>>
>> >
>> > So, as this test is written, it does not encapsulate the
>> > longstanding, expected behaviour of existing filesystems. btrfs
>> > should behave like XFS, ext3 and ext4 if possible - being different
>> > is only going to cause confusion and pain for application
>> > developers.
>> >
>> >> The btrfs issue was fixed by the following linux kernel patch:
>> >>
>> >>   Btrfs: don't remove extents and xattrs when logging new names
>> >
>> > Sounds like you've just introduced an ordered mode behavioural
>> > journalling regression into btrfs...
>>
>> How can it be a regression?
>
> Maybe it's not a regression - I don't know what the old behaviour
> was - but it certainly seems like a bug to me.  Btrfs is supposed to
> have a strongly ordered transaction model, similar to XFS and ext4
> and the test as you've written it implies transactional change
> ordering is not preserved in btrfs (ouch!).
>
> Ordered mode on this test is tricky because you are running fsync on
> a newly created file. That implies fsync on the parent directory,
> because fsync means you are stabilising all the *transactions* that
> changed metadata on the new file. That means the dirent created to
> point to the new file should be fsync()d, too.
>
> So, order of modifications:
>
> file foo data written and synced
> file foo truncate/extent
> hard link foo_link to foo created
> new file bar created
> new file bar fsync
>
> In this case, it's the hard link of foo to foo_link that creates the
> ordered dependency between foo, the parent directory and bar. The
> directory is then modified in a future transaction by creating bar,
> and the fsync on bar stabilises the parent directory as
> transactional atomicity requires all objects in a transaction to be
> stabilised when one object is asked to be stabilised. And for the
> latest modification to an object to be stabilised, ordered metadata
> requires all previous modifications to that object also need to be
> stabilised at the same time.
>
> IOWs, the fsync on bar has this ordered change dependency
> resolution:
>
> fsync(bar)
>   - implied parent directory fsync for dirent pointing to bar
>     - stabilises parent directory, which
>       - also stabilises new hard link dirent to foo_link, which
>         - stabilises new link count in foo, which
>           - stabilises inode core of foo, which
>             - stabilises changed inode size from truncate/extent
>             - stabilises extent list changes from truncate/extent
>
> IOWs, ordered metadata transactions mean the second fsync, through
> the chain of dependent modifications previously made, should also
> stabilise all the metadata changes made on foo since the fsync was
> called on it.
>
>> Before this change, after the fsync log replay, the file content
>> matched what it had when "sync" was called - i.e. doesn't match the
>> content when the fsync (on "foo") was performed nor the content after
>> any of the subsequent truncate operations.
>
> IOWs, btrfs did not follow ordered metadata behavioural rules before
> the change, nor does it follow them after the change, but it's
> now broken in a different, more subtle and surprising way.

Dave, thanks for the detailed explanation.
I'll update the test once I get btrfs to have the same behaviour as
xfs and ext3/4.



>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
diff mbox

Patch

diff --git a/tests/generic/044 b/tests/generic/044
new file mode 100755
index 0000000..95a42cc
--- /dev/null
+++ b/tests/generic/044
@@ -0,0 +1,131 @@ 
+#! /bin/bash
+# FS QA Test No. 044
+#
+# This test is motivated by an fsync issue discovered in btrfs.
+# The issue was that we could lose file data, that was previously fsync'ed
+# successfully, if we end up shrinking (via truncate) the file, add a hard
+# link to our file and then persist the fsync log later via an fsync of other
+# inode for example. After a power loss our file content wouldn't match
+# what it had when we last fsync'ed, but instead it had the data prior to
+# that fsync.
+#
+# The btrfs issue was fixed by the following linux kernel patch:
+#
+#  Btrfs: don't remove extents and xattrs when logging new names
+#
+#-----------------------------------------------------------------------
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana <fdmanana@suse.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_flakey
+	rm -f $tmp.*
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_need_to_be_root
+_require_scratch
+_require_dm_flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs >> $seqres.full 2>&1
+_init_flakey
+_mount_flakey
+
+# Create our test file with some data.
+$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
+	$SCRATCH_MNT/foo | _filter_xfs_io
+
+# Make sure the file is durably persisted.
+sync
+
+# Append some data to our file, to increase its size.
+$XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
+	$SCRATCH_MNT/foo | _filter_xfs_io
+
+# Fsync the file, so from this point on if a crash/power failure happens, our
+# new data is guaranteed to be there next time the fs is mounted.
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
+
+# Now shrink our file to 5000 bytes.
+$XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo
+
+# Now do an expanding truncate to a size larger than what we had when we last
+# fsync'ed our file. This is just to verify that after power failure and
+# replaying the fsync log, our file matches what it was when we last fsync'ed
+# it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
+# data had a value of 0xcc.
+$XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo
+
+# Add one hard link to our file. This made btrfs drop all of our file's
+# metadata from the fsync log, including the metadata relative to the
+# extent we just wrote and fsync'ed. This change was made only to the fsync
+# log in memory, so adding the hard link alone doesn't change the persisted
+# fsync log. This happened because the previous truncates set the runtime
+# flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
+ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
+
+# Now make sure the in memory fsync log is durably persisted.
+# Creating and fsync'ing another file will do it.
+# After this our persisted fsync log will no longer have metadata for our file
+# foo that points to the extent we wrote and fsync'ed before.
+touch $SCRATCH_MNT/bar
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
+
+# As expected, before the crash/power failure, we should be able to see a file
+# with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
+# the remaining bytes with value 0x00.
+echo "File content before:"
+od -t x1 $SCRATCH_MNT/foo
+
+# Simulate a crash/power loss.
+_load_flakey_table $FLAKEY_DROP_WRITES
+_unmount_flakey
+
+_load_flakey_table $FLAKEY_ALLOW_WRITES
+_mount_flakey
+
+# After mounting the fs again, the fsync log was replayed.
+# The expected result is to see a file with a size of 12Kb, with its first 8Kb
+# of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
+# The btrfs bug used to leave the file as it used te be as of the last
+# transaction commit - that is, with a size of 8Kb with all bytes having a
+# value of 0xaa.
+echo "File content after:"
+od -t x1 $SCRATCH_MNT/foo
+
+status=0
+exit
diff --git a/tests/generic/044.out b/tests/generic/044.out
new file mode 100644
index 0000000..397dbae
--- /dev/null
+++ b/tests/generic/044.out
@@ -0,0 +1,18 @@ 
+QA output created by 044
+wrote 8192/8192 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 4096/4096 bytes at offset 8192
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File content before:
+0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
+*
+0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00
+0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+*
+0100000
+File content after:
+0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
+*
+0020000 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
+*
+0030000
diff --git a/tests/generic/group b/tests/generic/group
index d686f2f..dc16972 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -46,6 +46,7 @@ 
 041 metadata auto quick
 042 metadata auto quick
 043 metadata auto quick
+044 metadata auto quick
 053 acl repair auto quick
 062 attr udf auto quick
 068 other auto freeze dangerous stress