generic: skip dm-log-writes tests on XFS v5 superblock filesystems
diff mbox series

Message ID 20190226181407.44612-1-bfoster@redhat.com
State New
Headers show
Series
  • generic: skip dm-log-writes tests on XFS v5 superblock filesystems
Related show

Commit Message

Brian Foster Feb. 26, 2019, 6:14 p.m. UTC
The dm-log-writes mechanism runs a workload against a filesystem,
tracks underlying FUAs and restores the filesystem to various points
in time based on FUA marks. This allows fstests to check fs
consistency at various points and verify log recovery works as
expected.

This mechanism does not play well with LSN based log recovery
ordering behavior on XFS v5 superblocks, however. For example,
generic/482 can reproduce false positive corruptions based on extent
to btree conversion of an inode if the inode and associated btree
block are written back after different checkpoints. Even though both
items are logged correctly in the extent-to-btree transaction, the
btree block can be relogged (multiple times) and only written back
once when the filesystem unmounts. If the inode was written back
after the initial conversion, recovery points between that mark and
when the btree block is ultimately written back will show corruption
because log recovery sees that the destination buffer is newer than
the recovered buffer and intentionally skips the buffer. This is a
false positive because the destination buffer was resiliently
written back after being physically relogged one or more times.

Update the dm-log-writes require checks to enforce v4 superblocks
when running against XFS and skip the test otherwise.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 common/dmlogwrites |  6 ++++++
 common/xfs         | 11 +++++++++++
 tests/generic/482  |  1 -
 3 files changed, 17 insertions(+), 1 deletion(-)

Comments

Amir Goldstein Feb. 26, 2019, 9:10 p.m. UTC | #1
On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
>
> The dm-log-writes mechanism runs a workload against a filesystem,
> tracks underlying FUAs and restores the filesystem to various points
> in time based on FUA marks. This allows fstests to check fs
> consistency at various points and verify log recovery works as
> expected.
>

Inaccurate. generic/482 restores to FUA points.
generic/45[57] restore to user defined points in time (marks).
dm-log-writes mechanism is capable of restoring either.

> This mechanism does not play well with LSN based log recovery
> ordering behavior on XFS v5 superblocks, however. For example,
> generic/482 can reproduce false positive corruptions based on extent
> to btree conversion of an inode if the inode and associated btree
> block are written back after different checkpoints. Even though both
> items are logged correctly in the extent-to-btree transaction, the
> btree block can be relogged (multiple times) and only written back
> once when the filesystem unmounts. If the inode was written back
> after the initial conversion, recovery points between that mark and
> when the btree block is ultimately written back will show corruption
> because log recovery sees that the destination buffer is newer than
> the recovered buffer and intentionally skips the buffer. This is a
> false positive because the destination buffer was resiliently
> written back after being physically relogged one or more times.
>

This story doesn't add up.
Either dm-log-writes emulated power failure correctly or it doesn't.
My understanding is that the issue you are seeing is a result of
XFS seeing "data from the future" after a restore of a power failure
snapshot, because the scratch device is not a clean slate.
If I am right, then the correct solution is to wipe the journal before
starting to replay restore points.

Am I misunderstanding whats going on?

IIRC, some of Josef's earlier versions used dm snapshots to restore
the blockdev to a clean state before replying log-writes.
I think that one of the earlier versions of generic/482 also took
that approach, but that resulted in longer test runtime (not sure).

> Update the dm-log-writes require checks to enforce v4 superblocks
> when running against XFS and skip the test otherwise.

You might as well disable dm-log-writes test for XFS completely.
Who cares about v4 superblocks these days?
We need a tool to make sure the NEW features are crash resilient.

dm-log-writes proved itself to be a powerful generic test tool that found
some serious crash consistency bugs in every one of the major filesystems
and it found bugs with XFS reflink log recovery as well, so IMO
disabling dm-log-writes for v5 would be "very unwise!".

Thanks,
Amir.
Dave Chinner Feb. 26, 2019, 11:22 p.m. UTC | #2
On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > The dm-log-writes mechanism runs a workload against a filesystem,
> > tracks underlying FUAs and restores the filesystem to various points
> > in time based on FUA marks. This allows fstests to check fs
> > consistency at various points and verify log recovery works as
> > expected.
> >
> 
> Inaccurate. generic/482 restores to FUA points.
> generic/45[57] restore to user defined points in time (marks).
> dm-log-writes mechanism is capable of restoring either.
> 
> > This mechanism does not play well with LSN based log recovery
> > ordering behavior on XFS v5 superblocks, however. For example,
> > generic/482 can reproduce false positive corruptions based on extent
> > to btree conversion of an inode if the inode and associated btree
> > block are written back after different checkpoints. Even though both
> > items are logged correctly in the extent-to-btree transaction, the
> > btree block can be relogged (multiple times) and only written back
> > once when the filesystem unmounts. If the inode was written back
> > after the initial conversion, recovery points between that mark and
> > when the btree block is ultimately written back will show corruption
> > because log recovery sees that the destination buffer is newer than
> > the recovered buffer and intentionally skips the buffer. This is a
> > false positive because the destination buffer was resiliently
> > written back after being physically relogged one or more times.
> >
> 
> This story doesn't add up.
> Either dm-log-writes emulated power failure correctly or it doesn't.
> My understanding is that the issue you are seeing is a result of
> XFS seeing "data from the future" after a restore of a power failure
> snapshot, because the scratch device is not a clean slate.
> If I am right, then the correct solution is to wipe the journal before
> starting to replay restore points.

If that is the problem, then I think we should be wiping the entire
block device before replaying the recorded logwrite.

i.e. this sounds like a "block device we are replaying onto has
stale data in it" problem because we are replaying the same
filesystem over the top of itself.  Hence there are no unique
identifiers in the metadata that can detect stale metadata in
the block device.

I'm surprised that we haven't tripped over this much earlier that
this...

> > Update the dm-log-writes require checks to enforce v4 superblocks
> > when running against XFS and skip the test otherwise.
> 
> You might as well disable dm-log-writes test for XFS completely.
> Who cares about v4 superblocks these days?

Enough of the inflammatory hyperbole, Amir. Statements like this
serve no useful purpose.

Cheers,

Dave.
Amir Goldstein Feb. 27, 2019, 4:06 a.m. UTC | #3
On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > The dm-log-writes mechanism runs a workload against a filesystem,
> > > tracks underlying FUAs and restores the filesystem to various points
> > > in time based on FUA marks. This allows fstests to check fs
> > > consistency at various points and verify log recovery works as
> > > expected.
> > >
> >
> > Inaccurate. generic/482 restores to FUA points.
> > generic/45[57] restore to user defined points in time (marks).
> > dm-log-writes mechanism is capable of restoring either.
> >
> > > This mechanism does not play well with LSN based log recovery
> > > ordering behavior on XFS v5 superblocks, however. For example,
> > > generic/482 can reproduce false positive corruptions based on extent
> > > to btree conversion of an inode if the inode and associated btree
> > > block are written back after different checkpoints. Even though both
> > > items are logged correctly in the extent-to-btree transaction, the
> > > btree block can be relogged (multiple times) and only written back
> > > once when the filesystem unmounts. If the inode was written back
> > > after the initial conversion, recovery points between that mark and
> > > when the btree block is ultimately written back will show corruption
> > > because log recovery sees that the destination buffer is newer than
> > > the recovered buffer and intentionally skips the buffer. This is a
> > > false positive because the destination buffer was resiliently
> > > written back after being physically relogged one or more times.
> > >
> >
> > This story doesn't add up.
> > Either dm-log-writes emulated power failure correctly or it doesn't.
> > My understanding is that the issue you are seeing is a result of
> > XFS seeing "data from the future" after a restore of a power failure
> > snapshot, because the scratch device is not a clean slate.
> > If I am right, then the correct solution is to wipe the journal before
> > starting to replay restore points.
>
> If that is the problem, then I think we should be wiping the entire
> block device before replaying the recorded logwrite.
>

Indeed.

> i.e. this sounds like a "block device we are replaying onto has
> stale data in it" problem because we are replaying the same
> filesystem over the top of itself.  Hence there are no unique
> identifiers in the metadata that can detect stale metadata in
> the block device.
>
> I'm surprised that we haven't tripped over this much earlier that
> this...
>

I remember asking myself the same thing... it's coming back to me
now. I really remember having this discussion during test review.
generic/482 is an adaptation of Josef's test script [1], which
does log recovery onto a snapshot on every FUA checkpoint.

[1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh

Setting up snapshots for every checkpoint was found empirically to take
more test runtime, than replaying log from the start for each checkpoint.
That observation was limited to the systems that Qu and Eryu tested on.

IRC, what usually took care of cleaning the block device is replaying the
"discard everything" IO from mkfs time. dm-log-writes driver should take care
of zeroing blocks upon replaying a discard IO even on a target device that
doesn't support discard, but maybe if original device doesn't support discard,
the IO is not recorded at all and therefore not replayed?

Brian, can you check if your log-writes stream contains a discard IO for
the entire block device? I do remember that the initial log-writes tests worked
very reliably on my laptop with SSD and were a bit flaky on another test machine
with spinning rust, but that machine had other hardware reliability issues at
the time (bad SATA cable) so I attributed all issues to that problem.

BTW, looking closer at generic/482, $prev does not seem to be used at all
in the replay loop.

> > > Update the dm-log-writes require checks to enforce v4 superblocks
> > > when running against XFS and skip the test otherwise.
> >
> > You might as well disable dm-log-writes test for XFS completely.
> > Who cares about v4 superblocks these days?
>
> Enough of the inflammatory hyperbole, Amir. Statements like this
> serve no useful purpose.
>

<deep breath> Agreed.

Thanks,
Amir.
Qu Wenruo Feb. 27, 2019, 4:19 a.m. UTC | #4
On 2019/2/27 下午12:06, Amir Goldstein wrote:
> On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
>>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
>>>>
>>>> The dm-log-writes mechanism runs a workload against a filesystem,
>>>> tracks underlying FUAs and restores the filesystem to various points
>>>> in time based on FUA marks. This allows fstests to check fs
>>>> consistency at various points and verify log recovery works as
>>>> expected.
>>>>
>>>
>>> Inaccurate. generic/482 restores to FUA points.
>>> generic/45[57] restore to user defined points in time (marks).
>>> dm-log-writes mechanism is capable of restoring either.
>>>
>>>> This mechanism does not play well with LSN based log recovery
>>>> ordering behavior on XFS v5 superblocks, however. For example,
>>>> generic/482 can reproduce false positive corruptions based on extent
>>>> to btree conversion of an inode if the inode and associated btree
>>>> block are written back after different checkpoints. Even though both
>>>> items are logged correctly in the extent-to-btree transaction, the
>>>> btree block can be relogged (multiple times) and only written back
>>>> once when the filesystem unmounts. If the inode was written back
>>>> after the initial conversion, recovery points between that mark and
>>>> when the btree block is ultimately written back will show corruption
>>>> because log recovery sees that the destination buffer is newer than
>>>> the recovered buffer and intentionally skips the buffer. This is a
>>>> false positive because the destination buffer was resiliently
>>>> written back after being physically relogged one or more times.
>>>>
>>>
>>> This story doesn't add up.
>>> Either dm-log-writes emulated power failure correctly or it doesn't.
>>> My understanding is that the issue you are seeing is a result of
>>> XFS seeing "data from the future" after a restore of a power failure
>>> snapshot, because the scratch device is not a clean slate.
>>> If I am right, then the correct solution is to wipe the journal before
>>> starting to replay restore points.
>>
>> If that is the problem, then I think we should be wiping the entire
>> block device before replaying the recorded logwrite.
>>
> 
> Indeed.

May I ask a stupid question?

How does it matter whether the device is clean or not?
Shouldn't the journal/metadata or whatever be self-contained?

> 
>> i.e. this sounds like a "block device we are replaying onto has
>> stale data in it" problem because we are replaying the same
>> filesystem over the top of itself.  Hence there are no unique
>> identifiers in the metadata that can detect stale metadata in
>> the block device.
>>
>> I'm surprised that we haven't tripped over this much earlier that
>> this...
>>
> 
> I remember asking myself the same thing... it's coming back to me
> now. I really remember having this discussion during test review.
> generic/482 is an adaptation of Josef's test script [1], which
> does log recovery onto a snapshot on every FUA checkpoint.
> 
> [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
> 
> Setting up snapshots for every checkpoint was found empirically to take
> more test runtime, than replaying log from the start for each checkpoint.
> That observation was limited to the systems that Qu and Eryu tested on.
> 
> IRC, what usually took care of cleaning the block device is replaying the
> "discard everything" IO from mkfs time.

This "discard everything" assumption doesn't look right to me.
Although most mkfs would discard at least part of the device, even
without discarding the newly created fs should be self-contained, no
wild pointer points to some garbage.

I though all metadata/journal write should be self-contained even for
later fs writes.

Am I missing something? Or do I get too poisoned by btrfs CoW?

Thanks,
Qu

> dm-log-writes driver should take care
> of zeroing blocks upon replaying a discard IO even on a target device that
> doesn't support discard, but maybe if original device doesn't support discard,
> the IO is not recorded at all and therefore not replayed?
> 
> Brian, can you check if your log-writes stream contains a discard IO for
> the entire block device? I do remember that the initial log-writes tests worked
> very reliably on my laptop with SSD and were a bit flaky on another test machine
> with spinning rust, but that machine had other hardware reliability issues at
> the time (bad SATA cable) so I attributed all issues to that problem.
> 
> BTW, looking closer at generic/482, $prev does not seem to be used at all
> in the replay loop.
> 
>>>> Update the dm-log-writes require checks to enforce v4 superblocks
>>>> when running against XFS and skip the test otherwise.
>>>
>>> You might as well disable dm-log-writes test for XFS completely.
>>> Who cares about v4 superblocks these days?
>>
>> Enough of the inflammatory hyperbole, Amir. Statements like this
>> serve no useful purpose.
>>
> 
> <deep breath> Agreed.
> 
> Thanks,
> Amir.
>
Amir Goldstein Feb. 27, 2019, 4:49 a.m. UTC | #5
On Wed, Feb 27, 2019 at 6:19 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/2/27 下午12:06, Amir Goldstein wrote:
> > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> >>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> >>>>
> >>>> The dm-log-writes mechanism runs a workload against a filesystem,
> >>>> tracks underlying FUAs and restores the filesystem to various points
> >>>> in time based on FUA marks. This allows fstests to check fs
> >>>> consistency at various points and verify log recovery works as
> >>>> expected.
> >>>>
> >>>
> >>> Inaccurate. generic/482 restores to FUA points.
> >>> generic/45[57] restore to user defined points in time (marks).
> >>> dm-log-writes mechanism is capable of restoring either.
> >>>
> >>>> This mechanism does not play well with LSN based log recovery
> >>>> ordering behavior on XFS v5 superblocks, however. For example,
> >>>> generic/482 can reproduce false positive corruptions based on extent
> >>>> to btree conversion of an inode if the inode and associated btree
> >>>> block are written back after different checkpoints. Even though both
> >>>> items are logged correctly in the extent-to-btree transaction, the
> >>>> btree block can be relogged (multiple times) and only written back
> >>>> once when the filesystem unmounts. If the inode was written back
> >>>> after the initial conversion, recovery points between that mark and
> >>>> when the btree block is ultimately written back will show corruption
> >>>> because log recovery sees that the destination buffer is newer than
> >>>> the recovered buffer and intentionally skips the buffer. This is a
> >>>> false positive because the destination buffer was resiliently
> >>>> written back after being physically relogged one or more times.
> >>>>
> >>>
> >>> This story doesn't add up.
> >>> Either dm-log-writes emulated power failure correctly or it doesn't.
> >>> My understanding is that the issue you are seeing is a result of
> >>> XFS seeing "data from the future" after a restore of a power failure
> >>> snapshot, because the scratch device is not a clean slate.
> >>> If I am right, then the correct solution is to wipe the journal before
> >>> starting to replay restore points.
> >>
> >> If that is the problem, then I think we should be wiping the entire
> >> block device before replaying the recorded logwrite.
> >>
> >
> > Indeed.
>
> May I ask a stupid question?
>
> How does it matter whether the device is clean or not?
> Shouldn't the journal/metadata or whatever be self-contained?
>

Yes and no.

The most simple example (not limited to xfs and not sure it is like that in xfs)
is how you find the last valid journal commit entry. It should have correct CRC
and the largest LSN. But it you replay IO on top of existing journal without
wiping it first, then journal recovery will continue past the point to meant to
replay or worse.

The problem that Brian describes is more complicated than that and not
limited to the data in the journal IIUC, but I think what I described above
may plague also ext4 and xfs v4.

> >
> >> i.e. this sounds like a "block device we are replaying onto has
> >> stale data in it" problem because we are replaying the same
> >> filesystem over the top of itself.  Hence there are no unique
> >> identifiers in the metadata that can detect stale metadata in
> >> the block device.
> >>
> >> I'm surprised that we haven't tripped over this much earlier that
> >> this...
> >>
> >
> > I remember asking myself the same thing... it's coming back to me
> > now. I really remember having this discussion during test review.
> > generic/482 is an adaptation of Josef's test script [1], which
> > does log recovery onto a snapshot on every FUA checkpoint.
> >
> > [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
> >
> > Setting up snapshots for every checkpoint was found empirically to take
> > more test runtime, than replaying log from the start for each checkpoint.
> > That observation was limited to the systems that Qu and Eryu tested on.
> >
> > IRC, what usually took care of cleaning the block device is replaying the
> > "discard everything" IO from mkfs time.
>
> This "discard everything" assumption doesn't look right to me.
> Although most mkfs would discard at least part of the device, even
> without discarding the newly created fs should be self-contained, no
> wild pointer points to some garbage.
>

It's true. We shouldn't make this assumption.
That was my explanation to Dave's question, how come we didn't see
this before?

Here is my log-writes info from generic/482:
./src/log-writes/replay-log -vv --find --end-mark mkfs --log
$LOGWRITES_DEV |grep DISCARD
seek entry 0@2: 0, size 8388607, flags 0x4(DISCARD)
seek entry 1@3: 8388607, size 8388607, flags 0x4(DISCARD)
seek entry 2@4: 16777214, size 4194306, flags 0x4(DISCARD)

> I though all metadata/journal write should be self-contained even for
> later fs writes.
>
> Am I missing something? Or do I get too poisoned by btrfs CoW?
>

I'd be very surprised if btrfs cannot be flipped by seeing stale data "from
the future" in the block device. Seems to me like the entire concept of
CoW and metadata checksums is completely subverted by the existence
of correct checksums on "stale metadata from the future".

Thanks,
Amir.
Qu Wenruo Feb. 27, 2019, 5:01 a.m. UTC | #6
On 2019/2/27 下午12:49, Amir Goldstein wrote:
[snip]
>>> Indeed.
>>
>> May I ask a stupid question?
>>
>> How does it matter whether the device is clean or not?
>> Shouldn't the journal/metadata or whatever be self-contained?
>>
> 
> Yes and no.
> 
> The most simple example (not limited to xfs and not sure it is like that in xfs)
> is how you find the last valid journal commit entry. It should have correct CRC
> and the largest LSN. But it you replay IO on top of existing journal without
> wiping it first, then journal recovery will continue past the point to meant to
> replay or worse.

The journal doesn't have some superblock like system to tell where to
stop replaying?

If not, then indeed we need to discard the journal before writing new one.

> 
> The problem that Brian describes is more complicated than that and not
> limited to the data in the journal IIUC, but I think what I described above
> may plague also ext4 and xfs v4.
> 
>>>
>>>> i.e. this sounds like a "block device we are replaying onto has
>>>> stale data in it" problem because we are replaying the same
>>>> filesystem over the top of itself.  Hence there are no unique
>>>> identifiers in the metadata that can detect stale metadata in
>>>> the block device.
>>>>
>>>> I'm surprised that we haven't tripped over this much earlier that
>>>> this...
>>>>
>>>
>>> I remember asking myself the same thing... it's coming back to me
>>> now. I really remember having this discussion during test review.
>>> generic/482 is an adaptation of Josef's test script [1], which
>>> does log recovery onto a snapshot on every FUA checkpoint.
>>>
>>> [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
>>>
>>> Setting up snapshots for every checkpoint was found empirically to take
>>> more test runtime, than replaying log from the start for each checkpoint.
>>> That observation was limited to the systems that Qu and Eryu tested on.
>>>
>>> IRC, what usually took care of cleaning the block device is replaying the
>>> "discard everything" IO from mkfs time.
>>
>> This "discard everything" assumption doesn't look right to me.
>> Although most mkfs would discard at least part of the device, even
>> without discarding the newly created fs should be self-contained, no
>> wild pointer points to some garbage.
>>
> 
> It's true. We shouldn't make this assumption.
> That was my explanation to Dave's question, how come we didn't see
> this before?
> 
> Here is my log-writes info from generic/482:
> ./src/log-writes/replay-log -vv --find --end-mark mkfs --log
> $LOGWRITES_DEV |grep DISCARD
> seek entry 0@2: 0, size 8388607, flags 0x4(DISCARD)
> seek entry 1@3: 8388607, size 8388607, flags 0x4(DISCARD)
> seek entry 2@4: 16777214, size 4194306, flags 0x4(DISCARD)
> 
>> I though all metadata/journal write should be self-contained even for
>> later fs writes.
>>
>> Am I missing something? Or do I get too poisoned by btrfs CoW?
>>
> 
> I'd be very surprised if btrfs cannot be flipped by seeing stale data "from
> the future" in the block device. Seems to me like the entire concept of
> CoW and metadata checksums is completely subverted by the existence
> of correct checksums on "stale metadata from the future".

It seems that metadata CoW makes it impossible to see future data.

All btree trees get updated CoW, so no metadata will be overwritten
during one transaction.
Only super block is overwritten and normally superblock is updated
atomically.

So either old superblock is still here, all we can see is old tree pointers.
Or new superblock is here, all we can see is new tree pointers.
And new metadata will never be written into old metadata, there is no
way to see future metadata.

Thanks,
Qu

> 
> Thanks,
> Amir.
>
Amir Goldstein Feb. 27, 2019, 5:19 a.m. UTC | #7
On Wed, Feb 27, 2019 at 7:01 AM Qu Wenruo <wqu@suse.de> wrote:
>
>
>
> On 2019/2/27 下午12:49, Amir Goldstein wrote:
> [snip]
> >>> Indeed.
> >>
> >> May I ask a stupid question?
> >>
> >> How does it matter whether the device is clean or not?
> >> Shouldn't the journal/metadata or whatever be self-contained?
> >>
> >
> > Yes and no.
> >
> > The most simple example (not limited to xfs and not sure it is like that in xfs)
> > is how you find the last valid journal commit entry. It should have correct CRC
> > and the largest LSN. But it you replay IO on top of existing journal without
> > wiping it first, then journal recovery will continue past the point to meant to
> > replay or worse.
>
> The journal doesn't have some superblock like system to tell where to
> stop replaying?
>

Yes, it does. My statement was inaccurate.
The point is how can filesystem trust that journal superblock is not corrupted?
If filesystem observes data from the future in the journal that is not
expected to
be there, or any checksumed metadata on the device, it may fail sanity checks
that superblock or other metadata are not corrupted.

Suppose filesystem has a bug that overwrites new metadata from time N
with old stale metadata buffer from time N-1. That would look not much different
than log-writes replay to time N-1 over non clean device from time N.

> If not, then indeed we need to discard the journal before writing new one.
>
...

> >> Am I missing something? Or do I get too poisoned by btrfs CoW?
> >>
> >
> > I'd be very surprised if btrfs cannot be flipped by seeing stale data "from
> > the future" in the block device. Seems to me like the entire concept of
> > CoW and metadata checksums is completely subverted by the existence
> > of correct checksums on "stale metadata from the future".
>
> It seems that metadata CoW makes it impossible to see future data.
>
> All btree trees get updated CoW, so no metadata will be overwritten
> during one transaction.
> Only super block is overwritten and normally superblock is updated
> atomically.
>
> So either old superblock is still here, all we can see is old tree pointers.
> Or new superblock is here, all we can see is new tree pointers.
> And new metadata will never be written into old metadata, there is no
> way to see future metadata.
>

Those assumptions could fail if you have unreliable hardware that
reorders IO across FUA, just drops IO on the floor or a bug in the filesystem
or block layer.

Existence of metadata from the future could look like any of
the above has happened.

Thanks,
Amir.
Qu Wenruo Feb. 27, 2019, 5:32 a.m. UTC | #8
On 2019/2/27 下午1:19, Amir Goldstein wrote:
> On Wed, Feb 27, 2019 at 7:01 AM Qu Wenruo <wqu@suse.de> wrote:
>>
>>
>>
>> On 2019/2/27 下午12:49, Amir Goldstein wrote:
>> [snip]
>>>>> Indeed.
>>>>
>>>> May I ask a stupid question?
>>>>
>>>> How does it matter whether the device is clean or not?
>>>> Shouldn't the journal/metadata or whatever be self-contained?
>>>>
>>>
>>> Yes and no.
>>>
>>> The most simple example (not limited to xfs and not sure it is like that in xfs)
>>> is how you find the last valid journal commit entry. It should have correct CRC
>>> and the largest LSN. But it you replay IO on top of existing journal without
>>> wiping it first, then journal recovery will continue past the point to meant to
>>> replay or worse.
>>
>> The journal doesn't have some superblock like system to tell where to
>> stop replaying?
>>
> 
> Yes, it does. My statement was inaccurate.
> The point is how can filesystem trust that journal superblock is not corrupted?
> If filesystem observes data from the future in the journal that is not
> expected to
> be there, or any checksumed metadata on the device, it may fail sanity checks
> that superblock or other metadata are not corrupted.
> 
> Suppose filesystem has a bug that overwrites new metadata from time N
> with old stale metadata buffer from time N-1. That would look not much different
> than log-writes replay to time N-1 over non clean device from time N.
> 
>> If not, then indeed we need to discard the journal before writing new one.
>>
> ...
> 
>>>> Am I missing something? Or do I get too poisoned by btrfs CoW?
>>>>
>>>
>>> I'd be very surprised if btrfs cannot be flipped by seeing stale data "from
>>> the future" in the block device. Seems to me like the entire concept of
>>> CoW and metadata checksums is completely subverted by the existence
>>> of correct checksums on "stale metadata from the future".
>>
>> It seems that metadata CoW makes it impossible to see future data.
>>
>> All btree trees get updated CoW, so no metadata will be overwritten
>> during one transaction.
>> Only super block is overwritten and normally superblock is updated
>> atomically.
>>
>> So either old superblock is still here, all we can see is old tree pointers.
>> Or new superblock is here, all we can see is new tree pointers.
>> And new metadata will never be written into old metadata, there is no
>> way to see future metadata.
>>
> 
> Those assumptions could fail if you have unreliable hardware that
> reorders IO across FUA, just drops IO on the floor or a bug in the filesystem
> or block layer.

Well, if hardware has problem, we can't really do anything to help.

Maybe that's reason why there are more corruption report for btrfs as
there are more problematic hardware than we thought?

> 
> Existence of metadata from the future could look like any of
> the above has happened.

Btrfs has an extra layer to prevent such problem from happening, each
metadata pointer has its expected generation.

If one metadata has a mismatch generation with its parent, then kernel
will know something went wrong.

And in fact, that's the most common failure mode for btrfs (although
most of them is seeing too old metadata), and we're looking into the
problem (but not much progress yet).

Thanks,
Qu

> 
> Thanks,
> Amir.
>
Amir Goldstein Feb. 27, 2019, 5:58 a.m. UTC | #9
> >> It seems that metadata CoW makes it impossible to see future data.
> >>
> >> All btree trees get updated CoW, so no metadata will be overwritten
> >> during one transaction.
> >> Only super block is overwritten and normally superblock is updated
> >> atomically.
> >>
> >> So either old superblock is still here, all we can see is old tree pointers.
> >> Or new superblock is here, all we can see is new tree pointers.
> >> And new metadata will never be written into old metadata, there is no
> >> way to see future metadata.
> >>
> >
> > Those assumptions could fail if you have unreliable hardware that
> > reorders IO across FUA, just drops IO on the floor or a bug in the filesystem
> > or block layer.
>
> Well, if hardware has problem, we can't really do anything to help.
>
> Maybe that's reason why there are more corruption report for btrfs as
> there are more problematic hardware than we thought?
>

I think we are not understanding each other.

dm-log-writes and xfstests are not expected to test unreliable hardware.
filesystems are not expected to work properly on unreliable hardware,
but they do have sanity checks to try and detect metadata corruptions that
could be a result of unreliable hardware/storage stack and put up a big
red sign.

If we don't wipe block device before log-writes replay, filesystem trips
over those sanity checks, because replay to time N-1 without wipe of
writes from time N looks awfully similar to flaky hardware/storage stack.

> >
> > Existence of metadata from the future could look like any of
> > the above has happened.
>
> Btrfs has an extra layer to prevent such problem from happening, each
> metadata pointer has its expected generation.
>
> If one metadata has a mismatch generation with its parent, then kernel
> will know something went wrong.
>
> And in fact, that's the most common failure mode for btrfs (although
> most of them is seeing too old metadata), and we're looking into the
> problem (but not much progress yet).
>

Yes, and that is exactly the reason why dm-log-writes has the potential
to generate false positive sanity check failures. Granted, I do not know
the inner works of btrfs CoW to point out a specific problematic use case.

Anyway, we do need to wipe block device before log replay regardless
if btrfs needs it or not.

Thanks,
Amir.
Dave Chinner Feb. 27, 2019, 6:15 a.m. UTC | #10
On Wed, Feb 27, 2019 at 06:49:56AM +0200, Amir Goldstein wrote:
> On Wed, Feb 27, 2019 at 6:19 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2019/2/27 下午12:06, Amir Goldstein wrote:
> > > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote:
> > >>
> > >> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > >>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > >>>>
> > >>>> The dm-log-writes mechanism runs a workload against a filesystem,
> > >>>> tracks underlying FUAs and restores the filesystem to various points
> > >>>> in time based on FUA marks. This allows fstests to check fs
> > >>>> consistency at various points and verify log recovery works as
> > >>>> expected.
> > >>>>
> > >>>
> > >>> Inaccurate. generic/482 restores to FUA points.
> > >>> generic/45[57] restore to user defined points in time (marks).
> > >>> dm-log-writes mechanism is capable of restoring either.
> > >>>
> > >>>> This mechanism does not play well with LSN based log recovery
> > >>>> ordering behavior on XFS v5 superblocks, however. For example,
> > >>>> generic/482 can reproduce false positive corruptions based on extent
> > >>>> to btree conversion of an inode if the inode and associated btree
> > >>>> block are written back after different checkpoints. Even though both
> > >>>> items are logged correctly in the extent-to-btree transaction, the
> > >>>> btree block can be relogged (multiple times) and only written back
> > >>>> once when the filesystem unmounts. If the inode was written back
> > >>>> after the initial conversion, recovery points between that mark and
> > >>>> when the btree block is ultimately written back will show corruption
> > >>>> because log recovery sees that the destination buffer is newer than
> > >>>> the recovered buffer and intentionally skips the buffer. This is a
> > >>>> false positive because the destination buffer was resiliently
> > >>>> written back after being physically relogged one or more times.
> > >>>>
> > >>>
> > >>> This story doesn't add up.
> > >>> Either dm-log-writes emulated power failure correctly or it doesn't.
> > >>> My understanding is that the issue you are seeing is a result of
> > >>> XFS seeing "data from the future" after a restore of a power failure
> > >>> snapshot, because the scratch device is not a clean slate.
> > >>> If I am right, then the correct solution is to wipe the journal before
> > >>> starting to replay restore points.
> > >>
> > >> If that is the problem, then I think we should be wiping the entire
> > >> block device before replaying the recorded logwrite.
> > >>
> > >
> > > Indeed.
> >
> > May I ask a stupid question?
> >
> > How does it matter whether the device is clean or not?
> > Shouldn't the journal/metadata or whatever be self-contained?
> >
> 
> Yes and no.
> 
> The most simple example (not limited to xfs and not sure it is like that in xfs)
> is how you find the last valid journal commit entry. It should have correct CRC
> and the largest LSN. But it you replay IO on top of existing journal without
> wiping it first, then journal recovery will continue past the point to meant to
> replay or worse.

No, that's not the problem we have with XFS. THe problem is that XFS
will not recover the changes in the log if the object on disk it
would recover into is more recent than the information found in the
log. i.e. it's already been written back and so the journal entry
does not need to be replayed.

IOWs, if the block device is not wiped, the first read of a piece
of a newly allocated and modified metadata object in the log will
see the future state of the object, not whatever was there when the
block was first allocated.

i.e.
			in memory	in journal	on disk
initial contents	0000		n/a		0000
allocate,		0001		n/a		0000
modify, checkpoint 1	0002		0002		0000
modify, checkpoint 2	0003		0003		0000
modify, checkpoint 3	0004		0004		0000
write back		0004		0004		0004
checkpoint 4		0004		n/a		0004

Now when we replay up to checkpoint 1, log recovery will read the
object from disk. If the disk has been zeroed before we replay, the
read in log recovery will see 0000 and replay 0002 over the top,
and all will be good. However, if the device hasn;t been zeroed,
recovery will read "0004", which is more recent than 0002, and it
will not replay 0002 because it knows there are future changes to
that object in the journal that will be replayed.

IOWs, stale metadata (from the future) prevents log recovery from
replaying the objects it should be replaying.

> The problem that Brian describes is more complicated than that and not
> limited to the data in the journal IIUC, but I think what I described above
> may plague also ext4 and xfs v4.

It will affect xfs v4, but we can't detect it at all because we
don't have sequence numbers in the metadata. ext4 is in the same
boat as xfs v4, while btrfs is like XFS v5 with transaction
identifiers in the metadata to indicate when it was written...

> > This "discard everything" assumption doesn't look right to me.
> > Although most mkfs would discard at least part of the device, even
> > without discarding the newly created fs should be self-contained, no
> > wild pointer points to some garbage.

If the block device is in an uninitialised state when we start,
then all bets are off - it is not a "self contained" test because
the initial state is completely unknown. We need to zero so that the
initial state for both creation and each replay that occurs start
from the same initial conditions.

> > Am I missing something? Or do I get too poisoned by btrfs CoW?
> 
> I'd be very surprised if btrfs cannot be flipped by seeing stale data "from
> the future" in the block device. Seems to me like the entire concept of
> CoW and metadata checksums is completely subverted by the existence
> of correct checksums on "stale metadata from the future".

No, this is a journal recovery issue - recovery is a
read-modify-write operation and so if the contents that are read are
stale in a specific way we can be exposed to problems like this.
btrfs is not a journalling filesystem, so it shouldn't be doing RMW
cycles on metadata to bring it into a consistent state during
recovery - it should be doing atomic updates of the tree root to
switch from one consistent state on disk to the next....

Cheers,

Dave.
Brian Foster Feb. 27, 2019, 1:18 p.m. UTC | #11
On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > The dm-log-writes mechanism runs a workload against a filesystem,
> > tracks underlying FUAs and restores the filesystem to various points
> > in time based on FUA marks. This allows fstests to check fs
> > consistency at various points and verify log recovery works as
> > expected.
> >
> 
> Inaccurate. generic/482 restores to FUA points.
> generic/45[57] restore to user defined points in time (marks).
> dm-log-writes mechanism is capable of restoring either.
> 

The above is poorly worded. I'm aware of the separate tests and I've
used the mechanism to bounce around to various marks. Note that my
understanding of the mechanism beyond that is rudimentary. I'll reword
this if the patch survives, but it sounds like there may be opportunity
to fix the mechanism, which clearly would be ideal.

> > This mechanism does not play well with LSN based log recovery
> > ordering behavior on XFS v5 superblocks, however. For example,
> > generic/482 can reproduce false positive corruptions based on extent
> > to btree conversion of an inode if the inode and associated btree
> > block are written back after different checkpoints. Even though both
> > items are logged correctly in the extent-to-btree transaction, the
> > btree block can be relogged (multiple times) and only written back
> > once when the filesystem unmounts. If the inode was written back
> > after the initial conversion, recovery points between that mark and
> > when the btree block is ultimately written back will show corruption
> > because log recovery sees that the destination buffer is newer than
> > the recovered buffer and intentionally skips the buffer. This is a
> > false positive because the destination buffer was resiliently
> > written back after being physically relogged one or more times.
> >
> 
> This story doesn't add up.
> Either dm-log-writes emulated power failure correctly or it doesn't.

It doesn't. It leaves the log and broader filesystem in a state that
makes no sense with respect to a power failure.

> My understanding is that the issue you are seeing is a result of
> XFS seeing "data from the future" after a restore of a power failure
> snapshot, because the scratch device is not a clean slate.
> If I am right, then the correct solution is to wipe the journal before
> starting to replay restore points.
> 
> Am I misunderstanding whats going on?
> 

Slightly. Wiping the journal will not help. I _think_ that a wipe of the
broader filesystem before recovering from the initial fua and replaying
in order from there would mitigate the problem. Is there an easy way to
test that theory? For example, would a mkfs of the scratch device before
the replay sequence of generic/482 begins allow the test to still
otherwise function correctly?

I was going to elaborate further on the sequence of events, but I see
Dave has already nicely described this generically in his most recent
reply.

> IIRC, some of Josef's earlier versions used dm snapshots to restore
> the blockdev to a clean state before replying log-writes.
> I think that one of the earlier versions of generic/482 also took
> that approach, but that resulted in longer test runtime (not sure).
> 
> > Update the dm-log-writes require checks to enforce v4 superblocks
> > when running against XFS and skip the test otherwise.
> 
> You might as well disable dm-log-writes test for XFS completely.
> Who cares about v4 superblocks these days?
> We need a tool to make sure the NEW features are crash resilient.
> 
> dm-log-writes proved itself to be a powerful generic test tool that found
> some serious crash consistency bugs in every one of the major filesystems
> and it found bugs with XFS reflink log recovery as well, so IMO
> disabling dm-log-writes for v5 would be "very unwise!".
> 

Thanks for the insight

Brian

> Thanks,
> Amir.
Brian Foster Feb. 27, 2019, 1:23 p.m. UTC | #12
On Wed, Feb 27, 2019 at 06:06:57AM +0200, Amir Goldstein wrote:
> On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > > >
> > > > The dm-log-writes mechanism runs a workload against a filesystem,
> > > > tracks underlying FUAs and restores the filesystem to various points
> > > > in time based on FUA marks. This allows fstests to check fs
> > > > consistency at various points and verify log recovery works as
> > > > expected.
> > > >
> > >
> > > Inaccurate. generic/482 restores to FUA points.
> > > generic/45[57] restore to user defined points in time (marks).
> > > dm-log-writes mechanism is capable of restoring either.
> > >
> > > > This mechanism does not play well with LSN based log recovery
> > > > ordering behavior on XFS v5 superblocks, however. For example,
> > > > generic/482 can reproduce false positive corruptions based on extent
> > > > to btree conversion of an inode if the inode and associated btree
> > > > block are written back after different checkpoints. Even though both
> > > > items are logged correctly in the extent-to-btree transaction, the
> > > > btree block can be relogged (multiple times) and only written back
> > > > once when the filesystem unmounts. If the inode was written back
> > > > after the initial conversion, recovery points between that mark and
> > > > when the btree block is ultimately written back will show corruption
> > > > because log recovery sees that the destination buffer is newer than
> > > > the recovered buffer and intentionally skips the buffer. This is a
> > > > false positive because the destination buffer was resiliently
> > > > written back after being physically relogged one or more times.
> > > >
> > >
> > > This story doesn't add up.
> > > Either dm-log-writes emulated power failure correctly or it doesn't.
> > > My understanding is that the issue you are seeing is a result of
> > > XFS seeing "data from the future" after a restore of a power failure
> > > snapshot, because the scratch device is not a clean slate.
> > > If I am right, then the correct solution is to wipe the journal before
> > > starting to replay restore points.
> >
> > If that is the problem, then I think we should be wiping the entire
> > block device before replaying the recorded logwrite.
> >
> 
> Indeed.
> 
> > i.e. this sounds like a "block device we are replaying onto has
> > stale data in it" problem because we are replaying the same
> > filesystem over the top of itself.  Hence there are no unique
> > identifiers in the metadata that can detect stale metadata in
> > the block device.
> >
> > I'm surprised that we haven't tripped over this much earlier that
> > this...
> >
> 
> I remember asking myself the same thing... it's coming back to me
> now. I really remember having this discussion during test review.
> generic/482 is an adaptation of Josef's test script [1], which
> does log recovery onto a snapshot on every FUA checkpoint.
> 
> [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
> 
> Setting up snapshots for every checkpoint was found empirically to take
> more test runtime, than replaying log from the start for each checkpoint.
> That observation was limited to the systems that Qu and Eryu tested on.
> 
> IRC, what usually took care of cleaning the block device is replaying the
> "discard everything" IO from mkfs time. dm-log-writes driver should take care
> of zeroing blocks upon replaying a discard IO even on a target device that
> doesn't support discard, but maybe if original device doesn't support discard,
> the IO is not recorded at all and therefore not replayed?
> 
> Brian, can you check if your log-writes stream contains a discard IO for
> the entire block device? I do remember that the initial log-writes tests worked
> very reliably on my laptop with SSD and were a bit flaky on another test machine
> with spinning rust, but that machine had other hardware reliability issues at
> the time (bad SATA cable) so I attributed all issues to that problem.
> 

FYI, the command from your other mail on a logwrites dev that
demonstrates this problem shows the following:

# ./src/log-writes/replay-log -vv --find --end-mark mkfs --log /dev/test/tmp | grep DISCARD
seek entry 0@2: 0, size 8388607, flags 0x4(DISCARD)
seek entry 1@3: 8388607, size 8388607, flags 0x4(DISCARD)
seek entry 2@4: 16777214, size 8388607, flags 0x4(DISCARD)
seek entry 3@5: 25165821, size 6291459, flags 0x4(DISCARD)

... which appears to cover the entire device. Is the intention that this
should wipe the scratch device?

Brian

> BTW, looking closer at generic/482, $prev does not seem to be used at all
> in the replay loop.
> 
> > > > Update the dm-log-writes require checks to enforce v4 superblocks
> > > > when running against XFS and skip the test otherwise.
> > >
> > > You might as well disable dm-log-writes test for XFS completely.
> > > Who cares about v4 superblocks these days?
> >
> > Enough of the inflammatory hyperbole, Amir. Statements like this
> > serve no useful purpose.
> >
> 
> <deep breath> Agreed.
> 
> Thanks,
> Amir.
Brian Foster Feb. 27, 2019, 2:17 p.m. UTC | #13
On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote:
> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > The dm-log-writes mechanism runs a workload against a filesystem,
> > > tracks underlying FUAs and restores the filesystem to various points
> > > in time based on FUA marks. This allows fstests to check fs
> > > consistency at various points and verify log recovery works as
> > > expected.
> > >
> > 
> > Inaccurate. generic/482 restores to FUA points.
> > generic/45[57] restore to user defined points in time (marks).
> > dm-log-writes mechanism is capable of restoring either.
> > 
> 
> The above is poorly worded. I'm aware of the separate tests and I've
> used the mechanism to bounce around to various marks. Note that my
> understanding of the mechanism beyond that is rudimentary. I'll reword
> this if the patch survives, but it sounds like there may be opportunity
> to fix the mechanism, which clearly would be ideal.
> 
> > > This mechanism does not play well with LSN based log recovery
> > > ordering behavior on XFS v5 superblocks, however. For example,
> > > generic/482 can reproduce false positive corruptions based on extent
> > > to btree conversion of an inode if the inode and associated btree
> > > block are written back after different checkpoints. Even though both
> > > items are logged correctly in the extent-to-btree transaction, the
> > > btree block can be relogged (multiple times) and only written back
> > > once when the filesystem unmounts. If the inode was written back
> > > after the initial conversion, recovery points between that mark and
> > > when the btree block is ultimately written back will show corruption
> > > because log recovery sees that the destination buffer is newer than
> > > the recovered buffer and intentionally skips the buffer. This is a
> > > false positive because the destination buffer was resiliently
> > > written back after being physically relogged one or more times.
> > >
> > 
> > This story doesn't add up.
> > Either dm-log-writes emulated power failure correctly or it doesn't.
> 
> It doesn't. It leaves the log and broader filesystem in a state that
> makes no sense with respect to a power failure.
> 
> > My understanding is that the issue you are seeing is a result of
> > XFS seeing "data from the future" after a restore of a power failure
> > snapshot, because the scratch device is not a clean slate.
> > If I am right, then the correct solution is to wipe the journal before
> > starting to replay restore points.
> > 
> > Am I misunderstanding whats going on?
> > 
> 
> Slightly. Wiping the journal will not help. I _think_ that a wipe of the
> broader filesystem before recovering from the initial fua and replaying
> in order from there would mitigate the problem. Is there an easy way to
> test that theory? For example, would a mkfs of the scratch device before
> the replay sequence of generic/482 begins allow the test to still
> otherwise function correctly?
> 

FYI, I gave this a try and it didn't ultimately work because mkfs didn't
clear the device either. I ended up reproducing the problem, physically
zeroing the device, replaying the associated FUA and observing the
problem go away. From there, if I replay to the final FUA mark and go
back to the (originally) problematic FUA, the problem is reintroduced.

Brian

> I was going to elaborate further on the sequence of events, but I see
> Dave has already nicely described this generically in his most recent
> reply.
> 
> > IIRC, some of Josef's earlier versions used dm snapshots to restore
> > the blockdev to a clean state before replying log-writes.
> > I think that one of the earlier versions of generic/482 also took
> > that approach, but that resulted in longer test runtime (not sure).
> > 
> > > Update the dm-log-writes require checks to enforce v4 superblocks
> > > when running against XFS and skip the test otherwise.
> > 
> > You might as well disable dm-log-writes test for XFS completely.
> > Who cares about v4 superblocks these days?
> > We need a tool to make sure the NEW features are crash resilient.
> > 
> > dm-log-writes proved itself to be a powerful generic test tool that found
> > some serious crash consistency bugs in every one of the major filesystems
> > and it found bugs with XFS reflink log recovery as well, so IMO
> > disabling dm-log-writes for v5 would be "very unwise!".
> > 
> 
> Thanks for the insight
> 
> Brian
> 
> > Thanks,
> > Amir.
Josef Bacik Feb. 27, 2019, 3:54 p.m. UTC | #14
On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote:
> On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote:
> > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > > >
> > > > The dm-log-writes mechanism runs a workload against a filesystem,
> > > > tracks underlying FUAs and restores the filesystem to various points
> > > > in time based on FUA marks. This allows fstests to check fs
> > > > consistency at various points and verify log recovery works as
> > > > expected.
> > > >
> > > 
> > > Inaccurate. generic/482 restores to FUA points.
> > > generic/45[57] restore to user defined points in time (marks).
> > > dm-log-writes mechanism is capable of restoring either.
> > > 
> > 
> > The above is poorly worded. I'm aware of the separate tests and I've
> > used the mechanism to bounce around to various marks. Note that my
> > understanding of the mechanism beyond that is rudimentary. I'll reword
> > this if the patch survives, but it sounds like there may be opportunity
> > to fix the mechanism, which clearly would be ideal.
> > 
> > > > This mechanism does not play well with LSN based log recovery
> > > > ordering behavior on XFS v5 superblocks, however. For example,
> > > > generic/482 can reproduce false positive corruptions based on extent
> > > > to btree conversion of an inode if the inode and associated btree
> > > > block are written back after different checkpoints. Even though both
> > > > items are logged correctly in the extent-to-btree transaction, the
> > > > btree block can be relogged (multiple times) and only written back
> > > > once when the filesystem unmounts. If the inode was written back
> > > > after the initial conversion, recovery points between that mark and
> > > > when the btree block is ultimately written back will show corruption
> > > > because log recovery sees that the destination buffer is newer than
> > > > the recovered buffer and intentionally skips the buffer. This is a
> > > > false positive because the destination buffer was resiliently
> > > > written back after being physically relogged one or more times.
> > > >
> > > 
> > > This story doesn't add up.
> > > Either dm-log-writes emulated power failure correctly or it doesn't.
> > 
> > It doesn't. It leaves the log and broader filesystem in a state that
> > makes no sense with respect to a power failure.
> > 
> > > My understanding is that the issue you are seeing is a result of
> > > XFS seeing "data from the future" after a restore of a power failure
> > > snapshot, because the scratch device is not a clean slate.
> > > If I am right, then the correct solution is to wipe the journal before
> > > starting to replay restore points.
> > > 
> > > Am I misunderstanding whats going on?
> > > 
> > 
> > Slightly. Wiping the journal will not help. I _think_ that a wipe of the
> > broader filesystem before recovering from the initial fua and replaying
> > in order from there would mitigate the problem. Is there an easy way to
> > test that theory? For example, would a mkfs of the scratch device before
> > the replay sequence of generic/482 begins allow the test to still
> > otherwise function correctly?
> > 
> 
> FYI, I gave this a try and it didn't ultimately work because mkfs didn't
> clear the device either. I ended up reproducing the problem, physically
> zeroing the device, replaying the associated FUA and observing the
> problem go away. From there, if I replay to the final FUA mark and go
> back to the (originally) problematic FUA, the problem is reintroduced.
> 

Sorry guys, whenever I run log-writes on xfs I use my helper script here

https://github.com/josefbacik/log-writes

specifically replay-individual-faster.sh.  This creates a snapshot at every
replay point, mounts and checks the fs, and then destroys the snapshot and keeps
going.  This way you don't end up with the "new" data still being on the device.
It's not super fast, but this is usually a fire and forget sort of thing.  I
could probably integrate this into xfstests for our log-writes tests, those tend
to not generate large logs so wouldn't take super long.  Does this fix the
problem for you Brian?

Josef
Amir Goldstein Feb. 27, 2019, 5:11 p.m. UTC | #15
> > FYI, I gave this a try and it didn't ultimately work because mkfs didn't
> > clear the device either. I ended up reproducing the problem, physically
> > zeroing the device, replaying the associated FUA and observing the
> > problem go away. From there, if I replay to the final FUA mark and go
> > back to the (originally) problematic FUA, the problem is reintroduced.
> >
>
> Sorry guys, whenever I run log-writes on xfs I use my helper script here
>
> https://github.com/josefbacik/log-writes
>
> specifically replay-individual-faster.sh.  This creates a snapshot at every
> replay point, mounts and checks the fs, and then destroys the snapshot and keeps
> going.  This way you don't end up with the "new" data still being on the device.
> It's not super fast, but this is usually a fire and forget sort of thing.  I
> could probably integrate this into xfstests for our log-writes tests, those tend
> to not generate large logs so wouldn't take super long.  Does this fix the
> problem for you Brian?
>

Chronicles of generic/482:
- [RFC] version used dm snapshot and reported that XFS survived the test:
https://marc.info/?l=fstests&m=152101818830823&w=1
- Alas, this RFC abused $TEST_DEV as the snapshot CoW device
During review and the need of the snapshot was questioned and
it was speculated that replay should be sufficient and faster.
- [v2] version dropped dm snapshots saying:
"Get rid of dm-snapshot which is pretty slow if we're creating and
 deleting snapshots repeatedly.
 (Maybe LVM thin provision would be much better, but current replay
 solution is good so far, and no slower than dm-snapshot)"
https://marc.info/?l=fstests&m=152221632920271&w=1
- This version already reported failures on XFS
- Dave has already identified the problem back then:
https://marc.info/?l=fstests&m=152228022208188&w=1
his conclusion was this "implies a write ordering problem".
What we have is:
- writing in the future
- going back to the past without wiping the future writes
I guess you can call that a write ordering problem ;-)

Qu,

Will you have time to fix the test to use dm thinp snapshots?

Thanks,
Amir.
Brian Foster Feb. 27, 2019, 5:13 p.m. UTC | #16
On Wed, Feb 27, 2019 at 10:54:20AM -0500, Josef Bacik wrote:
> On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote:
> > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote:
> > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > >
> > > > > The dm-log-writes mechanism runs a workload against a filesystem,
> > > > > tracks underlying FUAs and restores the filesystem to various points
> > > > > in time based on FUA marks. This allows fstests to check fs
> > > > > consistency at various points and verify log recovery works as
> > > > > expected.
> > > > >
> > > > 
> > > > Inaccurate. generic/482 restores to FUA points.
> > > > generic/45[57] restore to user defined points in time (marks).
> > > > dm-log-writes mechanism is capable of restoring either.
> > > > 
> > > 
> > > The above is poorly worded. I'm aware of the separate tests and I've
> > > used the mechanism to bounce around to various marks. Note that my
> > > understanding of the mechanism beyond that is rudimentary. I'll reword
> > > this if the patch survives, but it sounds like there may be opportunity
> > > to fix the mechanism, which clearly would be ideal.
> > > 
> > > > > This mechanism does not play well with LSN based log recovery
> > > > > ordering behavior on XFS v5 superblocks, however. For example,
> > > > > generic/482 can reproduce false positive corruptions based on extent
> > > > > to btree conversion of an inode if the inode and associated btree
> > > > > block are written back after different checkpoints. Even though both
> > > > > items are logged correctly in the extent-to-btree transaction, the
> > > > > btree block can be relogged (multiple times) and only written back
> > > > > once when the filesystem unmounts. If the inode was written back
> > > > > after the initial conversion, recovery points between that mark and
> > > > > when the btree block is ultimately written back will show corruption
> > > > > because log recovery sees that the destination buffer is newer than
> > > > > the recovered buffer and intentionally skips the buffer. This is a
> > > > > false positive because the destination buffer was resiliently
> > > > > written back after being physically relogged one or more times.
> > > > >
> > > > 
> > > > This story doesn't add up.
> > > > Either dm-log-writes emulated power failure correctly or it doesn't.
> > > 
> > > It doesn't. It leaves the log and broader filesystem in a state that
> > > makes no sense with respect to a power failure.
> > > 
> > > > My understanding is that the issue you are seeing is a result of
> > > > XFS seeing "data from the future" after a restore of a power failure
> > > > snapshot, because the scratch device is not a clean slate.
> > > > If I am right, then the correct solution is to wipe the journal before
> > > > starting to replay restore points.
> > > > 
> > > > Am I misunderstanding whats going on?
> > > > 
> > > 
> > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the
> > > broader filesystem before recovering from the initial fua and replaying
> > > in order from there would mitigate the problem. Is there an easy way to
> > > test that theory? For example, would a mkfs of the scratch device before
> > > the replay sequence of generic/482 begins allow the test to still
> > > otherwise function correctly?
> > > 
> > 
> > FYI, I gave this a try and it didn't ultimately work because mkfs didn't
> > clear the device either. I ended up reproducing the problem, physically
> > zeroing the device, replaying the associated FUA and observing the
> > problem go away. From there, if I replay to the final FUA mark and go
> > back to the (originally) problematic FUA, the problem is reintroduced.
> > 
> 
> Sorry guys, whenever I run log-writes on xfs I use my helper script here
> 
> https://github.com/josefbacik/log-writes
> 
> specifically replay-individual-faster.sh.  This creates a snapshot at every
> replay point, mounts and checks the fs, and then destroys the snapshot and keeps
> going.  This way you don't end up with the "new" data still being on the device.
> It's not super fast, but this is usually a fire and forget sort of thing.  I
> could probably integrate this into xfstests for our log-writes tests, those tend
> to not generate large logs so wouldn't take super long.  Does this fix the
> problem for you Brian?
> 

Thanks Josef. At a glance at that script I'm not quite following how
this fits together. Are you taking a snapshot of the original device
before the workload being tested is run against dm-log-writes, then
replaying on top of that? In general, anything that puts the device back
into the state from before the workload ran and replays from there
should be enough to fix the problem I think. As long as a replay
sequence runs in order, I don't think snapshots of each replay point
should technically be necessary (vs a single replay snapshot, for e.g.).

Note that I ended up testing generic/482 with a loop device for a
scratch device and that also worked around the problem, I suspect
because that allowed the aforementioned discards to actually reset the
underlying the device via loop discard->hole punch (which doesn't occur
with my original, non-loop scratch dev). So it seems that another option
for more deterministic behavior in fstests could be to just enforce the
use of a loop device as the underlying target in these particular tests.
For example, have the log-writes init create a file on the test device
and loop mount that rather than using the scratch device. Just a thought
though, it's not clear to me that's any more simple than what you have
already implemented here..

Brian

> Josef
Amir Goldstein Feb. 27, 2019, 6:46 p.m. UTC | #17
> > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't
> > > clear the device either. I ended up reproducing the problem, physically
> > > zeroing the device, replaying the associated FUA and observing the
> > > problem go away. From there, if I replay to the final FUA mark and go
> > > back to the (originally) problematic FUA, the problem is reintroduced.
> > >
> >
> > Sorry guys, whenever I run log-writes on xfs I use my helper script here
> >
> > https://github.com/josefbacik/log-writes
> >
> > specifically replay-individual-faster.sh.  This creates a snapshot at every
> > replay point, mounts and checks the fs, and then destroys the snapshot and keeps
> > going.  This way you don't end up with the "new" data still being on the device.
> > It's not super fast, but this is usually a fire and forget sort of thing.  I
> > could probably integrate this into xfstests for our log-writes tests, those tend
> > to not generate large logs so wouldn't take super long.  Does this fix the
> > problem for you Brian?
> >
>
> Thanks Josef. At a glance at that script I'm not quite following how
> this fits together. Are you taking a snapshot of the original device
> before the workload being tested is run against dm-log-writes, then
> replaying on top of that? In general, anything that puts the device back
> into the state from before the workload ran and replays from there
> should be enough to fix the problem I think. As long as a replay
> sequence runs in order, I don't think snapshots of each replay point
> should technically be necessary (vs a single replay snapshot, for e.g.).
>
> Note that I ended up testing generic/482 with a loop device for a
> scratch device and that also worked around the problem, I suspect
> because that allowed the aforementioned discards to actually reset the
> underlying the device via loop discard->hole punch (which doesn't occur
> with my original, non-loop scratch dev). So it seems that another option

Urrgh! take a look at src/log-writes/log-writes.c:log_discard()->zero_range().
I remembered correctly the replay log falls back to zeroing device with no
discard support, but wasn't aware that it drops large discard requests
(> 128MB), so there we have it.

> for more deterministic behavior in fstests could be to just enforce the
> use of a loop device as the underlying target in these particular tests.
> For example, have the log-writes init create a file on the test device
> and loop mount that rather than using the scratch device. Just a thought
> though, it's not clear to me that's any more simple than what you have
> already implemented here..
>

Seems like a pretty good idea to me, but in doing that we may loose
diversity in the tested block devices. On different test machines mkfs
and later fsstress ops will result in slightly different io patterns.
Since generic/482 is not a deterministic test (random fsstress seed)
than we cannot argue that reproducible results are better than
deterministic results.

Another option is to relax the  log->max_zero_size limit of
log_discard() and allow it to zero out devices that don't support
discard, but that is likely going to cost much more runtime.
Note that this simple patch could also work for you:

--- a/src/log-writes/log-writes.c
+++ b/src/log-writes/log-writes.c
@@ -95,7 +95,7 @@ int log_discard(struct log *log, struct
log_write_entry *entry)
 {
        u64 start = le64_to_cpu(entry->sector) * log->sectorsize;
        u64 size = le64_to_cpu(entry->nr_sectors) * log->sectorsize;
-       u64 max_chunk = 1 * 1024 * 1024 * 1024;
+       u64 max_chunk = log->max_zero_size;

---

Brian,

Can you try this patch on you test machine and say how much
worse the runtime is compared to the loop device setup?

In general, I am in favor of the simplest possible fix
and in keeping runtime shortest, so if the above fix
results in longer runtime, I would prefer to make the test always
run on loopdev as you suggested.

Thanks,
Amir.
Josef Bacik Feb. 27, 2019, 7:27 p.m. UTC | #18
On Wed, Feb 27, 2019 at 12:13:37PM -0500, Brian Foster wrote:
> On Wed, Feb 27, 2019 at 10:54:20AM -0500, Josef Bacik wrote:
> > On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote:
> > > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote:
> > > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > > >
> > > > > > The dm-log-writes mechanism runs a workload against a filesystem,
> > > > > > tracks underlying FUAs and restores the filesystem to various points
> > > > > > in time based on FUA marks. This allows fstests to check fs
> > > > > > consistency at various points and verify log recovery works as
> > > > > > expected.
> > > > > >
> > > > > 
> > > > > Inaccurate. generic/482 restores to FUA points.
> > > > > generic/45[57] restore to user defined points in time (marks).
> > > > > dm-log-writes mechanism is capable of restoring either.
> > > > > 
> > > > 
> > > > The above is poorly worded. I'm aware of the separate tests and I've
> > > > used the mechanism to bounce around to various marks. Note that my
> > > > understanding of the mechanism beyond that is rudimentary. I'll reword
> > > > this if the patch survives, but it sounds like there may be opportunity
> > > > to fix the mechanism, which clearly would be ideal.
> > > > 
> > > > > > This mechanism does not play well with LSN based log recovery
> > > > > > ordering behavior on XFS v5 superblocks, however. For example,
> > > > > > generic/482 can reproduce false positive corruptions based on extent
> > > > > > to btree conversion of an inode if the inode and associated btree
> > > > > > block are written back after different checkpoints. Even though both
> > > > > > items are logged correctly in the extent-to-btree transaction, the
> > > > > > btree block can be relogged (multiple times) and only written back
> > > > > > once when the filesystem unmounts. If the inode was written back
> > > > > > after the initial conversion, recovery points between that mark and
> > > > > > when the btree block is ultimately written back will show corruption
> > > > > > because log recovery sees that the destination buffer is newer than
> > > > > > the recovered buffer and intentionally skips the buffer. This is a
> > > > > > false positive because the destination buffer was resiliently
> > > > > > written back after being physically relogged one or more times.
> > > > > >
> > > > > 
> > > > > This story doesn't add up.
> > > > > Either dm-log-writes emulated power failure correctly or it doesn't.
> > > > 
> > > > It doesn't. It leaves the log and broader filesystem in a state that
> > > > makes no sense with respect to a power failure.
> > > > 
> > > > > My understanding is that the issue you are seeing is a result of
> > > > > XFS seeing "data from the future" after a restore of a power failure
> > > > > snapshot, because the scratch device is not a clean slate.
> > > > > If I am right, then the correct solution is to wipe the journal before
> > > > > starting to replay restore points.
> > > > > 
> > > > > Am I misunderstanding whats going on?
> > > > > 
> > > > 
> > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the
> > > > broader filesystem before recovering from the initial fua and replaying
> > > > in order from there would mitigate the problem. Is there an easy way to
> > > > test that theory? For example, would a mkfs of the scratch device before
> > > > the replay sequence of generic/482 begins allow the test to still
> > > > otherwise function correctly?
> > > > 
> > > 
> > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't
> > > clear the device either. I ended up reproducing the problem, physically
> > > zeroing the device, replaying the associated FUA and observing the
> > > problem go away. From there, if I replay to the final FUA mark and go
> > > back to the (originally) problematic FUA, the problem is reintroduced.
> > > 
> > 
> > Sorry guys, whenever I run log-writes on xfs I use my helper script here
> > 
> > https://github.com/josefbacik/log-writes
> > 
> > specifically replay-individual-faster.sh.  This creates a snapshot at every
> > replay point, mounts and checks the fs, and then destroys the snapshot and keeps
> > going.  This way you don't end up with the "new" data still being on the device.
> > It's not super fast, but this is usually a fire and forget sort of thing.  I
> > could probably integrate this into xfstests for our log-writes tests, those tend
> > to not generate large logs so wouldn't take super long.  Does this fix the
> > problem for you Brian?
> > 
> 
> Thanks Josef. At a glance at that script I'm not quite following how
> this fits together. Are you taking a snapshot of the original device
> before the workload being tested is run against dm-log-writes, then
> replaying on top of that? In general, anything that puts the device back
> into the state from before the workload ran and replays from there
> should be enough to fix the problem I think. As long as a replay
> sequence runs in order, I don't think snapshots of each replay point
> should technically be necessary (vs a single replay snapshot, for e.g.).
> 

Well we do this so we can mount/unmount the fs to get the log replay to happen,
and then run fsck.  We want the snapshot so that we can roll back the changes of
the log replay.

> Note that I ended up testing generic/482 with a loop device for a
> scratch device and that also worked around the problem, I suspect
> because that allowed the aforementioned discards to actually reset the
> underlying the device via loop discard->hole punch (which doesn't occur
> with my original, non-loop scratch dev). So it seems that another option
> for more deterministic behavior in fstests could be to just enforce the
> use of a loop device as the underlying target in these particular tests.
> For example, have the log-writes init create a file on the test device
> and loop mount that rather than using the scratch device. Just a thought
> though, it's not clear to me that's any more simple than what you have
> already implemented here..
> 

I'm good either way.  What I've done here isn't simple, it's more meant for my
long running tests.  Doing loop and discarding the whole device would be cool,
but I'm afraid of punch hole bugs in the underlying file system making things
weird, or running on kernels where loop doesn't do punch hole or the fs doesn't
support punch hole.

Each solution has its drawbacks.  Adding a bunch of infrastructure to do
snapshots isn't super awesome, but may be the most flexible.  The loop solution
has the added benefit of also testing punch hole/discard ;).

Josef
Brian Foster Feb. 27, 2019, 8:45 p.m. UTC | #19
On Wed, Feb 27, 2019 at 08:46:42PM +0200, Amir Goldstein wrote:
> > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't
> > > > clear the device either. I ended up reproducing the problem, physically
> > > > zeroing the device, replaying the associated FUA and observing the
> > > > problem go away. From there, if I replay to the final FUA mark and go
> > > > back to the (originally) problematic FUA, the problem is reintroduced.
> > > >
> > >
> > > Sorry guys, whenever I run log-writes on xfs I use my helper script here
> > >
> > > https://github.com/josefbacik/log-writes
> > >
> > > specifically replay-individual-faster.sh.  This creates a snapshot at every
> > > replay point, mounts and checks the fs, and then destroys the snapshot and keeps
> > > going.  This way you don't end up with the "new" data still being on the device.
> > > It's not super fast, but this is usually a fire and forget sort of thing.  I
> > > could probably integrate this into xfstests for our log-writes tests, those tend
> > > to not generate large logs so wouldn't take super long.  Does this fix the
> > > problem for you Brian?
> > >
> >
> > Thanks Josef. At a glance at that script I'm not quite following how
> > this fits together. Are you taking a snapshot of the original device
> > before the workload being tested is run against dm-log-writes, then
> > replaying on top of that? In general, anything that puts the device back
> > into the state from before the workload ran and replays from there
> > should be enough to fix the problem I think. As long as a replay
> > sequence runs in order, I don't think snapshots of each replay point
> > should technically be necessary (vs a single replay snapshot, for e.g.).
> >
> > Note that I ended up testing generic/482 with a loop device for a
> > scratch device and that also worked around the problem, I suspect
> > because that allowed the aforementioned discards to actually reset the
> > underlying the device via loop discard->hole punch (which doesn't occur
> > with my original, non-loop scratch dev). So it seems that another option
> 
> Urrgh! take a look at src/log-writes/log-writes.c:log_discard()->zero_range().
> I remembered correctly the replay log falls back to zeroing device with no
> discard support, but wasn't aware that it drops large discard requests
> (> 128MB), so there we have it.
> 

Heh, Ok.

> > for more deterministic behavior in fstests could be to just enforce the
> > use of a loop device as the underlying target in these particular tests.
> > For example, have the log-writes init create a file on the test device
> > and loop mount that rather than using the scratch device. Just a thought
> > though, it's not clear to me that's any more simple than what you have
> > already implemented here..
> >
> 
> Seems like a pretty good idea to me, but in doing that we may loose
> diversity in the tested block devices. On different test machines mkfs
> and later fsstress ops will result in slightly different io patterns.
> Since generic/482 is not a deterministic test (random fsstress seed)
> than we cannot argue that reproducible results are better than
> deterministic results.
> 
> Another option is to relax the  log->max_zero_size limit of
> log_discard() and allow it to zero out devices that don't support
> discard, but that is likely going to cost much more runtime.
> Note that this simple patch could also work for you:
> 
> --- a/src/log-writes/log-writes.c
> +++ b/src/log-writes/log-writes.c
> @@ -95,7 +95,7 @@ int log_discard(struct log *log, struct
> log_write_entry *entry)
>  {
>         u64 start = le64_to_cpu(entry->sector) * log->sectorsize;
>         u64 size = le64_to_cpu(entry->nr_sectors) * log->sectorsize;
> -       u64 max_chunk = 1 * 1024 * 1024 * 1024;
> +       u64 max_chunk = log->max_zero_size;
> 
> ---
> 
> Brian,
> 
> Can you try this patch on you test machine and say how much
> worse the runtime is compared to the loop device setup?
> 
> In general, I am in favor of the simplest possible fix
> and in keeping runtime shortest, so if the above fix
> results in longer runtime, I would prefer to make the test always
> run on loopdev as you suggested.
> 

Note that this is a low resource VM with a spinning disk. The scratch
device is a 15G LVM volume. My last run of generic/482 was 139s (which I
think was using loop for a scratch device). With this change and back to
using the LVM volume, I ended up stopping it because I started seeing a
~160s or so delay between XFS mount cycles in the syslog (i.e., per
replay iteration). ;P

Brian

> Thanks,
> Amir.
Brian Foster Feb. 27, 2019, 8:47 p.m. UTC | #20
On Wed, Feb 27, 2019 at 02:27:49PM -0500, Josef Bacik wrote:
> On Wed, Feb 27, 2019 at 12:13:37PM -0500, Brian Foster wrote:
> > On Wed, Feb 27, 2019 at 10:54:20AM -0500, Josef Bacik wrote:
> > > On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote:
> > > > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote:
> > > > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > > > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > > > >
> > > > > > > The dm-log-writes mechanism runs a workload against a filesystem,
> > > > > > > tracks underlying FUAs and restores the filesystem to various points
> > > > > > > in time based on FUA marks. This allows fstests to check fs
> > > > > > > consistency at various points and verify log recovery works as
> > > > > > > expected.
> > > > > > >
> > > > > > 
> > > > > > Inaccurate. generic/482 restores to FUA points.
> > > > > > generic/45[57] restore to user defined points in time (marks).
> > > > > > dm-log-writes mechanism is capable of restoring either.
> > > > > > 
> > > > > 
> > > > > The above is poorly worded. I'm aware of the separate tests and I've
> > > > > used the mechanism to bounce around to various marks. Note that my
> > > > > understanding of the mechanism beyond that is rudimentary. I'll reword
> > > > > this if the patch survives, but it sounds like there may be opportunity
> > > > > to fix the mechanism, which clearly would be ideal.
> > > > > 
> > > > > > > This mechanism does not play well with LSN based log recovery
> > > > > > > ordering behavior on XFS v5 superblocks, however. For example,
> > > > > > > generic/482 can reproduce false positive corruptions based on extent
> > > > > > > to btree conversion of an inode if the inode and associated btree
> > > > > > > block are written back after different checkpoints. Even though both
> > > > > > > items are logged correctly in the extent-to-btree transaction, the
> > > > > > > btree block can be relogged (multiple times) and only written back
> > > > > > > once when the filesystem unmounts. If the inode was written back
> > > > > > > after the initial conversion, recovery points between that mark and
> > > > > > > when the btree block is ultimately written back will show corruption
> > > > > > > because log recovery sees that the destination buffer is newer than
> > > > > > > the recovered buffer and intentionally skips the buffer. This is a
> > > > > > > false positive because the destination buffer was resiliently
> > > > > > > written back after being physically relogged one or more times.
> > > > > > >
> > > > > > 
> > > > > > This story doesn't add up.
> > > > > > Either dm-log-writes emulated power failure correctly or it doesn't.
> > > > > 
> > > > > It doesn't. It leaves the log and broader filesystem in a state that
> > > > > makes no sense with respect to a power failure.
> > > > > 
> > > > > > My understanding is that the issue you are seeing is a result of
> > > > > > XFS seeing "data from the future" after a restore of a power failure
> > > > > > snapshot, because the scratch device is not a clean slate.
> > > > > > If I am right, then the correct solution is to wipe the journal before
> > > > > > starting to replay restore points.
> > > > > > 
> > > > > > Am I misunderstanding whats going on?
> > > > > > 
> > > > > 
> > > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the
> > > > > broader filesystem before recovering from the initial fua and replaying
> > > > > in order from there would mitigate the problem. Is there an easy way to
> > > > > test that theory? For example, would a mkfs of the scratch device before
> > > > > the replay sequence of generic/482 begins allow the test to still
> > > > > otherwise function correctly?
> > > > > 
> > > > 
> > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't
> > > > clear the device either. I ended up reproducing the problem, physically
> > > > zeroing the device, replaying the associated FUA and observing the
> > > > problem go away. From there, if I replay to the final FUA mark and go
> > > > back to the (originally) problematic FUA, the problem is reintroduced.
> > > > 
> > > 
> > > Sorry guys, whenever I run log-writes on xfs I use my helper script here
> > > 
> > > https://github.com/josefbacik/log-writes
> > > 
> > > specifically replay-individual-faster.sh.  This creates a snapshot at every
> > > replay point, mounts and checks the fs, and then destroys the snapshot and keeps
> > > going.  This way you don't end up with the "new" data still being on the device.
> > > It's not super fast, but this is usually a fire and forget sort of thing.  I
> > > could probably integrate this into xfstests for our log-writes tests, those tend
> > > to not generate large logs so wouldn't take super long.  Does this fix the
> > > problem for you Brian?
> > > 
> > 
> > Thanks Josef. At a glance at that script I'm not quite following how
> > this fits together. Are you taking a snapshot of the original device
> > before the workload being tested is run against dm-log-writes, then
> > replaying on top of that? In general, anything that puts the device back
> > into the state from before the workload ran and replays from there
> > should be enough to fix the problem I think. As long as a replay
> > sequence runs in order, I don't think snapshots of each replay point
> > should technically be necessary (vs a single replay snapshot, for e.g.).
> > 
> 
> Well we do this so we can mount/unmount the fs to get the log replay to happen,
> and then run fsck.  We want the snapshot so that we can roll back the changes of
> the log replay.
> 
> > Note that I ended up testing generic/482 with a loop device for a
> > scratch device and that also worked around the problem, I suspect
> > because that allowed the aforementioned discards to actually reset the
> > underlying the device via loop discard->hole punch (which doesn't occur
> > with my original, non-loop scratch dev). So it seems that another option
> > for more deterministic behavior in fstests could be to just enforce the
> > use of a loop device as the underlying target in these particular tests.
> > For example, have the log-writes init create a file on the test device
> > and loop mount that rather than using the scratch device. Just a thought
> > though, it's not clear to me that's any more simple than what you have
> > already implemented here..
> > 
> 
> I'm good either way.  What I've done here isn't simple, it's more meant for my
> long running tests.  Doing loop and discarding the whole device would be cool,
> but I'm afraid of punch hole bugs in the underlying file system making things
> weird, or running on kernels where loop doesn't do punch hole or the fs doesn't
> support punch hole.
> 

True..

> Each solution has its drawbacks.  Adding a bunch of infrastructure to do
> snapshots isn't super awesome, but may be the most flexible.  The loop solution
> has the added benefit of also testing punch hole/discard ;).
> 

As yet another option, it looks like fstests already has thin pool
infrastructure in common/dmthin. I ran the same test using a manually
created thin volume for SCRATCH_DEV instead of a loop device. I see a
performance drop, but no failures so far and the test still runs in the
low-to-mid 200s range, which seems reasonable enough to me.

Brian

> Josef

Patch
diff mbox series

diff --git a/common/dmlogwrites b/common/dmlogwrites
index b9cbae7a..f2fbb733 100644
--- a/common/dmlogwrites
+++ b/common/dmlogwrites
@@ -12,6 +12,9 @@  _require_log_writes()
 	_exclude_scratch_mount_option dax
 	_require_dm_target log-writes
 	_require_test_program "log-writes/replay-log"
+	if [ "$FSTYP" == "xfs" ]; then
+		_require_scratch_xfs_nocrc
+	fi
 }
 
 # Starting from v4.15-rc1, DAX support was added to dm-log-writes, but note
@@ -30,6 +33,9 @@  _require_log_writes_dax()
 
 	_require_dm_target log-writes
 	_require_test_program "log-writes/replay-log"
+	if [ "$FSTYP" == "xfs" ]; then
+		_require_scratch_xfs_nocrc
+	fi
 
 	local ret=0
 	_log_writes_init
diff --git a/common/xfs b/common/xfs
index 24065813..525cb9cb 100644
--- a/common/xfs
+++ b/common/xfs
@@ -240,6 +240,17 @@  _require_scratch_xfs_crc()
 	_scratch_unmount
 }
 
+# this test requires the scratch dev to be formatted as v4
+#
+_require_scratch_xfs_nocrc()
+{
+	_scratch_mkfs_xfs >/dev/null 2>&1
+	_scratch_mount >/dev/null 2>&1
+	$XFS_INFO_PROG $SCRATCH_MNT | grep -q 'crc=0' ||
+		_notrun "XFS v5 superblocks not supported by this test"
+	_scratch_unmount
+}
+
 # this test requires the finobt feature to be available in mkfs.xfs
 #
 _require_xfs_mkfs_finobt()
diff --git a/tests/generic/482 b/tests/generic/482
index 7595aa59..7870ac5b 100755
--- a/tests/generic/482
+++ b/tests/generic/482
@@ -45,7 +45,6 @@  _require_scratch
 # and we need extra device as log device
 _require_log_writes
 
-
 nr_cpus=$("$here/src/feature" -o)
 # cap nr_cpus to 8 to avoid spending too much time on hosts with many cpus
 if [ $nr_cpus -gt 8 ]; then