Message ID | 20190226181407.44612-1-bfoster@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | generic: skip dm-log-writes tests on XFS v5 superblock filesystems | expand |
On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > The dm-log-writes mechanism runs a workload against a filesystem, > tracks underlying FUAs and restores the filesystem to various points > in time based on FUA marks. This allows fstests to check fs > consistency at various points and verify log recovery works as > expected. > Inaccurate. generic/482 restores to FUA points. generic/45[57] restore to user defined points in time (marks). dm-log-writes mechanism is capable of restoring either. > This mechanism does not play well with LSN based log recovery > ordering behavior on XFS v5 superblocks, however. For example, > generic/482 can reproduce false positive corruptions based on extent > to btree conversion of an inode if the inode and associated btree > block are written back after different checkpoints. Even though both > items are logged correctly in the extent-to-btree transaction, the > btree block can be relogged (multiple times) and only written back > once when the filesystem unmounts. If the inode was written back > after the initial conversion, recovery points between that mark and > when the btree block is ultimately written back will show corruption > because log recovery sees that the destination buffer is newer than > the recovered buffer and intentionally skips the buffer. This is a > false positive because the destination buffer was resiliently > written back after being physically relogged one or more times. > This story doesn't add up. Either dm-log-writes emulated power failure correctly or it doesn't. My understanding is that the issue you are seeing is a result of XFS seeing "data from the future" after a restore of a power failure snapshot, because the scratch device is not a clean slate. If I am right, then the correct solution is to wipe the journal before starting to replay restore points. Am I misunderstanding whats going on? IIRC, some of Josef's earlier versions used dm snapshots to restore the blockdev to a clean state before replying log-writes. I think that one of the earlier versions of generic/482 also took that approach, but that resulted in longer test runtime (not sure). > Update the dm-log-writes require checks to enforce v4 superblocks > when running against XFS and skip the test otherwise. You might as well disable dm-log-writes test for XFS completely. Who cares about v4 superblocks these days? We need a tool to make sure the NEW features are crash resilient. dm-log-writes proved itself to be a powerful generic test tool that found some serious crash consistency bugs in every one of the major filesystems and it found bugs with XFS reflink log recovery as well, so IMO disabling dm-log-writes for v5 would be "very unwise!". Thanks, Amir.
On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > tracks underlying FUAs and restores the filesystem to various points > > in time based on FUA marks. This allows fstests to check fs > > consistency at various points and verify log recovery works as > > expected. > > > > Inaccurate. generic/482 restores to FUA points. > generic/45[57] restore to user defined points in time (marks). > dm-log-writes mechanism is capable of restoring either. > > > This mechanism does not play well with LSN based log recovery > > ordering behavior on XFS v5 superblocks, however. For example, > > generic/482 can reproduce false positive corruptions based on extent > > to btree conversion of an inode if the inode and associated btree > > block are written back after different checkpoints. Even though both > > items are logged correctly in the extent-to-btree transaction, the > > btree block can be relogged (multiple times) and only written back > > once when the filesystem unmounts. If the inode was written back > > after the initial conversion, recovery points between that mark and > > when the btree block is ultimately written back will show corruption > > because log recovery sees that the destination buffer is newer than > > the recovered buffer and intentionally skips the buffer. This is a > > false positive because the destination buffer was resiliently > > written back after being physically relogged one or more times. > > > > This story doesn't add up. > Either dm-log-writes emulated power failure correctly or it doesn't. > My understanding is that the issue you are seeing is a result of > XFS seeing "data from the future" after a restore of a power failure > snapshot, because the scratch device is not a clean slate. > If I am right, then the correct solution is to wipe the journal before > starting to replay restore points. If that is the problem, then I think we should be wiping the entire block device before replaying the recorded logwrite. i.e. this sounds like a "block device we are replaying onto has stale data in it" problem because we are replaying the same filesystem over the top of itself. Hence there are no unique identifiers in the metadata that can detect stale metadata in the block device. I'm surprised that we haven't tripped over this much earlier that this... > > Update the dm-log-writes require checks to enforce v4 superblocks > > when running against XFS and skip the test otherwise. > > You might as well disable dm-log-writes test for XFS completely. > Who cares about v4 superblocks these days? Enough of the inflammatory hyperbole, Amir. Statements like this serve no useful purpose. Cheers, Dave.
On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > tracks underlying FUAs and restores the filesystem to various points > > > in time based on FUA marks. This allows fstests to check fs > > > consistency at various points and verify log recovery works as > > > expected. > > > > > > > Inaccurate. generic/482 restores to FUA points. > > generic/45[57] restore to user defined points in time (marks). > > dm-log-writes mechanism is capable of restoring either. > > > > > This mechanism does not play well with LSN based log recovery > > > ordering behavior on XFS v5 superblocks, however. For example, > > > generic/482 can reproduce false positive corruptions based on extent > > > to btree conversion of an inode if the inode and associated btree > > > block are written back after different checkpoints. Even though both > > > items are logged correctly in the extent-to-btree transaction, the > > > btree block can be relogged (multiple times) and only written back > > > once when the filesystem unmounts. If the inode was written back > > > after the initial conversion, recovery points between that mark and > > > when the btree block is ultimately written back will show corruption > > > because log recovery sees that the destination buffer is newer than > > > the recovered buffer and intentionally skips the buffer. This is a > > > false positive because the destination buffer was resiliently > > > written back after being physically relogged one or more times. > > > > > > > This story doesn't add up. > > Either dm-log-writes emulated power failure correctly or it doesn't. > > My understanding is that the issue you are seeing is a result of > > XFS seeing "data from the future" after a restore of a power failure > > snapshot, because the scratch device is not a clean slate. > > If I am right, then the correct solution is to wipe the journal before > > starting to replay restore points. > > If that is the problem, then I think we should be wiping the entire > block device before replaying the recorded logwrite. > Indeed. > i.e. this sounds like a "block device we are replaying onto has > stale data in it" problem because we are replaying the same > filesystem over the top of itself. Hence there are no unique > identifiers in the metadata that can detect stale metadata in > the block device. > > I'm surprised that we haven't tripped over this much earlier that > this... > I remember asking myself the same thing... it's coming back to me now. I really remember having this discussion during test review. generic/482 is an adaptation of Josef's test script [1], which does log recovery onto a snapshot on every FUA checkpoint. [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh Setting up snapshots for every checkpoint was found empirically to take more test runtime, than replaying log from the start for each checkpoint. That observation was limited to the systems that Qu and Eryu tested on. IRC, what usually took care of cleaning the block device is replaying the "discard everything" IO from mkfs time. dm-log-writes driver should take care of zeroing blocks upon replaying a discard IO even on a target device that doesn't support discard, but maybe if original device doesn't support discard, the IO is not recorded at all and therefore not replayed? Brian, can you check if your log-writes stream contains a discard IO for the entire block device? I do remember that the initial log-writes tests worked very reliably on my laptop with SSD and were a bit flaky on another test machine with spinning rust, but that machine had other hardware reliability issues at the time (bad SATA cable) so I attributed all issues to that problem. BTW, looking closer at generic/482, $prev does not seem to be used at all in the replay loop. > > > Update the dm-log-writes require checks to enforce v4 superblocks > > > when running against XFS and skip the test otherwise. > > > > You might as well disable dm-log-writes test for XFS completely. > > Who cares about v4 superblocks these days? > > Enough of the inflammatory hyperbole, Amir. Statements like this > serve no useful purpose. > <deep breath> Agreed. Thanks, Amir.
On 2019/2/27 下午12:06, Amir Goldstein wrote: > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote: >> >> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: >>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: >>>> >>>> The dm-log-writes mechanism runs a workload against a filesystem, >>>> tracks underlying FUAs and restores the filesystem to various points >>>> in time based on FUA marks. This allows fstests to check fs >>>> consistency at various points and verify log recovery works as >>>> expected. >>>> >>> >>> Inaccurate. generic/482 restores to FUA points. >>> generic/45[57] restore to user defined points in time (marks). >>> dm-log-writes mechanism is capable of restoring either. >>> >>>> This mechanism does not play well with LSN based log recovery >>>> ordering behavior on XFS v5 superblocks, however. For example, >>>> generic/482 can reproduce false positive corruptions based on extent >>>> to btree conversion of an inode if the inode and associated btree >>>> block are written back after different checkpoints. Even though both >>>> items are logged correctly in the extent-to-btree transaction, the >>>> btree block can be relogged (multiple times) and only written back >>>> once when the filesystem unmounts. If the inode was written back >>>> after the initial conversion, recovery points between that mark and >>>> when the btree block is ultimately written back will show corruption >>>> because log recovery sees that the destination buffer is newer than >>>> the recovered buffer and intentionally skips the buffer. This is a >>>> false positive because the destination buffer was resiliently >>>> written back after being physically relogged one or more times. >>>> >>> >>> This story doesn't add up. >>> Either dm-log-writes emulated power failure correctly or it doesn't. >>> My understanding is that the issue you are seeing is a result of >>> XFS seeing "data from the future" after a restore of a power failure >>> snapshot, because the scratch device is not a clean slate. >>> If I am right, then the correct solution is to wipe the journal before >>> starting to replay restore points. >> >> If that is the problem, then I think we should be wiping the entire >> block device before replaying the recorded logwrite. >> > > Indeed. May I ask a stupid question? How does it matter whether the device is clean or not? Shouldn't the journal/metadata or whatever be self-contained? > >> i.e. this sounds like a "block device we are replaying onto has >> stale data in it" problem because we are replaying the same >> filesystem over the top of itself. Hence there are no unique >> identifiers in the metadata that can detect stale metadata in >> the block device. >> >> I'm surprised that we haven't tripped over this much earlier that >> this... >> > > I remember asking myself the same thing... it's coming back to me > now. I really remember having this discussion during test review. > generic/482 is an adaptation of Josef's test script [1], which > does log recovery onto a snapshot on every FUA checkpoint. > > [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh > > Setting up snapshots for every checkpoint was found empirically to take > more test runtime, than replaying log from the start for each checkpoint. > That observation was limited to the systems that Qu and Eryu tested on. > > IRC, what usually took care of cleaning the block device is replaying the > "discard everything" IO from mkfs time. This "discard everything" assumption doesn't look right to me. Although most mkfs would discard at least part of the device, even without discarding the newly created fs should be self-contained, no wild pointer points to some garbage. I though all metadata/journal write should be self-contained even for later fs writes. Am I missing something? Or do I get too poisoned by btrfs CoW? Thanks, Qu > dm-log-writes driver should take care > of zeroing blocks upon replaying a discard IO even on a target device that > doesn't support discard, but maybe if original device doesn't support discard, > the IO is not recorded at all and therefore not replayed? > > Brian, can you check if your log-writes stream contains a discard IO for > the entire block device? I do remember that the initial log-writes tests worked > very reliably on my laptop with SSD and were a bit flaky on another test machine > with spinning rust, but that machine had other hardware reliability issues at > the time (bad SATA cable) so I attributed all issues to that problem. > > BTW, looking closer at generic/482, $prev does not seem to be used at all > in the replay loop. > >>>> Update the dm-log-writes require checks to enforce v4 superblocks >>>> when running against XFS and skip the test otherwise. >>> >>> You might as well disable dm-log-writes test for XFS completely. >>> Who cares about v4 superblocks these days? >> >> Enough of the inflammatory hyperbole, Amir. Statements like this >> serve no useful purpose. >> > > <deep breath> Agreed. > > Thanks, > Amir. >
On Wed, Feb 27, 2019 at 6:19 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2019/2/27 下午12:06, Amir Goldstein wrote: > > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote: > >> > >> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > >>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > >>>> > >>>> The dm-log-writes mechanism runs a workload against a filesystem, > >>>> tracks underlying FUAs and restores the filesystem to various points > >>>> in time based on FUA marks. This allows fstests to check fs > >>>> consistency at various points and verify log recovery works as > >>>> expected. > >>>> > >>> > >>> Inaccurate. generic/482 restores to FUA points. > >>> generic/45[57] restore to user defined points in time (marks). > >>> dm-log-writes mechanism is capable of restoring either. > >>> > >>>> This mechanism does not play well with LSN based log recovery > >>>> ordering behavior on XFS v5 superblocks, however. For example, > >>>> generic/482 can reproduce false positive corruptions based on extent > >>>> to btree conversion of an inode if the inode and associated btree > >>>> block are written back after different checkpoints. Even though both > >>>> items are logged correctly in the extent-to-btree transaction, the > >>>> btree block can be relogged (multiple times) and only written back > >>>> once when the filesystem unmounts. If the inode was written back > >>>> after the initial conversion, recovery points between that mark and > >>>> when the btree block is ultimately written back will show corruption > >>>> because log recovery sees that the destination buffer is newer than > >>>> the recovered buffer and intentionally skips the buffer. This is a > >>>> false positive because the destination buffer was resiliently > >>>> written back after being physically relogged one or more times. > >>>> > >>> > >>> This story doesn't add up. > >>> Either dm-log-writes emulated power failure correctly or it doesn't. > >>> My understanding is that the issue you are seeing is a result of > >>> XFS seeing "data from the future" after a restore of a power failure > >>> snapshot, because the scratch device is not a clean slate. > >>> If I am right, then the correct solution is to wipe the journal before > >>> starting to replay restore points. > >> > >> If that is the problem, then I think we should be wiping the entire > >> block device before replaying the recorded logwrite. > >> > > > > Indeed. > > May I ask a stupid question? > > How does it matter whether the device is clean or not? > Shouldn't the journal/metadata or whatever be self-contained? > Yes and no. The most simple example (not limited to xfs and not sure it is like that in xfs) is how you find the last valid journal commit entry. It should have correct CRC and the largest LSN. But it you replay IO on top of existing journal without wiping it first, then journal recovery will continue past the point to meant to replay or worse. The problem that Brian describes is more complicated than that and not limited to the data in the journal IIUC, but I think what I described above may plague also ext4 and xfs v4. > > > >> i.e. this sounds like a "block device we are replaying onto has > >> stale data in it" problem because we are replaying the same > >> filesystem over the top of itself. Hence there are no unique > >> identifiers in the metadata that can detect stale metadata in > >> the block device. > >> > >> I'm surprised that we haven't tripped over this much earlier that > >> this... > >> > > > > I remember asking myself the same thing... it's coming back to me > > now. I really remember having this discussion during test review. > > generic/482 is an adaptation of Josef's test script [1], which > > does log recovery onto a snapshot on every FUA checkpoint. > > > > [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh > > > > Setting up snapshots for every checkpoint was found empirically to take > > more test runtime, than replaying log from the start for each checkpoint. > > That observation was limited to the systems that Qu and Eryu tested on. > > > > IRC, what usually took care of cleaning the block device is replaying the > > "discard everything" IO from mkfs time. > > This "discard everything" assumption doesn't look right to me. > Although most mkfs would discard at least part of the device, even > without discarding the newly created fs should be self-contained, no > wild pointer points to some garbage. > It's true. We shouldn't make this assumption. That was my explanation to Dave's question, how come we didn't see this before? Here is my log-writes info from generic/482: ./src/log-writes/replay-log -vv --find --end-mark mkfs --log $LOGWRITES_DEV |grep DISCARD seek entry 0@2: 0, size 8388607, flags 0x4(DISCARD) seek entry 1@3: 8388607, size 8388607, flags 0x4(DISCARD) seek entry 2@4: 16777214, size 4194306, flags 0x4(DISCARD) > I though all metadata/journal write should be self-contained even for > later fs writes. > > Am I missing something? Or do I get too poisoned by btrfs CoW? > I'd be very surprised if btrfs cannot be flipped by seeing stale data "from the future" in the block device. Seems to me like the entire concept of CoW and metadata checksums is completely subverted by the existence of correct checksums on "stale metadata from the future". Thanks, Amir.
On 2019/2/27 下午12:49, Amir Goldstein wrote: [snip] >>> Indeed. >> >> May I ask a stupid question? >> >> How does it matter whether the device is clean or not? >> Shouldn't the journal/metadata or whatever be self-contained? >> > > Yes and no. > > The most simple example (not limited to xfs and not sure it is like that in xfs) > is how you find the last valid journal commit entry. It should have correct CRC > and the largest LSN. But it you replay IO on top of existing journal without > wiping it first, then journal recovery will continue past the point to meant to > replay or worse. The journal doesn't have some superblock like system to tell where to stop replaying? If not, then indeed we need to discard the journal before writing new one. > > The problem that Brian describes is more complicated than that and not > limited to the data in the journal IIUC, but I think what I described above > may plague also ext4 and xfs v4. > >>> >>>> i.e. this sounds like a "block device we are replaying onto has >>>> stale data in it" problem because we are replaying the same >>>> filesystem over the top of itself. Hence there are no unique >>>> identifiers in the metadata that can detect stale metadata in >>>> the block device. >>>> >>>> I'm surprised that we haven't tripped over this much earlier that >>>> this... >>>> >>> >>> I remember asking myself the same thing... it's coming back to me >>> now. I really remember having this discussion during test review. >>> generic/482 is an adaptation of Josef's test script [1], which >>> does log recovery onto a snapshot on every FUA checkpoint. >>> >>> [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh >>> >>> Setting up snapshots for every checkpoint was found empirically to take >>> more test runtime, than replaying log from the start for each checkpoint. >>> That observation was limited to the systems that Qu and Eryu tested on. >>> >>> IRC, what usually took care of cleaning the block device is replaying the >>> "discard everything" IO from mkfs time. >> >> This "discard everything" assumption doesn't look right to me. >> Although most mkfs would discard at least part of the device, even >> without discarding the newly created fs should be self-contained, no >> wild pointer points to some garbage. >> > > It's true. We shouldn't make this assumption. > That was my explanation to Dave's question, how come we didn't see > this before? > > Here is my log-writes info from generic/482: > ./src/log-writes/replay-log -vv --find --end-mark mkfs --log > $LOGWRITES_DEV |grep DISCARD > seek entry 0@2: 0, size 8388607, flags 0x4(DISCARD) > seek entry 1@3: 8388607, size 8388607, flags 0x4(DISCARD) > seek entry 2@4: 16777214, size 4194306, flags 0x4(DISCARD) > >> I though all metadata/journal write should be self-contained even for >> later fs writes. >> >> Am I missing something? Or do I get too poisoned by btrfs CoW? >> > > I'd be very surprised if btrfs cannot be flipped by seeing stale data "from > the future" in the block device. Seems to me like the entire concept of > CoW and metadata checksums is completely subverted by the existence > of correct checksums on "stale metadata from the future". It seems that metadata CoW makes it impossible to see future data. All btree trees get updated CoW, so no metadata will be overwritten during one transaction. Only super block is overwritten and normally superblock is updated atomically. So either old superblock is still here, all we can see is old tree pointers. Or new superblock is here, all we can see is new tree pointers. And new metadata will never be written into old metadata, there is no way to see future metadata. Thanks, Qu > > Thanks, > Amir. >
On Wed, Feb 27, 2019 at 7:01 AM Qu Wenruo <wqu@suse.de> wrote: > > > > On 2019/2/27 下午12:49, Amir Goldstein wrote: > [snip] > >>> Indeed. > >> > >> May I ask a stupid question? > >> > >> How does it matter whether the device is clean or not? > >> Shouldn't the journal/metadata or whatever be self-contained? > >> > > > > Yes and no. > > > > The most simple example (not limited to xfs and not sure it is like that in xfs) > > is how you find the last valid journal commit entry. It should have correct CRC > > and the largest LSN. But it you replay IO on top of existing journal without > > wiping it first, then journal recovery will continue past the point to meant to > > replay or worse. > > The journal doesn't have some superblock like system to tell where to > stop replaying? > Yes, it does. My statement was inaccurate. The point is how can filesystem trust that journal superblock is not corrupted? If filesystem observes data from the future in the journal that is not expected to be there, or any checksumed metadata on the device, it may fail sanity checks that superblock or other metadata are not corrupted. Suppose filesystem has a bug that overwrites new metadata from time N with old stale metadata buffer from time N-1. That would look not much different than log-writes replay to time N-1 over non clean device from time N. > If not, then indeed we need to discard the journal before writing new one. > ... > >> Am I missing something? Or do I get too poisoned by btrfs CoW? > >> > > > > I'd be very surprised if btrfs cannot be flipped by seeing stale data "from > > the future" in the block device. Seems to me like the entire concept of > > CoW and metadata checksums is completely subverted by the existence > > of correct checksums on "stale metadata from the future". > > It seems that metadata CoW makes it impossible to see future data. > > All btree trees get updated CoW, so no metadata will be overwritten > during one transaction. > Only super block is overwritten and normally superblock is updated > atomically. > > So either old superblock is still here, all we can see is old tree pointers. > Or new superblock is here, all we can see is new tree pointers. > And new metadata will never be written into old metadata, there is no > way to see future metadata. > Those assumptions could fail if you have unreliable hardware that reorders IO across FUA, just drops IO on the floor or a bug in the filesystem or block layer. Existence of metadata from the future could look like any of the above has happened. Thanks, Amir.
On 2019/2/27 下午1:19, Amir Goldstein wrote: > On Wed, Feb 27, 2019 at 7:01 AM Qu Wenruo <wqu@suse.de> wrote: >> >> >> >> On 2019/2/27 下午12:49, Amir Goldstein wrote: >> [snip] >>>>> Indeed. >>>> >>>> May I ask a stupid question? >>>> >>>> How does it matter whether the device is clean or not? >>>> Shouldn't the journal/metadata or whatever be self-contained? >>>> >>> >>> Yes and no. >>> >>> The most simple example (not limited to xfs and not sure it is like that in xfs) >>> is how you find the last valid journal commit entry. It should have correct CRC >>> and the largest LSN. But it you replay IO on top of existing journal without >>> wiping it first, then journal recovery will continue past the point to meant to >>> replay or worse. >> >> The journal doesn't have some superblock like system to tell where to >> stop replaying? >> > > Yes, it does. My statement was inaccurate. > The point is how can filesystem trust that journal superblock is not corrupted? > If filesystem observes data from the future in the journal that is not > expected to > be there, or any checksumed metadata on the device, it may fail sanity checks > that superblock or other metadata are not corrupted. > > Suppose filesystem has a bug that overwrites new metadata from time N > with old stale metadata buffer from time N-1. That would look not much different > than log-writes replay to time N-1 over non clean device from time N. > >> If not, then indeed we need to discard the journal before writing new one. >> > ... > >>>> Am I missing something? Or do I get too poisoned by btrfs CoW? >>>> >>> >>> I'd be very surprised if btrfs cannot be flipped by seeing stale data "from >>> the future" in the block device. Seems to me like the entire concept of >>> CoW and metadata checksums is completely subverted by the existence >>> of correct checksums on "stale metadata from the future". >> >> It seems that metadata CoW makes it impossible to see future data. >> >> All btree trees get updated CoW, so no metadata will be overwritten >> during one transaction. >> Only super block is overwritten and normally superblock is updated >> atomically. >> >> So either old superblock is still here, all we can see is old tree pointers. >> Or new superblock is here, all we can see is new tree pointers. >> And new metadata will never be written into old metadata, there is no >> way to see future metadata. >> > > Those assumptions could fail if you have unreliable hardware that > reorders IO across FUA, just drops IO on the floor or a bug in the filesystem > or block layer. Well, if hardware has problem, we can't really do anything to help. Maybe that's reason why there are more corruption report for btrfs as there are more problematic hardware than we thought? > > Existence of metadata from the future could look like any of > the above has happened. Btrfs has an extra layer to prevent such problem from happening, each metadata pointer has its expected generation. If one metadata has a mismatch generation with its parent, then kernel will know something went wrong. And in fact, that's the most common failure mode for btrfs (although most of them is seeing too old metadata), and we're looking into the problem (but not much progress yet). Thanks, Qu > > Thanks, > Amir. >
> >> It seems that metadata CoW makes it impossible to see future data. > >> > >> All btree trees get updated CoW, so no metadata will be overwritten > >> during one transaction. > >> Only super block is overwritten and normally superblock is updated > >> atomically. > >> > >> So either old superblock is still here, all we can see is old tree pointers. > >> Or new superblock is here, all we can see is new tree pointers. > >> And new metadata will never be written into old metadata, there is no > >> way to see future metadata. > >> > > > > Those assumptions could fail if you have unreliable hardware that > > reorders IO across FUA, just drops IO on the floor or a bug in the filesystem > > or block layer. > > Well, if hardware has problem, we can't really do anything to help. > > Maybe that's reason why there are more corruption report for btrfs as > there are more problematic hardware than we thought? > I think we are not understanding each other. dm-log-writes and xfstests are not expected to test unreliable hardware. filesystems are not expected to work properly on unreliable hardware, but they do have sanity checks to try and detect metadata corruptions that could be a result of unreliable hardware/storage stack and put up a big red sign. If we don't wipe block device before log-writes replay, filesystem trips over those sanity checks, because replay to time N-1 without wipe of writes from time N looks awfully similar to flaky hardware/storage stack. > > > > Existence of metadata from the future could look like any of > > the above has happened. > > Btrfs has an extra layer to prevent such problem from happening, each > metadata pointer has its expected generation. > > If one metadata has a mismatch generation with its parent, then kernel > will know something went wrong. > > And in fact, that's the most common failure mode for btrfs (although > most of them is seeing too old metadata), and we're looking into the > problem (but not much progress yet). > Yes, and that is exactly the reason why dm-log-writes has the potential to generate false positive sanity check failures. Granted, I do not know the inner works of btrfs CoW to point out a specific problematic use case. Anyway, we do need to wipe block device before log replay regardless if btrfs needs it or not. Thanks, Amir.
On Wed, Feb 27, 2019 at 06:49:56AM +0200, Amir Goldstein wrote: > On Wed, Feb 27, 2019 at 6:19 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > > > > > On 2019/2/27 下午12:06, Amir Goldstein wrote: > > > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote: > > >> > > >> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > >>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > >>>> > > >>>> The dm-log-writes mechanism runs a workload against a filesystem, > > >>>> tracks underlying FUAs and restores the filesystem to various points > > >>>> in time based on FUA marks. This allows fstests to check fs > > >>>> consistency at various points and verify log recovery works as > > >>>> expected. > > >>>> > > >>> > > >>> Inaccurate. generic/482 restores to FUA points. > > >>> generic/45[57] restore to user defined points in time (marks). > > >>> dm-log-writes mechanism is capable of restoring either. > > >>> > > >>>> This mechanism does not play well with LSN based log recovery > > >>>> ordering behavior on XFS v5 superblocks, however. For example, > > >>>> generic/482 can reproduce false positive corruptions based on extent > > >>>> to btree conversion of an inode if the inode and associated btree > > >>>> block are written back after different checkpoints. Even though both > > >>>> items are logged correctly in the extent-to-btree transaction, the > > >>>> btree block can be relogged (multiple times) and only written back > > >>>> once when the filesystem unmounts. If the inode was written back > > >>>> after the initial conversion, recovery points between that mark and > > >>>> when the btree block is ultimately written back will show corruption > > >>>> because log recovery sees that the destination buffer is newer than > > >>>> the recovered buffer and intentionally skips the buffer. This is a > > >>>> false positive because the destination buffer was resiliently > > >>>> written back after being physically relogged one or more times. > > >>>> > > >>> > > >>> This story doesn't add up. > > >>> Either dm-log-writes emulated power failure correctly or it doesn't. > > >>> My understanding is that the issue you are seeing is a result of > > >>> XFS seeing "data from the future" after a restore of a power failure > > >>> snapshot, because the scratch device is not a clean slate. > > >>> If I am right, then the correct solution is to wipe the journal before > > >>> starting to replay restore points. > > >> > > >> If that is the problem, then I think we should be wiping the entire > > >> block device before replaying the recorded logwrite. > > >> > > > > > > Indeed. > > > > May I ask a stupid question? > > > > How does it matter whether the device is clean or not? > > Shouldn't the journal/metadata or whatever be self-contained? > > > > Yes and no. > > The most simple example (not limited to xfs and not sure it is like that in xfs) > is how you find the last valid journal commit entry. It should have correct CRC > and the largest LSN. But it you replay IO on top of existing journal without > wiping it first, then journal recovery will continue past the point to meant to > replay or worse. No, that's not the problem we have with XFS. THe problem is that XFS will not recover the changes in the log if the object on disk it would recover into is more recent than the information found in the log. i.e. it's already been written back and so the journal entry does not need to be replayed. IOWs, if the block device is not wiped, the first read of a piece of a newly allocated and modified metadata object in the log will see the future state of the object, not whatever was there when the block was first allocated. i.e. in memory in journal on disk initial contents 0000 n/a 0000 allocate, 0001 n/a 0000 modify, checkpoint 1 0002 0002 0000 modify, checkpoint 2 0003 0003 0000 modify, checkpoint 3 0004 0004 0000 write back 0004 0004 0004 checkpoint 4 0004 n/a 0004 Now when we replay up to checkpoint 1, log recovery will read the object from disk. If the disk has been zeroed before we replay, the read in log recovery will see 0000 and replay 0002 over the top, and all will be good. However, if the device hasn;t been zeroed, recovery will read "0004", which is more recent than 0002, and it will not replay 0002 because it knows there are future changes to that object in the journal that will be replayed. IOWs, stale metadata (from the future) prevents log recovery from replaying the objects it should be replaying. > The problem that Brian describes is more complicated than that and not > limited to the data in the journal IIUC, but I think what I described above > may plague also ext4 and xfs v4. It will affect xfs v4, but we can't detect it at all because we don't have sequence numbers in the metadata. ext4 is in the same boat as xfs v4, while btrfs is like XFS v5 with transaction identifiers in the metadata to indicate when it was written... > > This "discard everything" assumption doesn't look right to me. > > Although most mkfs would discard at least part of the device, even > > without discarding the newly created fs should be self-contained, no > > wild pointer points to some garbage. If the block device is in an uninitialised state when we start, then all bets are off - it is not a "self contained" test because the initial state is completely unknown. We need to zero so that the initial state for both creation and each replay that occurs start from the same initial conditions. > > Am I missing something? Or do I get too poisoned by btrfs CoW? > > I'd be very surprised if btrfs cannot be flipped by seeing stale data "from > the future" in the block device. Seems to me like the entire concept of > CoW and metadata checksums is completely subverted by the existence > of correct checksums on "stale metadata from the future". No, this is a journal recovery issue - recovery is a read-modify-write operation and so if the contents that are read are stale in a specific way we can be exposed to problems like this. btrfs is not a journalling filesystem, so it shouldn't be doing RMW cycles on metadata to bring it into a consistent state during recovery - it should be doing atomic updates of the tree root to switch from one consistent state on disk to the next.... Cheers, Dave.
On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > tracks underlying FUAs and restores the filesystem to various points > > in time based on FUA marks. This allows fstests to check fs > > consistency at various points and verify log recovery works as > > expected. > > > > Inaccurate. generic/482 restores to FUA points. > generic/45[57] restore to user defined points in time (marks). > dm-log-writes mechanism is capable of restoring either. > The above is poorly worded. I'm aware of the separate tests and I've used the mechanism to bounce around to various marks. Note that my understanding of the mechanism beyond that is rudimentary. I'll reword this if the patch survives, but it sounds like there may be opportunity to fix the mechanism, which clearly would be ideal. > > This mechanism does not play well with LSN based log recovery > > ordering behavior on XFS v5 superblocks, however. For example, > > generic/482 can reproduce false positive corruptions based on extent > > to btree conversion of an inode if the inode and associated btree > > block are written back after different checkpoints. Even though both > > items are logged correctly in the extent-to-btree transaction, the > > btree block can be relogged (multiple times) and only written back > > once when the filesystem unmounts. If the inode was written back > > after the initial conversion, recovery points between that mark and > > when the btree block is ultimately written back will show corruption > > because log recovery sees that the destination buffer is newer than > > the recovered buffer and intentionally skips the buffer. This is a > > false positive because the destination buffer was resiliently > > written back after being physically relogged one or more times. > > > > This story doesn't add up. > Either dm-log-writes emulated power failure correctly or it doesn't. It doesn't. It leaves the log and broader filesystem in a state that makes no sense with respect to a power failure. > My understanding is that the issue you are seeing is a result of > XFS seeing "data from the future" after a restore of a power failure > snapshot, because the scratch device is not a clean slate. > If I am right, then the correct solution is to wipe the journal before > starting to replay restore points. > > Am I misunderstanding whats going on? > Slightly. Wiping the journal will not help. I _think_ that a wipe of the broader filesystem before recovering from the initial fua and replaying in order from there would mitigate the problem. Is there an easy way to test that theory? For example, would a mkfs of the scratch device before the replay sequence of generic/482 begins allow the test to still otherwise function correctly? I was going to elaborate further on the sequence of events, but I see Dave has already nicely described this generically in his most recent reply. > IIRC, some of Josef's earlier versions used dm snapshots to restore > the blockdev to a clean state before replying log-writes. > I think that one of the earlier versions of generic/482 also took > that approach, but that resulted in longer test runtime (not sure). > > > Update the dm-log-writes require checks to enforce v4 superblocks > > when running against XFS and skip the test otherwise. > > You might as well disable dm-log-writes test for XFS completely. > Who cares about v4 superblocks these days? > We need a tool to make sure the NEW features are crash resilient. > > dm-log-writes proved itself to be a powerful generic test tool that found > some serious crash consistency bugs in every one of the major filesystems > and it found bugs with XFS reflink log recovery as well, so IMO > disabling dm-log-writes for v5 would be "very unwise!". > Thanks for the insight Brian > Thanks, > Amir.
On Wed, Feb 27, 2019 at 06:06:57AM +0200, Amir Goldstein wrote: > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > > tracks underlying FUAs and restores the filesystem to various points > > > > in time based on FUA marks. This allows fstests to check fs > > > > consistency at various points and verify log recovery works as > > > > expected. > > > > > > > > > > Inaccurate. generic/482 restores to FUA points. > > > generic/45[57] restore to user defined points in time (marks). > > > dm-log-writes mechanism is capable of restoring either. > > > > > > > This mechanism does not play well with LSN based log recovery > > > > ordering behavior on XFS v5 superblocks, however. For example, > > > > generic/482 can reproduce false positive corruptions based on extent > > > > to btree conversion of an inode if the inode and associated btree > > > > block are written back after different checkpoints. Even though both > > > > items are logged correctly in the extent-to-btree transaction, the > > > > btree block can be relogged (multiple times) and only written back > > > > once when the filesystem unmounts. If the inode was written back > > > > after the initial conversion, recovery points between that mark and > > > > when the btree block is ultimately written back will show corruption > > > > because log recovery sees that the destination buffer is newer than > > > > the recovered buffer and intentionally skips the buffer. This is a > > > > false positive because the destination buffer was resiliently > > > > written back after being physically relogged one or more times. > > > > > > > > > > This story doesn't add up. > > > Either dm-log-writes emulated power failure correctly or it doesn't. > > > My understanding is that the issue you are seeing is a result of > > > XFS seeing "data from the future" after a restore of a power failure > > > snapshot, because the scratch device is not a clean slate. > > > If I am right, then the correct solution is to wipe the journal before > > > starting to replay restore points. > > > > If that is the problem, then I think we should be wiping the entire > > block device before replaying the recorded logwrite. > > > > Indeed. > > > i.e. this sounds like a "block device we are replaying onto has > > stale data in it" problem because we are replaying the same > > filesystem over the top of itself. Hence there are no unique > > identifiers in the metadata that can detect stale metadata in > > the block device. > > > > I'm surprised that we haven't tripped over this much earlier that > > this... > > > > I remember asking myself the same thing... it's coming back to me > now. I really remember having this discussion during test review. > generic/482 is an adaptation of Josef's test script [1], which > does log recovery onto a snapshot on every FUA checkpoint. > > [1] https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh > > Setting up snapshots for every checkpoint was found empirically to take > more test runtime, than replaying log from the start for each checkpoint. > That observation was limited to the systems that Qu and Eryu tested on. > > IRC, what usually took care of cleaning the block device is replaying the > "discard everything" IO from mkfs time. dm-log-writes driver should take care > of zeroing blocks upon replaying a discard IO even on a target device that > doesn't support discard, but maybe if original device doesn't support discard, > the IO is not recorded at all and therefore not replayed? > > Brian, can you check if your log-writes stream contains a discard IO for > the entire block device? I do remember that the initial log-writes tests worked > very reliably on my laptop with SSD and were a bit flaky on another test machine > with spinning rust, but that machine had other hardware reliability issues at > the time (bad SATA cable) so I attributed all issues to that problem. > FYI, the command from your other mail on a logwrites dev that demonstrates this problem shows the following: # ./src/log-writes/replay-log -vv --find --end-mark mkfs --log /dev/test/tmp | grep DISCARD seek entry 0@2: 0, size 8388607, flags 0x4(DISCARD) seek entry 1@3: 8388607, size 8388607, flags 0x4(DISCARD) seek entry 2@4: 16777214, size 8388607, flags 0x4(DISCARD) seek entry 3@5: 25165821, size 6291459, flags 0x4(DISCARD) ... which appears to cover the entire device. Is the intention that this should wipe the scratch device? Brian > BTW, looking closer at generic/482, $prev does not seem to be used at all > in the replay loop. > > > > > Update the dm-log-writes require checks to enforce v4 superblocks > > > > when running against XFS and skip the test otherwise. > > > > > > You might as well disable dm-log-writes test for XFS completely. > > > Who cares about v4 superblocks these days? > > > > Enough of the inflammatory hyperbole, Amir. Statements like this > > serve no useful purpose. > > > > <deep breath> Agreed. > > Thanks, > Amir.
On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote: > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > tracks underlying FUAs and restores the filesystem to various points > > > in time based on FUA marks. This allows fstests to check fs > > > consistency at various points and verify log recovery works as > > > expected. > > > > > > > Inaccurate. generic/482 restores to FUA points. > > generic/45[57] restore to user defined points in time (marks). > > dm-log-writes mechanism is capable of restoring either. > > > > The above is poorly worded. I'm aware of the separate tests and I've > used the mechanism to bounce around to various marks. Note that my > understanding of the mechanism beyond that is rudimentary. I'll reword > this if the patch survives, but it sounds like there may be opportunity > to fix the mechanism, which clearly would be ideal. > > > > This mechanism does not play well with LSN based log recovery > > > ordering behavior on XFS v5 superblocks, however. For example, > > > generic/482 can reproduce false positive corruptions based on extent > > > to btree conversion of an inode if the inode and associated btree > > > block are written back after different checkpoints. Even though both > > > items are logged correctly in the extent-to-btree transaction, the > > > btree block can be relogged (multiple times) and only written back > > > once when the filesystem unmounts. If the inode was written back > > > after the initial conversion, recovery points between that mark and > > > when the btree block is ultimately written back will show corruption > > > because log recovery sees that the destination buffer is newer than > > > the recovered buffer and intentionally skips the buffer. This is a > > > false positive because the destination buffer was resiliently > > > written back after being physically relogged one or more times. > > > > > > > This story doesn't add up. > > Either dm-log-writes emulated power failure correctly or it doesn't. > > It doesn't. It leaves the log and broader filesystem in a state that > makes no sense with respect to a power failure. > > > My understanding is that the issue you are seeing is a result of > > XFS seeing "data from the future" after a restore of a power failure > > snapshot, because the scratch device is not a clean slate. > > If I am right, then the correct solution is to wipe the journal before > > starting to replay restore points. > > > > Am I misunderstanding whats going on? > > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the > broader filesystem before recovering from the initial fua and replaying > in order from there would mitigate the problem. Is there an easy way to > test that theory? For example, would a mkfs of the scratch device before > the replay sequence of generic/482 begins allow the test to still > otherwise function correctly? > FYI, I gave this a try and it didn't ultimately work because mkfs didn't clear the device either. I ended up reproducing the problem, physically zeroing the device, replaying the associated FUA and observing the problem go away. From there, if I replay to the final FUA mark and go back to the (originally) problematic FUA, the problem is reintroduced. Brian > I was going to elaborate further on the sequence of events, but I see > Dave has already nicely described this generically in his most recent > reply. > > > IIRC, some of Josef's earlier versions used dm snapshots to restore > > the blockdev to a clean state before replying log-writes. > > I think that one of the earlier versions of generic/482 also took > > that approach, but that resulted in longer test runtime (not sure). > > > > > Update the dm-log-writes require checks to enforce v4 superblocks > > > when running against XFS and skip the test otherwise. > > > > You might as well disable dm-log-writes test for XFS completely. > > Who cares about v4 superblocks these days? > > We need a tool to make sure the NEW features are crash resilient. > > > > dm-log-writes proved itself to be a powerful generic test tool that found > > some serious crash consistency bugs in every one of the major filesystems > > and it found bugs with XFS reflink log recovery as well, so IMO > > disabling dm-log-writes for v5 would be "very unwise!". > > > > Thanks for the insight > > Brian > > > Thanks, > > Amir.
On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote: > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote: > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > > tracks underlying FUAs and restores the filesystem to various points > > > > in time based on FUA marks. This allows fstests to check fs > > > > consistency at various points and verify log recovery works as > > > > expected. > > > > > > > > > > Inaccurate. generic/482 restores to FUA points. > > > generic/45[57] restore to user defined points in time (marks). > > > dm-log-writes mechanism is capable of restoring either. > > > > > > > The above is poorly worded. I'm aware of the separate tests and I've > > used the mechanism to bounce around to various marks. Note that my > > understanding of the mechanism beyond that is rudimentary. I'll reword > > this if the patch survives, but it sounds like there may be opportunity > > to fix the mechanism, which clearly would be ideal. > > > > > > This mechanism does not play well with LSN based log recovery > > > > ordering behavior on XFS v5 superblocks, however. For example, > > > > generic/482 can reproduce false positive corruptions based on extent > > > > to btree conversion of an inode if the inode and associated btree > > > > block are written back after different checkpoints. Even though both > > > > items are logged correctly in the extent-to-btree transaction, the > > > > btree block can be relogged (multiple times) and only written back > > > > once when the filesystem unmounts. If the inode was written back > > > > after the initial conversion, recovery points between that mark and > > > > when the btree block is ultimately written back will show corruption > > > > because log recovery sees that the destination buffer is newer than > > > > the recovered buffer and intentionally skips the buffer. This is a > > > > false positive because the destination buffer was resiliently > > > > written back after being physically relogged one or more times. > > > > > > > > > > This story doesn't add up. > > > Either dm-log-writes emulated power failure correctly or it doesn't. > > > > It doesn't. It leaves the log and broader filesystem in a state that > > makes no sense with respect to a power failure. > > > > > My understanding is that the issue you are seeing is a result of > > > XFS seeing "data from the future" after a restore of a power failure > > > snapshot, because the scratch device is not a clean slate. > > > If I am right, then the correct solution is to wipe the journal before > > > starting to replay restore points. > > > > > > Am I misunderstanding whats going on? > > > > > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the > > broader filesystem before recovering from the initial fua and replaying > > in order from there would mitigate the problem. Is there an easy way to > > test that theory? For example, would a mkfs of the scratch device before > > the replay sequence of generic/482 begins allow the test to still > > otherwise function correctly? > > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > clear the device either. I ended up reproducing the problem, physically > zeroing the device, replaying the associated FUA and observing the > problem go away. From there, if I replay to the final FUA mark and go > back to the (originally) problematic FUA, the problem is reintroduced. > Sorry guys, whenever I run log-writes on xfs I use my helper script here https://github.com/josefbacik/log-writes specifically replay-individual-faster.sh. This creates a snapshot at every replay point, mounts and checks the fs, and then destroys the snapshot and keeps going. This way you don't end up with the "new" data still being on the device. It's not super fast, but this is usually a fire and forget sort of thing. I could probably integrate this into xfstests for our log-writes tests, those tend to not generate large logs so wouldn't take super long. Does this fix the problem for you Brian? Josef
> > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > > clear the device either. I ended up reproducing the problem, physically > > zeroing the device, replaying the associated FUA and observing the > > problem go away. From there, if I replay to the final FUA mark and go > > back to the (originally) problematic FUA, the problem is reintroduced. > > > > Sorry guys, whenever I run log-writes on xfs I use my helper script here > > https://github.com/josefbacik/log-writes > > specifically replay-individual-faster.sh. This creates a snapshot at every > replay point, mounts and checks the fs, and then destroys the snapshot and keeps > going. This way you don't end up with the "new" data still being on the device. > It's not super fast, but this is usually a fire and forget sort of thing. I > could probably integrate this into xfstests for our log-writes tests, those tend > to not generate large logs so wouldn't take super long. Does this fix the > problem for you Brian? > Chronicles of generic/482: - [RFC] version used dm snapshot and reported that XFS survived the test: https://marc.info/?l=fstests&m=152101818830823&w=1 - Alas, this RFC abused $TEST_DEV as the snapshot CoW device During review and the need of the snapshot was questioned and it was speculated that replay should be sufficient and faster. - [v2] version dropped dm snapshots saying: "Get rid of dm-snapshot which is pretty slow if we're creating and deleting snapshots repeatedly. (Maybe LVM thin provision would be much better, but current replay solution is good so far, and no slower than dm-snapshot)" https://marc.info/?l=fstests&m=152221632920271&w=1 - This version already reported failures on XFS - Dave has already identified the problem back then: https://marc.info/?l=fstests&m=152228022208188&w=1 his conclusion was this "implies a write ordering problem". What we have is: - writing in the future - going back to the past without wiping the future writes I guess you can call that a write ordering problem ;-) Qu, Will you have time to fix the test to use dm thinp snapshots? Thanks, Amir.
On Wed, Feb 27, 2019 at 10:54:20AM -0500, Josef Bacik wrote: > On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote: > > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote: > > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > > > tracks underlying FUAs and restores the filesystem to various points > > > > > in time based on FUA marks. This allows fstests to check fs > > > > > consistency at various points and verify log recovery works as > > > > > expected. > > > > > > > > > > > > > Inaccurate. generic/482 restores to FUA points. > > > > generic/45[57] restore to user defined points in time (marks). > > > > dm-log-writes mechanism is capable of restoring either. > > > > > > > > > > The above is poorly worded. I'm aware of the separate tests and I've > > > used the mechanism to bounce around to various marks. Note that my > > > understanding of the mechanism beyond that is rudimentary. I'll reword > > > this if the patch survives, but it sounds like there may be opportunity > > > to fix the mechanism, which clearly would be ideal. > > > > > > > > This mechanism does not play well with LSN based log recovery > > > > > ordering behavior on XFS v5 superblocks, however. For example, > > > > > generic/482 can reproduce false positive corruptions based on extent > > > > > to btree conversion of an inode if the inode and associated btree > > > > > block are written back after different checkpoints. Even though both > > > > > items are logged correctly in the extent-to-btree transaction, the > > > > > btree block can be relogged (multiple times) and only written back > > > > > once when the filesystem unmounts. If the inode was written back > > > > > after the initial conversion, recovery points between that mark and > > > > > when the btree block is ultimately written back will show corruption > > > > > because log recovery sees that the destination buffer is newer than > > > > > the recovered buffer and intentionally skips the buffer. This is a > > > > > false positive because the destination buffer was resiliently > > > > > written back after being physically relogged one or more times. > > > > > > > > > > > > > This story doesn't add up. > > > > Either dm-log-writes emulated power failure correctly or it doesn't. > > > > > > It doesn't. It leaves the log and broader filesystem in a state that > > > makes no sense with respect to a power failure. > > > > > > > My understanding is that the issue you are seeing is a result of > > > > XFS seeing "data from the future" after a restore of a power failure > > > > snapshot, because the scratch device is not a clean slate. > > > > If I am right, then the correct solution is to wipe the journal before > > > > starting to replay restore points. > > > > > > > > Am I misunderstanding whats going on? > > > > > > > > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the > > > broader filesystem before recovering from the initial fua and replaying > > > in order from there would mitigate the problem. Is there an easy way to > > > test that theory? For example, would a mkfs of the scratch device before > > > the replay sequence of generic/482 begins allow the test to still > > > otherwise function correctly? > > > > > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > > clear the device either. I ended up reproducing the problem, physically > > zeroing the device, replaying the associated FUA and observing the > > problem go away. From there, if I replay to the final FUA mark and go > > back to the (originally) problematic FUA, the problem is reintroduced. > > > > Sorry guys, whenever I run log-writes on xfs I use my helper script here > > https://github.com/josefbacik/log-writes > > specifically replay-individual-faster.sh. This creates a snapshot at every > replay point, mounts and checks the fs, and then destroys the snapshot and keeps > going. This way you don't end up with the "new" data still being on the device. > It's not super fast, but this is usually a fire and forget sort of thing. I > could probably integrate this into xfstests for our log-writes tests, those tend > to not generate large logs so wouldn't take super long. Does this fix the > problem for you Brian? > Thanks Josef. At a glance at that script I'm not quite following how this fits together. Are you taking a snapshot of the original device before the workload being tested is run against dm-log-writes, then replaying on top of that? In general, anything that puts the device back into the state from before the workload ran and replays from there should be enough to fix the problem I think. As long as a replay sequence runs in order, I don't think snapshots of each replay point should technically be necessary (vs a single replay snapshot, for e.g.). Note that I ended up testing generic/482 with a loop device for a scratch device and that also worked around the problem, I suspect because that allowed the aforementioned discards to actually reset the underlying the device via loop discard->hole punch (which doesn't occur with my original, non-loop scratch dev). So it seems that another option for more deterministic behavior in fstests could be to just enforce the use of a loop device as the underlying target in these particular tests. For example, have the log-writes init create a file on the test device and loop mount that rather than using the scratch device. Just a thought though, it's not clear to me that's any more simple than what you have already implemented here.. Brian > Josef
> > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > > > clear the device either. I ended up reproducing the problem, physically > > > zeroing the device, replaying the associated FUA and observing the > > > problem go away. From there, if I replay to the final FUA mark and go > > > back to the (originally) problematic FUA, the problem is reintroduced. > > > > > > > Sorry guys, whenever I run log-writes on xfs I use my helper script here > > > > https://github.com/josefbacik/log-writes > > > > specifically replay-individual-faster.sh. This creates a snapshot at every > > replay point, mounts and checks the fs, and then destroys the snapshot and keeps > > going. This way you don't end up with the "new" data still being on the device. > > It's not super fast, but this is usually a fire and forget sort of thing. I > > could probably integrate this into xfstests for our log-writes tests, those tend > > to not generate large logs so wouldn't take super long. Does this fix the > > problem for you Brian? > > > > Thanks Josef. At a glance at that script I'm not quite following how > this fits together. Are you taking a snapshot of the original device > before the workload being tested is run against dm-log-writes, then > replaying on top of that? In general, anything that puts the device back > into the state from before the workload ran and replays from there > should be enough to fix the problem I think. As long as a replay > sequence runs in order, I don't think snapshots of each replay point > should technically be necessary (vs a single replay snapshot, for e.g.). > > Note that I ended up testing generic/482 with a loop device for a > scratch device and that also worked around the problem, I suspect > because that allowed the aforementioned discards to actually reset the > underlying the device via loop discard->hole punch (which doesn't occur > with my original, non-loop scratch dev). So it seems that another option Urrgh! take a look at src/log-writes/log-writes.c:log_discard()->zero_range(). I remembered correctly the replay log falls back to zeroing device with no discard support, but wasn't aware that it drops large discard requests (> 128MB), so there we have it. > for more deterministic behavior in fstests could be to just enforce the > use of a loop device as the underlying target in these particular tests. > For example, have the log-writes init create a file on the test device > and loop mount that rather than using the scratch device. Just a thought > though, it's not clear to me that's any more simple than what you have > already implemented here.. > Seems like a pretty good idea to me, but in doing that we may loose diversity in the tested block devices. On different test machines mkfs and later fsstress ops will result in slightly different io patterns. Since generic/482 is not a deterministic test (random fsstress seed) than we cannot argue that reproducible results are better than deterministic results. Another option is to relax the log->max_zero_size limit of log_discard() and allow it to zero out devices that don't support discard, but that is likely going to cost much more runtime. Note that this simple patch could also work for you: --- a/src/log-writes/log-writes.c +++ b/src/log-writes/log-writes.c @@ -95,7 +95,7 @@ int log_discard(struct log *log, struct log_write_entry *entry) { u64 start = le64_to_cpu(entry->sector) * log->sectorsize; u64 size = le64_to_cpu(entry->nr_sectors) * log->sectorsize; - u64 max_chunk = 1 * 1024 * 1024 * 1024; + u64 max_chunk = log->max_zero_size; --- Brian, Can you try this patch on you test machine and say how much worse the runtime is compared to the loop device setup? In general, I am in favor of the simplest possible fix and in keeping runtime shortest, so if the above fix results in longer runtime, I would prefer to make the test always run on loopdev as you suggested. Thanks, Amir.
On Wed, Feb 27, 2019 at 12:13:37PM -0500, Brian Foster wrote: > On Wed, Feb 27, 2019 at 10:54:20AM -0500, Josef Bacik wrote: > > On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote: > > > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote: > > > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > > > > tracks underlying FUAs and restores the filesystem to various points > > > > > > in time based on FUA marks. This allows fstests to check fs > > > > > > consistency at various points and verify log recovery works as > > > > > > expected. > > > > > > > > > > > > > > > > Inaccurate. generic/482 restores to FUA points. > > > > > generic/45[57] restore to user defined points in time (marks). > > > > > dm-log-writes mechanism is capable of restoring either. > > > > > > > > > > > > > The above is poorly worded. I'm aware of the separate tests and I've > > > > used the mechanism to bounce around to various marks. Note that my > > > > understanding of the mechanism beyond that is rudimentary. I'll reword > > > > this if the patch survives, but it sounds like there may be opportunity > > > > to fix the mechanism, which clearly would be ideal. > > > > > > > > > > This mechanism does not play well with LSN based log recovery > > > > > > ordering behavior on XFS v5 superblocks, however. For example, > > > > > > generic/482 can reproduce false positive corruptions based on extent > > > > > > to btree conversion of an inode if the inode and associated btree > > > > > > block are written back after different checkpoints. Even though both > > > > > > items are logged correctly in the extent-to-btree transaction, the > > > > > > btree block can be relogged (multiple times) and only written back > > > > > > once when the filesystem unmounts. If the inode was written back > > > > > > after the initial conversion, recovery points between that mark and > > > > > > when the btree block is ultimately written back will show corruption > > > > > > because log recovery sees that the destination buffer is newer than > > > > > > the recovered buffer and intentionally skips the buffer. This is a > > > > > > false positive because the destination buffer was resiliently > > > > > > written back after being physically relogged one or more times. > > > > > > > > > > > > > > > > This story doesn't add up. > > > > > Either dm-log-writes emulated power failure correctly or it doesn't. > > > > > > > > It doesn't. It leaves the log and broader filesystem in a state that > > > > makes no sense with respect to a power failure. > > > > > > > > > My understanding is that the issue you are seeing is a result of > > > > > XFS seeing "data from the future" after a restore of a power failure > > > > > snapshot, because the scratch device is not a clean slate. > > > > > If I am right, then the correct solution is to wipe the journal before > > > > > starting to replay restore points. > > > > > > > > > > Am I misunderstanding whats going on? > > > > > > > > > > > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the > > > > broader filesystem before recovering from the initial fua and replaying > > > > in order from there would mitigate the problem. Is there an easy way to > > > > test that theory? For example, would a mkfs of the scratch device before > > > > the replay sequence of generic/482 begins allow the test to still > > > > otherwise function correctly? > > > > > > > > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > > > clear the device either. I ended up reproducing the problem, physically > > > zeroing the device, replaying the associated FUA and observing the > > > problem go away. From there, if I replay to the final FUA mark and go > > > back to the (originally) problematic FUA, the problem is reintroduced. > > > > > > > Sorry guys, whenever I run log-writes on xfs I use my helper script here > > > > https://github.com/josefbacik/log-writes > > > > specifically replay-individual-faster.sh. This creates a snapshot at every > > replay point, mounts and checks the fs, and then destroys the snapshot and keeps > > going. This way you don't end up with the "new" data still being on the device. > > It's not super fast, but this is usually a fire and forget sort of thing. I > > could probably integrate this into xfstests for our log-writes tests, those tend > > to not generate large logs so wouldn't take super long. Does this fix the > > problem for you Brian? > > > > Thanks Josef. At a glance at that script I'm not quite following how > this fits together. Are you taking a snapshot of the original device > before the workload being tested is run against dm-log-writes, then > replaying on top of that? In general, anything that puts the device back > into the state from before the workload ran and replays from there > should be enough to fix the problem I think. As long as a replay > sequence runs in order, I don't think snapshots of each replay point > should technically be necessary (vs a single replay snapshot, for e.g.). > Well we do this so we can mount/unmount the fs to get the log replay to happen, and then run fsck. We want the snapshot so that we can roll back the changes of the log replay. > Note that I ended up testing generic/482 with a loop device for a > scratch device and that also worked around the problem, I suspect > because that allowed the aforementioned discards to actually reset the > underlying the device via loop discard->hole punch (which doesn't occur > with my original, non-loop scratch dev). So it seems that another option > for more deterministic behavior in fstests could be to just enforce the > use of a loop device as the underlying target in these particular tests. > For example, have the log-writes init create a file on the test device > and loop mount that rather than using the scratch device. Just a thought > though, it's not clear to me that's any more simple than what you have > already implemented here.. > I'm good either way. What I've done here isn't simple, it's more meant for my long running tests. Doing loop and discarding the whole device would be cool, but I'm afraid of punch hole bugs in the underlying file system making things weird, or running on kernels where loop doesn't do punch hole or the fs doesn't support punch hole. Each solution has its drawbacks. Adding a bunch of infrastructure to do snapshots isn't super awesome, but may be the most flexible. The loop solution has the added benefit of also testing punch hole/discard ;). Josef
On Wed, Feb 27, 2019 at 08:46:42PM +0200, Amir Goldstein wrote: > > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > > > > clear the device either. I ended up reproducing the problem, physically > > > > zeroing the device, replaying the associated FUA and observing the > > > > problem go away. From there, if I replay to the final FUA mark and go > > > > back to the (originally) problematic FUA, the problem is reintroduced. > > > > > > > > > > Sorry guys, whenever I run log-writes on xfs I use my helper script here > > > > > > https://github.com/josefbacik/log-writes > > > > > > specifically replay-individual-faster.sh. This creates a snapshot at every > > > replay point, mounts and checks the fs, and then destroys the snapshot and keeps > > > going. This way you don't end up with the "new" data still being on the device. > > > It's not super fast, but this is usually a fire and forget sort of thing. I > > > could probably integrate this into xfstests for our log-writes tests, those tend > > > to not generate large logs so wouldn't take super long. Does this fix the > > > problem for you Brian? > > > > > > > Thanks Josef. At a glance at that script I'm not quite following how > > this fits together. Are you taking a snapshot of the original device > > before the workload being tested is run against dm-log-writes, then > > replaying on top of that? In general, anything that puts the device back > > into the state from before the workload ran and replays from there > > should be enough to fix the problem I think. As long as a replay > > sequence runs in order, I don't think snapshots of each replay point > > should technically be necessary (vs a single replay snapshot, for e.g.). > > > > Note that I ended up testing generic/482 with a loop device for a > > scratch device and that also worked around the problem, I suspect > > because that allowed the aforementioned discards to actually reset the > > underlying the device via loop discard->hole punch (which doesn't occur > > with my original, non-loop scratch dev). So it seems that another option > > Urrgh! take a look at src/log-writes/log-writes.c:log_discard()->zero_range(). > I remembered correctly the replay log falls back to zeroing device with no > discard support, but wasn't aware that it drops large discard requests > (> 128MB), so there we have it. > Heh, Ok. > > for more deterministic behavior in fstests could be to just enforce the > > use of a loop device as the underlying target in these particular tests. > > For example, have the log-writes init create a file on the test device > > and loop mount that rather than using the scratch device. Just a thought > > though, it's not clear to me that's any more simple than what you have > > already implemented here.. > > > > Seems like a pretty good idea to me, but in doing that we may loose > diversity in the tested block devices. On different test machines mkfs > and later fsstress ops will result in slightly different io patterns. > Since generic/482 is not a deterministic test (random fsstress seed) > than we cannot argue that reproducible results are better than > deterministic results. > > Another option is to relax the log->max_zero_size limit of > log_discard() and allow it to zero out devices that don't support > discard, but that is likely going to cost much more runtime. > Note that this simple patch could also work for you: > > --- a/src/log-writes/log-writes.c > +++ b/src/log-writes/log-writes.c > @@ -95,7 +95,7 @@ int log_discard(struct log *log, struct > log_write_entry *entry) > { > u64 start = le64_to_cpu(entry->sector) * log->sectorsize; > u64 size = le64_to_cpu(entry->nr_sectors) * log->sectorsize; > - u64 max_chunk = 1 * 1024 * 1024 * 1024; > + u64 max_chunk = log->max_zero_size; > > --- > > Brian, > > Can you try this patch on you test machine and say how much > worse the runtime is compared to the loop device setup? > > In general, I am in favor of the simplest possible fix > and in keeping runtime shortest, so if the above fix > results in longer runtime, I would prefer to make the test always > run on loopdev as you suggested. > Note that this is a low resource VM with a spinning disk. The scratch device is a 15G LVM volume. My last run of generic/482 was 139s (which I think was using loop for a scratch device). With this change and back to using the LVM volume, I ended up stopping it because I started seeing a ~160s or so delay between XFS mount cycles in the syslog (i.e., per replay iteration). ;P Brian > Thanks, > Amir.
On Wed, Feb 27, 2019 at 02:27:49PM -0500, Josef Bacik wrote: > On Wed, Feb 27, 2019 at 12:13:37PM -0500, Brian Foster wrote: > > On Wed, Feb 27, 2019 at 10:54:20AM -0500, Josef Bacik wrote: > > > On Wed, Feb 27, 2019 at 09:17:32AM -0500, Brian Foster wrote: > > > > On Wed, Feb 27, 2019 at 08:18:39AM -0500, Brian Foster wrote: > > > > > On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote: > > > > > > On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > > > > > > > > > The dm-log-writes mechanism runs a workload against a filesystem, > > > > > > > tracks underlying FUAs and restores the filesystem to various points > > > > > > > in time based on FUA marks. This allows fstests to check fs > > > > > > > consistency at various points and verify log recovery works as > > > > > > > expected. > > > > > > > > > > > > > > > > > > > Inaccurate. generic/482 restores to FUA points. > > > > > > generic/45[57] restore to user defined points in time (marks). > > > > > > dm-log-writes mechanism is capable of restoring either. > > > > > > > > > > > > > > > > The above is poorly worded. I'm aware of the separate tests and I've > > > > > used the mechanism to bounce around to various marks. Note that my > > > > > understanding of the mechanism beyond that is rudimentary. I'll reword > > > > > this if the patch survives, but it sounds like there may be opportunity > > > > > to fix the mechanism, which clearly would be ideal. > > > > > > > > > > > > This mechanism does not play well with LSN based log recovery > > > > > > > ordering behavior on XFS v5 superblocks, however. For example, > > > > > > > generic/482 can reproduce false positive corruptions based on extent > > > > > > > to btree conversion of an inode if the inode and associated btree > > > > > > > block are written back after different checkpoints. Even though both > > > > > > > items are logged correctly in the extent-to-btree transaction, the > > > > > > > btree block can be relogged (multiple times) and only written back > > > > > > > once when the filesystem unmounts. If the inode was written back > > > > > > > after the initial conversion, recovery points between that mark and > > > > > > > when the btree block is ultimately written back will show corruption > > > > > > > because log recovery sees that the destination buffer is newer than > > > > > > > the recovered buffer and intentionally skips the buffer. This is a > > > > > > > false positive because the destination buffer was resiliently > > > > > > > written back after being physically relogged one or more times. > > > > > > > > > > > > > > > > > > > This story doesn't add up. > > > > > > Either dm-log-writes emulated power failure correctly or it doesn't. > > > > > > > > > > It doesn't. It leaves the log and broader filesystem in a state that > > > > > makes no sense with respect to a power failure. > > > > > > > > > > > My understanding is that the issue you are seeing is a result of > > > > > > XFS seeing "data from the future" after a restore of a power failure > > > > > > snapshot, because the scratch device is not a clean slate. > > > > > > If I am right, then the correct solution is to wipe the journal before > > > > > > starting to replay restore points. > > > > > > > > > > > > Am I misunderstanding whats going on? > > > > > > > > > > > > > > > > Slightly. Wiping the journal will not help. I _think_ that a wipe of the > > > > > broader filesystem before recovering from the initial fua and replaying > > > > > in order from there would mitigate the problem. Is there an easy way to > > > > > test that theory? For example, would a mkfs of the scratch device before > > > > > the replay sequence of generic/482 begins allow the test to still > > > > > otherwise function correctly? > > > > > > > > > > > > > FYI, I gave this a try and it didn't ultimately work because mkfs didn't > > > > clear the device either. I ended up reproducing the problem, physically > > > > zeroing the device, replaying the associated FUA and observing the > > > > problem go away. From there, if I replay to the final FUA mark and go > > > > back to the (originally) problematic FUA, the problem is reintroduced. > > > > > > > > > > Sorry guys, whenever I run log-writes on xfs I use my helper script here > > > > > > https://github.com/josefbacik/log-writes > > > > > > specifically replay-individual-faster.sh. This creates a snapshot at every > > > replay point, mounts and checks the fs, and then destroys the snapshot and keeps > > > going. This way you don't end up with the "new" data still being on the device. > > > It's not super fast, but this is usually a fire and forget sort of thing. I > > > could probably integrate this into xfstests for our log-writes tests, those tend > > > to not generate large logs so wouldn't take super long. Does this fix the > > > problem for you Brian? > > > > > > > Thanks Josef. At a glance at that script I'm not quite following how > > this fits together. Are you taking a snapshot of the original device > > before the workload being tested is run against dm-log-writes, then > > replaying on top of that? In general, anything that puts the device back > > into the state from before the workload ran and replays from there > > should be enough to fix the problem I think. As long as a replay > > sequence runs in order, I don't think snapshots of each replay point > > should technically be necessary (vs a single replay snapshot, for e.g.). > > > > Well we do this so we can mount/unmount the fs to get the log replay to happen, > and then run fsck. We want the snapshot so that we can roll back the changes of > the log replay. > > > Note that I ended up testing generic/482 with a loop device for a > > scratch device and that also worked around the problem, I suspect > > because that allowed the aforementioned discards to actually reset the > > underlying the device via loop discard->hole punch (which doesn't occur > > with my original, non-loop scratch dev). So it seems that another option > > for more deterministic behavior in fstests could be to just enforce the > > use of a loop device as the underlying target in these particular tests. > > For example, have the log-writes init create a file on the test device > > and loop mount that rather than using the scratch device. Just a thought > > though, it's not clear to me that's any more simple than what you have > > already implemented here.. > > > > I'm good either way. What I've done here isn't simple, it's more meant for my > long running tests. Doing loop and discarding the whole device would be cool, > but I'm afraid of punch hole bugs in the underlying file system making things > weird, or running on kernels where loop doesn't do punch hole or the fs doesn't > support punch hole. > True.. > Each solution has its drawbacks. Adding a bunch of infrastructure to do > snapshots isn't super awesome, but may be the most flexible. The loop solution > has the added benefit of also testing punch hole/discard ;). > As yet another option, it looks like fstests already has thin pool infrastructure in common/dmthin. I ran the same test using a manually created thin volume for SCRATCH_DEV instead of a loop device. I see a performance drop, but no failures so far and the test still runs in the low-to-mid 200s range, which seems reasonable enough to me. Brian > Josef
diff --git a/common/dmlogwrites b/common/dmlogwrites index b9cbae7a..f2fbb733 100644 --- a/common/dmlogwrites +++ b/common/dmlogwrites @@ -12,6 +12,9 @@ _require_log_writes() _exclude_scratch_mount_option dax _require_dm_target log-writes _require_test_program "log-writes/replay-log" + if [ "$FSTYP" == "xfs" ]; then + _require_scratch_xfs_nocrc + fi } # Starting from v4.15-rc1, DAX support was added to dm-log-writes, but note @@ -30,6 +33,9 @@ _require_log_writes_dax() _require_dm_target log-writes _require_test_program "log-writes/replay-log" + if [ "$FSTYP" == "xfs" ]; then + _require_scratch_xfs_nocrc + fi local ret=0 _log_writes_init diff --git a/common/xfs b/common/xfs index 24065813..525cb9cb 100644 --- a/common/xfs +++ b/common/xfs @@ -240,6 +240,17 @@ _require_scratch_xfs_crc() _scratch_unmount } +# this test requires the scratch dev to be formatted as v4 +# +_require_scratch_xfs_nocrc() +{ + _scratch_mkfs_xfs >/dev/null 2>&1 + _scratch_mount >/dev/null 2>&1 + $XFS_INFO_PROG $SCRATCH_MNT | grep -q 'crc=0' || + _notrun "XFS v5 superblocks not supported by this test" + _scratch_unmount +} + # this test requires the finobt feature to be available in mkfs.xfs # _require_xfs_mkfs_finobt() diff --git a/tests/generic/482 b/tests/generic/482 index 7595aa59..7870ac5b 100755 --- a/tests/generic/482 +++ b/tests/generic/482 @@ -45,7 +45,6 @@ _require_scratch # and we need extra device as log device _require_log_writes - nr_cpus=$("$here/src/feature" -o) # cap nr_cpus to 8 to avoid spending too much time on hosts with many cpus if [ $nr_cpus -gt 8 ]; then
The dm-log-writes mechanism runs a workload against a filesystem, tracks underlying FUAs and restores the filesystem to various points in time based on FUA marks. This allows fstests to check fs consistency at various points and verify log recovery works as expected. This mechanism does not play well with LSN based log recovery ordering behavior on XFS v5 superblocks, however. For example, generic/482 can reproduce false positive corruptions based on extent to btree conversion of an inode if the inode and associated btree block are written back after different checkpoints. Even though both items are logged correctly in the extent-to-btree transaction, the btree block can be relogged (multiple times) and only written back once when the filesystem unmounts. If the inode was written back after the initial conversion, recovery points between that mark and when the btree block is ultimately written back will show corruption because log recovery sees that the destination buffer is newer than the recovered buffer and intentionally skips the buffer. This is a false positive because the destination buffer was resiliently written back after being physically relogged one or more times. Update the dm-log-writes require checks to enforce v4 superblocks when running against XFS and skip the test otherwise. Signed-off-by: Brian Foster <bfoster@redhat.com> --- common/dmlogwrites | 6 ++++++ common/xfs | 11 +++++++++++ tests/generic/482 | 1 - 3 files changed, 17 insertions(+), 1 deletion(-)