dm-log-writes: invalidate the bdev's for both of our devices
diff mbox

Message ID 1511890225-16601-1-git-send-email-josef@toxicpanda.com
State Rejected, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Josef Bacik Nov. 28, 2017, 5:30 p.m. UTC
From: Josef Bacik <jbacik@fb.com>

Amir noticed that sometimes the xfstests using dm-log-writes would fail
randomly but would work fine after trying again manually.  This is
because dm-log-writes writes directly to the device, but the log replay
tools read and write via the block device page cache.  Sometimes this
resulted in stale data being in the block device's page cache which
would result in random failures.  To handle this simply invalidate the
block device page cache on destruction so any replay of the log device
that follows will be forced to read the new real contents.

Reported-and-tested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/md/dm-log-writes.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Amir Goldstein Nov. 28, 2017, 7:29 p.m. UTC | #1
On Tue, Nov 28, 2017 at 7:30 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> From: Josef Bacik <jbacik@fb.com>
>
> Amir noticed that sometimes the xfstests using dm-log-writes would fail
> randomly but would work fine after trying again manually.  This is
> because dm-log-writes writes directly to the device, but the log replay
> tools read and write via the block device page cache.  Sometimes this
> resulted in stale data being in the block device's page cache which
> would result in random failures.  To handle this simply invalidate the
> block device page cache on destruction so any replay of the log device
> that follows will be forced to read the new real contents.
>
> Reported-and-tested-by: Amir Goldstein <amir73il@gmail.com>

I'm fine with the Reported-by, but let's wait a while with this patch so
I have more time to torture it.
The incidents I got even before the patch did not happen more than
a handful of times after running for a few days, so I need some more
days to validate the fix.
I had already sent you some weird output. Let's see what else comes
along.

Thanks,
Amir.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Amir Goldstein Nov. 28, 2017, 8:40 p.m. UTC | #2
On Tue, Nov 28, 2017 at 9:29 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Tue, Nov 28, 2017 at 7:30 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>> From: Josef Bacik <jbacik@fb.com>
>>
>> Amir noticed that sometimes the xfstests using dm-log-writes would fail
>> randomly but would work fine after trying again manually.  This is
>> because dm-log-writes writes directly to the device, but the log replay
>> tools read and write via the block device page cache.  Sometimes this
>> resulted in stale data being in the block device's page cache which
>> would result in random failures.  To handle this simply invalidate the
>> block device page cache on destruction so any replay of the log device
>> that follows will be forced to read the new real contents.
>>
>> Reported-and-tested-by: Amir Goldstein <amir73il@gmail.com>
>
> I'm fine with the Reported-by, but let's wait a while with this patch so
> I have more time to torture it.
> The incidents I got even before the patch did not happen more than
> a handful of times after running for a few days, so I need some more
> days to validate the fix.
> I had already sent you some weird output. Let's see what else comes
> along.
>

Sorry, no cigar.
Another run just completed with Malformed log and corrupted fs

The _check_scratch_fs that fails is the one right after _log_writes_remove
just like the report that I sent before this patch
and the LOGWRITES_DEV itself has malformed entry before the "end" mark
or even the last fsync mark:

./src/log-writes/replay-log -v --log $LOGWRITES_DEV --find --end-mark
testfile1.mark17
Malformed entry @112134

For what its worth, I am testing on spinning disks, 100G scratch dev.
Right now, I zoomed in on the following fsx seeds that managed to fail the test
a few times already, but in different ways, so I'm not sure the seeds are more
than voodoo:
seeds=(4597 4598 4599 4600)

I'll start running the same test but with fsx running on test partition, just
to get the feel for running the same fsx threads on bare xfs.

Any other ideas?

Amir.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Josef Bacik Nov. 28, 2017, 9:05 p.m. UTC | #3
On Tue, Nov 28, 2017 at 10:40:24PM +0200, Amir Goldstein wrote:
> On Tue, Nov 28, 2017 at 9:29 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Tue, Nov 28, 2017 at 7:30 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> >> From: Josef Bacik <jbacik@fb.com>
> >>
> >> Amir noticed that sometimes the xfstests using dm-log-writes would fail
> >> randomly but would work fine after trying again manually.  This is
> >> because dm-log-writes writes directly to the device, but the log replay
> >> tools read and write via the block device page cache.  Sometimes this
> >> resulted in stale data being in the block device's page cache which
> >> would result in random failures.  To handle this simply invalidate the
> >> block device page cache on destruction so any replay of the log device
> >> that follows will be forced to read the new real contents.
> >>
> >> Reported-and-tested-by: Amir Goldstein <amir73il@gmail.com>
> >
> > I'm fine with the Reported-by, but let's wait a while with this patch so
> > I have more time to torture it.
> > The incidents I got even before the patch did not happen more than
> > a handful of times after running for a few days, so I need some more
> > days to validate the fix.
> > I had already sent you some weird output. Let's see what else comes
> > along.
> >
> 
> Sorry, no cigar.
> Another run just completed with Malformed log and corrupted fs
> 
> The _check_scratch_fs that fails is the one right after _log_writes_remove
> just like the report that I sent before this patch
> and the LOGWRITES_DEV itself has malformed entry before the "end" mark
> or even the last fsync mark:
> 
> ./src/log-writes/replay-log -v --log $LOGWRITES_DEV --find --end-mark
> testfile1.mark17
> Malformed entry @112134
> 
> For what its worth, I am testing on spinning disks, 100G scratch dev.
> Right now, I zoomed in on the following fsx seeds that managed to fail the test
> a few times already, but in different ways, so I'm not sure the seeds are more
> than voodoo:
> seeds=(4597 4598 4599 4600)
> 
> I'll start running the same test but with fsx running on test partition, just
> to get the feel for running the same fsx threads on bare xfs.
> 
> Any other ideas?
> 

Is there anything special about your devices?  Are they 4k drives?  The corrupt
log is not awesome, was it still corrupt after the test bailed out?  Thanks,

Josef

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Amir Goldstein Nov. 28, 2017, 9:22 p.m. UTC | #4
On Tue, Nov 28, 2017 at 11:05 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> On Tue, Nov 28, 2017 at 10:40:24PM +0200, Amir Goldstein wrote:
>> On Tue, Nov 28, 2017 at 9:29 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> > On Tue, Nov 28, 2017 at 7:30 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>> >> From: Josef Bacik <jbacik@fb.com>
>> >>
>> >> Amir noticed that sometimes the xfstests using dm-log-writes would fail
>> >> randomly but would work fine after trying again manually.  This is
>> >> because dm-log-writes writes directly to the device, but the log replay
>> >> tools read and write via the block device page cache.  Sometimes this
>> >> resulted in stale data being in the block device's page cache which
>> >> would result in random failures.  To handle this simply invalidate the
>> >> block device page cache on destruction so any replay of the log device
>> >> that follows will be forced to read the new real contents.
>> >>
>> >> Reported-and-tested-by: Amir Goldstein <amir73il@gmail.com>
>> >
>> > I'm fine with the Reported-by, but let's wait a while with this patch so
>> > I have more time to torture it.
>> > The incidents I got even before the patch did not happen more than
>> > a handful of times after running for a few days, so I need some more
>> > days to validate the fix.
>> > I had already sent you some weird output. Let's see what else comes
>> > along.
>> >
>>
>> Sorry, no cigar.
>> Another run just completed with Malformed log and corrupted fs
>>
>> The _check_scratch_fs that fails is the one right after _log_writes_remove
>> just like the report that I sent before this patch
>> and the LOGWRITES_DEV itself has malformed entry before the "end" mark
>> or even the last fsync mark:
>>
>> ./src/log-writes/replay-log -v --log $LOGWRITES_DEV --find --end-mark
>> testfile1.mark17
>> Malformed entry @112134
>>
>> For what its worth, I am testing on spinning disks, 100G scratch dev.
>> Right now, I zoomed in on the following fsx seeds that managed to fail the test
>> a few times already, but in different ways, so I'm not sure the seeds are more
>> than voodoo:
>> seeds=(4597 4598 4599 4600)
>>
>> I'll start running the same test but with fsx running on test partition, just
>> to get the feel for running the same fsx threads on bare xfs.
>>
>> Any other ideas?
>>
>
> Is there anything special about your devices?  Are they 4k drives?  The corrupt
> log is not awesome, was it still corrupt after the test bailed out?  Thanks,
>

No nothing special. boring 4TB WD drive.
just reported on the xfstest thread that problem was reproduced with
xfs on scratch
partition, where dm-log-writes in not in the picture, so for now,
dm-log-writes is off the
hook.
Still need to explain the malformed log, but will follow the xfs
corruption lead first.

Thanks,
Amir.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Patch
diff mbox

diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 8b80a9ce9ea9..1c502930af5e 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -545,6 +545,8 @@  static void log_writes_dtr(struct dm_target *ti)
 		   !atomic_read(&lc->pending_blocks));
 	kthread_stop(lc->log_kthread);
 
+	invalidate_bdev(lc->logdev->bdev);
+	invalidate_bdev(lc->dev->bdev);
 	WARN_ON(!list_empty(&lc->logging_blocks));
 	WARN_ON(!list_empty(&lc->unflushed_blocks));
 	dm_put_device(ti, lc->dev);