[RFC] dm: add dm-power-fail target
diff mbox

Message ID 1416607231-8588-1-git-send-email-jbacik@fb.com
State Not Applicable, archived
Headers show

Commit Message

Josef Bacik Nov. 21, 2014, 10 p.m. UTC
Hello,

I'm hoping some FS guys can weigh in and verify my approach for testing power
fail conditions, and the DM guys to of course verify I didn't completely fail at
making a DM target.  All suggestions welcome, I want to have a nice robust tool
for testing power fail so we can be sure our fsync/commit code is all working
properly.  Thanks,

Josef


For power fail testing we want to be able to catch cases where the fs isn't
waiting for IO to complete properly.  This target aims to do this by creating a
linear mapping on a device and keeping track of all WRITEs and FLUSH/FUA
operations.  Any WRITEs are added to a global list on completion, then when we
submit a FLUSH this list is spliced onto the request and once that request
completes then it marks those blocks as valid.  FUA bypasses this logic
altogether and is considered valid once it completes.  There are two modes of
operation here

1) Zero - any WRITE that wasn't flushed will return 0's when we try to read from
it.  This is meant for BTRFS (or any COW fs) where we only write blocks once, so
if we try to read a block we didn't properly flush then we can just send back
0's.

2) Split - the device is split in half and written to alternating sides.  This
is for overwriting fs'es (ext*/xfs).  Once a FLUSH occurs we walk all the
completed WRITEs and set their mirror to whichever mirror they last wrote to.
Anything that wasn't flushed properly will point to its previous mirror.

We then have 3 different power fail events we can do

1) DROP_WRITES - just drop all writes immediately, any new writes are
immediately completed and outstanding ones are treated as if they never
completed.

2) DROP_AFTER_FUA - this allows a FUA to go through, and as soon as it completes
we drop all writes.  This is meant to be really evil about making sure your
commit code is doing the right thing by dropping at the worst possible moment.

3) DROP_AFTER_FLUSH - this allows a FLUSH to go through, and as soon as it
completes we drop all writes.  This can be good for testing fdatasync on
overwrite fs'es that may only issue a FLUSH and not have to update metadata.

There is also an option to return -EIO as soon as we have our power fail event
to make it easier to script stress tests.

The idea is to be as evil as possible wrt how a device's cache would react to a
powerfail event.  I had originally bolted this onto dm-flakey but it got
complicated trying to work in these new options with its original behavior so I
created a new target instead.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/md/Kconfig         |  14 +
 drivers/md/Makefile        |   1 +
 drivers/md/dm-power-fail.c | 691 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 706 insertions(+)
 create mode 100644 drivers/md/dm-power-fail.c

Comments

Zach Brown Nov. 24, 2014, 6:45 p.m. UTC | #1
On Fri, Nov 21, 2014 at 05:00:31PM -0500, Josef Bacik wrote:
> Hello,
> 
> I'm hoping some FS guys can weigh in and verify my approach for testing power
> fail conditions, and the DM guys to of course verify I didn't completely fail at
> making a DM target.  All suggestions welcome, I want to have a nice robust tool
> for testing power fail so we can be sure our fsync/commit code is all working
> properly.  Thanks,

*All* suggestions welcome?  OK, I'll put on my skeptic hat.

This implements a writeback cache in kernel data structures so that you
can race to throw away cached blocks that haven't been flushed.  How is
that meaningfully different than using an actual writeback caching dm
target and racing to invalidate it?

Using real caching dm target configurations would let you reuse their
testing and corner case handling that is, presumably, already slightly
more advanced than printk() swearing.

Implementation of this specific mechanism aside, the architectural
observation is that device ram disappearing during a power interruption
is just one, and possibly the most forgiving, outcome.  Pulling the
power out from under concurrnet cached writes can result in any
combination of them being written, or not, or scrambled to the point of
returning read errors.

If we were to justify developing a specific power failure target, I'd
like to see something that tracks write history and can replay the
history to offer a resonably exhaustive set of possible write results.
Verify *those* and you have much more confidence that the file system
can handle reading the results of its interrupted writes.

- z 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Nov. 24, 2014, 7:04 p.m. UTC | #2
On 11/24/2014 01:45 PM, Zach Brown wrote:
> On Fri, Nov 21, 2014 at 05:00:31PM -0500, Josef Bacik wrote:
>> Hello,
>>
>> I'm hoping some FS guys can weigh in and verify my approach for testing power
>> fail conditions, and the DM guys to of course verify I didn't completely fail at
>> making a DM target.  All suggestions welcome, I want to have a nice robust tool
>> for testing power fail so we can be sure our fsync/commit code is all working
>> properly.  Thanks,
>
> *All* suggestions welcome?  OK, I'll put on my skeptic hat.
>
> This implements a writeback cache in kernel data structures so that you
> can race to throw away cached blocks that haven't been flushed.  How is
> that meaningfully different than using an actual writeback caching dm
> target and racing to invalidate it?

I didn't think of the dm-cache target, but do we want to add data loss 
testing code to something people actually use in production?  I feel 
like that's a recipe for disaster.  I suppose it could work, but my 
target adds some specific scenarios like blow up after FUA/FLUSH to test 
for specific races.

>
> Using real caching dm target configurations would let you reuse their
> testing and corner case handling that is, presumably, already slightly
> more advanced than printk() swearing.
>

Well that's just an unfair jab, I missed _one_ debug printk.

> Implementation of this specific mechanism aside, the architectural
> observation is that device ram disappearing during a power interruption
> is just one, and possibly the most forgiving, outcome.  Pulling the
> power out from under concurrnet cached writes can result in any
> combination of them being written, or not, or scrambled to the point of
> returning read errors.
>

So that's what this target tries to get around, we don't care about what 
happens to the data that is outstanding, we act like it never happened, 
because if we are going to rely on reading that data at all then we've 
already lost.  So simply acting like it disappeared is just as bad as it 
being garbage or returning an EIO (though returning an EIO would be 
pretty cool too and easily be added to the target).

> If we were to justify developing a specific power failure target, I'd
> like to see something that tracks write history and can replay the
> history to offer a resonably exhaustive set of possible write results.
> Verify *those* and you have much more confidence that the file system
> can handle reading the results of its interrupted writes.

This sounds like a pretty cool idea, it would be weird trying to order 
everything out though to catch problems where we don't properly wait on 
IO to complete before we do flushing.  You'd probably have to keep track 
of when things were submitted and when they completed in the log in 
order to replay them in a way to expose problems with the flushing.  But 
you're right it would allow us to more exhaustively test all different 
scenarios.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zach Brown Nov. 24, 2014, 7:57 p.m. UTC | #3
> >This implements a writeback cache in kernel data structures so that you
> >can race to throw away cached blocks that haven't been flushed.  How is
> >that meaningfully different than using an actual writeback caching dm
> >target and racing to invalidate it?
> 
> I didn't think of the dm-cache target, but do we want to add data loss
> testing code to something people actually use in production?  I feel like
> that's a recipe for disaster.  I suppose it could work, but my target adds
> some specific scenarios like blow up after FUA/FLUSH to test for specific
> races.

I don't know if we'd even need code changes.  Can't you forcibly fiddle
with the target tables to remove the caching target at any point?  No
hablo dm.

> >Using real caching dm target configurations would let you reuse their
> >testing and corner case handling that is, presumably, already slightly
> >more advanced than printk() swearing.
> >
> 
> Well that's just an unfair jab, I missed _one_ debug printk.

And it was a hilarious printk :).

> >If we were to justify developing a specific power failure target, I'd
> >like to see something that tracks write history and can replay the
> >history to offer a resonably exhaustive set of possible write results.
> >Verify *those* and you have much more confidence that the file system
> >can handle reading the results of its interrupted writes.
> 
> This sounds like a pretty cool idea, it would be weird trying to order
> everything out though to catch problems where we don't properly wait on IO
> to complete before we do flushing.  You'd probably have to keep track of
> when things were submitted and when they completed in the log in order to
> replay them in a way to expose problems with the flushing.  But you're right
> it would allow us to more exhaustively test all different scenarios.

Well, I think it'd be more about tracking write submission and flush
completion to maintain sets of writes that could have become persistent
in any order.  Then you provide an interface for iterating over devices
that represent possible persistent outcomes.

Say you have a tree of flush events and each flush has a tree of blocks
that were dirty at the time of the flush.  After the flush you can walk
the blocks and record their tree position (or maintain them with the
_augmented callbacks.)

Then each device full of possible outcomes can be described by the flush
event and a giant bitmap with a few bits { .written, .corrupt } for each
block version in the flush.  Satisfy reads of a block by walking back
through the flushes.  Blocks in the current flush look up their tree
position in the device state bitmap to find their fate.   The most
recent dirty block in completed flushes is used, otherwise the backing
device is used if you're building from an existing known state.

Iterate over possible device states of write outcomes by adding bits
with carry in the giant bitmap.  (complexity++ for using the bitmaps to
represent which of multiple versions of one block should be used..)

Something like that, anyway.  Email is easy :).

It'd be interesting to see how far a simple prototype could go that
keeps everything in memory and has sane static limits on how much
history it tracks.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Nov. 24, 2014, 8:15 p.m. UTC | #4
On 11/24/2014 02:57 PM, Zach Brown wrote:
>>> This implements a writeback cache in kernel data structures so that you
>>> can race to throw away cached blocks that haven't been flushed.  How is
>>> that meaningfully different than using an actual writeback caching dm
>>> target and racing to invalidate it?
>>
>> I didn't think of the dm-cache target, but do we want to add data loss
>> testing code to something people actually use in production?  I feel like
>> that's a recipe for disaster.  I suppose it could work, but my target adds
>> some specific scenarios like blow up after FUA/FLUSH to test for specific
>> races.
>
> I don't know if we'd even need code changes.  Can't you forcibly fiddle
> with the target tables to remove the caching target at any point?  No
> hablo dm.
>
>>> Using real caching dm target configurations would let you reuse their
>>> testing and corner case handling that is, presumably, already slightly
>>> more advanced than printk() swearing.
>>>
>>
>> Well that's just an unfair jab, I missed _one_ debug printk.
>
> And it was a hilarious printk :).
>
>>> If we were to justify developing a specific power failure target, I'd
>>> like to see something that tracks write history and can replay the
>>> history to offer a resonably exhaustive set of possible write results.
>>> Verify *those* and you have much more confidence that the file system
>>> can handle reading the results of its interrupted writes.
>>
>> This sounds like a pretty cool idea, it would be weird trying to order
>> everything out though to catch problems where we don't properly wait on IO
>> to complete before we do flushing.  You'd probably have to keep track of
>> when things were submitted and when they completed in the log in order to
>> replay them in a way to expose problems with the flushing.  But you're right
>> it would allow us to more exhaustively test all different scenarios.
>
> Well, I think it'd be more about tracking write submission and flush
> completion to maintain sets of writes that could have become persistent
> in any order.  Then you provide an interface for iterating over devices
> that represent possible persistent outcomes.
>
> Say you have a tree of flush events and each flush has a tree of blocks
> that were dirty at the time of the flush.  After the flush you can walk
> the blocks and record their tree position (or maintain them with the
> _augmented callbacks.)
>
> Then each device full of possible outcomes can be described by the flush
> event and a giant bitmap with a few bits { .written, .corrupt } for each
> block version in the flush.  Satisfy reads of a block by walking back
> through the flushes.  Blocks in the current flush look up their tree
> position in the device state bitmap to find their fate.   The most
> recent dirty block in completed flushes is used, otherwise the backing
> device is used if you're building from an existing known state.
>
> Iterate over possible device states of write outcomes by adding bits
> with carry in the giant bitmap.  (complexity++ for using the bitmaps to
> represent which of multiple versions of one block should be used..)
>
> Something like that, anyway.  Email is easy :).
>
> It'd be interesting to see how far a simple prototype could go that
> keeps everything in memory and has sane static limits on how much
> history it tracks.
>

That is way complicated, I was just going to take two devices, one 
that's a linear mapping and the other that's the log, and then write to 
the log the sector+data that was written in order that it completes, and 
then have userspace do the replay.  So basically do the flush tracking 
like I am, then write out chunks to the log device to keep a semblance 
of how the flushing would have affected stuff, something like this

write a, write b, a complete, flush, b complete, flush complete

would log out

wrote a, flush, write b, <other writes>, <next flush>

and then we have a userspace thing that could do something like replay 
all writes to a flush, do fs consistency and data consistency checks, 
walk to the next flush, rinse repeat, and that way we could be sure that 
we always have a consistent fs.  This would make it easier to check 
complex fs operations (like btrfs's balance) without having to come up 
with special hacks in those operations to check them.  I like this 
better because it's less DM code which means less swearing printks, but 
whichever we think will be the best thing for this sort of testing.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Nov. 24, 2014, 8:18 p.m. UTC | #5
On Mon, Nov 24, 2014 at 3:15 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 11/24/2014 02:57 PM, Zach Brown wrote:
> 
> That is way complicated, I was just going to take two devices, one 
> that's a linear mapping and the other that's the log, and then write 
> to the log the sector+data that was written in order that it 
> completes, and then have userspace do the replay.  So basically do 
> the flush tracking like I am, then write out chunks to the log device 
> to keep a semblance of how the flushing would have affected stuff, 
> something like this
> 
> write a, write b, a complete, flush, b complete, flush complete
> 
> would log out
> 
> wrote a, flush, write b, <other writes>, <next flush>
> 
> and then we have a userspace thing that could do something like 
> replay all writes to a flush, do fs consistency and data consistency 
> checks, walk to the next flush, rinse repeat, and that way we could 
> be sure that we always have a consistent fs.  This would make it 
> easier to check complex fs operations (like btrfs's balance) without 
> having to come up with special hacks in those operations to check 
> them.  I like this better because it's less DM code which means less 
> swearing printks, but whichever we think will be the best thing for 
> this sort of testing.  Thanks,

I vote for whatever is the easiest to fit in our little programmer 
brains.  The more complex the tool, the less we'll trust it.

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zach Brown Nov. 24, 2014, 10:10 p.m. UTC | #6
On Mon, Nov 24, 2014 at 03:15:25PM -0500, Josef Bacik wrote:
> On 11/24/2014 02:57 PM, Zach Brown wrote:
> >>>This implements a writeback cache in kernel data structures so that you
> >>>can race to throw away cached blocks that haven't been flushed.  How is
> >>>that meaningfully different than using an actual writeback caching dm
> >>>target and racing to invalidate it?
> >>
> >>I didn't think of the dm-cache target, but do we want to add data loss
> >>testing code to something people actually use in production?  I feel like
> >>that's a recipe for disaster.  I suppose it could work, but my target adds
> >>some specific scenarios like blow up after FUA/FLUSH to test for specific
> >>races.
> >
> >I don't know if we'd even need code changes.  Can't you forcibly fiddle
> >with the target tables to remove the caching target at any point?  No
> >hablo dm.
> >
> >>>Using real caching dm target configurations would let you reuse their
> >>>testing and corner case handling that is, presumably, already slightly
> >>>more advanced than printk() swearing.
> >>>
> >>
> >>Well that's just an unfair jab, I missed _one_ debug printk.
> >
> >And it was a hilarious printk :).
> >
> >>>If we were to justify developing a specific power failure target, I'd
> >>>like to see something that tracks write history and can replay the
> >>>history to offer a resonably exhaustive set of possible write results.
> >>>Verify *those* and you have much more confidence that the file system
> >>>can handle reading the results of its interrupted writes.
> >>
> >>This sounds like a pretty cool idea, it would be weird trying to order
> >>everything out though to catch problems where we don't properly wait on IO
> >>to complete before we do flushing.  You'd probably have to keep track of
> >>when things were submitted and when they completed in the log in order to
> >>replay them in a way to expose problems with the flushing.  But you're right
> >>it would allow us to more exhaustively test all different scenarios.
> >
> >Well, I think it'd be more about tracking write submission and flush
> >completion to maintain sets of writes that could have become persistent
> >in any order.  Then you provide an interface for iterating over devices
> >that represent possible persistent outcomes.
> >
> >Say you have a tree of flush events and each flush has a tree of blocks
> >that were dirty at the time of the flush.  After the flush you can walk
> >the blocks and record their tree position (or maintain them with the
> >_augmented callbacks.)
> >
> >Then each device full of possible outcomes can be described by the flush
> >event and a giant bitmap with a few bits { .written, .corrupt } for each
> >block version in the flush.  Satisfy reads of a block by walking back
> >through the flushes.  Blocks in the current flush look up their tree
> >position in the device state bitmap to find their fate.   The most
> >recent dirty block in completed flushes is used, otherwise the backing
> >device is used if you're building from an existing known state.
> >
> >Iterate over possible device states of write outcomes by adding bits
> >with carry in the giant bitmap.  (complexity++ for using the bitmaps to
> >represent which of multiple versions of one block should be used..)
> >
> >Something like that, anyway.  Email is easy :).
> >
> >It'd be interesting to see how far a simple prototype could go that
> >keeps everything in memory and has sane static limits on how much
> >history it tracks.
> >
> 
> That is way complicated, I was just going to take two devices, one that's a
> linear mapping and the other that's the log, and then write to the log the
> sector+data that was written in order that it completes, and then have
> userspace do the replay.  So basically do the flush tracking like I am, then
> write out chunks to the log device to keep a semblance of how the flushing
> would have affected stuff, something like this
> 
> write a, write b, a complete, flush, b complete, flush complete
> 
> would log out
> 
> wrote a, flush, write b, <other writes>, <next flush>
> 
> and then we have a userspace thing that could do something like replay all
> writes to a flush, do fs consistency and data consistency checks, walk to
> the next flush, rinse repeat, and that way we could be sure that we always
> have a consistent fs.

I guess that'd be an ok start, but I don't think you need any clever
kernel code to do that.  I've hacked up something like this in bash with
blktrace, loopback files, and dd :/.

What I'm trying to say with this thread is that I think that only
testing persistence in the order of submission or completion, and
especially only around flushes, makes life too easy for the fs.  It
doesn't reflect the real device state that users can be stuck with.  For
example, I think we should test only b being written in that first
sequence you describe.

Maybe I'll throw something together to try and demonstrate what I'm on
about. 

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Nov. 24, 2014, 10:21 p.m. UTC | #7
On 11/24/2014 05:10 PM, Zach Brown wrote:
> On Mon, Nov 24, 2014 at 03:15:25PM -0500, Josef Bacik wrote:
>> On 11/24/2014 02:57 PM, Zach Brown wrote:
>>>>> This implements a writeback cache in kernel data structures so that you
>>>>> can race to throw away cached blocks that haven't been flushed.  How is
>>>>> that meaningfully different than using an actual writeback caching dm
>>>>> target and racing to invalidate it?
>>>>
>>>> I didn't think of the dm-cache target, but do we want to add data loss
>>>> testing code to something people actually use in production?  I feel like
>>>> that's a recipe for disaster.  I suppose it could work, but my target adds
>>>> some specific scenarios like blow up after FUA/FLUSH to test for specific
>>>> races.
>>>
>>> I don't know if we'd even need code changes.  Can't you forcibly fiddle
>>> with the target tables to remove the caching target at any point?  No
>>> hablo dm.
>>>
>>>>> Using real caching dm target configurations would let you reuse their
>>>>> testing and corner case handling that is, presumably, already slightly
>>>>> more advanced than printk() swearing.
>>>>>
>>>>
>>>> Well that's just an unfair jab, I missed _one_ debug printk.
>>>
>>> And it was a hilarious printk :).
>>>
>>>>> If we were to justify developing a specific power failure target, I'd
>>>>> like to see something that tracks write history and can replay the
>>>>> history to offer a resonably exhaustive set of possible write results.
>>>>> Verify *those* and you have much more confidence that the file system
>>>>> can handle reading the results of its interrupted writes.
>>>>
>>>> This sounds like a pretty cool idea, it would be weird trying to order
>>>> everything out though to catch problems where we don't properly wait on IO
>>>> to complete before we do flushing.  You'd probably have to keep track of
>>>> when things were submitted and when they completed in the log in order to
>>>> replay them in a way to expose problems with the flushing.  But you're right
>>>> it would allow us to more exhaustively test all different scenarios.
>>>
>>> Well, I think it'd be more about tracking write submission and flush
>>> completion to maintain sets of writes that could have become persistent
>>> in any order.  Then you provide an interface for iterating over devices
>>> that represent possible persistent outcomes.
>>>
>>> Say you have a tree of flush events and each flush has a tree of blocks
>>> that were dirty at the time of the flush.  After the flush you can walk
>>> the blocks and record their tree position (or maintain them with the
>>> _augmented callbacks.)
>>>
>>> Then each device full of possible outcomes can be described by the flush
>>> event and a giant bitmap with a few bits { .written, .corrupt } for each
>>> block version in the flush.  Satisfy reads of a block by walking back
>>> through the flushes.  Blocks in the current flush look up their tree
>>> position in the device state bitmap to find their fate.   The most
>>> recent dirty block in completed flushes is used, otherwise the backing
>>> device is used if you're building from an existing known state.
>>>
>>> Iterate over possible device states of write outcomes by adding bits
>>> with carry in the giant bitmap.  (complexity++ for using the bitmaps to
>>> represent which of multiple versions of one block should be used..)
>>>
>>> Something like that, anyway.  Email is easy :).
>>>
>>> It'd be interesting to see how far a simple prototype could go that
>>> keeps everything in memory and has sane static limits on how much
>>> history it tracks.
>>>
>>
>> That is way complicated, I was just going to take two devices, one that's a
>> linear mapping and the other that's the log, and then write to the log the
>> sector+data that was written in order that it completes, and then have
>> userspace do the replay.  So basically do the flush tracking like I am, then
>> write out chunks to the log device to keep a semblance of how the flushing
>> would have affected stuff, something like this
>>
>> write a, write b, a complete, flush, b complete, flush complete
>>
>> would log out
>>
>> wrote a, flush, write b, <other writes>, <next flush>
>>
>> and then we have a userspace thing that could do something like replay all
>> writes to a flush, do fs consistency and data consistency checks, walk to
>> the next flush, rinse repeat, and that way we could be sure that we always
>> have a consistent fs.
>
> I guess that'd be an ok start, but I don't think you need any clever
> kernel code to do that.  I've hacked up something like this in bash with
> blktrace, loopback files, and dd :/.

I don't think blktrace gives us the data being written though does it? 
If it does then hooray I'm done playing a device mapper developer.

>
> What I'm trying to say with this thread is that I think that only
> testing persistence in the order of submission or completion, and
> especially only around flushes, makes life too easy for the fs.  It
> doesn't reflect the real device state that users can be stuck with.  For
> example, I think we should test only b being written in that first
> sequence you describe.
>
> Maybe I'll throw something together to try and demonstrate what I'm on
> about.
>

Sure it's a really simple test, but what I currently have rigged up just 
does random writes+fsync and then uses the -EIO part of dm-power-fail. 
Then when my test gets an EIO it stops, saves the good file, unmounts 
and remounts the fs, and checks the good file against what is on the 
disk.  Nobody passes this test.  Btrfs, xfs and ext4 all fail at some 
point, takes an hour or two but eventually each one of them fall over. 
Now this could just be a bug in the test somewhere, but I'm pretty sure 
I've shaken all the bugs out.

With the logging approach then it is completely up to us how we replay 
the log, so we can always go back and do more horrible things with the 
replay, like replay for a while, skip a flush and write some of the next 
random crap and see what happens.  Doing horrible things is awesome and 
that is what I want, but I also want to make sure we're not failing in 
the simple things too.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zach Brown Nov. 24, 2014, 10:35 p.m. UTC | #8
> >>That is way complicated, I was just going to take two devices, one that's a
> >>linear mapping and the other that's the log, and then write to the log the
> >>sector+data that was written in order that it completes, and then have
> >>userspace do the replay.  So basically do the flush tracking like I am, then
> >>write out chunks to the log device to keep a semblance of how the flushing
> >>would have affected stuff, something like this
> >>
> >>write a, write b, a complete, flush, b complete, flush complete
> >>
> >>would log out
> >>
> >>wrote a, flush, write b, <other writes>, <next flush>
> >>
> >>and then we have a userspace thing that could do something like replay all
> >>writes to a flush, do fs consistency and data consistency checks, walk to
> >>the next flush, rinse repeat, and that way we could be sure that we always
> >>have a consistent fs.
> >
> >I guess that'd be an ok start, but I don't think you need any clever
> >kernel code to do that.  I've hacked up something like this in bash with
> >blktrace, loopback files, and dd :/.
> 
> I don't think blktrace gives us the data being written though does it? If it
> does then hooray I'm done playing a device mapper developer.

Right, that script was only working with the dumb simple case of using
manual double buffering and simple test loads that wouldn't overwrite
blocks.

> With the logging approach then it is completely up to us how we replay the
> log, so we can always go back and do more horrible things with the replay,
> like replay for a while, skip a flush and write some of the next random crap
> and see what happens.  Doing horrible things is awesome and that is what I
> want, but I also want to make sure we're not failing in the simple things
> too.  Thanks,

Sure, that's fair.  So maybe expand blktrace to include the data
payloads? :).

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 5bdedf6..bc3d6ca 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -432,4 +432,18 @@  config DM_SWITCH
 
 	  If unsure, say N.
 
+config DM_POWER_FAIL
+	tristate "Power fail target support"
+	depends on BLK_DEV_DM
+	---help---
+	  This device-mapper target creates a device that can be used for
+	  testing a file systems ability to survive power fails.  There are
+	  several modes of operation in order to test a variety of power fail
+	  scenarios.
+
+	  To compile this code as a module, choos M here: the module will be
+	  called dm-power-fail.
+
+	  If unsure, say N.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index a2da532..c667218 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -55,6 +55,7 @@  obj-$(CONFIG_DM_CACHE)		+= dm-cache.o
 obj-$(CONFIG_DM_CACHE_MQ)	+= dm-cache-mq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
+obj-$(CONFIG_DM_POWER_FAIL)	+= dm-power-fail.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-power-fail.c b/drivers/md/dm-power-fail.c
new file mode 100644
index 0000000..44c5045
--- /dev/null
+++ b/drivers/md/dm-power-fail.c
@@ -0,0 +1,691 @@ 
+/*
+ * Copyright (C) 2014 Facebook. All rights reserved.
+ *
+ * This file is released under the GPL.
+ */
+
+#include <linux/device-mapper.h>
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+
+#define DM_MSG_PREFIX "power-fail"
+
+/*
+ * The way this interface is meant to be used is like this
+ *
+ * dmsetup create powerfail
+ * mkfs /dev/powerfail
+ * mount /dev/powerfail /mnt/test
+ * do some stuff &
+ * sleep 30
+ * dmsetup message powerfail (drop_after_flush|drop_after_fua|drop_writes)
+ * umount /mnt/test
+ * dmsetup message powerfail redirect_reads
+ * fsck /dev/powerfail || exit 1
+ * mount /dev/powerfail /mnt/test
+ * <verify contents>
+ *
+ * You can set redirect_reads whenever you want, but the idea is that you want
+ * the teardown stuff to work like normal, and then flip the switch for us to
+ * start returning garbage on any writes that would have failed, and then do
+ * verification checks.  A perfectly functioning fs will recover properly and
+ * not give you IO errors and such.
+ *
+ * There are two modes for this target.
+ *
+ * zero - any write IO's that are lost once the power fail event happens will
+ * return 0's when read after redirect_reads is set.  This is meant for btrfs or
+ * any other future COW fs that comes along.
+ *
+ * split - we split the device in half, and write to alternating sides of the
+ * device.  We track which is the last good mirror to have completed.  Then
+ * whenever the power fail event occurs we will stop updating the last good
+ * mirror for writes and when redirect_reads is turned on we will read from the
+ * last good mirror.
+ */
+
+struct pfail_ctx {
+	struct dm_dev *dev;
+	spinlock_t pending_blocks_lock;
+	struct rb_root pending_blocks;
+	struct list_head unflushed_blocks;
+	long endio_delay;
+	unsigned long flags;
+	bool split;
+	bool error;
+};
+
+enum pfail_flag_bits {
+	DROP_WRITES,
+	DROP_AFTER_FUA,
+	DROP_AFTER_FLUSH,
+	REDIRECT_READS,
+};
+
+struct pending_block {
+	struct rb_node node;
+	struct list_head list;
+	unsigned long long bytenr;
+	unsigned bytes;
+	unsigned good_mirror;
+	unsigned cur_mirror;
+	unsigned rewrite;
+};
+
+struct per_bio_data {
+	struct pending_block *block;
+	struct pfail_ctx *pc;
+	struct work_struct work;
+	int error;
+	bool track;
+};
+
+static int contains(unsigned long long a, unsigned a_len,
+		    unsigned long long b, unsigned b_len)
+{
+	if (a >= (b + b_len))
+		return 0;
+	if (b >= (a + a_len))
+		return 0;
+	return 1;
+}
+
+static struct pending_block *insert_pending_block(struct pfail_ctx *pc,
+						  struct pending_block *block)
+{
+	struct rb_node **p = &pc->pending_blocks.rb_node;
+	struct rb_node *parent = NULL;
+	struct pending_block *entry;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct pending_block, node);
+
+		if (contains(block->bytenr, block->bytes, entry->bytenr,
+			     entry->bytes)) {
+			if (!pc->split) {
+				/*
+				 * With zero mode we free up blocks once they
+				 * successfully complete, and assume the fs
+				 * doesn't write to the same block until it has
+				 * been completely written, so if this happens
+				 * we have a problem.
+				 */
+				DMERR("existing block %llu-%u insert "
+				      "%llu-%u", entry->bytenr, entry->bytes,
+				       block->bytenr, block->bytes);
+			} else if (block->bytes > entry->bytes) {
+				printk(KERN_ERR "well shit\n");
+			}
+			kfree(block);
+			return entry;
+		} else if (entry->bytenr > block->bytenr)
+			p = &(*p)->rb_left;
+		else if (entry->bytenr < block->bytenr)
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&block->node, parent, p);
+	rb_insert_color(&block->node, &pc->pending_blocks);
+	return block;
+}
+
+static struct pending_block *find_pending_block(struct pfail_ctx *pc,
+						unsigned long long bytenr,
+						unsigned bytes)
+{
+	struct rb_node *n = pc->pending_blocks.rb_node;
+	struct pending_block *block;
+
+	while (n) {
+		block = rb_entry(n, struct pending_block, node);
+		if (contains(block->bytenr, block->bytes, bytenr, bytes))
+			return block;
+		else if (block->bytenr > bytenr)
+			n = n->rb_left;
+		else if (block->bytenr < bytenr)
+			n = n->rb_right;
+	}
+	return NULL;
+}
+
+static int parse_pfail_features(struct dm_arg_set *as, struct pfail_ctx *pc,
+				struct dm_target *ti)
+{
+	const char *arg_name;
+	unsigned argc;
+	int ret;
+	static struct dm_arg _args[] = {
+		{0, 3, "Invalid number of power fail feature arguments"},
+	};
+
+	if (!as->argc)
+		return 0;
+	ret = dm_read_arg_group(_args, as, &argc, &ti->error);
+	if (ret)
+		return -EINVAL;
+
+	while (argc) {
+		arg_name = dm_shift_arg(as);
+		argc--;
+
+		if (!strcasecmp(arg_name, "split")) {
+			pc->split = true;
+		} else if (!strcasecmp(arg_name, "error_on_fail")) {
+			pc->error = true;
+		} else {
+			ti->error = "Unrecognized power fail feature "
+				"requested";
+			ret = -EINVAL;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Construct a power-fail mapping:
+ * power-fail <dev_path> <endio_delay> [<#feature args [<arg>]*]
+ *
+ * endio_delay is in jiffies, if it is 0 there will be no delay.  This is
+ * helpful for widening suspected races in your waiting logic.
+ *
+ * Option feature arguments are:
+ *
+ * split - Meant for overwrite fs'es where we need to return the old busted data
+ *	   for reads on uncompleted blocks.  Must pass in a length of >= half of
+ *	   the disk so that we can use one side for pending blocks and one side
+ *	   for completed blocks
+ * error_on_fail - Instead of just pretending that the writes completed normally
+ *		   after the failure event, we will return -EIO.
+ */
+static int pfail_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct pfail_ctx *pc;
+	struct dm_arg_set as;
+	const char *devname;
+	sector_t sectors;
+	long endio_delay;
+	char dummy;
+	int ret;
+
+	as.argc = argc;
+	as.argv = argv;
+
+	if (argc < 2) {
+		ti->error = "Invalid argument count";
+		return -EINVAL;
+	}
+
+	pc = kzalloc(sizeof(*pc), GFP_KERNEL);
+	if (!pc) {
+		ti->error = "Cannot allocate context";
+		return -ENOMEM;
+	}
+	pc->pending_blocks = RB_ROOT;
+	spin_lock_init(&pc->pending_blocks_lock);
+	INIT_LIST_HEAD(&pc->unflushed_blocks);
+
+	devname = dm_shift_arg(&as);
+	if (sscanf(dm_shift_arg(&as), "%ld%c", &endio_delay, &dummy) != 1) {
+		ti->error = "Invalid endio delay";
+		goto bad;
+	}
+	pc->endio_delay = endio_delay;
+	ret = parse_pfail_features(&as, pc, ti);
+	if (ret)
+		goto bad;
+
+	if (dm_get_device(ti, devname, dm_table_get_mode(ti->table), &pc->dev)) {
+		ti->error = "Device lookup failed";
+		goto bad;
+	}
+
+	sectors = PAGE_CACHE_SIZE >> 9;
+	if (pc->split)
+		sectors = 512 >> 9;
+	ret = dm_set_target_max_io_len(ti, sectors);
+	if (ret)
+		goto bad;
+
+	sectors = (i_size_read(pc->dev->bdev->bd_inode) >> 9) - ti->begin;
+	sectors /= 2;
+	if (pc->split && ti->len > sectors) {
+		ti->error = "Using split but specified a device size that is "
+			"too large";
+		dm_put_device(ti, pc->dev);
+		goto bad;
+	}
+
+	ti->num_flush_bios = 1;
+	ti->flush_supported = true;
+	ti->num_discard_bios = 1;
+	ti->per_bio_data_size = sizeof(struct per_bio_data);
+	ti->private = pc;
+	return 0;
+
+bad:
+	kfree(pc);
+	return -EINVAL;
+}
+
+static void pfail_dtr(struct dm_target *ti)
+{
+	struct pfail_ctx *pc = ti->private;
+	struct pending_block *block;
+	struct rb_node *n;
+
+	dm_put_device(ti, pc->dev);
+	while ((n = rb_last(&pc->pending_blocks)) != NULL) {
+		block = rb_entry(n, struct pending_block, node);
+		rb_erase(&block->node, &pc->pending_blocks);
+		kfree(block);
+	}
+	kfree(pc);
+}
+
+static sector_t pfail_map_sector(struct dm_target *ti, sector_t bi_sector,
+				 int mirror)
+{
+	WARN_ON(bi_sector >= ti->len);
+	if (mirror == 2)
+		bi_sector += ti->len;
+	return dm_target_offset(ti, bi_sector);
+}
+
+static void pfail_map_bio(struct dm_target *ti, struct bio *bio, int mirror)
+{
+	struct pfail_ctx *pc = ti->private;
+
+	bio->bi_bdev = pc->dev->bdev;
+	if (bio_sectors(bio))
+		bio->bi_iter.bi_sector =
+			pfail_map_sector(ti, bio->bi_iter.bi_sector, mirror);
+}
+
+static int maybe_zero_bio(struct pfail_ctx *pc, struct bio *bio, int *mirror)
+{
+	struct pending_block *block;
+	char *data = bio_data(bio);
+	unsigned long long bytenr = bio->bi_iter.bi_sector << 9;
+	int good_mirror, cur_mirror;
+	int ret = 0;
+
+	if (!bio_has_data(bio))
+		return 0;
+
+	spin_lock(&pc->pending_blocks_lock);
+	block = find_pending_block(pc, bytenr, bio_cur_bytes(bio));
+	if (block) {
+		good_mirror = block->good_mirror;
+		cur_mirror = block->cur_mirror;
+	}
+	spin_unlock(&pc->pending_blocks_lock);
+	if (!block)
+		return 0;
+
+	if (test_bit(REDIRECT_READS, &pc->flags)) {
+		if (pc->split) {
+			*mirror = good_mirror;
+		} else {
+			memset(data, 0, bio_cur_bytes(bio));
+			ret = 1;
+		}
+	} else if (pc->split) {
+		*mirror = cur_mirror;
+	}
+
+	return ret;
+}
+
+static unsigned bio_bytes(struct bio *bio)
+{
+	if (!bio_has_data(bio))
+		return 0;
+
+	return bio_cur_bytes(bio);
+}
+
+static int pfail_map(struct dm_target *ti, struct bio *bio)
+{
+	struct pfail_ctx *pc = ti->private;
+	struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));
+	struct pending_block *block;
+	unsigned mirror = 0;
+	bool fua_bio = (bio->bi_rw & REQ_FUA);
+	bool flush_bio = (bio->bi_rw & REQ_FLUSH);
+
+	if (bio_sectors(bio) || flush_bio)
+		pb->track = true;
+	else
+		pb->track = false;
+	pb->pc = pc;
+	pb->block = NULL;
+
+	/*
+	 * Map reads as normal.
+	 */
+	if (bio_data_dir(bio) == READ) {
+		if (maybe_zero_bio(pc, bio, &mirror)) {
+			bio_endio(bio, 0);
+			return DM_MAPIO_SUBMITTED;
+		}
+		goto map_bio;
+	}
+
+	if (test_bit(DROP_WRITES, &pc->flags)) {
+		if (pc->error)
+			return -EIO;
+		pb->track = false;
+		bio_endio(bio, 0);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	/*
+	 * If we have just a naked REQ_FLUSH then we need to do alloc a dummy
+	 * block to hold the unflushed blocks.
+	 */
+	if (flush_bio && !fua_bio) {
+		struct pending_block *block;
+
+		block = kzalloc(sizeof(struct pending_block), GFP_NOIO);
+		if (!block) {
+			DMERR("Error allocating pending block");
+			return -ENOMEM;
+		}
+		INIT_LIST_HEAD(&block->list);
+
+		/*
+		 * IMPORTANT NOTE FOR FS DEVELOPERS!!!!
+		 *
+		 * We only take blocks that have already come through the end_io
+		 * handler as flushable blocks.  Anything that is still
+		 * outstanding IO at this point is assumed to be unflushable.
+		 */
+		spin_lock(&pc->pending_blocks_lock);
+		list_splice_init(&pc->unflushed_blocks, &block->list);
+		spin_unlock(&pc->pending_blocks_lock);
+		pb->block = block;
+	} else if (bio_data_dir(bio) == WRITE && bio_sectors(bio)) {
+		if (!pc->split && fua_bio)
+			goto map_bio;
+
+		block = kzalloc(sizeof(struct pending_block), GFP_NOIO);
+		if (!block) {
+			DMERR("Error allocating pending block");
+			return -ENOMEM;
+		}
+		block->bytenr = bio->bi_iter.bi_sector << 9;
+		block->bytes = bio_bytes(bio);
+		INIT_LIST_HEAD(&block->list);
+		spin_lock(&pc->pending_blocks_lock);
+		block = insert_pending_block(pc, block);
+		if (pc->split) {
+			if (block->good_mirror == 0 ||
+			    block->good_mirror == 1)
+				block->cur_mirror = 2;
+			else
+				block->cur_mirror = 1;
+			mirror = block->cur_mirror;
+		}
+		pb->block = block;
+
+		/*
+		 * If we do REQ_FLUSH|REQ_FUA we'll have an actual block and we
+		 * want to do the unflushed blocks dance here.
+		 */
+		if (flush_bio)
+			list_splice_init(&pc->unflushed_blocks, &block->list);
+		spin_unlock(&pc->pending_blocks_lock);
+	}
+map_bio:
+	pfail_map_bio(ti, bio, mirror);
+	return DM_MAPIO_REMAPPED;
+}
+
+static void pfail_write_end_io(struct work_struct *work)
+{
+	struct per_bio_data *pb = container_of(work, struct per_bio_data,
+					       work);
+	struct bio *bio;
+	struct pfail_ctx *pc = pb->pc;
+	struct pending_block *block;
+	bool flush, fua;
+
+	bio = dm_bio_from_per_bio_data(pb, sizeof(struct per_bio_data));
+	flush = bio->bi_rw & REQ_FLUSH;
+	fua = bio->bi_rw & REQ_FUA;
+	pb->track = false;
+	block = pb->block;
+
+	if (pc->endio_delay)
+		schedule_timeout(pc->endio_delay);
+
+	if (test_bit(DROP_WRITES, &pc->flags)) {
+		if (pc->error && !pb->error)
+			pb->error = -EIO;
+		if (flush && block && block->bytes == 0)
+			kfree(block);
+		goto out;
+	}
+
+	if (fua && test_bit(DROP_AFTER_FUA, &pc->flags)) {
+		if (test_and_set_bit(DROP_WRITES, &pc->flags)) {
+			if (pc->error && !pb->error)
+				pb->error = -EIO;
+			goto out;
+		}
+	}
+
+	if (flush && test_bit(DROP_AFTER_FLUSH, &pc->flags)) {
+		if (test_and_set_bit(DROP_WRITES, &pc->flags)) {
+			if (pc->error && !pb->error)
+				pb->error = -EIO;
+			goto out;
+		}
+	}
+
+	if (flush) {
+		struct pending_block *tmp, *n;
+
+		if (!block)
+			goto out;
+
+		spin_lock(&pc->pending_blocks_lock);
+		list_for_each_entry_safe(tmp, n, &block->list, list) {
+			list_del_init(&tmp->list);
+			if (pc->split) {
+				tmp->good_mirror = tmp->cur_mirror;
+			} else {
+				rb_erase(&tmp->node, &pc->pending_blocks);
+				kfree(tmp);
+			}
+			cond_resched_lock(&pc->pending_blocks_lock);
+		}
+		if (block->bytes != 0 && pc->split) {
+			block->good_mirror = block->cur_mirror;
+		} else if (block->bytes == 0) {
+			kfree(block);
+		}
+		spin_unlock(&pc->pending_blocks_lock);
+	} else if (fua) {
+		/*
+		 * FUA goes straight to disk, if it completes then the cur
+		 * mirror is the good mirror.
+		 */
+		if (pc->split) {
+			spin_lock(&pc->pending_blocks_lock);
+			block->good_mirror = block->cur_mirror;
+			spin_unlock(&pc->pending_blocks_lock);
+		}
+	} else if (block) {
+		spin_lock(&pc->pending_blocks_lock);
+		list_move_tail(&block->list, &pc->unflushed_blocks);
+		spin_unlock(&pc->pending_blocks_lock);
+	}
+out:
+	bio_endio(bio, pb->error);
+}
+
+static int pfail_end_io(struct dm_target *ti, struct bio *bio, int error)
+{
+	struct pfail_ctx *pc = ti->private;
+	struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));
+
+	if (bio_data_dir(bio) == WRITE && pb->track &&
+	    !test_bit(DROP_WRITES, &pc->flags)) {
+		pb->error = error;
+
+		atomic_inc(&bio->bi_remaining);
+		INIT_WORK(&pb->work, pfail_write_end_io);
+		schedule_work(&pb->work);
+		return DM_ENDIO_INCOMPLETE;
+	}
+
+	return error;
+}
+
+static void pfail_status(struct dm_target *ti, status_type_t type,
+			 unsigned status_flags, char *result, unsigned maxlen)
+{
+	unsigned sz = 0;
+	struct pfail_ctx *pc = ti->private;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		result[0] = '\0';
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%s %ld ", pc->dev->name, pc->endio_delay);
+
+		DMEMIT("%u ", pc->split + pc->error);
+
+		if (pc->split)
+			DMEMIT("split ");
+		if (pc->error)
+			DMEMIT("error_on_fail ");
+		break;
+	}
+}
+
+static int pfail_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long arg)
+{
+	struct pfail_ctx *pc = ti->private;
+	struct dm_dev *dev = pc->dev;
+	int r = 0;
+
+	/*
+	 * Only pass ioctls through if the device sizes match exactly.
+	 */
+	if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT)
+		r = scsi_verify_blk_ioctl(NULL, cmd);
+
+	return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
+}
+
+static int pfail_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+			struct bio_vec *biovec, int max_size)
+{
+	struct pfail_ctx *pc = ti->private;
+	struct request_queue *q = bdev_get_queue(pc->dev->bdev);
+
+	if (!q->merge_bvec_fn)
+		return max_size;
+
+	bvm->bi_bdev = pc->dev->bdev;
+	bvm->bi_sector = pfail_map_sector(ti, bvm->bi_sector, 0);
+
+	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static int pfail_iterate_devices(struct dm_target *ti,
+				 iterate_devices_callout_fn fn, void *data)
+{
+	struct pfail_ctx *pc = ti->private;
+
+	return fn(ti, pc->dev, 0, ti->len, data);
+}
+
+/*
+ * Valid messages
+ *
+ * drop_after_fua - drop all writes after the next seen fua write.
+ * drop_after_flush - drop all writes after the next seen flush write.
+ * drop_writes - drop all writes from now on.
+ * redirect_reads - start returning 0's/old data for reads done on unflushed
+ *	blocks.
+ * allow_writes - allow writes to start back up again.
+ */
+static int pfail_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct pfail_ctx *pc = ti->private;
+
+	if (argc != 1) {
+		DMWARN("Invalid power-fail message arguments, expect 1 "
+		       "argument, got %d", argc);
+		return -EINVAL;
+	}
+
+	if (!strcasecmp(argv[0], "redirect_reads")) {
+		set_bit(REDIRECT_READS, &pc->flags);
+	} else if (!strcasecmp(argv[0], "drop_after_fua")) {
+		set_bit(DROP_AFTER_FUA, &pc->flags);
+	} else if (!strcasecmp(argv[0], "drop_after_flush")) {
+		set_bit(DROP_AFTER_FLUSH, &pc->flags);
+	} else if (!strcasecmp(argv[0], "drop_writes")) {
+		set_bit(DROP_WRITES, &pc->flags);
+	} else if (!strcasecmp(argv[0], "allow_writes")) {
+		clear_bit(DROP_WRITES, &pc->flags);
+		clear_bit(DROP_AFTER_FUA, &pc->flags);
+		clear_bit(DROP_AFTER_FLUSH, &pc->flags);
+	} else {
+		DMWARN("Invalid argument %s", argv[0]);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct target_type pfail_target = {
+	.name   = "power-fail",
+	.version = {1, 3, 1},
+	.module = THIS_MODULE,
+	.ctr    = pfail_ctr,
+	.dtr    = pfail_dtr,
+	.map    = pfail_map,
+	.end_io = pfail_end_io,
+	.status = pfail_status,
+	.ioctl	= pfail_ioctl,
+	.merge	= pfail_merge,
+	.message = pfail_message,
+	.iterate_devices = pfail_iterate_devices,
+};
+
+static int __init dm_pfail_init(void)
+{
+	int r = dm_register_target(&pfail_target);
+
+	if (r < 0)
+		DMERR("register failed %d", r);
+
+	return r;
+}
+
+static void __exit dm_pfail_exit(void)
+{
+	dm_unregister_target(&pfail_target);
+}
+
+/* Module hooks */
+module_init(dm_pfail_init);
+module_exit(dm_pfail_exit);
+
+MODULE_DESCRIPTION(DM_NAME " power fail target");
+MODULE_AUTHOR("Josef Bacik <jbacik@fb.com>");
+MODULE_LICENSE("GPL");