diff mbox series

[RFC] dm-bow working prototype

Message ID 20181023212358.60292-1-paullawrence@google.com (mailing list archive)
State Rejected, archived
Delegated to: Mike Snitzer
Headers show
Series [RFC] dm-bow working prototype | expand

Commit Message

Paul Lawrence Oct. 23, 2018, 9:23 p.m. UTC
bow == backup on write

Similar to dm-snap, add the ability to take a snapshot of a device.
Unlike dm-snap, a separate volume is not required.

dm-bow can be in one of three states.

In state one, the free blocks on the device are identified by issuing
an FSTRIM to the filesystem.

In state two, any writes cause the overwritten data to be backup up
to the available free space. While in this state, the device can be
restored by unmounting the filesystem, removing the dm-bow device
and running a usermode tool over the underlying device.

In state three, the changes are committed, dm-bow is in pass-through
mode and the drive can no longer be restored.

It is planned to use this driver to enable restoration of a failed
update attempt on Android devices using ext4.

Test: Can boot Android with userdata mounted on this device. Can commit
userdata after SUW has run. Can then reboot, make changes and roll back.

Known issues:

Recovery tool needed

Remove BUG_ONs

Assumtion is that bios are always page aligned. This is not in general
true. Fix to work with file systems which use 1k blocks.

Mutex is held around entire flush operation, including lengthy I/O. Plan
is to convert to state machine with pending queues.

Add a linked list of free ranges to avoid linear searches of the tree.
Similarly sector0_current should be stored in bow_context. There should
be no linear searches of the map.

Interaction with block encryption is unknown, especially with respect
to sector 0.

Signed-off-by: Paul Lawrence <paullawrence@google.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-raid@vger.kernel.org

---
 Documentation/device-mapper/dm-bow.txt |  103 +++
 drivers/md/Kconfig                     |   12 +
 drivers/md/Makefile                    |    1 +
 drivers/md/dm-bow.c                    | 1033 ++++++++++++++++++++++++
 4 files changed, 1149 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-bow.txt
 create mode 100644 drivers/md/dm-bow.c

Comments

Alasdair G Kergon Oct. 23, 2018, 10:18 p.m. UTC | #1
On Tue, Oct 23, 2018 at 02:23:28PM -0700, Paul Lawrence wrote:
> It is planned to use this driver to enable restoration of a failed
> update attempt on Android devices using ext4.

Could you say a bit more about the reason for this new dm target so we
can understand better what parameters you are trying to optimise and
within what new constraints?  What are the failure modes that you need
to handle better by using this?  (We can guess some answers, but it
would better if you can lay them out so we don't need to make
assumptions.)

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paul Lawrence Oct. 24, 2018, 6:42 p.m. UTC | #2
Android has had the concept of A/B updates for since Android N, which 
means that if an update is unable to boot for any reason three times, we 
revert to the older system. However, if the failure occurs after the new 
system has started modifying userdata, we will be attempting to start an 
older system with a newer userdata, which is an unsupported state. Thus 
to make A/B able to fully deliver on its promise of safe updates, we 
need to be able to revert userdata in the event of a failure.

For those cases where the file system on userdata supports 
snapshots/checkpoints, we should clearly use them. However, there are 
many Android devices using filesystems that do not support checkpoints, 
so we need a generic solution. Here we had two options. One was to use 
overlayfs to manage the changes, then on merge have a script that copies 
the files to the underlying fs. This was rejected on the grounds of 
compatibility concerns and managing the merge through reboots, though it 
is definitely a plausible strategy. The second was to work at the block 
layer.

At the block layer, dm-snap would have given us a ready-made solution, 
except that there is no sufficiently large spare partition on Android 
devices. But in general there is free space on userdata, just scattered 
over the device, and of course likely to get modified as soon as 
userdata is written to. We also decided that the merge phase was a high 
risk component of any design. Since the normal path is that the update 
succeeds, we anticipate merges happening 99% of the time, and we want to 
guarantee their success even in the event of unexpected failure during 
the merge. Thus we decided we preferred a strategy where the device is 
in the committed state at all times, and rollback requires work, to one 
where the device remains in the original state but the merge is complex.


On 10/23/2018 03:18 PM, Alasdair G Kergon wrote:
> On Tue, Oct 23, 2018 at 02:23:28PM -0700, Paul Lawrence wrote:
>> It is planned to use this driver to enable restoration of a failed
>> update attempt on Android devices using ext4.
> Could you say a bit more about the reason for this new dm target so we
> can understand better what parameters you are trying to optimise and
> within what new constraints?  What are the failure modes that you need
> to handle better by using this?  (We can guess some answers, but it
> would better if you can lay them out so we don't need to make
> assumptions.)
>
> Alasdair
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Oct. 24, 2018, 7:24 p.m. UTC | #3
On Wed, 24 Oct 2018, Paul Lawrence wrote:

> Android has had the concept of A/B updates for since Android N, which means
> that if an update is unable to boot for any reason three times, we revert to
> the older system. However, if the failure occurs after the new system has
> started modifying userdata, we will be attempting to start an older system
> with a newer userdata, which is an unsupported state. Thus to make A/B able to
> fully deliver on its promise of safe updates, we need to be able to revert
> userdata in the event of a failure.
> 
> For those cases where the file system on userdata supports
> snapshots/checkpoints, we should clearly use them. However, there are many
> Android devices using filesystems that do not support checkpoints, so we need
> a generic solution. Here we had two options. One was to use overlayfs to
> manage the changes, then on merge have a script that copies the files to the
> underlying fs. This was rejected on the grounds of compatibility concerns and
> managing the merge through reboots, though it is definitely a plausible
> strategy. The second was to work at the block layer.
> 
> At the block layer, dm-snap would have given us a ready-made solution, except
> that there is no sufficiently large spare partition on Android devices. But in
> general there is free space on userdata, just scattered over the device, and
> of course likely to get modified as soon as userdata is written to. We also
> decided that the merge phase was a high risk component of any design. Since
> the normal path is that the update succeeds, we anticipate merges happening
> 99% of the time, and we want to guarantee their success even in the event of
> unexpected failure during the merge. Thus we decided we preferred a strategy
> where the device is in the committed state at all times, and rollback requires
> work, to one where the device remains in the original state but the merge is
> complex.

What about allocating a big file, using the FIEMAP ioctl to find the 
physical locations of the file, creating a dm device with many linear 
targets to map the big file and using it as a snapshot store? I think it 
would be way easier than re-implementing the snapshot functionality in a 
new target.

You can mount the whole filesystem using the "origin" target and you can 
attach a "snapshot" target that uses the mapped big file as its snapshot 
store - all writes will be placed directly to the device and the old data 
will be copied to the snapshot store in the big file.

If you decide that rollback is no longer needed, you just unload the 
snapshot target and delete the big file. If you decide that you want to 
rollback, you can use the snapshot merge functionality (or you can write a 
userspace utility that does offline merge).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon Oct. 25, 2018, 12:01 a.m. UTC | #4
On Wed, Oct 24, 2018 at 03:24:29PM -0400, Mikulas Patocka wrote:
> What about allocating a big file, using the FIEMAP ioctl to find the 

For reference, dmfilemapd in the lvm2 tree (in daemons/) uses FIEMAP
(in libdm/libdm-stats.c) for monitoring I/O by file.

> If you decide that rollback is no longer needed, you just unload the 
> snapshot target and delete the big file. If you decide that you want to 
> rollback, you can use the snapshot merge functionality (or you can write a 
> userspace utility that does offline merge).
 
There's some old code from Mark McLoughlin for userspace snapshot merging here:
  https://people.gnome.org/~markmc/code/merge-dm-snapshot.c

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bryn M. Reeves Oct. 25, 2018, 10:20 a.m. UTC | #5
On Wed, Oct 24, 2018 at 03:24:29PM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2018, Paul Lawrence wrote:
> 
> > Android has had the concept of A/B updates for since Android N, which means
> > that if an update is unable to boot for any reason three times, we revert to
> > the older system. However, if the failure occurs after the new system has
> > started modifying userdata, we will be attempting to start an older system
> > with a newer userdata, which is an unsupported state. Thus to make A/B able to
> > fully deliver on its promise of safe updates, we need to be able to revert
> > userdata in the event of a failure.
> > 
> > For those cases where the file system on userdata supports
> > snapshots/checkpoints, we should clearly use them. However, there are many
> > Android devices using filesystems that do not support checkpoints, so we need
> > a generic solution. Here we had two options. One was to use overlayfs to
> > manage the changes, then on merge have a script that copies the files to the
> > underlying fs. This was rejected on the grounds of compatibility concerns and
> > managing the merge through reboots, though it is definitely a plausible
> > strategy. The second was to work at the block layer.
> > 
> > At the block layer, dm-snap would have given us a ready-made solution, except
> > that there is no sufficiently large spare partition on Android devices. But in
> > general there is free space on userdata, just scattered over the device, and
> > of course likely to get modified as soon as userdata is written to. We also
> > decided that the merge phase was a high risk component of any design. Since
> > the normal path is that the update succeeds, we anticipate merges happening
> > 99% of the time, and we want to guarantee their success even in the event of
> > unexpected failure during the merge. Thus we decided we preferred a strategy
> > where the device is in the committed state at all times, and rollback requires
> > work, to one where the device remains in the original state but the merge is
> > complex.
> 
> What about allocating a big file, using the FIEMAP ioctl to find the 
> physical locations of the file, creating a dm device with many linear 
> targets to map the big file and using it as a snapshot store? I think it 
> would be way easier than re-implementing the snapshot functionality in a 
> new target.

libdevmapper already has code to handle enumerating physical file
extents via the dm-stats file mapping support. It should be fairly easy
to adapt this to create dm tables rather than dm-stats regions.

See dm_stats_create_regions_from_fd() and _stats_map_file_regions().

Bryn.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
MegaBrutal Oct. 25, 2018, 4:30 p.m. UTC | #6
Paul Lawrence <paullawrence@google.com> ezt írta (időpont: 2018. okt.
23., K, 23:27):
>
> bow == backup on write
>
> Similar to dm-snap, add the ability to take a snapshot of a device.
> Unlike dm-snap, a separate volume is not required.
>

The concept intrigued me, so I actually went on to try your prototype.
I could apply it on v4.12 mainline (newer kernel versions introduce
changes in "struct bio" in "include/linux/blk_types.h" those don't let
the module compile – I think minor changes would be necessary to adapt
to the new struct, though I didn't go into that).

My test scenario:
On a KVM, I created a 64M partition and formatted it to ext4, then put
some random files on it and unmounted the FS. I then called "dmsetup
create bowdev --table "0 131072 bow /dev/vdb1"". The
"/dev/mapper/bowdev" file appeared as expected. I mounted it in
read-only mode ("mount -vo ro /dev/mapper/bowdev /mnt") and run
"fstrim -v /mnt". At this point, I tried to advance to STATE 1 ("echo
1 > /sys/block/dm-2/bow/state"), but I got a kernel BUG alert. The
STATE did not change. I unmounted bowdev and removed the device
("dmsetup remove bowdev") which resulted in 2 subsequent kernel
alerts. The device disappeared but it brought the kernel to an
unstable state (various actions, like sync or trying to recreate the
bow device, resulted in a hang). I could not get any further than
this. I attached all the 3 kernel alerts in "dm-bow.dmesg.log".

I have some questions about dm-bow:
– How file system agnostic this feature is planned to be? While it is
designed with ext4 in mind, is it going to work when used over other
file systems, like FAT or BTRFS for example?
– Especially that BTRFS uses a CoW mechanism for even overwriting
files (overwritten segments are written to a free area and only then
gets the old data freed – except some specific conditions when
NO_COW/nodatacow is involved). Won't BTRFS CoW mechanism confuse BoW,
e.g. BTRFS will try to use space that BoW wants to use for backups?
Note however, using BoW on BTRFS wouldn't have much point, since BTRFS
has built-in features for snapshots. This leads me to my next
question.
– Why don't you just use BTRFS on Android? It basically provides a
similar feature like BoW, and it is matured enough, switching
snapshots are easy, etc.. However I see why it wouldn't be feasible
for you, e.g. it is slower than ext4, which would matter for an
Android device.
– What if you run out of free disk space while updating? I guess you
can just revert to the original state with BoW, but an update might
require more disk space with BoW (and this is a thing, my Android
always complains about not having enough space).
– Can I really expect dm-bow to work on non-Android systems (like I
tried it on an Ubuntu KVM)?
– Do you have any prototype for the command line utility to be used
for recovery?


MegaBrutal
[  174.206735] atkbd serio0: Unknown key pressed (translated set 2, code 0x63 on isa0060/serio0).
[  174.206745] atkbd serio0: Use 'setkeycodes 63 <keycode>' to make it known.
[  174.215981] atkbd serio0: Unknown key released (translated set 2, code 0x63 on isa0060/serio0).
[  174.215983] atkbd serio0: Use 'setkeycodes 63 <keycode>' to make it known.
[  187.976875] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)
[  231.633634] device-mapper: bow: Switching to state Checkpoint
[  231.639541] ------------[ cut here ]------------
[  231.639543] kernel BUG at drivers/md/dm-bow.c:224!
[  231.639990] invalid opcode: 0000 [#1] SMP
[  231.640320] Modules linked in: dm_bow dm_bufio nls_iso8859_1 snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm joydev snd_timer snd ppdev input_leds soundcore serio_raw parport_pc parport qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse virtio_blk ahci libahci floppy i2c_piix4 pata_acpi drm virtio_net
[  231.642197] CPU: 1 PID: 1182 Comm: bash Not tainted 4.12.0-dmbow #109
[  231.642496] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[  231.642860] task: ffff9f37568ad500 task.stack: ffffb925405c4000
[  231.645551] RIP: 0010:sector_to_page.part.5+0x9/0x10 [dm_bow]
[  231.645938] RSP: 0018:ffffb925405c7d20 EFLAGS: 00010202
[  231.646229] RAX: ffff9f3756abaf00 RBX: 0000000000000000 RCX: 0000000000000000
[  231.646489] RDX: 0000000000000000 RSI: 0000000000001242 RDI: ffff9f375e71bf00
[  231.646750] RBP: ffffb925405c7d20 R08: 00000035eeca0000 R09: 0000000000000000
[  231.647077] R10: 0000000000000000 R11: 0000000000000278 R12: ffff9f3756aba9c0
[  231.647345] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9f37568c5000
[  231.647628] FS:  00007f9f1b8a0740(0000) GS:ffff9f375e700000(0000) knlGS:0000000000000000
[  231.648077] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  231.648501] CR2: 00007f08c431af78 CR3: 00000000168ff000 CR4: 00000000000006e0
[  231.648929] Call Trace:
[  231.649417]  copy_data+0x16f/0x1e0 [dm_bow]
[  231.649794]  state_store+0x216/0x320 [dm_bow]
[  231.650127]  ? do_wp_page+0x135/0x4e0
[  231.650427]  ? __handle_mm_fault+0x889/0xf70
[  231.650704]  kobj_attr_store+0xf/0x20
[  231.651012]  sysfs_kf_write+0x37/0x40
[  231.651268]  kernfs_fop_write+0x11c/0x1a0
[  231.651527]  __vfs_write+0x28/0x130
[  231.651767]  ? handle_mm_fault+0xd8/0x230
[  231.652127]  ? rw_verify_area+0x4e/0xb0
[  231.652392]  vfs_write+0xb1/0x1a0
[  231.652648]  SyS_write+0x46/0xa0
[  231.652893]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  231.653111] RIP: 0033:0x7f9f1af75154
[  231.653394] RSP: 002b:00007ffeccba8db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  231.653692] RAX: ffffffffffffffda RBX: 00007f9f1b251760 RCX: 00007f9f1af75154
[  231.654009] RDX: 0000000000000002 RSI: 000055e168555390 RDI: 0000000000000001
[  231.654324] RBP: 0000000000000001 R08: 000000000000000a R09: 0000000000000001
[  231.654618] R10: 000000000000000a R11: 0000000000000246 R12: 0000000000000001
[  231.654929] R13: 000055e168448781 R14: 0000000000000000 R15: 0000000000000000
[  231.655163] Code: 7e 67 ad f3 48 85 c0 75 0f eb 15 48 89 c7 e8 ff 68 ad f3 48 85 c0 74 08 83 78 20 01 75 ed 5d c3 0f 0b 66 66 66 66 90 55 48 89 e5 <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 
[  231.655619] RIP: sector_to_page.part.5+0x9/0x10 [dm_bow] RSP: ffffb925405c7d20
[  231.655896] ---[ end trace 4a06f43bc2732b69 ]---
[  352.245478] ------------[ cut here ]------------
[  352.245914] WARNING: CPU: 0 PID: 1233 at drivers/md/dm-bufio.c:1513 dm_bufio_client_destroy+0x23d/0x250 [dm_bufio]
[  352.246244] Modules linked in: dm_bow dm_bufio nls_iso8859_1 snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm joydev snd_timer snd ppdev input_leds soundcore serio_raw parport_pc parport qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse virtio_blk ahci libahci floppy i2c_piix4 pata_acpi drm virtio_net
[  352.247976] CPU: 0 PID: 1233 Comm: dmsetup Tainted: G      D         4.12.0-dmbow #109
[  352.248213] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[  352.248456] task: ffff9f3756af1540 task.stack: ffffb92540884000
[  352.248713] RIP: 0010:dm_bufio_client_destroy+0x23d/0x250 [dm_bufio]
[  352.248953] RSP: 0018:ffffb92540887be0 EFLAGS: 00010246
[  352.249255] RAX: ffff9f375b2be618 RBX: ffff9f375673dc00 RCX: 0000000000000001
[  352.249569] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f375673dc00
[  352.249878] RBP: ffffb92540887c10 R08: ffffffffc04ff430 R09: 0000000000000001
[  352.250195] R10: ffffb92540887bb0 R11: 00000000000001be R12: ffff9f375673dc20
[  352.250481] R13: ffff9f375692e000 R14: 0000000000000000 R15: ffff9f375673dc20
[  352.250799] FS:  00007fda1d2d7040(0000) GS:ffff9f375e600000(0000) knlGS:0000000000000000
[  352.251117] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.251408] CR2: 00007ffdc1142ff8 CR3: 0000000016fb3000 CR4: 00000000000006f0
[  352.251699] Call Trace:
[  352.252013]  dm_bow_dtr+0x5f/0x90 [dm_bow]
[  352.252320]  dm_table_destroy+0x63/0x120
[  352.252580]  __dm_destroy+0x141/0x230
[  352.252828]  dm_destroy+0x13/0x20
[  352.253092]  dev_remove+0xde/0x120
[  352.253425]  ? remove_all+0x30/0x30
[  352.253693]  ctl_ioctl+0x1c2/0x4a0
[  352.253919]  dm_ctl_ioctl+0x13/0x20
[  352.254153]  do_vfs_ioctl+0x92/0x5d0
[  352.254383]  ? ____fput+0xe/0x10
[  352.254630]  ? task_work_run+0x7b/0x90
[  352.254837]  SyS_ioctl+0x79/0x90
[  352.255034]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  352.255238] RIP: 0033:0x7fda1cb755d7
[  352.255430] RSP: 002b:00007ffe974ee9b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[  352.255630] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda1cb755d7
[  352.255828] RDX: 0000561e91675c90 RSI: 00000000c138fd04 RDI: 0000000000000003
[  352.256026] RBP: 00007ffe974eea50 R08: 00007fda1ceac120 R09: 00007ffe974ee820
[  352.256246] R10: 0000561e91675d40 R11: 0000000000000206 R12: 0000561e90d6b8e0
[  352.256504] R13: 00007ffe974eed20 R14: 0000000000000000 R15: 0000000000000000
[  352.256840] Code: 0f 84 70 ff ff ff 0f 0b 31 f6 48 c7 c7 78 f4 4f c0 e8 fa 11 8a f3 48 8b 53 48 48 85 d2 75 c4 48 83 7b 40 00 75 e0 e9 4b ff ff ff <0f> ff e9 7b ff ff ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 
[  352.257464] ---[ end trace 4a06f43bc2732b6a ]---
[  352.257929] device-mapper: bufio: leaked buffer 0, hold count 1, list 0
[  352.258711] ------------[ cut here ]------------
[  352.259094] kernel BUG at drivers/md/dm-bufio.c:1529!
[  352.259384] invalid opcode: 0000 [#2] SMP
[  352.259708] Modules linked in: dm_bow dm_bufio nls_iso8859_1 snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm joydev snd_timer snd ppdev input_leds soundcore serio_raw parport_pc parport qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse virtio_blk ahci libahci floppy i2c_piix4 pata_acpi drm virtio_net
[  352.261620] CPU: 0 PID: 1233 Comm: dmsetup Tainted: G      D W       4.12.0-dmbow #109
[  352.261964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[  352.262285] task: ffff9f3756af1540 task.stack: ffffb92540884000
[  352.262552] RIP: 0010:dm_bufio_client_destroy+0x1b3/0x250 [dm_bufio]
[  352.262808] RSP: 0018:ffffb92540887be0 EFLAGS: 00010293
[  352.263085] RAX: ffff9f375b2be618 RBX: ffff9f375673dc00 RCX: 0000000000000006
[  352.263333] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffff9f375e60dcc0
[  352.263554] RBP: ffffb92540887c10 R08: ffffffffc04ff430 R09: 00000000000002b3
[  352.263773] R10: ffffb92540887bb0 R11: 00000000ffffffff R12: ffff9f375673dc40
[  352.264018] R13: ffff9f375673dc08 R14: 0000000000000001 R15: ffff9f375673dc20
[  352.264273] FS:  00007fda1d2d7040(0000) GS:ffff9f375e600000(0000) knlGS:0000000000000000
[  352.264497] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.264734] CR2: 00007ffdc1142ff8 CR3: 0000000016fb3000 CR4: 00000000000006f0
[  352.264998] Call Trace:
[  352.265248]  dm_bow_dtr+0x5f/0x90 [dm_bow]
[  352.265520]  dm_table_destroy+0x63/0x120
[  352.265771]  __dm_destroy+0x141/0x230
[  352.266008]  dm_destroy+0x13/0x20
[  352.266321]  dev_remove+0xde/0x120
[  352.266657]  ? remove_all+0x30/0x30
[  352.267000]  ctl_ioctl+0x1c2/0x4a0
[  352.267344]  dm_ctl_ioctl+0x13/0x20
[  352.267700]  do_vfs_ioctl+0x92/0x5d0
[  352.268071]  ? ____fput+0xe/0x10
[  352.268375]  ? task_work_run+0x7b/0x90
[  352.268658]  SyS_ioctl+0x79/0x90
[  352.268901]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  352.269121] RIP: 0033:0x7fda1cb755d7
[  352.269361] RSP: 002b:00007ffe974ee9b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[  352.269599] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda1cb755d7
[  352.269820] RDX: 0000561e91675c90 RSI: 00000000c138fd04 RDI: 0000000000000003
[  352.270038] RBP: 00007ffe974eea50 R08: 00007fda1ceac120 R09: 00007ffe974ee820
[  352.270253] R10: 0000561e91675d40 R11: 0000000000000206 R12: 0000561e90d6b8e0
[  352.270488] R13: 00007ffe974eed20 R14: 0000000000000000 R15: 0000000000000000
[  352.270739] Code: 48 8b 7b 78 e8 cf b2 e0 f3 48 89 df e8 17 d5 90 f3 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 41 be 01 00 00 00 e9 b4 fe ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 84 d2 74 7e 4c 8d 68 e8 41 8b 55 40 49 
[  352.271217] RIP: dm_bufio_client_destroy+0x1b3/0x250 [dm_bufio] RSP: ffffb92540887be0
[  352.271489] ---[ end trace 4a06f43bc2732b6b ]---
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paul Lawrence Oct. 25, 2018, 5:23 p.m. UTC | #7
Thank you for the suggestion. I spent part of yesterday experimenting 
with this idea, and it is certainly very promising. However, it does 
have some disadvantages as compared to dm-bow, if I am understanding the 
setup correctly:

1) Since dm-snap has no concept of the free space on the underlying file 
system any write into the free space will trigger a backup, so using 
twice the space of dm-bow. Changing existing data will create a backup 
with both drivers, but since we have to reserve the space for the 
backups up-front with dm-snap, we would likely only have half the space 
for that. Either way, it seems that dm-bow is likely to double the 
amount of changes we could make.

(Might it be possible to dynamically resize the backup file if it is 
mostly used up? This would fix the problem of only having half the space 
for changing existing data. The documentation seems to indicate that you 
can increase the size of the snapshot partition, and it seems like it 
should be possible to grow the underlying file without triggering a lot 
of writes. OTOH this would have to happen in userspace which creates 
other issues.)

2) Similarly, since writes into free space do not trigger a backup in 
dm-bow, dm-bow is likely to have a lower performance overhead in many 
circumstances. On the flip side, dm-bow's backup is in free space and 
will collide with other writes, so this advantage will reduce as free 
space fills up. But by choosing a suitable algorithm for how we use free 
space we might be able to retain most of this advantage.

I intend to put together a fully working prototype of your suggestion 
next to better compare with dm-bow. But I do believe there is value in 
tracking free space and utilizing it in any such solution.


On 10/24/2018 12:24 PM, Mikulas Patocka wrote:
>
> On Wed, 24 Oct 2018, Paul Lawrence wrote:
>
>> Android has had the concept of A/B updates for since Android N, which means
>> that if an update is unable to boot for any reason three times, we revert to
>> the older system. However, if the failure occurs after the new system has
>> started modifying userdata, we will be attempting to start an older system
>> with a newer userdata, which is an unsupported state. Thus to make A/B able to
>> fully deliver on its promise of safe updates, we need to be able to revert
>> userdata in the event of a failure.
>>
>> For those cases where the file system on userdata supports
>> snapshots/checkpoints, we should clearly use them. However, there are many
>> Android devices using filesystems that do not support checkpoints, so we need
>> a generic solution. Here we had two options. One was to use overlayfs to
>> manage the changes, then on merge have a script that copies the files to the
>> underlying fs. This was rejected on the grounds of compatibility concerns and
>> managing the merge through reboots, though it is definitely a plausible
>> strategy. The second was to work at the block layer.
>>
>> At the block layer, dm-snap would have given us a ready-made solution, except
>> that there is no sufficiently large spare partition on Android devices. But in
>> general there is free space on userdata, just scattered over the device, and
>> of course likely to get modified as soon as userdata is written to. We also
>> decided that the merge phase was a high risk component of any design. Since
>> the normal path is that the update succeeds, we anticipate merges happening
>> 99% of the time, and we want to guarantee their success even in the event of
>> unexpected failure during the merge. Thus we decided we preferred a strategy
>> where the device is in the committed state at all times, and rollback requires
>> work, to one where the device remains in the original state but the merge is
>> complex.
> What about allocating a big file, using the FIEMAP ioctl to find the
> physical locations of the file, creating a dm device with many linear
> targets to map the big file and using it as a snapshot store? I think it
> would be way easier than re-implementing the snapshot functionality in a
> new target.
>
> You can mount the whole filesystem using the "origin" target and you can
> attach a "snapshot" target that uses the mapped big file as its snapshot
> store - all writes will be placed directly to the device and the old data
> will be copied to the snapshot store in the big file.
>
> If you decide that rollback is no longer needed, you just unload the
> snapshot target and delete the big file. If you decide that you want to
> rollback, you can use the snapshot merge functionality (or you can write a
> userspace utility that does offline merge).
>
> Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paul Lawrence Oct. 25, 2018, 6:13 p.m. UTC | #8
> The concept intrigued me, so I actually went on to try your prototype.
> I could apply it on v4.12 mainline (newer kernel versions introduce
> changes in "struct bio" in "include/linux/blk_types.h" those don't let
> the module compile – I think minor changes would be necessary to adapt
> to the new struct, though I didn't go into that).
>
> My test scenario:
> On a KVM, I created a 64M partition and formatted it to ext4, then put
> some random files on it and unmounted the FS. I then called "dmsetup
> create bowdev --table "0 131072 bow /dev/vdb1"". The
> "/dev/mapper/bowdev" file appeared as expected. I mounted it in
> read-only mode ("mount -vo ro /dev/mapper/bowdev /mnt") and run
> "fstrim -v /mnt". At this point, I tried to advance to STATE 1 ("echo
> 1 > /sys/block/dm-2/bow/state"), but I got a kernel BUG alert. The
> STATE did not change. I unmounted bowdev and removed the device
> ("dmsetup remove bowdev") which resulted in 2 subsequent kernel
> alerts. The device disappeared but it brought the kernel to an
> unstable state (various actions, like sync or trying to recreate the
> bow device, resulted in a hang). I could not get any further than
> this. I attached all the 3 kernel alerts in "dm-bow.dmesg.log".
This BUG_ON is caused if your file system writes blocks in sizes less 
than your page size. I will fix that before I attempt to upstream this 
driver assuming it gets accepted. If you can make your file system have 
4k blocks, you should be able to proceed (I hit this when I created a 
16MB ext4 fs on a loopback device)
> I have some questions about dm-bow:
> – How file system agnostic this feature is planned to be? While it is
> designed with ext4 in mind, is it going to work when used over other
> file systems, like FAT or BTRFS for example?
So long as the file system supports fstrim, it should work. If the file 
system creates a lot of churn say by running garbage collection, I'd not 
recommend it. And I really don't see the use case if the file system has 
any sort of snapshot capability - that will always be a superior 
solution to a block level one IMO.
> – Especially that BTRFS uses a CoW mechanism for even overwriting
> files (overwritten segments are written to a free area and only then
> gets the old data freed – except some specific conditions when
> NO_COW/nodatacow is involved). Won't BTRFS CoW mechanism confuse BoW,
> e.g. BTRFS will try to use space that BoW wants to use for backups?
> Note however, using BoW on BTRFS wouldn't have much point, since BTRFS
> has built-in features for snapshots. This leads me to my next
> question.
> – Why don't you just use BTRFS on Android? It basically provides a
> similar feature like BoW, and it is matured enough, switching
> snapshots are easy, etc.. However I see why it wouldn't be feasible
> for you, e.g. it is slower than ext4, which would matter for an
> Android device.
I'm not the ideal person to answer that question, but yes, I believe 
performance is an issue, along with the lack of file based encryption.
> – What if you run out of free disk space while updating? I guess you
> can just revert to the original state with BoW, but an update might
> require more disk space with BoW (and this is a thing, my Android
> always complains about not having enough space).
Well this question remains with any snapshot system, and indeed is there 
even before you have snapshots. There are really only two choices - 
throw away the snapshot and keep going, or fail the update and revert 
(with presumably the intent of freeing up more space and trying again.) 
Which we choose would be a policy decision - my goal would be to make 
sure either option is possible.
> – Can I really expect dm-bow to work on non-Android systems (like I
> tried it on an Ubuntu KVM)?
Yes, absolutely, but for the moment it's a work in progress and it 
contains an assumption about IO accesses being page aligned that is the 
reason for the failure you are seeing.
> – Do you have any prototype for the command line utility to be used
> for recovery?
Yes, and I will be uploading that. For the moment it is embedded in some 
Android specific code. It won't take long to extricate it though. It's 
actually very simple.

Paul

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Wols Lists Oct. 25, 2018, 9:15 p.m. UTC | #9
On 25/10/18 19:13, Paul Lawrence wrote:
>> I have some questions about dm-bow:
>> – How file system agnostic this feature is planned to be? While it is
>> designed with ext4 in mind, is it going to work when used over other
>> file systems, like FAT or BTRFS for example?

> So long as the file system supports fstrim, it should work. If the file
> system creates a lot of churn say by running garbage collection, I'd not
> recommend it. And I really don't see the use case if the file system has
> any sort of snapshot capability - that will always be a superior
> solution to a block level one IMO.

Sorry for being dense, but why is this posted to linux-raid, then? Raid
does not support fstrim, and is filesystem-agnostic.

I can imagine people here being interested, but it feels to me as though
your functionality is completely orthogonal to raid. Sorry.

Cheers,
Wol

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Darrick J. Wong Oct. 25, 2018, 9:43 p.m. UTC | #10
On Tue, Oct 23, 2018 at 02:23:28PM -0700, Paul Lawrence wrote:
> bow == backup on write
> 
> Similar to dm-snap, add the ability to take a snapshot of a device.
> Unlike dm-snap, a separate volume is not required.
> 
> dm-bow can be in one of three states.
> 
> In state one, the free blocks on the device are identified by issuing
> an FSTRIM to the filesystem.

So, uh, what happens if the filesystem is so full that you run out of
space for stashing old block contents?

Also, what happens with things like ext4 that don't send discard
requests for uninitialized blockgroups and blockgroups that haven't been
touched since the last FITRIM call?

> In state two, any writes cause the overwritten data to be backup up
> to the available free space. While in this state, the device can be
> restored by unmounting the filesystem, removing the dm-bow device
> and running a usermode tool over the underlying device.
> 
> In state three, the changes are committed, dm-bow is in pass-through
> mode and the drive can no longer be restored.
> 
> It is planned to use this driver to enable restoration of a failed
> update attempt on Android devices using ext4.
> 
> Test: Can boot Android with userdata mounted on this device. Can commit
> userdata after SUW has run. Can then reboot, make changes and roll back.
> 
> Known issues:
> 
> Recovery tool needed
> 
> Remove BUG_ONs
> 
> Assumtion is that bios are always page aligned. This is not in general
> true. Fix to work with file systems which use 1k blocks.

Or filesystems with 512b blocks?

> Mutex is held around entire flush operation, including lengthy I/O. Plan
> is to convert to state machine with pending queues.
> 
> Add a linked list of free ranges to avoid linear searches of the tree.

(Don't we usually remove linked lists to avoid linear searches?
Anyway...)

> Similarly sector0_current should be stored in bow_context. There should
> be no linear searches of the map.
> 
> Interaction with block encryption is unknown, especially with respect
> to sector 0.
> 
> Signed-off-by: Paul Lawrence <paullawrence@google.com>
> Cc: Alasdair Kergon <agk@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: dm-devel@redhat.com
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-raid@vger.kernel.org
> 
> ---
>  Documentation/device-mapper/dm-bow.txt |  103 +++
>  drivers/md/Kconfig                     |   12 +
>  drivers/md/Makefile                    |    1 +
>  drivers/md/dm-bow.c                    | 1033 ++++++++++++++++++++++++
>  4 files changed, 1149 insertions(+)
>  create mode 100644 Documentation/device-mapper/dm-bow.txt
>  create mode 100644 drivers/md/dm-bow.c
> 
> diff --git a/Documentation/device-mapper/dm-bow.txt b/Documentation/device-mapper/dm-bow.txt
> new file mode 100644
> index 000000000000..a77d5bf8f98b
> --- /dev/null
> +++ b/Documentation/device-mapper/dm-bow.txt
> @@ -0,0 +1,103 @@
> +dm_bow (backup on write)
> +========================
> +
> +dm_bow is a device mapper driver that uses the free space on a device to back up
> +data that is overwritten. The changes can then be committed by a simple state
> +change, or rolled back by removing the dm_bow device and running a command line
> +utility over the underlying device.
> +
> +dm_bow has three states, set by writing ‘1’ or ‘2’ to /sys/block/dm-?/bow/state.
> +It is only possible to go from state 0 (initial state) to state 1, and then from
> +state 1 to state 2.
> +
> +State 0: dm_bow collects all trims to the device and assumes that these mark
> +free space on the overlying file system that can be safely used. Typically the
> +mount code would create the dm_bow device, mount the file system, call the
> +FITRIM ioctl on the file system then switch to state 1. These trims are not
> +propagated to the underlying device.
> +
> +TODO: There are some race conditions if there are writes in state 0. Test
> +mounting the drive ro and see if trims are still allowed. If that fails, test
> +holding all writes in a queue until we switch to state 1. If that fails,
> +consider implementing a ram disk type affair, which will be complex and risky.
> +
> +State 1: All writes to the device cause the underlying data to be backed up to
> +the free (trimmed) area as needed in such a way as they can be restored.
> +However, the writes, with one exception, then happen exactly as they would
> +without dm_bow, so the device is always in a good final state. The exception is
> +that sector 0 is used to keep a log of the latest changes, both to indicate that
> +we are in this state and to allow rollback. See below for all details.
> +
> +State 2: The transition to state 2 triggers replacing the special sector 0 with
> +the normal sector 0, and the freeing of all state information. dm_bow then
> +becomes a pass-through driver, allowing the device to continue to be used with
> +minimal performance impact.

What about state 3?

> +Usage
> +=====
> +dm-bow takes one command line parameter, the name of the underlying device.
> +
> +dm-bow will typically be used in the following way. dm-bow will be loaded with a
> +suitable underlying device and the resultant device will be mounted. A file
> +system trim will be issued via the FITRIM ioctl, then the device will be
> +switched to state 1. The file system will now be used as normal. At some point,
> +the changes can either be committed by switching to state 2, or rolled back by
> +unmounting the file system, removing the dm-bow device and running the command
> +line utility. Note that rebooting the device will be equivalent to unmounting
> +and removing, but the command line utility must still be run
> +
> +Details of operation in state 1
> +===============================
> +
> +dm_bow maintains a type for all sectors. A sector can be any of:
> +
> +SECTOR0
> +SECTOR0_CURRENT
> +UNCHANGED
> +FREE
> +CHANGED
> +BACKUP
> +
> +SECTOR0 is the first sector on the device, and is used to hold the log of
> +changes. This is the one exception.
> +
> +SECTOR0_CURRENT is a sector picked from the FREE sectors, and is where reads and
> +writes from the true sector zero are redirected to. Note that like any backup
> +sector, if the sector is written to directly, it must be moved again.
> +
> +UNCHANGED means that the sector has not been changed since we entered state 1.
> +Thus if it is written to or trimmed, the contents must first be backed up.
> +
> +FREE means that the sector was trimmed in state 0 and has not yet been written
> +to or used for backup. On being written to, a FREE sector is changed to CHANGED.
> +
> +CHANGED means that the sector has been modified, and can be further modified
> +without further backup.
> +
> +BACKUP means that this is a free sector being used as a backup. On being written
> +to, the contents must first be backed up again.
> +
> +All backup operations are logged to the first sector. The log sector has the
> +format:
> +--------------------------------------------------------
> +| Magic | Count | Sequence | Log entry | Log entry | …
> +--------------------------------------------------------
> +
> +Magic is a magic number. Count is the number of log entries. Sequence is 0
> +initially. A log entry is
> +
> +-----------------------------------
> +| Source | Dest | Size | Checksum |
> +-----------------------------------
> +
> +When SECTOR0 is full, the log sector is backup up and another empty log sector
> +created with sequence number one higher. The first entry in any log entry with
> +sequence > 0 therefore must be the log of the backing up of the previous log
> +sector. Note that sequence is not strictly needed, but is a useful sanity check
> +and potentially limits the time spent trying to restore a corrupted snapshot.
> +
> +On entering state 1, dm_bow has a list of free sectors. All other sectors are
> +unchanged. Sector0_current is selected from the free sectors and the contents of
> +sector 0 are copied there. The sector 0 is backed up, which triggers the first
> +log entry to be written.

Hmm.  I guess it's neat how you can discard the old fs simply by copying
the SECTOR0_CURRENT to sector 0 and exiting.

ext4 writes its superblock 512 bytes into the disk (so you can cram a
x86 boot sector and an ext4 fs into a floopy drive!) which means that a
bowed (bow'd?) block device will have both the bow log at byte offset 0
and a ext4 superblock at byte offset 512...?

> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 7baee86ded7c..21c95032f698 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -569,4 +569,16 @@ config DM_ANDROID_VERITY
>  	  of the metadata contents are verified against the key included
>  	  in the system keyring. Upon success, the underlying verity
>  	  target is setup.
> +
> +config DM_BOW
> +	tristate "Create a device that backs up all changes to free blocks"
> +	depends on BLK_DEV_DM
> +	---help---
> +	  This device-mapper target takes a device and keeps a log of all
> +	  changes using free blocks identified by issuing a trim command.
> +	  This can then be restored by running a command line utility,
> +	  or committed by simply replacing the target.
> +
> +	  If unsure, say N.
> +
>  endif # MD
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index c3bf33b35674..7503427e2f3b 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_DM_ERA)		+= dm-era.o
>  obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
>  obj-$(CONFIG_DM_REQ_CRYPT)	+= dm-req-crypt.o
>  obj-$(CONFIG_DM_ANDROID_VERITY) += dm-android-verity.o
> +obj-$(CONFIG_DM_BOW)		+= dm-bow.o
>  
>  ifeq ($(CONFIG_DM_UEVENT),y)
>  dm-mod-objs			+= dm-uevent.o
> diff --git a/drivers/md/dm-bow.c b/drivers/md/dm-bow.c
> new file mode 100644
> index 000000000000..2f3d6dfd3d6b
> --- /dev/null
> +++ b/drivers/md/dm-bow.c
> @@ -0,0 +1,1033 @@
> +/*
> + * Copyright (C) 2018 Google Limited.
> + *
> + * This file is released under the GPL.
> + */
> +
> +#include "dm.h"
> +#include "dm-bufio.h"
> +#include "dm-core.h"
> +
> +#include <linux/crc32.h>
> +#include <linux/module.h>
> +
> +#define DM_MSG_PREFIX "bow"
> +#define SECTOR_SIZE 512
> +
> +struct log_entry {
> +	u64 source;
> +	u64 dest;
> +	u32 size;
> +	u32 checksum;
> +} __packed;

On-disk structure is host-endian?

--D

> +
> +struct log_sector {
> +	u32 magic;
> +	u32 count;
> +	u32 sequence;
> +	struct log_entry entries[];
> +} __packed;
> +
> +/*
> + * MAGIC is BOW in ascii
> + */
> +#define MAGIC 0x00574f42
> +
> +struct bow_range {
> +	struct rb_node		node;
> +	sector_t		sector;
> +	enum {
> +		SECTOR0,	/* First sector - holds log record */
> +		SECTOR0_CURRENT,/* Live contents of sector0 */
> +		UNCHANGED,	/* Original contents */
> +		TRIMMED,	/* Range has been trimmed */
> +		CHANGED,	/* Range has been changed */
> +		BACKUP,		/* Range is being used as a backup */
> +		TOP,		/* Final range - sector is size of device */
> +	} type;
> +};
> +
> +static const char * const readable_type[] = {
> +	"Sector0",
> +	"Sector0_current",
> +	"Unchanged",
> +	"Free",
> +	"Changed",
> +	"Backup",
> +	"Top",
> +};
> +
> +enum state {
> +	TRIM,
> +	CHECKPOINT,
> +	COMMITTED,
> +};
> +
> +struct bow_context {
> +	struct dm_dev *dev;
> +	struct workqueue_struct *workqueue;
> +	struct dm_bufio_client *bufio;
> +	struct mutex ranges_lock;
> +	struct rb_root ranges;
> +	struct dm_kobject_holder kobj_holder;	/* for sysfs attributes */
> +	atomic_t state; /* One of the enum state values above */
> +	union {
> +		u8			raw_log_sector[PAGE_SIZE];
> +		struct log_sector	log_sector;
> +	};
> +};
> +
> +sector_t range_top(struct bow_range *br)
> +{
> +	return container_of(rb_next(&br->node), struct bow_range, node)
> +		->sector;
> +}
> +
> +unsigned int range_size(struct bow_range *br)
> +{
> +	return (range_top(br) - br->sector) * SECTOR_SIZE;
> +}
> +
> +static sector_t bvec_top(struct bvec_iter *bi_iter)
> +{
> +	return bi_iter->bi_sector + bi_iter->bi_size / SECTOR_SIZE;
> +}
> +
> +static struct bow_range *find_first_overlapping_range(struct rb_root *ranges,
> +						      struct bvec_iter *bi_iter)
> +{
> +	struct rb_node *node = ranges->rb_node;
> +
> +	while (node) {
> +		struct bow_range *br;
> +
> +		br = container_of(node, struct bow_range, node);
> +		if (br->sector <= bi_iter->bi_sector
> +		    && bi_iter->bi_sector < range_top(br))
> +			return br;
> +		if (bi_iter->bi_sector < br->sector)
> +			node = node->rb_left;
> +		else
> +			node = node->rb_right;
> +	}
> +
> +	BUG_ON(true);
> +	return NULL;
> +}
> +
> +void add_before(struct rb_root *ranges, struct bow_range *new_br,
> +		struct bow_range *existing)
> +{
> +	struct rb_node *parent = &(existing->node);
> +	struct rb_node **link = &(parent->rb_left);
> +
> +	while (*link) {
> +		parent = *link;
> +		link = &((*link)->rb_right);
> +	}
> +
> +	rb_link_node(&new_br->node, parent, link);
> +	rb_insert_color(&new_br->node, ranges);
> +}
> +
> +/*
> + * Given a range br returned by find_first_overlapping_range, split br
> + * into a leading range, a range matching the bio and a trailing range.
> + * Leading and trailing may end up size 0 and will then be deleted. The
> + * new range matching the bio is then returned and should have its type
> + * and type specific fields populated.
> + * If the bio runs off the end of the range, the bio is truncated
> + * accordingly
> + */
> +static int split_range(struct rb_root *ranges, struct bow_range **br,
> +		       struct bvec_iter *bi_iter)
> +{
> +	struct bow_range *new_br;
> +
> +	BUG_ON(bi_iter->bi_sector < (*br)->sector);
> +
> +	if (bi_iter->bi_sector > (*br)->sector) {
> +		struct bow_range *leading_br =
> +			kzalloc(sizeof(*leading_br), GFP_KERNEL);
> +
> +		if (!leading_br)
> +			return -ENOMEM;
> +
> +		*leading_br = **br;
> +		add_before(ranges, leading_br, *br);
> +		(*br)->sector = bi_iter->bi_sector;
> +	}
> +
> +	if (bvec_top(bi_iter) >= range_top(*br)) {
> +		bi_iter->bi_size = (range_top(*br) - (*br)->sector)
> +					* SECTOR_SIZE;
> +		return 0;
> +	}
> +
> +	/* new_br will be the beginning, existing br will be the tail */
> +	new_br = kzalloc(sizeof(*new_br), GFP_KERNEL);
> +	if (!new_br)
> +		return -ENOMEM;
> +
> +	new_br->sector = (*br)->sector;
> +	(*br)->sector = bvec_top(bi_iter);
> +	add_before(ranges, new_br, *br);
> +	*br = new_br;
> +
> +	return 0;
> +}
> +
> +/*
> + * Sets type of a range. May merge range into surrounding ranges
> + * Since br may be invalidated, always sets br to NULL to prevent
> + * usage after this is called
> + */
> +static void set_type(struct rb_root *ranges, struct bow_range **br, int type)
> +{
> +	struct bow_range *prev = container_of(rb_prev(&(*br)->node),
> +						      struct bow_range, node);
> +	struct bow_range *next = container_of(rb_next(&(*br)->node),
> +						      struct bow_range, node);
> +
> +	(*br)->type = type;
> +
> +	if (next->type == type) {
> +		rb_erase(&next->node, ranges);
> +		kfree(next);
> +	}
> +
> +	if (prev->type == type) {
> +		rb_erase(&(*br)->node, ranges);
> +		kfree(*br);
> +	}
> +
> +	*br = NULL;
> +}
> +
> +static struct bow_range *find_free_range(struct rb_root *ranges)
> +{
> +	struct rb_node *i;
> +
> +	for (i = rb_first(ranges); i; i = rb_next(i)) {
> +		struct bow_range *br = container_of(i, struct bow_range, node);
> +
> +		if (br->type == TRIMMED)
> +			return br;
> +	}
> +
> +	DMERR("Unable to find free space to back up to");
> +	return NULL;
> +}
> +
> +static sector_t sector_to_page(sector_t sector)
> +{
> +	BUG_ON(sector % (PAGE_SIZE / SECTOR_SIZE) != 0);
> +	return sector >> (PAGE_SHIFT - SECTOR_SHIFT);
> +}
> +
> +static int copy_data(struct dm_bufio_client *bufio, struct bow_range *source,
> +		     struct bow_range *dest, u32 *checksum)
> +{
> +	int i;
> +
> +	BUG_ON(range_size(source) != range_size(dest));
> +
> +	if (checksum)
> +		*checksum = sector_to_page(source->sector);
> +
> +	for (i = 0; i < range_size(source) / PAGE_SIZE; ++i) {
> +		struct dm_buffer *read_buffer, *write_buffer;
> +		u8 *read, *write;
> +
> +		read = dm_bufio_read(bufio, sector_to_page(source->sector) + i,
> +				     &read_buffer);
> +		if (IS_ERR(read)) {
> +			DMERR("Cannot read sector");
> +			dm_bufio_release(read_buffer);
> +			return PTR_ERR(read);
> +		}
> +
> +		if (checksum)
> +			*checksum = crc32(*checksum, read, PAGE_SIZE);
> +
> +		write = dm_bufio_new(bufio, sector_to_page(dest->sector) + i,
> +				      &write_buffer);
> +		if (IS_ERR(write)) {
> +			DMERR("Cannot write sector");
> +			dm_bufio_release(write_buffer);
> +			dm_bufio_release(read_buffer);
> +			return PTR_ERR(write);
> +		}
> +
> +		memcpy(write, read, PAGE_SIZE);
> +
> +		dm_bufio_mark_buffer_dirty(write_buffer);
> +		dm_bufio_release(write_buffer);
> +		dm_bufio_release(read_buffer);
> +	}
> +
> +	dm_bufio_write_dirty_buffers(bufio);
> +	return 0;
> +}
> +
> +/****** logging functions ******/
> +
> +static int add_log_entry(struct bow_context *bc, sector_t source, sector_t dest,
> +			 unsigned int size, u32 checksum);
> +
> +static int backup_log_sector(struct bow_context *bc)
> +{
> +	struct bow_range *first_br, *free_br;
> +	struct bvec_iter bi_iter;
> +	u32 checksum;
> +	int ret;
> +
> +	first_br = container_of(rb_first(&bc->ranges), struct bow_range, node);
> +	BUG_ON(first_br->type != SECTOR0);
> +	BUG_ON(range_size(first_br) != PAGE_SIZE);
> +
> +	free_br = find_free_range(&bc->ranges);
> +	/* No space left - return this error to userspace */
> +	if (!free_br)
> +		return -ENOSPC;
> +	bi_iter.bi_sector = free_br->sector;
> +	bi_iter.bi_size = PAGE_SIZE;
> +	ret = split_range(&bc->ranges, &free_br, &bi_iter);
> +	if (ret)
> +		return ret;
> +	BUG_ON(bi_iter.bi_size != PAGE_SIZE);
> +
> +	ret = copy_data(bc->bufio, first_br, free_br, &checksum);
> +	if (ret)
> +		return ret;
> +
> +	bc->log_sector.count = 0;
> +	bc->log_sector.sequence++;
> +	ret = add_log_entry(bc, first_br->sector, free_br->sector,
> +			    range_size(first_br), checksum);
> +	if (ret)
> +		return ret;
> +
> +	set_type(&bc->ranges, &free_br, BACKUP);
> +	return 0;
> +}
> +
> +static int add_log_entry(struct bow_context *bc, sector_t source, sector_t dest,
> +			 unsigned int size, u32 checksum)
> +{
> +	struct dm_buffer *sector_buffer;
> +	u8 *sector;
> +
> +	if (sizeof(struct log_sector)
> +			+ sizeof(struct log_entry) * (bc->log_sector.count + 1)
> +		   > sizeof(bc->raw_log_sector)) {
> +		int ret = backup_log_sector(bc);
> +
> +		if (ret)
> +			return ret;
> +	}
> +
> +	bc->log_sector.entries[bc->log_sector.count].source = source;
> +	bc->log_sector.entries[bc->log_sector.count].dest = dest;
> +	bc->log_sector.entries[bc->log_sector.count].size = size;
> +	bc->log_sector.entries[bc->log_sector.count].checksum = checksum;
> +	bc->log_sector.count++;
> +
> +	sector = dm_bufio_new(bc->bufio, 0, &sector_buffer);
> +	if (IS_ERR(sector)) {
> +		DMERR("Cannot write boot sector");
> +		dm_bufio_release(sector_buffer);
> +		return -ENOSPC;
> +	}
> +	memcpy(sector, bc->raw_log_sector, PAGE_SIZE);
> +	dm_bufio_mark_buffer_dirty(sector_buffer);
> +	dm_bufio_release(sector_buffer);
> +	dm_bufio_write_dirty_buffers(bc->bufio);
> +	return 0;
> +}
> +
> +static int prepare_log(struct bow_context *bc)
> +{
> +	struct bow_range *free_br, *first_br;
> +	struct bvec_iter bi_iter;
> +	u32 checksum;
> +	int ret;
> +
> +	/* Carve out first sector as log sector */
> +	first_br = container_of(rb_first(&bc->ranges), struct bow_range, node);
> +	BUG_ON(first_br->type != UNCHANGED || range_size(first_br) < PAGE_SIZE);
> +	bi_iter.bi_sector = 0;
> +	bi_iter.bi_size = PAGE_SIZE;
> +	ret = split_range(&bc->ranges, &first_br, &bi_iter);
> +	if (ret)
> +		return ret;
> +	first_br->type = SECTOR0;
> +	BUG_ON(range_size(first_br) != PAGE_SIZE);
> +
> +	/* Find free sector for active sector0 reads/writes */
> +	free_br = find_free_range(&bc->ranges);
> +	if (!free_br)
> +		return -ENOSPC;
> +	bi_iter.bi_sector = free_br->sector;
> +	bi_iter.bi_size = PAGE_SIZE;
> +	ret = split_range(&bc->ranges, &free_br, &bi_iter);
> +	if (ret)
> +		return ret;
> +	free_br->type = SECTOR0_CURRENT;
> +
> +	/* Copy data */
> +	ret = copy_data(bc->bufio, first_br, free_br, NULL);
> +	if (ret)
> +		return ret;
> +
> +	/* Find free sector to back up first sector */
> +	free_br = find_free_range(&bc->ranges);
> +	if (!free_br)
> +		return -ENOSPC;
> +	bi_iter.bi_sector = free_br->sector;
> +	bi_iter.bi_size = PAGE_SIZE;
> +	ret = split_range(&bc->ranges, &free_br, &bi_iter);
> +	if (ret)
> +		return ret;
> +
> +	/* Back up */
> +	ret = copy_data(bc->bufio, first_br, free_br, &checksum);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Set up our replacement boot sector - it will get written when we
> +	 * add the first log entry, which we do immediately
> +	 */
> +	bc->log_sector.magic = MAGIC;
> +	bc->log_sector.count = 0;
> +	bc->log_sector.sequence = 0;
> +
> +	/* Add log entry */
> +	ret = add_log_entry(bc, first_br->sector, free_br->sector,
> +			    range_size(first_br), checksum);
> +	if (ret)
> +		return ret;
> +
> +	set_type(&bc->ranges, &free_br, BACKUP);
> +	return 0;
> +}
> +
> +static struct bow_range *find_sector0_current(struct bow_context *bc)
> +{
> +	struct rb_node *i;
> +
> +	for (i = rb_first(&bc->ranges); i; i = rb_next(i)) {
> +		struct bow_range *br = container_of(i, struct bow_range, node);
> +
> +		if (br->type == SECTOR0_CURRENT)
> +			return br;
> +	}
> +
> +	BUG_ON(true);
> +}
> +
> +/****** sysfs interface functions ******/
> +
> +static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr,
> +			  char *buf)
> +{
> +	struct bow_context *bc = container_of(kobj, struct bow_context,
> +					      kobj_holder.kobj);
> +
> +	return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&bc->state));
> +}
> +
> +static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
> +			   const char *buf, size_t count)
> +{
> +	struct bow_context *bc = container_of(kobj, struct bow_context,
> +					      kobj_holder.kobj);
> +	enum state state, original_state;
> +	int ret;
> +
> +	state = buf[0] - '0';
> +	if (state < TRIM || state > COMMITTED) {
> +		DMERR("State value %d out of range", state);
> +		return -EINVAL;
> +	}
> +
> +	mutex_lock(&bc->ranges_lock);
> +	original_state = atomic_read(&bc->state);
> +	if (state != original_state + 1) {
> +		DMERR("Invalid state change from %d to %d",
> +		      original_state, state);
> +		ret = -EINVAL;
> +		goto bad;
> +	}
> +
> +	DMINFO("Switching to state %s", state == CHECKPOINT ? "Checkpoint"
> +	       : state == COMMITTED ? "Committed" : "Unknown");
> +
> +	if (state == CHECKPOINT) {
> +		ret = prepare_log(bc);
> +		if (ret) {
> +			DMERR("Failed to switch to checkpoint state");
> +			goto bad;
> +		}
> +	} else if (state == COMMITTED) {
> +		struct bow_range *br = find_sector0_current(bc);
> +		struct bow_range *sector0_br =
> +			container_of(rb_first(&bc->ranges), struct bow_range,
> +				     node);
> +
> +		ret = copy_data(bc->bufio, br, sector0_br, 0);
> +		if (ret) {
> +			DMERR("Failed to switch to committed state");
> +			goto bad;
> +		}
> +	}
> +	atomic_inc(&bc->state);
> +	ret = count;
> +
> +bad:
> +	mutex_unlock(&bc->ranges_lock);
> +	return ret;
> +}
> +
> +static int prepare_one_range(struct bow_context *bc, struct bvec_iter *bi_iter);
> +static ssize_t write_store(struct kobject *kobj, struct kobj_attribute *attr,
> +			   const char *buf, size_t count)
> +{
> +	struct bow_context *bc = container_of(kobj, struct bow_context,
> +					      kobj_holder.kobj);
> +	sector_t sector = *(sector_t *)buf;
> +	unsigned int size = *(unsigned int *)(buf + sizeof(sector));
> +	struct bvec_iter bio_bi_iter, bi_iter;
> +	int ret, i;
> +
> +	bio_bi_iter.bi_sector = sector;
> +	bio_bi_iter.bi_size = size;
> +	bi_iter = bio_bi_iter;
> +
> +	DMINFO("Simulate Write: %lu, %u",
> +	       bio_bi_iter.bi_sector,
> +	       bio_bi_iter.bi_size);
> +
> +	mutex_lock(&bc->ranges_lock);
> +	do {
> +		ret = prepare_one_range(bc, &bi_iter);
> +		bi_iter.bi_sector += bi_iter.bi_size / SECTOR_SIZE;
> +		bi_iter.bi_size = bio_bi_iter.bi_size
> +			- (bi_iter.bi_sector - bio_bi_iter.bi_sector)
> +			  * SECTOR_SIZE;
> +	} while (!ret && bi_iter.bi_size);
> +	mutex_unlock(&bc->ranges_lock);
> +
> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < size / PAGE_SIZE; i++) {
> +		struct dm_buffer *dm_buffer;
> +		u8 *ptr;
> +
> +		ptr = dm_bufio_new(bc->bufio, sector_to_page(sector) + i,
> +				   &dm_buffer);
> +		dm_bufio_mark_buffer_dirty(dm_buffer);
> +		dm_bufio_release(dm_buffer);
> +	}
> +	dm_bufio_write_dirty_buffers(bc->bufio);
> +
> +	return count;
> +}
> +
> +static struct kobj_attribute attr_state = __ATTR_RW(state);
> +static struct kobj_attribute attr_write = __ATTR_WO(write);
> +
> +static struct attribute *bow_attrs[] = {
> +	&attr_state.attr,
> +	&attr_write.attr,
> +	NULL
> +};
> +
> +static struct kobj_type bow_ktype = {
> +	.sysfs_ops = &kobj_sysfs_ops,
> +	.default_attrs = bow_attrs,
> +	.release = dm_kobject_release
> +};
> +
> +/****** constructor/destructor ******/
> +
> +static void dm_bow_dtr(struct dm_target *ti)
> +{
> +	struct bow_context *bc = (struct bow_context *) ti->private;
> +	struct list_head *i, *q;
> +	struct kobject *kobj;
> +
> +	while (rb_first(&bc->ranges)) {
> +		struct bow_range *br = container_of(rb_first(&bc->ranges),
> +						    struct bow_range, node);
> +
> +		rb_erase(&br->node, &bc->ranges);
> +		kfree(br);
> +	}
> +	if (bc->workqueue)
> +		destroy_workqueue(bc->workqueue);
> +	if (bc->bufio)
> +		dm_bufio_client_destroy(bc->bufio);
> +
> +	kobj = &bc->kobj_holder.kobj;
> +	if (kobj->state_initialized) {
> +		kobject_put(kobj);
> +		wait_for_completion(dm_get_completion_from_kobject(kobj));
> +	}
> +	kfree(bc);
> +}
> +
> +static int dm_bow_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> +	struct bow_context *bc;
> +	struct bow_range *br;
> +	int ret;
> +	struct mapped_device *md = dm_table_get_md(ti->table);
> +
> +	if (argc != 1) {
> +		ti->error = "Invalid argument count";
> +		return -EINVAL;
> +	}
> +
> +	bc = kzalloc(sizeof(*bc), GFP_KERNEL);
> +	if (!bc) {
> +		ti->error = "Cannot allocate bow context";
> +		return -ENOMEM;
> +	}
> +
> +	ti->num_flush_bios = 1;
> +	ti->num_discard_bios = 1;
> +	ti->num_write_same_bios = 1;
> +	ti->private = bc;
> +
> +	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +			    &bc->dev);
> +	if (ret) {
> +		ti->error = "Device lookup failed";
> +		goto bad;
> +	}
> +
> +	init_completion(&bc->kobj_holder.completion);
> +	ret = kobject_init_and_add(&bc->kobj_holder.kobj, &bow_ktype,
> +				   &disk_to_dev(dm_disk(md))->kobj, "%s",
> +				   "bow");
> +	if (ret) {
> +		ti->error = "Cannot create sysfs node";
> +		goto bad;
> +	}
> +
> +	mutex_init(&bc->ranges_lock);
> +	bc->ranges = RB_ROOT;
> +	bc->bufio = dm_bufio_client_create(bc->dev->bdev, PAGE_SIZE, 1, 0,
> +					   NULL, NULL);
> +	if (IS_ERR(bc->bufio)) {
> +		ti->error = "Cannot initialize dm-bufio";
> +		ret = PTR_ERR(bc->bufio);
> +		bc->bufio = NULL;
> +		goto bad;
> +	}
> +
> +	bc->workqueue = alloc_workqueue("dm-bow",
> +					WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM
> +					| WQ_UNBOUND, num_online_cpus());
> +	if (!bc->workqueue) {
> +		ti->error = "Cannot allocate workqueue";
> +		ret = -ENOMEM;
> +		goto bad;
> +	}
> +
> +	br = kzalloc(sizeof(*br), GFP_KERNEL);
> +	if (!br) {
> +		ti->error = "Cannot allocate ranges";
> +		ret = -ENOMEM;
> +		goto bad;
> +	}
> +
> +	br->sector = ti->len;
> +	br->type = TOP;
> +	rb_link_node(&br->node, NULL, &bc->ranges.rb_node);
> +	rb_insert_color(&br->node, &bc->ranges);
> +
> +	br = kzalloc(sizeof(*br), GFP_KERNEL);
> +	if (!br) {
> +		ti->error = "Cannot allocate ranges";
> +		ret = -ENOMEM;
> +		goto bad;
> +	}
> +
> +	br->sector = 0;
> +	br->type = UNCHANGED;
> +	rb_link_node(&br->node, bc->ranges.rb_node,
> +		     &bc->ranges.rb_node->rb_left);
> +	rb_insert_color(&br->node, &bc->ranges);
> +
> +	ti->discards_supported = true;
> +
> +	return 0;
> +
> +bad:
> +	dm_bow_dtr(ti);
> +	return ret;
> +}
> +
> +/****** Handle writes ******/
> +
> +static int prepare_unchanged_range(struct bow_context *bc, struct bow_range *br,
> +				   struct bvec_iter *bi_iter)
> +{
> +	struct bow_range *backup_br;
> +	struct bvec_iter backup_bi;
> +	sector_t log_source, log_dest;
> +	unsigned int log_size;
> +	u32 checksum;
> +	int ret;
> +
> +	/* Find a free range */
> +	backup_br = find_free_range(&bc->ranges);
> +	if (!backup_br)
> +		return -ENOSPC;
> +
> +	/* Carve out a backup range. This may be smaller than the br given */
> +	backup_bi.bi_sector = backup_br->sector;
> +	backup_bi.bi_size = min(range_size(backup_br), bi_iter->bi_size);
> +	ret = split_range(&bc->ranges, &backup_br, &backup_bi);
> +	if (ret)
> +		return ret; /* TODO make sure structures are consistent */
> +
> +	/*
> +	 * Carve out a changed range. This will not be smaller than the backup
> +	 * br since the backup br is smaller than the source range and iterator
> +	 */
> +	bi_iter->bi_size = backup_bi.bi_size;
> +	ret = split_range(&bc->ranges, &br, bi_iter);
> +	if (ret)
> +		return ret; /* TODO make sure structures are consistent */
> +	BUG_ON(range_size(br) != range_size(backup_br));
> +
> +	/* Copy data over */
> +	ret = copy_data(bc->bufio, br, backup_br, &checksum);
> +	if (ret)
> +		return ret;
> +
> +	/* Add an entry to the log */
> +	log_source = br->sector;
> +	log_dest = backup_br->sector;
> +	log_size = range_size(br);
> +
> +	/*
> +	 * Set the types last. Note that since set_type also amalgamates ranges
> +	 * we have to be careful of the order here.
> +	 * TODO: This is brittle. Work out better solution, maybe separate
> +	 * set_type and amalgamate_ranges?
> +	 */
> +	backup_br->type = br->type == SECTOR0_CURRENT ? SECTOR0_CURRENT
> +						      : BACKUP;
> +	br->type = CHANGED;
> +
> +	set_type(&bc->ranges, &backup_br, backup_br->type);
> +	set_type(&bc->ranges, &br, br->type);
> +
> +	/*
> +	 * Actually add the log entry at the end, since adding a log can cause
> +	 * another backup, which is not safe until this one is committed
> +	 */
> +	return add_log_entry(bc, log_source, log_dest, log_size, checksum);
> +}
> +
> +static int prepare_free_range(struct bow_context *bc, struct bow_range *br,
> +			      struct bvec_iter *bi_iter)
> +{
> +	int ret;
> +
> +	ret = split_range(&bc->ranges, &br, bi_iter);
> +	if (ret)
> +		return ret;
> +	set_type(&bc->ranges, &br, CHANGED);
> +	return 0;
> +}
> +
> +static int prepare_changed_range(struct bow_context *bc, struct bow_range *br,
> +				 struct bvec_iter *bi_iter)
> +{
> +	/* Nothing to do ... */
> +	return 0;
> +}
> +
> +static int prepare_one_range(struct bow_context *bc,
> +			     struct bvec_iter *bi_iter)
> +{
> +	struct bow_range *br = find_first_overlapping_range(&bc->ranges,
> +							    bi_iter);
> +	bi_iter->bi_size = min(bi_iter->bi_size,
> +			       (unsigned int)(range_top(br)
> +					      - bi_iter->bi_sector)
> +				* SECTOR_SIZE);
> +
> +	switch (br->type) {
> +	case CHANGED:
> +		return prepare_changed_range(bc, br, bi_iter);
> +
> +	case TRIMMED:
> +		return prepare_free_range(bc, br, bi_iter);
> +
> +	case UNCHANGED:
> +	case BACKUP:
> +	case SECTOR0_CURRENT:
> +		return prepare_unchanged_range(bc, br, bi_iter);
> +
> +	case SECTOR0:	/* Handled in the dm_bow_map */
> +	case TOP:	/* Illegal - top is off the end of the device */
> +	default:
> +		BUG_ON(true);
> +	}
> +}
> +
> +struct write_work {
> +	struct work_struct work;
> +	struct bow_context *bc;
> +	struct bio *bio;
> +};
> +
> +static void bow_write(struct work_struct *work)
> +{
> +	struct write_work *ww = container_of(work, struct write_work, work);
> +	struct bow_context *bc = ww->bc;
> +	struct bio *bio = ww->bio;
> +	struct bvec_iter bi_iter = bio->bi_iter;
> +	int ret = 0;
> +
> +	kfree(ww);
> +
> +	mutex_lock(&bc->ranges_lock);
> +	do {
> +		ret = prepare_one_range(bc, &bi_iter);
> +		bi_iter.bi_sector += bi_iter.bi_size / SECTOR_SIZE;
> +		bi_iter.bi_size = bio->bi_iter.bi_size
> +			- (bi_iter.bi_sector - bio->bi_iter.bi_sector)
> +			  * SECTOR_SIZE;
> +	} while (!ret && bi_iter.bi_size);
> +
> +	mutex_unlock(&bc->ranges_lock);
> +
> +	if (!ret) {
> +		bio->bi_bdev = bc->dev->bdev;
> +		submit_bio(bio);
> +	} else {
> +		DMERR("Write failure with error %d", -ret);
> +		bio->bi_error = ret;
> +		bio_endio(bio);
> +	}
> +}
> +
> +static int queue_write(struct bow_context *bc, struct bio *bio)
> +{
> +	struct write_work *ww = kmalloc(sizeof(*ww), GFP_NOIO | __GFP_NORETRY
> +					| __GFP_NOMEMALLOC | __GFP_NOWARN);
> +	if (!ww) {
> +		DMERR("Failed to allocate write_work");
> +		return -ENOMEM;
> +	}
> +
> +	INIT_WORK(&ww->work, bow_write);
> +	ww->bc = bc;
> +	ww->bio = bio;
> +	queue_work(bc->workqueue, &ww->work);
> +	return DM_MAPIO_SUBMITTED;
> +}
> +
> +static int handle_sector0(struct bow_context *bc, struct bio *bio)
> +{
> +	int ret = DM_MAPIO_REMAPPED;
> +
> +	if (bio->bi_iter.bi_size > PAGE_SIZE) {
> +		struct bio *split = bio_split(bio,
> +					      1 << (PAGE_SHIFT - SECTOR_SHIFT),
> +					      GFP_NOIO, fs_bio_set);
> +		bio_chain(split, bio);
> +		if (bio_data_dir(split) == WRITE) {
> +			ret = queue_write(bc, split);
> +			if (ret == DM_MAPIO_SUBMITTED)
> +				ret = DM_MAPIO_REMAPPED;
> +		} else {
> +			submit_bio(split);
> +		}
> +	}
> +
> +	bio->bi_iter.bi_sector = find_sector0_current(bc)->sector;
> +	return ret;
> +}
> +
> +/****** dm interface ******/
> +
> +static int dm_bow_map(struct dm_target *ti, struct bio *bio)
> +{
> +	int ret = DM_MAPIO_REMAPPED;
> +	struct bow_context *bc = ti->private;
> +
> +	if (atomic_read(&bc->state) != COMMITTED) {
> +		enum state state;
> +
> +		mutex_lock(&bc->ranges_lock);
> +		state = atomic_read(&bc->state);
> +		if (state == TRIM && bio_op(bio) == REQ_OP_DISCARD) {
> +			struct bow_range *br;
> +
> +			DMDEBUG("Trim: %lu, %u",
> +			       bio->bi_iter.bi_sector,
> +			       bio->bi_iter.bi_size);
> +
> +			br = find_first_overlapping_range(&bc->ranges,
> +							  &bio->bi_iter);
> +
> +			/*
> +			 * TODO - we assume this will always be an unmapped
> +			 * block containing the whole bio
> +			 */
> +			BUG_ON(br->type != UNCHANGED);
> +			if (!split_range(&bc->ranges, &br, &bio->bi_iter))
> +				set_type(&bc->ranges, &br, TRIMMED);
> +			bio_endio(bio);
> +			ret = DM_MAPIO_SUBMITTED;
> +		} else if (state == CHECKPOINT) {
> +			if (bio->bi_iter.bi_sector == 0)
> +				ret = handle_sector0(bc, bio);
> +			else if (bio_data_dir(bio) == WRITE)
> +				ret = queue_write(bc, bio);
> +			else
> +				/* passthrough */;
> +		} else {
> +			/* passthrough */
> +		}
> +		mutex_unlock(&bc->ranges_lock);
> +	}
> +	if (ret == DM_MAPIO_REMAPPED)
> +		bio->bi_bdev = bc->dev->bdev;
> +	return ret;
> +}
> +
> +static void dm_bow_tablestatus(struct dm_target *ti, char *result,
> +			       unsigned int maxlen)
> +{
> +	char *end = result + maxlen;
> +	struct bow_context *bc = ti->private;
> +	struct rb_node *i;
> +
> +	if (maxlen == 0)
> +		return;
> +	result[0] = 0;
> +
> +	if (!rb_first(&bc->ranges)) {
> +		scnprintf(result, end - result, "ERROR: Empty ranges");
> +		return;
> +	}
> +
> +	if (container_of(rb_first(&bc->ranges), struct bow_range, node)
> +	    ->sector) {
> +		scnprintf(result, end - result,
> +			 "ERROR: First range does not start at sector 0");
> +		return;
> +	}
> +
> +	for (i = rb_first(&bc->ranges); i; i = rb_next(i)) {
> +		struct bow_range *br = container_of(i, struct bow_range, node);
> +
> +		result += scnprintf(result, end - result, "%s: %lu",
> +				    readable_type[br->type], br->sector);
> +		if (result >= end)
> +			return;
> +
> +		result += scnprintf(result, end - result, "\n");
> +		if (result >= end)
> +			return;
> +
> +		switch (br->type) {
> +		case TOP:
> +			if (br->sector != ti->len) {
> +				scnprintf(result, end - result,
> +					 "\nERROR: Top sector is incorrect");
> +			}
> +
> +			if (&br->node != rb_last(&bc->ranges)) {
> +				scnprintf(result, end - result,
> +					  "\nERROR: Top sector is not last");
> +			}
> +			return;
> +
> +		default:
> +			break;
> +		}
> +
> +		if (br->sector > range_top(br)) {
> +			scnprintf(result, end - result,
> +				  "\nERROR: sector too big");
> +			return;
> +		}
> +	}
> +}
> +
> +static void dm_bow_status(struct dm_target *ti, status_type_t type,
> +			  unsigned int status_flags, char *result,
> +			  unsigned int maxlen)
> +{
> +	switch (type) {
> +	case STATUSTYPE_INFO:
> +		result[0] = 0;
> +		break;
> +
> +	case STATUSTYPE_TABLE:
> +		dm_bow_tablestatus(ti, result, maxlen);
> +		break;
> +	}
> +}
> +
> +int dm_bow_prepare_ioctl(struct dm_target *ti,
> +			 struct block_device **bdev, fmode_t *mode)
> +{
> +	struct bow_context *bc = ti->private;
> +	struct dm_dev *dev = bc->dev;
> +
> +	*bdev = dev->bdev;
> +	/* Only pass ioctls through if the device sizes match exactly. */
> +	return ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
> +}
> +
> +static int dm_bow_iterate_devices(struct dm_target *ti,
> +				  iterate_devices_callout_fn fn, void *data)
> +{
> +	struct bow_context *bc = ti->private;
> +
> +	return fn(ti, bc->dev, 0, ti->len, data);
> +}
> +
> +static struct target_type bow_target = {
> +	.name   = "bow",
> +	.version = {1, 1, 1},
> +	.module = THIS_MODULE,
> +	.ctr    = dm_bow_ctr,
> +	.dtr    = dm_bow_dtr,
> +	.map    = dm_bow_map,
> +	.status = dm_bow_status,
> +	.prepare_ioctl  = dm_bow_prepare_ioctl,
> +	.iterate_devices = dm_bow_iterate_devices,
> +};
> +
> +int __init dm_bow_init(void)
> +{
> +	int r = dm_register_target(&bow_target);
> +
> +	if (r < 0)
> +		DMERR("registering bow failed %d", r);
> +	return r;
> +}
> +
> +void dm_bow_exit(void)
> +{
> +	dm_unregister_target(&bow_target);
> +}
> +
> +MODULE_LICENSE("GPL");
> +
> +module_init(dm_bow_init);
> +module_exit(dm_bow_exit);
> -- 
> 2.19.1.568.g152ad8e336-goog
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Oct. 26, 2018, 8:03 p.m. UTC | #11
On Thu, 25 Oct 2018, Paul Lawrence wrote:

> Thank you for the suggestion. I spent part of yesterday experimenting with
> this idea, and it is certainly very promising. However, it does have some
> disadvantages as compared to dm-bow, if I am understanding the setup
> correctly:
> 
> 1) Since dm-snap has no concept of the free space on the underlying file
> system any write into the free space will trigger a backup, so using twice the
> space of dm-bow. Changing existing data will create a backup with both
> drivers, but since we have to reserve the space for the backups up-front with
> dm-snap, we would likely only have half the space for that. Either way, it
> seems that dm-bow is likely to double the amount of changes we could make.
> 
> (Might it be possible to dynamically resize the backup file if it is mostly
> used up? This would fix the problem of only having half the space for changing

Yes - the snapshot store can be extended while the snapshot is active.

> existing data. The documentation seems to indicate that you can increase the
> size of the snapshot partition, and it seems like it should be possible to
> grow the underlying file without triggering a lot of writes. OTOH this would
> have to happen in userspace which creates other issues.)
> 
> 2) Similarly, since writes into free space do not trigger a backup in dm-bow,
> dm-bow is likely to have a lower performance overhead in many circumstances.
> On the flip side, dm-bow's backup is in free space and will collide with other
> writes, so this advantage will reduce as free space fills up. But by choosing
> a suitable algorithm for how we use free space we might be able to retain most
> of this advantage.
> 
> I intend to put together a fully working prototype of your suggestion next to
> better compare with dm-bow. But I do believe there is value in tracking free
> space and utilizing it in any such solution.

The snapshot target could be hacked so that it remembers space trimmed 
with REQ_OP_DISCARD and won't reallocate these blocks.

But I suspect that running discard over the whole device would degrade 
performance more than copying some unneeded data.

How much data do you intend to backup with this solution?

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paul Lawrence Oct. 29, 2018, 4:51 p.m. UTC | #12
> The snapshot target could be hacked so that it remembers space trimmed
> with REQ_OP_DISCARD and won't reallocate these blocks.
>
> But I suspect that running discard over the whole device would degrade
> performance more than copying some unneeded data.
>
> How much data do you intend to backup with this solution?
>
>
We are space-constrained - we will have to free up space for the backup 
before we apply the update, so we have to predict the size and keeping 
usage as low as possible is thus very important.

Also, we've discussed the resizing requirement of the dm-snap solution 
and that part is not attractive at all - it seems it would be impossible 
to guarantee that the resizing happens in a timely fashion during the 
(very busy) update cycle.

Thanks everyone for the insights, especially into how dm-snap works, 
which I hadn't fully appreciated. At the moment, and for the above 
reasons, we intend to continue with the dm-bow solution, but do want to 
keep this discussion open. If anyone is going to be at Linux Plumbers, 
I'll be presenting this work and would love to chat about it more.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 15, 2018, 11:15 p.m. UTC | #13
On Mon, 29 Oct 2018, Paul Lawrence wrote:

> 
> > The snapshot target could be hacked so that it remembers space trimmed
> > with REQ_OP_DISCARD and won't reallocate these blocks.
> > 
> > But I suspect that running discard over the whole device would degrade
> > performance more than copying some unneeded data.
> > 
> > How much data do you intend to backup with this solution?
> > 
> > 
> We are space-constrained - we will have to free up space for the backup before
> we apply the update, so we have to predict the size and keeping usage as low
> as possible is thus very important.
> 
> Also, we've discussed the resizing requirement of the dm-snap solution and
> that part is not attractive at all - it seems it would be impossible to
> guarantee that the resizing happens in a timely fashion during the (very busy)
> update cycle.
> 
> Thanks everyone for the insights, especially into how dm-snap works, which I
> hadn't fully appreciated. At the moment, and for the above reasons, we intend
> to continue with the dm-bow solution, but do want to keep this discussion
> open. If anyone is going to be at Linux Plumbers, I'll be presenting this work
> and would love to chat about it more.

dm-snapshot took 9 years to fix the last data corruption bug (2004-2013 - 
the commit e9c6a182649f4259db704ae15a91ac820e63b0ca).

And with the new target duplicating the snapshot functionality, it may be 
the same.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paul Lawrence Nov. 19, 2018, 3:35 p.m. UTC | #14
Understand. I will be looking for a way to upstream the functionality we
need, but will spend time trying to work out how to integrate it into
dm-snap or leverage dm-snap in some other way to avoid duplication.

Thanks,

Paul

On Thu, Nov 15, 2018 at 3:15 PM Mikulas Patocka <mpatocka@redhat.com> wrote:

>
>
> On Mon, 29 Oct 2018, Paul Lawrence wrote:
>
> >
> > > The snapshot target could be hacked so that it remembers space trimmed
> > > with REQ_OP_DISCARD and won't reallocate these blocks.
> > >
> > > But I suspect that running discard over the whole device would degrade
> > > performance more than copying some unneeded data.
> > >
> > > How much data do you intend to backup with this solution?
> > >
> > >
> > We are space-constrained - we will have to free up space for the backup
> before
> > we apply the update, so we have to predict the size and keeping usage as
> low
> > as possible is thus very important.
> >
> > Also, we've discussed the resizing requirement of the dm-snap solution
> and
> > that part is not attractive at all - it seems it would be impossible to
> > guarantee that the resizing happens in a timely fashion during the (very
> busy)
> > update cycle.
> >
> > Thanks everyone for the insights, especially into how dm-snap works,
> which I
> > hadn't fully appreciated. At the moment, and for the above reasons, we
> intend
> > to continue with the dm-bow solution, but do want to keep this discussion
> > open. If anyone is going to be at Linux Plumbers, I'll be presenting
> this work
> > and would love to chat about it more.
>
> dm-snapshot took 9 years to fix the last data corruption bug (2004-2013 -
> the commit e9c6a182649f4259db704ae15a91ac820e63b0ca).
>
> And with the new target duplicating the snapshot functionality, it may be
> the same.
>
> Mikulas
>
<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small">Understand. I will be looking for a way to upstream the functionality we need, but will spend time trying to work out how to integrate it into dm-snap or leverage dm-snap in some other way to avoid duplication.</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small">Thanks,</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small">Paul</div></div><br><div class="gmail_quote"><div dir="ltr">On Thu, Nov 15, 2018 at 3:15 PM Mikulas Patocka &lt;<a href="mailto:mpatocka@redhat.com">mpatocka@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
On Mon, 29 Oct 2018, Paul Lawrence wrote:<br>
<br>
&gt; <br>
&gt; &gt; The snapshot target could be hacked so that it remembers space trimmed<br>
&gt; &gt; with REQ_OP_DISCARD and won&#39;t reallocate these blocks.<br>
&gt; &gt; <br>
&gt; &gt; But I suspect that running discard over the whole device would degrade<br>
&gt; &gt; performance more than copying some unneeded data.<br>
&gt; &gt; <br>
&gt; &gt; How much data do you intend to backup with this solution?<br>
&gt; &gt; <br>
&gt; &gt; <br>
&gt; We are space-constrained - we will have to free up space for the backup before<br>
&gt; we apply the update, so we have to predict the size and keeping usage as low<br>
&gt; as possible is thus very important.<br>
&gt; <br>
&gt; Also, we&#39;ve discussed the resizing requirement of the dm-snap solution and<br>
&gt; that part is not attractive at all - it seems it would be impossible to<br>
&gt; guarantee that the resizing happens in a timely fashion during the (very busy)<br>
&gt; update cycle.<br>
&gt; <br>
&gt; Thanks everyone for the insights, especially into how dm-snap works, which I<br>
&gt; hadn&#39;t fully appreciated. At the moment, and for the above reasons, we intend<br>
&gt; to continue with the dm-bow solution, but do want to keep this discussion<br>
&gt; open. If anyone is going to be at Linux Plumbers, I&#39;ll be presenting this work<br>
&gt; and would love to chat about it more.<br>
<br>
dm-snapshot took 9 years to fix the last data corruption bug (2004-2013 - <br>
the commit e9c6a182649f4259db704ae15a91ac820e63b0ca).<br>
<br>
And with the new target duplicating the snapshot functionality, it may be <br>
the same.<br>
<br>
Mikulas<br>
</blockquote></div>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Sandeep Patil Dec. 2, 2018, 10:07 a.m. UTC | #15
Hi Mikulas,

On Thu, Nov 15, 2018 at 06:15:34PM -0500, Mikulas Patocka wrote:
> 
> 
> On Mon, 29 Oct 2018, Paul Lawrence wrote:
> 
> > 
> > > The snapshot target could be hacked so that it remembers space trimmed
> > > with REQ_OP_DISCARD and won't reallocate these blocks.
> > > 
> > > But I suspect that running discard over the whole device would degrade
> > > performance more than copying some unneeded data.
> > > 
> > > How much data do you intend to backup with this solution?
> > > 
> > > 
> > We are space-constrained - we will have to free up space for the backup before
> > we apply the update, so we have to predict the size and keeping usage as low
> > as possible is thus very important.
> > 
> > Also, we've discussed the resizing requirement of the dm-snap solution and
> > that part is not attractive at all - it seems it would be impossible to
> > guarantee that the resizing happens in a timely fashion during the (very busy)
> > update cycle.
> > 
> > Thanks everyone for the insights, especially into how dm-snap works, which I
> > hadn't fully appreciated. At the moment, and for the above reasons, we intend
> > to continue with the dm-bow solution, but do want to keep this discussion
> > open. If anyone is going to be at Linux Plumbers, I'll be presenting this work
> > and would love to chat about it more.
> 
> dm-snapshot took 9 years to fix the last data corruption bug (2004-2013 - 
> the commit e9c6a182649f4259db704ae15a91ac820e63b0ca).
> 
> And with the new target duplicating the snapshot functionality, it may be 
> the same.
> 

Thanks for that. We are as much sensitive to not duplicating functionality
and of course reliability of the implementation.

So we did spend considerable amount of time trying to make dm-snapshot work
for us (including the approach suggested here now). However, the additional
space needed to make dm-snapshot work in this situation is unfortunate and
won't work for Android. Especially given that we will be taking that space
away from the user all in one go too.

Anyway, I wanted to ask if there is any way we can make dm-snapshot work the
way dm-bow does? With patches is fine, we can work on that :).

I think Paul is planning to send a v2 with more description and the block
size fix that caused problems for others trying it out.

FWIW, dm-bow itself suffers from a mutex for each write that stalls for
longer when the write is to a block being modified (as opposed to a free
block being written to). We are hoping to iterate over that problem once the
general idea is acceptable to everyone.

Thanks for your help.

- ssp


> Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

diff --git a/Documentation/device-mapper/dm-bow.txt b/Documentation/device-mapper/dm-bow.txt
new file mode 100644
index 000000000000..a77d5bf8f98b
--- /dev/null
+++ b/Documentation/device-mapper/dm-bow.txt
@@ -0,0 +1,103 @@ 
+dm_bow (backup on write)
+========================
+
+dm_bow is a device mapper driver that uses the free space on a device to back up
+data that is overwritten. The changes can then be committed by a simple state
+change, or rolled back by removing the dm_bow device and running a command line
+utility over the underlying device.
+
+dm_bow has three states, set by writing ‘1’ or ‘2’ to /sys/block/dm-?/bow/state.
+It is only possible to go from state 0 (initial state) to state 1, and then from
+state 1 to state 2.
+
+State 0: dm_bow collects all trims to the device and assumes that these mark
+free space on the overlying file system that can be safely used. Typically the
+mount code would create the dm_bow device, mount the file system, call the
+FITRIM ioctl on the file system then switch to state 1. These trims are not
+propagated to the underlying device.
+
+TODO: There are some race conditions if there are writes in state 0. Test
+mounting the drive ro and see if trims are still allowed. If that fails, test
+holding all writes in a queue until we switch to state 1. If that fails,
+consider implementing a ram disk type affair, which will be complex and risky.
+
+State 1: All writes to the device cause the underlying data to be backed up to
+the free (trimmed) area as needed in such a way as they can be restored.
+However, the writes, with one exception, then happen exactly as they would
+without dm_bow, so the device is always in a good final state. The exception is
+that sector 0 is used to keep a log of the latest changes, both to indicate that
+we are in this state and to allow rollback. See below for all details.
+
+State 2: The transition to state 2 triggers replacing the special sector 0 with
+the normal sector 0, and the freeing of all state information. dm_bow then
+becomes a pass-through driver, allowing the device to continue to be used with
+minimal performance impact.
+
+Usage
+=====
+dm-bow takes one command line parameter, the name of the underlying device.
+
+dm-bow will typically be used in the following way. dm-bow will be loaded with a
+suitable underlying device and the resultant device will be mounted. A file
+system trim will be issued via the FITRIM ioctl, then the device will be
+switched to state 1. The file system will now be used as normal. At some point,
+the changes can either be committed by switching to state 2, or rolled back by
+unmounting the file system, removing the dm-bow device and running the command
+line utility. Note that rebooting the device will be equivalent to unmounting
+and removing, but the command line utility must still be run
+
+Details of operation in state 1
+===============================
+
+dm_bow maintains a type for all sectors. A sector can be any of:
+
+SECTOR0
+SECTOR0_CURRENT
+UNCHANGED
+FREE
+CHANGED
+BACKUP
+
+SECTOR0 is the first sector on the device, and is used to hold the log of
+changes. This is the one exception.
+
+SECTOR0_CURRENT is a sector picked from the FREE sectors, and is where reads and
+writes from the true sector zero are redirected to. Note that like any backup
+sector, if the sector is written to directly, it must be moved again.
+
+UNCHANGED means that the sector has not been changed since we entered state 1.
+Thus if it is written to or trimmed, the contents must first be backed up.
+
+FREE means that the sector was trimmed in state 0 and has not yet been written
+to or used for backup. On being written to, a FREE sector is changed to CHANGED.
+
+CHANGED means that the sector has been modified, and can be further modified
+without further backup.
+
+BACKUP means that this is a free sector being used as a backup. On being written
+to, the contents must first be backed up again.
+
+All backup operations are logged to the first sector. The log sector has the
+format:
+--------------------------------------------------------
+| Magic | Count | Sequence | Log entry | Log entry | …
+--------------------------------------------------------
+
+Magic is a magic number. Count is the number of log entries. Sequence is 0
+initially. A log entry is
+
+-----------------------------------
+| Source | Dest | Size | Checksum |
+-----------------------------------
+
+When SECTOR0 is full, the log sector is backup up and another empty log sector
+created with sequence number one higher. The first entry in any log entry with
+sequence > 0 therefore must be the log of the backing up of the previous log
+sector. Note that sequence is not strictly needed, but is a useful sanity check
+and potentially limits the time spent trying to restore a corrupted snapshot.
+
+On entering state 1, dm_bow has a list of free sectors. All other sectors are
+unchanged. Sector0_current is selected from the free sectors and the contents of
+sector 0 are copied there. The sector 0 is backed up, which triggers the first
+log entry to be written.
+
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 7baee86ded7c..21c95032f698 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -569,4 +569,16 @@  config DM_ANDROID_VERITY
 	  of the metadata contents are verified against the key included
 	  in the system keyring. Upon success, the underlying verity
 	  target is setup.
+
+config DM_BOW
+	tristate "Create a device that backs up all changes to free blocks"
+	depends on BLK_DEV_DM
+	---help---
+	  This device-mapper target takes a device and keeps a log of all
+	  changes using free blocks identified by issuing a trim command.
+	  This can then be restored by running a command line utility,
+	  or committed by simply replacing the target.
+
+	  If unsure, say N.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index c3bf33b35674..7503427e2f3b 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -62,6 +62,7 @@  obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
 obj-$(CONFIG_DM_REQ_CRYPT)	+= dm-req-crypt.o
 obj-$(CONFIG_DM_ANDROID_VERITY) += dm-android-verity.o
+obj-$(CONFIG_DM_BOW)		+= dm-bow.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-bow.c b/drivers/md/dm-bow.c
new file mode 100644
index 000000000000..2f3d6dfd3d6b
--- /dev/null
+++ b/drivers/md/dm-bow.c
@@ -0,0 +1,1033 @@ 
+/*
+ * Copyright (C) 2018 Google Limited.
+ *
+ * This file is released under the GPL.
+ */
+
+#include "dm.h"
+#include "dm-bufio.h"
+#include "dm-core.h"
+
+#include <linux/crc32.h>
+#include <linux/module.h>
+
+#define DM_MSG_PREFIX "bow"
+#define SECTOR_SIZE 512
+
+struct log_entry {
+	u64 source;
+	u64 dest;
+	u32 size;
+	u32 checksum;
+} __packed;
+
+struct log_sector {
+	u32 magic;
+	u32 count;
+	u32 sequence;
+	struct log_entry entries[];
+} __packed;
+
+/*
+ * MAGIC is BOW in ascii
+ */
+#define MAGIC 0x00574f42
+
+struct bow_range {
+	struct rb_node		node;
+	sector_t		sector;
+	enum {
+		SECTOR0,	/* First sector - holds log record */
+		SECTOR0_CURRENT,/* Live contents of sector0 */
+		UNCHANGED,	/* Original contents */
+		TRIMMED,	/* Range has been trimmed */
+		CHANGED,	/* Range has been changed */
+		BACKUP,		/* Range is being used as a backup */
+		TOP,		/* Final range - sector is size of device */
+	} type;
+};
+
+static const char * const readable_type[] = {
+	"Sector0",
+	"Sector0_current",
+	"Unchanged",
+	"Free",
+	"Changed",
+	"Backup",
+	"Top",
+};
+
+enum state {
+	TRIM,
+	CHECKPOINT,
+	COMMITTED,
+};
+
+struct bow_context {
+	struct dm_dev *dev;
+	struct workqueue_struct *workqueue;
+	struct dm_bufio_client *bufio;
+	struct mutex ranges_lock;
+	struct rb_root ranges;
+	struct dm_kobject_holder kobj_holder;	/* for sysfs attributes */
+	atomic_t state; /* One of the enum state values above */
+	union {
+		u8			raw_log_sector[PAGE_SIZE];
+		struct log_sector	log_sector;
+	};
+};
+
+sector_t range_top(struct bow_range *br)
+{
+	return container_of(rb_next(&br->node), struct bow_range, node)
+		->sector;
+}
+
+unsigned int range_size(struct bow_range *br)
+{
+	return (range_top(br) - br->sector) * SECTOR_SIZE;
+}
+
+static sector_t bvec_top(struct bvec_iter *bi_iter)
+{
+	return bi_iter->bi_sector + bi_iter->bi_size / SECTOR_SIZE;
+}
+
+static struct bow_range *find_first_overlapping_range(struct rb_root *ranges,
+						      struct bvec_iter *bi_iter)
+{
+	struct rb_node *node = ranges->rb_node;
+
+	while (node) {
+		struct bow_range *br;
+
+		br = container_of(node, struct bow_range, node);
+		if (br->sector <= bi_iter->bi_sector
+		    && bi_iter->bi_sector < range_top(br))
+			return br;
+		if (bi_iter->bi_sector < br->sector)
+			node = node->rb_left;
+		else
+			node = node->rb_right;
+	}
+
+	BUG_ON(true);
+	return NULL;
+}
+
+void add_before(struct rb_root *ranges, struct bow_range *new_br,
+		struct bow_range *existing)
+{
+	struct rb_node *parent = &(existing->node);
+	struct rb_node **link = &(parent->rb_left);
+
+	while (*link) {
+		parent = *link;
+		link = &((*link)->rb_right);
+	}
+
+	rb_link_node(&new_br->node, parent, link);
+	rb_insert_color(&new_br->node, ranges);
+}
+
+/*
+ * Given a range br returned by find_first_overlapping_range, split br
+ * into a leading range, a range matching the bio and a trailing range.
+ * Leading and trailing may end up size 0 and will then be deleted. The
+ * new range matching the bio is then returned and should have its type
+ * and type specific fields populated.
+ * If the bio runs off the end of the range, the bio is truncated
+ * accordingly
+ */
+static int split_range(struct rb_root *ranges, struct bow_range **br,
+		       struct bvec_iter *bi_iter)
+{
+	struct bow_range *new_br;
+
+	BUG_ON(bi_iter->bi_sector < (*br)->sector);
+
+	if (bi_iter->bi_sector > (*br)->sector) {
+		struct bow_range *leading_br =
+			kzalloc(sizeof(*leading_br), GFP_KERNEL);
+
+		if (!leading_br)
+			return -ENOMEM;
+
+		*leading_br = **br;
+		add_before(ranges, leading_br, *br);
+		(*br)->sector = bi_iter->bi_sector;
+	}
+
+	if (bvec_top(bi_iter) >= range_top(*br)) {
+		bi_iter->bi_size = (range_top(*br) - (*br)->sector)
+					* SECTOR_SIZE;
+		return 0;
+	}
+
+	/* new_br will be the beginning, existing br will be the tail */
+	new_br = kzalloc(sizeof(*new_br), GFP_KERNEL);
+	if (!new_br)
+		return -ENOMEM;
+
+	new_br->sector = (*br)->sector;
+	(*br)->sector = bvec_top(bi_iter);
+	add_before(ranges, new_br, *br);
+	*br = new_br;
+
+	return 0;
+}
+
+/*
+ * Sets type of a range. May merge range into surrounding ranges
+ * Since br may be invalidated, always sets br to NULL to prevent
+ * usage after this is called
+ */
+static void set_type(struct rb_root *ranges, struct bow_range **br, int type)
+{
+	struct bow_range *prev = container_of(rb_prev(&(*br)->node),
+						      struct bow_range, node);
+	struct bow_range *next = container_of(rb_next(&(*br)->node),
+						      struct bow_range, node);
+
+	(*br)->type = type;
+
+	if (next->type == type) {
+		rb_erase(&next->node, ranges);
+		kfree(next);
+	}
+
+	if (prev->type == type) {
+		rb_erase(&(*br)->node, ranges);
+		kfree(*br);
+	}
+
+	*br = NULL;
+}
+
+static struct bow_range *find_free_range(struct rb_root *ranges)
+{
+	struct rb_node *i;
+
+	for (i = rb_first(ranges); i; i = rb_next(i)) {
+		struct bow_range *br = container_of(i, struct bow_range, node);
+
+		if (br->type == TRIMMED)
+			return br;
+	}
+
+	DMERR("Unable to find free space to back up to");
+	return NULL;
+}
+
+static sector_t sector_to_page(sector_t sector)
+{
+	BUG_ON(sector % (PAGE_SIZE / SECTOR_SIZE) != 0);
+	return sector >> (PAGE_SHIFT - SECTOR_SHIFT);
+}
+
+static int copy_data(struct dm_bufio_client *bufio, struct bow_range *source,
+		     struct bow_range *dest, u32 *checksum)
+{
+	int i;
+
+	BUG_ON(range_size(source) != range_size(dest));
+
+	if (checksum)
+		*checksum = sector_to_page(source->sector);
+
+	for (i = 0; i < range_size(source) / PAGE_SIZE; ++i) {
+		struct dm_buffer *read_buffer, *write_buffer;
+		u8 *read, *write;
+
+		read = dm_bufio_read(bufio, sector_to_page(source->sector) + i,
+				     &read_buffer);
+		if (IS_ERR(read)) {
+			DMERR("Cannot read sector");
+			dm_bufio_release(read_buffer);
+			return PTR_ERR(read);
+		}
+
+		if (checksum)
+			*checksum = crc32(*checksum, read, PAGE_SIZE);
+
+		write = dm_bufio_new(bufio, sector_to_page(dest->sector) + i,
+				      &write_buffer);
+		if (IS_ERR(write)) {
+			DMERR("Cannot write sector");
+			dm_bufio_release(write_buffer);
+			dm_bufio_release(read_buffer);
+			return PTR_ERR(write);
+		}
+
+		memcpy(write, read, PAGE_SIZE);
+
+		dm_bufio_mark_buffer_dirty(write_buffer);
+		dm_bufio_release(write_buffer);
+		dm_bufio_release(read_buffer);
+	}
+
+	dm_bufio_write_dirty_buffers(bufio);
+	return 0;
+}
+
+/****** logging functions ******/
+
+static int add_log_entry(struct bow_context *bc, sector_t source, sector_t dest,
+			 unsigned int size, u32 checksum);
+
+static int backup_log_sector(struct bow_context *bc)
+{
+	struct bow_range *first_br, *free_br;
+	struct bvec_iter bi_iter;
+	u32 checksum;
+	int ret;
+
+	first_br = container_of(rb_first(&bc->ranges), struct bow_range, node);
+	BUG_ON(first_br->type != SECTOR0);
+	BUG_ON(range_size(first_br) != PAGE_SIZE);
+
+	free_br = find_free_range(&bc->ranges);
+	/* No space left - return this error to userspace */
+	if (!free_br)
+		return -ENOSPC;
+	bi_iter.bi_sector = free_br->sector;
+	bi_iter.bi_size = PAGE_SIZE;
+	ret = split_range(&bc->ranges, &free_br, &bi_iter);
+	if (ret)
+		return ret;
+	BUG_ON(bi_iter.bi_size != PAGE_SIZE);
+
+	ret = copy_data(bc->bufio, first_br, free_br, &checksum);
+	if (ret)
+		return ret;
+
+	bc->log_sector.count = 0;
+	bc->log_sector.sequence++;
+	ret = add_log_entry(bc, first_br->sector, free_br->sector,
+			    range_size(first_br), checksum);
+	if (ret)
+		return ret;
+
+	set_type(&bc->ranges, &free_br, BACKUP);
+	return 0;
+}
+
+static int add_log_entry(struct bow_context *bc, sector_t source, sector_t dest,
+			 unsigned int size, u32 checksum)
+{
+	struct dm_buffer *sector_buffer;
+	u8 *sector;
+
+	if (sizeof(struct log_sector)
+			+ sizeof(struct log_entry) * (bc->log_sector.count + 1)
+		   > sizeof(bc->raw_log_sector)) {
+		int ret = backup_log_sector(bc);
+
+		if (ret)
+			return ret;
+	}
+
+	bc->log_sector.entries[bc->log_sector.count].source = source;
+	bc->log_sector.entries[bc->log_sector.count].dest = dest;
+	bc->log_sector.entries[bc->log_sector.count].size = size;
+	bc->log_sector.entries[bc->log_sector.count].checksum = checksum;
+	bc->log_sector.count++;
+
+	sector = dm_bufio_new(bc->bufio, 0, &sector_buffer);
+	if (IS_ERR(sector)) {
+		DMERR("Cannot write boot sector");
+		dm_bufio_release(sector_buffer);
+		return -ENOSPC;
+	}
+	memcpy(sector, bc->raw_log_sector, PAGE_SIZE);
+	dm_bufio_mark_buffer_dirty(sector_buffer);
+	dm_bufio_release(sector_buffer);
+	dm_bufio_write_dirty_buffers(bc->bufio);
+	return 0;
+}
+
+static int prepare_log(struct bow_context *bc)
+{
+	struct bow_range *free_br, *first_br;
+	struct bvec_iter bi_iter;
+	u32 checksum;
+	int ret;
+
+	/* Carve out first sector as log sector */
+	first_br = container_of(rb_first(&bc->ranges), struct bow_range, node);
+	BUG_ON(first_br->type != UNCHANGED || range_size(first_br) < PAGE_SIZE);
+	bi_iter.bi_sector = 0;
+	bi_iter.bi_size = PAGE_SIZE;
+	ret = split_range(&bc->ranges, &first_br, &bi_iter);
+	if (ret)
+		return ret;
+	first_br->type = SECTOR0;
+	BUG_ON(range_size(first_br) != PAGE_SIZE);
+
+	/* Find free sector for active sector0 reads/writes */
+	free_br = find_free_range(&bc->ranges);
+	if (!free_br)
+		return -ENOSPC;
+	bi_iter.bi_sector = free_br->sector;
+	bi_iter.bi_size = PAGE_SIZE;
+	ret = split_range(&bc->ranges, &free_br, &bi_iter);
+	if (ret)
+		return ret;
+	free_br->type = SECTOR0_CURRENT;
+
+	/* Copy data */
+	ret = copy_data(bc->bufio, first_br, free_br, NULL);
+	if (ret)
+		return ret;
+
+	/* Find free sector to back up first sector */
+	free_br = find_free_range(&bc->ranges);
+	if (!free_br)
+		return -ENOSPC;
+	bi_iter.bi_sector = free_br->sector;
+	bi_iter.bi_size = PAGE_SIZE;
+	ret = split_range(&bc->ranges, &free_br, &bi_iter);
+	if (ret)
+		return ret;
+
+	/* Back up */
+	ret = copy_data(bc->bufio, first_br, free_br, &checksum);
+	if (ret)
+		return ret;
+
+	/*
+	 * Set up our replacement boot sector - it will get written when we
+	 * add the first log entry, which we do immediately
+	 */
+	bc->log_sector.magic = MAGIC;
+	bc->log_sector.count = 0;
+	bc->log_sector.sequence = 0;
+
+	/* Add log entry */
+	ret = add_log_entry(bc, first_br->sector, free_br->sector,
+			    range_size(first_br), checksum);
+	if (ret)
+		return ret;
+
+	set_type(&bc->ranges, &free_br, BACKUP);
+	return 0;
+}
+
+static struct bow_range *find_sector0_current(struct bow_context *bc)
+{
+	struct rb_node *i;
+
+	for (i = rb_first(&bc->ranges); i; i = rb_next(i)) {
+		struct bow_range *br = container_of(i, struct bow_range, node);
+
+		if (br->type == SECTOR0_CURRENT)
+			return br;
+	}
+
+	BUG_ON(true);
+}
+
+/****** sysfs interface functions ******/
+
+static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr,
+			  char *buf)
+{
+	struct bow_context *bc = container_of(kobj, struct bow_context,
+					      kobj_holder.kobj);
+
+	return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&bc->state));
+}
+
+static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
+			   const char *buf, size_t count)
+{
+	struct bow_context *bc = container_of(kobj, struct bow_context,
+					      kobj_holder.kobj);
+	enum state state, original_state;
+	int ret;
+
+	state = buf[0] - '0';
+	if (state < TRIM || state > COMMITTED) {
+		DMERR("State value %d out of range", state);
+		return -EINVAL;
+	}
+
+	mutex_lock(&bc->ranges_lock);
+	original_state = atomic_read(&bc->state);
+	if (state != original_state + 1) {
+		DMERR("Invalid state change from %d to %d",
+		      original_state, state);
+		ret = -EINVAL;
+		goto bad;
+	}
+
+	DMINFO("Switching to state %s", state == CHECKPOINT ? "Checkpoint"
+	       : state == COMMITTED ? "Committed" : "Unknown");
+
+	if (state == CHECKPOINT) {
+		ret = prepare_log(bc);
+		if (ret) {
+			DMERR("Failed to switch to checkpoint state");
+			goto bad;
+		}
+	} else if (state == COMMITTED) {
+		struct bow_range *br = find_sector0_current(bc);
+		struct bow_range *sector0_br =
+			container_of(rb_first(&bc->ranges), struct bow_range,
+				     node);
+
+		ret = copy_data(bc->bufio, br, sector0_br, 0);
+		if (ret) {
+			DMERR("Failed to switch to committed state");
+			goto bad;
+		}
+	}
+	atomic_inc(&bc->state);
+	ret = count;
+
+bad:
+	mutex_unlock(&bc->ranges_lock);
+	return ret;
+}
+
+static int prepare_one_range(struct bow_context *bc, struct bvec_iter *bi_iter);
+static ssize_t write_store(struct kobject *kobj, struct kobj_attribute *attr,
+			   const char *buf, size_t count)
+{
+	struct bow_context *bc = container_of(kobj, struct bow_context,
+					      kobj_holder.kobj);
+	sector_t sector = *(sector_t *)buf;
+	unsigned int size = *(unsigned int *)(buf + sizeof(sector));
+	struct bvec_iter bio_bi_iter, bi_iter;
+	int ret, i;
+
+	bio_bi_iter.bi_sector = sector;
+	bio_bi_iter.bi_size = size;
+	bi_iter = bio_bi_iter;
+
+	DMINFO("Simulate Write: %lu, %u",
+	       bio_bi_iter.bi_sector,
+	       bio_bi_iter.bi_size);
+
+	mutex_lock(&bc->ranges_lock);
+	do {
+		ret = prepare_one_range(bc, &bi_iter);
+		bi_iter.bi_sector += bi_iter.bi_size / SECTOR_SIZE;
+		bi_iter.bi_size = bio_bi_iter.bi_size
+			- (bi_iter.bi_sector - bio_bi_iter.bi_sector)
+			  * SECTOR_SIZE;
+	} while (!ret && bi_iter.bi_size);
+	mutex_unlock(&bc->ranges_lock);
+
+	if (ret)
+		return ret;
+
+	for (i = 0; i < size / PAGE_SIZE; i++) {
+		struct dm_buffer *dm_buffer;
+		u8 *ptr;
+
+		ptr = dm_bufio_new(bc->bufio, sector_to_page(sector) + i,
+				   &dm_buffer);
+		dm_bufio_mark_buffer_dirty(dm_buffer);
+		dm_bufio_release(dm_buffer);
+	}
+	dm_bufio_write_dirty_buffers(bc->bufio);
+
+	return count;
+}
+
+static struct kobj_attribute attr_state = __ATTR_RW(state);
+static struct kobj_attribute attr_write = __ATTR_WO(write);
+
+static struct attribute *bow_attrs[] = {
+	&attr_state.attr,
+	&attr_write.attr,
+	NULL
+};
+
+static struct kobj_type bow_ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+	.default_attrs = bow_attrs,
+	.release = dm_kobject_release
+};
+
+/****** constructor/destructor ******/
+
+static void dm_bow_dtr(struct dm_target *ti)
+{
+	struct bow_context *bc = (struct bow_context *) ti->private;
+	struct list_head *i, *q;
+	struct kobject *kobj;
+
+	while (rb_first(&bc->ranges)) {
+		struct bow_range *br = container_of(rb_first(&bc->ranges),
+						    struct bow_range, node);
+
+		rb_erase(&br->node, &bc->ranges);
+		kfree(br);
+	}
+	if (bc->workqueue)
+		destroy_workqueue(bc->workqueue);
+	if (bc->bufio)
+		dm_bufio_client_destroy(bc->bufio);
+
+	kobj = &bc->kobj_holder.kobj;
+	if (kobj->state_initialized) {
+		kobject_put(kobj);
+		wait_for_completion(dm_get_completion_from_kobject(kobj));
+	}
+	kfree(bc);
+}
+
+static int dm_bow_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct bow_context *bc;
+	struct bow_range *br;
+	int ret;
+	struct mapped_device *md = dm_table_get_md(ti->table);
+
+	if (argc != 1) {
+		ti->error = "Invalid argument count";
+		return -EINVAL;
+	}
+
+	bc = kzalloc(sizeof(*bc), GFP_KERNEL);
+	if (!bc) {
+		ti->error = "Cannot allocate bow context";
+		return -ENOMEM;
+	}
+
+	ti->num_flush_bios = 1;
+	ti->num_discard_bios = 1;
+	ti->num_write_same_bios = 1;
+	ti->private = bc;
+
+	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+			    &bc->dev);
+	if (ret) {
+		ti->error = "Device lookup failed";
+		goto bad;
+	}
+
+	init_completion(&bc->kobj_holder.completion);
+	ret = kobject_init_and_add(&bc->kobj_holder.kobj, &bow_ktype,
+				   &disk_to_dev(dm_disk(md))->kobj, "%s",
+				   "bow");
+	if (ret) {
+		ti->error = "Cannot create sysfs node";
+		goto bad;
+	}
+
+	mutex_init(&bc->ranges_lock);
+	bc->ranges = RB_ROOT;
+	bc->bufio = dm_bufio_client_create(bc->dev->bdev, PAGE_SIZE, 1, 0,
+					   NULL, NULL);
+	if (IS_ERR(bc->bufio)) {
+		ti->error = "Cannot initialize dm-bufio";
+		ret = PTR_ERR(bc->bufio);
+		bc->bufio = NULL;
+		goto bad;
+	}
+
+	bc->workqueue = alloc_workqueue("dm-bow",
+					WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM
+					| WQ_UNBOUND, num_online_cpus());
+	if (!bc->workqueue) {
+		ti->error = "Cannot allocate workqueue";
+		ret = -ENOMEM;
+		goto bad;
+	}
+
+	br = kzalloc(sizeof(*br), GFP_KERNEL);
+	if (!br) {
+		ti->error = "Cannot allocate ranges";
+		ret = -ENOMEM;
+		goto bad;
+	}
+
+	br->sector = ti->len;
+	br->type = TOP;
+	rb_link_node(&br->node, NULL, &bc->ranges.rb_node);
+	rb_insert_color(&br->node, &bc->ranges);
+
+	br = kzalloc(sizeof(*br), GFP_KERNEL);
+	if (!br) {
+		ti->error = "Cannot allocate ranges";
+		ret = -ENOMEM;
+		goto bad;
+	}
+
+	br->sector = 0;
+	br->type = UNCHANGED;
+	rb_link_node(&br->node, bc->ranges.rb_node,
+		     &bc->ranges.rb_node->rb_left);
+	rb_insert_color(&br->node, &bc->ranges);
+
+	ti->discards_supported = true;
+
+	return 0;
+
+bad:
+	dm_bow_dtr(ti);
+	return ret;
+}
+
+/****** Handle writes ******/
+
+static int prepare_unchanged_range(struct bow_context *bc, struct bow_range *br,
+				   struct bvec_iter *bi_iter)
+{
+	struct bow_range *backup_br;
+	struct bvec_iter backup_bi;
+	sector_t log_source, log_dest;
+	unsigned int log_size;
+	u32 checksum;
+	int ret;
+
+	/* Find a free range */
+	backup_br = find_free_range(&bc->ranges);
+	if (!backup_br)
+		return -ENOSPC;
+
+	/* Carve out a backup range. This may be smaller than the br given */
+	backup_bi.bi_sector = backup_br->sector;
+	backup_bi.bi_size = min(range_size(backup_br), bi_iter->bi_size);
+	ret = split_range(&bc->ranges, &backup_br, &backup_bi);
+	if (ret)
+		return ret; /* TODO make sure structures are consistent */
+
+	/*
+	 * Carve out a changed range. This will not be smaller than the backup
+	 * br since the backup br is smaller than the source range and iterator
+	 */
+	bi_iter->bi_size = backup_bi.bi_size;
+	ret = split_range(&bc->ranges, &br, bi_iter);
+	if (ret)
+		return ret; /* TODO make sure structures are consistent */
+	BUG_ON(range_size(br) != range_size(backup_br));
+
+	/* Copy data over */
+	ret = copy_data(bc->bufio, br, backup_br, &checksum);
+	if (ret)
+		return ret;
+
+	/* Add an entry to the log */
+	log_source = br->sector;
+	log_dest = backup_br->sector;
+	log_size = range_size(br);
+
+	/*
+	 * Set the types last. Note that since set_type also amalgamates ranges
+	 * we have to be careful of the order here.
+	 * TODO: This is brittle. Work out better solution, maybe separate
+	 * set_type and amalgamate_ranges?
+	 */
+	backup_br->type = br->type == SECTOR0_CURRENT ? SECTOR0_CURRENT
+						      : BACKUP;
+	br->type = CHANGED;
+
+	set_type(&bc->ranges, &backup_br, backup_br->type);
+	set_type(&bc->ranges, &br, br->type);
+
+	/*
+	 * Actually add the log entry at the end, since adding a log can cause
+	 * another backup, which is not safe until this one is committed
+	 */
+	return add_log_entry(bc, log_source, log_dest, log_size, checksum);
+}
+
+static int prepare_free_range(struct bow_context *bc, struct bow_range *br,
+			      struct bvec_iter *bi_iter)
+{
+	int ret;
+
+	ret = split_range(&bc->ranges, &br, bi_iter);
+	if (ret)
+		return ret;
+	set_type(&bc->ranges, &br, CHANGED);
+	return 0;
+}
+
+static int prepare_changed_range(struct bow_context *bc, struct bow_range *br,
+				 struct bvec_iter *bi_iter)
+{
+	/* Nothing to do ... */
+	return 0;
+}
+
+static int prepare_one_range(struct bow_context *bc,
+			     struct bvec_iter *bi_iter)
+{
+	struct bow_range *br = find_first_overlapping_range(&bc->ranges,
+							    bi_iter);
+	bi_iter->bi_size = min(bi_iter->bi_size,
+			       (unsigned int)(range_top(br)
+					      - bi_iter->bi_sector)
+				* SECTOR_SIZE);
+
+	switch (br->type) {
+	case CHANGED:
+		return prepare_changed_range(bc, br, bi_iter);
+
+	case TRIMMED:
+		return prepare_free_range(bc, br, bi_iter);
+
+	case UNCHANGED:
+	case BACKUP:
+	case SECTOR0_CURRENT:
+		return prepare_unchanged_range(bc, br, bi_iter);
+
+	case SECTOR0:	/* Handled in the dm_bow_map */
+	case TOP:	/* Illegal - top is off the end of the device */
+	default:
+		BUG_ON(true);
+	}
+}
+
+struct write_work {
+	struct work_struct work;
+	struct bow_context *bc;
+	struct bio *bio;
+};
+
+static void bow_write(struct work_struct *work)
+{
+	struct write_work *ww = container_of(work, struct write_work, work);
+	struct bow_context *bc = ww->bc;
+	struct bio *bio = ww->bio;
+	struct bvec_iter bi_iter = bio->bi_iter;
+	int ret = 0;
+
+	kfree(ww);
+
+	mutex_lock(&bc->ranges_lock);
+	do {
+		ret = prepare_one_range(bc, &bi_iter);
+		bi_iter.bi_sector += bi_iter.bi_size / SECTOR_SIZE;
+		bi_iter.bi_size = bio->bi_iter.bi_size
+			- (bi_iter.bi_sector - bio->bi_iter.bi_sector)
+			  * SECTOR_SIZE;
+	} while (!ret && bi_iter.bi_size);
+
+	mutex_unlock(&bc->ranges_lock);
+
+	if (!ret) {
+		bio->bi_bdev = bc->dev->bdev;
+		submit_bio(bio);
+	} else {
+		DMERR("Write failure with error %d", -ret);
+		bio->bi_error = ret;
+		bio_endio(bio);
+	}
+}
+
+static int queue_write(struct bow_context *bc, struct bio *bio)
+{
+	struct write_work *ww = kmalloc(sizeof(*ww), GFP_NOIO | __GFP_NORETRY
+					| __GFP_NOMEMALLOC | __GFP_NOWARN);
+	if (!ww) {
+		DMERR("Failed to allocate write_work");
+		return -ENOMEM;
+	}
+
+	INIT_WORK(&ww->work, bow_write);
+	ww->bc = bc;
+	ww->bio = bio;
+	queue_work(bc->workqueue, &ww->work);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static int handle_sector0(struct bow_context *bc, struct bio *bio)
+{
+	int ret = DM_MAPIO_REMAPPED;
+
+	if (bio->bi_iter.bi_size > PAGE_SIZE) {
+		struct bio *split = bio_split(bio,
+					      1 << (PAGE_SHIFT - SECTOR_SHIFT),
+					      GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+		if (bio_data_dir(split) == WRITE) {
+			ret = queue_write(bc, split);
+			if (ret == DM_MAPIO_SUBMITTED)
+				ret = DM_MAPIO_REMAPPED;
+		} else {
+			submit_bio(split);
+		}
+	}
+
+	bio->bi_iter.bi_sector = find_sector0_current(bc)->sector;
+	return ret;
+}
+
+/****** dm interface ******/
+
+static int dm_bow_map(struct dm_target *ti, struct bio *bio)
+{
+	int ret = DM_MAPIO_REMAPPED;
+	struct bow_context *bc = ti->private;
+
+	if (atomic_read(&bc->state) != COMMITTED) {
+		enum state state;
+
+		mutex_lock(&bc->ranges_lock);
+		state = atomic_read(&bc->state);
+		if (state == TRIM && bio_op(bio) == REQ_OP_DISCARD) {
+			struct bow_range *br;
+
+			DMDEBUG("Trim: %lu, %u",
+			       bio->bi_iter.bi_sector,
+			       bio->bi_iter.bi_size);
+
+			br = find_first_overlapping_range(&bc->ranges,
+							  &bio->bi_iter);
+
+			/*
+			 * TODO - we assume this will always be an unmapped
+			 * block containing the whole bio
+			 */
+			BUG_ON(br->type != UNCHANGED);
+			if (!split_range(&bc->ranges, &br, &bio->bi_iter))
+				set_type(&bc->ranges, &br, TRIMMED);
+			bio_endio(bio);
+			ret = DM_MAPIO_SUBMITTED;
+		} else if (state == CHECKPOINT) {
+			if (bio->bi_iter.bi_sector == 0)
+				ret = handle_sector0(bc, bio);
+			else if (bio_data_dir(bio) == WRITE)
+				ret = queue_write(bc, bio);
+			else
+				/* passthrough */;
+		} else {
+			/* passthrough */
+		}
+		mutex_unlock(&bc->ranges_lock);
+	}
+	if (ret == DM_MAPIO_REMAPPED)
+		bio->bi_bdev = bc->dev->bdev;
+	return ret;
+}
+
+static void dm_bow_tablestatus(struct dm_target *ti, char *result,
+			       unsigned int maxlen)
+{
+	char *end = result + maxlen;
+	struct bow_context *bc = ti->private;
+	struct rb_node *i;
+
+	if (maxlen == 0)
+		return;
+	result[0] = 0;
+
+	if (!rb_first(&bc->ranges)) {
+		scnprintf(result, end - result, "ERROR: Empty ranges");
+		return;
+	}
+
+	if (container_of(rb_first(&bc->ranges), struct bow_range, node)
+	    ->sector) {
+		scnprintf(result, end - result,
+			 "ERROR: First range does not start at sector 0");
+		return;
+	}
+
+	for (i = rb_first(&bc->ranges); i; i = rb_next(i)) {
+		struct bow_range *br = container_of(i, struct bow_range, node);
+
+		result += scnprintf(result, end - result, "%s: %lu",
+				    readable_type[br->type], br->sector);
+		if (result >= end)
+			return;
+
+		result += scnprintf(result, end - result, "\n");
+		if (result >= end)
+			return;
+
+		switch (br->type) {
+		case TOP:
+			if (br->sector != ti->len) {
+				scnprintf(result, end - result,
+					 "\nERROR: Top sector is incorrect");
+			}
+
+			if (&br->node != rb_last(&bc->ranges)) {
+				scnprintf(result, end - result,
+					  "\nERROR: Top sector is not last");
+			}
+			return;
+
+		default:
+			break;
+		}
+
+		if (br->sector > range_top(br)) {
+			scnprintf(result, end - result,
+				  "\nERROR: sector too big");
+			return;
+		}
+	}
+}
+
+static void dm_bow_status(struct dm_target *ti, status_type_t type,
+			  unsigned int status_flags, char *result,
+			  unsigned int maxlen)
+{
+	switch (type) {
+	case STATUSTYPE_INFO:
+		result[0] = 0;
+		break;
+
+	case STATUSTYPE_TABLE:
+		dm_bow_tablestatus(ti, result, maxlen);
+		break;
+	}
+}
+
+int dm_bow_prepare_ioctl(struct dm_target *ti,
+			 struct block_device **bdev, fmode_t *mode)
+{
+	struct bow_context *bc = ti->private;
+	struct dm_dev *dev = bc->dev;
+
+	*bdev = dev->bdev;
+	/* Only pass ioctls through if the device sizes match exactly. */
+	return ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
+}
+
+static int dm_bow_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct bow_context *bc = ti->private;
+
+	return fn(ti, bc->dev, 0, ti->len, data);
+}
+
+static struct target_type bow_target = {
+	.name   = "bow",
+	.version = {1, 1, 1},
+	.module = THIS_MODULE,
+	.ctr    = dm_bow_ctr,
+	.dtr    = dm_bow_dtr,
+	.map    = dm_bow_map,
+	.status = dm_bow_status,
+	.prepare_ioctl  = dm_bow_prepare_ioctl,
+	.iterate_devices = dm_bow_iterate_devices,
+};
+
+int __init dm_bow_init(void)
+{
+	int r = dm_register_target(&bow_target);
+
+	if (r < 0)
+		DMERR("registering bow failed %d", r);
+	return r;
+}
+
+void dm_bow_exit(void)
+{
+	dm_unregister_target(&bow_target);
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(dm_bow_init);
+module_exit(dm_bow_exit);