From patchwork Mon Mar 4 22:45:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Paul Lawrence X-Patchwork-Id: 10838743 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 00FDC1803 for ; Mon, 4 Mar 2019 22:46:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D85A52BC48 for ; Mon, 4 Mar 2019 22:46:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C93202BC4F; Mon, 4 Mar 2019 22:46:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=2.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id BB5362BC48 for ; Mon, 4 Mar 2019 22:46:23 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6D0CA83F45; Mon, 4 Mar 2019 22:46:22 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 3B16A600D7; Mon, 4 Mar 2019 22:46:20 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id C40C062CFB; Mon, 4 Mar 2019 22:46:17 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id x24MkFdF024337 for ; Mon, 4 Mar 2019 17:46:15 -0500 Received: by smtp.corp.redhat.com (Postfix) id A663F5C28F; Mon, 4 Mar 2019 22:46:15 +0000 (UTC) Delivered-To: dm-devel@redhat.com Received: from mx1.redhat.com (ext-mx09.extmail.prod.ext.phx2.redhat.com [10.5.110.38]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A010117119 for ; Mon, 4 Mar 2019 22:46:13 +0000 (UTC) Received: from mail-oi1-f202.google.com (mail-oi1-f202.google.com [209.85.167.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7D32A4E90E for ; Mon, 4 Mar 2019 22:46:10 +0000 (UTC) Received: by mail-oi1-f202.google.com with SMTP id q82so3859573oia.9 for ; Mon, 04 Mar 2019 14:46:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=f9a1kAUrqwiitMLd89N+rK5EoNg4h5mhOF+wZugfz+s=; b=JvzGqGdd0nB6ME0tkFkqQHUV8ZGMGqn92N+efj4OfXq2U+6hOu2VzliPEr8/CS6ln/ B2BOgHWxDzzF3VLq+4fay4FknnOCwXUNQw6hPnD7sHhI2quqOoJaQ8V+qSQ/tFuAlWOX CCOZkqeTPPeup63O/Yng+Ussc5III2CBjcnkmjOcbxiWgpMiDuUhUfaZOjaQ0StiJGqx il0G2oZSwDmuwJ3EFw6sQQKgJ1xRkldyN6qPpv7zHb4QUieUxDKp4pjjUXyvNUJX+ZBS z+M0hlQLqwqdgFsJcE+/6DpUt27ojk8R71KrCeFa1MGUs7KFBQl8bVDQSMfpLtrzm2R9 jMbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=f9a1kAUrqwiitMLd89N+rK5EoNg4h5mhOF+wZugfz+s=; b=qC5gVSQxzEDTNr+R8BHOtQT6aI9+LRv6Trq1+fNO6PIkqaWUVdbvuUReWRzvv570as RJKvzsm3Cp/mJ/6Ivf9Ikx15C4efy7qfSxvgQOGWKemvue5KMOZLl5kBZ3DDA2IceshK RUH/b6OkfiBVEA2Xd6hfGWAFeWICevfAdGph1Ac+DWUHg5ZGzZzr9yUG5A15tHH965dP iO0fewQNl6GkN/xC6C9BwWOX2TesAetwdjamnZho9ul6DsYECNjd1d23H8uvO6WoGZ26 nl8ylBF30MlG1/R7nqneREJqC8MiMwS+DkBjmJunCiHXA12uiZ/boz2M+/kk2QQ1KgES R3xQ== X-Gm-Message-State: APjAAAXpApJ4ZifEStEzcUA0NiYR9jugzU/fJQDkgTdqJR+cJyJvqYDU 3/YDv3C5yfitc4hxMjhTbwM26mZaeRbwH9J0aFBSNpXRMkA5VdzafH1cYHgtaWpVIXPitAd0o3n U4Sbj7eeUpQXwPFlRIea6rmUrUi9ZOt5ONg6Eq3kEud6fOFAOKSBEHezoTRs8KlhyLflKmB3nyg == X-Google-Smtp-Source: APXvYqwkfSfUXga7rduD16Pz66i9MVrwZEKWl94xh7nEXMyMJYCViltixhjiHpgVWFyE6xPx5xJhPAUYlr7Ww9viO8Y= X-Received: by 2002:a9d:695a:: with SMTP id p26mr12488374oto.13.1551739569725; Mon, 04 Mar 2019 14:46:09 -0800 (PST) Date: Mon, 4 Mar 2019 14:45:36 -0800 Message-Id: <20190304224536.178551-1-paullawrence@google.com> Mime-Version: 1.0 From: Paul Lawrence To: dm-devel@redhat.com X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Mon, 04 Mar 2019 22:46:10 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Mon, 04 Mar 2019 22:46:10 +0000 (UTC) for IP:'209.85.167.202' DOMAIN:'mail-oi1-f202.google.com' HELO:'mail-oi1-f202.google.com' FROM:'3sap9XAwKCOYXIcTTIeZMVKMOWWOTM.KWULU-LMdMTZMLPIb.KWU@flex--paullawrence.bounces.google.com' RCPT:'' X-RedHat-Spam-Score: -7.612 (DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_NONE, SPF_PASS, USER_IN_DEF_DKIM_WL) 209.85.167.202 mail-oi1-f202.google.com 209.85.167.202 mail-oi1-f202.google.com <3sap9XAwKCOYXIcTTIeZMVKMOWWOTM.KWULU-LMdMTZMLPIb.KWU@flex--paullawrence.bounces.google.com> X-Scanned-By: MIMEDefang 2.78 on 10.5.110.38 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-MIME-Autoconverted: from quoted-printable to 8bit by lists01.pubmisc.prod.ext.phx2.redhat.com id x24MkFdF024337 X-loop: dm-devel@redhat.com Cc: Paul Lawrence Subject: [dm-devel] [dm-devel[PATCH v2][RFC] dm-bow working prototype X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Mon, 04 Mar 2019 22:46:23 +0000 (UTC) X-Virus-Scanned: ClamAV using ClamSMTP Second version of dm-bow. Key changes: Support for block sizes other than 4k Handles writes during trim phase, and overlapping trims Numerous small bug fixes bow == backup on write USE CASE: dm-bow takes a snapshot of an existing file system before mounting. The user may, before removing the device, commit the snapshot. Alternatively the user may remove the device and then run a command line utility to restore the device to its original state. dm-bow does not require an external device dm-bow efficiently uses all the available free space on the file system. IMPLEMENTATION: dm-bow can be in one of three states. In state one, the free blocks on the device are identified by issuing an FSTRIM to the filesystem. In state two, any writes cause the overwritten data to be backup up to the available free space. While in this state, the device can be restored by unmounting the filesystem, removing the dm-bow device and running a usermode tool over the underlying device. In state three, the changes are committed, dm-bow is in pass-through mode and the drive can no longer be restored. It is planned to use this driver to enable restoration of a failed update attempt on Android devices using ext4. Test: Can boot Android with userdata mounted on this device. Can commit userdata after SUW has run. Can then reboot, make changes and roll back. Known issues: Mutex is held around entire flush operation, including lengthy I/O. Plan is to convert to state machine with pending queues. Add a linked list of free ranges to avoid linear searches of the tree. Similarly sector0_current should be stored in bow_context. There should be no linear searches of the map. Interaction with block encryption is unknown, especially with respect to sector 0. Signed-off-by: Paul Lawrence --- Documentation/device-mapper/dm-bow.txt | 99 ++ drivers/md/Kconfig | 12 + drivers/md/Makefile | 1 + drivers/md/dm-bow.c | 1170 ++++++++++++++++++++++++ 4 files changed, 1282 insertions(+) create mode 100644 Documentation/device-mapper/dm-bow.txt create mode 100644 drivers/md/dm-bow.c diff --git a/Documentation/device-mapper/dm-bow.txt b/Documentation/device-mapper/dm-bow.txt new file mode 100644 index 000000000000..e3fc4d22e0f4 --- /dev/null +++ b/Documentation/device-mapper/dm-bow.txt @@ -0,0 +1,99 @@ +dm_bow (backup on write) +======================== + +dm_bow is a device mapper driver that uses the free space on a device to back up +data that is overwritten. The changes can then be committed by a simple state +change, or rolled back by removing the dm_bow device and running a command line +utility over the underlying device. + +dm_bow has three states, set by writing ‘1’ or ‘2’ to /sys/block/dm-?/bow/state. +It is only possible to go from state 0 (initial state) to state 1, and then from +state 1 to state 2. + +State 0: dm_bow collects all trims to the device and assumes that these mark +free space on the overlying file system that can be safely used. Typically the +mount code would create the dm_bow device, mount the file system, call the +FITRIM ioctl on the file system then switch to state 1. These trims are not +propagated to the underlying device. + +State 1: All writes to the device cause the underlying data to be backed up to +the free (trimmed) area as needed in such a way as they can be restored. +However, the writes, with one exception, then happen exactly as they would +without dm_bow, so the device is always in a good final state. The exception is +that sector 0 is used to keep a log of the latest changes, both to indicate that +we are in this state and to allow rollback. See below for all details. If there +isn't enough free space, writes are failed with -ENOSPC. + +State 2: The transition to state 2 triggers replacing the special sector 0 with +the normal sector 0, and the freeing of all state information. dm_bow then +becomes a pass-through driver, allowing the device to continue to be used with +minimal performance impact. + +Usage +===== +dm-bow takes one command line parameter, the name of the underlying device. + +dm-bow will typically be used in the following way. dm-bow will be loaded with a +suitable underlying device and the resultant device will be mounted. A file +system trim will be issued via the FITRIM ioctl, then the device will be +switched to state 1. The file system will now be used as normal. At some point, +the changes can either be committed by switching to state 2, or rolled back by +unmounting the file system, removing the dm-bow device and running the command +line utility. Note that rebooting the device will be equivalent to unmounting +and removing, but the command line utility must still be run + +Details of operation in state 1 +=============================== + +dm_bow maintains a type for all sectors. A sector can be any of: + +SECTOR0 +SECTOR0_CURRENT +UNCHANGED +FREE +CHANGED +BACKUP + +SECTOR0 is the first sector on the device, and is used to hold the log of +changes. This is the one exception. + +SECTOR0_CURRENT is a sector picked from the FREE sectors, and is where reads and +writes from the true sector zero are redirected to. Note that like any backup +sector, if the sector is written to directly, it must be moved again. + +UNCHANGED means that the sector has not been changed since we entered state 1. +Thus if it is written to or trimmed, the contents must first be backed up. + +FREE means that the sector was trimmed in state 0 and has not yet been written +to or used for backup. On being written to, a FREE sector is changed to CHANGED. + +CHANGED means that the sector has been modified, and can be further modified +without further backup. + +BACKUP means that this is a free sector being used as a backup. On being written +to, the contents must first be backed up again. + +All backup operations are logged to the first sector. The log sector has the +format: +-------------------------------------------------------- +| Magic | Count | Sequence | Log entry | Log entry | … +-------------------------------------------------------- + +Magic is a magic number. Count is the number of log entries. Sequence is 0 +initially. A log entry is + +----------------------------------- +| Source | Dest | Size | Checksum | +----------------------------------- + +When SECTOR0 is full, the log sector is backed up and another empty log sector +created with sequence number one higher. The first entry in any log entry with +sequence > 0 therefore must be the log of the backing up of the previous log +sector. Note that sequence is not strictly needed, but is a useful sanity check +and potentially limits the time spent trying to restore a corrupted snapshot. + +On entering state 1, dm_bow has a list of free sectors. All other sectors are +unchanged. Sector0_current is selected from the free sectors and the contents of +sector 0 are copied there. The sector 0 is backed up, which triggers the first +log entry to be written. + diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 3db222509e44..327e20424559 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -548,4 +548,16 @@ config DM_ZONED If unsure, say N. +config DM_BOW + tristate "Backup block device" + depends on BLK_DEV_DM + select DM_BUFIO + ---help--- + This device-mapper target takes a device and keeps a log of all + changes using free blocks identified by issuing a trim command. + This can then be restored by running a command line utility, + or committed by simply replacing the target. + + If unsure, say N. + endif # MD diff --git a/drivers/md/Makefile b/drivers/md/Makefile index 822f4e8753bc..f93731e760ee 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -68,6 +68,7 @@ obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o obj-$(CONFIG_DM_ZONED) += dm-zoned.o obj-$(CONFIG_DM_WRITECACHE) += dm-writecache.o +obj-$(CONFIG_DM_BOW) += dm-bow.o ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs += dm-uevent.o diff --git a/drivers/md/dm-bow.c b/drivers/md/dm-bow.c new file mode 100644 index 000000000000..ce002aa1fbd5 --- /dev/null +++ b/drivers/md/dm-bow.c @@ -0,0 +1,1170 @@ +/* + * Copyright (C) 2018 Google Limited. + * + * This file is released under the GPL. + */ + +#include "dm.h" +#include "dm-core.h" + +#include +#include +#include + +#define DM_MSG_PREFIX "bow" + +struct log_entry { + u64 source; + u64 dest; + u32 size; + u32 checksum; +} __packed; + +struct log_sector { + u32 magic; + u16 header_version; + u16 header_size; + u32 block_size; + u32 count; + u32 sequence; + sector_t sector0; + struct log_entry entries[]; +} __packed; + +/* + * MAGIC is BOW in ascii + */ +#define MAGIC 0x00574f42 +#define HEADER_VERSION 0x0100 + +/* + * A sorted set of ranges representing the state of the data on the device. + * Use an rb_tree for fast lookup of a given sector + * Consecutive ranges are always of different type - operations on this + * set must merge matching consecutive ranges. + * + * Top range is always of type TOP + */ +struct bow_range { + struct rb_node node; + sector_t sector; + enum { + SECTOR0, /* First sector - holds log record */ + SECTOR0_CURRENT,/* Live contents of sector0 */ + UNCHANGED, /* Original contents */ + TRIMMED, /* Range has been trimmed */ + CHANGED, /* Range has been changed */ + BACKUP, /* Range is being used as a backup */ + TOP, /* Final range - sector is size of device */ + } type; +}; + +static const char * const readable_type[] = { + "Sector0", + "Sector0_current", + "Unchanged", + "Free", + "Changed", + "Backup", + "Top", +}; + +enum state { + TRIM, + CHECKPOINT, + COMMITTED, +}; + +struct bow_context { + struct dm_dev *dev; + u32 block_size; + u32 block_shift; + struct workqueue_struct *workqueue; + struct dm_bufio_client *bufio; + struct mutex ranges_lock; /* Hold to access this struct and/or ranges */ + struct rb_root ranges; + struct dm_kobject_holder kobj_holder; /* for sysfs attributes */ + atomic_t state; /* One of the enum state values above */ + u64 trims_total; + struct log_sector *log_sector; +}; + +sector_t range_top(struct bow_range *br) +{ + return container_of(rb_next(&br->node), struct bow_range, node) + ->sector; +} + +unsigned int range_size(struct bow_range *br) +{ + return (range_top(br) - br->sector) * SECTOR_SIZE; +} + +static sector_t bvec_top(struct bvec_iter *bi_iter) +{ + return bi_iter->bi_sector + bi_iter->bi_size / SECTOR_SIZE; +} + +/* + * Find the first range that overlaps with bi_iter + * bi_iter is set to the size of the overlapping sub-range + */ +static struct bow_range *find_first_overlapping_range(struct rb_root *ranges, + struct bvec_iter *bi_iter) +{ + struct rb_node *node = ranges->rb_node; + struct bow_range *br; + + while (node) { + br = container_of(node, struct bow_range, node); + + if (br->sector <= bi_iter->bi_sector + && bi_iter->bi_sector < range_top(br)) + break; + + if (bi_iter->bi_sector < br->sector) + node = node->rb_left; + else + node = node->rb_right; + } + + WARN_ON(!node); + if (!node) + return NULL; + + if (range_top(br) - bi_iter->bi_sector + < bi_iter->bi_size >> SECTOR_SHIFT) + bi_iter->bi_size = (range_top(br) - bi_iter->bi_sector) + << SECTOR_SHIFT; + + return br; +} + +void add_before(struct rb_root *ranges, struct bow_range *new_br, + struct bow_range *existing) +{ + struct rb_node *parent = &(existing->node); + struct rb_node **link = &(parent->rb_left); + + while (*link) { + parent = *link; + link = &((*link)->rb_right); + } + + rb_link_node(&new_br->node, parent, link); + rb_insert_color(&new_br->node, ranges); +} + +/* + * Given a range br returned by find_first_overlapping_range, split br into a + * leading range, a range matching the bi_iter and a trailing range. + * Leading and trailing may end up size 0 and will then be deleted. The + * new range matching the bi_iter is then returned and should have its type + * and type specific fields populated. + * If bi_iter runs off the end of the range, bi_iter is truncated accordingly + */ +static int split_range(struct rb_root *ranges, struct bow_range **br, + struct bvec_iter *bi_iter) +{ + struct bow_range *new_br; + + if (bi_iter->bi_sector < (*br)->sector) { + WARN_ON(true); + return BLK_STS_IOERR; + } + + if (bi_iter->bi_sector > (*br)->sector) { + struct bow_range *leading_br = + kzalloc(sizeof(*leading_br), GFP_KERNEL); + + if (!leading_br) + return BLK_STS_RESOURCE; + + *leading_br = **br; + add_before(ranges, leading_br, *br); + (*br)->sector = bi_iter->bi_sector; + } + + if (bvec_top(bi_iter) >= range_top(*br)) { + bi_iter->bi_size = (range_top(*br) - (*br)->sector) + * SECTOR_SIZE; + return BLK_STS_OK; + } + + /* new_br will be the beginning, existing br will be the tail */ + new_br = kzalloc(sizeof(*new_br), GFP_KERNEL); + if (!new_br) + return BLK_STS_RESOURCE; + + new_br->sector = (*br)->sector; + (*br)->sector = bvec_top(bi_iter); + add_before(ranges, new_br, *br); + *br = new_br; + + return BLK_STS_OK; +} + +/* + * Sets type of a range. May merge range into surrounding ranges + * Since br may be invalidated, always sets br to NULL to prevent + * usage after this is called + */ +static void set_type(struct bow_context *bc, struct bow_range **br, int type) +{ + struct bow_range *prev = container_of(rb_prev(&(*br)->node), + struct bow_range, node); + struct bow_range *next = container_of(rb_next(&(*br)->node), + struct bow_range, node); + + if ((*br)->type == TRIMMED) + bc->trims_total -= range_size(*br); + + if (type == TRIMMED) + bc->trims_total += range_size(*br); + + (*br)->type = type; + + if (next->type == type) { + rb_erase(&next->node, &bc->ranges); + kfree(next); + } + + if (prev->type == type) { + rb_erase(&(*br)->node, &bc->ranges); + kfree(*br); + } + + *br = NULL; +} + +static struct bow_range *find_free_range(struct rb_root *ranges) +{ + struct rb_node *i; + + for (i = rb_first(ranges); i; i = rb_next(i)) { + struct bow_range *br = container_of(i, struct bow_range, node); + + if (br->type == TRIMMED) + return br; + } + + DMERR("Unable to find free space to back up to"); + return NULL; +} + +static sector_t sector_to_page(struct bow_context const *bc, sector_t sector) +{ + WARN_ON(sector % (bc->block_size / SECTOR_SIZE) != 0); + return sector >> (bc->block_shift - SECTOR_SHIFT); +} + +static int copy_data(struct bow_context const *bc, + struct bow_range *source, struct bow_range *dest, + u32 *checksum) +{ + int i; + + if (range_size(source) != range_size(dest)) { + WARN_ON(1); + return BLK_STS_IOERR; + } + + if (checksum) + *checksum = sector_to_page(bc, source->sector); + + for (i = 0; i < range_size(source) >> bc->block_shift; ++i) { + struct dm_buffer *read_buffer, *write_buffer; + u8 *read, *write; + sector_t page = sector_to_page(bc, source->sector) + i; + + read = dm_bufio_read(bc->bufio, page, &read_buffer); + if (IS_ERR(read)) { + DMERR("Cannot read page %lu", page); + dm_bufio_release(read_buffer); + return PTR_ERR(read); + } + + if (checksum) + *checksum = crc32(*checksum, read, bc->block_size); + + write = dm_bufio_new(bc->bufio, + sector_to_page(bc, dest->sector) + i, + &write_buffer); + if (IS_ERR(write)) { + DMERR("Cannot write sector"); + dm_bufio_release(write_buffer); + dm_bufio_release(read_buffer); + return PTR_ERR(write); + } + + memcpy(write, read, bc->block_size); + + dm_bufio_mark_buffer_dirty(write_buffer); + dm_bufio_release(write_buffer); + dm_bufio_release(read_buffer); + } + + dm_bufio_write_dirty_buffers(bc->bufio); + return BLK_STS_OK; +} + +/****** logging functions ******/ + +static int add_log_entry(struct bow_context *bc, sector_t source, sector_t dest, + unsigned int size, u32 checksum); + +static int backup_log_sector(struct bow_context *bc) +{ + struct bow_range *first_br, *free_br; + struct bvec_iter bi_iter; + u32 checksum = 0; + int ret; + + first_br = container_of(rb_first(&bc->ranges), struct bow_range, node); + + if (first_br->type != SECTOR0) { + WARN_ON(1); + return BLK_STS_IOERR; + } + + if (range_size(first_br) != bc->block_size) { + WARN_ON(1); + return BLK_STS_IOERR; + } + + free_br = find_free_range(&bc->ranges); + /* No space left - return this error to userspace */ + if (!free_br) + return BLK_STS_NOSPC; + bi_iter.bi_sector = free_br->sector; + bi_iter.bi_size = bc->block_size; + ret = split_range(&bc->ranges, &free_br, &bi_iter); + if (ret) + return ret; + if (bi_iter.bi_size != bc->block_size) { + WARN_ON(1); + return BLK_STS_IOERR; + } + + ret = copy_data(bc, first_br, free_br, &checksum); + if (ret) + return ret; + + bc->log_sector->count = 0; + bc->log_sector->sequence++; + ret = add_log_entry(bc, first_br->sector, free_br->sector, + range_size(first_br), checksum); + if (ret) + return ret; + + set_type(bc, &free_br, BACKUP); + return BLK_STS_OK; +} + +static int add_log_entry(struct bow_context *bc, sector_t source, sector_t dest, + unsigned int size, u32 checksum) +{ + struct dm_buffer *sector_buffer; + u8 *sector; + + if (sizeof(struct log_sector) + + sizeof(struct log_entry) * (bc->log_sector->count + 1) + > bc->block_size) { + int ret = backup_log_sector(bc); + + if (ret) + return ret; + } + + sector = dm_bufio_new(bc->bufio, 0, §or_buffer); + if (IS_ERR(sector)) { + DMERR("Cannot write boot sector"); + dm_bufio_release(sector_buffer); + return BLK_STS_NOSPC; + } + + bc->log_sector->entries[bc->log_sector->count].source = source; + bc->log_sector->entries[bc->log_sector->count].dest = dest; + bc->log_sector->entries[bc->log_sector->count].size = size; + bc->log_sector->entries[bc->log_sector->count].checksum = checksum; + bc->log_sector->count++; + + memcpy(sector, bc->log_sector, bc->block_size); + dm_bufio_mark_buffer_dirty(sector_buffer); + dm_bufio_release(sector_buffer); + dm_bufio_write_dirty_buffers(bc->bufio); + return BLK_STS_OK; +} + +static int prepare_log(struct bow_context *bc) +{ + struct bow_range *free_br, *first_br; + struct bvec_iter bi_iter; + u32 checksum = 0; + int ret; + + /* Carve out first sector as log sector */ + first_br = container_of(rb_first(&bc->ranges), struct bow_range, node); + if (first_br->type != UNCHANGED) { + WARN_ON(1); + return BLK_STS_IOERR; + } + if (range_size(first_br) < bc->block_size) { + WARN_ON(1); + return BLK_STS_IOERR; + } + bi_iter.bi_sector = 0; + bi_iter.bi_size = bc->block_size; + ret = split_range(&bc->ranges, &first_br, &bi_iter); + if (ret) + return ret; + first_br->type = SECTOR0; + if (range_size(first_br) != bc->block_size) { + WARN_ON(1); + return BLK_STS_IOERR; + } + + /* Find free sector for active sector0 reads/writes */ + free_br = find_free_range(&bc->ranges); + if (!free_br) + return BLK_STS_NOSPC; + bi_iter.bi_sector = free_br->sector; + bi_iter.bi_size = bc->block_size; + ret = split_range(&bc->ranges, &free_br, &bi_iter); + if (ret) + return ret; + free_br->type = SECTOR0_CURRENT; + + /* Copy data */ + ret = copy_data(bc, first_br, free_br, NULL); + if (ret) + return ret; + + bc->log_sector->sector0 = free_br->sector; + + /* Find free sector to back up original sector zero */ + free_br = find_free_range(&bc->ranges); + if (!free_br) + return BLK_STS_NOSPC; + bi_iter.bi_sector = free_br->sector; + bi_iter.bi_size = bc->block_size; + ret = split_range(&bc->ranges, &free_br, &bi_iter); + if (ret) + return ret; + + /* Back up */ + ret = copy_data(bc, first_br, free_br, &checksum); + if (ret) + return ret; + + /* + * Set up our replacement boot sector - it will get written when we + * add the first log entry, which we do immediately + */ + bc->log_sector->magic = MAGIC; + bc->log_sector->header_version = HEADER_VERSION; + bc->log_sector->header_size = sizeof(*bc->log_sector); + bc->log_sector->block_size = bc->block_size; + bc->log_sector->count = 0; + bc->log_sector->sequence = 0; + + /* Add log entry */ + ret = add_log_entry(bc, first_br->sector, free_br->sector, + range_size(first_br), checksum); + if (ret) + return ret; + + set_type(bc, &free_br, BACKUP); + return BLK_STS_OK; +} + +static struct bow_range *find_sector0_current(struct bow_context *bc) +{ + struct bvec_iter bi_iter; + + bi_iter.bi_sector = bc->log_sector->sector0; + bi_iter.bi_size = bc->block_size; + return find_first_overlapping_range(&bc->ranges, &bi_iter); +} + +/****** sysfs interface functions ******/ + +static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct bow_context *bc = container_of(kobj, struct bow_context, + kobj_holder.kobj); + + return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&bc->state)); +} + +static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct bow_context *bc = container_of(kobj, struct bow_context, + kobj_holder.kobj); + enum state state, original_state; + int ret; + + state = buf[0] - '0'; + if (state < TRIM || state > COMMITTED) { + DMERR("State value %d out of range", state); + return -EINVAL; + } + + mutex_lock(&bc->ranges_lock); + original_state = atomic_read(&bc->state); + if (state != original_state + 1) { + DMERR("Invalid state change from %d to %d", + original_state, state); + ret = -EINVAL; + goto bad; + } + + DMINFO("Switching to state %s", state == CHECKPOINT ? "Checkpoint" + : state == COMMITTED ? "Committed" : "Unknown"); + + if (state == CHECKPOINT) { + ret = prepare_log(bc); + if (ret) { + DMERR("Failed to switch to checkpoint state"); + goto bad; + } + } else if (state == COMMITTED) { + struct bow_range *br = find_sector0_current(bc); + struct bow_range *sector0_br = + container_of(rb_first(&bc->ranges), struct bow_range, + node); + + ret = copy_data(bc, br, sector0_br, 0); + if (ret) { + DMERR("Failed to switch to committed state"); + goto bad; + } + } + atomic_inc(&bc->state); + ret = count; + +bad: + mutex_unlock(&bc->ranges_lock); + return ret; +} + +static ssize_t free_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct bow_context *bc = container_of(kobj, struct bow_context, + kobj_holder.kobj); + u64 trims_total; + + mutex_lock(&bc->ranges_lock); + trims_total = bc->trims_total; + mutex_unlock(&bc->ranges_lock); + + return scnprintf(buf, PAGE_SIZE, "%llu\n", trims_total); +} + +static struct kobj_attribute attr_state = __ATTR_RW(state); +static struct kobj_attribute attr_free = __ATTR_RO(free); + +static struct attribute *bow_attrs[] = { + &attr_state.attr, + &attr_free.attr, + NULL +}; + +static struct kobj_type bow_ktype = { + .sysfs_ops = &kobj_sysfs_ops, + .default_attrs = bow_attrs, + .release = dm_kobject_release +}; + +/****** constructor/destructor ******/ + +static void dm_bow_dtr(struct dm_target *ti) +{ + struct bow_context *bc = (struct bow_context *) ti->private; + struct kobject *kobj; + + while (rb_first(&bc->ranges)) { + struct bow_range *br = container_of(rb_first(&bc->ranges), + struct bow_range, node); + + rb_erase(&br->node, &bc->ranges); + kfree(br); + } + if (bc->workqueue) + destroy_workqueue(bc->workqueue); + if (bc->bufio) + dm_bufio_client_destroy(bc->bufio); + + kobj = &bc->kobj_holder.kobj; + if (kobj->state_initialized) { + kobject_put(kobj); + wait_for_completion(dm_get_completion_from_kobject(kobj)); + } + + kfree(bc->log_sector); + kfree(bc); +} + +static int dm_bow_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct bow_context *bc; + struct bow_range *br; + int ret; + struct mapped_device *md = dm_table_get_md(ti->table); + + if (argc != 1) { + ti->error = "Invalid argument count"; + return -EINVAL; + } + + bc = kzalloc(sizeof(*bc), GFP_KERNEL); + if (!bc) { + ti->error = "Cannot allocate bow context"; + return -ENOMEM; + } + + ti->num_flush_bios = 1; + ti->num_discard_bios = 1; + ti->num_write_same_bios = 1; + ti->private = bc; + + ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), + &bc->dev); + if (ret) { + ti->error = "Device lookup failed"; + goto bad; + } + + bc->block_size = bc->dev->bdev->bd_queue->limits.logical_block_size; + bc->block_shift = ilog2(bc->block_size); + bc->log_sector = kzalloc(bc->block_size, GFP_KERNEL); + if (!bc->log_sector) { + ti->error = "Cannot allocate log sector"; + goto bad; + } + + init_completion(&bc->kobj_holder.completion); + ret = kobject_init_and_add(&bc->kobj_holder.kobj, &bow_ktype, + &disk_to_dev(dm_disk(md))->kobj, "%s", + "bow"); + if (ret) { + ti->error = "Cannot create sysfs node"; + goto bad; + } + + mutex_init(&bc->ranges_lock); + bc->ranges = RB_ROOT; + bc->bufio = dm_bufio_client_create(bc->dev->bdev, bc->block_size, 1, 0, + NULL, NULL); + if (IS_ERR(bc->bufio)) { + ti->error = "Cannot initialize dm-bufio"; + ret = PTR_ERR(bc->bufio); + bc->bufio = NULL; + goto bad; + } + + bc->workqueue = alloc_workqueue("dm-bow", + WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM + | WQ_UNBOUND, num_online_cpus()); + if (!bc->workqueue) { + ti->error = "Cannot allocate workqueue"; + ret = -ENOMEM; + goto bad; + } + + br = kzalloc(sizeof(*br), GFP_KERNEL); + if (!br) { + ti->error = "Cannot allocate ranges"; + ret = -ENOMEM; + goto bad; + } + + br->sector = ti->len; + br->type = TOP; + rb_link_node(&br->node, NULL, &bc->ranges.rb_node); + rb_insert_color(&br->node, &bc->ranges); + + br = kzalloc(sizeof(*br), GFP_KERNEL); + if (!br) { + ti->error = "Cannot allocate ranges"; + ret = -ENOMEM; + goto bad; + } + + br->sector = 0; + br->type = UNCHANGED; + rb_link_node(&br->node, bc->ranges.rb_node, + &bc->ranges.rb_node->rb_left); + rb_insert_color(&br->node, &bc->ranges); + + ti->discards_supported = true; + + return 0; + +bad: + dm_bow_dtr(ti); + return ret; +} + +/****** Handle writes ******/ + +static int prepare_unchanged_range(struct bow_context *bc, struct bow_range *br, + struct bvec_iter *bi_iter, + bool record_checksum) +{ + struct bow_range *backup_br; + struct bvec_iter backup_bi; + sector_t log_source, log_dest; + unsigned int log_size; + u32 checksum = 0; + int ret; + int original_type; + sector_t sector0; + + /* Find a free range */ + backup_br = find_free_range(&bc->ranges); + if (!backup_br) + return BLK_STS_NOSPC; + + /* Carve out a backup range. This may be smaller than the br given */ + backup_bi.bi_sector = backup_br->sector; + backup_bi.bi_size = min(range_size(backup_br), bi_iter->bi_size); + ret = split_range(&bc->ranges, &backup_br, &backup_bi); + if (ret) + return ret; + + /* + * Carve out a changed range. This will not be smaller than the backup + * br since the backup br is smaller than the source range and iterator + */ + bi_iter->bi_size = backup_bi.bi_size; + ret = split_range(&bc->ranges, &br, bi_iter); + if (ret) + return ret; + if (range_size(br) != range_size(backup_br)) { + WARN_ON(1); + return BLK_STS_IOERR; + } + + + /* Copy data over */ + ret = copy_data(bc, br, backup_br, record_checksum ? &checksum : NULL); + if (ret) + return ret; + + /* Add an entry to the log */ + log_source = br->sector; + log_dest = backup_br->sector; + log_size = range_size(br); + + /* + * Set the types. Note that since set_type also amalgamates ranges + * we have to set both sectors to their final type before calling + * set_type on either + */ + original_type = br->type; + sector0 = backup_br->sector; + backup_br->type = br->type == SECTOR0_CURRENT ? SECTOR0_CURRENT + : BACKUP; + br->type = CHANGED; + set_type(bc, &backup_br, backup_br->type); + + /* + * Add the log entry after marking the backup sector, since adding a log + * can cause another backup + */ + ret = add_log_entry(bc, log_source, log_dest, log_size, checksum); + if (ret) { + br->type = original_type; + return ret; + } + + /* Now it is safe to mark this backup successful */ + if (original_type == SECTOR0_CURRENT) + bc->log_sector->sector0 = sector0; + + set_type(bc, &br, br->type); + return ret; +} + +static int prepare_free_range(struct bow_context *bc, struct bow_range *br, + struct bvec_iter *bi_iter) +{ + int ret; + + ret = split_range(&bc->ranges, &br, bi_iter); + if (ret) + return ret; + set_type(bc, &br, CHANGED); + return BLK_STS_OK; +} + +static int prepare_changed_range(struct bow_context *bc, struct bow_range *br, + struct bvec_iter *bi_iter) +{ + /* Nothing to do ... */ + return BLK_STS_OK; +} + +static int prepare_one_range(struct bow_context *bc, + struct bvec_iter *bi_iter) +{ + struct bow_range *br = find_first_overlapping_range(&bc->ranges, + bi_iter); + switch (br->type) { + case CHANGED: + return prepare_changed_range(bc, br, bi_iter); + + case TRIMMED: + return prepare_free_range(bc, br, bi_iter); + + case UNCHANGED: + case BACKUP: + return prepare_unchanged_range(bc, br, bi_iter, true); + + /* + * We cannot track the checksum for the active sector0, since it + * may change at any point. + */ + case SECTOR0_CURRENT: + return prepare_unchanged_range(bc, br, bi_iter, false); + + case SECTOR0: /* Handled in the dm_bow_map */ + case TOP: /* Illegal - top is off the end of the device */ + default: + WARN_ON(1); + return BLK_STS_IOERR; + } +} + +struct write_work { + struct work_struct work; + struct bow_context *bc; + struct bio *bio; +}; + +static void bow_write(struct work_struct *work) +{ + struct write_work *ww = container_of(work, struct write_work, work); + struct bow_context *bc = ww->bc; + struct bio *bio = ww->bio; + struct bvec_iter bi_iter = bio->bi_iter; + int ret = BLK_STS_OK; + + kfree(ww); + + mutex_lock(&bc->ranges_lock); + do { + ret = prepare_one_range(bc, &bi_iter); + bi_iter.bi_sector += bi_iter.bi_size / SECTOR_SIZE; + bi_iter.bi_size = bio->bi_iter.bi_size + - (bi_iter.bi_sector - bio->bi_iter.bi_sector) + * SECTOR_SIZE; + } while (!ret && bi_iter.bi_size); + + mutex_unlock(&bc->ranges_lock); + + if (!ret) { + bio_set_dev(bio, bc->dev->bdev); + submit_bio(bio); + } else { + DMERR("Write failure with error %d", -ret); + bio->bi_status = ret; + bio_endio(bio); + } +} + +static int queue_write(struct bow_context *bc, struct bio *bio) +{ + struct write_work *ww = kmalloc(sizeof(*ww), GFP_NOIO | __GFP_NORETRY + | __GFP_NOMEMALLOC | __GFP_NOWARN); + if (!ww) { + DMERR("Failed to allocate write_work"); + return -ENOMEM; + } + + INIT_WORK(&ww->work, bow_write); + ww->bc = bc; + ww->bio = bio; + queue_work(bc->workqueue, &ww->work); + return DM_MAPIO_SUBMITTED; +} + +static int handle_sector0(struct bow_context *bc, struct bio *bio) +{ + int ret = DM_MAPIO_REMAPPED; + + if (bio->bi_iter.bi_size > bc->block_size) { + struct bio *split = bio_split(bio, + 1 << (PAGE_SHIFT - SECTOR_SHIFT), + GFP_NOIO, &fs_bio_set); + if (!split) { + DMERR("Failed to split bio"); + bio->bi_status = BLK_STS_RESOURCE; + bio_endio(bio); + return DM_MAPIO_SUBMITTED; + } + + bio_chain(split, bio); + split->bi_iter.bi_sector = bc->log_sector->sector0; + bio_set_dev(split, bc->dev->bdev); + submit_bio(split); + + if (bio_data_dir(bio) == WRITE) + ret = queue_write(bc, bio); + } else { + bio->bi_iter.bi_sector = bc->log_sector->sector0; + } + + return ret; +} + +static int add_trim(struct bow_context *bc, struct bio *bio) +{ + struct bow_range *br; + struct bvec_iter bi_iter = bio->bi_iter; + + DMDEBUG("add_trim: %lu, %u", + bio->bi_iter.bi_sector, bio->bi_iter.bi_size); + + do { + br = find_first_overlapping_range(&bc->ranges, &bi_iter); + + switch (br->type) { + case UNCHANGED: + if (!split_range(&bc->ranges, &br, &bi_iter)) + set_type(bc, &br, TRIMMED); + break; + + case TRIMMED: + /* Nothing to do */ + break; + + default: + /* No other case is legal in TRIM state */ + WARN_ON(true); + break; + } + + bi_iter.bi_sector += bi_iter.bi_size / SECTOR_SIZE; + bi_iter.bi_size = bio->bi_iter.bi_size + - (bi_iter.bi_sector - bio->bi_iter.bi_sector) + * SECTOR_SIZE; + + } while (bi_iter.bi_size); + + bio_endio(bio); + return DM_MAPIO_SUBMITTED; +} + +static int remove_trim(struct bow_context *bc, struct bio *bio) +{ + struct bow_range *br; + struct bvec_iter bi_iter = bio->bi_iter; + + DMDEBUG("remove_trim: %lu, %u", + bio->bi_iter.bi_sector, bio->bi_iter.bi_size); + + do { + br = find_first_overlapping_range(&bc->ranges, &bi_iter); + + switch (br->type) { + case UNCHANGED: + /* Nothing to do */ + break; + + case TRIMMED: + if (!split_range(&bc->ranges, &br, &bi_iter)) + set_type(bc, &br, UNCHANGED); + break; + + default: + /* No other case is legal in TRIM state */ + WARN_ON(true); + break; + } + + bi_iter.bi_sector += bi_iter.bi_size / SECTOR_SIZE; + bi_iter.bi_size = bio->bi_iter.bi_size + - (bi_iter.bi_sector - bio->bi_iter.bi_sector) + * SECTOR_SIZE; + + } while (bi_iter.bi_size); + + return DM_MAPIO_REMAPPED; +} + +/****** dm interface ******/ + +static int dm_bow_map(struct dm_target *ti, struct bio *bio) +{ + int ret = DM_MAPIO_REMAPPED; + struct bow_context *bc = ti->private; + + if (likely(bc->state.counter == COMMITTED)) { + bio_set_dev(bio, bc->dev->bdev); + return DM_MAPIO_REMAPPED; + } + + if (atomic_read(&bc->state) != COMMITTED) { + enum state state; + + mutex_lock(&bc->ranges_lock); + state = atomic_read(&bc->state); + if (state == TRIM) { + if (bio_op(bio) == REQ_OP_DISCARD) + ret = add_trim(bc, bio); + else if (bio_data_dir(bio) == WRITE) + ret = remove_trim(bc, bio); + else + /* passthrough */; + } else if (state == CHECKPOINT) { + if (bio->bi_iter.bi_sector == 0) + ret = handle_sector0(bc, bio); + else if (bio_data_dir(bio) == WRITE) + ret = queue_write(bc, bio); + else + /* passthrough */; + } else { + /* passthrough */ + } + mutex_unlock(&bc->ranges_lock); + } + if (ret == DM_MAPIO_REMAPPED) + bio_set_dev(bio, bc->dev->bdev); + return ret; +} + +static void dm_bow_tablestatus(struct dm_target *ti, char *result, + unsigned int maxlen) +{ + char *end = result + maxlen; + struct bow_context *bc = ti->private; + struct rb_node *i; + + if (maxlen == 0) + return; + result[0] = 0; + + if (!rb_first(&bc->ranges)) { + scnprintf(result, end - result, "ERROR: Empty ranges"); + return; + } + + if (container_of(rb_first(&bc->ranges), struct bow_range, node) + ->sector) { + scnprintf(result, end - result, + "ERROR: First range does not start at sector 0"); + return; + } + + for (i = rb_first(&bc->ranges); i; i = rb_next(i)) { + struct bow_range *br = container_of(i, struct bow_range, node); + + result += scnprintf(result, end - result, "%s: %lu", + readable_type[br->type], br->sector); + if (result >= end) + return; + + result += scnprintf(result, end - result, "\n"); + if (result >= end) + return; + + switch (br->type) { + case TOP: + if (br->sector != ti->len) { + scnprintf(result, end - result, + "\nERROR: Top sector is incorrect"); + } + + if (&br->node != rb_last(&bc->ranges)) { + scnprintf(result, end - result, + "\nERROR: Top sector is not last"); + } + return; + + default: + break; + } + + if (!rb_next(i)) { + scnprintf(result, end - result, + "\nLast range not of type TOP"); + return; + } + + if (br->sector > range_top(br)) { + scnprintf(result, end - result, + "\nERROR: sectors out of order"); + return; + } + } +} + +static void dm_bow_status(struct dm_target *ti, status_type_t type, + unsigned int status_flags, char *result, + unsigned int maxlen) +{ + switch (type) { + case STATUSTYPE_INFO: + if (maxlen) + result[0] = 0; + break; + + case STATUSTYPE_TABLE: + dm_bow_tablestatus(ti, result, maxlen); + break; + } +} + +int dm_bow_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +{ + struct bow_context *bc = ti->private; + struct dm_dev *dev = bc->dev; + + *bdev = dev->bdev; + /* Only pass ioctls through if the device sizes match exactly. */ + return ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT; +} + +static int dm_bow_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, void *data) +{ + struct bow_context *bc = ti->private; + + return fn(ti, bc->dev, 0, ti->len, data); +} + +static struct target_type bow_target = { + .name = "bow", + .version = {1, 1, 1}, + .module = THIS_MODULE, + .ctr = dm_bow_ctr, + .dtr = dm_bow_dtr, + .map = dm_bow_map, + .status = dm_bow_status, + .prepare_ioctl = dm_bow_prepare_ioctl, + .iterate_devices = dm_bow_iterate_devices, +}; + +int __init dm_bow_init(void) +{ + int r = dm_register_target(&bow_target); + + if (r < 0) + DMERR("registering bow failed %d", r); + return r; +} + +void dm_bow_exit(void) +{ + dm_unregister_target(&bow_target); +} + +MODULE_LICENSE("GPL"); + +module_init(dm_bow_init); +module_exit(dm_bow_exit);