From patchwork Mon Dec 28 00:31:56 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11990761 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B047C433E0 for ; Mon, 28 Dec 2020 00:33:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DE6B0225A9 for ; Mon, 28 Dec 2020 00:33:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726329AbgL1Acw (ORCPT ); Sun, 27 Dec 2020 19:32:52 -0500 Received: from mx2.suse.de ([195.135.220.15]:53878 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726226AbgL1Acw (ORCPT ); Sun, 27 Dec 2020 19:32:52 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1609115525; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ahyg3kMlN0K3kGzWHPtOlCkUD4xnL8cUb0ltGH1zwXk=; b=E/2VvJq3chnaR+yhOAmixRojqEvpIi4T3MfnTzc79qyxvPe+ps6K12a25MTKN4fQ+XBx5d lTTF8wScs/hJKgZaUkpNZPVKFm3oTyjw2jX2jHzTQSCgU0beAdCiqWfAOUSMUsNDHkBBBJ c2cGT1vOp0QIcRTs2Wo+uaBl6tKuaD0= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 53964B737 for ; Mon, 28 Dec 2020 00:32:05 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v5 1/4] btrfs-progs: image: introduce framework for more dump versions Date: Mon, 28 Dec 2020 08:31:56 +0800 Message-Id: <20201228003159.115343-2-wqu@suse.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228003159.115343-1-wqu@suse.com> References: <20201228003159.115343-1-wqu@suse.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The original dump format only contains a @magic member to verify the format, this means if we want to introduce new on-disk format or change certain size limit, we can only introduce new magic as version. This patch will introduce the framework to allow multiple magic numbers to co-exist for further functions. This patch will introduce the following members for each dump version. - max_pending_size The threshold size for an cluster. It's not a hard limit but a soft one. One cluster can go larger than max_pending_size for one item, but next item would go to next cluster. - magic_cpu The magic number in CPU endian. - extra_sb_flags If the super block of this restore needs extra super block flags like BTRFS_SUPER_FLAG_METADUMP_V2. For incoming data dump feature, we don't need any extra super block flags. This change also implies that all image dumps will use the same magic for all clusters. No mixing is allowed, as we will use the first cluster to determine the dump version. Signed-off-by: Qu Wenruo --- image/main.c | 72 ++++++++++++++++++++++++++++++++++++++++++------ image/metadump.h | 13 +++++++-- 2 files changed, 74 insertions(+), 11 deletions(-) diff --git a/image/main.c b/image/main.c index 48070e52c21f..8f4cf5ff7e0d 100644 --- a/image/main.c +++ b/image/main.c @@ -44,6 +44,19 @@ #define MAX_WORKER_THREADS (32) +const struct dump_version dump_versions[NR_DUMP_VERSIONS] = { + /* + * The original format, which only supports tree blocks and + * free space cache dump. + */ + { .version = 0, + .max_pending_size = SZ_256K, + .magic_cpu = 0xbd5c25e27295668bULL, + .extra_sb_flags = 1 } +}; + +const struct dump_version *current_version = &dump_versions[0]; + struct async_work { struct list_head list; struct list_head ordered; @@ -405,7 +418,7 @@ static void meta_cluster_init(struct metadump_struct *md, u64 start) md->num_items = 0; md->num_ready = 0; header = &md->cluster.header; - header->magic = cpu_to_le64(HEADER_MAGIC); + header->magic = cpu_to_le64(current_version->magic_cpu); header->bytenr = cpu_to_le64(start); header->nritems = cpu_to_le32(0); header->compress = md->compress_level > 0 ? @@ -717,7 +730,7 @@ static int add_extent(u64 start, u64 size, struct metadump_struct *md, { int ret; if (md->data != data || - md->pending_size + size > MAX_PENDING_SIZE || + md->pending_size + size > current_version->max_pending_size || md->pending_start + md->pending_size != start) { ret = flush_pending(md, 0); if (ret) @@ -1046,7 +1059,8 @@ static void update_super_old(u8 *buffer) u32 sectorsize = btrfs_super_sectorsize(super); u64 flags = btrfs_super_flags(super); - flags |= BTRFS_SUPER_FLAG_METADUMP; + if (current_version->extra_sb_flags) + flags |= BTRFS_SUPER_FLAG_METADUMP; btrfs_set_super_flags(super, flags); key = (struct btrfs_disk_key *)(super->sys_chunk_array); @@ -1146,7 +1160,8 @@ finish: if (mdres->clear_space_cache) btrfs_set_super_cache_generation(super, 0); - flags |= BTRFS_SUPER_FLAG_METADUMP_V2; + if (current_version->extra_sb_flags) + flags |= BTRFS_SUPER_FLAG_METADUMP_V2; btrfs_set_super_flags(super, flags); btrfs_set_super_sys_array_size(super, new_array_size); btrfs_set_super_num_devices(super, 1); @@ -1336,7 +1351,7 @@ static void *restore_worker(void *data) u8 *outbuf; int outfd; int ret; - int compress_size = MAX_PENDING_SIZE * 4; + int compress_size = current_version->max_pending_size * 4; outfd = fileno(mdres->out); buffer = malloc(compress_size); @@ -1489,6 +1504,42 @@ static void mdrestore_destroy(struct mdrestore_struct *mdres, int num_threads) free(mdres->original_super); } +static int detect_version(FILE *in) +{ + struct meta_cluster *cluster; + u8 buf[BLOCK_SIZE]; + bool found = false; + int i; + int ret; + + if (fseek(in, 0, SEEK_SET) < 0) { + error("seek failed: %m"); + return -errno; + } + ret = fread(buf, BLOCK_SIZE, 1, in); + if (!ret) { + error("failed to read header"); + return -EIO; + } + + fseek(in, 0, SEEK_SET); + cluster = (struct meta_cluster *)buf; + for (i = 0; i < NR_DUMP_VERSIONS; i++) { + if (le64_to_cpu(cluster->header.magic) == + dump_versions[i].magic_cpu) { + found = true; + current_version = &dump_versions[i]; + break; + } + } + + if (!found) { + error("unrecognized header format"); + return -EINVAL; + } + return 0; +} + static int mdrestore_init(struct mdrestore_struct *mdres, FILE *in, FILE *out, int old_restore, int num_threads, int fixup_offset, @@ -1496,6 +1547,9 @@ static int mdrestore_init(struct mdrestore_struct *mdres, { int i, ret = 0; + ret = detect_version(in); + if (ret < 0) + return ret; memset(mdres, 0, sizeof(*mdres)); pthread_cond_init(&mdres->cond, NULL); pthread_mutex_init(&mdres->mutex, NULL); @@ -1849,7 +1903,7 @@ static int search_for_chunk_blocks(struct mdrestore_struct *mdres) u64 current_cluster = 0, bytenr; u64 item_bytenr; u32 bufsize, nritems, i; - u32 max_size = MAX_PENDING_SIZE * 2; + u32 max_size = current_version->max_pending_size * 2; u8 *buffer, *tmp = NULL; int ret = 0; @@ -1902,7 +1956,7 @@ static int search_for_chunk_blocks(struct mdrestore_struct *mdres) ret = 0; header = &cluster->header; - if (le64_to_cpu(header->magic) != HEADER_MAGIC || + if (le64_to_cpu(header->magic) != current_version->magic_cpu || le64_to_cpu(header->bytenr) != current_cluster) { error("bad header in metadump image"); ret = -EIO; @@ -2101,7 +2155,7 @@ static int build_chunk_tree(struct mdrestore_struct *mdres, ret = 0; header = &cluster->header; - if (le64_to_cpu(header->magic) != HEADER_MAGIC || + if (le64_to_cpu(header->magic) != current_version->magic_cpu || le64_to_cpu(header->bytenr) != 0) { error("bad header in metadump image"); return -EIO; @@ -2673,7 +2727,7 @@ static int restore_metadump(const char *input, FILE *out, int old_restore, break; header = &cluster->header; - if (le64_to_cpu(header->magic) != HEADER_MAGIC || + if (le64_to_cpu(header->magic) != current_version->magic_cpu || le64_to_cpu(header->bytenr) != bytenr) { error("bad header in metadump image"); ret = -EIO; diff --git a/image/metadump.h b/image/metadump.h index 57bc3bf285b0..7bdddc7b853c 100644 --- a/image/metadump.h +++ b/image/metadump.h @@ -22,8 +22,6 @@ #include "kernel-lib/list.h" #include "kernel-shared/ctree.h" -#define HEADER_MAGIC 0xbd5c25e27295668bULL -#define MAX_PENDING_SIZE SZ_256K #define BLOCK_SIZE SZ_1K #define BLOCK_MASK (BLOCK_SIZE - 1) @@ -33,6 +31,17 @@ #define COMPRESS_NONE 0 #define COMPRESS_ZLIB 1 +struct dump_version { + u64 magic_cpu; + int version; + int max_pending_size; + unsigned int extra_sb_flags:1; +}; + +#define NR_DUMP_VERSIONS 1 +extern const struct dump_version dump_versions[NR_DUMP_VERSIONS]; +const extern struct dump_version *current_version; + struct meta_cluster_item { __le64 bytenr; __le32 size; From patchwork Mon Dec 28 00:31:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11990767 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30ECCC433E9 for ; Mon, 28 Dec 2020 00:33:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 058A2229C6 for ; Mon, 28 Dec 2020 00:33:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726352AbgL1Acy (ORCPT ); Sun, 27 Dec 2020 19:32:54 -0500 Received: from mx2.suse.de ([195.135.220.15]:53888 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726226AbgL1Acy (ORCPT ); Sun, 27 Dec 2020 19:32:54 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1609115527; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LlQHvJQJzdD6X0lthnmw9FWWKhF04SUDcjSOXV331Wc=; b=p4JQ2LO/YoXeXitjvBp0KkM8ymqQRS6m1qxy6jfCFu4HzdtLV4R7lR5IZrKjMGcP5yH76r PaIn/+n7q6nouW3hGCfNjfvQbvvjSgNVFtQXGKa0vMhdIxnqoRB8cBJH5lrps7xIvh66R9 ukx0mvp8372Lex6JHV52xAH++N0O3hY= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 52E51B738 for ; Mon, 28 Dec 2020 00:32:07 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v5 2/4] btrfs-progs: image: introduce -d option to dump data Date: Mon, 28 Dec 2020 08:31:57 +0800 Message-Id: <20201228003159.115343-3-wqu@suse.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228003159.115343-1-wqu@suse.com> References: <20201228003159.115343-1-wqu@suse.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org This new data dump feature will dump the whole image, not only the existing tree blocks but also all its data extents(*). This feature will rely on the new dump format (_DUmP_v1), as it needs extra large extent size limit, and older btrfs-image dump can't handle such large item/cluster size. Since we're dumping all extents including data extents, for the restored image there is no need to use any extra super block flags to inform kernel. Kernel should just treat the restored image as any ordinary btrfs. *: The data extents will be dumped as is, that's to say, even for preallocated extent, its (meaningless) data will be read out and dumpped. This behavior will cause extra space usage for the image, but we can skip all the complex partially shared preallocated extent check. Signed-off-by: Qu Wenruo --- image/main.c | 53 +++++++++++++++++++++++++++++++++++++----------- image/metadump.h | 2 +- 2 files changed, 42 insertions(+), 13 deletions(-) diff --git a/image/main.c b/image/main.c index 8f4cf5ff7e0d..d5822d61b05e 100644 --- a/image/main.c +++ b/image/main.c @@ -52,7 +52,15 @@ const struct dump_version dump_versions[NR_DUMP_VERSIONS] = { { .version = 0, .max_pending_size = SZ_256K, .magic_cpu = 0xbd5c25e27295668bULL, - .extra_sb_flags = 1 } + .extra_sb_flags = 1 }, + /* + * The newer format, with much larger item size to contain + * any data extent. + */ + { .version = 1, + .max_pending_size = SZ_256M, + .magic_cpu = 0x31765f506d55445fULL, /* ascii _DUmP_v1, no null */ + .extra_sb_flags = 0 }, }; const struct dump_version *current_version = &dump_versions[0]; @@ -454,10 +462,14 @@ static void metadump_destroy(struct metadump_struct *md, int num_threads) static int metadump_init(struct metadump_struct *md, struct btrfs_root *root, FILE *out, int num_threads, int compress_level, - enum sanitize_mode sanitize_names) + bool dump_data, enum sanitize_mode sanitize_names) { int i, ret = 0; + /* We need larger item/cluster limit for data extents */ + if (dump_data) + current_version = &dump_versions[1]; + memset(md, 0, sizeof(*md)); INIT_LIST_HEAD(&md->list); INIT_LIST_HEAD(&md->ordered); @@ -885,7 +897,7 @@ static int copy_space_cache(struct btrfs_root *root, } static int copy_from_extent_tree(struct metadump_struct *metadump, - struct btrfs_path *path) + struct btrfs_path *path, bool dump_data) { struct btrfs_root *extent_root; struct extent_buffer *leaf; @@ -950,9 +962,15 @@ static int copy_from_extent_tree(struct metadump_struct *metadump, ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item); if (btrfs_extent_flags(leaf, ei) & - BTRFS_EXTENT_FLAG_TREE_BLOCK) { + BTRFS_EXTENT_FLAG_TREE_BLOCK || + (dump_data && (btrfs_extent_flags(leaf, ei) & + BTRFS_EXTENT_FLAG_DATA))) { + bool is_data; + + is_data = btrfs_extent_flags(leaf, ei) & + BTRFS_EXTENT_FLAG_DATA; ret = add_extent(bytenr, num_bytes, metadump, - 0); + is_data); if (ret) { error("unable to add block %llu: %d", (unsigned long long)bytenr, ret); @@ -975,7 +993,7 @@ static int copy_from_extent_tree(struct metadump_struct *metadump, static int create_metadump(const char *input, FILE *out, int num_threads, int compress_level, enum sanitize_mode sanitize, - int walk_trees) + int walk_trees, bool dump_data) { struct btrfs_root *root; struct btrfs_path path; @@ -990,7 +1008,7 @@ static int create_metadump(const char *input, FILE *out, int num_threads, } ret = metadump_init(&metadump, root, out, num_threads, - compress_level, sanitize); + compress_level, dump_data, sanitize); if (ret) { error("failed to initialize metadump: %d", ret); close_ctree(root); @@ -1022,7 +1040,7 @@ static int create_metadump(const char *input, FILE *out, int num_threads, goto out; } } else { - ret = copy_from_extent_tree(&metadump, &path); + ret = copy_from_extent_tree(&metadump, &path, dump_data); if (ret) { err = ret; goto out; @@ -2890,6 +2908,7 @@ static void print_usage(int ret) printf("\t-s \tsanitize file names, use once to just use garbage, use twice if you want crc collisions\n"); printf("\t-w \twalk all trees instead of using extent tree, do this if your extent tree is broken\n"); printf("\t-m \trestore for multiple devices\n"); + printf("\t-d \talso dump data, conflicts with -w\n"); printf("\n"); printf("\tIn the dump mode, source is the btrfs device and target is the output file (use '-' for stdout).\n"); printf("\tIn the restore mode, source is the dumped image and target is the btrfs device/file.\n"); @@ -2909,6 +2928,7 @@ int BOX_MAIN(image)(int argc, char *argv[]) int ret; enum sanitize_mode sanitize = SANITIZE_NONE; int dev_cnt = 0; + bool dump_data = false; int usage_error = 0; FILE *out; @@ -2917,7 +2937,7 @@ int BOX_MAIN(image)(int argc, char *argv[]) { "help", no_argument, NULL, GETOPT_VAL_HELP}, { NULL, 0, NULL, 0 } }; - int c = getopt_long(argc, argv, "rc:t:oswm", long_options, NULL); + int c = getopt_long(argc, argv, "rc:t:oswmd", long_options, NULL); if (c < 0) break; switch (c) { @@ -2957,6 +2977,9 @@ int BOX_MAIN(image)(int argc, char *argv[]) create = 0; multi_devices = 1; break; + case 'd': + dump_data = true; + break; case GETOPT_VAL_HELP: default: print_usage(c != GETOPT_VAL_HELP); @@ -2975,10 +2998,15 @@ int BOX_MAIN(image)(int argc, char *argv[]) "create and restore cannot be used at the same time"); usage_error++; } + if (dump_data && walk_trees) { + error("-d conflicts with -w option"); + usage_error++; + } } else { - if (walk_trees || sanitize != SANITIZE_NONE || compress_level) { + if (walk_trees || sanitize != SANITIZE_NONE || compress_level || + dump_data) { error( - "using -w, -s, -c options for restore makes no sense"); + "using -w, -s, -c, -d options for restore makes no sense"); usage_error++; } if (multi_devices && dev_cnt < 2) { @@ -3031,7 +3059,8 @@ int BOX_MAIN(image)(int argc, char *argv[]) } ret = create_metadump(source, out, num_threads, - compress_level, sanitize, walk_trees); + compress_level, sanitize, walk_trees, + dump_data); } else { ret = restore_metadump(source, out, old_restore, num_threads, 0, target, multi_devices); diff --git a/image/metadump.h b/image/metadump.h index 7bdddc7b853c..db56add42f1c 100644 --- a/image/metadump.h +++ b/image/metadump.h @@ -38,7 +38,7 @@ struct dump_version { unsigned int extra_sb_flags:1; }; -#define NR_DUMP_VERSIONS 1 +#define NR_DUMP_VERSIONS 2 extern const struct dump_version dump_versions[NR_DUMP_VERSIONS]; const extern struct dump_version *current_version; From patchwork Mon Dec 28 00:31:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11990765 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50C83C433E6 for ; Mon, 28 Dec 2020 00:33:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1DE5F22582 for ; Mon, 28 Dec 2020 00:33:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726422AbgL1Ac4 (ORCPT ); Sun, 27 Dec 2020 19:32:56 -0500 Received: from mx2.suse.de ([195.135.220.15]:53902 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726226AbgL1Ac4 (ORCPT ); Sun, 27 Dec 2020 19:32:56 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1609115529; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XzsC6FgZjLZNiSzXfE3W1/6WxTZlZzRtArF5ns3k3kg=; b=Oz+CfiI6s4C7olTdLR8vJ84DX1Vkjy/zFg56nHMSDZMWGEQHi/hRI2ynbybVStXLqV5FE7 +2lL5i1bA8DLfxIY2sxm0GVl3Kr2VVBDMYyHRkMkOQz9bUs9Bxs1D5+hVWD0gYHKKrpYvS dxO/PTul1DDXIwtbu+hIJV3AQPF5O6o= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 1797FB73D for ; Mon, 28 Dec 2020 00:32:09 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v5 3/4] btrfs-progs: image: reduce memory requirement for decompression Date: Mon, 28 Dec 2020 08:31:58 +0800 Message-Id: <20201228003159.115343-4-wqu@suse.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228003159.115343-1-wqu@suse.com> References: <20201228003159.115343-1-wqu@suse.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org With recent change to enlarge max_pending_size to 256M for data dump, the decompress code requires quite a lot of memory space. (256M * 4). The main reason behind it is, we're using wrapped uncompress() function call, which needs the buffer to be large enough to contain the decompressed data. This patch will re-work the decompress work to use inflate() which can resume it decompression so that we can use a much smaller buffer size. This patch choose to use 512K buffer size. Now the memory consumption for restore is reduced to Cluster data size + 512K * nr_running_threads Instead of the original one: Cluster data size + 1G * nr_running_threads Signed-off-by: Qu Wenruo --- image/main.c | 222 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 146 insertions(+), 76 deletions(-) diff --git a/image/main.c b/image/main.c index d5822d61b05e..5fa6fa5aba17 100644 --- a/image/main.c +++ b/image/main.c @@ -1360,130 +1360,200 @@ static void write_backup_supers(int fd, u8 *buf) } } -static void *restore_worker(void *data) +/* + * Restore one item. + * + * For uncompressed data, it's just reading from work->buf then write to output. + * For compressed data, since we can have very large decompressed data + * (up to 256M), we need to consider memory usage. So here we will fill buffer + * then write the decompressed buffer to output. + */ +static int restore_one_work(struct mdrestore_struct *mdres, + struct async_work *async, u8 *buffer, int bufsize) { - struct mdrestore_struct *mdres = (struct mdrestore_struct *)data; - struct async_work *async; - size_t size; - u8 *buffer; - u8 *outbuf; - int outfd; + z_stream strm; + int buf_offset = 0; /* offset inside work->buffer */ + int out_offset = 0; /* offset for output */ + int out_len; + int outfd = fileno(mdres->out); + int compress_method = mdres->compress_method; int ret; - int compress_size = current_version->max_pending_size * 4; - outfd = fileno(mdres->out); - buffer = malloc(compress_size); - if (!buffer) { - error("not enough memory for restore worker buffer"); - pthread_mutex_lock(&mdres->mutex); - if (!mdres->error) - mdres->error = -ENOMEM; - pthread_mutex_unlock(&mdres->mutex); - pthread_exit(NULL); + ASSERT(is_power_of_2(bufsize)); + + if (compress_method == COMPRESS_ZLIB) { + strm.zalloc = Z_NULL; + strm.zfree = Z_NULL; + strm.opaque = Z_NULL; + strm.avail_in = async->bufsize; + strm.next_in = async->buffer; + strm.avail_out = 0; + strm.next_out = Z_NULL; + ret = inflateInit(&strm); + if (ret != Z_OK) { + error("failed to initialize decompress parameters: %d", + ret); + return ret; + } } + while (buf_offset < async->bufsize) { + bool compress_end = false; + int read_size = min_t(u64, async->bufsize - buf_offset, + bufsize); - while (1) { - u64 bytenr, physical_dup; - off_t offset = 0; - int err = 0; - - pthread_mutex_lock(&mdres->mutex); - while (!mdres->nodesize || list_empty(&mdres->list)) { - if (mdres->done) { - pthread_mutex_unlock(&mdres->mutex); - goto out; + /* Read part */ + if (compress_method == COMPRESS_ZLIB) { + if (strm.avail_out == 0) { + strm.avail_out = bufsize; + strm.next_out = buffer; } - pthread_cond_wait(&mdres->cond, &mdres->mutex); - } - async = list_entry(mdres->list.next, struct async_work, list); - list_del_init(&async->list); - - if (mdres->compress_method == COMPRESS_ZLIB) { - size = compress_size; pthread_mutex_unlock(&mdres->mutex); - ret = uncompress(buffer, (unsigned long *)&size, - async->buffer, async->bufsize); + ret = inflate(&strm, Z_NO_FLUSH); pthread_mutex_lock(&mdres->mutex); - if (ret != Z_OK) { - error("decompression failed with %d", ret); - err = -EIO; + switch (ret) { + case Z_NEED_DICT: + ret = Z_DATA_ERROR; + __attribute__ ((fallthrough)); + case Z_DATA_ERROR: + case Z_MEM_ERROR: + goto out; + } + if (ret == Z_STREAM_END) { + ret = 0; + compress_end = true; } - outbuf = buffer; + out_len = bufsize - strm.avail_out; } else { - outbuf = async->buffer; - size = async->bufsize; + /* No compress, read as many data as possible */ + memcpy(buffer, async->buffer + buf_offset, read_size); + + buf_offset += read_size; + out_len = read_size; } + /* Fixup part */ if (!mdres->multi_devices) { if (async->start == BTRFS_SUPER_INFO_OFFSET) { - memcpy(mdres->original_super, outbuf, + memcpy(mdres->original_super, buffer, BTRFS_SUPER_INFO_SIZE); if (mdres->old_restore) { - update_super_old(outbuf); + update_super_old(buffer); } else { - ret = update_super(mdres, outbuf); - if (ret) - err = ret; + ret = update_super(mdres, buffer); + if (ret < 0) + goto out; } } else if (!mdres->old_restore) { - ret = fixup_chunk_tree_block(mdres, async, outbuf, size); + ret = fixup_chunk_tree_block(mdres, async, + buffer, out_len); if (ret) - err = ret; + goto out; } } + /* Write part */ if (!mdres->fixup_offset) { + int size = out_len; + off_t offset = 0; + while (size) { + u64 logical = async->start + out_offset + offset; u64 chunk_size = size; - physical_dup = 0; + u64 physical_dup = 0; + u64 bytenr; + if (!mdres->multi_devices && !mdres->old_restore) bytenr = logical_to_physical(mdres, - async->start + offset, - &chunk_size, - &physical_dup); + logical, &chunk_size, + &physical_dup); else - bytenr = async->start + offset; + bytenr = logical; - ret = pwrite64(outfd, outbuf+offset, chunk_size, - bytenr); + ret = pwrite64(outfd, buffer + offset, chunk_size, bytenr); if (ret != chunk_size) - goto error; + goto write_error; if (physical_dup) - ret = pwrite64(outfd, outbuf+offset, - chunk_size, - physical_dup); + ret = pwrite64(outfd, buffer + offset, + chunk_size, physical_dup); if (ret != chunk_size) - goto error; + goto write_error; size -= chunk_size; offset += chunk_size; continue; - -error: - if (ret < 0) { - error("unable to write to device: %m"); - err = errno; - } else { - error("short write"); - err = -EIO; - } } } else if (async->start != BTRFS_SUPER_INFO_OFFSET) { - ret = write_data_to_disk(mdres->info, outbuf, async->start, size, 0); + ret = write_data_to_disk(mdres->info, buffer, + async->start, out_len, 0); if (ret) { error("failed to write data"); exit(1); } } - /* backup super blocks are already there at fixup_offset stage */ - if (!mdres->multi_devices && async->start == BTRFS_SUPER_INFO_OFFSET) - write_backup_supers(outfd, outbuf); + if (async->start == BTRFS_SUPER_INFO_OFFSET && + !mdres->multi_devices) + write_backup_supers(outfd, buffer); + out_offset += out_len; + if (compress_end) { + inflateEnd(&strm); + break; + } + } + return ret; + +write_error: + if (ret < 0) { + error("unable to write to device: %m"); + ret = -errno; + } else { + error("short write"); + ret = -EIO; + } +out: + if (compress_method == COMPRESS_ZLIB) + inflateEnd(&strm); + return ret; +} + +static void *restore_worker(void *data) +{ + struct mdrestore_struct *mdres = (struct mdrestore_struct *)data; + struct async_work *async; + u8 *buffer; + int ret; + int buffer_size = SZ_512K; + + buffer = malloc(buffer_size); + if (!buffer) { + error("not enough memory for restore worker buffer"); + pthread_mutex_lock(&mdres->mutex); + if (!mdres->error) + mdres->error = -ENOMEM; + pthread_mutex_unlock(&mdres->mutex); + pthread_exit(NULL); + } + + while (1) { + pthread_mutex_lock(&mdres->mutex); + while (!mdres->nodesize || list_empty(&mdres->list)) { + if (mdres->done) { + pthread_mutex_unlock(&mdres->mutex); + goto out; + } + pthread_cond_wait(&mdres->cond, &mdres->mutex); + } + async = list_entry(mdres->list.next, struct async_work, list); + list_del_init(&async->list); - if (err && !mdres->error) - mdres->error = err; + ret = restore_one_work(mdres, async, buffer, buffer_size); + if (ret < 0) { + mdres->error = ret; + pthread_mutex_unlock(&mdres->mutex); + goto out; + } mdres->num_items--; pthread_mutex_unlock(&mdres->mutex); From patchwork Mon Dec 28 00:31:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11990763 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DFC9C43381 for ; Mon, 28 Dec 2020 00:33:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 37AB8225A9 for ; Mon, 28 Dec 2020 00:33:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726434AbgL1Ac6 (ORCPT ); Sun, 27 Dec 2020 19:32:58 -0500 Received: from mx2.suse.de ([195.135.220.15]:53922 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726396AbgL1Ac5 (ORCPT ); Sun, 27 Dec 2020 19:32:57 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1609115531; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T8RhIHzpTvsgk7MPWeeWe+s0X3fhVB5HewRPi65ypX8=; b=Y2iUKaJAI8zl1D3LDL73hkB3ZfFTjmu8YWw9EWG1fsusTBpx+IiBoLWjz0TRqFNnFa7XhX c6CVtbiGalkRLGT9E506utiEQ3q9sgwjR1w03NSFF671BW5y281uLxWqN6J8v0NnOX5mS5 Bb35nzGygT/rp+mZv//XLCdi6Ic+SYE= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id E523CB73F for ; Mon, 28 Dec 2020 00:32:10 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v5 4/4] btrfs-progs: image: fix restored image size misalignment Date: Mon, 28 Dec 2020 08:31:59 +0800 Message-Id: <20201228003159.115343-5-wqu@suse.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228003159.115343-1-wqu@suse.com> References: <20201228003159.115343-1-wqu@suse.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org [BUG] There is a small device size misalignment between the super block device size and the device extent size: total_bytes 10737418240 <<< bytes_used 15097856 dev_item.total_bytes 10737418240 dev_item.bytes_used 1094713344 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 devid 1 total_bytes 1095761920 bytes_used 1094713344 ^^^^^^^^^^ [CAUSE] In fixup_device_size(), we only reset superblock device item size, which will be overwritten in write_dev_supers() using btrfs_device::total_bytes. And it doesn't touch btrfs_superblock::total_bytes either. [FIX] So fix the small mismatch by also resetting btrfs_device::total_bytes, btrfs_device::bytes_used and btrfs_superblock::total_bytes. Thankfully since commit 73dd4e3c87c9 ("btrfs-progs: image: Don't modify the chunk and device tree if the source dump is single device") single device dump won't have such problem, but it's still worthy for multi-device dump. Signed-off-by: Qu Wenruo --- image/main.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/image/main.c b/image/main.c index 5fa6fa5aba17..42564b1d2f44 100644 --- a/image/main.c +++ b/image/main.c @@ -2374,6 +2374,7 @@ static int fixup_device_size(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info = trans->fs_info; struct btrfs_dev_item *dev_item; struct btrfs_dev_extent *dev_ext; + struct btrfs_device *dev; struct btrfs_path path; struct extent_buffer *leaf; struct btrfs_root *root = fs_info->chunk_root; @@ -2392,6 +2393,8 @@ static int fixup_device_size(struct btrfs_trans_handle *trans, key.type = BTRFS_DEV_EXTENT_KEY; key.offset = (u64)-1; + dev = list_first_entry(&fs_info->fs_devices->devices, + struct btrfs_device, dev_list); ret = btrfs_search_slot(NULL, fs_info->dev_root, &key, &path, 0, 0); if (ret < 0) { errno = -ret; @@ -2425,6 +2428,9 @@ static int fixup_device_size(struct btrfs_trans_handle *trans, btrfs_set_stack_device_total_bytes(dev_item, dev_size); btrfs_set_stack_device_bytes_used(dev_item, mdres->alloced_chunks); + dev->total_bytes = dev_size; + dev->bytes_used = mdres->alloced_chunks; + btrfs_set_super_total_bytes(fs_info->super_copy, dev_size); ret = fstat(out_fd, &buf); if (ret < 0) { error("failed to stat result image: %m");