From patchwork Tue Jul 12 21:40:13 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Goffredo Baroncelli X-Patchwork-Id: 9226281 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 66FEE60871 for ; Tue, 12 Jul 2016 21:47:11 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 588EE27165 for ; Tue, 12 Jul 2016 21:47:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4A0B527F9C; Tue, 12 Jul 2016 21:47:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8E9E327F9A for ; Tue, 12 Jul 2016 21:47:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752315AbcGLVpf (ORCPT ); Tue, 12 Jul 2016 17:45:35 -0400 Received: from smtp-35.italiaonline.it ([212.48.25.163]:38032 "EHLO libero.it" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752302AbcGLVpb (ORCPT ); Tue, 12 Jul 2016 17:45:31 -0400 X-Greylist: delayed 308 seconds by postgrey-1.27 at vger.kernel.org; Tue, 12 Jul 2016 17:45:30 EDT Received: from venice.bhome ([94.38.186.37]) by smtp-35.iol.local with SMTP id N5PZbHd7MR6nZN5Pabpqga; Tue, 12 Jul 2016 23:40:15 +0200 x-libjamoibt: 1601 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=libero.it; s=s2014; t=1468359615; bh=qp+YzPUBew81Uce9j1bGgFS9CVK5e/bwNhFeK12W500=; h=From:Subject:Reply-To:To:Cc:Date; b=H0k3avVlGX8QjAlgKzprK/ZOZgJQJkd4Fvyxm5e2npa90UH37Iy9muvRKQqniQmKy Dq/eaZOpU0POBAyJsZ/EaeTTv0AaM02GJxjv3ii5RLsdHh5bVz5JKnxfmOfkeQVi3B /61phshrpZMuoLGOKe8TsVi8z2wx9cIhYUq8AtTeD1F7/ku0wMoc0808EjKp5F5+7v bYW0I9OIF1KdGzjxsYn/jiES6enbxoUGWUQgFse0GWFUB93HOGcH1QVeDi4DgkAkyU cV1dxT+a4jv+iMlysuRRP+pMr/jB8Xd3IQNdvfHo1DkFo3XJoiaDV1Rsn7+qWEDcyZ PYLs35e0hYB1A== X-CNFS-Analysis: v=2.2 cv=DpoUwy3+ c=1 sm=1 tr=0 a=xAu1zc06ONkAJD8+9AxygA==:117 a=xAu1zc06ONkAJD8+9AxygA==:17 a=NL-WSrys4zUA:10 a=IkcTkHD0fZMA:10 a=NEAV23lmAAAA:8 a=YJuWAmXftMs96uqeq2wA:9 a=MRvs2mK13QZmMUIS:21 a=wg_0ZpDvajC7iGNA:21 a=Bn2pgwyD2vrAyMmN8A2t:22 From: Goffredo Baroncelli Subject: New btrfs sub command: btrfs inspect physical-find Reply-To: kreijack@inwind.it To: linux-btrfs Cc: David Sterba Message-ID: <7c20be09-1ac2-1d1f-f545-e2d56578a77b@libero.it> Date: Tue, 12 Jul 2016 23:40:13 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.1.0 MIME-Version: 1.0 X-CMAE-Envelope: MS4wfI53NK08+DuBtJq64J2R2k721jnf6RD2OBVOmQzzD4xz3qEshCfbziZZObVAGuHmFmnt5LYu4mONZkNCo/knhnfv8nu8b6XX/1RALVXSpnzwi9u7Mpcr 692QynYl2hw2roZk0yhyO1/qDy6YFrdoYhxstMyyL9qqgTPpWQqrokGh2GMurLU6FhrASw3EXdU1cDTqA5x7zccOfKtJ83W5Yq4= Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi All, the enclosed patch adds a new btrfs sub command: "btrfs inspect physical-find". The aim of this new command is to show the physical placement on the disk of a file. Currently it handles all the profiles (single, dup, raid1/10/5/6). I develop this command in order to show some bug in btrfs RAID5 profile (see next email). You can pull the code from: https://github.com/kreijack/btrfs-progs.git branch insp-phy The syntax of this new command is simple: # btrfs inspect physical-find [] where: is the file to inspect is the offset of the file to inspect (default 0) Below some examples: ** Single $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid 1, /dev/loop0 : 12582912 LINEAR $ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo adaaa ** Dup The command shows both the copies $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid 1, /dev/loop0 : 71303168 DUP devid 1, /dev/loop0 : 104857600 DUP $ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo adaaa ** Raid1 The command shows both the copies $ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid 2, /dev/loop1 : 61865984 RAID1 devid 1, /dev/loop0 : 81788928 RAID1 $ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo adaaa ** Raid10 The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe $ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123] $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid 4, /dev/loop3 : 61931520 RAID10 devid 3, /dev/loop2 : 61931520 RAID10 $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo adaaa $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536 mnt/out.txt: 65536 devid 2, /dev/loop1 : 61931520 RAID10 devid 1, /dev/loop0 : 81854464 RAID10 $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo bdbbb ** Raid5 Depending by the offset, you can see which disk-stripe is used. $ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012] $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid 2, /dev/loop1 : 61931520 DATA devid 1, /dev/loop0 : 81854464 OTHER devid 3, /dev/loop2 : 61931520 PARITY $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536 devid 2, /dev/loop1 : 61931520 OTHER devid 1, /dev/loop0 : 81854464 DATA devid 3, /dev/loop2 : 61931520 PARITY $ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo adaaa $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo bdbbb $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd 00000000: 0300 0303 03 ..... The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03 ** Raid6 $ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid 3, /dev/loop2 : 61931520 DATA devid 2, /dev/loop1 : 61931520 OTHER devid 1, /dev/loop0 : 81854464 PARITY devid 4, /dev/loop3 : 61931520 PARITY $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo adaaa diff --git a/cmds-inspect.c b/cmds-inspect.c index dd7b9dd..a604c2b 100644 --- a/cmds-inspect.c +++ b/cmds-inspect.c @@ -22,6 +22,11 @@ #include #include #include +#include +#include +#include +#include +#include #include "kerncompat.h" #include "ioctl.h" @@ -623,6 +628,450 @@ out: return !!ret; } + +static const char* const cmd_inspect_physical_find_usage[] = { + "btrfs inspect-internal physical-find [options] [...]", + "Show the physical address of each blocks", + "-m the output is machine readable", + NULL +}; + +static void dump_stripes(int ndisks, struct btrfs_ioctl_dev_info_args *disks, + struct btrfs_chunk *chunk, u64 logical_start) { + struct btrfs_stripe *stripes; + stripes = &chunk->stripe; + + if ((chunk->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 ) { + /* LINEAR: each chunk has (should have) only one disk */ + int j; + char *dname = ""; + + assert(chunk->num_stripes == 1); + + u64 phy_start = stripes[0].offset + + +logical_start; + for (j = 0 ; j < ndisks ; j++) + if (stripes[0].devid == disks[j].devid) { + dname = (char*)disks[j].path; + break; + } + printf("\tdevid %llu, %s : %llu LINEAR\n", + stripes[0].devid, dname, phy_start); + } else if (chunk->type & BTRFS_BLOCK_GROUP_RAID0) { + /* + * RAID0: each chunk is composed by more disks; + * each stripe_len bytes are in a different disk: + * + * file: ABC...NMOP.... + * + * disk1 disk2 disk3 .... disksN + * + * A B C .... N + * M O P .... + * + */ + u64 disks_number = chunk->num_stripes; + u64 disk_stripe_size = chunk->stripe_len; + u64 stripe_capacity ; + u64 stripe_nr; + u64 disk_stripe_start; + int sidx; + int j; + char *dname = ""; + + stripe_capacity = disks_number * disk_stripe_size; + stripe_nr = logical_start / stripe_capacity; + disk_stripe_start = logical_start % disk_stripe_size; + + sidx = (logical_start / disk_stripe_size) % disks_number; + + u64 phy_start = stripes[sidx].offset + + stripe_nr * disk_stripe_size + + disk_stripe_start; + + for (j = 0 ; j < ndisks ; j++) + if (stripes[sidx].devid == disks[j].devid) { + dname = (char*)disks[j].path; + break; + } + printf("\tdevid %llu, %s : %llu RAID0\n", + stripes[sidx].devid, dname, phy_start); + + } else if (chunk->type & BTRFS_BLOCK_GROUP_RAID1) { + /* + * RAID0: each chunk is composed by more disks; + * each stripe_len bytes are in a different disk: + * + * file: ABC... + * + * disk1 disk2 disk3 .... + * + * A A + * B B + * C C + * + */ + int sidx; + for (sidx = 0; sidx < chunk->num_stripes; sidx++) { + int j; + char *dname = ""; + u64 phy_start = stripes[sidx].offset + + +logical_start; + + for (j = 0 ; j < ndisks ; j++) + if (stripes[sidx].devid == disks[j].devid) { + dname = (char*)disks[j].path; + break; + } + printf("\tdevid %llu, %s : %llu RAID1\n", + stripes[sidx].devid, dname, phy_start); + } + } else if (chunk->type & BTRFS_BLOCK_GROUP_DUP) { + /* + * DUP: each chunk has 'num_stripes' disk_stripe. Heach + * disk_stripe has its own copy of data + * + * file: ABCD.... + * + * disk1 disk2 disk3 + * + * A + * B + * C + * [...] + * A + * B + * C + * + * + * NOTE: the difference between DUP and RAID1 is that + * in RAID1 each disk_stripe is in a different disk, in DUP + * each disk chunk is in the same disk + */ + int sidx; + /* TBD: check what happens with the stripes */ + for (sidx = 0; sidx < chunk->num_stripes; sidx++) { + int j; + char *dname = ""; + u64 phy_start = stripes[sidx].offset + + +logical_start; + + for (j = 0 ; j < ndisks ; j++) + if (stripes[sidx].devid == disks[j].devid) { + dname = (char*)disks[j].path; + break; + } + printf("\tdevid %llu, %s : %llu DUP\n", + stripes[sidx].devid, dname, phy_start); + } + } else if (chunk->type & BTRFS_BLOCK_GROUP_RAID10) { + /* + * RAID10: each chunk is composed by more disks; + * each stripe_len bytes are in a different disk: + * + * file: ABCD.... + * + * disk1 disk2 disk3 disk4 + * + * A A B B + * C C D D + * + * + */ + int i; + u64 disks_number = chunk->num_stripes; + u64 disk_stripe_size = chunk->stripe_len; + u64 stripe_capacity ; + u64 stripe_nr; + u64 stripe_start; + u64 disk_stripe_start; + + stripe_capacity = disks_number * disk_stripe_size / chunk->sub_stripes; + stripe_nr = logical_start / stripe_capacity; + stripe_start = logical_start % stripe_capacity; + disk_stripe_start = logical_start % disk_stripe_size; + + for (i = 0; i < chunk->sub_stripes; i++) { + int j; + char *dname = ""; + int sidx = (i + + stripe_start/disk_stripe_size*chunk->sub_stripes) % + disks_number; + + u64 phy_start = stripes[sidx].offset + + +stripe_nr*disk_stripe_size + disk_stripe_start; + + for (j = 0 ; j < ndisks ; j++) + if (stripes[sidx].devid == disks[j].devid) { + dname = (char*)disks[j].path; + break; + } + printf("\tdevid %llu, %s : %llu RAID10\n", + stripes[sidx].devid, dname, phy_start); + } + } else if (chunk->type & BTRFS_BLOCK_GROUP_RAID5 || + chunk->type & BTRFS_BLOCK_GROUP_RAID6 ) { + /* + * RAID5: each chunk is spread on a different disk; however one + * disk is used for parity + * + * file: ABCDEFGHIJK.... + * + * disk1 disk2 disk3 disk4 disk5 + * + * A B C D P + * P D E F G + * H P I J K + * + * Note: P == parity + * + * RAID6: each chunk is spread on a different disk; however two + * disks are used for parity + * + * file: ABCDEFGHI... + * + * disk1 disk2 disk3 disk4 disk5 + * + * A B C P Q + * Q D E F P + * P Q G H I + * + * Note: P,Q == parity + * + */ + int parities_nr = 1; + u64 disks_number = chunk->num_stripes; + u64 disk_stripe_size = chunk->stripe_len; + u64 stripe_capacity ; + u64 stripe_nr; + u64 stripe_start; + u64 pos = 0; + u64 disk_stripe_start; + int sidx; + + if (chunk->type & BTRFS_BLOCK_GROUP_RAID6) + parities_nr = 2; + + stripe_capacity = (disks_number - parities_nr) * + disk_stripe_size; + stripe_nr = logical_start / stripe_capacity; + stripe_start = logical_start % stripe_capacity; + disk_stripe_start = logical_start % disk_stripe_size; + + for (sidx = 0; sidx < disks_number ; sidx++) { + int j; + char *dname = ""; + u64 stripe_index = (sidx + stripe_nr) % disks_number; + u64 phy_start = stripes[stripe_index].offset + /* chunk start */ + + stripe_nr*disk_stripe_size + /* stripe start */ + + disk_stripe_start; + + for (j = 0 ; j < ndisks ; j++) + if (stripes[stripe_index].devid == disks[j].devid) { + dname = (char*)disks[j].path; + break; + } + + if (sidx >= (disks_number - parities_nr)) { + printf("\tdevid %llu, %s : %llu PARITY\n", + stripes[stripe_index].devid, dname, + phy_start); + continue; + } + + if (stripe_start >= pos && stripe_start < (pos+disk_stripe_size)) { + printf("\tdevid %llu, %s : %llu DATA\n", + stripes[stripe_index].devid, + dname, phy_start); + } else { + printf("\tdevid %llu, %s : %llu OTHER\n", + stripes[stripe_index].devid, + dname, phy_start); + } + + pos += disk_stripe_size; + } + assert(pos == stripe_capacity); + } else { + error("Unknown chunk type = 0x%016llx\n", chunk->type); + return; + } + +} + +static int dump_extent(char *fname, int fd, u64 logical_start) { + + struct btrfs_ioctl_search_args args; + struct btrfs_ioctl_search_key *sk = &args.key; + struct btrfs_ioctl_search_header sh; + unsigned long off = 0; + int i; + int e; + struct btrfs_ioctl_dev_info_args *disks = NULL; + struct btrfs_ioctl_fs_info_args fi_args = {0}; + + e = get_fs_info(fname, &fi_args, &disks); + if ( e< 0) { + error("Cannot get info for the filesystem: may be it is not a btrfs filesystem ?\n"); + free(disks); + return -1; + } + + memset(&args, 0, sizeof(args)); + sk->tree_id = BTRFS_CHUNK_TREE_OBJECTID; + sk->min_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID; + sk->max_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID; + sk->min_type = BTRFS_CHUNK_ITEM_KEY; + sk->max_type = BTRFS_CHUNK_ITEM_KEY; + sk->max_offset = (u64)-1; + sk->min_offset = 0; + sk->max_transid = (u64)-1; + + while (1) { + int ret; + + sk->nr_items = 1; + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args); + e = errno; + if (ret < 0) { + error("cannot perform the search: %s", strerror(e)); + free(disks); + return -1; + } + if (sk->nr_items == 0) + break; + + off = 0; + for (i = 0; i < sk->nr_items; i++) { + struct btrfs_chunk *item; + + memcpy(&sh, args.buf + off, sizeof(sh)); + off += sizeof(sh); + item = (struct btrfs_chunk*)(args.buf + off); + off += sh.len; + + if (logical_start >= sh.offset && + logical_start <= sh.offset+item->length) { + dump_stripes(fi_args.num_devices, disks, + item, + logical_start-sh.offset); + free(disks); + return 0; + } + + + sk->min_objectid = sh.objectid; + sk->min_type = sh.type; + sk->min_offset = sh.offset; + } + + if (sk->min_offset < (u64)-1) + sk->min_offset++; + else + break; + } + + free(disks); + return 0; +} + +/* + * Inline extents are skipped because they do not take data space, + * delalloc and unknown are skipped because we do not know how much + * space they will use yet. + */ +#define SKIP_FLAGS (FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC| \ + FIEMAP_EXTENT_DATA_INLINE) +static int cmd_inspect_physical_find(int argc, char **argv) +{ + int ret = 0; + u64 logical = 0ull; + int fd; + int last = 0; + char buf[16384]; + char *fname; + int found = 0; + struct fiemap *fiemap = (struct fiemap*)buf; + struct fiemap_extent *fm_ext = &fiemap->fm_extents[0]; + const int count = (sizeof(buf) - sizeof(*fiemap)) / + sizeof(struct fiemap_extent); + + int minargc = 1; + + memset(fiemap, 0, sizeof(struct fiemap)); + + if (check_argc_min(argc - minargc, 1) || + check_argc_max(argc - minargc, 2) ) + usage(cmd_inspect_physical_find_usage); + + if (argc - minargc == 2) + logical = strtoull(argv[minargc+1], NULL, 0); + fname = argv[minargc]; + + printf("%s: %llu\n", fname, logical); + + fd = open(fname, O_RDONLY); + if (fd < 0) { + error("Can't open '%s' for reading\n", fname); + ret = -errno; + goto out; + } + + do { + + int rc; + int j; + + fiemap->fm_length = ~0ULL; + fiemap->fm_extent_count = count; + fiemap->fm_flags = FIEMAP_FLAG_SYNC; + rc = ioctl(fd, FS_IOC_FIEMAP, (unsigned long) fiemap); + if (rc < 0) { + error("Can't do ioctl()\n"); + close(fd); + ret = -errno; + goto out; + } + + for (j = 0; j < fiemap->fm_mapped_extents; j++) { + u32 flags = fm_ext[j].fe_flags; + + fiemap->fm_start = (fm_ext[j].fe_logical + + fm_ext[j].fe_length); + + if (flags & FIEMAP_EXTENT_LAST) + last = 1; + + if (flags & SKIP_FLAGS) + continue; + + if (logical > fm_ext[j].fe_logical + + fm_ext[j].fe_length) + continue; + + found = 1; + + rc = dump_extent(fname, fd, + fm_ext[j].fe_physical + logical - + fm_ext[j].fe_logical); + if (rc < 0) + ret = -errno; + last = 1; + break; + } + } while (last == 0); + + close(fd); + + if (!found) { + error("Can't find the extent: the file is too short, or the file is stored in a leaf.\n"); + ret = 10; + } + +out: + return ret; +} + static const char inspect_cmd_group_info[] = "query various internal information"; @@ -644,6 +1093,8 @@ const struct cmd_group inspect_cmd_group = { cmd_inspect_dump_super_usage, NULL, 0 }, { "tree-stats", cmd_inspect_tree_stats, cmd_inspect_tree_stats_usage, NULL, 0 }, + { "physical-find", cmd_inspect_physical_find, + cmd_inspect_physical_find_usage, NULL, 0 }, NULL_CMD_STRUCT } };