[RFC] btrfs-progs: inspect: new subcommand to dump chunks
diff mbox

Message ID 1466616406-28087-1-git-send-email-dsterba@suse.com
State New
Headers show

Commit Message

David Sterba June 22, 2016, 5:26 p.m. UTC
From: David Sterba <dsterba@suse.cz>

Hi,

the chunk dump is a useful thing, for debugging or balance filters.

Example output:

Chunks on device id: 1
PNumber            Type        PStart        Length          PEnd     Age         LStart  Usage
-----------------------------------------------------------------------------------------------
      0   System/RAID1        1.00MiB      32.00MiB      33.00MiB      47        1.40TiB   0.06
      1 Metadata/RAID1       33.00MiB       1.00GiB       1.03GiB      31        1.36TiB  64.03
      2 Metadata/RAID1        1.03GiB      32.00MiB       1.06GiB      36        1.36TiB  77.28
      3     Data/single       1.06GiB       1.00GiB       2.06GiB      12      422.30GiB  78.90
      4     Data/single       2.06GiB       1.00GiB       3.06GiB      11      420.30GiB  78.47
...

(full output is at http://susepaste.org/view/raw/089a9877)

This patch does the basic output, not filtering or sorting besides physical and
logical offset. There are possiblities to do more or enhance the output (eg.
starting with logical chunk and list the related physcial chunks together).
Or filter by type/profile, or understand the balance filters format.

As it is now, it's per-device dump of physical layout, which was the original
idea.

Printing 'usage' is not default as it's quite slow, it uses the search ioctl
and probably not in the best way, or there's some other issue in the
implementation.

I'll add the patch to devel branch but will not add it for any particular
release yet, I'd like some feedback first. Thanks.

-------------
New command 'btrfs inspect-internal dump-chunks' will dump layout of
chunks as stored on the devices. This corresponds to the physical
layout, sorted by the physical offset. The block group usage can be
shown as well, but the search is too slow so it's off by default.

If the physical offset sorting is selected, the empty space between
chunks is also shown.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 cmds-inspect.c | 364 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 364 insertions(+)

Comments

Hans van Kranenburg June 22, 2016, 10:20 p.m. UTC | #1
On 06/22/2016 07:26 PM, David Sterba wrote:
> From: David Sterba <dsterba@suse.cz>
>
> Hi,
>
> the chunk dump is a useful thing, for debugging or balance filters.

Nice!

> Example output:
>
> Chunks on device id: 1
> PNumber            Type        PStart        Length          PEnd     Age         LStart  Usage
> -----------------------------------------------------------------------------------------------
>       0   System/RAID1        1.00MiB      32.00MiB      33.00MiB      47        1.40TiB   0.06
>       1 Metadata/RAID1       33.00MiB       1.00GiB       1.03GiB      31        1.36TiB  64.03
>       2 Metadata/RAID1        1.03GiB      32.00MiB       1.06GiB      36        1.36TiB  77.28
>       3     Data/single       1.06GiB       1.00GiB       2.06GiB      12      422.30GiB  78.90
>       4     Data/single       2.06GiB       1.00GiB       3.06GiB      11      420.30GiB  78.47
> ...
>
> (full output is at http://susepaste.org/view/raw/089a9877)
>
> This patch does the basic output, not filtering or sorting besides physical and
> logical offset. There are possiblities to do more or enhance the output (eg.
> starting with logical chunk and list the related physcial chunks together).
> Or filter by type/profile, or understand the balance filters format.
>
> As it is now, it's per-device dump of physical layout, which was the original
> idea.
>
> Printing 'usage' is not default as it's quite slow, it uses the search ioctl
> and probably not in the best way, or there's some other issue in the
> implementation.

Interesting.

So after reading this, I wrote a little test to test some scenarios:

https://github.com/knorrie/python-btrfs/commit/1ca99880dfa0e14b148f3d9e2b6b381b781eb52d

It's very clear that the most optimal way of doing this search is to 
have nr_items=1 and if possible, specify the length in offset.

I have no idea yet what happens inside the kernel when nr_items is not 
1, in combination with a large extent tree. I'll try to follow the code 
and see if I can find what's making this big difference.

Also, the filesystem with the 1200 second search result is Debian 
3.16.7-ckt25-2, I haven't looked yet if there's different behaviour 
inside handling searches between that kernel and e.g. 4.5.

After loading a huge amount of data in memory, the search with nr_items 
 > 1 only takes 4.6 seconds to come up with the block group item itself, 
and another 4.6 seconds to come up with 0 additional results, only using 
cpu time.

It seems that searching in the empty space between
  (vaddr BLOCK_GROUP_ITEM length+1) and
  (vaddr BLOCK_GROUP_ITEM ULLONG_MAX)
is really expensive, while there's absolutely nothing to find.

I ran into this two days ago, caused by doing an unnecessary extra 
search after finding the block item already:

https://github.com/knorrie/python-btrfs/commit/5a69d9cf0477515ce2d6e50f1741276a49b33af8

> I'll add the patch to devel branch but will not add it for any particular
> release yet, I'd like some feedback first. Thanks.
>
> -------------
> New command 'btrfs inspect-internal dump-chunks' will dump layout of
> chunks as stored on the devices. This corresponds to the physical
> layout, sorted by the physical offset. The block group usage can be
> shown as well, but the search is too slow so it's off by default.
>
> If the physical offset sorting is selected, the empty space between
> chunks is also shown.
>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  cmds-inspect.c | 364 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 364 insertions(+)
> [...]
> +}
> +
> +static u64 fill_usage(int fd, u64 lstart)
> +{
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk = &args.key;
> +	struct btrfs_ioctl_search_header sh;
> +	struct btrfs_block_group_item *item;
> +	int ret;
> +
> +	memset(&args, 0, sizeof(args));
> +	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
> +	sk->min_objectid = lstart;
> +	sk->max_objectid = lstart;
> +	sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->max_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->min_offset = 0;
> +	sk->max_offset = (u64)-1;

The chunk length is already known here, so it can go in offset.

> +	sk->max_transid = (u64)-1;
> +
> +	sk->nr_items = 4096;

Or set to 1 because we already know there can only be one result.
Hans van Kranenburg June 23, 2016, 1:10 a.m. UTC | #2
On 06/22/2016 07:26 PM, David Sterba wrote:
> From: David Sterba <dsterba@suse.cz>
>
> Hi,
>
> the chunk dump is a useful thing, for debugging or balance filters.
>
> Example output:
>
> Chunks on device id: 1
> PNumber            Type        PStart        Length          PEnd     Age         LStart  Usage
> -----------------------------------------------------------------------------------------------
>       0   System/RAID1        1.00MiB      32.00MiB      33.00MiB      47        1.40TiB   0.06
>       1 Metadata/RAID1       33.00MiB       1.00GiB       1.03GiB      31        1.36TiB  64.03
>       2 Metadata/RAID1        1.03GiB      32.00MiB       1.06GiB      36        1.36TiB  77.28
>       3     Data/single       1.06GiB       1.00GiB       2.06GiB      12      422.30GiB  78.90
>       4     Data/single       2.06GiB       1.00GiB       3.06GiB      11      420.30GiB  78.47
> ...

On RAID0, it looks funny :)

# ./btrfs inspect-internal dump-chunks /mnt/extents/
Chunks on device id: 1
PNumber            Type        PStart        Length          PEnd 
Age         LStart
----------------------------------------------------------------------------------------
       0 Metadata/RAID0        1.00MiB     256.00MiB     257.00MiB 
1       13.36GiB
       .           empty             .      16.00EiB             . 
.              .
       1   System/RAID0      129.00MiB      64.00MiB     193.00MiB 
2       13.61GiB
       .           empty             .      16.00EiB             . 
.              .
       2     Data/RAID0      161.00MiB       2.00GiB       2.16GiB 
3       15.68GiB
       .           empty             .     115.00MiB             . 
.              .
       3     Data/RAID0        2.27GiB       2.00GiB       4.27GiB 
0       11.36GiB

Chunks on device id: 2
PNumber            Type        PStart        Length          PEnd 
Age         LStart
----------------------------------------------------------------------------------------
       0     Data/RAID0        1.00MiB       2.00GiB       2.00GiB 
0       11.36GiB
       .           empty             .      16.00EiB             . 
.              .
       1 Metadata/RAID0        1.00GiB     256.00MiB       1.25GiB 
1       13.36GiB
       .           empty             .      16.00EiB             . 
.              .
       2   System/RAID0        1.13GiB      64.00MiB       1.19GiB 
2       13.61GiB
       .           empty             .      16.00EiB             . 
.              .
       3     Data/RAID0        1.16GiB       2.00GiB       3.16GiB 
3       15.68GiB

For the correct sizes, it would be needed to either do some math, like 
if RAID0 then divide by 2, but I guess this gets horrible really soon 
with RAID5 etc... Or, if you want to print everything in the physical 
order, why not just dump the device tree with DEV_EXTENT which has the 
exact length on disk, and fill in the missing information using the 
reference to the chunk they belong to?

https://github.com/knorrie/python-btrfs/commit/be85c2a0e2774909b532326e99e0c23ab8467263

Sorry, couldn't resist /:D\
Qu Wenruo June 23, 2016, 1:20 a.m. UTC | #3
At 06/23/2016 01:26 AM, David Sterba wrote:
> From: David Sterba <dsterba@suse.cz>
>
> Hi,
>
> the chunk dump is a useful thing, for debugging or balance filters.

Yes, quite useful.
I don't ever need to manually check btrfs-debug-tree output to figure 
out the dev-extent layout.

But according to the output, it's more like dev-extent dump, not chunk dump.

It's better to make things more clear for objects at different logical 
level.

>
> Example output:
>
> Chunks on device id: 1
> PNumber            Type        PStart        Length          PEnd     Age         LStart  Usage
> -----------------------------------------------------------------------------------------------
>       0   System/RAID1        1.00MiB      32.00MiB      33.00MiB      47        1.40TiB   0.06

[Mixed chunk/dev-extent output]
For "PStart/Length/Pend/Lstart" that's OK.
(Although I prefer "Plength" to "Length", as it's a little confusing for 
chunk logical length and dev_extent physical length)

Since "pstart" = key.offset, "length" = dev_extent_length, "Pend" = 
pstart + length and "Lstart" = dev_extent_chunk_offset, these values are 
all straightforward.

But for Usage, it's completely a chunk/block group level thing, so 
personally speaking it's better not to output per chunk usage in 
dev-extent dump

If there is real "chunk dump", then that's the best place to output 
per-chunk "usage".

For dev-extent level usage, it should be the usage of a device, not per 
chunk.

["Age" calculation]
Another concern is about "age", since there is no generation in 
chunk_item nor dev_extent, so you're using logical bytenr as a hint.

This method implies the fact that btrfs will always allocates new chunk 
with increasing logical bytenr, never reuse lower logical bytenr.

Such behavior provides the basis for a lot of other btrfs functions, 
like balance/qgroup rescan.
IMHO it won't be changed in any time soon. But I'm still a little 
concerned about using such assumption.


Since "age" here is more like a sorting index, not the generation when 
the chunk is created.
It may not really reflect how long one chunk really lived.

It maybe completely OK for most people, but at least it's not as 
straightforward as generation.


[Support for offline btrfs]
BTW, it's also a good idea to support not only mounted fs, but also 
offlined one.

As for offlined one, no need to bother about the tree search ioctl, just 
straightforward search_slots().
(That's why I like offline btrfs tools)

Thanks,
Qu

>       1 Metadata/RAID1       33.00MiB       1.00GiB       1.03GiB      31        1.36TiB  64.03
>       2 Metadata/RAID1        1.03GiB      32.00MiB       1.06GiB      36        1.36TiB  77.28
>       3     Data/single       1.06GiB       1.00GiB       2.06GiB      12      422.30GiB  78.90
>       4     Data/single       2.06GiB       1.00GiB       3.06GiB      11      420.30GiB  78.47
> ...
>
> (full output is at http://susepaste.org/view/raw/089a9877)
>
> This patch does the basic output, not filtering or sorting besides physical and
> logical offset. There are possiblities to do more or enhance the output (eg.
> starting with logical chunk and list the related physcial chunks together).
> Or filter by type/profile, or understand the balance filters format.
>
> As it is now, it's per-device dump of physical layout, which was the original
> idea.
>
> Printing 'usage' is not default as it's quite slow, it uses the search ioctl
> and probably not in the best way, or there's some other issue in the
> implementation.
>
> I'll add the patch to devel branch but will not add it for any particular
> release yet, I'd like some feedback first. Thanks.
>
> -------------
> New command 'btrfs inspect-internal dump-chunks' will dump layout of
> chunks as stored on the devices. This corresponds to the physical
> layout, sorted by the physical offset. The block group usage can be
> shown as well, but the search is too slow so it's off by default.
>
> If the physical offset sorting is selected, the empty space between
> chunks is also shown.
>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  cmds-inspect.c | 364 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 364 insertions(+)
>
> diff --git a/cmds-inspect.c b/cmds-inspect.c
> index dd7b9dd278f2..4eace63d8517 100644
> --- a/cmds-inspect.c
> +++ b/cmds-inspect.c
> @@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char **argv)
>  	return !!ret;
>  }
>
> +static const char * const cmd_dump_chunks_usage[] = {
> +	"btrfs inspect-internal chunk-stats [options] <path>",
> +	"Show chunks (block groups) layout",
> +	"Show chunks (block groups) layout for all devices",
> +	"",
> +	HELPINFO_UNITS_LONG,
> +	"--sort=MODE        sort by the physical or logical chunk start",
> +	"                   MODE is one of pstart or lstart (default: pstart)",
> +	"--usage            show usage per block group, note this can be slow",
> +	NULL
> +};
> +
> +enum {
> +	CHUNK_SORT_PSTART,
> +	CHUNK_SORT_LSTART,
> +	CHUNK_SORT_DEFAULT = CHUNK_SORT_PSTART
> +};
> +
> +struct dump_chunks_entry {
> +	u64 devid;
> +	u64 start;
> +	u64 lstart;
> +	u64 length;
> +	u64 flags;
> +	u64 age;
> +	u64 used;
> +	u32 pnumber;
> +};
> +
> +struct dump_chunks_ctx {
> +	unsigned length;
> +	unsigned size;
> +	struct dump_chunks_entry *stats;
> +};
> +
> +int cmp_cse_devid_start(const void *va, const void *vb)
> +{
> +	const struct dump_chunks_entry *a = va;
> +	const struct dump_chunks_entry *b = vb;
> +
> +	if (a->devid < b->devid)
> +		return -1;
> +	if (a->devid > b->devid)
> +		return 1;
> +
> +	if (a->start < b->start)
> +		return -1;
> +	if (a->start == b->start) {
> +		error(
> +	"chunks start on same offset in the same device: devid %llu start %llu",
> +		    (unsigned long long)a->devid, (unsigned long long)a->start);
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +int cmp_cse_devid_lstart(const void *va, const void *vb)
> +{
> +	const struct dump_chunks_entry *a = va;
> +	const struct dump_chunks_entry *b = vb;
> +
> +	if (a->devid < b->devid)
> +		return -1;
> +	if (a->devid > b->devid)
> +		return 1;
> +
> +	if (a->lstart < b->lstart)
> +		return -1;
> +	if (a->lstart == b->lstart) {
> +		error(
> +"chunks logically start on same offset in the same device: devid %llu start %llu",
> +		    (unsigned long long)a->devid, (unsigned long long)a->lstart);
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +void print_dump_chunks(struct dump_chunks_ctx *ctx, unsigned sort_mode,
> +		unsigned unit_mode, int with_usage)
> +{
> +	u64 devid;
> +	struct dump_chunks_entry e;
> +	int i;
> +	int chidx;
> +	u64 lastend = 0;
> +	u64 age;
> +
> +	/*
> +	 * Chunks are sorted logically as found by the ioctl, we need to sort
> +	 * them once to find the physical ordering. This is the default mode.
> +	 */
> +	qsort(ctx->stats, ctx->length, sizeof(ctx->stats[0]), cmp_cse_devid_start);
> +	devid = 0;
> +	age = 0;
> +	for (i = 0; i < ctx->length; i++) {
> +		e = ctx->stats[i];
> +		if (e.devid != devid) {
> +			devid = e.devid;
> +			age = 0;
> +		}
> +		ctx->stats[i].pnumber = age;
> +		age++;
> +	}
> +
> +	if (sort_mode == CHUNK_SORT_LSTART)
> +		qsort(ctx->stats, ctx->length, sizeof(ctx->stats[0]), cmp_cse_devid_lstart);
> +
> +	devid = 0;
> +	for (i = 0; i < ctx->length; i++) {
> +		e = ctx->stats[i];
> +		if (e.devid != devid) {
> +			devid = e.devid;
> +			if (i != 0)
> +				putchar('\n');
> +			printf("Chunks on device id: %llu\n", devid);
> +			printf("PNumber            Type        PStart        Length          PEnd     Age         LStart%s\n",
> +					with_usage ? "  Usage" : "");
> +			printf("----------------------------------------------------------------------------------------%s\n",
> +					with_usage ? "-------" : "");
> +			chidx = 0;
> +			lastend = 0;
> +		}
> +		if (sort_mode == CHUNK_SORT_PSTART && lastend > 0
> +		    && e.start != lastend) {
> +			printf("      .           empty             .  ");
> +			printf("%12s  ",
> +				pretty_size_mode(e.start - lastend, unit_mode));
> +			printf("           .       .              .\n");
> +		}
> +
> +		printf("%7u ", e.pnumber);
> +		printf("%8s/%-6s  ", btrfs_group_type_str(e.flags),
> +				btrfs_group_profile_str(e.flags));
> +		printf("%12s  ", pretty_size_mode(e.start, unit_mode));
> +		printf("%12s  ", pretty_size_mode(e.length, unit_mode));
> +		printf("%12s  ",
> +			pretty_size_mode(e.start + e.length - 1, unit_mode));
> +		printf("%6llu ", e.age);
> +		printf("%14s", pretty_size_mode(e.lstart, unit_mode));
> +		if (with_usage)
> +			printf("  %5.2f", (float)e.used / e.length * 100);
> +		printf("\n");
> +
> +		lastend = e.start + e.length;
> +		chidx++;
> +	}
> +}
> +
> +static u64 fill_usage(int fd, u64 lstart)
> +{
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk = &args.key;
> +	struct btrfs_ioctl_search_header sh;
> +	struct btrfs_block_group_item *item;
> +	int ret;
> +
> +	memset(&args, 0, sizeof(args));
> +	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
> +	sk->min_objectid = lstart;
> +	sk->max_objectid = lstart;
> +	sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->max_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->min_offset = 0;
> +	sk->max_offset = (u64)-1;
> +	sk->max_transid = (u64)-1;
> +
> +	sk->nr_items = 4096;
> +	ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
> +	if (ret < 0) {
> +		error("cannot perform the search: %s", strerror(errno));
> +		return 1;
> +	}
> +	if (sk->nr_items == 0) {
> +		warning("blockgroup %llu not found",
> +				(unsigned long long)lstart);
> +		return 0;
> +	}
> +	if (sk->nr_items > 1) {
> +		warning("found more than one blockgroup %llu",
> +				(unsigned long long)lstart);
> +	}
> +
> +	memcpy(&sh, args.buf, sizeof(sh));
> +	item = (struct btrfs_block_group_item*)(args.buf + sizeof(sh));
> +
> +	return item->used;
> +}
> +
> +static int cmd_dump_chunks(int argc, char **argv)
> +{
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk = &args.key;
> +	struct btrfs_ioctl_search_header sh;
> +	unsigned long off = 0;
> +	u64 *age = 0;
> +	unsigned age_size = 128;
> +	int ret;
> +	int fd;
> +	int i;
> +	int e;
> +	DIR *dirstream = NULL;
> +	unsigned unit_mode;
> +	unsigned sort_mode = 0;
> +	int with_usage = 0;
> +	const char *path;
> +	struct dump_chunks_ctx ctx = {
> +		.length = 0,
> +		.size = 1024,
> +		.stats = NULL
> +	};
> +
> +	unit_mode = get_unit_mode_from_arg(&argc, argv, 0);
> +
> +	while (1) {
> +		int c;
> +		enum { GETOPT_VAL_SORT = 256, GETOPT_VAL_USAGE };
> +		static const struct option long_options[] = {
> +			{"sort", required_argument, NULL, GETOPT_VAL_SORT },
> +			{"usage", no_argument, NULL, GETOPT_VAL_USAGE },
> +			{NULL, 0, NULL, 0}
> +		};
> +
> +		c = getopt_long(argc, argv, "", long_options, NULL);
> +		if (c < 0)
> +			break;
> +
> +		switch (c) {
> +		case GETOPT_VAL_SORT:
> +			if (strcmp(optarg, "pstart") == 0) {
> +				sort_mode = CHUNK_SORT_PSTART;
> +			} else if (strcmp(optarg, "lstart") == 0) {
> +				sort_mode = CHUNK_SORT_LSTART;
> +			} else {
> +				error("unknown sort mode: %s", optarg);
> +				exit(1);
> +			}
> +			break;
> +		case GETOPT_VAL_USAGE:
> +			with_usage = 1;
> +			break;
> +		default:
> +			usage(cmd_dump_chunks_usage);
> +		}
> +	}
> +
> +	if (check_argc_exact(argc - optind, 1))
> +		usage(cmd_dump_chunks_usage);
> +
> +	ctx.stats = calloc(ctx.size, sizeof(ctx.stats[0]));
> +	if (!ctx.stats)
> +		goto out_nomem;
> +
> +	path = argv[optind];
> +
> +	fd = open_file_or_dir(path, &dirstream);
> +	if (fd < 0) {
> +	        error("cannot access '%s': %s", path, strerror(errno));
> +		return 1;
> +	}
> +
> +	memset(&args, 0, sizeof(args));
> +	sk->tree_id = BTRFS_CHUNK_TREE_OBJECTID;
> +	sk->min_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	sk->max_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	sk->min_type = BTRFS_CHUNK_ITEM_KEY;
> +	sk->max_type = BTRFS_CHUNK_ITEM_KEY;
> +	sk->max_offset = (u64)-1;
> +	sk->max_transid = (u64)-1;
> +	age = calloc(age_size, sizeof(u64));
> +	if (!age)
> +		goto out_nomem;
> +
> +	while (1) {
> +		sk->nr_items = 4096;
> +		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
> +		e = errno;
> +		if (ret < 0) {
> +			error("cannot perform the search: %s", strerror(e));
> +			return 1;
> +		}
> +		if (sk->nr_items == 0)
> +			break;
> +
> +		off = 0;
> +		for (i = 0; i < sk->nr_items; i++) {
> +			struct btrfs_chunk *item;
> +			struct btrfs_stripe *stripes;
> +			int sidx;
> +			u64 used = (u64)-1;
> +
> +			memcpy(&sh, args.buf + off, sizeof(sh));
> +			off += sizeof(sh);
> +			item = (struct btrfs_chunk*)(args.buf + off);
> +			off += sh.len;
> +
> +			stripes = &item->stripe;
> +			for (sidx = 0; sidx < item->num_stripes; sidx++) {
> +				struct dump_chunks_entry *e;
> +				u64 devid;
> +
> +				e = &ctx.stats[ctx.length];
> +				devid = stripes[sidx].devid;
> +				e->devid = devid;
> +				e->start = stripes[sidx].offset;
> +				e->lstart = sh.offset;
> +				e->length = item->length;
> +				e->flags = item->type;
> +				e->pnumber = -1;
> +				while (devid > age_size) {
> +					u64 *tmp;
> +					unsigned old_size = age_size;
> +
> +					age_size += 128;
> +					tmp = calloc(age_size, sizeof(u64));
> +					if (!tmp) {
> +						free(age);
> +						goto out_nomem;
> +					}
> +					memcpy(tmp, age, sizeof(u64) * old_size);
> +					age = tmp;
> +				}
> +				e->age = age[devid]++;
> +				if (with_usage) {
> +					if (used == (u64)-1)
> +						used = fill_usage(fd, sh.offset);
> +					e->used = used;
> +				} else {
> +					e->used = 0;
> +				}
> +
> +				ctx.length++;
> +
> +				if (ctx.length == ctx.size) {
> +					ctx.size += 1024;
> +					ctx.stats = realloc(ctx.stats, ctx.size
> +						* sizeof(ctx.stats[0]));
> +					if (!ctx.stats)
> +						goto out_nomem;
> +				}
> +			}
> +
> +			sk->min_objectid = sh.objectid;
> +			sk->min_type = sh.type;
> +			sk->min_offset = sh.offset;
> +		}
> +		if (sk->min_offset < (u64)-1)
> +			sk->min_offset++;
> +		else
> +			break;
> +	}
> +
> +	print_dump_chunks(&ctx, sort_mode, unit_mode, with_usage);
> +	free(ctx.stats);
> +
> +	close_file_or_dir(fd, dirstream);
> +	return 0;
> +
> +out_nomem:
> +	error("not enough memory");
> +	return 1;
> +}
> +
>  static const char inspect_cmd_group_info[] =
>  "query various internal information";
>
> @@ -644,6 +1006,8 @@ const struct cmd_group inspect_cmd_group = {
>  				cmd_inspect_dump_super_usage, NULL, 0 },
>  		{ "tree-stats", cmd_inspect_tree_stats,
>  				cmd_inspect_tree_stats_usage, NULL, 0 },
> +		{ "dump-chunks", cmd_dump_chunks, cmd_dump_chunks_usage, NULL,
> +			0 },
>  		NULL_CMD_STRUCT
>  	}
>  };
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo June 23, 2016, 1:53 a.m. UTC | #4
On Wed, Jun 22, 2016 at 07:26:46PM +0200, David Sterba wrote:
> From: David Sterba <dsterba@suse.cz>
> 
> Hi,
> 
> the chunk dump is a useful thing, for debugging or balance filters.
> 
> Example output:
> 
> Chunks on device id: 1
> PNumber            Type        PStart        Length          PEnd     Age         LStart  Usage
> -----------------------------------------------------------------------------------------------
>       0   System/RAID1        1.00MiB      32.00MiB      33.00MiB      47        1.40TiB   0.06
>       1 Metadata/RAID1       33.00MiB       1.00GiB       1.03GiB      31        1.36TiB  64.03
>       2 Metadata/RAID1        1.03GiB      32.00MiB       1.06GiB      36        1.36TiB  77.28
>       3     Data/single       1.06GiB       1.00GiB       2.06GiB      12      422.30GiB  78.90
>       4     Data/single       2.06GiB       1.00GiB       3.06GiB      11      420.30GiB  78.47
> ...
> 
> (full output is at http://susepaste.org/view/raw/089a9877)
> 
> This patch does the basic output, not filtering or sorting besides physical and
> logical offset. There are possiblities to do more or enhance the output (eg.
> starting with logical chunk and list the related physcial chunks together).
> Or filter by type/profile, or understand the balance filters format.
> 
> As it is now, it's per-device dump of physical layout, which was the original
> idea.
> 
> Printing 'usage' is not default as it's quite slow, it uses the search ioctl
> and probably not in the best way, or there's some other issue in the
> implementation.
> 
> I'll add the patch to devel branch but will not add it for any particular
> release yet, I'd like some feedback first. Thanks.
> 
> -------------
> New command 'btrfs inspect-internal dump-chunks' will dump layout of
> chunks as stored on the devices. This corresponds to the physical
> layout, sorted by the physical offset. The block group usage can be
> shown as well, but the search is too slow so it's off by default.
> 
> If the physical offset sorting is selected, the empty space between
> chunks is also shown.
> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  cmds-inspect.c | 364 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 364 insertions(+)
> 
> diff --git a/cmds-inspect.c b/cmds-inspect.c
> index dd7b9dd278f2..4eace63d8517 100644
> --- a/cmds-inspect.c
> +++ b/cmds-inspect.c
> @@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char **argv)
>  	return !!ret;
>  }
>  
> +static const char * const cmd_dump_chunks_usage[] = {
> +	"btrfs inspect-internal chunk-stats [options] <path>",
> +	"Show chunks (block groups) layout",
> +	"Show chunks (block groups) layout for all devices",
> +	"",
> +	HELPINFO_UNITS_LONG,
> +	"--sort=MODE        sort by the physical or logical chunk start",
> +	"                   MODE is one of pstart or lstart (default: pstart)",
> +	"--usage            show usage per block group, note this can be slow",
> +	NULL
> +};
> +
> +enum {
> +	CHUNK_SORT_PSTART,
> +	CHUNK_SORT_LSTART,
> +	CHUNK_SORT_DEFAULT = CHUNK_SORT_PSTART
> +};
> +
> +struct dump_chunks_entry {
> +	u64 devid;
> +	u64 start;
> +	u64 lstart;
> +	u64 length;
> +	u64 flags;
> +	u64 age;
> +	u64 used;
> +	u32 pnumber;
> +};
> +
> +struct dump_chunks_ctx {
> +	unsigned length;
> +	unsigned size;
> +	struct dump_chunks_entry *stats;
> +};
> +
> +int cmp_cse_devid_start(const void *va, const void *vb)
> +{
> +	const struct dump_chunks_entry *a = va;
> +	const struct dump_chunks_entry *b = vb;
> +
> +	if (a->devid < b->devid)
> +		return -1;
> +	if (a->devid > b->devid)
> +		return 1;
> +
> +	if (a->start < b->start)
> +		return -1;
> +	if (a->start == b->start) {
> +		error(
> +	"chunks start on same offset in the same device: devid %llu start %llu",
> +		    (unsigned long long)a->devid, (unsigned long long)a->start);
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +int cmp_cse_devid_lstart(const void *va, const void *vb)
> +{
> +	const struct dump_chunks_entry *a = va;
> +	const struct dump_chunks_entry *b = vb;
> +
> +	if (a->devid < b->devid)
> +		return -1;
> +	if (a->devid > b->devid)
> +		return 1;
> +
> +	if (a->lstart < b->lstart)
> +		return -1;
> +	if (a->lstart == b->lstart) {
> +		error(
> +"chunks logically start on same offset in the same device: devid %llu start %llu",
> +		    (unsigned long long)a->devid, (unsigned long long)a->lstart);
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +void print_dump_chunks(struct dump_chunks_ctx *ctx, unsigned sort_mode,
> +		unsigned unit_mode, int with_usage)
> +{
> +	u64 devid;
> +	struct dump_chunks_entry e;
> +	int i;
> +	int chidx;
> +	u64 lastend = 0;
> +	u64 age;
> +
> +	/*
> +	 * Chunks are sorted logically as found by the ioctl, we need to sort
> +	 * them once to find the physical ordering. This is the default mode.
> +	 */
> +	qsort(ctx->stats, ctx->length, sizeof(ctx->stats[0]), cmp_cse_devid_start);
> +	devid = 0;
> +	age = 0;
> +	for (i = 0; i < ctx->length; i++) {
> +		e = ctx->stats[i];
> +		if (e.devid != devid) {
> +			devid = e.devid;
> +			age = 0;
> +		}
> +		ctx->stats[i].pnumber = age;
> +		age++;
> +	}
> +
> +	if (sort_mode == CHUNK_SORT_LSTART)
> +		qsort(ctx->stats, ctx->length, sizeof(ctx->stats[0]), cmp_cse_devid_lstart);
> +
> +	devid = 0;
> +	for (i = 0; i < ctx->length; i++) {
> +		e = ctx->stats[i];
> +		if (e.devid != devid) {
> +			devid = e.devid;
> +			if (i != 0)
> +				putchar('\n');
> +			printf("Chunks on device id: %llu\n", devid);
> +			printf("PNumber            Type        PStart        Length          PEnd     Age         LStart%s\n",
> +					with_usage ? "  Usage" : "");
> +			printf("----------------------------------------------------------------------------------------%s\n",
> +					with_usage ? "-------" : "");
> +			chidx = 0;
> +			lastend = 0;
> +		}
> +		if (sort_mode == CHUNK_SORT_PSTART && lastend > 0
> +		    && e.start != lastend) {
> +			printf("      .           empty             .  ");
> +			printf("%12s  ",
> +				pretty_size_mode(e.start - lastend, unit_mode));
> +			printf("           .       .              .\n");
> +		}
> +
> +		printf("%7u ", e.pnumber);
> +		printf("%8s/%-6s  ", btrfs_group_type_str(e.flags),
> +				btrfs_group_profile_str(e.flags));
> +		printf("%12s  ", pretty_size_mode(e.start, unit_mode));
> +		printf("%12s  ", pretty_size_mode(e.length, unit_mode));
> +		printf("%12s  ",
> +			pretty_size_mode(e.start + e.length - 1, unit_mode));
> +		printf("%6llu ", e.age);
> +		printf("%14s", pretty_size_mode(e.lstart, unit_mode));
> +		if (with_usage)
> +			printf("  %5.2f", (float)e.used / e.length * 100);
> +		printf("\n");
> +
> +		lastend = e.start + e.length;
> +		chidx++;
> +	}
> +}
> +
> +static u64 fill_usage(int fd, u64 lstart)
> +{
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk = &args.key;
> +	struct btrfs_ioctl_search_header sh;
> +	struct btrfs_block_group_item *item;
> +	int ret;
> +
> +	memset(&args, 0, sizeof(args));
> +	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
> +	sk->min_objectid = lstart;
> +	sk->max_objectid = lstart;
> +	sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->max_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->min_offset = 0;
> +	sk->max_offset = (u64)-1;
> +	sk->max_transid = (u64)-1;
> +
> +	sk->nr_items = 4096;

What if we set nr_items = 1?  From the code review, this can let us
stop and return immediately.

Thanks,

-liubo
> +	ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
> +	if (ret < 0) {
> +		error("cannot perform the search: %s", strerror(errno));
> +		return 1;
> +	}
> +	if (sk->nr_items == 0) {
> +		warning("blockgroup %llu not found",
> +				(unsigned long long)lstart);
> +		return 0;
> +	}
> +	if (sk->nr_items > 1) {
> +		warning("found more than one blockgroup %llu",
> +				(unsigned long long)lstart);
> +	}
> +
> +	memcpy(&sh, args.buf, sizeof(sh));
> +	item = (struct btrfs_block_group_item*)(args.buf + sizeof(sh));
> +
> +	return item->used;
> +}
> +
> +static int cmd_dump_chunks(int argc, char **argv)
> +{
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk = &args.key;
> +	struct btrfs_ioctl_search_header sh;
> +	unsigned long off = 0;
> +	u64 *age = 0;
> +	unsigned age_size = 128;
> +	int ret;
> +	int fd;
> +	int i;
> +	int e;
> +	DIR *dirstream = NULL;
> +	unsigned unit_mode;
> +	unsigned sort_mode = 0;
> +	int with_usage = 0;
> +	const char *path;
> +	struct dump_chunks_ctx ctx = {
> +		.length = 0,
> +		.size = 1024,
> +		.stats = NULL
> +	};
> +
> +	unit_mode = get_unit_mode_from_arg(&argc, argv, 0);
> +
> +	while (1) {
> +		int c;
> +		enum { GETOPT_VAL_SORT = 256, GETOPT_VAL_USAGE };
> +		static const struct option long_options[] = {
> +			{"sort", required_argument, NULL, GETOPT_VAL_SORT },
> +			{"usage", no_argument, NULL, GETOPT_VAL_USAGE },
> +			{NULL, 0, NULL, 0}
> +		};
> +
> +		c = getopt_long(argc, argv, "", long_options, NULL);
> +		if (c < 0)
> +			break;
> +
> +		switch (c) {
> +		case GETOPT_VAL_SORT:
> +			if (strcmp(optarg, "pstart") == 0) {
> +				sort_mode = CHUNK_SORT_PSTART;
> +			} else if (strcmp(optarg, "lstart") == 0) {
> +				sort_mode = CHUNK_SORT_LSTART;
> +			} else {
> +				error("unknown sort mode: %s", optarg);
> +				exit(1);
> +			}
> +			break;
> +		case GETOPT_VAL_USAGE:
> +			with_usage = 1;
> +			break;
> +		default:
> +			usage(cmd_dump_chunks_usage);
> +		}
> +	}
> +
> +	if (check_argc_exact(argc - optind, 1))
> +		usage(cmd_dump_chunks_usage);
> +
> +	ctx.stats = calloc(ctx.size, sizeof(ctx.stats[0]));
> +	if (!ctx.stats)
> +		goto out_nomem;
> +
> +	path = argv[optind];
> +
> +	fd = open_file_or_dir(path, &dirstream);
> +	if (fd < 0) {
> +	        error("cannot access '%s': %s", path, strerror(errno));
> +		return 1;
> +	}
> +
> +	memset(&args, 0, sizeof(args));
> +	sk->tree_id = BTRFS_CHUNK_TREE_OBJECTID;
> +	sk->min_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	sk->max_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	sk->min_type = BTRFS_CHUNK_ITEM_KEY;
> +	sk->max_type = BTRFS_CHUNK_ITEM_KEY;
> +	sk->max_offset = (u64)-1;
> +	sk->max_transid = (u64)-1;
> +	age = calloc(age_size, sizeof(u64));
> +	if (!age)
> +		goto out_nomem;
> +
> +	while (1) {
> +		sk->nr_items = 4096;
> +		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
> +		e = errno;
> +		if (ret < 0) {
> +			error("cannot perform the search: %s", strerror(e));
> +			return 1;
> +		}
> +		if (sk->nr_items == 0)
> +			break;
> +
> +		off = 0;
> +		for (i = 0; i < sk->nr_items; i++) {
> +			struct btrfs_chunk *item;
> +			struct btrfs_stripe *stripes;
> +			int sidx;
> +			u64 used = (u64)-1;
> +
> +			memcpy(&sh, args.buf + off, sizeof(sh));
> +			off += sizeof(sh);
> +			item = (struct btrfs_chunk*)(args.buf + off);
> +			off += sh.len;
> +
> +			stripes = &item->stripe;
> +			for (sidx = 0; sidx < item->num_stripes; sidx++) {
> +				struct dump_chunks_entry *e;
> +				u64 devid;
> +
> +				e = &ctx.stats[ctx.length];
> +				devid = stripes[sidx].devid;
> +				e->devid = devid;
> +				e->start = stripes[sidx].offset;
> +				e->lstart = sh.offset;
> +				e->length = item->length;
> +				e->flags = item->type;
> +				e->pnumber = -1;
> +				while (devid > age_size) {
> +					u64 *tmp;
> +					unsigned old_size = age_size;
> +
> +					age_size += 128;
> +					tmp = calloc(age_size, sizeof(u64));
> +					if (!tmp) {
> +						free(age);
> +						goto out_nomem;
> +					}
> +					memcpy(tmp, age, sizeof(u64) * old_size);
> +					age = tmp;
> +				}
> +				e->age = age[devid]++;
> +				if (with_usage) {
> +					if (used == (u64)-1)
> +						used = fill_usage(fd, sh.offset);
> +					e->used = used;
> +				} else {
> +					e->used = 0;
> +				}
> +
> +				ctx.length++;
> +
> +				if (ctx.length == ctx.size) {
> +					ctx.size += 1024;
> +					ctx.stats = realloc(ctx.stats, ctx.size
> +						* sizeof(ctx.stats[0]));
> +					if (!ctx.stats)
> +						goto out_nomem;
> +				}
> +			}
> +
> +			sk->min_objectid = sh.objectid;
> +			sk->min_type = sh.type;
> +			sk->min_offset = sh.offset;
> +		}
> +		if (sk->min_offset < (u64)-1)
> +			sk->min_offset++;
> +		else
> +			break;
> +	}
> +
> +	print_dump_chunks(&ctx, sort_mode, unit_mode, with_usage);
> +	free(ctx.stats);
> +
> +	close_file_or_dir(fd, dirstream);
> +	return 0;
> +
> +out_nomem:
> +	error("not enough memory");
> +	return 1;
> +}
> +
>  static const char inspect_cmd_group_info[] =
>  "query various internal information";
>  
> @@ -644,6 +1006,8 @@ const struct cmd_group inspect_cmd_group = {
>  				cmd_inspect_dump_super_usage, NULL, 0 },
>  		{ "tree-stats", cmd_inspect_tree_stats,
>  				cmd_inspect_tree_stats_usage, NULL, 0 },
> +		{ "dump-chunks", cmd_dump_chunks, cmd_dump_chunks_usage, NULL,
> +			0 },
>  		NULL_CMD_STRUCT
>  	}
>  };
> -- 
> 2.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba June 23, 2016, 12:43 p.m. UTC | #5
On Wed, Jun 22, 2016 at 06:53:52PM -0700, Liu Bo wrote:
> > +static u64 fill_usage(int fd, u64 lstart)
> > +{
> > +	struct btrfs_ioctl_search_args args;
> > +	struct btrfs_ioctl_search_key *sk = &args.key;
> > +	struct btrfs_ioctl_search_header sh;
> > +	struct btrfs_block_group_item *item;
> > +	int ret;
> > +
> > +	memset(&args, 0, sizeof(args));
> > +	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
> > +	sk->min_objectid = lstart;
> > +	sk->max_objectid = lstart;
> > +	sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> > +	sk->max_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> > +	sk->min_offset = 0;
> > +	sk->max_offset = (u64)-1;
> > +	sk->max_transid = (u64)-1;
> > +
> > +	sk->nr_items = 4096;
> 
> What if we set nr_items = 1?  From the code review, this can let us
> stop and return immediately.

Indeed, I've copy-pasted the search code. When it's 1 it returns
immediatelly if the metadata are cached, cold read took a few seconds on
a terabyte-sized filesystem.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba June 23, 2016, 1:07 p.m. UTC | #6
On Thu, Jun 23, 2016 at 09:20:40AM +0800, Qu Wenruo wrote:
> Yes, quite useful.
> I don't ever need to manually check btrfs-debug-tree output to figure 
> out the dev-extent layout.
> 
> But according to the output, it's more like dev-extent dump, not chunk dump.
> 
> It's better to make things more clear for objects at different logical 
> level.

Agreed, therefore I'd create 2 commands, one for dumping the logical
chunks and another for the dev-extents (which is unfortunatelly not as
short for a command name).

> > Example output:
> >
> > Chunks on device id: 1
> > PNumber            Type        PStart        Length          PEnd     Age         LStart  Usage
> > -----------------------------------------------------------------------------------------------
> >       0   System/RAID1        1.00MiB      32.00MiB      33.00MiB      47        1.40TiB   0.06
> 
> [Mixed chunk/dev-extent output]
> For "PStart/Length/Pend/Lstart" that's OK.
> (Although I prefer "Plength" to "Length", as it's a little confusing for 
> chunk logical length and dev_extent physical length)

This should be sorted once the commands are split. Using the 'P' or 'L'
prefixes for clarity is a good thing.

> Since "pstart" = key.offset, "length" = dev_extent_length, "Pend" = 
> pstart + length and "Lstart" = dev_extent_chunk_offset, these values are 
> all straightforward.
> 
> But for Usage, it's completely a chunk/block group level thing, so 
> personally speaking it's better not to output per chunk usage in 
> dev-extent dump

Agreed.

> If there is real "chunk dump", then that's the best place to output 
> per-chunk "usage".
> 
> For dev-extent level usage, it should be the usage of a device, not per 
> chunk.
> 
> ["Age" calculation]
> Another concern is about "age", since there is no generation in 
> chunk_item nor dev_extent, so you're using logical bytenr as a hint.
> 
> This method implies the fact that btrfs will always allocates new chunk 
> with increasing logical bytenr, never reuse lower logical bytenr.
> 
> Such behavior provides the basis for a lot of other btrfs functions, 
> like balance/qgroup rescan.
> IMHO it won't be changed in any time soon. But I'm still a little 
> concerned about using such assumption.
> 
> 
> Since "age" here is more like a sorting index, not the generation when 
> the chunk is created.
> It may not really reflect how long one chunk really lived.
> 
> It maybe completely OK for most people, but at least it's not as 
> straightforward as generation.

Yes, I'll change it to LNumber.

> [Support for offline btrfs]
> BTW, it's also a good idea to support not only mounted fs, but also 
> offlined one.
> 
> As for offlined one, no need to bother about the tree search ioctl, just 
> straightforward search_slots().
> (That's why I like offline btrfs tools)

Yeah, the offline mode would be useful and not only for this command, so
this should be sorted first on the whole tool level so we don't stitch
it to each command separately.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba June 23, 2016, 1:13 p.m. UTC | #7
On Thu, Jun 23, 2016 at 12:20:38AM +0200, Hans van Kranenburg wrote:
> > Printing 'usage' is not default as it's quite slow, it uses the search ioctl
> > and probably not in the best way, or there's some other issue in the
> > implementation.
> 
> Interesting.
> 
> So after reading this, I wrote a little test to test some scenarios:
> 
> https://github.com/knorrie/python-btrfs/commit/1ca99880dfa0e14b148f3d9e2b6b381b781eb52d
> 
> It's very clear that the most optimal way of doing this search is to 
> have nr_items=1 and if possible, specify the length in offset.

And that solved it.

[...]

> It seems that searching in the empty space between
>   (vaddr BLOCK_GROUP_ITEM length+1) and
>   (vaddr BLOCK_GROUP_ITEM ULLONG_MAX)
> is really expensive, while there's absolutely nothing to find.

Yeah, the few BLOCK_GROUP_ITEMs are scattered among tons of EXTENT_ITEMs 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans van Kranenburg June 23, 2016, 1:17 p.m. UTC | #8
On 06/23/2016 03:13 PM, David Sterba wrote:
> On Thu, Jun 23, 2016 at 12:20:38AM +0200, Hans van Kranenburg wrote:
>>> Printing 'usage' is not default as it's quite slow, it uses the search ioctl
>>> and probably not in the best way, or there's some other issue in the
>>> implementation.
>>
>> Interesting.
>>
>> So after reading this, I wrote a little test to test some scenarios:
>>
>> https://github.com/knorrie/python-btrfs/commit/1ca99880dfa0e14b148f3d9e2b6b381b781eb52d
>>
>> It's very clear that the most optimal way of doing this search is to
>> have nr_items=1 and if possible, specify the length in offset.
>
> And that solved it.
>
> [...]
>
>> It seems that searching in the empty space between
>>    (vaddr BLOCK_GROUP_ITEM length+1) and
>>    (vaddr BLOCK_GROUP_ITEM ULLONG_MAX)
>> is really expensive, while there's absolutely nothing to find.
>
> Yeah, the few BLOCK_GROUP_ITEMs are scattered among tons of EXTENT_ITEMs

On the same vaddr objectid, there should only be at most one extent, and 
EXTENT_ITEM_KEY < BLOCK_GROUP_ITEM_KEY, so the space after the block 
group item should always contain exactly 0 items?

Still it takes very long...
David Sterba June 23, 2016, 1:27 p.m. UTC | #9
On Thu, Jun 23, 2016 at 03:10:09AM +0200, Hans van Kranenburg wrote:
> On 06/22/2016 07:26 PM, David Sterba wrote:
> > From: David Sterba <dsterba@suse.cz>
> >
> > Hi,
> >
> > the chunk dump is a useful thing, for debugging or balance filters.
> >
> > Example output:
> >
> > Chunks on device id: 1
> > PNumber            Type        PStart        Length          PEnd     Age         LStart  Usage
> > -----------------------------------------------------------------------------------------------
> >       0   System/RAID1        1.00MiB      32.00MiB      33.00MiB      47        1.40TiB   0.06
> >       1 Metadata/RAID1       33.00MiB       1.00GiB       1.03GiB      31        1.36TiB  64.03
> >       2 Metadata/RAID1        1.03GiB      32.00MiB       1.06GiB      36        1.36TiB  77.28
> >       3     Data/single       1.06GiB       1.00GiB       2.06GiB      12      422.30GiB  78.90
> >       4     Data/single       2.06GiB       1.00GiB       3.06GiB      11      420.30GiB  78.47
> > ...
> 
> On RAID0, it looks funny :)
> 
> # ./btrfs inspect-internal dump-chunks /mnt/extents/
> Chunks on device id: 1
> PNumber            Type        PStart        Length          PEnd Age         LStart
> ----------------------------------------------------------------------------------------
>        0 Metadata/RAID0        1.00MiB     256.00MiB     257.00MiB 1       13.36GiB
>        .           empty             .      16.00EiB             .  .              .
>        1   System/RAID0      129.00MiB      64.00MiB     193.00MiB 2       13.61GiB
>        .           empty             .      16.00EiB             .  .              .
>        2     Data/RAID0      161.00MiB       2.00GiB       2.16GiB 3       15.68GiB
>        .           empty             .     115.00MiB             .  .              .
>        3     Data/RAID0        2.27GiB       2.00GiB       4.27GiB 0       11.36GiB
> 
> Chunks on device id: 2
> PNumber            Type        PStart        Length          PEnd 
> Age         LStart
> ----------------------------------------------------------------------------------------
>        0     Data/RAID0        1.00MiB       2.00GiB       2.00GiB 0       11.36GiB
>        .           empty             .      16.00EiB             .  .              .
>        1 Metadata/RAID0        1.00GiB     256.00MiB       1.25GiB 1       13.36GiB
>        .           empty             .      16.00EiB             .  .              .
>        2   System/RAID0        1.13GiB      64.00MiB       1.19GiB 2       13.61GiB
>        .           empty             .      16.00EiB             .  .              .
>        3     Data/RAID0        1.16GiB       2.00GiB       3.16GiB 3       15.68GiB
> 
> For the correct sizes, it would be needed to either do some math, like 
> if RAID0 then divide by 2, but I guess this gets horrible really soon 
> with RAID5 etc... Or, if you want to print everything in the physical 
> order, why not just dump the device tree with DEV_EXTENT which has the 
> exact length on disk, and fill in the missing information using the 
> reference to the chunk they belong to?

That's a good idea and would avoid the guesswork that makes the current
numbers funny.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/cmds-inspect.c b/cmds-inspect.c
index dd7b9dd278f2..4eace63d8517 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -623,6 +623,368 @@  static int cmd_inspect_min_dev_size(int argc, char **argv)
 	return !!ret;
 }
 
+static const char * const cmd_dump_chunks_usage[] = {
+	"btrfs inspect-internal chunk-stats [options] <path>",
+	"Show chunks (block groups) layout",
+	"Show chunks (block groups) layout for all devices",
+	"",
+	HELPINFO_UNITS_LONG,
+	"--sort=MODE        sort by the physical or logical chunk start",
+	"                   MODE is one of pstart or lstart (default: pstart)",
+	"--usage            show usage per block group, note this can be slow",
+	NULL
+};
+
+enum {
+	CHUNK_SORT_PSTART,
+	CHUNK_SORT_LSTART,
+	CHUNK_SORT_DEFAULT = CHUNK_SORT_PSTART
+};
+
+struct dump_chunks_entry {
+	u64 devid;
+	u64 start;
+	u64 lstart;
+	u64 length;
+	u64 flags;
+	u64 age;
+	u64 used;
+	u32 pnumber;
+};
+
+struct dump_chunks_ctx {
+	unsigned length;
+	unsigned size;
+	struct dump_chunks_entry *stats;
+};
+
+int cmp_cse_devid_start(const void *va, const void *vb)
+{
+	const struct dump_chunks_entry *a = va;
+	const struct dump_chunks_entry *b = vb;
+
+	if (a->devid < b->devid)
+		return -1;
+	if (a->devid > b->devid)
+		return 1;
+
+	if (a->start < b->start)
+		return -1;
+	if (a->start == b->start) {
+		error(
+	"chunks start on same offset in the same device: devid %llu start %llu",
+		    (unsigned long long)a->devid, (unsigned long long)a->start);
+		return 0;
+	}
+	return 1;
+}
+
+int cmp_cse_devid_lstart(const void *va, const void *vb)
+{
+	const struct dump_chunks_entry *a = va;
+	const struct dump_chunks_entry *b = vb;
+
+	if (a->devid < b->devid)
+		return -1;
+	if (a->devid > b->devid)
+		return 1;
+
+	if (a->lstart < b->lstart)
+		return -1;
+	if (a->lstart == b->lstart) {
+		error(
+"chunks logically start on same offset in the same device: devid %llu start %llu",
+		    (unsigned long long)a->devid, (unsigned long long)a->lstart);
+		return 0;
+	}
+	return 1;
+}
+
+void print_dump_chunks(struct dump_chunks_ctx *ctx, unsigned sort_mode,
+		unsigned unit_mode, int with_usage)
+{
+	u64 devid;
+	struct dump_chunks_entry e;
+	int i;
+	int chidx;
+	u64 lastend = 0;
+	u64 age;
+
+	/*
+	 * Chunks are sorted logically as found by the ioctl, we need to sort
+	 * them once to find the physical ordering. This is the default mode.
+	 */
+	qsort(ctx->stats, ctx->length, sizeof(ctx->stats[0]), cmp_cse_devid_start);
+	devid = 0;
+	age = 0;
+	for (i = 0; i < ctx->length; i++) {
+		e = ctx->stats[i];
+		if (e.devid != devid) {
+			devid = e.devid;
+			age = 0;
+		}
+		ctx->stats[i].pnumber = age;
+		age++;
+	}
+
+	if (sort_mode == CHUNK_SORT_LSTART)
+		qsort(ctx->stats, ctx->length, sizeof(ctx->stats[0]), cmp_cse_devid_lstart);
+
+	devid = 0;
+	for (i = 0; i < ctx->length; i++) {
+		e = ctx->stats[i];
+		if (e.devid != devid) {
+			devid = e.devid;
+			if (i != 0)
+				putchar('\n');
+			printf("Chunks on device id: %llu\n", devid);
+			printf("PNumber            Type        PStart        Length          PEnd     Age         LStart%s\n",
+					with_usage ? "  Usage" : "");
+			printf("----------------------------------------------------------------------------------------%s\n",
+					with_usage ? "-------" : "");
+			chidx = 0;
+			lastend = 0;
+		}
+		if (sort_mode == CHUNK_SORT_PSTART && lastend > 0
+		    && e.start != lastend) {
+			printf("      .           empty             .  ");
+			printf("%12s  ",
+				pretty_size_mode(e.start - lastend, unit_mode));
+			printf("           .       .              .\n");
+		}
+
+		printf("%7u ", e.pnumber);
+		printf("%8s/%-6s  ", btrfs_group_type_str(e.flags),
+				btrfs_group_profile_str(e.flags));
+		printf("%12s  ", pretty_size_mode(e.start, unit_mode));
+		printf("%12s  ", pretty_size_mode(e.length, unit_mode));
+		printf("%12s  ",
+			pretty_size_mode(e.start + e.length - 1, unit_mode));
+		printf("%6llu ", e.age);
+		printf("%14s", pretty_size_mode(e.lstart, unit_mode));
+		if (with_usage)
+			printf("  %5.2f", (float)e.used / e.length * 100);
+		printf("\n");
+
+		lastend = e.start + e.length;
+		chidx++;
+	}
+}
+
+static u64 fill_usage(int fd, u64 lstart)
+{
+	struct btrfs_ioctl_search_args args;
+	struct btrfs_ioctl_search_key *sk = &args.key;
+	struct btrfs_ioctl_search_header sh;
+	struct btrfs_block_group_item *item;
+	int ret;
+
+	memset(&args, 0, sizeof(args));
+	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
+	sk->min_objectid = lstart;
+	sk->max_objectid = lstart;
+	sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	sk->max_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	sk->min_offset = 0;
+	sk->max_offset = (u64)-1;
+	sk->max_transid = (u64)-1;
+
+	sk->nr_items = 4096;
+	ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
+	if (ret < 0) {
+		error("cannot perform the search: %s", strerror(errno));
+		return 1;
+	}
+	if (sk->nr_items == 0) {
+		warning("blockgroup %llu not found",
+				(unsigned long long)lstart);
+		return 0;
+	}
+	if (sk->nr_items > 1) {
+		warning("found more than one blockgroup %llu",
+				(unsigned long long)lstart);
+	}
+
+	memcpy(&sh, args.buf, sizeof(sh));
+	item = (struct btrfs_block_group_item*)(args.buf + sizeof(sh));
+
+	return item->used;
+}
+
+static int cmd_dump_chunks(int argc, char **argv)
+{
+	struct btrfs_ioctl_search_args args;
+	struct btrfs_ioctl_search_key *sk = &args.key;
+	struct btrfs_ioctl_search_header sh;
+	unsigned long off = 0;
+	u64 *age = 0;
+	unsigned age_size = 128;
+	int ret;
+	int fd;
+	int i;
+	int e;
+	DIR *dirstream = NULL;
+	unsigned unit_mode;
+	unsigned sort_mode = 0;
+	int with_usage = 0;
+	const char *path;
+	struct dump_chunks_ctx ctx = {
+		.length = 0,
+		.size = 1024,
+		.stats = NULL
+	};
+
+	unit_mode = get_unit_mode_from_arg(&argc, argv, 0);
+
+	while (1) {
+		int c;
+		enum { GETOPT_VAL_SORT = 256, GETOPT_VAL_USAGE };
+		static const struct option long_options[] = {
+			{"sort", required_argument, NULL, GETOPT_VAL_SORT },
+			{"usage", no_argument, NULL, GETOPT_VAL_USAGE },
+			{NULL, 0, NULL, 0}
+		};
+
+		c = getopt_long(argc, argv, "", long_options, NULL);
+		if (c < 0)
+			break;
+
+		switch (c) {
+		case GETOPT_VAL_SORT:
+			if (strcmp(optarg, "pstart") == 0) {
+				sort_mode = CHUNK_SORT_PSTART;
+			} else if (strcmp(optarg, "lstart") == 0) {
+				sort_mode = CHUNK_SORT_LSTART;
+			} else {
+				error("unknown sort mode: %s", optarg);
+				exit(1);
+			}
+			break;
+		case GETOPT_VAL_USAGE:
+			with_usage = 1;
+			break;
+		default:
+			usage(cmd_dump_chunks_usage);
+		}
+	}
+
+	if (check_argc_exact(argc - optind, 1))
+		usage(cmd_dump_chunks_usage);
+
+	ctx.stats = calloc(ctx.size, sizeof(ctx.stats[0]));
+	if (!ctx.stats)
+		goto out_nomem;
+
+	path = argv[optind];
+
+	fd = open_file_or_dir(path, &dirstream);
+	if (fd < 0) {
+	        error("cannot access '%s': %s", path, strerror(errno));
+		return 1;
+	}
+
+	memset(&args, 0, sizeof(args));
+	sk->tree_id = BTRFS_CHUNK_TREE_OBJECTID;
+	sk->min_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	sk->max_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	sk->min_type = BTRFS_CHUNK_ITEM_KEY;
+	sk->max_type = BTRFS_CHUNK_ITEM_KEY;
+	sk->max_offset = (u64)-1;
+	sk->max_transid = (u64)-1;
+	age = calloc(age_size, sizeof(u64));
+	if (!age)
+		goto out_nomem;
+
+	while (1) {
+		sk->nr_items = 4096;
+		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
+		e = errno;
+		if (ret < 0) {
+			error("cannot perform the search: %s", strerror(e));
+			return 1;
+		}
+		if (sk->nr_items == 0)
+			break;
+
+		off = 0;
+		for (i = 0; i < sk->nr_items; i++) {
+			struct btrfs_chunk *item;
+			struct btrfs_stripe *stripes;
+			int sidx;
+			u64 used = (u64)-1;
+
+			memcpy(&sh, args.buf + off, sizeof(sh));
+			off += sizeof(sh);
+			item = (struct btrfs_chunk*)(args.buf + off);
+			off += sh.len;
+
+			stripes = &item->stripe;
+			for (sidx = 0; sidx < item->num_stripes; sidx++) {
+				struct dump_chunks_entry *e;
+				u64 devid;
+
+				e = &ctx.stats[ctx.length];
+				devid = stripes[sidx].devid;
+				e->devid = devid;
+				e->start = stripes[sidx].offset;
+				e->lstart = sh.offset;
+				e->length = item->length;
+				e->flags = item->type;
+				e->pnumber = -1;
+				while (devid > age_size) {
+					u64 *tmp;
+					unsigned old_size = age_size;
+
+					age_size += 128;
+					tmp = calloc(age_size, sizeof(u64));
+					if (!tmp) {
+						free(age);
+						goto out_nomem;
+					}
+					memcpy(tmp, age, sizeof(u64) * old_size);
+					age = tmp;
+				}
+				e->age = age[devid]++;
+				if (with_usage) {
+					if (used == (u64)-1)
+						used = fill_usage(fd, sh.offset);
+					e->used = used;
+				} else {
+					e->used = 0;
+				}
+
+				ctx.length++;
+
+				if (ctx.length == ctx.size) {
+					ctx.size += 1024;
+					ctx.stats = realloc(ctx.stats, ctx.size
+						* sizeof(ctx.stats[0]));
+					if (!ctx.stats)
+						goto out_nomem;
+				}
+			}
+
+			sk->min_objectid = sh.objectid;
+			sk->min_type = sh.type;
+			sk->min_offset = sh.offset;
+		}
+		if (sk->min_offset < (u64)-1)
+			sk->min_offset++;
+		else
+			break;
+	}
+
+	print_dump_chunks(&ctx, sort_mode, unit_mode, with_usage);
+	free(ctx.stats);
+
+	close_file_or_dir(fd, dirstream);
+	return 0;
+
+out_nomem:
+	error("not enough memory");
+	return 1;
+}
+
 static const char inspect_cmd_group_info[] =
 "query various internal information";
 
@@ -644,6 +1006,8 @@  const struct cmd_group inspect_cmd_group = {
 				cmd_inspect_dump_super_usage, NULL, 0 },
 		{ "tree-stats", cmd_inspect_tree_stats,
 				cmd_inspect_tree_stats_usage, NULL, 0 },
+		{ "dump-chunks", cmd_dump_chunks, cmd_dump_chunks_usage, NULL,
+			0 },
 		NULL_CMD_STRUCT
 	}
 };