Btrfs-progs: add check-only option for balance
diff mbox

Message ID 1465508801-14163-1-git-send-email-ashish.samant@oracle.com
State New
Headers show

Commit Message

Ashish Samant June 9, 2016, 9:46 p.m. UTC
From: Liu Bo <bo.li.lu@oracle.com>

This aims to decide whether a balance can reduce the number of
data block groups and if it can, this shows the '-dvrange' block
group's objectid.

With this, you can run
'btrfs balance start -c mnt' or 'btrfs balance start --check-only mnt'

 --------------------------------------------------------------
$ btrfs balance start -c /mnt/btrfs
Checking data block groups...
block_group        12582912 (len     8388608 used      786432)
block_group      1103101952 (len  1073741824 used   536870912)
block_group      2176843776 (len  1073741824 used  1073741824)
total bgs 3 total_free 544473088 min_used bg 12582912 has (min_used 786432 free 7602176)
run 'btrfs balance start -dvrange=12582912..12582913 your_mnt'

$ btrfs balance start -dvrange=12582912..12582913 /mnt/btrfs
Done, had to relocate 1 out of 5 chunks

$ btrfs balance start -c /mnt/btrfs
Checking data block groups...
block_group      1103101952 (len  1073741824 used   537395200)
block_group      2176843776 (len  1073741824 used  1073741824)
total bgs 2 total_free 536346624 min_used bg 1103101952 has (min_used 537395200 free 536346624)
 --------------------------------------------------------------

So you now know how to babysit your btrfs in a smart way.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
---
 cmds-balance.c |  127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 126 insertions(+), 1 deletions(-)

Comments

Goffredo Baroncelli June 10, 2016, 5:57 p.m. UTC | #1
Hi all,

On 2016-06-09 23:46, Ashish Samant wrote:
> From: Liu Bo <bo.li.lu@oracle.com>
> 
> This aims to decide whether a balance can reduce the number of
> data block groups and if it can, this shows the '-dvrange' block
> group's objectid.
> 
> With this, you can run
> 'btrfs balance start -c mnt' or 'btrfs balance start --check-only mnt'
> 
>  --------------------------------------------------------------
> $ btrfs balance start -c /mnt/btrfs
> Checking data block groups...
> block_group        12582912 (len     8388608 used      786432)
> block_group      1103101952 (len  1073741824 used   536870912)
> block_group      2176843776 (len  1073741824 used  1073741824)
> total bgs 3 total_free 544473088 min_used bg 12582912 has (min_used 786432 free 7602176)
> run 'btrfs balance start -dvrange=12582912..12582913 your_mnt'
> 
> $ btrfs balance start -dvrange=12582912..12582913 /mnt/btrfs
> Done, had to relocate 1 out of 5 chunks
> 
> $ btrfs balance start -c /mnt/btrfs
> Checking data block groups...
> block_group      1103101952 (len  1073741824 used   537395200)
> block_group      2176843776 (len  1073741824 used  1073741824)
> total bgs 2 total_free 536346624 min_used bg 1103101952 has (min_used 537395200 free 536346624)
>  --------------------------------------------------------------
> 
> So you now know how to babysit your btrfs in a smart way.

I think that it is an excellent tool. However I have some suggestions, most of these are from an user interface POV:

1) this should be a real command; it doesn't make sense at all that this command is a "sub command" of "btrfs bal start". I have two suggestion about that:
a) we could add a new sub-command to the "balance" family. Something like "btrfs bal analisys", where we could put some suggestions for a good balance
b) we could add a new sub-command to the "inspect" family. We could also add some feature like showing other block_gruop (system and metadata), and print their profile: i.e.

  # btrfs inspect block-group-analisys /
  Type           Mode              Start         Len        Used
  Data           single      83806388224     1.00GiB   945.64MiB
  Data           single      84880130048     1.00GiB   890.60MiB
  Data           single      85953871872     1.00GiB   818.18MiB
  Data           single      87027613696     1.00GiB   835.58MiB
  Data           single      88101355520     1.00GiB  1023.91MiB
  System         single      89175097344    32.00MiB    16.00KiB
  Metadata       single      89208651776     1.00GiB   614.88MiB
  [...]

further options could be added like showing only the most empty chunks, sorting by the Used value, filtering by type and/or profile....

2) From a readability POV, I suggest to use the pretty_size() function to display a more readable "len" and "used".
3) For the same reason, I suggest to switch to a "tabular" format, like my example: it doesn't make sense to write for every line "block_group/len/used"...
4) when the usual balance command fails because ENOSPACE, we could suggest to use this new command....

more notes below

BR
G.Baroncelli

> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
> ---
>  cmds-balance.c |  127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 126 insertions(+), 1 deletions(-)
> 
> diff --git a/cmds-balance.c b/cmds-balance.c
> index 8f3bf5b..e2aab6c 100644
> --- a/cmds-balance.c
> +++ b/cmds-balance.c
> @@ -493,6 +493,116 @@ out:
>  	return ret;
>  }
>  
> +/* return 0 if balance can remove a data block group, otherwise return 1 */
> +static int search_data_bgs(const char *path)
> +{
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk;
> +	struct btrfs_ioctl_search_header *header;
> +	struct btrfs_block_group_item *bg;
> +	unsigned long off = 0;
> +	DIR *dirstream = NULL;
> +	int e;
> +	int fd;
> +	int i;
> +	u64 total_free = 0;
> +	u64 min_used = (u64)-1;
> +	u64 free_of_min_used = 0;
> +	u64 bg_of_min_used = 0;
> +	u64 flags;
> +	u64 used;
> +	int ret = 0;
> +	int nr_data_bgs = 0;
> +
> +	fd = btrfs_open_dir(path, &dirstream, 1);
> +	if (fd < 0)
> +		return 1;
> +
> +	memset(&args, 0, sizeof(args));
> +	sk = &args.key;
> +
> +	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
> +	sk->min_objectid = sk->min_offset = sk->min_transid = 0;
> +	sk->max_objectid = sk->max_offset = sk->max_transid = (u64)-1;
> +	sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->nr_items = 65536;
> +
> +	while (1) {
> +		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
> +		e = errno;
> +		if (ret < 0) {
> +			fprintf(stderr, "ret %d error '%s'\n", ret,
> +				strerror(e));
> +			return ret;
> +		}
> +		/*
> +		 * it should not happen.
> +		 */
> +		if (sk->nr_items == 0)
> +			break;
> +
> +		off = 0;
> +		for (i = 0; i < sk->nr_items; i++) {
> +			header = (struct btrfs_ioctl_search_header *)(args.buf
> +								      + off);
> +
> +			off += sizeof(*header);
> +			if (header->type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
> +				bg = (struct btrfs_block_group_item *)
> +					(args.buf + off);
> +				flags = btrfs_block_group_flags(bg);
> +				if (flags & BTRFS_BLOCK_GROUP_DATA) {
> +					nr_data_bgs++;
> +					used = btrfs_block_group_used(bg);
> +					printf(
> +					"block_group %15llu (len %11llu used %11llu)\n",
> +							header->objectid,
> +							header->offset, used);

                                                        ^^^^^^^^^^^^^^^^^^^^^
Use pretty_size() around "header->offset" and "used"...

> +					total_free += header->offset - used;
> +					if (min_used >= used) {

I think that it would be better to track the chunk with the max space free that we can reallocate: the chunk from which we should start must be selected from the ones that have the properties:
	used <= total_free
and this chunk must have
	max(size - used)
This to free the max space with one shot
	
	

> +						min_used = used;
> +						free_of_min_used =
> +							header->offset - used;
> +						bg_of_min_used =
> +							header->objectid;
> +					}
> +				}
> +			}
> +
> +			off += header->len;
> +			sk->min_objectid = header->objectid;
> +			sk->min_type = header->type;
> +			sk->min_offset = header->offset;
> +		}
> +		sk->nr_items = 65536;
> +
> +		if (sk->min_objectid < sk->max_objectid)
> +			sk->min_objectid += 1;
> +		else
> +			break;
> +	}
> +
> +	if (nr_data_bgs <= 1) {
> +		printf("Data block groups in fs = %d, no need to do balance.\n",
> +				nr_data_bgs);
> +		return 1;
> +	}
> +
> +	printf("total bgs %d total_free %llu min_used bg %llu has (min_used %llu free %llu)\n",
> +			nr_data_bgs, total_free, bg_of_min_used, min_used,
> +			free_of_min_used);
> +	if (total_free - free_of_min_used > min_used) {
> +		printf("run 'btrfs balance start -dvrange=%llu..%llu <mountpoint>'\n",
> +				bg_of_min_used, bg_of_min_used + 1);
> +		ret = 0;
> +	} else {
> +		printf("Please don't balance data block groups, it won't work.\n");

We should add the reason because the balance will not fail

"WARNING: don't balance data block groups; it won't work, because there is no enough space free".


> +		ret = 1;
> +	}
> +
> +	return ret;
> +}
> +
>  static const char * const cmd_balance_start_usage[] = {
>  	"btrfs balance start [options] <path>",
>  	"Balance chunks across the devices",
> @@ -508,6 +618,7 @@ static const char * const cmd_balance_start_usage[] = {
>  	"-m[filters]    act on metadata chunks",
>  	"-s[filters]    act on system chunks (only under -f)",
>  	"-v             be verbose",
> +	"-c             only check if balance would make sense, not doing real job",
>  	"-f             force reducing of metadata integrity",
>  	"--full-balance do not print warning and do not delay start",
>  	NULL
> @@ -521,6 +632,7 @@ static int cmd_balance_start(int argc, char **argv)
>  	int force = 0;
>  	int verbose = 0;
>  	unsigned start_flags = 0;
> +	int check_data_bgs = 0;
>  	int i;
>  
>  	memset(&args, 0, sizeof(args));
> @@ -534,12 +646,14 @@ static int cmd_balance_start(int argc, char **argv)
>  			{ "system", optional_argument, NULL, 's' },
>  			{ "force", no_argument, NULL, 'f' },
>  			{ "verbose", no_argument, NULL, 'v' },
> +			{ "check-only", no_argument, NULL, 'c' },
>  			{ "full-balance", no_argument, NULL,
>  				GETOPT_VAL_FULL_BALANCE },
>  			{ NULL, 0, NULL, 0 }
>  		};
>  
> -		int opt = getopt_long(argc, argv, "d::s::m::fv", longopts, NULL);
> +		int opt = getopt_long(argc, argv, "d::s::m::fvc", longopts,
> +				      NULL);
>  		if (opt < 0)
>  			break;
>  
> @@ -571,6 +685,9 @@ static int cmd_balance_start(int argc, char **argv)
>  		case 'v':
>  			verbose = 1;
>  			break;
> +		case 'c':
> +			check_data_bgs = 1;
> +			break;
>  		case GETOPT_VAL_FULL_BALANCE:
>  			start_flags |= BALANCE_START_NOWARN;
>  			break;
> @@ -582,6 +699,14 @@ static int cmd_balance_start(int argc, char **argv)
>  	if (check_argc_exact(argc - optind, 1))
>  		usage(cmd_balance_start_usage);
>  
> +	if (check_data_bgs) {
> +		if (verbose)
> +			dump_ioctl_balance_args(&args);
> +
> +		printf("Checking data block groups...\n");
> +		return search_data_bgs(argv[optind]);
> +	}
> +
>  	/*
>  	 * allow -s only under --force, otherwise do with system chunks
>  	 * the same thing we were ordered to do with meta chunks
>
Hans van Kranenburg June 10, 2016, 8:47 p.m. UTC | #2
Hi,

Correct me if I'm wrong,

On 06/09/2016 11:46 PM, Ashish Samant wrote:
> +/* return 0 if balance can remove a data block group, otherwise return 1 */
> +static int search_data_bgs(const char *path)
> +{
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk;
> +	struct btrfs_ioctl_search_header *header;
> +	struct btrfs_block_group_item *bg;
> +	unsigned long off = 0;
> +	DIR *dirstream = NULL;
> +	int e;
> +	int fd;
> +	int i;
> +	u64 total_free = 0;
> +	u64 min_used = (u64)-1;
> +	u64 free_of_min_used = 0;
> +	u64 bg_of_min_used = 0;
> +	u64 flags;
> +	u64 used;
> +	int ret = 0;
> +	int nr_data_bgs = 0;
> +
> +	fd = btrfs_open_dir(path, &dirstream, 1);
> +	if (fd < 0)
> +		return 1;
> +
> +	memset(&args, 0, sizeof(args));
> +	sk = &args.key;
> +
> +	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
> +	sk->min_objectid = sk->min_offset = sk->min_transid = 0;
> +	sk->max_objectid = sk->max_offset = sk->max_transid = (u64)-1;
> +	sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	sk->nr_items = 65536;

This search returns not only block group information, but also 
everything else. You're first retrieving the complete extent tree to 
userspace, in buffers...

> +
> +	while (1) {
> +		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
> +		e = errno;
> +		if (ret < 0) {
> +			fprintf(stderr, "ret %d error '%s'\n", ret,
> +				strerror(e));
> +			return ret;
> +		}
> +		/*
> +		 * it should not happen.
> +		 */
> +		if (sk->nr_items == 0)
> +			break;
> +
> +		off = 0;
> +		for (i = 0; i < sk->nr_items; i++) {
> +			header = (struct btrfs_ioctl_search_header *)(args.buf
> +								      + off);
> +
> +			off += sizeof(*header);
> +			if (header->type == BTRFS_BLOCK_GROUP_ITEM_KEY) {

...and then just throwing 99.999999% of the results away again. This is 
going to take a phenomenal amount of effort on a huge filesystem, 
copying unnessecary data around between the kernel and your program.

The first thing I learned myself when starting to play with the search 
ioctl is that the search doesn't happen in some kind of 3 dimensional 
space. You can't just filter on a type of object when walking the tree.

http://logs.tvrrug.org.uk/logs/%23btrfs/2016-02-13.html#2016-02-13T22:32:52

The sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY only makes 
the search space start somewhere halfway objid 0 and end halfway objid 
max, including all other possible values for the type field for all 
objids in between.

> +				bg = (struct btrfs_block_group_item *)
> +					(args.buf + off);
> +				flags = btrfs_block_group_flags(bg);
> +				if (flags & BTRFS_BLOCK_GROUP_DATA) {
> +					nr_data_bgs++;
> +					used = btrfs_block_group_used(bg);
> +					printf(
> +					"block_group %15llu (len %11llu used %11llu)\n",
> +							header->objectid,
> +							header->offset, used);
> +					total_free += header->offset - used;
> +					if (min_used >= used) {
> +						min_used = used;
> +						free_of_min_used =
> +							header->offset - used;
> +						bg_of_min_used =
> +							header->objectid;
> +					}
> +				}
> +			}
> +
> +			off += header->len;
> +			sk->min_objectid = header->objectid;
> +			sk->min_type = header->type;
> +			sk->min_offset = header->offset;

When the following is a part of your extent tree...

key (289406976 EXTENT_ITEM 19193856) itemoff 15718 itemsize 53
extent refs 1 gen 11 flags DATA
extent data backref root 5 objectid 258 offset 0 count 1

key (289406976 BLOCK_GROUP_ITEM 1073741824) itemoff 15694 itemsize 24
block group used 24612864 chunk_objectid 256 flags DATA

...and when the extent_item just manages to squeeze in as last result 
into the current result buffer from the ioctl...

...then your search key looks like (289406976 168 19193856) after 
copying the values from the last seen object...

> +		}
> +		sk->nr_items = 65536;
> +
> +		if (sk->min_objectid < sk->max_objectid)
> +			sk->min_objectid += 1;

...and now it's (289406977 168 19193856), which means you're continuing 
your search *after* the block group item!

(289406976 168 19193856) is actually (289406976 << 72) + (168 << 64) + 
19193856, which is 1366685806470112827871857008640

The search is continued at 1366685811192479310741502222336, which skips 
4722366482869645213696 possible places where an object could live in the 
tree.

> +		else
> +			break;
> +	}
> +
> +	if (nr_data_bgs <= 1) {
> +		printf("Data block groups in fs = %d, no need to do balance.\n",
> +				nr_data_bgs);
> +		return 1;
> +	}
> +
> +	printf("total bgs %d total_free %llu min_used bg %llu has (min_used %llu free %llu)\n",
> +			nr_data_bgs, total_free, bg_of_min_used, min_used,
> +			free_of_min_used);
> +	if (total_free - free_of_min_used > min_used) {
> +		printf("run 'btrfs balance start -dvrange=%llu..%llu <mountpoint>'\n",
> +				bg_of_min_used, bg_of_min_used + 1);
> +		ret = 0;
> +	} else {
> +		printf("Please don't balance data block groups, it won't work.\n");
> +		ret = 1;
> +	}
> +
> +	return ret;
> +}

A more efficient way to do this search is to first look at the chunk 
tree, and for every chunk listed in there look up 1 exact match in the 
extent tree, and then read the used value of it.

Ironically, this is exact the thing that I was looking for when that IRC 
conversation referenced above happened, resulting in a very similar 
thing, playing around with searches, done in python instead of C. :-]

https://github.com/knorrie/btrfs-heatmap (see show_usage.py)

Moo! Have fun,
Goffredo Baroncelli June 12, 2016, 6:41 p.m. UTC | #3
Hi All,

On 2016-06-10 22:47, Hans van Kranenburg wrote:
>> +        if (sk->min_objectid < sk->max_objectid) 
>> + 		sk->min_objectid += 1;
> 
> ...and now it's (289406977 168 19193856), which means you're
> continuing your search *after* the block group item!
> 
> (289406976 168 19193856) is actually (289406976 << 72) + (168 << 64) 
> + 19193856, which is 1366685806470112827871857008640
> 
> The search is continued at 1366685811192479310741502222336, which
> skips 4722366482869645213696 possible places where an object could
> live in the tree.

I am not sure to follow you. The extent tree (the tree involved in the search), contains only two kind of object:

- BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0, size in bytes) 
- EXTENT_ITEM, where the key means (logical address, 0xa8, size in bytes) 

So it seems that for each (possible) "logical address", only two items might exist; the two item are completely identified by (objectid, type, ). It should not possible (for the extent tree) to have two item with the same objectid,key and different offset.
So, for the extent tree, it is safe to advance only the objectid field.

I am wrong ?

BR
Hans van Kranenburg June 12, 2016, 6:53 p.m. UTC | #4
Hi!

On 06/12/2016 08:41 PM, Goffredo Baroncelli wrote:
> Hi All,
>
> On 2016-06-10 22:47, Hans van Kranenburg wrote:
>>> +        if (sk->min_objectid < sk->max_objectid) +
>>> sk->min_objectid += 1;
>>
>> ...and now it's (289406977 168 19193856), which means you're
>> continuing your search *after* the block group item!
>>
>> (289406976 168 19193856) is actually (289406976 << 72) + (168 <<
>> 64) + 19193856, which is 1366685806470112827871857008640
>>
>> The search is continued at 1366685811192479310741502222336, which
>> skips 4722366482869645213696 possible places where an object could
>> live in the tree.
>
> I am not sure to follow you. The extent tree (the tree involved in
> the search), contains only two kind of object:
>
> - BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0, size
> in bytes)
> - EXTENT_ITEM, where the key means (logical address, 0xa8,
> size in bytes)
>
> So it seems that for each (possible) "logical address", only two
> items might exist; the two item are completely identified by
> (objectid, type, ). It should not possible (for the extent tree) to
> have two item with the same objectid,key and different offset. So,
> for the extent tree, it is safe to advance only the objectid field.
>
> I am wrong ?

When calling the search ioctl, the caller has to provide a memory buffer 
that the kernel is going to fill with results. For BTRFS_IOC_TREE_SEARCH 
used here, this buffer has a fixed size of 4096 bytes. Without some 
headers etc, this leaves a bit less than 4000 bytes of space for the 
kernel to write search result objects to.

If I do a search that will result in far more objects to be returned 
than possible to fit in those <4096 bytes, the kernel will just put a 
few in there until the next one does not fit any more.

It's the responsibility of the caller to change the start of the search 
to point just after the last received object and do the search again, in 
order to retrieve a few extra results.

So, the important line here was: "...when the extent_item just manages 
to squeeze in as last result into the current result buffer from the 
ioctl..."
Goffredo Baroncelli June 14, 2016, 6:11 p.m. UTC | #5
On 2016-06-12 20:53, Hans van Kranenburg wrote:
> Hi!
> 
> On 06/12/2016 08:41 PM, Goffredo Baroncelli wrote:
>> Hi All,
>> 
>> On 2016-06-10 22:47, Hans van Kranenburg wrote:
>>>> +        if (sk->min_objectid < sk->max_objectid) + 
>>>> sk->min_objectid += 1;
>>> 
>>> ...and now it's (289406977 168 19193856), which means you're 
>>> continuing your search *after* the block group item!
>>> 
>>> (289406976 168 19193856) is actually (289406976 << 72) + (168 << 
>>> 64) + 19193856, which is 1366685806470112827871857008640
>>> 
>>> The search is continued at 1366685811192479310741502222336,
>>> which skips 4722366482869645213696 possible places where an
>>> object could live in the tree.
>> 
>> I am not sure to follow you. The extent tree (the tree involved in 
>> the search), contains only two kind of object:
>> 
>> - BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0,
>> size in bytes) - EXTENT_ITEM, where the key means (logical address,
>> 0xa8, size in bytes)
>> 
>> So it seems that for each (possible) "logical address", only two 
>> items might exist; the two item are completely identified by 
>> (objectid, type, ). It should not possible (for the extent tree)
>> to have two item with the same objectid,key and different offset.
>> So, for the extent tree, it is safe to advance only the objectid
>> field.
>> 
>> I am wrong ?
> 
> When calling the search ioctl, the caller has to provide a memory
> buffer that the kernel is going to fill with results. For
> BTRFS_IOC_TREE_SEARCH used here, this buffer has a fixed size of 4096
> bytes. Without some headers etc, this leaves a bit less than 4000
> bytes of space for the kernel to write search result objects to.
> 
> If I do a search that will result in far more objects to be returned
> than possible to fit in those <4096 bytes, the kernel will just put a
> few in there until the next one does not fit any more.
> 
> It's the responsibility of the caller to change the start of the
> search to point just after the last received object and do the search
> again, in order to retrieve a few extra results.

You are right. If the last item in the buffer is a EXTENT_ITEM, and the 
next item in the disk is a BLOCK_GROUP_ITEM with the same object id,
the latter would be skipped.

I was find always terrible the BTRFS_IOC_TREE_SEARCH; if the min_*
fields was separate from the key, the use of this ioctl would
be a lot simpler. Moreover in most case (like this one), it would be 
reduced the context switches, because the ioctl would return
only valid data.



> 
> So, the important line here was: "...when the extent_item just
> manages to squeeze in as last result into the current result buffer
> from the ioctl..."
>
Hugo Mills June 14, 2016, 6:16 p.m. UTC | #6
On Tue, Jun 14, 2016 at 08:11:59PM +0200, Goffredo Baroncelli wrote:
> On 2016-06-12 20:53, Hans van Kranenburg wrote:
> > Hi!
> > 
> > On 06/12/2016 08:41 PM, Goffredo Baroncelli wrote:
> >> Hi All,
> >> 
> >> On 2016-06-10 22:47, Hans van Kranenburg wrote:
> >>>> +        if (sk->min_objectid < sk->max_objectid) + 
> >>>> sk->min_objectid += 1;
> >>> 
> >>> ...and now it's (289406977 168 19193856), which means you're 
> >>> continuing your search *after* the block group item!
> >>> 
> >>> (289406976 168 19193856) is actually (289406976 << 72) + (168 << 
> >>> 64) + 19193856, which is 1366685806470112827871857008640
> >>> 
> >>> The search is continued at 1366685811192479310741502222336,
> >>> which skips 4722366482869645213696 possible places where an
> >>> object could live in the tree.
> >> 
> >> I am not sure to follow you. The extent tree (the tree involved in 
> >> the search), contains only two kind of object:
> >> 
> >> - BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0,
> >> size in bytes) - EXTENT_ITEM, where the key means (logical address,
> >> 0xa8, size in bytes)
> >> 
> >> So it seems that for each (possible) "logical address", only two 
> >> items might exist; the two item are completely identified by 
> >> (objectid, type, ). It should not possible (for the extent tree)
> >> to have two item with the same objectid,key and different offset.
> >> So, for the extent tree, it is safe to advance only the objectid
> >> field.
> >> 
> >> I am wrong ?
> > 
> > When calling the search ioctl, the caller has to provide a memory
> > buffer that the kernel is going to fill with results. For
> > BTRFS_IOC_TREE_SEARCH used here, this buffer has a fixed size of 4096
> > bytes. Without some headers etc, this leaves a bit less than 4000
> > bytes of space for the kernel to write search result objects to.
> > 
> > If I do a search that will result in far more objects to be returned
> > than possible to fit in those <4096 bytes, the kernel will just put a
> > few in there until the next one does not fit any more.
> > 
> > It's the responsibility of the caller to change the start of the
> > search to point just after the last received object and do the search
> > again, in order to retrieve a few extra results.
> 
> You are right. If the last item in the buffer is a EXTENT_ITEM, and the 
> next item in the disk is a BLOCK_GROUP_ITEM with the same object id,
> the latter would be skipped.
> 
> I was find always terrible the BTRFS_IOC_TREE_SEARCH; if the min_*
> fields was separate from the key, the use of this ioctl would
> be a lot simpler. Moreover in most case (like this one), it would be 
> reduced the context switches, because the ioctl would return
> only valid data.

   There's an argument for implementing it. However, given the way the
indexing works (concatenation of the key elements, resulting in
lexical ordering of keys), you'd still have to do exactly the same
work, only in the kernel instead. The only thing you really win is the
number of context switches.

   It would really have to be a new ioctl, too. You can't change the
behaviour of the existing one.

   Hugo.

> > 
> > So, the important line here was: "...when the extent_item just
> > manages to squeeze in as last result into the current result buffer
> > from the ioctl..."
> > 
> 
>
Ashish Samant June 14, 2016, 6:21 p.m. UTC | #7
On 06/10/2016 01:47 PM, Hans van Kranenburg wrote:
> Hi,
>
> Correct me if I'm wrong,
>
> On 06/09/2016 11:46 PM, Ashish Samant wrote:
>> +/* return 0 if balance can remove a data block group, otherwise 
>> return 1 */
>> +static int search_data_bgs(const char *path)
>> +{
>> +    struct btrfs_ioctl_search_args args;
>> +    struct btrfs_ioctl_search_key *sk;
>> +    struct btrfs_ioctl_search_header *header;
>> +    struct btrfs_block_group_item *bg;
>> +    unsigned long off = 0;
>> +    DIR *dirstream = NULL;
>> +    int e;
>> +    int fd;
>> +    int i;
>> +    u64 total_free = 0;
>> +    u64 min_used = (u64)-1;
>> +    u64 free_of_min_used = 0;
>> +    u64 bg_of_min_used = 0;
>> +    u64 flags;
>> +    u64 used;
>> +    int ret = 0;
>> +    int nr_data_bgs = 0;
>> +
>> +    fd = btrfs_open_dir(path, &dirstream, 1);
>> +    if (fd < 0)
>> +        return 1;
>> +
>> +    memset(&args, 0, sizeof(args));
>> +    sk = &args.key;
>> +
>> +    sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
>> +    sk->min_objectid = sk->min_offset = sk->min_transid = 0;
>> +    sk->max_objectid = sk->max_offset = sk->max_transid = (u64)-1;
>> +    sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>> +    sk->nr_items = 65536;
>
> This search returns not only block group information, but also 
> everything else. You're first retrieving the complete extent tree to 
> userspace, in buffers...
>
>> +
>> +    while (1) {
>> +        ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
>> +        e = errno;
>> +        if (ret < 0) {
>> +            fprintf(stderr, "ret %d error '%s'\n", ret,
>> +                strerror(e));
>> +            return ret;
>> +        }
>> +        /*
>> +         * it should not happen.
>> +         */
>> +        if (sk->nr_items == 0)
>> +            break;
>> +
>> +        off = 0;
>> +        for (i = 0; i < sk->nr_items; i++) {
>> +            header = (struct btrfs_ioctl_search_header *)(args.buf
>> +                                      + off);
>> +
>> +            off += sizeof(*header);
>> +            if (header->type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
>
> ...and then just throwing 99.999999% of the results away again. This 
> is going to take a phenomenal amount of effort on a huge filesystem, 
> copying unnessecary data around between the kernel and your program.
>
> The first thing I learned myself when starting to play with the search 
> ioctl is that the search doesn't happen in some kind of 3 dimensional 
> space. You can't just filter on a type of object when walking the tree.
>
> http://logs.tvrrug.org.uk/logs/%23btrfs/2016-02-13.html#2016-02-13T22:32:52 
>
>
> The sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY only 
> makes the search space start somewhere halfway objid 0 and end halfway 
> objid max, including all other possible values for the type field for 
> all objids in between.
>
>> +                bg = (struct btrfs_block_group_item *)
>> +                    (args.buf + off);
>> +                flags = btrfs_block_group_flags(bg);
>> +                if (flags & BTRFS_BLOCK_GROUP_DATA) {
>> +                    nr_data_bgs++;
>> +                    used = btrfs_block_group_used(bg);
>> +                    printf(
>> +                    "block_group %15llu (len %11llu used %11llu)\n",
>> +                            header->objectid,
>> +                            header->offset, used);
>> +                    total_free += header->offset - used;
>> +                    if (min_used >= used) {
>> +                        min_used = used;
>> +                        free_of_min_used =
>> +                            header->offset - used;
>> +                        bg_of_min_used =
>> +                            header->objectid;
>> +                    }
>> +                }
>> +            }
>> +
>> +            off += header->len;
>> +            sk->min_objectid = header->objectid;
>> +            sk->min_type = header->type;
>> +            sk->min_offset = header->offset;
>
> When the following is a part of your extent tree...
>
> key (289406976 EXTENT_ITEM 19193856) itemoff 15718 itemsize 53
> extent refs 1 gen 11 flags DATA
> extent data backref root 5 objectid 258 offset 0 count 1
>
> key (289406976 BLOCK_GROUP_ITEM 1073741824) itemoff 15694 itemsize 24
> block group used 24612864 chunk_objectid 256 flags DATA
>
> ...and when the extent_item just manages to squeeze in as last result 
> into the current result buffer from the ioctl...
>
> ...then your search key looks like (289406976 168 19193856) after 
> copying the values from the last seen object...
>
>> +        }
>> +        sk->nr_items = 65536;
>> +
>> +        if (sk->min_objectid < sk->max_objectid)
>> +            sk->min_objectid += 1;
>
> ...and now it's (289406977 168 19193856), which means you're 
> continuing your search *after* the block group item!
>
> (289406976 168 19193856) is actually (289406976 << 72) + (168 << 64) + 
> 19193856, which is 1366685806470112827871857008640
>
> The search is continued at 1366685811192479310741502222336, which 
> skips 4722366482869645213696 possible places where an object could 
> live in the tree.
Ah, yes you are right.
>
>> +        else
>> +            break;
>> +    }
>> +
>> +    if (nr_data_bgs <= 1) {
>> +        printf("Data block groups in fs = %d, no need to do 
>> balance.\n",
>> +                nr_data_bgs);
>> +        return 1;
>> +    }
>> +
>> +    printf("total bgs %d total_free %llu min_used bg %llu has 
>> (min_used %llu free %llu)\n",
>> +            nr_data_bgs, total_free, bg_of_min_used, min_used,
>> +            free_of_min_used);
>> +    if (total_free - free_of_min_used > min_used) {
>> +        printf("run 'btrfs balance start -dvrange=%llu..%llu 
>> <mountpoint>'\n",
>> +                bg_of_min_used, bg_of_min_used + 1);
>> +        ret = 0;
>> +    } else {
>> +        printf("Please don't balance data block groups, it won't 
>> work.\n");
>> +        ret = 1;
>> +    }
>> +
>> +    return ret;
>> +}
>
> A more efficient way to do this search is to first look at the chunk 
> tree, and for every chunk listed in there look up 1 exact match in the 
> extent tree, and then read the used value of it.
>
> Ironically, this is exact the thing that I was looking for when that 
> IRC conversation referenced above happened, resulting in a very 
> similar thing, playing around with searches, done in python instead of 
> C. :-]
>
> https://github.com/knorrie/btrfs-heatmap (see show_usage.py)
Thanks, will take a look.

Ashish
>
> Moo! Have fun,
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli June 14, 2016, 6:55 p.m. UTC | #8
On 2016-06-14 20:16, Hugo Mills wrote:
[....]
>>
>> You are right. If the last item in the buffer is a EXTENT_ITEM, and the 
>> next item in the disk is a BLOCK_GROUP_ITEM with the same object id,
>> the latter would be skipped.
>>
>> I was find always terrible the BTRFS_IOC_TREE_SEARCH; if the min_*
>> fields was separate from the key, the use of this ioctl would
>> be a lot simpler. Moreover in most case (like this one), it would be 
>> reduced the context switches, because the ioctl would return
>> only valid data.
> 
>    There's an argument for implementing it. However, given the way the
> indexing works (concatenation of the key elements, resulting in
> lexical ordering of keys), you'd still have to do exactly the same
> work, only in the kernel instead. The only thing you really win is the
> number of context switches.
> 
>    It would really have to be a new ioctl, too. You can't change the
> behaviour of the existing one.
> 
>    Hugo.

It was 2010...

http://www.spinics.net/lists/linux-btrfs/msg07636.html


> 
>>>
>>> So, the important line here was: "...when the extent_item just
>>> manages to squeeze in as last result into the current result buffer
>>> from the ioctl..."
>>>
>>
>>
>

Patch
diff mbox

diff --git a/cmds-balance.c b/cmds-balance.c
index 8f3bf5b..e2aab6c 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -493,6 +493,116 @@  out:
 	return ret;
 }
 
+/* return 0 if balance can remove a data block group, otherwise return 1 */
+static int search_data_bgs(const char *path)
+{
+	struct btrfs_ioctl_search_args args;
+	struct btrfs_ioctl_search_key *sk;
+	struct btrfs_ioctl_search_header *header;
+	struct btrfs_block_group_item *bg;
+	unsigned long off = 0;
+	DIR *dirstream = NULL;
+	int e;
+	int fd;
+	int i;
+	u64 total_free = 0;
+	u64 min_used = (u64)-1;
+	u64 free_of_min_used = 0;
+	u64 bg_of_min_used = 0;
+	u64 flags;
+	u64 used;
+	int ret = 0;
+	int nr_data_bgs = 0;
+
+	fd = btrfs_open_dir(path, &dirstream, 1);
+	if (fd < 0)
+		return 1;
+
+	memset(&args, 0, sizeof(args));
+	sk = &args.key;
+
+	sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
+	sk->min_objectid = sk->min_offset = sk->min_transid = 0;
+	sk->max_objectid = sk->max_offset = sk->max_transid = (u64)-1;
+	sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	sk->nr_items = 65536;
+
+	while (1) {
+		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
+		e = errno;
+		if (ret < 0) {
+			fprintf(stderr, "ret %d error '%s'\n", ret,
+				strerror(e));
+			return ret;
+		}
+		/*
+		 * it should not happen.
+		 */
+		if (sk->nr_items == 0)
+			break;
+
+		off = 0;
+		for (i = 0; i < sk->nr_items; i++) {
+			header = (struct btrfs_ioctl_search_header *)(args.buf
+								      + off);
+
+			off += sizeof(*header);
+			if (header->type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+				bg = (struct btrfs_block_group_item *)
+					(args.buf + off);
+				flags = btrfs_block_group_flags(bg);
+				if (flags & BTRFS_BLOCK_GROUP_DATA) {
+					nr_data_bgs++;
+					used = btrfs_block_group_used(bg);
+					printf(
+					"block_group %15llu (len %11llu used %11llu)\n",
+							header->objectid,
+							header->offset, used);
+					total_free += header->offset - used;
+					if (min_used >= used) {
+						min_used = used;
+						free_of_min_used =
+							header->offset - used;
+						bg_of_min_used =
+							header->objectid;
+					}
+				}
+			}
+
+			off += header->len;
+			sk->min_objectid = header->objectid;
+			sk->min_type = header->type;
+			sk->min_offset = header->offset;
+		}
+		sk->nr_items = 65536;
+
+		if (sk->min_objectid < sk->max_objectid)
+			sk->min_objectid += 1;
+		else
+			break;
+	}
+
+	if (nr_data_bgs <= 1) {
+		printf("Data block groups in fs = %d, no need to do balance.\n",
+				nr_data_bgs);
+		return 1;
+	}
+
+	printf("total bgs %d total_free %llu min_used bg %llu has (min_used %llu free %llu)\n",
+			nr_data_bgs, total_free, bg_of_min_used, min_used,
+			free_of_min_used);
+	if (total_free - free_of_min_used > min_used) {
+		printf("run 'btrfs balance start -dvrange=%llu..%llu <mountpoint>'\n",
+				bg_of_min_used, bg_of_min_used + 1);
+		ret = 0;
+	} else {
+		printf("Please don't balance data block groups, it won't work.\n");
+		ret = 1;
+	}
+
+	return ret;
+}
+
 static const char * const cmd_balance_start_usage[] = {
 	"btrfs balance start [options] <path>",
 	"Balance chunks across the devices",
@@ -508,6 +618,7 @@  static const char * const cmd_balance_start_usage[] = {
 	"-m[filters]    act on metadata chunks",
 	"-s[filters]    act on system chunks (only under -f)",
 	"-v             be verbose",
+	"-c             only check if balance would make sense, not doing real job",
 	"-f             force reducing of metadata integrity",
 	"--full-balance do not print warning and do not delay start",
 	NULL
@@ -521,6 +632,7 @@  static int cmd_balance_start(int argc, char **argv)
 	int force = 0;
 	int verbose = 0;
 	unsigned start_flags = 0;
+	int check_data_bgs = 0;
 	int i;
 
 	memset(&args, 0, sizeof(args));
@@ -534,12 +646,14 @@  static int cmd_balance_start(int argc, char **argv)
 			{ "system", optional_argument, NULL, 's' },
 			{ "force", no_argument, NULL, 'f' },
 			{ "verbose", no_argument, NULL, 'v' },
+			{ "check-only", no_argument, NULL, 'c' },
 			{ "full-balance", no_argument, NULL,
 				GETOPT_VAL_FULL_BALANCE },
 			{ NULL, 0, NULL, 0 }
 		};
 
-		int opt = getopt_long(argc, argv, "d::s::m::fv", longopts, NULL);
+		int opt = getopt_long(argc, argv, "d::s::m::fvc", longopts,
+				      NULL);
 		if (opt < 0)
 			break;
 
@@ -571,6 +685,9 @@  static int cmd_balance_start(int argc, char **argv)
 		case 'v':
 			verbose = 1;
 			break;
+		case 'c':
+			check_data_bgs = 1;
+			break;
 		case GETOPT_VAL_FULL_BALANCE:
 			start_flags |= BALANCE_START_NOWARN;
 			break;
@@ -582,6 +699,14 @@  static int cmd_balance_start(int argc, char **argv)
 	if (check_argc_exact(argc - optind, 1))
 		usage(cmd_balance_start_usage);
 
+	if (check_data_bgs) {
+		if (verbose)
+			dump_ioctl_balance_args(&args);
+
+		printf("Checking data block groups...\n");
+		return search_data_bgs(argv[optind]);
+	}
+
 	/*
 	 * allow -s only under --force, otherwise do with system chunks
 	 * the same thing we were ordered to do with meta chunks