diff mbox

Again, no space left on device while rebalancing and recipe doesnt work

Message ID 56D54393.8060307@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Qu Wenruo March 1, 2016, 7:24 a.m. UTC
Marc Haber wrote on 2016/03/01 07:54 +0100:
> On Tue, Mar 01, 2016 at 08:45:21AM +0800, Qu Wenruo wrote:
>> Didn't see the attachment though, seems to be filtered by maillist police.
>
> Trying again.

OK, I got the attachment.

And, surprisingly, btrfs balance on data chunk works without problem, 
but it fails on plain btrfs balance command.

>
>>> I now have a kworker and a btfs-transact kernel process taking most of
>>> one CPU core each, even after the userspace programs have terminated.
>>> Is there a way to find out what these threads are actually doing?
>>
>> Did btrfs balance status gives any hint?
>
> It says 'No balance found on /mnt/fanbtr'. I do have a second btrfs on
> the box, which is acting up as well (it has a five digit number of
> snapshots, and deleting a single snapshot takes about five to ten
> minutes. I was planning to write another mailing list article once
> this balance issue is through).

I assume the large number of snapshots is related to the high CPU usage.
As so many snapshots will make btrfs take so much time to calculate its 
backref, and the backtrace seems to prove that.

I'd like to remove unused snapshots and keep the number of them to 4 
digits, as a workaround.

But still not sure if it's related to the ENOSPC problem.

It would provide great help if you can modify your kernel and add the 
following debug: (same as attachment)

------
 From f2cc7af0aea659a522b97d3776b719f14532bce9 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: Tue, 1 Mar 2016 15:21:18 +0800
Subject: [PATCH] btrfs: debug patch

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
  fs/btrfs/extent-tree.c | 15 +++++++++++++--
  1 file changed, 13 insertions(+), 2 deletions(-)


@@ -9419,6 +9421,11 @@ int btrfs_can_relocate(struct btrfs_root *root, 
u64 bytenr)
  	     space_info->bytes_pinned + space_info->bytes_readonly +
  	     min_free < space_info->total_bytes)) {
  		spin_unlock(&space_info->lock);
+		pr_info("no space: total:%llu, bg_len:%llu, used:%llu, reseved:%llu, 
pinned:%llu, ro:%llu, min_free:%llu\n",
+			space_info->total_bytes, block_group->key.offset,
+			space_info->bytes_used, space_info->bytes_reserved,
+			space_info->bytes_pinned, space_info->bytes_readonly,
+			min_free);
  		goto out;
  	}
  	spin_unlock(&space_info->lock);
@@ -9448,8 +9455,10 @@ int btrfs_can_relocate(struct btrfs_root *root, 
u64 bytenr)
  		 * this is just a balance, so if we were marked as full
  		 * we know there is no space for a new chunk
  		 */
-		if (full)
+		if (full) {
+			pr_info("space full\n");
  			goto out;
+		}

  		index = get_block_group_index(block_group);
  	}
@@ -9496,6 +9505,8 @@ int btrfs_can_relocate(struct btrfs_root *root, 
u64 bytenr)
  			ret = -1;
  		}
  	}
+	if (ret == -1)
+		pr_info("no new chunk allocatable\n");
  	mutex_unlock(&root->fs_info->chunk_mutex);
  	btrfs_end_transaction(trans, root);
  out:

Comments

Qu Wenruo March 1, 2016, 8:13 a.m. UTC | #1
Qu Wenruo wrote on 2016/03/01 15:24 +0800:
>
>
> Marc Haber wrote on 2016/03/01 07:54 +0100:
>> On Tue, Mar 01, 2016 at 08:45:21AM +0800, Qu Wenruo wrote:
>>> Didn't see the attachment though, seems to be filtered by maillist
>>> police.
>>
>> Trying again.
>
> OK, I got the attachment.
>
> And, surprisingly, btrfs balance on data chunk works without problem,
> but it fails on plain btrfs balance command.
>
>>
>>>> I now have a kworker and a btfs-transact kernel process taking most of
>>>> one CPU core each, even after the userspace programs have terminated.
>>>> Is there a way to find out what these threads are actually doing?
>>>
>>> Did btrfs balance status gives any hint?
>>
>> It says 'No balance found on /mnt/fanbtr'. I do have a second btrfs on
>> the box, which is acting up as well (it has a five digit number of
>> snapshots, and deleting a single snapshot takes about five to ten
>> minutes. I was planning to write another mailing list article once
>> this balance issue is through).
>
> I assume the large number of snapshots is related to the high CPU usage.
> As so many snapshots will make btrfs take so much time to calculate its
> backref, and the backtrace seems to prove that.
>
> I'd like to remove unused snapshots and keep the number of them to 4
> digits, as a workaround.
>
> But still not sure if it's related to the ENOSPC problem.
>
> It would provide great help if you can modify your kernel and add the
> following debug: (same as attachment)
>
> ------
>  From f2cc7af0aea659a522b97d3776b719f14532bce9 Mon Sep 17 00:00:00 2001
> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Date: Tue, 1 Mar 2016 15:21:18 +0800
> Subject: [PATCH] btrfs: debug patch
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>   fs/btrfs/extent-tree.c | 15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 083783b..70b284b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9393,8 +9393,10 @@ int btrfs_can_relocate(struct btrfs_root *root,
> u64 bytenr)
>       block_group = btrfs_lookup_block_group(root->fs_info, bytenr);
>
>       /* odd, couldn't find the block group, leave it alone */
> -    if (!block_group)
> +    if (!block_group) {
> +        pr_info("no such chunk: %llu\n", bytenr);
>           return -1;
> +    }
>
>       min_free = btrfs_block_group_used(&block_group->item);
>
> @@ -9419,6 +9421,11 @@ int btrfs_can_relocate(struct btrfs_root *root,
> u64 bytenr)
>            space_info->bytes_pinned + space_info->bytes_readonly +
>            min_free < space_info->total_bytes)) {
>           spin_unlock(&space_info->lock);
> +        pr_info("no space: total:%llu, bg_len:%llu, used:%llu,
> reseved:%llu, pinned:%llu, ro:%llu, min_free:%llu\n",
> +            space_info->total_bytes, block_group->key.offset,
> +            space_info->bytes_used, space_info->bytes_reserved,
> +            space_info->bytes_pinned, space_info->bytes_readonly,
> +            min_free);
Oh, I'm sorry that the output is not necessary, it's better to use the 
newer patch:
https://patchwork.kernel.org/patch/8462881/

With the newer patch, you will need to use enospc_debug mount option to 
get the debug information.

Sorry for the inconvenience.

Thanks,
Qu

>           goto out;
>       }
>       spin_unlock(&space_info->lock);
> @@ -9448,8 +9455,10 @@ int btrfs_can_relocate(struct btrfs_root *root,
> u64 bytenr)
>            * this is just a balance, so if we were marked as full
>            * we know there is no space for a new chunk
>            */
> -        if (full)
> +        if (full) {
> +            pr_info("space full\n");
>               goto out;
> +        }
>
>           index = get_block_group_index(block_group);
>       }
> @@ -9496,6 +9505,8 @@ int btrfs_can_relocate(struct btrfs_root *root,
> u64 bytenr)
>               ret = -1;
>           }
>       }
> +    if (ret == -1)
> +        pr_info("no new chunk allocatable\n");
>       mutex_unlock(&root->fs_info->chunk_mutex);
>       btrfs_end_transaction(trans, root);
>   out:


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duncan March 1, 2016, 8:51 p.m. UTC | #2
Qu Wenruo posted on Tue, 01 Mar 2016 15:24:03 +0800 as excerpted:


> 
> Marc Haber wrote on 2016/03/01 07:54 +0100:
>> On Tue, Mar 01, 2016 at 08:45:21AM +0800, Qu Wenruo wrote:
>>> Didn't see the attachment though, seems to be filtered by maillist
>>> police.
>>
>> Trying again.
> 
> OK, I got the attachment.
> 
> And, surprisingly, btrfs balance on data chunk works without problem,
> but it fails on plain btrfs balance command.

There has been something bothering me about this thread that I wasn't 
quite pinning down, but here it is.

If you look at the btrfs fi df/usage numbers, data chunk total vs. used 
are very close to one another (113 GiB total, 112.77 GiB used, single 
profile, assuming GiB data chunks, that's only a fraction of a single 
data chunk unused), so balance would seem to be getting thru them just 
fine.

But there's a /huge/ spread between total vs. used metadata (32 GiB 
total, under 4 GiB used, clearly _many_ empty or nearly empty chunks), 
implying that has not been successfully balanced in quite some time, if 
ever.  So I'd surmise the problem is in metadata, not in data.

Which would explain why balancing data works fine, but a whole-filesystem 
balance doesn't, because it's getting stuck on the metadata, not the data.

Now the balance metadata filters include system as well, by default, and 
the -mprofiles=dup and -sprofiles=dup balances finished, apparently 
without error, which throws a wrench into my theory.

But while we have the btrfs fi df from before the attempt with the 
profiles filters, we don't have the same output from after.

If btrfs fi df still shows more than a GiB spread between metadata total 
and used /after/ the supposedly successful profiles filter runs, then 
obviously they're not balancing what they should be balancing, a bug 
right there, which an educated guess suggests if it's fixed, the metadata 
and possibly system balances will likely fail, due to whatever problem on 
the filesystem is keeping the full balance from completing as well.

Of course, if the post-filtered-balance btrfs fi df shows a metadata 
spread of under a gig (given 256 MiB metadata chunks, but dup, so 
possibly nearly a half-gig free, and the 512 MiB global reserve counts as 
unused metadata as well, adding another half-gig that's going to be 
reported free but is actually accounted for, yielding a spread of upto a 
gig even after successful balance), then the problem is elsewhere, but 
I'm guessing it's still going to be well over a gig and may still be the 
full 28+ gig spread, 32 gig total, under 4 gig used, indicating the 
metadata filtered balance actually didn't actually work at all.

Meanwhile, the metadata filters also include system, so while it's 
possible to balance system specifically, without (other) metadata, to my 
knowledge it's impossible to balance (other) metadata exclusively, 
without balancing system.

Which, now assuming we still have that huge metadata spread, means if the 
on-filesystem bug is in the system chunks, both system and metadata 
filtered balances *should* fail, while if it's in non-system metadata, a 
system filtered balance *should* succeed, while a metadata filtered 
balance *should* fail.

>>>> I now have a kworker and a btfs-transact kernel process taking most
>>>> of one CPU core each, even after the userspace programs have
>>>> terminated. Is there a way to find out what these threads are
>>>> actually doing?
>>>
>>> Did btrfs balance status gives any hint?
>>
>> It says 'No balance found on /mnt/fanbtr'. I do have a second btrfs on
>> the box, which is acting up as well (it has a five digit number of
>> snapshots, and deleting a single snapshot takes about five to ten
>> minutes. I was planning to write another mailing list article once this
>> balance issue is through).
> 
> I assume the large number of snapshots is related to the high CPU usage.
> As so many snapshots will make btrfs take so much time to calculate its
> backref, and the backtrace seems to prove that.
> 
> I'd like to remove unused snapshots and keep the number of them to 4
> digits, as a workaround.


I'll strongly second that recommendation.  Btrfs is known to have 
snapshot scaling issues at 10K snapshots and above.  My strong 
recommendation is to limit snapshots per filesystem to 3000 or less, with 
a target of 2000 per filesystem or less if possible, and an ideal of 1000 
per filesystem or less if it's practical to keep it to that, which it 
should be with thinning, if you're only snapshotting 1-2 subvolumes, but 
may not be if you're snapshotting more.

You can actually do scheduled snapshotting on a pretty tight schedule, 
say twice or 3X per hour (every 20-30 minutes), provided you have a good 
snapshot thinning program in place as well, thinning to say a snapshot an 
hour after 2-12 hours, every other hour after say 25 hours (giving you a 
bit over a day of at least hourly), every six hours after 8 days (so you 
have over a week of every other hour), twice a day after a couple weeks, 
daily after four weeks, weekly after 90 days, by which time you should 
have an off-system backup available to fall back on as well if you're so 
concerned about snapshots, such that after six months or a year you can 
delete all snapshots and finally free the space taken by the old 
snapshots.

Having posted the same suggestion and done the math multiple times, 
that's 250-500 snapshots per subvolume depending primarily on how fast 
you thin down in the early stages, which means 2-4 subvolumes of 
snapshotting per thousand snapshots total per filesystem, which means 
with a strict enough thinning program, you can snapshot upto 8 subvolumes 
per filesystem and stay under the 2000 total snapshots target.

By 3000 snapshots per filesystem, you'll be beginning to notice slowdowns 
in some btrfs maintenance commands if you're sensitive to it, tho it's 
still at least practical to work with, and by 10K, it's generally 
noticeable by all, at least once they thin down to 2K or so, as it's 
suddenly faster again!  Above 100K, some btrfs maintenance commands slow 
to a crawl and doing that sort of maintenance really becomes impractical 
enough that it's generally easier to backup what you need to and blow 
away the filesystem to start again with a new one, than it is to try to 
recover the existing filesystem to a workable state, given that 
maintenance can at that point take days to weeks.

So 5-digits of snapshots on a filesystem is definitely well outside of 
the recommended range, to the point that in some cases, particularly 
approaching 6-digits of snapshots, it'll be more practical to simply 
ditch the filesystem and start over, than to try to work with it any 
longer.  Just don't do it; setup your thinning schedule so your peak is 
3000 snapshots per filesystem or under, and you won't have that problem 
to worry about. =:^)

Oh, and btrfs quota management exacerbates the scaling issues 
dramatically.  If you're using btrfs quotas, either half the max 
snapshots per filesystem recommendations, or reconsider whether you need 
quota functionality and turn it off, eliminating the existing quota data, 
if you don't really need that functionality. =:^(
Qu Wenruo March 3, 2016, 2:02 a.m. UTC | #3
Thanks for the output.

At least for mprofile enospc error, the problem itself is very 
straightforward, just unable to alloc tree block.

I'll check the codes to see if we can improve it, by finding out why we 
can't alloc a new chunk to resolve the problem.

But I'm still a little concerned about the dprofile case, as this time, 
dprofile doesn't trigger a ENOSPC and my debug info is not triggered.

Thanks,
Qu

Marc Haber wrote on 2016/03/01 17:16 +0100:
> Hi,
>
> On Tue, Mar 01, 2016 at 04:13:35PM +0800, Qu Wenruo wrote:
>> Oh, I'm sorry that the output is not necessary, it's better to use the newer
>> patch:
>> https://patchwork.kernel.org/patch/8462881/
>>
>> With the newer patch, you will need to use enospc_debug mount option to get
>> the debug information.
>
> I'll copy the log this time in the body of the message so that it gets
> through to the list. Let me know if you'd prefer an attachment.
>
> This time, I didn't see the busy kernel threads, and I now see ENOSPC
> during the mprofiles part of the manual balance.
>
> Hope this helps.
>
> Greetings
> Marc
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc Haber March 5, 2016, 2:28 p.m. UTC | #4
Hi,

I have not seen this message coming back to the mailing list. Was it
again too long?

I have pastebinned the log at http://paste.debian.net/412118/

On Tue, Mar 01, 2016 at 08:51:32PM +0000, Duncan wrote:
> There has been something bothering me about this thread that I wasn't 
> quite pinning down, but here it is.
> 
> If you look at the btrfs fi df/usage numbers, data chunk total vs. used 
> are very close to one another (113 GiB total, 112.77 GiB used, single 
> profile, assuming GiB data chunks, that's only a fraction of a single 
> data chunk unused), so balance would seem to be getting thru them just 
> fine.

Where would you see those numbers? I have those, pre-balance:

Mar  2 20:28:01 fan root: Data, single: total=77.00GiB, used=76.35GiB
Mar  2 20:28:01 fan root: System, DUP: total=32.00MiB, used=48.00KiB
Mar  2 20:28:01 fan root: Metadata, DUP: total=86.50GiB, used=2.11GiB
Mar  2 20:28:01 fan root: GlobalReserve, single: total=512.00MiB, used=0.00B

> But there's a /huge/ spread between total vs. used metadata (32 GiB 
> total, under 4 GiB used, clearly _many_ empty or nearly empty chunks), 
> implying that has not been successfully balanced in quite some time, if 
> ever.

This is possible, yes.

>   So I'd surmise the problem is in metadata, not in data.
> 
> Which would explain why balancing data works fine, but a whole-filesystem 
> balance doesn't, because it's getting stuck on the metadata, not the data.
> 
> Now the balance metadata filters include system as well, by default, and 
> the -mprofiles=dup and -sprofiles=dup balances finished, apparently 
> without error, which throws a wrench into my theory.

Also finishes without changing things, post-balance:
Mar  2 21:55:37 fan root: Data, single: total=77.00GiB, used=76.36GiB
Mar  2 21:55:37 fan root: System, DUP: total=32.00MiB, used=80.00KiB
Mar  2 21:55:37 fan root: Metadata, DUP: total=99.00GiB, used=2.11GiB
Mar  2 21:55:37 fan root: GlobalReserve, single: total=512.00MiB, used=0.00B

Wait, Metadata used actually _grew_???

> But while we have the btrfs fi df from before the attempt with the 
> profiles filters, we don't have the same output from after.
s
We now have everything. New log attached.

> > I'd like to remove unused snapshots and keep the number of them to 4
> > digits, as a workaround.
> 
> I'll strongly second that recommendation.  Btrfs is known to have 
> snapshot scaling issues at 10K snapshots and above.  My strong 
> recommendation is to limit snapshots per filesystem to 3000 or less, with 
> a target of 2000 per filesystem or less if possible, and an ideal of 1000 
> per filesystem or less if it's practical to keep it to that, which it 
> should be with thinning, if you're only snapshotting 1-2 subvolumes, but 
> may not be if you're snapshotting more.

I'm snapshotting /home every 10 minutes, the filesystem that I have
been posting logs from has about 400 snapshots, and snapshot cleanup
works fine. The slow snapshot removal is a different filesystem on the
same host which is on a rotating rust HDD, and is much bigger.

> By 3000 snapshots per filesystem, you'll be beginning to notice slowdowns 
> in some btrfs maintenance commands if you're sensitive to it, tho it's 
> still at least practical to work with, and by 10K, it's generally 
> noticeable by all, at least once they thin down to 2K or so, as it's 
> suddenly faster again!  Above 100K, some btrfs maintenance commands slow 
> to a crawl and doing that sort of maintenance really becomes impractical 
> enough that it's generally easier to backup what you need to and blow 
> away the filesystem to start again with a new one, than it is to try to 
> recover the existing filesystem to a workable state, given that 
> maintenance can at that point take days to weeks.

Ouch. This shold not be the case, or btrfs subvolume snapshot should
at least emit a warning. It is not good that it is so easy to get a
filesystem into a state this bad.

> So 5-digits of snapshots on a filesystem is definitely well outside of 
> the recommended range, to the point that in some cases, particularly 
> approaching 6-digits of snapshots, it'll be more practical to simply 
> ditch the filesystem and start over, than to try to work with it any 
> longer.  Just don't do it; setup your thinning schedule so your peak is 
> 3000 snapshots per filesystem or under, and you won't have that problem 
> to worry about. =:^)

That needs to be documented prominently. Ths ZFS fanbois will love that.

> Oh, and btrfs quota management exacerbates the scaling issues 
> dramatically.  If you're using btrfs quotas

Am not, thankfully.

Greetings
Marc
diff mbox

Patch

From f2cc7af0aea659a522b97d3776b719f14532bce9 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: Tue, 1 Mar 2016 15:21:18 +0800
Subject: [PATCH] btrfs: debug patch

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 083783b..70b284b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9393,8 +9393,10 @@  int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr)
 	block_group = btrfs_lookup_block_group(root->fs_info, bytenr);
 
 	/* odd, couldn't find the block group, leave it alone */
-	if (!block_group)
+	if (!block_group) {
+		pr_info("no such chunk: %llu\n", bytenr);
 		return -1;
+	}
 
 	min_free = btrfs_block_group_used(&block_group->item);
 
@@ -9419,6 +9421,11 @@  int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr)
 	     space_info->bytes_pinned + space_info->bytes_readonly +
 	     min_free < space_info->total_bytes)) {
 		spin_unlock(&space_info->lock);
+		pr_info("no space: total:%llu, bg_len:%llu, used:%llu, reseved:%llu, pinned:%llu, ro:%llu, min_free:%llu\n",
+			space_info->total_bytes, block_group->key.offset,
+			space_info->bytes_used, space_info->bytes_reserved,
+			space_info->bytes_pinned, space_info->bytes_readonly,
+			min_free);
 		goto out;
 	}
 	spin_unlock(&space_info->lock);
@@ -9448,8 +9455,10 @@  int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr)
 		 * this is just a balance, so if we were marked as full
 		 * we know there is no space for a new chunk
 		 */
-		if (full)
+		if (full) {
+			pr_info("space full\n");
 			goto out;
+		}
 
 		index = get_block_group_index(block_group);
 	}
@@ -9496,6 +9505,8 @@  int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr)
 			ret = -1;
 		}
 	}
+	if (ret == -1)
+		pr_info("no new chunk allocatable\n");
 	mutex_unlock(&root->fs_info->chunk_mutex);
 	btrfs_end_transaction(trans, root);
 out:
-- 
2.7.2