[v2,1/2] btrfs: account for pinned bytes in should_alloc_chunk
diff mbox

Message ID 20170622135148.16730-1-jeffm@suse.com
State New
Headers show

Commit Message

Jeff Mahoney June 22, 2017, 1:51 p.m. UTC
From: Jeff Mahoney <jeffm@suse.com>

In a heavy write scenario, we can end up with a large number of pinned bytes.
This can translate into (very) premature ENOSPC because pinned bytes
must be accounted for when allowing a reservation but aren't accounted for
when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Omar Sandoval June 29, 2017, 7:21 p.m. UTC | #1
On Thu, Jun 22, 2017 at 09:51:47AM -0400, jeffm@suse.com wrote:
> From: Jeff Mahoney <jeffm@suse.com>
> 
> In a heavy write scenario, we can end up with a large number of pinned bytes.
> This can translate into (very) premature ENOSPC because pinned bytes
> must be accounted for when allowing a reservation but aren't accounted for
> when deciding whether to create a new chunk.
> 
> This patch adds the accounting to should_alloc_chunk so that we can
> create the chunk.

Hey, Jeff,

Does this fix your ENOSPC problem on a fresh filesystem? I just tracked
down an ENOSPC issue someone here reported when doing a btrfs send to a
fresh filesystem and it sounds a lot like your issue: metadata
bytes_may_use shoots up but we don't allocate any chunks for it. I'm not
seeing how including bytes_pinned will help for this case. We won't have
any pinned bytes when populating a new fs, right?

I don't have a good solution. Allocating chunks based on bytes_may_use
is going to way over-allocate because of our worst-case estimations. I'm
double-checking now that the flusher is doing the right thing and not
missing anything. I'll keep digging, just wanted to know if you had any
thoughts.

> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 33d979e9ea2a..88b04742beea 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4377,7 +4377,7 @@ static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
>  {
>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>  	u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
> -	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
> +	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned;
>  	u64 thresh;
>  
>  	if (force == CHUNK_ALLOC_FORCE)
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Mahoney June 29, 2017, 7:49 p.m. UTC | #2
On 6/29/17 3:21 PM, Omar Sandoval wrote:
> On Thu, Jun 22, 2017 at 09:51:47AM -0400, jeffm@suse.com wrote:
>> From: Jeff Mahoney <jeffm@suse.com>
>>
>> In a heavy write scenario, we can end up with a large number of pinned bytes.
>> This can translate into (very) premature ENOSPC because pinned bytes
>> must be accounted for when allowing a reservation but aren't accounted for
>> when deciding whether to create a new chunk.
>>
>> This patch adds the accounting to should_alloc_chunk so that we can
>> create the chunk.
> 
> Hey, Jeff,

Hi Omar -

> Does this fix your ENOSPC problem on a fresh filesystem? I just tracked

No, it didn't.  It helped somewhat, but we were still hitting it
frequently.  What did help was reverting "Btrfs: skip commit transaction
if we don't have enough pinned bytes" (not upstream yet, on the list).

> down an ENOSPC issue someone here reported when doing a btrfs send to a
> fresh filesystem and it sounds a lot like your issue: metadata
> bytes_may_use shoots up but we don't allocate any chunks for it. I'm not
> seeing how including bytes_pinned will help for this case. We won't have
> any pinned bytes when populating a new fs, right?

Our test environment is just installing the OS.  That means lots of
creates, writes, and then renames, so there's a fair amount of metadata
churn that results in elevated pinned_bytes.  Rsync can cause the same
workload pretty easily too.  Nikolay was going to look into coming up
with a configuration for fsstress that would emulate it.

> I don't have a good solution. Allocating chunks based on bytes_may_use
> is going to way over-allocate because of our worst-case estimations. I'm
> double-checking now that the flusher is doing the right thing and not
> missing anything. I'll keep digging, just wanted to know if you had any
> thoughts.

My suspicion is that it all just happens to work and that there are
several bugs working together that approximate a correct result.  My
reasoning is that the patch I referenced above is correct.  The logic in
may_commit_transaction is inverted and causing a ton of additional
transaction commits.  I think that having the additional transaction
commits is serving to free pinned bytes more quickly so things just work
for the most part and pinned bytes doesn't play as much of a role.  But
once the transaction count comes down, that pinned bytes count gets
elevated and becomes more important.  I think it should be taken into
account to determine whether committing a transaction early will result
in releasing enough space to honor the reservation without allocating a
new chunk.  If the answer is yes, flush it.  If no, there's no point in
flushing it now, so just allocate the chunk and move on.

The big question is where this 80% number comes into play.

There is a caveat here: almost all of our testing has been on 4.4 with a
bunch of these patches backported.  I believe the same issue will still
be there on mainline, but we're in release crunch mode and I haven't had
a chance to test more fully.

-Jeff

>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
>> ---
>>  fs/btrfs/extent-tree.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 33d979e9ea2a..88b04742beea 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4377,7 +4377,7 @@ static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
>>  {
>>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>>  	u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
>> -	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
>> +	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned;
>>  	u64 thresh;
>>  
>>  	if (force == CHUNK_ALLOC_FORCE)
>> -- 
>> 2.11.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Omar Sandoval June 29, 2017, 8:01 p.m. UTC | #3
On Thu, Jun 29, 2017 at 03:49:05PM -0400, Jeff Mahoney wrote:
> On 6/29/17 3:21 PM, Omar Sandoval wrote:
> > On Thu, Jun 22, 2017 at 09:51:47AM -0400, jeffm@suse.com wrote:
> >> From: Jeff Mahoney <jeffm@suse.com>
> >>
> >> In a heavy write scenario, we can end up with a large number of pinned bytes.
> >> This can translate into (very) premature ENOSPC because pinned bytes
> >> must be accounted for when allowing a reservation but aren't accounted for
> >> when deciding whether to create a new chunk.
> >>
> >> This patch adds the accounting to should_alloc_chunk so that we can
> >> create the chunk.
> > 
> > Hey, Jeff,
> 
> Hi Omar -
> 
> > Does this fix your ENOSPC problem on a fresh filesystem? I just tracked
> 
> No, it didn't.  It helped somewhat, but we were still hitting it
> frequently.  What did help was reverting "Btrfs: skip commit transaction
> if we don't have enough pinned bytes" (not upstream yet, on the list).

I see, that makes sense.

> > down an ENOSPC issue someone here reported when doing a btrfs send to a
> > fresh filesystem and it sounds a lot like your issue: metadata
> > bytes_may_use shoots up but we don't allocate any chunks for it. I'm not
> > seeing how including bytes_pinned will help for this case. We won't have
> > any pinned bytes when populating a new fs, right?
> 
> Our test environment is just installing the OS.  That means lots of
> creates, writes, and then renames, so there's a fair amount of metadata
> churn that results in elevated pinned_bytes.  Rsync can cause the same
> workload pretty easily too.  Nikolay was going to look into coming up
> with a configuration for fsstress that would emulate it.

The reproducer I have is a ~1.7GB btrfs receive onto a brand new 3GB
filesystem. In my case, nothing (or very little) was getting pinned, but
it makes sense that it's different for your case.

> > I don't have a good solution. Allocating chunks based on bytes_may_use
> > is going to way over-allocate because of our worst-case estimations. I'm
> > double-checking now that the flusher is doing the right thing and not
> > missing anything. I'll keep digging, just wanted to know if you had any
> > thoughts.
> 
> My suspicion is that it all just happens to work and that there are
> several bugs working together that approximate a correct result.

It certainly feels that way :)

> My
> reasoning is that the patch I referenced above is correct.  The logic in
> may_commit_transaction is inverted and causing a ton of additional
> transaction commits.  I think that having the additional transaction
> commits is serving to free pinned bytes more quickly so things just work
> for the most part and pinned bytes doesn't play as much of a role.  But
> once the transaction count comes down, that pinned bytes count gets
> elevated and becomes more important.  I think it should be taken into
> account to determine whether committing a transaction early will result
> in releasing enough space to honor the reservation without allocating a
> new chunk.  If the answer is yes, flush it.  If no, there's no point in
> flushing it now, so just allocate the chunk and move on.
> 
> The big question is where this 80% number comes into play.
> 
> There is a caveat here: almost all of our testing has been on 4.4 with a
> bunch of these patches backported.  I believe the same issue will still
> be there on mainline, but we're in release crunch mode and I haven't had
> a chance to test more fully.

What's weird is that my reproducer hits this very frequently (>50% of
the time) on our internal kernel build, which is 4.6 + backports up to
4.12, but upstream 4.12-rc7 hits it much less frequently (~5% of the
time). Anyways, this is all getting messy in my head, so I'm just going
to go head down on this for a little while and see what I can come up
with. Thanks for the reply!

> -Jeff
> 
> >> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
> >> ---
> >>  fs/btrfs/extent-tree.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >> index 33d979e9ea2a..88b04742beea 100644
> >> --- a/fs/btrfs/extent-tree.c
> >> +++ b/fs/btrfs/extent-tree.c
> >> @@ -4377,7 +4377,7 @@ static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
> >>  {
> >>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> >>  	u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
> >> -	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
> >> +	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned;
> >>  	u64 thresh;
> >>  
> >>  	if (force == CHUNK_ALLOC_FORCE)
> >> -- 
> >> 2.11.0
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> -- 
> Jeff Mahoney
> SUSE Labs
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov June 29, 2017, 8:25 p.m. UTC | #4
On 29.06.2017 22:49, Jeff Mahoney wrote:
> Our test environment is just installing the OS.  That means lots of
> creates, writes, and then renames, so there's a fair amount of metadata
> churn that results in elevated pinned_bytes.  Rsync can cause the same
> workload pretty easily too.  Nikolay was going to look into coming up
> with a configuration for fsstress that would emulate it.

I did experiment with fsstress -f rename=65 -f write=35 but this thing
just exhausted the filesystem completely, no premature ENOSPC. I also
tried doing just rename on a fs which has around 1gb free space and
again the usage was steadily incrased but no enospc was observed ;\
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba July 10, 2017, 5:23 p.m. UTC | #5
On Thu, Jun 22, 2017 at 09:51:47AM -0400, jeffm@suse.com wrote:
> From: Jeff Mahoney <jeffm@suse.com>
> 
> In a heavy write scenario, we can end up with a large number of pinned bytes.
> This can translate into (very) premature ENOSPC because pinned bytes
> must be accounted for when allowing a reservation but aren't accounted for
> when deciding whether to create a new chunk.
> 
> This patch adds the accounting to should_alloc_chunk so that we can
> create the chunk.
> 
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>

I'm adding the two patches to for-next. More reviews welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9ea2a..88b04742beea 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4377,7 +4377,7 @@  static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
 {
 	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 	u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
-	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
+	u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned;
 	u64 thresh;
 
 	if (force == CHUNK_ALLOC_FORCE)