diff mbox

btrfs metadata reclaim behavior/performance characteristics

Message ID 20170518214724.GA10554@lim.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Liu Bo May 18, 2017, 9:47 p.m. UTC
On Thu, May 18, 2017 at 06:41:08PM +0300, Nikolay Borisov wrote:
> 
> 
> On 18.05.2017 17:45, Liu Bo wrote:
> > On Thu, May 18, 2017 at 11:40:05AM +0300, Nikolay Borisov wrote:
> 
> <ommitted for brevity>
> 
> > 
> > Just some random thoughts here.
> > 
> > Hmm, not sure if this matters, but fstests now doesn't set --mixed even if the
> > disk size is as small as 256mb.  So are you testing a mixed btrfs or not?
> 
> You are right that fstest only sets mixed mode in case the filesystem is
> less than or equal to 100mb. IN this case the fs is 256mb which means
> mixed mode _is not_ set.
> 
> > 
> > So now we've observed there're too many 'commit transaction' happening, I think it's because via commiting transaction it doesn't reclaim enough metadata space, esp. looks like space->bytes_may_use is not reduced somehow.
> 
> You've given me the idea to basically compare the state of the various
> space_info counters after each transaction commit. Before and after the
> ticketed work. Also what makes you believe it's the commit transaction
> itself not freeing enough memory and not some of the other, "cheaper"
> flush states?
>

Firstly, commit transaction is a checkpoint, where all queued delayed refs are
supposed to be processed and this's one of the factors to update space_info
counters (bytes_may_use -> reserve, pinned -> bytes_used), secondly, it's the
last resort, if any cheaper flush works, ie. offers enough space to tickets, it
won't commit, will it?

> > 
> > The metadata space_info->bytes_may_use may come from:
> > 1) 1K file with buffered IO ends up living in btree leaf, so it will contribute to the number,
> > 2) if it's mixed btrfs, then 1k file with direct IO ends up with creating a 4k extent in mixed block group.
> > 3) while writing 1k files, metadata is reserved to make it run, and when to release depends on writeback (in the buffered IO case) or endio (in the direct IO case)
> > 
> > When running several commit transaction concurrently, if one has entered TRANS_STATE_COMMIT_START state, others just wait there, have you observed that if each commit transaction actually writes superblock in the end?
> 
> Haven't gone that far, will have to instrument the code to confirm this.
> What exactly should writing the superblock reveal?
>

On a second thinking, it shouldn't matter whether they go to the part of writing
superblock.

Commit transaction is costly for sure, I think we're trying to find
why there're so many commit transaction with ticket patch set.

In btrfs_async_reclaim_metadata_space(), flush_state++ happens when one ticket is not satisfied, and we do commit transaction only if there're enough pinned bytes, but what I don't understand is the link shows 'pinned' is 0 and I spotted something wrong (shown in the included patch).

Thanks,

-liubo

From: Liu Bo <bo.li.liu@oracle.com>

Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes

We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.

This changes the code to the above logic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Nikolay Borisov May 19, 2017, 9:54 a.m. UTC | #1
> From: Liu Bo <bo.li.liu@oracle.com>
> 
> Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
> 
> We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
> 
> This changes the code to the above logic.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e390451c72e6..bded1ddd1bb6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  
> 	spin_lock(&delayed_rsv->lock);
>  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> -			   bytes - delayed_rsv->size) >= 0) {
> +			   	   		         bytes - delayed_rsv->size) < 0) {
>  							       spin_unlock(&delayed_rsv->lock);
> 								return -ENOSPC;
>  								}
> 

Your patch does make a very big difference. Here are a couple of runs of
slow-rm:



root@ubuntu-virtual:~# ./slow-rm.sh
Created 837 files before returning error, time taken 3
Created 920 files before returning error, time taken 3
Created 949 files before returning error, time taken 3
Created 930 files before returning error, time taken 3
Created 1101 files before returning error, time taken 4
Created 1082 files before returning error, time taken 4
Created 1608 files before returning error, time taken 5
Created 1735 files before returning error, time taken 5
rming took 1 seconds

root@ubuntu-virtual:~# ./slow-rm.sh
Created 801 files before returning error, time taken 3
Created 829 files before returning error, time taken 3
Created 983 files before returning error, time taken 3
Created 978 files before returning error, time taken 3
Created 1023 files before returning error, time taken 3
Created 1126 files before returning error, time taken 3
Created 1538 files before returning error, time taken 4
Created 1737 files before returning error, time taken 5
rming took 2 seconds

root@ubuntu-virtual:~# ./slow-rm.sh
Created 875 files before returning error, time taken 3
Created 891 files before returning error, time taken 3
Created 969 files before returning error, time taken 4
Created 1002 files before returning error, time taken 4
Created 1039 files before returning error, time taken 4
Created 1051 files before returning error, time taken 4
Created 1191 files before returning error, time taken 4
Created 2137 files before returning error, time taken 8
rming took 2 seconds

So rming is a lot faster, but we create less files all in all and get
ENOSPC earlier. This means that most of the time bytes_pinned is not
enough to satisfy the allocation hence we are hitting the second
percpu_counter comparison.

Also, the reason why the previous links showed 0 for bytes_pinned was
due to me having completely forgotten that bytes_pinned is a percpu
counter, hence my stap script wasn't actually reading it correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo May 19, 2017, 6:32 p.m. UTC | #2
On Fri, May 19, 2017 at 12:54:59PM +0300, Nikolay Borisov wrote:
> > From: Liu Bo <bo.li.liu@oracle.com>
> > 
> > Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
> > 
> > We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
> > 
> > This changes the code to the above logic.
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> >  fs/btrfs/extent-tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index e390451c72e6..bded1ddd1bb6 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
> >  
> > 	spin_lock(&delayed_rsv->lock);
> >  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> > -			   bytes - delayed_rsv->size) >= 0) {
> > +			   	   		         bytes - delayed_rsv->size) < 0) {
> >  							       spin_unlock(&delayed_rsv->lock);
> > 								return -ENOSPC;
> >  								}
> > 
> 
> Your patch does make a very big difference. Here are a couple of runs of
> slow-rm:
> 
> 
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 837 files before returning error, time taken 3
> Created 920 files before returning error, time taken 3
> Created 949 files before returning error, time taken 3
> Created 930 files before returning error, time taken 3
> Created 1101 files before returning error, time taken 4
> Created 1082 files before returning error, time taken 4
> Created 1608 files before returning error, time taken 5
> Created 1735 files before returning error, time taken 5
> rming took 1 seconds
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 801 files before returning error, time taken 3
> Created 829 files before returning error, time taken 3
> Created 983 files before returning error, time taken 3
> Created 978 files before returning error, time taken 3
> Created 1023 files before returning error, time taken 3
> Created 1126 files before returning error, time taken 3
> Created 1538 files before returning error, time taken 4
> Created 1737 files before returning error, time taken 5
> rming took 2 seconds
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 875 files before returning error, time taken 3
> Created 891 files before returning error, time taken 3
> Created 969 files before returning error, time taken 4
> Created 1002 files before returning error, time taken 4
> Created 1039 files before returning error, time taken 4
> Created 1051 files before returning error, time taken 4
> Created 1191 files before returning error, time taken 4
> Created 2137 files before returning error, time taken 8
> rming took 2 seconds
> 
> So rming is a lot faster, but we create less files all in all and get
> ENOSPC earlier. This means that most of the time bytes_pinned is not
> enough to satisfy the allocation hence we are hitting the second
> percpu_counter comparison.
>

Right, it's sort of my expected bahavior because all 1K buffered IO ends up
being inline extent, it's likely to run out of metadata space very soon.

> Also, the reason why the previous links showed 0 for bytes_pinned was
> due to me having completely forgotten that bytes_pinned is a percpu
> counter, hence my stap script wasn't actually reading it correctly.

I see, bytes_pinned in space_info is different from the percpu one, they're
updated at different time, but overall the percpu one is the the preciser
counter.

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov May 21, 2017, 12:45 p.m. UTC | #3
On 19.05.2017 21:32, Liu Bo wrote:
> On Fri, May 19, 2017 at 12:54:59PM +0300, Nikolay Borisov wrote:
>>> From: Liu Bo <bo.li.liu@oracle.com>
>>>
>>> Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
>>>
>>> We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
>>>
>>> This changes the code to the above logic.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>>  fs/btrfs/extent-tree.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index e390451c72e6..bded1ddd1bb6 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>>>  
>>> 	spin_lock(&delayed_rsv->lock);
>>>  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
>>> -			   bytes - delayed_rsv->size) >= 0) {
>>> +			   	   		         bytes - delayed_rsv->size) < 0) {
>>>  							       spin_unlock(&delayed_rsv->lock);
>>> 								return -ENOSPC;
>>>  								}
>>>
>>
>> Your patch does make a very big difference. Here are a couple of runs of
>> slow-rm:
>>
>>
>>
>> root@ubuntu-virtual:~# ./slow-rm.sh
>> Created 837 files before returning error, time taken 3
>> Created 920 files before returning error, time taken 3
>> Created 949 files before returning error, time taken 3
>> Created 930 files before returning error, time taken 3
>> Created 1101 files before returning error, time taken 4
>> Created 1082 files before returning error, time taken 4
>> Created 1608 files before returning error, time taken 5
>> Created 1735 files before returning error, time taken 5
>> rming took 1 seconds
>>
>> root@ubuntu-virtual:~# ./slow-rm.sh
>> Created 801 files before returning error, time taken 3
>> Created 829 files before returning error, time taken 3
>> Created 983 files before returning error, time taken 3
>> Created 978 files before returning error, time taken 3
>> Created 1023 files before returning error, time taken 3
>> Created 1126 files before returning error, time taken 3
>> Created 1538 files before returning error, time taken 4
>> Created 1737 files before returning error, time taken 5
>> rming took 2 seconds
>>
>> root@ubuntu-virtual:~# ./slow-rm.sh
>> Created 875 files before returning error, time taken 3
>> Created 891 files before returning error, time taken 3
>> Created 969 files before returning error, time taken 4
>> Created 1002 files before returning error, time taken 4
>> Created 1039 files before returning error, time taken 4
>> Created 1051 files before returning error, time taken 4
>> Created 1191 files before returning error, time taken 4
>> Created 2137 files before returning error, time taken 8
>> rming took 2 seconds
>>
>> So rming is a lot faster, but we create less files all in all and get
>> ENOSPC earlier. This means that most of the time bytes_pinned is not
>> enough to satisfy the allocation hence we are hitting the second
>> percpu_counter comparison.
>>
> 
> Right, it's sort of my expected bahavior because all 1K buffered IO ends up
> being inline extent, it's likely to run out of metadata space very soon.

Are you going to send this as an official patch to the ML ?

> 
>> Also, the reason why the previous links showed 0 for bytes_pinned was
>> due to me having completely forgotten that bytes_pinned is a percpu
>> counter, hence my stap script wasn't actually reading it correctly.
> 
> I see, bytes_pinned in space_info is different from the percpu one, they're
> updated at different time, but overall the percpu one is the the preciser
> counter.
> 
> -liubo
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo May 22, 2017, 10:57 p.m. UTC | #4
On Sun, May 21, 2017 at 03:45:02PM +0300, Nikolay Borisov wrote:
> 
> 
> On 19.05.2017 21:32, Liu Bo wrote:
> > On Fri, May 19, 2017 at 12:54:59PM +0300, Nikolay Borisov wrote:
> >>> From: Liu Bo <bo.li.liu@oracle.com>
> >>>
> >>> Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
> >>>
> >>> We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space.
> >>>
> >>> This changes the code to the above logic.
> >>>
> >>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> >>> ---
> >>>  fs/btrfs/extent-tree.c | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >>> index e390451c72e6..bded1ddd1bb6 100644
> >>> --- a/fs/btrfs/extent-tree.c
> >>> +++ b/fs/btrfs/extent-tree.c
> >>> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
> >>>  
> >>> 	spin_lock(&delayed_rsv->lock);
> >>>  	if (percpu_counter_compare(&space_info->total_bytes_pinned,
> >>> -			   bytes - delayed_rsv->size) >= 0) {
> >>> +			   	   		         bytes - delayed_rsv->size) < 0) {
> >>>  							       spin_unlock(&delayed_rsv->lock);
> >>> 								return -ENOSPC;
> >>>  								}
> >>>
> >>
> >> Your patch does make a very big difference. Here are a couple of runs of
> >> slow-rm:
> >>
> >>
> >>
> >> root@ubuntu-virtual:~# ./slow-rm.sh
> >> Created 837 files before returning error, time taken 3
> >> Created 920 files before returning error, time taken 3
> >> Created 949 files before returning error, time taken 3
> >> Created 930 files before returning error, time taken 3
> >> Created 1101 files before returning error, time taken 4
> >> Created 1082 files before returning error, time taken 4
> >> Created 1608 files before returning error, time taken 5
> >> Created 1735 files before returning error, time taken 5
> >> rming took 1 seconds
> >>
> >> root@ubuntu-virtual:~# ./slow-rm.sh
> >> Created 801 files before returning error, time taken 3
> >> Created 829 files before returning error, time taken 3
> >> Created 983 files before returning error, time taken 3
> >> Created 978 files before returning error, time taken 3
> >> Created 1023 files before returning error, time taken 3
> >> Created 1126 files before returning error, time taken 3
> >> Created 1538 files before returning error, time taken 4
> >> Created 1737 files before returning error, time taken 5
> >> rming took 2 seconds
> >>
> >> root@ubuntu-virtual:~# ./slow-rm.sh
> >> Created 875 files before returning error, time taken 3
> >> Created 891 files before returning error, time taken 3
> >> Created 969 files before returning error, time taken 4
> >> Created 1002 files before returning error, time taken 4
> >> Created 1039 files before returning error, time taken 4
> >> Created 1051 files before returning error, time taken 4
> >> Created 1191 files before returning error, time taken 4
> >> Created 2137 files before returning error, time taken 8
> >> rming took 2 seconds
> >>
> >> So rming is a lot faster, but we create less files all in all and get
> >> ENOSPC earlier. This means that most of the time bytes_pinned is not
> >> enough to satisfy the allocation hence we are hitting the second
> >> percpu_counter comparison.
> >>
> > 
> > Right, it's sort of my expected bahavior because all 1K buffered IO ends up
> > being inline extent, it's likely to run out of metadata space very soon.
> 
> Are you going to send this as an official patch to the ML ?
> 

Yes, here is the link,
https://patchwork.kernel.org/patch/9737947/

Thanks,
-liubo
> > 
> >> Also, the reason why the previous links showed 0 for bytes_pinned was
> >> due to me having completely forgotten that bytes_pinned is a percpu
> >> counter, hence my stap script wasn't actually reading it correctly.
> > 
> > I see, bytes_pinned in space_info is different from the percpu one, they're
> > updated at different time, but overall the percpu one is the the preciser
> > counter.
> > 
> > -liubo
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e390451c72e6..bded1ddd1bb6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4837,7 +4837,7 @@  static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 
	spin_lock(&delayed_rsv->lock);
 	if (percpu_counter_compare(&space_info->total_bytes_pinned,
-			   bytes - delayed_rsv->size) >= 0) {
+			   	   		         bytes - delayed_rsv->size) < 0) {
 							       spin_unlock(&delayed_rsv->lock);
								return -ENOSPC;
 								}