[0/5] rb_first to rb_first_cached conversion

Message ID	1534967513-117382-1-git-send-email-bo.liu@linux.alibaba.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Liu Bo <bo.liu@linux.alibaba.com> To: <linux-btrfs@vger.kernel.org> Subject: [PATCH 0/5] rb_first to rb_first_cached conversion Date: Thu, 23 Aug 2018 03:51:48 +0800 Message-Id: <1534967513-117382-1-git-send-email-bo.liu@linux.alibaba.com> Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	rb_first to rb_first_cached conversion \| expand [0/5] rb_first to rb_first_cached conversion [1/5] Btrfs: href_root: use rb_first_cached [2/5] Btrfs: href->ref_tree: use rb_first_cached [3/5] Btrfs: delayed_items: use rb_first_cached [4/5] Btrfs: extent_map: use rb_first_cached [5/5] Btrfs: preftree: use rb_first_cached

Liu Bo Aug. 22, 2018, 7:51 p.m. UTC

Several structs in btrfs are using rb_first() in a while loop, it'd be
more efficient to do this with rb_first_cached() which has the O(1)
complexity.

This patch set updates five structs which may have a large rb tree in
practice

Liu Bo (5):
  Btrfs: href_root: use rb_first_cached
  Btrfs: href->ref_tree: use rb_first_cached
  Btrfs: delayed_items: use rb_first_cached
  Btrfs: extent_map: use rb_first_cached
  Btrfs: preftree: use rb_first_cached

 fs/btrfs/backref.c                | 34 +++++++++++++-------------
 fs/btrfs/delayed-inode.c          | 29 +++++++++++++----------
 fs/btrfs/delayed-inode.h          |  4 ++--
 fs/btrfs/delayed-ref.c            | 50 +++++++++++++++++++++++----------------
 fs/btrfs/delayed-ref.h            |  4 ++--
 fs/btrfs/disk-io.c                |  8 +++----
 fs/btrfs/extent-tree.c            | 19 ++++++++-------
 fs/btrfs/extent_map.c             | 27 +++++++++++----------
 fs/btrfs/extent_map.h             |  2 +-
 fs/btrfs/inode.c                  |  4 ++--
 fs/btrfs/tests/extent-map-tests.c |  4 ++--
 fs/btrfs/transaction.c            |  5 ++--
 fs/btrfs/volumes.c                |  5 ++--
 13 files changed, 107 insertions(+), 88 deletions(-)

Qu Wenruo Aug. 23, 2018, 12:27 a.m. UTC | #1

On 2018/8/23 上午3:51, Liu Bo wrote:
> Several structs in btrfs are using rb_first() in a while loop, it'd be
> more efficient to do this with rb_first_cached() which has the O(1)
> complexity.

I'm a little curious about such rb_first() inside loop usage.

What I can find is mostly deletion + iteration code for rb_tree.
Like btrfs_free_qgroup_config() or btrfs_qgroup_account_extents().

I'm wondering if there is any special way to do it faster for such usage.
Or rb_tree_cached is the only solution here.

Thanks,
Qu

> 
> This patch set updates five structs which may have a large rb tree in
> practice
> 
> Liu Bo (5):
>   Btrfs: href_root: use rb_first_cached
>   Btrfs: href->ref_tree: use rb_first_cached
>   Btrfs: delayed_items: use rb_first_cached
>   Btrfs: extent_map: use rb_first_cached
>   Btrfs: preftree: use rb_first_cached
> 
>  fs/btrfs/backref.c                | 34 +++++++++++++-------------
>  fs/btrfs/delayed-inode.c          | 29 +++++++++++++----------
>  fs/btrfs/delayed-inode.h          |  4 ++--
>  fs/btrfs/delayed-ref.c            | 50 +++++++++++++++++++++++----------------
>  fs/btrfs/delayed-ref.h            |  4 ++--
>  fs/btrfs/disk-io.c                |  8 +++----
>  fs/btrfs/extent-tree.c            | 19 ++++++++-------
>  fs/btrfs/extent_map.c             | 27 +++++++++++----------
>  fs/btrfs/extent_map.h             |  2 +-
>  fs/btrfs/inode.c                  |  4 ++--
>  fs/btrfs/tests/extent-map-tests.c |  4 ++--
>  fs/btrfs/transaction.c            |  5 ++--
>  fs/btrfs/volumes.c                |  5 ++--
>  13 files changed, 107 insertions(+), 88 deletions(-)
>

Holger Hoffstätte Aug. 23, 2018, 10:21 a.m. UTC | #2

On 08/22/18 21:51, Liu Bo wrote:
> Several structs in btrfs are using rb_first() in a while loop, it'd be
> more efficient to do this with rb_first_cached() which has the O(1)
> complexity.
> 
> This patch set updates five structs which may have a large rb tree in
> practice

Great idea, works for me with no bad side effects so far. So:

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>

cheers
Holger

Nikolay Borisov Aug. 23, 2018, 11:26 a.m. UTC | #3

On 22.08.2018 22:51, Liu Bo wrote:
> Several structs in btrfs are using rb_first() in a while loop, it'd be
> more efficient to do this with rb_first_cached() which has the O(1)
> complexity.
> 
> This patch set updates five structs which may have a large rb tree in
> practice
> 
> Liu Bo (5):
>   Btrfs: href_root: use rb_first_cached
>   Btrfs: href->ref_tree: use rb_first_cached
>   Btrfs: delayed_items: use rb_first_cached
>   Btrfs: extent_map: use rb_first_cached
>   Btrfs: preftree: use rb_first_cached
> 
>  fs/btrfs/backref.c                | 34 +++++++++++++-------------
>  fs/btrfs/delayed-inode.c          | 29 +++++++++++++----------
>  fs/btrfs/delayed-inode.h          |  4 ++--
>  fs/btrfs/delayed-ref.c            | 50 +++++++++++++++++++++++----------------
>  fs/btrfs/delayed-ref.h            |  4 ++--
>  fs/btrfs/disk-io.c                |  8 +++----
>  fs/btrfs/extent-tree.c            | 19 ++++++++-------
>  fs/btrfs/extent_map.c             | 27 +++++++++++----------
>  fs/btrfs/extent_map.h             |  2 +-
>  fs/btrfs/inode.c                  |  4 ++--
>  fs/btrfs/tests/extent-map-tests.c |  4 ++--
>  fs/btrfs/transaction.c            |  5 ++--
>  fs/btrfs/volumes.c                |  5 ++--
>  13 files changed, 107 insertions(+), 88 deletions(-)
> 

Do you have any performance data proving this is in fact helping?

David Sterba Sept. 11, 2018, 3:34 p.m. UTC | #4

On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> Several structs in btrfs are using rb_first() in a while loop, it'd be
> more efficient to do this with rb_first_cached() which has the O(1)
> complexity.

We'd like to see some numbers, though just by reading the code I think
there's going to be a noticeable improvement for some workloads.

There's a common pattern:

while(node = rb_first) {
	entry = rb_entry(node)
	next = rb_next(node)
	rb_erase(node)
	cleanup(entry)
}

rb_first needs to traverse the tree up to logN depth, rb_erase can
completely reshuffle the tree. With the caching we'll skip the traversal
in rb_first. That's a cached memory access vs looped pointer
dereference trade-off that IMHO has a clear winner.

As the pattern in many cases traverses the whole tree and removes all
entries, ideally we could get rid of the rebalancing too. The entries
would be cleaned up in postorder way, ie. reset the data and kfree in
the end. This requires that order of the freeing does not matter, which
might no apply to some trees.

I did not find suitable rbtree functions or helpers for that,
rbtree_postorder_for_each_entry_safe does not allow rb_erase and
rb_erase itself forces the rebalancing internally. But I think we can
implement that.

We could possibly use rb_next_postorder that makes sure that all
children are traversed before the parent so this is at least something
that could considerd.

In cases where the order of rbnode processing must be preserved, ie. the
rb_erase must be there, the cached rb tree is all we can do.

Liu Bo Sept. 11, 2018, 6:31 p.m. UTC | #5

On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > more efficient to do this with rb_first_cached() which has the O(1)
> > complexity.
> 
> We'd like to see some numbers, though just by reading the code I think
> there's going to be a noticeable improvement for some workloads.
> 
> There's a common pattern:
> 
> while(node = rb_first) {
> 	entry = rb_entry(node)
> 	next = rb_next(node)
> 	rb_erase(node)
> 	cleanup(entry)
> }
> 
> rb_first needs to traverse the tree up to logN depth, rb_erase can
> completely reshuffle the tree. With the caching we'll skip the traversal
> in rb_first. That's a cached memory access vs looped pointer
> dereference trade-off that IMHO has a clear winner.
> 
> As the pattern in many cases traverses the whole tree and removes all
> entries, ideally we could get rid of the rebalancing too. The entries
> would be cleaned up in postorder way, ie. reset the data and kfree in
> the end. This requires that order of the freeing does not matter, which
> might no apply to some trees.

The order of freeing does not matter, but we need the tree to return
the correct next node to free, IOW, we have to maintain the rbtree
until the last two nodes.

> 
> I did not find suitable rbtree functions or helpers for that,
> rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> rb_erase itself forces the rebalancing internally. But I think we can
> implement that.
> 
> We could possibly use rb_next_postorder that makes sure that all
> children are traversed before the parent so this is at least something
> that could considerd.
> 

Qu was inquiring about the same thing, I haven't got time to dig it
further.

The question we need to answer is that whether we care about how fast
destruction can make, as some are in critical paths while some are
not.

thanks,
-liubo

> In cases where the order of rbnode processing must be preserved, ie. the
> rb_erase must be there, the cached rb tree is all we can do.

David Sterba Sept. 14, 2018, 3:11 p.m. UTC | #6

On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > more efficient to do this with rb_first_cached() which has the O(1)
> > > complexity.
> > 
> > We'd like to see some numbers, though just by reading the code I think
> > there's going to be a noticeable improvement for some workloads.
> > 
> > There's a common pattern:
> > 
> > while(node = rb_first) {
> > 	entry = rb_entry(node)
> > 	next = rb_next(node)
> > 	rb_erase(node)
> > 	cleanup(entry)
> > }
> > 
> > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > completely reshuffle the tree. With the caching we'll skip the traversal
> > in rb_first. That's a cached memory access vs looped pointer
> > dereference trade-off that IMHO has a clear winner.
> > 
> > As the pattern in many cases traverses the whole tree and removes all
> > entries, ideally we could get rid of the rebalancing too. The entries
> > would be cleaned up in postorder way, ie. reset the data and kfree in
> > the end. This requires that order of the freeing does not matter, which
> > might no apply to some trees.
> 
> The order of freeing does not matter, but we need the tree to return
> the correct next node to free, IOW, we have to maintain the rbtree
> until the last two nodes.
> 
> > 
> > I did not find suitable rbtree functions or helpers for that,
> > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > rb_erase itself forces the rebalancing internally. But I think we can
> > implement that.
> > 
> > We could possibly use rb_next_postorder that makes sure that all
> > children are traversed before the parent so this is at least something
> > that could considerd.
> > 
> 
> Qu was inquiring about the same thing, I haven't got time to dig it
> further.
> 
> The question we need to answer is that whether we care about how fast
> destruction can make, as some are in critical paths while some are
> not.

Yeah, I take __btrfs_return_cluster_to_free_space as an example where
the more efficient tree destruction would shorten the time where a lock
is held.

In other cases it might complicate the code too much, though the
performance is not critical, eg. at umount time. Although, it'd be good
to optimize that too now that we know how.

And in the remaining cases it's not possible to avoid rb_erase. I had a
closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
Based on that I'd like to add this series to for-next, the first node
caching is clear enough and safe. We can do the other optimizations
later.

Liu Bo Sept. 14, 2018, 7:14 p.m. UTC | #7

On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote:
> On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > > more efficient to do this with rb_first_cached() which has the O(1)
> > > > complexity.
> > > 
> > > We'd like to see some numbers, though just by reading the code I think
> > > there's going to be a noticeable improvement for some workloads.
> > > 
> > > There's a common pattern:
> > > 
> > > while(node = rb_first) {
> > > 	entry = rb_entry(node)
> > > 	next = rb_next(node)
> > > 	rb_erase(node)
> > > 	cleanup(entry)
> > > }
> > > 
> > > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > > completely reshuffle the tree. With the caching we'll skip the traversal
> > > in rb_first. That's a cached memory access vs looped pointer
> > > dereference trade-off that IMHO has a clear winner.
> > > 
> > > As the pattern in many cases traverses the whole tree and removes all
> > > entries, ideally we could get rid of the rebalancing too. The entries
> > > would be cleaned up in postorder way, ie. reset the data and kfree in
> > > the end. This requires that order of the freeing does not matter, which
> > > might no apply to some trees.
> > 
> > The order of freeing does not matter, but we need the tree to return
> > the correct next node to free, IOW, we have to maintain the rbtree
> > until the last two nodes.
> > 
> > > 
> > > I did not find suitable rbtree functions or helpers for that,
> > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > > rb_erase itself forces the rebalancing internally. But I think we can
> > > implement that.
> > > 
> > > We could possibly use rb_next_postorder that makes sure that all
> > > children are traversed before the parent so this is at least something
> > > that could considerd.
> > > 
> > 
> > Qu was inquiring about the same thing, I haven't got time to dig it
> > further.
> > 
> > The question we need to answer is that whether we care about how fast
> > destruction can make, as some are in critical paths while some are
> > not.
> 
> Yeah, I take __btrfs_return_cluster_to_free_space as an example where
> the more efficient tree destruction would shorten the time where a lock
> is held.
> 
> In other cases it might complicate the code too much, though the
> performance is not critical, eg. at umount time. Although, it'd be good
> to optimize that too now that we know how.
> 
> And in the remaining cases it's not possible to avoid rb_erase. I had a
> closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
> Based on that I'd like to add this series to for-next, the first node
> caching is clear enough and safe. We can do the other optimizations
> later.

This needs more fine-grained debugging to see what's the cost
lies on.

About the perf. number of rb_cached_node, I measured the time spent in
extent map loop in btrfs_evict_inode(),

evict_inode_truncate_pages()
    while (!RB_EMPTY_ROOT(&map_tree->map)) {
         rb_first();
	 rb_erase();
    }


for a em tree containing 10,000 em,
- with rb_first_cached, the loop takes 4880454 ns,
- with rb_first, the loop takes 4584148 ns,

looks like the difference is not that much, ~6%.  After all it's about
scalability, the more rb tree gets, the more rb_first_cached() wins.

thanks,
-liubo

David Sterba Sept. 20, 2018, 3:56 p.m. UTC | #8

On Fri, Sep 14, 2018 at 12:14:57PM -0700, Liu Bo wrote:
> On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote:
> > On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> > > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > > > more efficient to do this with rb_first_cached() which has the O(1)
> > > > > complexity.
> > > > 
> > > > We'd like to see some numbers, though just by reading the code I think
> > > > there's going to be a noticeable improvement for some workloads.
> > > > 
> > > > There's a common pattern:
> > > > 
> > > > while(node = rb_first) {
> > > > 	entry = rb_entry(node)
> > > > 	next = rb_next(node)
> > > > 	rb_erase(node)
> > > > 	cleanup(entry)
> > > > }
> > > > 
> > > > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > > > completely reshuffle the tree. With the caching we'll skip the traversal
> > > > in rb_first. That's a cached memory access vs looped pointer
> > > > dereference trade-off that IMHO has a clear winner.
> > > > 
> > > > As the pattern in many cases traverses the whole tree and removes all
> > > > entries, ideally we could get rid of the rebalancing too. The entries
> > > > would be cleaned up in postorder way, ie. reset the data and kfree in
> > > > the end. This requires that order of the freeing does not matter, which
> > > > might no apply to some trees.
> > > 
> > > The order of freeing does not matter, but we need the tree to return
> > > the correct next node to free, IOW, we have to maintain the rbtree
> > > until the last two nodes.
> > > 
> > > > 
> > > > I did not find suitable rbtree functions or helpers for that,
> > > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > > > rb_erase itself forces the rebalancing internally. But I think we can
> > > > implement that.
> > > > 
> > > > We could possibly use rb_next_postorder that makes sure that all
> > > > children are traversed before the parent so this is at least something
> > > > that could considerd.
> > > > 
> > > 
> > > Qu was inquiring about the same thing, I haven't got time to dig it
> > > further.
> > > 
> > > The question we need to answer is that whether we care about how fast
> > > destruction can make, as some are in critical paths while some are
> > > not.
> > 
> > Yeah, I take __btrfs_return_cluster_to_free_space as an example where
> > the more efficient tree destruction would shorten the time where a lock
> > is held.
> > 
> > In other cases it might complicate the code too much, though the
> > performance is not critical, eg. at umount time. Although, it'd be good
> > to optimize that too now that we know how.
> > 
> > And in the remaining cases it's not possible to avoid rb_erase. I had a
> > closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
> > Based on that I'd like to add this series to for-next, the first node
> > caching is clear enough and safe. We can do the other optimizations
> > later.
> 
> This needs more fine-grained debugging to see what's the cost
> lies on.
> 
> About the perf. number of rb_cached_node, I measured the time spent in
> extent map loop in btrfs_evict_inode(),
> 
> evict_inode_truncate_pages()
>     while (!RB_EMPTY_ROOT(&map_tree->map)) {
>          rb_first();
> 	 rb_erase();
>     }
> 
> 
> for a em tree containing 10,000 em,

The maximum depth of the rbtree for this number of nodes is roughly
2 * lg(10000) = 2 * 14 = 28. This is not an average case so it's going
to be better in practice but still 10+ pointer dereferences can be
exchanged with a pointer update.

> - with rb_first_cached, the loop takes 4880454 ns,
> - with rb_first, the loop takes 4584148 ns,
> 
> looks like the difference is not that much, ~6%.  After all it's about
> scalability, the more rb tree gets, the more rb_first_cached() wins.

Yeah, there's some difference but I think the cache effects will make
the measurements hard to estimate, so I take the numbers as a sort of
confirmation that there's not going to be a big difference.

The delayed refs and delayed inode can get some speedup from that as
there's usually only rb_first without rb_erase.

I'm going to update the first patch with the analysis we did and
reference the patch from the others so we'll have a starting point for
further optimizations and can verify if the described expected speedups
happen.

David Sterba Sept. 20, 2018, 4:49 p.m. UTC | #9

On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> Several structs in btrfs are using rb_first() in a while loop, it'd be
> more efficient to do this with rb_first_cached() which has the O(1)
> complexity.
> 
> This patch set updates five structs which may have a large rb tree in
> practice
> 
> Liu Bo (5):
>   Btrfs: href_root: use rb_first_cached
>   Btrfs: href->ref_tree: use rb_first_cached
>   Btrfs: delayed_items: use rb_first_cached
>   Btrfs: extent_map: use rb_first_cached
>   Btrfs: preftree: use rb_first_cached

For the record, patches have been merged to misc-next. I've changed the
subject titles to

Btrfs: delayed-refs: use rb_first_cached for href_root
Btrfs: delayed-refs: use rb_first_cached for ref_tree
Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root
Btrfs: extent_map: use rb_first_cached
Btrfs: preftree: use rb_first_cached

added Holger's tested-by and my reviewed-by.

Liu Bo Sept. 20, 2018, 5:40 p.m. UTC | #10

On Thu, Sep 20, 2018 at 06:49:59PM +0200, David Sterba wrote:
> On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > more efficient to do this with rb_first_cached() which has the O(1)
> > complexity.
> > 
> > This patch set updates five structs which may have a large rb tree in
> > practice
> > 
> > Liu Bo (5):
> >   Btrfs: href_root: use rb_first_cached
> >   Btrfs: href->ref_tree: use rb_first_cached
> >   Btrfs: delayed_items: use rb_first_cached
> >   Btrfs: extent_map: use rb_first_cached
> >   Btrfs: preftree: use rb_first_cached
> 
> For the record, patches have been merged to misc-next. I've changed the
> subject titles to
> 
> Btrfs: delayed-refs: use rb_first_cached for href_root
> Btrfs: delayed-refs: use rb_first_cached for ref_tree
> Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root
> Btrfs: extent_map: use rb_first_cached
> Btrfs: preftree: use rb_first_cached
> 
> added Holger's tested-by and my reviewed-by.

Thanks so much for the efforts.

thanks,
-liubo

[0/5] rb_first to rb_first_cached conversion

Message

Comments