Message ID | 1534967513-117382-1-git-send-email-bo.liu@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | rb_first to rb_first_cached conversion | expand |
On 2018/8/23 上午3:51, Liu Bo wrote: > Several structs in btrfs are using rb_first() in a while loop, it'd be > more efficient to do this with rb_first_cached() which has the O(1) > complexity. I'm a little curious about such rb_first() inside loop usage. What I can find is mostly deletion + iteration code for rb_tree. Like btrfs_free_qgroup_config() or btrfs_qgroup_account_extents(). I'm wondering if there is any special way to do it faster for such usage. Or rb_tree_cached is the only solution here. Thanks, Qu > > This patch set updates five structs which may have a large rb tree in > practice > > Liu Bo (5): > Btrfs: href_root: use rb_first_cached > Btrfs: href->ref_tree: use rb_first_cached > Btrfs: delayed_items: use rb_first_cached > Btrfs: extent_map: use rb_first_cached > Btrfs: preftree: use rb_first_cached > > fs/btrfs/backref.c | 34 +++++++++++++------------- > fs/btrfs/delayed-inode.c | 29 +++++++++++++---------- > fs/btrfs/delayed-inode.h | 4 ++-- > fs/btrfs/delayed-ref.c | 50 +++++++++++++++++++++++---------------- > fs/btrfs/delayed-ref.h | 4 ++-- > fs/btrfs/disk-io.c | 8 +++---- > fs/btrfs/extent-tree.c | 19 ++++++++------- > fs/btrfs/extent_map.c | 27 +++++++++++---------- > fs/btrfs/extent_map.h | 2 +- > fs/btrfs/inode.c | 4 ++-- > fs/btrfs/tests/extent-map-tests.c | 4 ++-- > fs/btrfs/transaction.c | 5 ++-- > fs/btrfs/volumes.c | 5 ++-- > 13 files changed, 107 insertions(+), 88 deletions(-) >
On 08/22/18 21:51, Liu Bo wrote: > Several structs in btrfs are using rb_first() in a while loop, it'd be > more efficient to do this with rb_first_cached() which has the O(1) > complexity. > > This patch set updates five structs which may have a large rb tree in > practice Great idea, works for me with no bad side effects so far. So: Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> cheers Holger
On 22.08.2018 22:51, Liu Bo wrote: > Several structs in btrfs are using rb_first() in a while loop, it'd be > more efficient to do this with rb_first_cached() which has the O(1) > complexity. > > This patch set updates five structs which may have a large rb tree in > practice > > Liu Bo (5): > Btrfs: href_root: use rb_first_cached > Btrfs: href->ref_tree: use rb_first_cached > Btrfs: delayed_items: use rb_first_cached > Btrfs: extent_map: use rb_first_cached > Btrfs: preftree: use rb_first_cached > > fs/btrfs/backref.c | 34 +++++++++++++------------- > fs/btrfs/delayed-inode.c | 29 +++++++++++++---------- > fs/btrfs/delayed-inode.h | 4 ++-- > fs/btrfs/delayed-ref.c | 50 +++++++++++++++++++++++---------------- > fs/btrfs/delayed-ref.h | 4 ++-- > fs/btrfs/disk-io.c | 8 +++---- > fs/btrfs/extent-tree.c | 19 ++++++++------- > fs/btrfs/extent_map.c | 27 +++++++++++---------- > fs/btrfs/extent_map.h | 2 +- > fs/btrfs/inode.c | 4 ++-- > fs/btrfs/tests/extent-map-tests.c | 4 ++-- > fs/btrfs/transaction.c | 5 ++-- > fs/btrfs/volumes.c | 5 ++-- > 13 files changed, 107 insertions(+), 88 deletions(-) > Do you have any performance data proving this is in fact helping?
On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > Several structs in btrfs are using rb_first() in a while loop, it'd be > more efficient to do this with rb_first_cached() which has the O(1) > complexity. We'd like to see some numbers, though just by reading the code I think there's going to be a noticeable improvement for some workloads. There's a common pattern: while(node = rb_first) { entry = rb_entry(node) next = rb_next(node) rb_erase(node) cleanup(entry) } rb_first needs to traverse the tree up to logN depth, rb_erase can completely reshuffle the tree. With the caching we'll skip the traversal in rb_first. That's a cached memory access vs looped pointer dereference trade-off that IMHO has a clear winner. As the pattern in many cases traverses the whole tree and removes all entries, ideally we could get rid of the rebalancing too. The entries would be cleaned up in postorder way, ie. reset the data and kfree in the end. This requires that order of the freeing does not matter, which might no apply to some trees. I did not find suitable rbtree functions or helpers for that, rbtree_postorder_for_each_entry_safe does not allow rb_erase and rb_erase itself forces the rebalancing internally. But I think we can implement that. We could possibly use rb_next_postorder that makes sure that all children are traversed before the parent so this is at least something that could considerd. In cases where the order of rbnode processing must be preserved, ie. the rb_erase must be there, the cached rb tree is all we can do.
On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote: > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > more efficient to do this with rb_first_cached() which has the O(1) > > complexity. > > We'd like to see some numbers, though just by reading the code I think > there's going to be a noticeable improvement for some workloads. > > There's a common pattern: > > while(node = rb_first) { > entry = rb_entry(node) > next = rb_next(node) > rb_erase(node) > cleanup(entry) > } > > rb_first needs to traverse the tree up to logN depth, rb_erase can > completely reshuffle the tree. With the caching we'll skip the traversal > in rb_first. That's a cached memory access vs looped pointer > dereference trade-off that IMHO has a clear winner. > > As the pattern in many cases traverses the whole tree and removes all > entries, ideally we could get rid of the rebalancing too. The entries > would be cleaned up in postorder way, ie. reset the data and kfree in > the end. This requires that order of the freeing does not matter, which > might no apply to some trees. The order of freeing does not matter, but we need the tree to return the correct next node to free, IOW, we have to maintain the rbtree until the last two nodes. > > I did not find suitable rbtree functions or helpers for that, > rbtree_postorder_for_each_entry_safe does not allow rb_erase and > rb_erase itself forces the rebalancing internally. But I think we can > implement that. > > We could possibly use rb_next_postorder that makes sure that all > children are traversed before the parent so this is at least something > that could considerd. > Qu was inquiring about the same thing, I haven't got time to dig it further. The question we need to answer is that whether we care about how fast destruction can make, as some are in critical paths while some are not. thanks, -liubo > In cases where the order of rbnode processing must be preserved, ie. the > rb_erase must be there, the cached rb tree is all we can do.
On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote: > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote: > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > > more efficient to do this with rb_first_cached() which has the O(1) > > > complexity. > > > > We'd like to see some numbers, though just by reading the code I think > > there's going to be a noticeable improvement for some workloads. > > > > There's a common pattern: > > > > while(node = rb_first) { > > entry = rb_entry(node) > > next = rb_next(node) > > rb_erase(node) > > cleanup(entry) > > } > > > > rb_first needs to traverse the tree up to logN depth, rb_erase can > > completely reshuffle the tree. With the caching we'll skip the traversal > > in rb_first. That's a cached memory access vs looped pointer > > dereference trade-off that IMHO has a clear winner. > > > > As the pattern in many cases traverses the whole tree and removes all > > entries, ideally we could get rid of the rebalancing too. The entries > > would be cleaned up in postorder way, ie. reset the data and kfree in > > the end. This requires that order of the freeing does not matter, which > > might no apply to some trees. > > The order of freeing does not matter, but we need the tree to return > the correct next node to free, IOW, we have to maintain the rbtree > until the last two nodes. > > > > > I did not find suitable rbtree functions or helpers for that, > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and > > rb_erase itself forces the rebalancing internally. But I think we can > > implement that. > > > > We could possibly use rb_next_postorder that makes sure that all > > children are traversed before the parent so this is at least something > > that could considerd. > > > > Qu was inquiring about the same thing, I haven't got time to dig it > further. > > The question we need to answer is that whether we care about how fast > destruction can make, as some are in critical paths while some are > not. Yeah, I take __btrfs_return_cluster_to_free_space as an example where the more efficient tree destruction would shorten the time where a lock is held. In other cases it might complicate the code too much, though the performance is not critical, eg. at umount time. Although, it'd be good to optimize that too now that we know how. And in the remaining cases it's not possible to avoid rb_erase. I had a closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq. Based on that I'd like to add this series to for-next, the first node caching is clear enough and safe. We can do the other optimizations later.
On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote: > On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote: > > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote: > > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > > > more efficient to do this with rb_first_cached() which has the O(1) > > > > complexity. > > > > > > We'd like to see some numbers, though just by reading the code I think > > > there's going to be a noticeable improvement for some workloads. > > > > > > There's a common pattern: > > > > > > while(node = rb_first) { > > > entry = rb_entry(node) > > > next = rb_next(node) > > > rb_erase(node) > > > cleanup(entry) > > > } > > > > > > rb_first needs to traverse the tree up to logN depth, rb_erase can > > > completely reshuffle the tree. With the caching we'll skip the traversal > > > in rb_first. That's a cached memory access vs looped pointer > > > dereference trade-off that IMHO has a clear winner. > > > > > > As the pattern in many cases traverses the whole tree and removes all > > > entries, ideally we could get rid of the rebalancing too. The entries > > > would be cleaned up in postorder way, ie. reset the data and kfree in > > > the end. This requires that order of the freeing does not matter, which > > > might no apply to some trees. > > > > The order of freeing does not matter, but we need the tree to return > > the correct next node to free, IOW, we have to maintain the rbtree > > until the last two nodes. > > > > > > > > I did not find suitable rbtree functions or helpers for that, > > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and > > > rb_erase itself forces the rebalancing internally. But I think we can > > > implement that. > > > > > > We could possibly use rb_next_postorder that makes sure that all > > > children are traversed before the parent so this is at least something > > > that could considerd. > > > > > > > Qu was inquiring about the same thing, I haven't got time to dig it > > further. > > > > The question we need to answer is that whether we care about how fast > > destruction can make, as some are in critical paths while some are > > not. > > Yeah, I take __btrfs_return_cluster_to_free_space as an example where > the more efficient tree destruction would shorten the time where a lock > is held. > > In other cases it might complicate the code too much, though the > performance is not critical, eg. at umount time. Although, it'd be good > to optimize that too now that we know how. > > And in the remaining cases it's not possible to avoid rb_erase. I had a > closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq. > Based on that I'd like to add this series to for-next, the first node > caching is clear enough and safe. We can do the other optimizations > later. This needs more fine-grained debugging to see what's the cost lies on. About the perf. number of rb_cached_node, I measured the time spent in extent map loop in btrfs_evict_inode(), evict_inode_truncate_pages() while (!RB_EMPTY_ROOT(&map_tree->map)) { rb_first(); rb_erase(); } for a em tree containing 10,000 em, - with rb_first_cached, the loop takes 4880454 ns, - with rb_first, the loop takes 4584148 ns, looks like the difference is not that much, ~6%. After all it's about scalability, the more rb tree gets, the more rb_first_cached() wins. thanks, -liubo
On Fri, Sep 14, 2018 at 12:14:57PM -0700, Liu Bo wrote: > On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote: > > On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote: > > > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote: > > > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > > > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > > > > more efficient to do this with rb_first_cached() which has the O(1) > > > > > complexity. > > > > > > > > We'd like to see some numbers, though just by reading the code I think > > > > there's going to be a noticeable improvement for some workloads. > > > > > > > > There's a common pattern: > > > > > > > > while(node = rb_first) { > > > > entry = rb_entry(node) > > > > next = rb_next(node) > > > > rb_erase(node) > > > > cleanup(entry) > > > > } > > > > > > > > rb_first needs to traverse the tree up to logN depth, rb_erase can > > > > completely reshuffle the tree. With the caching we'll skip the traversal > > > > in rb_first. That's a cached memory access vs looped pointer > > > > dereference trade-off that IMHO has a clear winner. > > > > > > > > As the pattern in many cases traverses the whole tree and removes all > > > > entries, ideally we could get rid of the rebalancing too. The entries > > > > would be cleaned up in postorder way, ie. reset the data and kfree in > > > > the end. This requires that order of the freeing does not matter, which > > > > might no apply to some trees. > > > > > > The order of freeing does not matter, but we need the tree to return > > > the correct next node to free, IOW, we have to maintain the rbtree > > > until the last two nodes. > > > > > > > > > > > I did not find suitable rbtree functions or helpers for that, > > > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and > > > > rb_erase itself forces the rebalancing internally. But I think we can > > > > implement that. > > > > > > > > We could possibly use rb_next_postorder that makes sure that all > > > > children are traversed before the parent so this is at least something > > > > that could considerd. > > > > > > > > > > Qu was inquiring about the same thing, I haven't got time to dig it > > > further. > > > > > > The question we need to answer is that whether we care about how fast > > > destruction can make, as some are in critical paths while some are > > > not. > > > > Yeah, I take __btrfs_return_cluster_to_free_space as an example where > > the more efficient tree destruction would shorten the time where a lock > > is held. > > > > In other cases it might complicate the code too much, though the > > performance is not critical, eg. at umount time. Although, it'd be good > > to optimize that too now that we know how. > > > > And in the remaining cases it's not possible to avoid rb_erase. I had a > > closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq. > > Based on that I'd like to add this series to for-next, the first node > > caching is clear enough and safe. We can do the other optimizations > > later. > > This needs more fine-grained debugging to see what's the cost > lies on. > > About the perf. number of rb_cached_node, I measured the time spent in > extent map loop in btrfs_evict_inode(), > > evict_inode_truncate_pages() > while (!RB_EMPTY_ROOT(&map_tree->map)) { > rb_first(); > rb_erase(); > } > > > for a em tree containing 10,000 em, The maximum depth of the rbtree for this number of nodes is roughly 2 * lg(10000) = 2 * 14 = 28. This is not an average case so it's going to be better in practice but still 10+ pointer dereferences can be exchanged with a pointer update. > - with rb_first_cached, the loop takes 4880454 ns, > - with rb_first, the loop takes 4584148 ns, > > looks like the difference is not that much, ~6%. After all it's about > scalability, the more rb tree gets, the more rb_first_cached() wins. Yeah, there's some difference but I think the cache effects will make the measurements hard to estimate, so I take the numbers as a sort of confirmation that there's not going to be a big difference. The delayed refs and delayed inode can get some speedup from that as there's usually only rb_first without rb_erase. I'm going to update the first patch with the analysis we did and reference the patch from the others so we'll have a starting point for further optimizations and can verify if the described expected speedups happen.
On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > Several structs in btrfs are using rb_first() in a while loop, it'd be > more efficient to do this with rb_first_cached() which has the O(1) > complexity. > > This patch set updates five structs which may have a large rb tree in > practice > > Liu Bo (5): > Btrfs: href_root: use rb_first_cached > Btrfs: href->ref_tree: use rb_first_cached > Btrfs: delayed_items: use rb_first_cached > Btrfs: extent_map: use rb_first_cached > Btrfs: preftree: use rb_first_cached For the record, patches have been merged to misc-next. I've changed the subject titles to Btrfs: delayed-refs: use rb_first_cached for href_root Btrfs: delayed-refs: use rb_first_cached for ref_tree Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root Btrfs: extent_map: use rb_first_cached Btrfs: preftree: use rb_first_cached added Holger's tested-by and my reviewed-by.
On Thu, Sep 20, 2018 at 06:49:59PM +0200, David Sterba wrote: > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > more efficient to do this with rb_first_cached() which has the O(1) > > complexity. > > > > This patch set updates five structs which may have a large rb tree in > > practice > > > > Liu Bo (5): > > Btrfs: href_root: use rb_first_cached > > Btrfs: href->ref_tree: use rb_first_cached > > Btrfs: delayed_items: use rb_first_cached > > Btrfs: extent_map: use rb_first_cached > > Btrfs: preftree: use rb_first_cached > > For the record, patches have been merged to misc-next. I've changed the > subject titles to > > Btrfs: delayed-refs: use rb_first_cached for href_root > Btrfs: delayed-refs: use rb_first_cached for ref_tree > Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root > Btrfs: extent_map: use rb_first_cached > Btrfs: preftree: use rb_first_cached > > added Holger's tested-by and my reviewed-by. Thanks so much for the efforts. thanks, -liubo