mbox series

[v4,0/6] Track node vacancy to reduce worst case allocation counts

Message ID 20250407184102.2155415-1-sidhartha.kumar@oracle.com (mailing list archive)
Headers show
Series Track node vacancy to reduce worst case allocation counts | expand

Message

Sidhartha Kumar April 7, 2025, 6:40 p.m. UTC
v3[3] -> v4:
  - fix bug reported by Vasily Gorbik by fixing condition for
    mast_overflow() and add test case for this condition. This fix
    was also tested on s390 by Vasily.

  - cleanup new_height variable usage in mas_spanning_rebalance()
    and add an additional test to make sure mt_height() is correct
    after a rebalance causes a tree's height to decrease per Liam.

v2[2] -> v3:
  - add r-b to patches 1,4, and 6
  
  - update function parameter comments in patch 2 

  - remove line that sets mas->depth in patch 2

  - remove redundant code for checking for a spanning write in patch 3

  - rewrite commit message of patch 5 for additonal context and clarity


v1[1] -> v2:
  - fix comment for vacant_height which refers to depth per Wei 

  - add a patch to reorder switch case statements in mas_prealloc_calc and
    mas_wr_store_entry

  - use sufficient height in spanning stores

  - modify patch 2 to use a counter to track ascending the tree rather
    than overloading mas->depth to have this function.

================ overview ========================
Currently, the maple tree preallocates the worst case number of nodes for
given store type by taking into account the whole height of the tree. This
comes from a worst case scenario of every node in the tree being full and
having to propagate node allocation upwards until we reach the root of the
tree. This can be optimized if there are vacancies in nodes that are at a
lower depth than the root node. This series implements tracking the level
at which there is a vacant node so we only need to allocate until this
level is reached, rather than always using the full height of the tree.
The ma_wr_state struct is modified to add a field which keeps track of the
vacant height and is updated during walks of the tree. This value is then
read in mas_prealloc_calc() when we decide how many nodes to allocate.

For rebalancing and spanning stores, we also need to track the lowest
height at which a node has 1 more entry than the minimum sufficient number
of entries. This is because rebalancing can cause a parent node to become
insufficient which results in further node allocations. In this case, we
need to use the sufficient height as the worst case rather than the vacant
height.

patch 1-2: preparatory patches
patch 3: implement vacant height tracking + update the tests
patch 4: support vacant height tracking for rebalancing writes
patch 5: implement sufficient height tracking
patch 6: reorder switch case statements

================ results =========================
Bpftrace was used to profile the allocation path for requesting new maple
nodes while running stress-ng mmap 120s. The histograms below represent
requests to kmem_cache_alloc_bulk() and show the count argument. This
represnts how many maple nodes the caller is requesting in
kmem_cache_alloc_bulk()

command: stress-ng --mmap 4 --timeout 120

mm-unstable 

@bulk_alloc_req:
[3, 4)                 4 |                                                    |
[4, 5)             54170 |@                                                   |
[5, 6)                 0 |                                                    |
[6, 7)            893057 |@@@@@@@@@@@@@@@@@@@@                                |
[7, 8)                 4 |                                                    |
[8, 9)           2230287 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[9, 10)            55811 |@                                                   |
[10, 11)           77834 |@                                                   |
[11, 12)               0 |                                                    |
[12, 13)         1368684 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[13, 14)               0 |                                                    |
[14, 15)               0 |                                                    |
[15, 16)          367197 |@@@@@@@@                                            |


@maple_node_total: 46,630,160
@total_vmas: 46184591

mm-unstable + this series

@bulk_alloc_req:
[2, 3)               198 |                                                    |
[3, 4)                 4 |                                                    |
[4, 5)                43 |                                                    |
[5, 6)                 0 |                                                    |
[6, 7)           1069503 |@@@@@@@@@@@@@@@@@@@@@                               |
[7, 8)                 4 |                                                    |
[8, 9)           2597268 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[9, 10)           472191 |@@@@@@@@@                                           |
[10, 11)          191904 |@@@                                                 |
[11, 12)               0 |                                                    |
[12, 13)          247316 |@@@@                                                |
[13, 14)               0 |                                                    |
[14, 15)               0 |                                                    |
[15, 16)           98769 |@                                                   |


@maple_node_total: 37,813,856
@total_vmas: 43493287

This represents a ~19% reduction in the number of bulk maple nodes allocated.

For more reproducible results, a historgram of the return value of
mas_prealloc_calc() is displayed while running the maple_tree_tests
whcih have a deterministic store pattern

mas_prealloc_calc() return value mm-unstable
1   :                                                    (12068)
3   :                                                    (11836)
5   : *****                                              (271192)
7   : ************************************************** (2329329)
9   : ***********                                        (534186)
10  :                                                    (435)
11  : ***************                                    (704306)
13  : ********                                           (409781)

mas_prealloc_calc() return value mm-unstable + this series
1   :                                                    (12070)
3   : ************************************************** (3548777)
5   : ********                                           (633458)
7   :                                                    (65081)
9   :                                                    (11224)
10  :                                                    (341)
11  :                                                    (2973)
13  :                                                    (68)

do_mmap latency was also measured for regressions:
command: stress-ng --mmap 4 --timeout 120

mm-unstable:
avg = 7162 nsecs, total: 16101821292 nsecs, count: 2248034

mm-unstable + this series:
avg = 6689 nsecs, total: 15135391764 nsecs, count: 2262726


stress-ng --mmap4 --timeout 120

with vacant_height:
stress-ng: info:  [257]                   21526312 Maple Tree Read                0.176 M/sec
stress-ng: info:  [257]                  339979348 Maple Tree Write               2.774 M/sec

without vacant_height:
stress-ng: info:  [8228]                   20968900 Maple Tree Read                0.171 M/sec
stress-ng: info:  [8228]                  312214648 Maple Tree Write               2.547 M/sec

This respresents an increase of ~3% read throughput and ~9% increase in write throughput.

[1]: https://lore.kernel.org/lkml/20241114170524.64391-1-sidhartha.kumar@oracle.com/T/
[2]: https://lore.kernel.org/lkml/20250221163610.578409-1-sidhartha.kumar@oracle.com/
[3]: https://lore.kernel.org/lkml/20250227204823.758784-1-sidhartha.kumar@oracle.com/
Sidhartha Kumar (6):
  maple_tree: convert mas_prealloc_calc() to take in a maple write state
  maple_tree: use height and depth consistently
  maple_tree: use vacant nodes to reduce worst case allocations
  maple_tree: break on convergence in mas_spanning_rebalance()
  maple_tree: add sufficient height
  maple_tree: reorder mas->store_type case statements

 include/linux/maple_tree.h       |   4 +
 lib/maple_tree.c                 | 189 ++++++++++++++++++-------------
 tools/testing/radix-tree/maple.c | 126 +++++++++++++++++++--
 3 files changed, 233 insertions(+), 86 deletions(-)

Comments

Sidhartha Kumar April 9, 2025, 5:51 p.m. UTC | #1
On 4/7/25 2:40 PM, Sidhartha Kumar wrote:
> v3[3] -> v4:
>    - fix bug reported by Vasily Gorbik by fixing condition for
>      mast_overflow() and add test case for this condition. This fix
>      was also tested on s390 by Vasily.
> 
>    - cleanup new_height variable usage in mas_spanning_rebalance()
>      and add an additional test to make sure mt_height() is correct
>      after a rebalance causes a tree's height to decrease per Liam.
> 
Hi Andrew,

Could the following fixup patch be applied to the end of this series? If 
you'd prefer a new version of the series I can do that as well.

Thanks,
Sid


commit 1be83a7dccbef0a5973a3723de5a4460270d90c7 (HEAD)
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Wed Apr 9 17:44:51 2025 +0000

     maple_tree: vacant_height series fixup

     l_mas.depth is no longer overloaded to track the height in
     mas_spanning_rebalance() so we can remove its initialization.

     Fix comments that refer to variables that track height as depth.

     Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 37dc9525dff6..9ef129038224 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -463,8 +463,8 @@ struct ma_wr_state {
         void __rcu **slots;             /* mas->node->slots pointer */
         void *entry;                    /* The entry to write */
         void *content;                  /* The existing entry that is 
being overwritten */
-       unsigned char vacant_height;    /* Depth of lowest node with 
free space */
-       unsigned char sufficient_height;/* Depth of lowest node with min 
sufficiency + 1 nodes */
+       unsigned char vacant_height;    /* Height of lowest node with 
free space */
+       unsigned char sufficient_height;/* Height of lowest node with 
min sufficiency + 1 nodes */
  };

  #define mas_lock(mas)           spin_lock(&((mas)->tree->ma_lock))
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 8425728e3c5a..affe979bd14d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -2850,8 +2850,6 @@ static void mas_spanning_rebalance(struct ma_state 
*mas,
             unlikely(mast->bn->b_end <= mt_min_slots[mast->bn->type]))
                 mast_spanning_rebalance(mast);

-       l_mas.depth = 0;
-
         /*
          * Each level of the tree is examined and balanced, pushing 
data to the left or
          * right, or rebalancing against left or right nodes is 
employed to avoid

> v2[2] -> v3:
>    - add r-b to patches 1,4, and 6
>    
>    - update function parameter comments in patch 2
> 
>    - remove line that sets mas->depth in patch 2
> 
>    - remove redundant code for checking for a spanning write in patch 3
> 
>    - rewrite commit message of patch 5 for additonal context and clarity
> 
> 
> v1[1] -> v2:
>    - fix comment for vacant_height which refers to depth per Wei
> 
>    - add a patch to reorder switch case statements in mas_prealloc_calc and
>      mas_wr_store_entry
> 
>    - use sufficient height in spanning stores
> 
>    - modify patch 2 to use a counter to track ascending the tree rather
>      than overloading mas->depth to have this function.
> 
> ================ overview ========================
> Currently, the maple tree preallocates the worst case number of nodes for
> given store type by taking into account the whole height of the tree. This
> comes from a worst case scenario of every node in the tree being full and
> having to propagate node allocation upwards until we reach the root of the
> tree. This can be optimized if there are vacancies in nodes that are at a
> lower depth than the root node. This series implements tracking the level
> at which there is a vacant node so we only need to allocate until this
> level is reached, rather than always using the full height of the tree.
> The ma_wr_state struct is modified to add a field which keeps track of the
> vacant height and is updated during walks of the tree. This value is then
> read in mas_prealloc_calc() when we decide how many nodes to allocate.
> 
> For rebalancing and spanning stores, we also need to track the lowest
> height at which a node has 1 more entry than the minimum sufficient number
> of entries. This is because rebalancing can cause a parent node to become
> insufficient which results in further node allocations. In this case, we
> need to use the sufficient height as the worst case rather than the vacant
> height.
> 
> patch 1-2: preparatory patches
> patch 3: implement vacant height tracking + update the tests
> patch 4: support vacant height tracking for rebalancing writes
> patch 5: implement sufficient height tracking
> patch 6: reorder switch case statements
> 
> ================ results =========================
> Bpftrace was used to profile the allocation path for requesting new maple
> nodes while running stress-ng mmap 120s. The histograms below represent
> requests to kmem_cache_alloc_bulk() and show the count argument. This
> represnts how many maple nodes the caller is requesting in
> kmem_cache_alloc_bulk()
> 
> command: stress-ng --mmap 4 --timeout 120
> 
> mm-unstable
> 
> @bulk_alloc_req:
> [3, 4)                 4 |                                                    |
> [4, 5)             54170 |@                                                   |
> [5, 6)                 0 |                                                    |
> [6, 7)            893057 |@@@@@@@@@@@@@@@@@@@@                                |
> [7, 8)                 4 |                                                    |
> [8, 9)           2230287 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [9, 10)            55811 |@                                                   |
> [10, 11)           77834 |@                                                   |
> [11, 12)               0 |                                                    |
> [12, 13)         1368684 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
> [13, 14)               0 |                                                    |
> [14, 15)               0 |                                                    |
> [15, 16)          367197 |@@@@@@@@                                            |
> 
> 
> @maple_node_total: 46,630,160
> @total_vmas: 46184591
> 
> mm-unstable + this series
> 
> @bulk_alloc_req:
> [2, 3)               198 |                                                    |
> [3, 4)                 4 |                                                    |
> [4, 5)                43 |                                                    |
> [5, 6)                 0 |                                                    |
> [6, 7)           1069503 |@@@@@@@@@@@@@@@@@@@@@                               |
> [7, 8)                 4 |                                                    |
> [8, 9)           2597268 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [9, 10)           472191 |@@@@@@@@@                                           |
> [10, 11)          191904 |@@@                                                 |
> [11, 12)               0 |                                                    |
> [12, 13)          247316 |@@@@                                                |
> [13, 14)               0 |                                                    |
> [14, 15)               0 |                                                    |
> [15, 16)           98769 |@                                                   |
> 
> 
> @maple_node_total: 37,813,856
> @total_vmas: 43493287
> 
> This represents a ~19% reduction in the number of bulk maple nodes allocated.
> 
> For more reproducible results, a historgram of the return value of
> mas_prealloc_calc() is displayed while running the maple_tree_tests
> whcih have a deterministic store pattern
> 
> mas_prealloc_calc() return value mm-unstable
> 1   :                                                    (12068)
> 3   :                                                    (11836)
> 5   : *****                                              (271192)
> 7   : ************************************************** (2329329)
> 9   : ***********                                        (534186)
> 10  :                                                    (435)
> 11  : ***************                                    (704306)
> 13  : ********                                           (409781)
> 
> mas_prealloc_calc() return value mm-unstable + this series
> 1   :                                                    (12070)
> 3   : ************************************************** (3548777)
> 5   : ********                                           (633458)
> 7   :                                                    (65081)
> 9   :                                                    (11224)
> 10  :                                                    (341)
> 11  :                                                    (2973)
> 13  :                                                    (68)
> 
> do_mmap latency was also measured for regressions:
> command: stress-ng --mmap 4 --timeout 120
> 
> mm-unstable:
> avg = 7162 nsecs, total: 16101821292 nsecs, count: 2248034
> 
> mm-unstable + this series:
> avg = 6689 nsecs, total: 15135391764 nsecs, count: 2262726
> 
> 
> stress-ng --mmap4 --timeout 120
> 
> with vacant_height:
> stress-ng: info:  [257]                   21526312 Maple Tree Read                0.176 M/sec
> stress-ng: info:  [257]                  339979348 Maple Tree Write               2.774 M/sec
> 
> without vacant_height:
> stress-ng: info:  [8228]                   20968900 Maple Tree Read                0.171 M/sec
> stress-ng: info:  [8228]                  312214648 Maple Tree Write               2.547 M/sec
> 
> This respresents an increase of ~3% read throughput and ~9% increase in write throughput.
> 
> [1]: https://lore.kernel.org/lkml/20241114170524.64391-1-sidhartha.kumar@oracle.com/T/
> [2]: https://lore.kernel.org/lkml/20250221163610.578409-1-sidhartha.kumar@oracle.com/
> [3]: https://lore.kernel.org/lkml/20250227204823.758784-1-sidhartha.kumar@oracle.com/
> Sidhartha Kumar (6):
>    maple_tree: convert mas_prealloc_calc() to take in a maple write state
>    maple_tree: use height and depth consistently
>    maple_tree: use vacant nodes to reduce worst case allocations
>    maple_tree: break on convergence in mas_spanning_rebalance()
>    maple_tree: add sufficient height
>    maple_tree: reorder mas->store_type case statements
> 
>   include/linux/maple_tree.h       |   4 +
>   lib/maple_tree.c                 | 189 ++++++++++++++++++-------------
>   tools/testing/radix-tree/maple.c | 126 +++++++++++++++++++--
>   3 files changed, 233 insertions(+), 86 deletions(-)
>
Andrew Morton April 10, 2025, 12:29 a.m. UTC | #2
On Wed, 9 Apr 2025 13:51:29 -0400 Sidhartha Kumar <sidhartha.kumar@oracle.com> wrote:

> On 4/7/25 2:40 PM, Sidhartha Kumar wrote:
> > v3[3] -> v4:
> >    - fix bug reported by Vasily Gorbik by fixing condition for
> >      mast_overflow() and add test case for this condition. This fix
> >      was also tested on s390 by Vasily.
> > 
> >    - cleanup new_height variable usage in mas_spanning_rebalance()
> >      and add an additional test to make sure mt_height() is correct
> >      after a rebalance causes a tree's height to decrease per Liam.
> > 
> Hi Andrew,
> 
> Could the following fixup patch be applied to the end of this series? If 
> you'd prefer a new version of the series I can do that as well.

Liam mentioned a couple of "nits" so I think a V5 would be appropriate,
thanks.