Message ID | 20240131141858.1149719-1-elver@google.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | bpf: Separate bpf_local_storage_lookup() fast and slow paths | expand |
Context | Check | Description |
---|---|---|
bpf/vmtest-bpf-PR | fail | merge-conflict |
netdev/tree_selection | success | Not a local patch |
On 1/31/24 6:18 AM, Marco Elver wrote: > To allow the compiler to inline the bpf_local_storage_lookup() fast- > path, factor it out by making bpf_local_storage_lookup() a static inline > function and move the slow-path to bpf_local_storage_lookup_slowpath(). > > Base on results from './benchs/run_bench_local_storage.sh' this produces > improvements in throughput and latency in the majority of cases: > > | Hashmap Control > | =============== > | num keys: 10 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > | hits latency: 71.968 ns/op | 71.318 ns/op (-0.9%) > | important_hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > | > | num keys: 1000 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > | hits latency: 84.794 ns/op | 85.874 ns/op (+1.3%) > | important_hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > | > | num keys: 10000 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) > | hits latency: 140.581 ns/op | 142.113 ns/op (+1.1%) > | important_hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) My understanding is the change in this patch should not affect the hashmap control result, so the above +/- ~1% change could be mostly noise. > | > | num keys: 100000 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > | hits latency: 208.623 ns/op | 200.401 ns/op (-3.9%) > | important_hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > | > | num keys: 4194304 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) > | hits latency: 478.851 ns/op | 337.648 ns/op (-29.5%) > | important_hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) The last one has a big difference. Did you run it a couple of times without the change and check if the result was consistent ? > | > | Local Storage > | ============= > | num_maps: 1 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > | hits latency: 30.676 ns/op | 25.988 ns/op (-15.3%) > | important_hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > | hits latency: 27.054 ns/op | 22.807 ns/op (-15.7%) > | important_hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > | > | num_maps: 10 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 32.078 ± 0.004 M ops/s | 37.813 ± 0.020 M ops/s (+17.9%) > | hits latency: 31.174 ns/op | 26.446 ns/op (-15.2%) > | important_hits throughput: 3.208 ± 0.000 M ops/s | 3.781 ± 0.002 M ops/s (+17.9%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 34.564 ± 0.011 M ops/s | 40.082 ± 0.037 M ops/s (+16.0%) > | hits latency: 28.932 ns/op | 24.949 ns/op (-13.8%) > | important_hits throughput: 12.344 ± 0.004 M ops/s | 14.315 ± 0.013 M ops/s (+16.0%) > | > | num_maps: 16 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 32.493 ± 0.023 M ops/s | 38.147 ± 0.029 M ops/s (+17.4%) > | hits latency: 30.776 ns/op | 26.215 ns/op (-14.8%) > | important_hits throughput: 2.031 ± 0.001 M ops/s | 2.384 ± 0.002 M ops/s (+17.4%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 34.380 ± 0.521 M ops/s | 41.605 ± 0.095 M ops/s (+21.0%) > | hits latency: 29.087 ns/op | 24.035 ns/op (-17.4%) > | important_hits throughput: 10.939 ± 0.166 M ops/s | 13.238 ± 0.030 M ops/s (+21.0%) > | > | num_maps: 17 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 28.748 ± 0.028 M ops/s | 32.248 ± 0.080 M ops/s (+12.2%) > | hits latency: 34.785 ns/op | 31.009 ns/op (-10.9%) > | important_hits throughput: 1.693 ± 0.002 M ops/s | 1.899 ± 0.005 M ops/s (+12.2%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 31.313 ± 0.030 M ops/s | 35.911 ± 0.020 M ops/s (+14.7%) > | hits latency: 31.936 ns/op | 27.847 ns/op (-12.8%) > | important_hits throughput: 9.533 ± 0.009 M ops/s | 10.933 ± 0.006 M ops/s (+14.7%) > | > | num_maps: 24 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 18.475 ± 0.027 M ops/s | 19.000 ± 0.006 M ops/s (+2.8%) > | hits latency: 54.127 ns/op | 52.632 ns/op (-2.8%) > | important_hits throughput: 0.770 ± 0.001 M ops/s | 0.792 ± 0.000 M ops/s (+2.9%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 21.361 ± 0.028 M ops/s | 22.388 ± 0.099 M ops/s (+4.8%) > | hits latency: 46.814 ns/op | 44.667 ns/op (-4.6%) > | important_hits throughput: 6.009 ± 0.008 M ops/s | 6.298 ± 0.028 M ops/s (+4.8%) > | > | num_maps: 32 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 14.220 ± 0.006 M ops/s | 14.168 ± 0.020 M ops/s (-0.4%) > | hits latency: 70.323 ns/op | 70.580 ns/op (+0.4%) > | important_hits throughput: 0.445 ± 0.000 M ops/s | 0.443 ± 0.001 M ops/s (-0.4%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 17.250 ± 0.011 M ops/s | 16.650 ± 0.021 M ops/s (-3.5%) > | hits latency: 57.971 ns/op | 60.061 ns/op (+3.6%) > | important_hits throughput: 4.815 ± 0.003 M ops/s | 4.647 ± 0.006 M ops/s (-3.5%) > | > | num_maps: 100 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 5.212 ± 0.012 M ops/s | 5.878 ± 0.004 M ops/s (+12.8%) > | hits latency: 191.877 ns/op | 170.116 ns/op (-11.3%) > | important_hits throughput: 0.052 ± 0.000 M ops/s | 0.059 ± 0.000 M ops/s (+13.5%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 6.521 ± 0.053 M ops/s | 7.086 ± 0.010 M ops/s (+8.7%) > | hits latency: 153.343 ns/op | 141.116 ns/op (-8.0%) > | important_hits throughput: 1.703 ± 0.014 M ops/s | 1.851 ± 0.003 M ops/s (+8.7%) > | > | num_maps: 1000 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 0.357 ± 0.005 M ops/s | 0.325 ± 0.005 M ops/s (-9.0%) > | hits latency: 2803.738 ns/op | 3076.923 ns/op (+9.7%) Is it understood why the slow down here? The same goes for the "num_maps: 32" case above but not as bad as here. > | important_hits throughput: 0.000 ± 0.000 M ops/s | 0.000 ± 0.000 M ops/s The important_hits is very little in this case? > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 0.434 ± 0.007 M ops/s | 0.447 ± 0.007 M ops/s (+3.0%) > | hits latency: 2306.539 ns/op | 2237.687 ns/op (-3.0%) > | important_hits throughput: 0.109 ± 0.002 M ops/s | 0.112 ± 0.002 M ops/s (+2.8%) > > Signed-off-by: Marco Elver <elver@google.com> > --- > include/linux/bpf_local_storage.h | 17 ++++++++++++++++- > kernel/bpf/bpf_local_storage.c | 14 ++++---------- > .../selftests/bpf/progs/cgrp_ls_recursion.c | 2 +- > .../selftests/bpf/progs/task_ls_recursion.c | 2 +- > 4 files changed, 22 insertions(+), 13 deletions(-) > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > index 173ec7f43ed1..c8cecf7fff87 100644 > --- a/include/linux/bpf_local_storage.h > +++ b/include/linux/bpf_local_storage.h > @@ -130,9 +130,24 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, > bool bpf_ma); > > struct bpf_local_storage_data * > +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, > + struct bpf_local_storage_map *smap, > + bool cacheit_lockit); > +static inline struct bpf_local_storage_data * > bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > struct bpf_local_storage_map *smap, > - bool cacheit_lockit); > + bool cacheit_lockit) > +{ > + struct bpf_local_storage_data *sdata; > + > + /* Fast path (cache hit) */ > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > + bpf_rcu_lock_held()); > + if (likely(sdata && rcu_access_pointer(sdata->smap) == smap)) > + return sdata; > + > + return bpf_local_storage_lookup_slowpath(local_storage, smap, cacheit_lockit); > +} > > void bpf_local_storage_destroy(struct bpf_local_storage *local_storage); > > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c > index 146824cc9689..2ef782a1bd6f 100644 > --- a/kernel/bpf/bpf_local_storage.c > +++ b/kernel/bpf/bpf_local_storage.c > @@ -415,20 +415,14 @@ void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now) > } > > /* If cacheit_lockit is false, this lookup function is lockless */ > -struct bpf_local_storage_data * > -bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > - struct bpf_local_storage_map *smap, > - bool cacheit_lockit) > +noinline struct bpf_local_storage_data * Is noinline needed ? > +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, > + struct bpf_local_storage_map *smap, > + bool cacheit_lockit) > { > struct bpf_local_storage_data *sdata; > struct bpf_local_storage_elem *selem; > > - /* Fast path (cache hit) */ > - sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > - bpf_rcu_lock_held()); > - if (sdata && rcu_access_pointer(sdata->smap) == smap) > - return sdata; > - > /* Slow path (cache miss) */ > hlist_for_each_entry_rcu(selem, &local_storage->list, snode, > rcu_read_lock_trace_held()) > diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > index a043d8fefdac..9895087a9235 100644 > --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > @@ -21,7 +21,7 @@ struct { > __type(value, long); > } map_b SEC(".maps"); > > -SEC("fentry/bpf_local_storage_lookup") > +SEC("fentry/bpf_local_storage_lookup_slowpath") The selftest is trying to catch recursion. The change here cannot test the same thing because the slowpath will never be hit in the test_progs. I don't have a better idea for now also. It has a conflict with the bpf-next tree also. Was the patch created against an internal tree?
On 1/31/24 6:18 AM, Marco Elver wrote: > To allow the compiler to inline the bpf_local_storage_lookup() fast- > path, factor it out by making bpf_local_storage_lookup() a static inline > function and move the slow-path to bpf_local_storage_lookup_slowpath(). > > Base on results from './benchs/run_bench_local_storage.sh' this produces > improvements in throughput and latency in the majority of cases: > > | Hashmap Control > | =============== > | num keys: 10 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > | hits latency: 71.968 ns/op | 71.318 ns/op (-0.9%) > | important_hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > | > | num keys: 1000 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > | hits latency: 84.794 ns/op | 85.874 ns/op (+1.3%) > | important_hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > | > | num keys: 10000 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) > | hits latency: 140.581 ns/op | 142.113 ns/op (+1.1%) > | important_hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) > | > | num keys: 100000 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > | hits latency: 208.623 ns/op | 200.401 ns/op (-3.9%) > | important_hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > | > | num keys: 4194304 > | hashmap (control) sequential get: > | <before> | <after> > | hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) > | hits latency: 478.851 ns/op | 337.648 ns/op (-29.5%) > | important_hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) > | > | Local Storage > | ============= > | num_maps: 1 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > | hits latency: 30.676 ns/op | 25.988 ns/op (-15.3%) > | important_hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > | hits latency: 27.054 ns/op | 22.807 ns/op (-15.7%) > | important_hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > | > | num_maps: 10 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 32.078 ± 0.004 M ops/s | 37.813 ± 0.020 M ops/s (+17.9%) > | hits latency: 31.174 ns/op | 26.446 ns/op (-15.2%) > | important_hits throughput: 3.208 ± 0.000 M ops/s | 3.781 ± 0.002 M ops/s (+17.9%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 34.564 ± 0.011 M ops/s | 40.082 ± 0.037 M ops/s (+16.0%) > | hits latency: 28.932 ns/op | 24.949 ns/op (-13.8%) > | important_hits throughput: 12.344 ± 0.004 M ops/s | 14.315 ± 0.013 M ops/s (+16.0%) > | > | num_maps: 16 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 32.493 ± 0.023 M ops/s | 38.147 ± 0.029 M ops/s (+17.4%) > | hits latency: 30.776 ns/op | 26.215 ns/op (-14.8%) > | important_hits throughput: 2.031 ± 0.001 M ops/s | 2.384 ± 0.002 M ops/s (+17.4%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 34.380 ± 0.521 M ops/s | 41.605 ± 0.095 M ops/s (+21.0%) > | hits latency: 29.087 ns/op | 24.035 ns/op (-17.4%) > | important_hits throughput: 10.939 ± 0.166 M ops/s | 13.238 ± 0.030 M ops/s (+21.0%) > | > | num_maps: 17 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 28.748 ± 0.028 M ops/s | 32.248 ± 0.080 M ops/s (+12.2%) > | hits latency: 34.785 ns/op | 31.009 ns/op (-10.9%) > | important_hits throughput: 1.693 ± 0.002 M ops/s | 1.899 ± 0.005 M ops/s (+12.2%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 31.313 ± 0.030 M ops/s | 35.911 ± 0.020 M ops/s (+14.7%) > | hits latency: 31.936 ns/op | 27.847 ns/op (-12.8%) > | important_hits throughput: 9.533 ± 0.009 M ops/s | 10.933 ± 0.006 M ops/s (+14.7%) > | > | num_maps: 24 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 18.475 ± 0.027 M ops/s | 19.000 ± 0.006 M ops/s (+2.8%) > | hits latency: 54.127 ns/op | 52.632 ns/op (-2.8%) > | important_hits throughput: 0.770 ± 0.001 M ops/s | 0.792 ± 0.000 M ops/s (+2.9%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 21.361 ± 0.028 M ops/s | 22.388 ± 0.099 M ops/s (+4.8%) > | hits latency: 46.814 ns/op | 44.667 ns/op (-4.6%) > | important_hits throughput: 6.009 ± 0.008 M ops/s | 6.298 ± 0.028 M ops/s (+4.8%) > | > | num_maps: 32 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 14.220 ± 0.006 M ops/s | 14.168 ± 0.020 M ops/s (-0.4%) > | hits latency: 70.323 ns/op | 70.580 ns/op (+0.4%) > | important_hits throughput: 0.445 ± 0.000 M ops/s | 0.443 ± 0.001 M ops/s (-0.4%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 17.250 ± 0.011 M ops/s | 16.650 ± 0.021 M ops/s (-3.5%) > | hits latency: 57.971 ns/op | 60.061 ns/op (+3.6%) > | important_hits throughput: 4.815 ± 0.003 M ops/s | 4.647 ± 0.006 M ops/s (-3.5%) > | > | num_maps: 100 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 5.212 ± 0.012 M ops/s | 5.878 ± 0.004 M ops/s (+12.8%) > | hits latency: 191.877 ns/op | 170.116 ns/op (-11.3%) > | important_hits throughput: 0.052 ± 0.000 M ops/s | 0.059 ± 0.000 M ops/s (+13.5%) > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 6.521 ± 0.053 M ops/s | 7.086 ± 0.010 M ops/s (+8.7%) > | hits latency: 153.343 ns/op | 141.116 ns/op (-8.0%) > | important_hits throughput: 1.703 ± 0.014 M ops/s | 1.851 ± 0.003 M ops/s (+8.7%) > | > | num_maps: 1000 > | local_storage cache sequential get: > | <before> | <after> > | hits throughput: 0.357 ± 0.005 M ops/s | 0.325 ± 0.005 M ops/s (-9.0%) > | hits latency: 2803.738 ns/op | 3076.923 ns/op (+9.7%) > | important_hits throughput: 0.000 ± 0.000 M ops/s | 0.000 ± 0.000 M ops/s > | local_storage cache interleaved get: > | <before> | <after> > | hits throughput: 0.434 ± 0.007 M ops/s | 0.447 ± 0.007 M ops/s (+3.0%) > | hits latency: 2306.539 ns/op | 2237.687 ns/op (-3.0%) > | important_hits throughput: 0.109 ± 0.002 M ops/s | 0.112 ± 0.002 M ops/s (+2.8%) > > Signed-off-by: Marco Elver <elver@google.com> Thanks for the patch. This indeed could improve performance with inlining. In Meta, we started to use LTO kernel which already have bpf_local_storage_lookup() cross-file inlined into bpf_sk_storage_lookup(), so we won't be able to reap this benefit. In the subject, please use tag [PATCH bpf-next] so CI can pick it up for testing properly. [...]
On Wed, 31 Jan 2024 at 20:52, Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 1/31/24 6:18 AM, Marco Elver wrote: > > To allow the compiler to inline the bpf_local_storage_lookup() fast- > > path, factor it out by making bpf_local_storage_lookup() a static inline > > function and move the slow-path to bpf_local_storage_lookup_slowpath(). > > > > Base on results from './benchs/run_bench_local_storage.sh' this produces > > improvements in throughput and latency in the majority of cases: > > > > | Hashmap Control > > | =============== > > | num keys: 10 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > > | hits latency: 71.968 ns/op | 71.318 ns/op (-0.9%) > > | important_hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) > > | > > | num keys: 1000 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > > | hits latency: 84.794 ns/op | 85.874 ns/op (+1.3%) > > | important_hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) > > | > > | num keys: 10000 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) > > | hits latency: 140.581 ns/op | 142.113 ns/op (+1.1%) > > | important_hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) > > My understanding is the change in this patch should not affect the hashmap > control result, so the above +/- ~1% change could be mostly noise. Yes, I think they are noise. > > | > > | num keys: 100000 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > > | hits latency: 208.623 ns/op | 200.401 ns/op (-3.9%) > > | important_hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) > > | > > | num keys: 4194304 > > | hashmap (control) sequential get: > > | <before> | <after> > > | hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) > > | hits latency: 478.851 ns/op | 337.648 ns/op (-29.5%) > > | important_hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) > > The last one has a big difference. Did you run it a couple of times without the > change and check if the result was consistent ? Based on what you say above this might be noise. I will rerun a few times (and also rebased against the latest v6.8-rc). > > | > > | Local Storage > > | ============= > > | num_maps: 1 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > > | hits latency: 30.676 ns/op | 25.988 ns/op (-15.3%) > > | important_hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > > | hits latency: 27.054 ns/op | 22.807 ns/op (-15.7%) > > | important_hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) > > | > > | num_maps: 10 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 32.078 ± 0.004 M ops/s | 37.813 ± 0.020 M ops/s (+17.9%) > > | hits latency: 31.174 ns/op | 26.446 ns/op (-15.2%) > > | important_hits throughput: 3.208 ± 0.000 M ops/s | 3.781 ± 0.002 M ops/s (+17.9%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 34.564 ± 0.011 M ops/s | 40.082 ± 0.037 M ops/s (+16.0%) > > | hits latency: 28.932 ns/op | 24.949 ns/op (-13.8%) > > | important_hits throughput: 12.344 ± 0.004 M ops/s | 14.315 ± 0.013 M ops/s (+16.0%) > > | > > | num_maps: 16 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 32.493 ± 0.023 M ops/s | 38.147 ± 0.029 M ops/s (+17.4%) > > | hits latency: 30.776 ns/op | 26.215 ns/op (-14.8%) > > | important_hits throughput: 2.031 ± 0.001 M ops/s | 2.384 ± 0.002 M ops/s (+17.4%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 34.380 ± 0.521 M ops/s | 41.605 ± 0.095 M ops/s (+21.0%) > > | hits latency: 29.087 ns/op | 24.035 ns/op (-17.4%) > > | important_hits throughput: 10.939 ± 0.166 M ops/s | 13.238 ± 0.030 M ops/s (+21.0%) > > | > > | num_maps: 17 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 28.748 ± 0.028 M ops/s | 32.248 ± 0.080 M ops/s (+12.2%) > > | hits latency: 34.785 ns/op | 31.009 ns/op (-10.9%) > > | important_hits throughput: 1.693 ± 0.002 M ops/s | 1.899 ± 0.005 M ops/s (+12.2%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 31.313 ± 0.030 M ops/s | 35.911 ± 0.020 M ops/s (+14.7%) > > | hits latency: 31.936 ns/op | 27.847 ns/op (-12.8%) > > | important_hits throughput: 9.533 ± 0.009 M ops/s | 10.933 ± 0.006 M ops/s (+14.7%) > > | > > | num_maps: 24 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 18.475 ± 0.027 M ops/s | 19.000 ± 0.006 M ops/s (+2.8%) > > | hits latency: 54.127 ns/op | 52.632 ns/op (-2.8%) > > | important_hits throughput: 0.770 ± 0.001 M ops/s | 0.792 ± 0.000 M ops/s (+2.9%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 21.361 ± 0.028 M ops/s | 22.388 ± 0.099 M ops/s (+4.8%) > > | hits latency: 46.814 ns/op | 44.667 ns/op (-4.6%) > > | important_hits throughput: 6.009 ± 0.008 M ops/s | 6.298 ± 0.028 M ops/s (+4.8%) > > | > > | num_maps: 32 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 14.220 ± 0.006 M ops/s | 14.168 ± 0.020 M ops/s (-0.4%) > > | hits latency: 70.323 ns/op | 70.580 ns/op (+0.4%) > > | important_hits throughput: 0.445 ± 0.000 M ops/s | 0.443 ± 0.001 M ops/s (-0.4%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 17.250 ± 0.011 M ops/s | 16.650 ± 0.021 M ops/s (-3.5%) > > | hits latency: 57.971 ns/op | 60.061 ns/op (+3.6%) > > | important_hits throughput: 4.815 ± 0.003 M ops/s | 4.647 ± 0.006 M ops/s (-3.5%) > > | > > | num_maps: 100 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 5.212 ± 0.012 M ops/s | 5.878 ± 0.004 M ops/s (+12.8%) > > | hits latency: 191.877 ns/op | 170.116 ns/op (-11.3%) > > | important_hits throughput: 0.052 ± 0.000 M ops/s | 0.059 ± 0.000 M ops/s (+13.5%) > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 6.521 ± 0.053 M ops/s | 7.086 ± 0.010 M ops/s (+8.7%) > > | hits latency: 153.343 ns/op | 141.116 ns/op (-8.0%) > > | important_hits throughput: 1.703 ± 0.014 M ops/s | 1.851 ± 0.003 M ops/s (+8.7%) > > | > > | num_maps: 1000 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 0.357 ± 0.005 M ops/s | 0.325 ± 0.005 M ops/s (-9.0%) > > | hits latency: 2803.738 ns/op | 3076.923 ns/op (+9.7%) > > Is it understood why the slow down here? The same goes for the "num_maps: 32" > case above but not as bad as here. num_maps:32 could be noise. > > | important_hits throughput: 0.000 ± 0.000 M ops/s | 0.000 ± 0.000 M ops/s > > The important_hits is very little in this case? It seems to be below 0.000M on the test machine. > > | local_storage cache interleaved get: > > | <before> | <after> > > | hits throughput: 0.434 ± 0.007 M ops/s | 0.447 ± 0.007 M ops/s (+3.0%) > > | hits latency: 2306.539 ns/op | 2237.687 ns/op (-3.0%) > > | important_hits throughput: 0.109 ± 0.002 M ops/s | 0.112 ± 0.002 M ops/s (+2.8%) > > > > Signed-off-by: Marco Elver <elver@google.com> > > --- > > include/linux/bpf_local_storage.h | 17 ++++++++++++++++- > > kernel/bpf/bpf_local_storage.c | 14 ++++---------- > > .../selftests/bpf/progs/cgrp_ls_recursion.c | 2 +- > > .../selftests/bpf/progs/task_ls_recursion.c | 2 +- > > 4 files changed, 22 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > > index 173ec7f43ed1..c8cecf7fff87 100644 > > --- a/include/linux/bpf_local_storage.h > > +++ b/include/linux/bpf_local_storage.h > > @@ -130,9 +130,24 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, > > bool bpf_ma); > > > > struct bpf_local_storage_data * > > +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, > > + struct bpf_local_storage_map *smap, > > + bool cacheit_lockit); > > +static inline struct bpf_local_storage_data * > > bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > > struct bpf_local_storage_map *smap, > > - bool cacheit_lockit); > > + bool cacheit_lockit) > > +{ > > + struct bpf_local_storage_data *sdata; > > + > > + /* Fast path (cache hit) */ > > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > > + bpf_rcu_lock_held()); > > + if (likely(sdata && rcu_access_pointer(sdata->smap) == smap)) > > + return sdata; > > + > > + return bpf_local_storage_lookup_slowpath(local_storage, smap, cacheit_lockit); > > +} > > > > void bpf_local_storage_destroy(struct bpf_local_storage *local_storage); > > > > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c > > index 146824cc9689..2ef782a1bd6f 100644 > > --- a/kernel/bpf/bpf_local_storage.c > > +++ b/kernel/bpf/bpf_local_storage.c > > @@ -415,20 +415,14 @@ void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now) > > } > > > > /* If cacheit_lockit is false, this lookup function is lockless */ > > -struct bpf_local_storage_data * > > -bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > > - struct bpf_local_storage_map *smap, > > - bool cacheit_lockit) > > +noinline struct bpf_local_storage_data * > > Is noinline needed ? Yes, so that this TU or LTO kernels do not inline the slowpath, which would cause worse codegen in the caller. > > +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, > > + struct bpf_local_storage_map *smap, > > + bool cacheit_lockit) > > { > > struct bpf_local_storage_data *sdata; > > struct bpf_local_storage_elem *selem; > > > > - /* Fast path (cache hit) */ > > - sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > > - bpf_rcu_lock_held()); > > - if (sdata && rcu_access_pointer(sdata->smap) == smap) > > - return sdata; > > - > > /* Slow path (cache miss) */ > > hlist_for_each_entry_rcu(selem, &local_storage->list, snode, > > rcu_read_lock_trace_held()) > > diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > index a043d8fefdac..9895087a9235 100644 > > --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > @@ -21,7 +21,7 @@ struct { > > __type(value, long); > > } map_b SEC(".maps"); > > > > -SEC("fentry/bpf_local_storage_lookup") > > +SEC("fentry/bpf_local_storage_lookup_slowpath") > > The selftest is trying to catch recursion. The change here cannot test the same > thing because the slowpath will never be hit in the test_progs. I don't have a > better idea for now also. > > It has a conflict with the bpf-next tree also. Was the patch created against an > internal tree? Base was v6.7. I will do a rebase and rerun benchmarks.
On Wed, 31 Jan 2024 at 20:52, Martin KaFai Lau <martin.lau@linux.dev> wrote: [...] > > | num_maps: 1000 > > | local_storage cache sequential get: > > | <before> | <after> > > | hits throughput: 0.357 ± 0.005 M ops/s | 0.325 ± 0.005 M ops/s (-9.0%) > > | hits latency: 2803.738 ns/op | 3076.923 ns/op (+9.7%) > > Is it understood why the slow down here? The same goes for the "num_maps: 32" > case above but not as bad as here. It turned out that there's a real slowdown due to the outlined slowpath. If I inline everything except for inserting the entry into the cache (cacheit_lockit codepath is still outlined), the results look much better even for the case where it always misses the cache. [...] > > diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > index a043d8fefdac..9895087a9235 100644 > > --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > > @@ -21,7 +21,7 @@ struct { > > __type(value, long); > > } map_b SEC(".maps"); > > > > -SEC("fentry/bpf_local_storage_lookup") > > +SEC("fentry/bpf_local_storage_lookup_slowpath") > > The selftest is trying to catch recursion. The change here cannot test the same > thing because the slowpath will never be hit in the test_progs. I don't have a > better idea for now also. Trying to prepare a v2, and for the test, the only option I see is to introduce a tracepoint ("bpf_local_storage_lookup"). If unused, should be a no-op due to static branch. Or can you suggest different functions to hook to for the recursion test? Preferences?
On Wed, Jan 31, 2024 at 6:19 AM Marco Elver <elver@google.com> wrote: > [...] > > Signed-off-by: Marco Elver <elver@google.com> > --- > include/linux/bpf_local_storage.h | 17 ++++++++++++++++- > kernel/bpf/bpf_local_storage.c | 14 ++++---------- > .../selftests/bpf/progs/cgrp_ls_recursion.c | 2 +- > .../selftests/bpf/progs/task_ls_recursion.c | 2 +- > 4 files changed, 22 insertions(+), 13 deletions(-) > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > index 173ec7f43ed1..c8cecf7fff87 100644 > --- a/include/linux/bpf_local_storage.h > +++ b/include/linux/bpf_local_storage.h > @@ -130,9 +130,24 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, > bool bpf_ma); > > struct bpf_local_storage_data * > +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, > + struct bpf_local_storage_map *smap, > + bool cacheit_lockit); > +static inline struct bpf_local_storage_data * > bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > struct bpf_local_storage_map *smap, > - bool cacheit_lockit); > + bool cacheit_lockit) > +{ > + struct bpf_local_storage_data *sdata; > + > + /* Fast path (cache hit) */ > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > + bpf_rcu_lock_held()); > + if (likely(sdata && rcu_access_pointer(sdata->smap) == smap)) > + return sdata; We have two changes here: 1) inlining; 2) likely() annotation. Could you please include in the commit log how much do the two contribute to the performance improvement? Thanks, Song > + > + return bpf_local_storage_lookup_slowpath(local_storage, smap, cacheit_lockit); > +} > [...]
On 2/5/24 7:00 AM, Marco Elver wrote: > On Wed, 31 Jan 2024 at 20:52, Martin KaFai Lau <martin.lau@linux.dev> wrote: > [...] >>> | num_maps: 1000 >>> | local_storage cache sequential get: >>> | <before> | <after> >>> | hits throughput: 0.357 ± 0.005 M ops/s | 0.325 ± 0.005 M ops/s (-9.0%) >>> | hits latency: 2803.738 ns/op | 3076.923 ns/op (+9.7%) >> >> Is it understood why the slow down here? The same goes for the "num_maps: 32" >> case above but not as bad as here. > > It turned out that there's a real slowdown due to the outlined > slowpath. If I inline everything except for inserting the entry into > the cache (cacheit_lockit codepath is still outlined), the results > look much better even for the case where it always misses the cache. > > [...] >>> diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c >>> index a043d8fefdac..9895087a9235 100644 >>> --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c >>> +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c >>> @@ -21,7 +21,7 @@ struct { >>> __type(value, long); >>> } map_b SEC(".maps"); >>> >>> -SEC("fentry/bpf_local_storage_lookup") >>> +SEC("fentry/bpf_local_storage_lookup_slowpath") >> >> The selftest is trying to catch recursion. The change here cannot test the same >> thing because the slowpath will never be hit in the test_progs. I don't have a >> better idea for now also. > > Trying to prepare a v2, and for the test, the only option I see is to > introduce a tracepoint ("bpf_local_storage_lookup"). If unused, should > be a no-op due to static branch. > > Or can you suggest different functions to hook to for the recursion test? I don't prefer to add another tracepoint for the selftest. The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the initial bpf_local_storage_lookup() should work and the immediate recurred bpf_task_storage_delete() will fail. Depends on how the new slow path function will look like in v2. The test can probably be made to go through the slow path, e.g. by creating a lot of task storage maps before triggering the lookup.
On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote: [...] > > Or can you suggest different functions to hook to for the recursion test? > > I don't prefer to add another tracepoint for the selftest. Ok - I also checked, even though it should be a no-op, it wasn't (compiler generated worse code). > The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the > initial bpf_local_storage_lookup() should work and the immediate recurred > bpf_task_storage_delete() will fail. > > Depends on how the new slow path function will look like in v2. The test can > probably be made to go through the slow path, e.g. by creating a lot of task > storage maps before triggering the lookup. Below is tentative v2, but I'm struggling with fixing up the test. In particular, bpf_task_storage_delete() now only calls out to migrate_disable/enable() and bpf_selem_unlink(), because the compiler just ends up inlining everything it can: <bpf_task_storage_delete>: endbr64 nopl 0x0(%rax,%rax,1) push %r14 push %rbx test %rsi,%rsi je ffffffff81280015 <bpf_task_storage_delete+0x75> mov %rsi,%rbx mov %rdi,%r14 call ffffffff810f2e40 <migrate_disable> incl %gs:0x7eda9ba5(%rip) # 29b68 <bpf_task_storage_busy> mov 0xb38(%rbx),%rax mov $0xfffffffffffffffe,%rbx test %rax,%rax je ffffffff8128002f <bpf_task_storage_delete+0x8f> movzwl 0x10e(%r14),%ecx mov (%rax,%rcx,8),%rdi test %rdi,%rdi je ffffffff8127ffef <bpf_task_storage_delete+0x4f> mov (%rdi),%rcx cmp %r14,%rcx je ffffffff81280022 <bpf_task_storage_delete+0x82> mov 0x88(%rax),%rdi test %rdi,%rdi je ffffffff8128002f <bpf_task_storage_delete+0x8f> add $0xfffffffffffffff0,%rdi je ffffffff8128002f <bpf_task_storage_delete+0x8f> mov 0x40(%rdi),%rax cmp %r14,%rax je ffffffff8128001e <bpf_task_storage_delete+0x7e> mov 0x10(%rdi),%rdi test %rdi,%rdi jne ffffffff8127fffb <bpf_task_storage_delete+0x5b> jmp ffffffff8128002f <bpf_task_storage_delete+0x8f> mov $0xffffffffffffffea,%rbx jmp ffffffff8128003b <bpf_task_storage_delete+0x9b> add $0x40,%rdi add $0xffffffffffffffc0,%rdi xor %ebx,%ebx xor %esi,%esi call ffffffff8127e820 <bpf_selem_unlink> decl %gs:0x7eda9b32(%rip) # 29b68 <bpf_task_storage_busy> call ffffffff810f2ed0 <migrate_enable> mov %rbx,%rax pop %rbx pop %r14 cs jmp ffffffff82324ea0 <__x86_return_thunk> Could you suggest how we can fix up the tests? I'm a little stuck because there's not much we can hook to left. Thanks, -- Marco ------ >8 ------ From: Marco Elver <elver@google.com> Date: Tue, 30 Jan 2024 17:57:45 +0100 Subject: [PATCH v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup() In various performance profiles of kernels with BPF programs attached, bpf_local_storage_lookup() appears as a significant portion of CPU cycles spent. To enable the compiler generate more optimal code, turn bpf_local_storage_lookup() into a static inline function, where only the cache insertion code path is outlined (call instruction can be elided entirely if cacheit_lockit is a constant expression). Based on results from './benchs/run_bench_local_storage.sh' (21 trials; reboot between each trial) this produces improvements in throughput and latency in the majority of cases, with an average (geomean) improvement of 8%: +---- Hashmap Control -------------------- | | + num keys: 10 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 14.789 M ops/s | 14.745 M ops/s ( ~ ) | +- hits latency | 67.679 ns/op | 67.879 ns/op ( ~ ) | +- important_hits throughput | 14.789 M ops/s | 14.745 M ops/s ( ~ ) | | + num keys: 1000 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 12.233 M ops/s | 12.170 M ops/s ( ~ ) | +- hits latency | 81.754 ns/op | 82.185 ns/op ( ~ ) | +- important_hits throughput | 12.233 M ops/s | 12.170 M ops/s ( ~ ) | | + num keys: 10000 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 7.220 M ops/s | 7.204 M ops/s ( ~ ) | +- hits latency | 138.522 ns/op | 138.842 ns/op ( ~ ) | +- important_hits throughput | 7.220 M ops/s | 7.204 M ops/s ( ~ ) | | + num keys: 100000 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 5.061 M ops/s | 5.165 M ops/s (+2.1%) | +- hits latency | 198.483 ns/op | 194.270 ns/op (-2.1%) | +- important_hits throughput | 5.061 M ops/s | 5.165 M ops/s (+2.1%) | | + num keys: 4194304 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 2.864 M ops/s | 2.882 M ops/s ( ~ ) | +- hits latency | 365.220 ns/op | 361.418 ns/op (-1.0%) | +- important_hits throughput | 2.864 M ops/s | 2.882 M ops/s ( ~ ) | +---- Local Storage ---------------------- | | + num_maps: 1 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 33.005 M ops/s | 39.068 M ops/s (+18.4%) | +- hits latency | 30.300 ns/op | 25.598 ns/op (-15.5%) | +- important_hits throughput | 33.005 M ops/s | 39.068 M ops/s (+18.4%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 37.151 M ops/s | 44.926 M ops/s (+20.9%) | +- hits latency | 26.919 ns/op | 22.259 ns/op (-17.3%) | +- important_hits throughput | 37.151 M ops/s | 44.926 M ops/s (+20.9%) | | + num_maps: 10 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 32.288 M ops/s | 38.099 M ops/s (+18.0%) | +- hits latency | 30.972 ns/op | 26.248 ns/op (-15.3%) | +- important_hits throughput | 3.229 M ops/s | 3.810 M ops/s (+18.0%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 34.473 M ops/s | 41.145 M ops/s (+19.4%) | +- hits latency | 29.010 ns/op | 24.307 ns/op (-16.2%) | +- important_hits throughput | 12.312 M ops/s | 14.695 M ops/s (+19.4%) | | + num_maps: 16 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 32.524 M ops/s | 38.341 M ops/s (+17.9%) | +- hits latency | 30.748 ns/op | 26.083 ns/op (-15.2%) | +- important_hits throughput | 2.033 M ops/s | 2.396 M ops/s (+17.9%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 34.575 M ops/s | 41.338 M ops/s (+19.6%) | +- hits latency | 28.925 ns/op | 24.193 ns/op (-16.4%) | +- important_hits throughput | 11.001 M ops/s | 13.153 M ops/s (+19.6%) | | + num_maps: 17 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 28.861 M ops/s | 32.756 M ops/s (+13.5%) | +- hits latency | 34.649 ns/op | 30.530 ns/op (-11.9%) | +- important_hits throughput | 1.700 M ops/s | 1.929 M ops/s (+13.5%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 31.529 M ops/s | 36.110 M ops/s (+14.5%) | +- hits latency | 31.719 ns/op | 27.697 ns/op (-12.7%) | +- important_hits throughput | 9.598 M ops/s | 10.993 M ops/s (+14.5%) | | + num_maps: 24 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 18.602 M ops/s | 19.937 M ops/s (+7.2%) | +- hits latency | 53.767 ns/op | 50.166 ns/op (-6.7%) | +- important_hits throughput | 0.776 M ops/s | 0.831 M ops/s (+7.2%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 21.718 M ops/s | 23.332 M ops/s (+7.4%) | +- hits latency | 46.047 ns/op | 42.865 ns/op (-6.9%) | +- important_hits throughput | 6.110 M ops/s | 6.564 M ops/s (+7.4%) | | + num_maps: 32 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 14.118 M ops/s | 14.626 M ops/s (+3.6%) | +- hits latency | 70.856 ns/op | 68.381 ns/op (-3.5%) | +- important_hits throughput | 0.442 M ops/s | 0.458 M ops/s (+3.6%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 17.111 M ops/s | 17.906 M ops/s (+4.6%) | +- hits latency | 58.451 ns/op | 55.865 ns/op (-4.4%) | +- important_hits throughput | 4.776 M ops/s | 4.998 M ops/s (+4.6%) | | + num_maps: 100 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 5.281 M ops/s | 5.528 M ops/s (+4.7%) | +- hits latency | 192.398 ns/op | 183.059 ns/op (-4.9%) | +- important_hits throughput | 0.053 M ops/s | 0.055 M ops/s (+4.9%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 6.265 M ops/s | 6.498 M ops/s (+3.7%) | +- hits latency | 161.436 ns/op | 152.877 ns/op (-5.3%) | +- important_hits throughput | 1.636 M ops/s | 1.697 M ops/s (+3.7%) | | + num_maps: 1000 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 0.355 M ops/s | 0.354 M ops/s ( ~ ) | +- hits latency | 2826.538 ns/op | 2827.139 ns/op ( ~ ) | +- important_hits throughput | 0.000 M ops/s | 0.000 M ops/s ( ~ ) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 0.404 M ops/s | 0.403 M ops/s ( ~ ) | +- hits latency | 2481.190 ns/op | 2487.555 ns/op ( ~ ) | +- important_hits throughput | 0.102 M ops/s | 0.101 M ops/s ( ~ ) Signed-off-by: Marco Elver <elver@google.com> --- v2: * Inline most of bpf_local_storage_lookup(), which produces greater speedup and avoids regressing the cases with large map arrays. * Drop "unlikely()" hint, it didn't produce much benefit. * Re-run benchmark and collect 21 trials of results. --- include/linux/bpf_local_storage.h | 30 ++++++++++- kernel/bpf/bpf_local_storage.c | 52 +++++-------------- .../selftests/bpf/progs/cgrp_ls_recursion.c | 2 +- .../selftests/bpf/progs/task_ls_recursion.c | 2 +- 4 files changed, 43 insertions(+), 43 deletions(-) diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h index 173ec7f43ed1..dcddb0aef7d8 100644 --- a/include/linux/bpf_local_storage.h +++ b/include/linux/bpf_local_storage.h @@ -129,10 +129,36 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, struct bpf_local_storage_cache *cache, bool bpf_ma); -struct bpf_local_storage_data * +void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage, + struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *selem); +/* If cacheit_lockit is false, this lookup function is lockless */ +static inline struct bpf_local_storage_data * bpf_local_storage_lookup(struct bpf_local_storage *local_storage, struct bpf_local_storage_map *smap, - bool cacheit_lockit); + bool cacheit_lockit) +{ + struct bpf_local_storage_data *sdata; + struct bpf_local_storage_elem *selem; + + /* Fast path (cache hit) */ + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], + bpf_rcu_lock_held()); + if (sdata && rcu_access_pointer(sdata->smap) == smap) + return sdata; + + /* Slow path (cache miss) */ + hlist_for_each_entry_rcu(selem, &local_storage->list, snode, + rcu_read_lock_trace_held()) + if (rcu_access_pointer(SDATA(selem)->smap) == smap) + break; + + if (!selem) + return NULL; + if (cacheit_lockit) + __bpf_local_storage_insert_cache(local_storage, smap, selem); + return SDATA(selem); +} void bpf_local_storage_destroy(struct bpf_local_storage *local_storage); diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index 146824cc9689..bdea1a459153 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -414,47 +414,21 @@ void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now) bpf_selem_unlink_storage(selem, reuse_now); } -/* If cacheit_lockit is false, this lookup function is lockless */ -struct bpf_local_storage_data * -bpf_local_storage_lookup(struct bpf_local_storage *local_storage, - struct bpf_local_storage_map *smap, - bool cacheit_lockit) +void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage, + struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *selem) { - struct bpf_local_storage_data *sdata; - struct bpf_local_storage_elem *selem; - - /* Fast path (cache hit) */ - sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], - bpf_rcu_lock_held()); - if (sdata && rcu_access_pointer(sdata->smap) == smap) - return sdata; - - /* Slow path (cache miss) */ - hlist_for_each_entry_rcu(selem, &local_storage->list, snode, - rcu_read_lock_trace_held()) - if (rcu_access_pointer(SDATA(selem)->smap) == smap) - break; - - if (!selem) - return NULL; - - sdata = SDATA(selem); - if (cacheit_lockit) { - unsigned long flags; - - /* spinlock is needed to avoid racing with the - * parallel delete. Otherwise, publishing an already - * deleted sdata to the cache will become a use-after-free - * problem in the next bpf_local_storage_lookup(). - */ - raw_spin_lock_irqsave(&local_storage->lock, flags); - if (selem_linked_to_storage(selem)) - rcu_assign_pointer(local_storage->cache[smap->cache_idx], - sdata); - raw_spin_unlock_irqrestore(&local_storage->lock, flags); - } + unsigned long flags; - return sdata; + /* spinlock is needed to avoid racing with the + * parallel delete. Otherwise, publishing an already + * deleted sdata to the cache will become a use-after-free + * problem in the next bpf_local_storage_lookup(). + */ + raw_spin_lock_irqsave(&local_storage->lock, flags); + if (selem_linked_to_storage(selem)) + rcu_assign_pointer(local_storage->cache[smap->cache_idx], SDATA(selem)); + raw_spin_unlock_irqrestore(&local_storage->lock, flags); } static int check_flags(const struct bpf_local_storage_data *old_sdata, diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c index 610c2427fd93..6e93f3c8b318 100644 --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c @@ -33,7 +33,7 @@ static void __on_lookup(struct cgroup *cgrp) bpf_cgrp_storage_delete(&map_b, cgrp); } -SEC("fentry/bpf_local_storage_lookup") +SEC("fentry/??????????????????????????") int BPF_PROG(on_lookup) { struct task_struct *task = bpf_get_current_task_btf(); diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c index 4542dc683b44..d73b33a4c153 100644 --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c @@ -27,7 +27,7 @@ struct { __type(value, long); } map_b SEC(".maps"); -SEC("fentry/bpf_local_storage_lookup") +SEC("fentry/??????????????????????????") int BPF_PROG(on_lookup) { struct task_struct *task = bpf_get_current_task_btf();
On 2/6/24 9:04 AM, Marco Elver wrote: > On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote: > [...] >>> Or can you suggest different functions to hook to for the recursion test? >> >> I don't prefer to add another tracepoint for the selftest. > > Ok - I also checked, even though it should be a no-op, it wasn't > (compiler generated worse code). I am interested to how the tracepoint generates worse code. Can you share some details ? > >> The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the >> initial bpf_local_storage_lookup() should work and the immediate recurred >> bpf_task_storage_delete() will fail. >> >> Depends on how the new slow path function will look like in v2. The test can >> probably be made to go through the slow path, e.g. by creating a lot of task >> storage maps before triggering the lookup. > > Below is tentative v2, but I'm struggling with fixing up the test. In > particular, bpf_task_storage_delete() now only calls out to > migrate_disable/enable() and bpf_selem_unlink(), because the compiler > just ends up inlining everything it can: > > <bpf_task_storage_delete>: > endbr64 > nopl 0x0(%rax,%rax,1) > push %r14 > push %rbx > test %rsi,%rsi > je ffffffff81280015 <bpf_task_storage_delete+0x75> > mov %rsi,%rbx > mov %rdi,%r14 > call ffffffff810f2e40 <migrate_disable> > incl %gs:0x7eda9ba5(%rip) # 29b68 <bpf_task_storage_busy> > mov 0xb38(%rbx),%rax > mov $0xfffffffffffffffe,%rbx > test %rax,%rax > je ffffffff8128002f <bpf_task_storage_delete+0x8f> > movzwl 0x10e(%r14),%ecx > > mov (%rax,%rcx,8),%rdi > test %rdi,%rdi > je ffffffff8127ffef <bpf_task_storage_delete+0x4f> > mov (%rdi),%rcx > cmp %r14,%rcx > je ffffffff81280022 <bpf_task_storage_delete+0x82> > mov 0x88(%rax),%rdi > test %rdi,%rdi > je ffffffff8128002f <bpf_task_storage_delete+0x8f> > add $0xfffffffffffffff0,%rdi > je ffffffff8128002f <bpf_task_storage_delete+0x8f> > mov 0x40(%rdi),%rax > cmp %r14,%rax > je ffffffff8128001e <bpf_task_storage_delete+0x7e> > mov 0x10(%rdi),%rdi > test %rdi,%rdi > jne ffffffff8127fffb <bpf_task_storage_delete+0x5b> > jmp ffffffff8128002f <bpf_task_storage_delete+0x8f> > mov $0xffffffffffffffea,%rbx > jmp ffffffff8128003b <bpf_task_storage_delete+0x9b> > add $0x40,%rdi > add $0xffffffffffffffc0,%rdi > xor %ebx,%ebx > xor %esi,%esi > call ffffffff8127e820 <bpf_selem_unlink> > decl %gs:0x7eda9b32(%rip) # 29b68 <bpf_task_storage_busy> > call ffffffff810f2ed0 <migrate_enable> > mov %rbx,%rax > pop %rbx > pop %r14 > cs jmp ffffffff82324ea0 <__x86_return_thunk> > > > Could you suggest how we can fix up the tests? I'm a little stuck > because there's not much we can hook to left. I don't see a solution either if only the cache insertion code path is in a traceable function. The prog->active counter has already been covered in another test. This test is mostly only covering the lookup => delete recur case and the code path is contained within the bpf storage logic. The future code review should be able to cover. I would make an exception here and remove this test case considering anything (e.g. tracepoint) we do here is likely to make it worse. (more on the test removal below). > > Thanks, > -- Marco > > ------ >8 ------ > > From: Marco Elver <elver@google.com> > Date: Tue, 30 Jan 2024 17:57:45 +0100 > Subject: [PATCH v2] bpf: Allow compiler to inline most of > bpf_local_storage_lookup() > > In various performance profiles of kernels with BPF programs attached, > bpf_local_storage_lookup() appears as a significant portion of CPU > cycles spent. To enable the compiler generate more optimal code, turn > bpf_local_storage_lookup() into a static inline function, where only the > cache insertion code path is outlined (call instruction can be elided > entirely if cacheit_lockit is a constant expression). Can you share more why only putting the cache insertion code to a function improves the larger number of maps case. In the benchmark, cacheit_lockit should always be true and __bpf_local_storage_insert_cache() should always be called. > > Based on results from './benchs/run_bench_local_storage.sh' (21 trials; > reboot between each trial) this produces improvements in throughput and > latency in the majority of cases, with an average (geomean) improvement > of 8%: > > +---- Hashmap Control -------------------- > | > | + num keys: 10 > | : <before> | <after> > | +-+ hashmap (control) sequential get +----------------------+---------------------- > | +- hits throughput | 14.789 M ops/s | 14.745 M ops/s ( ~ ) > | +- hits latency | 67.679 ns/op | 67.879 ns/op ( ~ ) > | +- important_hits throughput | 14.789 M ops/s | 14.745 M ops/s ( ~ ) > | > | + num keys: 1000 > | : <before> | <after> > | +-+ hashmap (control) sequential get +----------------------+---------------------- > | +- hits throughput | 12.233 M ops/s | 12.170 M ops/s ( ~ ) > | +- hits latency | 81.754 ns/op | 82.185 ns/op ( ~ ) > | +- important_hits throughput | 12.233 M ops/s | 12.170 M ops/s ( ~ ) > | > | + num keys: 10000 > | : <before> | <after> > | +-+ hashmap (control) sequential get +----------------------+---------------------- > | +- hits throughput | 7.220 M ops/s | 7.204 M ops/s ( ~ ) > | +- hits latency | 138.522 ns/op | 138.842 ns/op ( ~ ) > | +- important_hits throughput | 7.220 M ops/s | 7.204 M ops/s ( ~ ) > | > | + num keys: 100000 > | : <before> | <after> > | +-+ hashmap (control) sequential get +----------------------+---------------------- > | +- hits throughput | 5.061 M ops/s | 5.165 M ops/s (+2.1%) > | +- hits latency | 198.483 ns/op | 194.270 ns/op (-2.1%) > | +- important_hits throughput | 5.061 M ops/s | 5.165 M ops/s (+2.1%) > | > | + num keys: 4194304 > | : <before> | <after> > | +-+ hashmap (control) sequential get +----------------------+---------------------- > | +- hits throughput | 2.864 M ops/s | 2.882 M ops/s ( ~ ) > | +- hits latency | 365.220 ns/op | 361.418 ns/op (-1.0%) > | +- important_hits throughput | 2.864 M ops/s | 2.882 M ops/s ( ~ ) > | > +---- Local Storage ---------------------- > | > | + num_maps: 1 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 33.005 M ops/s | 39.068 M ops/s (+18.4%) > | +- hits latency | 30.300 ns/op | 25.598 ns/op (-15.5%) > | +- important_hits throughput | 33.005 M ops/s | 39.068 M ops/s (+18.4%) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 37.151 M ops/s | 44.926 M ops/s (+20.9%) > | +- hits latency | 26.919 ns/op | 22.259 ns/op (-17.3%) > | +- important_hits throughput | 37.151 M ops/s | 44.926 M ops/s (+20.9%) > | > | + num_maps: 10 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 32.288 M ops/s | 38.099 M ops/s (+18.0%) > | +- hits latency | 30.972 ns/op | 26.248 ns/op (-15.3%) > | +- important_hits throughput | 3.229 M ops/s | 3.810 M ops/s (+18.0%) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 34.473 M ops/s | 41.145 M ops/s (+19.4%) > | +- hits latency | 29.010 ns/op | 24.307 ns/op (-16.2%) > | +- important_hits throughput | 12.312 M ops/s | 14.695 M ops/s (+19.4%) > | > | + num_maps: 16 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 32.524 M ops/s | 38.341 M ops/s (+17.9%) > | +- hits latency | 30.748 ns/op | 26.083 ns/op (-15.2%) > | +- important_hits throughput | 2.033 M ops/s | 2.396 M ops/s (+17.9%) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 34.575 M ops/s | 41.338 M ops/s (+19.6%) > | +- hits latency | 28.925 ns/op | 24.193 ns/op (-16.4%) > | +- important_hits throughput | 11.001 M ops/s | 13.153 M ops/s (+19.6%) > | > | + num_maps: 17 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 28.861 M ops/s | 32.756 M ops/s (+13.5%) > | +- hits latency | 34.649 ns/op | 30.530 ns/op (-11.9%) > | +- important_hits throughput | 1.700 M ops/s | 1.929 M ops/s (+13.5%) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 31.529 M ops/s | 36.110 M ops/s (+14.5%) > | +- hits latency | 31.719 ns/op | 27.697 ns/op (-12.7%) > | +- important_hits throughput | 9.598 M ops/s | 10.993 M ops/s (+14.5%) > | > | + num_maps: 24 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 18.602 M ops/s | 19.937 M ops/s (+7.2%) > | +- hits latency | 53.767 ns/op | 50.166 ns/op (-6.7%) > | +- important_hits throughput | 0.776 M ops/s | 0.831 M ops/s (+7.2%) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 21.718 M ops/s | 23.332 M ops/s (+7.4%) > | +- hits latency | 46.047 ns/op | 42.865 ns/op (-6.9%) > | +- important_hits throughput | 6.110 M ops/s | 6.564 M ops/s (+7.4%) > | > | + num_maps: 32 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 14.118 M ops/s | 14.626 M ops/s (+3.6%) > | +- hits latency | 70.856 ns/op | 68.381 ns/op (-3.5%) > | +- important_hits throughput | 0.442 M ops/s | 0.458 M ops/s (+3.6%) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 17.111 M ops/s | 17.906 M ops/s (+4.6%) > | +- hits latency | 58.451 ns/op | 55.865 ns/op (-4.4%) > | +- important_hits throughput | 4.776 M ops/s | 4.998 M ops/s (+4.6%) > | > | + num_maps: 100 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 5.281 M ops/s | 5.528 M ops/s (+4.7%) > | +- hits latency | 192.398 ns/op | 183.059 ns/op (-4.9%) > | +- important_hits throughput | 0.053 M ops/s | 0.055 M ops/s (+4.9%) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 6.265 M ops/s | 6.498 M ops/s (+3.7%) > | +- hits latency | 161.436 ns/op | 152.877 ns/op (-5.3%) > | +- important_hits throughput | 1.636 M ops/s | 1.697 M ops/s (+3.7%) > | > | + num_maps: 1000 > | : <before> | <after> > | +-+ local_storage cache sequential get +----------------------+---------------------- > | +- hits throughput | 0.355 M ops/s | 0.354 M ops/s ( ~ ) > | +- hits latency | 2826.538 ns/op | 2827.139 ns/op ( ~ ) > | +- important_hits throughput | 0.000 M ops/s | 0.000 M ops/s ( ~ ) > | : > | : <before> | <after> > | +-+ local_storage cache interleaved get +----------------------+---------------------- > | +- hits throughput | 0.404 M ops/s | 0.403 M ops/s ( ~ ) > | +- hits latency | 2481.190 ns/op | 2487.555 ns/op ( ~ ) > | +- important_hits throughput | 0.102 M ops/s | 0.101 M ops/s ( ~ ) > > Signed-off-by: Marco Elver <elver@google.com> > --- > v2: > * Inline most of bpf_local_storage_lookup(), which produces greater > speedup and avoids regressing the cases with large map arrays. > * Drop "unlikely()" hint, it didn't produce much benefit. > * Re-run benchmark and collect 21 trials of results. > --- > include/linux/bpf_local_storage.h | 30 ++++++++++- > kernel/bpf/bpf_local_storage.c | 52 +++++-------------- > .../selftests/bpf/progs/cgrp_ls_recursion.c | 2 +- > .../selftests/bpf/progs/task_ls_recursion.c | 2 +- > 4 files changed, 43 insertions(+), 43 deletions(-) > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > index 173ec7f43ed1..dcddb0aef7d8 100644 > --- a/include/linux/bpf_local_storage.h > +++ b/include/linux/bpf_local_storage.h > @@ -129,10 +129,36 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, > struct bpf_local_storage_cache *cache, > bool bpf_ma); > > -struct bpf_local_storage_data * > +void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage, > + struct bpf_local_storage_map *smap, > + struct bpf_local_storage_elem *selem); > +/* If cacheit_lockit is false, this lookup function is lockless */ > +static inline struct bpf_local_storage_data * > bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > struct bpf_local_storage_map *smap, > - bool cacheit_lockit); > + bool cacheit_lockit) > +{ > + struct bpf_local_storage_data *sdata; > + struct bpf_local_storage_elem *selem; > + > + /* Fast path (cache hit) */ > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > + bpf_rcu_lock_held()); > + if (sdata && rcu_access_pointer(sdata->smap) == smap) > + return sdata; > + > + /* Slow path (cache miss) */ > + hlist_for_each_entry_rcu(selem, &local_storage->list, snode, > + rcu_read_lock_trace_held()) > + if (rcu_access_pointer(SDATA(selem)->smap) == smap) > + break; > + > + if (!selem) > + return NULL; > + if (cacheit_lockit) > + __bpf_local_storage_insert_cache(local_storage, smap, selem); > + return SDATA(selem); > +} > > void bpf_local_storage_destroy(struct bpf_local_storage *local_storage); > > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c > index 146824cc9689..bdea1a459153 100644 > --- a/kernel/bpf/bpf_local_storage.c > +++ b/kernel/bpf/bpf_local_storage.c > @@ -414,47 +414,21 @@ void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now) > bpf_selem_unlink_storage(selem, reuse_now); > } > > -/* If cacheit_lockit is false, this lookup function is lockless */ > -struct bpf_local_storage_data * > -bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > - struct bpf_local_storage_map *smap, > - bool cacheit_lockit) > +void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage, > + struct bpf_local_storage_map *smap, > + struct bpf_local_storage_elem *selem) > { > - struct bpf_local_storage_data *sdata; > - struct bpf_local_storage_elem *selem; > - > - /* Fast path (cache hit) */ > - sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > - bpf_rcu_lock_held()); > - if (sdata && rcu_access_pointer(sdata->smap) == smap) > - return sdata; > - > - /* Slow path (cache miss) */ > - hlist_for_each_entry_rcu(selem, &local_storage->list, snode, > - rcu_read_lock_trace_held()) > - if (rcu_access_pointer(SDATA(selem)->smap) == smap) > - break; > - > - if (!selem) > - return NULL; > - > - sdata = SDATA(selem); > - if (cacheit_lockit) { > - unsigned long flags; > - > - /* spinlock is needed to avoid racing with the > - * parallel delete. Otherwise, publishing an already > - * deleted sdata to the cache will become a use-after-free > - * problem in the next bpf_local_storage_lookup(). > - */ > - raw_spin_lock_irqsave(&local_storage->lock, flags); > - if (selem_linked_to_storage(selem)) > - rcu_assign_pointer(local_storage->cache[smap->cache_idx], > - sdata); > - raw_spin_unlock_irqrestore(&local_storage->lock, flags); > - } > + unsigned long flags; > > - return sdata; > + /* spinlock is needed to avoid racing with the > + * parallel delete. Otherwise, publishing an already > + * deleted sdata to the cache will become a use-after-free > + * problem in the next bpf_local_storage_lookup(). > + */ > + raw_spin_lock_irqsave(&local_storage->lock, flags); > + if (selem_linked_to_storage(selem)) > + rcu_assign_pointer(local_storage->cache[smap->cache_idx], SDATA(selem)); > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > } > > static int check_flags(const struct bpf_local_storage_data *old_sdata, > diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > index 610c2427fd93..6e93f3c8b318 100644 > --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c > @@ -33,7 +33,7 @@ static void __on_lookup(struct cgroup *cgrp) > bpf_cgrp_storage_delete(&map_b, cgrp); > } > > -SEC("fentry/bpf_local_storage_lookup") > +SEC("fentry/??????????????????????????") > int BPF_PROG(on_lookup) Remove this BPF_PROG. > { > struct task_struct *task = bpf_get_current_task_btf(); > diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c > index 4542dc683b44..d73b33a4c153 100644 > --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c > +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c > @@ -27,7 +27,7 @@ struct { > __type(value, long); > } map_b SEC(".maps"); > > -SEC("fentry/bpf_local_storage_lookup") > +SEC("fentry/??????????????????????????") Same here. The checks related to on_lookup in prog_tests/task_local_storage.c need to be removed also. > int BPF_PROG(on_lookup) > { > struct task_struct *task = bpf_get_current_task_btf();
On Tue, Feb 06, 2024 at 05:22PM -0800, Martin KaFai Lau wrote: > On 2/6/24 9:04 AM, Marco Elver wrote: > > On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote: > > [...] > > > > Or can you suggest different functions to hook to for the recursion test? > > > > > > I don't prefer to add another tracepoint for the selftest. > > > > Ok - I also checked, even though it should be a no-op, it wasn't > > (compiler generated worse code). > > I am interested to how the tracepoint generates worse code. Can you share > some details ? My guess is that it produces enough code that some inlinable functions are no longer being inlined. Specifically __bpf_task_storage_get(). > > > > > The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the > > > initial bpf_local_storage_lookup() should work and the immediate recurred > > > bpf_task_storage_delete() will fail. > > > > > > Depends on how the new slow path function will look like in v2. The test can > > > probably be made to go through the slow path, e.g. by creating a lot of task > > > storage maps before triggering the lookup. [...] > > Could you suggest how we can fix up the tests? I'm a little stuck > > because there's not much we can hook to left. > > I don't see a solution either if only the cache insertion code path is in a > traceable function. > > The prog->active counter has already been covered in another test. This test > is mostly only covering the lookup => delete recur case and the code path is > contained within the bpf storage logic. The future code review should be > able to cover. I would make an exception here and remove this test case > considering anything (e.g. tracepoint) we do here is likely to make it > worse. (more on the test removal below). > > > > > Thanks, > > -- Marco > > > > ------ >8 ------ > > > > From: Marco Elver <elver@google.com> > > Date: Tue, 30 Jan 2024 17:57:45 +0100 > > Subject: [PATCH v2] bpf: Allow compiler to inline most of > > bpf_local_storage_lookup() > > > > In various performance profiles of kernels with BPF programs attached, > > bpf_local_storage_lookup() appears as a significant portion of CPU > > cycles spent. To enable the compiler generate more optimal code, turn > > bpf_local_storage_lookup() into a static inline function, where only the > > cache insertion code path is outlined (call instruction can be elided > > entirely if cacheit_lockit is a constant expression). > > Can you share more why only putting the cache insertion code to a function > improves the larger number of maps case. In the benchmark, cacheit_lockit > should always be true and __bpf_local_storage_insert_cache() should always > be called. Keeping bpf_local_storage_lookup() smaller (even if just outlining the cache insertion) makes a difference as it allows the compiler generate more optimal code, specifically we avoid duplicating setting up calls to _raw_spin_lock/unlock. E.g. __bpf_task_storage_get is not being inlined anymore if bpf_local_storage_lookup() becomes too large (i.e. everything is up for inlining incl. cache insertion). Also, on x86 preempt builds, spin_lock/unlock aren't inlinable, so we have to pay the price of 2 calls regardless: previously for calls to _raw_spin_lock_irqsave and to _raw_spin_unlock_irqsave. However, with the version of __bpf_local_storage_insert_cache in my patch, the call to _raw_spin_unlock_irqsave is tail called, which allows the compiler to perform TCO, i.e. we still only pay the price of 2 calls: one to __bpf_local_storage_insert_cache and to _raw_spin_lock_irqsave (but no call to _raw_spin_unlock_irqsave, which can just be jumped to): <__bpf_local_storage_insert_cache>: endbr64 nopl 0x0(%rax,%rax,1) push %r15 push %r14 push %r12 push %rbx mov %rdx,%rbx mov %rsi,%r12 mov %rdi,%r15 lea 0xa8(%rdi),%r14 mov %r14,%rdi call ffffffff82323650 <_raw_spin_lock_irqsave> cmpq $0x0,0x18(%rbx) je ffffffff8127ea80 <__bpf_local_storage_insert_cache+0x40> add $0x40,%rbx movzwl 0x10e(%r12),%ecx mov %rbx,(%r15,%rcx,8) mov %r14,%rdi mov %rax,%rsi pop %rbx pop %r12 pop %r14 pop %r15 jmp ffffffff823237d0 <_raw_spin_unlock_irqrestore> <--- TCO I also compared a version where _everything_ is inlined vs. the one with __bpf_local_storage_insert_cache outlined: the one where everything is inlined nullifies any performance improvements and is significantly worse than the one with __bpf_local_storage_insert_cache outlined. [...] > > -SEC("fentry/bpf_local_storage_lookup") > > +SEC("fentry/??????????????????????????") > int BPF_PROG(on_lookup) > > Remove this BPF_PROG. > > > { > > struct task_struct *task = bpf_get_current_task_btf(); > > diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c > > index 4542dc683b44..d73b33a4c153 100644 > > --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c > > +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c > > @@ -27,7 +27,7 @@ struct { > > __type(value, long); > > } map_b SEC(".maps"); > > -SEC("fentry/bpf_local_storage_lookup") > > +SEC("fentry/??????????????????????????") > > Same here. The checks related to on_lookup in > prog_tests/task_local_storage.c need to be removed also. > > > int BPF_PROG(on_lookup) > > { > > struct task_struct *task = bpf_get_current_task_btf(); > Thanks, -- Marco
diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h index 173ec7f43ed1..c8cecf7fff87 100644 --- a/include/linux/bpf_local_storage.h +++ b/include/linux/bpf_local_storage.h @@ -130,9 +130,24 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, bool bpf_ma); struct bpf_local_storage_data * +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, + struct bpf_local_storage_map *smap, + bool cacheit_lockit); +static inline struct bpf_local_storage_data * bpf_local_storage_lookup(struct bpf_local_storage *local_storage, struct bpf_local_storage_map *smap, - bool cacheit_lockit); + bool cacheit_lockit) +{ + struct bpf_local_storage_data *sdata; + + /* Fast path (cache hit) */ + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], + bpf_rcu_lock_held()); + if (likely(sdata && rcu_access_pointer(sdata->smap) == smap)) + return sdata; + + return bpf_local_storage_lookup_slowpath(local_storage, smap, cacheit_lockit); +} void bpf_local_storage_destroy(struct bpf_local_storage *local_storage); diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index 146824cc9689..2ef782a1bd6f 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -415,20 +415,14 @@ void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now) } /* If cacheit_lockit is false, this lookup function is lockless */ -struct bpf_local_storage_data * -bpf_local_storage_lookup(struct bpf_local_storage *local_storage, - struct bpf_local_storage_map *smap, - bool cacheit_lockit) +noinline struct bpf_local_storage_data * +bpf_local_storage_lookup_slowpath(struct bpf_local_storage *local_storage, + struct bpf_local_storage_map *smap, + bool cacheit_lockit) { struct bpf_local_storage_data *sdata; struct bpf_local_storage_elem *selem; - /* Fast path (cache hit) */ - sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], - bpf_rcu_lock_held()); - if (sdata && rcu_access_pointer(sdata->smap) == smap) - return sdata; - /* Slow path (cache miss) */ hlist_for_each_entry_rcu(selem, &local_storage->list, snode, rcu_read_lock_trace_held()) diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c index a043d8fefdac..9895087a9235 100644 --- a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c @@ -21,7 +21,7 @@ struct { __type(value, long); } map_b SEC(".maps"); -SEC("fentry/bpf_local_storage_lookup") +SEC("fentry/bpf_local_storage_lookup_slowpath") int BPF_PROG(on_lookup) { struct task_struct *task = bpf_get_current_task_btf(); diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c index 4542dc683b44..d73b33a4c153 100644 --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c @@ -27,7 +27,7 @@ struct { __type(value, long); } map_b SEC(".maps"); -SEC("fentry/bpf_local_storage_lookup") +SEC("fentry/bpf_local_storage_lookup_slowpath") int BPF_PROG(on_lookup) { struct task_struct *task = bpf_get_current_task_btf();
To allow the compiler to inline the bpf_local_storage_lookup() fast- path, factor it out by making bpf_local_storage_lookup() a static inline function and move the slow-path to bpf_local_storage_lookup_slowpath(). Base on results from './benchs/run_bench_local_storage.sh' this produces improvements in throughput and latency in the majority of cases: | Hashmap Control | =============== | num keys: 10 | hashmap (control) sequential get: | <before> | <after> | hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) | hits latency: 71.968 ns/op | 71.318 ns/op (-0.9%) | important_hits throughput: 13.895 ± 0.024 M ops/s | 14.022 ± 0.095 M ops/s (+0.9%) | | num keys: 1000 | hashmap (control) sequential get: | <before> | <after> | hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) | hits latency: 84.794 ns/op | 85.874 ns/op (+1.3%) | important_hits throughput: 11.793 ± 0.018 M ops/s | 11.645 ± 0.370 M ops/s (-1.3%) | | num keys: 10000 | hashmap (control) sequential get: | <before> | <after> | hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) | hits latency: 140.581 ns/op | 142.113 ns/op (+1.1%) | important_hits throughput: 7.113 ± 0.012 M ops/s | 7.037 ± 0.051 M ops/s (-1.1%) | | num keys: 100000 | hashmap (control) sequential get: | <before> | <after> | hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) | hits latency: 208.623 ns/op | 200.401 ns/op (-3.9%) | important_hits throughput: 4.793 ± 0.034 M ops/s | 4.990 ± 0.025 M ops/s (+4.1%) | | num keys: 4194304 | hashmap (control) sequential get: | <before> | <after> | hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) | hits latency: 478.851 ns/op | 337.648 ns/op (-29.5%) | important_hits throughput: 2.088 ± 0.008 M ops/s | 2.962 ± 0.004 M ops/s (+41.9%) | | Local Storage | ============= | num_maps: 1 | local_storage cache sequential get: | <before> | <after> | hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) | hits latency: 30.676 ns/op | 25.988 ns/op (-15.3%) | important_hits throughput: 32.598 ± 0.008 M ops/s | 38.480 ± 0.054 M ops/s (+18.0%) | local_storage cache interleaved get: | <before> | <after> | hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) | hits latency: 27.054 ns/op | 22.807 ns/op (-15.7%) | important_hits throughput: 36.963 ± 0.045 M ops/s | 43.847 ± 0.037 M ops/s (+18.6%) | | num_maps: 10 | local_storage cache sequential get: | <before> | <after> | hits throughput: 32.078 ± 0.004 M ops/s | 37.813 ± 0.020 M ops/s (+17.9%) | hits latency: 31.174 ns/op | 26.446 ns/op (-15.2%) | important_hits throughput: 3.208 ± 0.000 M ops/s | 3.781 ± 0.002 M ops/s (+17.9%) | local_storage cache interleaved get: | <before> | <after> | hits throughput: 34.564 ± 0.011 M ops/s | 40.082 ± 0.037 M ops/s (+16.0%) | hits latency: 28.932 ns/op | 24.949 ns/op (-13.8%) | important_hits throughput: 12.344 ± 0.004 M ops/s | 14.315 ± 0.013 M ops/s (+16.0%) | | num_maps: 16 | local_storage cache sequential get: | <before> | <after> | hits throughput: 32.493 ± 0.023 M ops/s | 38.147 ± 0.029 M ops/s (+17.4%) | hits latency: 30.776 ns/op | 26.215 ns/op (-14.8%) | important_hits throughput: 2.031 ± 0.001 M ops/s | 2.384 ± 0.002 M ops/s (+17.4%) | local_storage cache interleaved get: | <before> | <after> | hits throughput: 34.380 ± 0.521 M ops/s | 41.605 ± 0.095 M ops/s (+21.0%) | hits latency: 29.087 ns/op | 24.035 ns/op (-17.4%) | important_hits throughput: 10.939 ± 0.166 M ops/s | 13.238 ± 0.030 M ops/s (+21.0%) | | num_maps: 17 | local_storage cache sequential get: | <before> | <after> | hits throughput: 28.748 ± 0.028 M ops/s | 32.248 ± 0.080 M ops/s (+12.2%) | hits latency: 34.785 ns/op | 31.009 ns/op (-10.9%) | important_hits throughput: 1.693 ± 0.002 M ops/s | 1.899 ± 0.005 M ops/s (+12.2%) | local_storage cache interleaved get: | <before> | <after> | hits throughput: 31.313 ± 0.030 M ops/s | 35.911 ± 0.020 M ops/s (+14.7%) | hits latency: 31.936 ns/op | 27.847 ns/op (-12.8%) | important_hits throughput: 9.533 ± 0.009 M ops/s | 10.933 ± 0.006 M ops/s (+14.7%) | | num_maps: 24 | local_storage cache sequential get: | <before> | <after> | hits throughput: 18.475 ± 0.027 M ops/s | 19.000 ± 0.006 M ops/s (+2.8%) | hits latency: 54.127 ns/op | 52.632 ns/op (-2.8%) | important_hits throughput: 0.770 ± 0.001 M ops/s | 0.792 ± 0.000 M ops/s (+2.9%) | local_storage cache interleaved get: | <before> | <after> | hits throughput: 21.361 ± 0.028 M ops/s | 22.388 ± 0.099 M ops/s (+4.8%) | hits latency: 46.814 ns/op | 44.667 ns/op (-4.6%) | important_hits throughput: 6.009 ± 0.008 M ops/s | 6.298 ± 0.028 M ops/s (+4.8%) | | num_maps: 32 | local_storage cache sequential get: | <before> | <after> | hits throughput: 14.220 ± 0.006 M ops/s | 14.168 ± 0.020 M ops/s (-0.4%) | hits latency: 70.323 ns/op | 70.580 ns/op (+0.4%) | important_hits throughput: 0.445 ± 0.000 M ops/s | 0.443 ± 0.001 M ops/s (-0.4%) | local_storage cache interleaved get: | <before> | <after> | hits throughput: 17.250 ± 0.011 M ops/s | 16.650 ± 0.021 M ops/s (-3.5%) | hits latency: 57.971 ns/op | 60.061 ns/op (+3.6%) | important_hits throughput: 4.815 ± 0.003 M ops/s | 4.647 ± 0.006 M ops/s (-3.5%) | | num_maps: 100 | local_storage cache sequential get: | <before> | <after> | hits throughput: 5.212 ± 0.012 M ops/s | 5.878 ± 0.004 M ops/s (+12.8%) | hits latency: 191.877 ns/op | 170.116 ns/op (-11.3%) | important_hits throughput: 0.052 ± 0.000 M ops/s | 0.059 ± 0.000 M ops/s (+13.5%) | local_storage cache interleaved get: | <before> | <after> | hits throughput: 6.521 ± 0.053 M ops/s | 7.086 ± 0.010 M ops/s (+8.7%) | hits latency: 153.343 ns/op | 141.116 ns/op (-8.0%) | important_hits throughput: 1.703 ± 0.014 M ops/s | 1.851 ± 0.003 M ops/s (+8.7%) | | num_maps: 1000 | local_storage cache sequential get: | <before> | <after> | hits throughput: 0.357 ± 0.005 M ops/s | 0.325 ± 0.005 M ops/s (-9.0%) | hits latency: 2803.738 ns/op | 3076.923 ns/op (+9.7%) | important_hits throughput: 0.000 ± 0.000 M ops/s | 0.000 ± 0.000 M ops/s | local_storage cache interleaved get: | <before> | <after> | hits throughput: 0.434 ± 0.007 M ops/s | 0.447 ± 0.007 M ops/s (+3.0%) | hits latency: 2306.539 ns/op | 2237.687 ns/op (-3.0%) | important_hits throughput: 0.109 ± 0.002 M ops/s | 0.112 ± 0.002 M ops/s (+2.8%) Signed-off-by: Marco Elver <elver@google.com> --- include/linux/bpf_local_storage.h | 17 ++++++++++++++++- kernel/bpf/bpf_local_storage.c | 14 ++++---------- .../selftests/bpf/progs/cgrp_ls_recursion.c | 2 +- .../selftests/bpf/progs/task_ls_recursion.c | 2 +- 4 files changed, 22 insertions(+), 13 deletions(-)