Message ID | 20241213223641.564002-8-ihor.solodrai@pm.me (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | pahole: shared ELF and faster reproducible BTF encoding | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote: > Introduce a static struct holding global data necessary for BTF > encoding: elf_functions tables and btf_encoder structs. > > The context has init()/exit() interface that should be used to > indicate when BTF encoding work has started and ended. > > I considered freeing everything contained in the context exclusively > on exit(), however it turns out this unnecessarily increases max > RSS. Probably because the work done in btf_encoder__encode() requires > relatively more memory, and if encoders and tables are freed earlier, > that space is reused. > > Compare: > -j4: Maximum resident set size (kbytes): 868484 > -j8: Maximum resident set size (kbytes): 1003040 > -j16: Maximum resident set size (kbytes): 1039416 > -j32: Maximum resident set size (kbytes): 1145312 > vs > -j4: Maximum resident set size (kbytes): 972692 > -j8: Maximum resident set size (kbytes): 1043184 > -j16: Maximum resident set size (kbytes): 1081156 > -j32: Maximum resident set size (kbytes): 1218184 > > Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> > --- After patch #10 "dwarf_loader: multithreading with a job/worker model" from this series I do not understand why this patch is necessary. After patch #10 there is only one BTF encoder, thus: - there is no need to track btf_encoder_list; - elf_functions_list can now be a part of the encoder; - it should be possible to forgo global variable for encoder and pass it as a parameter for each btf_encoder__* func. So it seems that this patch should be dropped and replaced by one that follows patch #10 and applies the above simplifications. Wdyt? [...]
On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote: > On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote: > > Introduce a static struct holding global data necessary for BTF > > encoding: elf_functions tables and btf_encoder structs. > > > > The context has init()/exit() interface that should be used to > > indicate when BTF encoding work has started and ended. > > > > I considered freeing everything contained in the context exclusively > > on exit(), however it turns out this unnecessarily increases max > > RSS. Probably because the work done in btf_encoder__encode() requires > > relatively more memory, and if encoders and tables are freed earlier, > > that space is reused. > > > > Compare: > > -j4: Maximum resident set size (kbytes): 868484 > > -j8: Maximum resident set size (kbytes): 1003040 > > -j16: Maximum resident set size (kbytes): 1039416 > > -j32: Maximum resident set size (kbytes): 1145312 > > vs > > -j4: Maximum resident set size (kbytes): 972692 > > -j8: Maximum resident set size (kbytes): 1043184 > > -j16: Maximum resident set size (kbytes): 1081156 > > -j32: Maximum resident set size (kbytes): 1218184 > > > > Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> > > --- > > After patch #10 "dwarf_loader: multithreading with a job/worker model" > from this series I do not understand why this patch is necessary. > After patch #10 there is only one BTF encoder, thus: > - there is no need to track btf_encoder_list; > - elf_functions_list can now be a part of the encoder; > - it should be possible to forgo global variable for encoder > and pass it as a parameter for each btf_encoder__* func. > > So it seems that this patch should be dropped and replaced by one that > follows patch #10 and applies the above simplifications. > Wdyt? Meaning that patch #6 "btf_encoder: switch to shared elf_functions table" is not necessary. Strictly speaking, patches 1,2,4 might not be necessary as well, but could be viewed as a refactoring. Switch to single-threaded BTF encoder significantly changes this patch-set.
On Monday, December 16th, 2024 at 7:15 PM, Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote: > > > On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote: > > > > > Introduce a static struct holding global data necessary for BTF > > > encoding: elf_functions tables and btf_encoder structs. > [...] > > > > After patch #10 "dwarf_loader: multithreading with a job/worker model" > > from this series I do not understand why this patch is necessary. > > After patch #10 there is only one BTF encoder, thus: > > - there is no need to track btf_encoder_list; > > - elf_functions_list can now be a part of the encoder; > > - it should be possible to forgo global variable for encoder > > and pass it as a parameter for each btf_encoder__* func. > > > > So it seems that this patch should be dropped and replaced by one that > > follows patch #10 and applies the above simplifications. > > Wdyt? > > > Meaning that patch #6 "btf_encoder: switch to shared elf_functions table" > is not necessary. Strictly speaking, patches 1,2,4 might not be necessary > as well, but could be viewed as a refactoring. > Switch to single-threaded BTF encoder significantly changes this patch-set. Eduard, thanks for the review again. You are correct: if we focus on the multithreading changes in dwarf_loader.c and make a decision that there is always a single btf_encoder, then much of this series can be discarded. At the same time I think most of the patches are useful. At the very least they enabled experiments that in the end lead me to the dwarf_loader changes. The changes making ELF functions table shared were beneficial in isolation, because they eliminated unnecessary duplication of information between encoders, leading to reduced memory usage. The changes splitting ELF and BTF function information in btf_encoder.c and simplifying function processing are also good in isolation. In my opinion, it's not wise to discard all of that, because it turned out that a single btf_encoder works better in the use-case we care about now. Later we might want to revisit parallel BTF encoding. Then some version of the refactoring changes here will have to be re-done. So I think it makes sense to land most of this series without significant re-work. But of course I am biased here, as I wrote most of the patches, and it's always painful to "throw away" effort. Let's see what others think.
On Tue, Dec 17, 2024 at 10:06 AM Ihor Solodrai <ihor.solodrai@pm.me> wrote: > > On Monday, December 16th, 2024 at 7:15 PM, Eduard Zingerman <eddyz87@gmail.com> wrote: > > > > > On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote: > > > > > On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote: > > > > > > > Introduce a static struct holding global data necessary for BTF > > > > encoding: elf_functions tables and btf_encoder structs. > > [...] > > > > > > After patch #10 "dwarf_loader: multithreading with a job/worker model" > > > from this series I do not understand why this patch is necessary. > > > After patch #10 there is only one BTF encoder, thus: > > > - there is no need to track btf_encoder_list; > > > - elf_functions_list can now be a part of the encoder; > > > - it should be possible to forgo global variable for encoder > > > and pass it as a parameter for each btf_encoder__* func. > > > > > > So it seems that this patch should be dropped and replaced by one that > > > follows patch #10 and applies the above simplifications. > > > Wdyt? > > > > > > Meaning that patch #6 "btf_encoder: switch to shared elf_functions table" > > is not necessary. Strictly speaking, patches 1,2,4 might not be necessary > > as well, but could be viewed as a refactoring. > > Switch to single-threaded BTF encoder significantly changes this patch-set. > > Eduard, thanks for the review again. > > You are correct: if we focus on the multithreading changes in > dwarf_loader.c and make a decision that there is always a single > btf_encoder, then much of this series can be discarded. > > At the same time I think most of the patches are useful. At the very > least they enabled experiments that in the end lead me to the > dwarf_loader changes. > > The changes making ELF functions table shared were beneficial in > isolation, because they eliminated unnecessary duplication of > information between encoders, leading to reduced memory usage. > > The changes splitting ELF and BTF function information in > btf_encoder.c and simplifying function processing are also good in > isolation. > > In my opinion, it's not wise to discard all of that, because it turned > out that a single btf_encoder works better in the use-case we care > about now. Later we might want to revisit parallel BTF encoding. Then > some version of the refactoring changes here will have to be re-done. > > So I think it makes sense to land most of this series without > significant re-work. But of course I am biased here, as I wrote most > of the patches, and it's always painful to "throw away" effort. > > Let's see what others think. I agree with Ihor. I think he invested a lot of time into these improvements, and asking him to re-do the series just to shuffle a few patches around is just an unnecessary overhead (which also delays the ultimate outcome: faster BTF generation with pahole). And as Ihor mentioned, we might improve upon this series by parallelizing encoders to gain some further improvements, so I think all the internal refactoring and preparations are setting up a good base for further work. Let's try to finalize patch #10's implementation and land this. It's a nice improvement both performance-wise. It's also good that we don't need to care about reproducible or not and can effectively deprecate that short-lived feature we added not so long ago.
On Tue, 2024-12-17 at 16:03 -0800, Andrii Nakryiko wrote: [...] > I agree with Ihor. I think he invested a lot of time into these > improvements, and asking him to re-do the series just to shuffle a few > patches around is just an unnecessary overhead (which also delays the > ultimate outcome: faster BTF generation with pahole). Patches #1-4 are good refactorings. Patches #5-7 introduce a global state and complication where this could be avoided. (These were necessary before Ihor figured out the trick with patch #10). E.g. 'elf_functions_list': - there is no reason for it to be global, it could be a part of the encoder, as it is now; - there is no reason for it to be a list, encoding works with a single function table and keeping it as a list just confuses reader. Same for btf_encoder_list_lock / btf_encoder_list. > And as Ihor mentioned, we might improve upon this series by > parallelizing encoders to gain some further improvements, so I think > all the internal refactoring and preparations are setting up a good > base for further work. Not really, the main insight found by Ihor is that parallel BTF encoding doesn't make much sense: constructing BTF is as cheap as copying it. I don't see much room for improvement here. [...]
Hi everyone. I had a discussion off-list with Eduard and Andrii and we came to an agreement on the next iteration of this series. Patches #3 and #7 will be changed: since there is only one encoder now, there is no reason to maintain global state, such as list of btf_encoders and list of elf_functions structs and corresponding locks. Refactoring about separating elf_functions from btf_encoder will largely remain, but the object itself will live in the btf_encoder. Patch #10 (dwarf_loader workers) will be modified to use only one conditional variable as Eduard suggested [1]. Andrii and I also ran a bunch of experiments to understand why after a certain point adding more threads noticeably slows down the processing. It turns out that with the number of jobs growing, the dwarf loading threads start competing for memory allocation in cu__zalloc. This can be partially mitigated by setting a larger obstack chunk size. I will add a relevant patch in the next version of the series. [1] https://lore.kernel.org/dwarves/58dc053c9d47a18124d8711604b08acbc6400340.camel@gmail.com
On Tue, Dec 17, 2024 at 04:40:40PM -0800, Eduard Zingerman wrote: > On Tue, 2024-12-17 at 16:03 -0800, Andrii Nakryiko wrote: > > [...] > > > I agree with Ihor. I think he invested a lot of time into these > > improvements, and asking him to re-do the series just to shuffle a few > > patches around is just an unnecessary overhead (which also delays the > > ultimate outcome: faster BTF generation with pahole). > > Patches #1-4 are good refactorings. > Patches #5-7 introduce a global state and complication where this > could be avoided. (These were necessary before Ihor figured out the > trick with patch #10). > E.g. 'elf_functions_list': > - there is no reason for it to be global, it could be a part of the > encoder, as it is now; > - there is no reason for it to be a list, encoding works with a single > function table and keeping it as a list just confuses reader. agree, with just single encoder object I don't see a reason for the btf_encoding_context, keeping functions in encoder will be simpler thanks, jirka > > Same for btf_encoder_list_lock / btf_encoder_list. > > > And as Ihor mentioned, we might improve upon this series by > > parallelizing encoders to gain some further improvements, so I think > > all the internal refactoring and preparations are setting up a good > > base for further work. > > Not really, the main insight found by Ihor is that parallel BTF > encoding doesn't make much sense: constructing BTF is as cheap as > copying it. I don't see much room for improvement here. > > [...] > >
diff --git a/btf_encoder.c b/btf_encoder.c index 4a4f6c8..a362fb2 100644 --- a/btf_encoder.c +++ b/btf_encoder.c @@ -150,19 +150,73 @@ struct btf_kfunc_set_range { uint64_t end; }; +static struct { + bool initialized; + + /* In principle, multiple ELFs can be processed in one pahole + * run, so we have to store elf_functions table per ELF. + * An elf_functions struct is added to the list in + * btf_encoder__pre_load_module(). + * The list is cleared at the end of btf_encoder__add_saved_funcs(). + */ + struct list_head elf_functions_list; + + /* The mutex only needed for add/delete, as this can happen in + * multiple encoding threads. A btf_encoder is added to this + * list in btf_encoder__new(), and removed in btf_encoder__delete(). + * All encoders except the main one (`btf_encoder` in pahole.c) + * are deleted in pahole_threads_collect(). + */ + pthread_mutex_t btf_encoder_list_lock; + struct list_head btf_encoder_list; + +} btf_encoding_context; + +int btf_encoding_context__init(void) +{ + int err = 0; + + if (btf_encoding_context.initialized) { + fprintf(stderr, "%s was called while context is already initialized\n", __func__); + err = -1; + goto out; + } + + INIT_LIST_HEAD(&btf_encoding_context.elf_functions_list); + INIT_LIST_HEAD(&btf_encoding_context.btf_encoder_list); + pthread_mutex_init(&btf_encoding_context.btf_encoder_list_lock, NULL); + btf_encoding_context.initialized = true; + +out: + return err; +} + +static inline void btf_encoder__delete_all(void); +static inline void elf_functions__delete_all(void); + +void btf_encoding_context__exit(void) +{ + if (!btf_encoding_context.initialized) { + fprintf(stderr, "%s was called while context is not initialized\n", __func__); + return; + } + + if (!list_empty(&btf_encoding_context.elf_functions_list)) + elf_functions__delete_all(); + + if (!list_empty(&btf_encoding_context.btf_encoder_list)) + btf_encoder__delete_all(); + + pthread_mutex_destroy(&btf_encoding_context.btf_encoder_list_lock); + btf_encoding_context.initialized = false; +} -/* In principle, multiple ELFs can be processed in one pahole run, - * so we have to store elf_functions table per ELF. - * An element is added to the list on btf_encoder__pre_load_module, - * and removed after btf_encoder__encode is done. - */ -static LIST_HEAD(elf_functions_list); static struct elf_functions *elf_functions__get(Elf *elf) { struct list_head *pos; - list_for_each(pos, &elf_functions_list) { + list_for_each(pos, &btf_encoding_context.elf_functions_list) { struct elf_functions *funcs = list_entry(pos, struct elf_functions, node); if (funcs->elf == elf) @@ -186,7 +240,7 @@ static inline void elf_functions__delete_all(void) struct elf_functions *funcs; struct list_head *pos, *tmp; - list_for_each_safe(pos, tmp, &elf_functions_list) { + list_for_each_safe(pos, tmp, &btf_encoding_context.elf_functions_list) { funcs = list_entry(pos, struct elf_functions, node); elf_functions__delete(funcs); } @@ -216,7 +270,7 @@ int btf_encoder__pre_load_module(Dwfl_Module *mod, Elf *elf) if (err) goto out_delete; - list_add_tail(&funcs->node, &elf_functions_list); + list_add_tail(&funcs->node, &btf_encoding_context.elf_functions_list); return 0; @@ -225,29 +279,21 @@ out_delete: return err; } - -static LIST_HEAD(encoders); -static pthread_mutex_t encoders__lock = PTHREAD_MUTEX_INITIALIZER; - -/* mutex only needed for add/delete, as this can happen in multiple encoding - * threads. Traversal of the list is currently confined to thread collection. - */ - #define btf_encoders__for_each_encoder(encoder) \ - list_for_each_entry(encoder, &encoders, node) + list_for_each_entry(encoder, &btf_encoding_context.btf_encoder_list, node) static void btf_encoders__add(struct btf_encoder *encoder) { - pthread_mutex_lock(&encoders__lock); - list_add_tail(&encoder->node, &encoders); - pthread_mutex_unlock(&encoders__lock); + pthread_mutex_lock(&btf_encoding_context.btf_encoder_list_lock); + list_add_tail(&encoder->node, &btf_encoding_context.btf_encoder_list); + pthread_mutex_unlock(&btf_encoding_context.btf_encoder_list_lock); } static void btf_encoders__delete(struct btf_encoder *encoder) { struct btf_encoder *existing = NULL; - pthread_mutex_lock(&encoders__lock); + pthread_mutex_lock(&btf_encoding_context.btf_encoder_list_lock); /* encoder may not have been added to list yet; check. */ btf_encoders__for_each_encoder(existing) { if (encoder == existing) @@ -255,7 +301,7 @@ static void btf_encoders__delete(struct btf_encoder *encoder) } if (encoder == existing) list_del(&encoder->node); - pthread_mutex_unlock(&encoders__lock); + pthread_mutex_unlock(&btf_encoding_context.btf_encoder_list_lock); } #define PERCPU_SECTION ".data..percpu" @@ -1385,6 +1431,9 @@ int btf_encoder__add_saved_funcs(struct btf_encoder *encoder) free(saved_fns); btf_encoders__for_each_encoder(e) btf_encoder__delete_saved_funcs(e); + + elf_functions__delete_all(); + return 0; } @@ -2178,7 +2227,6 @@ int btf_encoder__encode(struct btf_encoder *encoder) err = btf_encoder__write_elf(encoder, encoder->btf, BTF_ELF_SEC); } - elf_functions__delete_all(); return err; } @@ -2565,6 +2613,18 @@ void btf_encoder__delete(struct btf_encoder *encoder) free(encoder); } +static inline void btf_encoder__delete_all(void) +{ + struct btf_encoder *encoder; + struct list_head *pos, *tmp; + + list_for_each_safe(pos, tmp, &btf_encoding_context.btf_encoder_list) { + encoder = list_entry(pos, struct btf_encoder, node); + btf_encoder__delete(encoder); + } +} + + int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct conf_load *conf_load) { struct llvm_annotation *annot; diff --git a/btf_encoder.h b/btf_encoder.h index 7fa0390..f14edc1 100644 --- a/btf_encoder.h +++ b/btf_encoder.h @@ -23,6 +23,9 @@ enum btf_var_option { BTF_VAR_GLOBAL = 2, }; +int btf_encoding_context__init(void); +void btf_encoding_context__exit(void); + struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filename, struct btf *base_btf, bool verbose, struct conf_load *conf_load); void btf_encoder__delete(struct btf_encoder *encoder); diff --git a/pahole.c b/pahole.c index eb2e71a..6bbc9e4 100644 --- a/pahole.c +++ b/pahole.c @@ -3741,8 +3741,12 @@ int main(int argc, char *argv[]) conf_load.threads_collect = pahole_threads_collect; } - if (btf_encode) + if (btf_encode) { conf_load.pre_load_module = btf_encoder__pre_load_module; + err = btf_encoding_context__init(); + if (err < 0) + goto out; + } // Make 'pahole --header type < file' a shorter form of 'pahole -C type --count 1 < file' if (conf.header_type && !class_name && prettify_input) { @@ -3859,7 +3863,11 @@ try_sole_arg_as_class_names: goto out_cus_delete; } } + out_ok: + if (btf_encode) + btf_encoding_context__exit(); + if (stats_formatter != NULL) print_stats();
Introduce a static struct holding global data necessary for BTF encoding: elf_functions tables and btf_encoder structs. The context has init()/exit() interface that should be used to indicate when BTF encoding work has started and ended. I considered freeing everything contained in the context exclusively on exit(), however it turns out this unnecessarily increases max RSS. Probably because the work done in btf_encoder__encode() requires relatively more memory, and if encoders and tables are freed earlier, that space is reused. Compare: -j4: Maximum resident set size (kbytes): 868484 -j8: Maximum resident set size (kbytes): 1003040 -j16: Maximum resident set size (kbytes): 1039416 -j32: Maximum resident set size (kbytes): 1145312 vs -j4: Maximum resident set size (kbytes): 972692 -j8: Maximum resident set size (kbytes): 1043184 -j16: Maximum resident set size (kbytes): 1081156 -j32: Maximum resident set size (kbytes): 1218184 Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> --- btf_encoder.c | 108 +++++++++++++++++++++++++++++++++++++++----------- btf_encoder.h | 3 ++ pahole.c | 10 ++++- 3 files changed, 96 insertions(+), 25 deletions(-)