diff mbox series

[dwarves,v2,07/10] btf_encoder: introduce btf_encoding_context

Message ID 20241213223641.564002-8-ihor.solodrai@pm.me (mailing list archive)
State Not Applicable
Headers show
Series pahole: shared ELF and faster reproducible BTF encoding | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Ihor Solodrai Dec. 13, 2024, 10:37 p.m. UTC
Introduce a static struct holding global data necessary for BTF
encoding: elf_functions tables and btf_encoder structs.

The context has init()/exit() interface that should be used to
indicate when BTF encoding work has started and ended.

I considered freeing everything contained in the context exclusively
on exit(), however it turns out this unnecessarily increases max
RSS. Probably because the work done in btf_encoder__encode() requires
relatively more memory, and if encoders and tables are freed earlier,
that space is reused.

Compare:
    -j4: 	Maximum resident set size (kbytes): 868484
    -j8: 	Maximum resident set size (kbytes): 1003040
    -j16: 	Maximum resident set size (kbytes): 1039416
    -j32: 	Maximum resident set size (kbytes): 1145312
vs
    -j4: 	Maximum resident set size (kbytes): 972692
    -j8: 	Maximum resident set size (kbytes): 1043184
    -j16: 	Maximum resident set size (kbytes): 1081156
    -j32: 	Maximum resident set size (kbytes): 1218184

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
---
 btf_encoder.c | 108 +++++++++++++++++++++++++++++++++++++++-----------
 btf_encoder.h |   3 ++
 pahole.c      |  10 ++++-
 3 files changed, 96 insertions(+), 25 deletions(-)

Comments

Eduard Zingerman Dec. 17, 2024, 2:39 a.m. UTC | #1
On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> Introduce a static struct holding global data necessary for BTF
> encoding: elf_functions tables and btf_encoder structs.
> 
> The context has init()/exit() interface that should be used to
> indicate when BTF encoding work has started and ended.
> 
> I considered freeing everything contained in the context exclusively
> on exit(), however it turns out this unnecessarily increases max
> RSS. Probably because the work done in btf_encoder__encode() requires
> relatively more memory, and if encoders and tables are freed earlier,
> that space is reused.
> 
> Compare:
>     -j4: 	Maximum resident set size (kbytes): 868484
>     -j8: 	Maximum resident set size (kbytes): 1003040
>     -j16: 	Maximum resident set size (kbytes): 1039416
>     -j32: 	Maximum resident set size (kbytes): 1145312
> vs
>     -j4: 	Maximum resident set size (kbytes): 972692
>     -j8: 	Maximum resident set size (kbytes): 1043184
>     -j16: 	Maximum resident set size (kbytes): 1081156
>     -j32: 	Maximum resident set size (kbytes): 1218184
> 
> Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
> ---

After patch #10 "dwarf_loader: multithreading with a job/worker model"
from this series I do not understand why this patch is necessary.
After patch #10 there is only one BTF encoder, thus:
- there is no need to track btf_encoder_list;
- elf_functions_list can now be a part of the encoder;
- it should be possible to forgo global variable for encoder
  and pass it as a parameter for each btf_encoder__* func.

So it seems that this patch should be dropped and replaced by one that
follows patch #10 and applies the above simplifications.
Wdyt?

[...]
Eduard Zingerman Dec. 17, 2024, 3:15 a.m. UTC | #2
On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote:
> On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> > Introduce a static struct holding global data necessary for BTF
> > encoding: elf_functions tables and btf_encoder structs.
> > 
> > The context has init()/exit() interface that should be used to
> > indicate when BTF encoding work has started and ended.
> > 
> > I considered freeing everything contained in the context exclusively
> > on exit(), however it turns out this unnecessarily increases max
> > RSS. Probably because the work done in btf_encoder__encode() requires
> > relatively more memory, and if encoders and tables are freed earlier,
> > that space is reused.
> > 
> > Compare:
> >     -j4: 	Maximum resident set size (kbytes): 868484
> >     -j8: 	Maximum resident set size (kbytes): 1003040
> >     -j16: 	Maximum resident set size (kbytes): 1039416
> >     -j32: 	Maximum resident set size (kbytes): 1145312
> > vs
> >     -j4: 	Maximum resident set size (kbytes): 972692
> >     -j8: 	Maximum resident set size (kbytes): 1043184
> >     -j16: 	Maximum resident set size (kbytes): 1081156
> >     -j32: 	Maximum resident set size (kbytes): 1218184
> > 
> > Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
> > ---
> 
> After patch #10 "dwarf_loader: multithreading with a job/worker model"
> from this series I do not understand why this patch is necessary.
> After patch #10 there is only one BTF encoder, thus:
> - there is no need to track btf_encoder_list;
> - elf_functions_list can now be a part of the encoder;
> - it should be possible to forgo global variable for encoder
>   and pass it as a parameter for each btf_encoder__* func.
> 
> So it seems that this patch should be dropped and replaced by one that
> follows patch #10 and applies the above simplifications.
> Wdyt?

Meaning that patch #6 "btf_encoder: switch to shared elf_functions table"
is not necessary. Strictly speaking, patches 1,2,4 might not be necessary
as well, but could be viewed as a refactoring.
Switch to single-threaded BTF encoder significantly changes this patch-set.
Ihor Solodrai Dec. 17, 2024, 6:06 p.m. UTC | #3
On Monday, December 16th, 2024 at 7:15 PM, Eduard Zingerman <eddyz87@gmail.com> wrote:

> 
> On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote:
> 
> > On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> > 
> > > Introduce a static struct holding global data necessary for BTF
> > > encoding: elf_functions tables and btf_encoder structs.
> [...]
> > 
> > After patch #10 "dwarf_loader: multithreading with a job/worker model"
> > from this series I do not understand why this patch is necessary.
> > After patch #10 there is only one BTF encoder, thus:
> > - there is no need to track btf_encoder_list;
> > - elf_functions_list can now be a part of the encoder;
> > - it should be possible to forgo global variable for encoder
> > and pass it as a parameter for each btf_encoder__* func.
> > 
> > So it seems that this patch should be dropped and replaced by one that
> > follows patch #10 and applies the above simplifications.
> > Wdyt?
> 
> 
> Meaning that patch #6 "btf_encoder: switch to shared elf_functions table"
> is not necessary. Strictly speaking, patches 1,2,4 might not be necessary
> as well, but could be viewed as a refactoring.
> Switch to single-threaded BTF encoder significantly changes this patch-set.

Eduard, thanks for the review again.

You are correct: if we focus on the multithreading changes in
dwarf_loader.c and make a decision that there is always a single
btf_encoder, then much of this series can be discarded.

At the same time I think most of the patches are useful. At the very
least they enabled experiments that in the end lead me to the
dwarf_loader changes.

The changes making ELF functions table shared were beneficial in
isolation, because they eliminated unnecessary duplication of
information between encoders, leading to reduced memory usage.

The changes splitting ELF and BTF function information in
btf_encoder.c and simplifying function processing are also good in
isolation.

In my opinion, it's not wise to discard all of that, because it turned
out that a single btf_encoder works better in the use-case we care
about now. Later we might want to revisit parallel BTF encoding. Then
some version of the refactoring changes here will have to be re-done.

So I think it makes sense to land most of this series without
significant re-work. But of course I am biased here, as I wrote most
of the patches, and it's always painful to "throw away" effort.

Let's see what others think.
Andrii Nakryiko Dec. 18, 2024, 12:03 a.m. UTC | #4
On Tue, Dec 17, 2024 at 10:06 AM Ihor Solodrai <ihor.solodrai@pm.me> wrote:
>
> On Monday, December 16th, 2024 at 7:15 PM, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> >
> > On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote:
> >
> > > On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> > >
> > > > Introduce a static struct holding global data necessary for BTF
> > > > encoding: elf_functions tables and btf_encoder structs.
> > [...]
> > >
> > > After patch #10 "dwarf_loader: multithreading with a job/worker model"
> > > from this series I do not understand why this patch is necessary.
> > > After patch #10 there is only one BTF encoder, thus:
> > > - there is no need to track btf_encoder_list;
> > > - elf_functions_list can now be a part of the encoder;
> > > - it should be possible to forgo global variable for encoder
> > > and pass it as a parameter for each btf_encoder__* func.
> > >
> > > So it seems that this patch should be dropped and replaced by one that
> > > follows patch #10 and applies the above simplifications.
> > > Wdyt?
> >
> >
> > Meaning that patch #6 "btf_encoder: switch to shared elf_functions table"
> > is not necessary. Strictly speaking, patches 1,2,4 might not be necessary
> > as well, but could be viewed as a refactoring.
> > Switch to single-threaded BTF encoder significantly changes this patch-set.
>
> Eduard, thanks for the review again.
>
> You are correct: if we focus on the multithreading changes in
> dwarf_loader.c and make a decision that there is always a single
> btf_encoder, then much of this series can be discarded.
>
> At the same time I think most of the patches are useful. At the very
> least they enabled experiments that in the end lead me to the
> dwarf_loader changes.
>
> The changes making ELF functions table shared were beneficial in
> isolation, because they eliminated unnecessary duplication of
> information between encoders, leading to reduced memory usage.
>
> The changes splitting ELF and BTF function information in
> btf_encoder.c and simplifying function processing are also good in
> isolation.
>
> In my opinion, it's not wise to discard all of that, because it turned
> out that a single btf_encoder works better in the use-case we care
> about now. Later we might want to revisit parallel BTF encoding. Then
> some version of the refactoring changes here will have to be re-done.
>
> So I think it makes sense to land most of this series without
> significant re-work. But of course I am biased here, as I wrote most
> of the patches, and it's always painful to "throw away" effort.
>
> Let's see what others think.

I agree with Ihor. I think he invested a lot of time into these
improvements, and asking him to re-do the series just to shuffle a few
patches around is just an unnecessary overhead (which also delays the
ultimate outcome: faster BTF generation with pahole). And as Ihor
mentioned, we might improve upon this series by parallelizing encoders
to gain some further improvements, so I think all the internal
refactoring and preparations are setting up a good base for further
work.

Let's try to finalize patch #10's implementation and land this. It's a
nice improvement both performance-wise. It's also good that we don't
need to care about reproducible or not and can effectively deprecate
that short-lived feature we added not so long ago.
Eduard Zingerman Dec. 18, 2024, 12:40 a.m. UTC | #5
On Tue, 2024-12-17 at 16:03 -0800, Andrii Nakryiko wrote:

[...]

> I agree with Ihor. I think he invested a lot of time into these
> improvements, and asking him to re-do the series just to shuffle a few
> patches around is just an unnecessary overhead (which also delays the
> ultimate outcome: faster BTF generation with pahole).

Patches #1-4 are good refactorings.
Patches #5-7 introduce a global state and complication where this
could be avoided. (These were necessary before Ihor figured out the
trick with patch #10).
E.g. 'elf_functions_list':
- there is no reason for it to be global, it could be a part of the
  encoder, as it is now;
- there is no reason for it to be a list, encoding works with a single
  function table and keeping it as a list just confuses reader.

Same for btf_encoder_list_lock / btf_encoder_list.

> And as Ihor mentioned, we might improve upon this series by
> parallelizing encoders to gain some further improvements, so I think
> all the internal refactoring and preparations are setting up a good
> base for further work.

Not really, the main insight found by Ihor is that parallel BTF
encoding doesn't make much sense: constructing BTF is as cheap as
copying it. I don't see much room for improvement here.

[...]
Ihor Solodrai Dec. 18, 2024, 8:07 p.m. UTC | #6
Hi everyone.

I had a discussion off-list with Eduard and Andrii and we came to an
agreement on the next iteration of this series.

Patches #3 and #7 will be changed: since there is only one encoder
now, there is no reason to maintain global state, such as list of
btf_encoders and list of elf_functions structs and corresponding
locks. Refactoring about separating elf_functions from btf_encoder
will largely remain, but the object itself will live in the
btf_encoder.

Patch #10 (dwarf_loader workers) will be modified to use only one
conditional variable as Eduard suggested [1].

Andrii and I also ran a bunch of experiments to understand why after a
certain point adding more threads noticeably slows down the
processing. It turns out that with the number of jobs growing, the
dwarf loading threads start competing for memory allocation in
cu__zalloc. This can be partially mitigated by setting a larger
obstack chunk size. I will add a relevant patch in the next version of
the series.

[1] https://lore.kernel.org/dwarves/58dc053c9d47a18124d8711604b08acbc6400340.camel@gmail.com
Jiri Olsa Dec. 19, 2024, 2:59 p.m. UTC | #7
On Tue, Dec 17, 2024 at 04:40:40PM -0800, Eduard Zingerman wrote:
> On Tue, 2024-12-17 at 16:03 -0800, Andrii Nakryiko wrote:
> 
> [...]
> 
> > I agree with Ihor. I think he invested a lot of time into these
> > improvements, and asking him to re-do the series just to shuffle a few
> > patches around is just an unnecessary overhead (which also delays the
> > ultimate outcome: faster BTF generation with pahole).
> 
> Patches #1-4 are good refactorings.
> Patches #5-7 introduce a global state and complication where this
> could be avoided. (These were necessary before Ihor figured out the
> trick with patch #10).
> E.g. 'elf_functions_list':
> - there is no reason for it to be global, it could be a part of the
>   encoder, as it is now;
> - there is no reason for it to be a list, encoding works with a single
>   function table and keeping it as a list just confuses reader.

agree, with just single encoder object I don't see a reason for the
btf_encoding_context, keeping functions in encoder will be simpler

thanks,
jirka

> 
> Same for btf_encoder_list_lock / btf_encoder_list.
> 
> > And as Ihor mentioned, we might improve upon this series by
> > parallelizing encoders to gain some further improvements, so I think
> > all the internal refactoring and preparations are setting up a good
> > base for further work.
> 
> Not really, the main insight found by Ihor is that parallel BTF
> encoding doesn't make much sense: constructing BTF is as cheap as
> copying it. I don't see much room for improvement here.
> 
> [...]
> 
>
diff mbox series

Patch

diff --git a/btf_encoder.c b/btf_encoder.c
index 4a4f6c8..a362fb2 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -150,19 +150,73 @@  struct btf_kfunc_set_range {
 	uint64_t end;
 };
 
+static struct {
+	bool initialized;
+
+	/* In principle, multiple ELFs can be processed in one pahole
+	 * run, so we have to store elf_functions table per ELF.
+	 * An elf_functions struct is added to the list in
+	 * btf_encoder__pre_load_module().
+	 * The list is cleared at the end of btf_encoder__add_saved_funcs().
+	 */
+	struct list_head elf_functions_list;
+
+	/* The mutex only needed for add/delete, as this can happen in
+	 * multiple encoding threads.  A btf_encoder is added to this
+	 * list in btf_encoder__new(), and removed in btf_encoder__delete().
+	 * All encoders except the main one (`btf_encoder` in pahole.c)
+	 * are deleted in pahole_threads_collect().
+	 */
+	pthread_mutex_t btf_encoder_list_lock;
+	struct list_head btf_encoder_list;
+
+} btf_encoding_context;
+
+int btf_encoding_context__init(void)
+{
+	int err = 0;
+
+	if (btf_encoding_context.initialized) {
+		fprintf(stderr, "%s was called while context is already initialized\n", __func__);
+		err = -1;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&btf_encoding_context.elf_functions_list);
+	INIT_LIST_HEAD(&btf_encoding_context.btf_encoder_list);
+	pthread_mutex_init(&btf_encoding_context.btf_encoder_list_lock, NULL);
+	btf_encoding_context.initialized = true;
+
+out:
+	return err;
+}
+
+static inline void btf_encoder__delete_all(void);
+static inline void elf_functions__delete_all(void);
+
+void btf_encoding_context__exit(void)
+{
+	if (!btf_encoding_context.initialized) {
+		fprintf(stderr, "%s was called while context is not initialized\n", __func__);
+		return;
+	}
+
+	if (!list_empty(&btf_encoding_context.elf_functions_list))
+		elf_functions__delete_all();
+
+	if (!list_empty(&btf_encoding_context.btf_encoder_list))
+		btf_encoder__delete_all();
+
+	pthread_mutex_destroy(&btf_encoding_context.btf_encoder_list_lock);
+	btf_encoding_context.initialized = false;
+}
 
-/* In principle, multiple ELFs can be processed in one pahole run,
- * so we have to store elf_functions table per ELF.
- * An element is added to the list on btf_encoder__pre_load_module,
- * and removed after btf_encoder__encode is done.
- */
-static LIST_HEAD(elf_functions_list);
 
 static struct elf_functions *elf_functions__get(Elf *elf)
 {
 	struct list_head *pos;
 
-	list_for_each(pos, &elf_functions_list) {
+	list_for_each(pos, &btf_encoding_context.elf_functions_list) {
 		struct elf_functions *funcs = list_entry(pos, struct elf_functions, node);
 
 		if (funcs->elf == elf)
@@ -186,7 +240,7 @@  static inline void elf_functions__delete_all(void)
 	struct elf_functions *funcs;
 	struct list_head *pos, *tmp;
 
-	list_for_each_safe(pos, tmp, &elf_functions_list) {
+	list_for_each_safe(pos, tmp, &btf_encoding_context.elf_functions_list) {
 		funcs = list_entry(pos, struct elf_functions, node);
 		elf_functions__delete(funcs);
 	}
@@ -216,7 +270,7 @@  int btf_encoder__pre_load_module(Dwfl_Module *mod, Elf *elf)
 	if (err)
 		goto out_delete;
 
-	list_add_tail(&funcs->node, &elf_functions_list);
+	list_add_tail(&funcs->node, &btf_encoding_context.elf_functions_list);
 
 	return 0;
 
@@ -225,29 +279,21 @@  out_delete:
 	return err;
 }
 
-
-static LIST_HEAD(encoders);
-static pthread_mutex_t encoders__lock = PTHREAD_MUTEX_INITIALIZER;
-
-/* mutex only needed for add/delete, as this can happen in multiple encoding
- * threads.  Traversal of the list is currently confined to thread collection.
- */
-
 #define btf_encoders__for_each_encoder(encoder)		\
-	list_for_each_entry(encoder, &encoders, node)
+	list_for_each_entry(encoder, &btf_encoding_context.btf_encoder_list, node)
 
 static void btf_encoders__add(struct btf_encoder *encoder)
 {
-	pthread_mutex_lock(&encoders__lock);
-	list_add_tail(&encoder->node, &encoders);
-	pthread_mutex_unlock(&encoders__lock);
+	pthread_mutex_lock(&btf_encoding_context.btf_encoder_list_lock);
+	list_add_tail(&encoder->node, &btf_encoding_context.btf_encoder_list);
+	pthread_mutex_unlock(&btf_encoding_context.btf_encoder_list_lock);
 }
 
 static void btf_encoders__delete(struct btf_encoder *encoder)
 {
 	struct btf_encoder *existing = NULL;
 
-	pthread_mutex_lock(&encoders__lock);
+	pthread_mutex_lock(&btf_encoding_context.btf_encoder_list_lock);
 	/* encoder may not have been added to list yet; check. */
 	btf_encoders__for_each_encoder(existing) {
 		if (encoder == existing)
@@ -255,7 +301,7 @@  static void btf_encoders__delete(struct btf_encoder *encoder)
 	}
 	if (encoder == existing)
 		list_del(&encoder->node);
-	pthread_mutex_unlock(&encoders__lock);
+	pthread_mutex_unlock(&btf_encoding_context.btf_encoder_list_lock);
 }
 
 #define PERCPU_SECTION ".data..percpu"
@@ -1385,6 +1431,9 @@  int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
 	free(saved_fns);
 	btf_encoders__for_each_encoder(e)
 		btf_encoder__delete_saved_funcs(e);
+
+	elf_functions__delete_all();
+
 	return 0;
 }
 
@@ -2178,7 +2227,6 @@  int btf_encoder__encode(struct btf_encoder *encoder)
 		err = btf_encoder__write_elf(encoder, encoder->btf, BTF_ELF_SEC);
 	}
 
-	elf_functions__delete_all();
 	return err;
 }
 
@@ -2565,6 +2613,18 @@  void btf_encoder__delete(struct btf_encoder *encoder)
 	free(encoder);
 }
 
+static inline void btf_encoder__delete_all(void)
+{
+	struct btf_encoder *encoder;
+	struct list_head *pos, *tmp;
+
+	list_for_each_safe(pos, tmp, &btf_encoding_context.btf_encoder_list) {
+		encoder = list_entry(pos, struct btf_encoder, node);
+		btf_encoder__delete(encoder);
+	}
+}
+
+
 int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct conf_load *conf_load)
 {
 	struct llvm_annotation *annot;
diff --git a/btf_encoder.h b/btf_encoder.h
index 7fa0390..f14edc1 100644
--- a/btf_encoder.h
+++ b/btf_encoder.h
@@ -23,6 +23,9 @@  enum btf_var_option {
 	BTF_VAR_GLOBAL = 2,
 };
 
+int btf_encoding_context__init(void);
+void btf_encoding_context__exit(void);
+
 struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filename, struct btf *base_btf, bool verbose, struct conf_load *conf_load);
 void btf_encoder__delete(struct btf_encoder *encoder);
 
diff --git a/pahole.c b/pahole.c
index eb2e71a..6bbc9e4 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3741,8 +3741,12 @@  int main(int argc, char *argv[])
 		conf_load.threads_collect = pahole_threads_collect;
 	}
 
-	if (btf_encode)
+	if (btf_encode) {
 		conf_load.pre_load_module = btf_encoder__pre_load_module;
+		err = btf_encoding_context__init();
+		if (err < 0)
+			goto out;
+	}
 
 	// Make 'pahole --header type < file' a shorter form of 'pahole -C type --count 1 < file'
 	if (conf.header_type && !class_name && prettify_input) {
@@ -3859,7 +3863,11 @@  try_sole_arg_as_class_names:
 			goto out_cus_delete;
 		}
 	}
+
 out_ok:
+	if (btf_encode)
+		btf_encoding_context__exit();
+
 	if (stats_formatter != NULL)
 		print_stats();