fast-import: fix incomplete conversion with multiple mark files
diff mbox series

Message ID 20200606002241.1578150-1-sandals@crustytoothpaste.net
State New
Headers show
Series
  • fast-import: fix incomplete conversion with multiple mark files
Related show

Commit Message

brian m. carlson June 6, 2020, 12:22 a.m. UTC
When ddddf8d7e2 ("fast-import: permit reading multiple marks files",
2020-02-22) converted fast-import to handle multiple marks files in
preparation for submodule support, the conversion was incomplete.  With
a large number of marks, we would actually modify the marks variable
even though we had passed in a different variable to operate on.  In
addition, we didn't consider the fact that the code can replace the mark
set passed in, so when we did so we happened to leak quite a bit of
memory, since we never reused the structure we created, instead
reallocating a new one each time.

It doesn't appear from some testing that we actually produce incorrect
results in this case, only that we leak a substantial amount of memory.
To make things work properly and avoid leaking, pass a pointer to
pointer to struct mark_set, which allows us to modify the set of marks
when the number of marks is large.

With this patch, importing a dump of git.git with a set of exported
marks goes from taking in excess of 15 GiB of memory (and being killed
by the Linux OOM killer) to using a maximum of 1.4 GiB of memory.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 fast-import.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

Comments

Tibor Billes June 8, 2020, 3:52 p.m. UTC | #1
Hi,

On Sat, 6 Jun 2020, brian m. carlson wrote:

> When ddddf8d7e2 ("fast-import: permit reading multiple marks files",
> 2020-02-22) converted fast-import to handle multiple marks files in
> preparation for submodule support, the conversion was incomplete.  With
> a large number of marks, we would actually modify the marks variable
> even though we had passed in a different variable to operate on.  In
> addition, we didn't consider the fact that the code can replace the mark
> set passed in, so when we did so we happened to leak quite a bit of
> memory, since we never reused the structure we created, instead
> reallocating a new one each time.
>
> It doesn't appear from some testing that we actually produce incorrect
> results in this case, only that we leak a substantial amount of memory.
> To make things work properly and avoid leaking, pass a pointer to
> pointer to struct mark_set, which allows us to modify the set of marks
> when the number of marks is large.
>
> With this patch, importing a dump of git.git with a set of exported
> marks goes from taking in excess of 15 GiB of memory (and being killed
> by the Linux OOM killer) to using a maximum of 1.4 GiB of memory.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Thanks for the quickly patching it! I tested the patch and I can confirm this
solves the memory leak for me.

Thanks,
Tibor Billes
Junio C Hamano June 8, 2020, 4:47 p.m. UTC | #2
Tibor Billes <tbilles@gmx.com> writes:

> On Sat, 6 Jun 2020, brian m. carlson wrote:
>
>> When ddddf8d7e2 ("fast-import: permit reading multiple marks files",
>> 2020-02-22) converted fast-import to handle multiple marks files in
>> preparation for submodule support, the conversion was incomplete.
>...
>
> Thanks for the quickly patching it! I tested the patch and I can confirm this
> solves the memory leak for me.

Tibor, thanks for testing.

Brian, I notice that the singleton global "marks_set_count", even
though the number could be counted per instance of the mark_set
structure, is still singleton.  I didn't bother thinking about it
when I wrote my "perhaps along this line" patch, but now I read it
again, I think it is OK to leave it (and other stats counters) as
is, primarily because we don't have a need (yet) to show stats per
mark_set.  Did you leave it as a global for the same reason?  Just
sanity-checking.

Thanks.
brian m. carlson June 8, 2020, 10:58 p.m. UTC | #3
On 2020-06-08 at 16:47:58, Junio C Hamano wrote:
> Brian, I notice that the singleton global "marks_set_count", even
> though the number could be counted per instance of the mark_set
> structure, is still singleton.  I didn't bother thinking about it
> when I wrote my "perhaps along this line" patch, but now I read it
> again, I think it is OK to leave it (and other stats counters) as
> is, primarily because we don't have a need (yet) to show stats per
> mark_set.  Did you leave it as a global for the same reason?  Just
> sanity-checking.

Yes, I did; sorry for not mentioning that in the commit message.  I
think it's fine to count the total number of marks processed as a
statistic, and that would be independent of how we processed them.  It's
an interesting metric, but not so interesting that folks will have a
need to see a detailed breakdown, I think.

Patch
diff mbox series

diff --git a/fast-import.c b/fast-import.c
index 0dfa14dc8c..ed87d6e380 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -150,7 +150,7 @@  struct recent_command {
 	char *buf;
 };
 
-typedef void (*mark_set_inserter_t)(struct mark_set *s, struct object_id *oid, uintmax_t mark);
+typedef void (*mark_set_inserter_t)(struct mark_set **s, struct object_id *oid, uintmax_t mark);
 typedef void (*each_mark_fn_t)(uintmax_t mark, void *obj, void *cbp);
 
 /* Configured limits on output */
@@ -534,13 +534,15 @@  static char *pool_strdup(const char *s)
 	return r;
 }
 
-static void insert_mark(struct mark_set *s, uintmax_t idnum, struct object_entry *oe)
+static void insert_mark(struct mark_set **sp, uintmax_t idnum, struct object_entry *oe)
 {
+	struct mark_set *s = *sp;
+
 	while ((idnum >> s->shift) >= 1024) {
 		s = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
-		s->shift = marks->shift + 10;
-		s->data.sets[0] = marks;
-		marks = s;
+		s->shift = (*sp)->shift + 10;
+		s->data.sets[0] = (*sp);
+		(*sp) = s;
 	}
 	while (s->shift) {
 		uintmax_t i = idnum >> s->shift;
@@ -958,7 +960,7 @@  static int store_object(
 
 	e = insert_object(&oid);
 	if (mark)
-		insert_mark(marks, mark, e);
+		insert_mark(&marks, mark, e);
 	if (e->idx.offset) {
 		duplicate_count_by_type[type]++;
 		return 1;
@@ -1156,7 +1158,7 @@  static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 	e = insert_object(&oid);
 
 	if (mark)
-		insert_mark(marks, mark, e);
+		insert_mark(&marks, mark, e);
 
 	if (e->idx.offset) {
 		duplicate_count_by_type[OBJ_BLOB]++;
@@ -1731,7 +1733,7 @@  static void dump_marks(void)
 	}
 }
 
-static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
+static void insert_object_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
 {
 	struct object_entry *e;
 	e = find_object(oid);
@@ -1748,12 +1750,12 @@  static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintm
 	insert_mark(s, mark, e);
 }
 
-static void insert_oid_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
+static void insert_oid_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
 {
 	insert_mark(s, mark, xmemdupz(oid, sizeof(*oid)));
 }
 
-static void read_mark_file(struct mark_set *s, FILE *f, mark_set_inserter_t inserter)
+static void read_mark_file(struct mark_set **s, FILE *f, mark_set_inserter_t inserter)
 {
 	char line[512];
 	while (fgets(line, sizeof(line), f)) {
@@ -1786,7 +1788,7 @@  static void read_marks(void)
 		goto done; /* Marks file does not exist */
 	else
 		die_errno("cannot read '%s'", import_marks_file);
-	read_mark_file(marks, f, insert_object_entry);
+	read_mark_file(&marks, f, insert_object_entry);
 	fclose(f);
 done:
 	import_marks_file_done = 1;
@@ -3242,7 +3244,7 @@  static void parse_alias(void)
 		die(_("Expected 'to' command, got %s"), command_buf.buf);
 	e = find_object(&b.oid);
 	assert(e);
-	insert_mark(marks, next_mark, e);
+	insert_mark(&marks, next_mark, e);
 }
 
 static char* make_fast_import_path(const char *path)
@@ -3340,7 +3342,7 @@  static void option_rewrite_submodules(const char *arg, struct string_list *list)
 	fp = fopen(f, "r");
 	if (!fp)
 		die_errno("cannot read '%s'", f);
-	read_mark_file(ms, fp, insert_oid_entry);
+	read_mark_file(&ms, fp, insert_oid_entry);
 	fclose(fp);
 }