diff mbox series

[v4,5/5] unpack-objects: unpack_non_delta_entry() read data in a stream

Message ID 20211203093530.93589-6-chiyutianyi@gmail.com (mailing list archive)
State Superseded
Headers show
Series unpack large objects in stream | expand

Commit Message

Han Xin Dec. 3, 2021, 9:35 a.m. UTC
From: Han Xin <hanxin.hx@alibaba-inc.com>

We used to call "get_data()" in "unpack_non_delta_entry()" to read the
entire contents of a blob object, no matter how big it is. This
implementation may consume all the memory and cause OOM.

By implementing a zstream version of input_stream interface, we can use
a small fixed buffer for "unpack_non_delta_entry()".

However, unpack non-delta objects from a stream instead of from an entrie
buffer will have 10% performance penalty. Therefore, only unpack object
larger than the "big_file_threshold" in zstream. See the following
benchmarks:

    hyperfine \
      --setup \
      'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \
      --prepare 'rm -rf dest.git && git init --bare dest.git' \
      -n 'old' 'git -C dest.git unpack-objects <small.pack' \
      -n 'new' 'new/git -C dest.git unpack-objects <small.pack' \
      -n 'new (small threshold)' \
      'new/git -c core.bigfilethreshold=16k -C dest.git unpack-objects <small.pack'
    Benchmark 1: old
      Time (mean ± σ):      6.075 s ±  0.069 s    [User: 5.047 s, System: 0.991 s]
      Range (min … max):    6.018 s …  6.189 s    10 runs

    Benchmark 2: new
      Time (mean ± σ):      6.090 s ±  0.033 s    [User: 5.075 s, System: 0.976 s]
      Range (min … max):    6.030 s …  6.142 s    10 runs

    Benchmark 3: new (small threshold)
      Time (mean ± σ):      6.755 s ±  0.029 s    [User: 5.150 s, System: 1.560 s]
      Range (min … max):    6.711 s …  6.809 s    10 runs

    Summary
      'old' ran
        1.00 ± 0.01 times faster than 'new'
        1.11 ± 0.01 times faster than 'new (small threshold)'

Helped-by: Derrick Stolee <stolee@gmail.com>
Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
---
 builtin/unpack-objects.c            | 77 ++++++++++++++++++++++++++++-
 object-file.c                       |  6 +--
 object-store.h                      |  4 ++
 t/t5590-unpack-non-delta-objects.sh | 76 ++++++++++++++++++++++++++++
 4 files changed, 159 insertions(+), 4 deletions(-)
 create mode 100755 t/t5590-unpack-non-delta-objects.sh

Comments

Ævar Arnfjörð Bjarmason Dec. 3, 2021, 1:07 p.m. UTC | #1
On Fri, Dec 03 2021, Han Xin wrote:

> From: Han Xin <hanxin.hx@alibaba-inc.com>
>
> We used to call "get_data()" in "unpack_non_delta_entry()" to read the
> entire contents of a blob object, no matter how big it is. This
> implementation may consume all the memory and cause OOM.
>
> By implementing a zstream version of input_stream interface, we can use
> a small fixed buffer for "unpack_non_delta_entry()".
>
> However, unpack non-delta objects from a stream instead of from an entrie
> buffer will have 10% performance penalty. Therefore, only unpack object
> larger than the "big_file_threshold" in zstream. See the following
> benchmarks:
>
>     hyperfine \
>       --setup \
>       'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \
>       --prepare 'rm -rf dest.git && git init --bare dest.git' \
>       -n 'old' 'git -C dest.git unpack-objects <small.pack' \
>       -n 'new' 'new/git -C dest.git unpack-objects <small.pack' \
>       -n 'new (small threshold)' \
>       'new/git -c core.bigfilethreshold=16k -C dest.git unpack-objects <small.pack'
>     Benchmark 1: old
>       Time (mean ± σ):      6.075 s ±  0.069 s    [User: 5.047 s, System: 0.991 s]
>       Range (min … max):    6.018 s …  6.189 s    10 runs
>
>     Benchmark 2: new
>       Time (mean ± σ):      6.090 s ±  0.033 s    [User: 5.075 s, System: 0.976 s]
>       Range (min … max):    6.030 s …  6.142 s    10 runs
>
>     Benchmark 3: new (small threshold)
>       Time (mean ± σ):      6.755 s ±  0.029 s    [User: 5.150 s, System: 1.560 s]
>       Range (min … max):    6.711 s …  6.809 s    10 runs
>
>     Summary
>       'old' ran
>         1.00 ± 0.01 times faster than 'new'
>         1.11 ± 0.01 times faster than 'new (small threshold)'

So before we wrote used core.bigfilethreshold for two things (or more?):
Whether we show a diff for it (we mark it "binary") and whether it's
split into a loose object.

Now it's three things, we've added a "this is a threshold when we'll
stream the object" to that.

Might it make sense to squash something like this in, so we can have our
cake & eat it too?

With this I get, where HEAD~0 is this change:
    
    Summary
      './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~0' ran
        1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~1'
        1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'origin/master'
        1.01 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~0'
        1.06 ± 0.14 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'origin/master'
        1.20 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~1'

I.e. it's 5% slower, not 20% (haven't looked into why), but we'll not
stream out 16k..128MB objects (maybe the repo has even bigger ones?)

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index c04f62a54a1..601b7a2418f 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -424,6 +424,17 @@ be delta compressed, but larger binary media files won't be.
 +
 Common unit suffixes of 'k', 'm', or 'g' are supported.
 
+core.bigFileStreamingThreshold::
+	Files larger than this will be streamed out to a temporary
+	object file while being hashed, which will when be renamed
+	in-place to a loose object, particularly if the
+	`core.bigFileThreshold' setting dictates that they're always
+	written out as loose objects.
++
+Default is 128 MiB on all platforms.
++
+Common unit suffixes of 'k', 'm', or 'g' are supported.
+
 core.excludesFile::
 	Specifies the pathname to the file that contains patterns to
 	describe paths that are not meant to be tracked, in addition
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index bedc494e2db..94ce275c807 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -400,7 +400,7 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
 	void *buf;
 
 	/* Write large blob in stream without allocating full buffer. */
-	if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) {
+	if (!dry_run && type == OBJ_BLOB && size > big_file_streaming_threshold) {
 		write_stream_blob(nr, size);
 		return;
 	}
diff --git a/cache.h b/cache.h
index eba12487b99..4037c7fd849 100644
--- a/cache.h
+++ b/cache.h
@@ -964,6 +964,7 @@ extern size_t packed_git_window_size;
 extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
+extern unsigned long big_file_streaming_threshold;
 extern unsigned long pack_size_limit_cfg;
 
 /*
diff --git a/config.c b/config.c
index c5873f3a706..7b122a142a8 100644
--- a/config.c
+++ b/config.c
@@ -1408,6 +1408,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.bigfilestreamingthreshold")) {
+		big_file_streaming_threshold = git_config_ulong(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.packedgitlimit")) {
 		packed_git_limit = git_config_ulong(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 9da7f3c1a19..4fcc3de7417 100644
--- a/environment.c
+++ b/environment.c
@@ -46,6 +46,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
 size_t delta_base_cache_limit = 96 * 1024 * 1024;
 unsigned long big_file_threshold = 512 * 1024 * 1024;
+unsigned long big_file_streaming_threshold = 128 * 1024 * 1024;
 int pager_use_color = 1;
 const char *editor_program;
 const char *askpass_program;
Ævar Arnfjörð Bjarmason Dec. 3, 2021, 1:54 p.m. UTC | #2
On Fri, Dec 03 2021, Han Xin wrote:

> diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh
> new file mode 100755
> index 0000000000..01d950d119
> --- /dev/null
> +++ b/t/t5590-unpack-non-delta-objects.sh
> @@ -0,0 +1,76 @@
> +#!/bin/sh
> +#
> +# Copyright (c) 2021 Han Xin
> +#
> +
> +test_description='Test unpack-objects when receive pack'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./test-lib.sh
> +
> +test_expect_success "create commit with big blobs (1.5 MB)" '
> +	test-tool genrandom foo 1500000 >big-blob &&
> +	test_commit --append foo big-blob &&
> +	test-tool genrandom bar 1500000 >big-blob &&
> +	test_commit --append bar big-blob &&
> +	(
> +		cd .git &&
> +		find objects/?? -type f | sort

...are thse...

> +	) >expect &&
> +	PACK=$(echo main | git pack-objects --progress --revs test)

Is --progress needed?

> +'
> +
> +test_expect_success 'setup GIT_ALLOC_LIMIT to 1MB' '
> +	GIT_ALLOC_LIMIT=1m &&
> +	export GIT_ALLOC_LIMIT
> +'
> +
> +test_expect_success 'prepare dest repository' '
> +	git init --bare dest.git &&
> +	git -C dest.git config core.bigFileThreshold 2m &&
> +	git -C dest.git config receive.unpacklimit 100

I think it would be better to just (could roll this into a function):

	test_when_finished "rm -rf dest.git" &&
	git init dest.git &&
	git -C dest.git config ...

Then you can use it with e.g. --run=3-4 and not have it error out
because of skipped setup.

A lot of our tests fail like that, but in this case fixing it seems
trivial.



> +'
> +
> +test_expect_success 'fail to unpack-objects: cannot allocate' '
> +	test_must_fail git -C dest.git unpack-objects <test-$PACK.pack 2>err &&
> +	test_i18ngrep "fatal: attempting to allocate" err &&

nit: just "grep", not "test_i18ngrep"

> +	(
> +		cd dest.git &&
> +		find objects/?? -type f | sort

..."find" needed over just globbing?:

    obj=$(echo objects/*/*)

?
Ævar Arnfjörð Bjarmason Dec. 3, 2021, 2:05 p.m. UTC | #3
On Fri, Dec 03 2021, Han Xin wrote:

> From: Han Xin <hanxin.hx@alibaba-inc.com>
> [..]
> +static void write_stream_blob(unsigned nr, unsigned long size)
> +{
> +	char hdr[32];
> +	int hdrlen;
> +	git_zstream zstream;
> +	struct input_zstream_data data;
> +	struct input_stream in_stream = {
> +		.read = feed_input_zstream,
> +		.data = &data,
> +		.size = size,
> +	};
> +	struct object_id *oid = &obj_list[nr].oid;
> +	int ret;
> +
> +	memset(&zstream, 0, sizeof(zstream));
> +	memset(&data, 0, sizeof(data));
> +	data.zstream = &zstream;
> +	git_inflate_init(&zstream);
> +
> +	/* Generate the header */
> +	hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, type_name(OBJ_BLOB), (uintmax_t)size) + 1;
> +
> +	if ((ret = write_loose_object(oid, hdr, hdrlen, &in_stream, 0, 0)))
> +		die(_("failed to write object in stream %d"), ret);
> +
> +	if (zstream.total_out != size || data.status != Z_STREAM_END)
> +		die(_("inflate returned %d"), data.status);
> +	git_inflate_end(&zstream);
> +
> +	if (strict && !dry_run) {
> +		struct blob *blob = lookup_blob(the_repository, oid);
> +		if (blob)
> +			blob->object.flags |= FLAG_WRITTEN;
> +		else
> +			die("invalid blob object from stream");
> +	}
> +	obj_list[nr].obj = NULL;
> +}

Just a side-note, I think (but am not 100% sure) that these existing
occurances aren't needed due to our use of CALLOC_ARRAY():
    
    diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
    index 4a9466295ba..00b349412c5 100644
    --- a/builtin/unpack-objects.c
    +++ b/builtin/unpack-objects.c
    @@ -248,7 +248,6 @@ static void write_object(unsigned nr, enum object_type type,
                            die("failed to write object");
                    added_object(nr, type, buf, size);
                    free(buf);
    -               obj_list[nr].obj = NULL;
            } else if (type == OBJ_BLOB) {
                    struct blob *blob;
                    if (write_object_file(buf, size, type_name(type),
    @@ -262,7 +261,6 @@ static void write_object(unsigned nr, enum object_type type,
                            blob->object.flags |= FLAG_WRITTEN;
                    else
                            die("invalid blob object");
    -               obj_list[nr].obj = NULL;
            } else {
                    struct object *obj;
                    int eaten;

The reason I'm noting it is that the same seems to be true of your new
addition here. I.e. are these assignments to NULL needed?

Anyway, the reason I started poking at this it tha this
write_stream_blob() seems to duplicate much of write_object(). AFAICT
only the writing part is really different, the part where we
lookup_blob() after, set FLAG_WRITTEN etc. is all the same.

Why can't we call write_object() here?

The obvious answer seems to be that the call to write_object_file()
isn't prepared to do the sort of streaming that you want, so instead
you're bypassing it and calling write_loose_object() directly.

I haven't tried this myself, but isn't a better and cleaner approach
here to not add another meaning to what is_null_oid() means, but to just
add a HASH_STREAM flag that'll get passed down as "unsigned flags" to
write_loose_object()? See FLAG_BITS in object.h.

Then the "obj_list[nr].obj" here could also become
"obj_list[nr].obj.flags |= (1u<<12)" or whatever (but that wouldn't
strictly be needed I think.

But by adding the "HASH_STREAM" flag you could I think stop duplicating
the "Generate the header" etc. here and call write_object_file_flags().

I don't so much care about how it's done within unpack-objects.c, but
not having another meaning to is_null_oid() in play would be really
nice, and it this case it seems entirely avoidable.
Han Xin Dec. 7, 2021, 6:17 a.m. UTC | #4
On Fri, Dec 3, 2021 at 9:59 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Fri, Dec 03 2021, Han Xin wrote:
>
> > diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh
> > new file mode 100755
> > index 0000000000..01d950d119
> > --- /dev/null
> > +++ b/t/t5590-unpack-non-delta-objects.sh
> > @@ -0,0 +1,76 @@
> > +#!/bin/sh
> > +#
> > +# Copyright (c) 2021 Han Xin
> > +#
> > +
> > +test_description='Test unpack-objects when receive pack'
> > +
> > +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> > +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> > +
> > +. ./test-lib.sh
> > +
> > +test_expect_success "create commit with big blobs (1.5 MB)" '
> > +     test-tool genrandom foo 1500000 >big-blob &&
> > +     test_commit --append foo big-blob &&
> > +     test-tool genrandom bar 1500000 >big-blob &&
> > +     test_commit --append bar big-blob &&
> > +     (
> > +             cd .git &&
> > +             find objects/?? -type f | sort
>
> ...are thse...
>
> > +     ) >expect &&
> > +     PACK=$(echo main | git pack-objects --progress --revs test)
>
> Is --progress needed?
>

"--progress" is not necessary.

> > +'
> > +
> > +test_expect_success 'setup GIT_ALLOC_LIMIT to 1MB' '
> > +     GIT_ALLOC_LIMIT=1m &&
> > +     export GIT_ALLOC_LIMIT
> > +'
> > +
> > +test_expect_success 'prepare dest repository' '
> > +     git init --bare dest.git &&
> > +     git -C dest.git config core.bigFileThreshold 2m &&
> > +     git -C dest.git config receive.unpacklimit 100
>
> I think it would be better to just (could roll this into a function):
>
>         test_when_finished "rm -rf dest.git" &&
>         git init dest.git &&
>         git -C dest.git config ...
>
> Then you can use it with e.g. --run=3-4 and not have it error out
> because of skipped setup.
>
> A lot of our tests fail like that, but in this case fixing it seems
> trivial.
>
>

OK, I will take it.

>
> > +'
> > +
> > +test_expect_success 'fail to unpack-objects: cannot allocate' '
> > +     test_must_fail git -C dest.git unpack-objects <test-$PACK.pack 2>err &&
> > +     test_i18ngrep "fatal: attempting to allocate" err &&
>
> nit: just "grep", not "test_i18ngrep"
>
> > +     (
> > +             cd dest.git &&
> > +             find objects/?? -type f | sort
>
> ..."find" needed over just globbing?:
>
>     obj=$(echo objects/*/*)
>
> ?

I tried to use "echo" instead of "find". It works well on my personal
computer, but fails due to the "info/commit-graph" generated when CI on
Github.
So it seems that ".git/objects/??" will be more rigorous?
Han Xin Dec. 7, 2021, 6:42 a.m. UTC | #5
On Fri, Dec 3, 2021 at 9:19 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Fri, Dec 03 2021, Han Xin wrote:
>
> > From: Han Xin <hanxin.hx@alibaba-inc.com>
> >
> > We used to call "get_data()" in "unpack_non_delta_entry()" to read the
> > entire contents of a blob object, no matter how big it is. This
> > implementation may consume all the memory and cause OOM.
> >
> > By implementing a zstream version of input_stream interface, we can use
> > a small fixed buffer for "unpack_non_delta_entry()".
> >
> > However, unpack non-delta objects from a stream instead of from an entrie
> > buffer will have 10% performance penalty. Therefore, only unpack object
> > larger than the "big_file_threshold" in zstream. See the following
> > benchmarks:
> >
> >     hyperfine \
> >       --setup \
> >       'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \
> >       --prepare 'rm -rf dest.git && git init --bare dest.git' \
> >       -n 'old' 'git -C dest.git unpack-objects <small.pack' \
> >       -n 'new' 'new/git -C dest.git unpack-objects <small.pack' \
> >       -n 'new (small threshold)' \
> >       'new/git -c core.bigfilethreshold=16k -C dest.git unpack-objects <small.pack'
> >     Benchmark 1: old
> >       Time (mean ± σ):      6.075 s ±  0.069 s    [User: 5.047 s, System: 0.991 s]
> >       Range (min … max):    6.018 s …  6.189 s    10 runs
> >
> >     Benchmark 2: new
> >       Time (mean ± σ):      6.090 s ±  0.033 s    [User: 5.075 s, System: 0.976 s]
> >       Range (min … max):    6.030 s …  6.142 s    10 runs
> >
> >     Benchmark 3: new (small threshold)
> >       Time (mean ± σ):      6.755 s ±  0.029 s    [User: 5.150 s, System: 1.560 s]
> >       Range (min … max):    6.711 s …  6.809 s    10 runs
> >
> >     Summary
> >       'old' ran
> >         1.00 ± 0.01 times faster than 'new'
> >         1.11 ± 0.01 times faster than 'new (small threshold)'
>
> So before we wrote used core.bigfilethreshold for two things (or more?):
> Whether we show a diff for it (we mark it "binary") and whether it's
> split into a loose object.
>
> Now it's three things, we've added a "this is a threshold when we'll
> stream the object" to that.
>
> Might it make sense to squash something like this in, so we can have our
> cake & eat it too?
>
> With this I get, where HEAD~0 is this change:
>
>     Summary
>       './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~0' ran
>         1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~1'
>         1.00 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'origin/master'
>         1.01 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~0'
>         1.06 ± 0.14 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'origin/master'
>         1.20 ± 0.01 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~1'
>
> I.e. it's 5% slower, not 20% (haven't looked into why), but we'll not
> stream out 16k..128MB objects (maybe the repo has even bigger ones?)
>
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index c04f62a54a1..601b7a2418f 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -424,6 +424,17 @@ be delta compressed, but larger binary media files won't be.
>  +
>  Common unit suffixes of 'k', 'm', or 'g' are supported.
>
> +core.bigFileStreamingThreshold::
> +       Files larger than this will be streamed out to a temporary
> +       object file while being hashed, which will when be renamed
> +       in-place to a loose object, particularly if the
> +       `core.bigFileThreshold' setting dictates that they're always
> +       written out as loose objects.
> ++
> +Default is 128 MiB on all platforms.
> ++
> +Common unit suffixes of 'k', 'm', or 'g' are supported.
> +
>  core.excludesFile::
>         Specifies the pathname to the file that contains patterns to
>         describe paths that are not meant to be tracked, in addition
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index bedc494e2db..94ce275c807 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -400,7 +400,7 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
>         void *buf;
>
>         /* Write large blob in stream without allocating full buffer. */
> -       if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) {
> +       if (!dry_run && type == OBJ_BLOB && size > big_file_streaming_threshold) {
>                 write_stream_blob(nr, size);
>                 return;
>         }
> diff --git a/cache.h b/cache.h
> index eba12487b99..4037c7fd849 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -964,6 +964,7 @@ extern size_t packed_git_window_size;
>  extern size_t packed_git_limit;
>  extern size_t delta_base_cache_limit;
>  extern unsigned long big_file_threshold;
> +extern unsigned long big_file_streaming_threshold;
>  extern unsigned long pack_size_limit_cfg;
>
>  /*
> diff --git a/config.c b/config.c
> index c5873f3a706..7b122a142a8 100644
> --- a/config.c
> +++ b/config.c
> @@ -1408,6 +1408,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>                 return 0;
>         }
>
> +       if (!strcmp(var, "core.bigfilestreamingthreshold")) {
> +               big_file_streaming_threshold = git_config_ulong(var, value);
> +               return 0;
> +       }
> +
>         if (!strcmp(var, "core.packedgitlimit")) {
>                 packed_git_limit = git_config_ulong(var, value);
>                 return 0;
> diff --git a/environment.c b/environment.c
> index 9da7f3c1a19..4fcc3de7417 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -46,6 +46,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
>  size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
>  size_t delta_base_cache_limit = 96 * 1024 * 1024;
>  unsigned long big_file_threshold = 512 * 1024 * 1024;
> +unsigned long big_file_streaming_threshold = 128 * 1024 * 1024;
>  int pager_use_color = 1;
>  const char *editor_program;
>  const char *askpass_program;

I'm not sure if we need an additional "core.bigFileStreamingThreshold"
here, because "core.bigFileThreshold" has been widely used in
"index-pack", "read_object" and so on.

In the test case which uses "core.bigFileStreamingThreshold" instead of
"core.bigFileThreshold", I found the test case execution failed because
of "fsck", who tried to allocate 15MB of memory.
In the process of "fsck_loose()", "read_loose_object()" will be called,
which contains the following content:

  if (*oi->typep == OBJ_BLOB && *size> big_file_threshold) {
    if (check_stream_oid(&stream, hdr, *size, path, expected_oid) <0)
    goto out;
  } else {
    /* this will allocate 15MB of memory */
    *contents = unpack_loose_rest(&stream, hdr, *size, expected_oid);
    ...
  }

The same case can be found in "unpack_entry_data()":

  static char fixed_buf[8192];
  ...
  if (type == OBJ_BLOB && size > big_file_threshold)
    buf = fixed_buf;
  else
    buf = xmallocz(size);
 ...

Although I know that setting a "core.bigfilethreshold" smaller than the
default value on the server side does not help me prevent users from
creating large delta objects on the client side, it can still
effectively help me reduce the Memory allocation in "receive-pack".

If this is not the correct way to use "core.bigfilethreshold", maybe
you can share some better solutions to me, if you want.

Thanks.
-Han Xin
Han Xin Dec. 7, 2021, 6:48 a.m. UTC | #6
On Fri, Dec 3, 2021 at 10:29 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Fri, Dec 03 2021, Han Xin wrote:
>
> > From: Han Xin <hanxin.hx@alibaba-inc.com>
> > [..]
> > +static void write_stream_blob(unsigned nr, unsigned long size)
> > +{
> > +     char hdr[32];
> > +     int hdrlen;
> > +     git_zstream zstream;
> > +     struct input_zstream_data data;
> > +     struct input_stream in_stream = {
> > +             .read = feed_input_zstream,
> > +             .data = &data,
> > +             .size = size,
> > +     };
> > +     struct object_id *oid = &obj_list[nr].oid;
> > +     int ret;
> > +
> > +     memset(&zstream, 0, sizeof(zstream));
> > +     memset(&data, 0, sizeof(data));
> > +     data.zstream = &zstream;
> > +     git_inflate_init(&zstream);
> > +
> > +     /* Generate the header */
> > +     hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, type_name(OBJ_BLOB), (uintmax_t)size) + 1;
> > +
> > +     if ((ret = write_loose_object(oid, hdr, hdrlen, &in_stream, 0, 0)))
> > +             die(_("failed to write object in stream %d"), ret);
> > +
> > +     if (zstream.total_out != size || data.status != Z_STREAM_END)
> > +             die(_("inflate returned %d"), data.status);
> > +     git_inflate_end(&zstream);
> > +
> > +     if (strict && !dry_run) {
> > +             struct blob *blob = lookup_blob(the_repository, oid);
> > +             if (blob)
> > +                     blob->object.flags |= FLAG_WRITTEN;
> > +             else
> > +                     die("invalid blob object from stream");
> > +     }
> > +     obj_list[nr].obj = NULL;
> > +}
>
> Just a side-note, I think (but am not 100% sure) that these existing
> occurances aren't needed due to our use of CALLOC_ARRAY():
>
>     diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
>     index 4a9466295ba..00b349412c5 100644
>     --- a/builtin/unpack-objects.c
>     +++ b/builtin/unpack-objects.c
>     @@ -248,7 +248,6 @@ static void write_object(unsigned nr, enum object_type type,
>                             die("failed to write object");
>                     added_object(nr, type, buf, size);
>                     free(buf);
>     -               obj_list[nr].obj = NULL;
>             } else if (type == OBJ_BLOB) {
>                     struct blob *blob;
>                     if (write_object_file(buf, size, type_name(type),
>     @@ -262,7 +261,6 @@ static void write_object(unsigned nr, enum object_type type,
>                             blob->object.flags |= FLAG_WRITTEN;
>                     else
>                             die("invalid blob object");
>     -               obj_list[nr].obj = NULL;
>             } else {
>                     struct object *obj;
>                     int eaten;
>
> The reason I'm noting it is that the same seems to be true of your new
> addition here. I.e. are these assignments to NULL needed?
>
> Anyway, the reason I started poking at this it tha this
> write_stream_blob() seems to duplicate much of write_object(). AFAICT
> only the writing part is really different, the part where we
> lookup_blob() after, set FLAG_WRITTEN etc. is all the same.
>
> Why can't we call write_object() here?
>
> The obvious answer seems to be that the call to write_object_file()
> isn't prepared to do the sort of streaming that you want, so instead
> you're bypassing it and calling write_loose_object() directly.
>
> I haven't tried this myself, but isn't a better and cleaner approach
> here to not add another meaning to what is_null_oid() means, but to just
> add a HASH_STREAM flag that'll get passed down as "unsigned flags" to
> write_loose_object()? See FLAG_BITS in object.h.
>
> Then the "obj_list[nr].obj" here could also become
> "obj_list[nr].obj.flags |= (1u<<12)" or whatever (but that wouldn't
> strictly be needed I think.
>
> But by adding the "HASH_STREAM" flag you could I think stop duplicating
> the "Generate the header" etc. here and call write_object_file_flags().
>
> I don't so much care about how it's done within unpack-objects.c, but
> not having another meaning to is_null_oid() in play would be really
> nice, and it this case it seems entirely avoidable.

I did refactor it according to your suggestions in my next patch version.
Using a HASH_STREAM tag is indeed a better way to deal with it, and it
can also reduce my refactor to the original contents.

Thanks.
-Han Xin
diff mbox series

Patch

diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 8d68acd662..bedc494e2d 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -326,11 +326,86 @@  static void added_object(unsigned nr, enum object_type type,
 	}
 }
 
+struct input_zstream_data {
+	git_zstream *zstream;
+	unsigned char buf[8192];
+	int status;
+};
+
+static const void *feed_input_zstream(struct input_stream *in_stream, unsigned long *readlen)
+{
+	struct input_zstream_data *data = in_stream->data;
+	git_zstream *zstream = data->zstream;
+	void *in = fill(1);
+
+	if (!len || data->status == Z_STREAM_END) {
+		*readlen = 0;
+		return NULL;
+	}
+
+	zstream->next_out = data->buf;
+	zstream->avail_out = sizeof(data->buf);
+	zstream->next_in = in;
+	zstream->avail_in = len;
+
+	data->status = git_inflate(zstream, 0);
+	use(len - zstream->avail_in);
+	*readlen = sizeof(data->buf) - zstream->avail_out;
+
+	return data->buf;
+}
+
+static void write_stream_blob(unsigned nr, unsigned long size)
+{
+	char hdr[32];
+	int hdrlen;
+	git_zstream zstream;
+	struct input_zstream_data data;
+	struct input_stream in_stream = {
+		.read = feed_input_zstream,
+		.data = &data,
+		.size = size,
+	};
+	struct object_id *oid = &obj_list[nr].oid;
+	int ret;
+
+	memset(&zstream, 0, sizeof(zstream));
+	memset(&data, 0, sizeof(data));
+	data.zstream = &zstream;
+	git_inflate_init(&zstream);
+
+	/* Generate the header */
+	hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, type_name(OBJ_BLOB), (uintmax_t)size) + 1;
+
+	if ((ret = write_loose_object(oid, hdr, hdrlen, &in_stream, 0, 0)))
+		die(_("failed to write object in stream %d"), ret);
+
+	if (zstream.total_out != size || data.status != Z_STREAM_END)
+		die(_("inflate returned %d"), data.status);
+	git_inflate_end(&zstream);
+
+	if (strict && !dry_run) {
+		struct blob *blob = lookup_blob(the_repository, oid);
+		if (blob)
+			blob->object.flags |= FLAG_WRITTEN;
+		else
+			die("invalid blob object from stream");
+	}
+	obj_list[nr].obj = NULL;
+}
+
 static void unpack_non_delta_entry(enum object_type type, unsigned long size,
 				   unsigned nr)
 {
-	void *buf = get_data(size, dry_run);
+	void *buf;
+
+	/* Write large blob in stream without allocating full buffer. */
+	if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) {
+		write_stream_blob(nr, size);
+		return;
+	}
 
+	buf = get_data(size, dry_run);
 	if (!dry_run && buf)
 		write_object(nr, type, buf, size);
 	else
diff --git a/object-file.c b/object-file.c
index fa54e39c2c..71d510614b 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1878,9 +1878,9 @@  static const void *feed_simple_input_stream(struct input_stream *in_stream, unsi
 	return data->buf;
 }
 
-static int write_loose_object(const struct object_id *oid, char *hdr,
-			      int hdrlen, struct input_stream *in_stream,
-			      time_t mtime, unsigned flags)
+int write_loose_object(const struct object_id *oid, char *hdr,
+		       int hdrlen, struct input_stream *in_stream,
+		       time_t mtime, unsigned flags)
 {
 	int fd, ret;
 	unsigned char compressed[4096];
diff --git a/object-store.h b/object-store.h
index a84d891d60..ac5b11ec16 100644
--- a/object-store.h
+++ b/object-store.h
@@ -229,6 +229,10 @@  int hash_object_file(const struct git_hash_algo *algo, const void *buf,
 		     unsigned long len, const char *type,
 		     struct object_id *oid);
 
+int write_loose_object(const struct object_id *oid, char *hdr,
+		       int hdrlen, struct input_stream *in_stream,
+		       time_t mtime, unsigned flags);
+
 int write_object_file_flags(const void *buf, unsigned long len,
 			    const char *type, struct object_id *oid,
 			    unsigned flags);
diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh
new file mode 100755
index 0000000000..01d950d119
--- /dev/null
+++ b/t/t5590-unpack-non-delta-objects.sh
@@ -0,0 +1,76 @@ 
+#!/bin/sh
+#
+# Copyright (c) 2021 Han Xin
+#
+
+test_description='Test unpack-objects when receive pack'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+test_expect_success "create commit with big blobs (1.5 MB)" '
+	test-tool genrandom foo 1500000 >big-blob &&
+	test_commit --append foo big-blob &&
+	test-tool genrandom bar 1500000 >big-blob &&
+	test_commit --append bar big-blob &&
+	(
+		cd .git &&
+		find objects/?? -type f | sort
+	) >expect &&
+	PACK=$(echo main | git pack-objects --progress --revs test)
+'
+
+test_expect_success 'setup GIT_ALLOC_LIMIT to 1MB' '
+	GIT_ALLOC_LIMIT=1m &&
+	export GIT_ALLOC_LIMIT
+'
+
+test_expect_success 'prepare dest repository' '
+	git init --bare dest.git &&
+	git -C dest.git config core.bigFileThreshold 2m &&
+	git -C dest.git config receive.unpacklimit 100
+'
+
+test_expect_success 'fail to unpack-objects: cannot allocate' '
+	test_must_fail git -C dest.git unpack-objects <test-$PACK.pack 2>err &&
+	test_i18ngrep "fatal: attempting to allocate" err &&
+	(
+		cd dest.git &&
+		find objects/?? -type f | sort
+	) >actual &&
+	! test_cmp expect actual
+'
+
+test_expect_success 'set a lower bigfile threshold' '
+	git -C dest.git config core.bigFileThreshold 1m
+'
+
+test_expect_success 'unpack big object in stream' '
+	git -C dest.git unpack-objects <test-$PACK.pack &&
+	git -C dest.git fsck &&
+	(
+		cd dest.git &&
+		find objects/?? -type f | sort
+	) >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'setup for unpack-objects dry-run test' '
+	git init --bare unpack-test.git
+'
+
+test_expect_success 'unpack-objects dry-run' '
+	(
+		cd unpack-test.git &&
+		git unpack-objects -n <../test-$PACK.pack
+	) &&
+	(
+		cd unpack-test.git &&
+		find objects/ -type f
+	) >actual &&
+	test_must_be_empty actual
+'
+
+test_done