[v7,1/5] unpack-objects.c: add dry_run mode for get_data()

Message ID	20211221115201.12120-2-chiyutianyi@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> From: Han Xin <chiyutianyi@gmail.com> To: Junio C Hamano <gitster@pobox.com>, Git List <git@vger.kernel.org>, Jeff King <peff@peff.net>, Jiang Xin <zhiyou.jx@alibaba-inc.com>, Philip Oakley <philipoakley@iee.email>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= <avarab@gmail.com>, Derrick Stolee <stolee@gmail.com>, =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de> Cc: Han Xin <hanxin.hx@alibaba-inc.com> Subject: [PATCH v7 1/5] unpack-objects.c: add dry_run mode for get_data() Date: Tue, 21 Dec 2021 19:51:57 +0800 Message-Id: <20211221115201.12120-2-chiyutianyi@gmail.com> In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	unpack large blobs in stream \| expand [v7,0/5] unpack large blobs in stream [v7,1/5] unpack-objects.c: add dry_run mode for get_data() [v7,2/5] object-file API: add a format_object_header() function [v7,3/5] object-file.c: refactor write_loose_object() to reuse in stream version [v7,4/5] object-file.c: add "write_stream_object_file()" to support read in stream [v7,5/5] unpack-objects: unpack_non_delta_entry() read data in a stream

Han Xin Dec. 21, 2021, 11:51 a.m. UTC

From: Han Xin <hanxin.hx@alibaba-inc.com>

In dry_run mode, "get_data()" is used to verify the inflation of data,
and the returned buffer will not be used at all and will be freed
immediately. Even in dry_run mode, it is dangerous to allocate a
full-size buffer for a large blob object. Therefore, only allocate a
low memory footprint when calling "get_data()" in dry_run mode.

Suggested-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
---
 builtin/unpack-objects.c            | 23 +++++++++---
 t/t5590-unpack-non-delta-objects.sh | 57 +++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+), 6 deletions(-)
 create mode 100755 t/t5590-unpack-non-delta-objects.sh

Ævar Arnfjörð Bjarmason Dec. 21, 2021, 2:09 p.m. UTC | #1

On Tue, Dec 21 2021, Han Xin wrote:

> From: Han Xin <hanxin.hx@alibaba-inc.com>
>
> In dry_run mode, "get_data()" is used to verify the inflation of data,
> and the returned buffer will not be used at all and will be freed
> immediately. Even in dry_run mode, it is dangerous to allocate a
> full-size buffer for a large blob object. Therefore, only allocate a
> low memory footprint when calling "get_data()" in dry_run mode.
>
> Suggested-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
> Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
> ---
>  builtin/unpack-objects.c            | 23 +++++++++---
>  t/t5590-unpack-non-delta-objects.sh | 57 +++++++++++++++++++++++++++++
>  2 files changed, 74 insertions(+), 6 deletions(-)
>  create mode 100755 t/t5590-unpack-non-delta-objects.sh
>
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index 4a9466295b..9104eb48da 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -96,15 +96,21 @@ static void use(int bytes)
>  	display_throughput(progress, consumed_bytes);
>  }
>  
> -static void *get_data(unsigned long size)
> +static void *get_data(size_t size, int dry_run)
>  {
>  	git_zstream stream;
> -	void *buf = xmallocz(size);
> +	size_t bufsize;
> +	void *buf;
>  
>  	memset(&stream, 0, sizeof(stream));
> +	if (dry_run && size > 8192)
> +		bufsize = 8192;
> +	else
> +		bufsize = size;
> +	buf = xmallocz(bufsize);

Maybe I'm misunderstanding this, but the commit message says it would be
dangerous to allocate a very larger buffer, but here we only limit the
size under "dry_run".

Removing that "&& size > 8192" makes all the tests pass still, so there
seems to be some missing coverage here in any case.

> diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh
> new file mode 100755
> index 0000000000..48c4fb1ba3
> --- /dev/null
> +++ b/t/t5590-unpack-non-delta-objects.sh
> @@ -0,0 +1,57 @@
> +#!/bin/sh
> +#
> +# Copyright (c) 2021 Han Xin
> +#
> +
> +test_description='Test unpack-objects with non-delta objects'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./test-lib.sh
> +
> +prepare_dest () {
> +	test_when_finished "rm -rf dest.git" &&
> +	git init --bare dest.git
> +}
> +
> +test_expect_success "setup repo with big blobs (1.5 MB)" '
> +	test-tool genrandom foo 1500000 >big-blob &&
> +	test_commit --append foo big-blob &&
> +	test-tool genrandom bar 1500000 >big-blob &&
> +	test_commit --append bar big-blob &&
> +	(
> +		cd .git &&
> +		find objects/?? -type f | sort
> +	) >expect &&
> +	PACK=$(echo main | git pack-objects --revs test)
> +'
> +
> +test_expect_success 'setup env: GIT_ALLOC_LIMIT to 1MB' '
> +	GIT_ALLOC_LIMIT=1m &&
> +	export GIT_ALLOC_LIMIT
> +'
> +
> +test_expect_success 'fail to unpack-objects: cannot allocate' '
> +	prepare_dest &&
> +	test_must_fail git -C dest.git unpack-objects <test-$PACK.pack 2>err &&
> +	grep "fatal: attempting to allocate" err &&
> +	(
> +		cd dest.git &&
> +		find objects/?? -type f | sort
> +	) >actual &&
> +	test_file_not_empty actual &&
> +	! test_cmp expect actual
> +'
> +
> +test_expect_success 'unpack-objects dry-run' '
> +	prepare_dest &&
> +	git -C dest.git unpack-objects -n <test-$PACK.pack &&
> +	(
> +		cd dest.git &&
> +		find objects/ -type f
> +	) >actual &&
> +	test_must_be_empty actual
> +'
> +
> +test_done

I commented on this "find" usage in an earlier round, I think there's a
much easier way to do this. You're really just going back and forth
between checking whether or not all the objects are loose.

I think that the below fix-up on top of this series is a better way to
do that, and more accurate. I.e. in your test here you check "!
test_cmp", which means that we could have some packed and some loose,
but really what you're meaning to check is a flip-flop between "all
loose?" and "no loose?.

In addition to that there was no reason to hardcode "main", we can just
use HEAD. All in all I think the below fix-up makes sense:

diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh
index 8436cbf8db6..d78bb89225d 100755
--- a/t/t5590-unpack-non-delta-objects.sh
+++ b/t/t5590-unpack-non-delta-objects.sh
@@ -5,9 +5,6 @@
 
 test_description='Test unpack-objects with non-delta objects'
 
-GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
-export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
-
 . ./test-lib.sh
 
 prepare_dest () {
@@ -20,16 +17,22 @@ prepare_dest () {
 	fi
 }
 
+assert_no_loose () {
+	glob=dest.git/objects/?? &&
+	echo "$glob" >expect &&
+	echo $glob >actual &&
+	test_cmp expect actual
+}
+
 test_expect_success "setup repo with big blobs (1.5 MB)" '
 	test-tool genrandom foo 1500000 >big-blob &&
 	test_commit --append foo big-blob &&
 	test-tool genrandom bar 1500000 >big-blob &&
 	test_commit --append bar big-blob &&
-	(
-		cd .git &&
-		find objects/?? -type f | sort
-	) >expect &&
-	PACK=$(echo main | git pack-objects --revs test)
+
+	# Everything is loose
+	rmdir .git/objects/pack &&
+	PACK=$(echo HEAD | git pack-objects --revs test)
 '
 
 test_expect_success 'setup env: GIT_ALLOC_LIMIT to 1MB' '
@@ -41,51 +44,27 @@ test_expect_success 'fail to unpack-objects: cannot allocate' '
 	prepare_dest 2m &&
 	test_must_fail git -C dest.git unpack-objects <test-$PACK.pack 2>err &&
 	grep "fatal: attempting to allocate" err &&
-	(
-		cd dest.git &&
-		find objects/?? -type f | sort
-	) >actual &&
-	test_file_not_empty actual &&
-	! test_cmp expect actual
+	rmdir dest.git/objects/pack
 '
 
 test_expect_success 'unpack big object in stream' '
 	prepare_dest 1m &&
 	mkdir -p dest.git/objects/05 &&
 	git -C dest.git unpack-objects <test-$PACK.pack &&
-	git -C dest.git fsck &&
-	(
-		cd dest.git &&
-		find objects/?? -type f | sort
-	) >actual &&
-	test_cmp expect actual
+	rmdir dest.git/objects/pack
 '
 
 test_expect_success 'unpack big object in stream with existing oids' '
 	prepare_dest 1m &&
 	git -C dest.git index-pack --stdin <test-$PACK.pack &&
-	(
-		cd dest.git &&
-		find objects/?? -type f | sort
-	) >actual &&
-	test_must_be_empty actual &&
 	git -C dest.git unpack-objects <test-$PACK.pack &&
-	git -C dest.git fsck &&
-	(
-		cd dest.git &&
-		find objects/?? -type f | sort
-	) >actual &&
-	test_must_be_empty actual
+	assert_no_loose
 '
 
 test_expect_success 'unpack-objects dry-run' '
 	prepare_dest &&
 	git -C dest.git unpack-objects -n <test-$PACK.pack &&
-	(
-		cd dest.git &&
-		find objects/ -type f
-	) >actual &&
-	test_must_be_empty actual
+	assert_no_loose
 '
 
 test_done

René Scharfe Dec. 21, 2021, 2:43 p.m. UTC | #2

Am 21.12.21 um 15:09 schrieb Ævar Arnfjörð Bjarmason:
>
> On Tue, Dec 21 2021, Han Xin wrote:
>
>> From: Han Xin <hanxin.hx@alibaba-inc.com>
>>
>> In dry_run mode, "get_data()" is used to verify the inflation of data,
>> and the returned buffer will not be used at all and will be freed
>> immediately. Even in dry_run mode, it is dangerous to allocate a
>> full-size buffer for a large blob object. Therefore, only allocate a
>> low memory footprint when calling "get_data()" in dry_run mode.
>>
>> Suggested-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
>> Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
>> ---
>>  builtin/unpack-objects.c            | 23 +++++++++---
>>  t/t5590-unpack-non-delta-objects.sh | 57 +++++++++++++++++++++++++++++
>>  2 files changed, 74 insertions(+), 6 deletions(-)
>>  create mode 100755 t/t5590-unpack-non-delta-objects.sh
>>
>> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
>> index 4a9466295b..9104eb48da 100644
>> --- a/builtin/unpack-objects.c
>> +++ b/builtin/unpack-objects.c
>> @@ -96,15 +96,21 @@ static void use(int bytes)
>>  	display_throughput(progress, consumed_bytes);
>>  }
>>
>> -static void *get_data(unsigned long size)
>> +static void *get_data(size_t size, int dry_run)
>>  {
>>  	git_zstream stream;
>> -	void *buf = xmallocz(size);
>> +	size_t bufsize;
>> +	void *buf;
>>
>>  	memset(&stream, 0, sizeof(stream));
>> +	if (dry_run && size > 8192)
>> +		bufsize = 8192;
>> +	else
>> +		bufsize = size;
>> +	buf = xmallocz(bufsize);
>
> Maybe I'm misunderstanding this, but the commit message says it would be
> dangerous to allocate a very larger buffer, but here we only limit the
> size under "dry_run".

This patch reduces the memory usage of dry runs, as its commit message
says.  The memory usage of one type of actual (non-dry) unpack is reduced
by patch 5.

> Removing that "&& size > 8192" makes all the tests pass still, so there
> seems to be some missing coverage here in any case.

How would you test that an 8KB buffer is allocated even though a smaller
one would suffice?  And why?  Wasting a few KB shouldn't be noticeable.

René

Ævar Arnfjörð Bjarmason Dec. 21, 2021, 3:04 p.m. UTC | #3

On Tue, Dec 21 2021, René Scharfe wrote:

> Am 21.12.21 um 15:09 schrieb Ævar Arnfjörð Bjarmason:
>>
>> On Tue, Dec 21 2021, Han Xin wrote:
>>
>>> From: Han Xin <hanxin.hx@alibaba-inc.com>
>>>
>>> In dry_run mode, "get_data()" is used to verify the inflation of data,
>>> and the returned buffer will not be used at all and will be freed
>>> immediately. Even in dry_run mode, it is dangerous to allocate a
>>> full-size buffer for a large blob object. Therefore, only allocate a
>>> low memory footprint when calling "get_data()" in dry_run mode.
>>>
>>> Suggested-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
>>> Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
>>> ---
>>>  builtin/unpack-objects.c            | 23 +++++++++---
>>>  t/t5590-unpack-non-delta-objects.sh | 57 +++++++++++++++++++++++++++++
>>>  2 files changed, 74 insertions(+), 6 deletions(-)
>>>  create mode 100755 t/t5590-unpack-non-delta-objects.sh
>>>
>>> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
>>> index 4a9466295b..9104eb48da 100644
>>> --- a/builtin/unpack-objects.c
>>> +++ b/builtin/unpack-objects.c
>>> @@ -96,15 +96,21 @@ static void use(int bytes)
>>>  	display_throughput(progress, consumed_bytes);
>>>  }
>>>
>>> -static void *get_data(unsigned long size)
>>> +static void *get_data(size_t size, int dry_run)
>>>  {
>>>  	git_zstream stream;
>>> -	void *buf = xmallocz(size);
>>> +	size_t bufsize;
>>> +	void *buf;
>>>
>>>  	memset(&stream, 0, sizeof(stream));
>>> +	if (dry_run && size > 8192)
>>> +		bufsize = 8192;
>>> +	else
>>> +		bufsize = size;
>>> +	buf = xmallocz(bufsize);
>>
>> Maybe I'm misunderstanding this, but the commit message says it would be
>> dangerous to allocate a very larger buffer, but here we only limit the
>> size under "dry_run".
>
> This patch reduces the memory usage of dry runs, as its commit message
> says.  The memory usage of one type of actual (non-dry) unpack is reduced
> by patch 5.
>
>> Removing that "&& size > 8192" makes all the tests pass still, so there
>> seems to be some missing coverage here in any case.
>
> How would you test that an 8KB buffer is allocated even though a smaller
> one would suffice?  And why?  Wasting a few KB shouldn't be noticeable.

That doesn't sound like it needs to be tested. I was just trying to grok
what this was all doing. Thanks!

Jiang Xin Dec. 22, 2021, 11:15 a.m. UTC | #4

On Wed, Dec 22, 2021 at 9:53 AM René Scharfe <l.s.r@web.de> wrote:
>
> Am 21.12.21 um 15:09 schrieb Ævar Arnfjörð Bjarmason:
> > Maybe I'm misunderstanding this, but the commit message says it would be
> > dangerous to allocate a very larger buffer, but here we only limit the
> > size under "dry_run".
>
> This patch reduces the memory usage of dry runs, as its commit message
> says.  The memory usage of one type of actual (non-dry) unpack is reduced
> by patch 5.
>

For Han Xin and me, it is very challenging to write better commit log
in English.  Since the commit is moved to the beginning, the commit
log should be rewritten as follows:

unpack-objects.c: low memory footprint for get_data() in dry_run mode

As the name implies, "get_data(size)" will allocate and return a given
size of memory. Allocating memory for a large blob object may cause the
system to run out of memory. Before preparing to replace calling of
"get_data()" to resolve unpack issue of large blob objects, refactor
"get_data()" to reduce memory footprint for dry_run mode. Because
in dry_run mode, "get_data()" is only used to check the integrity of
data, and the returned buffer is not used at all.

Therefore, add the flag "dry_run" as an additional parameter of
"get_data()" and reuse a small buffer in dry_run mode. Because in
dry_run mode, the return buffer is not the entire data that the user
wants, for this reason, we will release the buffer and return NULL.

Han Xin, I think you can try to free the allocated buffer for dry_run
mode inside "get_data()".

--
Jiang Xin

Jiang Xin Dec. 22, 2021, 11:29 a.m. UTC | #5

On Wed, Dec 22, 2021 at 8:37 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Dec 21 2021, Han Xin wrote:
>
> I commented on this "find" usage in an earlier round, I think there's a
> much easier way to do this. You're really just going back and forth
> between checking whether or not all the objects are loose.
>
> I think that the below fix-up on top of this series is a better way to
> do that, and more accurate. I.e. in your test here you check "!
> test_cmp", which means that we could have some packed and some loose,
> but really what you're meaning to check is a flip-flop between "all
> loose?" and "no loose?.
>
> In addition to that there was no reason to hardcode "main", we can just
> use HEAD. All in all I think the below fix-up makes sense:
>
> diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh
> index 8436cbf8db6..d78bb89225d 100755
> --- a/t/t5590-unpack-non-delta-objects.sh
> +++ b/t/t5590-unpack-non-delta-objects.sh
> @@ -5,9 +5,6 @@
>
>  test_description='Test unpack-objects with non-delta objects'
>
> -GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> -export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> -
>  . ./test-lib.sh
>
>  prepare_dest () {
> @@ -20,16 +17,22 @@ prepare_dest () {
>         fi
>  }
>
> +assert_no_loose () {
> +       glob=dest.git/objects/?? &&
> +       echo "$glob" >expect &&
> +       echo $glob >actual &&

Incompatible for zsh. This may work:

    eval "echo $glob" >actual &&

--
Jiang Xin

Jiang Xin Dec. 31, 2021, 3:06 a.m. UTC | #6

On Wed, Dec 22, 2021 at 2:33 AM Han Xin <chiyutianyi@gmail.com> wrote:
>
> From: Han Xin <hanxin.hx@alibaba-inc.com>
>
> In dry_run mode, "get_data()" is used to verify the inflation of data,
> and the returned buffer will not be used at all and will be freed
> immediately. Even in dry_run mode, it is dangerous to allocate a
> full-size buffer for a large blob object. Therefore, only allocate a
> low memory footprint when calling "get_data()" in dry_run mode.
>
> Suggested-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
> Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
> ---
>  builtin/unpack-objects.c            | 23 +++++++++---
>  t/t5590-unpack-non-delta-objects.sh | 57 +++++++++++++++++++++++++++++
>  2 files changed, 74 insertions(+), 6 deletions(-)
>  create mode 100755 t/t5590-unpack-non-delta-objects.sh
>
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index 4a9466295b..9104eb48da 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -96,15 +96,21 @@ static void use(int bytes)
>         display_throughput(progress, consumed_bytes);
>  }
>
> -static void *get_data(unsigned long size)
> +static void *get_data(size_t size, int dry_run)

After a offline talk with Han Xin, we feel it is not necessary to pass
"dry_run" as a argument, use the file-scope static variable directly
in "get_data()".

>  {
>         git_zstream stream;
> -       void *buf = xmallocz(size);
> +       size_t bufsize;
> +       void *buf;
>
>         memset(&stream, 0, sizeof(stream));
> +       if (dry_run && size > 8192)

Use the file-scope static variable "dry_run".

[v7,1/5] unpack-objects.c: add dry_run mode for get_data()

Commit Message

Comments

Patch