diff mbox series

[v3,2/5] object-file.c: handle undetermined oid in write_loose_object()

Message ID 20211122033220.32883-3-chiyutianyi@gmail.com (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Han Xin Nov. 22, 2021, 3:32 a.m. UTC
From: Han Xin <hanxin.hx@alibaba-inc.com>

When streaming a large blob object to "write_loose_object()", we have no
chance to run "write_object_file_prepare()" to calculate the oid in
advance. So we need to handle undetermined oid in function
"write_loose_object()".

In the original implementation, we know the oid and we can write the
temporary file in the same directory as the final object, but for an
object with an undetermined oid, we don't know the exact directory for
the object, so we have to save the temporary file in ".git/objects/"
directory instead.

Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
---
 object-file.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

Comments

Derrick Stolee Nov. 29, 2021, 3:10 p.m. UTC | #1
On 11/21/2021 10:32 PM, Han Xin wrote:
> From: Han Xin <hanxin.hx@alibaba-inc.com>
> 
> When streaming a large blob object to "write_loose_object()", we have no
> chance to run "write_object_file_prepare()" to calculate the oid in
> advance. So we need to handle undetermined oid in function
> "write_loose_object()".
> 
> In the original implementation, we know the oid and we can write the
> temporary file in the same directory as the final object, but for an
> object with an undetermined oid, we don't know the exact directory for
> the object, so we have to save the temporary file in ".git/objects/"
> directory instead.

My first reaction is to not write into .git/objects/ directly, but
instead make a .git/objects/tmp/ directory and write within that
directory. The idea is to prevent leaving stale files in the
.git/objects/ directory if the process terminates strangely (say,
a power outage or segfault).

If this was an interesting idea to pursue, it does leave a question:
should we clean up the tmp/ directory when it is empty? That would
require adding a check in finalize_object_file() that is probably
best left unchecked (the lstat() would add a cost per loose object
write that is probably too costly). I would rather leave an empty
tmp/ directory than add that cost per loose object write.

I suppose another way to do it would be to register the check as
an event at the end of the process, so we only check once, and
that only happens if we created a loose object with this streaming
method.

With all of these complications in mind, I think cleaning up the
stale tmp/ directory could (at the very least) be delayed to another
commit or patch series. Hopefully adding the directory is not too
much complication to add here.

> -	loose_object_path(the_repository, &filename, oid);
> +	if (is_null_oid(oid)) {
> +		/* When oid is not determined, save tmp file to odb path. */
> +		strbuf_reset(&filename);
> +		strbuf_addstr(&filename, the_repository->objects->odb->path);
> +		strbuf_addch(&filename, '/');

Here, you could instead of the strbuf_addch() do

	strbuf_add(&filename, "/tmp/", 5);
	if (safe_create_leading_directories(filename.buf)) {
		error(_("failed to create '%s'"));
		strbuf_release(&filename);
		return -1;
	}		

> +	} else {
> +		loose_object_path(the_repository, &filename, oid);
> +	}
>  
>  	fd = create_tmpfile(&tmp_file, filename.buf);
>  	if (fd < 0) {
> @@ -1939,12 +1946,31 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
>  		die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid),
>  		    ret);
>  	the_hash_algo->final_oid_fn(&parano_oid, &c);
> -	if (!oideq(oid, &parano_oid))
> +	if (!is_null_oid(oid) && !oideq(oid, &parano_oid))
>  		die(_("confused by unstable object source data for %s"),
>  		    oid_to_hex(oid));
>  
>  	close_loose_object(fd);
>  
> +	if (is_null_oid(oid)) {
> +		int dirlen;
> +
> +		oidcpy((struct object_id *)oid, &parano_oid);
> +		loose_object_path(the_repository, &filename, oid);
> +
> +		/* We finally know the object path, and create the missing dir. */
> +		dirlen = directory_size(filename.buf);
> +		if (dirlen) {
> +			struct strbuf dir = STRBUF_INIT;
> +			strbuf_add(&dir, filename.buf, dirlen - 1);
> +			if (mkdir(dir.buf, 0777) && errno != EEXIST)
> +				return -1;
> +			if (adjust_shared_perm(dir.buf))
> +				return -1;
> +			strbuf_release(&dir);
> +		}
> +	}
> +

Upon first reading I was asking "where is the file rename?" but
it is part of finalize_object_file() which is called further down.

Thanks,
-Stolee
Junio C Hamano Nov. 29, 2021, 8:44 p.m. UTC | #2
Derrick Stolee <stolee@gmail.com> writes:

> My first reaction is to not write into .git/objects/ directly, but
> instead make a .git/objects/tmp/ directory and write within that
> directory. The idea is to prevent leaving stale files in the
> .git/objects/ directory if the process terminates strangely (say,
> a power outage or segfault).

Even if we know the name of the object we are writing beforehand, I
do not think it is a good idea to open-write-close the final object
file.  The approach we already use everywhere is to write into a
tmpfile/lockfile and rename it to the final name 

object-file.c::write_loose_object() uses create_tmpfile() to prepare
a temporary file whose name begins with "tmp_obj_", so that "gc" can
recognize stale ones and remove them.

> If this was an interesting idea to pursue, it does leave a question:
> should we clean up the tmp/ directory when it is empty? That would
> require adding a check in finalize_object_file() that is probably
> best left unchecked (the lstat() would add a cost per loose object
> write that is probably too costly). I would rather leave an empty
> tmp/ directory than add that cost per loose object write.

I am not sure why we want a new tmp/ directory.
Derrick Stolee Nov. 29, 2021, 10:18 p.m. UTC | #3
On 11/29/2021 3:44 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> My first reaction is to not write into .git/objects/ directly, but
>> instead make a .git/objects/tmp/ directory and write within that
>> directory. The idea is to prevent leaving stale files in the
>> .git/objects/ directory if the process terminates strangely (say,
>> a power outage or segfault).
> 
> Even if we know the name of the object we are writing beforehand, I
> do not think it is a good idea to open-write-close the final object
> file.  The approach we already use everywhere is to write into a
> tmpfile/lockfile and rename it to the final name 
> 
> object-file.c::write_loose_object() uses create_tmpfile() to prepare
> a temporary file whose name begins with "tmp_obj_", so that "gc" can
> recognize stale ones and remove them.

The only difference is that the tmp_obj_* file would go into the
loose object directory corresponding to the first two hex characters
of the OID, but that no longer happens now.
 
>> If this was an interesting idea to pursue, it does leave a question:
>> should we clean up the tmp/ directory when it is empty? That would
>> require adding a check in finalize_object_file() that is probably
>> best left unchecked (the lstat() would add a cost per loose object
>> write that is probably too costly). I would rather leave an empty
>> tmp/ directory than add that cost per loose object write.
> 
> I am not sure why we want a new tmp/ directory.

I'm just thinking of a case where this fails repeatedly I would
rather have those failed tmp_obj_* files isolated in their own
directory. It's an extremely minor point, so I'm fine to drop
the recommendation.

Thanks,
-Stolee
Han Xin Nov. 30, 2021, 3:23 a.m. UTC | #4
On Tue, Nov 30, 2021 at 6:18 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/29/2021 3:44 PM, Junio C Hamano wrote:
> > Derrick Stolee <stolee@gmail.com> writes:
> >
> >> My first reaction is to not write into .git/objects/ directly, but
> >> instead make a .git/objects/tmp/ directory and write within that
> >> directory. The idea is to prevent leaving stale files in the
> >> .git/objects/ directory if the process terminates strangely (say,
> >> a power outage or segfault).
> >
> > Even if we know the name of the object we are writing beforehand, I
> > do not think it is a good idea to open-write-close the final object
> > file.  The approach we already use everywhere is to write into a
> > tmpfile/lockfile and rename it to the final name
> >
> > object-file.c::write_loose_object() uses create_tmpfile() to prepare
> > a temporary file whose name begins with "tmp_obj_", so that "gc" can
> > recognize stale ones and remove them.
>
> The only difference is that the tmp_obj_* file would go into the
> loose object directory corresponding to the first two hex characters
> of the OID, but that no longer happens now.
>

At the beginning of this patch, I did save the temporary object in a
two hex characters directory of "null_oid", but this is also a very
strange behavior. "Gc" will indeed clean up these tmp_obj_* files, no
matter if they are in .git/objects/ or .git/objects/xx.

Thanks,
-Han Xin
diff mbox series

Patch

diff --git a/object-file.c b/object-file.c
index 227f53a0de..78fd2a5d39 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1892,7 +1892,14 @@  static int write_loose_object(const struct object_id *oid, char *hdr,
 	const void *buf;
 	unsigned long len;
 
-	loose_object_path(the_repository, &filename, oid);
+	if (is_null_oid(oid)) {
+		/* When oid is not determined, save tmp file to odb path. */
+		strbuf_reset(&filename);
+		strbuf_addstr(&filename, the_repository->objects->odb->path);
+		strbuf_addch(&filename, '/');
+	} else {
+		loose_object_path(the_repository, &filename, oid);
+	}
 
 	fd = create_tmpfile(&tmp_file, filename.buf);
 	if (fd < 0) {
@@ -1939,12 +1946,31 @@  static int write_loose_object(const struct object_id *oid, char *hdr,
 		die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid),
 		    ret);
 	the_hash_algo->final_oid_fn(&parano_oid, &c);
-	if (!oideq(oid, &parano_oid))
+	if (!is_null_oid(oid) && !oideq(oid, &parano_oid))
 		die(_("confused by unstable object source data for %s"),
 		    oid_to_hex(oid));
 
 	close_loose_object(fd);
 
+	if (is_null_oid(oid)) {
+		int dirlen;
+
+		oidcpy((struct object_id *)oid, &parano_oid);
+		loose_object_path(the_repository, &filename, oid);
+
+		/* We finally know the object path, and create the missing dir. */
+		dirlen = directory_size(filename.buf);
+		if (dirlen) {
+			struct strbuf dir = STRBUF_INIT;
+			strbuf_add(&dir, filename.buf, dirlen - 1);
+			if (mkdir(dir.buf, 0777) && errno != EEXIST)
+				return -1;
+			if (adjust_shared_perm(dir.buf))
+				return -1;
+			strbuf_release(&dir);
+		}
+	}
+
 	if (mtime) {
 		struct utimbuf utb;
 		utb.actime = mtime;