mbox series

[v2,0/4] Don't lazy-fetch commits when parsing them

Message ID cover.1669922792.git.jonathantanmy@google.com (mailing list archive)
Headers show
Series Don't lazy-fetch commits when parsing them | expand

Message

Jonathan Tan Dec. 1, 2022, 7:27 p.m. UTC
Thanks everyone for your reviews. Here is a reroll with the requested change
(just one small one).

Jonathan Tan (4):
  object-file: reread object with exact same args
  object-file: refactor corrupt object diagnosis
  object-file: refactor replace object lookup
  commit: don't lazy-fetch commits

 commit.c       | 18 ++++++++++++++--
 object-file.c  | 57 ++++++++++++++++++++++++++++++++++----------------
 object-store.h | 10 +++++++++
 3 files changed, 65 insertions(+), 20 deletions(-)

Range-diff against v1:
1:  604160e79c = 1:  604160e79c object-file: reread object with exact same args
2:  a8c5fcd9f8 ! 2:  1be60f1bf2 object-file: refactor corrupt object diagnosis
    @@ object-file.c: void *read_object_file_extended(struct repository *r,
      
      	/* die if we replaced an object with one that does not exist */
     -	if (repl != oid)
    -+	if (real_oid != oid)
    ++	if (!oideq(real_oid, oid))
      		die(_("replacement %s not found for %s"),
     -		    oid_to_hex(repl), oid_to_hex(oid));
     +		    oid_to_hex(real_oid), oid_to_hex(oid));
3:  940396307f = 3:  28935ba1b0 object-file: refactor replace object lookup
4:  6af8dcebd1 = 4:  a38229c42a commit: don't lazy-fetch commits

Comments

Jeff King Dec. 1, 2022, 7:54 p.m. UTC | #1
On Thu, Dec 01, 2022 at 11:27:29AM -0800, Jonathan Tan wrote:

> Thanks everyone for your reviews. Here is a reroll with the requested change
> (just one small one).

Thanks, this looks OK to me. However Junio noted in "What's cooking"
that it seems to break CI on windows. The problem is in t5318.93:

  2022-12-01T09:26:44.8887018Z ++ cat test_err
  2022-12-01T09:26:44.8887414Z error: Could not read 0000000000000000000000000000000000000000
  2022-12-01T09:26:44.8887825Z error: Could not read 0000000000000000000000000000000000000000
  2022-12-01T09:26:44.8888240Z error: Could not read 0000000000000000000000000000000000000000
  2022-12-01T09:26:44.8888639Z error: Could not read 0000000000000000000000000000000000000000
  2022-12-01T09:26:44.8889052Z error: Could not read 0000000000000000000000000000000000000000
  2022-12-01T09:26:44.8889512Z error: Could not read 0000000000000000000000000000000000000000
  2022-12-01T09:26:44.8889991Z fatal: failed to read object 0000000000000000000000000000000000000000: Function not implemented
  2022-12-01T09:26:44.8890401Z ++ return 1
  2022-12-01T09:26:44.8890761Z error: last command exited with $?=1
  2022-12-01T09:26:44.8891263Z not ok 93 - corrupt commit-graph write (broken parent)

Looks like the check in die_if_corrupt() is seeing a different errno
value than ENOENT. I wonder if we need to take more care to preserve it
across calls. It does look like we hit the same sequence of functions
that read_object_file_extended() did, but perhaps this was buggy all
along, and you're now exposing it through a new code path.

In particular I wonder if obj_read_unlock() might be the culprit here,
and something like this might help:

diff --git a/object-file.c b/object-file.c
index 8adef99a7c..db2d35519e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1641,9 +1641,12 @@ int oid_object_info_extended(struct repository *r, const struct object_id *oid,
 			     struct object_info *oi, unsigned flags)
 {
 	int ret;
+	int save_errno;
 	obj_read_lock();
 	ret = do_oid_object_info_extended(r, oid, oi, flags);
+	save_errno = errno;
 	obj_read_unlock();
+	errno = save_errno;
 	return ret;
 }
 

-Peff
Jonathan Tan Dec. 1, 2022, 9:26 p.m. UTC | #2
Jeff King <peff@peff.net> writes:
> On Thu, Dec 01, 2022 at 11:27:29AM -0800, Jonathan Tan wrote:
> 
> > Thanks everyone for your reviews. Here is a reroll with the requested change
> > (just one small one).
> 
> Thanks, this looks OK to me. However Junio noted in "What's cooking"
> that it seems to break CI on windows. The problem is in t5318.93:
> 
>   2022-12-01T09:26:44.8887018Z ++ cat test_err
>   2022-12-01T09:26:44.8887414Z error: Could not read 0000000000000000000000000000000000000000
>   2022-12-01T09:26:44.8887825Z error: Could not read 0000000000000000000000000000000000000000
>   2022-12-01T09:26:44.8888240Z error: Could not read 0000000000000000000000000000000000000000
>   2022-12-01T09:26:44.8888639Z error: Could not read 0000000000000000000000000000000000000000
>   2022-12-01T09:26:44.8889052Z error: Could not read 0000000000000000000000000000000000000000
>   2022-12-01T09:26:44.8889512Z error: Could not read 0000000000000000000000000000000000000000
>   2022-12-01T09:26:44.8889991Z fatal: failed to read object 0000000000000000000000000000000000000000: Function not implemented
>   2022-12-01T09:26:44.8890401Z ++ return 1
>   2022-12-01T09:26:44.8890761Z error: last command exited with $?=1
>   2022-12-01T09:26:44.8891263Z not ok 93 - corrupt commit-graph write (broken parent)
> 
> Looks like the check in die_if_corrupt() is seeing a different errno
> value than ENOENT. I wonder if we need to take more care to preserve it
> across calls. It does look like we hit the same sequence of functions
> that read_object_file_extended() did, but perhaps this was buggy all
> along, and you're now exposing it through a new code path.
> 
> In particular I wonder if obj_read_unlock() might be the culprit here,
> and something like this might help:
> 
> diff --git a/object-file.c b/object-file.c
> index 8adef99a7c..db2d35519e 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1641,9 +1641,12 @@ int oid_object_info_extended(struct repository *r, const struct object_id *oid,
>  			     struct object_info *oi, unsigned flags)
>  {
>  	int ret;
> +	int save_errno;
>  	obj_read_lock();
>  	ret = do_oid_object_info_extended(r, oid, oi, flags);
> +	save_errno = errno;
>  	obj_read_unlock();
> +	errno = save_errno;
>  	return ret;
>  }
 
Copying die_if_corrupt() until "failed to read object":

> 1734 void die_if_corrupt(struct repository *r,                                                                                                                                                       
> 1735                     const struct object_id *oid,                                                                                                                                                
> 1736                     const struct object_id *real_oid)                                                                                                                                           
> 1737 {                                                                                                                                                                                               
> 1738         const struct packed_git *p;                                                                                                                                                             
> 1739         const char *path;                                                                                                                                                                       
> 1740         struct stat st;                                                                                                                                                                         
> 1741                                                                                                                                                                                                 
> 1742         obj_read_lock();                                                                                                                                                                        
> 1743         if (errno && errno != ENOENT)                                                                                                                                                           
> 1744                 die_errno(_("failed to read object %s"), oid_to_hex(oid));

I wonder if we could just remove this check. Even as it is, I don't think that
there is any guarantee that obj_read_lock() would not clobber errno. Removing
it makes all tests pass locally, but I haven't tried it on CI.

(One argument that could be made is that we shouldn't have any die_if_corrupt()
refactoring or other refactoring of the sort, because previously its contents
was part of a function and it could thus rely on the errno of what has happened
previously. But I think that even without my patches, we couldn't rely on it
in the first place - looking at obj_read_lock(), it looks like it could init a
mutex, and depending on the implementation of that, it could clobber errno.)
Junio C Hamano Dec. 1, 2022, 11:09 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> On Thu, Dec 01, 2022 at 11:27:29AM -0800, Jonathan Tan wrote:
>
>> Thanks everyone for your reviews. Here is a reroll with the requested change
>> (just one small one).
>
> Thanks, this looks OK to me. However Junio noted in "What's cooking"
> that it seems to break CI on windows. The problem is in t5318.93:
> ...
> In particular I wonder if obj_read_unlock() might be the culprit here,
> and something like this might help:

Thanks for following-up.
Jeff King Dec. 2, 2022, 12:23 a.m. UTC | #4
On Thu, Dec 01, 2022 at 01:26:50PM -0800, Jonathan Tan wrote:

> > 1734 void die_if_corrupt(struct repository *r,
> > 1735                     const struct object_id *oid,
> > 1736                     const struct object_id *real_oid)
> > 1737 {
> > 1738         const struct packed_git *p;
> > 1739         const char *path;
> > 1740         struct stat st;
> > 1741
> > 1742         obj_read_lock();
> > 1743         if (errno && errno != ENOENT)
> > 1744                 die_errno(_("failed to read object %s"), oid_to_hex(oid));
> 
> I wonder if we could just remove this check. Even as it is, I don't think that
> there is any guarantee that obj_read_lock() would not clobber errno. Removing
> it makes all tests pass locally, but I haven't tried it on CI.
> 
> (One argument that could be made is that we shouldn't have any die_if_corrupt()
> refactoring or other refactoring of the sort, because previously its contents
> was part of a function and it could thus rely on the errno of what has happened
> previously. But I think that even without my patches, we couldn't rely on it
> in the first place - looking at obj_read_lock(), it looks like it could init a
> mutex, and depending on the implementation of that, it could clobber errno.)

Yeah, I don't see any difference in the new caller versus what the
original was doing. The errno we care about comes from inside
oid_object_info_extended(). So in any case, we'll see at least
obj_read_unlock() followed by obj_read_lock() between the syscalls of
interest and this check. And I don't even really see any indication that
oid_object_info_extended() tries to set or preserve errno itself. The
likely sequence is:

  - find_pack_entry() fails to find it; errno isn't set at all
  - loose_object_info() tries to open it and probably gets ENOENT
  - we check find_pack_entry() again after reprepare_packed_git()
  - that fails so we return -1, barring submodule or partial clone
    tricks

So it really seems like we're quite likely to get an errno from opening
or mapping packs. Which implies the original suffers from the same
issue, but we simply never triggered it meaningfully in a test.

I'm not entirely sure on just removing the check. It comes from
3ba7a06552 (A loose object is not corrupt if it cannot be read due to
EMFILE, 2010-10-28), so we'd lose what that commit is trying to do.
Though I think even back then, I think it would have suffered from the
same problems (minus the lock/unlock; I'm still unclear which syscall is
the actual culprit here).

If we assume that errno from reading the object isn't reliable, I think
you'd have to actually re-check things. Something like:

  if (find_pack_entry(...) || !stat_loose_object(...))
    /* ok, it's not missing */

but of course we don't have the actual errno that _did_ cause us to
fail, which makes the error message we'd print a lot less useful. Maybe
this check should be ditched and we should complain much closer to the
source of the problem:

diff --git a/object-file.c b/object-file.c
index 26290554bb..743ba8210e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1460,8 +1460,12 @@ static int loose_object_info(struct repository *r,
 	}
 
 	map = map_loose_object(r, oid, &mapsize);
-	if (!map)
+	if (!map) {
+		if (errno != ENOENT)
+			error_errno("unable to open loose object %s",
+				    oid_to_hex(oid));
 		return -1;
+	}
 
 	if (!oi->sizep)
 		oi->sizep = &size_scratch;

That might make things more verbose for other code paths, but that kind
of seems like a good thing. If you have an object file that we can't
open, we probably _should_ be complaining loudly about it.

We may need to be a little more careful about preserving errno in
map_loose_object_1(), though (gee, another place where the existing
check could run into trouble).

-Peff
Jonathan Tan Dec. 6, 2022, 12:49 a.m. UTC | #5
Jeff King <peff@peff.net> writes:
> Yeah, I don't see any difference in the new caller versus what the
> original was doing. The errno we care about comes from inside
> oid_object_info_extended(). So in any case, we'll see at least
> obj_read_unlock() followed by obj_read_lock() between the syscalls of
> interest and this check. And I don't even really see any indication that
> oid_object_info_extended() tries to set or preserve errno itself. The
> likely sequence is:
> 
>   - find_pack_entry() fails to find it; errno isn't set at all
>   - loose_object_info() tries to open it and probably gets ENOENT
>   - we check find_pack_entry() again after reprepare_packed_git()
>   - that fails so we return -1, barring submodule or partial clone
>     tricks
> 
> So it really seems like we're quite likely to get an errno from opening
> or mapping packs. Which implies the original suffers from the same
> issue, but we simply never triggered it meaningfully in a test.

Thanks for checking. I'm still not sure how the current code passes CI, but my
patches don't. 

> I'm not entirely sure on just removing the check. It comes from
> 3ba7a06552 (A loose object is not corrupt if it cannot be read due to
> EMFILE, 2010-10-28), so we'd lose what that commit is trying to do.
> Though I think even back then, I think it would have suffered from the
> same problems (minus the lock/unlock; I'm still unclear which syscall is
> the actual culprit here).

Ah, thanks for the pointer to that commit. Without that, my patch would report
corruption even if the real issue was EMFILE, as the commit message of that
commit describes.

> If we assume that errno from reading the object isn't reliable, I think
> you'd have to actually re-check things. Something like:
> 
>   if (find_pack_entry(...) || !stat_loose_object(...))
>     /* ok, it's not missing */
> 
> but of course we don't have the actual errno that _did_ cause us to
> fail, which makes the error message we'd print a lot less useful. Maybe
> this check should be ditched and we should complain much closer to the
> source of the problem:
> 
> diff --git a/object-file.c b/object-file.c
> index 26290554bb..743ba8210e 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1460,8 +1460,12 @@ static int loose_object_info(struct repository *r,
>  	}
>  
>  	map = map_loose_object(r, oid, &mapsize);
> -	if (!map)
> +	if (!map) {
> +		if (errno != ENOENT)
> +			error_errno("unable to open loose object %s",
> +				    oid_to_hex(oid));
>  		return -1;
> +	}
>  
>  	if (!oi->sizep)
>  		oi->sizep = &size_scratch;
> 
> That might make things more verbose for other code paths, but that kind
> of seems like a good thing. If you have an object file that we can't
> open, we probably _should_ be complaining loudly about it.
> 
> We may need to be a little more careful about preserving errno in
> map_loose_object_1(), though (gee, another place where the existing
> check could run into trouble).

Besides needing to be careful in map_loose_object_1(), I'm not sure if this
fully solves the problem. This is non-fatal, so the EMFILE commit's work would
still remain undone. If this were made fatal, I think this would change the
behavior of too much code, especially those that can tolerate loose objects
being missing.
 
What do you think of not putting any die_if_corrupt() calls in the commit
parsing code at all? The error message printed would then be different (just a
generic message about being unable to parse a commit, versus the specific one
here) but it does pass CI [1]. Also, I don't think that we should be doing errno
diagnostics separate from what causes the errno anyway.

[1] https://github.com/jonathantanmy/git/actions/runs/3624495729
Jeff King Dec. 6, 2022, 2:03 a.m. UTC | #6
On Mon, Dec 05, 2022 at 04:49:34PM -0800, Jonathan Tan wrote:

> > So it really seems like we're quite likely to get an errno from opening
> > or mapping packs. Which implies the original suffers from the same
> > issue, but we simply never triggered it meaningfully in a test.
> 
> Thanks for checking. I'm still not sure how the current code passes CI, but my
> patches don't. 

Hmm. Actually, I am now, too.

My assumption was that only certain tests would notice the problem,
because both outcomes are an error, and they only differ in what stderr
says ("this does not exist" versus "we got this weird errno"). And so I
assumed that the old spot in read_object_file() did not happen to
trigger any tests which check stderr, but your new caller in
repo_parse_commit_internal() was unlucky enough to do so. But since that
new caller was calling repo_read_object_file() before, I'd think it
would have triggered the same thing.

> > That might make things more verbose for other code paths, but that kind
> > of seems like a good thing. If you have an object file that we can't
> > open, we probably _should_ be complaining loudly about it.
> > 
> > We may need to be a little more careful about preserving errno in
> > map_loose_object_1(), though (gee, another place where the existing
> > check could run into trouble).
> 
> Besides needing to be careful in map_loose_object_1(), I'm not sure if this
> fully solves the problem. This is non-fatal, so the EMFILE commit's work would
> still remain undone. If this were made fatal, I think this would change the
> behavior of too much code, especially those that can tolerate loose objects
> being missing.

True, in the sense that we'd still say "X is corrupt". But I think the
real sin prior to 3ba7a0655 is that we _only_ said "hey, this looks
corrupt". If the output is:

  error: unable to mmap .git/objects/12/3456abcd...
  fatal: object 123456abcd is corrupt or missing

then that is at least not actively misleading (and is broadly similar to
other cases in Git, where higher-level code only knows "I expected us to
have this object and for some reason we don't", but without knowing
whether it was missing, or a system error, or corrupt).

That said I think all of these die() statements that you moved into
die_if_corrupt() are already doing the wrong thing. We should probably
mention errors (besides "missing object") to the user at the lowest
level where we can give the most detail, and then return errors up the
stack. That makes Git more verbose, but remember we're talking about
corrupted or broken repositories here. You shouldn't see these under
normal circumstances.

> What do you think of not putting any die_if_corrupt() calls in the commit
> parsing code at all? The error message printed would then be different (just a
> generic message about being unable to parse a commit, versus the specific one
> here) but it does pass CI [1]. Also, I don't think that we should be doing errno
> diagnostics separate from what causes the errno anyway.

So I think that's a lesser version of what I'm proposing above. ;) In
the sense that yes, I would say that repo_parse_commit_internal() does
not need to do anything more specific than its existing "could not read"
message. But I would go further and say:

  - it would be nice the low-level code where errno _is_ valid said more
    about what happened

  - we should take read_object_file_extended() in the same direction

-Peff