Message ID | cover.1669922792.git.jonathantanmy@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Don't lazy-fetch commits when parsing them | expand |
On Thu, Dec 01, 2022 at 11:27:29AM -0800, Jonathan Tan wrote: > Thanks everyone for your reviews. Here is a reroll with the requested change > (just one small one). Thanks, this looks OK to me. However Junio noted in "What's cooking" that it seems to break CI on windows. The problem is in t5318.93: 2022-12-01T09:26:44.8887018Z ++ cat test_err 2022-12-01T09:26:44.8887414Z error: Could not read 0000000000000000000000000000000000000000 2022-12-01T09:26:44.8887825Z error: Could not read 0000000000000000000000000000000000000000 2022-12-01T09:26:44.8888240Z error: Could not read 0000000000000000000000000000000000000000 2022-12-01T09:26:44.8888639Z error: Could not read 0000000000000000000000000000000000000000 2022-12-01T09:26:44.8889052Z error: Could not read 0000000000000000000000000000000000000000 2022-12-01T09:26:44.8889512Z error: Could not read 0000000000000000000000000000000000000000 2022-12-01T09:26:44.8889991Z fatal: failed to read object 0000000000000000000000000000000000000000: Function not implemented 2022-12-01T09:26:44.8890401Z ++ return 1 2022-12-01T09:26:44.8890761Z error: last command exited with $?=1 2022-12-01T09:26:44.8891263Z not ok 93 - corrupt commit-graph write (broken parent) Looks like the check in die_if_corrupt() is seeing a different errno value than ENOENT. I wonder if we need to take more care to preserve it across calls. It does look like we hit the same sequence of functions that read_object_file_extended() did, but perhaps this was buggy all along, and you're now exposing it through a new code path. In particular I wonder if obj_read_unlock() might be the culprit here, and something like this might help: diff --git a/object-file.c b/object-file.c index 8adef99a7c..db2d35519e 100644 --- a/object-file.c +++ b/object-file.c @@ -1641,9 +1641,12 @@ int oid_object_info_extended(struct repository *r, const struct object_id *oid, struct object_info *oi, unsigned flags) { int ret; + int save_errno; obj_read_lock(); ret = do_oid_object_info_extended(r, oid, oi, flags); + save_errno = errno; obj_read_unlock(); + errno = save_errno; return ret; } -Peff
Jeff King <peff@peff.net> writes: > On Thu, Dec 01, 2022 at 11:27:29AM -0800, Jonathan Tan wrote: > > > Thanks everyone for your reviews. Here is a reroll with the requested change > > (just one small one). > > Thanks, this looks OK to me. However Junio noted in "What's cooking" > that it seems to break CI on windows. The problem is in t5318.93: > > 2022-12-01T09:26:44.8887018Z ++ cat test_err > 2022-12-01T09:26:44.8887414Z error: Could not read 0000000000000000000000000000000000000000 > 2022-12-01T09:26:44.8887825Z error: Could not read 0000000000000000000000000000000000000000 > 2022-12-01T09:26:44.8888240Z error: Could not read 0000000000000000000000000000000000000000 > 2022-12-01T09:26:44.8888639Z error: Could not read 0000000000000000000000000000000000000000 > 2022-12-01T09:26:44.8889052Z error: Could not read 0000000000000000000000000000000000000000 > 2022-12-01T09:26:44.8889512Z error: Could not read 0000000000000000000000000000000000000000 > 2022-12-01T09:26:44.8889991Z fatal: failed to read object 0000000000000000000000000000000000000000: Function not implemented > 2022-12-01T09:26:44.8890401Z ++ return 1 > 2022-12-01T09:26:44.8890761Z error: last command exited with $?=1 > 2022-12-01T09:26:44.8891263Z not ok 93 - corrupt commit-graph write (broken parent) > > Looks like the check in die_if_corrupt() is seeing a different errno > value than ENOENT. I wonder if we need to take more care to preserve it > across calls. It does look like we hit the same sequence of functions > that read_object_file_extended() did, but perhaps this was buggy all > along, and you're now exposing it through a new code path. > > In particular I wonder if obj_read_unlock() might be the culprit here, > and something like this might help: > > diff --git a/object-file.c b/object-file.c > index 8adef99a7c..db2d35519e 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -1641,9 +1641,12 @@ int oid_object_info_extended(struct repository *r, const struct object_id *oid, > struct object_info *oi, unsigned flags) > { > int ret; > + int save_errno; > obj_read_lock(); > ret = do_oid_object_info_extended(r, oid, oi, flags); > + save_errno = errno; > obj_read_unlock(); > + errno = save_errno; > return ret; > } Copying die_if_corrupt() until "failed to read object": > 1734 void die_if_corrupt(struct repository *r, > 1735 const struct object_id *oid, > 1736 const struct object_id *real_oid) > 1737 { > 1738 const struct packed_git *p; > 1739 const char *path; > 1740 struct stat st; > 1741 > 1742 obj_read_lock(); > 1743 if (errno && errno != ENOENT) > 1744 die_errno(_("failed to read object %s"), oid_to_hex(oid)); I wonder if we could just remove this check. Even as it is, I don't think that there is any guarantee that obj_read_lock() would not clobber errno. Removing it makes all tests pass locally, but I haven't tried it on CI. (One argument that could be made is that we shouldn't have any die_if_corrupt() refactoring or other refactoring of the sort, because previously its contents was part of a function and it could thus rely on the errno of what has happened previously. But I think that even without my patches, we couldn't rely on it in the first place - looking at obj_read_lock(), it looks like it could init a mutex, and depending on the implementation of that, it could clobber errno.)
Jeff King <peff@peff.net> writes: > On Thu, Dec 01, 2022 at 11:27:29AM -0800, Jonathan Tan wrote: > >> Thanks everyone for your reviews. Here is a reroll with the requested change >> (just one small one). > > Thanks, this looks OK to me. However Junio noted in "What's cooking" > that it seems to break CI on windows. The problem is in t5318.93: > ... > In particular I wonder if obj_read_unlock() might be the culprit here, > and something like this might help: Thanks for following-up.
On Thu, Dec 01, 2022 at 01:26:50PM -0800, Jonathan Tan wrote: > > 1734 void die_if_corrupt(struct repository *r, > > 1735 const struct object_id *oid, > > 1736 const struct object_id *real_oid) > > 1737 { > > 1738 const struct packed_git *p; > > 1739 const char *path; > > 1740 struct stat st; > > 1741 > > 1742 obj_read_lock(); > > 1743 if (errno && errno != ENOENT) > > 1744 die_errno(_("failed to read object %s"), oid_to_hex(oid)); > > I wonder if we could just remove this check. Even as it is, I don't think that > there is any guarantee that obj_read_lock() would not clobber errno. Removing > it makes all tests pass locally, but I haven't tried it on CI. > > (One argument that could be made is that we shouldn't have any die_if_corrupt() > refactoring or other refactoring of the sort, because previously its contents > was part of a function and it could thus rely on the errno of what has happened > previously. But I think that even without my patches, we couldn't rely on it > in the first place - looking at obj_read_lock(), it looks like it could init a > mutex, and depending on the implementation of that, it could clobber errno.) Yeah, I don't see any difference in the new caller versus what the original was doing. The errno we care about comes from inside oid_object_info_extended(). So in any case, we'll see at least obj_read_unlock() followed by obj_read_lock() between the syscalls of interest and this check. And I don't even really see any indication that oid_object_info_extended() tries to set or preserve errno itself. The likely sequence is: - find_pack_entry() fails to find it; errno isn't set at all - loose_object_info() tries to open it and probably gets ENOENT - we check find_pack_entry() again after reprepare_packed_git() - that fails so we return -1, barring submodule or partial clone tricks So it really seems like we're quite likely to get an errno from opening or mapping packs. Which implies the original suffers from the same issue, but we simply never triggered it meaningfully in a test. I'm not entirely sure on just removing the check. It comes from 3ba7a06552 (A loose object is not corrupt if it cannot be read due to EMFILE, 2010-10-28), so we'd lose what that commit is trying to do. Though I think even back then, I think it would have suffered from the same problems (minus the lock/unlock; I'm still unclear which syscall is the actual culprit here). If we assume that errno from reading the object isn't reliable, I think you'd have to actually re-check things. Something like: if (find_pack_entry(...) || !stat_loose_object(...)) /* ok, it's not missing */ but of course we don't have the actual errno that _did_ cause us to fail, which makes the error message we'd print a lot less useful. Maybe this check should be ditched and we should complain much closer to the source of the problem: diff --git a/object-file.c b/object-file.c index 26290554bb..743ba8210e 100644 --- a/object-file.c +++ b/object-file.c @@ -1460,8 +1460,12 @@ static int loose_object_info(struct repository *r, } map = map_loose_object(r, oid, &mapsize); - if (!map) + if (!map) { + if (errno != ENOENT) + error_errno("unable to open loose object %s", + oid_to_hex(oid)); return -1; + } if (!oi->sizep) oi->sizep = &size_scratch; That might make things more verbose for other code paths, but that kind of seems like a good thing. If you have an object file that we can't open, we probably _should_ be complaining loudly about it. We may need to be a little more careful about preserving errno in map_loose_object_1(), though (gee, another place where the existing check could run into trouble). -Peff
Jeff King <peff@peff.net> writes: > Yeah, I don't see any difference in the new caller versus what the > original was doing. The errno we care about comes from inside > oid_object_info_extended(). So in any case, we'll see at least > obj_read_unlock() followed by obj_read_lock() between the syscalls of > interest and this check. And I don't even really see any indication that > oid_object_info_extended() tries to set or preserve errno itself. The > likely sequence is: > > - find_pack_entry() fails to find it; errno isn't set at all > - loose_object_info() tries to open it and probably gets ENOENT > - we check find_pack_entry() again after reprepare_packed_git() > - that fails so we return -1, barring submodule or partial clone > tricks > > So it really seems like we're quite likely to get an errno from opening > or mapping packs. Which implies the original suffers from the same > issue, but we simply never triggered it meaningfully in a test. Thanks for checking. I'm still not sure how the current code passes CI, but my patches don't. > I'm not entirely sure on just removing the check. It comes from > 3ba7a06552 (A loose object is not corrupt if it cannot be read due to > EMFILE, 2010-10-28), so we'd lose what that commit is trying to do. > Though I think even back then, I think it would have suffered from the > same problems (minus the lock/unlock; I'm still unclear which syscall is > the actual culprit here). Ah, thanks for the pointer to that commit. Without that, my patch would report corruption even if the real issue was EMFILE, as the commit message of that commit describes. > If we assume that errno from reading the object isn't reliable, I think > you'd have to actually re-check things. Something like: > > if (find_pack_entry(...) || !stat_loose_object(...)) > /* ok, it's not missing */ > > but of course we don't have the actual errno that _did_ cause us to > fail, which makes the error message we'd print a lot less useful. Maybe > this check should be ditched and we should complain much closer to the > source of the problem: > > diff --git a/object-file.c b/object-file.c > index 26290554bb..743ba8210e 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -1460,8 +1460,12 @@ static int loose_object_info(struct repository *r, > } > > map = map_loose_object(r, oid, &mapsize); > - if (!map) > + if (!map) { > + if (errno != ENOENT) > + error_errno("unable to open loose object %s", > + oid_to_hex(oid)); > return -1; > + } > > if (!oi->sizep) > oi->sizep = &size_scratch; > > That might make things more verbose for other code paths, but that kind > of seems like a good thing. If you have an object file that we can't > open, we probably _should_ be complaining loudly about it. > > We may need to be a little more careful about preserving errno in > map_loose_object_1(), though (gee, another place where the existing > check could run into trouble). Besides needing to be careful in map_loose_object_1(), I'm not sure if this fully solves the problem. This is non-fatal, so the EMFILE commit's work would still remain undone. If this were made fatal, I think this would change the behavior of too much code, especially those that can tolerate loose objects being missing. What do you think of not putting any die_if_corrupt() calls in the commit parsing code at all? The error message printed would then be different (just a generic message about being unable to parse a commit, versus the specific one here) but it does pass CI [1]. Also, I don't think that we should be doing errno diagnostics separate from what causes the errno anyway. [1] https://github.com/jonathantanmy/git/actions/runs/3624495729
On Mon, Dec 05, 2022 at 04:49:34PM -0800, Jonathan Tan wrote: > > So it really seems like we're quite likely to get an errno from opening > > or mapping packs. Which implies the original suffers from the same > > issue, but we simply never triggered it meaningfully in a test. > > Thanks for checking. I'm still not sure how the current code passes CI, but my > patches don't. Hmm. Actually, I am now, too. My assumption was that only certain tests would notice the problem, because both outcomes are an error, and they only differ in what stderr says ("this does not exist" versus "we got this weird errno"). And so I assumed that the old spot in read_object_file() did not happen to trigger any tests which check stderr, but your new caller in repo_parse_commit_internal() was unlucky enough to do so. But since that new caller was calling repo_read_object_file() before, I'd think it would have triggered the same thing. > > That might make things more verbose for other code paths, but that kind > > of seems like a good thing. If you have an object file that we can't > > open, we probably _should_ be complaining loudly about it. > > > > We may need to be a little more careful about preserving errno in > > map_loose_object_1(), though (gee, another place where the existing > > check could run into trouble). > > Besides needing to be careful in map_loose_object_1(), I'm not sure if this > fully solves the problem. This is non-fatal, so the EMFILE commit's work would > still remain undone. If this were made fatal, I think this would change the > behavior of too much code, especially those that can tolerate loose objects > being missing. True, in the sense that we'd still say "X is corrupt". But I think the real sin prior to 3ba7a0655 is that we _only_ said "hey, this looks corrupt". If the output is: error: unable to mmap .git/objects/12/3456abcd... fatal: object 123456abcd is corrupt or missing then that is at least not actively misleading (and is broadly similar to other cases in Git, where higher-level code only knows "I expected us to have this object and for some reason we don't", but without knowing whether it was missing, or a system error, or corrupt). That said I think all of these die() statements that you moved into die_if_corrupt() are already doing the wrong thing. We should probably mention errors (besides "missing object") to the user at the lowest level where we can give the most detail, and then return errors up the stack. That makes Git more verbose, but remember we're talking about corrupted or broken repositories here. You shouldn't see these under normal circumstances. > What do you think of not putting any die_if_corrupt() calls in the commit > parsing code at all? The error message printed would then be different (just a > generic message about being unable to parse a commit, versus the specific one > here) but it does pass CI [1]. Also, I don't think that we should be doing errno > diagnostics separate from what causes the errno anyway. So I think that's a lesser version of what I'm proposing above. ;) In the sense that yes, I would say that repo_parse_commit_internal() does not need to do anything more specific than its existing "could not read" message. But I would go further and say: - it would be nice the low-level code where errno _is_ valid said more about what happened - we should take read_object_file_extended() in the same direction -Peff