sha1-file: remove OBJECT_INFO_SKIP_CACHED

Message ID	20191230211027.37002-1-jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=o5qA=2U=vger.kernel.org=git-owner@kernel.org> Date: Mon, 30 Dec 2019 13:10:27 -0800 Message-Id: <20191230211027.37002-1-jonathantanmy@google.com> Mime-Version: 1.0 Subject: [PATCH] sha1-file: remove OBJECT_INFO_SKIP_CACHED From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	sha1-file: remove OBJECT_INFO_SKIP_CACHED \| expand sha1-file: remove OBJECT_INFO_SKIP_CACHED

Message ID

20191230211027.37002-1-jonathantanmy@google.com (mailing list archive)

State

New, archived

Headers

Date: Mon, 30 Dec 2019 13:10:27 -0800
Message-Id: <20191230211027.37002-1-jonathantanmy@google.com>
Mime-Version: 1.0
Subject: [PATCH] sha1-file: remove OBJECT_INFO_SKIP_CACHED
From: Jonathan Tan <jonathantanmy@google.com>
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

sha1-file: remove OBJECT_INFO_SKIP_CACHED | expand

Commit Message

Jonathan Tan Dec. 30, 2019, 9:10 p.m. UTC

In a partial clone, if a user provides the hash of the empty tree ("git
mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
requires that that object be parsed, for example:

  git diff-tree 4b825d <a non-empty tree>

then Git will lazily fetch the empty tree. This fetch would merely be
inconvenient if the promisor remote could supply that tree, but at
$DAYJOB we discovered that some repositories do not (e.g. [1]).

There are 2 functions: repo_has_object_file() which does not consult
find_cached_object() (which, among other things, knows about the empty
tree); and repo_read_object_file() which does. This issue occurs
because, as an optimization to avoid reading blobs into memory,
parse_object() calls repo_has_object_file() before
repo_read_object_file(). In the case of a regular repository (that is,
not a partial clone), repo_has_object_file() will return false for the
empty tree (thus bypassing the optimization) and repo_read_object_file()
will nevertheless succeed, thus things coincidentally work. But in a
partial clone, repo_has_object_file() triggers a lazy fetch of the
missing empty tree. This optimization was introduced by 090ea12671
("parse_object: avoid putting whole blob in core", 2012-03-07), and the
empty-tree special-case dichotomy between repo_has_object_file() (then,
has_sha1_file()) and repo_read_object_file() (then, sha1_object_info())
has existed since then.

(The flag OBJECT_INFO_SKIP_CACHED, introduced later in dfdd4afcf9
("sha1_file: teach sha1_object_info_extended more flags", 2017-06-26)
and used in e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags",
2017-06-26), was introduced to preserve the empty-tree special-case
dichotomy.)

The best solution to the problem introduced at the start of this commit
message seems to be to eliminate this dichotomy. Not only will this fix
the problem, but it reduces a potential avenue of surprise for future
developers of Git - it will no longer be the case that
repo_has_object_file() doesn't know about empty trees, but
repo_read_object_file() does. A cost is that repo_has_object_file() will
now need to oideq upon each invocation, but that is trivial compared to
the filesystem lookup or the pack index search required anyway. (And if
find_cached_object() needs to do more because of previous invocations to
pretend_object_file(), all the more reason to be consistent in whether
we present cached objects.) Therefore, remove OBJECT_INFO_SKIP_CACHED.

[1] https://dart.googlesource.com/json_rpc_2

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
I put a lot of historical references which makes the commit message
rather long - let me know if you think that some can be omitted.
---
 object-store.h |  2 --
 sha1-file.c    | 38 ++++++++++++++++++--------------------
 2 files changed, 18 insertions(+), 22 deletions(-)

Comments

Junio C Hamano Dec. 30, 2019, 9:43 p.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> In a partial clone, if a user provides the hash of the empty tree ("git
> mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
> requires that that object be parsed, for example:
>
>   git diff-tree 4b825d <a non-empty tree>
>
> then Git will lazily fetch the empty tree.

That sounds like a bug.  Shouldn't some objects like empty tree and
empty blob whose names are hardcoded come internally without being
fetched from anywhere?

> There are 2 functions: repo_has_object_file() which does not consult
> find_cached_object() (which, among other things, knows about the empty
> tree); and repo_read_object_file() which does. This issue occurs
> because, as an optimization to avoid reading blobs into memory,
> parse_object() calls repo_has_object_file() before
> repo_read_object_file(). In the case of a regular repository (that is,
> not a partial clone), repo_has_object_file() will return false for the
> empty tree (thus bypassing the optimization) and...

OK, that bypassing already sounds wrong.  

> ... it will no longer be the case that
> repo_has_object_file() doesn't know about empty trees, but
> repo_read_object_file() does.

Yup, that sounds like the right solution to the bug, with or without
lazy cloning.

Jonathan Nieder Dec. 30, 2019, 10:01 p.m. UTC | #2

Hi,

Jonathan Tan wrote:

> In a partial clone, if a user provides the hash of the empty tree ("git
> mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
> requires that that object be parsed, for example:
>
>   git diff-tree 4b825d <a non-empty tree>
>
> then Git will lazily fetch the empty tree. This fetch would merely be
> inconvenient if the promisor remote could supply that tree, but at
> $DAYJOB we discovered that some repositories do not (e.g. [1]).

Ooh, I think there's something subtle hiding in this paragraph.

When I first read it, I thought you meant that the repositories are
not self-contained --- that they contain references to the empty tree
but do not fulfill "want"s for them.  I don't believe that's what you
mean, though.

My second reading is the repository genuinely doesn't contain the
empty tree but different Git server implementations handle that
differently.  I tried to reproduce this with

	empty_tree=$(git mktree </dev/null)
	git init --bare x
	git clone --filter=blob:none file://$(pwd)/x y
	cd y
	echo hi >README
	git add README
	git commit -m 'nonempty tree'
	GIT_TRACE=1 git diff-tree "$empty_tree" HEAD

and indeed, it looks like Git serves the empty tree even from
repositories that don't contain it.  By comparison, JGit does not do
the same trick.  So we don't need to refer to a specific repository in
the wild to reproduce this.

All that said, having to fetch this object in the first place is
unexpected.  The question of the promisor remote serving it is only
relevant from the point of view of "why didn't we discover this
sooner?"

> There are 2 functions: repo_has_object_file() which does not consult
> find_cached_object() (which, among other things, knows about the empty
> tree); and repo_read_object_file() which does.

Hm, where does this dichotomy come from?  E.g. is the latter a
lower-level function used by the former?

>                                                This issue occurs
> because,

nit: on first reading I had trouble figuring out what "this issue"
refers to here.

[...]
>          as an optimization to avoid reading blobs into memory,
> parse_object() calls repo_has_object_file() before
> repo_read_object_file(). In the case of a regular repository (that is,
> not a partial clone), repo_has_object_file() will return false for the
> empty tree (thus bypassing the optimization) and repo_read_object_file()
> will nevertheless succeed, thus things coincidentally work.

This might be easier to understand if phrased in terms of the
intention behind the code instead of the specific call stacks used.
See f06ab027 for an example of this kind of thing.  For example:

  Applications within and outside of Git benefit from being able to
  assume that every repository contains the empty tree as an object
  (see, for example, commit 9abd46a347 "Make sure the empty tree
  exists when needed in merge-recursive", 2006-12-07).  To this end,
  since 346245a1bb (hard-code the empty tree object, 2008-02-13), Git
  has made the empty tree available in all repositories via
  find_cached_object, which all object access paths can use.

  Object existence checks (has_object_file), however, do not use
  find_cached_object.  <describe reason here>

>                                                             But in a
> partial clone, repo_has_object_file() triggers a lazy fetch of the
> missing empty tree.

  This particularly affects partial clones: has_object_file does not
  only report false in this case but contacts the promisor remote in
  order to obtain that answer.  The cost of these repeated negative
  lookups can add up.

  For example, in an optimization introduced in 090ea12671
  ("parse_object: avoid putting whole blob in core", 2012-03-07),
  object parsing uses has_object_file before read_object_file to check
  for a fast-path, so this negative lookup is triggered whenever we
  try to parse the absent empty tree.

When I state it this way, it's not obvious why this particular caller
of has_object_file is more relevant than others.  Did I miss some
subtlety?

[...]
>                                            This fetch would merely be
> inconvenient if the promisor remote could supply that tree, but at
> $DAYJOB we discovered that some repositories do not (e.g. [1]).

  If the promisor remote is running standard Git then it *does* have a
  copy of the empty tree, via the cached object itself.  This
  guarantee is not a documented part of the protocol, however, and
  other Git servers do not implement it.

> The best solution to the problem introduced at the start of this commit
> message seems to be to eliminate this dichotomy.

Indeed.  Can we justify the change even more simply in those terms?

  Object existence checks (has_object_file), however, do not use
  find_cached_object.  <describe reason here>

  This makes the API unnecessarily difficult to reason about.
  Instead, let's consistently view the empty tree as a genuine part of
  the repository, even in has_object_file.  As a side effect, this
  allows us to simplify the common 'has_object_file ||
  find_cached_object' pattern to a more simple existence check.

[...]
>                               A cost is that repo_has_object_file() will
> now need to oideq upon each invocation, but that is trivial compared to
> the filesystem lookup or the pack index search required anyway. (And if
> find_cached_object() needs to do more because of previous invocations to
> pretend_object_file(), all the more reason to be consistent in whether
> we present cached objects.) Therefore, remove OBJECT_INFO_SKIP_CACHED.

Thanks for discussing the possible costs, and I agree that they're
trivial relative to the I/O that these functions already incur.

[...]
>  object-store.h |  2 --
>  sha1-file.c    | 38 ++++++++++++++++++--------------------
>  2 files changed, 18 insertions(+), 22 deletions(-)

As hinted above, we should be able to simplify away has_sha1_file ||
find_cached_object checks in this change.

Thanks and hpoe that helps,
Jonathan

Jonathan Tan Dec. 31, 2019, 12:39 a.m. UTC | #3

> Ooh, I think there's something subtle hiding in this paragraph.
> 
> When I first read it, I thought you meant that the repositories are
> not self-contained --- that they contain references to the empty tree
> but do not fulfill "want"s for them.  I don't believe that's what you
> mean, though.
> 
> My second reading is the repository genuinely doesn't contain the
> empty tree but different Git server implementations handle that
> differently.  I tried to reproduce this with
> 
> 	empty_tree=$(git mktree </dev/null)
> 	git init --bare x
> 	git clone --filter=blob:none file://$(pwd)/x y
> 	cd y
> 	echo hi >README
> 	git add README
> 	git commit -m 'nonempty tree'
> 	GIT_TRACE=1 git diff-tree "$empty_tree" HEAD
> 
> and indeed, it looks like Git serves the empty tree even from
> repositories that don't contain it.  By comparison, JGit does not do
> the same trick.  So we don't need to refer to a specific repository in
> the wild to reproduce this.
> 
> All that said, having to fetch this object in the first place is
> unexpected.  The question of the promisor remote serving it is only
> relevant from the point of view of "why didn't we discover this
> sooner?"

Yes, that's true. I'll reword it to emphasize the spurious fetching (and
mention some implementations like JGit not serving those, which
exacerbates the problem).

I think we didn't discover this sooner because not many people directly
enter the empty tree on the command line :-) (but this is a problem that
we should solve, most certainly).

> > There are 2 functions: repo_has_object_file() which does not consult
> > find_cached_object() (which, among other things, knows about the empty
> > tree); and repo_read_object_file() which does.
> 
> Hm, where does this dichotomy come from?  E.g. is the latter a
> lower-level function used by the former?

I don't know the reason for the dichotomy - perhaps it was an oversight.
The relevant commits are:

d66b37bb19 ("Add pretend_sha1_file() interface.", 2007-02-05)
Adds a way to pretend that some objects are present by including them in
a cache - empty-tree-pervasiveness is built on top of this later. Only
read_sha1_file() was updated to make use of this; has_sha1_file() and
sha1_object_info() were already present at this time, but were not
updated. (Maybe the latter 2 had no need for pretending?)

346245a1bb ("hard-code the empty tree object", 2008-02-13)
Empty tree pervasiveness built on top of the cache.

c4d9986f5f ("sha1_object_info: examine cached_object store too",
2011-02-07)
sha1_object_info() was updated to use the cache. has_sha1_file() is
never mentioned.

> > as an optimization to avoid reading blobs into memory,
> > parse_object() calls repo_has_object_file() before
> > repo_read_object_file(). In the case of a regular repository (that is,
> > not a partial clone), repo_has_object_file() will return false for the
> > empty tree (thus bypassing the optimization) and repo_read_object_file()
> > will nevertheless succeed, thus things coincidentally work.
> 
> This might be easier to understand if phrased in terms of the
> intention behind the code instead of the specific call stacks used.
> See f06ab027 for an example of this kind of thing.  For example:
> 
>   Applications within and outside of Git benefit from being able to
>   assume that every repository contains the empty tree as an object
>   (see, for example, commit 9abd46a347 "Make sure the empty tree
>   exists when needed in merge-recursive", 2006-12-07).  To this end,
>   since 346245a1bb (hard-code the empty tree object, 2008-02-13), Git
>   has made the empty tree available in all repositories via
>   find_cached_object, which all object access paths can use.
> 
>   Object existence checks (has_object_file), however, do not use
>   find_cached_object.  <describe reason here>
> 
> When I state it this way, it's not obvious why this particular caller
> of has_object_file is more relevant than others.  Did I miss some
> subtlety?

This particular caller is what caused us to notice this problem.

> Indeed.  Can we justify the change even more simply in those terms?
> 
>   Object existence checks (has_object_file), however, do not use
>   find_cached_object.  <describe reason here>
> 
>   This makes the API unnecessarily difficult to reason about.
>   Instead, let's consistently view the empty tree as a genuine part of
>   the repository, even in has_object_file.  As a side effect, this
>   allows us to simplify the common 'has_object_file ||
>   find_cached_object' pattern to a more simple existence check.

OK, let me see if I can rewrite it to emphasize the reasoning-about-API
part. I still want to include the user-facing part that caused us to
observe this problem, but I can deemphasize it.

This might be a moot point, but what do you mean by the
"'has_object_file || find_cached_object' pattern"?

Jonathan Nieder Dec. 31, 2019, 1:03 a.m. UTC | #4

Jonathan Tan wrote:
> Jonathan Nieder wrote:

>> Hm, where does this dichotomy come from?  E.g. is the latter a
>> lower-level function used by the former?
>
> I don't know the reason for the dichotomy - perhaps it was an oversight.

Ah, that makes sense.

[...]
> This might be a moot point, but what do you mean by the
> "'has_object_file || find_cached_object' pattern"?

Oh!  Thanks, I should have been more precise.  And calling it a pattern
was probably overreaching --- I can only find one instance, in
pretend_object_file:

	if (has_object_file(oid) || find_cached_object(oid))
		return 0;

Since there's only one, I was on the wrong track.  (Except that with
this change, that can indeed change to

	if (has_object_file(oid))
		return 0;

Thanks,
Jonathan

diff --git a/object-store.h b/object-store.h
index 55ee639350..61b8b13e3b 100644
--- a/object-store.h
+++ b/object-store.h
@@ -292,8 +292,6 @@  struct object_info {
 #define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
-/* Do not check cached storage */
-#define OBJECT_INFO_SKIP_CACHED 4
 /* Do not retry packed storage after checking packed and loose storage */
 #define OBJECT_INFO_QUICK 8
 /* Do not check loose object */
diff --git a/sha1-file.c b/sha1-file.c
index 188de57634..03ae9ae93a 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1417,6 +1417,7 @@  int oid_object_info_extended(struct repository *r, const struct object_id *oid,
 			     struct object_info *oi, unsigned flags)
 {
 	static struct object_info blank_oi = OBJECT_INFO_INIT;
+	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
 	const struct object_id *real = oid;
@@ -1431,24 +1432,22 @@  int oid_object_info_extended(struct repository *r, const struct object_id *oid,
 	if (!oi)
 		oi = &blank_oi;
 
-	if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
-		struct cached_object *co = find_cached_object(real);
-		if (co) {
-			if (oi->typep)
-				*(oi->typep) = co->type;
-			if (oi->sizep)
-				*(oi->sizep) = co->size;
-			if (oi->disk_sizep)
-				*(oi->disk_sizep) = 0;
-			if (oi->delta_base_sha1)
-				hashclr(oi->delta_base_sha1);
-			if (oi->type_name)
-				strbuf_addstr(oi->type_name, type_name(co->type));
-			if (oi->contentp)
-				*oi->contentp = xmemdupz(co->buf, co->size);
-			oi->whence = OI_CACHED;
-			return 0;
-		}
+	co = find_cached_object(real);
+	if (co) {
+		if (oi->typep)
+			*(oi->typep) = co->type;
+		if (oi->sizep)
+			*(oi->sizep) = co->size;
+		if (oi->disk_sizep)
+			*(oi->disk_sizep) = 0;
+		if (oi->delta_base_sha1)
+			hashclr(oi->delta_base_sha1);
+		if (oi->type_name)
+			strbuf_addstr(oi->type_name, type_name(co->type));
+		if (oi->contentp)
+			*oi->contentp = xmemdupz(co->buf, co->size);
+		oi->whence = OI_CACHED;
+		return 0;
 	}
 
 	while (1) {
@@ -1932,8 +1931,7 @@  int repo_has_object_file_with_flags(struct repository *r,
 {
 	if (!startup_info->have_repository)
 		return 0;
-	return oid_object_info_extended(r, oid, NULL,
-					flags | OBJECT_INFO_SKIP_CACHED) >= 0;
+	return oid_object_info_extended(r, oid, NULL, flags) >= 0;
 }
 
 int repo_has_object_file(struct repository *r,

sha1-file: remove OBJECT_INFO_SKIP_CACHED

Commit Message

Comments

Patch