apply: do not fetch when checking object existence

Message ID	20200728010403.95142-1-jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=DRt7=BH=vger.kernel.org=git-owner@kernel.org> Date: Mon, 27 Jul 2020 18:04:03 -0700 Message-Id: <20200728010403.95142-1-jonathantanmy@google.com> Mime-Version: 1.0 Subject: [PATCH] apply: do not fetch when checking object existence From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	apply: do not fetch when checking object existence \| expand apply: do not fetch when checking object existence

Message ID

20200728010403.95142-1-jonathantanmy@google.com (mailing list archive)

State

New, archived

Headers

Date: Mon, 27 Jul 2020 18:04:03 -0700
Message-Id: <20200728010403.95142-1-jonathantanmy@google.com>
Mime-Version: 1.0
Subject: [PATCH] apply: do not fetch when checking object existence
From: Jonathan Tan <jonathantanmy@google.com>
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

apply: do not fetch when checking object existence | expand

Commit Message

Jonathan Tan July 28, 2020, 1:04 a.m. UTC

There have been a few bugs wherein Git fetches missing objects whenever
the existence of an object is checked, even though it does not need to
perform such a fetch. To resolve these bugs, we could look at all the
places that has_object_file() (or a similar function) is used. As a
first step, introduce a new function has_object() that checks for the
existence of an object, with a default behavior of not fetching if the
object is missing and the repository is a partial clone. As we verify
each has_object_file() (or similar) usage, we can replace it with
has_object(), and we will know that we are done when we can delete
has_object_file() (and the other similar functions).

Also, the new function has_object() has more appropriate defaults:
besides not fetching, it also does not recheck packed storage.

Then, use this new function to fix a bug in apply.c.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
I mentioned the idea for this change here:
https://lore.kernel.org/git/20200721225020.1352772-1-jonathantanmy@google.com/

I included the same have_repository check introduced in 3e8b7d3c77
("has_sha1_file: don't bother if we are not in a repository",
2017-04-13), but I couldn't verify if it is still needed. In particular,
even at that commit, all the tests pass (after I make a small change
to an irrelevant test about the order of entries in a cookie file).
---
 apply.c        |  2 +-
 object-store.h | 25 +++++++++++++++++++++++--
 sha1-file.c    | 12 ++++++++++++
 t/t4150-am.sh  | 16 ++++++++++++++++
 4 files changed, 52 insertions(+), 3 deletions(-)

Comments

Junio C Hamano July 28, 2020, 1:19 a.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> There have been a few bugs wherein Git fetches missing objects whenever
> the existence of an object is checked, even though it does not need to
> perform such a fetch. To resolve these bugs, we could look at all the
> places that has_object_file() (or a similar function) is used. As a
> first step, introduce a new function has_object() that checks for the
> existence of an object, with a default behavior of not fetching if the
> object is missing and the repository is a partial clone. As we verify
> each has_object_file() (or similar) usage, we can replace it with
> has_object(), and we will know that we are done when we can delete
> has_object_file() (and the other similar functions).

I wonder if we want to name the two (i.e. one variant that refuses
to go to network because it is trying to see if a lazy fetch is
needed, and the other that goes to network behind caller's back for
ease of use in a lazy clone) a bit more distinctly so that which one
could potentially go outside.

Depending on one's view which one is _normal_ access pattern, giving
an explicit adverb to one variant while leaving the other one bland
might be sufficient.  For example, I _think_ most of the places do
not want to handle the details of lazily fetching themselves, and I
suspect that the traditional has_object_file() semantics without "do
not trigger lazy fetch" option would be the normal access pattern.

In which case, renaming your new "has_object" to something like
"has_object_locally()" would be a good name for a special case
codepath that wants to care---if the object does not exist locally
and needs to be obtained lazily from elsewhere, the function would
say "no".

And all the other names like has_object_file() that by default gives
callers a transparent access to lazily fetched objects can stay the
same.

> I mentioned the idea for this change here:
> https://lore.kernel.org/git/20200721225020.1352772-1-jonathantanmy@google.com/

Yup, I think that is going in a good direction.  I suspect that
apply will not be the only remaining case we need to "fix", and
using the new helper function, codepaths that have already been
"fixed" by passing "do not lazily fetch" option to the traditional
API functions would become easier to read.  And if that is the case,
let's have the introduction of the helper function as a separate
patch, with each of [PATCH 2-N/N] be a fix for separate codepaths.

Thanks.

Jonathan Tan July 28, 2020, 6:23 p.m. UTC | #2

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > There have been a few bugs wherein Git fetches missing objects whenever
> > the existence of an object is checked, even though it does not need to
> > perform such a fetch. To resolve these bugs, we could look at all the
> > places that has_object_file() (or a similar function) is used. As a
> > first step, introduce a new function has_object() that checks for the
> > existence of an object, with a default behavior of not fetching if the
> > object is missing and the repository is a partial clone. As we verify
> > each has_object_file() (or similar) usage, we can replace it with
> > has_object(), and we will know that we are done when we can delete
> > has_object_file() (and the other similar functions).
> 
> I wonder if we want to name the two (i.e. one variant that refuses
> to go to network because it is trying to see if a lazy fetch is
> needed, and the other that goes to network behind caller's back for
> ease of use in a lazy clone) a bit more distinctly so that which one
> could potentially go outside.
> 
> Depending on one's view which one is _normal_ access pattern, giving
> an explicit adverb to one variant while leaving the other one bland
> might be sufficient.  For example, I _think_ most of the places do
> not want to handle the details of lazily fetching themselves, and I
> suspect that the traditional has_object_file() semantics without "do
> not trigger lazy fetch" option would be the normal access pattern.

Right now, I think that most (if not all) places would not want to fetch
at all - so *with* "do not trigger lazy fetch" would be the normal
access pattern. This is because (in my opinion) if a caller checks the
existence of an object, it most likely can tolerate the object's
absence; if the caller couldn't tolerate it, it would just directly
query for its type or contents or something like that.

I tried to communicate this in my documentation of the deprecated
functions/macros, but perhaps it could be written better.

(One other option to consider is to just change has_object_file() to
never fetch, although I think this is more risky.)

> In which case, renaming your new "has_object" to something like
> "has_object_locally()" would be a good name for a special case
> codepath that wants to care---if the object does not exist locally
> and needs to be obtained lazily from elsewhere, the function would
> say "no".
> 
> And all the other names like has_object_file() that by default gives
> callers a transparent access to lazily fetched objects can stay the
> same.

If my analysis above is wrong, then yes I agree that we should do this.
But we might need to find another way to indicate which has_object_file()
has been checked and which hasn't - changing away from has_object_file()
completely gives us a way to indicate this, but if we're sticking with
has_object_file(), we have to find another way of indicating that we've
looked at this call and it is OK.

> > I mentioned the idea for this change here:
> > https://lore.kernel.org/git/20200721225020.1352772-1-jonathantanmy@google.com/
> 
> Yup, I think that is going in a good direction.  I suspect that
> apply will not be the only remaining case we need to "fix", and
> using the new helper function, codepaths that have already been
> "fixed" by passing "do not lazily fetch" option to the traditional
> API functions would become easier to read.  And if that is the case,
> let's have the introduction of the helper function as a separate
> patch, with each of [PATCH 2-N/N] be a fix for separate codepaths.
> 
> Thanks.

OK - I'll separate out the helper function into its own patch in version
2.

diff --git a/apply.c b/apply.c
index 8bff604dbe..402d80602a 100644
--- a/apply.c
+++ b/apply.c
@@ -3178,7 +3178,7 @@  static int apply_binary(struct apply_state *state,
 		return 0; /* deletion patch */
 	}
 
-	if (has_object_file(&oid)) {
+	if (has_object(the_repository, &oid, 0)) {
 		/* We already have the postimage */
 		enum object_type type;
 		unsigned long size;
diff --git a/object-store.h b/object-store.h
index f439d47af8..c4fc9dd74e 100644
--- a/object-store.h
+++ b/object-store.h
@@ -239,12 +239,33 @@  int read_loose_object(const char *path,
 		      unsigned long *size,
 		      void **contents);
 
+/* Retry packed storage after checking packed and loose storage */
+#define HAS_OBJECT_RECHECK_PACKED 1
+
+/*
+ * Returns 1 if the object exists. This function will not lazily fetch objects
+ * in a partial clone.
+ */
+int has_object(struct repository *r, const struct object_id *oid,
+	       unsigned flags);
+
+/*
+ * These macros and functions are deprecated. If checking existence for an
+ * object that is likely to be missing and/or whose absence is relatively
+ * inconsequential (or is consequential but the caller is prepared to handle
+ * it), use has_object(), which has better defaults (no lazy fetch in a partial
+ * clone and no rechecking of packed storage). In the unlikely event that a
+ * caller needs to assert existence of an object that it fully expects to
+ * exist, and wants to trigger a lazy fetch in a partial clone, use
+ * oid_object_info_extended() with a NULL struct object_info.
+ *
+ * These functions can be removed once all callers have migrated to
+ * has_object() and/or oid_object_info_extended().
+ */
 #ifndef NO_THE_REPOSITORY_COMPATIBILITY_MACROS
 #define has_sha1_file_with_flags(sha1, flags) repo_has_sha1_file_with_flags(the_repository, sha1, flags)
 #define has_sha1_file(sha1) repo_has_sha1_file(the_repository, sha1)
 #endif
-
-/* Same as the above, except for struct object_id. */
 int repo_has_object_file(struct repository *r, const struct object_id *oid);
 int repo_has_object_file_with_flags(struct repository *r,
 				    const struct object_id *oid, int flags);
diff --git a/sha1-file.c b/sha1-file.c
index ccd34dd9e8..ff444d7abb 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1988,6 +1988,18 @@  int force_object_loose(const struct object_id *oid, time_t mtime)
 	return ret;
 }
 
+int has_object(struct repository *r, const struct object_id *oid,
+	       unsigned flags)
+{
+	int quick = !(flags & HAS_OBJECT_RECHECK_PACKED);
+	unsigned object_info_flags = OBJECT_INFO_SKIP_FETCH_OBJECT |
+		(quick ? OBJECT_INFO_QUICK : 0);
+
+	if (!startup_info->have_repository)
+		return 0;
+	return oid_object_info_extended(r, oid, NULL, object_info_flags) >= 0;
+}
+
 int repo_has_object_file_with_flags(struct repository *r,
 				    const struct object_id *oid, int flags)
 {
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index bda4586a79..94a2c76522 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -1133,4 +1133,20 @@  test_expect_success 'am and .gitattibutes' '
 	)
 '
 
+test_expect_success 'apply binary blob in partial clone' '
+	printf "\\000" >binary &&
+	git add binary &&
+	git commit -m "binary blob" &&
+	git format-patch --stdout -m HEAD^ >patch &&
+
+	test_create_repo server &&
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	git clone --filter=blob:none "file://$(pwd)/server" client &&
+	test_when_finished "rm -rf client" &&
+
+	# Exercise to make sure that it works
+	git -C client am ../patch
+'
+
 test_done

apply: do not fetch when checking object existence

Commit Message

Comments

Patch