sha1-file: make pretend_object_file() not prefetch

Message ID	20200721225020.1352772-1-jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=GLOD=BA=vger.kernel.org=git-owner@kernel.org> Date: Tue, 21 Jul 2020 15:50:20 -0700 Message-Id: <20200721225020.1352772-1-jonathantanmy@google.com> Mime-Version: 1.0 Subject: [PATCH] sha1-file: make pretend_object_file() not prefetch From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	sha1-file: make pretend_object_file() not prefetch \| expand sha1-file: make pretend_object_file() not prefetch

Message ID

20200721225020.1352772-1-jonathantanmy@google.com (mailing list archive)

State

New, archived

Headers

Date: Tue, 21 Jul 2020 15:50:20 -0700
Message-Id: <20200721225020.1352772-1-jonathantanmy@google.com>
Mime-Version: 1.0
Subject: [PATCH] sha1-file: make pretend_object_file() not prefetch
From: Jonathan Tan <jonathantanmy@google.com>
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

sha1-file: make pretend_object_file() not prefetch | expand

Commit Message

Jonathan Tan July 21, 2020, 10:50 p.m. UTC

When pretend_object_file() is invoked with an object that does not exist
(as is the typical case), there is no need to fetch anything from the
promisor remote, because the caller already knows what the object is
supposed to contain. Therefore, suppress the fetch. (The
OBJECT_INFO_QUICK flag is added for the same reason.)

This was noticed at $DAYJOB when "blame" was run on a file that had
uncommitted modifications.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This is another instance of the issue Junio raised here [1]:

> Fixing issues hit by real users reactively is a necessary and good
> thing, but this is not the first time we patch callers of
> has_object_file() for this kind of "we are merely trying to
> determine the boundary of what we have, so that we know what we need
> to add to this repository" queries, I am afraid.
>
> Perhaps it is a good idea to sweep all the hits from "git grep -e
> has_object_file \*.c" and audit the codebase to see if there are
> other problematic ones?

To make progress towards solving this, I'm thinking of making a new
function has_object() that takes a repository, an object ID, and flags,
and only supports one flag: HAS_OBJECT_RECHECK_PACKED (which has the
opposite meaning of OBJECT_INFO_QUICK). Checks that should not fetch
should use has_object(), and checks that should fetch (if they exist - I
think that most would want additional information about the object, so
they would use oid_object_info() or similar already) should use
oid_object_info() or a similar function. That way we can see how many
uses of has_object_file() we have checked, and at the same time make
behavior we usually want into the default behavior. Any opinions?

[1] https://lore.kernel.org/git/xmqqpn8wie21.fsf@gitster.c.googlers.com/
---
 sha1-file.c      |  3 ++-
 t/t8002-blame.sh | 11 +++++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

Comments

Junio C Hamano July 21, 2020, 11:25 p.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> To make progress towards solving this, I'm thinking of making a new
> function has_object() that takes a repository, an object ID, and flags,
> and only supports one flag: HAS_OBJECT_RECHECK_PACKED (which has the
> opposite meaning of OBJECT_INFO_QUICK). Checks that should not fetch
> should use has_object(), and checks that should fetch (if they exist - I
> think that most would want additional information about the object, so
> they would use oid_object_info() or similar already) should use
> oid_object_info() or a similar function. That way we can see how many
> uses of has_object_file() we have checked, and at the same time make
> behavior we usually want into the default behavior. Any opinions?

Yes, a new function (or a pair of functions) makes it easy to see
how much progress we made and also gives us easier greppability.

Junio C Hamano July 21, 2020, 11:27 p.m. UTC | #2

Jonathan Tan <jonathantanmy@google.com> writes:

> When pretend_object_file() is invoked with an object that does not exist
> (as is the typical case), there is no need to fetch anything from the
> promisor remote, because the caller already knows what the object is
> supposed to contain. Therefore, suppress the fetch. (The
> OBJECT_INFO_QUICK flag is added for the same reason.)

Yes, "pretend" is also a way to lie about the contents IIRC, so even
if the object is available elsewhere, we should *not* fetch from the
promisor.  Makes sense to me.

Jeff King July 23, 2020, 5:47 p.m. UTC | #3

On Tue, Jul 21, 2020 at 04:27:15PM -0700, Junio C Hamano wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > When pretend_object_file() is invoked with an object that does not exist
> > (as is the typical case), there is no need to fetch anything from the
> > promisor remote, because the caller already knows what the object is
> > supposed to contain. Therefore, suppress the fetch. (The
> > OBJECT_INFO_QUICK flag is added for the same reason.)
> 
> Yes, "pretend" is also a way to lie about the contents IIRC, so even
> if the object is available elsewhere, we should *not* fetch from the
> promisor.  Makes sense to me.

I agree this patch is fine, but I wonder if it could go even further. If
we are pretending some particular contents, shouldn't we override
anything that might be in the object database? I.e., could we eliminate
this has_object_file() entirely?

That should be OK for the same reason that it's OK to use QUICK.

There's only one caller of this function (git-blame), which I think
would be happy with such a change.

-Peff

Junio C Hamano July 23, 2020, 6:06 p.m. UTC | #4

Jeff King <peff@peff.net> writes:

> I agree this patch is fine, but I wonder if it could go even further. If
> we are pretending some particular contents, shouldn't we override
> anything that might be in the object database? I.e., could we eliminate
> this has_object_file() entirely?
>
> That should be OK for the same reason that it's OK to use QUICK.
>
> There's only one caller of this function (git-blame), which I think
> would be happy with such a change.

I actually have to take that "we could even lie to say that content
that does not hash to X is object X" back---that was never the
intention of this mechanism.

It was to ensure that operations that are supposedly read-only can
avoid writing into the repository---"blame" uses it to pretend as if
the working tree file already has a corresponding object in the
object store, IIRC.

The only reason why hash_object_file() is used here is to allow us
discarding the memory held for that working tree copy if it happens
to match what is stored in the object database.  The saving would be
within a few hundred kilobytes to a single digit megabyte range at
most, hopefully, so we could drop it (oh, saying "a single digit
megabyte" reminds me that my first Linux computer was 486dx with
4MB ram---on that box, it may have mattered).

diff --git a/sha1-file.c b/sha1-file.c
index ccd34dd9e8..45fdb8415c 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1600,7 +1600,8 @@  int pretend_object_file(void *buf, unsigned long len, enum object_type type,
 	struct cached_object *co;
 
 	hash_object_file(the_hash_algo, buf, len, type_name(type), oid);
-	if (has_object_file(oid) || find_cached_object(oid))
+	if (has_object_file_with_flags(oid, OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT) ||
+	    find_cached_object(oid))
 		return 0;
 	ALLOC_GROW(cached_objects, cached_object_nr + 1, cached_object_alloc);
 	co = &cached_objects[cached_object_nr++];
diff --git a/t/t8002-blame.sh b/t/t8002-blame.sh
index eea048e52c..2ed6aaae35 100755
--- a/t/t8002-blame.sh
+++ b/t/t8002-blame.sh
@@ -122,4 +122,15 @@  test_expect_success '--exclude-promisor-objects does not BUG-crash' '
 	test_must_fail git blame --exclude-promisor-objects one
 '
 
+test_expect_success 'blame with uncommitted edits in partial clone does not crash' '
+	git init server &&
+	echo foo >server/file.txt &&
+	git -C server add file.txt &&
+	git -C server commit -m file &&
+
+	git clone --filter=blob:none "file://$(pwd)/server" client &&
+	echo bar >>client/file.txt &&
+	git -C client blame file.txt
+'
+
 test_done

sha1-file: make pretend_object_file() not prefetch

Commit Message

Comments

Patch