[v2] sha1-file: remove OBJECT_INFO_SKIP_CACHED

Message ID	20200102201630.180969-1-jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=b7P8=2X=vger.kernel.org=git-owner@kernel.org> Date: Thu, 2 Jan 2020 12:16:30 -0800 In-Reply-To: <20191230211027.37002-1-jonathantanmy@google.com> Message-Id: <20200102201630.180969-1-jonathantanmy@google.com> Mime-Version: 1.0 References: <20191230211027.37002-1-jonathantanmy@google.com> Subject: [PATCH v2] sha1-file: remove OBJECT_INFO_SKIP_CACHED From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, gitster@pobox.com, jrnieder@gmail.com Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	[v2] sha1-file: remove OBJECT_INFO_SKIP_CACHED \| expand [v2] sha1-file: remove OBJECT_INFO_SKIP_CACHED

Jonathan Tan Jan. 2, 2020, 8:16 p.m. UTC

In a partial clone, if a user provides the hash of the empty tree ("git
mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
requires that that object be parsed, for example:

  git diff-tree 4b825d <a non-empty tree>

then Git will lazily fetch the empty tree, unnecessarily, because
parsing of that object invokes repo_has_object_file(), which does not
special-case the empty tree.

Instead, teach repo_has_object_file() to consult find_cached_object()
(which handles the empty tree), thus bringing it in line with the rest
of the object-store-accessing functions. A cost is that
repo_has_object_file() will now need to oideq upon each invocation, but
that is trivial compared to the filesystem lookup or the pack index
search required anyway. (And if find_cached_object() needs to do more
because of previous invocations to pretend_object_file(), all the more
reason to be consistent in whether we present cached objects.)

As a historical note, the function now known as repo_read_object_file()
was taught the empty tree in 346245a1bb ("hard-code the empty tree
object", 2008-02-13), and the function now known as oid_object_info()
was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
cached_object store too", 2011-02-07). repo_has_object_file() was never
updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
introduced later in dfdd4afcf9 ("sha1_file: teach
sha1_object_info_extended more flags", 2017-06-26) and used in
e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
was introduced to preserve this difference in empty-tree handling, but
now it can be removed.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Forgot to add v2 to the other email, so resending it with the correct
email subject.

Difference from v1: updated commit message in response to Jonathan
Nieder's feedback. Hopefully I didn't remove too much.
---
 object-store.h |  2 --
 sha1-file.c    | 38 ++++++++++++++++++--------------------
 2 files changed, 18 insertions(+), 22 deletions(-)

Junio C Hamano Jan. 2, 2020, 9:41 p.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> As a historical note, the function now known as repo_read_object_file()
> was taught the empty tree in 346245a1bb ("hard-code the empty tree
> object", 2008-02-13), and the function now known as oid_object_info()
> was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
> cached_object store too", 2011-02-07). repo_has_object_file() was never
> updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
> introduced later in dfdd4afcf9 ("sha1_file: teach
> sha1_object_info_extended more flags", 2017-06-26) and used in
> e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
> was introduced to preserve this difference in empty-tree handling, but
> now it can be removed.

I am not 100% sure if the implication of this change is safe to
allow us to say "now it can be".

The has_object_file() helper wanted to say "no" when given a
non-existing object registered via the pretend_object_file(),
presumably because we wanted to allow a use pattern like:

 - prepare an in-core representation of an object we tentatively
   expect, but not absolutely sure, to be necessary.

 - perform operations, using the object data obtained via
   read_object() API, which is capable of yielding data even for
   such "pretend" objects (perhaps we are creating a tentative merge
   parents during a recursive merge).

 - write out final set of objects by enumerating those that do not
   really exist yet (via has_object_file() API).

Teaching about the empty tree to has_object_file() is a good thing
(especially because we do not necessarily write an empty tree object
to our repositories), but as a collateral damage of doing so, we
make such use pattern impossible.  

It is not a large loss---the third bullet in the above list can just
be made to unconditionally call write_object_file() without
filtering with has_object_file() and write_object_file() will apply
the right optimization anyway, so it probably is OK.

Will queue.

Thanks.

Jonathan Nieder Jan. 4, 2020, 12:13 a.m. UTC | #2

Jonathan Tan wrote:

> In a partial clone, if a user provides the hash of the empty tree ("git
> mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
> requires that that object be parsed, for example:
>
>   git diff-tree 4b825d <a non-empty tree>
>
> then Git will lazily fetch the empty tree, unnecessarily, because
> parsing of that object invokes repo_has_object_file(), which does not
> special-case the empty tree.
>
> Instead, teach repo_has_object_file() to consult find_cached_object()
> (which handles the empty tree), thus bringing it in line with the rest
> of the object-store-accessing functions. A cost is

Lovely, thank you.

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  object-store.h |  2 --
>  sha1-file.c    | 38 ++++++++++++++++++--------------------
>  2 files changed, 18 insertions(+), 22 deletions(-)

Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>

To follow up on Junio's hint in his review: callers can inject
additional cached objects by using pretend_object_file.  Junio
described how this would make sense as a mechanism for building
the virtual ancestor object, but we don't do that.  In fact, the
only caller is fake_working_tree_commit in "git blame", a read-only
code path. *phew*

-- >8 --
Subject: sha1-file: document how to use pretend_object_file

Like in-memory alternates, pretend_object_file contains a trap for the
unwary: careless callers can use it to create references to an object
that does not exist in the on-disk object store.

Add a comment documenting how to use the function without risking such
problems.

The only current caller is blame, which uses pretend_object_file to
create an in-memory commit representing the working tree state.
Noticed during a discussion of how to safely use this function in
operations like "git merge" which, unlike blame, are not read-only.

Inspired-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
 object-store.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/object-store.h b/object-store.h
index 55ee639350..d0fc7b091b 100644
--- a/object-store.h
+++ b/object-store.h
@@ -208,6 +208,14 @@ int hash_object_file_literally(const void *buf, unsigned long len,
 			       const char *type, struct object_id *oid,
 			       unsigned flags);

+/*
+ * Add an object file to the in-memory object store, without writing it
+ * to disk.
+ *
+ * Callers are responsible for calling write_object_file to record the
+ * object in persistent storage before writing any other new objects
+ * that reference it.
+ */
 int pretend_object_file(void *, unsigned long, enum object_type,
 			struct object_id *oid);

Jeff King Jan. 6, 2020, 9:14 p.m. UTC | #3

On Thu, Jan 02, 2020 at 01:41:27PM -0800, Junio C Hamano wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > As a historical note, the function now known as repo_read_object_file()
> > was taught the empty tree in 346245a1bb ("hard-code the empty tree
> > object", 2008-02-13), and the function now known as oid_object_info()
> > was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
> > cached_object store too", 2011-02-07). repo_has_object_file() was never
> > updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
> > introduced later in dfdd4afcf9 ("sha1_file: teach
> > sha1_object_info_extended more flags", 2017-06-26) and used in
> > e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
> > was introduced to preserve this difference in empty-tree handling, but
> > now it can be removed.
> 
> I am not 100% sure if the implication of this change is safe to
> allow us to say "now it can be".
> 
> The has_object_file() helper wanted to say "no" when given a
> non-existing object registered via the pretend_object_file(),
> presumably because we wanted to allow a use pattern like:
> 
>  - prepare an in-core representation of an object we tentatively
>    expect, but not absolutely sure, to be necessary.
> 
>  - perform operations, using the object data obtained via
>    read_object() API, which is capable of yielding data even for
>    such "pretend" objects (perhaps we are creating a tentative merge
>    parents during a recursive merge).
> 
>  - write out final set of objects by enumerating those that do not
>    really exist yet (via has_object_file() API).
> 
> Teaching about the empty tree to has_object_file() is a good thing
> (especially because we do not necessarily write an empty tree object
> to our repositories), but as a collateral damage of doing so, we
> make such use pattern impossible.  
> 
> It is not a large loss---the third bullet in the above list can just
> be made to unconditionally call write_object_file() without
> filtering with has_object_file() and write_object_file() will apply
> the right optimization anyway, so it probably is OK.

I agree that whoever called pretend_object_file() can be careful and
write out the final set of objects itself via write_object_file(). But
I'd worry a bit about a caller who doesn't necessarily realize that they
need to do that. E.g., imagine we call pretend_object_file() for some
blob oid, expecting it to be read-only. And then in the same process,
some other bit of the code writes out a tree that mentions that blob.
Oops, that tree is now corrupt after we exit the process. And IMHO
neither the pretend-caller nor the tree-writer are to blame; the problem
is that they shared global state they were not expecting.

This is pretty far-fetched given that the only user of
pretend_object_file() is in git-blame right now. But it does give me
pause. Overall, though, I'm more inclined to say that we should be
dropping SKIP_CACHED here and considering pretend_object_file() to be a
bit dangerous (i.e., to keep it in mind if somebody proposes more
calls).

Another point of reference (in favor of Jonathan's patch):

  https://lore.kernel.org/git/20190304174053.GA27497@sigill.intra.peff.net/

is a bug that would not have happened if this patch had been applied
(there's also some discussion of the greater issue, but nothing that wasn't
already brought up here, I think).

-Peff

Jeff King Jan. 6, 2020, 9:17 p.m. UTC | #4

On Fri, Jan 03, 2020 at 04:13:31PM -0800, Jonathan Nieder wrote:

> To follow up on Junio's hint in his review: callers can inject
> additional cached objects by using pretend_object_file.  Junio
> described how this would make sense as a mechanism for building
> the virtual ancestor object, but we don't do that.  In fact, the
> only caller is fake_working_tree_commit in "git blame", a read-only
> code path. *phew*
> 
> -- >8 --
> Subject: sha1-file: document how to use pretend_object_file
> [...]
> +/*
> + * Add an object file to the in-memory object store, without writing it
> + * to disk.
> + *
> + * Callers are responsible for calling write_object_file to record the
> + * object in persistent storage before writing any other new objects
> + * that reference it.
> + */
>  int pretend_object_file(void *, unsigned long, enum object_type,
>  			struct object_id *oid);
>  

I think this is an improvement over the status quo, but it's still a
potential trap for code which happens to run in the same process (see my
other email in the thread).

Should the message perhaps be even more scary?

-Peff

Jonathan Nieder Jan. 6, 2020, 11:47 p.m. UTC | #5

Hi,

Jeff King wrote:
> On Fri, Jan 03, 2020 at 04:13:31PM -0800, Jonathan Nieder wrote:

>> To follow up on Junio's hint in his review: callers can inject
>> additional cached objects by using pretend_object_file.  Junio
>> described how this would make sense as a mechanism for building
>> the virtual ancestor object, but we don't do that.  In fact, the
>> only caller is fake_working_tree_commit in "git blame", a read-only
>> code path. *phew*
>>
>> -- >8 --
>> Subject: sha1-file: document how to use pretend_object_file
>> [...]
>> +/*
>> + * Add an object file to the in-memory object store, without writing it
>> + * to disk.
>> + *
>> + * Callers are responsible for calling write_object_file to record the
>> + * object in persistent storage before writing any other new objects
>> + * that reference it.
>> + */
>>  int pretend_object_file(void *, unsigned long, enum object_type,
>>  			struct object_id *oid);
>
> I think this is an improvement over the status quo, but it's still a
> potential trap for code which happens to run in the same process (see my
> other email in the thread).
>
> Should the message perhaps be even more scary?

A pet peeve of mine is warning volume escalation: if it becomes common
for us to say

 * Warning: callers are reponsible for [...]

then new warnings trying to stand out might say

 * WARNING: callers are responsible for [...]

and then after we are desensitized to that, we may switch to

 * WARNING WARNING WARNING, not the usual blah-blah: callers are

and so on.  The main way I have found to counteract that is to make
the "dangerous curve" markers context-specific enough that people
don't learn to ignore them.  After all, sometimes a concurrency
warning is important to me, at other times warnings about clarity may
be what attract my interest, and so on.

I don't have a good suggestion here.  Perhaps "Callers are responsible
for" is too slow and something more terse would help?

 /*
  * Adds an object to the in-memory object store, without writing it
  * to disk.
  *
  * Use with caution!  Unless you call write_object_file to record the
  * in-memory object to persistent storage, any other new objects that
  * reference it will point to a missing (in memory only) object,
  * resulting in a corrupt repository.
  */

It would be even better if we have some automated way to catch this
kind of issue.  Should tests run "git fsck" after each test?  Should
write_object_file have a paranoid mode that checks integrity?

I don't know an efficient way to do that.  Ultimately I am comfortable
counting on reviewers to be aware of this kind of pitfall.  While
nonlocal invariants are always hard to maintain, this pitfall is
inherent in the semantics of the function, so I am not too worried
that reviewers will overlook it.

A less error-prone interface would tie the result of
pretend_object_file to a short-lived overlay on the_repository without
affecting global state.  We could even enforce read-only access in
that overlay.  I don't think the "struct repository" interface and
callers are ready for that yet, though.

Thanks,
Jonathan

Jeff King Jan. 7, 2020, 11:22 a.m. UTC | #6

On Mon, Jan 06, 2020 at 03:47:53PM -0800, Jonathan Nieder wrote:

> >> + * Callers are responsible for calling write_object_file to record the
> >> + * object in persistent storage before writing any other new objects
> >> + * that reference it.
> >> + */
> >>  int pretend_object_file(void *, unsigned long, enum object_type,
> >>  			struct object_id *oid);
> >
> > I think this is an improvement over the status quo, but it's still a
> > potential trap for code which happens to run in the same process (see my
> > other email in the thread).
> >
> > Should the message perhaps be even more scary?
> 
> A pet peeve of mine is warning volume escalation: if it becomes common
> for us to say
> 
>  * Warning: callers are reponsible for [...]
> 
> then new warnings trying to stand out might say
> 
>  * WARNING: callers are responsible for [...]
> 
> and then after we are desensitized to that, we may switch to
> 
>  * WARNING WARNING WARNING, not the usual blah-blah: callers are
> 
> and so on.  The main way I have found to counteract that is to make
> the "dangerous curve" markers context-specific enough that people
> don't learn to ignore them.  After all, sometimes a concurrency
> warning is important to me, at other times warnings about clarity may
> be what attract my interest, and so on.

I meant less about the number of capital letters, and more that we
should be saying "this interface is dangerous; don't use it". Because
it's not just "callers are responsible". It's "this can cause subtle
and hard-to-debug issues because it's writing to global state".

My preferred solution would actually be to rip it out entirely, but we'd
need some solution for git-blame, the sole caller. Possibly it could
insert the value straight into the diff_filespec. But according to the
thread that I linked earlier, I poked at that last year but it didn't
look trivial.

> I don't have a good suggestion here.  Perhaps "Callers are responsible
> for" is too slow and something more terse would help?
> 
>  /*
>   * Adds an object to the in-memory object store, without writing it
>   * to disk.
>   *
>   * Use with caution!  Unless you call write_object_file to record the
>   * in-memory object to persistent storage, any other new objects that
>   * reference it will point to a missing (in memory only) object,
>   * resulting in a corrupt repository.
>   */

Yeah, that's more what I had in mind.

> It would be even better if we have some automated way to catch this
> kind of issue.  Should tests run "git fsck" after each test?  Should
> write_object_file have a paranoid mode that checks integrity?
> 
> I don't know an efficient way to do that.  Ultimately I am comfortable
> counting on reviewers to be aware of this kind of pitfall.  While
> nonlocal invariants are always hard to maintain, this pitfall is
> inherent in the semantics of the function, so I am not too worried
> that reviewers will overlook it.

Yeah, given the scope of the problem (we have a single caller, and this
mechanism is over a decade old) I'm fine with review as the enforcement
mechanism, too.

> A less error-prone interface would tie the result of
> pretend_object_file to a short-lived overlay on the_repository without
> affecting global state.  We could even enforce read-only access in
> that overlay.  I don't think the "struct repository" interface and
> callers are ready for that yet, though.

I agree that would be better, though it's still kind-of global (in that
the repository object is effectively a global for most processes).

-Peff

[v2] sha1-file: remove OBJECT_INFO_SKIP_CACHED

Commit Message

Comments

Patch