diff mbox series

[2/3] hash-object doc: elaborate on -w and --literally promises

Message ID 20190520215312.10363-3-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series hash-object doc: small fixes | expand

Commit Message

Ævar Arnfjörð Bjarmason May 20, 2019, 9:53 p.m. UTC
Clarify the hash-object docs to explicitly note that the --literally
option guarantees that a loose object will be written, but that a
normal -w ("write") invocation doesn't.

At first I thought talking about "loose object" in the docs was a
mistake in 83115ac4a8 ("git-hash-object.txt: document --literally
option", 2015-05-04), but as is clear from 5ba9a93b39 ("hash-object:
add --literally option", 2014-09-11) this was intended all along.

As seen in 4dd1fbc7b1 ("Bigfile: teach "git add" to send a large file
straight to a pack", 2011-05-08) we'll otherwise not guarantee that we
write loose objects, so let's explicitly note that, using vague enough
language to leave the door open to any arbitrary future storage
format, not just loose objects and packs.

While I'm at it remove the mention of "--stdin" from the "--literally"
documentation. This wasn't correct, the "--literally" option combines
with any other option (e.g. "--stdin-paths"), not just "--stdin".

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-hash-object.txt | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Jeff King May 22, 2019, 5:08 a.m. UTC | #1
On Mon, May 20, 2019 at 11:53:11PM +0200, Ævar Arnfjörð Bjarmason wrote:

> Clarify the hash-object docs to explicitly note that the --literally
> option guarantees that a loose object will be written, but that a
> normal -w ("write") invocation doesn't.

I had to double-check here: you mean that _when_ we are writing an
object, "--literally" would always write loose, right?

> At first I thought talking about "loose object" in the docs was a
> mistake in 83115ac4a8 ("git-hash-object.txt: document --literally
> option", 2015-05-04), but as is clear from 5ba9a93b39 ("hash-object:
> add --literally option", 2014-09-11) this was intended all along.

Hmm. After reading both of those, I do think it's mostly an
implementation detail. I would not be at all surprised to find that the
test suite relies on this (e.g., cleaning up with rm
.git/objects/ab/cd1234). But I suspect we also rely on that for the
non-literal case, too. ;)

So I am on the fence. In some sense it doesn't hurt to document the
behavior, but I'm not sure I would want to lock us in to any particular
behavior, even for --literally. The intent of the option (as I recall)
really is just "let us write whatever trash we want as an object,
ignoring all quality checks".

>  --literally::
> -	Allow `--stdin` to hash any garbage into a loose object which might not
> +	Allow for hashing arbitrary data which might not
>  	otherwise pass standard object parsing or git-fsck checks. Useful for
>  	stress-testing Git itself or reproducing characteristics of corrupt or
> -	bogus objects encountered in the wild.
> +	bogus objects encountered in the wild. When writing objects guarantees
> +	that the written object will be a loose object, for ease of debugging.

I had to read this last sentence a few times to parse it. Maybe a comma
before guarantees would help? Or even:

  When writing objects, this option guarantees that the written object
  will be a loose object, for ease of debugging.

-Peff
Jakub Narębski May 24, 2019, 10:04 a.m. UTC | #2
Jeff King <peff@peff.net> writes:
> On Mon, May 20, 2019 at 11:53:11PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> Clarify the hash-object docs to explicitly note that the --literally
>> option guarantees that a loose object will be written, but that a
>> normal -w ("write") invocation doesn't.
>
> I had to double-check here: you mean that _when_ we are writing an
> object, "--literally" would always write loose, right?
>
>> At first I thought talking about "loose object" in the docs was a
>> mistake in 83115ac4a8 ("git-hash-object.txt: document --literally
>> option", 2015-05-04), but as is clear from 5ba9a93b39 ("hash-object:
>> add --literally option", 2014-09-11) this was intended all along.
>
> Hmm. After reading both of those, I do think it's mostly an
> implementation detail. I would not be at all surprised to find that the
> test suite relies on this (e.g., cleaning up with rm
> .git/objects/ab/cd1234). But I suspect we also rely on that for the
> non-literal case, too. ;)
>
> So I am on the fence. In some sense it doesn't hurt to document the
> behavior, but I'm not sure I would want to lock us in to any particular
> behavior, even for --literally. The intent of the option (as I recall)
> really is just "let us write whatever trash we want as an object,
> ignoring all quality checks".

I thik that this implemetation detail of `--literally` is here to stay;
how would you otherwise fix the issue if garbage object makes Git crash?

However, I would prefer to have options state _intent_; if there is
legitimate need for a tool that creates loose objects, it would be
better to have separate `--loose` option to `git hash-object` (which
would imply `-w`, otherwise it doesn't have sense).

>>  --literally::
>> -	Allow `--stdin` to hash any garbage into a loose object which might not
>> +	Allow for hashing arbitrary data which might not
>>  	otherwise pass standard object parsing or git-fsck checks. Useful for
>>  	stress-testing Git itself or reproducing characteristics of corrupt or
>> -	bogus objects encountered in the wild.
>> +	bogus objects encountered in the wild. When writing objects guarantees
>> +	that the written object will be a loose object, for ease of debugging.
>
> I had to read this last sentence a few times to parse it. Maybe a comma
> before guarantees would help? Or even:
>
>   When writing objects, this option guarantees that the written object
>   will be a loose object, for ease of debugging.

I agree that this reads better.

Regards,
--
Jakub Narębski
Ævar Arnfjörð Bjarmason May 24, 2019, 10:12 a.m. UTC | #3
On Fri, May 24 2019, Jakub Narebski wrote:

> Jeff King <peff@peff.net> writes:
>> On Mon, May 20, 2019 at 11:53:11PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>>> Clarify the hash-object docs to explicitly note that the --literally
>>> option guarantees that a loose object will be written, but that a
>>> normal -w ("write") invocation doesn't.
>>
>> I had to double-check here: you mean that _when_ we are writing an
>> object, "--literally" would always write loose, right?
>>
>>> At first I thought talking about "loose object" in the docs was a
>>> mistake in 83115ac4a8 ("git-hash-object.txt: document --literally
>>> option", 2015-05-04), but as is clear from 5ba9a93b39 ("hash-object:
>>> add --literally option", 2014-09-11) this was intended all along.
>>
>> Hmm. After reading both of those, I do think it's mostly an
>> implementation detail. I would not be at all surprised to find that the
>> test suite relies on this (e.g., cleaning up with rm
>> .git/objects/ab/cd1234). But I suspect we also rely on that for the
>> non-literal case, too. ;)
>>
>> So I am on the fence. In some sense it doesn't hurt to document the
>> behavior, but I'm not sure I would want to lock us in to any particular
>> behavior, even for --literally. The intent of the option (as I recall)
>> really is just "let us write whatever trash we want as an object,
>> ignoring all quality checks".
>
> I thik that this implemetation detail of `--literally` is here to stay;
> how would you otherwise fix the issue if garbage object makes Git crash?
>
> However, I would prefer to have options state _intent_; if there is
> legitimate need for a tool that creates loose objects, it would be
> better to have separate `--loose` option to `git hash-object` (which
> would imply `-w`, otherwise it doesn't have sense).

I wonder if we can just remove this option and replace it with a
GIT_TEST_* env variable, or even a test-tool helper. I can't see why
anyone other than our own test suite wants this, and that's why it was
added. So why document it & expose it in a plumbing tool?

>>>  --literally::
>>> -	Allow `--stdin` to hash any garbage into a loose object which might not
>>> +	Allow for hashing arbitrary data which might not
>>>  	otherwise pass standard object parsing or git-fsck checks. Useful for
>>>  	stress-testing Git itself or reproducing characteristics of corrupt or
>>> -	bogus objects encountered in the wild.
>>> +	bogus objects encountered in the wild. When writing objects guarantees
>>> +	that the written object will be a loose object, for ease of debugging.
>>
>> I had to read this last sentence a few times to parse it. Maybe a comma
>> before guarantees would help? Or even:
>>
>>   When writing objects, this option guarantees that the written object
>>   will be a loose object, for ease of debugging.
>
> I agree that this reads better.
>
> Regards,
Jeff King May 28, 2019, 6:06 a.m. UTC | #4
On Fri, May 24, 2019 at 12:12:10PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I thik that this implemetation detail of `--literally` is here to stay;
> > how would you otherwise fix the issue if garbage object makes Git crash?
> >
> > However, I would prefer to have options state _intent_; if there is
> > legitimate need for a tool that creates loose objects, it would be
> > better to have separate `--loose` option to `git hash-object` (which
> > would imply `-w`, otherwise it doesn't have sense).
> 
> I wonder if we can just remove this option and replace it with a
> GIT_TEST_* env variable, or even a test-tool helper. I can't see why
> anyone other than our own test suite wants this, and that's why it was
> added. So why document it & expose it in a plumbing tool?

I can think of a few reasons you might want it in the general toolbox:

  - you could be recreating a known-buggy history and want to overcome
    consistency checks (e.g., imagine a tool that imports or modifies
    git history, and needs to recreate objects literally, warts and
    all). I think we don't do a lot of quality checks in hash-object
    now, but we have discussed it (and --literally would be the obvious
    escape hatch).

  - folks like security researchers who are poking at Git and want to
    see how it reacts to various almost-correct inputs. They _could_ use
    the test-tool helper, but being able to demonstrate bugs with the
    standard toolbox is helpful for communication.

  - likewise for general non-security debugging ("it behaves in this
    funny way if I violate this constraint...")

So I actually think it's nice to have it as part of the plumbing, as
long as it's buried and documented sufficiently that unsuspecting users
do not stumble onto it.

And at any rate, now that it has been in the wild for a while, we risk
breaking somebody who has come to depend on it.

-Peff
Junio C Hamano May 28, 2019, 4:49 p.m. UTC | #5
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Clarify the hash-object docs to explicitly note that the --literally
> option guarantees that a loose object will be written, but that a
> normal -w ("write") invocation doesn't.
>
> At first I thought talking about "loose object" in the docs was a
> mistake in 83115ac4a8 ("git-hash-object.txt: document --literally
> option", 2015-05-04), but as is clear from 5ba9a93b39 ("hash-object:
> add --literally option", 2014-09-11) this was intended all along.

I have to admit that this "loose only" was the doing of my
defeatism.  IOW, I was utterly pessimistic that I would be able to
add more types (and more importantly, unbounded number of random
types) of objects in the packstream.

So, "loose object" limitation is a practical one for those of us who
cannot think of a reasonable way to cram arbitrary number of random
new types into just 3 bits of the "type" bitfield, and not inherent
to the "hash-object --literally" command.

So I am very happy to see the first hunk of this patch, but I doubt
there is much value in the last sentence the second hunk adds.

Thanks.
Junio C Hamano May 28, 2019, 4:56 p.m. UTC | #6
Jakub Narebski <jnareb@gmail.com> writes:

> I thik that this implemetation detail of `--literally` is here to stay;
> how would you otherwise fix the issue if garbage object makes Git crash?

By repacking, presumably ;-).

More importantly, there needs a way to extend "enum object_type" to
allow unbounded number of arbitrary (garbage) types before we can
allow --literally to record such a garbage type in a pack stream.

So I'd expect the implementation detail would stay for a long time.
But there is nothing that says `--literally` inherently must write
loose.  It is plausible that a new implementation writes objects of
known/valid types to a pack stream, while unknown/garbage types to
loose objects.

> However, I would prefer to have options state _intent_; if there is
> legitimate need for a tool that creates loose objects, it would be
> better to have separate `--loose` option to `git hash-object` (which
> would imply `-w`, otherwise it doesn't have sense).

Yes, I very much agree with that.
diff mbox series

Patch

diff --git a/Documentation/git-hash-object.txt b/Documentation/git-hash-object.txt
index df9e2c58bd..100630d021 100644
--- a/Documentation/git-hash-object.txt
+++ b/Documentation/git-hash-object.txt
@@ -28,6 +28,12 @@  OPTIONS
 
 -w::
 	Actually write the object into the object database.
++
+No guarantees are made as to how the object is added to the database
+(e.g. whether as a loose object, or streamed to a pack), except in the
+case of the `--literally` option. How it's written may depend on
+heuristics such as the `core.bigFileThreshold` configuration variable
+(see linkgit:git-config[1]).
 
 --stdin::
 	Read the object from standard input instead of from a file.
@@ -53,10 +59,11 @@  OPTIONS
 	is always implied, unless the `--path` option is given.
 
 --literally::
-	Allow `--stdin` to hash any garbage into a loose object which might not
+	Allow for hashing arbitrary data which might not
 	otherwise pass standard object parsing or git-fsck checks. Useful for
 	stress-testing Git itself or reproducing characteristics of corrupt or
-	bogus objects encountered in the wild.
+	bogus objects encountered in the wild. When writing objects guarantees
+	that the written object will be a loose object, for ease of debugging.
 
 GIT
 ---