[v3] doc: describe Git bundle format
diff mbox series

Message ID 20200207204225.123764-1-masayasuzuki@google.com
State New
Headers show
Series
  • [v3] doc: describe Git bundle format
Related show

Commit Message

Masaya Suzuki Feb. 7, 2020, 8:42 p.m. UTC
The bundle format was not documented. Describe the format with ABNF and
explain the meaning of each part.

Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
---
Changes from v2:

* Change "sender" and "receiver" to "writer" and "reader".
* Add an example of a case that a bundle can reference an object outside of the
  bundle.
* Mention that the prerequisites are different from the shallow
  boundary, and the bundle format cannot represent a shallow clone repository.


 Documentation/technical/bundle-format.txt | 48 +++++++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100644 Documentation/technical/bundle-format.txt

Comments

Masaya Suzuki Feb. 7, 2020, 8:44 p.m. UTC | #1
On Fri, Feb 7, 2020 at 12:42 PM Masaya Suzuki <masayasuzuki@google.com> wrote:
> +=== Note on the shallow clone and a Git bundle
> +
> +Note that the prerequisites does not represent a shallow-clone boundary. The

the prerequisites do not
Junio C Hamano Feb. 7, 2020, 8:59 p.m. UTC | #2
Masaya Suzuki <masayasuzuki@google.com> writes:

> On Fri, Feb 7, 2020 at 12:42 PM Masaya Suzuki <masayasuzuki@google.com> wrote:
>> +=== Note on the shallow clone and a Git bundle
>> +
>> +Note that the prerequisites does not represent a shallow-clone boundary. The
>
> the prerequisites do not

Grammo aside, I am not sure if that particular Note is beneficial to
begin with.  I would imagine that you can get a bundle that holds
all the objects in a shallow repository by specifying the range that
match the shallow-clone boundary when you run "git bundle create"
while disabling thin-pack generation.

The support of shallow-clone by Git may be incomplete and it may not
be easy to form such a range, and "git bundle create" command may
not have a knob to disable thin-pack generation, but that does not
mean that the bundle *format* cannot be used to represent the
shallow boundary.
Masaya Suzuki Feb. 7, 2020, 10:21 p.m. UTC | #3
On Fri, Feb 7, 2020 at 12:59 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Masaya Suzuki <masayasuzuki@google.com> writes:
>
> > On Fri, Feb 7, 2020 at 12:42 PM Masaya Suzuki <masayasuzuki@google.com> wrote:
> >> +=== Note on the shallow clone and a Git bundle
> >> +
> >> +Note that the prerequisites does not represent a shallow-clone boundary. The
> >
> > the prerequisites do not
>
> Grammo aside, I am not sure if that particular Note is beneficial to
> begin with.  I would imagine that you can get a bundle that holds
> all the objects in a shallow repository by specifying the range that
> match the shallow-clone boundary when you run "git bundle create"
> while disabling thin-pack generation.

Yes. The reason that I've been trying to check the semantics of the
prerequisites is that I DO recognize that this is possible
format-wise. I'm not sure if this Git implementation can create such
bundles, but format-wise such bundles can be created.

When writing a Git bundle parser in other implementations (like JGit),
it's not clear whether, as a library, I should support such use cases.
If such usage is supported in the format, then the semantics of the
prerequisites changes. Currently the prerequisites are defined as the
objects that are NOT included in the bundle, and the reader of the
bundle MUST already have, in order to use the data in the bundle. If
the format supports shallow-cloned repository, this will be defined as
the objects that are NOT included in the bundle. If the reader wants
to read this bundle as if it's a non-shallow clone, the reader of the
bundle MUST have the objects that are reachable from these
prerequisites. If the reader wants to read this bundle as if it's a
shallow clone, the reader MUST treat these as a shallow boundary.

Also, this change will put further restrictions on the pack. "Pack" is
the pack data stream "git fetch" would send. If the writer of a bundle
wants to write as a shallow-clone pack, the pack MUST NOT reference
objects outside of the shallow boundary from the pack file as a delta
base. The writer MAY reference the commit objects outside of the
shallow boundary as a parent.

The readers and the writers of bundles MUST communicate whether a
bundle represents a shallow clone repository in other means. The
bundle file does not have any indicator whether it's a shallow clone
bundle or not.

> The support of shallow-clone by Git may be incomplete and it may not
> be easy to form such a range, and "git bundle create" command may
> not have a knob to disable thin-pack generation, but that does not
> mean that the bundle *format* cannot be used to represent the
> shallow boundary.

As I wrote above, if this bundle format supports the shallow clone
state, the semantics will change and writers and readers have
different constraints on the packs. In order to do so, the readers and
the writers have to agree whether it's a shallow clone or not in other
mean since the bundle file doesn't have such indicators. I think it's
better to prohibit such use cases (or at least make it as unintended
usage), and then create a different bundle format version that
supports shallow clone boundary (so that the bundle file can be more
close to the frozen git-fetch response).
Junio C Hamano Feb. 8, 2020, 1:49 a.m. UTC | #4
Masaya Suzuki <masayasuzuki@google.com> writes:

> Yes. The reason that I've been trying to check the semantics of the
> prerequisites is that I DO recognize that this is possible
> format-wise. I'm not sure if this Git implementation can create such
> bundles, but format-wise such bundles can be created.

Yeah, now I get it.  

The problem is *not* that v2 format "cannot represent a shallow
clone repository", but is that there is nothing that prevents a
bundle in v2 format from depending on objects behind (not just at)
the shallow boundary, making it impossible for a reader to guarantee
that a bundle with prereqs can be used to create an equivalent
shallow repository with shallow boundary at the same place as
prereqs.  IOW, bundle with prereqs in the v2 format allows more
objects to be omitted than an equivalent shallow repository omits,
because prereqs and shallow cutoff points mean different things.

While we are at it, I suspect that with reachability bitmap, a "git
fetch" that updates a history up to commit A to a new history up to
commit B can omit more objects than what is directly reachable from
the commit A.  That is, if A's direct child (call it C) is a commit
that reverts A, a blob in A's tree won't be in the bundle (because A
is a prereq), but the blob at the same path in C is the same blob as
the blob at the same path in A's parent (that is what it means for
that A's direct child to be a revert of A).  In the normal
enumeration based on object-walk to decide which objects to send,
such a blob in C will be included in the pack, but a reachability
bitmap can say "if we assume the reader has A, it must have A^1, so
that blob should exist at the reader, hence can be omitted from the
transfer even though we are sending commit C."
Masaya Suzuki Feb. 12, 2020, 10:13 p.m. UTC | #5
On Fri, Feb 7, 2020 at 5:49 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Masaya Suzuki <masayasuzuki@google.com> writes:
>
> > Yes. The reason that I've been trying to check the semantics of the
> > prerequisites is that I DO recognize that this is possible
> > format-wise. I'm not sure if this Git implementation can create such
> > bundles, but format-wise such bundles can be created.
>
> Yeah, now I get it.
>
> The problem is *not* that v2 format "cannot represent a shallow
> clone repository", but is that there is nothing that prevents a
> bundle in v2 format from depending on objects behind (not just at)
> the shallow boundary, making it impossible for a reader to guarantee
> that a bundle with prereqs can be used to create an equivalent
> shallow repository with shallow boundary at the same place as
> prereqs.  IOW, bundle with prereqs in the v2 format allows more
> objects to be omitted than an equivalent shallow repository omits,
> because prereqs and shallow cutoff points mean different things.

Yes. So, I think it's better to say prereqs and shallow boundaries are
different.

> While we are at it, I suspect that with reachability bitmap, a "git
> fetch" that updates a history up to commit A to a new history up to
> commit B can omit more objects than what is directly reachable from
> the commit A.  That is, if A's direct child (call it C) is a commit
> that reverts A, a blob in A's tree won't be in the bundle (because A
> is a prereq), but the blob at the same path in C is the same blob as
> the blob at the same path in A's parent (that is what it means for
> that A's direct child to be a revert of A).  In the normal
> enumeration based on object-walk to decide which objects to send,
> such a blob in C will be included in the pack,

That's interesting. I have never looked CGit's implementation, but I
think JGit would omit those objects. (At least that was my
understanding. Not confirmed with the code.)

Anyway. Is it OK with adding this small note on "prereq is not a
shallow boundary"? In practice, there are not many Git implementations
that handle Git bundles, so it's not that big deal as long those few
implementers recognize this, but this document is meant for those
implementers.
Junio C Hamano Feb. 12, 2020, 10:43 p.m. UTC | #6
Masaya Suzuki <masayasuzuki@google.com> writes:

> Anyway. Is it OK with adding this small note on "prereq is not a
> shallow boundary"?

I thought the text in the latest round is good as-is.

Thanks.

Patch
diff mbox series

diff --git a/Documentation/technical/bundle-format.txt b/Documentation/technical/bundle-format.txt
new file mode 100644
index 0000000000..0e828151a5
--- /dev/null
+++ b/Documentation/technical/bundle-format.txt
@@ -0,0 +1,48 @@ 
+= Git bundle v2 format
+
+The Git bundle format is a format that represents both refs and Git objects.
+
+== Format
+
+We will use ABNF notation to define the Git bundle format. See
+protocol-common.txt for the details.
+
+----
+bundle    = signature *prerequisite *reference LF pack
+signature = "# v2 git bundle" LF
+
+prerequisite = "-" obj-id SP comment LF
+comment      = *CHAR
+reference    = obj-id SP refname LF
+
+pack         = ... ; packfile
+----
+
+== Semantics
+
+A Git bundle consists of three parts.
+
+* "Prerequisites" lists the objects that are NOT included in the bundle and the
+  reader of the bundle MUST already have, in order to use the data in the
+  bundle. The objects stored in the bundle may refer to prerequisite objects and
+  anything reachable from them (e.g. a tree object in the bundle can reference
+  a blob that is reachable from a prerequisite) and/or expressed as a delta
+  against prerequisite objects.
+
+* "References" record the tips of the history graph, iow, what the reader of the
+  bundle CAN "git fetch" from it.
+
+* "Pack" is the pack data stream "git fetch" would send, if you fetch from a
+  repository that has the references recorded in the "References" above into a
+  repository that has references pointing at the objects listed in
+  "Prerequisites" above.
+
+In the bundle format, there can be a comment following a prerequisite obj-id.
+This is a comment and it has no specific meaning. The writer of the bundle MAY
+put any string here. The reader of the bundle MUST ignore the comment.
+
+=== Note on the shallow clone and a Git bundle
+
+Note that the prerequisites does not represent a shallow-clone boundary. The
+semantics of the prerequisites and the shallow-clone boundaries are different,
+and the Git bundle v2 format cannot represent a shallow clone repository.