mbox series

[00/15] Refactor chunk-format into an API

Message ID pull.804.git.1607012215.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Refactor chunk-format into an API | expand

Message

Johannes Schindelin via GitGitGadget Dec. 3, 2020, 4:16 p.m. UTC
I was thinking about file formats recently and realized that the "chunks"
that are common to the commit-graph and multi-pack-index could inform future
file formats. To make that process easier, let's combine the process of
writing and reading chunks into a common API that both of these existing
formats use.

There is some extra benefit immediately: the writing and reading code for
each gets a bit cleaner. Also, there were different checks in each that made
the process more robust. Now, these share a common set of checks.

In particular, Szeder made some updates to the commit-graph writing process
that forms the model for this API.

Thanks, -Stolee

Derrick Stolee (15):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: add API for writing table of contents
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: convert chunk write methods to return int
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: create write_chunks()
  chunk-format: create chunk reading API
  commit-graph: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      |  54 ++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 105 ++++
 chunk-format.h                                |  69 +++
 commit-graph.c                                | 298 ++++++-----
 midx.c                                        | 466 ++++++++----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 623 insertions(+), 384 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 72ffeb997eaf999f6938b2a7e0d9a75dcceaa311
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-804%2Fderrickstolee%2Fchunk-format%2Frefactor-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-804/derrickstolee/chunk-format/refactor-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/804

Comments

René Scharfe Dec. 4, 2020, 12:48 p.m. UTC | #1
Am 03.12.20 um 17:16 schrieb Derrick Stolee via GitGitGadget:
> I was thinking about file formats recently and realized that the "chunks"
> that are common to the commit-graph and multi-pack-index could inform future
> file formats. To make that process easier, let's combine the process of
> writing and reading chunks into a common API that both of these existing
> formats use.
>
> There is some extra benefit immediately: the writing and reading code for
> each gets a bit cleaner. Also, there were different checks in each that made
> the process more robust. Now, these share a common set of checks.

>  Documentation/technical/chunk-format.txt      |  54 ++
>  .../technical/commit-graph-format.txt         |   3 +
>  Documentation/technical/pack-format.txt       |   3 +
>  Makefile                                      |   1 +
>  chunk-format.c                                | 105 ++++
>  chunk-format.h                                |  69 +++
>  commit-graph.c                                | 298 ++++++-----
>  midx.c                                        | 466 ++++++++----------
>  t/t5318-commit-graph.sh                       |   2 +-
>  t/t5319-multi-pack-index.sh                   |   6 +-
>  10 files changed, 623 insertions(+), 384 deletions(-)

623-384-54-3-3-1-69-2-6 = 101

So if we ignore changes to documentation, headers, tests and build
script this spends ca. 100 more lines of code than the current version.
That's roughly the size of the new file chunk-format.c -- from this
bird's-eye-view the new API seems to be pure overhead.

In the new code I see several magic numbers, use of void pointers and
casting as well as repetition -- is this really going in the right
direction?  I get the feeling that YAGNI.

René
Derrick Stolee Dec. 4, 2020, 1:57 p.m. UTC | #2
On 12/4/2020 7:48 AM, René Scharfe wrote:
> Am 03.12.20 um 17:16 schrieb Derrick Stolee via GitGitGadget:
...
>>  Documentation/technical/chunk-format.txt      |  54 ++
>>  .../technical/commit-graph-format.txt         |   3 +
>>  Documentation/technical/pack-format.txt       |   3 +
>>  Makefile                                      |   1 +
>>  chunk-format.c                                | 105 ++++
>>  chunk-format.h                                |  69 +++
>>  commit-graph.c                                | 298 ++++++-----
>>  midx.c                                        | 466 ++++++++----------
>>  t/t5318-commit-graph.sh                       |   2 +-
>>  t/t5319-multi-pack-index.sh                   |   6 +-
>>  10 files changed, 623 insertions(+), 384 deletions(-)
> 
> 623-384-54-3-3-1-69-2-6 = 101
> 
> So if we ignore changes to documentation, headers, tests and build
> script this spends ca. 100 more lines of code than the current version.
> That's roughly the size of the new file chunk-format.c -- from this
> bird's-eye-view the new API seems to be pure overhead.

Overhead in terms of lines of code, but many of those are function
prototypes and single lines containing only "{" and "}". So yes,
the code files are a bit longer, but the amount of executed code is
not meaningfully different.

Extra lines of code is an expected cost of refactoring. The remaining
question is, "is it worth the cost?" I believe it is.
 
> In the new code I see several magic numbers, use of void pointers and
> casting as well as repetition -- is this really going in the right
> direction?  I get the feeling that YAGNI.

void pointers are a cost of abstraction in C that we use all over the
codebase.

You (and Junio) are right to point out my magic numbers. Those should
be replaced with something better when possible.

As far as YAGNI, I doubt that very much. First, we have already seen
extensions to the commit-graph that added several new chunks, and
plugging into this (documented) API should be easier than the previous
ad-hoc mechanism.

I've CC'd Abhishek to get his opinion, since he's recently added chunks
to the commit-graph file. Outside of the fact that this series conflicts
with his series (which I will fix), it would be good to see if he
appreciates this model.

>> I was thinking about file formats recently and realized that the "chunks"
>> that are common to the commit-graph and multi-pack-index could inform future
>> file formats. To make that process easier, let's combine the process of
>> writing and reading chunks into a common API that both of these existing
>> formats use.

And another point on YAGNI: I'm literally prototyping a new file format and
want to use this API to build it instead of repeating myself. Specifically,
I noticed that the commit-graph and multi-pack-index were inconsistent in
how they protected the file format in different ways during writes and reads.
This leads to...

>> There is some extra benefit immediately: the writing and reading code for
>> each gets a bit cleaner. Also, there were different checks in each that made
>> the process more robust. Now, these share a common set of checks.

...my point that combining these checks make both codepaths slightly more
robust. I didn't even include the potential extension of storing the size
of each chunk in "struct commit_graph" and "struct multi_pack_index" for
run-time bound checks during lookups. That seemed like too much new
behavior for a series that intends to only refactor.

Thanks,
-Stolee
Junio C Hamano Dec. 4, 2020, 7:42 p.m. UTC | #3
René Scharfe <l.s.r@web.de> writes:

>>  Documentation/technical/chunk-format.txt      |  54 ++
>>  .../technical/commit-graph-format.txt         |   3 +
>>  Documentation/technical/pack-format.txt       |   3 +
>>  Makefile                                      |   1 +
>>  chunk-format.c                                | 105 ++++
>>  chunk-format.h                                |  69 +++
>>  commit-graph.c                                | 298 ++++++-----
>>  midx.c                                        | 466 ++++++++----------
>>  t/t5318-commit-graph.sh                       |   2 +-
>>  t/t5319-multi-pack-index.sh                   |   6 +-
>>  10 files changed, 623 insertions(+), 384 deletions(-)
>
> 623-384-54-3-3-1-69-2-6 = 101
>
> So if we ignore changes to documentation, headers, tests and build
> script this spends ca. 100 more lines of code than the current version.
> That's roughly the size of the new file chunk-format.c -- from this
> bird's-eye-view the new API seems to be pure overhead.
>
> In the new code I see several magic numbers, use of void pointers and
> casting as well as repetition -- is this really going in the right
> direction?  I get the feeling that YAGNI.

Hmph, two existing users consolidated into one and still not losing
lines is not a very convincing sign.  Perhaps a third existing user
would purely lose lines when converted to use this (do we have a
third or fourth one?)  I dunno.
Taylor Blau Dec. 8, 2020, 6:49 p.m. UTC | #4
On Fri, Dec 04, 2020 at 01:48:31PM +0100, René Scharfe wrote:
> Am 03.12.20 um 17:16 schrieb Derrick Stolee via GitGitGadget:
> > I was thinking about file formats recently and realized that the "chunks"
> > that are common to the commit-graph and multi-pack-index could inform future
> > file formats. To make that process easier, let's combine the process of
> > writing and reading chunks into a common API that both of these existing
> > formats use.
> >
> > There is some extra benefit immediately: the writing and reading code for
> > each gets a bit cleaner. Also, there were different checks in each that made
> > the process more robust. Now, these share a common set of checks.
>
> >  Documentation/technical/chunk-format.txt      |  54 ++
> >  .../technical/commit-graph-format.txt         |   3 +
> >  Documentation/technical/pack-format.txt       |   3 +
> >  Makefile                                      |   1 +
> >  chunk-format.c                                | 105 ++++
> >  chunk-format.h                                |  69 +++
> >  commit-graph.c                                | 298 ++++++-----
> >  midx.c                                        | 466 ++++++++----------
> >  t/t5318-commit-graph.sh                       |   2 +-
> >  t/t5319-multi-pack-index.sh                   |   6 +-
> >  10 files changed, 623 insertions(+), 384 deletions(-)
>
> 623-384-54-3-3-1-69-2-6 = 101
>
> So if we ignore changes to documentation, headers, tests and build
> script this spends ca. 100 more lines of code than the current version.
> That's roughly the size of the new file chunk-format.c -- from this
> bird's-eye-view the new API seems to be pure overhead.
>
> In the new code I see several magic numbers, use of void pointers and
> casting as well as repetition -- is this really going in the right
> direction?  I get the feeling that YAGNI.

I think that Stolee is going in the right direction. I suggested earlier
in the thread making a new "chunkfile" type which can handle allocating
new chunks, writing their tables of contents, and so on.

So, I think that we should pursues that direction a little further
before deciding whether or not this is worth continuing. My early
experiments showed that it does add a little more code to the
chunk-format.{c,h} files, but you get negative diffs in midx.c and
commit-graph.c, which is more in line with what I would expect from this
series.

I do think that the "overhead" here is more tolerable than we might
think; I'd rather have a well-documented "chunkfile" implementation
written once and called twice, than two near-identical implementations
of _writing_ the chunks / table of contents at each of the call sites.
So, even if this does end up being a net-lines-added kind of diff, I'd
still say that it's worth it.

With regards to the "YAGNI" comment... I do have thoughts about
extending the reachability bitmap format to use chunks (of course, this
would break compatibility with JGit, and it isn't something that I plan
to do in the short-term, or even necessarily in the future).

In any event, I'm sure that this won't be these two won't be the last
chunk-based formats that we have in Git.

> René

Thanks,
Taylor
René Scharfe Dec. 9, 2020, 5:13 p.m. UTC | #5
Am 08.12.20 um 19:49 schrieb Taylor Blau:
> So, I think that we should pursues that direction a little further
> before deciding whether or not this is worth continuing. My early
> experiments showed that it does add a little more code to the
> chunk-format.{c,h} files, but you get negative diffs in midx.c and
> commit-graph.c, which is more in line with what I would expect from this
> series.

OK.

> I do think that the "overhead" here is more tolerable than we might
> think; I'd rather have a well-documented "chunkfile" implementation
> written once and called twice, than two near-identical implementations
> of _writing_ the chunks / table of contents at each of the call sites.
> So, even if this does end up being a net-lines-added kind of diff, I'd
> still say that it's worth it.

Well, interfaces are hard, and having two similar-but-not-quite-equal
pieces of code instead of a central API implementation trying to
serve two callers can actually be better.

I'm not too familiar with the chunk producers and consumers, so I can
only offer some high-level observations.  And I don't have to use the
API, so go wild! ;)  I was just triggered by the appearance of two
working pieces of code being replaced by two slightly different pieces
of code plus a third one on top.

> With regards to the "YAGNI" comment... I do have thoughts about
> extending the reachability bitmap format to use chunks (of course, this
> would break compatibility with JGit, and it isn't something that I plan
> to do in the short-term, or even necessarily in the future).
>
> In any event, I'm sure that this won't be these two won't be the last
> chunk-based formats that we have in Git.

OK, so perhaps we can do better before this scheme is copied.  The write
side is complicated by the fact that the table of contents (TOC) is
written first, followed by the actual chunks.  That requires two passes
over the data.

The ZIP format solved a similar issue by placing the TOC at the end,
which allows for one-pass streaming.  Another way to achieve that would
be to put the TOC in a separate file, like we do for .pack and .idx
files.  This way you could have a single write function for chunks, and
writers would just be a single sequence of calls for the different
types.

But seeing that the read side just loads all of the chunks anyway
(skipping unknown IDs) I wonder why we need a TOC at all.  That would
only be useful if callers were trying to read just some small subset
of the whole file.  A collection of chunks for easy dumping and loading
could be serialized by writing just a small header for each chunk
containing its type and size followed by its payload.

René
Taylor Blau Dec. 10, 2020, 12:50 a.m. UTC | #6
On Wed, Dec 09, 2020 at 06:13:18PM +0100, René Scharfe wrote:
> I'm not too familiar with the chunk producers and consumers, so I can
> only offer some high-level observations.  And I don't have to use the
> API, so go wild! ;)  I was just triggered by the appearance of two
> working pieces of code being replaced by two slightly different pieces
> of code plus a third one on top.

:-).

> > With regards to the "YAGNI" comment... I do have thoughts about
> > extending the reachability bitmap format to use chunks (of course, this
> > would break compatibility with JGit, and it isn't something that I plan
> > to do in the short-term, or even necessarily in the future).
> >
> > In any event, I'm sure that this won't be these two won't be the last
> > chunk-based formats that we have in Git.
>
> OK, so perhaps we can do better before this scheme is copied.  The write
> side is complicated by the fact that the table of contents (TOC) is
> written first, followed by the actual chunks.  That requires two passes
> over the data.

"Two passes" meaning that we have to both compute the size of and then
write the data? This is relatively cheap to do, at least so I think.

For e.g., the OIDLOOKUP commit-graph chunk is just the_hash_algo->hashsz
* commits->nr bytes wide, so that can be done in constant time. A more
heavyweight case might be for e.g., the Bloom data section, where Bloom
filters have to first be computed, their lengths accounted for, and
_then_ written when we eventually get to writing that chunk.

This happens in compute_bloom_filters(); and write_chunk_bloom_indexes()
+ write_chunk_bloom_data(), respectively. Those Bloom filters are all
stored in a commit slab until they are written, so these "two passes"
are just paid for in memory.

> The ZIP format solved a similar issue by placing the TOC at the end,
> which allows for one-pass streaming.  Another way to achieve that would
> be to put the TOC in a separate file, like we do for .pack and .idx
> files.  This way you could have a single write function for chunks, and
> writers would just be a single sequence of calls for the different
> types.

Interesting. I'm not opposed to changing any of these formats (and maybe
there is some compelling argument for doing so, I am not sure) but I
think that unifying the implementation for reading / writing the chunk
format _before_ changing it is a postive step.

> But seeing that the read side just loads all of the chunks anyway
> (skipping unknown IDs) I wonder why we need a TOC at all.  That would
> only be useful if callers were trying to read just some small subset
> of the whole file.  A collection of chunks for easy dumping and loading
> could be serialized by writing just a small header for each chunk
> containing its type and size followed by its payload.

AFAIK, we do use the table of contents to locate where the chunks are so
that we can for e.g., set up the commit_graph structure's pointers to
point at each chunk appropriately.

> René

Thanks,
Taylor
Derrick Stolee Dec. 10, 2020, 2:30 p.m. UTC | #7
On 12/9/2020 7:50 PM, Taylor Blau wrote:
> On Wed, Dec 09, 2020 at 06:13:18PM +0100, René Scharfe wrote:
>> I'm not too familiar with the chunk producers and consumers, so I can
>> only offer some high-level observations.  And I don't have to use the
>> API, so go wild! ;)  I was just triggered by the appearance of two
>> working pieces of code being replaced by two slightly different pieces
>> of code plus a third one on top.
> 
> :-).
> 
>>> With regards to the "YAGNI" comment... I do have thoughts about
>>> extending the reachability bitmap format to use chunks (of course, this
>>> would break compatibility with JGit, and it isn't something that I plan
>>> to do in the short-term, or even necessarily in the future).
>>>
>>> In any event, I'm sure that this won't be these two won't be the last
>>> chunk-based formats that we have in Git.
>>
>> OK, so perhaps we can do better before this scheme is copied.  The write
>> side is complicated by the fact that the table of contents (TOC) is
>> written first, followed by the actual chunks.  That requires two passes
>> over the data.
> 
> "Two passes" meaning that we have to both compute the size of and then
> write the data? This is relatively cheap to do, at least so I think.
> 
> For e.g., the OIDLOOKUP commit-graph chunk is just the_hash_algo->hashsz
> * commits->nr bytes wide, so that can be done in constant time. A more
> heavyweight case might be for e.g., the Bloom data section, where Bloom
> filters have to first be computed, their lengths accounted for, and
> _then_ written when we eventually get to writing that chunk.
> 
> This happens in compute_bloom_filters(); and write_chunk_bloom_indexes()
> + write_chunk_bloom_data(), respectively. Those Bloom filters are all
> stored in a commit slab until they are written, so these "two passes"
> are just paid for in memory.

The current design of the format (TOC first) does require that we can
predict the chunk sizes before we start writing the file. But also this
has _some_ desirable properties for the writer. Specifically, we keep
the file write handle for a smaller amount of time. How valuable is that?
:shrug:

>> The ZIP format solved a similar issue by placing the TOC at the end,
>> which allows for one-pass streaming.  Another way to achieve that would
>> be to put the TOC in a separate file, like we do for .pack and .idx
>> files.  This way you could have a single write function for chunks, and
>> writers would just be a single sequence of calls for the different
>> types.
> 
> Interesting. I'm not opposed to changing any of these formats (and maybe
> there is some compelling argument for doing so, I am not sure) but I
> think that unifying the implementation for reading / writing the chunk
> format _before_ changing it is a postive step.

Changing the TOC location would require a version increment in the file
formats, which is a bit painful.

I understand why the ZIP format does this, it is trying to stream data
through a compression algorithm and cannot store the result in memory
before writing. Perhaps we would want to consider that as a way to
reduce the memory load for things like changed-path Bloom filters, but
let's wait for that to be an actually noticeable problem before making
the change.

The TOC location is definitely optimized for readers, who are already
reading the initial header to discover some info about the file.

>> But seeing that the read side just loads all of the chunks anyway
>> (skipping unknown IDs) I wonder why we need a TOC at all.  That would
>> only be useful if callers were trying to read just some small subset
>> of the whole file.  A collection of chunks for easy dumping and loading
>> could be serialized by writing just a small header for each chunk
>> containing its type and size followed by its payload.
> 
> AFAIK, we do use the table of contents to locate where the chunks are so
> that we can for e.g., set up the commit_graph structure's pointers to
> point at each chunk appropriately.

The chunk-based format is really optimized for _structured data_ where
these sizes are mostly predictable. The chunk sizes that are not
predictable are things like the "extra edges" or changed-path Bloom
filter data, but that data is indexed by a structured chunk.

A natural fit for the chunk-based format is the reachability bitmap
format, since the current format requires a scan to discover which
commits have bitmaps. If we used chunks, then we could quickly
search a lookup table for the commits that have bitmaps, then navigate
to the binary data chunk for the bitmap itself.

Thanks,
-Stolee