[v5,02/17] pack-mtimes: support reading .mtimes files

Message ID	91a9d21b0b7d99023083c0bbb6f91ccdc1782736.1653088640.git.me@ttaylorr.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Fri, 20 May 2022 19:17:35 -0400 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: avarab@gmail.com, derrickstolee@github.com, gitster@pobox.com, jrnieder@gmail.com, larsxschneider@gmail.com, tytso@mit.edu Subject: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Message-ID: <91a9d21b0b7d99023083c0bbb6f91ccdc1782736.1653088640.git.me@ttaylorr.com> References: <cover.1638224692.git.me@ttaylorr.com> <cover.1653088640.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <cover.1653088640.git.me@ttaylorr.com> Precedence: bulk
Series	cruft packs \| expand [v5,00/17] cruft packs [v5,01/17] Documentation/technical: add cruft-packs.txt [v5,02/17] pack-mtimes: support reading .mtimes files [v5,03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' [v5,04/17] chunk-format.h: extract oid_version() [v5,05/17] pack-mtimes: support writing pack .mtimes files [v5,06/17] t/helper: add 'pack-mtimes' test-tool [v5,07/17] builtin/pack-objects.c: return from create_object_entry() [v5,08/17] builtin/pack-objects.c: --cruft without expiration [v5,09/17] reachable: add options to add_unseen_recent_objects_to_traversal [v5,10/17] reachable: report precise timestamps from objects in cruft packs [v5,11/17] builtin/pack-objects.c: --cruft with expiration [v5,12/17] builtin/repack.c: support generating a cruft pack [v5,13/17] builtin/repack.c: allow configuring cruft pack generation [v5,14/17] builtin/repack.c: use named flags for existing_packs [v5,15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack [v5,16/17] builtin/gc.c: conditionally avoid pruning objects via loose [v5,17/17] sha1-file.c: don't freshen cruft packs

Taylor Blau May 20, 2022, 11:17 p.m. UTC

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 126 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 183 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

Jonathan Nieder May 24, 2022, 7:32 p.m. UTC | #1

Hi,

Taylor Blau wrote:

> This patch prepares for cruft packs by defining the `.mtimes` format,
> and introducing a basic API that callers can use to read out individual
> mtimes.

Makes sense.  Does this intend to produce any functional change?  I'm
guessing not (and the lack of tests agrees), but the commit message
doesn't say so.

By the way, is this something we could cover in tests, e.g. using a
test helper that exercises the new code?

[...]
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -294,6 +294,25 @@ Pack file entry: <+
>  
>  All 4-byte numbers are in network order.
>  
> +== pack-*.mtimes files have the format:
> +
> +All 4-byte numbers are in network byte order.
> +
> +  - A 4-byte magic number '0x4d544d45' ('MTME').
> +
> +  - A 4-byte version identifier (= 1).
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> +
> +  - A table of 4-byte unsigned integers. The ith value is the
> +    modification time (mtime) of the ith object in the corresponding
> +    pack by lexicographic (index) order. The mtimes count standard
> +    epoch seconds.
> +
> +  - A trailer, containing a checksum of the corresponding packfile,
> +    and a checksum of all of the above (each having length according
> +    to the specified hash function).
> +

This describes the "syntax" but not the "semantics" of the file.
Should I look to a separate piece of documentation for the semantics?
If so, can this one include a mention of that piece of documentation
to make it easier to find?

[...]
> --- a/object-store.h
> +++ b/object-store.h
> @@ -115,12 +115,15 @@ struct packed_git {
>  		 freshened:1,
>  		 do_not_close:1,
>  		 pack_promisor:1,
> -		 multi_pack_index:1;
> +		 multi_pack_index:1,
> +		 is_cruft:1;
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  	struct revindex_entry *revindex;
>  	const uint32_t *revindex_data;
>  	const uint32_t *revindex_map;
>  	size_t revindex_size;
> +	const uint32_t *mtimes_map;
> +	size_t mtimes_size;

What does mtimes_map contain?  A comment would help.


> --- /dev/null
> +++ b/pack-mtimes.c
> @@ -0,0 +1,126 @@
> +#include "pack-mtimes.h"
> +#include "object-store.h"
> +#include "packfile.h"

Missing #include of git-compat-util.h.

> +
> +static char *pack_mtimes_filename(struct packed_git *p)
> +{
> +	size_t len;
> +	if (!strip_suffix(p->pack_name, ".pack", &len))
> +		BUG("pack_name does not end in .pack");
> +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> +}

This seems simple enough that it's not obvious we need more code
sharing.  Do you agree?  If so, I'd suggest just removing the
NEEDSWORK comment.

> +
> +#define MTIMES_HEADER_SIZE (12)
> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))

Hm, the all-caps name makes this feel like a compile-time constant but
it contains a reference to the_hash_algo.  Could it be an inline
function instead?

> +
> +struct mtimes_header {
> +	uint32_t signature;
> +	uint32_t version;
> +	uint32_t hash_id;
> +};
> +
> +static int load_pack_mtimes_file(char *mtimes_file,
> +				 uint32_t num_objects,
> +				 const uint32_t **data_p, size_t *len_p)

What does this function do?  A comment would help.

> +{
> +	int fd, ret = 0;
> +	struct stat st;
> +	void *data = NULL;
> +	size_t mtimes_size;
> +	struct mtimes_header header;
> +	uint32_t *hdr;
> +
> +	fd = git_open(mtimes_file);
> +
> +	if (fd < 0) {

nit: this would be more readable without the blank line between
setting and checking fd (likewise for the other examples below).
> +		ret = -1;
> +		goto cleanup;
> +	}

[...]
> +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {

This presupposes that the hash_id matches the_hash_algo.  Maybe worth
a NEEDSWORK comment.

[...]
> +cleanup:
> +	if (ret) {
> +		if (data)
> +			munmap(data, mtimes_size);
> +	} else {
> +		*len_p = mtimes_size;
> +		*data_p = (const uint32_t *)data;

Do we know that 'data' is uint32_t aligned?  Casting earlier in the
function could make that more obvious.

[...]
> +int load_pack_mtimes(struct packed_git *p)

This could use a doc comment in the header file.  For example, what
requirements do we have on what the caller passes as 'p'?

[...]
> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)

Likewise.

[...]
> --- a/packfile.c
> +++ b/packfile.c
[...]
> @@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
>  
>  void unlink_pack_path(const char *pack_name, int force_delete)
>  {
> -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
> +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};

Are these in any particular order?  Should they be?

[...]
> @@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
>  	if (!access(p->pack_name, F_OK))
>  		p->pack_promisor = 1;
>  
> +	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
> +	if (!access(p->pack_name, F_OK))
> +		p->is_cruft = 1;
> +
>  	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
>  	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
>  		free(p);
> @@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
>  	    ends_with(file_name, ".pack") ||
>  	    ends_with(file_name, ".bitmap") ||
>  	    ends_with(file_name, ".keep") ||
> -	    ends_with(file_name, ".promisor"))
> +	    ends_with(file_name, ".promisor") ||
> +	    ends_with(file_name, ".mtimes"))

likewise

Thanks,
Jonathan

Randall S. Becker May 24, 2022, 7:44 p.m. UTC | #2

On May 24, 2022 3:32 PM, Taylor Blau wrote:
>Taylor Blau wrote:
>
>> This patch prepares for cruft packs by defining the `.mtimes` format,
>> and introducing a basic API that callers can use to read out
>> individual mtimes.
>
>Makes sense.  Does this intend to produce any functional change?  I'm
guessing
>not (and the lack of tests agrees), but the commit message doesn't say so.
>
>By the way, is this something we could cover in tests, e.g. using a test
helper that
>exercises the new code?
>
>[...]
>> --- a/Documentation/technical/pack-format.txt
>> +++ b/Documentation/technical/pack-format.txt
>> @@ -294,6 +294,25 @@ Pack file entry: <+
>>
>>  All 4-byte numbers are in network order.
>>
>> +== pack-*.mtimes files have the format:
>> +
>> +All 4-byte numbers are in network byte order.
>> +
>> +  - A 4-byte magic number '0x4d544d45' ('MTME').
>> +
>> +  - A 4-byte version identifier (= 1).
>> +
>> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>> +
>> +  - A table of 4-byte unsigned integers. The ith value is the
>> +    modification time (mtime) of the ith object in the corresponding
>> +    pack by lexicographic (index) order. The mtimes count standard
>> +    epoch seconds.
>> +
>> +  - A trailer, containing a checksum of the corresponding packfile,
>> +    and a checksum of all of the above (each having length according
>> +    to the specified hash function).
>> +
>
>This describes the "syntax" but not the "semantics" of the file.
>Should I look to a separate piece of documentation for the semantics?
>If so, can this one include a mention of that piece of documentation to
make it
>easier to find?
>
>[...]
>> --- a/object-store.h
>> +++ b/object-store.h
>> @@ -115,12 +115,15 @@ struct packed_git {
>>  		 freshened:1,
>>  		 do_not_close:1,
>>  		 pack_promisor:1,
>> -		 multi_pack_index:1;
>> +		 multi_pack_index:1,
>> +		 is_cruft:1;
>>  	unsigned char hash[GIT_MAX_RAWSZ];
>>  	struct revindex_entry *revindex;
>>  	const uint32_t *revindex_data;
>>  	const uint32_t *revindex_map;
>>  	size_t revindex_size;
>> +	const uint32_t *mtimes_map;
>> +	size_t mtimes_size;
>
>What does mtimes_map contain?  A comment would help.
>
>
>> --- /dev/null
>> +++ b/pack-mtimes.c
>> @@ -0,0 +1,126 @@
>> +#include "pack-mtimes.h"
>> +#include "object-store.h"
>> +#include "packfile.h"
>
>Missing #include of git-compat-util.h.
>
>> +
>> +static char *pack_mtimes_filename(struct packed_git *p) {
>> +	size_t len;
>> +	if (!strip_suffix(p->pack_name, ".pack", &len))
>> +		BUG("pack_name does not end in .pack");
>> +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
>> +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name); }
>
>This seems simple enough that it's not obvious we need more code sharing.
Do
>you agree?  If so, I'd suggest just removing the NEEDSWORK comment.
>
>> +
>> +#define MTIMES_HEADER_SIZE (12)
>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 *
>> +the_hash_algo->rawsz))
>
>Hm, the all-caps name makes this feel like a compile-time constant but it
contains
>a reference to the_hash_algo.  Could it be an inline function instead?
>
>> +
>> +struct mtimes_header {
>> +	uint32_t signature;
>> +	uint32_t version;
>> +	uint32_t hash_id;
>> +};
>> +
>> +static int load_pack_mtimes_file(char *mtimes_file,
>> +				 uint32_t num_objects,
>> +				 const uint32_t **data_p, size_t *len_p)
>
>What does this function do?  A comment would help.
>
>> +{
>> +	int fd, ret = 0;
>> +	struct stat st;
>> +	void *data = NULL;
>> +	size_t mtimes_size;
>> +	struct mtimes_header header;
>> +	uint32_t *hdr;
>> +
>> +	fd = git_open(mtimes_file);
>> +
>> +	if (fd < 0) {
>
>nit: this would be more readable without the blank line between setting and
>checking fd (likewise for the other examples below).
>> +		ret = -1;
>> +		goto cleanup;
>> +	}
>
>[...]
>> +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t),
>> +num_objects)) {
>
>This presupposes that the hash_id matches the_hash_algo.  Maybe worth a
>NEEDSWORK comment.
>
>[...]
>> +cleanup:
>> +	if (ret) {
>> +		if (data)
>> +			munmap(data, mtimes_size);
>> +	} else {
>> +		*len_p = mtimes_size;
>> +		*data_p = (const uint32_t *)data;
>
>Do we know that 'data' is uint32_t aligned?  Casting earlier in the
function could
>make that more obvious.
>
>[...]
>> +int load_pack_mtimes(struct packed_git *p)
>
>This could use a doc comment in the header file.  For example, what
requirements
>do we have on what the caller passes as 'p'?
>
>[...]
>> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
>
>Likewise.
>
>[...]
>> --- a/packfile.c
>> +++ b/packfile.c
>[...]
>> @@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store
>> *o)
>>
>>  void unlink_pack_path(const char *pack_name, int force_delete)  {
>> -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep",
".bitmap",
>".promisor"};
>> +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep",
>> +".bitmap", ".promisor", ".mtimes"};
>
>Are these in any particular order?  Should they be?
>
>[...]
>> @@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path,
>size_t path_len, int local)
>>  	if (!access(p->pack_name, F_OK))
>>  		p->pack_promisor = 1;
>>
>> +	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
>> +	if (!access(p->pack_name, F_OK))
>> +		p->is_cruft = 1;
>> +
>>  	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
>>  	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
>>  		free(p);
>> @@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name,
size_t
>full_name_len,
>>  	    ends_with(file_name, ".pack") ||
>>  	    ends_with(file_name, ".bitmap") ||
>>  	    ends_with(file_name, ".keep") ||
>> -	    ends_with(file_name, ".promisor"))
>> +	    ends_with(file_name, ".promisor") ||
>> +	    ends_with(file_name, ".mtimes"))
>
>likewise

I am again concerned about 32-bit time_t assumptions. time_t is 32-bit on
some platforms, signed/unsigned, and sometimes 64-bit. We are talking about
potentially long-persistent files, as I understand this series, so we should
not be limiting times to end at 2038. That's only 16 years off and I would
wager that many clones that exist today will exist then.
--Randall

Taylor Blau May 24, 2022, 10:21 p.m. UTC | #3

On Tue, May 24, 2022 at 12:32:01PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
>
> > This patch prepares for cruft packs by defining the `.mtimes` format,
> > and introducing a basic API that callers can use to read out individual
> > mtimes.
>
> Makes sense.  Does this intend to produce any functional change?  I'm
> guessing not (and the lack of tests agrees), but the commit message
> doesn't say so.
>
> By the way, is this something we could cover in tests, e.g. using a
> test helper that exercises the new code?

This does not produce a functional change, no. This commit in isolation
adds a bunch of dead code that will be used (and tested) in the
following patches.

There is a test helper that is added (and then used extensively further
on in the series) four patches later, c.f., "t/helper: add 'pack-mtimes'
test-tool".

> [...]
> > --- a/Documentation/technical/pack-format.txt
> > +++ b/Documentation/technical/pack-format.txt
> > @@ -294,6 +294,25 @@ Pack file entry: <+
> >
> >  All 4-byte numbers are in network order.
> >
> > +== pack-*.mtimes files have the format:
> > +
> > +All 4-byte numbers are in network byte order.
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> > +
> > +  - A table of 4-byte unsigned integers. The ith value is the
> > +    modification time (mtime) of the ith object in the corresponding
> > +    pack by lexicographic (index) order. The mtimes count standard
> > +    epoch seconds.
> > +
> > +  - A trailer, containing a checksum of the corresponding packfile,
> > +    and a checksum of all of the above (each having length according
> > +    to the specified hash function).
> > +
>
> This describes the "syntax" but not the "semantics" of the file.
> Should I look to a separate piece of documentation for the semantics?
> If so, can this one include a mention of that piece of documentation
> to make it easier to find?
>
> [...]
> > --- a/object-store.h
> > +++ b/object-store.h
> > @@ -115,12 +115,15 @@ struct packed_git {
> >  		 freshened:1,
> >  		 do_not_close:1,
> >  		 pack_promisor:1,
> > -		 multi_pack_index:1;
> > +		 multi_pack_index:1,
> > +		 is_cruft:1;
> >  	unsigned char hash[GIT_MAX_RAWSZ];
> >  	struct revindex_entry *revindex;
> >  	const uint32_t *revindex_data;
> >  	const uint32_t *revindex_map;
> >  	size_t revindex_size;
> > +	const uint32_t *mtimes_map;
> > +	size_t mtimes_size;
>
> What does mtimes_map contain?  A comment would help.

It contains a pointer at the beginning of the mmapped region of the
.mtimes file, similar to revindex_map above it.

>
> > --- /dev/null
> > +++ b/pack-mtimes.c
> > @@ -0,0 +1,126 @@
> > +#include "pack-mtimes.h"
> > +#include "object-store.h"
> > +#include "packfile.h"
>
> Missing #include of git-compat-util.h.

Ah, good eyes: thanks.

Junio: would you like a replacement patch / a whole new copy of the
series / or can you amend this locally when queuing? Whatever is lowest
effort for you works for me.

> > +
> > +static char *pack_mtimes_filename(struct packed_git *p)
> > +{
> > +	size_t len;
> > +	if (!strip_suffix(p->pack_name, ".pack", &len))
> > +		BUG("pack_name does not end in .pack");
> > +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> > +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> > +}
>
> This seems simple enough that it's not obvious we need more code
> sharing.  Do you agree?  If so, I'd suggest just removing the
> NEEDSWORK comment.

Yeah, it is conceptually simple, though it feels like the sort of thing
that could benefit from not having to be written once for each
extension (hence the comment).

> > +
> > +#define MTIMES_HEADER_SIZE (12)
> > +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
>
> Hm, the all-caps name makes this feel like a compile-time constant but
> it contains a reference to the_hash_algo.  Could it be an inline
> function instead?

Yes, it could be an inline function, but I don't think there is
necessarily anything wrong with it being a #define'd macro. There are
some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which also use
the_hash_algo on the right-hand side of a `#define`.

> > +
> > +struct mtimes_header {
> > +	uint32_t signature;
> > +	uint32_t version;
> > +	uint32_t hash_id;
> > +};
> > +
> > +static int load_pack_mtimes_file(char *mtimes_file,
> > +				 uint32_t num_objects,
> > +				 const uint32_t **data_p, size_t *len_p)
>
> What does this function do?  A comment would help.

I know that I'm biased as the author of this code, but I think the
signature is clear here. At least, I'm not sure what information a
comment would add that the function name and its arguments don't already
convey.

> > +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
>
> This presupposes that the hash_id matches the_hash_algo.  Maybe worth
> a NEEDSWORK comment.

Good catch.

> [...]
> > +cleanup:
> > +	if (ret) {
> > +		if (data)
> > +			munmap(data, mtimes_size);
> > +	} else {
> > +		*len_p = mtimes_size;
> > +		*data_p = (const uint32_t *)data;
>
> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
> function could make that more obvious.

`data` is definitely uint32_t aligned, but this is a tradeoff, since if
we wrote:

    uint32_t *data = xmmap(...);

then I think we would have to change the case where ret is non-zero to be:

    if (data)
        munmap((void*)data, ...);

and likewise, data_p is const.

> > +int load_pack_mtimes(struct packed_git *p)
>
> This could use a doc comment in the header file.  For example, what
> requirements do we have on what the caller passes as 'p'?
>
> [...]
> > +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
>
> Likewise.

Sure. I wonder when we should do that, though. I'm not trying to be
impatient to get this merged, but iterating on the documentation feels
like it could be done on top without having to re-send the substantive
parts of this series over and over.

> [...]
> > --- a/packfile.c
> > +++ b/packfile.c
> [...]
> > @@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
> >
> >  void unlink_pack_path(const char *pack_name, int force_delete)
> >  {
> > -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
> > +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
>
> Are these in any particular order?  Should they be?
>
> [...]

These aren't technically in any particular order (nor should they be),
though .idx should be first. I'm leaving it alone here
(semi-intentionally, since the race it opens up isn't related to this
series, and it's on my list to deal with after this code has settled).

> > @@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
> >  	if (!access(p->pack_name, F_OK))
> >  		p->pack_promisor = 1;
> >
> > +	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
> > +	if (!access(p->pack_name, F_OK))
> > +		p->is_cruft = 1;
> > +
> >  	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
> >  	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
> >  		free(p);
> > @@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
> >  	    ends_with(file_name, ".pack") ||
> >  	    ends_with(file_name, ".bitmap") ||
> >  	    ends_with(file_name, ".keep") ||
> > -	    ends_with(file_name, ".promisor"))
> > +	    ends_with(file_name, ".promisor") ||
> > +	    ends_with(file_name, ".mtimes"))
>
> likewise

No specific order here (since these are all OR'd together).

Thanks,
Taylor

Taylor Blau May 24, 2022, 10:25 p.m. UTC | #4

On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit on
> some platforms, signed/unsigned, and sometimes 64-bit. We are talking about
> potentially long-persistent files, as I understand this series, so we should
> not be limiting times to end at 2038. That's only 16 years off and I would
> wager that many clones that exist today will exist then.

Note that we're using unsigned fields here, so we have until 2106 (see
my earlier response on this in
https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).

Thanks,
Taylor

Randall S. Becker May 24, 2022, 11:24 p.m. UTC | #5

On May 24, 2022 6:25 PM ,Taylor Blau write:
>On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
>> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit
>> on some platforms, signed/unsigned, and sometimes 64-bit. We are
>> talking about potentially long-persistent files, as I understand this
>> series, so we should not be limiting times to end at 2038. That's only
>> 16 years off and I would wager that many clones that exist today will exist then.
>
>Note that we're using unsigned fields here, so we have until 2106 (see my earlier
>response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).

I appreciate that, but 32-bit time_t is still signed on many platforms, so when cast, it still might, at some point in another series, cause issues. Please be cautious. I expect that this is the particular hill on which I will die.

Taylor Blau May 25, 2022, 12:07 a.m. UTC | #6

On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@nexbridge.com wrote:
> On May 24, 2022 6:25 PM ,Taylor Blau write:
> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
> >> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit
> >> on some platforms, signed/unsigned, and sometimes 64-bit. We are
> >> talking about potentially long-persistent files, as I understand this
> >> series, so we should not be limiting times to end at 2038. That's only
> >> 16 years off and I would wager that many clones that exist today will exist then.
> >
> >Note that we're using unsigned fields here, so we have until 2106 (see my earlier
> >response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).
>
> I appreciate that, but 32-bit time_t is still signed on many
> platforms, so when cast, it still might, at some point in another
> series, cause issues. Please be cautious. I expect that this is the
> particular hill on which I will die.

Randall S. Becker May 25, 2022, 12:20 a.m. UTC | #7

On May 24, 2022 8:08 PM, Taylor Blau wrote:
>On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@nexbridge.com wrote:
>> On May 24, 2022 6:25 PM ,Taylor Blau write:
>> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
>> >> I am again concerned about 32-bit time_t assumptions. time_t is
>> >> 32-bit on some platforms, signed/unsigned, and sometimes 64-bit. We
>> >> are talking about potentially long-persistent files, as I
>> >> understand this series, so we should not be limiting times to end
>> >> at 2038. That's only
>> >> 16 years off and I would wager that many clones that exist today will exist
>then.
>> >
>> >Note that we're using unsigned fields here, so we have until 2106
>> >(see my earlier response on this in
>https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).
>>
>> I appreciate that, but 32-bit time_t is still signed on many
>> platforms, so when cast, it still might, at some point in another
>> series, cause issues. Please be cautious. I expect that this is the
>> particular hill on which I will die.

Jonathan Nieder May 25, 2022, 7:48 a.m. UTC | #8

Hi,

Taylor Blau wrote:
> On Tue, May 24, 2022 at 12:32:01PM -0700, Jonathan Nieder wrote:

>> Makes sense.  Does this intend to produce any functional change?  I'm
>> guessing not (and the lack of tests agrees), but the commit message
>> doesn't say so.
[...]
> This does not produce a functional change, no. This commit in isolation
> adds a bunch of dead code that will be used (and tested) in the
> following patches.
[...]
>> What does mtimes_map contain?  A comment would help.
>
> It contains a pointer at the beginning of the mmapped region of the
> .mtimes file, similar to revindex_map above it.

To be clear, in cases like this by "comment" I mean "in-code comment".
I.e., my interest is not that _I_ find out the answer but that the
code becomes more maintainable via the answer becoming easier to find.

[...]
>> This seems simple enough that it's not obvious we need more code
>> sharing.  Do you agree?  If so, I'd suggest just removing the
>> NEEDSWORK comment.
>
> Yeah, it is conceptually simple, though it feels like the sort of thing
> that could benefit from not having to be written once for each
> extension (hence the comment).

The reason I asked is that the NEEDSWORK here actually got in the way
of comprehension for me --- it made me wonder "is there some
complexity here I'm missing?"

That's why I'd suggest one of
- removing the NEEDSWORK comment
- going ahead and implementing the code sharing you mean, or
- fleshing out the NEEDSWORK comment so the reader can wonder less

>>> +
>>> +#define MTIMES_HEADER_SIZE (12)
>>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
>>
>> Hm, the all-caps name makes this feel like a compile-time constant but
>> it contains a reference to the_hash_algo.  Could it be an inline
>> function instead?
>
> Yes, it could be an inline function, but I don't think there is
> necessarily anything wrong with it being a #define'd macro. There are
> some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
> GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which also use
> the_hash_algo on the right-hand side of a `#define`.

Those are due to an incomplete migration from use of the true constant
GIT_SHA1_RAWSZ to use of the dynamic value the_hash_algo->rawsz, no?
In other words, "other examples do it wrong" doesn't feel like a great
justification for making it worse in new code.

[...]
>>> +static int load_pack_mtimes_file(char *mtimes_file,
>>> +				 uint32_t num_objects,
>>> +				 const uint32_t **data_p, size_t *len_p)
>>
>> What does this function do?  A comment would help.
>
> I know that I'm biased as the author of this code, but I think the
> signature is clear here. At least, I'm not sure what information a
> comment would add that the function name and its arguments don't already
> convey.

Ah, thanks for this point of clarification.  What isn't clear from the
signature is
- when should I call this function?
- what does its return value represent?
- how does it handle errors?

I agree that the parameters are self-explanatory.

>>> +cleanup:
>>> +	if (ret) {
>>> +		if (data)
>>> +			munmap(data, mtimes_size);
>>> +	} else {
>>> +		*len_p = mtimes_size;
>>> +		*data_p = (const uint32_t *)data;
>>
>> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
>> function could make that more obvious.
>
> `data` is definitely uint32_t aligned, but this is a tradeoff, since if
> we wrote:
>
>     uint32_t *data = xmmap(...);
>
> then I think we would have to change the case where ret is non-zero to be:
>
>     if (data)
>         munmap((void*)data, ...);
>
> and likewise, data_p is const.

Doing it that way sounds great to me.  That way, the type contains the
information we need up-front and the safety of the cast is obvious in
the place where the cast is needed.

(Although my understanding is also that in C it's fine to pass a
uint32_t* to a function expecting a void*, so the second cast would
also not be needed.)

[...]
>>> +int load_pack_mtimes(struct packed_git *p)
>>
>> This could use a doc comment in the header file.  For example, what
>> requirements do we have on what the caller passes as 'p'?
>>
>> [...]
>>> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
>>
>> Likewise.
>
> Sure. I wonder when we should do that, though. I'm not trying to be
> impatient to get this merged, but iterating on the documentation feels
> like it could be done on top without having to re-send the substantive
> parts of this series over and over.

In terms of re-sending patches, sending a "fixup!" patch with the
minor changes you want to make doesn't seem too problematic to me.  In
general a major benefit of code review is getting others' eyes on new
code from the standpoint of readability and maintainability; including
comments like this up front doesn't seem like a huge amount to ask
(versus getting those comments to be perfect, which would be
unreasonable to expect since it's not hard to update them over time).

> Thanks,
> Taylor

Thanks for looking it through.

Sincerely,
Jonathan

Ævar Arnfjörð Bjarmason May 25, 2022, 9:11 a.m. UTC | #9

On Tue, May 24 2022, Taylor Blau wrote:

> On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@nexbridge.com wrote:
>> On May 24, 2022 6:25 PM ,Taylor Blau write:
>> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
>> >> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit
>> >> on some platforms, signed/unsigned, and sometimes 64-bit. We are
>> >> talking about potentially long-persistent files, as I understand this
>> >> series, so we should not be limiting times to end at 2038. That's only
>> >> 16 years off and I would wager that many clones that exist today will exist then.
>> >
>> >Note that we're using unsigned fields here, so we have until 2106 (see my earlier
>> >response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).
>>
>> I appreciate that, but 32-bit time_t is still signed on many
>> platforms, so when cast, it still might, at some point in another
>> series, cause issues. Please be cautious. I expect that this is the
>> particular hill on which I will die.

Derrick Stolee May 25, 2022, 1:30 p.m. UTC | #10

On 5/25/2022 5:11 AM, Ævar Arnfjörð Bjarmason wrote:
> I must say that I really don't like this part of the format. Is it
> really necessary to optimize the storage space here in a way that leaves
> open questions about future time_t compatibility, and having to
> introduce the first use of unsigned 32 bit timestamps to git's codebase?

The commit-graph file format uses unsigned 34-bit timestamps (packed
with 30-bit topological levels in the CDAT chunk), so this "not-64-bit
signed timestamps" thing is something we've done before.
 
> Yes, this is its own self-contained format, so we don't *need* time_t
> here, but it's also really handy if we can eventually consistently use
> 64 time_t everywhere and not worry about any compatibility issues, or
> unsigned v.s. signed, or to create our own little ext4-like signed 32
> bit timestamp format.

We can also use a new file format version when it is necessary. We
have a lot of time to add that detail without overly complicating the
format right now.

> If we really are trying to micro-optimize storage space here I'm willing
> to bet that this is still a bad/premature optimization. There's much
> better ways to store this sort of data in a compact way if that's the
> concern. E.g. you'd store a 64 bit "base" timestamp in the header for
> the first entry, and have smaller (signed) "delta" timestamps storing
> offsets from that "base" timestamp.

This is a good idea for a v2 format when that is necessary.

Thanks,
-Stolee

Taylor Blau May 25, 2022, 9:13 p.m. UTC | #11

On Wed, May 25, 2022 at 09:30:55AM -0400, Derrick Stolee wrote:
> On 5/25/2022 5:11 AM, Ævar Arnfjörð Bjarmason wrote:
> > I must say that I really don't like this part of the format. Is it
> > really necessary to optimize the storage space here in a way that leaves
> > open questions about future time_t compatibility, and having to
> > introduce the first use of unsigned 32 bit timestamps to git's codebase?
>
> The commit-graph file format uses unsigned 34-bit timestamps (packed
> with 30-bit topological levels in the CDAT chunk), so this "not-64-bit
> signed timestamps" thing is something we've done before.
>
> > Yes, this is its own self-contained format, so we don't *need* time_t
> > here, but it's also really handy if we can eventually consistently use
> > 64 time_t everywhere and not worry about any compatibility issues, or
> > unsigned v.s. signed, or to create our own little ext4-like signed 32
> > bit timestamp format.
>
> We can also use a new file format version when it is necessary. We
> have a lot of time to add that detail without overly complicating the
> format right now.
>
> > If we really are trying to micro-optimize storage space here I'm willing
> > to bet that this is still a bad/premature optimization. There's much
> > better ways to store this sort of data in a compact way if that's the
> > concern. E.g. you'd store a 64 bit "base" timestamp in the header for
> > the first entry, and have smaller (signed) "delta" timestamps storing
> > offsets from that "base" timestamp.
>
> This is a good idea for a v2 format when that is necessary.

I agree here.

I'm not opposed to such a change (or even being the one to work on it!),
but I would encourage us to pursue that change outside of this series,
since it can easily be done on top.

Of course, if we ever did decide to implement 64-bit mtimes, we would
have to maintain support for reading both the 32-bit and 64-bit values.
But I think the code is well-equipped to do that, and it could be done
on top without significant additional complexity.

Thanks,
Taylor

Taylor Blau May 25, 2022, 9:36 p.m. UTC | #12

On Wed, May 25, 2022 at 12:48:54AM -0700, Jonathan Nieder wrote:
> >> What does mtimes_map contain?  A comment would help.
> >
> > It contains a pointer at the beginning of the mmapped region of the
> > .mtimes file, similar to revindex_map above it.
>
> To be clear, in cases like this by "comment" I mean "in-code comment".
> I.e., my interest is not that _I_ find out the answer but that the
> code becomes more maintainable via the answer becoming easier to find.

OK. I'll add a comment in the fixup! patch which I'm about to send.

> [...]
> >> This seems simple enough that it's not obvious we need more code
> >> sharing.  Do you agree?  If so, I'd suggest just removing the
> >> NEEDSWORK comment.
> >
> > Yeah, it is conceptually simple, though it feels like the sort of thing
> > that could benefit from not having to be written once for each
> > extension (hence the comment).
>
> The reason I asked is that the NEEDSWORK here actually got in the way
> of comprehension for me --- it made me wonder "is there some
> complexity here I'm missing?"
>
> That's why I'd suggest one of
> - removing the NEEDSWORK comment
> - going ahead and implementing the code sharing you mean, or
> - fleshing out the NEEDSWORK comment so the reader can wonder less

I am a little sad to remove it, since I thought it was useful as-is. But
I can just as easily remember to come back to this myself in the future,
so if it is distracting to you in the meantime, then I don't mind
holding onto it in my own head.

> >>> +
> >>> +#define MTIMES_HEADER_SIZE (12)
> >>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
> >>
> >> Hm, the all-caps name makes this feel like a compile-time constant but
> >> it contains a reference to the_hash_algo.  Could it be an inline
> >> function instead?
> >
> > Yes, it could be an inline function, but I don't think there is
> > necessarily anything wrong with it being a #define'd macro. There are
> > some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
> > GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which also use
> > the_hash_algo on the right-hand side of a `#define`.
>
> Those are due to an incomplete migration from use of the true constant
> GIT_SHA1_RAWSZ to use of the dynamic value the_hash_algo->rawsz, no?
> In other words, "other examples do it wrong" doesn't feel like a great
> justification for making it worse in new code.

Fair point. I can imagine reasons for the existing pattern, but updating
it to handle the variable rawsz is easy to do (and it probably should
have been that way since the beginning).

> [...]
> >>> +static int load_pack_mtimes_file(char *mtimes_file,
> >>> +				 uint32_t num_objects,
> >>> +				 const uint32_t **data_p, size_t *len_p)
> >>
> >> What does this function do?  A comment would help.
> >
> > I know that I'm biased as the author of this code, but I think the
> > signature is clear here. At least, I'm not sure what information a
> > comment would add that the function name and its arguments don't already
> > convey.
>
> Ah, thanks for this point of clarification.  What isn't clear from the
> signature is
> - when should I call this function?
> - what does its return value represent?
> - how does it handle errors?
>
> I agree that the parameters are self-explanatory.

I'm hesitant to over-document a static function with a single caller,
but when looking at this, I think there is an opportunity to document
_its_ caller (`load_pack_mtimes()`) which isn't static, but was also
missing documentation.

> >>> +cleanup:
> >>> +	if (ret) {
> >>> +		if (data)
> >>> +			munmap(data, mtimes_size);
> >>> +	} else {
> >>> +		*len_p = mtimes_size;
> >>> +		*data_p = (const uint32_t *)data;
> >>
> >> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
> >> function could make that more obvious.
> >
> > `data` is definitely uint32_t aligned, but this is a tradeoff, since if
> > we wrote:
> >
> >     uint32_t *data = xmmap(...);
> >
> > then I think we would have to change the case where ret is non-zero to be:
> >
> >     if (data)
> >         munmap((void*)data, ...);
> >
> > and likewise, data_p is const.
>
> Doing it that way sounds great to me.  That way, the type contains the
> information we need up-front and the safety of the cast is obvious in
> the place where the cast is needed.
>
> (Although my understanding is also that in C it's fine to pass a
> uint32_t* to a function expecting a void*, so the second cast would
> also not be needed.)
>
> [...]

Done, thanks for the suggestion.

Thanks,
Taylor

Randall S. Becker May 25, 2022, 9:58 p.m. UTC | #13

On May 25, 2022 5:37 PM, Taylor Blau wrote:
>On Wed, May 25, 2022 at 12:48:54AM -0700, Jonathan Nieder wrote:
>> >> What does mtimes_map contain?  A comment would help.
>> >
>> > It contains a pointer at the beginning of the mmapped region of the
>> > .mtimes file, similar to revindex_map above it.
>>
>> To be clear, in cases like this by "comment" I mean "in-code comment".
>> I.e., my interest is not that _I_ find out the answer but that the
>> code becomes more maintainable via the answer becoming easier to find.
>
>OK. I'll add a comment in the fixup! patch which I'm about to send.
>
>> [...]
>> >> This seems simple enough that it's not obvious we need more code
>> >> sharing.  Do you agree?  If so, I'd suggest just removing the
>> >> NEEDSWORK comment.
>> >
>> > Yeah, it is conceptually simple, though it feels like the sort of
>> > thing that could benefit from not having to be written once for each
>> > extension (hence the comment).
>>
>> The reason I asked is that the NEEDSWORK here actually got in the way
>> of comprehension for me --- it made me wonder "is there some
>> complexity here I'm missing?"
>>
>> That's why I'd suggest one of
>> - removing the NEEDSWORK comment
>> - going ahead and implementing the code sharing you mean, or
>> - fleshing out the NEEDSWORK comment so the reader can wonder less
>
>I am a little sad to remove it, since I thought it was useful as-is. But I can just as
>easily remember to come back to this myself in the future, so if it is distracting to
>you in the meantime, then I don't mind holding onto it in my own head.
>
>> >>> +
>> >>> +#define MTIMES_HEADER_SIZE (12)
>> >>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 *
>> >>> +the_hash_algo->rawsz))
>> >>
>> >> Hm, the all-caps name makes this feel like a compile-time constant
>> >> but it contains a reference to the_hash_algo.  Could it be an
>> >> inline function instead?
>> >
>> > Yes, it could be an inline function, but I don't think there is
>> > necessarily anything wrong with it being a #define'd macro. There
>> > are some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
>> > GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which
>also
>> > use the_hash_algo on the right-hand side of a `#define`.
>>
>> Those are due to an incomplete migration from use of the true constant
>> GIT_SHA1_RAWSZ to use of the dynamic value the_hash_algo->rawsz, no?
>> In other words, "other examples do it wrong" doesn't feel like a great
>> justification for making it worse in new code.
>
>Fair point. I can imagine reasons for the existing pattern, but updating it to handle
>the variable rawsz is easy to do (and it probably should have been that way since
>the beginning).
>
>> [...]
>> >>> +static int load_pack_mtimes_file(char *mtimes_file,
>> >>> +				 uint32_t num_objects,
>> >>> +				 const uint32_t **data_p, size_t *len_p)
>> >>
>> >> What does this function do?  A comment would help.
>> >
>> > I know that I'm biased as the author of this code, but I think the
>> > signature is clear here. At least, I'm not sure what information a
>> > comment would add that the function name and its arguments don't
>> > already convey.
>>
>> Ah, thanks for this point of clarification.  What isn't clear from the
>> signature is
>> - when should I call this function?
>> - what does its return value represent?
>> - how does it handle errors?
>>
>> I agree that the parameters are self-explanatory.
>
>I'm hesitant to over-document a static function with a single caller, but when
>looking at this, I think there is an opportunity to document _its_ caller
>(`load_pack_mtimes()`) which isn't static, but was also missing documentation.
>
>> >>> +cleanup:
>> >>> +	if (ret) {
>> >>> +		if (data)
>> >>> +			munmap(data, mtimes_size);
>> >>> +	} else {
>> >>> +		*len_p = mtimes_size;
>> >>> +		*data_p = (const uint32_t *)data;
>> >>
>> >> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
>> >> function could make that more obvious.
>> >
>> > `data` is definitely uint32_t aligned, but this is a tradeoff, since
>> > if we wrote:
>> >
>> >     uint32_t *data = xmmap(...);
>> >
>> > then I think we would have to change the case where ret is non-zero to be:
>> >
>> >     if (data)
>> >         munmap((void*)data, ...);
>> >
>> > and likewise, data_p is const.
>>
>> Doing it that way sounds great to me.  That way, the type contains the
>> information we need up-front and the safety of the cast is obvious in
>> the place where the cast is needed.
>>
>> (Although my understanding is also that in C it's fine to pass a
>> uint32_t* to a function expecting a void*, so the second cast would
>> also not be needed.)

I do not think c99 allows this in 100% of cases - specifically if there a const void * involved. gcc does not care. I do not think c89 cares either. I will watch out for it when this is merged.
--Randall

Taylor Blau May 25, 2022, 10:59 p.m. UTC | #14

On Wed, May 25, 2022 at 05:58:49PM -0400, rsbecker@nexbridge.com wrote:
> >> > `data` is definitely uint32_t aligned, but this is a tradeoff, since
> >> > if we wrote:
> >> >
> >> >     uint32_t *data = xmmap(...);
> >> >
> >> > then I think we would have to change the case where ret is non-zero to be:
> >> >
> >> >     if (data)
> >> >         munmap((void*)data, ...);
> >> >
> >> > and likewise, data_p is const.
> >>
> >> Doing it that way sounds great to me.  That way, the type contains the
> >> information we need up-front and the safety of the cast is obvious in
> >> the place where the cast is needed.
> >>
> >> (Although my understanding is also that in C it's fine to pass a
> >> uint32_t* to a function expecting a void*, so the second cast would
> >> also not be needed.)
>
> I do not think c99 allows this in 100% of cases - specifically if
> there a const void * involved. gcc does not care. I do not think c89
> cares either. I will watch out for it when this is merged.

Thanks for the heads up. I looked through the results of "git grep '=
xmmap'" to see if we had contemporary examples of either assigning to a
non-'void *', or passing a non-'void *' variable to munmap.

Luckily, we have both, so this shouldn't cause a problem. fixup! patch
incoming shortly...

Thanks,
Taylor

Taylor Blau May 25, 2022, 11:02 p.m. UTC | #15

Junio,

On Fri, May 20, 2022 at 07:17:35PM -0400, Taylor Blau wrote:
> To store the individual mtimes of objects in a cruft pack, introduce a
> new `.mtimes` format that can optionally accompany a single pack in the
> repository.

Like I mentioned in this sub-thread, here is a small fixup! to apply on
top of this patch when queueing. I'm hoping this will be easier than
reapplying the dozen+ or so patches in this series (the rest of which
are unchanged). But if it isn't, please let me know and I can send you a
reroll of the whole thing.

In the meantime, here's the fixup...

--- 8< ---
Subject: [PATCH] fixup! pack-mtimes: support reading .mtimes files

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-store.h |  5 +++++
 pack-mtimes.c  | 35 +++++++++++++++++++----------------
 pack-mtimes.h  | 11 +++++++++++
 3 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/object-store.h b/object-store.h
index 2c4671ed7a..05cc9a33ed 100644
--- a/object-store.h
+++ b/object-store.h
@@ -122,6 +122,11 @@ struct packed_git {
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	/*
+	 * mtimes_map points at the beginning of the memory mapped region of
+	 * this pack's corresponding .mtimes file, and mtimes_size is the size
+	 * of that .mtimes file
+	 */
 	const uint32_t *mtimes_map;
 	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
diff --git a/pack-mtimes.c b/pack-mtimes.c
index 46ad584af1..0e0aafdcb0 100644
--- a/pack-mtimes.c
+++ b/pack-mtimes.c
@@ -1,3 +1,4 @@
+#include "git-compat-util.h"
 #include "pack-mtimes.h"
 #include "object-store.h"
 #include "packfile.h"
@@ -7,12 +8,10 @@ static char *pack_mtimes_filename(struct packed_git *p)
 	size_t len;
 	if (!strip_suffix(p->pack_name, ".pack", &len))
 		BUG("pack_name does not end in .pack");
-	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
 	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
 }

 #define MTIMES_HEADER_SIZE (12)
-#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))

 struct mtimes_header {
 	uint32_t signature;
@@ -26,10 +25,9 @@ static int load_pack_mtimes_file(char *mtimes_file,
 {
 	int fd, ret = 0;
 	struct stat st;
-	void *data = NULL;
-	size_t mtimes_size;
+	uint32_t *data = NULL;
+	size_t mtimes_size, expected_size;
 	struct mtimes_header header;
-	uint32_t *hdr;

 	fd = git_open(mtimes_file);

@@ -44,21 +42,16 @@ static int load_pack_mtimes_file(char *mtimes_file,

 	mtimes_size = xsize_t(st.st_size);

-	if (mtimes_size < MTIMES_MIN_SIZE) {
+	if (mtimes_size < MTIMES_HEADER_SIZE) {
 		ret = error(_("mtimes file %s is too small"), mtimes_file);
 		goto cleanup;
 	}

-	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
-		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
-		goto cleanup;
-	}
+	data = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);

-	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
-
-	header.signature = ntohl(hdr[0]);
-	header.version = ntohl(hdr[1]);
-	header.hash_id = ntohl(hdr[2]);
+	header.signature = ntohl(data[0]);
+	header.version = ntohl(data[1]);
+	header.hash_id = ntohl(data[2]);

 	if (header.signature != MTIMES_SIGNATURE) {
 		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
@@ -77,13 +70,23 @@ static int load_pack_mtimes_file(char *mtimes_file,
 		goto cleanup;
 	}

+
+	expected_size = MTIMES_HEADER_SIZE;
+	expected_size = st_add(expected_size, st_mult(sizeof(uint32_t), num_objects));
+	expected_size = st_add(expected_size, 2 * (header.hash_id == 1 ? GIT_SHA1_RAWSZ : GIT_SHA256_RAWSZ));
+
+	if (mtimes_size != expected_size) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
 cleanup:
 	if (ret) {
 		if (data)
 			munmap(data, mtimes_size);
 	} else {
 		*len_p = mtimes_size;
-		*data_p = (const uint32_t *)data;
+		*data_p = data;
 	}

 	close(fd);
diff --git a/pack-mtimes.h b/pack-mtimes.h
index 38ddb9f893..cc957b3e85 100644
--- a/pack-mtimes.h
+++ b/pack-mtimes.h
@@ -8,8 +8,19 @@

 struct packed_git;

+/*
+ * Loads the .mtimes file corresponding to "p", if any, returning zero
+ * on success.
+ */
 int load_pack_mtimes(struct packed_git *p);

+/* Returns the mtime associated with the object at position "pos" (in
+ * lexicographic/index order) in pack "p".
+ *
+ * Note that it is a BUG() to call this function if either (a) "p" does
+ * not have a corresponding .mtimes file, or (b) it does, but it hasn't
+ * been loaded
+ */
 uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);

 #endif
--
2.36.1.94.gb0d54bedca

--- >8 ---

Ævar Arnfjörð Bjarmason May 26, 2022, 12:02 a.m. UTC | #16

On Wed, May 25 2022, Taylor Blau wrote:

> On Wed, May 25, 2022 at 09:30:55AM -0400, Derrick Stolee wrote:
>> On 5/25/2022 5:11 AM, Ævar Arnfjörð Bjarmason wrote:
>> > I must say that I really don't like this part of the format. Is it
>> > really necessary to optimize the storage space here in a way that leaves
>> > open questions about future time_t compatibility, and having to
>> > introduce the first use of unsigned 32 bit timestamps to git's codebase?
>>
>> The commit-graph file format uses unsigned 34-bit timestamps (packed
>> with 30-bit topological levels in the CDAT chunk), so this "not-64-bit
>> signed timestamps" thing is something we've done before.
>>
>> > Yes, this is its own self-contained format, so we don't *need* time_t
>> > here, but it's also really handy if we can eventually consistently use
>> > 64 time_t everywhere and not worry about any compatibility issues, or
>> > unsigned v.s. signed, or to create our own little ext4-like signed 32
>> > bit timestamp format.
>>
>> We can also use a new file format version when it is necessary. We
>> have a lot of time to add that detail without overly complicating the
>> format right now.
>>
>> > If we really are trying to micro-optimize storage space here I'm willing
>> > to bet that this is still a bad/premature optimization. There's much
>> > better ways to store this sort of data in a compact way if that's the
>> > concern. E.g. you'd store a 64 bit "base" timestamp in the header for
>> > the first entry, and have smaller (signed) "delta" timestamps storing
>> > offsets from that "base" timestamp.
>>
>> This is a good idea for a v2 format when that is necessary.
>
> I agree here.
>
> I'm not opposed to such a change (or even being the one to work on it!),
> but I would encourage us to pursue that change outside of this series,
> since it can easily be done on top.
>
> Of course, if we ever did decide to implement 64-bit mtimes, we would
> have to maintain support for reading both the 32-bit and 64-bit values.
> But I think the code is well-equipped to do that, and it could be done
> on top without significant additional complexity.

Do you mean "on top" in the sense that we'd expect that before the next
release, so that we wouldn't need to deal with bumping the format, and
have some phase-out period for the older version etc.

Or that we would need to treat what's landing here as something we'll
need to support going forward?

I think if a format change is worthwhile doing at all that it's worth
just doing it now if it's going to be the latter of those, as changing
file formats before they're in the wild is easy, but after that it's at
best a bit tedious. E.g. we'll need testing to see how we deal with
mixed new/old format files etc. etc.

Taylor Blau May 26, 2022, 12:12 a.m. UTC | #17

On Thu, May 26, 2022 at 02:02:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> > If we really are trying to micro-optimize storage space here I'm willing
> >> > to bet that this is still a bad/premature optimization. There's much
> >> > better ways to store this sort of data in a compact way if that's the
> >> > concern. E.g. you'd store a 64 bit "base" timestamp in the header for
> >> > the first entry, and have smaller (signed) "delta" timestamps storing
> >> > offsets from that "base" timestamp.
> >>
> >> This is a good idea for a v2 format when that is necessary.
> >
> > I agree here.
> >
> > I'm not opposed to such a change (or even being the one to work on it!),
> > but I would encourage us to pursue that change outside of this series,
> > since it can easily be done on top.
> >
> > Of course, if we ever did decide to implement 64-bit mtimes, we would
> > have to maintain support for reading both the 32-bit and 64-bit values.
> > But I think the code is well-equipped to do that, and it could be done
> > on top without significant additional complexity.
>
> Do you mean "on top" in the sense that we'd expect that before the next
> release, so that we wouldn't need to deal with bumping the format, and
> have some phase-out period for the older version etc.
>
> Or that we would need to treat what's landing here as something we'll
> need to support going forward?

My plan is to treat what will hopefully land here as something we're
going to support.

I meant "on top" in the sense that the format implemented here does not
restrict us against making changes (like adding support for wider
records) in the future. IOW, I did not mean to suggest that we should
expect more patches from me in this cycle to deprecate parts of the v1
format.

In other words (again ;-)), I would like to see us ship this format with
the existing 32-bit records.

> I think if a format change is worthwhile doing at all that it's worth
> just doing it now if it's going to be the latter of those, as changing
> file formats before they're in the wild is easy, but after that it's at
> best a bit tedious. E.g. we'll need testing to see how we deal with
> mixed new/old format files etc. etc.

I can understand where you're coming from, though as I noted earlier in
the thread, I don't think changing the format in the manner you suggest
would be that difficult in practice.

But in the meantime, the existing format is useful and works, and I
don't think we should go back to the drawing board for something that we
can do later if we decide to.

Thanks,
Taylor

Junio C Hamano May 26, 2022, 12:30 a.m. UTC | #18

Taylor Blau <me@ttaylorr.com> writes:

> diff --git a/pack-mtimes.c b/pack-mtimes.c
> index 46ad584af1..0e0aafdcb0 100644
> --- a/pack-mtimes.c
> +++ b/pack-mtimes.c
> @@ -1,3 +1,4 @@
> +#include "git-compat-util.h"
>  #include "pack-mtimes.h"
>  #include "object-store.h"
>  #include "packfile.h"
> @@ -7,12 +8,10 @@ static char *pack_mtimes_filename(struct packed_git *p)
>  	size_t len;
>  	if (!strip_suffix(p->pack_name, ".pack", &len))
>  		BUG("pack_name does not end in .pack");
> -	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
>  	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
>  }
>
>  #define MTIMES_HEADER_SIZE (12)
> -#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
>
>  struct mtimes_header {
>  	uint32_t signature;
> @@ -26,10 +25,9 @@ static int load_pack_mtimes_file(char *mtimes_file,
>  {
>  	int fd, ret = 0;
>  	struct stat st;
> -	void *data = NULL;
> -	size_t mtimes_size;
> +	uint32_t *data = NULL;
> +	size_t mtimes_size, expected_size;
>  	struct mtimes_header header;
> -	uint32_t *hdr;
>
>  	fd = git_open(mtimes_file);
>
> @@ -44,21 +42,16 @@ static int load_pack_mtimes_file(char *mtimes_file,
>
>  	mtimes_size = xsize_t(st.st_size);
>
> -	if (mtimes_size < MTIMES_MIN_SIZE) {
> +	if (mtimes_size < MTIMES_HEADER_SIZE) {
>  		ret = error(_("mtimes file %s is too small"), mtimes_file);
>  		goto cleanup;
>  	}
>
> -	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> -		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
> -		goto cleanup;
> -	}
> +	data = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
>
> -	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> -
> -	header.signature = ntohl(hdr[0]);
> -	header.version = ntohl(hdr[1]);
> -	header.hash_id = ntohl(hdr[2]);
> +	header.signature = ntohl(data[0]);
> +	header.version = ntohl(data[1]);
> +	header.hash_id = ntohl(data[2]);

So, instead of assuming that the size of the file is cast in stone
to match the size the current implementation happens to give and
reject a file from a future version, we check the header first to
give a more readable error when we see a version of the file that
we do not understand.

Makes sense.

At least, "here is a small fixup!" should have been accompanied by a
brief explanation to say something like that, i.e. why a fixup is
needed, what shortcoming in the original it is meant to address,
etc.

Will queue between 2/17 and 3/17 without squashing (yet).

Thanks.

>  	if (header.signature != MTIMES_SIGNATURE) {
>  		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> @@ -77,13 +70,23 @@ static int load_pack_mtimes_file(char *mtimes_file,
>  		goto cleanup;
>  	}
>
> +
> +	expected_size = MTIMES_HEADER_SIZE;
> +	expected_size = st_add(expected_size, st_mult(sizeof(uint32_t), num_objects));
> +	expected_size = st_add(expected_size, 2 * (header.hash_id == 1 ? GIT_SHA1_RAWSZ : GIT_SHA256_RAWSZ));
> +
> +	if (mtimes_size != expected_size) {
> +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
> +		goto cleanup;
> +	}
> +
>  cleanup:
>  	if (ret) {
>  		if (data)
>  			munmap(data, mtimes_size);
>  	} else {
>  		*len_p = mtimes_size;
> -		*data_p = (const uint32_t *)data;
> +		*data_p = data;
>  	}
>
>  	close(fd);
> diff --git a/pack-mtimes.h b/pack-mtimes.h
> index 38ddb9f893..cc957b3e85 100644
> --- a/pack-mtimes.h
> +++ b/pack-mtimes.h
> @@ -8,8 +8,19 @@
>
>  struct packed_git;
>
> +/*
> + * Loads the .mtimes file corresponding to "p", if any, returning zero
> + * on success.
> + */
>  int load_pack_mtimes(struct packed_git *p);
>
> +/* Returns the mtime associated with the object at position "pos" (in
> + * lexicographic/index order) in pack "p".
> + *
> + * Note that it is a BUG() to call this function if either (a) "p" does
> + * not have a corresponding .mtimes file, or (b) it does, but it hasn't
> + * been loaded
> + */
>  uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
>
>  #endif
> --
> 2.36.1.94.gb0d54bedca
>
> --- >8 ---

Andreas Schwab June 1, 2023, 1:01 p.m. UTC | #19

On Mai 20 2022, Taylor Blau wrote:

> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 6d3efb7d16..b520aa9c45 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -294,6 +294,25 @@ Pack file entry: <+
>  
>  All 4-byte numbers are in network order.
>  
> +== pack-*.mtimes files have the format:
> +
> +All 4-byte numbers are in network byte order.
> +
> +  - A 4-byte magic number '0x4d544d45' ('MTME').

This is identified by file(1) as "Multitracker Version 4.05". ;-)

[v5,02/17] pack-mtimes: support reading .mtimes files

Commit Message

Comments

Patch