[v2,12/17] chunk-format: create read chunk API

Message ID	d8d8e9e2aa3faf0fdda5dc688fb92e924fec423a.1611759716.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <d8d8e9e2aa3faf0fdda5dc688fb92e924fec423a.1611759716.git.gitgitgadget@gmail.com> In-Reply-To: <pull.848.v2.git.1611759716.gitgitgadget@gmail.com> References: <pull.848.git.1611676886.gitgitgadget@gmail.com> <pull.848.v2.git.1611759716.gitgitgadget@gmail.com> Date: Wed, 27 Jan 2021 15:01:51 +0000 Subject: [PATCH v2 12/17] chunk-format: create read chunk API Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de, szeder.dev@gmail.com, Chris Torek <chris.torek@gmail.com>, Derrick Stolee <stolee@gmail.com>, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com> Precedence: bulk From: Derrick Stolee <dstolee@microsoft.com>
Series	Refactor chunk-format into an API \| expand [v2,00/17] Refactor chunk-format into an API [v2,01/17] commit-graph: anonymize data in chunk_write_fn [v2,02/17] chunk-format: create chunk format write API [v2,03/17] commit-graph: use chunk-format write API [v2,04/17] midx: rename pack_info to write_midx_context [v2,05/17] midx: use context in write_midx_pack_names() [v2,06/17] midx: add entries to write_midx_context [v2,07/17] midx: add pack_perm to write_midx_context [v2,08/17] midx: add num_large_offsets to write_midx_context [v2,09/17] midx: return success/failure in chunk write methods [v2,10/17] midx: drop chunk progress during write [v2,11/17] midx: use chunk-format API in write_midx_internal() [v2,12/17] chunk-format: create read chunk API [v2,13/17] commit-graph: use chunk-format read API [v2,14/17] midx: use chunk-format read API [v2,15/17] midx: use 64-bit multiplication for chunk sizes [v2,16/17] chunk-format: restore duplicate chunk checks [v2,17/17] chunk-format: add technical docs

Derrick Stolee Jan. 27, 2021, 3:01 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

Add the capability to read the table of contents, then pair the chunks
with necessary logic using read_chunk_fn pointers. Callers will be added
in future changes, but the typical outline will be:

 1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
 2. call read_table_of_contents().
 3. for each chunk to parse,
    a. call pair_chunk() to assign a pointer with the chunk position, or
    b. call read_chunk() to run a callback on the chunk start and size.
 4. call free_chunkfile() to clear the 'struct chunkfile' data.

We are re-using the anonymous 'struct chunkfile' data, as it is internal
to the chunk-format API. This gives it essentially two modes: write and
read. If the same struct instance was used for both reads and writes,
then there would be failures.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 33 +++++++++++++++++++++
 2 files changed, 113 insertions(+)

Junio C Hamano Feb. 4, 2021, 11:40 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> We are re-using the anonymous 'struct chunkfile' data, as it is internal
> to the chunk-format API. This gives it essentially two modes: write and
> read. If the same struct instance was used for both reads and writes,
> then there would be failures.

Writing it here won't help future developers very much.  I think
that belongs to API/function doc for init_chunkfile().

> diff --git a/chunk-format.c b/chunk-format.c
> index ab914c55856..74501084cf8 100644
> --- a/chunk-format.c
> +++ b/chunk-format.c
> @@ -12,6 +12,8 @@ struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
>  	chunk_write_fn write_fn;
> +
> +	const void *start;
>  };
>  
>  struct chunkfile {
> @@ -89,3 +91,81 @@ int write_chunkfile(struct chunkfile *cf, void *data)
>  
>  	return 0;
>  }
> +
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,
> +			   uint64_t toc_offset,
> +			   int toc_length)

It's a bit of mystery, having seen how the table-of-contents is laid
out by reading the writing side of the code, how toc_offset and
toc_length are discovered by the caller.  IIRC, the size to cover
everything from the beginning of the file to the end of the
table-of-contents was recorded as the length of a non-existent chunk
with id 0, but we need to be able to somehow find it to use that as
a way to get to the (end of) table-of-contents from the beginning of
the file.   I guess we'll learn enough when we get to read the code
that calls this function.

> +{
> +	uint32_t chunk_id;
> +	const unsigned char *table_of_contents = mfile + toc_offset;
> +
> +	ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc);
> +
> +	while (toc_length--) {
> +		uint64_t chunk_offset, next_chunk_offset;
> +
> +		chunk_id = get_be32(table_of_contents);
> +		chunk_offset = get_be64(table_of_contents + 4);
> +
> +		if (!chunk_id) {
> +			error(_("terminating chunk id appears earlier than expected"));
> +			return 1;
> +		}
> +
> +		table_of_contents += CHUNK_LOOKUP_WIDTH;
> +		next_chunk_offset = get_be64(table_of_contents + 4);
> +
> +		if (next_chunk_offset < chunk_offset ||
> +		    next_chunk_offset > mfile_size - the_hash_algo->rawsz) {

The chunks have to be recorded in toc in the order they appear, and
there must be enough room to store the hashfile trailer checksum
after the last chunk.  OK.

> +			error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
> +			      chunk_offset, next_chunk_offset);
> +			return -1;
> +		}
> +
> +		cf->chunks[cf->chunks_nr].id = chunk_id;
> +		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
> +		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
> +		cf->chunks_nr++;
> +	}
> +
> +	chunk_id = get_be32(table_of_contents);
> +	if (chunk_id) {
> +		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
> +		return -1;
> +	}

Shouldn't we be validating the size component associated with this
"id=0" fake chunk that appears at the end as well?  IIRC, the size
field should be pointing at the byte immediately after the TOC entry
for the last true chunk, immediately before this zero chunk id?

> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       const unsigned char **p)
> +{
> +	int i;
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		if (cf->chunks[i].id == chunk_id) {
> +			*p = cf->chunks[i].start;
> +			return 0;
> +		}

OK, the assumption here is that there will be at most one chunk that
has the chunk_id we seek to find (or putting it differently, second
and subsequent chunks with the same ID are ignored).  We may want to
write it down somewhere.

> +	}
> +
> +	return CHUNK_NOT_FOUND;
> +}
> +
> +int read_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data)
> +{
> +	int i;
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		if (cf->chunks[i].id == chunk_id)
> +			return fn(cf->chunks[i].start, cf->chunks[i].size, data);

It is curious why pair_chunk() exists in the first place.  With
something like this:

        int pair_chunk_fn(const unsigned char *chunk_start,
                          size_t chunk_size,
                          void *data)
        {
                const unsigned char **p = data;
                *p = chunk_start;
                return 0;
        }

instead of

	const unsigned char *location;

	pair_chunk(cf, chunk_id, &location);

we can write

	const unsigned char *location;

	read_chunk(cf, chunk_id, pair_chunk_fn, &location);

no?  That would allow us to reorganize the in-core TOC more easily
if it turns out to be necessary in the future.

> diff --git a/chunk-format.h b/chunk-format.h
> index bfaed672813..b62c9bf8ba1 100644
> --- a/chunk-format.h
> +++ b/chunk-format.h
> @@ -17,4 +17,37 @@ void add_chunk(struct chunkfile *cf,
>  	       size_t size);
>  int write_chunkfile(struct chunkfile *cf, void *data);
>  
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,
> +			   uint64_t toc_offset,
> +			   int toc_length);
> +
> +#define CHUNK_NOT_FOUND (-2)
> +
> +/*
> + * Find 'chunk_id' in the given chunkfile and assign the
> + * given pointer to the position in the mmap'd file where
> + * that chunk begins.
> + *
> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
> + */
> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       const unsigned char **p);
> +
> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
> +			     size_t chunk_size, void *data);

Is the assumption throughout the chunked file API that reading is
actually not done as a "seek+read+process", but as a normal memory
access over mmapped region?  The reason I ask is because the answer
affects the choice of the right type for the offset.  The function
signature of read_table_of_contents() uses size_t to represent the
length of the entire mmapped region that holds the data from the
file, and that is better than off_t, especially if size_t were
smaller than off_t (i.e. we may not be able to mmap a huge size that
filesystem can handle and let us access with seek+read).

But the assumption that the whole mfile can be mmapped in as a whole
is only in read_table_of_contents(), and users of read_chunk() API
can be oblivious, I think---IOW, we could "page in" the chunk's data
in read_chunk() while the callback function works on it in-core, and
then discard it, if we wanted to change the implementation [*].

	Side note: for that to work, the API must say that the
	callback function MUST NOT assume that the memory region
	starting at chunk_start it is given will stay in memory
	after it returns.  Otherwise, we cannot "page in" and "page
	out".

I am not advocating that we should not assume the entire file can be
mapped in.  I would however advocate to be explicit in documenting
what the users of API can and cannot do (e.g. if we want the "read"
callbacks to take advantage of the fact that mfile will stay mapped
until the chunkfile is discarded, we should say so, so that they will
not make unnecessary copies out of the mmapped region).

> +/*
> + * Find 'chunk_id' in the given chunkfile and call the
> + * given chunk_read_fn method with the information for
> + * that chunk.
> + *
> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
> + */
> +int read_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data);

Did I miss an update to free_chunkfile() to release resources used
to read this file?  For some reason, unlike the writing side, the
reading side of this API feels a bit incomplete to me.

Thanks.

Derrick Stolee Feb. 5, 2021, 12:19 p.m. UTC | #2

On 2/4/2021 6:40 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> We are re-using the anonymous 'struct chunkfile' data, as it is internal
>> to the chunk-format API. This gives it essentially two modes: write and
>> read. If the same struct instance was used for both reads and writes,
>> then there would be failures.
> 
> Writing it here won't help future developers very much.  I think
> that belongs to API/function doc for init_chunkfile().

I will add a clear comment there. Thanks.

>> +int read_table_of_contents(struct chunkfile *cf,
>> +			   const unsigned char *mfile,
>> +			   size_t mfile_size,
>> +			   uint64_t toc_offset,
>> +			   int toc_length)
> 
> It's a bit of mystery, having seen how the table-of-contents is laid
> out by reading the writing side of the code, how toc_offset and
> toc_length are discovered by the caller.  IIRC, the size to cover
> everything from the beginning of the file to the end of the
> table-of-contents was recorded as the length of a non-existent chunk
> with id 0, but we need to be able to somehow find it to use that as
> a way to get to the (end of) table-of-contents from the beginning of
> the file.   I guess we'll learn enough when we get to read the code
> that calls this function.

The existing formats have a byte in their header specifying how
many chunks are in the table of contents. That's how this information
is known in advance.

If we want to instead rely on the terminating chunk id with value 0,
then this method could be modified in the future. It does complicate
the allocation of cf->chunks slightly.

>> +	chunk_id = get_be32(table_of_contents);
>> +	if (chunk_id) {
>> +		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
>> +		return -1;
>> +	}
> 
> Shouldn't we be validating the size component associated with this
> "id=0" fake chunk that appears at the end as well?  IIRC, the size
> field should be pointing at the byte immediately after the TOC entry
> for the last true chunk, immediately before this zero chunk id?

During the loop, we scanned ahead to find the offset of the
terminating chunk and compute the size of the last "real" chunk.

Any size validation here would be to check that the offset points
_exactly_ to the terminating hash, but that might be too restrictive
on the format. (Who knows if there is a legit reason to have non-
chunked data between the chunked data and the trailing hash?)

As I re-roll, I plan to skip this final check. It would be easy to
add it in a forward fix, thought.

>> +int pair_chunk(struct chunkfile *cf,
>> +	       uint32_t chunk_id,
>> +	       const unsigned char **p)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		if (cf->chunks[i].id == chunk_id) {
>> +			*p = cf->chunks[i].start;
>> +			return 0;
>> +		}
> 
> OK, the assumption here is that there will be at most one chunk that
> has the chunk_id we seek to find (or putting it differently, second
> and subsequent chunks with the same ID are ignored).  We may want to
> write it down somewhere.

This is enforced by a later patch, as you discovered.

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		if (cf->chunks[i].id == chunk_id)
>> +			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
> 
> It is curious why pair_chunk() exists in the first place.  With
> something like this:
> 
>         int pair_chunk_fn(const unsigned char *chunk_start,
>                           size_t chunk_size,
>                           void *data)
>         {
>                 const unsigned char **p = data;
>                 *p = chunk_start;
>                 return 0;
>         }
> 
> instead of
> 
> 	const unsigned char *location;
> 
> 	pair_chunk(cf, chunk_id, &location);
> 
> we can write
> 
> 	const unsigned char *location;
> 
> 	read_chunk(cf, chunk_id, pair_chunk_fn, &location);
> 
> no?  That would allow us to reorganize the in-core TOC more easily
> if it turns out to be necessary in the future.

I like this, but why not just use pair_chunk_fn inside of
the implementation of pair_chunk() so callers have an easy
interface. This results in the implementation:

static int pair_chunk_fn(const unsigned char *chunk_start,
			 size_t chunk_size,
			 void *data)
{
	const unsigned char **p = data;
	*p = chunk_start;
	return 0;
}

int pair_chunk(struct chunkfile *cf,
	       uint32_t chunk_id,
	       const unsigned char **p)
{
	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
}

>> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
>> +			     size_t chunk_size, void *data);
> 
> Is the assumption throughout the chunked file API that reading is
> actually not done as a "seek+read+process", but as a normal memory
> access over mmapped region?  The reason I ask is because the answer
> affects the choice of the right type for the offset.  The function
> signature of read_table_of_contents() uses size_t to represent the
> length of the entire mmapped region that holds the data from the
> file, and that is better than off_t, especially if size_t were
> smaller than off_t (i.e. we may not be able to mmap a huge size that
> filesystem can handle and let us access with seek+read).
> 
> But the assumption that the whole mfile can be mmapped in as a whole
> is only in read_table_of_contents(), and users of read_chunk() API
> can be oblivious, I think---IOW, we could "page in" the chunk's data
> in read_chunk() while the callback function works on it in-core, and
> then discard it, if we wanted to change the implementation [*].
> 
> 	Side note: for that to work, the API must say that the
> 	callback function MUST NOT assume that the memory region
> 	starting at chunk_start it is given will stay in memory
> 	after it returns.  Otherwise, we cannot "page in" and "page
> 	out".
> 
> I am not advocating that we should not assume the entire file can be
> mapped in.  I would however advocate to be explicit in documenting
> what the users of API can and cannot do (e.g. if we want the "read"
> callbacks to take advantage of the fact that mfile will stay mapped
> until the chunkfile is discarded, we should say so, so that they will
> not make unnecessary copies out of the mmapped region).

I will add this expectation to the documentation.

>> +/*
>> + * Find 'chunk_id' in the given chunkfile and call the
>> + * given chunk_read_fn method with the information for
>> + * that chunk.
>> + *
>> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
>> + */
>> +int read_chunk(struct chunkfile *cf,
>> +	       uint32_t chunk_id,
>> +	       chunk_read_fn fn,
>> +	       void *data);
> 
> Did I miss an update to free_chunkfile() to release resources used
> to read this file?  For some reason, unlike the writing side, the
> reading side of this API feels a bit incomplete to me.

free_chunkfile(cf) already frees cf->chunks (and cf itself). No
other resources are allocated during the read. The caller is
responsible for the mmap'd file resource.

The reading side could use more documentation. I will think about
this.

Thanks,
-Stolee

Junio C Hamano Feb. 5, 2021, 7:37 p.m. UTC | #3

Derrick Stolee <stolee@gmail.com> writes:

>>> +	chunk_id = get_be32(table_of_contents);
>>> +	if (chunk_id) {
>>> +		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
>>> +		return -1;
>>> +	}
>> 
>> Shouldn't we be validating the size component associated with this
>> "id=0" fake chunk that appears at the end as well?

No, please disregard this comment, which was based on my incorrect
understanding of the "size" field associated with this fake ID==0
chunk (I incorrectly thought the size had something to do with the
file header plus TOC, but it is not---it is to allow discovering the
size of the last chunk by being a sentinel that records the offset
of an extra chunk at the end that does not actually exist).

> I like this, but why not just use pair_chunk_fn inside of
> the implementation of pair_chunk() so callers have an easy
> interface.

Yes, I didn't realize that earlier design iteration resulted in the
introduction of the "pair_chunk()" after discovering that it often
is necessary to just note the address where the data begins, so I
am OK to leave something like pair_chunk() as a public interface,
and implementing the pair_chunk() helper like you suggest would be a
perfectly fine way to do so.

It however is curious that the callers who use pair_chunk() do not
get the same quality of data as read_chunk() callers.

The users of pair_chunk() presumably are not ready to (or simply do
not want to) process the data immediately by using read_chunk() with
callback, but when they get ready to process the data, unlike
read_chunk callbacks, they do not get to learn how much they ought
to process---all they learn is the address of the beginning of the
chunk.  I do not see a way to write pair_chunk() users safely to
guarantee that they do not overrun at the tail end of the chunk they
are processing.

Thanks.

Junio C Hamano Feb. 8, 2021, 10:26 p.m. UTC | #4

Junio C Hamano <gitster@pobox.com> writes:

> The users of pair_chunk() presumably are not ready to (or simply do
> not want to) process the data immediately by using read_chunk() with
> callback, but when they get ready to process the data, unlike
> read_chunk callbacks, they do not get to learn how much they ought
> to process---all they learn is the address of the beginning of the
> chunk.  I do not see a way to write pair_chunk() users safely to
> guarantee that they do not overrun at the tail end of the chunk they
> are processing.

I've read through v3 and found it mostly done, but the above
question still stands.  I find it questionable why callers of
pair_chunk() only can learn where a chunk data begins, without
being able to learn how big the region of memory is.  IOW, why
can we get away without doing something like this?  The users
of pair_chunk() won't even know when they overrun the end of
the data the are given without something like this, no?

Thanks.

+struct memory_region {
+	const unsigned char *p;
+	size_t sz;
+};
+
 static int pair_chunk_fn(const unsigned char *chunk_start,
                          size_t chunk_size,
                          void *data)
 {
-        const unsigned char **p = data;
-        *p = chunk_start;
+        struct memory_region *x = data;
+        x->p = chunk_start;
+        x->sz = chunk_size;
         return 0;
 }



 int pair_chunk(struct chunkfile *cf,
                uint32_t chunk_id,
-                const unsigned char **p)
+                const unsigned char **p,
+                size_t *sz)
 {
+        int ret;
+        struct memory_region x;
=        return read_chunk(cf, chunk_id, pair_chunk_fn, &x);
+        ret = read_chunk(cf, chunk_id, pair_chunk_fn, &x);
+        *p = x.p;
+        *sz = x.sz;
+        return ret;
 }

Derrick Stolee Feb. 9, 2021, 1:33 a.m. UTC | #5

On 2/8/2021 5:26 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> The users of pair_chunk() presumably are not ready to (or simply do
>> not want to) process the data immediately by using read_chunk() with
>> callback, but when they get ready to process the data, unlike
>> read_chunk callbacks, they do not get to learn how much they ought
>> to process---all they learn is the address of the beginning of the
>> chunk.  I do not see a way to write pair_chunk() users safely to
>> guarantee that they do not overrun at the tail end of the chunk they
>> are processing.
> 
> I've read through v3 and found it mostly done, but the above
> question still stands.  I find it questionable why callers of
> pair_chunk() only can learn where a chunk data begins, without
> being able to learn how big the region of memory is.  IOW, why
> can we get away without doing something like this?  The users
> of pair_chunk() won't even know when they overrun the end of
> the data the are given without something like this, no?

I guess that the point is that if a caller wants to perform
logic on the size, then they should use read_chunk() instead.
We have some chunks that check the size is correct upon read,
but most chunks do not do this (currently).

In future series, additional protections could be added, and
I would expect that to be done by converting callers of
pair_chunk() into callers of read_chunk() with appropriate
callback functions.

Thanks,
-Stolee

Junio C Hamano Feb. 9, 2021, 8:47 p.m. UTC | #6

Derrick Stolee <stolee@gmail.com> writes:

> In future series, additional protections could be added, and
> I would expect that to be done by converting callers of
> pair_chunk() into callers of read_chunk() with appropriate
> callback functions.

I am perfectly OK with leaving the interface as-is, as apparently it
is sufficient for the current callers.  

It was just it looked insufficient for future callers that do not
want to use the callback interface, i.e. learn the <ptr, size> pair
with a single API call and then handle the data in that region
itself, instead of preparing a callback function and calling
read_chunk() on it.  As C does not have closures, it gets quickly
cumbersome if such a caller wants to share pieces of information
with the callback function that is given to read_chunk(), but a
caller that learns <ptr,size> from pair_chunk() and then works on
the region of memory itself do nto have to worry about that.

[v2,12/17] chunk-format: create read chunk API

Commit Message

Comments

Patch