diff mbox series

[01/30] object-file-convert: Stubs for converting from one object format to another

Message ID 20230927195537.1682-1-ebiederm@gmail.com (mailing list archive)
State New, archived
Headers show
Series Initial support for multiple hash functions | expand

Commit Message

Eric W. Biederman Sept. 27, 2023, 7:55 p.m. UTC
From: "Eric W. Biederman" <ebiederm@xmission.com>

Two basic functions are provided:
- convert_object_file Takes an object file it's type and hash algorithm
  and converts it into the equivalent object file that would
  have been generated with hash algorithm "to".

  For blob objects there is no converstion to be done and it is an
  error to use this function on them.

  For commit, tree, and tag objects embedded oids are replaced by the
  oids of the objects they refer to with those objects and their
  object ids reencoded in with the hash algorithm "to".  Signatures
  are rearranged so that they remain valid after the object has
  been reencoded.

- repo_oid_to_algop which takes an oid that refers to an object file
  and returns the oid of the equavalent object file generated
  with the target hash algorithm.

The pair of files object-file-convert.c and object-file-convert.h are
introduced to hold as much of this logic as possible to keep this
conversion logic cleanly separated from everything else and in the
hopes that someday the code will be clean enough git can support
compiling out support for sha1 and the various conversion functions.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 Makefile              |  1 +
 object-file-convert.c | 57 +++++++++++++++++++++++++++++++++++++++++++
 object-file-convert.h | 24 ++++++++++++++++++
 3 files changed, 82 insertions(+)
 create mode 100644 object-file-convert.c
 create mode 100644 object-file-convert.h

Comments

Eric Sunshine Sept. 27, 2023, 8:42 p.m. UTC | #1
On Wed, Sep 27, 2023 at 3:55 PM Eric W. Biederman <ebiederm@gmail.com> wrote:
> Two basic functions are provided:
> - convert_object_file Takes an object file it's type and hash algorithm
>   and converts it into the equivalent object file that would
>   have been generated with hash algorithm "to".
>
>   For blob objects there is no converstion to be done and it is an
>   error to use this function on them.

s/converstion/conversion/

>   For commit, tree, and tag objects embedded oids are replaced by the
>   oids of the objects they refer to with those objects and their
>   object ids reencoded in with the hash algorithm "to".  Signatures
>   are rearranged so that they remain valid after the object has
>   been reencoded.
>
> - repo_oid_to_algop which takes an oid that refers to an object file
>   and returns the oid of the equavalent object file generated
>   with the target hash algorithm.

s/equavalent/equivalent/

> The pair of files object-file-convert.c and object-file-convert.h are
> introduced to hold as much of this logic as possible to keep this
> conversion logic cleanly separated from everything else and in the
> hopes that someday the code will be clean enough git can support
> compiling out support for sha1 and the various conversion functions.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Just some minor comments below, many of which are subjective
style-related observations, thus not necessarily actionable, but also
one or two legitimate questions.

> diff --git a/object-file-convert.c b/object-file-convert.c
> @@ -0,0 +1,57 @@
> +int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
> +                     const struct git_hash_algo *to, struct object_id *dest)
> +{
> +       /*
> +        * If the source alogirthm is not set, then we're using the
> +        * default hash algorithm for that object.
> +        */

s/alogirthm/algorithm/

> +       const struct git_hash_algo *from =
> +               src->algo ? &hash_algos[src->algo] : repo->hash_algo;
> +
> +       if (from == to) {
> +               if (src != dest)
> +                       oidcpy(dest, src);
> +               return 0;
> +       }
> +       return -1;
> +}

On this project, we usually get the simple cases out of the way first,
which often reduces the indentation level, making the code easier to
digest at a glance. So, it would be typical to write this as:

    if (from != to)
        return -1
    if (src != dest)
        oidcpy(dest, src);
    return 0;

or even:

    if (from != to)
        return -1
    if (src == dest)
        return 0;
    oidcpy(dest, src);
    return 0;

This way, for instance, the reader doesn't get to the end of the
function and then have to scan backward to understand the condition of
the `return -1`.

> +int convert_object_file(struct strbuf *outbuf,
> +                       const struct git_hash_algo *from,
> +                       const struct git_hash_algo *to,
> +                       const void *buf, size_t len,
> +                       enum object_type type,
> +                       int gentle)
> +{
> +       int ret;
> +
> +       /* Don't call this function when no conversion is necessary */
> +       if ((from == to) || (type == OBJ_BLOB))
> +               die("Refusing noop object file conversion");

Several comments...

Style: we usually reduce the noise level by dropping the extra parentheses:

    if (from == to || type == OBJ_BLOB)

Does this condition represent a programming error or a runtime error
triggerable by some input? If a programming error, then use BUG()
rather than die().

If a triggerable runtime error, then...

* start user-facing messages with lowercase rather than capitalized word

* make the user-facing message localizable so readers of other
languages can digest it

    die(_("refusing do-nothing object conversion"));

On the other hand, don't make BUG() messages localizable.

> +       switch (type) {
> +       case OBJ_COMMIT:
> +       case OBJ_TREE:
> +       case OBJ_TAG:
> +       default:
> +               /* Not implemented yet, so fail. */
> +               ret = -1;
> +               break;
> +       }
> +       if (!ret)
> +               return 0;
> +       if (gentle) {
> +               strbuf_release(outbuf);
> +               return ret;
> +       }

This function appears to be a mere skeleton at the moment, so it's
difficult to judge at this point whether you are using `outbuf` as a
bag of bytes or as a legitimate string container. If the latter, then
the API may be reasonable, but if you're using it as a bag-of-bytes,
then it feels like you're leaking an implementation detail into the
API.

> +       die(_("Failed to convert object from %s to %s"),
> +               from->name, to->name);

s/Failed/failed/

For people trying to diagnose this problem, would it be helpful to
present more information about the failed conversion, such as object
type and perhaps even its OID?

> diff --git a/object-file-convert.h b/object-file-convert.h
> @@ -0,0 +1,24 @@
> +int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
> +                     const struct git_hash_algo *to, struct object_id *dest);

I suppose the function name is pretty much self-explanatory to those
familiar with the underlying concepts, but it might still be helpful
to add a comment explaining what the function does.

> +/*
> + * Convert an object file from one hash algorithm to another algorithm.
> + * Return -1 on failure, 0 on success.
> + */
> +int convert_object_file(struct strbuf *outbuf,
> +                       const struct git_hash_algo *from,
> +                       const struct git_hash_algo *to,
> +                       const void *buf, size_t len,
> +                       enum object_type type,
> +                       int gentle);
Eric W. Biederman Oct. 2, 2023, 1:22 a.m. UTC | #2
Eric Sunshine <sunshine@sunshineco.com> writes:

> On Wed, Sep 27, 2023 at 3:55 PM Eric W. Biederman <ebiederm@gmail.com> wrote:
>> Two basic functions are provided:
>> - convert_object_file Takes an object file it's type and hash algorithm
>>   and converts it into the equivalent object file that would
>>   have been generated with hash algorithm "to".
>>
>>   For blob objects there is no converstion to be done and it is an
>>   error to use this function on them.
>
> s/converstion/conversion/
>
>>   For commit, tree, and tag objects embedded oids are replaced by the
>>   oids of the objects they refer to with those objects and their
>>   object ids reencoded in with the hash algorithm "to".  Signatures
>>   are rearranged so that they remain valid after the object has
>>   been reencoded.
>>
>> - repo_oid_to_algop which takes an oid that refers to an object file
>>   and returns the oid of the equavalent object file generated
>>   with the target hash algorithm.
>
> s/equavalent/equivalent/
>
>> The pair of files object-file-convert.c and object-file-convert.h are
>> introduced to hold as much of this logic as possible to keep this
>> conversion logic cleanly separated from everything else and in the
>> hopes that someday the code will be clean enough git can support
>> compiling out support for sha1 and the various conversion functions.
>>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Just some minor comments below, many of which are subjective
> style-related observations, thus not necessarily actionable, but also
> one or two legitimate questions.
>
>> diff --git a/object-file-convert.c b/object-file-convert.c
>> @@ -0,0 +1,57 @@
>> +int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
>> +                     const struct git_hash_algo *to, struct object_id *dest)
>> +{
>> +       /*
>> +        * If the source alogirthm is not set, then we're using the
>> +        * default hash algorithm for that object.
>> +        */
>
> s/alogirthm/algorithm/
>
>> +       const struct git_hash_algo *from =
>> +               src->algo ? &hash_algos[src->algo] : repo->hash_algo;
>> +
>> +       if (from == to) {
>> +               if (src != dest)
>> +                       oidcpy(dest, src);
>> +               return 0;
>> +       }
>> +       return -1;
>> +}
>
> On this project, we usually get the simple cases out of the way first,
> which often reduces the indentation level, making the code easier to
> digest at a glance. So, it would be typical to write this as:
>
>     if (from != to)
>         return -1
>     if (src != dest)
>         oidcpy(dest, src);
>     return 0;
>
> or even:
>
>     if (from != to)
>         return -1
>     if (src == dest)
>         return 0;
>     oidcpy(dest, src);
>     return 0;
>
> This way, for instance, the reader doesn't get to the end of the
> function and then have to scan backward to understand the condition of
> the `return -1`.

The "return -1" is there only because it is a stub, and it is there
where the rest of the code needs to go.

As for simple cases the "if (from == to)" case is a simple case I am
getting out of the way.  It unfortunately is cluttered by the fact
that "oidcpy(&oid, &oid)" is not valid so it has to guard the copy.

if (from == to) {
	if (src == dest)
        	return 0;
        oidcpy(dest, src);
        return 0;
}

Could be used but is wordier.  And duplicates the return code
for the same case so I am not enthusiastic about it.


>> +int convert_object_file(struct strbuf *outbuf,
>> +                       const struct git_hash_algo *from,
>> +                       const struct git_hash_algo *to,
>> +                       const void *buf, size_t len,
>> +                       enum object_type type,
>> +                       int gentle)
>> +{
>> +       int ret;
>> +
>> +       /* Don't call this function when no conversion is necessary */
>> +       if ((from == to) || (type == OBJ_BLOB))
>> +               die("Refusing noop object file conversion");
>
> Several comments...
>
> Style: we usually reduce the noise level by dropping the extra parentheses:
>
>     if (from == to || type == OBJ_BLOB)

I honestly can not be confident of C code that does that.

The precedence of the operators in C has been wrong for longer than I
have been programming, and I can never remember exactly how the
precedence is wrong.  So for the last 30 years I have been adding enough
parenthesis that I don't have to remember.

> Does this condition represent a programming error or a runtime error
> triggerable by some input? If a programming error, then use BUG()
> rather than die().

Agreed BUG would be better there.

> If a triggerable runtime error, then...
>
> * start user-facing messages with lowercase rather than capitalized word
>
> * make the user-facing message localizable so readers of other
> languages can digest it
>
>     die(_("refusing do-nothing object conversion"));
>
> On the other hand, don't make BUG() messages localizable.
>
>> +       switch (type) {
>> +       case OBJ_COMMIT:
>> +       case OBJ_TREE:
>> +       case OBJ_TAG:
>> +       default:
>> +               /* Not implemented yet, so fail. */
>> +               ret = -1;
>> +               break;
>> +       }
>> +       if (!ret)
>> +               return 0;
>> +       if (gentle) {
>> +               strbuf_release(outbuf);
>> +               return ret;
>> +       }
>
> This function appears to be a mere skeleton at the moment, so it's
> difficult to judge at this point whether you are using `outbuf` as a
> bag of bytes or as a legitimate string container. If the latter, then
> the API may be reasonable, but if you're using it as a bag-of-bytes,
> then it feels like you're leaking an implementation detail into the
> API.

It is a string that represents the entire object.

It might be arguable if tree objects are text given that trees represent
oids in binary, but for tag and commit objects they are definitely one
big text string.

This is an implementation detail that makes the code simpler, and less
error prone.

I have not encountered anything where a string buffer would not be
a reasonable fit.

>> +       die(_("Failed to convert object from %s to %s"),
>> +               from->name, to->name);
>
> s/Failed/failed/

I don't understand wanting to start a sentence with a lower case letter.
Can you explain?

> For people trying to diagnose this problem, would it be helpful to
> present more information about the failed conversion, such as object
> type and perhaps even its OID?

I expect some of that context will come from the conversion functions
themselves.  Those messages are almost certain to give the type
information one way or another because they are type specific.

That said it requires some version of corrupt repository to reach this
error.  Either missing mapping tables, or a corrupt object.

So I don't know how much it matters to get this perfect the first
time.  We can improve the error message we gain experience.

I would agree that including the OID makes sense.  Unfortunately the
code does not have the OID at this point.

>> diff --git a/object-file-convert.h b/object-file-convert.h
>> @@ -0,0 +1,24 @@
>> +int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
>> +                     const struct git_hash_algo *to, struct object_id *dest);
>
> I suppose the function name is pretty much self-explanatory to those
> familiar with the underlying concepts, but it might still be helpful
> to add a comment explaining what the function does.

I could use words that repeat what is in the function signature.
But I don't think I could add anything.

I would have to say something like:

Look up the oid that an equivalent object would have in a repository
whose object format is "to".

Is that helpful?

Eric
Eric Sunshine Oct. 2, 2023, 2:27 a.m. UTC | #3
On Sun, Oct 1, 2023 at 9:22 PM Eric W. Biederman <ebiederm@gmail.com> wrote:
> Eric Sunshine <sunshine@sunshineco.com> writes:
> > On Wed, Sep 27, 2023 at 3:55 PM Eric W. Biederman <ebiederm@gmail.com> wrote:
> >> +       die(_("Failed to convert object from %s to %s"),
> >> +               from->name, to->name);
> >
> > s/Failed/failed/
>
> I don't understand wanting to start a sentence with a lower case letter.
> Can you explain?

Consistency with most of the rest of the messages emitted by Git.
CodingGuidelines call for lowercase and omission of the full-stop
(period).

The choice, of course, is subjective, but these are the guidelines
upon which the project has settled for better or worse. (You can
certainly find older code, which predates the guideline, using
capitalized messages, but it's good to adhere to the guideline for new
code if possible.)

> >> +int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
> >> +                     const struct git_hash_algo *to, struct object_id *dest);
> >
> > I suppose the function name is pretty much self-explanatory to those
> > familiar with the underlying concepts, but it might still be helpful
> > to add a comment explaining what the function does.
>
> I could use words that repeat what is in the function signature.
> But I don't think I could add anything.
>
> I would have to say something like:
>
> Look up the oid that an equivalent object would have in a repository
> whose object format is "to".
>
> Is that helpful?

That would help clarify the function a bit for me. Explaining the
function's return value could also be useful.
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index 577630936535..f7e824f25cda 100644
--- a/Makefile
+++ b/Makefile
@@ -1073,6 +1073,7 @@  LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
 LIB_OBJS += notes.o
+LIB_OBJS += object-file-convert.o
 LIB_OBJS += object-file.o
 LIB_OBJS += object-name.o
 LIB_OBJS += object.o
diff --git a/object-file-convert.c b/object-file-convert.c
new file mode 100644
index 000000000000..ba3e18f6af44
--- /dev/null
+++ b/object-file-convert.c
@@ -0,0 +1,57 @@ 
+#include "git-compat-util.h"
+#include "gettext.h"
+#include "strbuf.h"
+#include "repository.h"
+#include "hash-ll.h"
+#include "object.h"
+#include "object-file-convert.h"
+
+int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
+		      const struct git_hash_algo *to, struct object_id *dest)
+{
+	/*
+	 * If the source alogirthm is not set, then we're using the
+	 * default hash algorithm for that object.
+	 */
+	const struct git_hash_algo *from =
+		src->algo ? &hash_algos[src->algo] : repo->hash_algo;
+
+	if (from == to) {
+		if (src != dest)
+			oidcpy(dest, src);
+		return 0;
+	}
+	return -1;
+}
+
+int convert_object_file(struct strbuf *outbuf,
+			const struct git_hash_algo *from,
+			const struct git_hash_algo *to,
+			const void *buf, size_t len,
+			enum object_type type,
+			int gentle)
+{
+	int ret;
+
+	/* Don't call this function when no conversion is necessary */
+	if ((from == to) || (type == OBJ_BLOB))
+		die("Refusing noop object file conversion");
+
+	switch (type) {
+	case OBJ_COMMIT:
+	case OBJ_TREE:
+	case OBJ_TAG:
+	default:
+		/* Not implemented yet, so fail. */
+		ret = -1;
+		break;
+	}
+	if (!ret)
+		return 0;
+	if (gentle) {
+		strbuf_release(outbuf);
+		return ret;
+	}
+	die(_("Failed to convert object from %s to %s"),
+		from->name, to->name);
+}
diff --git a/object-file-convert.h b/object-file-convert.h
new file mode 100644
index 000000000000..a4f802aa8eea
--- /dev/null
+++ b/object-file-convert.h
@@ -0,0 +1,24 @@ 
+#ifndef OBJECT_CONVERT_H
+#define OBJECT_CONVERT_H
+
+struct repository;
+struct object_id;
+struct git_hash_algo;
+struct strbuf;
+#include "object.h"
+
+int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
+		      const struct git_hash_algo *to, struct object_id *dest);
+
+/*
+ * Convert an object file from one hash algorithm to another algorithm.
+ * Return -1 on failure, 0 on success.
+ */
+int convert_object_file(struct strbuf *outbuf,
+			const struct git_hash_algo *from,
+			const struct git_hash_algo *to,
+			const void *buf, size_t len,
+			enum object_type type,
+			int gentle);
+
+#endif /* OBJECT_CONVERT_H */