[v2,01/24] builtin/pack-objects: make hash agnostic
diff mbox series

Message ID 20200222201749.937983-2-sandals@crustytoothpaste.net
State New
Headers show
Series
  • SHA-256 stage 4 implementation, part 1/3
Related show

Commit Message

brian m. carlson Feb. 22, 2020, 8:17 p.m. UTC
Avoid hard-coding a hash size, instead preferring to use the_hash_algo.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 builtin/pack-objects.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jeff King Feb. 23, 2020, 9:57 p.m. UTC | #1
On Sat, Feb 22, 2020 at 08:17:26PM +0000, brian m. carlson wrote:

> Avoid hard-coding a hash size, instead preferring to use the_hash_algo.
> [...]
> -			hashwrite(out, base_sha1, 20);
> +			hashwrite(out, base_sha1, the_hash_algo->rawsz);

Yeah, looks obviously correct (and I think this is new from the
pack-reuse patches of mine that Christian sent recently).

The name "base_sha1" is clearly not great either. It could be changed
trivially, but the real sin is that it comes from
nth_packed_object_sha1(). It could be replaced with
nth_packed_object_oid() at the cost of an extra hash copy, which isn't
too bad.

It would be nice to get rid of that function entirely. In most cases,
it's either free to do so (we end up copying the result into an oid
anyway) or we pay for one extra hashcpy (out of the mmap into a local
struct). But the one in pack-check.c:verify_packfile() is tricky; we
store a pointer per object into the index mmap, and we'd have to switch
to storing an oid per object. Given that the code isn't commonly run
(and other operations like _generating_ the index in the first place are
clearly OK with the extra memory hit), I think I'd be OK with that in
the name of cleanliness.

All of that is sort of orthogonal to your goals, I think, so I don't
mind at all calling it out of scope for your series. As long as we use
the_hash_algo->rawsz, these bare pointer accesses aren't wrong (it's
just that it's easy to accidentally get it wrong, as this code shows).

I think it would also be correct to cast the mmap'd bytes to a "struct
object_id" given that the struct contains the hash bytes as the first
member. I worry a little that we'd get no compiler warning of the
breakage if that assumption changes, but it also seems unlikely to do
so.

-Peff
Jeff King Feb. 24, 2020, 3:01 a.m. UTC | #2
On Sun, Feb 23, 2020 at 04:57:59PM -0500, Jeff King wrote:

> It would be nice to get rid of [nth_packed_object_sha1] entirely. In most cases,
> it's either free to do so (we end up copying the result into an oid
> anyway) or we pay for one extra hashcpy (out of the mmap into a local
> struct). But the one in pack-check.c:verify_packfile() is tricky; we
> store a pointer per object into the index mmap, and we'd have to switch
> to storing an oid per object. Given that the code isn't commonly run
> (and other operations like _generating_ the index in the first place are
> clearly OK with the extra memory hit), I think I'd be OK with that in
> the name of cleanliness.
> 
> All of that is sort of orthogonal to your goals, I think, so I don't
> mind at all calling it out of scope for your series. As long as we use
> the_hash_algo->rawsz, these bare pointer accesses aren't wrong (it's
> just that it's easy to accidentally get it wrong, as this code shows).

I looked into this a bit further. It turns out that the current code in
verify_packfile() was explicitly trying to avoid that extra memory. But
the good news is that I think we can improve it further, cleaning up the
existing type-punning and using even less memory than now.

I'll prepare a separate series to do that. I had thought the cleanup
might also make this whole 20 / the_hash_algo->rawsz thing go away, but
it doesn't (the name hashwrite() made me think we ought to be using an
oidwrite(), but of course that's silly; the "hash" here is a "file with
a hash checksum" and not "an object id hash"). So the two topics really
are independent.

-Peff
brian m. carlson Feb. 24, 2020, 3:42 a.m. UTC | #3
On 2020-02-23 at 21:57:59, Jeff King wrote:
> On Sat, Feb 22, 2020 at 08:17:26PM +0000, brian m. carlson wrote:
> 
> > Avoid hard-coding a hash size, instead preferring to use the_hash_algo.
> > [...]
> > -			hashwrite(out, base_sha1, 20);
> > +			hashwrite(out, base_sha1, the_hash_algo->rawsz);
> 
> Yeah, looks obviously correct (and I think this is new from the
> pack-reuse patches of mine that Christian sent recently).

I believe it is, which is why I CC'd you on it.

> The name "base_sha1" is clearly not great either. It could be changed
> trivially, but the real sin is that it comes from
> nth_packed_object_sha1(). It could be replaced with
> nth_packed_object_oid() at the cost of an extra hash copy, which isn't
> too bad.

I probably should send a series getting rid of the rest of the "sha1"
places in our code, because there are a lot of them, but I just haven't
gotten around to it yet.  And yeah, as you mentioned, there are still a
lot of places using nth_packed_object_sha1.

> It would be nice to get rid of that function entirely. In most cases,
> it's either free to do so (we end up copying the result into an oid
> anyway) or we pay for one extra hashcpy (out of the mmap into a local
> struct). But the one in pack-check.c:verify_packfile() is tricky; we
> store a pointer per object into the index mmap, and we'd have to switch
> to storing an oid per object. Given that the code isn't commonly run
> (and other operations like _generating_ the index in the first place are
> clearly OK with the extra memory hit), I think I'd be OK with that in
> the name of cleanliness.

Yeah, that's why I hadn't switched it out earlier.

> I think it would also be correct to cast the mmap'd bytes to a "struct
> object_id" given that the struct contains the hash bytes as the first
> member. I worry a little that we'd get no compiler warning of the
> breakage if that assumption changes, but it also seems unlikely to do
> so.

In the future, struct object_id will get a new member (an algorithm
value), but I think it's fine to make the assumption that the hash bytes
are first.
Jeff King Feb. 24, 2020, 4:20 a.m. UTC | #4
On Mon, Feb 24, 2020 at 03:42:59AM +0000, brian m. carlson wrote:

> > Yeah, looks obviously correct (and I think this is new from the
> > pack-reuse patches of mine that Christian sent recently).
> 
> I believe it is, which is why I CC'd you on it.

Heh, yeah. I knew you knew, but was saying so for the rest of the
audience. :)

> > I think it would also be correct to cast the mmap'd bytes to a "struct
> > object_id" given that the struct contains the hash bytes as the first
> > member. I worry a little that we'd get no compiler warning of the
> > breakage if that assumption changes, but it also seems unlikely to do
> > so.
> 
> In the future, struct object_id will get a new member (an algorithm
> value), but I think it's fine to make the assumption that the hash bytes
> are first.

Yeah, I think that would continue to work, although weirdness would
ensue if anybody ever dereferenced the algorithm member in one of the
type-punned structs. If we can avoid it entirely, I think we should (and
it was easy to remove the spot in pack-check).

Patches incoming.

-Peff

Patch
diff mbox series

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 940fbcb7b3..fceac7b462 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -879,7 +879,7 @@  static void write_reused_pack_one(size_t pos, struct hashfile *out,
 			len = encode_in_pack_object_header(header, sizeof(header),
 							   OBJ_REF_DELTA, size);
 			hashwrite(out, header, len);
-			hashwrite(out, base_sha1, 20);
+			hashwrite(out, base_sha1, the_hash_algo->rawsz);
 			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
 			return;
 		}